It will break! Leonid Movsesyan, Dropbox Hierarchy of needs 4 easy - - PowerPoint PPT Presentation

it will break
SMART_READER_LITE
LIVE PREVIEW

It will break! Leonid Movsesyan, Dropbox Hierarchy of needs 4 easy - - PowerPoint PPT Presentation

It will break! Leonid Movsesyan, Dropbox Hierarchy of needs 4 easy steps Build your hierarchy of needs Put together your assumptions about the each layer Run regular disaster recovery testing (DRT) as close to reality as possible


slide-1
SLIDE 1
slide-2
SLIDE 2

It will break!

Leonid Movsesyan, Dropbox

slide-3
SLIDE 3

Hierarchy of needs

slide-4
SLIDE 4

4 easy steps

  • Build your hierarchy of needs
  • Put together your assumptions about the each layer
  • Run regular disaster recovery testing (DRT) as close to reality as

possible

  • Adjust them constantly to reflect the changes
slide-5
SLIDE 5

Search engine example

slide-6
SLIDE 6

Optimize for metrics that you care about

slide-7
SLIDE 7

Treat your DRTs how you treat your unit tests

slide-8
SLIDE 8

How to design a good DRT?

  • For every action in your design ask yourself ‘What if?’
  • We’ll use S3 to store our data: what if S3 availability zone go down?
  • We’ll run credit card processing using this 3rd party vendor: what if it times
  • ut?
  • We’ll store the metadata in MySQL: what if MySQL master will die?
  • Go with this question as deep as possible and use ’what if?’ scenarios

as a DRT

slide-9
SLIDE 9

Power outage

slide-10
SLIDE 10

Power will fail

slide-11
SLIDE 11

Diesel generators may not start

slide-12
SLIDE 12

You’ll run out of diesel sooner than you expect

slide-13
SLIDE 13

Avoid the consequences

  • Split the servers by groups based on the hierarchy of needs
  • Create automation that will allow to power off the top of the

hierarchy first

  • Test this automation regularly as well as your diesel generators
slide-14
SLIDE 14

Network outage

slide-15
SLIDE 15

Never expect networks you don’t control to be reliable

slide-16
SLIDE 16

Never expect you switches on all the levels of your network to be always available

slide-17
SLIDE 17

Expect the network to fail slowly

slide-18
SLIDE 18

Network testing

  • Use DRTs to fine tune timeouts
  • Imitate multiple types of network issues
  • Failover every network device in your own topology
slide-19
SLIDE 19

Pro tip: Use switch failover as an

  • pportunity to upgrade the

firmware

slide-20
SLIDE 20

Wishful thinking

slide-21
SLIDE 21
slide-22
SLIDE 22

Cloud vendors

  • Expect your cloud instance to fail
  • Expect you cloud stored data to get lost or corrupted
  • Expect to lose network connectivity to your cloud provider
  • Expect cloud providers to lose the whole region
slide-23
SLIDE 23

Try not to over-engineer

slide-24
SLIDE 24

Don’t forget about human error

slide-25
SLIDE 25

Avoid any manual operations and runbooks in places where the mistake can not be tolerable

slide-26
SLIDE 26

Analyze and group you outages

  • Structure the root causes of the outages
  • Analyze times to detect, diagnose and recover
  • Group outages to identify the patterns
slide-27
SLIDE 27

Questions?