Learning from failures Kripa Krishnan Technical Program Director - - PowerPoint PPT Presentation

learning from failures
SMART_READER_LITE
LIVE PREVIEW

Learning from failures Kripa Krishnan Technical Program Director - - PowerPoint PPT Presentation

Learning from failures Kripa Krishnan Technical Program Director Sep.3.2014 Google Confidential and Proprietary Topics DiRT: Disaster simulation exercise at Google What we learned Applying these lessons Google Confidential


slide-1
SLIDE 1

Google Confidential and Proprietary

Learning from failures

Kripa Krishnan Technical Program Director

Sep.3.2014

slide-2
SLIDE 2

Google Confidential and Proprietary

Topics

  • DiRT: Disaster simulation exercise at Google
  • What we learned
  • Applying these lessons
slide-3
SLIDE 3

Google Confidential and Proprietary

Introducing DiRT

  • DiRT

○ Annual disaster recovery & testing exercise ○ 8 years since inception ○ Multi-day exercise triggering (controlled) failures in systems and process

slide-4
SLIDE 4

Google Confidential and Proprietary

Introducing DiRT

  • DiRT

○ Annual disaster recovery & testing exercise ○ 8 years since inception ○ Multi-day exercise triggering (controlled) failures in systems and process

  • Premise

○ 30-day incapacitation of headquarters following a disaster ○ Other offices and facilities may be affected

slide-5
SLIDE 5

Google Confidential and Proprietary

Introducing DiRT

  • DiRT

○ Annual disaster recovery & testing exercise ○ 8 years since inception ○ Multi-day exercise triggering (controlled) failures in systems and process

  • Premise

○ 30-day incapacitation of headquarters following a disaster ○ Other offices and facilities may be affected

  • When

○ "Big disaster": Annually for 3-5 days ○ Continuous testing: Year-round

slide-6
SLIDE 6

Google Confidential and Proprietary

Introducing DiRT

  • DiRT

○ Annual disaster recovery & testing exercise ○ 8 years since inception ○ Multi-day exercise triggering (controlled) failures in systems and process

  • Premise

○ 30-day incapacitation of headquarters following a disaster ○ Other offices and facilities may be affected

  • When

○ "Big disaster": Annually for 3-5 days ○ Continuous testing: Year-round

  • Who

○ 100s of engineers (Site Reliability, Network, Hardware, Software, Infrastructure, Security, Facilities) ○ Business units (Human Resources, Finance, Safety, Crisis response etc.)

slide-7
SLIDE 7

Google Confidential and Proprietary

  • First 24 hours

○ Crisis response

Operations failover

  • Rest of DiRT

○ Series of injected failures ○ 100s of smaller tests run by individual teams

  • Team

○ Command Center - 30-40 'insiders' ○ 'Storyteller' narrates the test ○ 100s of teams respond to the tests

DiRT - Structure

slide-8
SLIDE 8

Google Confidential and Proprietary

  • First 24 hours

○ Crisis response

Operations failover

  • Rest of DiRT

○ Series of injected failures ○ 100s of smaller tests run by individual teams

  • Team

○ Command Center - 30-40 'insiders' ○ 'Storyteller' narrates the test ○ 100s of teams respond to the tests ○ 100s of teams write post-mortems for each of the tests ○ 100s of teams fix issues found in post-mortems

DiRT - Structure

slide-9
SLIDE 9

Google Confidential and Proprietary

Rolling out a disaster (Example 1/4 - first 6 hours)

  • Research and Test

○ Scenario: Bay area earthquake, later an aftershock ○ Story: Meteor hit the bay area. Zombies emerge from ground. ○ Structural damage, flooding communications hampered

  • Response

○ Safety (Evacuations?) ○ People 'Finder' and ‘Alerter’ ○ Incident management, team and operational failovers ○ Communicate Communicate

  • Did we learn anything?

○ Mass evacuations ○ Satellite phones - a good idea? ○ People 'alerter'

  • Record. Fix.
slide-10
SLIDE 10

Google Confidential and Proprietary

Rolling out a disaster (Example 2/4)

  • Test

○ Distributed offices responsible for carrying operations ○ Oh wait! Unrelated fiber cut incident! Datacenter down!

  • Response

○ Teams bring back systems with no help or communication from HQ

  • Did we learn anything?

○ Where is our monitoring? ○ Is it possible to DoS cafes? ○ Great - approvals system works! Now what? ○ Phones - wait... ○ How quickly can we recover? Good enough?

  • Record. Fix.
slide-11
SLIDE 11

Google Confidential and Proprietary

Rolling out a disaster (Example 3/4)

  • Test

○ Oh wait! Massive power outage! Many days!

  • Response

○ Emergency funding to respond. ○ Run on backup generators. ○ Longer-term outage - capacity concerns addressed.

  • Did we learn anything?

○ Can we seamlessly cut to generators at full load? ○ For how long? Do we get into trouble? ○ If the datacenter was down for n hours, and we fixed everything why is nothing coming back up?

  • Record. Fix.
slide-12
SLIDE 12

Google Confidential and Proprietary

Rolling out a disaster (Example 4/4)

  • Test

○ Meanwhile at HQ, there are a lot of issues to resolve. ○ 'I am travelling, send me back.' ○ 'Do customers need to pay us?'

  • Response

○ New policies on the fly ○ Sharing resources (food, shelter etc.)

  • Did we learn anything?

○ Creativity and a culture that promotes flexibility helps a lot. ○ Communications is hard ○ Exhaustion and decisions

  • Record. Fix.
slide-13
SLIDE 13

Google Confidential and Proprietary

Meta: What did we learn?

  • Post-mortem culture: Failures are inevitable - the best we can do is

be prepared for them and learn from them. Fix!

slide-14
SLIDE 14

Google Confidential and Proprietary

Meta: What did we learn?

  • Post-mortem culture: Failures are inevitable - the best we can do is

be prepared for them and learn from them. Fix!

  • Continuous simulations and testing: An untested plan is not really a
  • plan. Test against them. All the time.
slide-15
SLIDE 15

Google Confidential and Proprietary

Meta: What did we learn?

  • Post-mortem culture: Failures are inevitable - the best we can do is

be prepared for them and learn from them. Fix!

  • Continuous simulations and testing: An untested plan is not really a
  • plan. Test against them. All the time.
  • Test everything: There is no real distinction between business

continuity and disaster recovery - people and technology co-exist.

slide-16
SLIDE 16

Google Confidential and Proprietary

Meta: What did we learn?

  • Post-mortem culture: Failures are inevitable - the best we can do is

be prepared for them and learn from them. Fix!

  • Continuous simulations and testing: An untested plan is not really a
  • plan. Test against them. All the time.
  • Test everything: There is no real distinction between business

continuity and disaster recovery - people and technology co-exist.

  • Rinse and repeat: Repetition is important. A system or document

that is never used is not helpful when it matters.

slide-17
SLIDE 17

Google Confidential and Proprietary

Thank you!