Building resilience How outages shaped Etsys systems Act 1 Quick! - - PowerPoint PPT Presentation

building resilience
SMART_READER_LITE
LIVE PREVIEW

Building resilience How outages shaped Etsys systems Act 1 Quick! - - PowerPoint PPT Presentation

Building resilience How outages shaped Etsys systems Act 1 Quick! Be resilient! http://www.flickr.com/photos/niaid/11854196633/sizes/l/ Quick! Be resilient! Actually, its a slow process Iterative Introspective Horizontal


slide-1
SLIDE 1
slide-2
SLIDE 2

Building resilience

How outages shaped Etsy’s systems

slide-3
SLIDE 3

Act 1

slide-4
SLIDE 4

Quick! Be resilient!

http://www.flickr.com/photos/niaid/11854196633/sizes/l/

slide-5
SLIDE 5

Quick! Be resilient!

  • Actually, it’s a slow process
  • Iterative
  • Introspective
  • Horizontal and vertical

development

slide-6
SLIDE 6

Quick! Be resilient!

http://www.flickr.com/photos/ogcodes/6091644301/sizes/l/

slide-7
SLIDE 7

Quick! Be resilient!

http://www.flickr.com/photos/studio360/1150744342/sizes/o/

slide-8
SLIDE 8

Quick! Be resilient!

http://www.flickr.com/photos/studio360/1150744368/sizes/o/

slide-9
SLIDE 9

Quick! Be resilient!

http://www.flickr.com/photos/ogcodes/6091644301/sizes/l/

slide-10
SLIDE 10

Quick! Be resilient!

Current generation Next generation

slide-11
SLIDE 11

Quick! Be resilient!

http://www.flickr.com/photos/jurvetson/8671257096/

slide-12
SLIDE 12

Quick! Be resilient!

http://cudebi.wordpress.com/2012/09/19/tah-pagh-tahbe-o-el-reconocimiento-de-william-shakespeare-en-el-universo-de-star-trek/

slide-13
SLIDE 13

Resilience Engineering

http:/ /www.flickr.com/photos/freefoto/728651045/sizes/o/

slide-14
SLIDE 14

Resilience Engineering

  • “To Engineer is Human”


“To Forgive Design”


  • Henry Petroski
  • “The Field Guide to Understanding Human Error”


“Just Culture”


  • Sidney Dekker
slide-15
SLIDE 15

Act 2

slide-16
SLIDE 16

Building resilience at Etsy

  • Continuous deployment
  • Metrics, metrics, metrics
  • Peer review
  • Postmortems
slide-17
SLIDE 17

Building resilience at Etsy

  • Continuous deployment
  • Metrics, metrics, metrics
  • Peer review
  • Postmortems }

Culture

slide-18
SLIDE 18

Or: How to win at failing

Postmortems

slide-19
SLIDE 19
  • No blame
  • Open discussion
  • Focus on improvements

Constructive cultures

slide-20
SLIDE 20
  • No blame
  • Open discussion
  • Focus on improvements}

Culture

Constructive cultures

slide-21
SLIDE 21

–Japanese proverb

“The nail that sticks up,
 gets hammered down”

Destructive cultures

slide-22
SLIDE 22

The result?

slide-23
SLIDE 23
  • #23: Fortune’s “Top 50 best small and medium

businesses to work for”

  • Rapid code iterations and deploys
  • Lasting relationships
  • Generousity of spirit
  • …and much more
slide-24
SLIDE 24

Act 3

slide-25
SLIDE 25

Doing postmortems? Get Morgue

http:/ /github.com/etsy/morgue

slide-26
SLIDE 26

Morgue

slide-27
SLIDE 27

Morgue

slide-28
SLIDE 28

Morgue

slide-29
SLIDE 29

Forkistan

  • Mean time to detect: 0 min
  • Mean time to recover: 10 mins
slide-30
SLIDE 30

Yo Dawg, I Heard You Like Errors..

  • Mean time to detect: 2 mins
  • Mean time to recover: 15 mins
slide-31
SLIDE 31

Smashing INT for Fun and Profit

  • Mean time to detect: 0 min
  • Mean time to recover: 4 hrs 52 mins
slide-32
SLIDE 32

Apache Amnesia

  • Mean time to detect: 2 hours
  • Mean time to recover: 5 mins
slide-33
SLIDE 33

Continuously Upgrading Databases

  • Mean time to detect: 2 mins
  • Mean time to recover: 1 hour (but, not really..)
slide-34
SLIDE 34

Q & A

Avleen Vig Staff Operations Engineer Etsy, Inc @avleen

slide-35
SLIDE 35