Reliability from the ground up Designing for 5 9s Astrid Atkinson, - - PowerPoint PPT Presentation

reliability from the ground up
SMART_READER_LITE
LIVE PREVIEW

Reliability from the ground up Designing for 5 9s Astrid Atkinson, - - PowerPoint PPT Presentation

Reliability from the ground up Designing for 5 9s Astrid Atkinson, June 2018 What do we mean when we talk about reliability? Reliability, n. - the quality of being trustworthy or of performing consistently well. Resilience, n. - the capacity to


slide-1
SLIDE 1

Reliability from the ground up

Designing for 5 9s

Astrid Atkinson, June 2018

slide-2
SLIDE 2

What do we mean when we talk about reliability?

slide-3
SLIDE 3

Reliability, n. - the quality of being trustworthy or of performing consistently well. Resilience, n. - the capacity to recover quickly from difficulties; toughness, the ability of a substance or object to spring back into shape.

slide-4
SLIDE 4

Reliability is a property of the system not the sum of its parts

slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7

How much reliability do you need?

Baseline requirements for uptime Optimizations

Mikey Dickerson’s hierarchy of reliability

slide-8
SLIDE 8

90% 36.5 days 99% 3.65 days 99.9% 8.76 hours 99.99% 52.56 minutes 99.999% 5.26 minutes

Time per Year of Downtime

slide-9
SLIDE 9

Principles for resilience

slide-10
SLIDE 10

Rule #1: Every node for itself

“Keep doing what you’re doing, unless it’s actively unsafe.”

slide-11
SLIDE 11

Recovering from failure

Loop script restarts Validate

  • n

startup All required state cached locally Report health: “I’m OK to serve!”

slide-12
SLIDE 12

Failures of dependencies

slide-13
SLIDE 13

Handling bad inputs

“I’d like a search result and a unicorn!” “I’ve never heard of a unicorn :-(”

RIP

slide-14
SLIDE 14

Rule #2: Everything runs on more than one machine

slide-15
SLIDE 15

Serving components are stateless Data is split across machines and replicated

Idempotency and sharding

slide-16
SLIDE 16

Idempotency

Any of these servers can perform the same actions

slide-17
SLIDE 17

Sharding

Data is split across machines and replicated

slide-18
SLIDE 18

DNS

Multiple clusters

provide better redundancy

slide-19
SLIDE 19

Rule #3: Loosely coupled dependencies

slide-20
SLIDE 20

“Hey, who should I talk to?” LB

Network

slide-21
SLIDE 21

“OK, I’ll just keep using the last policy I got.” LB

slide-22
SLIDE 22

Rule #4: Design for change

  • 1. Use tools, not processes
  • 2. Check in your configs
  • 3. Canary all changes
  • 4. Rollbacks should always be safe
slide-23
SLIDE 23

Global state == global failure

slide-24
SLIDE 24

Rule #5: Observe the system, not the components

slide-25
SLIDE 25

Pick a few high level metrics that describe the overall behavior of the

  • system. Make those perfect.

Everything else is background info.

queries latency errors

The only* dashboard you need

slide-26
SLIDE 26

The fifth nine is people.

  • Trisha Weir
slide-27
SLIDE 27

Unexpected failure mode System design Sad engineer paged at night

Highly skilled engineer who will quit if woken up too often

slide-28
SLIDE 28

Future proof

slide-29
SLIDE 29

Supporting an ecosystem

  • Healthy systems grow, which means more teams and systems
  • Tooling and infrastructure scale better than people and processes

Come to my talk later!

slide-30
SLIDE 30

grant me the serenity to accept the things I cannot change; courage to change the things I can; and wisdom to know the difference

The limits of control

slide-31
SLIDE 31

Thank you.

Credits: Susan J Fowler, Production Ready Microservices, O’Reilly, 2016 Mikey Dickerson, Hierarchy of Reliability Trisha Weir et al, The Fifth Nine: Diverse Perspectives on Reliability, GHC 2015 astrid@gmail.com