Reliability from the ground up Designing for 5 9s Astrid Atkinson, - - PowerPoint PPT Presentation

▶

Aug 30, 2022 177 likes •504 views

Reliability from the ground up Designing for 5 9s Astrid Atkinson, June 2018 What do we mean when we talk about reliability? Reliability, n. - the quality of being trustworthy or of performing consistently well. Resilience, n. - the capacity to

SLIDE 1

Reliability from the ground up

Designing for 5 9s

Astrid Atkinson, June 2018

SLIDE 2

What do we mean when we talk about reliability?

SLIDE 3

Reliability, n. - the quality of being trustworthy or of performing consistently well. Resilience, n. - the capacity to recover quickly from difficulties; toughness, the ability of a substance or object to spring back into shape.

SLIDE 4

Reliability is a property of the system not the sum of its parts

SLIDE 5

SLIDE 6

SLIDE 7

How much reliability do you need?

Baseline requirements for uptime Optimizations

Mikey Dickerson’s hierarchy of reliability

SLIDE 8

90% 36.5 days 99% 3.65 days 99.9% 8.76 hours 99.99% 52.56 minutes 99.999% 5.26 minutes

Time per Year of Downtime

SLIDE 9

Principles for resilience

SLIDE 10

Rule #1: Every node for itself

“Keep doing what you’re doing, unless it’s actively unsafe.”

SLIDE 11

Recovering from failure

Loop script restarts Validate

startup All required state cached locally Report health: “I’m OK to serve!”

SLIDE 12

Failures of dependencies

SLIDE 13

Handling bad inputs

“I’d like a search result and a unicorn!” “I’ve never heard of a unicorn :-(”

RIP

SLIDE 14

Rule #2: Everything runs on more than one machine

SLIDE 15

Serving components are stateless Data is split across machines and replicated

Idempotency and sharding

SLIDE 16

Idempotency

Any of these servers can perform the same actions

SLIDE 17

Sharding

Data is split across machines and replicated

SLIDE 18

DNS

Multiple clusters

provide better redundancy

SLIDE 19

Rule #3: Loosely coupled dependencies

SLIDE 20

“Hey, who should I talk to?” LB

Network

SLIDE 21

“OK, I’ll just keep using the last policy I got.” LB

SLIDE 22

Rule #4: Design for change

1. Use tools, not processes
2. Check in your configs
3. Canary all changes
4. Rollbacks should always be safe

SLIDE 23

Global state == global failure

SLIDE 24

Rule #5: Observe the system, not the components

SLIDE 25

Pick a few high level metrics that describe the overall behavior of the

system. Make those perfect.

Everything else is background info.

queries latency errors

The only* dashboard you need

SLIDE 26

The fifth nine is people.

Trisha Weir

SLIDE 27

Unexpected failure mode System design Sad engineer paged at night

Highly skilled engineer who will quit if woken up too often

SLIDE 28

Future proof

SLIDE 29

Supporting an ecosystem

Healthy systems grow, which means more teams and systems
Tooling and infrastructure scale better than people and processes

Come to my talk later!

SLIDE 30

grant me the serenity to accept the things I cannot change; courage to change the things I can; and wisdom to know the difference

The limits of control

SLIDE 31

Thank you.

Credits: Susan J Fowler, Production Ready Microservices, O’Reilly, 2016 Mikey Dickerson, Hierarchy of Reliability Trisha Weir et al, The Fifth Nine: Diverse Perspectives on Reliability, GHC 2015 astrid@gmail.com