Pitfalls in Measuring SLOs Danyel Fisher @fisherdanyel An Outage - - PowerPoint PPT Presentation

pitfalls in measuring slos
SMART_READER_LITE
LIVE PREVIEW

Pitfalls in Measuring SLOs Danyel Fisher @fisherdanyel An Outage - - PowerPoint PPT Presentation

Pitfalls in Measuring SLOs Danyel Fisher @fisherdanyel An Outage Danyel Fisher @fisherdanyel Danyel Fisher @fisherdanyel Danyel Fisher @fisherdanyel What do you do when things break? How bad was this break? Danyel Fisher @fisherdanyel


slide-1
SLIDE 1

Pitfalls in Measuring SLOs

Danyel Fisher @fisherdanyel

slide-2
SLIDE 2

Danyel Fisher @fisherdanyel

An Outage

slide-3
SLIDE 3

Danyel Fisher @fisherdanyel

slide-4
SLIDE 4

Danyel Fisher @fisherdanyel

slide-5
SLIDE 5

Danyel Fisher @fisherdanyel

What do you do when things break? How bad was this break?

slide-6
SLIDE 6

Danyel Fisher @fisherdanyel

slide-7
SLIDE 7

Danyel Fisher @fisherdanyel Build new features! We need to improve quality!

slide-8
SLIDE 8

Danyel Fisher @fisherdanyel

Management Engineering Clients and Users How broken is “too broken”? What does “good enough” mean? Combatting alert fatigue

slide-9
SLIDE 9

Danyel Fisher @fisherdanyel

A telemetry system produces events that correspond to real world use We can describe some of these events as eligible We can describe some of them as good

slide-10
SLIDE 10

Danyel Fisher @fisherdanyel

Given an event, is it eligible? Is it good? Eligible: “Had an http status code” Good: “... that was a 200, and was served under 500 ms”

slide-11
SLIDE 11

Danyel Fisher @fisherdanyel

slide-12
SLIDE 12

Danyel Fisher @fisherdanyel

slide-13
SLIDE 13

Danyel Fisher @fisherdanyel

Minimum Quality ratio over a period of time Number of bad events allowed.

slide-14
SLIDE 14

Danyel Fisher @fisherdanyel

Deploy faster Room for experimentation Opportunity to tighten SLO

slide-15
SLIDE 15

Danyel Fisher @fisherdanyel

We always store incoming user data 99.99% Queries often return in < 10 s Default dashboards usually load in < 1s 99.9% 99% 7.3 hours 45 minutes ~4.3 minutes

slide-16
SLIDE 16

Danyel Fisher @fisherdanyel

User Data Throughput

We blew through three months’ budget in those 12 minutes.

slide-17
SLIDE 17

Danyel Fisher @fisherdanyel

We dropped customer data

slide-18
SLIDE 18

Danyel Fisher @fisherdanyel

We dropped customer data We rolled it back (manually) We communicated to customers We halted deploys

slide-19
SLIDE 19

Danyel Fisher @fisherdanyel

We checked in code that didn’t build. We had experimental CI build wiring. Our scripts deployed empty binaries. There was no health check and rollback.

slide-20
SLIDE 20

Danyel Fisher @fisherdanyel

We stopped writing new features We prioritized stability We mitigated risks

slide-21
SLIDE 21

Danyel Fisher @fisherdanyel

SLOs allowed us to characterize what went wrong, how badly it went wrong, and how to prioritize repair

slide-22
SLIDE 22

Danyel Fisher @fisherdanyel

Learning from SLOs

slide-23
SLIDE 23

Danyel Fisher @fisherdanyel

Final point

A one-line description of it

slide-24
SLIDE 24

Danyel Fisher @fisherdanyel

  • Design Thinking
  • Expressing and Viewing
  • Burndown Alerts and Responding
  • Learning from our Experiences
  • Success Stories
slide-25
SLIDE 25

Danyel Fisher @fisherdanyel

Design Thinking and Task Analysis

Understand user goals and needs Learn from informants and experts Collaborate with internal team Collect feedback and ideas externally

slide-26
SLIDE 26

Danyel Fisher @fisherdanyel

Displays and Views

slide-27
SLIDE 27

Danyel Fisher @fisherdanyel

See where the burndown was happening, explain why, and remediate

slide-28
SLIDE 28

Danyel Fisher @fisherdanyel

Expressing SLOs

Time based “How many 5 minute periods, had a P95(duration) < 500 ms” Event based “How many events had a duration < 500 ms”

slide-29
SLIDE 29

Danyel Fisher @fisherdanyel

How do we express SLOs?

Good events Bad events How often Time range

slide-30
SLIDE 30

Danyel Fisher @fisherdanyel

How do we express SLOs?

Good events Bad events How often Time range

slide-31
SLIDE 31

Danyel Fisher @fisherdanyel

How do we express SLOs?

Good events Bad events How often Time range

Eligible: $name is “run_trigger_detailed” Good: $app.error does not exist

slide-32
SLIDE 32

Danyel Fisher @fisherdanyel

How do we express SLOs?

Good events Bad events How often Time range

slide-33
SLIDE 33

Danyel Fisher @fisherdanyel

How do we express SLOs?

Good events Bad events How often Time range

slide-34
SLIDE 34

Danyel Fisher @fisherdanyel

Status of an SLO

slide-35
SLIDE 35

Danyel Fisher @fisherdanyel

How have we done?

slide-36
SLIDE 36

Danyel Fisher @fisherdanyel

slide-37
SLIDE 37

Danyel Fisher @fisherdanyel

Where did it go?

slide-38
SLIDE 38

Danyel Fisher @fisherdanyel

When did the errors happen?

slide-39
SLIDE 39

Danyel Fisher @fisherdanyel

When did the errors happen?

slide-40
SLIDE 40

Danyel Fisher @fisherdanyel

What went wrong?

High dimensional data High cardinality data

slide-41
SLIDE 41

Danyel Fisher @fisherdanyel

Why did it happen?

slide-42
SLIDE 42

Danyel Fisher @fisherdanyel

Why did it happen?

slide-43
SLIDE 43

Danyel Fisher @fisherdanyel

Why did it happen?

slide-44
SLIDE 44

Danyel Fisher @fisherdanyel

See where the burndown was happening, explain why, and remediate

slide-45
SLIDE 45

Danyel Fisher @fisherdanyel

User Feedback

“The Bubble Up in the SLO page is really powerful at highlighting what is contributing the most to missing our SLIs, it has definitely confirmed our assumptions.”

slide-46
SLIDE 46

Danyel Fisher @fisherdanyel

User Feedback

“Your customers have to be happy... we have to have an understanding of the customer

  • experience. … To the millisecond we knew

what our percentage was of success versus failure.”

  • Josh Hull, Site Reliability Engineering Lead,

Clover Health

slide-47
SLIDE 47

Danyel Fisher @fisherdanyel

User Feedback

“The historical SLO chart also confirms a fix for a performance issue we did greatly contributed to the SLO compliance by showing a nice upward trend line. :)”

slide-48
SLIDE 48

Danyel Fisher @fisherdanyel

User Feedback

“I’d love to drive alerts off our SLOs. Right now we don’t have anything to draw us in and have some alerts on the average error rate but they’re a little spiky to be useful. It would be great to get a better sense of when the budget is going and define alerts that way.”

slide-49
SLIDE 49

Danyel Fisher @fisherdanyel

Burndown Alerts

slide-50
SLIDE 50

Danyel Fisher @fisherdanyel

How is my system doing?

Am I over budget? When will my alarm fail?

slide-51
SLIDE 51

Danyel Fisher @fisherdanyel

When will I fail?

User goal: get alerts to exhaustion time Human-digestible units 24 hours: “I’ll take a look in the morning” 4 hours: “All hands on deck!”

slide-52
SLIDE 52

Danyel Fisher @fisherdanyel

slide-53
SLIDE 53

Danyel Fisher @fisherdanyel

How is my system doing?

Am I over budget? When will my alarm fail?

slide-54
SLIDE 54

Danyel Fisher @fisherdanyel

Implementing Burn Alerts

Run a 30 day query at a 5 minute resolution every minute

slide-55
SLIDE 55

Danyel Fisher @fisherdanyel

slide-56
SLIDE 56

Danyel Fisher @fisherdanyel

Caching is Fun!

slide-57
SLIDE 57

Danyel Fisher @fisherdanyel

Fun with Caching

Vital to cache results … but not incomplete results … … at what resolution of cache?

slide-58
SLIDE 58

Danyel Fisher @fisherdanyel

Flappy Alerts

“It’ll expire at 3:55” “Wait, make that 4:05” “Nope, 3:55 again!” (We added a 10%ish buffer)

slide-59
SLIDE 59

Danyel Fisher @fisherdanyel

Recovering from Bankruptcy

A failure a month ago brought us to -169% and still hasn’t aged out? That means we don’t get alerts anymore Customer workaround: delete and re-create the SLO, thus blowing the cache

slide-60
SLIDE 60

Danyel Fisher @fisherdanyel

Learning from Experience

slide-61
SLIDE 61

Danyel Fisher @fisherdanyel

Volume is important

Tolerate at least dozens of bad events per day

slide-62
SLIDE 62

Faults

slide-63
SLIDE 63

Danyel Fisher @fisherdanyel

SLOs for Customer Service

Remember that user having a bad day? ADD IMAGE

slide-64
SLIDE 64

Danyel Fisher @fisherdanyel

Blackouts are easy

… but brownouts are much more interesting

slide-65
SLIDE 65

Danyel Fisher @fisherdanyel

slide-66
SLIDE 66

Danyel Fisher @fisherdanyel

Timeline

1:29 am SLO alerts. “Maybe it’s just a blip” 1.5% brownout for 20 minutes

slide-67
SLIDE 67

Danyel Fisher @fisherdanyel

Timeline

1:29 am SLO alerts. “Maybe it’s just a blip” 4:21 am Minor incident. “It might be an AWS problem” 6:25 am SLO alerts again. “Could it be ALB compat?” 9:55 am “Why is our system uptime dropping to zero?” It’s out of memory We aren’t alerting on that crash

slide-68
SLIDE 68

Danyel Fisher @fisherdanyel

Timeline

1:29 am SLO alerts. “Maybe it’s just a blip” 1.5% brownout for 20 minutes 4:21 am Minor incident. “It might be an AWS problem”

slide-69
SLIDE 69

Danyel Fisher @fisherdanyel

Timeline

1:29 am SLO alerts. “Maybe it’s just a blip” 1.5% brownout for 20 minutes 4:21 am Minor incident. “It might be an AWS problem” 6:25 am SLO alerts again. “Could it be ALB compat?”

slide-70
SLIDE 70

Danyel Fisher @fisherdanyel

Timeline

1:29 am SLO alerts. “Maybe it’s just a blip” 1.5% brownout for 20 minutes 4:21 am Minor incident. “It might be an AWS problem” 6:25 am SLO alerts again. “Could it be ALB compat?” 9:55 am “Why is our system uptime dropping to zero?” It’s out of memory We aren’t alerting on that crash

slide-71
SLIDE 71

Danyel Fisher @fisherdanyel

Timeline

1:29 am SLO alerts. “Maybe it’s just a blip” 4:21 am Minor incident. “It might be an AWS problem” 6:25 am SLO alerts again. “Could it be ALB compat?” 9:55 am “Why is our system uptime dropping to zero?” It’s out of memory We aren’t alerting on that crash

slide-72
SLIDE 72

Danyel Fisher @fisherdanyel

Timeline

1:29 am SLO alerts. “Maybe it’s just a blip” 1.5% brownout for 20 minutes 4:21 am Minor incident. “It might be an AWS problem” 6:25 am SLO alerts again. “Could it be ALB compat?” 9:55 am “Why is our system uptime dropping to zero?” It’s out of memory We aren’t alerting on that crash 10:32 am Fixed

slide-73
SLIDE 73

Danyel Fisher @fisherdanyel

slide-74
SLIDE 74

Danyel Fisher @fisherdanyel

slide-75
SLIDE 75

Danyel Fisher @fisherdanyel

We stopped writing new features We prioritized stability We mitigated risks … and we promoted our SLO burn alerts

slide-76
SLIDE 76

Danyel Fisher @fisherdanyel

Cultural Change

It’s hard to replace alerts with SLOs But a clear incident can help

slide-77
SLIDE 77

Danyel Fisher @fisherdanyel

Reduce Alarm Fatigue

Focus on user-affecting SLOs Focus on actionable alarms

slide-78
SLIDE 78

Danyel Fisher @fisherdanyel

Conclusion

slide-79
SLIDE 79

Danyel Fisher @fisherdanyel

SLOs allowed us to characterize what went wrong, how badly it went wrong, and how to prioritize repair

slide-80
SLIDE 80

Danyel Fisher @fisherdanyel

You can do it too And maybe avoid our mistakes

slide-81
SLIDE 81

Pitfalls in Measuring SLOs

Email: danyel@honeycomb.io Twitter: @fisherdanyel Visit our booth on the 5th floor to kick the SLO tires, and to learn how we debug in high-res hny.co/danyel