A D A R K A N D S T O R M Y N I G H T @kiranb Kiran Bhattaram - - PowerPoint PPT Presentation

a d a r k a n d s t o r m y n i g h t
SMART_READER_LITE
LIVE PREVIEW

A D A R K A N D S T O R M Y N I G H T @kiranb Kiran Bhattaram - - PowerPoint PPT Presentation

TA L E S O F O P E R A B I L I T Y A N T I - PA T T E R N S A D A R K A N D S T O R M Y N I G H T @kiranb Kiran Bhattaram It was a dark and stormy night; the rain fell in torrents except at occasional intervals, when it was checked


slide-1
SLIDE 1

A D A R K A N D S T O R M Y N I G H T

TA L E S O F O P E R A B I L I T Y A N T I - PA T T E R N S

slide-2
SLIDE 2

Kiran Bhattaram

@kiranb

slide-3
SLIDE 3

B U LW E R - LY T T O N

It was a dark and stormy night; the rain fell in torrents — except at

  • ccasional intervals, when it was checked by a violent gust of wind

which swept up the streets (for it is in London that our scene lies), rattling along the housetops, and fiercely agitating the scanty flame

  • f the lamps that struggled against the darkness.
slide-4
SLIDE 4

DEFINITIONS

What is operability?

▸ The ability to keep a system in a safe and reliable

functioning condition, according to pre-defined

  • perational requirements.
slide-5
SLIDE 5

Characteristics of operability

▸ safety & reliability ▸ scalability ▸ grace under pressure

DEFINITIONS

▸ ease of upgrades ▸ observability ▸ usability ▸ cultural practices around incidents ▸ AND MORE

slide-6
SLIDE 6

DEFINITIONS

Characteristics of an operable system

▸ Converge towards a stable state. ▸ Give operators visibility and tools. ▸ Designed to be usable and unsurprising.

slide-7
SLIDE 7

DEFINITIONS

Agenda

Robustness Usability Review! Observability

slide-8
SLIDE 8
  • 1. ROBUSTNESS
slide-9
SLIDE 9

THE TALE OF THE SYSTEM THAT COULDN’T GIVE ANYTHING UP

STORY 1

slide-10
SLIDE 10

ROBUSTNESS

Define your critical path.

slide-11
SLIDE 11

ROBUSTNESS

Harvest, Yield and Scalable Tolerant Systems

Yield = successful requests total requests != uptime Harvest = data available total data

* dropping requests * degrading response

slide-12
SLIDE 12

ROBUSTNESS

Controlling yield: load shedding upstream requests

▸ categories of load shedders: ▸ # of requests ▸ # of concurrent requests (protect against the long tail) ▸ overall fleet utilization (keep x% of workers for core

traffic)

slide-13
SLIDE 13

ROBUSTNESS

Controlling harvest: circuit breakers

▸ stop calling a dependency if it seems down! ▸ what do you return? ▸ cached data ▸ nil ▸ or propagate the error upstream

slide-14
SLIDE 14

ROBUSTNESS

Controlling harvest: circuit breakers & compartmentalization

http://idighardware.com/2013/10/fire-doors-everything-you-always-wanted-to-know-but-were-afraid-to-ask/

slide-15
SLIDE 15

ROBUSTNESS

Putting it all together: giving things up

▸ Combine harvest/yield degradation in different ways to

protect the critical path

▸ Monitor any degradation! ▸ Dark launch your rate limiters to check what they’d block.

slide-16
SLIDE 16

ROBUSTNESS

Robustness, in review

▸ know how the system sheds

load

▸ know how it reacts to

downstream failures Converge to a stable state.

slide-17
SLIDE 17
  • 2. OBSERVABILITY
slide-18
SLIDE 18

THE TALE OF THE FRACTAL QUEUE

STORY 2

slide-19
SLIDE 19

OBSERVABILITY

Instrument EVERYTHING

▸ especially with queues ▸ percentiles, not averages ▸ don’t intermingle logs (keep a searchable trace ID on

requests)

slide-20
SLIDE 20

OBSERVABILITY

Over-collect data, but build dashboards carefully

▸ work metrics ▸ is the system doing the thing it’s supposed to? ▸ resource metrics ▸ how are the components of the system behaving? ▸ build your dashboard with work metrics first.

slide-21
SLIDE 21

THE TALE OF THE 64 ALERT WEEK

STORY 4

slide-22
SLIDE 22

OBSERVABILITY

Don’t normalize deviance

slide-23
SLIDE 23

OBSERVABILITY

Knowing what to alert on

▸ Monitor the alert volume of your system! ▸ Pages should be actionable and represent user pain.

slide-24
SLIDE 24

OBSERVABILITY

Observability: what we learned

▸ Kiran has a special vendetta against unmonitored queues. ▸ Building good dashboards: work metrics & resource

metrics.

▸ Monitor alert volume, too!

slide-25
SLIDE 25
  • 3. USABILITY
slide-26
SLIDE 26
  • 6. Recognition vs. recall
  • 9. Help users recognize,

diagnose, and recover from errors

USABILITY

A quick side note: Nielsen Heuristics

  • 1. Visibility of system status
  • 2. Match between system and the

real world

  • 3. User control and freedom
  • 4. Consistency and standards
  • 5. Error prevention
  • 6. Recognition vs. recall
  • 7. Flexibility and efficiency of use
  • 8. Aesthetic and minimalist design
  • 9. Help users recognize, diagnose,

and recover from errors

  • 10. Help and documentation
  • 1. Visibility of system status
  • 3. User control and freedom
  • 5. Error prevention
slide-27
SLIDE 27

Story 5: the tale of the special snowflake service

slide-28
SLIDE 28

USABILITY

Heuristic 4. Consistency and Standards

▸ pattern-matching across

similar systems is really valuable!

▸ Choose boring

technology: spend your innovation tokens wisely!

slide-29
SLIDE 29

OBSERVABILITY

Heuristic 3. User control and freedom

▸ Tooling is a part of the service! ▸ relatedly, deploy mechanisms are related to availability! ▸ Give operators the ability to change operational

parameters.

slide-30
SLIDE 30

THE TALE OF THE OPS SPELL BOOK

STORY 6

slide-31
SLIDE 31

USABILITY

Heuristic 6. Recognition v. recall

▸ Keep checklists minimal and heavily automated. ▸ long flowcharts in a runbook are :( ▸ relatedly: scripting user communications is helpful.

slide-32
SLIDE 32

USABILITY

Heuristic 1. Visibility of system status

▸ which of these are changes to production? ▸ config changes ▸ deploys ▸ utility script runs ▸ failovers ▸ adding/decreasing capacity

slide-33
SLIDE 33

THE TALE OF THE AMBIGUOUS ERROR MESSAGE

STORY 7

slide-34
SLIDE 34

USABILITY

Heuristic 9. Help users recognize, diagnose, and recover from errors

▸ error messages are a crucial part of your interface ▸ Writing a good alert message: ▸ expressed in plain language, precisely indicate the

problem, and constructively suggest a solution (runbooks!)

▸ (ex.) CRITICAL: Served 5% 5xx results in the last 5 minutes!

<link to runbook>

slide-35
SLIDE 35

USABILITY

Usability, in review

▸ Operational experience matters! Consider: ▸ whether the system follows general conventions. ▸ how it alerts operators to errors clearly and

unambiguously.

▸ how minimal and usable the tooling is.

slide-36
SLIDE 36

Review

▸ Robustness

▸ Does your system converge to a stable state?

▸ Observability

▸ Can you infer what the internal state of the system looks like?

▸ Usability

▸ Do your operators have control over the state of the system?

Do you adhere to general standards? REVIEW

slide-37
SLIDE 37

THE TALE OF THE SAD QUEUE

STORY THE LAST

: (

slide-38
SLIDE 38

A DARK AND STORMY NIGHT

STORY THE LAST

slide-39
SLIDE 39

Resources

▸ Harvest, Yield, and Scalable Tolerant Systems (Brewer & Fox) ▸ How Complex Systems Fail (Cook) ▸ "Going solid": a model of system dynamics and consequences for patient safety (Cook) ▸ Nielsen’s Usability Heuristics ▸ Choose Boring Technology (Dan McKinley) ▸ Site Reliability Engineering: How Google Runs Production Systems ▸ Stripe’s (upcoming) rate limiting blog post ▸ Collection of postmortems (Dan Luu)

REVIEW

slide-40
SLIDE 40

REVIEW

On Designing and Deploying Internet-Scale Services, James Hamilton

▸ list of best practices, from design, to upgrades, to incident

response

slide-41
SLIDE 41

T H A N K S !

Thanks to Ines Sombra, Charity Majors, Alyssa Frazee, Rachel Sanders, and Andy Bonventre for review!

slide-42
SLIDE 42

APPENDIX

STUFF I COULDN’T GET TO

slide-43
SLIDE 43

OBSERVABILITY

decouple deploys from releases

▸ get a minimal version in dark-reads into production asap ▸ corollary: have good kill switches! ▸ Know what rollbacks look like

slide-44
SLIDE 44

OBSERVABILITY

collect operational metrics in this shadow phase

▸ Gain historical knowledge of what the system’s healthy

state looks like.

▸ Tweak your alerts and SLAs. ▸ Gameday the system! Write runbooks!