The Art of SLOs In the midst of chaos , there is also opportunity - - PowerPoint PPT Presentation

the art of slos
SMART_READER_LITE
LIVE PREVIEW

The Art of SLOs In the midst of chaos , there is also opportunity - - PowerPoint PPT Presentation

The Art of SLOs In the midst of chaos , there is also opportunity reliability Sun Tzu, The Art of War https://cre.page.link/art-of-slos-slides Welcome! Don't be shy say hello to your neighbours https://cre.page.link/art-of-slos-slides


slide-1
SLIDE 1 https://cre.page.link/art-of-slos-slides

The Art of SLOs

In the midst of chaos, there is also opportunity reliability

— Sun Tzu, The Art of War

slide-2
SLIDE 2 https://cre.page.link/art-of-slos-slides

Welcome!

Don't be shy … say hello to your neighbours

slide-3
SLIDE 3 https://cre.page.link/art-of-slos-slides

Group Agreements

⁄ We’re here to learn ⁄ Please ask questions (raise your hand) ⁄ One speaker at a time ⁄ Assume positive intent ⁄ “Why am I speaking?”

slide-4
SLIDE 4 https://cre.page.link/art-of-slos-slides

Agenda

⁄ Terminology ⁄ Why your services need SLOs ⁄ Spending your error budget ⁄ Choosing a good SLI ⁄ Developing SLOs and SLIs

slide-5
SLIDE 5 https://cre.page.link/art-of-slos-slides

Service Level Indicator

A quantifiable measure of service reliability

slide-6
SLIDE 6 https://cre.page.link/art-of-slos-slides

Service Level Objectives

Set a reliability target for an SLI

slide-7
SLIDE 7 https://cre.page.link/art-of-slos-slides

Users? Customers?

Customers are users who directly pay for a service

slide-8
SLIDE 8 https://cre.page.link/art-of-slos-slides

Services Need SLOs

slide-9
SLIDE 9 https://cre.page.link/art-of-slos-slides

Don't believe us?

"Since introducing SLOs, the relationship between our operations and development teams has subtly but markedly improved."

— Ben McCormack, Evernote; The Site Reliability Workbook, Chapter 3

"... it is difficult to do your job well without clearly defining well. SLOs provide the language we need to define well."

— Theo Schlossnagle, Circonus; Seeking SRE, Chapter 21

slide-10
SLIDE 10 https://cre.page.link/art-of-slos-slides

The most important feature

  • f any system

is its reliability

slide-11
SLIDE 11 https://cre.page.link/art-of-slos-slides

Operators

Stability

How do you incentivize reliability?

Developers

Agility

slide-12
SLIDE 12 https://cre.page.link/art-of-slos-slides

A principled way to agree on the desired reliability

  • f a service
slide-13
SLIDE 13 https://cre.page.link/art-of-slos-slides

What does "reliable" mean?

Think about Netflix, Google Search, Gmail, Twitter… how do you tell if they are ‘working’?

slide-14
SLIDE 14 https://cre.page.link/art-of-slos-slides

0 ms 300 ms 200 ms Customer “HTTP GET / …” “Ugh”

Objective Agreement

slide-15
SLIDE 15 https://cre.page.link/art-of-slos-slides

With me so far?

slide-16
SLIDE 16 https://cre.page.link/art-of-slos-slides

When do we need to make a service more reliable?

slide-17
SLIDE 17 https://cre.page.link/art-of-slos-slides

100%

100% is the wrong reliability target for basically everything

— Benjamin Treynor Sloss, VP 24x7, Google; Site Reliability Engineering, Introduction

slide-18
SLIDE 18 https://cre.page.link/art-of-slos-slides

😢😌

SLOs should capture the performance and availability levels that, if barely met, would keep the typical customer of a service happy

“meets SLO targets” ⇒ “happy customers” “sad customers” ⇒ “misses SLO targets”

slide-19
SLIDE 19 https://cre.page.link/art-of-slos-slides

Measure SLO achieved & try to be slightly

  • ver target...

Target SLI

slide-20
SLIDE 20 https://cre.page.link/art-of-slos-slides

…but don’t be too much better

  • r users will

depend on it

"Workflow", Randall Munroe, XKCD Source: https://xkcd.com/1172/ Target

!

SLI

slide-21
SLIDE 21 https://cre.page.link/art-of-slos-slides

Error Budgets

An SLO implies an acceptable level of unreliability This is a budget that can be allocated

slide-22
SLIDE 22 https://cre.page.link/art-of-slos-slides

Implementation Mechanics

Evaluate SLO performance over a set window, e.g. 28 days Remaining budget drives prioritization of engineering effort

slide-23
SLIDE 23 https://cre.page.link/art-of-slos-slides

ITIL Approximation

Service in SLO → most operational work is a standard change Service close to being out of SLO → revert to normal change

(No, I don't understand the difference between "standard" and "normal" either…)

slide-24
SLIDE 24 https://cre.page.link/art-of-slos-slides

What should we spend

  • ur error budget on?
slide-25
SLIDE 25 https://cre.page.link/art-of-slos-slides

Error budgets can accommodate

⁄ releasing new features ⁄ expected system changes ⁄ inevitable failure in hardware, networks, etc. ⁄ planned downtime ⁄ risky experiments

slide-26
SLIDE 26 https://cre.page.link/art-of-slos-slides

Dev team becomes self-policing

The error budget is a valuable resource for them

Shared responsibility for system uptime

Infrastructure failures eat into the error budget

Common incentive for devs and SREs

Find the right balance between innovation and reliability

Dev team can manage the risk themselves

They decide how to spend their error budget

Unrealistic reliability goals become unattractive

These goals dampen the velocity of innovation

Benefits of error budgets

slide-27
SLIDE 27 https://cre.page.link/art-of-slos-slides

Still with me?

slide-28
SLIDE 28 https://cre.page.link/art-of-slos-slides

Activity

Reliability Principles

slide-29
SLIDE 29 https://cre.page.link/art-of-slos-slides

Dear Colleagues, The negative press from our recent outage has convinced me that we all need to take the reliability of

  • ur services more seriously. In this
  • pen letter, I want to lay down three

reliability principles to guide your future decision making.

slide-30
SLIDE 30 https://cre.page.link/art-of-slos-slides

1. ... rebuild user trust by making a financial commitment to reliability. 2. ... find ways to help our users tolerate or enjoy future outages. 3. ... meet our users expectations of reliability before building features. 4. ... build the features that make our users happy faster. 5. ... never suffer another outage, ever again! The first principle concerns our users. We let them down, but they deserve

  • better. They deserve to be happy

when using our services! Our business must ...

slide-31
SLIDE 31 https://cre.page.link/art-of-slos-slides

1. … choose to fail fast and catch errors early through rapid iteration. 2. … have Ops engage in the design of new features to reduce risk. 3. … only release new features publicly when they are shown to be reliable. 4. … build and release software in small, controlled steps. 5. … reduce feature iteration speed when our systems are unreliable. The second principle concerns the way we build our services. We have to change our development process to incorporate reliability. Our business must...

slide-32
SLIDE 32 https://cre.page.link/art-of-slos-slides

1. … share responsibility for reliability between Ops and Dev teams. 2. … tie operational response and team priorities to a reliability goal. 3. … make our systems more resilient to failure to cut operational load. 4. … give Ops a veto on all releases to prevent failures reaching our users. 5. … route negative complaints on Twitter directly to Ops pagers. The third principle concerns our

  • perational practices. What we're

doing today isn't working. Our Ops teams are burned out and our incident rate is too high. We have to do things differently to improve! Our business must...

slide-33
SLIDE 33 https://cre.page.link/art-of-slos-slides

To put these principles into practice, we are going to borrow some ideas from Google! The next step is to define some SLOs for our services and begin tracking our performance against them. Thanks for reading! Eleanor Exec, CEO

slide-34
SLIDE 34 https://cre.page.link/art-of-slos-slides

Break!

slide-35
SLIDE 35 https://cre.page.link/art-of-slos-slides

Choosing a Good SLI

slide-36
SLIDE 36 https://cre.page.link/art-of-slos-slides
slide-37
SLIDE 37 https://cre.page.link/art-of-slos-slides

time

unhappy users

slide-38
SLIDE 38 https://cre.page.link/art-of-slos-slides

time time metric metric

BAD GOOD

slide-39
SLIDE 39 https://cre.page.link/art-of-slos-slides

time time metric

Variance obscures metric deterioration

metric

BAD GOOD

slide-40
SLIDE 40 https://cre.page.link/art-of-slos-slides

time metric

Metric deterioration correlates with outage

metric time

BAD GOOD

slide-41
SLIDE 41 https://cre.page.link/art-of-slos-slides

time time metric

Metric provides poor signal-to-noise ratio Metric provides good signal-to-noise ratio

metric ? ✓

BAD GOOD

slide-42
SLIDE 42 https://cre.page.link/art-of-slos-slides

SLO SLI

slide-43
SLIDE 43 https://cre.page.link/art-of-slos-slides

SLI : good events valid events × 100%

slide-44
SLIDE 44 https://cre.page.link/art-of-slos-slides

3–5 SLIs*

* per user journey

slide-45
SLIDE 45 https://cre.page.link/art-of-slos-slides

SLO SLI

slide-46
SLIDE 46 https://cre.page.link/art-of-slos-slides

What performance does the business need?

slide-47
SLIDE 47 https://cre.page.link/art-of-slos-slides

User expectations are strongly tied to past performance

slide-48
SLIDE 48 https://cre.page.link/art-of-slos-slides

?

Continuous Improvement

slide-49
SLIDE 49 https://cre.page.link/art-of-slos-slides

Information

  • verload?
slide-50
SLIDE 50 https://cre.page.link/art-of-slos-slides

Developing SLOs and SLIs

slide-51
SLIDE 51 https://cre.page.link/art-of-slos-slides

?

slide-52
SLIDE 52 https://cre.page.link/art-of-slos-slides

Our Game: Fang Faction

Web Server API Server Leaderboards User Profiles Game Servers Leaderboard Generation Load Balancer

slide-53
SLIDE 53 https://cre.page.link/art-of-slos-slides

SomeUser's Profjle

SomeUser Tribe of Frog

Faction Score: 31337 Midwest Canyon

https://fangfactiongame.com/profile/someuser

Faction Name: Tribe of Frog Leader Name: SomeUser Email Address: user@example.com Update

1. 2. 3. 4. 5. 6. Tri-Bool 65535 Tri Repetae 61995 Triassic Five 52391 Tricksy Hobbits 37164 Tribe of Frog 31337 Trite Examples 29243
slide-54
SLIDE 54 https://cre.page.link/art-of-slos-slides

Loading a Profile Page

API Server Leaderboards User Profiles Load Balancer Game Servers Leaderboard Generation Web Server

slide-55
SLIDE 55 https://cre.page.link/art-of-slos-slides

Request / Response Availability Latency Quality Data Processing Coverage Correctness Freshness Throughput Storage Throughput Latency

SLI Menu

slide-56
SLIDE 56 https://cre.page.link/art-of-slos-slides

The profile page should load successfully The profile page should load quickly

Availability Latency

slide-57
SLIDE 57 https://cre.page.link/art-of-slos-slides
  • How do we define success?
  • Where is the success / failure recorded?
  • How do we define quickly?
  • When does the timer start / stop?

The profile page should load successfully The profile page should load quickly

Availability Latency

slide-58
SLIDE 58 https://cre.page.link/art-of-slos-slides

The proportion of valid requests served faster than a threshold. The proportion of valid requests served successfully.

  • How do we define success?
  • Where is the success / failure recorded?
  • How do we define quickly?
  • When does the timer start / stop?

The profile page should load successfully The profile page should load quickly

Availability Latency

slide-59
SLIDE 59 https://cre.page.link/art-of-slos-slides

The proportion of valid requests served faster than a threshold. The proportion of valid requests served successfully.

  • How do we define success?
  • Where is the success / failure recorded?
  • How do we define quickly?
  • When does the timer start / stop?

The profile page should load successfully The profile page should load quickly

Availability Latency

slide-60
SLIDE 60 https://cre.page.link/art-of-slos-slides

The proportion of HTTP GET requests for /profile/{user} served faster than a threshold. The proportion of HTTP GET requests for /profile/{user} or /profile/{user}/avatar served successfully.

  • How do we define success?
  • Where is the success / failure recorded?
  • How do we define quickly?
  • When does the timer start / stop?

The profile page should load successfully The profile page should load quickly

Availability Latency

slide-61
SLIDE 61 https://cre.page.link/art-of-slos-slides

The proportion of HTTP GET requests for /profile/{user} served faster than a threshold. The proportion of HTTP GET requests for /profile/{user} or /profile/{user}/avatar served successfully.

  • How do we define success?
  • Where is the success / failure recorded?
  • How do we define quickly?
  • When does the timer start / stop?

The profile page should load successfully The profile page should load quickly

Availability Latency

slide-62
SLIDE 62 https://cre.page.link/art-of-slos-slides

The proportion of HTTP GET requests for /profile/{user} served within X ms. The proportion of HTTP GET requests for /profile/{user} or /profile/{user}/avatar that have 2XX, 3XX or 4XX (excl. 429) status.

  • How do we define success?
  • Where is the success / failure recorded?
  • How do we define quickly?
  • When does the timer start / stop?

The profile page should load successfully The profile page should load quickly

Availability Latency

slide-63
SLIDE 63 https://cre.page.link/art-of-slos-slides

Measurement Strategies Application-level Metrics Logs Processing Front-end Infra Metrics Synthetic Clients/Data Client-side Instrumentation

SLI Menu

slide-64
SLIDE 64 https://cre.page.link/art-of-slos-slides

The proportion of HTTP GET requests for /profile/{user} that send their entire response within X ms measured at the load balancer. The proportion of HTTP GET requests for /profile/{user} or /profile/{user}/avatar that have 2XX, 3XX or 4XX (excl. 429) status measured at the load balancer.

  • How do we define success?
  • Where is the success / failure recorded?
  • How do we define quickly?
  • When does the timer start / stop?

The profile page should load successfully The profile page should load quickly

Availability Latency

slide-65
SLIDE 65 https://cre.page.link/art-of-slos-slides

Activity

Postmortem

slide-66
SLIDE 66 https://cre.page.link/art-of-slos-slides

Proportion of HTTP GET requests for /profile/{user} or /profile/{user}/avatar that have 2XX, 3XX or 4XX (excl. 429) status measured at the load balancer and Proportion of HTTP GET requests for /profile/prober_user and all linked resources returning valid HTML containing "ProberUser" measured by a black-box prober every 5s Proportion of HTTP GET requests for /profile/{user} that send their entire response within X ms measured at the load balancer

Availability Latency

slide-67
SLIDE 67 https://cre.page.link/art-of-slos-slides

Do the SLIs cover the failure modes?

API Server Leader Boards User Profiles Load Balancer Black Box Prober Availability Availability Latency Game Servers Leaderboard Generation Web Server

slide-68
SLIDE 68 https://cre.page.link/art-of-slos-slides

Activity

Define SLO Targets

slide-69
SLIDE 69 https://cre.page.link/art-of-slos-slides

What goals should we set for the reliability of our journey?

Service SLO Type Objective Web: User Profile Availability Web: User Profile Latency ... ... 99.95% successful in previous 28d 90% of requests < 500ms in previous 28d

Your objectives should have both a target and a measurement window

slide-70
SLIDE 70 https://cre.page.link/art-of-slos-slides

Fallen asleep yet?

slide-71
SLIDE 71 https://cre.page.link/art-of-slos-slides

Break!

slide-72
SLIDE 72 https://cre.page.link/art-of-slos-slides

Workshop: Let's develop some more SLIs and SLOs!

Follow the process we demonstrated for the Buy In-Game Currency journey: 1. Choose SLI specifications from the menu (see booklet, p6) 2. Substitute definitions in to create a detailed SLI implementation 3. Walk through user journey and look for coverage gaps 4. Set aspirational SLOs based on business needs Once you're done, choose another journey as a group.

You have roughly 45 minutes for each journey.

slide-73
SLIDE 73 https://cre.page.link/art-of-slos-slides

Our Game: Fang Faction

Web Server API Server Leaderboards User Profiles Game Servers Leaderboard Generation Load Balancer Play Store

slide-74
SLIDE 74 https://cre.page.link/art-of-slos-slides

Break!

slide-75
SLIDE 75 https://cre.page.link/art-of-slos-slides

Buy In-Game Currency

Model Answer

slide-76
SLIDE 76 https://cre.page.link/art-of-slos-slides

Break Down The Journey

Five request/response pairs 1. Fetch list of SKUs from API server 2. Fetch SKU details from Play Store 3. User launches Play billing flow 4. Send token to API server 5. Verify token with Play Store

1 2 3 4 5

slide-77
SLIDE 77 https://cre.page.link/art-of-slos-slides

Break Down The Journey

Journey has two parts. A: Fetch SKUs 1. Fetch list of SKUs from API server 2. Fetch SKU details from Play Store 3. User launches Play billing flow 4. Send token to API server 5. Verify token with Play Store

1 2

slide-78
SLIDE 78 https://cre.page.link/art-of-slos-slides

Break Down The Journey

Journey has two parts. B: Buy Item 1. Fetch list of SKUs from API server 2. Fetch SKU details from Play Store 3. User launches Play billing flow 4. Send token to API server 5. Verify token with Play Store

3 4 5

slide-79
SLIDE 79 https://cre.page.link/art-of-slos-slides

Break Down The Journey

User might choose not to buy an item :-( 1. Fetch list of SKUs from API server 2. Fetch SKU details from Play Store 3. User launches Play billing flow 4. Send token to API server 5. Verify token with Play Store We have to treat these parts separately!

1 2

slide-80
SLIDE 80 https://cre.page.link/art-of-slos-slides

Break Down The Journey

Two requests don't hit API server at all! 1. Fetch list of SKUs from API server 2. Fetch SKU details from Play Store 3. User launches Play billing flow 4. Send token to API server 5. Verify token with Play Store Server or load balancer metrics may not give enough coverage of the journey :-(

2 3

slide-81
SLIDE 81 https://cre.page.link/art-of-slos-slides

Break Down The Journey

Two requests don't hit API server at all! 1. Fetch list of SKUs from API server 2. Fetch SKU details from Play Store 3. User launches Play billing flow 4. Send token to API server 5. Verify token with Play Store Server or load balancer metrics may not give enough coverage of the journey :-( … we'll have to ask our users to consent to client-side telemetry.

2 3

slide-82
SLIDE 82 https://cre.page.link/art-of-slos-slides

Buy Flow What SLIs?

Buy Flow journey is Request / Response SLI menu suggests we use Availability and Latency SLIs

slide-83
SLIDE 83 https://cre.page.link/art-of-slos-slides

Buy Flow Availability: Specification

B makes money, so let's start with that 1. Fetch list of SKUs from API server 2. Fetch SKU details from Play Store 3. User launches Play billing flow 4. Send token to API server 5. Verify token with Play Store Availability SLI Specification The proportion of valid requests served successfully.

3 4 5

slide-84
SLIDE 84 https://cre.page.link/art-of-slos-slides

Buy Flow Availability: Valid Requests

Availability SLI The proportion of valid requests served successfully. … but which requests are valid? 3. User launches Play billing flow 4. Send token to API server? 5. Verify token with Play Store? Launching the billing flow indicates a user's intent to buy a product Users consenting to client-side telemetry collection allows us to track this intent

3

slide-85
SLIDE 85 https://cre.page.link/art-of-slos-slides

Buy Flow Availability: Success Criteria

Availability SLI The proportion of launched billing flows from users consenting to collection served successfully. … and how do we determine success? All interactions must be successful! 3. Good status code; purchase token 4. Good status code; account updated 5. Good status code; valid token ○ Return 402 to API call when token is invalid

3 4 5

slide-86
SLIDE 86 https://cre.page.link/art-of-slos-slides

Buy Flow Availability: Measurement

Availability SLI The proportion of launched billing flows from users consenting to collection where the billing flow returns:

  • OK and a purchase token
  • r FEATURE_NOT_SUPPORTED
  • r ITEM_UNAVAILABLE
  • r USER_CANCELED

and /api/completePurchase returns:

  • 200 OK and Parseable JSON
  • r 402 Payment Required

… but where are we measuring this?

3 4

slide-87
SLIDE 87 https://cre.page.link/art-of-slos-slides

Buy Flow Availability: Measurement

Availability SLI The proportion of launched billing flows from users consenting to collection where the billing flow returns:

  • OK
  • r FEATURE_NOT_SUPPORTED
  • r ITEM_UNAVAILABLE
  • r USER_CANCELED

and /api/completePurchase returns:

  • 200 OK
  • r 402 Payment Required
  • and Parseable JSON

measured by the game client and reported back asynchronously.

3 4

slide-88
SLIDE 88 https://cre.page.link/art-of-slos-slides

Buy Flow Latency: Specification

We want to measure latency for B too! 1. Fetch list of SKUs from API server 2. Fetch SKU details from Play Store 3. User launches Play billing flow 4. Send token to API server 5. Verify token with Play Store Latency SLI Specification The proportion of valid requests served faster than a threshold.

3 4 5

slide-89
SLIDE 89 https://cre.page.link/art-of-slos-slides

Buy Flow Latency: Valid Requests

Latency SLI The proportion of valid requests served faster than a threshold. … but which requests are valid? 3. User launches Play billing flow? 4. Send token to API server 5. Verify token with Play Store? Why not 3?

  • Too variable, SLI will have poor SnR
  • Billing flow contains lots of "poking

device with a finger" time

4

slide-90
SLIDE 90 https://cre.page.link/art-of-slos-slides

Buy Flow Latency: "Too Slow" Threshold

Latency SLI The proportion

  • f /api/completePurchase requests

served faster than a threshold. … and what is fast enough? Rough estimate time!

  • Verify Token <= 500ms
  • Database Write <= 200ms
  • Round up a bit…
slide-91
SLIDE 91 https://cre.page.link/art-of-slos-slides

Buy Flow Latency: Measurement

Latency SLI The proportion

  • f /api/completePurchase requests

served within 1000ms. … but where are we measuring this? Where does the timer start/stop?

slide-92
SLIDE 92 https://cre.page.link/art-of-slos-slides

Buy Flow Latency: Measurement

Latency SLI The proportion

  • f /api/completePurchase requests

where the complete response is returned to the client within 1000ms measured at the load balancer.

slide-93
SLIDE 93 https://cre.page.link/art-of-slos-slides

A brief word from

  • ur sponsors...
slide-94
SLIDE 94 https://cre.page.link/art-of-slos-slides

https://cre.page.link/art-of-slos

slide-95
SLIDE 95 https://cre.page.link/art-of-slos-slides

Want to learn more about SLOs? Take our course on Coursera: https://cre.page.link/coursera

slide-96
SLIDE 96 https://cre.page.link/art-of-slos-slides

Both of these are now available in HTML format for free! https://landing.google.com/sre/books/

slide-97
SLIDE 97 https://cre.page.link/art-of-slos-slides

Thanks!

Please fill in the feedback form

Insert QR code link to feedback form in this space! Insert QR code link to feedback form in this space!

slide-98
SLIDE 98 https://cre.page.link/art-of-slos-slides

Q&A

Please ask our panelists questions!