The Art of SLOs In the midst of chaos , there is also opportunity - PowerPoint PPT Presentation

The Art of SLOs In the midst of chaos , there is also opportunity reliability — Sun Tzu, The Art of War https://cre.page.link/art-of-slos-slides

Welcome! Don't be shy … say hello to your neighbours https://cre.page.link/art-of-slos-slides

Group Agreements ⁄ We’re here to learn ⁄ Please ask questions (raise your hand) ⁄ One speaker at a time ⁄ Assume positive intent ⁄ “Why am I speaking?” https://cre.page.link/art-of-slos-slides

Agenda ⁄ Terminology ⁄ Why your services need SLOs ⁄ Spending your error budget ⁄ Choosing a good SLI ⁄ Developing SLOs and SLIs https://cre.page.link/art-of-slos-slides

S ervice L evel I ndicator A quantifiable measure of service reliability https://cre.page.link/art-of-slos-slides

S ervice L evel O bjectives Set a reliability target for an SLI https://cre.page.link/art-of-slos-slides

Users? Customers? Customers are users who directly pay for a service https://cre.page.link/art-of-slos-slides

Services Need SLOs https://cre.page.link/art-of-slos-slides

Don't believe us? "Since introducing SLOs, the relationship between our operations and development teams has subtly but markedly improved ." — Ben McCormack, Evernote; The Site Reliability Workbook, Chapter 3 "... it is difficult to do your job well without clearly defining well . SLOs provide the language we need to define well ." — Theo Schlossnagle, Circonus; Seeking SRE, Chapter 21 https://cre.page.link/art-of-slos-slides

The most ➊ important feature of any system is its reliability https://cre.page.link/art-of-slos-slides

Developers Operators How do you incentivize Agility Stability reliability? https://cre.page.link/art-of-slos-slides

A principled way to agree on the desired reliability of a service https://cre.page.link/art-of-slos-slides

What does " reliable " mean? Think about Netflix, Google Search, Gmail, Twitter… how do you tell if they are ‘working’? https://cre.page.link/art-of-slos-slides

Objective Agreement 200 ms “Ugh” 0 ms 300 ms “HTTP GET / …” Customer https://cre.page.link/art-of-slos-slides

With me so far? https://cre.page.link/art-of-slos-slides

When do we need to make a service more reliable ? https://cre.page.link/art-of-slos-slides

100% 100% is the wrong reliability target for basically everything — Benjamin Treynor Sloss , VP 24x7, Google; Site Reliability Engineering, Introduction https://cre.page.link/art-of-slos-slides

😢😌 SLOs should capture the performance and availability levels that, if barely met , would keep the typical customer of a service happy “meets SLO targets” ⇒ “happy customers” “sad customers” ⇒ “misses SLO targets” https://cre.page.link/art-of-slos-slides

Measure SLO SLI achieved & try Target to be slightly over target... https://cre.page.link/art-of-slos-slides

SLI "Workflow", Randall Munroe, XKCD …but don’t be Source: https://xkcd.com/1172/ ! too much better Target or users will depend on it https://cre.page.link/art-of-slos-slides

Error Budgets An SLO implies an acceptable level of unreliability This is a budget that can be allocated https://cre.page.link/art-of-slos-slides

Implementation Mechanics Evaluate SLO performance over a set window , e.g. 28 days Remaining budget drives prioritization of engineering effort https://cre.page.link/art-of-slos-slides

ITIL Approximation Service in SLO → most operational work is a standard change Service close to being out of SLO → revert to normal change (No, I don't understand the difference between "standard" and "normal" either…) https://cre.page.link/art-of-slos-slides

What should we spend our error budget on? https://cre.page.link/art-of-slos-slides

Error budgets can accommodate ⁄ releasing new features ⁄ expected system changes ⁄ inevitable failure in hardware, networks, etc. ⁄ planned downtime ⁄ risky experiments https://cre.page.link/art-of-slos-slides

Benefits of error budgets ⁄ ⁄ Dev team becomes self-policing Common incentive for devs and SREs The error budget is a valuable resource for them Find the right balance between innovation and reliability ⁄ ⁄ Shared responsibility for system uptime Dev team can manage the risk themselves Infrastructure failures eat into the error budget They decide how to spend their error budget ⁄ Unrealistic reliability goals become unattractive These goals dampen the velocity of innovation https://cre.page.link/art-of-slos-slides

Still with me? https://cre.page.link/art-of-slos-slides

Activity Reliability Principles https://cre.page.link/art-of-slos-slides

Dear Colleagues, The negative press from our recent outage has convinced me that we all need to take the reliability of our services more seriously. In this open letter, I want to lay down three reliability principles to guide your future decision making. https://cre.page.link/art-of-slos-slides

The first principle concerns our users. 1. ... rebuild user trust by making a financial commitment to reliability. We let them down, but they deserve better. They deserve to be happy 2. ... find ways to help our users when using our services! tolerate or enjoy future outages. 3. ... meet our users expectations of reliability before building features. Our business must ... 4. ... build the features that make our users happy faster. 5. ... never suffer another outage, ever again! https://cre.page.link/art-of-slos-slides

The second principle concerns the 1. … choose to fail fast and catch errors early through rapid iteration. way we build our services. We have to change our development process to 2. … have Ops engage in the design of incorporate reliability. new features to reduce risk. 3. … only release new features publicly when they are shown to be reliable. Our business must... 4. … build and release software in small, controlled steps. 5. … reduce feature iteration speed when our systems are unreliable. https://cre.page.link/art-of-slos-slides

The third principle concerns our 1. … share responsibility for reliability between Ops and Dev teams. operational practices. What we're doing today isn't working. Our Ops 2. … tie operational response and teams are burned out and our team priorities to a reliability goal. incident rate is too high. We have to 3. … make our systems more resilient do things differently to improve! to failure to cut operational load. 4. … give Ops a veto on all releases to prevent failures reaching our users. Our business must... 5. … route negative complaints on Twitter directly to Ops pagers. https://cre.page.link/art-of-slos-slides

To put these principles into practice, we are going to borrow some ideas from Google! The next step is to define some SLOs for our services and begin tracking our performance against them. Thanks for reading! Eleanor Exec , CEO https://cre.page.link/art-of-slos-slides

Break! https://cre.page.link/art-of-slos-slides

Choosing a Good SLI https://cre.page.link/art-of-slos-slides

https://cre.page.link/art-of-slos-slides

unhappy users time https://cre.page.link/art-of-slos-slides

BAD GOOD metric metric time time https://cre.page.link/art-of-slos-slides

BAD GOOD metric metric time time Variance obscures metric deterioration https://cre.page.link/art-of-slos-slides

BAD GOOD metric metric time time Metric deterioration correlates with outage https://cre.page.link/art-of-slos-slides

BAD GOOD metric metric ? ✓ time time Metric provides poor Metric provides good signal-to-noise ratio signal-to-noise ratio https://cre.page.link/art-of-slos-slides

SLI SLO https://cre.page.link/art-of-slos-slides

good events SLI : × 100% valid events https://cre.page.link/art-of-slos-slides

3–5 SLIs * * per user journey https://cre.page.link/art-of-slos-slides

SLI SLO https://cre.page.link/art-of-slos-slides

W hat performance does the business need? https://cre.page.link/art-of-slos-slides

U ser expectations are strongly tied to past performance https://cre.page.link/art-of-slos-slides

Continuous ? Improvement https://cre.page.link/art-of-slos-slides

Information o verload? https://cre.page.link/art-of-slos-slides

Developing SLOs and SLIs https://cre.page.link/art-of-slos-slides

? https://cre.page.link/art-of-slos-slides

Our Game: Fang Faction Leaderboards Web Server Leaderboard Generation User Profiles Load Balancer Game Servers API Server https://cre.page.link/art-of-slos-slides

https://fangfactiongame.com/profile/someuser SomeUser's Profjle Faction Name: Tribe of Frog Leader Name: SomeUser SomeUser Email Address: user@example.com Tribe of Frog Faction Score: 31337 Midwest Canyon Update 1. Tri-Bool 65535 2. Tri Repetae 61995 3. Triassic Five 52391 4. Tricksy Hobbits 37164 5. Tribe of Frog 31337 6. Trite Examples 29243 https://cre.page.link/art-of-slos-slides

Loading a Profile Page Leaderboards Web Server Leaderboard Generation User Profiles Load Balancer Game Servers API Server https://cre.page.link/art-of-slos-slides

The Art of SLOs In the midst of chaos , there is also opportunity - PowerPoint PPT Presentation

The Art of SLOs In the midst of chaos , there is also opportunity reliability Sun Tzu, The Art of War https://cre.page.link/art-of-slos-slides Welcome! Don't be shy say hello to your neighbours https://cre.page.link/art-of-slos-slides

HISTORY ART Pre- Historic Art Egyptian Art Greek Art Roman Art Byzantine Art Medieval Art

HISTORY ART Pre- Historic Art Egyptian Art Greek Art Roman Art Byzantine Art Medieval Art

Student Learning Outcomes Course Objectives Course SLOs Program SLOs

Creating Dashboards of Direct and Creating Dashboards of Direct and Creating Dashboards of Direct

Exploring Student Learning Objectives (SLOs) & Student Outcome Objectives (SOOs) American

Overview of Presentation Public Art Definitions Why is Public Art Important ? Percent for Art

ART OF CHANGE 21 PRSENTATION 2 ART OF CHANGE 21 ABOUT US Art of Change 21 works in the field

Pixel Art What is pixel art? Pixel art is a digital art form that is created in raster in its

Art and Design Art and Design Insects Year One Art and Design Art and Design | LKS2 | Insects |

Greek Art from E Early Classical to l Cl l Hellenistic Period Hellenistic Period AP Art

CHART | ART FAIR 29. 31. AUGUST 2014 CHART | ART FAIR IS AN INNOVATIVE ART FAIR WITH A HIGH

Tartu Art School Tartu Art School Estonia Tartu Tartu Art School Tartu Art School Graphic

Bodhi Simpson, LCPC, ATR My story My Story Art- Our first language What is Art Therapy? Art

SLOs: Assessment & Alignment of Outcomes San Bernardino Valley College January 10, 2013

Long-term SLOs for reclaimed cloud computing resources Carvalho et al. (2014) Christopher

Pitfalls in Measuring SLOs Danyel Fisher @fisherdanyel An Outage Danyel Fisher @fisherdanyel

1 of 68 Easier to ask forgiveness slides 4/25/19, 9:11 PM 2 of 68 Easier to ask forgiveness

CIS 500 Software Foundations Exceptions (Chapter 14) Fall 2005 9 November

An adaptive PML technique for time-harmonic scattering problems Following a paper by Zhiming Chen

Collaborative Filtering Practical Machine Learning, CS 294-34 Lester Mackey Based on slides by

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department

On Ideal Lattices and Learning With Errors Over Rings Vadim Lyubashevsky 1 Chris Peikert 2 Oded

Error-Correcting Sparse Interpolation in the Chebyshev Basis Andrew Arnold* Erich Kaltofen

Rounding errors Example Show demo: Waiting for 1. Determine the double-precision machine

Sambuz

Useful Links

Newsletter

Mail Us

The Art of SLOs In the midst of chaos , there is also opportunity - PowerPoint PPT Presentation

The Art of SLOs In the midst of chaos , there is also opportunity reliability Sun Tzu, The Art of War https://cre.page.link/art-of-slos-slides Welcome! Don't be shy say hello to your neighbours https://cre.page.link/art-of-slos-slides

HISTORY ART Pre- Historic Art Egyptian Art Greek Art Roman Art Byzantine Art Medieval Art

HISTORY ART Pre- Historic Art Egyptian Art Greek Art Roman Art Byzantine Art Medieval Art

Student Learning Outcomes Course Objectives Course SLOs Program SLOs

Creating Dashboards of Direct and Creating Dashboards of Direct and Creating Dashboards of Direct

Exploring Student Learning Objectives (SLOs) &amp; Student Outcome Objectives (SOOs) American

Overview of Presentation Public Art Definitions Why is Public Art Important ? Percent for Art

ART OF CHANGE 21 PRSENTATION 2 ART OF CHANGE 21 ABOUT US Art of Change 21 works in the field

Pixel Art What is pixel art? Pixel art is a digital art form that is created in raster in its

Art and Design Art and Design Insects Year One Art and Design Art and Design | LKS2 | Insects |

Greek Art from E Early Classical to l Cl l Hellenistic Period Hellenistic Period AP Art

CHART | ART FAIR 29. 31. AUGUST 2014 CHART | ART FAIR IS AN INNOVATIVE ART FAIR WITH A HIGH

Tartu Art School Tartu Art School Estonia Tartu Tartu Art School Tartu Art School Graphic

Bodhi Simpson, LCPC, ATR My story My Story Art- Our first language What is Art Therapy? Art

SLOs: Assessment &amp; Alignment of Outcomes San Bernardino Valley College January 10, 2013

Long-term SLOs for reclaimed cloud computing resources Carvalho et al. (2014) Christopher

Pitfalls in Measuring SLOs Danyel Fisher @fisherdanyel An Outage Danyel Fisher @fisherdanyel

1 of 68 Easier to ask forgiveness slides 4/25/19, 9:11 PM 2 of 68 Easier to ask forgiveness

CIS 500 Software Foundations Exceptions (Chapter 14) Fall 2005 9 November

An adaptive PML technique for time-harmonic scattering problems Following a paper by Zhiming Chen

Collaborative Filtering Practical Machine Learning, CS 294-34 Lester Mackey Based on slides by

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department

On Ideal Lattices and Learning With Errors Over Rings Vadim Lyubashevsky 1 Chris Peikert 2 Oded

Error-Correcting Sparse Interpolation in the Chebyshev Basis Andrew Arnold* Erich Kaltofen

Rounding errors Example Show demo: Waiting for 1. Determine the double-precision machine

Sambuz

Useful Links

Newsletter

Mail Us

Exploring Student Learning Objectives (SLOs) & Student Outcome Objectives (SOOs) American

SLOs: Assessment & Alignment of Outcomes San Bernardino Valley College January 10, 2013