The SRE I aspire to be Yaniv Aknin // @aknin #VelocityConf San Jose - - PowerPoint PPT Presentation

the sre i aspire to be
SMART_READER_LITE
LIVE PREVIEW

The SRE I aspire to be Yaniv Aknin // @aknin #VelocityConf San Jose - - PowerPoint PPT Presentation

The SRE I aspire to be Yaniv Aknin // @aknin #VelocityConf San Jose 2019 The SRE I aspire to be // @aknin Who is this guy Google SRE since 2013 Most recently GCP's Quantitative Reliability Lead Jack of all trades Equal parts SRE, dev,


slide-1
SLIDE 1

The SRE I aspire to be

Yaniv Aknin // @aknin #VelocityConf San Jose 2019

slide-2
SLIDE 2

The SRE I aspire to be // @aknin

Who is this guy

  • Google SRE since 2013

Most recently GCP's Quantitative Reliability Lead

  • Jack of all trades

Equal parts SRE, dev, and /pro(duct|ject) manager/

  • Opinions my own

But I owe a lot here to others

slide-3
SLIDE 3

The SRE I aspire to be // @aknin

  • Google SRE since 2013

Most recently GCP's Quantitative Reliability Lead

  • Jack of all trades

Equal parts SRE, dev, and /pro(duct|ject) manager/

  • Opinions my own

But I owe a lot here to others

Who is this guy

NB: what does "SRE" really mean?

* *

slide-4
SLIDE 4

The SRE I aspire to be // @aknin

Wikipedia says Engineering is "using scientific principles to design and build $THINGS"

https://en.wikipedia.org/wiki/Engineering

slide-5
SLIDE 5

The SRE I aspire to be // @aknin

Wikipedia says Engineering is "using scientific principles to design and build $THINGS"

Imagine THINGS="Reliability"... how do we apply science to that?

https://en.wikipedia.org/wiki/Engineering

slide-6
SLIDE 6

The SRE I aspire to be // @aknin

Innovation

(engineering, proactive, change)

Reliability

(support, reactive, preserve)

slide-7
SLIDE 7

The SRE I aspire to be // @aknin

Reliability Innovation

(support, reactive, preserve) (engineering, proactive, change)

?

slide-8
SLIDE 8

The SRE I aspire to be // @aknin

Reliability Innovation

(engineering, proactive, change) (engineering, proactive, change)

The Error Budget

slide-9
SLIDE 9

The SRE I aspire to be // @aknin

Measurably optimise reliability vs cost

slide-10
SLIDE 10

The SRE I aspire to be // @aknin

William Thomson (Lord Kelvin) President of the Royal Society Lecture on "Electrical Units of Measurement"

Published in "Popular Lectures", Vol. 1, 1883 (abridged to fit slide)

When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, your knowledge is of a meagre and unsatisfactory kind.

“ ”

slide-11
SLIDE 11

The SRE I aspire to be // @aknin

MTBF/MTTR

Challenge: fungible definition of "failure"

"9s" (e.g. "99.95% uptime")

Challenge: aggregating individual events into business credible 9s

99.99% MTBF MTTR 99.9%

slide-12
SLIDE 12

The SRE I aspire to be // @aknin

  • Scope
  • Difficulty
  • Cost++
  • Misconceptions

Why is this hard?

slide-13
SLIDE 13

The SRE I aspire to be // @aknin

  • Scope
  • Difficulty
  • Cost++
  • Misconceptions

Why is this hard? And why is it good?

  • Leverage
  • Precision
  • Cost--
slide-14
SLIDE 14

The SRE I aspire to be // @aknin

On ops, user harm, and tradeoffs

Ops User happiness Your product is here.

slide-15
SLIDE 15

The SRE I aspire to be // @aknin

On ops, user harm, and tradeoffs

Ops User happiness Your product is here.

slide-16
SLIDE 16

The SRE I aspire to be // @aknin

On ops, user harm, and tradeoffs

Ops User happiness Your product is here.

slide-17
SLIDE 17

The SRE I aspire to be // @aknin

On ops, user harm, and tradeoffs

Ops User happiness Your product is here.

slide-18
SLIDE 18

The SRE I aspire to be // @aknin

You need "better quality" 9s!

99%

"Whatever I happened to ship"

99.999%

"I spent time making my metrics hit certain thresholds"

Misaligned

"Whatever I happened to measure"

Aligned

"I spent time ensuring 9s correlate with customer pain"

slide-19
SLIDE 19

The SRE I aspire to be // @aknin

Happy Customers

99%

"Whatever I happened to ship"

99.999%

"I spent time making my metrics hit certain thresholds"

Misaligned

"Whatever I happened to measure"

Aligned

"I spent time ensuring 9s correlate with customer pain"

Wasted Effort Unknown Problem Known Problem

First move right, then move up

slide-20
SLIDE 20

The SRE I aspire to be // @aknin

SRE team: a recipe

Obvious

Monitoring Alerting Capacity planning CI/CD & Rollouts Load Balancing

slide-21
SLIDE 21

The SRE I aspire to be // @aknin

SRE team: a recipe

Obvious

Monitoring Alerting Capacity planning CI/CD & Rollouts Load Balancing

Less Obvious

System Architecture Distributed Algorithms Networking Operating Systems

slide-22
SLIDE 22

The SRE I aspire to be // @aknin

SRE team: a recipe

Obvious

Monitoring Alerting Capacity planning CI/CD & Rollouts Load Balancing

Less Obvious

System Architecture Distributed Algorithms Networking Operating Systems

Least Obvious

Product Management Data Science Business Acumen (nose for) UX Research

slide-23
SLIDE 23

The SRE I aspire to be // @aknin

Litmus test of SRE

  • Have a measurement of reliability
  • When unreliable, resource allocation changes
  • When reliable, you don't do ops
slide-24
SLIDE 24

The SRE I aspire to be // @aknin

Litmus test of SRE

  • Have a measurement of reliability
  • When unreliable, resource allocation changes
  • When reliable, you don't do ops

Please remember this is my litmus test... tell me yours?

*

*

slide-25
SLIDE 25

The SRE I aspire to be // @aknin

Thank you!

Art credits "Lord Kelvin", Messrs. Dickinson, London, goo.gl/RHF61Z, [cropped] Yin Yang, https://openclipart.org/detail/276316/ying-yang

Yaniv Aknin // @aknin