ACM.org Highlights For Scientists, Programmers, Designers, and - - PowerPoint PPT Presentation

acm org highlights
SMART_READER_LITE
LIVE PREVIEW

ACM.org Highlights For Scientists, Programmers, Designers, and - - PowerPoint PPT Presentation

ACM.org Highlights For Scientists, Programmers, Designers, and Managers: Learning Center - https://learning.acm.org View past TechTalks & Podcasts with top inventors, innovators, entrepreneurs, & award winners Access to


slide-1
SLIDE 1

For Scientists, Programmers, Designers, and Managers:

  • Learning Center - https://learning.acm.org
  • View past TechTalks & Podcasts with top inventors, innovators, entrepreneurs, & award winners
  • Access to O’Reilly Learning Platform – technical books, courses, videos, tutorials & case studies
  • Access to Skillsoft Training & ScienceDirect – vendor certification prep, technical books & courses
  • Ethical Responsibility – https://ethics.acm.org

ACM.org Highlights

Popular Publications & Research Papers

  • Communications of the ACM - http://cacm.acm.org
  • Queue Magazine - http://queue.acm.org
  • Digital Library - http://dl.acm.org

Major Conferences, Events, & Recognition

  • https://www.acm.org/conferences
  • https://www.acm.org/chapters
  • https://awards.acm.org

By the Numbers

  • 2,200,000+ content readers
  • 1,800,000+ DL research citations
  • $1,000,000 Turing Award prize
  • 100,000+ global members
  • 1160+ Fellows
  • 700+ chapters globally
  • 170+ yearly conferences globally
  • 100+ yearly awards
  • 70+ Turing Award Laureates
slide-2
SLIDE 2

OOPS! Learning from surprise at Netflix

Lorin Hochstein

  • Sr. Software Engineer, Netflix
slide-3
SLIDE 3

@lhochstein

Let’s talk about

  • utages!
slide-4
SLIDE 4

@lhochstein

At Netflix, we call them incidents

slide-5
SLIDE 5

@lhochstein

Incidents are scary!

slide-6
SLIDE 6

@lhochstein

The system did something we didn’t expect…

slide-7
SLIDE 7

@lhochstein

…and a bad thing happened!

slide-8
SLIDE 8

@lhochstein

Uncertainty makes people nervous

slide-9
SLIDE 9

@lhochstein

We want closure

slide-10
SLIDE 10

@lhochstein

How can we be confident this won’t happen again?

slide-11
SLIDE 11

@lhochstein

We do an incident review

slide-12
SLIDE 12

@lhochstein

Why did this happen?

slide-13
SLIDE 13

@lhochstein

Do a root cause analysis

slide-14
SLIDE 14

@lhochstein

Identify action items that will prevent reoccurrence

slide-15
SLIDE 15

@lhochstein

We can now move past it

slide-16
SLIDE 16

@lhochstein

Until the next one…

slide-17
SLIDE 17

@lhochstein

… which is completely different

slide-18
SLIDE 18

@lhochstein

We can get more out of incidents than preventing the last one

slide-19
SLIDE 19

@lhochstein

slide-20
SLIDE 20

@lhochstein

Learning isn’t proportional to impact of an incident

slide-21
SLIDE 21

@lhochstein

We can learn just as much from “incidents” where there is no business impact!

slide-22
SLIDE 22

@lhochstein

An operational surprise

slide-23
SLIDE 23

@lhochstein

OOPSies

slide-24
SLIDE 24

@lhochstein

OOPS

slide-25
SLIDE 25

@lhochstein

slide-26
SLIDE 26

@lhochstein

slide-27
SLIDE 27

@lhochstein

slide-28
SLIDE 28

@lhochstein

slide-29
SLIDE 29

@lhochstein https://twitter.com/FakeRyanGosling/status/1106714429247221761

slide-30
SLIDE 30

@lhochstein

slide-31
SLIDE 31

A play in three acts

  • 1. What we hope to learn from OOPSies
  • 2. What to ask when looking into how an OOPS happened
  • 3. How to write up an OOPS
slide-32
SLIDE 32

@lhochstein

  • I. What we hope to learn
slide-33
SLIDE 33

@lhochstein

Fools learn from

  • experience. I prefer

to learn from the experience of

  • thers.

– Otto von Bismarck (attributed)

slide-34
SLIDE 34

@lhochstein

Identify gaps

slide-35
SLIDE 35

@lhochstein

Tooling gaps

slide-36
SLIDE 36

@lhochstein

slide-37
SLIDE 37

@lhochstein

slide-38
SLIDE 38

Consider a cluster of servers

Server group

slide-39
SLIDE 39

The size is configurable

Server group Desired

128

EC2

slide-40
SLIDE 40

Netflix traffic varies over time

slide-41
SLIDE 41

Autoscaling sizes for you

Desired

128

Autoscaler Min

20

Max

1000

Metrics EC2 Server group

slide-42
SLIDE 42

@lhochstein

One day…

slide-43
SLIDE 43

Desired

128

Max

1000

Min

12

slide-44
SLIDE 44

Desired

256

Max

1000

Min

12

slide-45
SLIDE 45

One day…

Desired

256

Autoscaler Min

20

Max

1000

Metrics EC2 Server group

slide-46
SLIDE 46
  • 1. EC2: Bring up new instances

Desired

256

slide-47
SLIDE 47
  • 2. Autoscaler fires: 256 → 128

Desired

128

slide-48
SLIDE 48
  • 2. EC2: terminate instances

Desired

128

slide-49
SLIDE 49

User sees green → gray

Desired

128

Autoscaler Min

20

Max

1000

Metrics EC2 Server group

slide-50
SLIDE 50

@lhochstein

slide-51
SLIDE 51

@lhochstein

Operational expertise gaps

slide-52
SLIDE 52

@lhochstein

Resource gaps

slide-53
SLIDE 53

@lhochstein

Beware the law of stretched systems!

slide-54
SLIDE 54

@lhochstein

Every system is stretched to operate at its capacity

slide-55
SLIDE 55

@lhochstein

Beware the law of fluency!

slide-56
SLIDE 56

@lhochstein

Hard to tell when a skilled engineer starts to become

  • verloaded
slide-57
SLIDE 57

@lhochstein

Build shared understanding

slide-58
SLIDE 58

@lhochstein

It came as a surprise that X calls Y’s endpoint

slide-59
SLIDE 59

@lhochstein

Facilitate skill transfer

slide-60
SLIDE 60

@lhochstein

Learn by watching experts in action

slide-61
SLIDE 61

@lhochstein

  • II. What to ask
slide-62
SLIDE 62

@lhochstein

Do an investigation afterwards

slide-63
SLIDE 63

@lhochstein

(but don’t call it that)

slide-64
SLIDE 64

@lhochstein

“How did we get here?”

slide-65
SLIDE 65

@lhochstein

How did X seem reasonable in the moment?

slide-66
SLIDE 66

@lhochstein

What were all of the things that had to be true for the surprise to happen?

slide-67
SLIDE 67

@lhochstein

Capture perspectives from multiple people

slide-68
SLIDE 68

@lhochstein

  • III. How to write it up
slide-69
SLIDE 69

@lhochstein

Narrative description

slide-70
SLIDE 70

@lhochstein

Tell a good story

slide-71
SLIDE 71

@lhochstein

Imagine new team member reading it

slide-72
SLIDE 72

@lhochstein

Contributing factors

slide-73
SLIDE 73

@lhochstein

Front50 provides an inconsistent view of application permissions, this triggered endless retries

slide-74
SLIDE 74

@lhochstein

Similar feature was already in use, so enabling it here seemed low-risk

slide-75
SLIDE 75

@lhochstein

X was out sick when the feature was deployed

slide-76
SLIDE 76

@lhochstein

Mitigators

slide-77
SLIDE 77

@lhochstein

Spinnaker's staging stack was not impacted, which gave us a backdoor way to monitor and make changes

slide-78
SLIDE 78

@lhochstein

Demand Engineering has tooling & experience in changing size of many server groups automatically, which was sufficient to undo most bad changes

slide-79
SLIDE 79

@lhochstein

Risks

slide-80
SLIDE 80

@lhochstein

The regression occurred in an area of Spinnaker that is difficult to test

slide-81
SLIDE 81

@lhochstein

Misconfigured pools and queues

slide-82
SLIDE 82

@lhochstein

Difficulties in handling

slide-83
SLIDE 83

@lhochstein

Observability blind spots: lack of metrics around connection pool or redis command usage made it difficult to determine redis usage change

slide-84
SLIDE 84

@lhochstein

slide-85
SLIDE 85

@lhochstein

clouddriver was rolled out at 3pm and we were paged at 5:30pm, so not immediately clear that issue had to do with deployment

slide-86
SLIDE 86

If you only remember three things…

  • Any operational surprise is a potential opportunity for learning
  • Ask questions that answer “how did we get here?”
  • Tell a good story

@lhochstein

slide-87
SLIDE 87

I want to learn more about learning more!

  • Etsy Debrief Facilitation Guide
  • The Field Guide To Understanding ‘Human Error’ by Sidney Dekker
  • http://resiliencepapers.club

@lhochstein

slide-88
SLIDE 88

The Learning Continues…

TechTalk Discourse: https://on.acm.org TechTalk Inquiries: learning@acm.org TechTalk Archives: https://learning.acm.org/techtalks Learning Center: https://learning.acm.org Professional Ethics: https://ethics.acm.org Queue Magazine: https://queue.acm.org