How Computers Help Humans Root Cause Issues at Netflix SETH KATZ - - PowerPoint PPT Presentation

how computers help humans root cause issues at netflix
SMART_READER_LITE
LIVE PREVIEW

How Computers Help Humans Root Cause Issues at Netflix SETH KATZ - - PowerPoint PPT Presentation

How Computers Help Humans Root Cause Issues at Netflix SETH KATZ QCON NEW YORK, 2018 Hello! Seth Katz 5 years at Netflix Focused on improving Netflix operations Share what weve learned on applying machine


slide-1
SLIDE 1

How Computers Help Humans Root Cause Issues at Netflix

SETH KATZ QCON NEW YORK, 2018

slide-2
SLIDE 2
  • Seth Katz
  • 5 years at Netflix
  • Focused on improving Netflix
  • perations
  • Share what we’ve learned on applying

machine intelligence to operations

Hello!

slide-3
SLIDE 3

I got paged!

slide-4
SLIDE 4

Funny Tweet - Serious Situation

slide-5
SLIDE 5

Agenda

  • Netflix operations
  • Approach and challenges to ML in operations
  • Anomaly detection

○ Real-time ○ Near real-time

  • Visualization and making it practical
  • Reflections and takeaways
slide-6
SLIDE 6

Android devices that can’t play a movie exceeds 1%

What if we get this page?

slide-7
SLIDE 7

Microservices

NQ NRDJS Play API manifest Zuul

slide-8
SLIDE 8

Android NQ NRDJS Zuul Play API

slide-9
SLIDE 9

Slack Message

slide-10
SLIDE 10

Why is diagnosing pages hard

It’s 3am in the morning - are you thinking clearly? Maybe you understand your microservice? What about all the other services involved? What about their push schedules in every region?

slide-11
SLIDE 11

Hard problem - how to build a minimum viable product ?

slide-12
SLIDE 12

Simple, Principled, Robust Anomaly Detection

Principled algorithms have guarantees you can use to reason about for any data pattern Simple algorithms that are very easy to implement. Don’t need major frameworks, GPUs, Python, etc.

Approach and Challenges for ML

slide-13
SLIDE 13

Wouldn’t be great if ...

slide-14
SLIDE 14
slide-15
SLIDE 15

Golden Age of AI

Approach and Challenges for ML

slide-16
SLIDE 16

Why do Star Trek robots seem near, but Lost In Space robots seem further into the future

slide-17
SLIDE 17

AI challenges in operations

Limited examples of outages Cause and effect Tribal knowledge

slide-18
SLIDE 18

More AI challenges

Curse of dimensionality Rapidly changing ground truth Generalization to new problems

slide-19
SLIDE 19

So what can we do? - Real-time root cause detection

slide-20
SLIDE 20

Root cause for the oracle

Real Time Root Cause Detection

slide-21
SLIDE 21

Real world example

Timeline

  • 11:50:15 - Region failover from us-east-1 -> eu-west-1
  • 11:51:12 - Service A timeouts increase 243% in eu-west-1
  • 11:51:14 - Android movie errors increase 840%

Complete picture of what happens - time suggests causality

slide-22
SLIDE 22

Victory?

We can only do this on metric subsets

  • Signals usually relatively stable

and slow changing

  • Signal with up to date event

source

  • Signals with rapid updates,

many samples.

slide-23
SLIDE 23

How can we detect scalar anomalies?

slide-24
SLIDE 24
  • Anomaly very clear to

humans

  • Limited data needed
  • Historical trend

unnecessary

  • Recovery also clear
  • Principled signal analysis

possible

Scalar Anomaly Signal

Android error rate

slide-25
SLIDE 25

What’s normal?

slide-26
SLIDE 26

Median on a Stream.

If Incoming > Median: Median = Median + Alpha Else: Median = Median - Alpha

  • Alpha can be adjusted if consecutively on one side
  • Need rapid data updates for timely convergence.
slide-27
SLIDE 27

What’s abnormal?

slide-28
SLIDE 28
  • Is the next data point from the same distribution as sample?
  • Can I guarantee it is the same distribution with a desired level of

confidence?

  • Do I need to assume my data is normally distributed (aka

Gaussian)?

  • Hoeffding Bound

Hoeffding Bound

slide-29
SLIDE 29
  • n=sample size
  • d=desired certainty, eg .01 for 99%
  • r=sample range, ie (max -min)

Hoeffding Bound Very Simple

slide-30
SLIDE 30

Not Anomaly Anomaly

slide-31
SLIDE 31

Another problem - detecting a bad config push?

slide-32
SLIDE 32

Consecutive histogram snapshots

11:10:15 11:10:20

Sharp drop in English titles

slide-33
SLIDE 33

Is there principled way to measure difference between histograms?

slide-34
SLIDE 34

Information Theory

slide-35
SLIDE 35

Fair Coin

Entropy - Average Information

9-1 Biased Coin

slide-36
SLIDE 36

How much entropy do we lose if we estimate histogram with wrong probability distribution?

slide-37
SLIDE 37

Uniform Distribution Info Loss

slide-38
SLIDE 38

Minor Formula Change for Entropy difference

  • Entropy

KL Divergence

  • KL Divergence
slide-39
SLIDE 39

Is KL divergence a good score?

slide-40
SLIDE 40
  • Not symmetric?

○ Take KL divergence in both directions and add

  • No upper limit?

○ Normalize it

Jensen Shannon Divergence (JSD)

slide-41
SLIDE 41

Real Time Root Cause Detection Not Anomaly Anomaly

slide-42
SLIDE 42

Real time Algo Recap

Scalar? Median for expected Hoeffding Threshold Yes? JSD Threshold? Normalize to 1 No?

slide-43
SLIDE 43

How to communicate anomalies?

slide-44
SLIDE 44
  • Android movie errors increase 840%?

○ Increased from what? ○ Why not use z-score (number of standard deviations from mean)?

Example

slide-45
SLIDE 45

This is your brain on Pager Duty

slide-46
SLIDE 46

Intuitive messages beat mathematically precise ones

slide-47
SLIDE 47

What about nearly real-time signals?

slide-48
SLIDE 48

More Time and More Data

slide-49
SLIDE 49

Diurnal Patterns

Prime Time Night Time

slide-50
SLIDE 50
  • Usually better for mean time to resolve than

mean time to detect

  • Less precise timing
  • Use correlation, but humans decide cause vs

effect

Drawbacks

slide-51
SLIDE 51

Suspicious Things

slide-52
SLIDE 52
  • Is there an attribute over represented for sessions with 1234

error code? ○ Device? ○ UI version?

  • Baseline Essential

○ What if only one UI version actually produces error code 1234?

Error Code 1234 is High?

slide-53
SLIDE 53

How do we identify significant change from baseline?

slide-54
SLIDE 54

Error 1234 UI Version 0.0.1 BaseLine 1000000 10000 Now 100000 1150

Two-Way Contingency Table

Use Chi-Squared test

slide-55
SLIDE 55

Contingency Tables Fail

  • Yes/No are past and

present the same

  • Chi-squared says

significant, 99.999% confidence

  • Netflix is always

changing

slide-56
SLIDE 56

Bonferonni’s principle

Are we there yet? Eventually right by chance if you ask enough!

Near real time signals

slide-57
SLIDE 57
  • Contingency tables don’t work
  • Convert it to a time series problem

Getting Correlation Right

slide-58
SLIDE 58

Why would time series work when contingency tables fail?

slide-59
SLIDE 59
  • Chi-squared test is so sensitive because of very large samples
  • Number of time windows much smaller - significance tests work
  • n smaller sets

Sensitivity

slide-60
SLIDE 60

Correlation Windows

Time Window Pearson Correlation Score Error 1234 and UI Version 0.0.1 10am-10:30am .18 10:30-11:00am .2 11-11:30am .25 11:30am-12pm .95 Near real time signals

slide-61
SLIDE 61
  • Mann-Whitney U Test on correlation values. (not Student’s t-test)

○ No Gaussian assumption involved

  • Works best after human determines present is “interesting”

○ Eg, run after an alert fires

Significant Change?

slide-62
SLIDE 62

Anomaly detection for near real-time

slide-63
SLIDE 63

InterQuartile Range

IQR = 20 Anomaly > 75% + N*IQR

slide-64
SLIDE 64

Near real-time anomalies

2-3 am IQR Threshold 3-4 am IQR Threshold Signal

slide-65
SLIDE 65

Placeholder for dense graphs

  • Microservices, cal pattern
  • Color coded errors
  • Sentence for more context
  • Need to de-noise for slack to work well
  • Need deduplication
slide-66
SLIDE 66

Displaying anomalies in context

slide-67
SLIDE 67

Android NQ NRDJS Zuul Play API

slide-68
SLIDE 68

Visualization and making it practical

slide-69
SLIDE 69

Summary on Slack

slide-70
SLIDE 70

Reflections and Takeaways

slide-71
SLIDE 71
  • Scikit Learn and Tensorflow might be
  • verkill, at least for these algorithms
  • Human curation reduces scope so we don’t

need a Danger Will Robinson intelligence

Back to basics - simple statistics

Reflections and Takeaways

slide-72
SLIDE 72

Real time vs Near real time

  • Timing suggests causality
  • Useful for mean time to detect
  • Careful choice of metrics needed
  • Cause requires correlation
  • Humans assign cause and effect
  • More granular metrics
  • Useful for mean time to resolve
  • Diurnal pattern improved

predictions

Reflections and Takeaways

Real time Near real time

slide-73
SLIDE 73

Get correlation right

  • Contingency tables don’t work
  • Correlation and Mann-Whitney U test works pretty well
slide-74
SLIDE 74

A Summary Incident Approach

Android errors increased 850 percent?

Statistics + Visualization

U-test

Hoeffding JSD Mann-Whitney IQR Hourly

slide-75
SLIDE 75

More Information, Q&A

Team https://medium.com/netflix-techblog/lessons-from-building-observability-tools-at- netflix-7cfafed6ab17 Me https://www.linkedin.com/in/katzseth22202

slide-76
SLIDE 76

Thank you.