Understanding Software System Behavior With ML and Time Series Data - - PowerPoint PPT Presentation

understanding software system behavior with ml and time
SMART_READER_LITE
LIVE PREVIEW

Understanding Software System Behavior With ML and Time Series Data - - PowerPoint PPT Presentation

Understanding Software System Behavior With ML and Time Series Data QCon.ai SF April 11, 2018 David Andrzejewski - @davidandrzej Engineering, Sumo Logic Sumo Logic Confidential Intro / context Currently: Sumo Logic since 2011


slide-1
SLIDE 1

Sumo Logic Confidential

Understanding Software System Behavior With ML and Time Series Data

QCon.ai SF – April 11, 2018

David Andrzejewski - @davidandrzej Engineering, Sumo Logic

slide-2
SLIDE 2

Sumo Logic Confidential

Intro / context

  • Currently:

Sumo Logic since 2011

Co-organizer: SF ML Meetup

@davidandrzej on Twitter

  • Previously:

Postdoc at LLNL

U Wisconsin

  • BS Comp E / CS / Math
  • PhD CS (ML)
slide-3
SLIDE 3

Sumo Logic Confidential

Continuous intelligence for machine data

slide-4
SLIDE 4

Sumo Logic Confidential

Overview

  • 1. Mega-trends: “Softwarification” of Everything + ML
  • 2. Machine data: practicalities and basic analytics
  • 3. Machine learning, data mining, and pitfalls
slide-5
SLIDE 5

Sumo Logic Confidential

slide-6
SLIDE 6

Sumo Logic Confidential

slide-7
SLIDE 7

Sumo Logic Confidential

Sumo Logic Confidential

Trouble in software paradise!

slide-8
SLIDE 8

Sumo Logic Confidential

Microservices “death star”

slide-9
SLIDE 9

Sumo Logic Confidential

slide-10
SLIDE 10

Sumo Logic Confidential

Big Data to the rescue?

  • Logs (TBs / day)
  • Metrics (M DPs / min)
  • Source code (GBs)
  • Traces
  • Events

DEBUG-level visibility, in production

slide-11
SLIDE 11

Sumo Logic Confidential

Not so fast! “Could a Neuroscientist Understand a Microprocessor?”

  • (cool NES plotter art - Michael Fogleman)

Jonas & Kording (PLoS Comp Bio 2017)

slide-12
SLIDE 12

Sumo Logic Confidential

  • Data: necessary but not sufficient?
  • Today’s systems:

– Software – Biological – Social / economic

”Grand challenge” problem

Using data to understand complex, dynamic, multi-scale systems new measurements → new science

slide-13
SLIDE 13

Sumo Logic Confidential

Machine data time series

slide-14
SLIDE 14

Sumo Logic Confidential

Operational time series telemetry: the basics

  • What:

– “Four Golden Signals” (Google SRE book)

  • Latency, Traffic, Error, Saturation
  • (also: USE, RED, …)

– Basic resources: CPU, memory, … – More granular timings – Event counts, cache miss rates, other internals…

  • How:

– “push” agents/daemons (eg, StatsD) – “pull” metrics endpoints (eg, Prometheus)

  • Where:

– TSDB (time series database) – OSS / Commercial systems

slide-15
SLIDE 15

Sumo Logic Confidential

Operational time series telemetry: why

Q: WTF is my system actually doing?

Monitoring & troubleshooting

  • data visualization
  • alerting*
  • summarize behavior
  • comparisons
slide-16
SLIDE 16

Sumo Logic Confidential

Operational time series telemetry: example

“Metrics 2.0”–style key-value identifier

deployment=production cluster=indexer host=foobuzz-39 metric=write_latency units=ms

8:01 8:02 8:03 8:04 8:05 … 64 128 72 144 96 …

Actual data: sequence

  • f (timestamp, value)
slide-17
SLIDE 17

Sumo Logic Confidential

Quantization: rollup / time-based aggregation

Raw event/observation data à coarser, more regular 1-minute aggregations à 1-hour aggregations, etc

8: 8:00 00 8: 8:01 01 … 8: 8:58 58 8: 8:59 59 60.1 43.2 33.3 45.1 42.5 6: 6:00 00 7: 7:00 00 8: 8:00 00 9: 9:00 00 10: 10:00 00 … … 33.3 … …

!: # ℝ → ℝ

Aggregation: map from multiset of floats to some single-valued summary Min

  • Max
  • Avg
  • Sum
  • Count
slide-18
SLIDE 18

Sumo Logic Confidential

Quantization: rollup / time-based aggregation

Raw event/observation data à coarser, more regular 1-minute aggregations à 1-hour aggregations, etc

8: 8:00 00 8: 8:01 01 … 8: 8:58 58 8: 8:59 59 60.1 43.2 33.3 45.1 42.5 6: 6:00 00 7: 7:00 00 8: 8:00 00 9: 9:00 00 10: 10:00 00 … … 33.3 … …

!: # ℝ → ℝ

Aggregation: map from multiset of floats to some single-valued summary

  • Min
  • Max
  • Avg
  • Sum
  • Count
  • Percentiles?
slide-19
SLIDE 19

Sumo Logic Confidential

SRE percentiles

  • avg = 1485 ms
  • p95 = 4894 ms

Percentile as guarantee p99 < 2000 ms translates into unambiguous language: “No more than 1% of customer requests take longer than 2 seconds to execute”

slide-20
SLIDE 20

Sumo Logic Confidential

Percentiles via CDF-1

p60 = -1.8 etc...

https://en.wikipedia.org/wiki/Normal_distribution

slide-21
SLIDE 21

Sumo Logic Confidential

Algebraic structure for fun and profit

Example: item counts

data data data

f(s1 + s2) = f(s1) ⊕ f(s2)

slide-22
SLIDE 22

Sumo Logic Confidential

Algebraic structure for fun and profit

Example: word counts

data data data

f(s1 + s2) = f(s1) ⊕ f(s2)

Aggregate of combined data Combination of aggregates

Monoid homomorphism!

slide-23
SLIDE 23

Sumo Logic Confidential

Percentile original sin: ! "# + "% ≠ ! "# ⊕ !("%)

  • In general, cannot combine:

– p95 of dataset X – p95 of dataset Y

  • ...to say anything meaningful at all about dataset X ∪ Y
  • Impress your SRE/DevOps friends at parties!

Not a monoid homomorphism

slide-24
SLIDE 24

Sumo Logic Confidential

Basic aggregation: across series

8: 8:01 01 8: 8:02 02 8: 8:03 03 8: 8:04 04 8: 8:05 05 … 64 128 72 144 96 … 23 33 49 57 37 … 46 101 78 58 39 … … … … … … … 8: 8:01 01 8: 8:02 02 8: 8:03 03 8: 8:04 04 8: 8:05 05 … 55.3 47.1 76.8 52.3 41.7

What is max write_latency of entire foobuzz cluster?

f = MAX( )

host=foobuzz-3 host=foobuzz-2 host=foobuzz-1

slide-25
SLIDE 25

Sumo Logic Confidential

Basic aggregation: across time (aka “fold”)

8: 8:01 01 8: 8:02 02 8: 8:03 03 … 64 128 72 … 23 33 49 … 46 101 78 … … … … …

What is average queue depth of each foobuzz host over this time period?

103.4 48.6 62.1

f = AVG( )

host=foobuzz-3 host=foobuzz-2 host=foobuzz-1

slide-26
SLIDE 26

Sumo Logic Confidential

Time-shifted comparisons

deployment=production cluster=indexer host=foobuzz-21 metric=write_latency units=ms 8: 8:01 01 8: 8:02 02 8: 8:03 03 8: 8:04 04 8: 8:05 05 … 64 128 72 144 96 …

How does write_latency for this foobuzz instance compare versus yesterday?

8: 8:01 01 (-24h 24h) 8: 8:02 02 (-24h 24h) 8: 8:03 03 (-24h 24h) 8: 8:04 04 (-24h 24h) 8: 8:05 05 (-24h 24h) … 23 12 18 37 24 …

20 40 60 80 100 120 140 160 8:01 8:02 8:03 8:04 8:05

Comparison

Now Timeshift

slide-27
SLIDE 27

Sumo Logic Confidential

Time-shifted comparisons

deployment=production cluster=indexer host=foobuzz-21 metric=write_latency units=ms 8: 8:01 01 8: 8:02 02 8: 8:03 03 8: 8:04 04 8: 8:05 05 … 64 128 72 144 96 …

How does write_latency for this foobuzz instance compare versus yesterday?

8: 8:01 01 (-24h 24h) 8: 8:02 02 (-24h 24h) 8: 8:03 03 (-24h 24h) 8: 8:04 04 (-24h 24h) 8: 8:05 05 (-24h 24h) … 23 12 18 37 24 …

20 40 60 80 100 120 140 160 8:01 8:02 8:03 8:04 8:05

Comparison

Now Timeshift

slide-28
SLIDE 28

Sumo Logic Confidential

Windowing data

  • Tiled / Fixed
  • Sliding / Rolling
  • See Tyler Akidau (Apache Beam)

– QCon SF 2016 slides – ”Beyond Batch” blog posts Part 1, Part 2

Aka “grouping over time”

slide-29
SLIDE 29

Sumo Logic Confidential

Handling ”missing” data

Reality: often messy!

pandas

  • fillna() – some very sane basics

Fancier model / ML based approaches

  • try to “predict” missing data

– “imputation” (statistics / econometrics) – inference / sampling (probabilistic models) –

slide-30
SLIDE 30

Sumo Logic Confidential

Original data Fixed value (mean) Interpolation Back fill Forward fill

(notebook code on Github)

slide-31
SLIDE 31

Sumo Logic Confidential

Fixed-threshold alerting

”Wake somebody up if the site is down”

slide-32
SLIDE 32

Sumo Logic Confidential

MACHINE SCALE = overwhelming complexity!

N ≈ one million series

  • Can’t analyze them all
  • Can’t even look at them
  • !

" pairs to compare

  • Historical comparisons
  • ver different timescales
  • PROBLEM: how to “scale”

expert human time and attention?

slide-33
SLIDE 33

Sumo Logic Confidential

“Machine learning studies computer algorithms for learning to do stuff.”

  • Prof. Rob Schapire (COS 511 scribe notes)
slide-34
SLIDE 34

Sumo Logic Confidential

ML cheat sheet

Is machine learning right for you?

Do you know what you’re trying to accomplish? Can you do it with simple / deterministic analysis? YES NO YES

Uh oh Do that

NO

Let’s try ML…?

slide-35
SLIDE 35

Sumo Logic Confidential

Predictive models and outliers

Surprise: Your prediction is wrong!

slide-36
SLIDE 36

Sumo Logic Confidential

Outlier detection via predictive modeling

KEY ASSUMPTIONS

1. In “steady-state”, data exhibit some regularity / predictability 2. Learn a model of this behavior 3. Major deviations from our expectation represent new underlying behavior or totally novel “exogenous shock” 4. These surprises are valuable to discover “It’s tough to make predictions, especially about the future”

slide-37
SLIDE 37

Sumo Logic Confidential

Outlier detection via predictive modeling

KEY ASSUMPTIONS

In “steady 1.

  • state”, data exhibit some

regularity / predictability Learn a model of this behavior 2. Major 3. deviations from our expectation represent new underlying behavior or totally novel “exogenous shock” These surprises are valuable to discover 4.

KEY Qs

1. Is behavior actually regular? 2. How to model behavior? 3. How major is “major”? 4. Are surprises actually valuable? “It’s tough to make predictions, especially about the future”

slide-38
SLIDE 38

Sumo Logic Confidential

Simple example: rolling window

  • Model

– predict as sliding window avg

  • Threshold

– standardize on sliding window std dev

  • Cons:

– very simple / naïve – Doesn’t handle well:

  • “expected spikes”
  • periodicity
  • Pros:

– easy to visualize – people can understand it

aka “Bollinger bands”

µ ± 3σ

slide-39
SLIDE 39

Sumo Logic Confidential

Little fancier: autoregression (AR)

  • Model: next data point is linear

combination of previous N

  • Estimation: what are weights?
  • Bonus question: can you express

“rolling avg” in this framework?

Estimate future based on past

!" = $

%&' (

)%!"*% + ,"

2 4 6 8 10 12 14 16 8:01 8:02 8:03 8:04 8:05

foobuzz write_latency

slide-40
SLIDE 40

Sumo Logic Confidential

Little fancier: autoregression (AR)

  • Model: next data point is linear

combination of previous N

  • Estimation: what are weights?
  • Bonus question: can you express

“rolling avg” in this framework?

Estimate future based on past

!" = $

%&' (

)%!"*% + ,"

2 4 6 8 10 12 14 16 8:01 8:02 8:03 8:04 8:05

foobuzz write_latency

slide-41
SLIDE 41

Sumo Logic Confidential

Little fancier: autoregression (AR)

  • Model: next data point is linear

combination of previous N

  • Estimation: what are weights?
  • Bonus question: can you express

“rolling avg” in this framework?

Estimate future based on past

!" = $

%&' (

)%!"*% + ,"

2 4 6 8 10 12 14 16 8:01 8:02 8:03 8:04 8:05

foobuzz write_latency

slide-42
SLIDE 42

Sumo Logic Confidential

Little fancier: autoregression (AR)

  • Model: next data point is linear

combination of previous N

  • Estimation: what are weights?
  • Bonus question: can you express

“rolling avg” in this framework?

Estimate future based on past

!" = $

%&' (

)%!"*% + ,"

2 4 6 8 10 12 14 16 8:01 8:02 8:03 8:04 8:05

foobuzz write_latency

slide-43
SLIDE 43

Sumo Logic Confidential

Fixed-length feature vectors

A B C D E F

NOTE: can add other variables (eg, host load) to context

A B C B C D C D E D E F

! "

slide-44
SLIDE 44

Sumo Logic Confidential

Data with linear trend

Easy to fit a

  • linear regression

Estimate future based on past

!" = $ ∗ & + ( + )"

slide-45
SLIDE 45

Sumo Logic Confidential

Data with linear trend

  • OR can remove linear trend by

simple differencing operation

Estimate future based on past

0.5 1 1.5 2 2.5 3 3.5 7:59 8:00 8:02 8:03 8:05

foobuzz disk_used

  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1 1.2 8:02 8:03 8:04 8:05

diff'ed

!"

# = !" − !"&'

slide-46
SLIDE 46

Sumo Logic Confidential

Seasonality

Very common in data linked to human activity

slide-47
SLIDE 47

Sumo Logic Confidential

Seasonality

Very common in data linked to human activity

slide-48
SLIDE 48

Sumo Logic Confidential

Modeling seasonal data

  • Detection

– (p)ACF plots (Rob J. Hyndman) – FFT spectrum

  • Modeling

– ”manual” adjustment – ARIMA – Seasonal Holt-Winters – Fourier coefficients (eg, FB Prophet)

Detection + Modeling

slide-49
SLIDE 49

Sumo Logic Confidential

Latent state models

  • Assume each observed data point x

associated with a hidden state s

  • Hidden Markov Model (HMM)

– Rabiner 1989 – Jurafsky & Martin book chapter

  • More complex but very expressive
  • Can (sometimes?) interpret the

latent/hidden state information

Observed data produced by hidden mechanism

!" !# !$ %" %# %$

slide-50
SLIDE 50

Sumo Logic Confidential

Bayesian Change Point Detection

Ryan Prescott Adams & David J.C. MacKay

  • 4
  • 2

2 4 6 8 10 12 14 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 105 109 113 117 121 125 129 133 137 141 145 149

Example: system occasionally does internal “maintenance”

slide-51
SLIDE 51

Sumo Logic Confidential

Bayesian Change Point Detection

Ryan Prescott Adams & David J.C. MacKay

  • 4
  • 2

2 4 6 8 10 12 14 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 105 109 113 117 121 125 129 133 137 141 145 149

  • 5

5 10 15 20 40 60 80 100 120 140 160

Example: system occasionally does internal “maintenance”

slide-52
SLIDE 52

Sumo Logic Confidential

Why are we doing this again?

Which aspects don’t scale “manually”?

  • Forecasting (data-driven prediction vs human guesswork)

– capacity planning – preventative maintenance

  • Outlier detection (automatic predictions and comparisons)

– model accurately characterizes “typical” behavior – significant surprises may therefore be interesting – useful if you:

  • don’t want to set fixed/hard thresholds
  • do really care about unexplained variation in this quantity
  • Machine advantage: multivariate data
slide-53
SLIDE 53

Sumo Logic Confidential

Code & data resources

Try this at home!

  • Python

– pandas – StatsModels – scikit-learn – Keras

  • Datasets

– Numenta Anomaly Benchmark (NAB) –

Yahoo Extendible Generic Anomaly Detection System (EGADS)

– Kaggle time series datasets

slide-54
SLIDE 54

Sumo Logic Confidential

Distance-based data mining of time series

slide-55
SLIDE 55

Sumo Logic Confidential

Identifying similar behaviors

We have lots of machines

  • One has some interesting
  • behavior (alarm, etc)

Which other hosts may be

  • behaving in a similar way?
slide-56
SLIDE 56

Sumo Logic Confidential

Metric similarity: naïve approach

Are these ”behaving similarly”?

  • Direct norm distance calculation
  • !

– ", $ = " − $ ' Spikes are “disjoint” – Distance would be large –

Intuition: can we slightly shift?

  • Would be very similar…

  • 1

1 2 3 4 5 6 1 2 3 4 5 6 7 8 9 10 11 12 A B

slide-57
SLIDE 57

Sumo Logic Confidential

Metric similarity: naïve approach

“Hosts who look like X”

  • Are these ”behaving similarly”?
  • Direct norm distance calculation

– ! ", $ = " − $ ' – Spikes are “disjoint” – Distance would be large

  • Intuition: can we slightly shift?

– Would be very similar…

  • 1

1 2 3 4 5 6 1 2 3 4 5 6 7 8 9 10 11 12 A B

slide-58
SLIDE 58

Sumo Logic Confidential

Metric similarity: Dynamic Time Warping (DTW)

  • Keogh research group (UC-Riverside)
  • Etsy Kale (v1, v2)
  • Idea: allow some “warping”
  • f the time series to get

best alignment

Diagram: “Fast Multisegment Alignments for Temporal Expression Profiles”

  • Adam. A. Smith and Mark Craven

(UW Madison) - Comput Syst Bioinformatics Conf. 2008

slide-59
SLIDE 59

Sumo Logic Confidential

Top N similar hosts via DTW

Highly similar hosts

slide-60
SLIDE 60

Sumo Logic Confidential

Top N similar hosts via DTW

Highly similar hosts

slide-61
SLIDE 61

Sumo Logic Confidential

Build a graph of host-host similarity

Edge weight ∝"# DTW distance

slide-62
SLIDE 62

Sumo Logic Confidential

Spectral clustering

Tutorial (von Luxborg), sklearn implementation

slide-63
SLIDE 63

Sumo Logic Confidential

Spectral clustering

Tutorial (von Luxborg), sklearn implementation

slide-64
SLIDE 64

Sumo Logic Confidential

Anomaly detection & event classification with log data

slide-65
SLIDE 65

Sumo Logic Confidential

Anatomy of a log message: Five W’s

65

slide-66
SLIDE 66

Sumo Logic Confidential

Anatomy of a log message: Five W’s

66

When? Timestamp with time zone

slide-67
SLIDE 67

Sumo Logic Confidential

Anatomy of a log message: Five W’s

67

When? Timestamp with time zone Where? Host, module, code location

slide-68
SLIDE 68

Sumo Logic Confidential

Anatomy of a log message: Five W’s

68

When? Timestamp with time zone Where? Host, module, code location Who? Authentication context

slide-69
SLIDE 69

Sumo Logic Confidential

Deriving time series from log data

  • High volume stream of semi—structured strings

Logs: what are they good for?

slide-70
SLIDE 70

Sumo Logic Confidential

Deriving time series from log data

  • High volume stream of semi—structured strings
  • Count them

Logs: what are they good for?

slide-71
SLIDE 71

Sumo Logic Confidential

Deriving time series from log data

  • High volume stream of semi—structured strings
  • Count them
  • Parse things out of them

Logs: what are they good for?

slide-72
SLIDE 72

Sumo Logic Confidential

Deriving time series from log data

  • High volume stream of semi—structured strings
  • Count them
  • Parse things out of them
  • Cluster them

Logs: what are they good for?

slide-73
SLIDE 73

Sumo Logic Confidential

02/15/2014 10:03:16 UTC Health status check: zim-5 is OK

slide-74
SLIDE 74

Sumo Logic Confidential

printf(“%s Health status check: %s is %s”, timestamp, hostid, hoststatus) 02/15/2014 10:03:16 UTC Health status check: zim-5 is OK

slide-75
SLIDE 75

Sumo Logic Confidential

02/15/2014 10:03:16 UTC Health status check: zim-5 is OK 02/15/2014 10:03:11 UTC Health status check: gir-3 is OK 02/15/2014 10:03:07 UTC Health status check: gir-2 is TIMED OUT 02/15/2014 10:02:45 UTC Health status check: dib-1 is OK printf(“%s Health status check: %s is %s”, timestamp, hostid, hoststatus) 02/15/2014 10:03:16 UTC Health status check: zim-5 is OK

slide-76
SLIDE 76

Sumo Logic Confidential

02/15/2014 10:03:16 UTC Health status check: zim-5 is OK 02/15/2014 10:03:11 UTC Health status check: gir-3 is OK 02/15/2014 10:03:07 UTC Health status check: gir-2 is TIMED OUT 02/15/2014 10:02:45 UTC Health status check: dib-1 is OK

$DATETIME Health status check: **** is ****

printf(“%s Health status check: %s is %s”, timestamp, hostid, hoststatus) 02/15/2014 10:03:16 UTC Health status check: zim-5 is OK

slide-77
SLIDE 77

Sumo Logic Confidential

Log data as (approximate) program execution trace

Logs emitted by printf()

  • Which
  • printf() gets hit

can vary with code path Changes in printf()

  • counts imply code

behavior changes

Code + Behavior → Logs

slide-78
SLIDE 78

Sumo Logic Confidential

Health check OK Request processed Txn timeout, retry

Log cluster counts as multivariate time series

slide-79
SLIDE 79

Sumo Logic Confidential

Health check OK Request processed Txn timeout, retry

Log cluster counts as multivariate time series

slide-80
SLIDE 80

Sumo Logic Confidential

Distances over multivariate count vectors

  • Kullback-Leibler Divergence
  • IDEA

– track “distance” between recent time and historical average – do rolling outlier on this quantity

Comparing printf counts

5 10 15 20 25 30 35 A B C D

Histogram Distance

Vector 1 Vector 2

!(#| Q = '

(

# ) log(# )

  • ) )
slide-81
SLIDE 81

Sumo Logic Confidential

Multiclass event classification

  • Vector representation of ”event”
  • Classify new events via nearest-

neighbors via cosine similarity

  • Interpretability: these are log lines

IDEA: categorize anomalies by difference vector

5 10 15 20 25 30 35 A B C D

Histogram Distance

Vector 1 Vector 2

  • 10
  • 5

5 10 15 20 25 30 A B C D

Difference

⃗ " = ⃗ $ − ⃗ & cos *+,- = ⟨"+, ⟩ "- "+ "-

slide-82
SLIDE 82

Sumo Logic Confidential

Don’t fool yourself

slide-83
SLIDE 83

Sumo Logic Confidential

Some warnings on thresholds

Family-wise Error Rate (FWER)

  • You’ve got a super accurate model!
  • Only alert if 0.01% chance…
  • Congratulations!

– Over 1M series you can expect ~100 false positives

  • Why Nobody Cares About Your Anomaly Detection

– Baron Schwartz (VividCortex), O’Reilly Strata San Jose 2018

  • Multiple-testing adjustment (eg, Bonferroni correction)

– Just divide p-value by number

  • f trials (very loose / conservative)

! ≤ # $

slide-84
SLIDE 84

Sumo Logic Confidential

Finance: time series epistemology for “fun” and “profit”

“Predicting stock returns with random hybrid convolutional deep recurrent neural networks”

Rich source of errors

  • “Domain
  • oblivious ML

” (DANGER!) Blog takedowns

  • Zachary David (

– blog 1, blog 2) Knight Capital –

slide-85
SLIDE 85

Sumo Logic Confidential

Sanity check: historical backtesting

Simulated “replay” of past data

  • Common in

financial domain

  • Restrict model-

data interaction

  • Applicable to

machine data modeling (?)

slide-86
SLIDE 86

Sumo Logic Confidential

Advanced topics

slide-87
SLIDE 87

Sumo Logic Confidential

Bayesian methods

(one) advantage: explicit uncertainty modeling

  • Put a prior on it!

Plot from scikit-learn docs (Vincent Dubourg , Jake Vanderplas, Jan Hendrik Metzen)

slide-88
SLIDE 88

Sumo Logic Confidential

Hierarchical Bayesian methods

  • Bayesian Time Series: Structured

Representations for Scalability

– Emily Fox (Univ of Washington)

Put a prior on your prior! time machine cluster

!",$(&) (",$ )"

Exploit structure?

slide-89
SLIDE 89

Sumo Logic Confidential

WHAT ABOUT “DEEP LEARNING”?

slide-90
SLIDE 90

Sumo Logic Confidential

WHAT ABOUT “DEEP LEARNING”?

θt+1 = θt − η ∗ ∇fθt(X, y)

slide-91
SLIDE 91

Sumo Logic Confidential

Everything we’ve discussed, but more parameters and stuff

  • (AFAIK) doesn’t free you (yet…?) from

– Understanding the problem domain – Framing the ML problem

  • See: Intro to forecasting , Two effective algos for time series forecasting
  • Deep Neural Net versions of predictive approaches

– AWS Deep AR service (arXiv) – Relevant flavors:

  • Recurrent Neural Networks (RNN)
  • Convolutional Neural Networks (CNN)
  • Long-short term memory (LSTM)
  • Attentional models

– (potential) advantage: certainly should have plenty of machine data to train…

Probably something relevant has been published on arXiv during this talk…)

slide-92
SLIDE 92

Sumo Logic Confidential

“In conclusion, machine data is a land of contrasts…”

  • Software systems
  • Run our entire civilization…
  • but very complex!
  • Machine data
  • Key tool for understanding…
  • but overwhelming volume!
  • Machine learning
  • Powerful toolkit…
  • but can mislead / confuse!
slide-93
SLIDE 93

Sumo Logic Confidential

BONUS MATERIAL

slide-94
SLIDE 94

Sumo Logic Confidential

Overfitting 101

Too much of a good thing

  • Given rich enough model class, can fit any training data “perfectly”

– Risk of “over”-fitting to idiosyncratic noise in your training data – Actually degrades true predictive performance

Training iteration Generalization error Overfitting!

slide-95
SLIDE 95

Sumo Logic Confidential

Example: K=4

  • Idea: always
  • testing model

against “new” data See

  • Aarti Singh

CMU lecture notes

K-fold cross validation

Split data into k batches with train/test splits

slide-96
SLIDE 96

Sumo Logic Confidential

Advanced overfitting: human-in-the-loop

Manual search: model selection, network architecture, hyperparameters, …

82% 83% 84% 85% 86% 87% 88% 89% 90% 91% A B C D

"Test" accuracy

k-fold CV

slide-97
SLIDE 97

Sumo Logic Confidential

Advanced overfitting: human-in-the-loop

Manual search: model selection, network architecture, hyperparameters, …

k-fold CV

78% 80% 82% 84% 86% 88% 90% 92% A B C D

Human overfitting

"Test" accuracy True generalization

slide-98
SLIDE 98

Sumo Logic Confidential

Advanced overfitting: human-in-the-loop

How to avoid this?

  • Pre-commit to modeling choices
  • Use genuine holdout set (only used once)
  • “Thresholdout” - reusable holdout techniques

– Related to theory of differential privacy (Cynthia Dwork & Aaron Roth – pdf) – The reusable holdout: Preserving validity in adaptive data analysis – Generalization in Adaptive Data Analysis and Holdout Reuse – Google Research blog post

slide-99
SLIDE 99

Sumo Logic Confidential

P-hacking: FWER on steroids

Big data → easy to find “significant” results

  • “Replication crisis” in science
  • Spurious correlations generator
  • Machine data: easy to do
  • Data Scientists: watch out!
  • Business / Managers: watch
  • ut for your Data Scientists

doing this!