Understanding Software System Behavior With ML and Time Series Data - PowerPoint PPT Presentation

Understanding Software System Behavior With ML and Time Series Data QCon.ai SF – April 11, 2018 David Andrzejewski - @davidandrzej Engineering, Sumo Logic Sumo Logic Confidential

Intro / context • Currently: – Sumo Logic since 2011 – Co-organizer: SF ML Meetup – @davidandrzej on Twitter • Previously: – Postdoc at LLNL – U Wisconsin • BS Comp E / CS / Math • PhD CS (ML) Sumo Logic Confidential

Continuous intelligence for machine data Sumo Logic Confidential

Overview 1. Mega-trends: “Softwarification” of Everything + ML 2. Machine data: practicalities and basic analytics 3. Machine learning, data mining, and pitfalls Sumo Logic Confidential

Sumo Logic Confidential

Trouble in software paradise! Sumo Logic Confidential Sumo Logic Confidential

Microservices “death star” Sumo Logic Confidential

Sumo Logic Confidential

Big Data to the rescue? DEBUG-level visibility, in production • Logs (TBs / day) • Metrics (M DPs / min) • Source code (GBs) • Traces • Events Sumo Logic Confidential

Not so fast! “Could a Neuroscientist Understand a Microprocessor?” Jonas & Kording (PLoS Comp Bio 2017) • (cool NES plotter art - Michael Fogleman) Sumo Logic Confidential

Using data to understand complex, dynamic, multi-scale systems ”Grand challenge” problem new measurements → new science Data: necessary but not sufficient? • Today’s systems: • Software – – Biological – Social / economic Sumo Logic Confidential

Machine data time series Sumo Logic Confidential

Operational time series telemetry: the basics What: • – “Four Golden Signals” (Google SRE book) Latency, Traffic, Error, Saturation • • (also: USE, RED, …) – Basic resources: CPU, memory, … – More granular timings Event counts, cache miss rates, other internals… – • How: – “push” agents/daemons (eg, StatsD) “pull” metrics endpoints (eg, Prometheus) – Where: • – TSDB (time series database) – OSS / Commercial systems Sumo Logic Confidential

Operational time series telemetry: why Q: WTF is my system actually doing? Monitoring & troubleshooting • data visualization • alerting* • summarize behavior • comparisons Sumo Logic Confidential

Operational time series telemetry: example “Metrics 2.0”–style deployment=production key-value identifier cluster=indexer host=foobuzz-39 Actual data: sequence metric=write_latency of (timestamp, value) units=ms 8:01 8:02 8:03 8:04 8:05 … 64 128 72 144 96 … Sumo Logic Confidential

Quantization: rollup / time-based aggregation Raw event/observation data à coarser, more regular !: # ℝ → ℝ 1-minute aggregations à 1-hour aggregations, etc Aggregation: map from 8: 8:00 00 8: 8:01 01 … 8: 8:58 58 8: 8:59 59 multiset of floats to some 60.1 43.2 33.3 45.1 42.5 single-valued summary Min • Max • Avg • 6: 6:00 00 7: 7:00 00 8: 8:00 00 9:00 9: 00 10:00 10: 00 Sum • … … 33.3 … … Count • Sumo Logic Confidential

Quantization: rollup / time-based aggregation Raw event/observation data à coarser, more regular !: # ℝ → ℝ 1-minute aggregations à 1-hour aggregations, etc Aggregation: map from 8: 8:00 00 8: 8:01 01 … 8: 8:58 58 8: 8:59 59 multiset of floats to some 60.1 43.2 33.3 45.1 42.5 single-valued summary Min • Max • Avg • 6: 6:00 00 7:00 7: 00 8: 8:00 00 9:00 9: 00 10:00 10: 00 Sum • … … 33.3 … … Count • Percentiles? • Sumo Logic Confidential

SRE percentiles Percentile as guarantee p99 < 2000 ms translates into unambiguous language: avg = 1485 ms • “No more than 1% of p95 = 4894 ms • customer requests take longer than 2 seconds to execute” Sumo Logic Confidential

Percentiles via CDF -1 p60 = -1.8 etc... https://en.wikipedia.org/wiki/Normal_distribution Sumo Logic Confidential

Algebraic structure for fun and profit Example: item counts f ( s 1 + s 2 ) = f ( s 1 ) ⊕ f ( s 2 ) data data data Sumo Logic Confidential

Algebraic structure for fun and profit Example: word counts f ( s 1 + s 2 ) = f ( s 1 ) ⊕ f ( s 2 ) Aggregate of combined data Combination of aggregates Monoid data homomorphism! data data Sumo Logic Confidential

Percentile original sin: ! " # + " % ≠ ! " # ⊕ !(" % ) Not a monoid homomorphism In general, cannot combine: • – p95 of dataset X – p95 of dataset Y • ...to say anything meaningful at all about dataset X ∪ Y Impress your SRE/DevOps friends at parties! • Sumo Logic Confidential

Basic aggregation: across series What is max write_latency of entire foobuzz cluster? 8: 8:01 01 8: 8:02 02 8: 8:03 03 8: 8:04 04 8: 8:05 05 … 64 128 72 144 96 … host=foobuzz-1 23 33 49 57 37 … host=foobuzz-2 46 101 78 58 39 … host=foobuzz-3 … … … … … … f = MAX( ) 8: 8:01 01 8: 8:02 02 8: 8:03 03 8: 8:04 04 8: 8:05 05 … 55.3 47.1 76.8 52.3 41.7 Sumo Logic Confidential

Basic aggregation: across time (aka “fold”) What is average queue depth of each foobuzz host over this time period? f = AVG( ) 8: 8:01 01 8: 8:02 02 8:03 8: 03 … 64 128 72 … 103.4 host=foobuzz-1 23 33 49 … 48.6 host=foobuzz-2 46 101 78 … 62.1 host=foobuzz-3 … … … … Sumo Logic Confidential

Comparison Time-shifted comparisons 160 140 How does write_latency for this foobuzz 120 instance compare versus yesterday ? 100 80 60 deployment=production 40 cluster=indexer 20 host=foobuzz-21 0 metric=write_latency 8:01 8:02 8:03 8:04 8:05 units=ms Now Timeshift 8: 8:01 01 8: 8:02 02 8: 8:03 03 8: 8:04 04 8:05 8: 05 … 64 128 72 144 96 … 8: 8:01 01 8: 8:02 02 8:03 8: 03 8:04 8: 04 8:05 8: 05 … (-24h 24h) (-24h 24h) (-24h 24h) (-24h 24h) (-24h 24h) 23 12 18 37 24 … Sumo Logic Confidential

Windowing data Aka “grouping over time” Tiled / Fixed • Sliding / Rolling • … See Tyler Akidau (Apache Beam) • – QCon SF 2016 slides – ”Beyond Batch” blog posts Part 1, Part 2 Sumo Logic Confidential

Handling ”missing” data Reality: often messy! pandas fillna() – some very sane basics • Fancier model / ML based approaches • – try to “predict” missing data – “imputation” (statistics / econometrics) inference / sampling (probabilistic models) – Sumo Logic Confidential

Original data Fixed value (mean) (notebook code on Github) Forward fill Back fill Interpolation Sumo Logic Confidential

Fixed-threshold alerting ”Wake somebody up if the site is down” Sumo Logic Confidential

MACHINE SCALE = overwhelming complexity! N ≈ one million series Can’t analyze them all • • Can’t even look at them ! " pairs to compare • • Historical comparisons over different timescales • PROBLEM: how to “scale” expert human time and attention? Sumo Logic Confidential

“Machine learning studies computer algorithms for learning to do stuff.” -Prof. Rob Schapire (COS 511 scribe notes) Sumo Logic Confidential

ML cheat sheet Uh oh NO Is machine learning Do you know what right for you? you’re trying to accomplish? YES Do that YES Can you do it with simple / deterministic analysis? NO Let’s try ML…? Sumo Logic Confidential

Predictive models and outliers Surprise: Your prediction is wrong! Sumo Logic Confidential

Outlier detection via predictive modeling “It ’s tough to make predictions, especially about the future” KEY ASSUMPTIONS 1. In “steady-state”, data exhibit some regularity / predictability 2. Learn a model of this behavior 3. Major deviations from our expectation represent new underlying behavior or totally novel “exogenous shock” 4. These surprises are valuable to discover Sumo Logic Confidential

Outlier detection via predictive modeling “It ’s tough to make predictions, especially about the future” KEY ASSUMPTIONS 1. In “steady -state”, data exhibit some regularity / predictability 2. Learn a model of this behavior 3. Major deviations from our expectation represent new underlying behavior or KEY Qs totally novel “exogenous shock” 1. Is behavior actually regular? 4. These surprises are valuable to discover 2. How to model behavior? 3. How major is “major”? 4. Are surprises actually valuable? Sumo Logic Confidential

Understanding Software System Behavior With ML and Time Series Data - PowerPoint PPT Presentation

Understanding Software System Behavior With ML and Time Series Data QCon.ai SF April 11, 2018 David Andrzejewski - @davidandrzej Engineering, Sumo Logic Sumo Logic Confidential Intro / context Currently: Sumo Logic since 2011

BEHAVIOR @ HOME Behavior Basics Simple strategies that can make a big difference! Presented by

Cycle time: 40 sec Cycle time: 12 sec Cycle time: 0.75 sec Cycle time: 1.25 sec Cycle time: 5

UNDERSTANDING (LMOU) LOCAL MEMORANDUM OF UNDERSTANDING (LMOU) LOCAL MEMORANDUM OF UNDERSTANDING

Software Reliability and System Reliability Introduction 1 Software Reliability and System

Behavior through Data: Behavior Incident Report System Myrna Veguilla Lise Fox University of

APPLIED BEHAVIOR ANALYSIS Specialization Overview Agenda What is Applied Behavior Analysis

State-based Behavior I SWEN-261 Introduction to Software Engineering Department of Software

Introduction to Software Testing Software Testing - Module 1 Part 1 The Software Engineering

CS1063: Understanding CS1063: Understanding CS1063: Understanding CS1063: Understanding

@ .To increase understanding of the The difference between "conscious/explicit behavior"

Understanding Your Childs Behavior Megan Gropp School Psychologist/Board Certified Behavior

CSE 2221 Software I: Software Components and CSE 2231 Software II: Software Development and

Software Engineering Topics Computer science v. software engineering Definition of

Assessment and Treatment of Severe Introduction to Problem Behavior in Problem Behavior Children

UT^2: Human-like Behavior via Neuroevolution of Combat Behavior and Replay of Human Traces Jacob

State-based Behavior II SWEN-261 Introduction to Software Engineering Department of Software

Obstacles in Numerical Calculations Erik Schnetter Paris, November 2006 Obstacles in Numerical

October 7th, 1997 6:00pm Arrive hotel in New York City. Phone system does not support

Evaluating Technologies UNC COMP 523 Wed Sep 2, 2020 Prof. Jeff Terrell 1 / 30 Announcements

DQM Status DQM Status Status as used (or not used) during 04/2016 test beam GUI CSS Gui

on Smartphones & Tablets with Trusted Computing Stefan Saroiu Microsoft Research (Redmond)

Functional Programming June 2, 2019 Functional Programming June 2, 2019 1 / 24 Mayer Goldberg \

rs t rs

2 nd semester Topic 35: Messages on the phone In our daily life we are trying to save our

Understanding Software System Behavior With ML and Time Series Data - PowerPoint PPT Presentation

Understanding Software System Behavior With ML and Time Series Data QCon.ai SF April 11, 2018 David Andrzejewski - @davidandrzej Engineering, Sumo Logic Sumo Logic Confidential Intro / context Currently: Sumo Logic since 2011

BEHAVIOR @ HOME Behavior Basics Simple strategies that can make a big difference! Presented by

Cycle time: 40 sec Cycle time: 12 sec Cycle time: 0.75 sec Cycle time: 1.25 sec Cycle time: 5

UNDERSTANDING (LMOU) LOCAL MEMORANDUM OF UNDERSTANDING (LMOU) LOCAL MEMORANDUM OF UNDERSTANDING

Software Reliability and System Reliability Introduction 1 Software Reliability and System

Behavior through Data: Behavior Incident Report System Myrna Veguilla Lise Fox University of

APPLIED BEHAVIOR ANALYSIS Specialization Overview Agenda What is Applied Behavior Analysis

State-based Behavior I SWEN-261 Introduction to Software Engineering Department of Software

Introduction to Software Testing Software Testing - Module 1 Part 1 The Software Engineering

CS1063: Understanding CS1063: Understanding CS1063: Understanding CS1063: Understanding

@ .To increase understanding of the The difference between &quot;conscious/explicit behavior&quot;

Understanding Your Childs Behavior Megan Gropp School Psychologist/Board Certified Behavior

CSE 2221 Software I: Software Components and CSE 2231 Software II: Software Development and

Software Engineering Topics Computer science v. software engineering Definition of

Assessment and Treatment of Severe Introduction to Problem Behavior in Problem Behavior Children

UT^2: Human-like Behavior via Neuroevolution of Combat Behavior and Replay of Human Traces Jacob

State-based Behavior II SWEN-261 Introduction to Software Engineering Department of Software

Obstacles in Numerical Calculations Erik Schnetter Paris, November 2006 Obstacles in Numerical

October 7th, 1997 6:00pm Arrive hotel in New York City. Phone system does not support

Evaluating Technologies UNC COMP 523 Wed Sep 2, 2020 Prof. Jeff Terrell 1 / 30 Announcements

DQM Status DQM Status Status as used (or not used) during 04/2016 test beam GUI CSS Gui

on Smartphones &amp; Tablets with Trusted Computing Stefan Saroiu Microsoft Research (Redmond)

Functional Programming June 2, 2019 Functional Programming June 2, 2019 1 / 24 Mayer Goldberg \

rs t rs

2 nd semester Topic 35: Messages on the phone In our daily life we are trying to save our

@ .To increase understanding of the The difference between "conscious/explicit behavior"

on Smartphones & Tablets with Trusted Computing Stefan Saroiu Microsoft Research (Redmond)