Automating Operations with Machine Learning Matt Callanan Senior - - PowerPoint PPT Presentation

automating operations with machine learning
SMART_READER_LITE
LIVE PREVIEW

Automating Operations with Machine Learning Matt Callanan Senior - - PowerPoint PPT Presentation

Automating Operations with Machine Learning Matt Callanan Senior Software Development Engineer Expedia Group Willie Wheeler Principal Applications Engineer Expedia Group Too many alerts Too many false alerts Alerts not actionable


slide-1
SLIDE 1

Matt Callanan Senior Software Development Engineer Expedia Group

Automating Operations with Machine Learning

Willie Wheeler Principal Applications Engineer Expedia Group

slide-2
SLIDE 2
slide-3
SLIDE 3

Too many alerts Too many false alerts Alerts not actionable Diagnosis takes too long Remediation takes too long

slide-4
SLIDE 4

Can automation and machine learning help solve these problems?

slide-5
SLIDE 5

Automating Ops with ML Machine Learning Overview Automating Operations

Overview

slide-6
SLIDE 6

Automating Operations

slide-7
SLIDE 7

System healthy System unhealthy Detect Diagnose Remediate

slide-8
SLIDE 8

Develop Build Deploy Monitor Rollback

Infrastructure Queues Services Network Databases Cloud Firewalls

Feature Switch Config Change

slide-9
SLIDE 9

Develop Build Deploy Monitor Rollback

Infrastructure Queues Services Network Databases Cloud Firewalls

Feature Switch Config Change

System healthy System unhealthy Detect Diagnose Remediate

slide-10
SLIDE 10

How does it work?

slide-11
SLIDE 11

Alert Manager Deployment Bot Treatment

Remediation

Diagnostics Anomaly Detection Logging Tracing

Apps

Metrics

slide-12
SLIDE 12

Alert Manager Deployment Bot Treatment

Remediation

Diagnostics Anomaly Detection Logging Tracing

Apps

Metrics

slide-13
SLIDE 13
slide-14
SLIDE 14
slide-15
SLIDE 15
slide-16
SLIDE 16
slide-17
SLIDE 17
slide-18
SLIDE 18
slide-19
SLIDE 19
slide-20
SLIDE 20
slide-21
SLIDE 21
slide-22
SLIDE 22
slide-23
SLIDE 23

Machine Learning Overview

slide-24
SLIDE 24

Machine learning systems perform tasks by learning from data, instead

  • f requiring explicit programming.
slide-25
SLIDE 25

Traditional programming

Computer Program Inputs Outputs

Machine Learning

Computer Inputs Outputs Program

slide-26
SLIDE 26

There are many approaches to ML.

slide-27
SLIDE 27

Convolutional neural network

Dog

https://www.flickr.com/photos/a_peach/8631368705

Recurrent neural network

“Hay dos estudiantes en la cocina.” “There are two students in the kitchen.”

slide-28
SLIDE 28

Model Training Executable Model ML-enabled Application Training Data Model Repo

Data Ingestion Pre-Processing Data Analysis

slide-29
SLIDE 29
slide-30
SLIDE 30

Automating Ops with ML

slide-31
SLIDE 31

1. Anomalies and Anomaly Detection

2. ML Ops 3. Our Approach

  • 4. Situation Diagnostics
slide-32
SLIDE 32

Anomalies and Anomaly Detection

slide-33
SLIDE 33

An anomaly is an unusual data point.

slide-34
SLIDE 34

Anomalies signal a change. We want to detect them.

slide-35
SLIDE 35

Time Series Anomaly Detection with Adaptive Alerting

slide-36
SLIDE 36
slide-37
SLIDE 37

AA minimises Mean Time To Detect (MTTD) by performing anomaly detection on streaming time series data. AA supports classical and ML-based time series anomaly detection algos.

slide-38
SLIDE 38

Constant threshold STL regression Recurrent neural network Holt-Winters AWS random cut forest Custom regression

slide-39
SLIDE 39

Normal Weak Anomaly Strong Anomaly Strong Anomaly Weak Anomaly

}

} }

}

}

prediction actual T-1

slide-40
SLIDE 40

anomalies (Kafka) metrics (Kafka)

Anomaly detectors (Kafka streams) { "metric": "details-tp99", "timestamp": 1549932163, "value": 150 } { "metric": "details-tp99", "timestamp": 1549932163, "value": 150, "anomaly": "WEAK" }

slide-41
SLIDE 41

anomalies (Kafka)

Anomaly detectors (Kafka streams)

metrics (Kafka)

ML detectors Training data (S3) ML training

slide-42
SLIDE 42

ML Ops

slide-43
SLIDE 43
  • Data Scientists and Engineers tend to
  • perate in siloes
  • Models can take many months to deploy
  • Training and retraining consistency issues
slide-44
SLIDE 44
  • Data Scientists:
  • Experiment Tracking for Training and HPO
  • Engineers:
  • Scalability, Robustness, Repeatability
  • Reliable Deployment of Models
slide-45
SLIDE 45

Kubernetes ingest

Data Store (e.g. S3)

Non-Kubernetes Runtime preprocess analyze train deploy Kubeflow / Kubeflow Pipelines

Container Op Container Op Container Op Container Op Container Op

Web Service, Streaming

slide-46
SLIDE 46

“aa-ct-trainer” aa-ml-training-core Python Library aa-ml-training-pipelines Python Library aa-ml-training-docker /aa-ct-trainer Python code “*_starter.py” scripts /pipeline Python code “create_pipeline.py” aa-ct-trainer Dockerfile /aa-ct-trainer à /app

aa-ml-training-app-base Dockerfile aa-ml-training-miniconda3 Dockerfile Kubeflow Pipelines SDK Python Library pandas, scipy, matplotlib, boto3, etc. Python Libraries pandas, scipy, matplotlib, boto3, etc. Python Libraries pandas, scipy, matplotlib, boto3, etc. Python Libraries

aa-ct-trainer Docker Image

aa-ml-training-app-base Docker Image aa-ml-training-miniconda3 Docker Image

Alpine Linux

  • ingest_starter.py
  • preprocess_starter.py
  • analyze_starter.py
  • train_starter.py
  • deploy_starter.py

Adaptive Alerting ML Training Repositories

aa-ml-training-pipelines-base Docker Image

“aa-ml- training- core” Library “aa-ml- training- pipelines” Library

aa-ml-training-pipelines-base Dockerfile

Docker Image Stack

slide-47
SLIDE 47

Our Approach

slide-48
SLIDE 48
  • Anomaly Detection
  • Building high-volume system
  • Several thousand auto-generated CT detectors for 5xx
  • Fine-tuned ML models
  • Exploring additional Deep Learning and LSTM models
  • Diagnostics
  • Automated runbook diagnostics
  • Exploring ML for situation diagnostics
  • Auto-Remediation
  • Automated rollback hints
  • Piloting full auto-rollback remediation
slide-49
SLIDE 49
  • Ingest at least 4 weeks of data
  • Remove outliers (e.g. Hampel filtering)
  • Interpolate missing data
  • Analyse data set (EDA)
  • Visualise with scatter plots, ACF
  • Perform tests, e.g adfuller
  • Record properties of data set (e.g. stationary

vs single/dual seasonality)

slide-50
SLIDE 50
  • Based on profile properties, select algorithms to

explore

  • For each algorithm:
  • Explore hyperparameter space (HPO)
  • For each HP combination, find the best fit
  • Select model with best set of HPs for algorithm
  • Compare scores of each algorithm
  • Select the best model from the best algorithm
  • Ensemble detectors could serve multiple models
slide-51
SLIDE 51
  • When detector is created, a schedule is

created, e.g. daily/weekly

  • Scheduler initiates Data Prep, Training, and

Deployment tasks

slide-52
SLIDE 52
  • Good for fast anomaly detection
  • Detect mean shift over period of time
  • Less volatile anomaly classification
slide-53
SLIDE 53
  • Treat first 3 weeks as “training data”
  • Treat final 7 days as “test data”
slide-54
SLIDE 54
slide-55
SLIDE 55

Situation Diagnostics

slide-56
SLIDE 56

Time series anomaly detection generates a large volume of individually unactionable anomalies. We need a small number of actionable anomalies.

slide-57
SLIDE 57
slide-58
SLIDE 58

Haystack is Expedia Group’s OSS distributed tracing product. Key features:

Distributed tracing Trace telemetry Dynamic service graph Anomaly detection on telemetry State snapshotting

ExpediaDotCom/haystack

slide-59
SLIDE 59

Haystack service graph

ExpediaDotCom/haystack

slide-60
SLIDE 60

Can machine learning help?

  • Incident classification
  • Incident diagnostics
  • Incident prediction

Haystack data creates new automation opportunities:

ExpediaDotCom/haystack

slide-61
SLIDE 61

A graph network works with graph- structured inputs and outputs. Similar to convolutional networks, but can learn graph topology (not just grid).

slide-62
SLIDE 62

Use cases

Graph net + classifier

Incident classification:

Hotel bookings drop

Graph net

Incident diagnostics:

Geo Service: bad deployment

Graph net

Incident Prediction:

ALERT: Hotel bookings drop in 8m

slide-63
SLIDE 63
slide-64
SLIDE 64

Summary

slide-65
SLIDE 65
  • Open loops are costly and error prone

Close Your Loops

  • Outage Detection, Diagnosis, and Remediation can be automated

to create closed loop systems Automate to Reduce MTTK, MTTD, MTTR

  • ML helps automate detection and diagnosis by converting

historical observations into predictions and classifications Machine Learning

  • Break down the siloes between Data Science and Engineering

MLOps

  • What story is your call graph telling you?

Graph Networks

slide-66
SLIDE 66

Ops automation

Closing Loops and Opening Minds (AWS re:Invent) https://www.youtube.com/watch?v=O8xLxNje30M KubeFlow http://kubeflow.org

Forecasting & anomaly detection

Forecasting: Principles and Practice https://otexts.com/fpp2/ Outlier Detection: A Survey https://web.cs.hacettepe.edu.tr/~aykut/classes/spri ng2013/bil682/supplemental/Outlier_Detection_A_S urvey.pdf

slide-67
SLIDE 67

Graph networks

Relational inductive biases, deep learning, and graph networks https://arxiv.org/abs/1806.01261 A comprehensive survey on graph neural networks https://arxiv.org/abs/1901.00596 Graph Nets https://github.com/deepmind/graph_nets

Machine learning

Awesome Public Datasets https://github.com/awesomedata/awesome- public-datasets Awesome TensorFlow https://github.com/jtoy/awesome-tensorflow

slide-68
SLIDE 68

Thanks!

Matt Callanan: @mcallana Willie Wheeler: @williewheeler ExpediaDotCom/adaptive-alerting ExpediaDotCom/haystack

Hiring: www.lifeatexpedia.com