Matt Callanan Senior Software Development Engineer Expedia Group
Automating Operations with Machine Learning
Willie Wheeler Principal Applications Engineer Expedia Group
Automating Operations with Machine Learning Matt Callanan Senior - - PowerPoint PPT Presentation
Automating Operations with Machine Learning Matt Callanan Senior Software Development Engineer Expedia Group Willie Wheeler Principal Applications Engineer Expedia Group Too many alerts Too many false alerts Alerts not actionable
Matt Callanan Senior Software Development Engineer Expedia Group
Willie Wheeler Principal Applications Engineer Expedia Group
Too many alerts Too many false alerts Alerts not actionable Diagnosis takes too long Remediation takes too long
System healthy System unhealthy Detect Diagnose Remediate
Develop Build Deploy Monitor Rollback
Infrastructure Queues Services Network Databases Cloud Firewalls
Feature Switch Config Change
Develop Build Deploy Monitor Rollback
Infrastructure Queues Services Network Databases Cloud Firewalls
Feature Switch Config Change
System healthy System unhealthy Detect Diagnose Remediate
Alert Manager Deployment Bot Treatment
Remediation
Diagnostics Anomaly Detection Logging Tracing
Apps
Metrics
Alert Manager Deployment Bot Treatment
Remediation
Diagnostics Anomaly Detection Logging Tracing
Apps
Metrics
Computer Program Inputs Outputs
Computer Inputs Outputs Program
Convolutional neural network
Dog
https://www.flickr.com/photos/a_peach/8631368705
Recurrent neural network
“Hay dos estudiantes en la cocina.” “There are two students in the kitchen.”
Model Training Executable Model ML-enabled Application Training Data Model Repo
Data Ingestion Pre-Processing Data Analysis
Constant threshold STL regression Recurrent neural network Holt-Winters AWS random cut forest Custom regression
Normal Weak Anomaly Strong Anomaly Strong Anomaly Weak Anomaly
prediction actual T-1
anomalies (Kafka) metrics (Kafka)
Anomaly detectors (Kafka streams) { "metric": "details-tp99", "timestamp": 1549932163, "value": 150 } { "metric": "details-tp99", "timestamp": 1549932163, "value": 150, "anomaly": "WEAK" }
anomalies (Kafka)
Anomaly detectors (Kafka streams)
metrics (Kafka)
ML detectors Training data (S3) ML training
Kubernetes ingest
Data Store (e.g. S3)
Non-Kubernetes Runtime preprocess analyze train deploy Kubeflow / Kubeflow Pipelines
Container Op Container Op Container Op Container Op Container Op
Web Service, Streaming
“aa-ct-trainer” aa-ml-training-core Python Library aa-ml-training-pipelines Python Library aa-ml-training-docker /aa-ct-trainer Python code “*_starter.py” scripts /pipeline Python code “create_pipeline.py” aa-ct-trainer Dockerfile /aa-ct-trainer à /app
aa-ml-training-app-base Dockerfile aa-ml-training-miniconda3 Dockerfile Kubeflow Pipelines SDK Python Library pandas, scipy, matplotlib, boto3, etc. Python Libraries pandas, scipy, matplotlib, boto3, etc. Python Libraries pandas, scipy, matplotlib, boto3, etc. Python Libraries
aa-ct-trainer Docker Image
aa-ml-training-app-base Docker Image aa-ml-training-miniconda3 Docker Image
Alpine Linux
Adaptive Alerting ML Training Repositories
aa-ml-training-pipelines-base Docker Image
“aa-ml- training- core” Library “aa-ml- training- pipelines” Library
aa-ml-training-pipelines-base Dockerfile
Docker Image Stack
vs single/dual seasonality)
explore
created, e.g. daily/weekly
Deployment tasks
Distributed tracing Trace telemetry Dynamic service graph Anomaly detection on telemetry State snapshotting
ExpediaDotCom/haystack
ExpediaDotCom/haystack
ExpediaDotCom/haystack
Graph net + classifier
Hotel bookings drop
Graph net
Geo Service: bad deployment
Graph net
ALERT: Hotel bookings drop in 8m
Close Your Loops
to create closed loop systems Automate to Reduce MTTK, MTTD, MTTR
historical observations into predictions and classifications Machine Learning
MLOps
Graph Networks
Closing Loops and Opening Minds (AWS re:Invent) https://www.youtube.com/watch?v=O8xLxNje30M KubeFlow http://kubeflow.org
Forecasting: Principles and Practice https://otexts.com/fpp2/ Outlier Detection: A Survey https://web.cs.hacettepe.edu.tr/~aykut/classes/spri ng2013/bil682/supplemental/Outlier_Detection_A_S urvey.pdf
Relational inductive biases, deep learning, and graph networks https://arxiv.org/abs/1806.01261 A comprehensive survey on graph neural networks https://arxiv.org/abs/1901.00596 Graph Nets https://github.com/deepmind/graph_nets
Awesome Public Datasets https://github.com/awesomedata/awesome- public-datasets Awesome TensorFlow https://github.com/jtoy/awesome-tensorflow
Matt Callanan: @mcallana Willie Wheeler: @williewheeler ExpediaDotCom/adaptive-alerting ExpediaDotCom/haystack
Hiring: www.lifeatexpedia.com