automating operations with machine learning
play

Automating Operations with Machine Learning Matt Callanan Senior - PowerPoint PPT Presentation

Automating Operations with Machine Learning Matt Callanan Senior Software Development Engineer Expedia Group Willie Wheeler Principal Applications Engineer Expedia Group Too many alerts Too many false alerts Alerts not actionable


  1. Automating Operations with Machine Learning Matt Callanan Senior Software Development Engineer Expedia Group Willie Wheeler Principal Applications Engineer Expedia Group

  2. Too many alerts Too many false alerts Alerts not actionable Diagnosis takes too long Remediation takes too long

  3. Can automation and machine learning help solve these problems?

  4. Overview Machine Automating Automating Learning Operations Ops with ML Overview

  5. Automating Operations

  6. System healthy System Remediate unhealthy Diagnose Detect

  7. Firewalls Cloud Feature Services Switch Queues Network Databases Config Infrastructure Change Develop Build Deploy Monitor Rollback

  8. Firewalls Feature Cloud Services Switch System healthy Queues Network Databases Config Change Infrastructure System Remediate unhealthy Develop Build Deploy Monitor Diagnose Detect Rollback

  9. How does it work?

  10. Bot Tracing Treatment Anomaly Metrics Diagnostics Deployment Remediation Detection Apps Alert Logging Manager

  11. Bot Tracing Treatment Anomaly Metrics Diagnostics Deployment Remediation Detection Apps Alert Logging Manager

  12. Machine Learning Overview

  13. Machine learning systems perform tasks by learning from data, instead of requiring explicit programming.

  14. Traditional programming Program Computer Outputs Inputs Machine Learning Inputs Computer Program Outputs

  15. There are many approaches to ML.

  16. Convolutional neural network Dog Recurrent “Hay dos “There are two neural estudiantes students in the en la cocina.” kitchen.” network https://www.flickr.com/photos/a_peach/8631368705

  17. Data Ingestion Executable Model ML-enabled Pre-Processing Training Model Application Data Analysis Training Data Model Repo

  18. Automating Ops with ML

  19. 1. Anomalies and Anomaly Detection 2. ML Ops 3. Our Approach 4. Situation Diagnostics

  20. Anomalies and Anomaly Detection

  21. An anomaly is an unusual data point.

  22. Anomalies signal a change. We want to detect them.

  23. Time Series Anomaly Detection with Adaptive Alerting

  24. AA minimises Mean Time To Detect (MTTD) by performing anomaly detection on streaming time series data. AA supports classical and ML-based time series anomaly detection algos.

  25. Constant threshold STL regression Recurrent neural network Holt-Winters Custom regression AWS random cut forest

  26. } Strong prediction Anomaly } Weak } Anomaly T-1 Normal } } Weak Anomaly actual Strong Anomaly

  27. Anomaly detectors metrics anomalies (Kafka streams) (Kafka) (Kafka) { { "metric": "details-tp99", "metric": "details-tp99", "timestamp": 1549932163, "timestamp": 1549932163, "value": 150 "value": 150, } "anomaly": "WEAK" }

  28. ML training ML detectors Training data (S3) Anomaly detectors metrics anomalies (Kafka streams) (Kafka) (Kafka)

  29. ML Ops

  30. Data Scientists and Engineers tend to • operate in siloes Models can take many months to deploy • Training and retraining consistency issues •

  31. Data Scientists: • Experiment Tracking for Training and HPO • Engineers: • Scalability, Robustness, Repeatability • Reliable Deployment of Models •

  32. Non-Kubernetes Data Store (e.g. S3) Runtime ingest preprocess train deploy analyze Web Service, Container Op Container Op Container Op Container Op Container Op Streaming Kubeflow / Kubeflow Pipelines Kubernetes

  33. Adaptive Alerting ML Training Repositories Docker Image Stack aa-ct-trainer “aa-ct-trainer” Docker Image - ingest_starter.py /aa-ct-trainer /pipeline aa-ct-trainer - preprocess_starter.py Python code Python code Dockerfile - analyze_starter.py “*_starter.py” scripts “create_pipeline.py” / aa-ct-trainer à /app - train_starter.py - deploy_starter.py aa-ml-training-pipelines-base aa-ml-training-core aa-ml-training-pipelines aa-ml-training-docker Docker Image Python Library Python Library aa-ml-training-pipelines-base “aa-ml- “aa-ml- Dockerfile training- training- aa-ml-training-app-base core” pipelines” pandas, scipy, matplotlib, pandas, scipy, matplotlib, Dockerfile Library Library pandas, scipy, matplotlib, boto3, etc. Kubeflow Pipelines SDK boto3, etc. Python Libraries boto3, etc. Python Library aa-ml-training-miniconda3 Python Libraries aa-ml-training-app-base Python Libraries Dockerfile Docker Image aa-ml-training-miniconda3 Docker Image Alpine Linux

  34. Our Approach

  35. • Anomaly Detection Building high-volume system • Several thousand auto-generated CT detectors for 5xx • Fine-tuned ML models • Exploring additional Deep Learning and LSTM models • • Diagnostics Automated runbook diagnostics • Exploring ML for situation diagnostics • • Auto-Remediation Automated rollback hints • Piloting full auto-rollback remediation •

  36. • Ingest at least 4 weeks of data • Remove outliers (e.g. Hampel filtering) • Interpolate missing data • Analyse data set (EDA) • Visualise with scatter plots, ACF • Perform tests, e.g adfuller • Record properties of data set (e.g. stationary vs single/dual seasonality)

  37. • Based on profile properties, select algorithms to explore • For each algorithm: Explore hyperparameter space (HPO) • For each HP combination, find the best fit • Select model with best set of HPs for algorithm • • Compare scores of each algorithm • Select the best model from the best algorithm • Ensemble detectors could serve multiple models

  38. • When detector is created, a schedule is created, e.g. daily/weekly • Scheduler initiates Data Prep, Training, and Deployment tasks

  39. • Good for fast anomaly detection • Detect mean shift over period of time • Less volatile anomaly classification

  40. • Treat first 3 weeks as “training data” • Treat final 7 days as “test data”

  41. Situation Diagnostics

  42. Time series anomaly detection generates a large volume of individually unactionable anomalies. We need a small number of actionable anomalies.

  43. ExpediaDotCom/haystack Haystack is Expedia Group’s OSS distributed tracing product. Key features: Dynamic Distributed State service tracing snapshotting graph Anomaly Trace detection on telemetry telemetry

  44. ExpediaDotCom/haystack Haystack service graph

  45. ExpediaDotCom/haystack Haystack data creates new automation opportunities: • Incident classification • Incident diagnostics • Incident prediction Can machine learning help?

  46. A graph network works with graph- structured inputs and outputs. Similar to convolutional networks, but can learn graph topology (not just grid).

  47. Use cases Incident Graph net + classification: classifier Hotel bookings drop Incident Geo Service: Graph net diagnostics: bad deployment Incident ALERT: Graph net Prediction: Hotel bookings drop in 8m

  48. Summary

  49. Close Your Loops • Open loops are costly and error prone Automate to Reduce MTTK, MTTD, MTTR • Outage Detection, Diagnosis, and Remediation can be automated to create closed loop systems Machine Learning • ML helps automate detection and diagnosis by converting historical observations into predictions and classifications MLOps • Break down the siloes between Data Science and Engineering Graph Networks • What story is your call graph telling you?

  50. Ops automation https://www.youtube.com/watch?v=O8xLxNje30M Closing Loops and Opening Minds (AWS re:Invent) KubeFlow http://kubeflow.org Forecasting & anomaly detection https://otexts.com/fpp2/ Forecasting: Principles and Practice Outlier Detection: A Survey https://web.cs.hacettepe.edu.tr/~aykut/classes/spri ng2013/bil682/supplemental/Outlier_Detection_A_S urvey.pdf

  51. Machine learning Awesome Public Datasets https://github.com/awesomedata/awesome- public-datasets Awesome TensorFlow https://github.com/jtoy/awesome-tensorflow Graph networks https://arxiv.org/abs/1806.01261 Relational inductive biases, deep learning, and graph networks A comprehensive survey on graph https://arxiv.org/abs/1901.00596 neural networks Graph Nets https://github.com/deepmind/graph_nets

  52. Thanks! Matt Callanan: @mcallana Willie Wheeler: @williewheeler ExpediaDotCom/adaptive-alerting ExpediaDotCom/haystack Hiring: www.lifeatexpedia.com

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend