Griffon: Reasoning about Job Anomalies with Unlabeled Data in - PowerPoint PPT Presentation

Griffon: Reasoning about Job Anomalies with Unlabeled Data in Cloud-based Platforms Liqun Shao, Yiwen Zhu , Siqi Liu*, Abhiram Eswaran, Kristin Lieber, Janhavi Mahajan, Minsoo Thigpen, Sudhir Darbha, Subru Krishnan, Soundar Srinivasan, Carlo Curino, Konstantinos Karanasos Microsoft, *University of Pittsburgh

Microsoft’s Internal Big Data Analytics Platform 500K 250K (jobs/day) (nodes)

https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=2ahUKEwjv9uXU0__lAhWtIDQIHaU0ABwQjB16BAgBEAM&url=https%3A%2F%2Fwww.intellectualtakeout.org%2Farticle%2Fka nye-wests-private-firefighting-force-good&psig=AOvVaw2pinteqP1A7uhZRdXBfq0J&ust=1574575139344414

My job is SLOW ER …

On On-Call Support Engineer Work rkflow 57 mins 88 mins

End-to-End Identify job deployed and used slowdown causes Drops the Consistent results validated investigation time by domain experts

Gri riffon: Before and Aft fter Before Griffon A job goes out of An Engineer spends hours of manual After 2-3 days of investigation, the service-level objectives labor looking through hundreds of reason for job slowdown is found. (SLO) and the engineer metrics is alerted After Griffon The reason is found in the top five generated by Griffon. A job goes out of SLO The Job ID and VC is fed All the metrics Griffon has and the engineer is through Griffon and the top looked at can be ruled out alerted reasons for job slowdown are and the engineer can generated automatically direct their efforts to a smaller set of metrics.

Grif iffon • ML Methodology • System Architecture

Data wrangling Data collection: Identifying the right data Unlabeled data Model building: Small amount of validation data Tradeoff between accuracy and interpretability Cannot maintain models for each job template Deployment and Scalability Evaluation: Evaluation metrics for root causes of slow jobs Challenges

Identify Job Slowdown Reasons Job Runtime Predictor Feature Contributions

Job Runtime Prediction Job Runtime Predictor MARE LR RF GBT DNN Per-Template Model 0.186 0.116 0.124 0.146 Global Model 0.235 0.121 0.277 0.353

Feature Contributions Reformulate decision tree models to linear models: Compare feature contributions to baseline predictions:

Feature Contributions Intercept/Bias 10 m +6 m InputSize -4 m JobPriority BonusPnHours -0 m 12 m Prediction

Intercept 10 m +6 m InputSize -4 m Intercept 10 m JobPriority +3 m InputSize BonusPnHours -0 m JobPriority -2 m 12 m Prediction -4 m Slow Job BonusPnHours InputSize: 6-3 = 3 JobPriority: 4 -2 = 2 7 m Prediction BonusPnHours: 4 – 0 = 4 Baseline Job

Architecture

Azure Big Data Analytics Platform

Azure Big Data Analytics Platform Azure ML with MLFlow: • Archiving • Versioning • Serving

Flask Application

Griffon Output

Job Id Predicted Reason Engineer Validated Rank Confidence Reason Level 9182 Input size Input size 1 High Validation of Griffon Predictions

Job Id Predicted Reason Engineer Validated Rank Confidence Reason Level 9182 Input size Input size 1 High Validation of 8578 Revocation Revocation 4 Medium Griffon Predictions

Job Id Predicted Reason Engineer Validated Rank Confidence Reason Level 9182 Input size Input size 1 High Validation of 8578 Revocation Revocation 4 Medium 4414 Yarn or cluster Yarn or cluster - Low Griffon issue issue 6170 PN hours PN hours 5 Medium Predictions 7588 Time skew Time skew 1 High 3798 PN hours PN hours 1 High 1590 PN hours PN hours 1 High 2560 Usable machine Usable machine 2 High count count

Scalability & Generalization

Conclusions • End-to-end interpretable ranking system to identify the root causes of job slowdowns • No human labeled reasons needed • Highly consistent results validated by on-call engineers • Our model generalizes well by testing on job templates not included in the training set

Thank you! Please see our poster for more details ☺ !

Griffon: Reasoning about Job Anomalies with Unlabeled Data in - PowerPoint PPT Presentation

Griffon: Reasoning about Job Anomalies with Unlabeled Data in Cloud-based Platforms Liqun Shao, Yiwen Zhu , Siqi Liu*, Abhiram Eswaran, Kristin Lieber, Janhavi Mahajan, Minsoo Thigpen, Sudhir Darbha, Subru Krishnan, Soundar Srinivasan, Carlo

Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert Guthrie, Jacob Eisenstein

10 Steps to Counting Unlabeled Planar Graphs: 20 Years Later Manuel Bodirsky October 2007

Points of Pride: What we have accomplished so far! Created Job Framework 24 Job Groups/Job

Mining Anomalies Andrzej Wasylkowski 1 Why Mine Anomalies? How can we make programs more

Automated Reasoning Course Presentation Summary Automated Reasoning Motivations Course Plan

Anomalies in Data Maximilian Toller KDDM2 Maximilian Toller, Know-Center > www.tugraz.at 1

10701 Semi supervised learning Can Unlabeled Data improve supervised learning? Important

Evidential and Causal Reasoning Much reasoning in AI can be seen as evidential reasoning ,

Word2Vec Michael Collins, Columbia University Motivation We can easily collect very large

Clustering Clustering is an unsupervised classification method, i.e. unlabeled data is partitioned

b s b c anomalies anomalies Found by LHCb (and perhaps Found by several experiments

Detection of electromagnetic anomalies Detection of electromagnetic anomalies before volcanic

Impact of Meteorological Impact of Meteorological A Anomalies on Forest Anomalies on Forest A

Veterinary Surgery Dominique J Griffon DVM, DECVS, DACVS, MS, PhD Editor-in-Chief Content

Unlabeled Motzkin numbers Max Alekseyev Dept. Computer Science and Engineering 2013 Max

BCS Cumbria 16 th May 2019 1 In Introductions Steve Lawless CEO Purple Griffon Dr

Week 8: Model Building 1 Partial F Test, Multiple testing, Out of Sample Prediction Max H.

Weekly Briefing for Small Business Beth Milito and Holly Wade May 6, 2020 - Getting Back to

From MPI-1.1 to MPI-3.1, publishing and teaching, with a special focus on MPI-3 shared memory

SCANNING NEGATIVES AND SLIDES: DIGITIZING YOUR PHOTOGRAPHIC ARCHIVES Download Free Author: Sascha

Presentation schedule revisited Politeness Positive and negative politeness Form

Introduction Professor Adam Bates Fall 2016 Security & Privacy Research at Illinois (SPRAI)

3.36pt 1/54 Statistical Methods for Plant Biology PBIO 3150/5150 Anirudh V. S. Ruhil September

RBF Morph Advanced Mesh Morphing for optimization and multi-physics Marco Evangelos Biancolini

Griffon: Reasoning about Job Anomalies with Unlabeled Data in - PowerPoint PPT Presentation

Griffon: Reasoning about Job Anomalies with Unlabeled Data in Cloud-based Platforms Liqun Shao, Yiwen Zhu , Siqi Liu*, Abhiram Eswaran, Kristin Lieber, Janhavi Mahajan, Minsoo Thigpen, Sudhir Darbha, Subru Krishnan, Soundar Srinivasan, Carlo

Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert Guthrie, Jacob Eisenstein

10 Steps to Counting Unlabeled Planar Graphs: 20 Years Later Manuel Bodirsky October 2007

Points of Pride: What we have accomplished so far! Created Job Framework 24 Job Groups/Job

Mining Anomalies Andrzej Wasylkowski 1 Why Mine Anomalies? How can we make programs more

Automated Reasoning Course Presentation Summary Automated Reasoning Motivations Course Plan

Anomalies in Data Maximilian Toller KDDM2 Maximilian Toller, Know-Center &gt; www.tugraz.at 1

10701 Semi supervised learning Can Unlabeled Data improve supervised learning? Important

Evidential and Causal Reasoning Much reasoning in AI can be seen as evidential reasoning ,

Word2Vec Michael Collins, Columbia University Motivation We can easily collect very large

Clustering Clustering is an unsupervised classification method, i.e. unlabeled data is partitioned

b s b c anomalies anomalies Found by LHCb (and perhaps Found by several experiments

Detection of electromagnetic anomalies Detection of electromagnetic anomalies before volcanic

Impact of Meteorological Impact of Meteorological A Anomalies on Forest Anomalies on Forest A

Veterinary Surgery Dominique J Griffon DVM, DECVS, DACVS, MS, PhD Editor-in-Chief Content

Unlabeled Motzkin numbers Max Alekseyev Dept. Computer Science and Engineering 2013 Max

BCS Cumbria 16 th May 2019 1 In Introductions Steve Lawless CEO Purple Griffon Dr

Week 8: Model Building 1 Partial F Test, Multiple testing, Out of Sample Prediction Max H.

Weekly Briefing for Small Business Beth Milito and Holly Wade May 6, 2020 - Getting Back to

From MPI-1.1 to MPI-3.1, publishing and teaching, with a special focus on MPI-3 shared memory

SCANNING NEGATIVES AND SLIDES: DIGITIZING YOUR PHOTOGRAPHIC ARCHIVES Download Free Author: Sascha

Presentation schedule revisited Politeness Positive and negative politeness Form

Introduction Professor Adam Bates Fall 2016 Security &amp; Privacy Research at Illinois (SPRAI)

3.36pt 1/54 Statistical Methods for Plant Biology PBIO 3150/5150 Anirudh V. S. Ruhil September

RBF Morph Advanced Mesh Morphing for optimization and multi-physics Marco Evangelos Biancolini

Anomalies in Data Maximilian Toller KDDM2 Maximilian Toller, Know-Center > www.tugraz.at 1

Introduction Professor Adam Bates Fall 2016 Security & Privacy Research at Illinois (SPRAI)