Fingerprinting the datacenter: automated classification of - PowerPoint PPT Presentation

Fingerprinting the datacenter: automated classification of performance crises Peter Bodík 1,3 , Moises Goldszmidt 3 , Armando Fox 1 , Dawn Woodard 4 , Hans Andersen 2 1 RAD Lab, UC Berkeley 2 Microsoft 3 Research 4 Cornell University

Crisis identification is difficult, time consuming and costly Frequent SW/HW failures cause downtime Timeline of a typical crisis OK – detection: automatic, easy 3:00 AM – identification: manual, difficult CRISIS 3:15 AM • takes minutes to hours – resolution: depends on crisis type 4:15 AM – root cause diagnosis, documentation next day OK Web apps are complex and large-scale – app used for evaluation: 400 servers, 100 metrics 2

Insight: performance metrics help identify recurring crises Performance crises recur – incorrect root cause diagnosis – takes time to deploy the fix • other priorities, test new code System state is similar during similar crises – but not easily captured by fixed set of metrics – 3 operator-selected metrics not enough 3

Contribution: crisis identification as it happens, via classification 1. Fingerprint = compact representation of system state – uniquely identifies a crisis – robust to noise – intuitive visualization 2. Using fingerprints to identify crises as they happen – goal: operator receives email about crisis – “Crisis similar to DB config error from 2 weeks ago” 3. Evaluation on data from a real commercial service deployed on hundreds of servers – 80% identification accuracy 4

Outline • Definition of performance crises • Crisis fingerprints • Evaluation results • Related work • Conclusion 5

Definition and examples of performance crises Performance crisis = violation of service-level objective (SLO) – based on business objectives – captures performance of whole cluster – example: >90% servers have latency < 100 ms during 15-minute epoch Crises we analyzed – app config, DB config, request routing errors – overloaded front-end, overloaded back-end 6

Fingerprints capture state of performance metrics during crisis Metrics as arbitrary time series – OS, resource utilization, workload, latency, app, … 1: CPU utilization 1: select server 1 relevant metrics 2: workload … … 100: latency 2: summarize using quantiles 1: CPU utilization server 2 2: workload 3: map into … … hot/normal/cold 100: latency … 4: average over time 1: CPU utilization server 1000 2: workload crisis … … fingerprint 100: latency 7 OK CRISIS OK

Step 1: Using feature selection to pick relevant metrics • all 100 metrics what would low identification not work • 3 operator-selected metrics accuracy Logistic regression with L1 constraints – fit accurate linear more with only few metrics – selected metrics that operators didn’t consider 1: CPU utilization model input (all metrics) 2: workload … … 100: latency model output OK CRISIS OK (binary) 8 time

Step 2: Summarize selected metrics across servers using 3 quantiles # servers 0% CPU utilization 100% 25 th percentile 50 th percentile, 95 th percentile median • robust to outliers • can efficiently compute even for datacenter- sized clusters • mean, variance what would not work • only median 9

Step 3: Map metric quantiles into hot/normal/cold overloaded back-end Based on historic values time overloaded back-end Epoch fingerprints – differentiate among crises – compact DB config error – intuitive • raw metric values what would app config error • time series model not work 10 10

Step 4: Averaging over time Different crises have different durations • all epoch fingerprints what would • 1 epoch fingerprint not work Crisis fingerprint – average epoch fingerprints over time – compare by computing Euclidean distance epoch fingerprints crisis fingerprint is a vector

Crisis identification in operational setting OK Crisis detected automatically via SLO violation During first hour of crisis ? ? A A A CRISIS – update fingerprint of current crisis epochs – if found similar crisis P, emit label P else emit ? – “previously - unseen crisis” When crisis is over – automatically update relevant metrics, fingerprints OK – ideally, operators enter supplied label into crisis DB 12

Outline • Definition of performance crises • Crisis fingerprints • Evaluation results • Related work • Conclusion 13

System under study 24 x 7 enterprise-class user-facing application at Microsoft – 400 machines – 100 metrics per machine, 15-minute epochs – operators: “Correct label useful during first hour” Definition of a crisis – operators supplied 3 latency metrics and thresholds – 10% servers have latency > threshold during 1 epoch 19 operator-labeled crises of 10 types – 9 of type A, 2 of type B, 1 each of 8 more types – 4-month period 14

Evaluation results Identification stability = stick to first label – unstable: ??A??, AABBB – stable: ?????, AAAAA, ??AAA Previously-seen crises: – identification accuracy: 77% – identified when detected or one epoch later For 77% of crises, average time to ID 10 minutes – could potentially save up to 50 minutes – more with shorter epochs Accuracy for previously-unseen crises: 82% 15

More results in the paper Comparison to other approaches – using all metrics – 3 operator-specified metrics – failure signatures [SOSP ‘05] Updating fingerprints Sensitivity analysis Online-clustering approach – model evolution of fingerprint during crisis – doesn’t assume 100% correct labeling of crises 16

Closest related work • Capturing, indexing, clustering, and retrieving system history, SOSP ’05 – authors: Cohen, Zhang, Goldszmidt, Symons, Kelly, Fox • Failure signatures – signature for individual servers – build and manage per-crisis classification models – detailed comparison in the paper 17

Conclusion Crisis fingerprint – compact representation of system state – scales to large clusters – intuitive visualization Use of Machine Learning crucial for metric selection Correct identification for 80% crises – on average after 10 minutes – rigorous evaluation on production data Selection of relevant metrics used at Microsoft 18

Thank you! 19

Fingerprinting the datacenter: automated classification of - PowerPoint PPT Presentation

Fingerprinting the datacenter: automated classification of performance crises Peter Bodk 1,3 , Moises Goldszmidt 3 , Armando Fox 1 , Dawn Woodard 4 , Hans Andersen 2 1 RAD Lab, UC Berkeley 2 Microsoft 3 Research 4 Cornell University Crisis

Acoustic Fingerprinting Soundz Jake Runzer June 28, 2018 Jake Runzer Acoustic Fingerprinting

k -fingerprinting: a Robust Scalable Website Fingerprinting Technique George Danezis Jamie Hayes

Fingerprinting hardware devices Fingerprinting hardware devices using clock-skewing using

CO 447 | LEC6 BLOCKCHAIN SECURITY Dr. Benjamin Livshits Stateless Fingerprinting 2 EFF

The Time-less Datacenter Paul Borrill and Alan H. Karp Earth Computing The Datacenter Resilience

Scaling Datacenter Accelerators With Compute-Reuse Architectures Adi Fuchs and David Wentzlaff

FLAT DATACENTER STORAGE CS 744 - Big Data Systems Fall 2018 Presenter - Arjun Balasubramanian

CompSci 514: Computer Networks Lecture 14 Datacenter Transport protocols II Xiaowei Yang

Datacenter Transformation Datacenter Transformation

Google Datacenter CS 142 Lecture Notes: Datacenters Slide 1 Datacenter Organization Single

CompSci 514: Computer Networks Lecture 15 Practical Datacenter Networks Xiaowei Yang Overview

Articulus Detecting IP Hijacking Through Server Fingerprinting Research Question How can we

Fingerprinting of Defendants October 11, 2018 VIRGINIA STATE CRIME COMMISSION N I A S I G

Website fingerprinting attacks against Tor Browser Bundle: a comparison between HTTP/1.1 and

Fingerprinting ECUs for Vehicle Intrusion Detection Kyong-Tak Cho, Kang G. Shin, University of

Feature Selection in Website Fingerprinting Junhua Yan Advisor: Prof. Jasleen Kaur July 24,

BIAS BIAS Biometric Identity Assurance Services 6 March 2009 Catherine Tilton W3C Workshop on

Picasso: Light-weight device class fingerprinting for web clients Elie Bursztein , Artem Malyshev,

Automatic Fingerprinting Of Vulnerable BLE IoT Devices With Static UUIDs From Mobile Apps Chaoshun

Fingerprinting Requirements for Increased Controls Licensees Chris Einberg, Senior Project

Need for Classification Classification required To isolate traffic of interest

Clock Around the Clock Time-Based Device Fingerprinting Iskander Sanchez-Rola, Igor Santos,

with GPU in Hybrid Storage Systems Prince Hamandawana, Awais Khan, Changgyu Lee , Sungyong Park,

Online Trust and Digital Certificates: Tech Tutorial Edward W. Felten Professor of Computer