FluxRank: A Widely-Deployable Framework to Automatically Localizing - PowerPoint PPT Presentation

FluxRank: A Widely-Deployable Framework to Automatically Localizing Root Cause Machines for Software Service Failure Mitigation Ping Liu 1 , Yu Chen 2 , Xiaohui Nie 1 , Jing Zhu 1 , Shenglin Zhang 3 , Kaixin Sui 4 Ming Zhang 5 , Dan Pei 1 1 2 3 4 5

��

Background Why we focus on failure mitigation ? Because it took too long for a complex distributed service �

Service outages of 2019 Three and a half hours before successful mitigation ��

Service outages of 2019 Almost a full day before successful mitigation ��

Service outages of 2019 Almost three hour before successful mitigation ��

Background So Long !!! � Mitigation Time Our algorithm cuts the Three and a half hours • mitigation time by more A full day • Three hour • than 80% on average. ... • �

FluxRank Fluc tuation of KPI Rank �

Background Failure mitigation takes too much time. Why? ��

Troubleshooting process Critical KPI Response time Error rate … Monitor ��

Troubleshooting process Response Time Failure start time operator Confirmation ��

Troubleshooting process Response Time Mitigation Failure start time Switch traffic • Rollback version • �� Restart instances • … • �� Mitigation start time �� operator Confirmation Mitigation ��

Troubleshooting process Response Time Root cause analysis • Analyze source codes Failure start time Analyze logs • … • Mitigation start time developer Confirmation Mitigation Root Cause Analysis ��

Troubleshooting process Response Time How do operators mitigate the failure? operator Confirmation Mitigation Root Cause Analysis ��

Mitigation Software Service Hundreds of Web server Database Computation � modules Tens of Data Center Data Center Data Center Hundreds of … … … machines �� Hundreds of �� KPIs ��

Mitigation failed Software Service Web server Database Computation � Data Center Data Center Data Center … … … ��

Mitigation Anomaly detection by statistic methods, like static threshold, 3-sigma, etc. �� alert ��

Mitigation failed Software Service Because of the dependencies between Web server Database Computation � modules and machines Data Center Data Center Data Center … … … Failures will propagate between �� modules and machines ��

Mitigation alert Software Service Web server Database Computation � alert Alerts will be found everywhere ! Data Center Data Center Data Center … … … �� alert ��

Mitigation Try possible reasons one by one Find possible failure reason Take mitigation actions by history experience • Switch traffic Workload increase? • • Rollback version System updated events? • • Restart instances operator New service is online? • • … … • ��

Mitigation If the mitigation is failed by trying possible reasons Operators will manually scan KPIs to find the root cause location ��

Mitigation Why are operators reluctant to check the codes and exception logs? Only service developers can understand the details of codes and exception logs developer ��

Mitigation Why are operators reluctant to check the codes and exception logs? Operators mostly scan the KPIs to monitor the running status of modules and machines operator ��

Mitigation �� Database Web server �� Computation � �� Tens of thousands of machines Hundreds of KPIs Hundreds of modules The search space is too huge ! ��

Mitigation �� Database Web server �� Computation � �� Tens of thousands of machines Hundreds of KPIs Hundreds of modules I have to mitigate the failure quickly! ��

Mitigation Root cause location can be localized along the dependency graph • Dependency graph based approaches Web server • Sherlock [SIGCOMM’ 07] • MonitorRank [SIGMETRICS’ 13] • Fchain [ICDCS’ 13] Database � • CauseInfer [INFOCOM’ 14] • BRCA [IPCCC’ 16] Computation • Dependency graph represents the Dependency graph dependencies between modules ��

Mitigation In practice, automatically obtaining the Web server dependency graph of a online complex distributed service is difficult: Database � Additional data collection codes need to • be added, like Google’s Dapper. Computation For an online complex distributed • service, it is infeasible. Dependency graph ��

Mitigation The dependency graph also can be manually Web server obtained by the experience of developers and operators: Database � Maintaining the graphs for the rapidly • changing software services is difficult Computation because the quick change of the codes • makes the dependency graph elusive. Dependency graph ��

Mitigation Therefore, in practice, the localizing process is still a manual process. ��

Core idea If the manually scanning process can be automated by machine learning, then the overall mitigation time can be greatly reduced. Machine learning Root cause machine KPIs Root cause machine algorithm Root cause machine ��

Directly training machine learning models Core idea in an end-to-end manner does not work Machine learning Root cause machine KPIs` Root cause machine algorithm Root cause machine Lack of interpretability. • Insufficient failure cases. • ��

Core idea Domain-knowledge can be utilized to divide the problem into several phases Each phase has sufficient data and Machine learning Root cause machine KPIs Root cause machine interpretable algorithm can be used Root cause machine algorithm Phase 2 Phase 1 Phase … ��

Manual localization without dependency graph Software Service Web server Database Computation � Step-1 : scan the KPIs to understand the status of Data Center Data Center Data Center machines … … … ��

FluxRank: A Widely-Deployable Framework to Automatically Localizing - PowerPoint PPT Presentation

FluxRank: A Widely-Deployable Framework to Automatically Localizing Root Cause Machines for Software Service Failure Mitigation Ping Liu 1 , Yu Chen 2 , Xiaohui Nie 1 , Jing Zhu 1 , Shenglin Zhang 3 , Kaixin Sui 4 Ming Zhang 5 , Dan Pei 1 1 2 3

Enabling Venus In-Situ Science Deployable Entry System Technology, Adaptive Deployable Entry

RESULTS OF THE DEPLOYABLE MEMBRANE & ADEO PASSIVE DE-ORBIT SUBSYSTEM ACTIVITIES LEADING TO A

IT Infrastructure Management User-Friendly End-to-End Easily Customizable & Deployable

Mixtasy: Remailing on Existing Infrastructure Anonymized Email Communication Easily Deployable

SDL Control of the UltraLITE Precision Deployable Test Article Using Adaptive Spatio-Temporal

Transportable & Re-deployable Modular Hangar Facility Product Description The transportable

Deployable Boom Arm for a Double Langmuir Probe Presenter: Charles Van Steenwyk Visiting

Low Creep/Low Relaxation Thermoplastic Polymer Composites for Deployable Structures Kyle Horn

Less Pain, Most of the Gain: Incrementally Deployable ICN

Shuffler: Fast and Deployable Continuous Code Re-Randomization David Williams-King, Graham

Post IPv4 completion Making IPv6 deployable incrementally by making it backward compatible

An Radical but Incrementally Deployable Vision for Future Internet Architecture James McCauley,

Rapidly Deployable Wireless Networks for Emergency Communications & Sensing Applications

Terminus: Towards a Network-Level Deployable Architecture Against Distributed Denial-of- Service

Importance of structural damping in the dynamic analysis of compliant deployable structures

Automatically Automatically Finding Patches Finding Patches Using Genetic Using Genetic

Michel Electron Reconstruction Aidan Reynolds 15 th November 2017 Outline Introduction

EVCache: Lowering Costs for a Low Latency Cache with RocksDB Scott Mansfield Vu Nguyen EVCache

Requ quirement ments for or Requ quirement ments for or Mult ulticas icast in L3 VPNs

Middle author dilemma: how to recognize critical contributions of multidisciplinary teams Melissa

The MultiJEDI ERC Project: Multilingual Joint Word Sense Disambiguation Roberto Navigli

Stanford CS193p Developing Applications for iOS Winter 2017 CS193p Winter 2017 Today What is

Merging Data Resources for Inflectional and Derivational Morphology in Czech ek y, Magda

Professor: Kevin Molloy (adapted from slides originally developed by Alvin Chao) Counting on a

FluxRank: A Widely-Deployable Framework to Automatically Localizing - PowerPoint PPT Presentation

FluxRank: A Widely-Deployable Framework to Automatically Localizing Root Cause Machines for Software Service Failure Mitigation Ping Liu 1 , Yu Chen 2 , Xiaohui Nie 1 , Jing Zhu 1 , Shenglin Zhang 3 , Kaixin Sui 4 Ming Zhang 5 , Dan Pei 1 1 2 3

Enabling Venus In-Situ Science Deployable Entry System Technology, Adaptive Deployable Entry

RESULTS OF THE DEPLOYABLE MEMBRANE &amp; ADEO PASSIVE DE-ORBIT SUBSYSTEM ACTIVITIES LEADING TO A

IT Infrastructure Management User-Friendly End-to-End Easily Customizable &amp; Deployable

Mixtasy: Remailing on Existing Infrastructure Anonymized Email Communication Easily Deployable

SDL Control of the UltraLITE Precision Deployable Test Article Using Adaptive Spatio-Temporal

Transportable &amp; Re-deployable Modular Hangar Facility Product Description The transportable

Deployable Boom Arm for a Double Langmuir Probe Presenter: Charles Van Steenwyk Visiting

Low Creep/Low Relaxation Thermoplastic Polymer Composites for Deployable Structures Kyle Horn

Less Pain, Most of the Gain: Incrementally Deployable ICN

Shuffler: Fast and Deployable Continuous Code Re-Randomization David Williams-King, Graham

Post IPv4 completion Making IPv6 deployable incrementally by making it backward compatible

An Radical but Incrementally Deployable Vision for Future Internet Architecture James McCauley,

Rapidly Deployable Wireless Networks for Emergency Communications &amp; Sensing Applications

Terminus: Towards a Network-Level Deployable Architecture Against Distributed Denial-of- Service

Importance of structural damping in the dynamic analysis of compliant deployable structures

Automatically Automatically Finding Patches Finding Patches Using Genetic Using Genetic

Michel Electron Reconstruction Aidan Reynolds 15 th November 2017 Outline Introduction

EVCache: Lowering Costs for a Low Latency Cache with RocksDB Scott Mansfield Vu Nguyen EVCache

Requ quirement ments for or Requ quirement ments for or Mult ulticas icast in L3 VPNs

Middle author dilemma: how to recognize critical contributions of multidisciplinary teams Melissa

The MultiJEDI ERC Project: Multilingual Joint Word Sense Disambiguation Roberto Navigli

Stanford CS193p Developing Applications for iOS Winter 2017 CS193p Winter 2017 Today What is

Merging Data Resources for Inflectional and Derivational Morphology in Czech ek y, Magda

Professor: Kevin Molloy (adapted from slides originally developed by Alvin Chao) Counting on a

RESULTS OF THE DEPLOYABLE MEMBRANE & ADEO PASSIVE DE-ORBIT SUBSYSTEM ACTIVITIES LEADING TO A

IT Infrastructure Management User-Friendly End-to-End Easily Customizable & Deployable

Transportable & Re-deployable Modular Hangar Facility Product Description The transportable

Rapidly Deployable Wireless Networks for Emergency Communications & Sensing Applications