FluxRank: A Widely-Deployable Framework to Automatically Localizing - - PowerPoint PPT Presentation

fluxrank a widely deployable framework to automatically
SMART_READER_LITE
LIVE PREVIEW

FluxRank: A Widely-Deployable Framework to Automatically Localizing - - PowerPoint PPT Presentation

FluxRank: A Widely-Deployable Framework to Automatically Localizing Root Cause Machines for Software Service Failure Mitigation Ping Liu 1 , Yu Chen 2 , Xiaohui Nie 1 , Jing Zhu 1 , Shenglin Zhang 3 , Kaixin Sui 4 Ming Zhang 5 , Dan Pei 1 1 2 3


slide-1
SLIDE 1

FluxRank: A Widely-Deployable Framework to Automatically Localizing Root Cause Machines for Software Service Failure Mitigation

Ping Liu1, Yu Chen2, Xiaohui Nie1, Jing Zhu1, Shenglin Zhang3, Kaixin Sui4 Ming Zhang5, Dan Pei1

1 2 3 4 5

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4

Background

  • Why we focus on failure mitigation ?

Because it took too long for a complex distributed service

slide-5
SLIDE 5

Service outages of 2019

  • Three and a half hours

before successful mitigation

slide-6
SLIDE 6

Service outages of 2019

  • Almost a full day

before successful mitigation

slide-7
SLIDE 7

Service outages of 2019

  • Almost three hour

before successful mitigation

slide-8
SLIDE 8

Background

  • Mitigation Time
  • Three and a half hours
  • A full day
  • Three hour
  • ...

Our algorithm cuts the mitigation time by more than 80% on average.

So Long !!!

slide-9
SLIDE 9
  • FluxRank

Fluctuation of KPI

Rank

slide-10
SLIDE 10

Background

  • Failure mitigation takes too much time.

Why?

slide-11
SLIDE 11

Troubleshooting process

  • Critical KPI

Response time Error rate … Monitor

slide-12
SLIDE 12

Troubleshooting process

  • Response Time

Failure start time Confirmation

  • perator
slide-13
SLIDE 13

Troubleshooting process

  • Response Time

Failure start time Confirmation Mitigation Mitigation start time Mitigation

  • Switch traffic
  • Rollback version
  • Restart instances
  • perator
slide-14
SLIDE 14

Troubleshooting process

  • Root Cause Analysis

Response Time Failure start time Confirmation Mitigation Mitigation start time Root cause analysis

  • Analyze source codes
  • Analyze logs

developer

slide-15
SLIDE 15

Troubleshooting process

  • Root Cause Analysis

Response Time Confirmation Mitigation

  • perator

How do operators mitigate the failure?

slide-16
SLIDE 16

Mitigation

  • Software Service

Web server Database

  • Computation

Data Center Data Center Data Center

… …

Hundreds of modules Tens of Hundreds of machines Hundreds of KPIs

slide-17
SLIDE 17

Mitigation

  • Software Service

Web server Database

  • Computation

Data Center Data Center Data Center

… …

failed

slide-18
SLIDE 18

Mitigation

  • Anomaly detection by statistic

methods, like static threshold, 3-sigma, etc.

alert

slide-19
SLIDE 19

Mitigation

  • Software Service

Web server Database

  • Computation

Data Center Data Center Data Center

… …

failed

Because of the dependencies between modules and machines Failures will propagate between modules and machines

slide-20
SLIDE 20

Mitigation

  • Software Service

Web server Database

  • Data Center

Data Center Data Center

  • Computation

… …

alert alert alert

Alerts will be found everywhere !

slide-21
SLIDE 21

Mitigation

  • perator

Find possible failure reason by history experience

  • Workload increase?
  • System updated events?
  • New service is online?

Take mitigation actions

  • Switch traffic
  • Rollback version
  • Restart instances

Try possible reasons one by one

slide-22
SLIDE 22

Mitigation

  • If the mitigation is failed by trying

possible reasons Operators will manually scan KPIs to find the root cause location

slide-23
SLIDE 23

Mitigation

  • Why are operators reluctant to check the codes

and exception logs?

developer

Only service developers can understand the details of codes and exception logs

slide-24
SLIDE 24

Mitigation

  • Why are operators reluctant to check the codes

and exception logs?

Operators mostly scan the KPIs to monitor the running status of modules and machines

  • perator
slide-25
SLIDE 25

Mitigation

  • Hundreds of KPIs
  • Tens of thousands of machines

The search space is too huge !

Web server Database

  • Computation

Hundreds of modules

slide-26
SLIDE 26

Mitigation

  • I have to mitigate

the failure quickly!

Hundreds of KPIs

  • Tens of thousands of machines

Web server Database

  • Computation

Hundreds of modules

slide-27
SLIDE 27

Mitigation

  • Web server

Database

  • Computation

Dependency graph

  • Dependency graph based approaches
  • Sherlock [SIGCOMM’ 07]
  • MonitorRank [SIGMETRICS’ 13]
  • Fchain [ICDCS’ 13]
  • CauseInfer [INFOCOM’ 14]
  • BRCA [IPCCC’ 16]
  • Dependency graph represents the

dependencies between modules

Root cause location can be localized along the dependency graph

slide-28
SLIDE 28

Mitigation

  • Web server

Database

  • Computation

Dependency graph

In practice, automatically obtaining the dependency graph of a online complex distributed service is difficult:

  • Additional data collection codes need to

be added, like Google’s Dapper.

  • For an online complex distributed

service, it is infeasible.

slide-29
SLIDE 29

Mitigation

  • Web server

Database

  • Computation

Dependency graph

The dependency graph also can be manually

  • btained by the experience of developers

and operators:

  • Maintaining the graphs for the rapidly

changing software services is difficult

  • because the quick change of the codes

makes the dependency graph elusive.

slide-30
SLIDE 30

Mitigation

  • Therefore, in practice, the localizing process is

still a manual process.

slide-31
SLIDE 31

Core idea

  • If the manually scanning process can be automated

by machine learning, then the overall mitigation time can be greatly reduced.

Machine learning algorithm

KPIs

Root cause machine Root cause machine Root cause machine

slide-32
SLIDE 32

Core idea

  • Machine learning

algorithm

KPIs`

Root cause machine Root cause machine Root cause machine Directly training machine learning models in an end-to-end manner does not work

  • Lack of interpretability.
  • Insufficient failure cases.
slide-33
SLIDE 33

Core idea

  • Machine learning

algorithm

KPIs

Root cause machine Root cause machine Root cause machine Phase 1 Phase 2 Phase …

Domain-knowledge can be utilized to divide the problem into several phases

Each phase has sufficient data and interpretable algorithm can be used

slide-34
SLIDE 34

Manual localization without dependency graph

  • Software Service

Web server Database

  • Data Center

Data Center Data Center

  • Computation

… …

Step-1: scan the KPIs to understand the status of machines

slide-35
SLIDE 35

Manual localization without dependency graph

  • Software Service

Web server Database

  • Data Center

Data Center Data Center

  • Computation

… …

Step-2: rank the potential root cause machines according to experience

slide-36
SLIDE 36

Manual localization without dependency graph

  • Software Service

Web server Database

  • Data Center

Data Center Data Center

  • Computation

… …

Step-3: trigger mitigation action on the

highest-ranked machines one by one until successful mitigation

slide-37
SLIDE 37

Core idea

  • KPIs

Root cause machine Root cause machine Digest Change Quantification Digest Distillation Digest Ranking

FluxRank FluxRank mimics the step 1 and step 2 of manual mitigation process

slide-38
SLIDE 38
slide-39
SLIDE 39

Design

  • Root cause machine

Root cause machine Digest Change Quantification Digest Distillation Digest Ranking

FluxRank

KPIs

FLuxRank Distills valuable digest from the huge number of KPIs

slide-40
SLIDE 40

Digest

  • FluxRank’s output of

a real failure case Digest

A digest represents the change patterns of several machines from the same module.

slide-41
SLIDE 41

Digest

  • Module name
slide-42
SLIDE 42

Digest

  • Machines list
slide-43
SLIDE 43

Digest

  • !"#$% = # %( )"*ℎ$,-. $, #ℎ- /$0-.#

# %( )"*ℎ$,-. $, #ℎ- )%/12-

slide-44
SLIDE 44

Digest

  • KPI list
slide-45
SLIDE 45

Digest

  • Downward change

score of KPI

slide-46
SLIDE 46

Digest

  • upward change score
  • f KPI
slide-47
SLIDE 47

Digest

  • In the top digest, CPU idle KPIs dropped

abnormally, and CPU load KPIs rose abnormally

slide-48
SLIDE 48

Digest

  • Operators can easily understand that 27 machines of module

M1 from data center 1 have CPU overload exception

slide-49
SLIDE 49

Digest

  • Module M1 is deployed on 750 machines, each machine

has 47 standard Linux KPIs

slide-50
SLIDE 50

Digest

  • FluxRank
  • We can see that the search space and analysis

time of operators can be greatly reduced !

slide-51
SLIDE 51

Design

  • Root cause machine

Root cause machine Digest Change Quantification Digest Distillation Digest Ranking

FluxRank

KPIs

slide-52
SLIDE 52

Change quantification

  • The change of each KPI is quantified into two change

scores: upward change score (o) and downward change score (u)

slide-53
SLIDE 53

Change quantification

  • Because the quantified change scores will be used for

ranking in phase three, the scores have to satisfy the following requirement: The change scores are comparable among diversified KPI characteristics Probability

slide-54
SLIDE 54

Change quantification

  • The change quantification also must be lightweight ,

because hundreds of thousands of KPIs need to be quickly quantified

slide-55
SLIDE 55

Change quantification

  • Probability

lightweight Kernel Density Estimation (KDE) based quantification algorithm Comparable between diversified KPI characteristics In KDE, Choose different kernels for different KPIs

slide-56
SLIDE 56

Digest distillation

  • KPIs of the same module

Construct vectors representation of the change pattern of machines

slide-57
SLIDE 57

Digest distillation

  • Suppose each machine has ! KPIs, then the KPIs upward

change score " and downward change score # can form a vector to represent the change pattern of the machine

  • ( "% , #% , "' , #', … , ") , #) )
slide-58
SLIDE 58

Digest distillation

  • KPIs of the same module

Construct a vector representation of the change pattern of a machine Use constructed vectors to cluster machines to generate digests

slide-59
SLIDE 59

Digest distillation

  • Data Center 3

Data Center 2 Computation

DC-2&DC-3

Computation memory disk network …

DC-3

Computation memory disk network …

slide-60
SLIDE 60

Digest distillation

  • We choose DB-SCAN as the clustering algorithm,

because the cluster number can not be determined We use Pearson correlation as the distance function

  • f clustering algorithm, which can capture the

similar change pattern

slide-61
SLIDE 61

Digest Ranking

  • The distilled digests need to be ranked so that

the one most relevant to the root cause can be listed at the top

slide-62
SLIDE 62

Digest Ranking

  • Database

DC-1 DC-2&DC-3

Computation

Digest

memory disk network … memory disk network … Computation memory disk network …

DC-3

Learning-to-rank model

Database

DC-1 DC-2&DC-3

Computation

Ranked Digest

memory disk network … memory disk network … Computation memory disk network …

DC-3

Extract root cause related features Train logistic regression

slide-63
SLIDE 63
slide-64
SLIDE 64

Datasets

  • The largest system contains 11519 machines

Our dataset contains 70 real failures cases from five different online software services

slide-65
SLIDE 65

Datasets

slide-66
SLIDE 66

Metric

  • Root cause digest (RCD). A root cause digest is a

digest satisfying the following conditions:

  • All machines of a digest are root cause machines where

the root cause took place.

  • The top-five KPIs of a digest contain one or more root

cause relevant KPIs

slide-67
SLIDE 67

Root Cause Digest

  • All 27 machines are root

cause machines where CPU overload took place Top five KPIs are root cause related KPIs which can indicate CPU overload

slide-68
SLIDE 68

Metric

  • We use Recall@K as the evaluation metric

How many cases’ root cause digest can be ranked into top K

slide-69
SLIDE 69

Offline Evaluation

  • FluxRank is stable among different folds
  • f cross-validations.

66/70 cases' root cause digests are ranked into top 3.

slide-70
SLIDE 70

Offline Evaluation

  • Compared with manual localization,

FluxRank cuts the mitigation time by more than 80% on average

slide-71
SLIDE 71

Online Evaluation

  • FluxRank has been successfully deployed online on one Internet

service (with hundreds of machines) and six banking services (each with tens of machines) in two large banks for three months.

55/59 cases' root cause digests are ranked into top 1

slide-72
SLIDE 72
slide-73
SLIDE 73

Case study

  • This case is a CPU overload failure. The failure service

contains 29 modules and runs on 11,519 machines.

  • The root cause of this failure is that 27 machines causes CPU
  • verload exception.
slide-74
SLIDE 74

Case study

  • ffline service

Tester

Start offline CPU stress test on the offline service

slide-75
SLIDE 75

Case study

  • Online service

Module Module Module … …

  • ffline service

Tester

Due to an faulty configuration, the tester incorrectly make the CPUs of some online machines overload.

  • perator

After one hour mitigation, with no success

slide-76
SLIDE 76

Case study

  • Online service

Module Module Module … …

  • ffline service

Tester

Then, the failure was escalated. Operators stopped all stress tests that may influence online service

  • perator
slide-77
SLIDE 77

Case study

  • Online service

Module Module Module … …

  • ffline service

Tester

Eventually, they successfully mitigated the failure, but spent about two hours in total.

  • perator
slide-78
SLIDE 78

Case study

  • FluxRank successfully

recommended the 27 CPU

  • verloaded machines to the top

The CPU related KPIs also be ranked to the top of digest’s KPI list

Operators can easily understand that 27 machines

  • f module M1 from data center 1 have CPU
  • verload exception.
slide-79
SLIDE 79

Conclusion

  • Target
  • Failure mitigation.
  • Not find exact root cause
  • Method
  • Quantify KPIs -> Cluster machines -> Rank digests
  • Not use dependency graph
  • Offline evaluation
  • 66/70 cases' root cause digests are ranked into top 3
  • Online evaluation
  • 55/59 cases' root cause digests are ranked into top 1
slide-80
SLIDE 80
slide-81
SLIDE 81

Widely-deployable Framework

  • FluxRank can be easily deployed using existing

KPI data without any change of the service.

  • FluxRank have been quickly deployed on six
  • nline service.
slide-82
SLIDE 82

Diversified KPI characteristics

  • CPU idle KPI
  • Ratio KPI
  • Value range: [0, 1.0]
  • It is anomalous when the value is close to 0 or 1
  • Beta distribution is more suitable to describe ratio KPI
  • Example
  • If normal range is: [0.3, 0.8]
  • During CPU overload, the value is: 0.1
  • From Gaussian distribution, 0.1 is normal
  • From Beta distribution, 0.1 is anomalous