The Failure Trace Archive: Enabling Comparative Analysis of Diverse - - PowerPoint PPT Presentation

the failure trace archive enabling comparative analysis
SMART_READER_LITE
LIVE PREVIEW

The Failure Trace Archive: Enabling Comparative Analysis of Diverse - - PowerPoint PPT Presentation

The Failure Trace Archive: Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo 1 , Bahman Javadi 1 , Alexandru Iosup 2 , Dick Epema 2 2 TU Delft, The Netherlands 1 INRIA, France Motivation Push toward experimental


slide-1
SLIDE 1

The Failure Trace Archive: Enabling Comparative Analysis

  • f Diverse Distributed Systems

Derrick Kondo1, Bahman Javadi1, Alexandru Iosup2, Dick Epema2

1INRIA, France 2 TU Delft, The Netherlands

slide-2
SLIDE 2

Motivation

  • Push toward experimental computer

science

slide-3
SLIDE 3

Motivation

  • Push toward experimental computer

science

  • Hard to evaluate and compare algorithms

and models for fault-tolerance

  • Lack of public trace data sets
  • Lack of standard trace format
  • Lack of parsing and analytical tools
slide-4
SLIDE 4

Motivation

  • Push toward experimental computer

science

  • Hard to evaluate and compare algorithms

and models for fault-tolerance

  • Lack of public trace data sets
  • Lack of standard trace format
  • Lack of parsing and analytical tools
  • Failures in distributed systems have increasingly

high negative impact and complex dynamics

slide-5
SLIDE 5

Failure Trace Archive (FTA)

  • Availability traces of distributed systems,

differing in scale, volatility, and usage

  • Standard event-based format for failure

traces

  • Scripts and tools for parsing and analyzing

traces in svn repository

http://fta.inria.fr

slide-6
SLIDE 6

Related Work

Resource

Data Sets

Format

Parsing Tools

Analysis Tools

Grid Observatory

Emphasis on EGEE

✗ ✗ ✗

Computer Failure Repo.

12 (mainly

clusters)

✗ ✗ ✗

Repo.

  • f Avail. Traces

5 (mainly P2P)

✓ ✓ ✗

Desktop Grid Archive

4 Desktop Grids

✓ ✗ ✗

FTA1

22

✓ ✓ ✓

1 FTA includes data sets of the former three resources, in

addition to providing several new data sets

✗ ✗

slide-7
SLIDE 7

Enabled Studies

  • Comparing models/algorithms using the identical data

sets

slide-8
SLIDE 8

Enabled Studies

  • Comparing models/algorithms using the identical data

sets

  • Evaluation of generality/specificity of model/algorithm

across different types of systems

slide-9
SLIDE 9

Enabled Studies

  • Comparing models/algorithms using the identical data

sets

  • Evaluation of generality/specificity of model/algorithm

across different types of systems

  • Evaluation of the generality of a system trace
slide-10
SLIDE 10

Enabled Studies

  • Comparing models/algorithms using the identical data

sets

  • Evaluation of generality/specificity of model/algorithm

across different types of systems

  • Evaluation of the generality of a system trace
  • Analysis of evolution of failures over time
slide-11
SLIDE 11

Enabled Studies

  • Comparing models/algorithms using the identical data

sets

  • Evaluation of generality/specificity of model/algorithm

across different types of systems

  • Evaluation of the generality of a system trace
  • Analysis of evolution of failures over time
  • And many more...
slide-12
SLIDE 12

Contributions

  • Description of FTA, trace format and

analysis toolbox

  • High-level statistical characterization of

failures in each data set

  • Show importance of public data sets and

methods via characterization of ambiguous data sets

slide-13
SLIDE 13

Background Definitions

  • Failure: observed deviation from correct

system state

  • Availability (unavailability) interval:

continuous period that system is in correct state (incorrect state)

  • Error: system state (not externally visible) that

leads to failure

  • Fault: root cause of an error
slide-14
SLIDE 14

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

slide-15
SLIDE 15

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

  • Resource (versus job
  • r user) centric
slide-16
SLIDE 16

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

  • Resource (versus job
  • r user) centric
slide-17
SLIDE 17

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

  • Resource (versus job
  • r user) centric
slide-18
SLIDE 18

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

  • Resource (versus job
  • r user) centric
  • Event-based
slide-19
SLIDE 19

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

  • Resource (versus job
  • r user) centric
  • Event-based
slide-20
SLIDE 20

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

  • Resource (versus job
  • r user) centric
  • Event-based
slide-21
SLIDE 21

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

  • Resource (versus job
  • r user) centric
  • Event-based
  • Associated metadata
slide-22
SLIDE 22

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

  • Resource (versus job
  • r user) centric
  • Event-based
  • Associated metadata
slide-23
SLIDE 23

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

  • Resource (versus job
  • r user) centric
  • Event-based
  • Associated metadata
slide-24
SLIDE 24

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

  • Resource (versus job
  • r user) centric
  • Event-based
  • Codes for different

components, events, and errors

  • Associated metadata
slide-25
SLIDE 25

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

  • Resource (versus job
  • r user) centric
  • Event-based
  • Codes for different

components, events, and errors

  • Associated metadata
slide-26
SLIDE 26

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

  • Resource (versus job
  • r user) centric
  • Event-based
  • Codes for different

components, events, and errors

  • Associated metadata
slide-27
SLIDE 27

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

  • Resource (versus job
  • r user) centric
  • Event-based
  • Codes for different

components, events, and errors

  • Associated metadata
  • Balance between

completeness and sparseness

slide-28
SLIDE 28

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

  • Resource (versus job
  • r user) centric
  • Event-based
  • Codes for different

components, events, and errors

  • Extensibility
  • Associated metadata
  • Balance between

completeness and sparseness

slide-29
SLIDE 29

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

  • Resource (versus job
  • r user) centric
  • Raw, Tabbed, Relational database

(MySQL)

  • Event-based
  • Codes for different

components, events, and errors

  • Extensibility
  • Associated metadata
  • Balance between

completeness and sparseness

slide-30
SLIDE 30

Data Quality Assessment

  • Syntactic: standard format library that checks

data types, number fields (automated)

  • Semantic: time moves forward and is non-
  • verlapping, state is valid (automated)
  • Visual: look at the distribution for outliers

(manual)

slide-31
SLIDE 31

Data Sets

  • Usage (p2p, supercomputer, grids, desktop

PC’s)

  • Type (CPU, network, IO)
  • Scale (50-240,000 hosts)
  • Volatility (minutes to days)
  • Resolution (wrt failure detection)
slide-32
SLIDE 32

Currently 21 Data Sets

http://fta.inria.fr

slide-33
SLIDE 33

Currently 21 Data Sets

http://fta.inria.fr

slide-34
SLIDE 34

Currently 21 Data Sets

http://fta.inria.fr

slide-35
SLIDE 35

Currently 21 Data Sets

http://fta.inria.fr

slide-36
SLIDE 36

Currently 21 Data Sets

http://fta.inria.fr

slide-37
SLIDE 37

Currently 21 Data Sets

http://fta.inria.fr

slide-38
SLIDE 38

Currently 21 Data Sets

http://fta.inria.fr

slide-39
SLIDE 39

Currently 21 Data Sets

http://fta.inria.fr

slide-40
SLIDE 40

Currently 21 Data Sets

http://fta.inria.fr

slide-41
SLIDE 41

Currently 21 Data Sets

http://fta.inria.fr

slide-42
SLIDE 42

Statistical Analysis

slide-43
SLIDE 43

FTA Toolbox

initialize MySQL trace database query process finalize text html wiki latex

  • Makes it easy to run a set of statistical measures

across all the data sets

  • Provides library of functions that can be reused and

incorporated

  • Implemented in Matlab
  • svn checkout svn://scm.gforge.inria.fr/svn/fta/

toolbox

slide-44
SLIDE 44

Failure Modelling

  • Approach
  • Model availability and unavailability

intervals, each with a single probability distribution

  • Assume availability and unavailability is

identically and independently distributed

  • Descriptive, not prescriptive
slide-45
SLIDE 45

Distributions of Availability and Unavailability Intervals

slide-46
SLIDE 46

Distributions of Availability and Unavailability Intervals

Qualitative Description

slide-47
SLIDE 47

Model Fitting

  • For each candidate probability distribution
  • Compute parameters that maximize the

distribution’s likelihood

  • Measure goodness of fit using Kolomorov-

Smirnov (KS) and Anderson-Darling (AD) tests

  • Compute p-value using 30 samples. Take

average of 1000 p-values

slide-48
SLIDE 48

P-Values for KS & AD Goodness-of-fit tests

Availability Unavailability

slide-49
SLIDE 49

P-Values for KS & AD Goodness-of-fit tests

Availability Unavailability

p-value < 0.05 or 0.10 ⇒ reject H0 that data came from fitted distribution

slide-50
SLIDE 50

P-Values for KS & AD Goodness-of-fit tests

Availability Unavailability (Un)availability generally not heavy-tailed

p-value < 0.05 or 0.10 ⇒ reject H0 that data came from fitted distribution

slide-51
SLIDE 51

P-Values for KS & AD Goodness-of-fit tests

Availability Unavailability

p-value < 0.05 or 0.10 ⇒ reject H0 that data came from fitted distribution

slide-52
SLIDE 52

Exponential usually not a good fit.

P-Values for KS & AD Goodness-of-fit tests

Availability Unavailability

p-value < 0.05 or 0.10 ⇒ reject H0 that data came from fitted distribution

slide-53
SLIDE 53

P-Values for KS & AD Goodness-of-fit tests

Availability Unavailability

p-value < 0.05 or 0.10 ⇒ reject H0 that data came from fitted distribution

slide-54
SLIDE 54

Gamma a good fit. Amenable for Markov Models

P-Values for KS & AD Goodness-of-fit tests

Availability Unavailability

p-value < 0.05 or 0.10 ⇒ reject H0 that data came from fitted distribution

slide-55
SLIDE 55

P-Values for KS & AD Goodness-of-fit tests

Availability Unavailability

p-value < 0.05 or 0.10 ⇒ reject H0 that data came from fitted distribution

slide-56
SLIDE 56

P-Values for KS & AD Goodness-of-fit tests

Availability Unavailability Weibull and Log- Normal provide best fit

p-value < 0.05 or 0.10 ⇒ reject H0 that data came from fitted distribution

slide-57
SLIDE 57

Parameters of Distributions

Availability Unavailability μ: mean, σ: std dev., k: shape, λ: scale

slide-58
SLIDE 58

Parameters of Distributions

Availability Unavailability μ: mean, σ: std dev., k: shape, λ: scale k < 1, ∴ decreasing hazard rate

slide-59
SLIDE 59

Can different interpretations of trace data sets affect the model?

slide-60
SLIDE 60

Ambiguous Data Sets

Data Set Ambiguity Interpretation G5K06

Monitored state is an error or failure

error G5K06B

an error or failure

failure

slide-61
SLIDE 61

Ambiguous Data Sets

Data Set Ambiguity Interpretation G5K06

Monitored state is an error or failure

error G5K06B

an error or failure

failure LANL0516

Overlapping intervals

union LANL0516B

Overlapping intervals

intersection

slide-62
SLIDE 62

Ambiguous Data Sets

Data Set Ambiguity Interpretation G5K06

Monitored state is an error or failure

error G5K06B

an error or failure

failure LANL0516

Overlapping intervals

union LANL0516B

Overlapping intervals

intersection ND07CPU

Definition of idleness

w/o user and CPU load for 15 mins

ND07CPUB

Definition of idleness

CPU load < 10%

slide-63
SLIDE 63

QQ Plots for Ambiguous Data Sets

200 400 600 800 1000 100 200 300 400 500 600 700 800 900 1000 Quantiles of g5k06 fit Quantiles of g5k06B fit

slide-64
SLIDE 64

QQ Plots for Ambiguous Data Sets

200 400 600 800 1000 100 200 300 400 500 600 700 800 900 1000 Quantiles of g5k06 fit Quantiles of g5k06B fit 50 100 150 200 20 40 60 80 100 120 140 160 180 200 Quantiles of lanl0516 fit Quantiles of lanl0516B fit

slide-65
SLIDE 65

QQ Plots for Ambiguous Data Sets

200 400 600 800 1000 100 200 300 400 500 600 700 800 900 1000 Quantiles of g5k06 fit Quantiles of g5k06B fit 50 100 150 200 20 40 60 80 100 120 140 160 180 200 Quantiles of lanl0516 fit Quantiles of lanl0516B fit 50 100 150 200 250 300 50 100 150 200 250 300 Quantiles of nd07cpu fit Quantiles of nd07cpuB fit

slide-66
SLIDE 66

Distribution Parameters for Ambiguous Data Sets

μ: mean, σ: std dev., k: shape, λ: scale

slide-67
SLIDE 67

Distribution Parameters for Ambiguous Data Sets

Mean of G5K06B 1.5 times greater than G5K06 μ: mean, σ: std dev., k: shape, λ: scale

slide-68
SLIDE 68

Distribution Parameters for Ambiguous Data Sets

μ: mean, σ: std dev., k: shape, λ: scale

slide-69
SLIDE 69

Distribution Parameters for Ambiguous Data Sets

Gamma scale parameter often significantly different μ: mean, σ: std dev., k: shape, λ: scale

slide-70
SLIDE 70

How to identify interpretation?

  • Parsing script is the exact interpretation
  • Meaning explained in comments
  • Publicly accessible in svn
  • Format supports different interpretations of availability
  • Can have multiple event_trace’s corresponding to

different definitions availability

  • So each interpretation can be uniquely identified
slide-71
SLIDE 71

How to resolve differences

  • f interpretation?
  • Determine which interpretation affects the
  • application. (E.g. G5K06)
  • Determine most common interpretation,
  • r interpretation that is the lowest

common denominator (E.g. ND07CPU)

  • Exclude period of ambiguity or post-

process it so that it is consistent with rest

  • f data set (E.g. LANL05)
slide-72
SLIDE 72

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

Future Directions

  • Call to arms: trace data exists in

many production environments, but not always accessible

  • Include more production systems
  • Types of failures
  • Causes of failures
  • State before failures
  • Automated trace collection
  • Failure models and algorithms
  • Integration of job and resource

failures

slide-73
SLIDE 73

Acknowledgements

  • All contributors of trace data to the FTA
  • INRIA ALEAE project directed by

Emmanuel Jeannot

  • Feedback from Cecile Germain, Eric Heien,

Artur Andrzejak, anonymous reviewers

slide-74
SLIDE 74

Summary

  • FTA: Data Sets, Format, Tools
  • http://fta.inria.fr
  • High-level modelling and statistical

characterization of 9 data sets

  • Slight differences in interpretation make

significant difference in model

  • Got data? Questions? Please email

derrick.kondo@inria.fr or any other FTA team member

slide-75
SLIDE 75

Thank you