SLIDE 1 The Failure Trace Archive: Enabling Comparative Analysis
- f Diverse Distributed Systems
Derrick Kondo1, Bahman Javadi1, Alexandru Iosup2, Dick Epema2
1INRIA, France 2 TU Delft, The Netherlands
SLIDE 2 Motivation
- Push toward experimental computer
science
SLIDE 3 Motivation
- Push toward experimental computer
science
- Hard to evaluate and compare algorithms
and models for fault-tolerance
- Lack of public trace data sets
- Lack of standard trace format
- Lack of parsing and analytical tools
SLIDE 4 Motivation
- Push toward experimental computer
science
- Hard to evaluate and compare algorithms
and models for fault-tolerance
- Lack of public trace data sets
- Lack of standard trace format
- Lack of parsing and analytical tools
- Failures in distributed systems have increasingly
high negative impact and complex dynamics
SLIDE 5 Failure Trace Archive (FTA)
- Availability traces of distributed systems,
differing in scale, volatility, and usage
- Standard event-based format for failure
traces
- Scripts and tools for parsing and analyzing
traces in svn repository
http://fta.inria.fr
SLIDE 6 Related Work
Resource
Data Sets
Format
Parsing Tools
Analysis Tools
Grid Observatory
Emphasis on EGEE
✗ ✗ ✗
Computer Failure Repo.
12 (mainly
clusters)
✗ ✗ ✗
Repo.
5 (mainly P2P)
✓ ✓ ✗
Desktop Grid Archive
4 Desktop Grids
✓ ✗ ✗
FTA1
22
✓ ✓ ✓
1 FTA includes data sets of the former three resources, in
addition to providing several new data sets
✗ ✗
SLIDE 7 Enabled Studies
- Comparing models/algorithms using the identical data
sets
SLIDE 8 Enabled Studies
- Comparing models/algorithms using the identical data
sets
- Evaluation of generality/specificity of model/algorithm
across different types of systems
SLIDE 9 Enabled Studies
- Comparing models/algorithms using the identical data
sets
- Evaluation of generality/specificity of model/algorithm
across different types of systems
- Evaluation of the generality of a system trace
SLIDE 10 Enabled Studies
- Comparing models/algorithms using the identical data
sets
- Evaluation of generality/specificity of model/algorithm
across different types of systems
- Evaluation of the generality of a system trace
- Analysis of evolution of failures over time
SLIDE 11 Enabled Studies
- Comparing models/algorithms using the identical data
sets
- Evaluation of generality/specificity of model/algorithm
across different types of systems
- Evaluation of the generality of a system trace
- Analysis of evolution of failures over time
- And many more...
SLIDE 12 Contributions
- Description of FTA, trace format and
analysis toolbox
- High-level statistical characterization of
failures in each data set
- Show importance of public data sets and
methods via characterization of ambiguous data sets
SLIDE 13 Background Definitions
- Failure: observed deviation from correct
system state
- Availability (unavailability) interval:
continuous period that system is in correct state (incorrect state)
- Error: system state (not externally visible) that
leads to failure
- Fault: root cause of an error
SLIDE 14 FTA Schema
platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes
SLIDE 15 FTA Schema
platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes
- Resource (versus job
- r user) centric
SLIDE 16 FTA Schema
platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes
- Resource (versus job
- r user) centric
SLIDE 17 FTA Schema
platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes
- Resource (versus job
- r user) centric
SLIDE 18 FTA Schema
platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes
- Resource (versus job
- r user) centric
- Event-based
SLIDE 19 FTA Schema
platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes
- Resource (versus job
- r user) centric
- Event-based
SLIDE 20 FTA Schema
platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes
- Resource (versus job
- r user) centric
- Event-based
SLIDE 21 FTA Schema
platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes
- Resource (versus job
- r user) centric
- Event-based
- Associated metadata
SLIDE 22 FTA Schema
platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes
- Resource (versus job
- r user) centric
- Event-based
- Associated metadata
SLIDE 23 FTA Schema
platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes
- Resource (versus job
- r user) centric
- Event-based
- Associated metadata
SLIDE 24 FTA Schema
platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes
- Resource (versus job
- r user) centric
- Event-based
- Codes for different
components, events, and errors
SLIDE 25 FTA Schema
platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes
- Resource (versus job
- r user) centric
- Event-based
- Codes for different
components, events, and errors
SLIDE 26 FTA Schema
platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes
- Resource (versus job
- r user) centric
- Event-based
- Codes for different
components, events, and errors
SLIDE 27 FTA Schema
platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes
- Resource (versus job
- r user) centric
- Event-based
- Codes for different
components, events, and errors
- Associated metadata
- Balance between
completeness and sparseness
SLIDE 28 FTA Schema
platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes
- Resource (versus job
- r user) centric
- Event-based
- Codes for different
components, events, and errors
- Extensibility
- Associated metadata
- Balance between
completeness and sparseness
SLIDE 29 FTA Schema
platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes
- Resource (versus job
- r user) centric
- Raw, Tabbed, Relational database
(MySQL)
- Event-based
- Codes for different
components, events, and errors
- Extensibility
- Associated metadata
- Balance between
completeness and sparseness
SLIDE 30 Data Quality Assessment
- Syntactic: standard format library that checks
data types, number fields (automated)
- Semantic: time moves forward and is non-
- verlapping, state is valid (automated)
- Visual: look at the distribution for outliers
(manual)
SLIDE 31 Data Sets
- Usage (p2p, supercomputer, grids, desktop
PC’s)
- Type (CPU, network, IO)
- Scale (50-240,000 hosts)
- Volatility (minutes to days)
- Resolution (wrt failure detection)
SLIDE 32
Currently 21 Data Sets
http://fta.inria.fr
SLIDE 33
Currently 21 Data Sets
http://fta.inria.fr
SLIDE 34
Currently 21 Data Sets
http://fta.inria.fr
SLIDE 35
Currently 21 Data Sets
http://fta.inria.fr
SLIDE 36
Currently 21 Data Sets
http://fta.inria.fr
SLIDE 37
Currently 21 Data Sets
http://fta.inria.fr
SLIDE 38
Currently 21 Data Sets
http://fta.inria.fr
SLIDE 39
Currently 21 Data Sets
http://fta.inria.fr
SLIDE 40
Currently 21 Data Sets
http://fta.inria.fr
SLIDE 41
Currently 21 Data Sets
http://fta.inria.fr
SLIDE 42
Statistical Analysis
SLIDE 43 FTA Toolbox
initialize MySQL trace database query process finalize text html wiki latex
- Makes it easy to run a set of statistical measures
across all the data sets
- Provides library of functions that can be reused and
incorporated
- Implemented in Matlab
- svn checkout svn://scm.gforge.inria.fr/svn/fta/
toolbox
SLIDE 44 Failure Modelling
- Approach
- Model availability and unavailability
intervals, each with a single probability distribution
- Assume availability and unavailability is
identically and independently distributed
- Descriptive, not prescriptive
SLIDE 45
Distributions of Availability and Unavailability Intervals
SLIDE 46 Distributions of Availability and Unavailability Intervals
Qualitative Description
SLIDE 47 Model Fitting
- For each candidate probability distribution
- Compute parameters that maximize the
distribution’s likelihood
- Measure goodness of fit using Kolomorov-
Smirnov (KS) and Anderson-Darling (AD) tests
- Compute p-value using 30 samples. Take
average of 1000 p-values
SLIDE 48
P-Values for KS & AD Goodness-of-fit tests
Availability Unavailability
SLIDE 49 P-Values for KS & AD Goodness-of-fit tests
Availability Unavailability
p-value < 0.05 or 0.10 ⇒ reject H0 that data came from fitted distribution
SLIDE 50 P-Values for KS & AD Goodness-of-fit tests
Availability Unavailability (Un)availability generally not heavy-tailed
p-value < 0.05 or 0.10 ⇒ reject H0 that data came from fitted distribution
SLIDE 51 P-Values for KS & AD Goodness-of-fit tests
Availability Unavailability
p-value < 0.05 or 0.10 ⇒ reject H0 that data came from fitted distribution
SLIDE 52 Exponential usually not a good fit.
P-Values for KS & AD Goodness-of-fit tests
Availability Unavailability
p-value < 0.05 or 0.10 ⇒ reject H0 that data came from fitted distribution
SLIDE 53 P-Values for KS & AD Goodness-of-fit tests
Availability Unavailability
p-value < 0.05 or 0.10 ⇒ reject H0 that data came from fitted distribution
SLIDE 54 Gamma a good fit. Amenable for Markov Models
P-Values for KS & AD Goodness-of-fit tests
Availability Unavailability
p-value < 0.05 or 0.10 ⇒ reject H0 that data came from fitted distribution
SLIDE 55 P-Values for KS & AD Goodness-of-fit tests
Availability Unavailability
p-value < 0.05 or 0.10 ⇒ reject H0 that data came from fitted distribution
SLIDE 56 P-Values for KS & AD Goodness-of-fit tests
Availability Unavailability Weibull and Log- Normal provide best fit
p-value < 0.05 or 0.10 ⇒ reject H0 that data came from fitted distribution
SLIDE 57
Parameters of Distributions
Availability Unavailability μ: mean, σ: std dev., k: shape, λ: scale
SLIDE 58
Parameters of Distributions
Availability Unavailability μ: mean, σ: std dev., k: shape, λ: scale k < 1, ∴ decreasing hazard rate
SLIDE 59
Can different interpretations of trace data sets affect the model?
SLIDE 60 Ambiguous Data Sets
Data Set Ambiguity Interpretation G5K06
Monitored state is an error or failure
error G5K06B
an error or failure
failure
SLIDE 61 Ambiguous Data Sets
Data Set Ambiguity Interpretation G5K06
Monitored state is an error or failure
error G5K06B
an error or failure
failure LANL0516
Overlapping intervals
union LANL0516B
Overlapping intervals
intersection
SLIDE 62 Ambiguous Data Sets
Data Set Ambiguity Interpretation G5K06
Monitored state is an error or failure
error G5K06B
an error or failure
failure LANL0516
Overlapping intervals
union LANL0516B
Overlapping intervals
intersection ND07CPU
Definition of idleness
w/o user and CPU load for 15 mins
ND07CPUB
Definition of idleness
CPU load < 10%
SLIDE 63 QQ Plots for Ambiguous Data Sets
200 400 600 800 1000 100 200 300 400 500 600 700 800 900 1000 Quantiles of g5k06 fit Quantiles of g5k06B fit
SLIDE 64 QQ Plots for Ambiguous Data Sets
200 400 600 800 1000 100 200 300 400 500 600 700 800 900 1000 Quantiles of g5k06 fit Quantiles of g5k06B fit 50 100 150 200 20 40 60 80 100 120 140 160 180 200 Quantiles of lanl0516 fit Quantiles of lanl0516B fit
SLIDE 65 QQ Plots for Ambiguous Data Sets
200 400 600 800 1000 100 200 300 400 500 600 700 800 900 1000 Quantiles of g5k06 fit Quantiles of g5k06B fit 50 100 150 200 20 40 60 80 100 120 140 160 180 200 Quantiles of lanl0516 fit Quantiles of lanl0516B fit 50 100 150 200 250 300 50 100 150 200 250 300 Quantiles of nd07cpu fit Quantiles of nd07cpuB fit
SLIDE 66
Distribution Parameters for Ambiguous Data Sets
μ: mean, σ: std dev., k: shape, λ: scale
SLIDE 67
Distribution Parameters for Ambiguous Data Sets
Mean of G5K06B 1.5 times greater than G5K06 μ: mean, σ: std dev., k: shape, λ: scale
SLIDE 68
Distribution Parameters for Ambiguous Data Sets
μ: mean, σ: std dev., k: shape, λ: scale
SLIDE 69
Distribution Parameters for Ambiguous Data Sets
Gamma scale parameter often significantly different μ: mean, σ: std dev., k: shape, λ: scale
SLIDE 70 How to identify interpretation?
- Parsing script is the exact interpretation
- Meaning explained in comments
- Publicly accessible in svn
- Format supports different interpretations of availability
- Can have multiple event_trace’s corresponding to
different definitions availability
- So each interpretation can be uniquely identified
SLIDE 71 How to resolve differences
- f interpretation?
- Determine which interpretation affects the
- application. (E.g. G5K06)
- Determine most common interpretation,
- r interpretation that is the lowest
common denominator (E.g. ND07CPU)
- Exclude period of ambiguity or post-
process it so that it is consistent with rest
SLIDE 72 platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes
Future Directions
- Call to arms: trace data exists in
many production environments, but not always accessible
- Include more production systems
- Types of failures
- Causes of failures
- State before failures
- Automated trace collection
- Failure models and algorithms
- Integration of job and resource
failures
SLIDE 73 Acknowledgements
- All contributors of trace data to the FTA
- INRIA ALEAE project directed by
Emmanuel Jeannot
- Feedback from Cecile Germain, Eric Heien,
Artur Andrzejak, anonymous reviewers
SLIDE 74 Summary
- FTA: Data Sets, Format, Tools
- http://fta.inria.fr
- High-level modelling and statistical
characterization of 9 data sets
- Slight differences in interpretation make
significant difference in model
- Got data? Questions? Please email
derrick.kondo@inria.fr or any other FTA team member
SLIDE 75
Thank you