[PPT] - The Failure Trace Archive: Enabling Comparative Analysis of Diverse PowerPoint Presentation

SLIDE 1

The Failure Trace Archive: Enabling Comparative Analysis

f Diverse Distributed Systems

Derrick Kondo1, Bahman Javadi1, Alexandru Iosup2, Dick Epema2

1INRIA, France 2 TU Delft, The Netherlands

SLIDE 2

Motivation

Push toward experimental computer

science

SLIDE 3

Motivation

Push toward experimental computer

science

Hard to evaluate and compare algorithms

and models for fault-tolerance

Lack of public trace data sets
Lack of standard trace format
Lack of parsing and analytical tools

SLIDE 4

Motivation

Push toward experimental computer

science

Hard to evaluate and compare algorithms

and models for fault-tolerance

Lack of public trace data sets
Lack of standard trace format
Lack of parsing and analytical tools
Failures in distributed systems have increasingly

high negative impact and complex dynamics

SLIDE 5

Failure Trace Archive (FTA)

Availability traces of distributed systems,

differing in scale, volatility, and usage

Standard event-based format for failure

traces

Scripts and tools for parsing and analyzing

traces in svn repository

http://fta.inria.fr

SLIDE 6

Related Work

Resource

Data Sets

Format

Parsing Tools

Analysis Tools

Grid Observatory

Emphasis on EGEE

✗ ✗ ✗

Computer Failure Repo.

12 (mainly

clusters)

✗ ✗ ✗

Repo.

f Avail. Traces

5 (mainly P2P)

✓ ✓ ✗

Desktop Grid Archive

4 Desktop Grids

✓ ✗ ✗

FTA1

22

✓ ✓ ✓

1 FTA includes data sets of the former three resources, in

addition to providing several new data sets

✗ ✗

SLIDE 7

Enabled Studies

Comparing models/algorithms using the identical data

sets

SLIDE 8

Enabled Studies

Comparing models/algorithms using the identical data

sets

Evaluation of generality/specificity of model/algorithm

across different types of systems

SLIDE 9

Enabled Studies

Comparing models/algorithms using the identical data

sets

Evaluation of generality/specificity of model/algorithm

across different types of systems

Evaluation of the generality of a system trace

SLIDE 10

Enabled Studies

Comparing models/algorithms using the identical data

sets

Evaluation of generality/specificity of model/algorithm

across different types of systems

Evaluation of the generality of a system trace
Analysis of evolution of failures over time

SLIDE 11

Enabled Studies

Comparing models/algorithms using the identical data

sets

Evaluation of generality/specificity of model/algorithm

across different types of systems

Evaluation of the generality of a system trace
Analysis of evolution of failures over time
And many more...

SLIDE 12

Contributions

Description of FTA, trace format and

analysis toolbox

High-level statistical characterization of

failures in each data set

Show importance of public data sets and

methods via characterization of ambiguous data sets

SLIDE 13

Background Definitions

Failure: observed deviation from correct

system state

Availability (unavailability) interval:

continuous period that system is in correct state (incorrect state)

Error: system state (not externally visible) that

leads to failure

Fault: root cause of an error

SLIDE 14

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

SLIDE 15

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

Resource (versus job
r user) centric

SLIDE 16

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

Resource (versus job
r user) centric

SLIDE 17

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

Resource (versus job
r user) centric

SLIDE 18

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

Resource (versus job
r user) centric
Event-based

SLIDE 19

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

Resource (versus job
r user) centric
Event-based

SLIDE 20

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

Resource (versus job
r user) centric
Event-based

SLIDE 21

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

Resource (versus job
r user) centric
Event-based
Associated metadata

SLIDE 22

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

Resource (versus job
r user) centric
Event-based
Associated metadata

SLIDE 23

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

Resource (versus job
r user) centric
Event-based
Associated metadata

SLIDE 24

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

Resource (versus job
r user) centric
Event-based
Codes for different

components, events, and errors

Associated metadata

SLIDE 25

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

Resource (versus job
r user) centric
Event-based
Codes for different

components, events, and errors

Associated metadata

SLIDE 26

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

Resource (versus job
r user) centric
Event-based
Codes for different

components, events, and errors

Associated metadata

SLIDE 27

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

Resource (versus job
r user) centric
Event-based
Codes for different

components, events, and errors

Associated metadata
Balance between

completeness and sparseness

SLIDE 28

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

Resource (versus job
r user) centric
Event-based
Codes for different

components, events, and errors

Extensibility
Associated metadata
Balance between

completeness and sparseness

SLIDE 29

FTA Schema

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

Resource (versus job
r user) centric
Raw, Tabbed, Relational database

(MySQL)

Event-based
Codes for different

components, events, and errors

Extensibility
Associated metadata
Balance between

completeness and sparseness

SLIDE 30

Data Quality Assessment

Syntactic: standard format library that checks

data types, number fields (automated)

Semantic: time moves forward and is non-
verlapping, state is valid (automated)
Visual: look at the distribution for outliers

(manual)

SLIDE 31

Data Sets

Usage (p2p, supercomputer, grids, desktop

PC’s)

Type (CPU, network, IO)
Scale (50-240,000 hosts)
Volatility (minutes to days)
Resolution (wrt failure detection)

SLIDE 42

Statistical Analysis

SLIDE 43

FTA Toolbox

initialize MySQL trace database query process finalize text html wiki latex

Makes it easy to run a set of statistical measures

across all the data sets

Provides library of functions that can be reused and

incorporated

Implemented in Matlab
svn checkout svn://scm.gforge.inria.fr/svn/fta/

toolbox

SLIDE 44

Failure Modelling

Approach
Model availability and unavailability

intervals, each with a single probability distribution

Assume availability and unavailability is

identically and independently distributed

Descriptive, not prescriptive

SLIDE 45

Distributions of Availability and Unavailability Intervals

SLIDE 46

Distributions of Availability and Unavailability Intervals

Qualitative Description

SLIDE 47

Model Fitting

For each candidate probability distribution
Compute parameters that maximize the

distribution’s likelihood

Measure goodness of fit using Kolomorov-

Smirnov (KS) and Anderson-Darling (AD) tests

Compute p-value using 30 samples. Take

average of 1000 p-values

SLIDE 48

P-Values for KS & AD Goodness-of-fit tests

Availability Unavailability

SLIDE 49

P-Values for KS & AD Goodness-of-fit tests

Availability Unavailability

SLIDE 56

P-Values for KS & AD Goodness-of-fit tests

Availability Unavailability Weibull and Log- Normal provide best fit

p-value < 0.05 or 0.10 ⇒ reject H0 that data came from fitted distribution

SLIDE 57

Parameters of Distributions

Availability Unavailability μ: mean, σ: std dev., k: shape, λ: scale

SLIDE 58

Parameters of Distributions

Availability Unavailability μ: mean, σ: std dev., k: shape, λ: scale k < 1, ∴ decreasing hazard rate

Overlapping intervals

union LANL0516B

Overlapping intervals

intersection ND07CPU

Definition of idleness

w/o user and CPU load for 15 mins

ND07CPUB

Definition of idleness

CPU load < 10%

SLIDE 63

QQ Plots for Ambiguous Data Sets

200 400 600 800 1000 100 200 300 400 500 600 700 800 900 1000 Quantiles of g5k06 fit Quantiles of g5k06B fit

SLIDE 64

QQ Plots for Ambiguous Data Sets

200 400 600 800 1000 100 200 300 400 500 600 700 800 900 1000 Quantiles of g5k06 fit Quantiles of g5k06B fit 50 100 150 200 20 40 60 80 100 120 140 160 180 200 Quantiles of lanl0516 fit Quantiles of lanl0516B fit

SLIDE 65

QQ Plots for Ambiguous Data Sets

200 400 600 800 1000 100 200 300 400 500 600 700 800 900 1000 Quantiles of g5k06 fit Quantiles of g5k06B fit 50 100 150 200 20 40 60 80 100 120 140 160 180 200 Quantiles of lanl0516 fit Quantiles of lanl0516B fit 50 100 150 200 250 300 50 100 150 200 250 300 Quantiles of nd07cpu fit Quantiles of nd07cpuB fit

SLIDE 66

Distribution Parameters for Ambiguous Data Sets

μ: mean, σ: std dev., k: shape, λ: scale

SLIDE 67

Distribution Parameters for Ambiguous Data Sets

Mean of G5K06B 1.5 times greater than G5K06 μ: mean, σ: std dev., k: shape, λ: scale

SLIDE 68

Distribution Parameters for Ambiguous Data Sets

μ: mean, σ: std dev., k: shape, λ: scale

SLIDE 69

Distribution Parameters for Ambiguous Data Sets

Gamma scale parameter often significantly different μ: mean, σ: std dev., k: shape, λ: scale

SLIDE 70

How to identify interpretation?

Parsing script is the exact interpretation
Meaning explained in comments
Publicly accessible in svn
Format supports different interpretations of availability
Can have multiple event_trace’s corresponding to

different definitions availability

So each interpretation can be uniquely identified

SLIDE 71

How to resolve differences

f interpretation?
Determine which interpretation affects the
application. (E.g. G5K06)
Determine most common interpretation,
r interpretation that is the lowest

common denominator (E.g. ND07CPU)

Exclude period of ambiguity or post-

process it so that it is consistent with rest

f data set (E.g. LANL05)

SLIDE 72

platform node component event_trace creator node_perf event_state component_type codes event_type codes event_end reason codes

Future Directions

Call to arms: trace data exists in

many production environments, but not always accessible

Include more production systems
Types of failures
Causes of failures
State before failures
Automated trace collection
Failure models and algorithms
Integration of job and resource

failures

SLIDE 73

Acknowledgements

All contributors of trace data to the FTA
INRIA ALEAE project directed by

Emmanuel Jeannot

Feedback from Cecile Germain, Eric Heien,

Artur Andrzejak, anonymous reviewers

SLIDE 74

Summary

FTA: Data Sets, Format, Tools
http://fta.inria.fr
High-level modelling and statistical

characterization of 9 data sets

Slight differences in interpretation make

significant difference in model

Got data? Questions? Please email

derrick.kondo@inria.fr or any other FTA team member

SLIDE 75