Black-Box Problem Diagnosis in Parallel File Systems . Kasick 1 - - PowerPoint PPT Presentation

black box problem diagnosis in parallel file systems
SMART_READER_LITE
LIVE PREVIEW

Black-Box Problem Diagnosis in Parallel File Systems . Kasick 1 - - PowerPoint PPT Presentation

Black-Box Problem Diagnosis in Parallel File Systems . Kasick 1 Michael P Jiaqi Tan 2 , Rajeev Gandhi 1 , Priya Narasimhan 1 1 Carnegie Mellon University 2 DSO National Labs, Singapore Michael P . Kasick Problem Diagnosis in Parallel File


slide-1
SLIDE 1

Black-Box Problem Diagnosis in Parallel File Systems

Michael P . Kasick1

Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1

1Carnegie Mellon University 2DSO National Labs, Singapore Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 1

slide-2
SLIDE 2

Problem Diagnosis Goals

To diagnose problems in off-the-shelf parallel file systems

Environmental performance problems: disk & network faults Target file systems: PVFS & Lustre

To develop methods applicable to existing deployments

Application transparency: avoid code-level instrumentation Minimal overhead, training, and configuration Support for arbitrary workloads: avoid models, SLOs, etc.

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 2

slide-3
SLIDE 3

Motivation: Real Problem Anecdotes

Problems motivated by PVFS developers’ experiences

From Argonne’s Blue Gene/P PVFS cluster

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 3

slide-4
SLIDE 4

Motivation: Real Problem Anecdotes

Problems motivated by PVFS developers’ experiences

From Argonne’s Blue Gene/P PVFS cluster

“Limping-but-alive” server problems

No errors reported, can’t identify faulty node with logs Single faulty server impacts overall system performance

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 3

slide-5
SLIDE 5

Motivation: Real Problem Anecdotes

Problems motivated by PVFS developers’ experiences

From Argonne’s Blue Gene/P PVFS cluster

“Limping-but-alive” server problems

No errors reported, can’t identify faulty node with logs Single faulty server impacts overall system performance

Storage-related problems:

Accidental launch of rogue processes, decreases throughput Buggy RAID controller issues patrol reads when not at idle

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 3

slide-6
SLIDE 6

Motivation: Real Problem Anecdotes

Problems motivated by PVFS developers’ experiences

From Argonne’s Blue Gene/P PVFS cluster

“Limping-but-alive” server problems

No errors reported, can’t identify faulty node with logs Single faulty server impacts overall system performance

Storage-related problems:

Accidental launch of rogue processes, decreases throughput Buggy RAID controller issues patrol reads when not at idle

Network-related problems:

Faulty-switch ports corrupt packets, fail CRC checks Overloaded switches drop packets but pass diagnostic tests

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 3

slide-7
SLIDE 7

Outline

1

Introduction

2

Experimental Methods

3

Diagnostic Algorithm

4

Results

5

Conclusion

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 4

slide-8
SLIDE 8

Target Parallel File Systems

Aim to support I/O-intensive applications Provide high-bandwidth, concurrent access

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 5

slide-9
SLIDE 9

Parallel File System Architecture

network clients I/Oservers

ios0 ios1 ios2 iosN mds0 mdsM

metadata servers

One or more I/O and metadata servers Clients communicate with every server

No server-server communication

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 6

slide-10
SLIDE 10

Parallel File System Data Striping

… 5 4 3 2 1 Logical File: Server 1 3 6 … Server 2 1 4 7 … Server 3 2 5 8 … Physical Files

Client stripes local file into 64 kB–1 MB chunks Writes to each I/O server in round-robin order

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 7

slide-11
SLIDE 11

Parallel File Systems: Empirical Insights (I)

Server behavior is similar for most requests

Large requests are striped across all servers Small requests, in aggregate, equally load all servers

Hypothesis: Peer-similarity

Fault-free servers exhibit similar performance metrics Faulty servers exhibit dissimilarities in certain metrics Peer-comparison of metrics identifies faulty node

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 8

slide-12
SLIDE 12

Example: Disk-Hog Fault

200 400 600 20000 60000 100000 Elapsed Time (s) Sectors Read (/s)

Faulty server Non-faulty servers Peer-asymmetry

Strongly motivates peer-comparison approach

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 9

slide-13
SLIDE 13

Parallel File Systems: Empirical Insights (II)

Faults manifest asymmetrically only on some metrics

Ex: A disk-busy fault manifests . . .

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 10

slide-14
SLIDE 14

Parallel File Systems: Empirical Insights (II)

Faults manifest asymmetrically only on some metrics

Ex: A disk-busy fault manifests . . .

Asymmetrically on latency metrics (↑ on faulty, ↓ on fault-free)

200 400 600 800 Elapsed Time (s) I/O Wait Time (ms) 1000 2000 Faulty server Non−faulty servers

Peer-asymmetry

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 10

slide-15
SLIDE 15

Parallel File Systems: Empirical Insights (II)

Faults manifest asymmetrically only on some metrics

Ex: A disk-busy fault manifests . . .

Asymmetrically on latency metrics (↑ on faulty, ↓ on fault-free) Symmetrically on throughput metrics (↓ on all nodes)

200 400 600 800 Elapsed Time (s) Sectors Read (/s) 40000 80000 Faulty server Non−faulty servers

No asymmetry

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 10

slide-16
SLIDE 16

Parallel File Systems: Empirical Insights (II)

Faults manifest asymmetrically only on some metrics

Ex: A disk-busy fault manifests . . .

Asymmetrically on latency metrics (↑ on faulty, ↓ on fault-free) Symmetrically on throughput metrics (↓ on all nodes)

Faults distinguishable by which metrics are peer-divergent

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 10

slide-17
SLIDE 17

Outline

1

Introduction

2

Experimental Methods

3

Diagnostic Algorithm

4

Results

5

Conclusion

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 11

slide-18
SLIDE 18

System Model

Fault Model:

Non-fail-stop problems

“Limping-but-alive” performance problems

Problems affecting storage & network resources

Assumptions:

Hardware is homogeneous, identically configured Workloads are non-pathological (balanced requests) Majority of servers exhibit fault-free behavior

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 12

slide-19
SLIDE 19

Instrumentation

Sampling of storage & network performance metrics

Sampled from /proc once every second Gathered from all server nodes

Storage-related metrics of interest:

Throughput: Bytes read/sec, bytes written/sec Latency: I/O wait time

Network-related metrics of interest:

Throughput: Bytes received/sec, transmitted/sec Congestion: TCP sending congestion window

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 13

slide-20
SLIDE 20

Workloads

ddw & ddr (dd write & read)

Use dd to write/read many GB to/from file Large (order MB) I/O requests, saturating workload

iozonew & iozoner (IOzone write & read)

Ran in either write/rewrite or read/reread mode Large I/O requests, workload transitions, fsync

postmark (PostMark)

Metadata-heavy, small reads/writes (single server) Simulates email/news servers

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 14

slide-21
SLIDE 21

Fault Types

Susceptible resources:

Storage: Access contention Network: Congestion, packet loss (faulty hardware)

Manifestation mechanism:

Hog: Introduces new visible workload (server-monitored) Busy/Loss: Alters existing workload (unmonitored)

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 15

slide-22
SLIDE 22

Fault Types

Susceptible resources:

Storage: Access contention Network: Congestion, packet loss (faulty hardware)

Manifestation mechanism:

Hog: Introduces new visible workload (server-monitored) Busy/Loss: Alters existing workload (unmonitored) Storage Network Hog disk-hog write-network-hog read-network-hog Busy/Loss disk-busy receive-packet-loss send-packet-loss

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 15

slide-23
SLIDE 23

Experiment Setup

PVFS cluster configurations:

10 clients, 10 combined I/O & metadata servers 6 clients, 12 combined I/O & metadata servers

Luster cluster configurations:

10 clients, 10 I/O servers, 1 metadata server 6 clients, 12 I/O servers, 1 metadata server

Each client runs same workload for ≈600 s Faults injected on single server for 300 s All workload & fault combinations run 10 times

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 16

slide-24
SLIDE 24

Outline

1

Introduction

2

Experimental Methods

3

Diagnostic Algorithm

4

Results

5

Conclusion

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 17

slide-25
SLIDE 25

Diagnostic Algorithm

Phase I: Node Indictment

Histogram-based approach (for most metrics) Time series-based approach (congestion window) Both use peer-comparison to indict faulty node

Phase II: Root-Cause Analysis

Ascribes to root cause based on affected metrics

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 18

slide-26
SLIDE 26

Phase I: Node Indictment (Histogram-Based)

Peer-compare metric PDFs (histograms) across servers

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 19

slide-27
SLIDE 27

Phase I: Node Indictment (Histogram-Based)

Peer-compare metric PDFs (histograms) across servers

Compute PDF of metric for each server over sliding window

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 19

slide-28
SLIDE 28

Phase I: Node Indictment (Histogram-Based)

Peer-compare metric PDFs (histograms) across servers

Compute PDF of metric for each server over sliding window Compute Kullback-Leibler divergence for each server pair

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 19

slide-29
SLIDE 29

Phase I: Node Indictment (Histogram-Based)

Peer-compare metric PDFs (histograms) across servers

Compute PDF of metric for each server over sliding window Compute Kullback-Leibler divergence for each server pair Flag pair anomalous if its divergence exceeds threshold

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 19

slide-30
SLIDE 30

Phase I: Node Indictment (Histogram-Based)

Peer-compare metric PDFs (histograms) across servers

Compute PDF of metric for each server over sliding window Compute Kullback-Leibler divergence for each server pair Flag pair anomalous if its divergence exceeds threshold Flag server if over half of its server pairs are anomalous

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 19

slide-31
SLIDE 31

Threshold Selection

Fault-free training session (stress test)

Run ddw, ddr, & postmark under fault-free conditions Find minimum threshold that eliminates all anomalies

Histogram comparison uses per-server thresholds

Captures performance profile of each server Important to train on each cluster & file system

Train on performance-stressing workloads only

Metrics deviate most when servers are saturated Less intense workloads have better coupled behavior

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 20

slide-32
SLIDE 32

Example: PVFS Throughput (Disk-Hog Fault)

200 400 600 20000 60000 100000 Elapsed Time (s) Sectors Read (/s) Faulty server Non−faulty servers

PVFS + disk-hog PVFS only

Throughput diverges due to disk-hog on faulty server

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 21

slide-33
SLIDE 33

Phase II: Root-Cause Analysis

Build table of metrics & faults affecting them:

Storage Throughput: Storage Latency: disk-hog disk-hog disk-busy Network Throughput: Network Congestion: network-hog network-hog packet-loss (ACKs only) packet-loss Derive checklist that maps divergent metrics to cause Infers resource responsible Determines mechanism by which resource faulted

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 22

slide-34
SLIDE 34

Checklist for Root-Cause Analysis

Peer-divergence in . . . Yes: disk-hog fault Storage throughput? No: next question Yes: disk-busy fault Storage latency? No: . . . Yes: network-hog fault Network throughput?∗ No: . . . Yes: packet-loss fault Network congestion? No: no fault discovered

∗Must diverge in both receive & transmit, or in absence of congestion

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 23

slide-35
SLIDE 35

Outline

1

Introduction

2

Experimental Methods

3

Diagnostic Algorithm

4

Results

5

Conclusion

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 24

slide-36
SLIDE 36

Results: Single Cluster

control diskhog diskbusy wnethog rnethog recvloss sendloss

Indicted True Positive Diagnosed True Positive Indicted False Positive Diagnosed False Positive

Fault Accuracy Rate (%) 20 40 60 80 100

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 25

slide-37
SLIDE 37

Results: Aggregate

PVFS 10/10 PVFS 6/12 Lustre 10/10 Lustre 6/12

Indicted True Positive Diagnosed True Positive Indicted False Positive Diagnosed False Positive

Cluster Accuracy Rate (%) 20 40 60 80 100

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 26

slide-38
SLIDE 38

Results Summary

True-positives inconsistent across faults

Some faults are not observable for all workloads Minimal performance effect where not observable

True- & false-positives inconsistent across clusters

Algorithm sensitive to imprecise thresholds Rank metrics based on degree of dissimilarity

Strategy is promising in general Instrumentation overhead

< 1% increase in workload runtime, negligible

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 27

slide-39
SLIDE 39

Outline

1

Introduction

2

Experimental Methods

3

Diagnostic Algorithm

4

Results

5

Conclusion

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 28

slide-40
SLIDE 40

Future Work

Analysis: Improve diagnosis accuracy rates

Make analysis robust to imprecise thresholds

Real-world data: Deploy on a production system

Validate technique on real workloads, at scale

Coverage: Expand target problem class

Other sources of performance & non-performance faults

Instrumentation: Expand instrumentation

Additional black-box metrics, request sniffing & tracing

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 29

slide-41
SLIDE 41

Summary

Problem diagnosis in parallel file systems

Illustrates use of OS-level metrics in diagnosis Leverages peer-comparison to identify faulty nodes Demonstrates root-cause analysis by metrics affected

Diagnosis method is applicable to existing deployments

Instrumentation is minimally invasive, low overhead Fault-free training with stress tests

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 30

slide-42
SLIDE 42

Peer-Comparison Scalability

Number of comparisons: n(n−1)

2

= ⇒ O(n2) Insight: Don’t need to compare one node against all Proposed solution:

Establish n −k partitions with k servers Perform peer-comparisons among servers in each partition Repartition with a different grouping for each window

Solution comparisons: (n −k)k(k−1)

2

= ⇒ O(n)

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 31

slide-43
SLIDE 43

Congestion Window Problem

No closely-coupled peer behavior

cwnd is intentionally noisy under normal conditions Synchronized connections can’t fully use link capacity Can’t compare histograms, too much variance

Congestion window packet-loss heuristic:

TCP responds to packet-loss by halving cwnd Exponential decay after multiple loss events Log scale: Each loss results in linear cwnd decrease

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 32

slide-44
SLIDE 44

Time Series Comparison Example

200 400 600 2 5 10 20 50 100 Elapsed Time (s) Segments

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 33

slide-45
SLIDE 45

Time Series Comparison Example

200 400 600 2 5 10 20 50 100 Elapsed Time (s) Segments

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 33

slide-46
SLIDE 46

Time Series Comparison Example

Elapsed Time (s) Segments

100 200 300 400 500 600 700 2 5 10 50 100 200 300 400 500 600 700 100 200 300 400 500 600 700 2 5 10 50 2 5 10 50

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 34

slide-47
SLIDE 47

Time Series Comparison Example

Elapsed Time (s) Segments

100 200 300 400 500 600 700 2 5 10 50 100 200 300 400 500 600 700 100 200 300 400 500 600 700 2 5 10 50 2 5 10 50

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 34

slide-48
SLIDE 48

Heterogeneous Hardware (ddr)

100 200 300 400 500 600 50 100 150 200 Elapsed Time (s) I/O Wait Time (ms)

Disks are same model, have different performance profiles

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 35

slide-49
SLIDE 49

Load Imbalances (postmark)

100 200 300 400 500 600 700 0e+00 1e+05 2e+05 3e+05 Elapsed Time (s) Bytes Received (B/s)

“/” on one metadata server, all path lookups go there

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 36

slide-50
SLIDE 50

Cross-Resource Influence (ddr)

200 400 600 800 5 10 20 50 100 200 Elapsed Time (s) Segments Faulty server Non−faulty servers

Disk-busy effect on server cwnd, unintentional sync

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 37

slide-51
SLIDE 51

Delayed ACKs (ddw)

200 400 600 0e+00 2e+05 4e+05 6e+05 8e+05 Elapsed Time (s) Bytes Transmitted (B/s) Faulty server Non−faulty servers

Packet-loss fault may also deviate network throughput

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 38

slide-52
SLIDE 52

Results: PVFS 10/10 Cluster

control diskhog diskbusy wnethog rnethog recvloss sendloss

Indicted True Positive Diagnosed True Positive Indicted False Positive Diagnosed False Positive

Fault Accuracy Rate (%) 20 40 60 80 100

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 39

slide-53
SLIDE 53

Results: PVFS 6/12 Cluster

control diskhog diskbusy wnethog rnethog recvloss sendloss

Indicted True Positive Diagnosed True Positive Indicted False Positive Diagnosed False Positive

Fault Accuracy Rate (%) 20 40 60 80 100

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 40

slide-54
SLIDE 54

Results: Lustre 10/10 Cluster

control diskhog diskbusy wnethog rnethog recvloss sendloss

Indicted True Positive Diagnosed True Positive Indicted False Positive Diagnosed False Positive

Fault Accuracy Rate (%) 20 40 60 80 100

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 41

slide-55
SLIDE 55

Results: Lustre 6/12 Cluster

control diskhog diskbusy wnethog rnethog recvloss sendloss

Indicted True Positive Diagnosed True Positive Indicted False Positive Diagnosed False Positive

Fault Accuracy Rate (%) 20 40 60 80 100

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 42

slide-56
SLIDE 56

Results: Aggregate

PVFS 10/10 PVFS 6/12 Lustre 10/10 Lustre 6/12

Indicted True Positive Diagnosed True Positive Indicted False Positive Diagnosed False Positive

Cluster Accuracy Rate (%) 20 40 60 80 100

Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 43