[PPT] - ParaStack : Efficient Hang Detection for MPI Programs at Large Scale PowerPoint Presentation

SLIDE 1

ParaStack: Efficient Hang Detection for MPI Programs at Large Scale

Hongbo Li Zizhong Chen & Rajiv Gupta

SLIDE 2

Question Solution Evaluation

2

SLIDE 3

Question Solution Evaluation

3

Program Hang Resource Wastage Current Solution

SLIDE 4

Execution in Batch Mode

4

Process ID 1 2 i ! … … " ": occupied supercomputer time. Processes communicate via message passing (MPI). Time

SLIDE 5

Program Hang Occurs

Program hang --- a type of bug whose occurrence stalls the program’s execution. Root cause can be in

ne single process, e.g. process 0 --- Incorrect thread-level

synchronization and infinite loop,

r all processes --- communication deadlock across all processes

et.al.

Process ID 1 2 i ! Time … …

5

SLIDE 6

Hang Causes Resource Wastage

Process ID 1 2 i ! Time … …

Resource waste Large scale

6

Negative --- significant resource wastage at large scale.

SLIDE 7

Solution: Hang Detection

Release resources when detecting a hang Shorter detection delay (!") à Bigger saving (!#)

7

Process ID 1 2 i $ Time … … !" !#

SLIDE 8

Traditional Detection Method

Timeout is a commonly used method based on various metrics, e.g., IO-watchdog monitors how often a program writes. Setting a good timeout is hard due to following two dilemmas:

Small timeout à Large Savings Too Small timeout à False Alarms Large timeout à Avoid False Positives Too Large timeout à Large Wastage

8

SLIDE 9

Question Solution Evaluation

9

Statistical Model Two Problems

SLIDE 10

ParaStack

Does not guess based on null unlike timeout methods. Detects hangs based on runtime history.

10

SLIDE 11

Basic Concept

while (…) { user code MPI_Function () }

!"#$

Definition: !"#$ = ."#$ .$"$/0 where 1234 denotes the number

f

processes executing inside user code and 142456 denotes the total number of processes employed in the run.

11

SLIDE 12

Dynamic Variation of Sout

A snippet of !"#$ variation obtained via sampling every 1 millisecond interval.

0.3 0.6 1 51 101 Sout Running timeline 0.3 0.6 1 51 101 Running timeline Sout 0.5 1 1 101 201 Sout Running Timeline LU FT SP

12

SLIDE 13

When a Hang Occurs

0.4 0.8 1 51 101 Sout Running Timeline

!"#$ variation of a faulty LU run, where a fault is simulated by a very long sleep and injected on the left border of the red region. Program hang is characterized by two features: (1) very small !%&' and (2) consecutive observations of (1).

13

SLIDE 14

Suspicion

!(#$%&) is the empirical cumulative distribution function

btained from randomly sampling ()*+.

Given probability ̂

, we obtain . = 012

̂

and classify the
bserved value of ()*+ into a pair of opposite random

events:

14

Feature 1: Small

SLIDE 15

Significance Test of Hang

Geometric distribution. The probability distribution of ! = # times of suspicions before the first occurrence of non- suspicion is $ ! = # = %& ∗ (1 − %) where % estimates the true suspicion probability ,. Given the confidence level 1 − -, we claim a hang is detected if $./ ! ≥ 1 = 23 ≤ 5. Make it simple: something is very likely wrong when a very rare event occurs.

15

Feature 1+2: Consecutively small

SLIDE 16

Whole Picture

16

! !" !#

P r

b

a b i l i t y d r

p

s a s c

n

s e c u t i v e s u s p i c i

n

s a r e

b

s e r v e d

SLIDE 17

Two Problems with the Model

(1) How to achieve random sampling? (2) The observed suspicion probability ( ̂ ") doesn’t reflect the truth ("), i.e., # ≠ % # .

17

SLIDE 18

Random Sampling

Insert between two consecutive samplings with a random time step: !"#$ % + %/(. Too small % à lack of randomness; Bigger % à better randomness. Solution: use runs test to check randomness of the sample sequence, and double ) if it is found to be lack of randomness until randomness is assured.

0.5 1 1 101 201 Sout Running Timeline

ûû û ûû û û ûû û û û û û û û û û û û ûû û ûû û ûû û û

ü ü ü ü û Lack of randomness ü better randomness

18

SLIDE 19

Random Sampling (Cont.)

Runs test --- a standard test that checks the randomness of a two-valued data sequence. Runs test’s procedure:

1)

calculate the average of the sample sequence;

2)

denote values bigger than the average as (+) and those smaller than that as (-);

3)

check the number of runs (!) --- a run is defined as a series of consecutive (+) or (-);

4)

Too small or too large " à the sequence is lack of randomness (significance test)

19

SLIDE 20

Random Sampling (Cont.)

20

Example. We have a sample sequence as

0.2 0.1 0.1 0.2 0.1 0.1 0.0 0.0 0.8 0.9 1.0 0.8 0.9 0.1 0.9 0.9,

which can be transformed as below

− − − − − − − − + + + + + − + + .

Its average is 0.44375, the non-rejection region at 95% confidence is (4, 14), and # = 4. As & is outside the non- rejection region, we claim the sampling is not random and thus double '.

SLIDE 21

! " ≠ "

The difference ($) between the observed probability (! ") and the true probability (") is closely related to the sample size %. Solution: Hence, we estimate |" − ! "| ≤ $ at different sample size levels with high confidence (95%) : ̂ * = 0.47 ̂ * = 0.27 ̂ * = 0.12 ̂ * = 0.06 3 = 0.3 3 = 0.2 3 = 0.1 3 = 0.05 when 11 ≤ : < 19, when 19 ≤ : < 42, when 42 ≤ : < 86, when 86 ≤ :. At each level, we use a different credible ! " to define what is a suspicion (?@AB ≤ CDE ̂ * ) . Make it simple: the difference gets smaller as sample size increases.

21

SLIDE 22

! " ≠ " (Cont.)

|% − ̂ %| ≤ ) is not enough as underestimating ", i.e., ! " < ", lead to false positives.

Given ̂ % < %, ̂ %+ --- the probability that a program is still healthy --- converges faster than %+ to the significance level , as k increases à more false positives.

We use - = ! " + 0 as an estimate of " in the calculation of hangs’ probability (-1), which guarantees that - ≥ " with 97.5% confidence.

22

SLIDE 23

Question Solution Evaluation

23

SLIDE 24

Goal

Trivial overhead High accuracy & Low false positive ParaStack > Timeout Short detection delay Enable resource saving when a hang occurs

24

SLIDE 25

Evaluation Setting

10 randomly selected processes are monitored. Significance level ! = 0.1%. The initial maximal sampling interval is set as ' = 400 ms. ParaStack’s default setting

25

Fault injection A hang is simulated by injecting a long enough sleep() in either source code or binary. Target Programs HPL, HPCG, NPB benchmark set

SLIDE 26

Evaluation Setting (Cont.)

26

Used notations

AC Accuracy FP False positive rate D Average delay S Standard deviation of delays

Number of hang-injected runs using default ParaStack

Scale Tardis Tianhe-2 Stampede 256 800+ 20+ 1024 300+ 100+ 4096 50 8192 5 16384 3

SLIDE 27

Overhead, Accuracy & False Alarms

Average accuracy à over 99% for 100 runs of each program No false alarm reported in:

39.7 hours of hang-free runs at scale of 1024
66 hours of hang-free runs at scale of 256
all hang-injected runs

27

Overhead @ scale 1024 with 5 runs on each program. We disable the automatic adaptation of !.

SLIDE 28

ParaStack v.s. Timeout

Timeout baseline

Hang is claimed to be found upon K consecutive observations of !"#$ ≤ 0 sampled at a fixed interval I. Like ParaStack, it only samples 10 processes to maintain the trivial

verhead.

28

10 runs per setting & 256 processes

SLIDE 29

ParaStack v.s. Timeout (Cont.)

Setting of ParaStack:

P: ParaStack initializing ! as 400ms. P*: ParaStack initializing ! as 10ms which doesn’t deliver random sampling.

P* compares well with P as ParaStack is able to automatically adjust ! to ensure a good model.

29

10 runs per setting & 256 processes

SLIDE 30

Detection Delay

30

The median of detection delays based on 100 runs per setting at scale 256.

BT CG LU SP FT MG HPL HPCG 4 6 3 3 13 3 4 5

(Unit: seconds)

SLIDE 31

Detection Delay (Cont.)

31

Delay on Tianhe-2 with 50 runs per setting Delay on Stampede with 20 runs per setting @ scale 1024 and 10 runs per setting at scale 4096

ParaStack detects hangs in a few seconds, which is far less than the commonly used 1-minute timeout.

SLIDE 32

Timesaving

10 faulty HPL runs with program hang’s occurrence uniformly distributed over the program execution On average 35.5% time saving

32

27.5% 55.5% 24.0% 0.0% 88.7% 59.2% 33.5% 44.8% 10.0%11.3% 0.0% 50.0% 100.0% 1 2 3 4 5 6 7 8 9 10 Saved time (%) Hangs

SLIDE 33

Thank you!

33