to Understand Parallel Program Behaviors LAI WEI , JOHN - - PowerPoint PPT Presentation

to understand parallel program behaviors
SMART_READER_LITE
LIVE PREVIEW

to Understand Parallel Program Behaviors LAI WEI , JOHN - - PowerPoint PPT Presentation

Automated Analysis of Time Series Data to Understand Parallel Program Behaviors LAI WEI , JOHN MELLOR-CRUMMEY RICE UNIVERSITY HOUSTON, TX, USA Background Parallel computers of increasing scale Support scientific simulations of increasing


slide-1
SLIDE 1

Automated Analysis of Time Series Data to Understand Parallel Program Behaviors

LAI WEI, JOHN MELLOR-CRUMMEY RICE UNIVERSITY HOUSTON, TX, USA

slide-2
SLIDE 2

Background

Parallel computers of increasing scale

  • Support scientific simulations of increasing ambition

Performance of many applications fail to scale accordingly

  • Load imbalance, serialization, network congestion, etc.

Performance tools to understand application behaviors

  • Measure and present performance data
  • Used by experts to manually identify performance inefficiencies

2

slide-3
SLIDE 3

Profile

Breaks down application run time into sources of costs

3 P0 P1 P2 P3 9s 9s 9s 9s 1s 1s 1s 1s 8s 8s 8s 8s 4s 4.1s 3.9s 4s 4s 3.9s 4.1s 4s main() init() solve() compute() sync() Calling context

slide-4
SLIDE 4

Profile

Breaks down application run time into sources of costs

4 P0 P1 P2 P3 9s 9s 9s 9s 1s 1s 1s 1s 8s 8s 8s 8s 4s 4.1s 3.9s 4s 4s 3.9s 4.1s 4s main() init() solve() compute() sync() Calling context

slide-5
SLIDE 5

Profile

Breaks down application run time into sources of costs

5 P0 P1 P2 P3 9s 9s 9s 9s 1s 1s 1s 1s 8s 8s 8s 8s 4s 4.1s 3.9s 4s 4s 3.9s 4.1s 4s

Performance loss, why?

main() init() solve() compute() sync() Calling context

slide-6
SLIDE 6

Time series

Presents application behavior over time

6 main()

init() solve()

compute() sync() Calling context P3 P2 P1 P0 solve() init() 1s 3s 5s 7s 9s Depth = 1

slide-7
SLIDE 7

Time series

Presents application behavior over time

7 main()

init()

solve()

compute() sync()

Calling context P3 P2 P1 P0 compute() sync() init() 1s 3s 5s 7s 9s Depth = 2

slide-8
SLIDE 8

Time series

Presents application behavior over time

8 main()

init()

solve()

compute() sync()

Calling context P3 P2 P1 P0 compute() sync() init() 1s 3s 5s 7s 9s Depth = 2

slide-9
SLIDE 9

Time series

Presents application behavior over time

9 main()

init()

solve()

compute() sync()

Calling context P3 P2 P1 P0 compute() sync() init()

Load imbalance

1s 3s 5s 7s 9s Depth = 2

slide-10
SLIDE 10

Time series

Presents application behavior over time

10 main()

init()

solve()

compute() sync()

Calling context P3 P2 P1 P0 compute() sync() init()

Load imbalance

1s 3s 5s 7s 9s Depth = 2

slide-11
SLIDE 11

Time series

Presents application behavior over time

11 main()

init()

solve()

compute() sync()

Calling context P3 P2 P1 P0 compute() sync() init()

Load imbalance

1s 3s 5s 7s 9s Depth = 2

slide-12
SLIDE 12

Motivation

Experts manually examine time series

  • Understand how and why performance inefficiencies arise

Time series of large scale parallel executions

  • Vast in three dimensions
  • Process
  • Time
  • Call path depth
  • Manual analysis is difficult if not impractical

12

slide-13
SLIDE 13

Related work -- automated analysis

Analysis of profiles [Huck, SC’05] [Tallent, SC’10]

  • Often insufficient for diagnosing how and why parallel inefficiencies arise

Analysis of execution traces

  • Collecting instrumentation-based traces are costly in time and space
  • Fine-grained traces explode at large scale
  • Analysis at coarse granularity [Gonzalez, IPDPS’09] [Llort, IPDPS’10]
  • Still needs lots of manual effort
  • Analysis at fine granularity for short intervals [Geimer, CCPE’10] [Böhme, TOPC’16]
  • Requires prior knowledge for selective tracing

13

slide-14
SLIDE 14

Our contribution

Automated analysis of sample-based time-series data

  • Feasible for large-scale programs
  • Data volume is manageable
  • Derive compact top-down summaries
  • Uncover patterns and variance
  • Direct attention to potential performance losses
  • Attribute losses to code regions where they originate

14

slide-15
SLIDE 15

Approach

  • 1. Collect and prepare sample-based time-series for further analysis
  • Collect a time series of call paths with HPCToolkit
  • Organize each time series as a tree of program calling contexts
  • Identify iterative behaviors in the time series
  • 2. Build clusters across threads and loop iterations
  • 3. Quantify performance losses and attribute them to call paths

15

slide-16
SLIDE 16

Collect call path samples over time

16

Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-17
SLIDE 17

Collect call path samples over time

17

Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-18
SLIDE 18

Collect call path samples over time

18

Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-19
SLIDE 19

Collect call path samples over time

19

Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-20
SLIDE 20

Collect call path samples over time

20

T1

Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-21
SLIDE 21

Collect call path samples over time

21

main() foo@13 C@5

T1

Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-22
SLIDE 22

Collect call path samples over time

22

main() foo@13 C@5

T1

Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-23
SLIDE 23

Collect call path samples over time

23

main() foo@13 C@5

T1

Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-24
SLIDE 24

Collect call path samples over time

24

main() foo@13 C@5

T1

Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-25
SLIDE 25

Collect call path samples over time

25

main() foo@13 C@5

T1

Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-26
SLIDE 26

Collect call path samples over time

26

main() foo@13 C@5

T1 T2

Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-27
SLIDE 27

Collect call path samples over time

27

main() foo@13 C@5 main() foo@13

T1 T2

A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-28
SLIDE 28

Collect call path samples over time

28

main() foo@13 C@5 main() foo@13

T1 T2

A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-29
SLIDE 29

Collect call path samples over time

29

main() foo@13 C@5 main() foo@13 loop@6

T1 T2

A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-30
SLIDE 30

Collect call path samples over time

30

main() foo@13 C@5 main() foo@13 loop@6

T1 T2

A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-31
SLIDE 31

Collect call path samples over time

31

main() foo@13 C@5 main() foo@13 loop@6

T1 T2

A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-32
SLIDE 32

Collect call path samples over time

32

main() foo@13 C@5 main() foo@13 loop@6

T1 T2

A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-33
SLIDE 33

Collect call path samples over time

33

main() foo@13 C@5 main() foo@13 loop@6

T1 T2

A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-34
SLIDE 34

Collect call path samples over time

34

main() foo@13 C@5 main() foo@13 loop@6 main() foo@13 loop@6

T1 T2 T3

A@7 B@8 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-35
SLIDE 35

Collect call path samples over time

35

main() foo@13 C@5 main() foo@13 loop@6 main() foo@13 loop@6

T1 T2 T3

A@7 B@8 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-36
SLIDE 36

Collect call path samples over time

36

main() foo@13 C@5 main() foo@13 loop@6 main() foo@13 loop@6

T1 T2 T3

A@7 B@8 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-37
SLIDE 37

Collect call path samples over time

37

main() foo@13 C@5 main() foo@13 loop@6 main() foo@13 loop@6

T1 T2 T3

A@7 B@8 main() foo@13 loop@6 A@7

T4

Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-38
SLIDE 38

Collect call path samples over time

38

main() foo@13 C@5 main() foo@13 loop@6 main() foo@13 loop@6

T1 T2 T3

A@7 B@8 main() foo@13 loop@6 A@7

T4 T5

main() foo@13 loop@6 A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-39
SLIDE 39

Collect call path samples over time

39

main() foo@13 C@5 main() foo@13 loop@6 main() foo@13 loop@6

T1 T2 T3

A@7 B@8 main() foo@13 loop@6 A@7

T4 T5

main() foo@13 loop@6 A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-40
SLIDE 40

Collect call path samples over time

40

main() foo@13 C@5 main() foo@13 loop@6 main() foo@13 loop@6

T1 T2 T3

A@7 B@8 main() foo@13 loop@6 A@7

T4 T5

main() foo@13 loop@6 A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-41
SLIDE 41

Collect call path samples over time

41

main() foo@13 C@5 main() foo@13 loop@6 main() foo@13 loop@6

T1 T2 T3

A@7 B@8 main() foo@13 loop@6 A@7

T4

main() foo@13 loop@6

T5 T6

A@7 main() foo@13 loop@6 A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-42
SLIDE 42

Collect call path samples over time

42

main() foo@13 C@5 main() foo@13 loop@6 main() foo@13 loop@6

T1 T2 T3

A@7 B@8 main() foo@13 loop@6 A@7

T4

main() foo@13 loop@6 main() foo@13 loop@6

T5 T6 T7

A@7 A@7 main() foo@13 loop@6 A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-43
SLIDE 43

Collect call path samples over time

43

main() foo@13 C@5 main() foo@13 loop@6 main() foo@13 loop@6

T1 T2 T3

A@7 B@8 main() foo@13 loop@6 A@7

T4

main() foo@13 loop@6 main() foo@13 loop@6

T5 T6 T7

A@7 A@7 main() foo@13 loop@6 A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-44
SLIDE 44

Collect call path samples over time

44

main() foo@13 C@5 main() foo@13 loop@6 main() foo@13 loop@6

T1 T2 T3

A@7 B@8 main() foo@13 loop@6 A@7

T4

main() foo@13 loop@6 main() foo@13 loop@6

T5 T6 T7

A@7 A@7 main() foo@13 loop@6 A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-45
SLIDE 45

Collect call path samples over time

45

main() foo@13 C@5 main() foo@13 loop@6 main() foo@13 loop@6

T1 T2 T3

A@7 B@8 main() foo@13 loop@6 A@7

T4

main() foo@13 loop@6 main() foo@13 loop@6

T5 T6 T7

A@7 A@7 main() foo@13 loop@6 C@9

T8

main() foo@13 loop@6 A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-46
SLIDE 46

main() foo@13 loop@6 A@7

Construct a Temporal Context Tree

46 main() T1-T8 foo@13 T1-T8 loop@6 T2-T8

main() foo@13 C@5 main() foo@13 loop@6

T1 T2 T3

B@8 main() foo@13 loop@6 A@7

T4

main() foo@13 loop@6 main() foo@13 loop@6

T5 T6 T7

A@7 A@7 main() foo@13 loop@6 C@9

T8

main() foo@13 loop@6 A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

slide-47
SLIDE 47

main() foo@13 loop@6 A@7

Construct a Temporal Context Tree

47 main() T1-T8 foo@13 T1-T8 loop@6 T2-T8

main() foo@13 C@5 main() foo@13 loop@6

T1 T2 T3

B@8 main() foo@13 loop@6 A@7

T4

main() foo@13 loop@6 main() foo@13 loop@6

T5 T6 T7

A@7 A@7 main() foo@13 loop@6 C@9

T8

main() foo@13 loop@6 A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

slide-48
SLIDE 48

main() foo@13 loop@6 A@7

Construct a Temporal Context Tree

48 main() T1-T8 foo@13 T1-T8 loop@6 T2-T8

main() foo@13 C@5 main() foo@13 loop@6

T1 T2 T3

B@8 main() foo@13 loop@6 A@7

T4

main() foo@13 loop@6 main() foo@13 loop@6

T5 T6 T7

A@7 A@7 main() foo@13 loop@6 C@9

T8

main() foo@13 loop@6 A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3

One tree per thread

slide-49
SLIDE 49

Identify iterations

49 loop@6 T2-T8

slide-50
SLIDE 50

Identify iterations

50

6 7 8 9

loop@6 T2-T8

slide-51
SLIDE 51

Identify iterations

51

6 7 8 9

loop@6 T2-T8

begin of loop@6 A@7 B@8 C@9

Use ParseAPI to analyze binaries

slide-52
SLIDE 52

Identify iterations

52

6 7 8 9

loop@6 T2-T8

begin of loop@6 A@7 B@8 C@9

loop@6 T2-T8 #0 #1 #2

Insert a new iteration when a back edge must have been taken

Use ParseAPI to analyze binaries

slide-53
SLIDE 53

Approach

  • 1. Collect and prepare sample-based time-series for further analysis
  • Collect a time series of call paths with HPCToolkit
  • Organize each time series as a tree of program calling contexts
  • Identify iterative behaviors in the time series
  • 2. Build clusters across threads and loop iterations
  • 3. Quantify performance losses and attribute them to call paths

53

slide-54
SLIDE 54

Clustering

Objective

  • Concisely summarize behaviors of a collection of threads executing many iterations
  • Represent a large set of instances with a few representatives
  • Identify variations across threads and iterations
  • Variations may indicate performance bottlenecks
  • Is there any variation? How large is it? Where does variation arise?

Steps

  • Quantify differences in Temporal Context Trees (TCTs)
  • K-farthest clustering [Bahmani, BIG DATA’15]
  • Time complexity = O(N*K*G). N = number of instances; G = size of TCTs
  • Multi-level clustering ✔

Parallelization ✔

54

slide-55
SLIDE 55

Approach

  • 1. Collect and prepare sample-based time-series for further analysis
  • Collect a time series of call paths with HPCToolkit
  • Organize each time series as a tree of program calling contexts
  • Identify iterative behaviors in the time series
  • 2. Build clusters across threads and loop iterations
  • 3. Quantify performance losses and attribute them to call paths

55

slide-56
SLIDE 56

Quantify performance losses

Variation across threads provides clues to performance losses

56

2 4 6 8 10 12 A B C

X = 8s X = 2s Y = 7s X = 6s Z = 4s

Opt #2 = 37.5%

Optimized

X = 4s X = 2s Y = 7s

P2 P1 P0

slide-57
SLIDE 57

2 4 6 8 10 12 A B C

Quantify imbalance

57 P2 P1

X = 8s

P0

X = 2s Y = 7s X = 6s Z = 4s

Opt #2 = 37.5%

Optimized

X = 4s X = 2s Y = 7s

Assume X as computation, Y & Z as synchronization For a computation node C

  • imb(C) = projected reduction in execution time if work in C is balanced across threads
slide-58
SLIDE 58

2 4 6 8 10 12 A B C

Quantify imbalance

58 P2 P1

X = 8s

P0

X = 2s Y = 7s X = 6s Z = 4s

Opt #2 = 37.5%

Optimized

X = 4s X = 2s Y = 7s

Assume X as computation, Y & Z as synchronization For a computation node C

  • imb(C) = projected reduction in execution time if work in C is balanced across threads
slide-59
SLIDE 59

2 4 6 8 10 12 A B C

Quantify imbalance

59 P2 P1

X = 8s

P0

X = 2s Y = 7s X = 6s Z = 4s

Opt #2 = 37.5%

Optimized

X = 4s X = 2s Y = 7s

imb(X) = max(X) - avg(X) = 8s - 4s = 4s

Assume X as computation, Y & Z as synchronization For a computation node C

  • imb(C) = projected reduction in execution time if work in C is balanced across threads
slide-60
SLIDE 60

2 4 6 8 10 12 A B C

Quantify imbalance

60

X = 8s X = 2s Y = 7s X = 6s Z = 4s

Opt #2 = 37.5%

Optimized

X = 4s X = 2s Y = 7s

P2 P1 P0

Assume X as computation, Y & Z as synchronization For a synchronization node S

  • imb(S) = projected reduction in execution time if work between the prior

synchronization and S is balanced.

slide-61
SLIDE 61

2 4 6 8 10 12 A B C

Quantify imbalance

61

X = 8s X = 2s Y = 7s X = 6s Z = 4s

Opt #2 = 37.5%

Optimized

X = 4s X = 2s Y = 7s

imb(Y) = imb(X) = 4s

P2 P1 P0

Assume X as computation, Y & Z as synchronization For a synchronization node S

  • imb(S) = projected reduction in execution time if work between the prior

synchronization and S is balanced.

slide-62
SLIDE 62

2 4 6 8 10 12 A B C

Attribute imbalance

62

X = 8s X = 2s Y = 7s X = 6s Z = 4s

Opt #2 = 37.5%

Optimized

X = 4s X = 2s Y = 7s

X Y Z A

P2 P1 P0

Assume X as computation, Y & Z as synchronization sumImb(N) for each node N in TCT

= imb(N) if N is a leaf = Sum { sumImb(every child of N) }

slide-63
SLIDE 63

2 4 6 8 10 12 A B C

Attribute imbalance

63

X = 8s X = 2s Y = 7s X = 6s Z = 4s

Opt #2 = 37.5%

Optimized

X = 4s X = 2s Y = 7s

X Y Z A = 0s X = 4s Y = 4s Z = 0s

imb(N) imb(N)

P2 P1 P0

Assume X as computation, Y & Z as synchronization sumImb(N) for each node N in TCT

= imb(N) if N is a leaf = Sum { sumImb(every child of N) }

slide-64
SLIDE 64

2 4 6 8 10 12 A B C

Attribute imbalance

64

X = 8s X = 2s Y = 7s X = 6s Z = 4s

Opt #2 = 37.5%

Optimized

X = 4s X = 2s Y = 7s

X Y Z A = 0s X = 4s Y = 4s Z = 0s

imb(N) imb(N)

P2 P1 P0

Assume X as computation, Y & Z as synchronization sumImb(N) for each node N in TCT

= imb(N) if N is a leaf = Sum { sumImb(every child of N) }

slide-65
SLIDE 65

2 4 6 8 10 12 A B C

Attribute imbalance

65

X = 8s X = 2s Y = 7s X = 6s Z = 4s

Opt #2 = 37.5%

Optimized

X = 4s X = 2s Y = 7s

X Y Z A = 0s X = 4s / 4s Y = 4s / 4s Z = 0s / 0s

imb(N) imb(N) sumImb(N)

P2 P1 P0

Assume X as computation, Y & Z as synchronization sumImb(N) for each node N in TCT

= imb(N) if N is a leaf = Sum { sumImb(every child of N) }

slide-66
SLIDE 66

2 4 6 8 10 12 A B C

Attribute imbalance

66

X = 8s X = 2s Y = 7s X = 6s Z = 4s

Opt #2 = 37.5%

Optimized

X = 4s X = 2s Y = 7s

X Y Z A = 0s / 8s X = 4s / 4s Y = 4s / 4s Z = 0s / 0s

imb(N) sumImb(N) imb(N) sumImb(N)

P2 P1 P0

Assume X as computation, Y & Z as synchronization sumImb(N) for each node N in TCT

= imb(N) if N is a leaf = Sum { sumImb(every child of N) }

slide-67
SLIDE 67

Highlight significant call paths

Pick call paths that contribute significantly to imbalance

  • sumImb(N) / RunTime > significanceRatio (= 1%)
  • Avoid reporting call paths with tiny losses

Pick appropriate depths for significant call paths

  • imb(N) / sumImb(N) > appropriateDepthRatio (= 70%)
  • Avoid reporting too many children with small losses

Quantify and attribute waiting in a similar way

67

0%/10% 0.4%/9.6%

5%/5%

2.3%/2.3%

4%/4%

1.8%/1.8%

0.4%/0.4%

0.9%/0.9% 0%/10% imb(N) sumImb(N)

slide-68
SLIDE 68

Highlight significant call paths

Pick call paths that contribute significantly to imbalance

  • sumImb(N) / RunTime > significanceRatio (= 1%)
  • Avoid reporting call paths with tiny losses

Pick appropriate depths for significant call paths

  • imb(N) / sumImb(N) > appropriateDepthRatio (= 70%)
  • Avoid reporting too many children with small losses

Quantify and attribute waiting in a similar way

68

0%/10% 0.4%/9.6%

5%/5%

2.3%/2.3%

4%/4%

1.8%/1.8%

0.4%/0.4%

0.9%/0.9% 0%/10% imb(N) sumImb(N)

slide-69
SLIDE 69

Highlight significant call paths

Pick call paths that contribute significantly to imbalance

  • sumImb(N) / RunTime > significanceRatio (= 1%)
  • Avoid reporting call paths with tiny losses

Pick appropriate depths for significant call paths

  • imb(N) / sumImb(N) > appropriateDepthRatio (= 70%)
  • Avoid reporting too many children with small losses

Quantify and attribute waiting in a similar way

69

0%/10% 0.4%/9.6% 2.3%/2.3% 1.8%/1.8%

0.4%/0.4%

0.9%/0.9% 0%/10% imb(N) sumImb(N)

5%/5% 4%/4%

slide-70
SLIDE 70

Highlight significant call paths

Pick call paths that contribute significantly to imbalance

  • sumImb(N) / RunTime > significanceRatio (= 1%)
  • Avoid reporting call paths with tiny losses

Pick appropriate depths for significant call paths

  • imb(N) / sumImb(N) > appropriateDepthRatio (= 70%)
  • Avoid reporting too many children with small losses

Quantify and attribute waiting in a similar way

70

0%/10% 0.4%/9.6% 2.3%/2.3% 1.8%/1.8%

0.4%/0.4%

0.9%/0.9% 0%/10% imb(N) sumImb(N)

5%/5% 4%/4%

slide-71
SLIDE 71

Experiments

Platform: Titan @ Oak Ridge National Laboratory

  • One 16-core AMD Opteron per node; one thread per core
  • Gemini interconnect (3D torus)

Applications:

  • PFLOTRAN @ 512 MPI ranks, ~178 seconds
  • Simulation of subsurface flow and reactive transport
  • Chemical reactions; environmental assessment
  • AMG2013 @ 512 MPI ranks, ~23 seconds
  • Parallel solver of structured/unstructured linear systems

71

slide-72
SLIDE 72

Manual analysis of PFLOTRAN?

72

Depth of callpath P0 P511 Run time = 178s

slide-73
SLIDE 73

Manual analysis of PFLOTRAN?

73

Depth of callpath P0 P511 Run time = 178s

slide-74
SLIDE 74

Manual analysis of PFLOTRAN?

74

Depth of callpath P0 P511

Need to zoom in to examine the execution at a higher resolution Need to select appropriate call path depths to derive detailed insights

Run time = 178s

slide-75
SLIDE 75

Visualization with automated insights

75

Run time = 178s

Cluster #2 Cluster #1 P0 P4 P1 P511 P60

Depth of callpath

slide-76
SLIDE 76

Visualization with automated insights

76

Run time = 178s

Processes are ordered first by clusters and then by MPI rank Height of each cluster is proportional to log2(size+1)

Cluster #2 Cluster #1 P0 P4 P1 P511 P60

Depth of callpath

slide-77
SLIDE 77

Visualization with automated insights

77

Run time = 178s

Processes are ordered first by clusters and then by MPI rank Height of each cluster is proportional to log2(size+1)

Cluster #2 Cluster #1 P0 P4 P1 P511 P60

Depth of callpath

Execution cut into several segments

slide-78
SLIDE 78

Visualization with automated insights

78

Run time = 178s

Processes are ordered first by clusters and then by MPI rank Height of each cluster is proportional to log2(size+1) Each pixel shows a procedure frame on the call path. Depths are selected by automated analysis

Cluster #2 Cluster #1 P0 P4 P1 P511 P60

Depth of callpath

Execution cut into several segments

slide-79
SLIDE 79

Visualization with automated insights

79

Run time = 178s

Processes are ordered first by clusters and then by MPI rank Height of each cluster is proportional to log2(size+1) Each pixel shows a procedure frame on the call path. Depths are selected by automated analysis

Insignificant call paths are colored grey

Cluster #2 Cluster #1 P0 P4 P1 P511 P60

Depth of callpath

Execution cut into several segments

slide-80
SLIDE 80

Visualization with automated insights

80

Run time = 178s

Processes are ordered first by clusters and then by MPI rank Height of each cluster is proportional to log2(size+1) Each pixel shows a procedure frame on the call path. Depths are selected by automated analysis

Insignificant call paths are colored grey

Cluster #2 Cluster #1 P0 P4 P1 P511 P60

Zoom in

Depth of callpath

Execution cut into several segments

slide-81
SLIDE 81

Understand behavior of PFLOTRAN

81

Cluster #2 Cluster #1 P0 P4 P1 P511 P60

51.97s 81.75s

slide-82
SLIDE 82

Understand behavior of PFLOTRAN

82

Cluster #2 Cluster #1 P0 P4 P1 P511 P60

51.97s 81.75s

Synchronization colored in purple

slide-83
SLIDE 83

Understand behavior of PFLOTRAN

83

Cluster #2 Cluster #1 P0 P4 P1 P511 P60

51.97s 81.75s

Work colored in red, yellow, brown Synchronization colored in purple

slide-84
SLIDE 84

Understand behavior of PFLOTRAN

84

Cluster #2 Cluster #1 P0 P4 P1 P511 P60

51.97s 81.75s

Work colored in red, yellow, brown Synchronization colored in purple I/O work

slide-85
SLIDE 85

Understand behavior of PFLOTRAN

85

Cluster #2 Cluster #1 P0 P4 P1 P511 P60

51.97s 81.75s

Work colored in red, yellow, brown Wait colored in green and blue Synchronization colored in purple I/O work

slide-86
SLIDE 86

Understand behavior of PFLOTRAN

86

Cluster #2 Cluster #1 P0 P4 P1 P511 P60

51.97s 81.75s

Work colored in red, yellow, brown Wait colored in green and blue Synchronization colored in purple

MPI_Send MPI_Alltoall

I/O work

slide-87
SLIDE 87

Understand behavior of PFLOTRAN

87

Cluster #2 Cluster #1 P0 P4 P1 P511 P60

51.97s 81.75s

Work colored in red, yellow, brown Wait colored in green and blue Synchronization colored in purple

Extreme load-imbalance, probably serialization MPI_Send MPI_Alltoall

I/O work

slide-88
SLIDE 88

Sketch of PFLOTRAN serialized I/O

If I’m P0

  • Write global grid to visualization file

For -- (runs 3 iterations)

  • MPI_Alltoall within local MPI group
  • If I’m not P0
  • MPI_Send my data to P0
  • Else
  • Write my data to visualization file
  • MPI_Recv data from P1 to P511 and write to visualization file

88

slide-89
SLIDE 89

Sketch of PFLOTRAN serialized I/O

If I’m P0

  • Write global grid to visualization file

For -- (runs 3 iterations)

  • MPI_Alltoall within local MPI group
  • If I’m not P0
  • MPI_Send my data to P0
  • Else
  • Write my data to visualization file
  • MPI_Recv data from P1 to P511 and write to visualization file

89

Symptom of serialization

slide-90
SLIDE 90

Sketch of PFLOTRAN serialized I/O

If I’m P0

  • Write global grid to visualization file

For -- (runs 3 iterations)

  • MPI_Alltoall within local MPI group
  • If I’m not P0
  • MPI_Send my data to P0
  • Else
  • Write my data to visualization file
  • MPI_Recv data from P1 to P511 and write to visualization file

90

Symptom of serialization Cause of serialization

slide-91
SLIDE 91

Sketch of PFLOTRAN serialized I/O

If I’m P0

  • Write global grid to visualization file

For -- (runs 3 iterations)

  • MPI_Alltoall within local MPI group
  • If I’m not P0
  • MPI_Send my data to P0
  • Else
  • Write my data to visualization file
  • MPI_Recv data from P1 to P511 and write to visualization file

91

Symptom of serialization Cause of serialization Cluster #1 (P4 - P60) in the same local group as P0

slide-92
SLIDE 92

Conclusion of PFLOTRAN

Serialized I/O is causing performance loss

  • Automated analysis estimates run time improvement 178s 66s
  • Replace serial I/O with parallel I/O 178s 70s

Presentation of automated insights of PFLOTRAN

  • Reduces analysis complexity of time series in three dimensions
  • Process (group into clusters), Time (split into segments), Depth (automatic selection)
  • Directs attention to potential performance losses
  • Helps user understand the causes of such losses

92

slide-93
SLIDE 93

Summary

Automated analysis of parallel time-series performance data

  • Identifies potential inefficiencies in a large set of time series
  • Automation will be critical for analyzing performance on emerging exascale systems
  • Replace hours/days of manual effort with automated analysis

Future work

  • Visualize summarized iterative behaviors over time
  • Use semantic information for 1) MPMD applications; 2) more accurate diagnosis
  • Provide automated hints on how to fix highlighted performance losses

93