Automated Analysis of Time Series Data to Understand Parallel Program Behaviors
LAI WEI, JOHN MELLOR-CRUMMEY RICE UNIVERSITY HOUSTON, TX, USA
to Understand Parallel Program Behaviors LAI WEI , JOHN - - PowerPoint PPT Presentation
Automated Analysis of Time Series Data to Understand Parallel Program Behaviors LAI WEI , JOHN MELLOR-CRUMMEY RICE UNIVERSITY HOUSTON, TX, USA Background Parallel computers of increasing scale Support scientific simulations of increasing
LAI WEI, JOHN MELLOR-CRUMMEY RICE UNIVERSITY HOUSTON, TX, USA
Parallel computers of increasing scale
Performance of many applications fail to scale accordingly
Performance tools to understand application behaviors
2
Breaks down application run time into sources of costs
3 P0 P1 P2 P3 9s 9s 9s 9s 1s 1s 1s 1s 8s 8s 8s 8s 4s 4.1s 3.9s 4s 4s 3.9s 4.1s 4s main() init() solve() compute() sync() Calling context
Breaks down application run time into sources of costs
4 P0 P1 P2 P3 9s 9s 9s 9s 1s 1s 1s 1s 8s 8s 8s 8s 4s 4.1s 3.9s 4s 4s 3.9s 4.1s 4s main() init() solve() compute() sync() Calling context
Breaks down application run time into sources of costs
5 P0 P1 P2 P3 9s 9s 9s 9s 1s 1s 1s 1s 8s 8s 8s 8s 4s 4.1s 3.9s 4s 4s 3.9s 4.1s 4s
Performance loss, why?
main() init() solve() compute() sync() Calling context
Presents application behavior over time
6 main()
init() solve()
compute() sync() Calling context P3 P2 P1 P0 solve() init() 1s 3s 5s 7s 9s Depth = 1
Presents application behavior over time
7 main()
init()
solve()
Calling context P3 P2 P1 P0 compute() sync() init() 1s 3s 5s 7s 9s Depth = 2
Presents application behavior over time
8 main()
init()
solve()
Calling context P3 P2 P1 P0 compute() sync() init() 1s 3s 5s 7s 9s Depth = 2
Presents application behavior over time
9 main()
init()
solve()
Calling context P3 P2 P1 P0 compute() sync() init()
1s 3s 5s 7s 9s Depth = 2
Presents application behavior over time
10 main()
init()
solve()
Calling context P3 P2 P1 P0 compute() sync() init()
1s 3s 5s 7s 9s Depth = 2
Presents application behavior over time
11 main()
init()
solve()
Calling context P3 P2 P1 P0 compute() sync() init()
1s 3s 5s 7s 9s Depth = 2
Experts manually examine time series
Time series of large scale parallel executions
12
Analysis of profiles [Huck, SC’05] [Tallent, SC’10]
Analysis of execution traces
13
Automated analysis of sample-based time-series data
14
15
16
Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
17
Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
18
Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
19
Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
20
T1
Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
21
main() foo@13 C@5
T1
Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
22
main() foo@13 C@5
T1
Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
23
main() foo@13 C@5
T1
Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
24
main() foo@13 C@5
T1
Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
25
main() foo@13 C@5
T1
Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
26
main() foo@13 C@5
T1 T2
Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
27
main() foo@13 C@5 main() foo@13
T1 T2
A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
28
main() foo@13 C@5 main() foo@13
T1 T2
A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
29
main() foo@13 C@5 main() foo@13 loop@6
T1 T2
A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
30
main() foo@13 C@5 main() foo@13 loop@6
T1 T2
A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
31
main() foo@13 C@5 main() foo@13 loop@6
T1 T2
A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
32
main() foo@13 C@5 main() foo@13 loop@6
T1 T2
A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
33
main() foo@13 C@5 main() foo@13 loop@6
T1 T2
A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
34
main() foo@13 C@5 main() foo@13 loop@6 main() foo@13 loop@6
T1 T2 T3
A@7 B@8 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
35
main() foo@13 C@5 main() foo@13 loop@6 main() foo@13 loop@6
T1 T2 T3
A@7 B@8 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
36
main() foo@13 C@5 main() foo@13 loop@6 main() foo@13 loop@6
T1 T2 T3
A@7 B@8 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
37
main() foo@13 C@5 main() foo@13 loop@6 main() foo@13 loop@6
T1 T2 T3
A@7 B@8 main() foo@13 loop@6 A@7
T4
Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
38
main() foo@13 C@5 main() foo@13 loop@6 main() foo@13 loop@6
T1 T2 T3
A@7 B@8 main() foo@13 loop@6 A@7
T4 T5
main() foo@13 loop@6 A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
39
main() foo@13 C@5 main() foo@13 loop@6 main() foo@13 loop@6
T1 T2 T3
A@7 B@8 main() foo@13 loop@6 A@7
T4 T5
main() foo@13 loop@6 A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
40
main() foo@13 C@5 main() foo@13 loop@6 main() foo@13 loop@6
T1 T2 T3
A@7 B@8 main() foo@13 loop@6 A@7
T4 T5
main() foo@13 loop@6 A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
41
main() foo@13 C@5 main() foo@13 loop@6 main() foo@13 loop@6
T1 T2 T3
A@7 B@8 main() foo@13 loop@6 A@7
T4
main() foo@13 loop@6
T5 T6
A@7 main() foo@13 loop@6 A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
42
main() foo@13 C@5 main() foo@13 loop@6 main() foo@13 loop@6
T1 T2 T3
A@7 B@8 main() foo@13 loop@6 A@7
T4
main() foo@13 loop@6 main() foo@13 loop@6
T5 T6 T7
A@7 A@7 main() foo@13 loop@6 A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
43
main() foo@13 C@5 main() foo@13 loop@6 main() foo@13 loop@6
T1 T2 T3
A@7 B@8 main() foo@13 loop@6 A@7
T4
main() foo@13 loop@6 main() foo@13 loop@6
T5 T6 T7
A@7 A@7 main() foo@13 loop@6 A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
44
main() foo@13 C@5 main() foo@13 loop@6 main() foo@13 loop@6
T1 T2 T3
A@7 B@8 main() foo@13 loop@6 A@7
T4
main() foo@13 loop@6 main() foo@13 loop@6
T5 T6 T7
A@7 A@7 main() foo@13 loop@6 A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
45
main() foo@13 C@5 main() foo@13 loop@6 main() foo@13 loop@6
T1 T2 T3
A@7 B@8 main() foo@13 loop@6 A@7
T4
main() foo@13 loop@6 main() foo@13 loop@6
T5 T6 T7
A@7 A@7 main() foo@13 loop@6 C@9
T8
main() foo@13 loop@6 A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
main() foo@13 loop@6 A@7
46 main() T1-T8 foo@13 T1-T8 loop@6 T2-T8
main() foo@13 C@5 main() foo@13 loop@6
T1 T2 T3
B@8 main() foo@13 loop@6 A@7
T4
main() foo@13 loop@6 main() foo@13 loop@6
T5 T6 T7
A@7 A@7 main() foo@13 loop@6 C@9
T8
main() foo@13 loop@6 A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
main() foo@13 loop@6 A@7
47 main() T1-T8 foo@13 T1-T8 loop@6 T2-T8
main() foo@13 C@5 main() foo@13 loop@6
T1 T2 T3
B@8 main() foo@13 loop@6 A@7
T4
main() foo@13 loop@6 main() foo@13 loop@6
T5 T6 T7
A@7 A@7 main() foo@13 loop@6 C@9
T8
main() foo@13 loop@6 A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
main() foo@13 loop@6 A@7
48 main() T1-T8 foo@13 T1-T8 loop@6 T2-T8
main() foo@13 C@5 main() foo@13 loop@6
T1 T2 T3
B@8 main() foo@13 loop@6 A@7
T4
main() foo@13 loop@6 main() foo@13 loop@6
T5 T6 T7
A@7 A@7 main() foo@13 loop@6 C@9
T8
main() foo@13 loop@6 A@7 Depth = 0 Depth = 1 Depth = 2 Depth = 3 Depth = 0 Depth = 1 Depth = 2 Depth = 3
One tree per thread
49 loop@6 T2-T8
50
6 7 8 9
loop@6 T2-T8
51
6 7 8 9
loop@6 T2-T8
begin of loop@6 A@7 B@8 C@9
Use ParseAPI to analyze binaries
52
6 7 8 9
loop@6 T2-T8
begin of loop@6 A@7 B@8 C@9
loop@6 T2-T8 #0 #1 #2
Insert a new iteration when a back edge must have been taken
Use ParseAPI to analyze binaries
53
Objective
Parallelization ✔
54
55
Variation across threads provides clues to performance losses
56
2 4 6 8 10 12 A B C
X = 8s X = 2s Y = 7s X = 6s Z = 4s
Opt #2 = 37.5%
Optimized
X = 4s X = 2s Y = 7s
P2 P1 P0
2 4 6 8 10 12 A B C
57 P2 P1
X = 8s
P0
X = 2s Y = 7s X = 6s Z = 4s
Opt #2 = 37.5%
Optimized
X = 4s X = 2s Y = 7s
Assume X as computation, Y & Z as synchronization For a computation node C
2 4 6 8 10 12 A B C
58 P2 P1
X = 8s
P0
X = 2s Y = 7s X = 6s Z = 4s
Opt #2 = 37.5%
Optimized
X = 4s X = 2s Y = 7s
Assume X as computation, Y & Z as synchronization For a computation node C
2 4 6 8 10 12 A B C
59 P2 P1
X = 8s
P0
X = 2s Y = 7s X = 6s Z = 4s
Opt #2 = 37.5%
Optimized
X = 4s X = 2s Y = 7s
Assume X as computation, Y & Z as synchronization For a computation node C
2 4 6 8 10 12 A B C
60
X = 8s X = 2s Y = 7s X = 6s Z = 4s
Opt #2 = 37.5%
Optimized
X = 4s X = 2s Y = 7s
P2 P1 P0
Assume X as computation, Y & Z as synchronization For a synchronization node S
synchronization and S is balanced.
2 4 6 8 10 12 A B C
61
X = 8s X = 2s Y = 7s X = 6s Z = 4s
Opt #2 = 37.5%
Optimized
X = 4s X = 2s Y = 7s
P2 P1 P0
Assume X as computation, Y & Z as synchronization For a synchronization node S
synchronization and S is balanced.
2 4 6 8 10 12 A B C
62
X = 8s X = 2s Y = 7s X = 6s Z = 4s
Opt #2 = 37.5%
Optimized
X = 4s X = 2s Y = 7s
P2 P1 P0
Assume X as computation, Y & Z as synchronization sumImb(N) for each node N in TCT
= imb(N) if N is a leaf = Sum { sumImb(every child of N) }
2 4 6 8 10 12 A B C
63
X = 8s X = 2s Y = 7s X = 6s Z = 4s
Opt #2 = 37.5%
Optimized
X = 4s X = 2s Y = 7s
imb(N) imb(N)
P2 P1 P0
Assume X as computation, Y & Z as synchronization sumImb(N) for each node N in TCT
= imb(N) if N is a leaf = Sum { sumImb(every child of N) }
2 4 6 8 10 12 A B C
64
X = 8s X = 2s Y = 7s X = 6s Z = 4s
Opt #2 = 37.5%
Optimized
X = 4s X = 2s Y = 7s
imb(N) imb(N)
P2 P1 P0
Assume X as computation, Y & Z as synchronization sumImb(N) for each node N in TCT
= imb(N) if N is a leaf = Sum { sumImb(every child of N) }
2 4 6 8 10 12 A B C
65
X = 8s X = 2s Y = 7s X = 6s Z = 4s
Opt #2 = 37.5%
Optimized
X = 4s X = 2s Y = 7s
imb(N) imb(N) sumImb(N)
P2 P1 P0
Assume X as computation, Y & Z as synchronization sumImb(N) for each node N in TCT
= imb(N) if N is a leaf = Sum { sumImb(every child of N) }
2 4 6 8 10 12 A B C
66
X = 8s X = 2s Y = 7s X = 6s Z = 4s
Opt #2 = 37.5%
Optimized
X = 4s X = 2s Y = 7s
imb(N) sumImb(N) imb(N) sumImb(N)
P2 P1 P0
Assume X as computation, Y & Z as synchronization sumImb(N) for each node N in TCT
= imb(N) if N is a leaf = Sum { sumImb(every child of N) }
Pick call paths that contribute significantly to imbalance
Pick appropriate depths for significant call paths
Quantify and attribute waiting in a similar way
67
0%/10% 0.4%/9.6%
5%/5%
2.3%/2.3%
4%/4%
1.8%/1.8%
0.4%/0.4%
0.9%/0.9% 0%/10% imb(N) sumImb(N)
Pick call paths that contribute significantly to imbalance
Pick appropriate depths for significant call paths
Quantify and attribute waiting in a similar way
68
0%/10% 0.4%/9.6%
5%/5%
2.3%/2.3%
4%/4%
1.8%/1.8%
0.4%/0.4%
0.9%/0.9% 0%/10% imb(N) sumImb(N)
Pick call paths that contribute significantly to imbalance
Pick appropriate depths for significant call paths
Quantify and attribute waiting in a similar way
69
0%/10% 0.4%/9.6% 2.3%/2.3% 1.8%/1.8%
0.4%/0.4%
0.9%/0.9% 0%/10% imb(N) sumImb(N)
5%/5% 4%/4%
Pick call paths that contribute significantly to imbalance
Pick appropriate depths for significant call paths
Quantify and attribute waiting in a similar way
70
0%/10% 0.4%/9.6% 2.3%/2.3% 1.8%/1.8%
0.4%/0.4%
0.9%/0.9% 0%/10% imb(N) sumImb(N)
5%/5% 4%/4%
Platform: Titan @ Oak Ridge National Laboratory
Applications:
71
72
73
74
Need to zoom in to examine the execution at a higher resolution Need to select appropriate call path depths to derive detailed insights
75
Cluster #2 Cluster #1 P0 P4 P1 P511 P60
76
Processes are ordered first by clusters and then by MPI rank Height of each cluster is proportional to log2(size+1)
Cluster #2 Cluster #1 P0 P4 P1 P511 P60
77
Processes are ordered first by clusters and then by MPI rank Height of each cluster is proportional to log2(size+1)
Cluster #2 Cluster #1 P0 P4 P1 P511 P60
Execution cut into several segments
78
Processes are ordered first by clusters and then by MPI rank Height of each cluster is proportional to log2(size+1) Each pixel shows a procedure frame on the call path. Depths are selected by automated analysis
Cluster #2 Cluster #1 P0 P4 P1 P511 P60
Execution cut into several segments
79
Processes are ordered first by clusters and then by MPI rank Height of each cluster is proportional to log2(size+1) Each pixel shows a procedure frame on the call path. Depths are selected by automated analysis
Insignificant call paths are colored grey
Cluster #2 Cluster #1 P0 P4 P1 P511 P60
Execution cut into several segments
80
Processes are ordered first by clusters and then by MPI rank Height of each cluster is proportional to log2(size+1) Each pixel shows a procedure frame on the call path. Depths are selected by automated analysis
Insignificant call paths are colored grey
Cluster #2 Cluster #1 P0 P4 P1 P511 P60
Zoom in
Execution cut into several segments
81
Cluster #2 Cluster #1 P0 P4 P1 P511 P60
82
Cluster #2 Cluster #1 P0 P4 P1 P511 P60
Synchronization colored in purple
83
Cluster #2 Cluster #1 P0 P4 P1 P511 P60
Work colored in red, yellow, brown Synchronization colored in purple
84
Cluster #2 Cluster #1 P0 P4 P1 P511 P60
Work colored in red, yellow, brown Synchronization colored in purple I/O work
85
Cluster #2 Cluster #1 P0 P4 P1 P511 P60
Work colored in red, yellow, brown Wait colored in green and blue Synchronization colored in purple I/O work
86
Cluster #2 Cluster #1 P0 P4 P1 P511 P60
Work colored in red, yellow, brown Wait colored in green and blue Synchronization colored in purple
MPI_Send MPI_Alltoall
I/O work
87
Cluster #2 Cluster #1 P0 P4 P1 P511 P60
Work colored in red, yellow, brown Wait colored in green and blue Synchronization colored in purple
Extreme load-imbalance, probably serialization MPI_Send MPI_Alltoall
I/O work
If I’m P0
For -- (runs 3 iterations)
88
If I’m P0
For -- (runs 3 iterations)
89
Symptom of serialization
If I’m P0
For -- (runs 3 iterations)
90
Symptom of serialization Cause of serialization
If I’m P0
For -- (runs 3 iterations)
91
Symptom of serialization Cause of serialization Cluster #1 (P4 - P60) in the same local group as P0
Serialized I/O is causing performance loss
Presentation of automated insights of PFLOTRAN
92
Automated analysis of parallel time-series performance data
Future work
93