Lecture 10: Performance Tools
Abhinav Bhatele, Department of Computer Science
Introduction to Parallel Computing (CMSC498X / CMSC818X)
Lecture 10: Performance Tools Abhinav Bhatele, Department of - - PowerPoint PPT Presentation
Introduction to Parallel Computing (CMSC498X / CMSC818X) Lecture 10: Performance Tools Abhinav Bhatele, Department of Computer Science Announcements Quiz 1 has been posted Deadline: October 1, 11:59 pm AoE Department seminar tomorrow
Abhinav Bhatele, Department of Computer Science
Introduction to Parallel Computing (CMSC498X / CMSC818X)
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
2
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
3
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
4
double start, end; double phase1, phase2, phase3; start = MPI_Wtime(); ... phase1 code ... end = MPI_Wtime(); phase1 = end - start; start = MPI_Wtime(); ... phase2 ... end = MPI_Wtime(); phase2 = end - start; start = MPI_Wtime(); ... phase3 ... end = MPI_Wtime(); phase3 = end - start;
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
4
double start, end; double phase1, phase2, phase3; start = MPI_Wtime(); ... phase1 code ... end = MPI_Wtime(); phase1 = end - start; start = MPI_Wtime(); ... phase2 ... end = MPI_Wtime(); phase2 = end - start; start = MPI_Wtime(); ... phase3 ... end = MPI_Wtime(); phase3 = end - start;
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
5
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
6
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
7
Vampir visualization: https://hpc.llnl.gov/software/development-environment-software/vampir-vampir-server
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
7
Vampir visualization: https://hpc.llnl.gov/software/development-environment-software/vampir-vampir-server
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
7
Vampir visualization: https://hpc.llnl.gov/software/development-environment-software/vampir-vampir-server
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
8
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
9
Gprof data in hpctView
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
10
main physics solvers mpi psm2 hypre mpi psm2
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
11
foo bar qux waldo baz grault quux corge bar grault garply baz grault fred garply plugh xyzzy thud baz garply
Calling context tree (CCT)
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
11
foo bar qux waldo baz grault quux corge bar grault garply baz grault fred garply plugh xyzzy thud baz garply
Calling context tree (CCT)
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
11
foo bar qux waldo baz grault quux corge bar grault garply baz grault fred garply plugh xyzzy thud baz garply
File Line number Function name Callpath Load module Process ID Thread ID Contextual information
Calling context tree (CCT)
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
11
foo bar qux waldo baz grault quux corge bar grault garply baz grault fred garply plugh xyzzy thud baz garply
File Line number Function name Callpath Load module Process ID Thread ID Contextual information Time Flops Cache misses Performance Metrics
Calling context tree (CCT)
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
11
foo bar qux waldo baz grault quux corge bar grault garply baz grault fred garply plugh xyzzy thud baz garply foo bar qux waldo baz grault quux corge garply fred plugh xyzzy thud
File Line number Function name Callpath Load module Process ID Thread ID Contextual information Time Flops Cache misses Performance Metrics
Calling context tree (CCT) Call graph
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
12
The static call graph can be constructed from the source text of the program. However, discover- ing the static call graph from the source text would require two moderately difficult steps: finding the source text for the program (which may not be available), and scanning and parsing that text, which may be in any one of several languages. In our programming system, the static calling information is also contained in the executable ver- sion of the program, which we already have avail- able, and which is in language-independent form. One can examine the instructions in the object pro- gram, looking for calls to routines, and note which routines can be called. This technique allows us to add arcs to those already in the dynamic call graph. If a statically discovered arc already exists in the dynamic call graph, no action is
discovered arcs that do not exist in the dynamic call graph are added to the graph with a traversal count of zero. Thus they are never responsible for any time propagation. However, they may affect the structure of the graph. Since they may com- plete strongly connected components, the static call graph construction is done before topological
Presentation The data is presented to the user in two different formats. The first presentation simply lists the routines without regard to the amount of time their descendants use. The second presenta- tion incorporates the call graph of the program.
5.1. The Flat Profile
The fiat profile consists of a list of all the rou- tines that are called during execution of the pro- gram, with the count of the number of times they are called and the number of seconds of execution time for which they are themselves accountable. The routines are listed in decreasing order of execu- tion time. A list of the routines that are never called during execution of the program is also avail- able to verify that nothing important is omitted by this execution. The fiat profile gives a quick over- view of the routines that are used, and shows the routines that are themselves responsible for large fractions of the execution time. In practice, this profile usually shows that no single function is
times sum to the total execution time.
5.'b-. The Call Graph Profile
Ideally, we would like to print the call graph of
the program,
but we are limited by the two- dimensional nature of our output devices. We can- not assume that a call graph is planar, and even if it is, that we can print a planar version-of it. Instead, we choose to list each routine, together With infor- 'mation about the routines that are its direct parents and children. This listing presents a win- dow into the call graph. Based on Our experience, both parent information and child iniormati0n is important, and should be available without searching through the output. The major entries of the call graph profile are the entries from the fiat profile, augmented by the time propagated to each routine from its descen-
for the routine itself plus the time inherited from its descendants. The profile shows which of the higher level routines spend large portions of the total execution time in the routines that they call. For each routine, we show the amount
passed by each child to the routine, which includes time for the child itself and for the descendants of the child (and thus the descendants of the routine). We also show the percentage these times represent
the parents of each routine are listed, along with time, and percentage of total routine time, pro- pagated to each one. Cycles are handled as single entities. The cycle as a whole is shown as though it were a single rou- tine, except that members of the cycle are listed in place of the children. Although the number of calls
they do not affect time propagation. When a child is a member
time shown is the appropriate fraction of the time for the whole cycle. Self-recursive routines have their calls broken down into calls from the outside and self-recursive calls. Only the outside calls affect the propagation of time. The following example is a typical fragment of a call graph. The en'try in the call graph profile listing for this example is shown in Figure 4. The entry is for routine EXAMPLE, which has the Caller routines as its parents, and the Sub routines as its children. The reader should keep in mind that all information is given with respect to EXAM-
PLE is the second entry in the profile listing. The EXAMPLE routine is Called ten times, four times by CALLER1, and six times by CALLER2. Consequently 40~ of EXAmPLE's time is propagated to CALLER1, and 60~ of EXAMPLE'S time is prdpagated %o CALLER2. The self 'and descendant fields o'f the parents show
the amount o'f self and descendant time EXAMPLE
propagates to 'them '(but not the 'time used by the parents directly). Note that EXAMPLE calls i~tself recui'sively four times. The routine EXAMPLE calls routine SUB1 twenty times, SUB2 once, and never calls SUB3. Since sUB2 ~s called a 'total of five times, 20~ of its self and descendant 'time is propagated to EXAMPLE's descendant time field. Because SUB1 is a
124
foo bar qux waldo baz grault quux corge bar grault garply baz grault fred garply plugh xyzzy thud baz garply
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
13
https://hatchet.readthedocs.io/en/latest/
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
14
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
14
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
databases
14
Columns Rows
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
databases
14
Columns Rows Index
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
databases
14
Columns Rows Index
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
15
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
15
360 361 362 363 364 365 366 367 368 369 370 371
main physics solvers mpi psm2 hypre mpi psm2
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
15
360 361 362 363 364 365 366 367 368 369 370 371
main physics solvers mpi psm2 hypre mpi psm2
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
16
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
16
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
16
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
17
main physics solvers mpi psm2 hypre mpi psm2
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
17
main physics solvers mpi psm2 hypre mpi psm2
main physics solvers mpi psm2 hypre mpi psm2
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
17
main physics solvers mpi psm2 hypre mpi psm2
main physics solvers mpi psm2 hypre mpi psm2
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
17
main physics solvers mpi psm2 hypre mpi psm2 main physics hypre psm2 psm2
main physics solvers mpi psm2 hypre mpi psm2
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
18
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
18
main physics solvers mpi psm2 hypre mpi psm2 main physics solvers mpi psm2 hypre mpi psm2
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
18
main physics solvers mpi psm2 hypre mpi psm2 main physics solvers mpi psm2 hypre mpi psm2 main physics solvers mpi psm2 hypre mpi psm2
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
18
main physics solvers mpi psm2 hypre mpi psm2 main physics solvers mpi psm2 hypre mpi psm2 main physics solvers mpi psm2 hypre mpi psm2
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
19
es one
addition eration graphs computes the
subtract
assignment
graphframe from
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
19
es one
addition eration graphs computes the
subtract
assignment
graphframe from
foo bar qux waldo baz grault quux corge bar grault garply baz grault fred garply plugh xyzzy thud baz garply
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
19
es one
addition eration graphs computes the
subtract
assignment
graphframe from
for two
printed
quux corge foo bar fred xyzzy thud qux bar waldo
foo bar qux waldo baz grault quux corge bar grault garply baz grault fred garply plugh xyzzy thud baz garply
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
20
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
20
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
20
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
21
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
21
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
21
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
22
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
22
Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING
22
Abhinav Bhatele 5218 Brendan Iribe Center (IRB) / College Park, MD 20742 phone: 301.405.4507 / e-mail: bhatele@cs.umd.edu