Lecture 10: Performance Tools Abhinav Bhatele, Department of - - PowerPoint PPT Presentation

lecture 10 performance tools
SMART_READER_LITE
LIVE PREVIEW

Lecture 10: Performance Tools Abhinav Bhatele, Department of - - PowerPoint PPT Presentation

Introduction to Parallel Computing (CMSC498X / CMSC818X) Lecture 10: Performance Tools Abhinav Bhatele, Department of Computer Science Announcements Quiz 1 has been posted Deadline: October 1, 11:59 pm AoE Department seminar tomorrow


slide-1
SLIDE 1

Lecture 10: Performance Tools

Abhinav Bhatele, Department of Computer Science

Introduction to Parallel Computing (CMSC498X / CMSC818X)

slide-2
SLIDE 2

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Announcements

  • Quiz 1 has been posted
  • Deadline: October 1, 11:59 pm AoE
  • Department seminar tomorrow at 11:00 am
  • Zoom link forwarded by e-mail

2

slide-3
SLIDE 3

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Performance analysis

  • Parallel performance of a program might not be what the developer expects
  • How do we find performance bottlenecks?
  • Two parts to performance analysis: measurement and analysis/visualization
  • Simplest tool: timers in the code and printf

3

slide-4
SLIDE 4

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Using timers

4

double start, end; double phase1, phase2, phase3; start = MPI_Wtime(); ... phase1 code ... end = MPI_Wtime(); phase1 = end - start; start = MPI_Wtime(); ... phase2 ... end = MPI_Wtime(); phase2 = end - start; start = MPI_Wtime(); ... phase3 ... end = MPI_Wtime(); phase3 = end - start;

slide-5
SLIDE 5

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Using timers

4

double start, end; double phase1, phase2, phase3; start = MPI_Wtime(); ... phase1 code ... end = MPI_Wtime(); phase1 = end - start; start = MPI_Wtime(); ... phase2 ... end = MPI_Wtime(); phase2 = end - start; start = MPI_Wtime(); ... phase3 ... end = MPI_Wtime(); phase3 = end - start;

Phase 1 took 2.45 s Phase 2 took 11.79 s Phase 3 took 4.37 s

slide-6
SLIDE 6

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Performance tools

  • Tracing tools
  • Capture entire execution trace
  • Profiling tools
  • Provide aggregated information
  • Typically use statistical sampling
  • Many tools can do both

5

slide-7
SLIDE 7

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Metrics recorded

  • Counts of function invocations
  • Time spent in code
  • Number of bytes sent
  • Hardware counters
  • To fix performance problems — we need to connect metrics to source code

6

slide-8
SLIDE 8

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Tracing tools

  • Record all the events in the program with timestamps
  • Events: function calls, MPI events, etc.

7

Vampir visualization: https://hpc.llnl.gov/software/development-environment-software/vampir-vampir-server

slide-9
SLIDE 9

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Tracing tools

  • Record all the events in the program with timestamps
  • Events: function calls, MPI events, etc.

7

Vampir visualization: https://hpc.llnl.gov/software/development-environment-software/vampir-vampir-server

slide-10
SLIDE 10

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Tracing tools

  • Record all the events in the program with timestamps
  • Events: function calls, MPI events, etc.

7

Vampir visualization: https://hpc.llnl.gov/software/development-environment-software/vampir-vampir-server

slide-11
SLIDE 11

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Examples of tracing tools

  • VampirTrace
  • Score-P
  • TAU
  • Projections
  • HPCToolkit

8

slide-12
SLIDE 12

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Profiling tools

  • Ignore the specific times at which events
  • ccurred
  • Provide aggregate information about

different parts of the code

  • Examples:
  • Gprof, perf
  • mpiP
  • HPCToolkit, caliper
  • Python tools: cprofile, pyinstrument, scalene

9

Gprof data in hpctView

slide-13
SLIDE 13

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Calling contexts, trees, and graphs

  • Calling context or call path: Sequence of function invocations

leading to the current sample

  • Calling context tree (CCT): dynamic prefix tree of all call

paths in an execution

  • Call graph: merge nodes in a CCT with the same name into

a single node but keep caller-callee relationships as arcs

10

main physics solvers mpi psm2 hypre mpi psm2

slide-14
SLIDE 14

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Calling context trees, call graphs, …

11

foo bar qux waldo baz grault quux corge bar grault garply baz grault fred garply plugh xyzzy thud baz garply

Calling context tree (CCT)

slide-15
SLIDE 15

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Calling context trees, call graphs, …

11

foo bar qux waldo baz grault quux corge bar grault garply baz grault fred garply plugh xyzzy thud baz garply

Calling context tree (CCT)

slide-16
SLIDE 16

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Calling context trees, call graphs, …

11

foo bar qux waldo baz grault quux corge bar grault garply baz grault fred garply plugh xyzzy thud baz garply

File Line number Function name Callpath Load module Process ID Thread ID Contextual information

Calling context tree (CCT)

slide-17
SLIDE 17

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Calling context trees, call graphs, …

11

foo bar qux waldo baz grault quux corge bar grault garply baz grault fred garply plugh xyzzy thud baz garply

File Line number Function name Callpath Load module Process ID Thread ID Contextual information Time Flops Cache misses Performance Metrics

Calling context tree (CCT)

slide-18
SLIDE 18

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Calling context trees, call graphs, …

11

foo bar qux waldo baz grault quux corge bar grault garply baz grault fred garply plugh xyzzy thud baz garply foo bar qux waldo baz grault quux corge garply fred plugh xyzzy thud

File Line number Function name Callpath Load module Process ID Thread ID Contextual information Time Flops Cache misses Performance Metrics

Calling context tree (CCT) Call graph

slide-19
SLIDE 19

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Output of profiling tools

  • Flat profile: Listing of all functions with counts and

execution times

  • Call graph profile
  • Calling context tree

12

The static call graph can be constructed from the source text of the program. However, discover- ing the static call graph from the source text would require two moderately difficult steps: finding the source text for the program (which may not be available), and scanning and parsing that text, which may be in any one of several languages. In our programming system, the static calling information is also contained in the executable ver- sion of the program, which we already have avail- able, and which is in language-independent form. One can examine the instructions in the object pro- gram, looking for calls to routines, and note which routines can be called. This technique allows us to add arcs to those already in the dynamic call graph. If a statically discovered arc already exists in the dynamic call graph, no action is

  • required. Statically

discovered arcs that do not exist in the dynamic call graph are added to the graph with a traversal count of zero. Thus they are never responsible for any time propagation. However, they may affect the structure of the graph. Since they may com- plete strongly connected components, the static call graph construction is done before topological

  • rdering.
  • 5. Data

Presentation The data is presented to the user in two different formats. The first presentation simply lists the routines without regard to the amount of time their descendants use. The second presenta- tion incorporates the call graph of the program.

5.1. The Flat Profile

The fiat profile consists of a list of all the rou- tines that are called during execution of the pro- gram, with the count of the number of times they are called and the number of seconds of execution time for which they are themselves accountable. The routines are listed in decreasing order of execu- tion time. A list of the routines that are never called during execution of the program is also avail- able to verify that nothing important is omitted by this execution. The fiat profile gives a quick over- view of the routines that are used, and shows the routines that are themselves responsible for large fractions of the execution time. In practice, this profile usually shows that no single function is

  • verwhelmingly responsible for the total time 'of the
  • program. Notice that for this profile, the individual

times sum to the total execution time.

5.'b-. The Call Graph Profile

Ideally, we would like to print the call graph of

the program,

but we are limited by the two- dimensional nature of our output devices. We can- not assume that a call graph is planar, and even if it is, that we can print a planar version-of it. Instead, we choose to list each routine, together With infor- 'mation about the routines that are its direct parents and children. This listing presents a win- dow into the call graph. Based on Our experience, both parent information and child iniormati0n is important, and should be available without searching through the output. The major entries of the call graph profile are the entries from the fiat profile, augmented by the time propagated to each routine from its descen-

  • dants. This profile is sorted by the sum of the time

for the routine itself plus the time inherited from its descendants. The profile shows which of the higher level routines spend large portions of the total execution time in the routines that they call. For each routine, we show the amount

  • f time

passed by each child to the routine, which includes time for the child itself and for the descendants of the child (and thus the descendants of the routine). We also show the percentage these times represent

  • f the total time accounted to the child. Similarly,

the parents of each routine are listed, along with time, and percentage of total routine time, pro- pagated to each one. Cycles are handled as single entities. The cycle as a whole is shown as though it were a single rou- tine, except that members of the cycle are listed in place of the children. Although the number of calls

  • f each member from within the cycle are shown,

they do not affect time propagation. When a child is a member

  • f a cycle, the

time shown is the appropriate fraction of the time for the whole cycle. Self-recursive routines have their calls broken down into calls from the outside and self-recursive calls. Only the outside calls affect the propagation of time. The following example is a typical fragment of a call graph. The en'try in the call graph profile listing for this example is shown in Figure 4. The entry is for routine EXAMPLE, which has the Caller routines as its parents, and the Sub routines as its children. The reader should keep in mind that all information is given with respect to EXAM-

  • PLE. The index in the first column shows that EXAM-

PLE is the second entry in the profile listing. The EXAMPLE routine is Called ten times, four times by CALLER1, and six times by CALLER2. Consequently 40~ of EXAmPLE's time is propagated to CALLER1, and 60~ of EXAMPLE'S time is prdpagated %o CALLER2. The self 'and descendant fields o'f the parents show

the amount o'f self and descendant time EXAMPLE

propagates to 'them '(but not the 'time used by the parents directly). Note that EXAMPLE calls i~tself recui'sively four times. The routine EXAMPLE calls routine SUB1 twenty times, SUB2 once, and never calls SUB3. Since sUB2 ~s called a 'total of five times, 20~ of its self and descendant 'time is propagated to EXAMPLE's descendant time field. Because SUB1 is a

124

foo bar qux waldo baz grault quux corge bar grault garply baz grault fred garply plugh xyzzy thud baz garply

slide-20
SLIDE 20

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Hatchet

  • Hatchet enables programmatic analysis of parallel profiles
  • Leverages pandas which supports multi-dimensional tabular datasets
  • Create a structured index to enable indexing pandas dataframes by nodes in a graph
  • A set of operators to filter, prune and/or aggregate structured data

13

https://hatchet.readthedocs.io/en/latest/

slide-21
SLIDE 21

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Pandas and dataframes

14

slide-22
SLIDE 22

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Pandas and dataframes

  • Pandas is an open-source Python library

for data analysis

14

slide-23
SLIDE 23

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Pandas and dataframes

  • Pandas is an open-source Python library

for data analysis

  • Dataframe: two-dimensional tabular data

structure

  • Supports many operations borrowed from SQL

databases

14

Columns Rows

slide-24
SLIDE 24

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Pandas and dataframes

  • Pandas is an open-source Python library

for data analysis

  • Dataframe: two-dimensional tabular data

structure

  • Supports many operations borrowed from SQL

databases

14

Columns Rows Index

slide-25
SLIDE 25

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Pandas and dataframes

  • Pandas is an open-source Python library

for data analysis

  • Dataframe: two-dimensional tabular data

structure

  • Supports many operations borrowed from SQL

databases

  • MultiIndex enables working with high-

dimensional data in a 2D data structure

14

Columns Rows Index

slide-26
SLIDE 26

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Central data structure: a GraphFrame

  • Consists of a structured index

graph object and a pandas dataframe

  • Graph stores caller-callee

relationships

  • Dataframe stores all numerical

and categorical data

15

slide-27
SLIDE 27

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Central data structure: a GraphFrame

  • Consists of a structured index

graph object and a pandas dataframe

  • Graph stores caller-callee

relationships

  • Dataframe stores all numerical

and categorical data

15

360 361 362 363 364 365 366 367 368 369 370 371

main physics solvers mpi psm2 hypre mpi psm2

slide-28
SLIDE 28

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Central data structure: a GraphFrame

  • Consists of a structured index

graph object and a pandas dataframe

  • Graph stores caller-callee

relationships

  • Dataframe stores all numerical

and categorical data

15

360 361 362 363 364 365 366 367 368 369 370 371

main physics solvers mpi psm2 hypre mpi psm2

slide-29
SLIDE 29

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Dataframe operation: filter

16

slide-30
SLIDE 30

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Dataframe operation: filter

16

slide-31
SLIDE 31

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Dataframe operation: filter

16

slide-32
SLIDE 32

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Graph operation: squash

17

main physics solvers mpi psm2 hypre mpi psm2

slide-33
SLIDE 33

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Graph operation: squash

17

main physics solvers mpi psm2 hypre mpi psm2

filter

main physics solvers mpi psm2 hypre mpi psm2

slide-34
SLIDE 34

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Graph operation: squash

17

main physics solvers mpi psm2 hypre mpi psm2

filter

main physics solvers mpi psm2 hypre mpi psm2

slide-35
SLIDE 35

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Graph operation: squash

17

main physics solvers mpi psm2 hypre mpi psm2 main physics hypre psm2 psm2

filter squash

main physics solvers mpi psm2 hypre mpi psm2

slide-36
SLIDE 36

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Graphframe operation: subtract

18

slide-37
SLIDE 37

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Graphframe operation: subtract

18

main physics solvers mpi psm2 hypre mpi psm2 main physics solvers mpi psm2 hypre mpi psm2

slide-38
SLIDE 38

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Graphframe operation: subtract

18

— =

main physics solvers mpi psm2 hypre mpi psm2 main physics solvers mpi psm2 hypre mpi psm2 main physics solvers mpi psm2 hypre mpi psm2

slide-39
SLIDE 39

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Graphframe operation: subtract

18

— =

main physics solvers mpi psm2 hypre mpi psm2 main physics solvers mpi psm2 hypre mpi psm2 main physics solvers mpi psm2 hypre mpi psm2

https://hatchet.readthedocs.io

slide-40
SLIDE 40

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Visualizing small graphs

19

  • sum. The

es one

addition eration graphs computes the

subtract

  • ne of

assignment

graphframe from

slide-41
SLIDE 41

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Visualizing small graphs

19

  • sum. The

es one

addition eration graphs computes the

subtract

  • ne of

assignment

graphframe from

foo bar qux waldo baz grault quux corge bar grault garply baz grault fred garply plugh xyzzy thud baz garply

slide-42
SLIDE 42

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Visualizing small graphs

19

  • sum. The

es one

addition eration graphs computes the

subtract

  • ne of

assignment

graphframe from

for two

  • bject.

printed

  • utput the

quux corge foo bar fred xyzzy thud qux bar waldo

Flamegraph

foo bar qux waldo baz grault quux corge bar grault garply baz grault fred garply plugh xyzzy thud baz garply

slide-43
SLIDE 43

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Example 1: Generating a flat profile

20

slide-44
SLIDE 44

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Example 1: Generating a flat profile

20

slide-45
SLIDE 45

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Example 1: Generating a flat profile

20

slide-46
SLIDE 46

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Example 2: Comparing two executions

21

slide-47
SLIDE 47

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Example 2: Comparing two executions

21

slide-48
SLIDE 48

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Example 2: Comparing two executions

21

slide-49
SLIDE 49

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Example 3: Scaling study

22

slide-50
SLIDE 50

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Example 3: Scaling study

22

slide-51
SLIDE 51

Abhinav Bhatele (CMSC498X/CMSC818X) LIVE RECORDING

Example 3: Scaling study

22

slide-52
SLIDE 52

Abhinav Bhatele 5218 Brendan Iribe Center (IRB) / College Park, MD 20742 phone: 301.405.4507 / e-mail: bhatele@cs.umd.edu