Lecture 11: Measurement Tools Abhinav Bhatele, Department of - - PowerPoint PPT Presentation

lecture 11 measurement tools
SMART_READER_LITE
LIVE PREVIEW

Lecture 11: Measurement Tools Abhinav Bhatele, Department of - - PowerPoint PPT Presentation

High Performance Computing Systems (CMSC714) Lecture 11: Measurement Tools Abhinav Bhatele, Department of Computer Science Summary of last lecture Scalable networks: fat-tree, dragonfly Use high-radix routers Many nodes connected to


slide-1
SLIDE 1

Lecture 11: Measurement Tools

Abhinav Bhatele, Department of Computer Science

High Performance Computing Systems (CMSC714)

slide-2
SLIDE 2

Abhinav Bhatele, CMSC714

Summary of last lecture

  • Scalable networks: fat-tree, dragonfly
  • Use high-radix routers
  • Many nodes connected to each switch
  • Low network diameter, high bisection bandwidth
  • Dynamic routing

2

slide-3
SLIDE 3

Abhinav Bhatele, CMSC714

Performance analysis

  • Parallel performance of a program might not be what we expect
  • How do we find performance bottlenecks?
  • Two parts to performance analysis: measurement and analysis/visualization
  • Simplest tool: timers in the code and printf

3

slide-4
SLIDE 4

Abhinav Bhatele, CMSC714

Performance Tools

  • Tracing tools
  • Capture entire execution trace
  • Vampir, Score-P
  • Profiling tools
  • Typically use statistical sampling
  • Gprof
  • Many tools can do both
  • TAU, HPCToolkit, Projections

4

slide-5
SLIDE 5

Abhinav Bhatele, CMSC714

Metrics recorded

  • Counts of function invocations
  • Time spent in code
  • Hardware counters

5

slide-6
SLIDE 6

Abhinav Bhatele, CMSC714

Calling contexts, trees, and graphs

  • Calling context or call path: Sequence of function

invocations leading to the current sample

  • Calling context tree: dynamic prefix tree of all call paths in

an execution

  • Call graph: keep caller-callee relationships as arcs

6

main physics solvers mpi psm2 hypre mpi psm2

slide-7
SLIDE 7

Abhinav Bhatele, CMSC714

Output

  • Flat profile: Listing of all functions with counts and execution

times

  • Call graph profile
  • Calling context tree

7

The static call graph can be constructed from the source text of the program. However, discover- ing the static call graph from the source text would require two moderately difficult steps: finding the source text for the program (which may not be available), and scanning and parsing that text, which may be in any one of several languages. In our programming system, the static calling information is also contained in the executable ver- sion of the program, which we already have avail- able, and which is in language-independent form. One can examine the instructions in the object pro- gram, looking for calls to routines, and note which routines can be called. This technique allows us to add arcs to those already in the dynamic call graph. If a statically discovered arc already exists in the dynamic call graph, no action is

  • required. Statically

discovered arcs that do not exist in the dynamic call graph are added to the graph with a traversal count of zero. Thus they are never responsible for any time propagation. However, they may affect the structure of the graph. Since they may com- plete strongly connected components, the static call graph construction is done before topological

  • rdering.
  • 5. Data

Presentation The data is presented to the user in two different formats. The first presentation simply lists the routines without regard to the amount of time their descendants use. The second presenta- tion incorporates the call graph of the program.

5.1. The Flat Profile

The fiat profile consists of a list of all the rou- tines that are called during execution of the pro- gram, with the count of the number of times they are called and the number of seconds of execution time for which they are themselves accountable. The routines are listed in decreasing order of execu- tion time. A list of the routines that are never called during execution of the program is also avail- able to verify that nothing important is omitted by this execution. The fiat profile gives a quick over- view of the routines that are used, and shows the routines that are themselves responsible for large fractions of the execution time. In practice, this profile usually shows that no single function is

  • verwhelmingly responsible for the total time 'of the
  • program. Notice that for this profile, the individual

times sum to the total execution time.

5.'b-. The Call Graph Profile

Ideally, we would like to print the call graph of

the program,

but we are limited by the two- dimensional nature of our output devices. We can- not assume that a call graph is planar, and even if it is, that we can print a planar version-of it. Instead, we choose to list each routine, together With infor- 'mation about the routines that are its direct parents and children. This listing presents a win- dow into the call graph. Based on Our experience, both parent information and child iniormati0n is important, and should be available without searching through the output. The major entries of the call graph profile are the entries from the fiat profile, augmented by the time propagated to each routine from its descen-

  • dants. This profile is sorted by the sum of the time

for the routine itself plus the time inherited from its descendants. The profile shows which of the higher level routines spend large portions of the total execution time in the routines that they call. For each routine, we show the amount

  • f time

passed by each child to the routine, which includes time for the child itself and for the descendants of the child (and thus the descendants of the routine). We also show the percentage these times represent

  • f the total time accounted to the child. Similarly,

the parents of each routine are listed, along with time, and percentage of total routine time, pro- pagated to each one. Cycles are handled as single entities. The cycle as a whole is shown as though it were a single rou- tine, except that members of the cycle are listed in place of the children. Although the number of calls

  • f each member from within the cycle are shown,

they do not affect time propagation. When a child is a member

  • f a cycle, the

time shown is the appropriate fraction of the time for the whole cycle. Self-recursive routines have their calls broken down into calls from the outside and self-recursive calls. Only the outside calls affect the propagation of time. The following example is a typical fragment of a call graph. The en'try in the call graph profile listing for this example is shown in Figure 4. The entry is for routine EXAMPLE, which has the Caller routines as its parents, and the Sub routines as its children. The reader should keep in mind that all information is given with respect to EXAM-

  • PLE. The index in the first column shows that EXAM-

PLE is the second entry in the profile listing. The EXAMPLE routine is Called ten times, four times by CALLER1, and six times by CALLER2. Consequently 40~ of EXAmPLE's time is propagated to CALLER1, and 60~ of EXAMPLE'S time is prdpagated %o CALLER2. The self 'and descendant fields o'f the parents show

the amount o'f self and descendant time EXAMPLE

propagates to 'them '(but not the 'time used by the parents directly). Note that EXAMPLE calls i~tself recui'sively four times. The routine EXAMPLE calls routine SUB1 twenty times, SUB2 once, and never calls SUB3. Since sUB2 ~s called a 'total of five times, 20~ of its self and descendant 'time is propagated to EXAMPLE's descendant time field. Because SUB1 is a

124

foo bar qux waldo baz grault quux corge bar grault garply baz grault fred garply plugh xyzzy thud baz garply

slide-8
SLIDE 8

Abhinav Bhatele, CMSC714

Questions

  • Execution count: It is highlighted to be two types of counts, which is either an actual count or a boolean.

What’s the benefit of introducing the second type?

  • It seems that the call to monitoring routine is more informative but slower compared to the inline counter
  • increment. Will the slow down actually affect the accuracy of the monitoring? Also is this trade-off generally

worth it (in terms of profiling)?

  • It is not immediately clear from the paper how they actually derive the timing approximation from the
  • histogram. If possible I’d like to see if there’s an illustrating example.
  • Is there any principled way to extract static call graph from a generic program?
  • What are the different types of call graphs? How is each type best used for understanding program

performance?

  • How much memory does profiling data require usually? Related: how does gprof balance various overheads?
  • How does timeslicing work on timeshare machines?

8

gprof: A Call Graph Execution Profiler

slide-9
SLIDE 9

Abhinav Bhatele, CMSC714

Questions

  • The paper states “dynamic instrumentation remains susceptible to systematic

measurement error because of instrumentation overhead”. Where do these overheads come from comparing to static and binary instrumentation?

  • The loop optimization performed by compiler introduces semantic gap between source

code and binary. Is there any effort on incorporating compiler into the profiling system to reduce such gap?

  • It seems from the paper that the proposed HPCToolkit is better than gprof. How do

they compare practically when used to profile a program?

  • How does highly optimized code make it harder to accurately profile? How does binary

analysis address these issues?

  • What are the measurement techniques for instrumentation?

9

Binary Analysis for Measurement and Attribution of Program Performance

slide-10
SLIDE 10

Abhinav Bhatele 5218 Brendan Iribe Center (IRB) / College Park, MD 20742 phone: 301.405.4507 / e-mail: bhatele@cs.umd.edu

Questions?