[PPT] - Lecture 13: Things of Interest G63.2011.002/G22.2945.001 November PowerPoint Presentation

SLIDE 1

Lecture 13: Things of Interest

G63.2011.002/G22.2945.001 · November 30, 2010

Debugging Instrumentation Profiling and Hardware

SLIDE 2

Outline

Debugging Instrumentation Profiling and Hardware

SLIDE 3

Today

“Odds and Ends”

Tools (emphasis on Linux, non-proprietary)
Ways to use them
Learn new details about hardware along the way
. . . across our four ways of high-performance computing:

Serial, OpenMP, MPI, GPU Will post slides, video (hopefully) Questions about your final project? → Ask us! We’re happy to help!

Debugging Instrumentation Profiling and Hardware

SLIDE 4

Outline

Debugging Instrumentation Profiling and Hardware

SLIDE 5

Debugging

Bad program behavior:

Wrong result
Segmentation fault
Run-time errors
assert() violations (<assert.h>, -DNDEBUG)

Desired Insight:

Where? (Source code location)
When? (Execution History)
History within function
Call stack
With what data? (Variable contents, etc.)
→ Why? (And how do I fix it?)

Key Actions: Attach to inferior, trace (ptrace()) its execution

Debugging Instrumentation Profiling and Hardware

SLIDE 6

Debugging

Bad program behavior:

Wrong result
Segmentation fault
Run-time errors
assert() violations (<assert.h>, -DNDEBUG)

Desired Insight:

Where? (Source code location)
When? (Execution History)
History within function
Call stack
With what data? (Variable contents, etc.)
→ Why? (And how do I fix it?)

Key Actions: Attach to inferior, trace (ptrace()) its execution What about bugs that aren’t reproducible?

Debugging Instrumentation Profiling and Hardware

SLIDE 7

Debugging with GDB: Summary

Three main usage patterns:
Run-until-crash (‘Post-mortem’)
Core dump
Break-and-trace
-g vs -On
Ctrl

X , Ctrl A

Step into (s), step over (n), finish

(fin)

p data to look at variables

Debugging Instrumentation Profiling and Hardware

SLIDE 8

Other Debuggers: DDD

GNU Data Display Debugger (Free)

Debugging Instrumentation Profiling and Hardware

SLIDE 9

Other Debuggers: TotalView

TotalView (Proprietary)

Debugging Instrumentation Profiling and Hardware

SLIDE 10

Other Debuggers: DDT

Allinea Distributed Debugging Tool (Proprietary)

Debugging Instrumentation Profiling and Hardware

SLIDE 11

Outline

Debugging Instrumentation Profiling and Hardware

SLIDE 12

Question

Problem: Debugging only deals with problems when they cause

bservable wrong behavior (e.g. a crash).

Doesn’t find latent problems. Suggested solution: Monitor program behavior (precisely) while it’s executing. Possible?

Debugging Instrumentation Profiling and Hardware

SLIDE 13

What is Instrumentation?

What is Instrumentation? A.k.a. how does Valgrind work? x86(-64) Binary IR (SSA) Tool x86(-64) Binary Tools:

Memcheck (find pointer bugs)
Massif (find memory allocations)
Cachegrind/Callgrind (find cache

misbehavior)

Helgrind/DRD (find data races)

Debugging Instrumentation Profiling and Hardware

SLIDE 14

Outline

Debugging Instrumentation Profiling and Hardware

SLIDE 15

Profilers

Slow program execution:

Poor memory access pattern
Expensive processing

(e.g. division, transcendental functions)

Control overhead (branches, function

calls) Desired Insight:

Where is time spent? (Source code

location)

When? (Execution History)
Call stack
What is the limiting factor?

Main Types of Profilers:

Exact, Sampling
Hardware, Software

Debugging Instrumentation Profiling and Hardware

SLIDE 16

Reflections on Profilers

Sampling Exact

+ Fast

Slow
Noisy

+ Exact

(takes time to converge!) No free lunch. But: No exact machine-level profiler!

Debugging Instrumentation Profiling and Hardware

SLIDE 17

Making sense of OProfile sample counts

What do OProfile sample counts mean? Individually: not much! → Ratios make sense! What kind of ratios?

(Events in Routine 1)/(Events in Routine 2)
(Events in Line 1)/(Events in Line 2)
(Count of Event 1 in X)/(Count of Event 2 in X)

Always ask: Sample count sufficiently converged?

Debugging Instrumentation Profiling and Hardware

SLIDE 18

OProfile: Examples I

(DCU LINES IN or L1D REPL) / INST RETIRED

L1 miss rate, target: small, location understood (seen)

L2 LINES IN / INST RETIRED

L2 miss rate, target: small

INST RETIRED / CPU CLK UNHALTED

Instructions per clock, target > 1 (seen)

CYCLES L1I MEM STALLED / CPU CLK UNHALTED

Instruction fetch stalls. Should never happen–means CPU could not predict where code is going. (→ pipeline stall)

BR IND CALL EXEC / INST RETIRED

Fraction of indirect calls (virtual table lookups)

Debugging Instrumentation Profiling and Hardware

SLIDE 19

OProfile: Examples II

L1D CACHE LD / CPU CLK UNHALTED

Fraction of time the L1 load/store buffers are full

STORE BLOCK / CPU CLK UNHALTED

Fraction of cycle CPU is blocked waiting to be able to write to memory

PAGE WALKS / CPU CLK UNHALTED

Cycles spent waiting for page table walks (TLB miss penalty)

DTLB MISSES / INST RETIRED

Data TLB miss rate

Debugging Instrumentation Profiling and Hardware

SLIDE 20

Virtual Memory

Virtual address space Physical address space

0x00000000 0x00010000 0x10000000 0x7fffffff 0x00000000 0x00ffffff page belonging to process page not belonging to process

text data stack

Debugging Instrumentation Profiling and Hardware

SLIDE 21

Virtual Memory

8 16 24 31 15 7 23

... ... ... ... ... ...

4K memory page 10 32* 12 10 Linear address: page directory 32 bit PD entry CR3 *) 32 bits aligned to a 4-KByte boundary page table 32 bit PT entry

(One page directory per process.)

Debugging Instrumentation Profiling and Hardware

SLIDE 22

Virtual Memory

8 16 24 31 15 7 23

... ... ... ... ... ...

4K memory page 10 32* 12 10 Linear address: page directory 32 bit PD entry CR3 *) 32 bits aligned to a 4-KByte boundary page table 32 bit PT entry

(One page directory per process.) . . . and two extra memory accesses per memory access?

Debugging Instrumentation Profiling and Hardware

SLIDE 23

Caching the Page Table

page table disk

TLB

TLB write TLB hit TLB miss page table hit page not present page table write virtual address physical address

Debugging Instrumentation Profiling and Hardware

SLIDE 24

Caching the Page Table

page table disk

TLB

TLB write TLB hit TLB miss page table hit page not present page table write virtual address physical address

What leads to TLB flush? TLB flush ⇒ Cache flush?

Debugging Instrumentation Profiling and Hardware

SLIDE 25

Influencing TLB performance

What to do if limited by TLB performance?

Access fewer pages:
Increase locality
Problem: fragmented memory!
Default x86 page granularity: 4 kiB

“Huge” pages also exist: 2 MiB Obtaining huge-page memory: (Linux

nly)
mount -t hugetlbfs none

/mnt/huge

Create /mnt/huge/myfile
mmap() that file.

→ 5–10% gain on matmul But: Huge pages are shared, scarce resource!

Virtual address space Physical address space

0x00000000 0x00010000 0x10000000 0x7fffffff 0x00000000 0x00ffffff page belonging to process page not belonging to process

text data stack

Debugging Instrumentation Profiling and Hardware

SLIDE 26

OProfile: Also for multi-processor programs

EXT SNOOP / INST RETIRED

Fraction of instructions causing retrieval of modified cache line from other core

(L1D CACHE LOCK DURATION + 20 ×

L1D CACHE LOCK)/CPU CLK UNHALTED Fraction of cycles spent waiting for synchronized (“atomic”) access to memory

Debugging Instrumentation Profiling and Hardware

SLIDE 27

Atomic Operations

Collaborative (inter-block) Global Memory Update: Read Increment Write

Debugging Instrumentation Profiling and Hardware

SLIDE 28

Atomic Operations

Collaborative (inter-block) Global Memory Update: Read Increment Write Interruptible!

Debugging Instrumentation Profiling and Hardware

SLIDE 29

Atomic Operations

Collaborative (inter-block) Global Memory Update: Read Increment Write Interruptible! Interruptible!

Debugging Instrumentation Profiling and Hardware

SLIDE 30

Atomic Operations

Collaborative (inter-block) Global Memory Update: Read Increment Write Interruptible! Interruptible! Atomic Global Memory Update: Read Increment Write

Debugging Instrumentation Profiling and Hardware

SLIDE 31

Atomic Operations

Collaborative (inter-block) Global Memory Update: Read Increment Write Interruptible! Interruptible! Atomic Global Memory Update: Read Increment Write Protected

Debugging Instrumentation Profiling and Hardware

SLIDE 32

Atomic Operations

Collaborative (inter-block) Global Memory Update: Read Increment Write Interruptible! Interruptible! Atomic Global Memory Update: Read Increment Write Protected Protected

Debugging Instrumentation Profiling and Hardware

SLIDE 33

Atomic Operations

Collaborative (inter-block) Global Memory Update: Read Increment Write Interruptible! Interruptible! Atomic Global Memory Update: Read Increment Write Protected Protected How? OpenCL: atomic {add,inc,cmpxchg,. . . }(int *global, int value);

Debugging Instrumentation Profiling and Hardware

SLIDE 34

GProf

1. $ cc -pg -omy-program my-program.c
2. $ ./my-program

(gmon.out gets created)

3. $ gprof ./my-program

Implementation:

Change function invocation to store call graph information on

every function call

Look at the program counter/call graph ∼100 times per

second (coarse!)

Debugging Instrumentation Profiling and Hardware

SLIDE 35

(Abbreviated) GProf Output

% cumulative self self total time seconds seconds calls ms/call ms/call name 80.95 0.17 0.17 2560 0.07 0.07 regmul 19.05 0.21 0.04 40 1.00 1.00 square dgemm 0.00 0.21 0.00 640 0.00 0.00 copy to block 0.00 0.21 0.00 320 0.00 0.53 blockmul [...] index % time self children called name <spontaneous> [1] 100.0 0.00 0.21 main [1] 0.00 0.17 40/40 square dgemm tuned [5] 0.04 0.00 40/40 square dgemm [6] 0.00 0.00 80/80 fill random matrix [10] −−−−−−−−−−−−−−−−−−−−−− 0.00 0.17 320/320 form Cblock [4] [3] 81.0 0.00 0.17 320 blockmul [3] 0.17 0.00 2560/2560 regmul [2] 0.00 0.00 640/640 copy to block [7]

Also: Annotated source.

Debugging Instrumentation Profiling and Hardware

SLIDE 36

Intel VTune

Intel VTune (sampling, perf counters, proprietary)

Debugging Instrumentation Profiling and Hardware

SLIDE 37

AMD CodeAnalyst

AMD CodeAnalyst (sampling, perf counters, proprietary)

Debugging Instrumentation Profiling and Hardware

SLIDE 38

Apple Shark

Apple Shark (sampling, proprietary)

Debugging Instrumentation Profiling and Hardware

SLIDE 39

Profiling MPI: Jumpshot

Jumpshot MPI Profiler (exact, event-based, free)

Debugging Instrumentation Profiling and Hardware

SLIDE 40

Profiling MPI: Jumpshot

Jumpshot MPI Profiler (exact, event-based, free)

1. Install MPE. (works on top of

existing MPI)

2. $ mpecc -omy-program

my-program.c

3. $ mperun -np 32 my-program

What to do about deluge of data?

Debugging Instrumentation Profiling and Hardware

SLIDE 41

Profiling MPI: Jumpshot

Jumpshot MPI Profiler (exact, event-based, free) Now how about GPUs? (Seen in hw3: Events)

Debugging Instrumentation Profiling and Hardware

SLIDE 42

Nvidia GPU Profiler: Events

gld request : Number of executed global load instructions per warp in a SM gst request : Number of executed global store instructions per warp in a SM divergent branch : Number of unique branches that diverge instructions : Instructions executed warp serialized : Number of SIMD groups that serialize on address conflicts to local memory And many more: see /share/apps/cuda/toolkit/3.1/doc/Compute Profiler 3.1.txt (Careful: CUDA terminology)

Debugging Instrumentation Profiling and Hardware

SLIDE 43

OpenCL ↔ CUDA: A dictionary

OpenCL CUDA Grid Grid Work Group Block Work Item Thread kernel global global device local shared private local imagend t texture<type, n, ...> barrier(LMF) syncthreads() get local id(012) threadIdx.xyz get group id(012) blockIdx.xyz get global id(012) – (reimplement)

Debugging Instrumentation Profiling and Hardware

SLIDE 44

Local Memory: Banking

· · · · · · · · · · · · Bank Address Work Item

Debugging Instrumentation Profiling and Hardware

SLIDE 45

Local Memory: Banking

4 8 12 16 20

· · ·

1 5 9 13 17 21

· · ·

2 6 10 14 18 22

· · ·

3 7 11 15 19 23

· · · Bank Address Work Item

Debugging Instrumentation Profiling and Hardware

SLIDE 46

Local Memory: Banking

4 8 12 16 20

· · ·

1 5 9 13 17 21

· · ·

2 6 10 14 18 22

· · ·

3 7 11 15 19 23

· · · Bank Address Work Item

1 2 3

Debugging Instrumentation Profiling and Hardware

SLIDE 47

Local Memory: Banking

· · · · · · · · · · · · Bank Address Work Item OK: local variable[get local id(0)], (Single cycle)

Debugging Instrumentation Profiling and Hardware

SLIDE 48

Local Memory: Banking

· · · · · · · · · · · · Bank Address Work Item Bad: local variable[BANK COUNT*get local id(0)] (BANK COUNT cycles)

Debugging Instrumentation Profiling and Hardware

SLIDE 49

Local Memory: Banking

· · · · · · · · · · · · Bank Address Work Item OK: local variable[(BANK COUNT+1)*get local id(0)] (Single cycle)

Debugging Instrumentation Profiling and Hardware

SLIDE 50

Local Memory: Banking

· · · · · · · · · · · · Bank Address Work Item OK: local variable[ODD NUMBER*get local id(0)] (Single cycle)

Debugging Instrumentation Profiling and Hardware

SLIDE 51

Local Memory: Banking

· · · · · · · · · · · · Bank Address Work Item Bad: local variable[2*get local id(0)] (BANK COUNT/2 cycles)

Debugging Instrumentation Profiling and Hardware

SLIDE 52

Local Memory: Banking

· · · · · · · · · · · · Bank Address Work Item OK: local variable[f(blockIdx)] (Broadcast–single cycle)

Debugging Instrumentation Profiling and Hardware

SLIDE 53

Local Memory: Banking

· · · · · · · · · · · · Bank Address Work Item Nvidia hardware has 16 banks. Work item access local memory in groups of 16.

Debugging Instrumentation Profiling and Hardware

SLIDE 54

Questions?

?

Debugging Instrumentation Profiling and Hardware

SLIDE 55

Image Credits

Valgrind logo: Julian Seward
Clock: sxc.hu/cema
Bar chart: sxc.hu/miamiamia
x86 Paging Illustrations: Wikipedia
Dictionary: sxc.hu/topfer

Debugging Instrumentation Profiling and Hardware