Bursty Tracing: A Framework for Low-Overhead Temporal Profiling - - PowerPoint PPT Presentation
Bursty Tracing: A Framework for Low-Overhead Temporal Profiling - - PowerPoint PPT Presentation
Bursty Tracing: A Framework for Low-Overhead Temporal Profiling Martin Hirzel Trishul Chilimbi hirzel@colorado.edu trishulc@microsoft.com FDDO4 December 2001 Austin, Texas Low-overhead temporal profiling Low overhead Intended
“Low-overhead temporal profiling”
- Low overhead
– Intended for dynamic optimization systems – Profile overhead must be recovered by
- ptimization
- Temporal profiling
– Trend in profiling literature: discover more causality (path profiling, calling context trees, etc.) – Temporal profiles expose more optimization
- pportunities
2
Arnold-Ryder profiling framework
A B
- riginal
procedure (a) A’ B’ instrumented code checking code back− edge check entry check modified procedure (Arnold−Ryder) A B (b)
- Counter nCheck
- Sampling rate r =
1 nCheck0 + 1
- Implemented in Jikes RVM (Java on PowerPC)
3
Why longer bursts
- Arnold-Ryder framework isolates events by loop
back-edges, calls, and returns
- Example:
for(i = 1; i < n; i++) if(. . .) f(); else g();
- Temporal relationships interesting for optimization:
– Single-entry multiple-exit regions – Field reordering
4
Contributions
- Longer bursts
– Our framework captures temporal relationships across loop back-edges, calls, and returns.
- x86 binaries
– We report experiences with the framework in an alternative setting with different advantages and disadvantages.
- Overhead reduction techniques
– We eliminate some of the checks at procedure entries and at loop back-edges.
5
Talk outline
- Introduction
- Methodology
– Longer bursts – Overhead reduction by eliminating checks
- Evaluation
– Overhead – Profile quality
- Conclusion
6
Longer bursts
A B
- riginal
procedure (a) A’ B’ instrumented code checking code back− edge check entry check modified procedure (longer bursts) A B (b)
- Counters nCheck and nInstr
- Sampling rate r =
nInstr0 nCheck0 + nInstr0
- Implemented using Vulcan (x86 binaries)
7
Fewer checks
- Goal: reduce overhead
- Starting point: 6-35% overhead in our setting with
checks on all procedure entries and loop back-edges
- Constraint: never recurse or loop for unbounded
amount of time without check
- Remark: analogous to thread-yield points, gc-safe
points, asynchronous-exception points
8
Eliminating entry checks
substitute check match expand main join delete_digram insert_after ~symbols
9
Eliminating entry checks
substitute check match expand main join delete_digram insert_after ~symbols 4 3 2 1 3 2 1 3
C =
- f ∈ N | ¬is leaf(f) ∧ (is root(f) ∨
addr taken(f) ∨ recursion from below(f))
- 10
Eliminating loop back-edge checks
- Tight inner loops
– Checking gets expensive relative to time spent in original code – Statically optimized, not much opportunity for dynamic optimization
- Omit both checking and profiling for tight inner
loops
- k-boring loop:
– No calls – At most k profiling events of interest
11
Evaluation: Overhead
- overhead(r) = basic overhead + r · instr overhead
- rig
EL EC EN L4 L10 LN EC+L4
- rig
EL EC EN L4 L10 LN EC+L4
- rig
EL EC EN L4 L10 LN EC+L4
- rig
EL EC EN L4 L10 LN EC+L4
- rig
EL EC EN L4 L10 LN EC+L4 40 35 30 25 20 15 10 5 181.mcf % basic overhead 252.eon 300.twolf 305.espresso boxsim all checks intact no checks on entry to leaf procedures call−graph technique no checks on entry to any procedures 4−boring loop technique 10−boring loop technique call−graph and 4−boring loop techniques no checks on any loop back−edges
- rig
EL EC EN L4 L10 EC+L4 LN
12
Case study: Hot data stream profiles
- data reference: dynamic load, (pc, addr) pair
- data stream: sequence v of data references
- heat of data stream: v.heat = v.length ∗ v.frequency
- hot data stream: when v.heat > heat threshold
(we set the threshold such that all hot data streams together cover 90% of the profile)
- hot data stream profile: set P of hot data streams
and their heats
- overlap(P, Q) =
- v∈P∪Q
min{v.heatP, v.heatQ}
13
Evaluation: Overlap
5000:50 1000:50 2000:10 1000:10 200:10 200:1 100:1 20:1 5000:50 1000:50 2000:10 1000:10 200:10 200:1 100:1 20:1 5000:50 1000:50 2000:10 1000:10 200:10 200:1 100:1 20:1 5000:50 1000:50 2000:10 1000:10 200:10 200:1 100:1 20:1 5000:50 1000:50 2000:10 1000:10 200:10 200:1 100:1 20:1 50 40 30 20 10 181.mcf 252.eon 300.twolf 305.espresso boxsim % overlap 60
- nCheck0:nInstr0
14
Evaluation: Overlap
nCheck0:nInstr0 = 1000:50
- rig
EL EC EN L4 L10 LN EC+L4
- rig
EL EC EN L4 L10 LN EC+L4
- rig
EL EC EN L4 L10 LN EC+L4
- rig
EL EC EN L4 L10 LN EC+L4 all checks intact no checks on entry to leaf procedures call−graph technique no checks on entry to any procedures 4−boring loop technique 10−boring loop technique call−graph and 4−boring loop techniques no checks on any loop back−edges
- rig
EL EC EN L4 L10 EC+L4 LN
- rig
EL EC EN L4 L10 LN EC+L4 181.mcf 252.eon 300.twolf 305.espresso boxsim 10 20 30 40 50 % overlap 60
15
Related work
- Arnold, Ryder, A framework for reducing the cost of
instrumented code, PLDI 2001
- Temporal profiling
– Ball, Larus, Efficient path profiling, MICRO 1996 – Ammons, Ball, Larus, Exploiting hardware performance counters with flow and context sensitive profiling, PLDI 1997 – Larus, Whole program paths, PLDI 1999 – Chilimbi, Efficient representations and abstractions for quantifying and exploiting data reference locality, PLDI 2001
16
Conclusions
- Bursty tracing can collect temporal profiles online
– General, low-overhead, deterministic – Flexible trade-off between sampling rate,
- verhead, and burst-length
– Temporal
- Future work
– Prefetching hot data streams – Eliminating more loop back-edge checks – Improving profile quality further
17