Bursty Tracing: A Framework for Low-Overhead Temporal Profiling - - PowerPoint PPT Presentation

bursty tracing a framework for low overhead temporal
SMART_READER_LITE
LIVE PREVIEW

Bursty Tracing: A Framework for Low-Overhead Temporal Profiling - - PowerPoint PPT Presentation

Bursty Tracing: A Framework for Low-Overhead Temporal Profiling Martin Hirzel Trishul Chilimbi hirzel@colorado.edu trishulc@microsoft.com FDDO4 December 2001 Austin, Texas Low-overhead temporal profiling Low overhead Intended


slide-1
SLIDE 1

Bursty Tracing: A Framework for Low-Overhead Temporal Profiling

Martin Hirzel

hirzel@colorado.edu

Trishul Chilimbi

trishulc@microsoft.com

FDDO4 December 2001 Austin, Texas

slide-2
SLIDE 2

“Low-overhead temporal profiling”

  • Low overhead

– Intended for dynamic optimization systems – Profile overhead must be recovered by

  • ptimization
  • Temporal profiling

– Trend in profiling literature: discover more causality (path profiling, calling context trees, etc.) – Temporal profiles expose more optimization

  • pportunities

2

slide-3
SLIDE 3

Arnold-Ryder profiling framework

A B

  • riginal

procedure (a) A’ B’ instrumented code checking code back− edge check entry check modified procedure (Arnold−Ryder) A B (b)

  • Counter nCheck
  • Sampling rate r =

1 nCheck0 + 1

  • Implemented in Jikes RVM (Java on PowerPC)

3

slide-4
SLIDE 4

Why longer bursts

  • Arnold-Ryder framework isolates events by loop

back-edges, calls, and returns

  • Example:

for(i = 1; i < n; i++) if(. . .) f(); else g();

  • Temporal relationships interesting for optimization:

– Single-entry multiple-exit regions – Field reordering

4

slide-5
SLIDE 5

Contributions

  • Longer bursts

– Our framework captures temporal relationships across loop back-edges, calls, and returns.

  • x86 binaries

– We report experiences with the framework in an alternative setting with different advantages and disadvantages.

  • Overhead reduction techniques

– We eliminate some of the checks at procedure entries and at loop back-edges.

5

slide-6
SLIDE 6

Talk outline

  • Introduction
  • Methodology

– Longer bursts – Overhead reduction by eliminating checks

  • Evaluation

– Overhead – Profile quality

  • Conclusion

6

slide-7
SLIDE 7

Longer bursts

A B

  • riginal

procedure (a) A’ B’ instrumented code checking code back− edge check entry check modified procedure (longer bursts) A B (b)

  • Counters nCheck and nInstr
  • Sampling rate r =

nInstr0 nCheck0 + nInstr0

  • Implemented using Vulcan (x86 binaries)

7

slide-8
SLIDE 8

Fewer checks

  • Goal: reduce overhead
  • Starting point: 6-35% overhead in our setting with

checks on all procedure entries and loop back-edges

  • Constraint: never recurse or loop for unbounded

amount of time without check

  • Remark: analogous to thread-yield points, gc-safe

points, asynchronous-exception points

8

slide-9
SLIDE 9

Eliminating entry checks

substitute check match expand main join delete_digram insert_after ~symbols

9

slide-10
SLIDE 10

Eliminating entry checks

substitute check match expand main join delete_digram insert_after ~symbols 4 3 2 1 3 2 1 3

C =

  • f ∈ N | ¬is leaf(f) ∧ (is root(f) ∨

addr taken(f) ∨ recursion from below(f))

  • 10
slide-11
SLIDE 11

Eliminating loop back-edge checks

  • Tight inner loops

– Checking gets expensive relative to time spent in original code – Statically optimized, not much opportunity for dynamic optimization

  • Omit both checking and profiling for tight inner

loops

  • k-boring loop:

– No calls – At most k profiling events of interest

11

slide-12
SLIDE 12

Evaluation: Overhead

  • overhead(r) = basic overhead + r · instr overhead
  • rig

EL EC EN L4 L10 LN EC+L4

  • rig

EL EC EN L4 L10 LN EC+L4

  • rig

EL EC EN L4 L10 LN EC+L4

  • rig

EL EC EN L4 L10 LN EC+L4

  • rig

EL EC EN L4 L10 LN EC+L4 40 35 30 25 20 15 10 5 181.mcf % basic overhead 252.eon 300.twolf 305.espresso boxsim all checks intact no checks on entry to leaf procedures call−graph technique no checks on entry to any procedures 4−boring loop technique 10−boring loop technique call−graph and 4−boring loop techniques no checks on any loop back−edges

  • rig

EL EC EN L4 L10 EC+L4 LN

12

slide-13
SLIDE 13

Case study: Hot data stream profiles

  • data reference: dynamic load, (pc, addr) pair
  • data stream: sequence v of data references
  • heat of data stream: v.heat = v.length ∗ v.frequency
  • hot data stream: when v.heat > heat threshold

(we set the threshold such that all hot data streams together cover 90% of the profile)

  • hot data stream profile: set P of hot data streams

and their heats

  • overlap(P, Q) =
  • v∈P∪Q

min{v.heatP, v.heatQ}

13

slide-14
SLIDE 14

Evaluation: Overlap

5000:50 1000:50 2000:10 1000:10 200:10 200:1 100:1 20:1 5000:50 1000:50 2000:10 1000:10 200:10 200:1 100:1 20:1 5000:50 1000:50 2000:10 1000:10 200:10 200:1 100:1 20:1 5000:50 1000:50 2000:10 1000:10 200:10 200:1 100:1 20:1 5000:50 1000:50 2000:10 1000:10 200:10 200:1 100:1 20:1 50 40 30 20 10 181.mcf 252.eon 300.twolf 305.espresso boxsim % overlap 60

  • nCheck0:nInstr0

14

slide-15
SLIDE 15

Evaluation: Overlap

nCheck0:nInstr0 = 1000:50

  • rig

EL EC EN L4 L10 LN EC+L4

  • rig

EL EC EN L4 L10 LN EC+L4

  • rig

EL EC EN L4 L10 LN EC+L4

  • rig

EL EC EN L4 L10 LN EC+L4 all checks intact no checks on entry to leaf procedures call−graph technique no checks on entry to any procedures 4−boring loop technique 10−boring loop technique call−graph and 4−boring loop techniques no checks on any loop back−edges

  • rig

EL EC EN L4 L10 EC+L4 LN

  • rig

EL EC EN L4 L10 LN EC+L4 181.mcf 252.eon 300.twolf 305.espresso boxsim 10 20 30 40 50 % overlap 60

15

slide-16
SLIDE 16

Related work

  • Arnold, Ryder, A framework for reducing the cost of

instrumented code, PLDI 2001

  • Temporal profiling

– Ball, Larus, Efficient path profiling, MICRO 1996 – Ammons, Ball, Larus, Exploiting hardware performance counters with flow and context sensitive profiling, PLDI 1997 – Larus, Whole program paths, PLDI 1999 – Chilimbi, Efficient representations and abstractions for quantifying and exploiting data reference locality, PLDI 2001

16

slide-17
SLIDE 17

Conclusions

  • Bursty tracing can collect temporal profiles online

– General, low-overhead, deterministic – Flexible trade-off between sampling rate,

  • verhead, and burst-length

– Temporal

  • Future work

– Prefetching hot data streams – Eliminating more loop back-edge checks – Improving profile quality further

17