Heuristics for Profile- -driven Method driven Method- - - - PowerPoint PPT Presentation

heuristics for profile driven method driven method
SMART_READER_LITE
LIVE PREVIEW

Heuristics for Profile- -driven Method driven Method- - - - PowerPoint PPT Presentation

Heuristics for Profile- -driven Method driven Method- - Heuristics for Profile level Speculative Parallelization level Speculative Parallelization John Whaley and Christos Kozyrakis Stanford University June 15, 2005 Speculative


slide-1
SLIDE 1

Heuristics for Profile Heuristics for Profile-

  • driven Method

driven Method-

  • level Speculative Parallelization

level Speculative Parallelization

John Whaley and Christos Kozyrakis

Stanford University

June 15, 2005

slide-2
SLIDE 2

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 1

Speculative Multithreading

  • Speculatively parallelize an application

– Uses speculation to overcome ambiguous dependencies – Uses hardware support to recover from misspeculation – Promising technique for automatically extracting parallelism from programs

  • Problem: Where to put the threads?
slide-3
SLIDE 3

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 2

Method-Level Speculation

  • Idea: Use method boundaries as

speculative threads

– Computation is naturally partitioned into methods – Execution often independent – Well-defined interface

  • Extract parallelism from irregular,

non-numerical applications

slide-4
SLIDE 4

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 3

Method-Level Speculation Example

main() { work_A; foo(); work_C; // reads *q } foo() { work_B; // writes *p }

slide-5
SLIDE 5

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 4

main() { work_A; foo() { work_B; // writes *p } work_C; // reads *q }

Method-Level Speculation Example

slide-6
SLIDE 6

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 5

main() { work_A; foo() { work_B; // writes *p } work_C; // reads *q } work_A foo() work_B work_C

Sequential execution

Method-Level Speculation Example

slide-7
SLIDE 7

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 6

main() { work_A; foo() { work_B; // writes *p } work_C; // reads *q }

TLS execution – no violation

work_A foo() work_B work_C

  • verhead

fork p!=q No violation

Method-Level Speculation Example

slide-8
SLIDE 8

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 7

main() { work_A; foo() { work_B; // writes *p } work_C; // reads *q }

TLS execution – violation

work_A foo() work_B work_C

  • verhead

fork p=q Violation!

  • verhead

work_C (aborted)

Method-Level Speculation Example

slide-9
SLIDE 9

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 8

work_A foo() work_B work_C work_A foo() work_B work_C

  • verhead

fork

p!=q No violation

work_A foo() work_B work_C

  • verhead

p=q Violation!

  • verhead

work_C

(aborted)

fork

Method-Level Speculation Example

Sequential TLS – no violation TLS – violation

slide-10
SLIDE 10

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 9

Nested Speculation

foo() work_A work_B

  • verhead

fork

bar() work_C

main() { foo() { work_A; } work_B; bar() { work_C; } work_D; }

fork

  • verhead

work_D

Sequences of method calls can cause nested speculation.

slide-11
SLIDE 11

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 10

This Talk: Choosing Speculation Points

  • Which methods to speculate?

– Low chance of violation – Not too short, not too long – Not too many stores

  • Idea: Use profile data to choose good

speculation points

– Used for profile-driven and dynamic compiler – Should be low-cost but accurate

  • We evaluated 7 different heuristics

– ~80% effective compared to perfect oracle

slide-12
SLIDE 12

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 11

Difficulties in Method-Level Speculation

  • Method invocations can have varying

execution times

– Too short: Doesn’t overcome speculation

  • verhead

– Too long: More likely to violate or overflow, prevents other threads from retiring

  • Return values

– Mispredicted return value causes violation

slide-13
SLIDE 13

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 12

Classes of Heuristics

  • Simple Heuristics

– Use only simple information, such as method runtime

  • Single-Pass Heuristics

– More advanced information, such as sequence of store addresses – Single pass through profile data

  • Multi-Pass Heuristics

– Multiple passes through profile data

slide-14
SLIDE 14

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 13

Classes of Heuristics

  • Simple Heuristics

– Use only simple information, such as method runtime

  • Single-Pass Heuristics

– More advanced information, such as sequence of store addresses – Single pass through profile data

  • Multi-Pass Heuristics

– Multiple passes through profile data

slide-15
SLIDE 15

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 14

Runtime Heuristic (SI-RT)

  • Speculate on all methods with:

– MIN < runtime < MAX

  • Idea: Should be long enough to

amortize overhead, but not long enough to violate

  • Data required:

– Average runtime of each method

slide-16
SLIDE 16

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 15

Store Heuristic (SI-SC)

  • Speculate on all methods with:

– dynamic # of stores < MAX

  • Idea: Stores cause violations, so

speculate on methods with few stores

  • Data required:

– Average dynamic store count of each method

slide-17
SLIDE 17

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 16

Classes of Heuristics

  • Simple Heuristics

– Use only simple information, such as method runtime

  • Single-Pass Heuristics

– More advanced information, such as sequence of store addresses – Single pass through profile data

  • Multi-Pass Heuristics

– Multiple passes through profile data

slide-18
SLIDE 18

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 17

Stalled Threads

bar() work_A work_B

  • verhead

fork

foo() { bar() { work_A; } work_B; }

idle

Speculative threads may stall while waiting to become main thread.

slide-19
SLIDE 19

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 18

Fork at intermediate points

bar() work_A work_B

  • verhead

fork

foo() { bar() { work_A; } work_B; } Fork at an intermediate point within a method to avoid violations and stalling

slide-20
SLIDE 20

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 19

Best Speedup Heuristic (SP-SU)

  • Speculate on methods with:

– predicted speedup > THRES

  • Calculate predicted speedup by:
  • Scan store stream backwards to find

fork point

– Choose fork point to avoid violations and stalling expected sequential run time expected parallel run time

slide-21
SLIDE 21

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 20

Most Cycles Saved Heuristic (SP-CS)

  • Speculate on methods with:

– predicted cycle savings > THRES

  • Calculate predicted cycle savings by:
  • Place fork point such that:

– predicted probability of violation < RATIO

  • Uses same information as SP-SU

sequential cycle count – parallel cycle count

slide-22
SLIDE 22

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 21

Classes of Heuristics

  • Simple Heuristics

– Use only simple information, such as method runtime

  • Single-Pass Heuristics

– More advanced information, such as sequence of store addresses – Single pass through profile data

  • Multi-Pass Heuristics

– Multiple passes through profile data

slide-23
SLIDE 23

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 22

Nested Speculation

foo() work_A work_D

  • verhead

fork

bar() work_B

  • verhead

foo() work_C

idle

fork

main() { foo() { work_A; bar() { work_B; } work_C; } work_D; }

Effectiveness of speculation choice depends on choices for caller methods!

slide-24
SLIDE 24

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 23

Best Speedup Heuristic with Parent Info (MP-SU)

  • Iterative algorithm:

– Choose speculation with best speedup – Readjust all callee methods to account for speculation in caller – Repeat until best speedup < THRES

  • Max # of iterations: depth of call

graph

slide-25
SLIDE 25

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 24

Most Cycles Saved Heuristic with Parent Info (MP-CS)

  • Iterative algorithm:

1.Choose speculation with most cycles saved and predicted violations < RATIO 2.Readjust all callee methods to account for speculation in caller 3.Repeat until most cycles saved < THRES

  • Multi-pass version of SP-CS
slide-26
SLIDE 26

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 25

Most Cycles Saved Heuristic with No Nesting (MP-CSNN)

  • Iterative algorithm:

– Choose speculation with most cycles saved and predicted violations < RATIO. – Eliminate all callee methods from consideration. – Repeat until most cycles saved < THRES.

  • Disallows nested speculation to avoid

double-counting the benefits

  • Faster to compute than MP-CS
slide-27
SLIDE 27

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 26

Experimental Results Experimental Results

slide-28
SLIDE 28

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 27

Trace-Driven Simulation

  • How to find the optimal parameters

(THRES, RATIO, etc.) ?

  • Parameter sweeps

– For each benchmark

  • For each heuristic

– Multiple parameters for each heuristic

  • For cycle-accurate simulation:

>100 CPU years?!

  • Alternative: trace-driven simulation
slide-29
SLIDE 29

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 28

Trace-Driven Simulation

  • 1. Collect trace on Pentium III (3-way out-of-
  • rder CPU, 32K L1, 256K L2)

– Record all memory accesses, enter/exit method events, etc.

  • 2. Recalibrate to remove instrumentation
  • verhead
  • 3. Simulate trace on 4-way CMP hardware

– Model shared cache, speculation overheads, dependencies, squashing, etc.

Spot check with cycle-accurate simulator: Accurate within ~3%

slide-30
SLIDE 30

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 29

Simulated Architecture

  • Four 3-way out-of-order CPUs

– 32K L1, 256K shared L2

  • Single speculative buffer per CPU
  • Forking, retiring, squashing overhead:

70 cycles each

  • Speculative threads can be preempted

– Low priority speculations can be squashed by higher priority ones

slide-31
SLIDE 31

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 30

The Oracle

  • A “Perfect” Oracle

– Preanalyzes entire trace – Makes a separate decision on every method invocation – Chooses fork points to never violate – Zero overhead for forking or retiring threads

  • Upper-bound on performance of any

heuristic

slide-32
SLIDE 32

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 31

Benchmarks

  • SpecJVM

– compress: Lempel-Ziv compression – jack: Java parser generator – javac: Java compiler from the JDK 1.0.2 – jess: Java expert shell system – mpeg: Mpeg layer 3 audio decompression – raytrace: Raytracer that works on a dinosaur scene

  • SPLASH-2

– barnes: Hierarchical N-body solver – water: Simulation of water molecules

slide-33
SLIDE 33

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 32

Heuristic Parameter Tuning

1e5 1e6 1e7 1e9 1e12 1e3 1e4 1e5 1e6 1e7 1.00 1.10 1.20 1.30 1.40 1.50 1.60 Speedup MAX MIN

Runtime (SI-RT)

1e5 1e6 1e7 1e9 1e12 1e3 1e5 1e7 1000 2000 3000 4000 5000 6000 Number of violations MAX MIN

Runtime (SI-RT)

slide-34
SLIDE 34

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 33

Heuristic Parameter Tuning

Store (SI-SC)

1.00 1.10 1.20 1.30 1.40 1.50 1e1 1e2 1e3 1e4 1e5 Threshold Speedup Void only Constant Perfect

slide-35
SLIDE 35

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 34

Heuristic Parameter Tuning

Store (SI-SC)

1000 2000 3000 4000 5000 1e1 1e2 1e3 1e4 1e5 Threshold Number of violations Void only Constant Perfect

slide-36
SLIDE 36

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 35

Heuristic Parameter Tuning

Best Speedup (SP-SU)

1.00 1.10 1.20 1.30 1.40 1.50 1.60 1.001 1.01 1.1 1.2 1.4 1.6 Threshold Speedup Void only Constant Perfect

slide-37
SLIDE 37

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 36

Heuristic Parameter Tuning

Best Speedup (SP-SU)

50 100 150 200 250 300 350 1.001 1.01 1.1 1.2 1.4 1.6 Threshold Number of Violations Void only Constant Perfect

slide-38
SLIDE 38

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 37

Heuristic Parameter Tuning

0.1 0.3 0.5 0.7 0.9 1e2 1e5 1e7 1.00 1.10 1.20 1.30 1.40 1.50 1.60 Speedup RATIO THRES

Most Cycles Saved (SP-CS)

0.1 0.3 0.5 0.7 0.9 1e2 1e5 1e7 50 100 150 200 250 300 350 Number of violations RATIO THRES

Most Cycles Saved (SP-CS)

slide-39
SLIDE 39

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 38

Heuristic Parameter Tuning

Best Speedup with Parent Info (MP-SU)

1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 1.001 1.01 1.1 1.2 1.4 1.6 Threshold Speedup Void only Constant Perfect

slide-40
SLIDE 40

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 39

Heuristic Parameter Tuning

Best Speedup with Parent Info (MP-SU)

100 200 300 400 500 1.001 1.01 1.1 1.2 1.4 1.6 Threshold Number of violations Void only Constant Perfect

slide-41
SLIDE 41

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 40

Heuristic Parameter Tuning

0.1 0.3 0.5 0.7 0.9 1e2 1e5 1e7 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 Speedup RATIO THRES

Most Cycles Saved with Parent Info (MP-CS)

0.1 0.3 0.5 0.7 0.9 1e2 1e5 1e7 50 100 150 200 250 300 350 400 Number of violations RATIO THRES

Most Cycles Saved with Parent Info (MP-CS)

slide-42
SLIDE 42

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 41

Heuristic Parameter Tuning

0.1 0.3 0.5 0.7 0.9 1e2 1e5 1e7 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 Speedup RATIO THRES

Most Cycles Saved with No Nesting (MP-CSNN)

0.1 0.3 0.5 0.7 0.9 1e2 1e5 1e7 50 100 150 200 250 300 350 400 Number of violations RATIO THRES

Most Cycles Saved with No Nesting (MP-CSNN)

slide-43
SLIDE 43

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 42

Tuning Summary

  • Runtime (SI-RT):

– MIN = 103 cycles, MAX = 107 cycles

  • Store (SI-SC):

– MAX = 105 stores

  • Best speedup (SP-SU, MP-SU):

– Single pass: MIN = 1.2x speedup – Multi pass: MIN = 1.4x speedup

  • Most cycles saved (SP-CS, MP-CS, MP-CSNN):

– THRES = 105 cycles saved, RATIO = 70% violation

  • Return value prediction:

– Constant is within 15% of perfect value prediction

slide-44
SLIDE 44

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 43

1.0 1.2 1.4 1.6 1.8 2.0 barnes compress jack javac jess mpeg raytrace water Average Speedup SI-RT SI-SC SP-SU SP-CS MP-SU MP-CS MP-CSNN Oracle

Overall Speedups

slide-45
SLIDE 45

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 44

Breakdown of Speculative Threads

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 SI-RT SI-SC SP-SU SP-CS MP-SU MP-CS MP-CSNN SI-RT SI-SC SP-SU SP-CS MP-SU MP-CS MP-CSNN SI-RT SI-SC SP-SU SP-CS MP-SU MP-CS MP-CSNN SI-RT SI-SC SP-SU SP-CS MP-SU MP-CS MP-CSNN SI-RT SI-SC SP-SU SP-CS MP-SU MP-CS MP-CSNN SI-RT SI-SC SP-SU SP-CS MP-SU MP-CS MP-CSNN SI-RT SI-SC SP-SU SP-CS MP-SU MP-CS MP-CSNN SI-RT SI-SC SP-SU SP-CS MP-SU MP-CS MP-CSNN barnes compress jack javac jess mpeg raytrace water Normalized number of threads Successful Preempted Killed

slide-46
SLIDE 46

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 45

Breakdown of Execution Time

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 SI-RT SI-SC SP-SU SP-CS MP-SU MP-CS MP-CSNN SI-RT SI-SC SP-SU SP-CS MP-SU MP-CS MP-CSNN SI-RT SI-SC SP-SU SP-CS MP-SU MP-CS MP-CSNN SI-RT SI-SC SP-SU SP-CS MP-SU MP-CS MP-CSNN SI-RT SI-SC SP-SU SP-CS MP-SU MP-CS MP-CSNN SI-RT SI-SC SP-SU SP-CS MP-SU MP-CS MP-CSNN SI-RT SI-SC SP-SU SP-CS MP-SU MP-CS MP-CSNN SI-RT SI-SC SP-SU SP-CS MP-SU MP-CS MP-CSNN barnes compress jack javac jess mpeg raytrace water Normalized Execution Time Useful Idle Wasted

slide-47
SLIDE 47

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 46

Speculative Store Buffer Size

1.38 1.27 13.02 0.30 2.57 0.39 6.48 12.02 MP- CSNN 1.38 1.27 1.64 0.30 0.30 0.39 6.48 12.02 MP-CS 1.38 1.27 1.27 0.30 0.30 0.39 6.48 12.01 MP-SU 0.22 1.64 15.29 0.30 2.57 0.39 6.48 0.31 SP-CS 0.55 1.64 13.02 0.30 1.08 0.39 6.48 8.11 SP-SU 1.45 1.64 13.02 0.15 3.51 0.19 6.47 12.02 SI-SC 0.20 1.64 0.76 0.26 2.05 0.39 0.18 0.31 SI-RT

water rtrace mpeg jess javac jack comp barnes

Maximum speculative store buffer size: 16KB

slide-48
SLIDE 48

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 47

Related Work

  • Loop-level parallelism
  • Method-level parallelism

– Warg and Stenstrom

  • ICPAC’01: Limit study
  • IPDPS’03: Heuristic based on runtime
  • CF’05: Misspeculation prediction
  • Compilers

– Multiscalar: Vijaykumar and Sohi, JPDC’99 – SpMT: Bhowmik & Chen, SPAA’02

slide-49
SLIDE 49

June 15, 2005 Heuristics for Profile-driven Method- level Speculative Parallelization 48

Conclusions

  • Evaluated 7 heuristics for method-

level speculation

  • Take-home points:

– Method-level speculation has complex interactions, very hard to predict – Single-pass heuristics do a good job: 80% of a perfect oracle – Most important issue is the balance between over- and under-speculating