Loop Selection for Thread-Level Speculation Shengyue Wang, Xiaoru - - PowerPoint PPT Presentation
Loop Selection for Thread-Level Speculation Shengyue Wang, Xiaoru - - PowerPoint PPT Presentation
Loop Selection for Thread-Level Speculation Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Chip Multiprocessors (CMPs) IBM Power5
Shengyue Wang
Chip Multiprocessors (CMPs)
- CMPs:
- IBM Power5
- Sun Niagara
- Intel dual-core Xeon
- AMD dual-core Opteron
Proc Proc Proc Cache
Improve program performance with parallel threads
IBM Power5
Shengyue Wang
Thread-Level Speculation (TLS) Automatic parallelization is difficult
- Ambiguous data dependences
- Complex control flow
TLS facilitates automatic parallelization by:
- Executing potentially dependent threads in parallel
- Preserving data dependences via runtime checking
Where do we find speculative parallel threads?
Shengyue Wang
Parallelizing Loops under TLS Loops are good candidates for parallelism
- Regular structure
- Significant coverage on dynamic execution time
General purpose applications are complicated Facts about SPECINT 2000
- Average number of loops: 714
- Average dynamic loop nesting: 8
Loop selection: which loops should be parallelized?
Shengyue Wang
Potential of Loop Selection
- 40%
- 20%
0% 20% 40% 60% 80% 100% m c f c r a f t y t w
- l
f g z i p b z i p 2 v
- r
t e x v p r p a r s e r g a p g c c p e r l b m k
Outer loop Inner loop Best
program speedup
Carefully selected loops can improve performance significantly!
Shengyue Wang
Outline Loop selection Algorithm
- Parallel performance prediction
- Dynamic loop behavior
- Conclusions
Shengyue Wang
Loop Nesting
main( ) { while ( condition1 ) { while ( condition2 ) { foo( ); goo( ); } } }
Loop graph Source code : static loop : nesting relationship
main_loop1 main_loop2 goo_loop1 foo_loop1 foo( ) { while ( condition3 ) { goo( ); } } goo( ) { while ( condition4 ) { } }
Shengyue Wang
Benefit of Parallelizing a Single Loop
benefit = % program execution time saved = coverage × (1 – 1 / loop speedup)
Benefit Loop Speedup Coverage main_loop1 main_loop2 goo_loop1 foo_loop1
13% 20% 5% 18%
80% 1.2 13% 70% 1.4 20% 30% 1.2 5% 50% 1.6 18%
Program speedup = 1 / (1 - benefit) = 1.25
Shengyue Wang
Loop Selection: Problem Definition Goal:
Select the set of loops that maximizes the
- verall program performance when parallelized
Constraint:
The set cannot contain loops with nesting relationship
Loop selection is NP-complete!
Weighted maximum independent set
Shengyue Wang
Loop Selection: Algorithm
- Exhaustive search (≤ 50 nodes)
- Try all possible combinations of loops
- Greedy algorithm (> 50 nodes)
- In descending order of benefit
- Check for nesting relation
- Add the loop to the set if no nesting
Average number of loops for SPECINT 2000: 714
Shengyue Wang
Loop Pruning
Only resort to greedy algorithm for gcc and parser
loop3 loop4 loop5 loop6 loop7 loop8 loop2 loop1
Shengyue Wang
Benefit of Parallelizing a Single Loop
Loop graph
Benefit Coverage main_loop1 main_loop2 goo_loop1 foo_loop1
13% 20% 5% 18%
80% 1.2 13% 70% 1.4 20% 30% 1.2 5% 50% 1.6 18%
How can we estimate the speedup?
Speedup
Shengyue Wang
Outline Loop selection
- Algorithm
Parallel performance prediction
- Dynamic loop behavior
- Conclusions
Shengyue Wang
Estimating Parallel Performance Communicating value between speculative threads adds significant overhead to parallel execution
- Synchronization:
- Resolves frequently occurring data dependences
- Speculation:
- Resolves infrequently occurring data dependences
Estimating communication costs with the compiler
Shengyue Wang
Cost of Mis-speculation
Cost of mis-speculation = amount of work wasted × prob. of mis-speculation
T1 store T2 load
×
Amount of work wasted
Shengyue Wang
Cost of Mis-speculation
Cost of mis-speculation = amount of work wasted × prob. of mis-speculation
T1 store T2 load
×
Amount of work wasted Sequential part
Shengyue Wang
Synchronization
T1 T2 load store
Synchronization serializes parallel execution
Shengyue Wang
Cost of Synchronization
T1 T2 load2 load1 store2 store1 T1 T2 store1 T1 T2 load1 store1
Synchronization Cost = # of dependent instructions Synchronization Cost = longest stall Synchronization Cost = longest stall based on dependent instructions
load1
- Est. I
- Est. II
- Est. III
Shengyue Wang
Experimental Framework
- Machine model
- 4 single-issue in-order processors
- Private L1 data cache (32K, 2-way, 1 cycle)
- Shared L2 data cache (2M, 4-way, 10 cycles)
- Speculation support (write buffer, address buffer)
- Synchronization support (comm. buffer, 10 cycles)
- Compiler optimizations using ORC 2.1
- Instruction scheduling to improve parallelism
Shengyue Wang
Comparison: Speedup Estimation Techniques
- 40%
- 20%
0% 20% 40% 60% 80% 100% m cf crafty tw
- lf
gzip bzip2 v
- rtex
v pr parser gap gcc perlbm k
- Est. I
- Est. II
- Est. III
Perfect
0% 20% 40% 60% 80% 100% m cf crafty twolf gzip bzip2 vortex vpr parser gap gcc perlbm k
- Est. I
- Est. II
- Est. III
Perfect
program speedup
- 40%
- 20%
0% 20% 40% 60% 80% 100% m cf crafty tw
- lf
gzip bzip2 v
- rtex
v pr parser gap gcc perlbm k
- Est. I
- Est. II
- Est. III
Perfect
- 40%
- 20%
0% 20% 40% 60% 80% 100% m cf crafty tw
- lf
gzip bzip2 v
- rtex
v pr parser gap gcc perlbm k
- Est. I
- Est. II
- Est. III
Perfect
- 40%
- 20%
0% 20% 40% 60% 80% 100% m cf crafty tw
- lf
gzip bzip2 v
- rtex
v pr parser gap gcc perlbm k
- Est. I
- Est. II
- Est. III
Perfect
0% 20% 40% 60% 80% 100% m cf crafty twolf gzip bzip2 vortex vpr parser gap gcc perlbm k
- Est. I
- Est. II
- Est. III
Perfect
0% 20% 40% 60% 80% 100% m cf crafty twolf gzip bzip2 vortex vpr parser gap gcc perlbm k
- Est. I
- Est. II
- Est. III
Perfect
0% 20% 40% 60% 80% 100% m cf crafty twolf gzip bzip2 vortex vpr parser gap gcc perlbm k
- Est. I
- Est. II
- Est. III
Perfect
coverage
Average program speedup: 20%, coverage: 70%
Shengyue Wang
Outline
- Loop selection
- Algorithm
- Parallel performance prediction
Dynamic loop behavior
- Conclusions
Shengyue Wang
Loop Behavior May Change
main( ) { while ( condition1 ) { while ( condition2 ) { foo( ); goo( ); } } }
Source code Calling context of a loop: the path from the root to that loop
foo( ) { while ( condition3 ) { goo( ); } } goo( ) { while ( condition4 ) { } } main_loop1 main_loop2 foo_loop1 goo_loop1_A goo_loop1_B
Loop tree
Shengyue Wang
Loop Selection in a Tree
main_loop1 main_loop2 foo_loop1 goo_loop1_A goo_loop1_B
13% 20% 5%
- 2%
18%
goo_loop1 is parallelized
- nly when it is reached
from main_loop2
Loop cloning can be used
Shengyue Wang
Loop Behavior May Change
Exploit loop behavior dynamically
main_loop1 goo_loop1_A goo_loop1_B foo_loop1 foo_loop1
- nly parallelize the loop
in these invocations
Shengyue Wang
Potential of Exploiting Dynamic Behavior
0% 20% 40% 60% 80% 100% m cf cra fty tw
- lf
g zip b zip 2 vo rtex vp r p a rs er g a p g cc p e rlbm k
No Co n te x t Ca llin g Co n te x t O ra cle
0% 20% 40% 60% 80% 100% m cf cra fty tw
- lf
g zip b zip 2 vo rtex vp r p a rs er g a p g cc p e rlbm k
No Co n te x t Ca llin g Co n te x t O ra cle
0% 20% 40% 60% 80% 100% m cf cra fty tw
- lf
g zip b zip 2 vo rtex vp r p a rs er g a p g cc p e rlbm k
No Co n te x t Ca llin g Co n te x t O ra cle
program speedup
0% 20% 40% 60% 80% 100% m cf cra fty tw
- lf
g zip b zip 2 vo rtex vp r p a rs er g a p g cc p e rlbm k
No Co n te x t Ca llin g Co n te x t O ra cle
0% 20% 40% 60% 80% 100% m cf cra fty tw
- lf
g zip b zip 2 vo rtex vp r p a rs er g a p g cc p e rlbm k
No Co n te x t Ca llin g Co n te x t O ra cle
0% 20% 40% 60% 80% 100% m c f c r a f t y t w
- l
f g z i p b z i p 2 v
- r
t e x v p r p a r s e r g a p g c c p e r l b m k
No Co n te x t Ca llin g Co n te x t O ra cle
coverage 5 out of 11 benchmarks show performance potential
Shengyue Wang
Conclusions Loop selection is important for TLS
- Compiler-based loop selection
- Speedup 20%, Coverage 70%
- Exploiting dynamic behavior offers
performance potential
Shengyue Wang