Loop Selection for Thread-Level Speculation Shengyue Wang, Xiaoru - - PowerPoint PPT Presentation

loop selection for thread level speculation
SMART_READER_LITE
LIVE PREVIEW

Loop Selection for Thread-Level Speculation Shengyue Wang, Xiaoru - - PowerPoint PPT Presentation

Loop Selection for Thread-Level Speculation Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Chip Multiprocessors (CMPs) IBM Power5


slide-1
SLIDE 1

Loop Selection for Thread-Level Speculation

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

slide-2
SLIDE 2

Shengyue Wang

Chip Multiprocessors (CMPs)

  • CMPs:
  • IBM Power5
  • Sun Niagara
  • Intel dual-core Xeon
  • AMD dual-core Opteron

Proc Proc Proc Cache

Improve program performance with parallel threads

IBM Power5

slide-3
SLIDE 3

Shengyue Wang

Thread-Level Speculation (TLS) Automatic parallelization is difficult

  • Ambiguous data dependences
  • Complex control flow

TLS facilitates automatic parallelization by:

  • Executing potentially dependent threads in parallel
  • Preserving data dependences via runtime checking

Where do we find speculative parallel threads?

slide-4
SLIDE 4

Shengyue Wang

Parallelizing Loops under TLS Loops are good candidates for parallelism

  • Regular structure
  • Significant coverage on dynamic execution time

General purpose applications are complicated Facts about SPECINT 2000

  • Average number of loops: 714
  • Average dynamic loop nesting: 8

Loop selection: which loops should be parallelized?

slide-5
SLIDE 5

Shengyue Wang

Potential of Loop Selection

  • 40%
  • 20%

0% 20% 40% 60% 80% 100% m c f c r a f t y t w

  • l

f g z i p b z i p 2 v

  • r

t e x v p r p a r s e r g a p g c c p e r l b m k

Outer loop Inner loop Best

program speedup

Carefully selected loops can improve performance significantly!

slide-6
SLIDE 6

Shengyue Wang

Outline Loop selection Algorithm

  • Parallel performance prediction
  • Dynamic loop behavior
  • Conclusions
slide-7
SLIDE 7

Shengyue Wang

Loop Nesting

main( ) { while ( condition1 ) { while ( condition2 ) { foo( ); goo( ); } } }

Loop graph Source code : static loop : nesting relationship

main_loop1 main_loop2 goo_loop1 foo_loop1 foo( ) { while ( condition3 ) { goo( ); } } goo( ) { while ( condition4 ) { } }

slide-8
SLIDE 8

Shengyue Wang

Benefit of Parallelizing a Single Loop

benefit = % program execution time saved = coverage × (1 – 1 / loop speedup)

Benefit Loop Speedup Coverage main_loop1 main_loop2 goo_loop1 foo_loop1

13% 20% 5% 18%

80% 1.2 13% 70% 1.4 20% 30% 1.2 5% 50% 1.6 18%

Program speedup = 1 / (1 - benefit) = 1.25

slide-9
SLIDE 9

Shengyue Wang

Loop Selection: Problem Definition Goal:

Select the set of loops that maximizes the

  • verall program performance when parallelized

Constraint:

The set cannot contain loops with nesting relationship

Loop selection is NP-complete!

Weighted maximum independent set

slide-10
SLIDE 10

Shengyue Wang

Loop Selection: Algorithm

  • Exhaustive search (≤ 50 nodes)
  • Try all possible combinations of loops
  • Greedy algorithm (> 50 nodes)
  • In descending order of benefit
  • Check for nesting relation
  • Add the loop to the set if no nesting

Average number of loops for SPECINT 2000: 714

slide-11
SLIDE 11

Shengyue Wang

Loop Pruning

Only resort to greedy algorithm for gcc and parser

loop3 loop4 loop5 loop6 loop7 loop8 loop2 loop1

slide-12
SLIDE 12

Shengyue Wang

Benefit of Parallelizing a Single Loop

Loop graph

Benefit Coverage main_loop1 main_loop2 goo_loop1 foo_loop1

13% 20% 5% 18%

80% 1.2 13% 70% 1.4 20% 30% 1.2 5% 50% 1.6 18%

How can we estimate the speedup?

Speedup

slide-13
SLIDE 13

Shengyue Wang

Outline Loop selection

  • Algorithm

Parallel performance prediction

  • Dynamic loop behavior
  • Conclusions
slide-14
SLIDE 14

Shengyue Wang

Estimating Parallel Performance Communicating value between speculative threads adds significant overhead to parallel execution

  • Synchronization:
  • Resolves frequently occurring data dependences
  • Speculation:
  • Resolves infrequently occurring data dependences

Estimating communication costs with the compiler

slide-15
SLIDE 15

Shengyue Wang

Cost of Mis-speculation

Cost of mis-speculation = amount of work wasted × prob. of mis-speculation

T1 store T2 load

×

Amount of work wasted

slide-16
SLIDE 16

Shengyue Wang

Cost of Mis-speculation

Cost of mis-speculation = amount of work wasted × prob. of mis-speculation

T1 store T2 load

×

Amount of work wasted Sequential part

slide-17
SLIDE 17

Shengyue Wang

Synchronization

T1 T2 load store

Synchronization serializes parallel execution

slide-18
SLIDE 18

Shengyue Wang

Cost of Synchronization

T1 T2 load2 load1 store2 store1 T1 T2 store1 T1 T2 load1 store1

Synchronization Cost = # of dependent instructions Synchronization Cost = longest stall Synchronization Cost = longest stall based on dependent instructions

load1

  • Est. I
  • Est. II
  • Est. III
slide-19
SLIDE 19

Shengyue Wang

Experimental Framework

  • Machine model
  • 4 single-issue in-order processors
  • Private L1 data cache (32K, 2-way, 1 cycle)
  • Shared L2 data cache (2M, 4-way, 10 cycles)
  • Speculation support (write buffer, address buffer)
  • Synchronization support (comm. buffer, 10 cycles)
  • Compiler optimizations using ORC 2.1
  • Instruction scheduling to improve parallelism
slide-20
SLIDE 20

Shengyue Wang

Comparison: Speedup Estimation Techniques

  • 40%
  • 20%

0% 20% 40% 60% 80% 100% m cf crafty tw

  • lf

gzip bzip2 v

  • rtex

v pr parser gap gcc perlbm k

  • Est. I
  • Est. II
  • Est. III

Perfect

0% 20% 40% 60% 80% 100% m cf crafty twolf gzip bzip2 vortex vpr parser gap gcc perlbm k

  • Est. I
  • Est. II
  • Est. III

Perfect

program speedup

  • 40%
  • 20%

0% 20% 40% 60% 80% 100% m cf crafty tw

  • lf

gzip bzip2 v

  • rtex

v pr parser gap gcc perlbm k

  • Est. I
  • Est. II
  • Est. III

Perfect

  • 40%
  • 20%

0% 20% 40% 60% 80% 100% m cf crafty tw

  • lf

gzip bzip2 v

  • rtex

v pr parser gap gcc perlbm k

  • Est. I
  • Est. II
  • Est. III

Perfect

  • 40%
  • 20%

0% 20% 40% 60% 80% 100% m cf crafty tw

  • lf

gzip bzip2 v

  • rtex

v pr parser gap gcc perlbm k

  • Est. I
  • Est. II
  • Est. III

Perfect

0% 20% 40% 60% 80% 100% m cf crafty twolf gzip bzip2 vortex vpr parser gap gcc perlbm k

  • Est. I
  • Est. II
  • Est. III

Perfect

0% 20% 40% 60% 80% 100% m cf crafty twolf gzip bzip2 vortex vpr parser gap gcc perlbm k

  • Est. I
  • Est. II
  • Est. III

Perfect

0% 20% 40% 60% 80% 100% m cf crafty twolf gzip bzip2 vortex vpr parser gap gcc perlbm k

  • Est. I
  • Est. II
  • Est. III

Perfect

coverage

Average program speedup: 20%, coverage: 70%

slide-21
SLIDE 21

Shengyue Wang

Outline

  • Loop selection
  • Algorithm
  • Parallel performance prediction

Dynamic loop behavior

  • Conclusions
slide-22
SLIDE 22

Shengyue Wang

Loop Behavior May Change

main( ) { while ( condition1 ) { while ( condition2 ) { foo( ); goo( ); } } }

Source code Calling context of a loop: the path from the root to that loop

foo( ) { while ( condition3 ) { goo( ); } } goo( ) { while ( condition4 ) { } } main_loop1 main_loop2 foo_loop1 goo_loop1_A goo_loop1_B

Loop tree

slide-23
SLIDE 23

Shengyue Wang

Loop Selection in a Tree

main_loop1 main_loop2 foo_loop1 goo_loop1_A goo_loop1_B

13% 20% 5%

  • 2%

18%

goo_loop1 is parallelized

  • nly when it is reached

from main_loop2

Loop cloning can be used

slide-24
SLIDE 24

Shengyue Wang

Loop Behavior May Change

Exploit loop behavior dynamically

main_loop1 goo_loop1_A goo_loop1_B foo_loop1 foo_loop1

  • nly parallelize the loop

in these invocations

slide-25
SLIDE 25

Shengyue Wang

Potential of Exploiting Dynamic Behavior

0% 20% 40% 60% 80% 100% m cf cra fty tw

  • lf

g zip b zip 2 vo rtex vp r p a rs er g a p g cc p e rlbm k

No Co n te x t Ca llin g Co n te x t O ra cle

0% 20% 40% 60% 80% 100% m cf cra fty tw

  • lf

g zip b zip 2 vo rtex vp r p a rs er g a p g cc p e rlbm k

No Co n te x t Ca llin g Co n te x t O ra cle

0% 20% 40% 60% 80% 100% m cf cra fty tw

  • lf

g zip b zip 2 vo rtex vp r p a rs er g a p g cc p e rlbm k

No Co n te x t Ca llin g Co n te x t O ra cle

program speedup

0% 20% 40% 60% 80% 100% m cf cra fty tw

  • lf

g zip b zip 2 vo rtex vp r p a rs er g a p g cc p e rlbm k

No Co n te x t Ca llin g Co n te x t O ra cle

0% 20% 40% 60% 80% 100% m cf cra fty tw

  • lf

g zip b zip 2 vo rtex vp r p a rs er g a p g cc p e rlbm k

No Co n te x t Ca llin g Co n te x t O ra cle

0% 20% 40% 60% 80% 100% m c f c r a f t y t w

  • l

f g z i p b z i p 2 v

  • r

t e x v p r p a r s e r g a p g c c p e r l b m k

No Co n te x t Ca llin g Co n te x t O ra cle

coverage 5 out of 11 benchmarks show performance potential

slide-26
SLIDE 26

Shengyue Wang

Conclusions Loop selection is important for TLS

  • Compiler-based loop selection
  • Speedup 20%, Coverage 70%
  • Exploiting dynamic behavior offers

performance potential

slide-27
SLIDE 27

Shengyue Wang

Thank You!