loop selection for thread level speculation
play

Loop Selection for Thread-Level Speculation Shengyue Wang, Xiaoru - PowerPoint PPT Presentation

Loop Selection for Thread-Level Speculation Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Chip Multiprocessors (CMPs) IBM Power5


  1. Loop Selection for Thread-Level Speculation Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

  2. Chip Multiprocessors (CMPs) IBM Power5 • CMPs: Proc Proc • IBM Power5 • Sun Niagara Proc • Intel dual-core Xeon Cache • AMD dual-core Opteron Improve program performance with parallel threads Shengyue Wang

  3. Thread-Level Speculation (TLS) Automatic parallelization is difficult • Ambiguous data dependences • Complex control flow TLS facilitates automatic parallelization by: • Executing potentially dependent threads in parallel • Preserving data dependences via runtime checking Where do we find speculative parallel threads? Shengyue Wang

  4. Parallelizing Loops under TLS Loops are good candidates for parallelism • Regular structure • Significant coverage on dynamic execution time General purpose applications are complicated Facts about SPECINT 2000 • Average number of loops: 714 • Average dynamic loop nesting: 8 Loop selection: which loops should be parallelized? Shengyue Wang

  5. Potential of Loop Selection Outer loop Inner loop Best 100% 80% program speedup 60% 40% 20% 0% k r p 2 x m p f f c y r e c -20% p l i p e o a c t m z s b f v i t g g a w g z r r l o a r r b t e c v p p -40% Carefully selected loops can improve performance significantly! Shengyue Wang

  6. Outline � Loop selection � Algorithm • Parallel performance prediction • Dynamic loop behavior • Conclusions Shengyue Wang

  7. Loop Nesting main( ) { main_loop1 while ( condition1 ) { while ( condition2 ) { foo( ); goo( ); main_loop2 } } } foo_loop1 foo( ) { while ( condition3 ) { goo( ); } goo_loop1 } Loop graph goo( ) { while ( condition4 ) { : static loop } : nesting relationship } Source code Shengyue Wang

  8. Benefit of Parallelizing a Single Loop 13% main_loop1 Coverage Loop Speedup Benefit 20% 80% 1.2 13% main_loop2 70% 1.4 20% 5% 30% 1.2 5% foo_loop1 50% 1.6 18% 18% benefit = % program execution time saved goo_loop1 = coverage × (1 – 1 / loop speedup) Program speedup = 1 / (1 - benefit) = 1.25 Shengyue Wang

  9. Loop Selection: Problem Definition Goal: Select the set of loops that maximizes the overall program performance when parallelized Constraint: The set cannot contain loops with nesting relationship Loop selection is NP-complete ! Weighted maximum independent set Shengyue Wang

  10. Loop Selection: Algorithm • Exhaustive search ( ≤ 50 nodes) • Try all possible combinations of loops • Greedy algorithm (> 50 nodes) • In descending order of benefit • Check for nesting relation • Add the loop to the set if no nesting Average number of loops for SPECINT 2000: 714 Shengyue Wang

  11. Loop Pruning loop1 loop2 loop3 loop4 loop5 loop6 loop7 loop8 Only resort to greedy algorithm for gcc and parser Shengyue Wang

  12. Benefit of Parallelizing a Single Loop 13% main_loop1 Coverage Speedup Benefit 20% 80% 1.2 13% main_loop2 70% 1.4 20% 5% 30% 1.2 5% foo_loop1 50% 1.6 18% 18% goo_loop1 How can we estimate the speedup? Loop graph Shengyue Wang

  13. Outline � Loop selection • Algorithm � Parallel performance prediction • Dynamic loop behavior • Conclusions Shengyue Wang

  14. Estimating Parallel Performance Communicating value between speculative threads adds significant overhead to parallel execution • Synchronization: • Resolves frequently occurring data dependences • Speculation: • Resolves infrequently occurring data dependences Estimating communication costs with the compiler Shengyue Wang

  15. Cost of Mis-speculation T1 T2 load Amount of work wasted × store Cost of mis-speculation = amount of work wasted × prob. of mis-speculation Shengyue Wang

  16. Cost of Mis-speculation T1 Sequential part T2 Amount of load work wasted × store Cost of mis-speculation = amount of work wasted × prob. of mis-speculation Shengyue Wang

  17. Synchronization T1 T2 store load Synchronization serializes parallel execution Shengyue Wang

  18. Cost of Synchronization Est. III Est. II Est. I T1 T2 T1 T2 T1 T2 store2 load1 store1 load1 load2 store1 store1 load1 Synchronization Cost = Synchronization Cost = Synchronization Cost = longest stall based on longest stall # of dependent instructions dependent instructions Shengyue Wang

  19. Experimental Framework • Machine model • 4 single-issue in-order processors • Private L1 data cache (32K, 2-way, 1 cycle) • Shared L2 data cache (2M, 4-way, 10 cycles) • Speculation support (write buffer, address buffer) • Synchronization support (comm. buffer, 10 cycles) • Compiler optimizations using ORC 2.1 • Instruction scheduling to improve parallelism Shengyue Wang

  20. Comparison: Speedup Estimation Techniques 100% 100% 100% 100% Est. I Est. I Est. I Est. I 80% 80% 80% 80% Est. II Est. II Est. II Est. II program speedup 60% 60% 60% 60% Est. III Est. III Est. III Est. III Perfect Perfect Perfect Perfect 40% 40% 40% 40% 20% 20% 20% 20% 0% 0% 0% 0% m m m m cf cf cf cf crafty crafty crafty crafty tw tw tw tw olf olf olf olf gzip gzip gzip gzip bzip2 bzip2 bzip2 bzip2 v v v v ortex ortex ortex ortex v v v v pr pr pr pr parser parser parser parser gap gap gap gap gcc gcc gcc gcc perlbm perlbm perlbm perlbm k k k k -20% -20% -20% -20% -40% -40% -40% -40% Average program speedup: 20%, coverage: 70% 100% 100% 100% 100% Est. I Est. I Est. I Est. I 80% Est. II 80% 80% 80% Est. II Est. II coverage Est. II Est. III Est. III Est. III Est. III 60% 60% 60% 60% Perfect Perfect Perfect Perfect 40% 40% 40% 40% 20% 20% 20% 20% 0% 0% 0% 0% Shengyue Wang m cf crafty twolf gzip bzip2 vortex vpr parser gap gcc perlbm k m m cf cf crafty crafty twolf twolf gzip gzip bzip2 bzip2 vortex vortex vpr vpr parser parser gap gap gcc gcc perlbm perlbm k k m cf crafty twolf gzip bzip2 vortex vpr parser gap gcc perlbm k

  21. Outline • Loop selection • Algorithm • Parallel performance prediction � Dynamic loop behavior • Conclusions Shengyue Wang

  22. Loop Behavior May Change main( ) { main_loop1 while ( condition1 ) { while ( condition2 ) { foo( ); goo( ); main_loop2 } } } foo_loop1 foo( ) { while ( condition3 ) { goo( ); } goo_loop1_A goo_loop1_B } Loop tree goo( ) { while ( condition4 ) { Calling context of a loop: } the path from the root to that loop } Source code Shengyue Wang

  23. Loop Selection in a Tree 13% main_loop1 20% main_loop2 goo_loop1 is parallelized 5% only when it is reached foo_loop1 from main_loop2 -2% 18% goo_loop1_A goo_loop1_B Loop cloning can be used Shengyue Wang

  24. Loop Behavior May Change main_loop1 foo_loop1 only parallelize the loop in these invocations foo_loop1 goo_loop1_A goo_loop1_B Exploit loop behavior dynamically Shengyue Wang

  25. Potential of Exploiting Dynamic Behavior 100% 100% 100% No No Co Co n n te te x x t t No Co n te x t Ca Ca llin llin g Co g Co n n te te x x t t program speedup Ca llin g Co n te x t 80% 80% 80% O O O ra ra ra cle cle cle 60% 60% 60% 40% 40% 40% 20% 20% 20% 0% 0% 0% k k k rtex rtex er er r rlbm rlbm m zip zip 2 2 2 x p p p p cf cf f fty fty y lf lf cc cc c f r r r e c zip zip vp vp l p e p o o o i a a a t rs rs s c m m m z b f tw tw i t v g g g g cra cra a w g g g g g z r r l vo vo o a a a r r b b b t e e e c p p p v p p p 5 out of 11 benchmarks show performance potential 100% 100% 100% No No Co Co n n te te x x t t No Co n te x t Ca Ca Ca llin llin llin g Co g Co g Co n n n te te te x x x t t t 80% 80% 80% coverage O O O ra ra ra cle cle cle 60% 60% 60% 40% 40% 40% 20% 20% 20% 0% 0% 0% k k k rlbm rlbm 2 2 rtex rtex er er rlbm Shengyue Wang zip zip 2 rtex er zip p p cc cc cf cf fty fty lf lf r r p cc cf fty lf r zip zip vp vp zip vp a a o o a m m o rs rs m rs cra cra tw tw g g g g cra tw g g g g g vo vo a a vo a b b b e e e p p p p p p

  26. Conclusions Loop selection is important for TLS • Compiler-based loop selection • Speedup 20%, Coverage 70% • Exploiting dynamic behavior offers performance potential Shengyue Wang

  27. Thank You! Shengyue Wang

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend