Loop Selection for Thread-Level Speculation Shengyue Wang, Xiaoru - PowerPoint PPT Presentation

Loop Selection for Thread-Level Speculation Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

Chip Multiprocessors (CMPs) IBM Power5 • CMPs: Proc Proc • IBM Power5 • Sun Niagara Proc • Intel dual-core Xeon Cache • AMD dual-core Opteron Improve program performance with parallel threads Shengyue Wang

Thread-Level Speculation (TLS) Automatic parallelization is difficult • Ambiguous data dependences • Complex control flow TLS facilitates automatic parallelization by: • Executing potentially dependent threads in parallel • Preserving data dependences via runtime checking Where do we find speculative parallel threads? Shengyue Wang

Parallelizing Loops under TLS Loops are good candidates for parallelism • Regular structure • Significant coverage on dynamic execution time General purpose applications are complicated Facts about SPECINT 2000 • Average number of loops: 714 • Average dynamic loop nesting: 8 Loop selection: which loops should be parallelized? Shengyue Wang

Potential of Loop Selection Outer loop Inner loop Best 100% 80% program speedup 60% 40% 20% 0% k r p 2 x m p f f c y r e c -20% p l i p e o a c t m z s b f v i t g g a w g z r r l o a r r b t e c v p p -40% Carefully selected loops can improve performance significantly! Shengyue Wang

Outline � Loop selection � Algorithm • Parallel performance prediction • Dynamic loop behavior • Conclusions Shengyue Wang

Loop Nesting main( ) { main_loop1 while ( condition1 ) { while ( condition2 ) { foo( ); goo( ); main_loop2 } } } foo_loop1 foo( ) { while ( condition3 ) { goo( ); } goo_loop1 } Loop graph goo( ) { while ( condition4 ) { : static loop } : nesting relationship } Source code Shengyue Wang

Benefit of Parallelizing a Single Loop 13% main_loop1 Coverage Loop Speedup Benefit 20% 80% 1.2 13% main_loop2 70% 1.4 20% 5% 30% 1.2 5% foo_loop1 50% 1.6 18% 18% benefit = % program execution time saved goo_loop1 = coverage × (1 – 1 / loop speedup) Program speedup = 1 / (1 - benefit) = 1.25 Shengyue Wang

Loop Selection: Problem Definition Goal: Select the set of loops that maximizes the overall program performance when parallelized Constraint: The set cannot contain loops with nesting relationship Loop selection is NP-complete ! Weighted maximum independent set Shengyue Wang

Loop Selection: Algorithm • Exhaustive search ( ≤ 50 nodes) • Try all possible combinations of loops • Greedy algorithm (> 50 nodes) • In descending order of benefit • Check for nesting relation • Add the loop to the set if no nesting Average number of loops for SPECINT 2000: 714 Shengyue Wang

Loop Pruning loop1 loop2 loop3 loop4 loop5 loop6 loop7 loop8 Only resort to greedy algorithm for gcc and parser Shengyue Wang

Benefit of Parallelizing a Single Loop 13% main_loop1 Coverage Speedup Benefit 20% 80% 1.2 13% main_loop2 70% 1.4 20% 5% 30% 1.2 5% foo_loop1 50% 1.6 18% 18% goo_loop1 How can we estimate the speedup? Loop graph Shengyue Wang

Outline � Loop selection • Algorithm � Parallel performance prediction • Dynamic loop behavior • Conclusions Shengyue Wang

Estimating Parallel Performance Communicating value between speculative threads adds significant overhead to parallel execution • Synchronization: • Resolves frequently occurring data dependences • Speculation: • Resolves infrequently occurring data dependences Estimating communication costs with the compiler Shengyue Wang

Cost of Mis-speculation T1 T2 load Amount of work wasted × store Cost of mis-speculation = amount of work wasted × prob. of mis-speculation Shengyue Wang

Cost of Mis-speculation T1 Sequential part T2 Amount of load work wasted × store Cost of mis-speculation = amount of work wasted × prob. of mis-speculation Shengyue Wang

Synchronization T1 T2 store load Synchronization serializes parallel execution Shengyue Wang

Cost of Synchronization Est. III Est. II Est. I T1 T2 T1 T2 T1 T2 store2 load1 store1 load1 load2 store1 store1 load1 Synchronization Cost = Synchronization Cost = Synchronization Cost = longest stall based on longest stall # of dependent instructions dependent instructions Shengyue Wang

Experimental Framework • Machine model • 4 single-issue in-order processors • Private L1 data cache (32K, 2-way, 1 cycle) • Shared L2 data cache (2M, 4-way, 10 cycles) • Speculation support (write buffer, address buffer) • Synchronization support (comm. buffer, 10 cycles) • Compiler optimizations using ORC 2.1 • Instruction scheduling to improve parallelism Shengyue Wang

Comparison: Speedup Estimation Techniques 100% 100% 100% 100% Est. I Est. I Est. I Est. I 80% 80% 80% 80% Est. II Est. II Est. II Est. II program speedup 60% 60% 60% 60% Est. III Est. III Est. III Est. III Perfect Perfect Perfect Perfect 40% 40% 40% 40% 20% 20% 20% 20% 0% 0% 0% 0% m m m m cf cf cf cf crafty crafty crafty crafty tw tw tw tw olf olf olf olf gzip gzip gzip gzip bzip2 bzip2 bzip2 bzip2 v v v v ortex ortex ortex ortex v v v v pr pr pr pr parser parser parser parser gap gap gap gap gcc gcc gcc gcc perlbm perlbm perlbm perlbm k k k k -20% -20% -20% -20% -40% -40% -40% -40% Average program speedup: 20%, coverage: 70% 100% 100% 100% 100% Est. I Est. I Est. I Est. I 80% Est. II 80% 80% 80% Est. II Est. II coverage Est. II Est. III Est. III Est. III Est. III 60% 60% 60% 60% Perfect Perfect Perfect Perfect 40% 40% 40% 40% 20% 20% 20% 20% 0% 0% 0% 0% Shengyue Wang m cf crafty twolf gzip bzip2 vortex vpr parser gap gcc perlbm k m m cf cf crafty crafty twolf twolf gzip gzip bzip2 bzip2 vortex vortex vpr vpr parser parser gap gap gcc gcc perlbm perlbm k k m cf crafty twolf gzip bzip2 vortex vpr parser gap gcc perlbm k

Outline • Loop selection • Algorithm • Parallel performance prediction � Dynamic loop behavior • Conclusions Shengyue Wang

Loop Behavior May Change main( ) { main_loop1 while ( condition1 ) { while ( condition2 ) { foo( ); goo( ); main_loop2 } } } foo_loop1 foo( ) { while ( condition3 ) { goo( ); } goo_loop1_A goo_loop1_B } Loop tree goo( ) { while ( condition4 ) { Calling context of a loop: } the path from the root to that loop } Source code Shengyue Wang

Loop Selection in a Tree 13% main_loop1 20% main_loop2 goo_loop1 is parallelized 5% only when it is reached foo_loop1 from main_loop2 -2% 18% goo_loop1_A goo_loop1_B Loop cloning can be used Shengyue Wang

Loop Behavior May Change main_loop1 foo_loop1 only parallelize the loop in these invocations foo_loop1 goo_loop1_A goo_loop1_B Exploit loop behavior dynamically Shengyue Wang

Potential of Exploiting Dynamic Behavior 100% 100% 100% No No Co Co n n te te x x t t No Co n te x t Ca Ca llin llin g Co g Co n n te te x x t t program speedup Ca llin g Co n te x t 80% 80% 80% O O O ra ra ra cle cle cle 60% 60% 60% 40% 40% 40% 20% 20% 20% 0% 0% 0% k k k rtex rtex er er r rlbm rlbm m zip zip 2 2 2 x p p p p cf cf f fty fty y lf lf cc cc c f r r r e c zip zip vp vp l p e p o o o i a a a t rs rs s c m m m z b f tw tw i t v g g g g cra cra a w g g g g g z r r l vo vo o a a a r r b b b t e e e c p p p v p p p 5 out of 11 benchmarks show performance potential 100% 100% 100% No No Co Co n n te te x x t t No Co n te x t Ca Ca Ca llin llin llin g Co g Co g Co n n n te te te x x x t t t 80% 80% 80% coverage O O O ra ra ra cle cle cle 60% 60% 60% 40% 40% 40% 20% 20% 20% 0% 0% 0% k k k rlbm rlbm 2 2 rtex rtex er er rlbm Shengyue Wang zip zip 2 rtex er zip p p cc cc cf cf fty fty lf lf r r p cc cf fty lf r zip zip vp vp zip vp a a o o a m m o rs rs m rs cra cra tw tw g g g g cra tw g g g g g vo vo a a vo a b b b e e e p p p p p p

Conclusions Loop selection is important for TLS • Compiler-based loop selection • Speedup 20%, Coverage 70% • Exploiting dynamic behavior offers performance potential Shengyue Wang

Thank You! Shengyue Wang

Loop Selection for Thread-Level Speculation Shengyue Wang, Xiaoru - PowerPoint PPT Presentation

Loop Selection for Thread-Level Speculation Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Chip Multiprocessors (CMPs) IBM Power5

Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing

Years Guri Sohi University of Wisconsin-Madison Outline Speculation infancy performance

Repetition Types of Loops Counting loop Know how many times to loop

13 IN THIS CHAPTER Benefits of Thread Pooling 308 Considerations and Costs of Thread

Trading Strategies Introduction Trading Loop Trading Loop Trading Loop Trading Loop Three

Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop

To thread or not to thread? Why PETSc favors MPI-only Plenary Discussion PETSc User Meeting 2016

Software Thread Level Speculation for the Java Language and Virtual Machine Environment

Loop Invariants: Part 2 7 January 2019 OSU CSE 1 Maintaining the Loop Invariant A claimed

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

Upper and Lower Loop Bound Estimation by Symbolic Execution and Loop Acceleration Pavel Cadek

Enhancing Fine- Grained Parallelism Loop vectorization, Loop distribution, Scalar expansion

c } false loop body P (postcondition) Loop Invariant Defn : A boolean condition that

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Design of Thread-Safe Classes 1 Topic Outline Thread-Safe Classes Principles Confinement

Synthesizing Commutativity Conditions Kshitij Bansal Eric Koskinen Omer Tripp New York

Measuring YouTube Content Delivery over IPv6 Q/A Recommendations Stall Events Tiroughput

What is En#ty Resolu#on? Problem of idenBfying and

Prioritizing Enterprise Customer Needs with Constructed, Augmented MaxDiff EARL London These

16623 - Advanced Computer Vision Apps Final Project (50 % of total grade) Project Proposal Due -

In Integrating NVIDIA Deep Learning Accelerator (NVDLA) wit ith RIS ISC-V SoC on Fir ireSim

ELKO Lalo In nllzi.io argmin Ln O OE du Elz a logpo.cz cst 10,00 ftp.logpdzl KLCpqllp 0

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions Christopher Clark, Kenton

Web Mining and Recommender Systems Triadic closure; strong & weak ties Triangles So far

Loop Selection for Thread-Level Speculation Shengyue Wang, Xiaoru - PowerPoint PPT Presentation

Loop Selection for Thread-Level Speculation Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Chip Multiprocessors (CMPs) IBM Power5

Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing

Years Guri Sohi University of Wisconsin-Madison Outline Speculation infancy performance

Repetition Types of Loops Counting loop Know how many times to loop

13 IN THIS CHAPTER Benefits of Thread Pooling 308 Considerations and Costs of Thread

Trading Strategies Introduction Trading Loop Trading Loop Trading Loop Trading Loop Three

Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop

To thread or not to thread? Why PETSc favors MPI-only Plenary Discussion PETSc User Meeting 2016

Software Thread Level Speculation for the Java Language and Virtual Machine Environment

Loop Invariants: Part 2 7 January 2019 OSU CSE 1 Maintaining the Loop Invariant A claimed

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

Upper and Lower Loop Bound Estimation by Symbolic Execution and Loop Acceleration Pavel Cadek

Enhancing Fine- Grained Parallelism Loop vectorization, Loop distribution, Scalar expansion

c } false loop body P (postcondition) Loop Invariant Defn : A boolean condition that

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

Design of Thread-Safe Classes 1 Topic Outline Thread-Safe Classes Principles Confinement

Synthesizing Commutativity Conditions Kshitij Bansal Eric Koskinen Omer Tripp New York

Measuring YouTube Content Delivery over IPv6 Q/A Recommendations Stall Events Tiroughput

What is En#ty Resolu#on? Problem of idenBfying and

Prioritizing Enterprise Customer Needs with Constructed, Augmented MaxDiff EARL London These

16623 - Advanced Computer Vision Apps Final Project (50 % of total grade) Project Proposal Due -

In Integrating NVIDIA Deep Learning Accelerator (NVDLA) wit ith RIS ISC-V SoC on Fir ireSim

ELKO Lalo In nllzi.io argmin Ln O OE du Elz a logpo.cz cst 10,00 ftp.logpdzl KLCpqllp 0

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions Christopher Clark, Kenton

Web Mining and Recommender Systems Triadic closure; strong &amp; weak ties Triangles So far

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Web Mining and Recommender Systems Triadic closure; strong & weak ties Triangles So far