an example of a research compiler
play

An example of a research compiler Simone Campanoni - PowerPoint PPT Presentation

An example of a research compiler Simone Campanoni simonec@eecs.northwestern.edu Sequential programs are not accelerating like they used to Performance (log scale) Performance gap Core frequency scaling Multicore era Sequential program


  1. An example of a research compiler Simone Campanoni simonec@eecs.northwestern.edu

  2. Sequential programs are not accelerating like they used to Performance (log scale) Performance gap Core frequency scaling Multicore era Sequential program running on a platform 1992 2004 2

  3. Multicores are underutilized Single application: Not enough explicit parallelism • Developing parallel code is hard • Sequentially-designed code is still ubiquitous Multiple applications: Only a few CPU-intensive applications running concurrently in client devices 3

  4. Parallelizing compiler: Exploit unused cores to accelerate sequential programs 4

  5. Non-numerical programs need to be parallelized Non-numerical programs Numerical programs 5

  6. Parallelize loops to parallelize a program Outermost loops 99% of time is spent in loops Time 6

  7. DOALL parallelism Time Iteration 0 work() Iteration 1 work() Iteration 2 work() 7

  8. DOACROSS parallelism Time Sequential c=f(c) c=f(c) c=f(c) segment d=f(d) d=f(d) d=f(d) work() work() work() Parallel segment 8

  9. HELIX: DOACROSS for multicore [ Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012 ] c=f(c) d=f(d) work() c=f(c) d=f(d) work() c=f(c) d=f(d) work() 9

  10. HELIX: DOACROSS for multicore [ Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012 ] c=f(c) d=f(d) work() c=f(c) d=f(d) c=f(c) work() d=f(d) work() 10

  11. HELIX: DOACROSS for multicore [ Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012 ] Wait 0 Seq. Signal 0 c=f(c) Segment 0 d=f(d) c=f(c) Signal 1 Wait 1 work(x) d=f(d) c=f(c) work() d=f(d) Seq. work() Segment 1 11

  12. HELIX: DOACROSS for multicore [ Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012 ] 12

  13. HELIX: DOACROSS for multicore [ Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012 ] 13

  14. Parallelize loops to parallelize a program Outermost loops Innermost loops 99% of time is spent in loops Time 14

  15. Parallelize loops to parallelize a program Innermost Outermost loops loops Coverage HELIX Ease of analysis Communication 15

  16. HELIX: DOACROSS for multicore 16 [ Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012 ] Innermost Outermost loops loops Coverage HELIX HELIX-RC Easy of HELIX-UP analysis 4 Speedup Small Loop Parallelism HELIX ICC, Microsoft Visual Studio,DOACROSS Communication 1 SPEC INT baseline 4-core Intel Nehalem 16

  17. Outline Small Loop Parallelism and HELIX [ CGO 2012 DAC 2012, IEEE Micro 2012 ] HELIX-RC: Architecture/Compiler Co-Design [ ISCA 2014 ] Small loops HELIX Communication HELIX-UP: Unleash Parallelization [ CGO 2015 ] 17

  18. SLP challenge: short loop iterations SPEC CPU Int benchmarks Duration of loop iteration (cycles) Clock cycles 18

  19. SLP challenge: short loop iterations 90 SPEC CPU Int benchmarks Duration of loop iteration (cycles) Clock cycles 19

  20. SLP challenge: short loop iterations Adjacent core communication latency Duration of loop iteration (cycles) Clock cycles 20

  21. A compiler-architecture co-design to efficiently execute short iterations Compiler • Identify latency-critical code in each small loop • Code that generates shared data • Expose information to the architecture Wait 0 Seq. Signal 0 Architecture: Ring Cache Segment 0 Signal 1 • Reduce the communication latency Wait 1 on the critical path Seq. Segment 1 21

  22. Light-weight enhancement of today’s multicore architecture Store X, 1 … Core 0 Core 1 Store X, 1 Load X Store Y, 1 Ring node Ring node Load Y … DL1 DL1 Iter. 0 Iter. 1 Last level cache DL1 DL1 75 – 260 Store Y, Store Y, Ring node Ring node cycles! 1 1 Iter. 3 Iter. 2 Core 3 Core 2 22

  23. Light-weight enhancement of today’s multicore architecture Store X, 1 … Core 0 Core 1 Wait 0 Wait 0 Store Y, 1 Ring node Ring node Load Y Signal 0 … Iter. 0 Iter. 1 Ring node Ring node 23

  24. 98% hit rate 24

  25. The importance of HELIX-RC Numerical programs Non-numerical programs 25

  26. The importance of HELIX-RC Numerical programs Non-numerical programs 26

  27. Outline Small Loop Parallelism and HELIX [ CGO 2012 DAC 2012, IEEE Micro 2012 ] HELIX-RC: Architecture/Compiler Co-Design [ ISCA 2014 ] Small loops HELIX Communication HELIX-UP: Unleash Parallelization [ CGO 2015 ] 27

  28. HELIX and its limitations Iteration 0 Thread 0 80% Data Iteration 1 Thread 1 Data Iteration 2 Thread 2 Data 50% Thread 3 78% accuracy 1.19 Performance: 79% accuracy 1.61 Lower than you would like • Nehalem 2.77 Inconsistent across architectures • Bulldozer 2.31 1.68 Haswell Sensitive to • 4 Cores dependence analysis accuracy What can we do to improve it? 28 28

  29. Opportunity: relax program semantics • Some workloads tolerate output distortion • Output distortion is workload-dependent 29

  30. Relaxing transformations remove performance bottlenecks • Sequential bottleneck Thread 1 Thread 2 Thread 3 Inst 1 Inst 1 Inst 1 Inst 2 Inst 2 Inst 2 Sequential Inst 3 Dep segment Inst 4 Inst 3 Inst 4 Inst 3 Inst 4 Speedup 30

  31. Relaxing transformations remove performance bottlenecks • Sequential bottleneck • Communication bottleneck • Data locality bottleneck 31

  32. Relaxing transformations remove performance bottlenecks No relaxing transformations No output distortion Relaxing transformation 1 Baseline performance Relaxing transformation 2 … Max output distortion Max performance Relaxing transformation k 32

  33. Design space of HELIX-UP Apply relaxing transformation 5 to code region 2 o Performance o Energy saved o Output distortion Code Apply relaxing transformation 3 region 1 to code region 1 Code region 2 1) User provides output distortion limits 2) System finds the best configuration 3) Run parallelized code with that configuration 33

  34. Pruning the design space Empirical observation: Transforming a code region affects only the loop it belongs to 50 loops, 2 code regions per loop 2 transformations per code region Complete space = 2 100 Pruned space = 50 * (2 2 ) = 200 How well does HELIX-UP perform? 34

  35. HELIX-UP unblocks extra parallelism HELIX: no relaxing transformations with small output distortions Nehalem 6 cores 2 threads per core 35

  36. HELIX-UP unblocks extra parallelism with small output distortions Nehalem 6 cores 2 threads per core 36

  37. Performance/distortion tradeoff 256.bzip2 HELIX % 37

  38. Run time code tuning • Static HELIX-UP decides how to transform the code based on profile data averaged over inputs • The runtime reacts to transient bottlenecks by adjusting code accordingly 38

  39. Adapting code at run time unlocks more parallelism 256.bzip2 HELIX % 39

  40. HELIX-UP improves more than just performance • Robustness to DDG inaccuracies • Consistent performance across platforms 40

  41. Relaxed transformations to be robust to DDG inaccuracies 256.bzip2 Increasing DDG No impact inaccuracies leads to lower performance on HELIX-UP HELIX HELIX-UP 41

  42. Relaxed transformations for consistent performance Increasing communication latency 42

  43. Small Loop Parallelism and HELIX • Parallelism hides in small loops HELIX-RC: Architecture/Compiler Co-Design • Irregular programs require low latency HELIX-UP: Unleash Parallelization • Tolerating distortions boosts parallelization 43

  44. Thank you!

  45. Small Loop Parallelism and HELIX • Parallelism hides in small loops HELIX-RC: Architecture/Compiler Co-Design • Irregular programs require low latency HELIX-UP: Unleash Parallelization • Tolerating distortions boosts parallelization 45

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend