polly acc transparent compilation to heterogeneous
play

Polly-ACC: Transparent Compilation to Heterogeneous Hardware Torsten - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Torsten Hoefler (with Tobias Grosser) 1 spcl.inf.ethz.ch @spcl_eth Evading various ends the hardware view 2 spcl.inf.ethz.ch @spcl_eth


  1. spcl.inf.ethz.ch @spcl_eth Polly-ACC: Transparent Compilation to Heterogeneous Hardware Torsten Hoefler (with Tobias Grosser) 1

  2. spcl.inf.ethz.ch @spcl_eth Evading various “ends” – the hardware view 2

  3. spcl.inf.ethz.ch @spcl_eth Parallel Hardware Sequential Software Multi-Core CPU Fortran row = 0 ; output_image_ptr = output_image ; C/C++ output_image_ptr += ( NN * dead_rows ); for ( r = 0 ; r < NN - KK + 1 ; r ++) { CPU CPU CPU CPU output_image_offset = output_image_ptr ; output_image_offset += dead_cols ; col = 0 ; for ( c = 0 ; c < NN - KK + 1 ; c ++) { CPU CPU CPU CPU input_image_ptr = input_image ; input_image_ptr += ( NN * row ); kernel_ptr = kernel ; S0: * output_image_offset = 0 ; for ( i = 0 ; i < KK ; i ++) { input_image_offset = input_image_ptr ; input_image_offset += col ; GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU kernel_offset = kernel_ptr ; for ( j = 0 ; j < KK ; j ++) { S1: temp1 = * input_image_offset ++; S1: temp2 = * kernel_offset ++; S1: * output_image_offset += temp1 * temp2 ; } GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU kernel_ptr += KK ; input_image_ptr += NN ; } S2: * output_image_offset = ((* output_image_offset )/ normal_factor ); output_image_offset ++ ; col ++; } Accelerator output_image_ptr += NN ; row ++; } } 3

  4. spcl.inf.ethz.ch @spcl_eth Design Goals Automatic Non-Goal: Automatic accelerator mapping Algorithmic Changes - How close can we get? “Regression Free” High Performance 4

  5. spcl.inf.ethz.ch @spcl_eth Tool: Polyhedral Modeling Iteration Space Program Code i ≤ N = 4 5 for (i = 0; i <= N; i++) i 4 for (j = 0; j <= i; j++) 3 S(i,j); 0 ≤ j 2 j ≤ i 1 0 (i, j) = (0,0) (3,3) (4,0) (4,4) (4,3) (4,2) (1,0) (1,1) (2,0) (2,1) (4,1) (2,2) (3,0) (3,1) (3,2) 0 ≤ i N = 4 0 1 2 3 4 5 j Polly -- Performing Polyhedral Optimizations on a Low-Level D = { (i,j ) | 0 ≤ i ≤ N ∧ 0 ≤ j ≤ i } Intermediate Representation Tobias Grosser et al, Parallel Processing Letter, 2012 4

  6. spcl.inf.ethz.ch @spcl_eth Mapping Computation to Device Device Blocks & Threads Iteration Space 0 1 0 1 2 3 0 1 2 3 0 i 0 1 0 0 1 1 2 j 0 4 % 2, 𝑘 𝑗 𝐶𝐽𝐸 = { 𝑗, 𝑘 → 3 % 2 } 1 1 𝑈𝐽𝐸 = { 𝑗, 𝑘 → 𝑗 % 4, 𝑘 % 3 } 2 6

  7. spcl.inf.ethz.ch @spcl_eth Memory Hierarchy of a Heterogeneous System 7

  8. spcl.inf.ethz.ch @spcl_eth Host-device date transfers 8

  9. spcl.inf.ethz.ch @spcl_eth Host-device date transfers 9

  10. spcl.inf.ethz.ch @spcl_eth Mapping onto fast memory 10

  11. spcl.inf.ethz.ch @spcl_eth Mapping onto fast memory Polyhedral parallel code generation for CUDA, Verdoolaege, Sven et. al, ACM Transactions on Architecture and Code Optimization, 2013 11

  12. spcl.inf.ethz.ch @spcl_eth Profitability Heuristic Execution Modeling GPU All Loop Nests dynamic static Insufficient Compute Unsuitable Trivial T. Grosser, TH: Polly- ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

  13. spcl.inf.ethz.ch @spcl_eth From kernels to program – data transfers void heat(int n, float A[n], float hot, float cold) { float B[n] = {0}; initialize(n, A, cold); setCenter(n, A, hot, n/4); for (int t = 0; t < T; t++) { average(n, A, B); average(n, B, A); printf("Iteration %d done", t); } } 13

  14. spcl.inf.ethz.ch @spcl_eth void heat(int n, float A[n], ...) { Data Transfer – Per Kernel initialize(n, A, cold); setCenter(n, A, hot, n/4); for (int t = 0; t < T; t++) { average(n, A, B); Host Memory Device Memory average(n, B, A); printf("Iteration %d done", t); } } D → 𝐼 initialize () D → 𝐼 setCenter() 𝐼 → 𝐸 𝐸 → 𝐼 average() time 𝐼 → 𝐸 𝐸 → 𝐼 average() 𝐼 → 𝐸 𝐸 → 𝐼 average() 14

  15. spcl.inf.ethz.ch @spcl_eth void heat(int n, float A[n], ...) { Data Transfer – Inter Kernel Caching initialize(n, A, cold); setCenter(n, A, hot, n/4); for (int t = 0; t < T; t++) { average(n, A, B); Host Memory Host Memory Device Memory average(n, B, A); printf("Iteration %d done", t); } } initialize () setCenter() average() time 𝐸 → 𝐼 𝐼 → 𝐸 average() average() 15

  16. spcl.inf.ethz.ch @spcl_eth Evaluation Evaluation Workstation: 10 core SandyBridge NVIDIA Titan Black (Kepler) Mobile: 4 core Haswell NVIDIA GT730M (Kepler) 16

  17. spcl.inf.ethz.ch @spcl_eth LLVM Nightly Test Suite # Compute Regions / Kernels 10000 1000 100 10 1 SCoPs 0-dim 1-dim 2-dim 3-dim No Heuristics Heuristics 17 T. Grosser, TH: Polly- ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

  18. spcl.inf.ethz.ch @spcl_eth Some results: Polybench 3.2 geomean: ~6x arithmean: ~30x Speedup over icc – O3 Xeon E5-2690 (10 cores, 0.5Tflop) vs. Titan Black Kepler GPU (2.9k cores, 1.7Tflop) 18 T. Grosser, TH: Polly- ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

  19. spcl.inf.ethz.ch Compiles all of SPEC CPU 2006 – Example: @spcl_eth LBM 8:24 essentially my 4-core x86 laptop with the (free) GPU that’s in there 7:12 6:00 Runtime (m:s) Xeon E5-2690 (10 cores, 0.5Tflop) vs. 4:48 ~20% Titan Black Kepler GPU (2.9k cores, 1.7Tflop) 3:36 2:24 ~4x 1:12 0:00 Mobile Workstation icc icc -openmp clang Polly ACC 19 T. Grosser, TH: Polly- ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

  20. spcl.inf.ethz.ch @spcl_eth Cactus ADM (SPEC 2006) Workstation Mobile 20 T. Grosser, TH: Polly- ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

  21. spcl.inf.ethz.ch @spcl_eth Cactus ADM (SPEC 2006) - Data Transfer Workstation Mobile 21 T. Grosser, TH: Polly- ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

  22. spcl.inf.ethz.ch @spcl_eth Polly-ACC http://spcl.inf.ethz.ch/Polly-ACC Automatic “Regression Free” High Performance 22 T. Grosser, TH: Polly- ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16

  23. spcl.inf.ethz.ch @spcl_eth Brave new compiler world!? Unfortunately not …   Limited to affine code regions  Maybe generalizes to control-restricted programs  No distributed anything!!  Good news:  Much of traditional HPC fits that model  Infrastructure is coming along  Bad news:  Modern data-driven HPC and Big Data fits less well  Need a programming model for distributed heterogeneous machines! 23

  24. spcl.inf.ethz.ch @spcl_eth How do we program GPUs today? l s d t l s d t … l s t d l s d t CUDA MPI • over-subscribe hardware • host controlled • use spare parallel slack for latency • full device hiding synchronization device compute core instruction latency active thread T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

  25. spcl.inf.ethz.ch @spcl_eth Latency hiding at the cluster level? s l pu l d t d t l s l s d t d t l s l s t d t d l pu l s d t d t dCUDA (distributed CUDA) • unified programming model for GPU clusters • avoid unnecessary device synchronization to enable system wide latency hiding device compute core instruction latency active thread T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)

  26. spcl.inf.ethz.ch @spcl_eth Talk on Wednesday Tobias Gysi , Jeremiah Baer, TH: “ dCUDA: Hardware Supported Overlap of Computation and Communication” Wednesday, Nov. 16 th 4:00-4:30pm Room 355-D 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend