Enabling Automatic Partitioning of Data-Parallel Kernels with - PowerPoint PPT Presentation

Enabling Automatic Partitioning of Data-Parallel Kernels with Polyhedral Compilation Alexander Matz, Holger Fröning Heidelberg University, Germany LLVM Performance Workshop @CGO 2018 Sat 24 Feb 2018, Vienna, Austria

Multi GPU in the Real World • NVIDIA DGX-1 and HGX-1   8 Tesla GPUs [1] • Amazon AWS P2   up to 16 Tesla GPUs • Google Cloud Platform   up to 8 Tesla GPUs [2] [1] https://www.nvidia.de/content/dam/en-zz/Solutions/Data-Center/hgx-1/data-center-nvidia-hgx-1-update-2-hero-desktop@2x.jpg [2] https://aws.amazon.com/de/ec2/instance-types/p2/ 2

Observations GPU Programming • Execution model • No guarantees exist for interactions among CTAs until kernel completion   => Kernels can be safely partitioned along CTA boundaries (usually) • Memory • Strong NUMA effects prohibit latency tolerance for remote accesses • Good partitioning mainly depends on memory access pattern • Language • Data-parallel languages help in identifying areas of interest (kernels) • Parallel slackness helps for scalability (larger core count due to multi-GPU) 3

Basic Idea Single GPU Single GPU NVCC • Keep clear data ownership and movements of single GPU programming • Automatically sync buffers Single GPU Multi GPU • Hybrid compile time / run time approach Mekong • Minimize runtime overhead 4

Pipeline Overview Contributions • Based on LLVM (gpucc) • Preprocessing based on text substitution • Majority of functionality implemented as passes • Not fully integrated yet 5

Kernel Analysis & Code Generation 1. Kernel Code 2. Application Model 3. Memory Range Array Access GPU Thread Grid b[y*N + x] += a[y*N + x]; b[y*N + x] += a[y*N + x + 1]; ↦ b[y*N + x] += a[y*N + x - 1]; b[y*N + x] += a[y*N + x + N]; b[y*N + x] += a[y*N + x - N]; b[y*N + x] *= 0.2; Polyhedral Analysis Polyhedral Code Generation 6

Kernel Analysis • Based on Polyhedral Value & Memory Analysis [1] • Model should intuitively map Global ID ↦ Array Element, so ℤ 3 ↦ ℤ d • CUDA Expression “threadIdx + blockIdx * blockDim” not affine [N] -> { I[y, x] -> S[o1=y, o2=x] : 0 <= o1,o2 < N; • Workaround I[y, x] -> S[o1=y, o2=x-1] : 0 <= o1,o2 < N; I[y, x] -> S[o1=y, o2=x+1] : 0 <= o1,o2 < N; I[y, x] -> S[o1=y-1, o2=x] : 0 <= o1,o2 < N; I[y, x] -> S[o1=y+1, o2=x] : 0 <= o1,o2 < N; • Replace product with new input dimension “blockOffset” } • Limit “threadIdx” to [0..“blockDim”], then project out • Model is now: ℤ 6 ↦ ℤ d , with three pairs of two dependent dimensions [1] http://www.llvm.org/devmtg/2017-10/#src2 7

Code Generation • Purpose A: Encode buffer dimension sizes and type information • Purpose B: Implement efficient iterators for array accesses • Tracking buffer state requires iterators for write accesses • Synchronizing buffers for kernels requires iterators for read accesses 8

Iterator Code Generation 2D 5-point stencil, read map 2D domain [N] -> { [yl, yu, xl, xu] -> { I[y, x] -> S[o1=y, o2=x] : 0 <= o1,o2 < N; I[y, x] : 0 <= yl <= y < yu and 0 <= xl <= x < xu I[y, x] -> S[o1=y, o2=x-1] : 0 <= o1,o2 < N; } I[y, x] -> S[o1=y, o2=x+1] : 0 <= o1,o2 < N; I[y, x] -> S[o1=y-1, o2=x] : 0 <= o1,o2 < N; I[y, x] -> S[o1=y+1, o2=x] : 0 <= o1,o2 < N; } • Based on isl AST generation Identity schedule of map range • Accurate but inefficient for (int c0 = max(max(0, yl - 1), yl + xl - N); c0 <= min(min(yu, N - 1), yu - xl + N - 1); c0 += 1) for (int c1 = max(max(max(0, xl - 1), yl + xl - c0 - 1), -yu + xl + c0); • Reads don’t need 100% accuracy c1 <= min(min(min(xu, N - 1), yu + xu - c0 - 1), -yl + xu + c0); c1 += 1) S(c0, c1); • Last dimension is stored contiguous in memory in C 9

Iterator Code Generation Loop for o1 only for (int c0 = max(max(0, yl - 1), yl + xl - N); Minimum o2 for o1 = c0 c0 <= min(min(yu, N - 1), yu - xl + N - 1); c0 += 1) { Maximum o2 for o1 = c0 int y_lower = yl == c0 && yu >= c0 + 1 && xl == 0 && xu >= 2 ? 0 : c0 >= yl && yu >= c0 + 1 && xl >= 1 ? xl - 1 : xl; int y_upper = c0 >= yl && yu >= c0 + 1 && N >= xu + 2 ? xu : (yl == c0 + 1 && yu >= c0 + 2 && N >= xu + 1) || (c0 >= yl + 1 && yu == c0 && N >= xu + 1) Contiguous memory || (yl >= c0 && yu >= c0 + 2 && xu == N) ? xu - 1 : N - 1; S(c0, y_lower, y_upper); chunk in row o1 = c0 } • Replace one loop with closed-form lower/upper expressions (optimized by LLVM) • Good estimate for read maps • Write maps need extra checks (modulo, non-convex sets) to verify accuracy • Allows more efficient tracking and data transfers 10

Runtime Buffer Management 01010 01010 01010 01010 foreach GPU: 01010 10101 10101 10101 10101 cudaMalloc(size) -> Refs += [cudaMalloc(size)] -> 10101 new Tracker() Ref&Tracker 01010 01010 01010 foreach GPU: 01010 01010 10101 10101 10101 maybeCopy ( ) 10101 cudaMemcpy( ) 10101 update_tracker( ) Ref&Tracker 01010 foreach GPU: calc_partition() 01010 01010 01010 01010 10101 foreach GPU: sync_buffer( ) 10101 10101 10101 10101 kernel<<<grid>>>( ) foreach GPU: kernel<<<partition>>>( ) Ref&Tracker foreach GPU: update_tracker( ) 11

Runtime Buffer Synchronization First Kernel Launch Kernel Iteration Data Gathering GPU 1 GPU 2 Host Transfer • • • Data is in host memory   Data is distributed on Data is distributed on GPUs GPUs • • • Each GPU transfers its Host transfers most up to Each GPU only transfers whole read set data chunk from each stale data GPU • Often the most repeated part of application 12

Runtime Buffer Tracking • Synchronization requires tracking • Track intervals of memory describing location of most recent update • No overlapping intervals, implemented as b-tree based map with lower bound search • Coalescing neighboring intervals keeps memory footprint and performance stable 0x00 0x00 0xA0 0xA0 0xF0 0xF0 0x00 0xF0 0x00 0xA0 0xF0 GPU 2 INV HOST INV GPU 1 13

Performance Matrix Multiply Hotspot (n = 32768) N − Body (n = 262144) 10 12 10 GPUs 8 10 1 8 Speedup 8 4 6 6 8 6 4 12 4 4 16 2 2 2 10000 20000 0 500 1000 1500 2000 0 20 40 60 Matrix side length Iterations Iterations Matrix Multiply (n = 28384) Hotspot (n = 28384, i = 1000) N − Body (n = 262144, i = 64) 80 100 Execution time (s) 60 100 75 Rest 40 50 Transfers 50 Kernel 20 25 0 0 0 1 3 5 7 9 11 13 15 1 3 5 7 9 11 13 15 1 3 5 7 9 11 13 15 GPUs GPUs GPUs 14

Future Work • Fully integrated proof-of-concept • Better handling of non-affine accesses • More comprehensive validation using well-known benchmarks • Array reshaping for better performance and memory utilization • Explore shared memory optimizations (e.g. posted writes for synchronization) 15

Conclusion • Compiler based Automatic Partitioning is feasible • Polyhedral compilation is a good fit for GPU memory access patterns • Accuracy of extracted memory access patterns crucial for both correctness (write accesses) and performance (read accesses) • Performance of prototype experiments very promising • LLVM provides excellent research platform for non-traditional compiler researchers 16

Thank you We especially thank Christoph Klein and Lorenz Braun (Heidelberg University), and   Johannes Doerfert (Saarland University) for their contributions to our research as well as Sudha Yalamanchili (Georgia Tech), Mark Hummel (NVIDIA), Peter Zaspel (University of Basel), Tobias Grosser (ETH Zürich), Johannes Doerfert and Sebastian Hack (Saarland University) for many helpful discussions and our Sponsors BMBF, Google, NVIDIA, and the German Excellence Initiative

1D-Identity Map 1. Analysis Output [boff_x, tid_x] -> { [] -> [boff_x + tid_x] } 3. Canonicalized Access Map 2. 1D Iteration Domain (CUDA Thread Grid) [boffmin_x, boffmax_x, bidmin_x, [boffmin_x, boffmax_x, bidmin_x, bidmax_x, bdim_x] -> { bidmax_x, bdim_x] -> { [boff_x, bid_x, tid_x] : [boff_x, bid_x] -> [o0] : boffmin_x <= boff_x < boffmax_x boffmin_x <= boff_x < boffmax_x and bidmin_x <= bid_x < bidmax_x and bidmin_x <= bid_x < bidmax_x and 0 <= tid_x < bdim_x; and boff_x <= o0 < bdim_x + boff_x } } 18

Enabling Automatic Partitioning of Data-Parallel Kernels with - PowerPoint PPT Presentation

Enabling Automatic Partitioning of Data-Parallel Kernels with Polyhedral Compilation Alexander Matz, Holger Frning Heidelberg University, Germany LLVM Performance Workshop @CGO 2018 Sat 24 Feb 2018, Vienna, Austria Multi GPU in the Real

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Algorithms in the parallel Algorithms in the parallel partitioning tool partitioning tool

Partitioning and Divide-and- Conquer Strategies Partitioning Strategies Partitioning simply

Partitioning Introduction to Partitioning Mahapatra-Texas A&M-Spring02 1 System

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Using Processor Partitioning to Using Processor Partitioning to Evaluate the Performance of MPI,

Partitioning under the hood in MySQL 5.5 Mattias Jonsson, Partitioning developer Mikael

1 1 Slide 5 Slide 6 Partitioning and Load Balancing Partitioning Goals Assignment of

Partitioning Problem and Usage Lecture 8 CSCI 4974/6971 26 Sep 2016 1 / 14 Todays Biz 1.

Investigating hypergraph-partitioning-based sparse matrix partitioning methods Bora U car

Skeleton and Dual Complex Chenyang Xu Beijing International Center of Mathematics Research

Kai Xu Lecturer, London Bachelor, Shanghai PhD, Brisbane Research Scientist, Research

Lifting the Cartier transform of Ogus and Vologodsky modulo p n [following H. Oyama, A. Shiho and

Machine Learning (CSE 446): PCA (continued) and Learning as Minimizing Loss Sham M Kakade

Interpolation points and interpolation formulae on the square Stefano De Marchi Department of

Why dont we move slower? On the cost of time in the neural control of movement eric Jean 1

Next Generation Assessment Stakeholder Meeting March 24, 2016 Agenda DC Health updates

COORDINATED ASSESSMENT WORK GROUP September 12, 2019 WELCOME! Coordinated Assessment is the

Enabling Automatic Partitioning of Data-Parallel Kernels with - PowerPoint PPT Presentation

Enabling Automatic Partitioning of Data-Parallel Kernels with Polyhedral Compilation Alexander Matz, Holger Frning Heidelberg University, Germany LLVM Performance Workshop @CGO 2018 Sat 24 Feb 2018, Vienna, Austria Multi GPU in the Real

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Algorithms in the parallel Algorithms in the parallel partitioning tool partitioning tool

Partitioning and Divide-and- Conquer Strategies Partitioning Strategies Partitioning simply

Partitioning Introduction to Partitioning Mahapatra-Texas A&amp;M-Spring02 1 System

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Using Processor Partitioning to Using Processor Partitioning to Evaluate the Performance of MPI,

Partitioning under the hood in MySQL 5.5 Mattias Jonsson, Partitioning developer Mikael

1 1 Slide 5 Slide 6 Partitioning and Load Balancing Partitioning Goals Assignment of

Partitioning Problem and Usage Lecture 8 CSCI 4974/6971 26 Sep 2016 1 / 14 Todays Biz 1.

Investigating hypergraph-partitioning-based sparse matrix partitioning methods Bora U car

Skeleton and Dual Complex Chenyang Xu Beijing International Center of Mathematics Research

Kai Xu Lecturer, London Bachelor, Shanghai PhD, Brisbane Research Scientist, Research

Lifting the Cartier transform of Ogus and Vologodsky modulo p n [following H. Oyama, A. Shiho and

Machine Learning (CSE 446): PCA (continued) and Learning as Minimizing Loss Sham M Kakade

Interpolation points and interpolation formulae on the square Stefano De Marchi Department of

Why dont we move slower? On the cost of time in the neural control of movement eric Jean 1

Next Generation Assessment Stakeholder Meeting March 24, 2016 Agenda DC Health updates

COORDINATED ASSESSMENT WORK GROUP September 12, 2019 WELCOME! Coordinated Assessment is the

Partitioning Introduction to Partitioning Mahapatra-Texas A&M-Spring02 1 System