Compiling N ESL for GPUs John Reppy University of Chicago August - PowerPoint PPT Presentation

Compiling N ESL for GPUs John Reppy University of Chicago August 2014

Introduction Credits This work is a collaboration with I Lars Bergstrom (Mozilla) I Nora Sandler (U. of Chicago) We also had support from NVIDIA. August 2014 WG 2.8 — N ESL /GPU 2

Introduction GPUs GPUs provide super-computer levels of parallelism. For example, NVIDIA’s Fermi architecture has 14 Streaming Multiprocessors (SMs), each with 32 ALUs (1.5 TFlops peak performance). Shared Shared L2 Cache (768 Kb) Global Memory August 2014 WG 2.8 — N ESL /GPU 3

Introduction Fermi SM Instruction cache warp scheduler/dispatch unit warp scheduler/dispatch unit 32K by 32-bit register file (holds thread state) 16-cores 16-cores 16 load/store units 64Kb L1 cache/local memory More recent designs (Kepler & Maxwell) have even more compute power. August 2014 WG 2.8 — N ESL /GPU 4

Introduction GPU programming model I S ingle- I nstruction, M ultiple- T hread execution model. I Each warp (32 threads) executes the same instruction. I SM-local barrier synchronization (fast); global atomics (slow). I Predication used to handle divergent control flow (conditionals/loops). I Explicit memory hierarchy: I per-thread memory in registers on SM I per-SM shared memory and cache I global memory (backed by shared L2 cache) I host memory The “high-level” GPU programming languages (CUDA and OpenCL) expose these properties! August 2014 WG 2.8 — N ESL /GPU 5

Introduction Programming becomes harder C code for dot product (map-reduce): float dotp ( int n, float *a, float *b) { float sum = 0.0f; for ( int i = 0; i < n; i++) sum += a[i] * b[i]; return sum; } CUDA code for dot product: __global__ void dotp ( int n, const float *a, const float *b, float *results) { __shared__ float chache[ThreadsPerBlock] ; float temp; const unsigned int tid = blockDim.x * blockIdx.x + threadIdx.x; const unsigned int idx = threadIdx.x ; // CPU side code cudaMalloc (( void **)&V1_D , N* sizeof ( float )) ; while (tid < n) { cudaMalloc (( void **)&V2_D , N* sizeof ( float )) ; temp += a[tid] * b[tid] ; cudaMalloc (( void **)&V3_D , blockPerGrid* sizeof ( float )) ; tid += blockDim.x * gridDim.x ; } cudaMemcpy (V1_D , V1_H , N* sizeof ( float ), cudaMemcpyHostToDevice); chache[idx] = temp ; cudaMemcpy (V2_D , V2_H , N* sizeof ( float ), cudaMemcpyHostToDevice); __synchthreads () ; dotp <<<blockPerGrid, ThreadPerBlock>>> (N, V1_D, V2_D, V3_D); int i = blockDim.x / 2 ; V3_H = new float [blockPerGrid] ; while (i != 0) { cudaMemcpy (V3_H, V3_D, N* sizeof ( float ), cudaMemcpyDeviceToHost); if (chacheindex < i) chache[chacheindex] += chache [chacheindex + i] ; float sum = 0 ; __synchthreads () ; for ( int i = 0 ; i<blockPerGrid ; i++) i /= 2; sum += V3_H[i] ; } if (chacheindex == 0) delete V3_H; results[blockIdx.x] = chache [0]; } August 2014 WG 2.8 — N ESL /GPU 6

Introduction Better programming models for GPUs I Domain-specific languages can be harnessed to both lift the level of programming and provide portable parallelism. I Higher-level, but more restricted, programming models can be mapped to efficient parallel codes. This talk is about the second approach. August 2014 WG 2.8 — N ESL /GPU 7

Nested Data Parallelism N ESL I N ESL is a first-order functional language for parallel programming over sequences designed by Guy Blelloch [CACM ’96]. I Provides parallel for-each operation { x+y : x in xs; y in ys } I Provides other parallel operations on sequences, such as reductions, prefix-scans, and permutations. function dotp (xs, ys) = sum ( { x*y : x in xs; y in ys } ) I Supports Nested Data Parallelism (NDP) — components of a parallel computation may themselves be parallel. August 2014 WG 2.8 — N ESL /GPU 8

Nested Data Parallelism NDP example: sparse-matrix times vector Represent matrix as sequence of sequences of     1 0 4 0 0 1 pairs 0 3 0 0 2 2         0 0 0 5 0 3         6 7 8 0 0 4     (0, 1) (2, 4) 0 0 9 0 0 5 (1, 3) (4, 2) Want to avoid computing products where (3, 5) matrix entries are 0. (0, 6) (1, 7) (4, 8) (0, 1) August 2014 WG 2.8 — N ESL /GPU 9

Nested Data Parallelism NDP example: sparse-matrix times vector In N ESL , this algorithm has a compact expression: function svxv (sv, v) = sum ( { x * v [ i ] : (i, x) in sv } ) function smxv (sm, v) = { svxv (sv, v) : sv in sm } Notice that the smxv is a map of map-reduce subcomputations; i.e. , nested data parallelism. August 2014 WG 2.8 — N ESL /GPU 10

Nested Data Parallelism NDP example: sparse-matrix times vector Naive parallel decomposition will be unbalanced because of irregularity in sub-problem sizes. (0, 1) (2, 4) 2 2 1 3 1 (1, 3) (4, 2) 0 2 1 4 3 0 1 4 0 (3, 5) 1 4 3 2 5 6 7 8 1 (0, 6) (1, 7) (4, 8) (0, 1) Flattening transformation converts NDP to flat DP (including AoS to SoA) August 2014 WG 2.8 — N ESL /GPU 11

Nested Data Parallelism N ESL on GPUs I N ESL was designed for wide-vector machines (SIMD) I Good fit for GPU computation I First try [ICFP ’12]: I Implement Blelloch’s VCODE VM on GPUs using CUDA and Thrust. I Added map fusion. I Outperforms CPU on NDP benchmarks. I As fast as hand-written CUDA in some cases (Quickhull), but usually slower (worst case: 100 times slower on Barnes-Hut). August 2014 WG 2.8 — N ESL /GPU 12

Nested Data Parallelism Areas for improvement There are a number of areas for improvement. I Better fusion: I Fuse generators, scans, and reductions with maps. I “Horizontal fusion,” (fuse independent maps over the same index space). I Better segment descriptor management. I Better memory management. It proved difficult/impossible to support these improvements. August 2014 WG 2.8 — N ESL /GPU 13

Nessie Nessie New N ESL compiler built from scratch. I Designed to support better fusion, etc. . I Backend transforms flattened code to CUDA in several steps. I Testbed for future optimization experiments: I Vectorization avoidance (works for SIMT but not SIMD) [Keller et al ’12] I Piecewise execution [Prins ’96; Madsen and Filinski ’13] I Blocking ( i.e. , multiple elements per CUDA thread) August 2014 WG 2.8 — N ESL /GPU 14

Nessie Nessie compiler I Front-end produces monomorphic, direct-style IR. I Flattening eliminates NDP and produces Flan, which is a flat-vector language. I Shape analysis is used to tag vectors with size information (symbolic in some cases). I Backend transforms flattened code to CUDA in several steps. August 2014 WG 2.8 — N ESL /GPU 15

Nessie Generating CUDA from Flan To get from Flan to CUDA takes a number of transformation steps. I Translate Flan to FuseAST, which makes maps, reductions, etc. explicit. I Fuse map compositions (“vertical fusion”). I Compute the PDG for the FuseAST program [Ferrante et al 1987] I For each group of computational nodes in a control region, we compute a schedule based on data dependencies and synchronization requirements. I Using the schedules, we translate the program into λ cu , which makes the CPU/GPU distinction explicit. I CUDA code is generated from the λ cu (plus some library code). August 2014 WG 2.8 — N ESL /GPU 16

Nessie Example Consider the following N ESL function: function sumprods (xs, ys, zs) = let s1 = sum ( { x * y : x in xs; y in ys } ); s2 = sum ( { x * z : x in xs; z in zs } ) in s1 + s2 ; August 2014 WG 2.8 — N ESL /GPU 17

Nessie Example Consider the following N ESL function: sumprods function sumprods (xs, ys, zs) = let s1 = sum ( { x * y : x in xs; y in ys } ); map MUL (xs, ys) map MUL (xs, ys) s2 = sum ( { x * z : x in xs; z in zs } ) in s1 + s2 ; + REDUCE + REDUCE + EXIT August 2014 WG 2.8 — N ESL /GPU 17

Nessie Example Consider the following N ESL function: function sumprods (xs, ys, zs) = let MAP MUL (xs, ys) Stage 1 MAP MUL (xs, zs) s1 = sum ( { x * y : x in xs; y in ys } ); s2 = sum ( { x * z : x in xs; z in zs } ) in s1 + s2 ; + REDUCE + REDUCE + Stage 2 Stage 1 is translated to a CUDA kernel that produces two scalar results August 2014 WG 2.8 — N ESL /GPU 17

Nessie Example The λ cu representation: task task1 () (xs : [int] , ys : [int] , zs : [int] ) = let (t1 : [int] , t2 : [int] ) = map ( kernel (x, y, z) => (x*y, x*z)) (xs, ys, zs) let s1 = reduce t1 let s2 = reduce t2 in (s1, s2) function sumprods (xs : [int] , ys : [int] , zs : [int] ) = let (s1, s2) = run task1 () (xs, ys, zs) in s1+s2 August 2014 WG 2.8 — N ESL /GPU 18

Conclusion Status and future work I Generating running code for a number of the simpler examples ( e.g. , dot product). I Optimized reduce, scan, etc. , operations [NVIDIA]. I Early performance measurements are promising (1.3 speedup on dot product over VCODE version). I Lots of work to do, particularly for segmented operations. I Want to develop a proper calculus of heterogeneous computation ( λ cu is a first step). I Lots of optimizations to explore! August 2014 WG 2.8 — N ESL /GPU 19

Compiling N ESL for GPUs John Reppy University of Chicago August - PowerPoint PPT Presentation

Compiling N ESL for GPUs John Reppy University of Chicago August 2014 Introduction Credits This work is a collaboration with I Lars Bergstrom (Mozilla) I Nora Sandler (U. of Chicago) We also had support from NVIDIA. August 2014 WG 2.8 N

The Progression of ELD SKI LL AREA PROFI CI ENCY LEVEL DESCRI PTORS, GRADES 9 1 2 Listening

PEAC Peralta ESL Advisory Council Implementing Our NEW Accelerated ESL Model Dr. Sedique Popal

Introduction to Compiling Chapter 1 1 Compiler Construction Introduction to Compiling To Do

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

18-ESL-602 (001) SPEAKING AND LISTENING SKILLS I English as a Second Language (ESL) Program

Literacy Learning Simulation www.esl-literacy.com Workshop Outcome Participants will reflect

Tri-Cities Adult Literacy & ESL Working Group The Adult Literacy & ESL Working Group

Starting and Growing a Church-Based ESL program: contexts and principles Michael Pasquale &

Always be Cross-compiling Matthew Bauer, John Ericson October 9, 2019 Always be cross compiling

Compiling & Debugging FF Compiling & Running FF (Linux & Mac) System Requirement:

Title of Module: Oral Presentations for Multilingual (ESL) Students Collaborators Name:

Bridge to Credit through Content-Driven ESL Skills Development & Academic Support Cybele

RECLASSIFICATION CRITERIA FOR SPED/ELS 2019-2020 AGENDA Bilingual/ESL Student Numbers

Updates We are working to change our ESL composition sequence to better align with the

Empow ering Educators, Empow ering Students: Navigating the New ESL Landscape Alberta Teachers

a Visualization of Phonetic i Markers for Early ESL u Learners in

2015 full-year results presentation 15th March 2016 Forward-looking statements This presentation

Marketing PostgreSQL brand where to start Valeria Kaplan dataegret.com About me PostgreSQL:

An Extension of Systems Factorial Technology (SFT) to Arbitrary Numbers of Processes James

BLOGS ARE ECHO CHAMBERS: BLOGS ARE ECHO CHAMBERS Eric Gilbert | Tony Bergstrom | Karrie Karahalios

Collaborative Filtering Presentation by Alex Hugger Filtering Documents Mittwoch, 28. April 2010

Searching for spectral features in the g -ray sky Alejandro Ibarra Technische Universitt

Dark Matter searches with H.E.S.S. towards dwarf spheroidals galaxies Aion Viana On behalf of

Distribution of traces of genus 3 curves over finite fields R. Lercier, C. Ritzenthaler, Florent

Compiling N ESL for GPUs John Reppy University of Chicago August - PowerPoint PPT Presentation

Compiling N ESL for GPUs John Reppy University of Chicago August 2014 Introduction Credits This work is a collaboration with I Lars Bergstrom (Mozilla) I Nora Sandler (U. of Chicago) We also had support from NVIDIA. August 2014 WG 2.8 N

The Progression of ELD SKI LL AREA PROFI CI ENCY LEVEL DESCRI PTORS, GRADES 9 1 2 Listening

PEAC Peralta ESL Advisory Council Implementing Our NEW Accelerated ESL Model Dr. Sedique Popal

Introduction to Compiling Chapter 1 1 Compiler Construction Introduction to Compiling To Do

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

18-ESL-602 (001) SPEAKING AND LISTENING SKILLS I English as a Second Language (ESL) Program

Literacy Learning Simulation www.esl-literacy.com Workshop Outcome Participants will reflect

Tri-Cities Adult Literacy &amp; ESL Working Group The Adult Literacy &amp; ESL Working Group

Starting and Growing a Church-Based ESL program: contexts and principles Michael Pasquale &amp;

Always be Cross-compiling Matthew Bauer, John Ericson October 9, 2019 Always be cross compiling

Compiling &amp; Debugging FF Compiling &amp; Running FF (Linux &amp; Mac) System Requirement:

Title of Module: Oral Presentations for Multilingual (ESL) Students Collaborators Name:

Bridge to Credit through Content-Driven ESL Skills Development &amp; Academic Support Cybele

RECLASSIFICATION CRITERIA FOR SPED/ELS 2019-2020 AGENDA Bilingual/ESL Student Numbers

Updates We are working to change our ESL composition sequence to better align with the

Empow ering Educators, Empow ering Students: Navigating the New ESL Landscape Alberta Teachers

a Visualization of Phonetic i Markers for Early ESL u Learners in

2015 full-year results presentation 15th March 2016 Forward-looking statements This presentation

Marketing PostgreSQL brand where to start Valeria Kaplan dataegret.com About me PostgreSQL:

An Extension of Systems Factorial Technology (SFT) to Arbitrary Numbers of Processes James

BLOGS ARE ECHO CHAMBERS: BLOGS ARE ECHO CHAMBERS Eric Gilbert | Tony Bergstrom | Karrie Karahalios

Collaborative Filtering Presentation by Alex Hugger Filtering Documents Mittwoch, 28. April 2010

Searching for spectral features in the g -ray sky Alejandro Ibarra Technische Universitt

Dark Matter searches with H.E.S.S. towards dwarf spheroidals galaxies Aion Viana On behalf of

Distribution of traces of genus 3 curves over finite fields R. Lercier, C. Ritzenthaler, Florent

Tri-Cities Adult Literacy & ESL Working Group The Adult Literacy & ESL Working Group

Starting and Growing a Church-Based ESL program: contexts and principles Michael Pasquale &

Compiling & Debugging FF Compiling & Running FF (Linux & Mac) System Requirement:

Bridge to Credit through Content-Driven ESL Skills Development & Academic Support Cybele