gpucc: An Open-Source GPGPU Compiler Jingyue Wu (jingyue@google.com) - PowerPoint PPT Presentation

gpucc: An Open-Source GPGPU Compiler Jingyue Wu (jingyue@google.com) , Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Artem Belevich, Robert Hundt

One-Slide Overview ● Motivation Lack of a state-of-the-art platform for CUDA compiler and HPC research ○ ○ Binary dependencies, performance tuning, language features, bug turnaround times, etc. Solution ● ○ gpucc : the first fully-functional, open-source, high performance CUDA compiler based on LLVM and supports C++11 and C++14 ○ ○ developed and tuned several general and CUDA-specific optimization passes ● Results highlight (compared with nvcc) up to 51% faster on internal end-to-end benchmarks ○ ○ on par on open-source benchmarks compile time is 8% faster on average and 2.4x faster for pathological compilations ○

Mixed-Mode CUDA Code template <int N> __global__ void kernel( float *y) { ... } GPU/device

Mixed-Mode CUDA Code template <int N> void host(float *x) { float *y; template <int N> cudaMalloc(&y, 4*N); __global__ void kernel( cudaMemcpy(y, x, ...); float *y) { kernel<N><<<16, 128>>>(y); ... ... } } CPU/host GPU/device

Mixed-Mode CUDA Code foo.cu template <int N> void host(float *x) { float *y; template <int N> cudaMalloc(&y, 4*N); __global__ void kernel( cudaMemcpy(y, x, ...); float *y) { kernel<N><<<16, 128>>>(y); ... ... } } CPU/host GPU/device

gpucc Architecture (Current and Interim) Mixed mode input file template <int N> __global__ void kernel( float *y) { Host-device splitter ... } Host code Device code template <int N> Clang void host(float *x) { float *y; Device code IR optimizer cudaMalloc(&y, 4*N); generator cudaMemcpy(y, x, ...); kernel<N><<<16, 128>>>(y); NVPTX codegen PTX assembly ... } Host compilation Fat Binary

Clang Integration (WIP and Long-Term) mixed mode ● Major issues with the separate compilation input file Source-to-source translation is complex and fragile ○ ○ Long compilation time Device compilation Clang driver instead of physical code splitting ● ○ (by Artem Belevich) PTX assembly ○ $ clang foo.cu ... ○ $ clang -x cuda <file> ... Host compilation Fat Binary

CPU vs GPU Characteristics CPU GPU ● Designed for general purposes ● Designed for rendering Optimized for latency Optimized for throughput ● ● ● Heavyweight hardware threads ● Lightweight hardware threads Branch prediction ○ ○ Out-of-order execution Superscalar ○ ● Small number of cores per die Massive parallelism ● ○ Can trade latency for throughput

Major Optimizations in gpucc ● Straight-line scalar optimizations Inferring memory spaces ● Loop unrolling and function inlining ● ● Memory-space alias analysis ● Speculative execution Bypassing 64-bit divisions ●

Straight-Line Scalar Optimizations y n for (long x = 0; x < 3; ++x) { for (long y = 0; y < 3; ++y) { float *p = &a[(c+y) + (b+x) * n]; ... // load from p } } (b,c) a x

Straight-Line Scalar Optimizations p0 = &a[c + b * n]; p1 = &a[c + 1 + b * n]; p2 = &a[c + 2 + b * n]; loop for (long x = 0; x < 3; ++x) { unroll p3 = &a[c + (b + 1) * n]; for (long y = 0; y < 3; ++y) { p4 = &a[c + 1 + (b + 1) * n]; float *p = &a[(c+y) + (b+x) * n]; p5 = &a[c + 2 + (b + 1) * n]; ... // load from p } p6 = &a[c + (b + 2) * n]; } p7 = &a[c + 1 + (b + 2) * n]; p8 = &a[c + 2 + (b + 2) * n];

Straight-Line Scalar Optimizations p0 = &a[c + b * n]; p1 = &a[c + 1 + b * n]; p2 = &a[c + 2 + b * n]; p3 = &a[c + (b + 1) * n]; p4 = &a[c + 1 + (b + 1) * n]; p5 = &a[c + 2 + (b + 1) * n]; p6 = &a[c + (b + 2) * n]; p7 = &a[c + 1 + (b + 2) * n]; c + 2 b + 2 (b + 2) * n c + 2 + (b + 2) * n p8 = &a[c + 2 + (b + 2) * n];

Straight-Line Scalar Optimizations p0 = &a[c + b * n]; p1 = &a[c + 1 + b * n]; p2 = &a[c + 2 + b * n]; p3 = &a[c + (b + 1) * n]; p4 = &a[c + 1 + (b + 1) * n]; p5 = &a[c + 2 + (b + 1) * n]; Addressing mode (base+imm) p6 = &a[c + (b + 2) * n]; p8 = &a[c + (b + 2) * n] + 2 Injured redundancy p7 = &a[c + 1 + (b + 2) * n]; (b + 1) * n + n ● Pointer arithmetic reassociation c + 2 ● Straight-line strength reduction b + 2 ● Global reassociation (b + 2) * n c + 2 + (b + 2) * n p8 = &a[c + 2 + (b + 2) * n];

Pointer Arithmetic Reassociation p0 = &a[c + b * n]; p0 = &a[c + b * n]; p1 = &a[c + 1 + b * n]; p1 = &p0[1]; p2 = &a[c + 2 + b * n]; p2 = &p0[2]; p3 = &a[c + (b + 1) * n]; p3 = &a[c + (b + 1) * n]; p4 = &a[c + 1 + (b + 1) * n]; p4 = &p3[1]; p5 = &a[c + 2 + (b + 1) * n]; p5 = &p3[2]; p6 = &a[c + (b + 2) * n]; p6 = &a[c + (b + 2) * n]; p7 = &a[c + 1 + (b + 2) * n]; p7 = &p6[1]; p8 = &a[c + 2 + (b + 2) * n]; p8 = &p6[2];

Straight-Line Strength Reduction x = (base+C0)*stride x = (base+C0)*stride y = (base+C1)*stride y = x + (C1-C0)*stride

Straight-Line Strength Reduction x = (base+C0)*stride x = (base+C0)*stride y = (base+C1)*stride y = x + (C1-C0)*stride x0 = b * n; x0 = b * n; p0 = &a[c + x0]; p0 = &a[c + x0]; p1 = &p0[1]; p1 = &p0[1]; p2 = &p0[2]; p2 = &p0[2]; x1 = (b + 1) * n; x1 = x0 + n; p3 = &a[c + x1]; p3 = &a[c + x1]; p4 = &p3[1]; p4 = &p3[1]; p5 = &p3[2]; p5 = &p3[2]; x2 = (b + 2) * n; x2 = x1 + n; p6 = &a[c + x2]; p6 = &a[c + x2]; p7 = &p6[1]; p7 = &p6[1]; p8 = &p6[2]; p8 = &p6[2];

Global Reassociation x0 = b * n; p0 = &a[c + x0]; p1 = &p0[1]; p2 = &p0[2]; x1 = x0 + n; p3 = &a[c + x1]; p4 = &p3[1]; p5 = &p3[2]; x2 = x1 + n; p6 = &a[c + x2]; p7 = &p6[1]; p8 = &p6[2];

Global Reassociation x0 = b * n; x0 = b * n; p0 = &a[c + x0]; i0 = c + x0; p1 = &p0[1]; p2 = &p0[2]; x1 = x0 + n; x1 = x0 + n; p3 = &a[c + x1]; i1 = c + x1; p4 = &p3[1]; p5 = &p3[2]; x2 = x1 + n; p6 = &a[c + x2]; p7 = &p6[1]; p8 = &p6[2];

Global Reassociation x0 = b * n; x0 = b * n; p0 = &a[c + x0]; i0 = c + x0; p1 = &p0[1]; p2 = &p0[2]; x1 = x0 + n; x1 = x0 + n; p3 = &a[c + x1]; i1 = c + x1; // = c+(x0+n) = (c+x0)+n p4 = &p3[1]; p5 = &p3[2]; x2 = x1 + n; p6 = &a[c + x2]; p7 = &p6[1]; p8 = &p6[2];

Global Reassociation x0 = b * n; x0 = b * n; p0 = &a[c + x0]; i0 = c + x0; p1 = &p0[1]; p2 = &p0[2]; x1 = x0 + n; x1 = x0 + n; p3 = &a[c + x1]; i1 = c + x1; i1 = i0 + n; p4 = &p3[1]; p5 = &p3[2]; x2 = x1 + n; p6 = &a[c + x2]; p7 = &p6[1]; p8 = &p6[2];

Global Reassociation x0 = b * n; x0 = b * n; p0 = &a[c + x0]; i0 = c + x0; p1 = &p0[1]; p0 = &a[i0]; p2 = &p0[2]; x1 = x0 + n; x1 = x0 + n; p3 = &a[c + x1]; i1 = c + x1; i1 = i0 + n; p3 = &p0[n]; p4 = &p3[1]; p3 = &a[i1]; p3 = &a[i1]; p5 = &p3[2]; x2 = x1 + n; p6 = &a[c + x2]; p7 = &p6[1]; p8 = &p6[2];

Global Reassociation x0 = b * n; x0 = b * n; x0 = b * n; p0 = &a[c + x0]; i0 = c + x0; p0 = &a[c + x0]; p1 = &p0[1]; p0 = &a[i0]; p1 = &p0[1]; p2 = &p0[2]; p2 = &p0[2]; x1 = x0 + n; x1 = x0 + n; p3 = &a[c + x1]; i1 = c + x1; i1 = i0 + n; p3 = &p0[n]; p3 = &p0[n]; p4 = &p3[1]; p3 = &a[i1]; p3 = &a[i1]; p4 = &p3[1]; p5 = &p3[2]; p5 = &p3[2]; x2 = x1 + n; p6 = &a[c + x2]; p6 = &p3[n]; p7 = &p6[1]; p7 = &p6[1]; p8 = &p6[2]; p8 = &p6[2];

Summary of Straight-Line Scalar Optimizations x0 = b * n; p0 = &a[c + b * n]; p0 = &a[c + x0]; p1 = &a[c + 1 + b * n]; p1 = &p0[1]; p2 = &a[c + 2 + b * n]; p2 = &p0[2]; p3 = &a[c + (b + 1) * n]; p3 = &p0[n]; p4 = &a[c + 1 + (b + 1) * n]; p4 = &p3[1]; p5 = &a[c + 2 + (b + 1) * n]; p5 = &p3[2]; p6 = &a[c + (b + 2) * n]; p6 = &p3[n]; p7 = &a[c + 1 + (b + 2) * n]; p7 = &p6[1]; p8 = &a[c + 2 + (b + 2) * n]; p8 = &p6[2]; Design doc: https://goo.gl/4Rb9As

Optimizations ● Straight-line scalar optimizations Inferring memory spaces ● Loop unrolling and function inlining ● ● Memory-space alias analysis ● Speculative execution Bypassing 64-bit divisions ●

Inferring Memory Spaces Load/store PTX assembly instructions GPU Device ● Special Block (processor) Block (processor) ○ ld.shared/st.shared ○ ld.global/st.global Thread Thread Thread Thread ○ ... ● Shared memory Shared memory Generic ○ ld/st ○ Overhead in checking (e.g. ~10% Global memory slower than ld.shared ) ○ Alias analysis suffers

gpucc: An Open-Source GPGPU Compiler Jingyue Wu (jingyue@google.com) - PowerPoint PPT Presentation

gpucc: An Open-Source GPGPU Compiler Jingyue Wu (jingyue@google.com) , Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Artem Belevich, Robert Hundt One-Slide Overview Motivation Lack

gpucc: An Open-Source GPGPU Compiler Jingyue Wu , Artem Belevich, Eli Bendersky, Mark Heffernan,

Welcome! Global Agenda: 1. GPGPU (1) : Introduction, architecture, concepts 2. GPGPU (2) :

Welcome! Todays Agenda: GPU Execution Model GPGPU Flow GPGPU Low Level Notes

Parallel Incep+on MPP Databases GPGPU Kyle Dunn Me Data nerd for Recovering HPC/GPGPU

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

Welcome! Todays Agenda: Introduction to GPGPU Example: Voronoi Noise GPGPU

Welcome! Todays Agenda: Practical GPGPU: Verlet Fluid GPGPU Algorithms Optimizing

Efficient Abstractions for GPGPU Programming . Mathias Bourgoin 10.03.2015 Efficient

Make Money With Open Source What is Open Source? Community Free software vs. open source

and Retrieval Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H.

Open Source Databases Peter Zaitsev, CEO Percona What a Year! Huge changes for Open Source and

Automating Your Lights with Open Source Combining Open Source Hardware with Free and Open Source

11/8/2012 The Structure of a Compiler (2) The Structure of a Compiler (1) Any compiler must

Compiler Development (CMPSC 401) Janyl Jumadinova January 17, 2018 Janyl Jumadinova Compiler

Principles of Compiler Design - The Brainf*ck Compiler - Clifford Wolf - www.clifford.at

K E D b . D a L a t a B a s e Jordan Vincent XML processing using GPGPU Jordan

A GHOST FROM POSTSCRIPT for RUXCON 2017 A GHOST FROM POSTSCRIPT WHO ARE WE redrain

2PAC 2Furious: Envisioning an iOS compromise in 2019 Marco Grassi - @marcograss Liang Chen -

Bypassing Microsoft JEA role capabilities for fun & profit whoami Cristhian Parrot -

Cloud Interconnections William B. Norton Console, Chief Scientist Last Updated: 9/19/16 4:04pm

the kernel bypass with RDMA! Using the RDMA infrastructure for performance while retaining kernel

XDP (eXpress Data Path) as a building block for other FOSS projects Jesper Dangaard Brouer (Red

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. Tseng

Practical Attacks against Mobile Device Management Solutions Michael Shaulov, CEO

gpucc: An Open-Source GPGPU Compiler Jingyue Wu (jingyue@google.com) - PowerPoint PPT Presentation

gpucc: An Open-Source GPGPU Compiler Jingyue Wu (jingyue@google.com) , Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Artem Belevich, Robert Hundt One-Slide Overview Motivation Lack

gpucc: An Open-Source GPGPU Compiler Jingyue Wu , Artem Belevich, Eli Bendersky, Mark Heffernan,

Welcome! Global Agenda: 1. GPGPU (1) : Introduction, architecture, concepts 2. GPGPU (2) :

Welcome! Todays Agenda: GPU Execution Model GPGPU Flow GPGPU Low Level Notes

Parallel Incep+on MPP Databases GPGPU Kyle Dunn Me Data nerd for Recovering HPC/GPGPU

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

Welcome! Todays Agenda: Introduction to GPGPU Example: Voronoi Noise GPGPU

Welcome! Todays Agenda: Practical GPGPU: Verlet Fluid GPGPU Algorithms Optimizing

Efficient Abstractions for GPGPU Programming . Mathias Bourgoin 10.03.2015 Efficient

Make Money With Open Source What is Open Source? Community Free software vs. open source

and Retrieval Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H.

Open Source Databases Peter Zaitsev, CEO Percona What a Year! Huge changes for Open Source and

Automating Your Lights with Open Source Combining Open Source Hardware with Free and Open Source

11/8/2012 The Structure of a Compiler (2) The Structure of a Compiler (1) Any compiler must

Compiler Development (CMPSC 401) Janyl Jumadinova January 17, 2018 Janyl Jumadinova Compiler

Principles of Compiler Design - The Brainf*ck Compiler - Clifford Wolf - www.clifford.at

K E D b . D a L a t a B a s e Jordan Vincent XML processing using GPGPU Jordan

A GHOST FROM POSTSCRIPT for RUXCON 2017 A GHOST FROM POSTSCRIPT WHO ARE WE redrain

2PAC 2Furious: Envisioning an iOS compromise in 2019 Marco Grassi - @marcograss Liang Chen -

Bypassing Microsoft JEA role capabilities for fun &amp; profit whoami Cristhian Parrot -

Cloud Interconnections William B. Norton Console, Chief Scientist Last Updated: 9/19/16 4:04pm

the kernel bypass with RDMA! Using the RDMA infrastructure for performance while retaining kernel

XDP (eXpress Data Path) as a building block for other FOSS projects Jesper Dangaard Brouer (Red

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. Tseng

Practical Attacks against Mobile Device Management Solutions Michael Shaulov, CEO

Bypassing Microsoft JEA role capabilities for fun & profit whoami Cristhian Parrot -