High GPU performance while the compiler handles the low level - PowerPoint PPT Presentation

High GPU performance while the compiler handles the low level details pros and cons of using a graphics pipeline Jeppe Revall Frisvad January 2013

Timeline on the programmability of the GPU 1995 2001 2003 2006 2009 2010 2012 m m n s y s s l o e t t e s s n i c i n l i i i l t i e l b e e c t r r m ) n s e a l l l ) e l e k a a u s r v ) e e l e r s g r f l e l e p a a - g e e n d l t ) p p g n l a u o s g y e e a n i ) r a r p e c n s r x i a t e r e o v i r m ) ) i t a i M t m r r r s s F c i M t A o n s e i e 0 o 0 r c s r a r M I 9 T 0 4 o o C a n 8 ( 0 2 c 2 M c 2 y 6 ( ( 2 1 0 ( D 1 ( 8 5 8 ( 2 ( 1995 Fixed-function rasterization pipeline in hardware. 2001 Vertex shaders (first programmable part of the pipeline). 2003 Fragment/pixel shaders (GPGPU). 2006 Unified shaders (CUDA) and geometry shaders. 2008 High level GPU programming in M ATLAB with Jacket. 2009 Compute shaders (interoperability) and tesselation shaders. 2010 Programmable ray tracing pipeline on the GPU (OptiX). 2012 Dynamic parallelism (threads spawn threads).

Trade-offs between performance and ease of use ◮ M ATLAB is a simple and easy way to use the GPU. How much efficiency do we lose? ◮ CUDA offers both high and low level functionality. It is easy to get started, but hard to get good performance. ◮ GPGPU requires basic knowledge of graphics programming. Harder to get started, easier to get good performance. ◮ Let us explore the validity of these statements. ◮ Case study: The full colour FFT of Lenna.

The full color FFT of Lenna in M ATLAB img = imread(’Lenna.png’); img size = size(img); s = sqrt(img size(1)*img size(2)); float img = (single(img) + 0.5)/256; tic; fft img = fft2(float img); norm img = sqrt(fft img.*conj(fft img)); norm img = rgb fftshift(clamp(norm img/s, 0, 1)); toc; image(norm img);

GPU computing in M ATLAB img = imread(’Lenna.png’); img size = size(img); s = sqrt(img size(1)*img size(2)); float img = (single(img) + 0.5)/256; gpu img = gpuArray(float img); tic; fft img = fft2(gpu img); norm img = sqrt(fft img.*conj(fft img)); norm img = rgb fftshift(clamp(norm img/s, 0, 1)); toc; result = gather(gpu img); image(result);

From vectorized M ATLAB to GPU-based M ATLAB FFT performance comparison 16 14 12 10 8 Time (ms) 6 4 2 0 Vectorized GPU-based CUDA (CUFFT) Shaders Matlab Matlab (GPGPU)

GPU architecture ( c � NVIDIA)

CUDA C version using CUFFT - inititalization // Load image from file int width, height, channels; unsigned char* data = SOIL load image("Lenna.png", &width, &height, &channels, SOIL LOAD RGB); // Declare variables for CUDA code unsigned int size = width*height; unsigned int half width = width/2 + 1; unsigned char* result = new unsigned char[size*channels]; uchar3* img = 0; cufftReal* float img = 0; cufftComplex* fft img = 0; uchar3* norm img = 0;

CUDA C version using CUFFT - allocation and transfer // Allocate GPU buffers cudaMalloc((void**)&img, size*sizeof(uchar3)); cudaMalloc((void**)&float img, size*sizeof(cufftReal)); cudaMalloc((void**)&fft img, half width*height*sizeof(cufftComplex)); cudaMalloc((void**)&norm img, size*sizeof(uchar3)); // Copy input image from host to GPU cudaMemcpy(img, data, size*sizeof(uchar3), cudaMemcpyHostToDevice); // Create FFT plan cufftHandle planR2C; cufftPlan2d(&planR2C, width, height, CUFFT R2C);

CUDA C version using CUFFT - kernel invocation timer.start(); // Launch kernels and perform FFT on the GPU // with one block for each row of pixels. uchar3 to floatR �� height, width �� (float img, img); cufftExecR2C(planR2C, float img, fft img); complexR to uchar3 �� height, half width �� (norm img, fft img, width, height); . . . (same for green (G) and blue (B)) . . . cudaDeviceSynchronize(); timer.stop();

CUDA C version using CUFFT - finalization // Copy output image from GPU to host cudaMemcpy(result, norm img, size*sizeof(uchar3), cudaMemcpyDeviceToHost); stbi write png("fft.png", width, height, channels, result, width*channels); // Clean up (free memory) cufftDestroy(planR2C); cudaFree(img); cudaFree(float img); cudaFree(fft img); cudaFree(norm img); delete [ ] result;

From vectorized M ATLAB to GPU-based M ATLAB FFT performance comparison 16 14 12 10 8 Sta�onary �me (ms), 6 3.5 GHz and 512 cores 4 2 0 Vectorized GPU-based CUDA Shaders Matlab Matlab (CUFFT) (GPGPU)

Rasterization pipeline - classical version Object World Eye Model View Projection Normalized Clip Window device W divide Viewport

Rasterization pipeline ◮ The rasterization pipeline is still available in GPUs. References - Fernando, R., and Kilgard, M. J. Introduction. In The Cg Tutorial: The Definitive guide to Programmable Real-Time Graphics , Chapter 1, Addison-Wesley, 2003. - Luebke, D., and Humphreys, G. How GPUs work. Computer 40 (2), pp. 96–100, February 2007.

FFT on the GPU using shaders ◮ Only 2 log 2 ( N ) passes for two 2D FFTs ( N is width). ◮ Scrambler indices and weights in small 1D textures.

Shader version (GPGPU) - creating a context int main(int argc, char** argv) { glutInit(&argc, argv); glutInitWindowSize(width, height); glutInitDisplayMode(GLUT DOUBLE | GLUT RGBA | GLUT DEPTH | GLUT ALPHA); glutCreateWindow("GPU FFT demo"); glewInit(); // Load image from file int width, height, channels; unsigned char* data = SOIL load image("Lenna.png", &width, &height, &channels, SOIL LOAD RGB); // Transfer image to texture memory SOIL create OGL texture(data, width, height, channels, SOIL CREATE NEW ID, SOIL FLAG INVERT Y); . . . return 0; }

Shader version (GPGPU) - drawing triangles GLuint source list = glGenLists(1); glNewList(source list, GL COMPILE); glEnable(GL TEXTURE 2D); glBindTexture(GL TEXTURE 2D, source tex); glBegin(GL POLYGON); glTexCoord2f(0.0f, 0.0f); glVertex2f(-1.0f, -1.0f); glTexCoord2f(1.0f, 0.0f); glVertex2f(1.0f, -1.0f); glTexCoord2f(1.0f, 1.0f); glVertex2f(1.0f, 1.0f); glTexCoord2f(0.0f, 1.0f); glVertex2f(-1.0f, 1.0f); glEnd(); glDisable(GL TEXTURE 2D); glEndList();

Shader version (GPGPU) - loading shaders // Create shaders GLuint vs = glCreateShader(GL VERTEX SHADER); GLuint fs = glCreateShader(GL FRAGMENT SHADER); // Load source code strings into shaders glShaderSource(vs, 1, &vert shader, 0); glShaderSource(fs, 1, &frag shader, 0); // Compile shaders glCompileShader(vs); glCompileShader(fs); // Create a program and attach the shaders GLuint prog = glCreateProgram(); glAttachShader(prog, vs); glAttachShader(prog, fs); glLinkProgram(prog); // Link the program

Shader version (GPGPU) - using the FFT const float scale = 1.0f/sqrt(width*height); FFT* fft = new FFT(width, height); timer.start(); fft − > set input(draw fft source rb); fft − > do fft(); glBlendFunc(GL ONE, GL ONE); glEnable(GL BLEND); fft − > draw output(scale, 0.0f, 0.0f, 1); fft − > draw output(0.0f, 0.0f, scale, 2); glDisable(GL BLEND); fft − > set input(draw fft source g); fft − > redraw input(); fft − > do fft(); glEnable(GL BLEND); fft − > draw output(0.0f, scale, 0.0f); glDisable(GL BLEND); timer.stop();

Performance FFT performance comparison 16 14 12 10 Sta�onary �me (ms), 8 3.5 GHz and 512 cores Laptop �me (ms), 6 2.4 GHz and 32 cores 4 2 0 Vectorized GPU-based CUDA (CUFFT) Shaders Matlab Matlab (GPGPU) ◮ Are newer GPUs better at CUDA or were the CUDA magic numbers better suited for the 512 cores?

What were the difficulties? ◮ M ATLAB ◮ None. ◮ CUDA (in summary) ◮ Two pass compilation (.cu files and the NVCC compiler). ◮ CUDA libraries dependencies. ◮ Magic numbers for thread/block allocation. ◮ Easy to make inefficient code (non-coalesced memory access). ◮ Must write kernels (syntax needs learning). ◮ Shaders (in summary) ◮ Graphics context needed (library dependency). ◮ OpenGL extensions need detection (library dependency). ◮ Algorithms run through the graphics pipeline. ◮ Runtime compilation. ◮ Must write shaders (syntax needs learning).

Rasterization example 50 50 100 100 150 150 200 200 250 250 300 300 350 350 400 400 450 450 500 500 550 550 100 200 300 400 500 600 700 100 200 300 400 500 600 700 > > render Elapsed time is 22.266896 seconds. > > render Starting matlabpool using the ’local’ profile ... connected to 6 workers. Sending a stop signal to all the workers ... stopped. Elapsed time is 18.082566 seconds. Rasteriza�on performance comparison > > gpu render 25 Elapsed time is 4.079594 seconds. 20 > > gpu render 15 Elapsed time is 0.306040 seconds. Time (ms) 10 5 0 Matlab for- Matlab parfor GPU-based Pre-allocated loops Matlab GPU ◮ Frame rate using graphics pipeline: > 5000 frames per second.

GPU ray tracing pipeline (OptiX)

Interoperability ◮ Compute shaders ◮ Rasterization-computing interop. ◮ Compiler handles thread/block allocation. ◮ OptiX ◮ Interop functionality for OpenGL/DirectX and CUDA. ◮ Ray tracing-rasterization-computing interop. ◮ Thread/block allocation needed for computing kernels. ◮ Downside: Interop requires a graphics context. ◮ Thank you for your attention.

High GPU performance while the compiler handles the low level - PowerPoint PPT Presentation

High GPU performance while the compiler handles the low level details pros and cons of using a graphics pipeline Jeppe Revall Frisvad January 2013 Timeline on the programmability of the GPU 1995 2001 2003 2006 2009 2010 2012 m m n s

Method Handles Everywhere! Charles Oliver Nutter @headius Method Handles What are method

1 Handle: Extended handles helps to keep hands away from direct heat Extended handles

Compilers Recognizing Handles Alex Aiken Recognizing Handles Bad News There are no known

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

Persistent Handles: approaches Ralph Bhme, Samba Team, SerNet 2018-06-08 Outline Persistent

Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

While Loops Python While Loops Form of the while loop: while condition : Statement Block

Features Exclusively from Surface of wok Handles Matt finish - easy to Extended handles

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

HIGH-PERFORMANCE GPU VIDEO ENCODING ABHIJIT PATAIT SR. MANAGER, NVIDIA AGENDA GPU Video

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

face of current global crisis osten olorunsola Chairman Energy Institute Nigeria 202005 Life

iPlots eXtreme Next-generation interactive graphics for analysis of large data Simon Urbanek

Cowen and Company 8 th Annual Global Transportation Conference September 9, 2015 Alan H. Shaw

Uniprocessor Performance not Scaling Performance (vs. VAX-11/780) 10000 20%/year Concurrency

Convergent synthesis of a linker-connected fluorescent ebselen-coumarin heterodimer Jim Kppers 1

Celiac Disease & Non-Celiac Gluten Sensitivity Amandeep Shergill, M.D., M.S. H.S.

That is the Question? Gary S. Roubin MD PhD. Chairman IMC. CREST 2 Trial. Conflicts of Interest.

Quan%fying organics: COD & BOD CTB3365x Introduc1on to water

High GPU performance while the compiler handles the low level - PowerPoint PPT Presentation

High GPU performance while the compiler handles the low level details pros and cons of using a graphics pipeline Jeppe Revall Frisvad January 2013 Timeline on the programmability of the GPU 1995 2001 2003 2006 2009 2010 2012 m m n s

Method Handles Everywhere! Charles Oliver Nutter @headius Method Handles What are method

1 Handle: Extended handles helps to keep hands away from direct heat Extended handles

Compilers Recognizing Handles Alex Aiken Recognizing Handles Bad News There are no known

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

Persistent Handles: approaches Ralph Bhme, Samba Team, SerNet 2018-06-08 Outline Persistent

Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

While Loops Python While Loops Form of the while loop: while condition : Statement Block

Features Exclusively from Surface of wok Handles Matt finish - easy to Extended handles

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

HIGH-PERFORMANCE GPU VIDEO ENCODING ABHIJIT PATAIT SR. MANAGER, NVIDIA AGENDA GPU Video

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

face of current global crisis osten olorunsola Chairman Energy Institute Nigeria 202005 Life

iPlots eXtreme Next-generation interactive graphics for analysis of large data Simon Urbanek

Cowen and Company 8 th Annual Global Transportation Conference September 9, 2015 Alan H. Shaw

Uniprocessor Performance not Scaling Performance (vs. VAX-11/780) 10000 20%/year Concurrency

Convergent synthesis of a linker-connected fluorescent ebselen-coumarin heterodimer Jim Kppers 1

Celiac Disease &amp; Non-Celiac Gluten Sensitivity Amandeep Shergill, M.D., M.S. H.S.

That is the Question? Gary S. Roubin MD PhD. Chairman IMC. CREST 2 Trial. Conflicts of Interest.

Quan%fying organics: COD &amp; BOD CTB3365x Introduc1on to water

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Celiac Disease & Non-Celiac Gluten Sensitivity Amandeep Shergill, M.D., M.S. H.S.

Quan%fying organics: COD & BOD CTB3365x Introduc1on to water