GPU COMPUTING RESEARCH WITH OPENCL Studying Future Workloads and - PowerPoint PPT Presentation

GPU COMPUTING RESEARCH WITH OPENCL Studying Future Workloads and Devices Perhaad Mistry, Dana Schaa, Enqiang Sun, Rafael Ubal, Yash Ukidave, David Kaeli Dept of Electrical and Computer Engineering Northeastern University CCIS Class - CS 6240 1 | CCIS Class | Nov 30, 2011

TOPICS  Introduction to OpenCL and GPU Computing  Speeded Up Robust Features  HAPTIC - OpenCL Heterogeneous Application Profiling & Introspection Capabilities 2 | CCIS Class | Nov 30, 2011

MOTIVATION TO STUDY GPU COMPUTING More than 65% of Americans played a video game in 2009 – economies of scale Manufacturers include NVIDIA, AMD/ATI, IBM-Cell Very competitive commodities market 3 | CCIS Class | Nov 30, 2011

MOTIVATION TO STUDY GPU COMPUTING Theoretical Peaks Don’t matter Much How do you write an application that performs well ?? 4 | CCIS Class | Nov 30, 2011

GPU COMPUTING - A wide range of GPU applications � 3D image analysis � Protein folding � Film � Adaptive radiation therapy � Quantum chemistry � Financial � Acoustics � Ray tracing � Languages � Astronomy � Radar � GIS � Audio � Reservoir simulation � Holographics cinema � Automobile vision � Robotic vision / AI � Machine learning � Bioinfomatics � Robotic surgery � Mathematics research � Biological simulation � Satellite data analysis � Military � Broadcast � Seismic imaging � Mine planning � Cellular automata � Surgery simulation � Molecular dynamics � Fluid dynamics � Surveillance � MRI reconstruction � Computer vision � Ultrasound � Multispectral imaging � Cryptography � Video conferencing � N-body simulation � CT reconstruction � Telescope � Network processing � Data mining � Video � Neural network � Digital cinema / projections � Visualization � Oceanographic research � Electromagnetic simulation � Wireless � Optical inspection � Equity training � X-Ray � Particle physics 5 | CCIS Class | Nov 30, 2011

CPU VS GPU ARCHITECTURES Irregular data accesses Regular data accesses Focus on per thread performance More ALUs and massively parallel Space devoted to control logic instead of ALU Throughput oriented Efficiently handle control flow intensive workloads Multi level caches used to hide latency 6 | CCIS Class | Nov 30, 2011

MODERN GPGPU ARCHITECTURE  Generic many core GPU – Less space devoted to control logic and caches – Large register files to support multiple thread contexts  Low latency hardware managed thread switching  Large number of ALU per “core” with small user managed cache per core On Board System Memory  Memory bus optimized for bandwidth High Bandwidth bus to ALUs – ~150 GBPS bandwidth allows us to service a large number of ALUs simultaneously ALUs Simple Cache 7 | CCIS Class | Nov 30, 2011

NVIDIA GPU COMPUTE ARCHITECTURE  Compute Unified Device Architecture  Hierarchical architecture  A device contains many multiprocessors  Scalar “cuda cores” per multiprocessor – 32 for Fermi CUDA Core  Single instruction issue unit per Dispatch Port multiprocessor Operand Collector  Many memory spaces FP Int Unit  GTX 480 - Compute 2.0 capability Unit – 15 Streaming Multiprocessors (SMs) Result Queue – 1 SM features 32 CUDA processors – 480 CUDA processors 8 | CCIS Class | Nov 30, 2011

GPU MEMORY ARCHITECTURE  Device Memory (GDDR) – Large memory with a high bandwidth link to multiprocessor  Registers on chip (~16k) – Large number of registers enable low overhead context switching and massive multithreading  Shared memory ( on chip) – Shared between scalar cores – Low latency and banked  Constant and texture memory – Read only and cached 9 | CCIS Class | Nov 30, 2011

A “TRANSPARENTLY” SCALABLE ARCHITECTURE The programming model maps easily to underlying architecture Scalable programming model Program consists of independent blocks of threads Same program will be scalable across devices 10 | CCIS Class | Nov 30, 2011

AN OPTIMAL GPGPU PROGRAM  From the discussion on hardware we see that an ideal kernel for a GPU: – Has thousands of independent pieces of work  Uses all available compute units  Allows interleaving for latency hiding – Is amenable to instruction stream sharing  Maps to SIMD execution by preventing divergence between work items – Has high arithmetic intensity  Ratio of math operations to memory access is high  Not limited by memory bandwidth  Note that these caveats apply to all GPUs 11 | CCIS Class | Nov 30, 2011

OPENCL – THE FUTURE FOR MANY-CORE COMPUTING  OpenCL (Open Computing Language) released in 2008 - Developed by Khronos Group – a non-profit  A framework similar to CUDA for writing programs that execute on heterogeneous systems  Allows CPU and GPU to work together for faster and more efficient processing  Modeled as four parts: – Platform Model – Execution Model – Memory Model – Programming Model  Kernels — execute on heterogeneous devices – Same kernel on multiple devices such as CPUs, GPUs, DSPs, FPGAs, etc 12 | CCIS Class | Nov 30, 2011

OPENCL – CONFORMANT COMPANIES Over 300+ OpenCL 1.1 Compliant Devices Altera, TI coming up … OpenCL 1.2 announced at SC 11 13 | CCIS Class | Nov 30, 2011

OPENCL - THE BIG PICTURE 14 | CCIS Class | Nov 30, 2011

GPU MEMORY MODEL IN OPENCL  For both AMD, Nvidia GPUs a subset of hardware memory exposed in OpenCL Private Private Private Private Memory Memory Memory Memory  Configurable shared memory is usable as local memory Workitem 1 Workitem 1 Workitem 1 Workitem 1 – Local memory used to share data between Compute Unit 1 Compute Unit N items of a work group at lower latency than global memory  Private memory utilizes registers per work item Local Memory Local Memory Global / Constant Memory Data Cache Compute Device Global Memory Compute Device Memory 15 | CCIS Class | Nov 30, 2011

OPENCL EXAMPLE - BASIC MATRIX MULTIPLICATION  Non-blocking matrix multiplication – Doesn’t use local memory  Each element of matrix reads its own data independently  Serial matrix multiplication for(int i = 0; i < Ha; i++) for(int j = 0; j < Wb; j++){ c[i][j] = 0; for(int k = 0; k < Wa; k++) c[i][j] += a[i][k] + b[k][j] }  Reuse code from image rotation – Create context, command queues and compile program – Only need one more input memory object for 2 nd matrix 16 | CCIS Class | Nov 30, 2011

SIMPLE MATRIX MULTIPLICATION __kernel void simpleMultiply( Wb __global float* c, int Wa, int Wb, B __global float* a, __global float* b) { Hb col //Get global position in Y direction int row = get_global_id(1); //Get global position in X direction int col = get_global_id(0); A C float sum = 0.0f; row //Calculate result of one element for (int i = 0; i < Wa; i++) { Ha sum += a[row*Wa+i] * b[i*Wb+col]; } c[row*Wb+col] = sum; } Wa Wb 17 | CCIS Class | Nov 30, 2011

STEP0: INITIALIZE DEVICE  Declare context Query Platform  Choose a device from context Platform  Using device and context create a command queue Layer Query Devices cl_context myctx = clCreateContextFromType ( Command Queue 0, CL_DEVICE_TYPE_GPU, NULL, NULL, &ciErrNum); Create Buffers ciErrNum = clGetDeviceIDs (0, Compiler CL_DEVICE_TYPE_GPU, Compile Program 1, &device, cl_uint *num_devices) Compile Kernel cl_commandqueue myqueue ; Runtime myqueue = clCreateCommandQueue ( Set Arguments Layer myctx, device, 0, &ciErrNum); Execute Kernel 18 | CCIS Class | Nov 30, 2011

STEP1: CREATE BUFFERS  Create buffers on device Query Platform  Input data is read-only Platform Layer Query Devices  Output data is write-only cl_mem d_a = clCreateBuffer ( myctx, Command Queue CL_MEM_READ_ONLY, mem_size, NULL, &ciErrNum); Create Buffers cl_mem d_c = clCreateBuffer ( myctx, Compiler Compile Program CL_MEM_WRITE_ONLY, mem_size, NULL, &ciErrNum); Compile Kernel  Transfer input data to the device Runtime ciErrNum = clEnqueueWriteBuffer ( Set Arguments Layer myqueue , d_a, CL_TRUE, 0, mem_size, (void *)src_image, 0, NULL, NULL) Execute Kernel 19 | CCIS Class | Nov 30, 2011

STEP2: BUILD PROGRAM, SELECT KERNEL // create the program Query Platform cl_program myprog = clCreateProgramWithSource Platform ( myctx,1, (const char **)&source, Layer Query Devices &program_length, &ciErrNum); Command Queue // build the program ciErrNum = clBuildProgram ( myprog, 0, Create Buffers NULL, NULL, NULL, NULL); Compiler Compile Program //Use the “image_rotate” function as the kernel Compile Kernel cl_kernel mykernel = clCreateKernel ( Runtime myprog , “image_rotate” , error_code) Set Arguments Layer Execute Kernel 20 | CCIS Class | Nov 30, 2011

STEP3: SET ARGUMENTS, ENQUEUE KERNEL Query Platform // Set Arguments Platform clSetKernelArg (mykernel, 0, sizeof(cl_mem), ( void *)&d_a); Layer Query Devices clSetKernelArg (mykernel, 1, sizeof(cl_mem), ( void *)&d_b); clSetKernelArg (mykernel, 2, sizeof(cl_int), ( void *)&W); … Command Queue Create Buffers //Set local and global workgroup sizes Compiler size_t localws[2] = {16,16} ; Compile Program size_t globalws[2] = {W, H};//Assume divisible by 16 Compile Kernel // execute kernel Runtime Set Arguments clEnqueueNDRangeKernel( Layer myqueue , myKernel, 2, 0, globalws, localws, Execute Kernel 0, NULL, NULL); 21 | CCIS Class | Nov 30, 2011

GPU COMPUTING RESEARCH WITH OPENCL Studying Future Workloads and - PowerPoint PPT Presentation

GPU COMPUTING RESEARCH WITH OPENCL Studying Future Workloads and Devices Perhaad Mistry, Dana Schaa, Enqiang Sun, Rafael Ubal, Yash Ukidave, David Kaeli Dept of Electrical and Computer Engineering Northeastern University CCIS Class - CS 6240 1

OpenCL Kernel Compilation Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and

Investigation of the OpenCL support in the GeantV's Vectorized Geometry Gabor Biro 22.09.2014.

The OpenCL C++ API Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Introduction to OpenCL David Black-Schaffer david.black-schaffer@it.uu.se 1 Disclaimer I

OpenCL on FPGAs Contains material from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Synchronization in OpenCL Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

Scalable Multi-Precision Simulation of Spiking Neural Networks on GPU with OpenCL Dmitri

GPU Parallel Implementation of The Approximate K-SVD Algorithm Using OpenCL Paul Irofti 1 Bogdan

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Han Dong Dibyajyoti Ghosh Fahad Zafar Shujia Zhou Motivation Explore OpenCL in accelerating

Simulation of OpenCL and APUs on Multi2Sim 4.1 Rafael Ubal, David Kaeli Conference title 1

OpenCL Application on Mobile GPU: A Case Study Elena Barreras, Juan M. Jimenez, Arian Maghazeh,

SymEngine Symbolic Executjon of OpenCL Kernels Alberto Magni Optjmize code for GPUs Optjmize

1 Learning Objectives Upon completion of this module, learners will be able to: 1. Identify the

Chapter 7: Clinical Treatment Planning in External Photon Beam Radiotherapy Set of 232 slides

Digital Image Processing (CS/ECE 545) Lecture 11: Geometric Operations, Comparing Images and Future

Status of the Next-Generation Supercomputer Project YOKOKAWA, Mitsuo Next-Generation

How to build How to build a successful research group? a successful research group? Prof. M

Similarity Metrics and Efficient Optimization for Simultaneous Registration Supplementary Material

Ionoacoustic range monitoring for proton therapy W. Assmann Faculty of Physics Department for

Ra Radiation T Therapy Kirk Bertsche, Accuray, Inc. For the 2018 Annual Meeting of the