Outline Overview Parallel Computing with GPU Introduction to CUDA - PowerPoint PPT Presentation

Outline � Overview � Parallel Computing with GPU � Introduction to CUDA � CUDA Thread Model � CUDA Memory Hierarchy and Memory Spaces � CUDA Memory Hierarchy and Memory Spaces � CUDA Synchronization 2110412 Parallel Comp Arch CUDA: Parallel Programming on GPU Natawut Nupairoj, Ph.D. Department of Computer Engineering, Chulalongkorn University Overview Overview – 2D Primitive - BitBLT � Modern graphics accelerators are called GPUs (Graphics Processing Units) � How GPUs speed up graphics: � Pipelining: similar to pipelining in CPUs. � CPUs like Pentium 4 has 20 pipeline stages. � GPUs typically have 600-800 stages -- very few branches & most of the functionality is fixed. Source: Leigh, “ Graphics Hardware Architecture & Miscellaneous Real Time Special Effects ” Source: Wikipedia

Overview – Chroma Key Overview � Parallelizing � Process the data in parallel within the GPU. In essence multiple pipelines running in parallel. � Basic model is SIMD (Single Instruction Multiple Data) – ie same graphics algorithms but lots of polygons to process. Source: Leigh, “ Graphics Hardware Architecture & Miscellaneous Real Time Special Effects ” Modern GPU is More General SIMD (revisited) Purpose – Lots of ALU’s � One control unit tells processing elements to compute (at the same time). D P M I D P M Ctrl D D P P M M D P M � Examples � TMC/CM- 1 , Maspar MP- 1 , Modern GPU

The nVidia G80 GPU ► 128 streaming floating point processors @1.5Ghz ► 1.5 Gb Shared RAM with 86Gb/s bandwidth ► 500 Gflop on one chip (single precision) Source: Kirk, “ Parallel Computing: What has changed lately? ” Programming Interface Still A Specialized Processor � Very Efficient For � Interface to GPU via nVidia’s proprietary API – CUDA (very � Fast Parallel Floating Point Processing C-like) � Single Instruction Multiple Data Operations � Looks a lot like UPC (simplified CUDA below) � High Computation per Memory Access void AddVectors(float *r, float *a, float *a) void AddVectors(float *r, float *a, float *a) � Not As Efficient For { � Double Precision (need to test performance) int tx = threadId.x; //~processor rank � Logical Operations on Integer Data r[tx] = a[tx] + b[tx]; //executed in parallel � Branching-Intensive Operations } � Random Access, Memory-Intensive Operations

Parallel Programming with CUDA Source: CUDA Tutorial Workshop, ISC-2009 SETI@home and CUDA Introduction to CUDA � nVidia introduced CUDA in November 2006 � Utilize parallel computing engine in GPU to solve complex computational problems � CUDA is industry-standard C � Subset of C with extensions � Write a program for one thread � Instantiate it on many parallel threads � Familiar programming model and language � CUDA is a scalable parallel programming model � Program runs on any number of processors without recompiling � Run 5x to 10x times faster than CPU-only version

CUDA Concept CUDA Development: nvcc � Co-Execution between Host (CPU) and Device (GPU) � Parallel portions are executed on the device as kernels � One kernel is executed at a time � Many threads execute each kernel � All threads run the same code � Each thread has an ID that it uses to compute memory addresses and make control decisions � Serial program with parallel kernels, all in C � Serial C code executes in a CPU thread � Parallel kernel C code executes in thread blocks across multiple processing elements Normal C Program CUDA Program void VecAdd_CPU(float* A, float* B, float* C, int N) // Kernel definition { __global__ void VecAdd(float* A, float* B, float* C) for(int i=0 ; i < N ; i++) { C[i] = A[i] + B[i]; int i = threadIdx.x; } C[i] = A[i] + B[i]; C[i] = A[i] + B[i]; } void main() { void main() VecAdd_CPU(A, B, C, N); { } // Kernel invocation VecAdd<<<1, N>>>(A, B, C); }

CUDA Thread Model � CUDA Thread can be � one-dimensional � two-dimensional � three-dimensional � Thread Hierarchy � Grid � (2-D) Block � (3-D) Thread Source: High Performance Computing with CUDA, DoD HPCMP: 2009 Calling CUDA Kernel Example: Adding 2-D Matrix // Kernel definition � Modified C function call syntax: __global__ void MatAdd(float A[M][N], float B[M][N], float C[M][N]) kernel<<<dim3 dG, dim3 dB>>>(…) { int i = threadIdx.x; int j = threadIdx.y; � Execution Configuration (“<<< >>>”) C[i][j] = A[i][j] + B[i][j]; } � dG - dimension and size of grid in blocks void main() � Two-dimensional: x and y { � Blocks launched in the grid: dG.x*dG.y // Kernel invocation � dB - dimension and size of blocks in threads: dim3 dimBlock(M, N); � Three-dimensional: x, y, and z MatAdd<<<1, dimBlock>>>(A, B, C); � Threads per block: dB.x*dB.y*dB.z } � Unspecified dim3 fields initialize to 1

CUDA Built-In Device Variables Example: Adding 2-D Matrix // Kernel definition � All __global__ and __device__ functions have access to __global__ void MatAdd(float A[M][N], float B[M][N], float C[M][N]) these automatically defined variables { int i = blockIdx.x; int j = threadIdx.x; � dim3 gridDim; C[i][j] = A[i][j] + B[i][j]; � Dimensions of the grid in blocks (at most 2D) } � dim3 blockDim; void main() { � Dimensions of the block in threads // Kernel invocation � dim3 blockIdx; MatAdd<<<M, N>>>(A, B, C); � Block index within the grid } � dim3 threadIdx; � Thread index within the block Example: Adding 2-D Matrix Function Qualifiers // Kernel definition � Kernels designated by function qualifier: __global__ void MatAdd(float A[M][N], float B[M][N], float C[M][N]) � __global__ { int i = blockIdx.x * blockDim.x + threadIdx.x; � Function called from host and executed on device int j = blockIdx.y * blockDim.y + threadIdx.y; � Must return void if (i < N && j < N) C[i][j] = A[i][j] + B[i][j]; � Other CUDA function qualifiers Other CUDA function qualifiers } � __device__ int main() { � Function called from device and run on device // Kernel invocation � Cannot be called from host code dim3 dimBlock(16, 16); dim3 dimGrid((M + dimBlock.x – 1) / dimBlock.x, (N + dimBlock.y – 1) / dimBlock.y); MatAdd<<<dimGrid, dimBlock>>>(A, B, C); }

Exercise Exercise int main() __global__ void kernel( int *a ) { { int idx = blockIdx.x*blockDim.x + threadIdx.x; ... a[idx] = 7; kernel<<<3, 5>>>( d_a ); Output: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 } ... __global__ void kernel( int *a ) } { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = blockIdx.x; Output: 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 } __global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = threadIdx.x; Output: 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 } Incremental Array Example Incremental Array Example CPU Program CUDA Program � Increment N-element vector a by scalar b void inc_cpu(int *a, int N) __global__ void inc_gpu(int *a_d, int N) { { int idx; int idx = blockIdx.x * blockDim.x + threadIdx.x; � Let’s assume N=16, blockDim=4 -> 4 blocks for (idx = 0; idx<N; idx++) if (idx < N) a[idx] = a[idx] + 1; a_d[idx] = a_d[idx] + 1; } } void main() void main() { { blockIdx.x=0 blockIdx.x=1 blockIdx.x=2 blockIdx.x=3 … … blockDim.x=4 blockDim.x=4 blockDim.x=4 blockDim.x=4 threadIdx.x=0,1,2,3 threadIdx.x=0,1,2,3 threadIdx.x=0,1,2,3 threadIdx.x=0,1,2,3 inc_cpu(a, N); dim3 dimBlock (blocksize); idx=0,1,2,3 idx=4,5,6,7 idx=8,9,10,11 idx=12,13,14,15 … dim3 dimGrid(ceil(N/(float)blocksize)); } inc_gpu<<<dimGrid, dimBlock>>>(a_d, N); … int idx = blockDim.x * blockId.x + threadIdx.x; will map from local index threadIdx to global index } NB: blockDim should be bigger than 4 in real code, this is just an example

Note on CUDA Kernel CUDA Memory Hierarchy � Each thread has private � Kernels are C functions with some restrictions per-thread local memory � Cannot access host memory � All threads in a block have � Must have void return type per-block shared memory � No variable number of arguments (“varargs”) � All threads can access � Not recursive shared global memory � No static variables � Function arguments automatically copied from host to device CUDA Host/Device Memory Spaces � “Local” memory resides in device DRAM � Use registers and shared memory to minimize local memory use � Host can read and write global memory but not shared memory Source: High Performance Computing with CUDA, DoD HPCMP: 2009 Source: High Performance Computing with CUDA, DoD HPCMP: 2009

Outline Overview Parallel Computing with GPU Introduction to CUDA - PowerPoint PPT Presentation

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model CUDA Memory Hierarchy and Memory Spaces CUDA Memory Hierarchy and Memory Spaces CUDA Synchronization 2110412 Parallel Comp Arch CUDA:

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

Presentation Preparation Outline Speech Outline Template ***Use this outline to guide you in

Outline for St Outline for St Outline for

Beob Kyun Kim, S oonwook Hwang {kyun, hwang}@ kisti.re.kr KIS TI, Korea Outline Outline

Catherine Revels, World Bank November 2009 Presentation outline Presentation outline

Battlestar Galactica Battlestar Galactica Galactica Battlestar Outline Outline Outline

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Appendix J: Capstone Presentation Outline Revised Spring 2016 CAPSTONE PRESENTATION OUTLINE This

PT1 TMP Presentation Outline 1 Group Members: ___________________________________ Use this outline

Broverview Outline 2 Outline Philosophy and Architecture A framework for network traffic

Xingqian Peng, Huaqiao University, China Presented by Zhen Wu Presented by Zhen Wu October 30,2011

1 Web Application Development 2 3 Web Application Development CSS Outline An outline is a

Lecture Outline Strengthening Induction Hypothesis. Lecture Outline Strengthening Induction

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Outline Outline Deaf and Hearing Impaired Deaf and Hearing Impaired Physical Structures of

Exploiting Quality-Efficiency Tradeoffs with Arbitrary Quantization Special Session - CODES+ISSS

340 Million Tweets per day 2.3 Billion Queries per day < 10 s Indexing latency 50 ms Avg.

Architecture level Optimizations for Kummer based HECC on FPGAs Gabriel GALLIN Turku Ozlum

CS 126 Lecture A3: Boolean Logic Outline Introduction Logic gates Boolean algebra

BUBBLE STR UBBLE STRUGGLE UGGLE Call Graph Visualization with Radare2 Marion Marschalek

Graphics and Visualization Yuriy Tymchuk (almost) Alain Plantec Guillaume Larcheveque What are

Memories and SRAM 1 Silicon Memories Why store things in silicon? Its fast!!!

Unit 18 Field Programmable Gate Arrays (FPGAs) HARDWARE IMPLEMENTATION Implementing Logic

Outline Overview Parallel Computing with GPU Introduction to CUDA - PowerPoint PPT Presentation

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model CUDA Memory Hierarchy and Memory Spaces CUDA Memory Hierarchy and Memory Spaces CUDA Synchronization 2110412 Parallel Comp Arch CUDA:

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

Presentation Preparation Outline Speech Outline Template ***Use this outline to guide you in

Outline for St Outline for St Outline for

Beob Kyun Kim, S oonwook Hwang {kyun, hwang}@ kisti.re.kr KIS TI, Korea Outline Outline

Catherine Revels, World Bank November 2009 Presentation outline Presentation outline

Battlestar Galactica Battlestar Galactica Galactica Battlestar Outline Outline Outline

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Appendix J: Capstone Presentation Outline Revised Spring 2016 CAPSTONE PRESENTATION OUTLINE This

PT1 TMP Presentation Outline 1 Group Members: ___________________________________ Use this outline

Broverview Outline 2 Outline Philosophy and Architecture A framework for network traffic

Xingqian Peng, Huaqiao University, China Presented by Zhen Wu Presented by Zhen Wu October 30,2011

1 Web Application Development 2 3 Web Application Development CSS Outline An outline is a

Lecture Outline Strengthening Induction Hypothesis. Lecture Outline Strengthening Induction

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Outline Outline Deaf and Hearing Impaired Deaf and Hearing Impaired Physical Structures of

Exploiting Quality-Efficiency Tradeoffs with Arbitrary Quantization Special Session - CODES+ISSS

340 Million Tweets per day 2.3 Billion Queries per day &lt; 10 s Indexing latency 50 ms Avg.

Architecture level Optimizations for Kummer based HECC on FPGAs Gabriel GALLIN Turku Ozlum

CS 126 Lecture A3: Boolean Logic Outline Introduction Logic gates Boolean algebra

BUBBLE STR UBBLE STRUGGLE UGGLE Call Graph Visualization with Radare2 Marion Marschalek

Graphics and Visualization Yuriy Tymchuk (almost) Alain Plantec Guillaume Larcheveque What are

Memories and SRAM 1 Silicon Memories Why store things in silicon? Its fast!!!

Unit 18 Field Programmable Gate Arrays (FPGAs) HARDWARE IMPLEMENTATION Implementing Logic

340 Million Tweets per day 2.3 Billion Queries per day < 10 s Indexing latency 50 ms Avg.