Module 3.1 - CUDA Parallelism Model Kernel-Based SPMD Parallel - PowerPoint PPT Presentation

Jan 26, 2024 •162 likes •270 views

GPU Teaching Kit Accelerated Computing Module 3.1 - CUDA Parallelism Model Kernel-Based SPMD Parallel Programming Objective To learn the basic concepts involved in a simple CUDA kernel function Declaration Built-in variables

GPU Teaching Kit Accelerated Computing Module 3.1 - CUDA Parallelism Model Kernel-Based SPMD Parallel Programming
Objective – To learn the basic concepts involved in a simple CUDA kernel function – Declaration – Built-in variables – Thread index to data index mapping 2 2
Example: Vector Addition Kernel Device Code // Compute vector sum C = A + B // Each thread performs one pair-wise addition __global__ void vecAddKernel(float* A, float* B, float* C, int n) { int i = threadIdx.x+blockDim.x*blockIdx.x; if(i<n) C[i] = A[i] + B[i]; } 3
Example: Vector Addition Kernel Launch (Host Code) Host Code void vecAdd(float* h_A, float* h_B, float* h_C, int n) { // d_A, d_B, d_C allocations and copies omitted // Run ceil(n/256.0) blocks of 256 threads each vecAddKernel<<<ceil(n/256.0),256>>>(d_A, d_B, d_C, n); } The ceiling function makes sure that there are enough threads to cover all elements. 4 4
More on Kernel Launch (Host Code) Host Code void vecAdd(float* h_A, float* h_B, float* h_C, int n) { dim3 DimGrid((n-1)/256 + 1, 1, 1); dim3 DimBlock(256, 1, 1); vecAddKernel<<<DimGrid,DimBlock>>>(d_A, d_B, d_C, n); } This is an equivalent way to express the ceiling function. 5 5
Kernel execution in a nutshell __host__ __global__ void vecAdd(…) void vecAddKernel(float *A, { float *B, float *C, int n) dim3 DimGrid(ceil(n/256.0),1,1); { dim3 DimBlock(256,1,1); int i = blockIdx.x * blockDim.x vecAddKernel<<<DimGrid,DimBlock>>>(d_A,d_B + threadIdx.x; ,d_C,n); } if( i<n ) C[i] = A[i]+B[i]; } Grid Blk 0 Blk N-1 • • • GPU M0 Mk • • • RAM 6 6
More on CUDA Function Declarations Executed on Only callable from the: the: __device__ float DeviceFunc() device device device host __global__ void KernelFunc() __host__ float HostFunc() host host − __global__ defines a kernel function − Each “__” consists of two underscore characters − A kernel function must return void − __device__ and __host__ can be used together − __host__ is optional if used alone 7 7
Compiling A CUDA Program Integrated C programs with CUDA extensions NVCC Compiler Host Code Device Code (PTX) Host C Compiler/ Linker Device Just-in-Time Compiler Heterogeneous Computing Platform with CPUs, GPUs, etc. 8
GPU Teaching Kit Accelerated Computing The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Recommend

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model CUDA Memory Hierarchy and Memory Spaces CUDA Memory Hierarchy and Memory Spaces CUDA Synchronization 2110412 Parallel Comp Arch CUDA:

451 views • 12 slides

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March 30, 2009 Billions of transistors Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March 30, 2009 Multicore

470 views • 23 slides

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU computing as first-class capability Retain traditional DirectX/OpenGL graphics performance CUDA C Based on industry-standard C A handful of

1.31k views • 62 slides

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

GPU Teaching Kit Accelerated Computing Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn the main venues and developer resources for GPU computing Where CUDA C fits in the big picture 2 3

1.28k views • 12 slides

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

Introduction Bindings in Ada CUDA/Ada Design CUDA Binding Conclusion CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied Sciences Rapperswil (HSR), Switzerland 1/16/2012 Master seminar: Program

470 views • 36 slides

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

' $ Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of Parallel Systems & % Database Systems

341 views • 21 slides

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A rchitecture released in 2007 GPU Computing Extension of C/C++ requires NVCC (CUDA Compiler) and NVIDIA Graphics Card Historical

507 views • 38 slides

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need for CUDA Introduction to CUDA CUDA kernels, decompositions CUDA memory management C and Fortran OpenCL 2 NVIDIA CUDA

1.24k views • 37 slides

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

GPU Teaching Kit Accelerated Computing Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become familiar with some valuable tools and resources from the CUDA Toolkit Compiler flags Debuggers

888 views • 34 slides

JOBS IN VALUE CHAINS ANALYSIS INTRODUCTION Roadmap: Why are we here today? Agenda for the

JOBS IN VALUE CHAINS ANALYSIS INTRODUCTION Roadmap: Why are we here today? Agenda for the program Introduction Module 1a Module 1b Module 2a Module 2b Module 3a Module 3b Module 3c Module 3d Module 3e Module 4 INTRODUCTION: WHY ARE

855 views • 52 slides

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Kunle Olukotun Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism accessible to all programmers Parallelism is not for the average programmer Too difficult to find parallelism, to debug, maintain

593 views • 40 slides

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Fall 2015 :: CSE 610 Parallel Computer Architectures Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Overview Data Parallelism vs. Control Parallelism Data Parallelism: parallelism

1.34k views • 59 slides

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is supported in OpenMP. If a PARALLEL directive is encountered within another PARALLEL directive, a new team of threads will be created. This is

242 views • 11 slides

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism among instructions. Instruction-level parallelism INSTRUCTION-LEVEL PARALLELISM Increase depth of pipeline (greater overlap of

648 views • 26 slides

Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes Vishal Mehta Engineer,

www.bsc.es Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes Vishal Mehta Engineer, Barcelona Supercomputing Center vishal.mehta@bsc.es BSC/UPC CUDA Centre of Excellence (CCOE) Training Build an education program on

901 views • 31 slides

WebEOC Training 1 Topics Module 1 WebEOC Overview Module 2 Getting Started Module 3

WebEOC Training 1 Topics Module 1 WebEOC Overview Module 2 Getting Started Module 3 Status Boards Position Log Request for Assistance Mission/Task Significant Events Module 4 Forms Module 5 Links Module 6

847 views • 72 slides

FUSED TABLE SCANS: COMBINING AVX-512 AND JIT Markus Dreseler, Jan Kossmann, Johannes Frohnhofen,

FUSED TABLE SCANS: COMBINING AVX-512 AND JIT Markus Dreseler, Jan Kossmann, Johannes Frohnhofen, Matthias Uflacker, Hasso Plattner Joint Workshop of HardBD & Active @ ICDE Paris, 16th of April 2018 FUSED TABLE SCANS FUSED TABLE SCANS -

502 views • 19 slides

How not to Design a Scripting Language Department of Computer Science and Statistics Trinity

How not to Design a Scripting Language Paul Biggar How not to Design a Scripting Language Department of Computer Science and Statistics Trinity College Dublin StackOverflow London, 28th October, 2009 Paul Biggar Department of Computer Science

734 views • 45 slides

Native Code Generation COMP 520: Compiler Design (4 credits) Professor Laurie Hendren

COMP 520 Winter 2016 Native Code Generation (1) Native Code Generation COMP 520: Compiler Design (4 credits) Professor Laurie Hendren hendren@cs.mcgill.ca WendyTheWhitespace-IntolerantDragon WendyTheWhitespacenogarDtnarelotnI COMP 520 Winter

845 views • 30 slides

Method Inlining Method inlining replaces a function call site with the body of the callee.

4/22/2013 Method Inlining Method inlining replaces a function call site with the body of the callee. Example: int main() { Method Optim izations int max(int a, int b) { int x = 3; if(a > b) return a; int y = 5; else return b; int z; }

407 views • 4 slides

Justifying the State Protection and Power Review: Justifying the state: What are the ultimate

Justifying the State Protection and Power Review: Justifying the state: What are the ultimate goals? How can our loss of freedom can be justified! OK here are some justifications Consent: The social contract Power is its own

916 views • 36 slides

Lecture 7/8: block feedback: many people not seeing value of lecture material data: room

Viz theory How to handle complexity: 4 families of strategies Scenario Lecture 7/8: block feedback: many people not seeing value of lecture material data: room occupancy rates Manipulate Facet Reduce Derive Design &

148 views • 3 slides

Algorithms for Big Data (XIII) Chihao Zhang Shanghai Jiao Tong University Dec. 13, 2019

Algorithms for Big Data (XIII) Chihao Zhang Shanghai Jiao Tong University Dec. 13, 2019 Algorithms for Big Data (XIII) 1/11 We introduced the notion of electrical networks. Review We studied random walks on general graphs using spectral

695 views • 47 slides

Homework Assignment 1/3 America COMPETES Reauthorization Act (H.R. 1806 -114 th Congress )

Homework Assignment 1/3 America COMPETES Reauthorization Act (H.R. 1806 -114 th Congress ) House Science Majority Bill Provides authorizations for NSF, DOE Science, and NIST Breaking with tradition, bill authorizes at the NSF

292 views • 3 slides