Lecture 2.2 - Introduction to CUDA C Memory Allocation and Data - PowerPoint PPT Presentation

GPU Teaching Kit Accelerated Computing Lecture 2.2 - Introduction to CUDA C Memory Allocation and Data Movement API Functions

Objective – To learn the basic API functions in CUDA host code – Device Memory Allocation – Host-Device Data Transfer 2

Data Parallelism - Vector Addition Example A[0] A[1] A[2] A[N-1] vector A … vector B B[0] B[1] B[2] … B[N-1] + + + + C[0] C[1] C[2] C[N-1] vector C … 3 3

Vector Addition – Traditional C Code // Compute vector sum C = A + B void vecAdd(float *h_A, float *h_B, float *h_C, int n) { int i; for (i = 0; i<n; i++) h_C[i] = h_A[i] + h_B[i]; } int main() { // Memory allocation for h_A, h_B, and h_C // I/O to read h_A and h_B, N elements … vecAdd(h_A, h_B, h_C, N); } 4 4

Heterogeneous Computing vecAdd CUDA Host Code Part 1 #include <cuda.h> Device Memory Host Memory Part 2 void vecAdd(float *h_A, float *h_B, float *h_C, int n) { CPU GPU int size = n* sizeof(float); float *d_A, *d_B, *d_C; // Part 1 Part 3 // Allocate device memory for A, B, and C // copy A and B to device memory // Part 2 // Kernel launch code – the device performs the actual vector addition // Part 3 // copy C from the device memory // Free device vectors } 5 5

Partial Overview of CUDA Memories – Device code can: – R/W per-thread registers (Device) Grid – R/W all-shared global Block (0, 0) Block (0, 1) memory Registers Registers Registers Registers Thread (0, 0) Thread (0, 1) Thread (0, 1) Thread (0, 0) – Host code can – Transfer data to/from per Host Global grid global memory Memory We will cover more memory t ypes and more sophist icated memory models lat er. 6 6

CUDA Device Memory Management API functions – cudaMalloc() – Allocates an object in the device (Device) Grid global memory Block (0, 0) Block (0, 1) – Two parameters – Address of a pointe r to the Registers Registers Registers Registers allocated object Thread (0, 0) Thread (0, 1) Thread (0, 1) Thread (0, 0) – Size of allocated object in terms of bytes Host Global – cudaFree() Memory – Frees object from device global memory – One parameter – Pointer to freed object 7

Host-Device Data Transfer API functions – cudaMemcpy() – memory data transfer (Device) Grid – Requires four parameters Block (0, 0) Block (0, 1) – Pointer to destination Registers Registers Registers Registers – Pointer to source Thread (0, 0) Thread (0, 1) Thread (0, 1) Thread (0, 0) – Number of bytes copied – Type/Direction of transfer Host Global Memory – Transfer to device is asynchronous 8

Vector Addition Host Code void vecAdd(float *h_A, float *h_B, float *h_C, int n) { int size = n * sizeof(float); float *d_A, *d_B, *d_C; cudaMalloc((void **) &d_A, size); cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMalloc((void **) &d_B, size); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); cudaMalloc((void **) &d_C, size); // Kernel invocation code – to be shown later cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); cudaFree(d_A); cudaFree(d_B); cudaFree (d_C); } 9 9

In Practice, Check for API Errors in Host Code cudaError_t err = cudaMalloc((void **) &d_A, size); if (err != cudaSuccess) { printf(“%s in %s at line %d\n”, cudaGetErrorString(err), __FILE__, __LINE__); exit(EXIT_FAILURE); } 10 10

GPU Teaching Kit Accelerated Computing The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.

Lecture 2.2 - Introduction to CUDA C Memory Allocation and Data - PowerPoint PPT Presentation

GPU Teaching Kit Accelerated Computing Lecture 2.2 - Introduction to CUDA C Memory Allocation and Data Movement API Functions Objective To learn the basic API functions in CUDA host code Device Memory Allocation Host-Device Data

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

Computer Graphics Parallel Programming with Cuda Hendrik Lensch Computer Graphics

2110412 Parallel Comp Arch CUDA: Parallel Programming on GPU Natawut Nupairoj, Ph.D. Department

S9751: ACCELERATE YOUR CUDA DEVELOPMENT WITH LATEST DEBUGGING AND CODE ANALYSIS DEVELOPER TOOLS

CUDA 7 AND BEYOND MARK HARRIS, NVIDIA CUDA 7 Runtime C++11 cuSOLVER Compilation

SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler

CUDA 8 AND BEYOND Mark Harris, April 5, 2016 INTRODUCING CUDA 8 Pascal Support Unified Memory

PerfMon redux: analyzing a CUDA application with the Windows PerfMon redux: analyzing a CUDA

CUDA ON MOBILE Yogesh Kini, GTC 2016 Typical pipeline ABSTRACT CUDA Interop APIs Unified

Approaches to GPU computing Manuel Ujaldon Nvidia CUDA Fellow Computer Architecture Department

To tune or not to tune Thomas Pasquier tfjmp@cs.ubc.ca https://tfjmp.org The team - Ayat Fekry

LoST: Local State Transfer And BSPL, the Blindingly Simple Protocol Language Munindar P . Singh

Working Group Draft for TCPCLv4 Brian Sipos RKF Engineering Solutions IETF104 Motivations for

Deep Learning & Beyond AI F UN DAMEN TALS Nemanja Radojkovic Senior Data Scientist Brief

P Performance Analysis of f A l i f Ultra-Scaled InAs HEMTs Neerav Kharche 1 , Gerhard

Midwinter Meeting February 29, 2020 The Who, Where, When, What, Why and How of Pharmacy

CVPR 2020 Video Pentathlon Challenge: Multi-modal Transformer for Video Retrieval Valentin

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Sambuz

Useful Links

Newsletter

Mail Us

Lecture 2.2 - Introduction to CUDA C Memory Allocation and Data - PowerPoint PPT Presentation

GPU Teaching Kit Accelerated Computing Lecture 2.2 - Introduction to CUDA C Memory Allocation and Data Movement API Functions Objective To learn the basic API functions in CUDA host code Device Memory Allocation Host-Device Data

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

Computer Graphics Parallel Programming with Cuda Hendrik Lensch Computer Graphics

2110412 Parallel Comp Arch CUDA: Parallel Programming on GPU Natawut Nupairoj, Ph.D. Department

S9751: ACCELERATE YOUR CUDA DEVELOPMENT WITH LATEST DEBUGGING AND CODE ANALYSIS DEVELOPER TOOLS

CUDA 7 AND BEYOND MARK HARRIS, NVIDIA CUDA 7 Runtime C++11 cuSOLVER Compilation

SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler

CUDA 8 AND BEYOND Mark Harris, April 5, 2016 INTRODUCING CUDA 8 Pascal Support Unified Memory

PerfMon redux: analyzing a CUDA application with the Windows PerfMon redux: analyzing a CUDA

CUDA ON MOBILE Yogesh Kini, GTC 2016 Typical pipeline ABSTRACT CUDA Interop APIs Unified

Approaches to GPU computing Manuel Ujaldon Nvidia CUDA Fellow Computer Architecture Department

To tune or not to tune Thomas Pasquier tfjmp@cs.ubc.ca https://tfjmp.org The team - Ayat Fekry

LoST: Local State Transfer And BSPL, the Blindingly Simple Protocol Language Munindar P . Singh

Working Group Draft for TCPCLv4 Brian Sipos RKF Engineering Solutions IETF104 Motivations for

Deep Learning &amp; Beyond AI F UN DAMEN TALS Nemanja Radojkovic Senior Data Scientist Brief

P Performance Analysis of f A l i f Ultra-Scaled InAs HEMTs Neerav Kharche 1 , Gerhard

Midwinter Meeting February 29, 2020 The Who, Where, When, What, Why and How of Pharmacy

CVPR 2020 Video Pentathlon Challenge: Multi-modal Transformer for Video Retrieval Valentin

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Sambuz

Useful Links

Newsletter

Mail Us

Deep Learning & Beyond AI F UN DAMEN TALS Nemanja Radojkovic Senior Data Scientist Brief