CUDA Programming Model Ming Yang Apr 5, 2016 Thread Local memory - PowerPoint PPT Presentation

CUDA Programming Model Ming Yang Apr 5, 2016 Thread Local memory Warp Occupancy Block Grid Shared memory Kernel Register Global memory 1

??? • What are the scheduling units in Streaming Multiprocessor (SM)?? • warps. How are they scheduled? • How is the occupancy computed?? • anything to do with block/thread/registers/shared memory? Yes! All of them. 2

Thread hierarchy Grid Block (0, 0) Block (1, 0) # of threads: thread (0, 0) thread (1, 0) thread (2, 0) thread (3, 0) thread (0, 0) thread (1, 0) thread (2, 0) thread (3, 0) (2*2) * (4*2) = 32 thread (0, 1) thread (1, 1) thread (2, 1) thread (1, 3) thread (0, 1) thread (1, 1) thread (2, 1) thread (1, 3) Block (0, 1) Block (1, 1) thread (0, 0) thread (1, 0) thread (2, 0) thread (3, 0) thread (0, 0) thread (1, 0) thread (2, 0) thread (3, 0) thread (0, 1) thread (1, 1) thread (2, 1) thread (1, 3) thread (0, 1) thread (1, 1) thread (2, 1) thread (1, 3) How are these … threads assigned to dim3 dimGrid(2, 2, 1); the SMs?? dim3 dimBlock(4, 2, 1); vectorAdd<<<dimGrid, dimBlock>>>(a, b, c); … 3

Thread Blocks Assignment Streaming Multiprocessor (SM) Block (0, 0) Block (1, 0) Block (0, 1) Block (1, 1) … … … … … … … … … … … … • Threads are assigned to SM in block granularity • Blocks in one grid can be assigned to different SMs • SM manages/schedules thread execution. • how?? 4

Warps as Scheduling Units Warp 0 Warp 1 Warp 2 • Each block is executed … … as 32-thread warps 0 1 31 32 33 63 64 65 • Warps are scheduling units in SM • how are they scheduled? • Threads in a warp execute in SIMT • what is SIMT (Single Instruction Multiple Thread)? • What about control divergence? 5

Warps as Scheduling Units (cont.) • Warps are scheduling units in SM Streaming Multiprocessor (SM) Pool of warps Warp 0 Warp 0 Warp 1 Warp 2 Warp 1 Warp 2 Warp 3 Warp 4 Warp 5 Warp 2 Time (cycle) Warp 0 Warp 1 Warp 3 … …… Warp 63 6

Warps as Scheduling Units (cont.) • Threads in a warp execute in SIMT Memory I/O Processing Unit Processing Unit Processing Unit Register Register ALU File Register ALU File ALU File Control Unit PC IR 7

Review • Threads are organized by block/grid • Threads are assigned to SM in block granularity • Threads are scheduled in the unit of warp, and in the way of SIMD 8

Occupancy • Occupancy = # of active warps / Maximum number of resident warps per SM Compute Capabilities Technical Specifications 2.x 3.0 3.2 3.5 3.7 5.0 5.2 5.3 Maximum number of resident 48 64 warps per SM • Occupancy limiters: • Register usage • Shared memory usage • Block size 9

Memory hierarchy thread (0, 0) Per-thread local memory Block (1, 0) thread (0, thread (1, thread (2, thread (3, 0) 0) 0) 0) Per-block shared memory thread (0, thread (1, thread (2, thread (1, 1) 1) 1) 3) Grid Block (0, 0) Block (1, 0) thread (0, thread (1, thread (2, thread (3, thread (0, thread (1, thread (2, thread (3, 0) 0) 0) 0) 0) 0) 0) 0) thread (0, thread (1, thread (2, thread (1, thread (0, thread (1, thread (2, thread (1, 1) 1) 1) 3) 1) 1) 1) 3) Global Memory Block (0, 1) Block (1, 1) thread (0, thread (1, thread (2, thread (3, thread (0, thread (1, thread (2, thread (3, 0) 0) 0) 0) 0) 0) 0) 0) thread (0, thread (1, thread (2, thread (1, thread (0, thread (1, thread (2, thread (1, 1) 1) 1) 3) 1) 1) 1) 3) 10

Occupancy limiter: Register usage Compute Capabilities 2.x 3.0 3.2 3.5 3.7 5.0 5.2 5.3 Technical Specifications Maximum number of 32-bit 32 K 64 K 32 K registers per thread block Maximum number of resident 1536 2048 threads per SM • Example 1 (capability = 3.0) • Kernel uses 21 registers per thread • # of active threads = 64K / 21 ≈⋳ 3121 • > 2048 thus an occupancy of 100% 11

Occupancy limiter: Register usage (cont.) Compute Capabilities 2.x 3.0 3.2 3.5 3.7 5.0 5.2 5.3 Technical Specifications Maximum number of 32-bit 32 K 64 K 32 K registers per thread block Maximum number of resident 1536 2048 threads per SM Maximum number of resident 48 64 warps per SM • Example 2 (capability = 3.0) • Kernel uses 64 registers per thread • # of Active threads = 64K / 64 = 1024 • # of warps = 1024 / 32 = 32 • Occupancy = 32 / 64 = 50% 12

Occupancy limiter: Shared memory Compute Capabilities 2.x 3.0 3.2 3.5 3.7 5.0 5.2 5.3 Technical Specifications Maximum amount of shared 1 12 64 96 64 48 KB memory per SM KB KB KB KB Maximum number of resident 1536 2048 threads per SM Maximum number of resident 48 64 warps per SM • Example 1 (capability = 3.0) • Kernel uses 16 bytes of shared memory per thread • # of Active threads = 48K / 16 = 3072 • > 2048 thus an occupancy of 100% 13

Occupancy limiter: Shared memory (cont.) Compute Capabilities 2.x 3.0 3.2 3.5 3.7 5.0 5.2 5.3 Technical Specifications Maximum amount of shared 1 12 64 96 64 48 KB memory per SM KB KB KB KB Maximum number of resident 1536 2048 threads per SM Maximum number of resident 48 64 warps per SM • Example 2 (capability = 3.0) • Kernel uses 32 bytes of shared memory per thread • # of Active threads = 48K / 32 = 1536 • # of warps = 1536 / 32 = 48 • Occupancy = 48 / 64 = 75% 14

Occupancy limiter: Block size Compute Capabilities 2.x 3.0 3.2 3.5 3.7 5.0 5.2 5.3 Technical Specifications Maximum number of resident 8 16 32 blocks per multiprocessor Maximum number of resident 1536 2048 threads per SM Maximum number of resident 48 64 warps per SM • capability = 3.0 Warp size=32 Block size Active threads Active warps Occupancy 32 32 * 16 = 512 512 / 32 = 16 16 / 64 = 25% 64 1024 32 50% 128 2048 64 100% 192 3072 (2048) 64 100% 256 4096 (2048) 64 100% 15

Occupancy • Do we want higher occupancy? • Maybe yes. Latency (of memory op. and algorithmic op.) can be hidden with more threads running. • Is occupancy a metric of performance? • No!! It’s just one of the contributing factors. 16

Reference: http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf http://on-demand.gputechconf.com/gtc/2010/video/S12238-Better-Performance-at-Lower- Occupancy.mp4 17

Review • Calculation formula for occupancy • # of active warps / maximum number of warps per SM • Occupancy limiters: • register, shared memory, block size • Understanding of occupancy • occupancy is not equivalent to performance • but we still want higher occupancy usually 18

Case study: cublasSgemm • Matrix multiplication of single-precision real number • SGEMM performs one of the matrix-matrix operations • C := alpha*op( A )*op( B ) + beta*C 0. 1.0 • where op( X ) is one of • op( X ) = X or op( X ) = X**T (transposed) always this one in our case • It’s used by the fully-connected (fc) layer in Caffe (when batch size is larger than 1) 19

Reasons of case-studying cublasSgemm • sgemm_largek_lds64 • it’s the kernel used by cublasSgemm • it decreases fastest with batch size increasing • it’s the only kernel I point missed observed of which occupancy changes with different batch sizes 20

Experiment • Use cublasSgemm: • Inputs: Matrix A (M*K), B (K*N) • Output: Matrix C (M*N) = A*B • Variables used here are consistent with the usage in the fully-connected layer in Caffe) • M: batch size (2, 4, 8, …, 1024) • K: 9216/ 4096 /4096 • N: 4096/ 4096 /1000 21

Results 1 1200 1,085.0 0.75 900 Execution time (ms.) Occupancy 0.5 600 542.6 0.25 300 269.5 135.1 96.6 97 .2 53.3 33.5 40.5 64.6 0 0 2 4 8 16 32 64 128 256 512 1024 maxwell_sgemm_128x128_nn <<<32*1*1, 256*1*1>>> Batch size (M) sgemm_largek_lds64 <<<64*1*8, 32*4*1>>> maxwell_sgemm_128x64_nn <<<32*1*1, 128*1*1>>> sgemm_largek_lds64 with different parameters<<<128*1*4, 16*16*1>>> 22

Summary • Thread hierarchy • Streaming multiprocessor scheduling • Memory hierarchy • Occupancy • Case study on `cublasSgemm` 23

References • (Coursera class) Heterogeneous Parallel Programming by Wen-mei W. Hwu (https://class.coursera.org/hetero-004) • http://docs.nvidia.com/cuda/cuda-c-programming- guide/ • http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf 24

Backup slides (about stream and concurrency) after this They’re basically copied from http://on-demand.gputechconf.com/gtc-express/201 1/presentations/ StreamsAndConcurrencyWebinar.pdf 25

Streams • A sequence of operations that execute in issue-order on the GPU • Programming model used to effect concurrency • CUDA operations in different streams may run concurrently CUDA operations from different streams may be interleaved • Rules: • A CUDA operation is dispatched from the engine queue if: • Preceding calls in the same stream have completed, • Preceding calls in the same queue have been dispatched, and • Resources are available 26

Example 27

Example 28

CUDA Programming Model Ming Yang Apr 5, 2016 Thread Local memory - PowerPoint PPT Presentation

CUDA Programming Model Ming Yang Apr 5, 2016 Thread Local memory Warp Occupancy Block Grid Shared memory Kernel Register Global memory 1 ??? What are the scheduling units in Streaming Multiprocessor (SM)?? warps. How are they

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Shader Programming Shader Programming vs CUDA vs CUDA Tien-Tsin Wong The Chinese University of

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

2110412 Parallel Comp Arch CUDA: Parallel Programming on GPU Natawut Nupairoj, Ph.D. Department

Computer Graphics Parallel Programming with Cuda Hendrik Lensch Computer Graphics

CUDA 8 AND BEYOND Mark Harris, April 5, 2016 INTRODUCING CUDA 8 Pascal Support Unified Memory

Approaches to GPU computing Manuel Ujaldon Nvidia CUDA Fellow Computer Architecture Department

Module 3.1 - CUDA Parallelism Model Kernel-Based SPMD Parallel Programming Objective To

S9751: ACCELERATE YOUR CUDA DEVELOPMENT WITH LATEST DEBUGGING AND CODE ANALYSIS DEVELOPER TOOLS

CUDA 7 AND BEYOND MARK HARRIS, NVIDIA CUDA 7 Runtime C++11 cuSOLVER Compilation

SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler

Limits by Occupancy (as of 10/31/17) Commercial 87,606,255 2.15% Residential 3,996,185,384

Ethical research (And also IRB) What does it mean for research to be ethical? The Belmont

Matt Bishop Department of Computer Science University of California at Davis 1 Shields Ave.

R u t c o r R esearch R e p o r t Logical Analysis of Data: Classification with Justification

15-853:Algorithms in the Real World Announcements: Projects: Enter your team information in

dependencies in occupancy behavior Tackling Climate Change with Machine Learning workshop at

WOSHTEP Update for the Commission on Health and Safety and Workers Compensation Robin Dewey,

at work in healthcare March 11 th , 2019 The Issue San andra Kop oppert Director, Programs and