Computer Graphics Parallel Programming with Cuda Hendrik Lensch - PowerPoint PPT Presentation

Computer Graphics – Parallel Programming with Cuda – Hendrik Lensch Computer Graphics WS07/08 – HW-Shading

Overview • So far: – Introduction to Cuda – GPGPU via Cuda (general purpose computing on the GPU) – Block matrix-matrix multiplication • Today: – Some parallel programming principles – Parallel Vector Reduction – Parallel Prefix Sum Calculation • Next: – No lectures on Monday – Input/Output devices Computer Graphics WS07/08 – HW-Shading

Resources • Where to find Cuda and the documentation? – http://www.nvidia.com/object/cuda_home.html • Lecture on parallel programming on the GPU by David Kirk and Wen-mei W. Hwu (most of the following slides are copied from this course) – http://courses.ece.uiuc.edu/ece498/al1/Syllabus.html • On the Parallel Prefix Sum (Scan) algorithm – http://developer.download.nvidia.com/compute/cuda/sdk/website/pr ojects/scan/doc/scan.pdf Computer Graphics WS07/08 – HW-Shading

GeForce 8800 16 highly threaded SM’s, >128 FPU’s, 367 GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU Host Input Assembler Thread Execution Manager Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Cache Cache Cache Cache Cache Cache Cache Cache Texture Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Load/store Load/store Global Memory Computer Graphics WS07/08 – HW-Shading

CUDA Highlights: On-Chip Shared Memory • CUDA enables access to a parallel on-chip shared memory for efficient inter-thread data sharing Control Control ALU ALU ALU ... ALU ALU ALU ... Cache Cache … Shared Shared d 0 d 1 d 2 d 3 d 4 d 5 d 6 d 7 memory memory … DRAM d 0 d 1 d 2 d 3 d 4 d 5 d 6 d 7 Big memory bandwidth savings Computer Graphics WS07/08 – HW-Shading

Global, Constant, and Texture Memories (Long Latency Accesses) (Device) Grid • Global memory Block (0, 0) Block (1, 0) – Main means of communicating R/W Data between host and device Shared Memory Shared Memory – Contents visible to all threads Registers Registers Registers Registers • Texture and Constant Memories Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) – Constants initialized by host – Contents visible to all threads Local Local Local Local Memory Memory Memory Memory Host Global Memory Constant Memory Texture Memory Courtesy: NDVIA Computer Graphics WS07/08 – HW-Shading

Thread Batching: Grids and Blocks • A kernel is executed as a grid of thread blocks Host Device – All threads share data memory Grid 1 space Kernel Block Block Block • A thread block is a batch of 1 (0, 0) (1, 0) (2, 0) threads that can cooperate with each other by: Block Block Block (0, 1) (1, 1) (2, 1) – Synchronizing their execution • For hazard-free shared memory Grid 2 accesses – Efficiently sharing data through a Kernel 2 low latency shared memory • Two threads from two different Block (1, 1) blocks cannot cooperate Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) Courtesy: NDVIA Computer Graphics WS07/08 – HW-Shading

Quick Terminology Review • Thread : concurrent code and associated state executed on the CUDA device (in parallel with other threads) – The unit of parallelism in CUDA • Warp : a group of threads executed physically in parallel in G80 • Block : a group of threads that are executed together and form the unit of resource assignment • Grid : a group of thread blocks that must all complete before the next phase of the program can begin Computer Graphics WS07/08 – HW-Shading

How Thread Blocks are Partitioned • Thread blocks are partitioned into warps – Thread IDs within a warp are consecutive and increasing – Warp 0 starts with Thread ID 0 • Partitioning is always the same – Thus you can use this knowledge in control flow – However, the exact size of warps may change from generation to generation – (Covered next) • However, DO NOT rely on any ordering between warps – If there are any dependencies between threads, you must __syncthreads () to get correct results Computer Graphics WS07/08 – HW-Shading

Control Flow Instructions • Main performance concern with branching is divergence – Threads within a single warp take different paths – Different execution paths are serialized in G80 • The control paths taken by the threads in a warp are traversed one at a time until there is no more. • A common case: avoid divergence when branch condition is a function of thread ID – Example with divergence: • If (threadIdx.x > 2) { } • This creates two different control paths for threads in a block • Branch granularity < warp size; threads 0 and 1 follow different path than the rest of the threads in the first warp – Example without divergence: • If (threadIdx.x / WARP_SIZE > 2) { } • Also creates two different control paths for threads in a block • Branch granularity is a whole multiple of warp size; all threads in any given warp follow the same path Computer Graphics WS07/08 – HW-Shading

Shared Memory Bank Conflicts • Shared memory is as fast as registers if there are no bank conflicts • The fast case: – If all threads of a half-warp access different banks, there is no bank conflict – If all threads of a half-warp access the identical address, there is no bank conflict (broadcast) • The slow case: – Bank Conflict: multiple threads in the same half-warp access the same bank – Must serialize the accesses – Cost = max # of simultaneous accesses to a single bank Computer Graphics WS07/08 – HW-Shading

Linear Addressing s=1 Thread 0 Bank 0 • Given: Thread 1 Bank 1 Thread 2 Bank 2 Thread 3 Bank 3 Thread 4 Bank 4 __shared__ float shared[256]; Thread 5 Bank 5 float foo = Thread 6 Bank 6 Thread 7 Bank 7 shared[baseIndex + s * threadIdx.x]; Thread 15 Bank 15 • This is only bank-conflict-free if s shares no common factors with the s=3 number of banks Thread 0 Bank 0 Thread 1 Bank 1 – 16 on G80, so s must be odd Thread 2 Bank 2 Thread 3 Bank 3 Thread 4 Bank 4 Thread 5 Bank 5 Thread 6 Bank 6 Thread 7 Bank 7 Thread 15 Bank 15 Computer Graphics WS07/08 – HW-Shading

Data Types and Bank Conflicts • This has no conflicts if type of shared is 32-bits: foo = shared[baseIndex + threadIdx.x] Thread 0 Bank 0 Thread 1 Bank 1 Thread 2 Bank 2 Thread 3 Bank 3 Thread 4 Bank 4 • But not if the data type is smaller Thread 5 Bank 5 Thread 6 Bank 6 – 4-way bank conflicts: Thread 7 Bank 7 __shared__ char shared[]; foo = shared[baseIndex + threadIdx.x]; Thread 15 Bank 15 – 2-way bank conflicts: Thread 0 Bank 0 Thread 1 Bank 1 __shared__ short shared[]; Thread 2 Bank 2 Thread 3 Bank 3 foo = shared[baseIndex + threadIdx.x]; Thread 4 Bank 4 Thread 5 Bank 5 Thread 6 Bank 6 Thread 7 Bank 7 Thread 15 Bank 15 Computer Graphics WS07/08 – HW-Shading

Structs and Bank Conflicts • Struct assignments compile into as many memory accesses as there are struct members: Thread 0 Bank 0 Thread 1 Bank 1 struct vector { float x, y, z; }; Thread 2 Bank 2 struct myType { Thread 3 Bank 3 float f; Thread 4 Bank 4 Thread 5 Bank 5 int c; Thread 6 Bank 6 }; Thread 7 Bank 7 __shared__ struct vector vectors[64]; __shared__ struct myType myTypes[64]; Thread 15 Bank 15 • This has no bank conflicts for vector; struct size is 3 words – 3 accesses per thread, contiguous banks (no common factor with 16) struct vector v = vectors[baseIndex + threadIdx.x]; • This has 2-way bank conflicts for my Type; (2 accesses per thread) struct myType m = myTypes[baseIndex + threadIdx.x]; Computer Graphics WS07/08 – HW-Shading

Common Array Bank Conflict Patterns 1D • Each thread loads 2 elements into shared mem: – 2-way-interleaved loads result in 2-way bank conflicts: Thread 0 Bank 0 Thread 1 Bank 1 int tid = threadIdx.x; Thread 2 Bank 2 shared[2*tid] = global[2*tid]; Thread 3 Bank 3 shared[2*tid+1] = global[2*tid+1]; Thread 4 Bank 4 Bank 5 Bank 6 • This makes sense for traditional CPU Bank 7 threads, locality in cache line usage and Thread 8 reduced sharing traffice. Thread 9 – Not in shared memory usage where there is Thread 10 no cache line effects but banking effects Thread 11 Bank 15 Computer Graphics WS07/08 – HW-Shading

A Better Array Access Pattern • Each thread loads one element in every consecutive group of bockDim elements. Thread 0 Bank 0 Thread 1 Bank 1 shared[tid] = global[tid]; Thread 2 Bank 2 shared[tid + blockDim.x] = Thread 3 Bank 3 global[tid + blockDim.x]; Thread 4 Bank 4 Thread 5 Bank 5 Thread 6 Bank 6 Thread 7 Bank 7 Thread 15 Bank 15 Computer Graphics WS07/08 – HW-Shading

Example: Parallel Reduction • Given an array of values, “reduce” them to a single value in parallel • Examples – sum reduction: sum of all values in the array – Max reduction: maximum of all values in the array • Typically parallel implementation: – Recursively halve # threads, add two values per thread – Takes log(n) steps for n elements, requires n/2 threads Computer Graphics WS07/08 – HW-Shading

Computer Graphics Parallel Programming with Cuda Hendrik Lensch - PowerPoint PPT Presentation

Computer Graphics Parallel Programming with Cuda Hendrik Lensch Computer Graphics WS07/08 HW-Shading Overview So far: Introduction to Cuda GPGPU via Cuda (general purpose computing on the GPU) Block matrix-matrix

Computer Graphics (CS 4731) Lecture 1: Introduction to Computer Graphics Prof Emmanuel Agu Computer

Computer Graphics (CS 543) Lecture 1a: Introduction to Computer Graphics Prof Emmanuel Agu

Outline What is Computer Graphics? What is CG used for? CPSC 314 Computer Graphics Defining

Computer Graphics and Applications Computer Graphics and Applications IGR201 Kiwon Um CG,

3D GRAPHICS design animate render Computer Graphics 3D animation movies Computer Graphics

Computer Graphics Computer Graphics CS 543 Lecture 1 (Part 3) Prof Emmanuel Agu Computer

Graphics Computer Graphics vs. Graphic Design Computer Graphics is not using Photoshop-

Graphics Processing CS418 Computer Graphics John C. Hart Graphics Processing Graphics

Computer Graphics Overview CMSC 435/634 1 Graphics Areas Core graphics areas

Computer Graphics Seminar MTAT.03.305 Spring 2020 Raimond Tunnel Computer Graphics

Computer Graphics (CS 543) Computer Graphics (CS 543) Lecture 7 (Part 3): Hierarchical 3D Models

Five Key Problems in Computer Graphics Penny Rheingans UMBC Computer Graphics Using

Computer Graphics (CS 563) Lecture 4: Advanced Computer Graphics Image Based Effects: Part 2

Computer Graphics Seminar MTAT.03.305 Fall 2019 Raimond Tunnel Computer Graphics Graphical

History of computer graphics CS 248 - Introduction to Computer Graphics Autumn quarter, 2001

Goals Goals Computer Graphics (Spring 2008) Computer Graphics (Spring 2008) Systems: Be able

Computer Graphics Computer Graphics CS 543 Lecture 1 (Part 2) Prof Emmanuel Agu Computer

Goals Goals Computer Graphics (Fall 2004) Computer Graphics (Fall 2004) Systems: Be able to

Overview Extremely over-simplified view of graphics (60 min) IMGD 3000: Basic Computer Graphics

Goals Goals Computer Graphics (Fall 2005) Computer Graphics (Fall 2005) Systems: Be able to

CS488 Introduction to Computer Graphics Luc R ENAMBOT 1 What is Computer Graphics ?

COMPUTER GRAPHICS SPRING 2015 DR. MICHAEL J. REALE DEFINITION AND APPLICATIONS WHAT IS COMPUTER

School of Continuing Studies CS 395: Advanced What is Computer Graphics? Computer Graphics _

Computer Graphics (CS 563) Lecture 3: Advanced Computer Graphics Prof Emmanuel Agu Computer Science