GPU Architecture Analytical Patrick Cozzi Graphics, Inc. - PowerPoint PPT Presentation

Who is this guy? GPU Architecture Analytical Patrick Cozzi Graphics, Inc. University of Pennsylvania developer lecturer author editor CIS 371 Guest Lecture Spring 2012 See http://www.seas.upenn.edu/~pcozzi/ How did this happen? Graphics Workloads  Triangles/vertices and pixels/fragments http://proteneer.com/blog/?p=263 Right image from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch14.html

Early 90s – Pre GPU Why GPUs?  Graphics workloads are embarrassingly parallel  Data-parallel  Pipeline-parallel  CPU and GPU execute in parallel  Hardware: texture filtering, rasterization, etc. Slide from http://s09.idav.ucdavis.edu/talks/01-BPS-SIGGRAPH09-mhouston.pdf Data Parallel NVIDIA GeForce 6 (2004) 6 vertex shader processors  Beyond Graphics  Cloth simulation  Particle system  Matrix multiply 16 fragment shader processors Image from: https://plus.google.com/u/0/photos/100838748547881402137/albums/5407605084626995217/5581900335460078306 Image from http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter30.html

NVIDIA G80 Architecture Why Unify Shader Processors? Slide from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf Slide from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf Why Unify Shader Processors? GPU Architecture Big Ideas  GPUs are specialized for  Compute-intensive, highly parallel computation  Graphics is just the beginning.  Transistors are devoted to:  Processing  Not:  Data caching  Flow control Slide from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

NVIDIA G80 Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf NVIDIA G80 NVIDIA G80 Streaming Processing (SP) Streaming Multi-Processor (SM)

NVIDIA G80 NVIDIA GT200  16 SMs  30 SMs  Each with 8 SPs  Each with 8 SPs  128 total SPs  240 total SPs  Each SM hosts up  Each SM hosts to 768 threads up to  1024 threads  Up to 12,288 threads in flight  In flight, up to  30,720 threads Slide from David Luebke: http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf Slide from David Luebke: http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf GPU Computing History  2001/2002 – researchers see GPU as data- Let’s program parallel coprocessor  The GPGPU field is born this thing!  2007 – NVIDIA releases CUDA  CUDA – Compute Uniform Device Architecture  GPGPU shifts to GPU Computing  2008 – Khronos releases OpenCL specification

CUDA Abstractions CUDA Kernels  A hierarchy of thread groups  Executed N times in parallel by N different  Shared memories CUDA threads  Barrier synchronization Thread ID Declaration Specifier Execution Configuration CUDA Program Execution Thread Hierarchies  Grid – one or more thread blocks  1D or 2D  Block – array of threads  1D, 2D, or 3D  Each block in a grid has the same number of threads  Each thread in a block can  Synchronize  Access shared memory Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Thread Hierarchies Thread Hierarchies  Thread Block  Group of threads  G80 and GT200: Up to 512 threads  Fermi: Up to 1024 threads  Reside on same processor core  Share memory of that core Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Thread Hierarchies Thread Hierarchies  Threads in a block  Share (limited) low-latency memory  Synchronize execution  To coordinate memory accesses  __syncThreads()  Barrier – threads in block wait until all threads reach this  Lightweight Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Scheduling Threads Scheduling Threads  Warp – threads from a block  Warps for three blocks scheduled  G80 / GT200 – 32 threads on the same SM.  Run on the same SM  Unit of thread scheduling  Consecutive threadIdx values  An implementation detail – in theory  warpSize Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter3-CudaThreadingModel.pdf Scheduling Threads Scheduling Threads Remember this: Image from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://courses.engr.illinois.edu/ece498/al/Syllabus.html

Scheduling Threads Scheduling Threads  What happens if branches in a warp Remember this: diverge? Image from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Scheduling Threads Scheduling Threads  32 threads per warp but 8 SPs per  32 threads per warp but 8 SPs per SM. What gives? SM. What gives?  When an SM schedules a warp:  Its instruction is ready  8 threads enter the SPs on the 1 st cycle  8 more on the 2 nd , 3 rd , and 4 th cycles  Therefore, 4 cycles are required to dispatch a warp

Scheduling Threads Scheduling Threads  Question  Solution  A kernel has  Each warp has 4 multiples/adds  1 global memory read ( 200 cycles)  16 cycles  4 non-dependent multiples/adds  We need to cover 200 cycles  How many warps are required to hide the  200 / 16 = 12.5 memory latency?  ceil(12.5) = 13  13 warps are required Memory Model Thread Synchronization Recall:  Threads in a block can synchronize  call __syncthreads to create a barrier  A thread waits at this call until all threads in the block reach it, then all threads continue Mds[i] = Md[j]; __syncthreads(); func(Mds[i], Mds[i + 1]); Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Thread Synchronization Thread Synchronization Thread 0 Thread 1 Thread 0 Thread 1 Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; __syncthreads(); __syncthreads(); __syncthreads(); __syncthreads(); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); Thread 2 Thread 3 Thread 2 Thread 3 Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; __syncthreads(); __syncthreads(); __syncthreads(); __syncthreads(); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); Time: 0 Time: 1 Thread Synchronization Thread Synchronization Thread 0 Thread 1 Thread 0 Thread 1 Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; __syncthreads(); __syncthreads(); __syncthreads(); __syncthreads(); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); Thread 2 Thread 3 Thread 2 Thread 3 Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; __syncthreads(); __syncthreads(); __syncthreads(); __syncthreads(); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); Threads 0 and 1 are blocked at barrier Time: 1 Time: 2

GPU Architecture Analytical Patrick Cozzi Graphics, Inc. - PowerPoint PPT Presentation

Who is this guy? GPU Architecture Analytical Patrick Cozzi Graphics, Inc. University of Pennsylvania developer lecturer author editor CIS 371 Guest Lecture Spring 2012 See http://www.seas.upenn.edu/~pcozzi/ How did this happen?

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

TESLA V100 GPU Xudong Shao Houxiang Ji Hao Gao The history of GPU architecture 2017 Volta

Beam Optics Sample n Detectors A canonical experimental recipe (scanning): while not

Single-molecule characterization of blood coagulation, tissue repair and viral invasion X. Frank

Sensors for the CMS High Granularity Calorimeter INSTR17 at BINP, Novosibirsk Wed, March 1, 2017

2016 dat 2016 dataset aset Huajie Cheng 2020.11.13 Introduction TES measurement using

UMD/Nvidia GPU Summit October 27-19, 2014 Fran LoPresti, Deputy CIO, Cyberinfrastructure and

High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad

Arbitrary Dimension Reed-Solomon Coding and Decoding for Extended RAID on GPUs Matthew Curry, H.

Using Architectural Properties to Model System-Wide Graceful Degradation Charles Shelton Philip

GPU Architecture Analytical Patrick Cozzi Graphics, Inc. - PowerPoint PPT Presentation

Who is this guy? GPU Architecture Analytical Patrick Cozzi Graphics, Inc. University of Pennsylvania developer lecturer author editor CIS 371 Guest Lecture Spring 2012 See http://www.seas.upenn.edu/~pcozzi/ How did this happen?

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

TESLA V100 GPU Xudong Shao Houxiang Ji Hao Gao The history of GPU architecture 2017 Volta

Beam Optics Sample n Detectors A canonical experimental recipe (scanning): while not

Single-molecule characterization of blood coagulation, tissue repair and viral invasion X. Frank

Sensors for the CMS High Granularity Calorimeter INSTR17 at BINP, Novosibirsk Wed, March 1, 2017

2016 dat 2016 dataset aset Huajie Cheng 2020.11.13 Introduction TES measurement using

UMD/Nvidia GPU Summit October 27-19, 2014 Fran LoPresti, Deputy CIO, Cyberinfrastructure and

High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad

Arbitrary Dimension Reed-Solomon Coding and Decoding for Extended RAID on GPUs Matthew Curry, H.

Using Architectural Properties to Model System-Wide Graceful Degradation Charles Shelton Philip

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,