GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists - PowerPoint PPT Presentation

1 ¡ GPU ¡Programming ¡ GPU ¡PROGRAMMING ¡

2 ¡ GPU ¡Programming ¡ Assignment ¡4 ¡ • Consists ¡of ¡two ¡programming ¡assignments ¡ • Concurrency ¡ • GPU ¡programming ¡ • Requires ¡a ¡computer ¡with ¡a ¡CUDA/OpenCL/DirectCompute ¡compaBble ¡ GPU ¡ • Due ¡Jun ¡07 ¡ • We ¡have ¡no ¡final ¡exams ¡

3 ¡ GPU ¡Programming ¡ GPU ¡Resources ¡ • Download ¡CUDA ¡toolkit ¡from ¡the ¡web ¡ • Very ¡good ¡text ¡book: ¡ • Programming ¡Massively ¡Parallel ¡Processors ¡ • Wen-‑mei ¡Hwu ¡and ¡David ¡Kirk ¡ • Available ¡at ¡ ¡ • hSp://courses.engr.illinois.edu/ece498/al/Syllabus.html ¡

4 ¡ GPU ¡Programming ¡ Acknowledgments ¡ • Slides ¡and ¡material ¡from ¡ ¡ • Wen-‑mei ¡Hwu ¡(UIUC) ¡and ¡David ¡Kirk ¡(NVIDIA) ¡

Why ¡GPU ¡Programming ¡ More ¡processing ¡power ¡+ ¡higher ¡memory ¡ • bandwidth ¡ • GPU ¡in ¡every ¡PC ¡and ¡workstaBon ¡– ¡massive ¡ volume ¡and ¡potenBal ¡impact ¡ 5 ¡ GPU ¡Programming ¡

Current ¡CPU ¡ 4 ¡Cores ¡ CPU 0 CPU 1 4 ¡float ¡wide ¡SIMD ¡ 3GHz ¡ 48-‑96GFlops ¡ CPU 2 CPU 3 2x ¡HyperThreaded ¡ 64kB ¡$L1/core ¡ 20GB/s ¡to ¡Memory ¡ $200 ¡ L2 Cache 200W ¡

Current ¡GPU ¡ 32 ¡Cores ¡ SIMD SIMD SIMD SIMD SIMD 32 ¡Float ¡wide ¡ SIMD SIMD SIMD SIMD SIMD 1GHz ¡ SIMD SIMD SIMD SIMD SIMD 1TeraFlop ¡ SIMD SIMD SIMD SIMD SIMD 32x ¡“HyperThreaded” ¡ SIMD SIMD SIMD SIMD SIMD 64kB ¡$L1/Core ¡ SIMD SIMD SIMD SIMD SIMD 150GB/s ¡to ¡Mem ¡ SIMD SIMD SIMD SIMD SIMD $200, ¡ ¡ SIMD SIMD SIMD SIMD SIMD 200W ¡ L2 Cache

GPU ¡Programming ¡ Bandwidth ¡and ¡Capacity ¡ CPU GPU 50GFlops 1TFlop 1GB/s ¡ 10GB/s ¡ 100GB/s ¡ GPU RAM 1 GB CPU RAM 4-6 GB All ¡values ¡are ¡approximate ¡ 8 ¡

GPU ¡Programming ¡ 9 ¡ CUDA ¡ • “Compute ¡Unified ¡Device ¡Architecture” ¡ • General ¡purpose ¡programming ¡model ¡ • User ¡kicks ¡off ¡batches ¡of ¡threads ¡on ¡the ¡GPU ¡ • GPU ¡= ¡dedicated ¡super-‑threaded, ¡massively ¡data ¡parallel ¡co-‑processor ¡ • Targeted ¡solware ¡stack ¡ • Compute ¡oriented ¡drivers, ¡language, ¡and ¡tools ¡ • Driver ¡for ¡loading ¡computaBon ¡programs ¡into ¡GPU ¡

10 ¡ GPU ¡Programming ¡ Languages ¡with ¡Similar ¡CapabiliBes ¡ • CUDA ¡ • OpenCL ¡ • DirectCompute ¡ • You ¡are ¡free ¡to ¡use ¡any ¡of ¡the ¡above ¡for ¡assignment ¡4 ¡ • I ¡will ¡focus ¡on ¡CUDA ¡for ¡the ¡rest ¡of ¡the ¡lecture ¡ • Same ¡abstracBons ¡present ¡in ¡all ¡three ¡with ¡different ¡(and ¡ confusing) ¡names ¡

11 ¡ GPU Programming � CUDA ¡Programming ¡Model: ¡ The ¡GPU ¡= ¡compute ¡device ¡that: ¡ • Is ¡a ¡coprocessor ¡to ¡the ¡CPU ¡or ¡host ¡ • Has ¡its ¡own ¡DRAM ¡(device ¡memory) ¡ • Runs ¡many ¡threads ¡in ¡parallel ¡ • GPU ¡program ¡= ¡kernel ¡ • Differences ¡between ¡GPU ¡and ¡CPU ¡threads ¡ ¡ • GPU ¡threads ¡are ¡extremely ¡lightweight ¡ • Very ¡liSle ¡creaBon ¡overhead ¡ • GPU ¡needs ¡1000s ¡of ¡threads ¡for ¡full ¡efficiency ¡ • MulB-‑core ¡CPU ¡needs ¡only ¡a ¡few ¡ •

12 ¡ GPU ¡Programming ¡ A ¡CUDA ¡Program ¡ 1. Host ¡performs ¡some ¡CPU ¡computaBon ¡ 2. Host ¡copies ¡input ¡data ¡into ¡the ¡device ¡ 3. Host ¡instructs ¡the ¡device ¡to ¡execute ¡a ¡kernel ¡ 4. Device ¡executes ¡the ¡kernel ¡produces ¡results ¡ 5. Host ¡copies ¡the ¡results ¡ 6. Goto ¡step ¡1 ¡

13 ¡ GPU ¡Programming ¡ CUDA ¡Kernel ¡is ¡a ¡SPMD ¡program ¡ ¡ • SPMD ¡= ¡Single ¡Program ¡MulBple ¡Data ¡ • All ¡threads ¡run ¡the ¡same ¡code ¡ Kernel: • Each ¡thread ¡uses ¡its ¡id ¡to ¡ ¡ … • Operate ¡on ¡different ¡memory ¡ i = input[tid]; addresses ¡ o = f(i); output[tid] = o; • Make ¡control ¡decisions ¡ …

14 ¡ GPU ¡Programming ¡ CUDA ¡Kernel ¡is ¡a ¡SPMD ¡program ¡ ¡ • SPMD ¡= ¡Single ¡Program ¡MulBple ¡Data ¡ • All ¡threads ¡run ¡the ¡same ¡code ¡ Kernel: • Each ¡thread ¡uses ¡its ¡id ¡to ¡ ¡ … • Operate ¡on ¡different ¡memory ¡ i = input[tid]; addresses ¡ if(i%2 == 0) o = f(i); • Make ¡control ¡decisions ¡ else • Difference ¡with ¡SIMD ¡ o = g(i); • Threads ¡can ¡execute ¡different ¡ output[tid] = o; control ¡flow ¡ … • At ¡a ¡performance ¡cost ¡

15 ¡ GPU ¡Programming ¡ Threads ¡OrganizaBon ¡ • Kernel ¡threads ¡ ¡ Host Device Grid 1 ¡ ¡ ¡ ¡ ¡ ¡= ¡Grid ¡of ¡Thread ¡Blocks ¡ ¡ Kernel Block Block 1 (0, 0) (1, 0) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(1D ¡or ¡2D) ¡ Block Block (0, 1) (1, 1) • Thread ¡Block ¡ Grid 2 ¡ ¡ ¡ ¡ ¡ ¡= ¡Array ¡of ¡Threads ¡ Kernel 2 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(1D ¡or ¡2D ¡or ¡3D) ¡ • Simplifies ¡memory ¡addressing ¡ for ¡mulBdimensional ¡data ¡

16 ¡ GPU ¡Programming ¡ Threads ¡OrganizaBon ¡ • Kernel ¡threads ¡ ¡ Host Device Grid 1 ¡ ¡ ¡ ¡ ¡ ¡= ¡Grid ¡of ¡Thread ¡Blocks ¡ ¡ Kernel Block Block 1 (0, 0) (1, 0) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(1D ¡or ¡2D) ¡ Block Block (0, 1) (1, 1) • Thread ¡Block ¡ Grid 2 ¡ ¡ ¡ ¡ ¡ ¡= ¡Array ¡of ¡Threads ¡ Kernel 2 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(1D ¡or ¡2D ¡or ¡3D) ¡ Thread ¡ Thread ¡ (0,0) ¡ (1,0) ¡ • Simplifies ¡memory ¡addressing ¡ for ¡mulBdimensional ¡data ¡ Thread ¡ Thread ¡ (0,1) ¡ (1,1) ¡

17 ¡ GPU ¡Programming ¡ Threads ¡within ¡a ¡Block ¡ • Execute ¡in ¡lock ¡step ¡ CUDA Thread Block • Can ¡share ¡memory ¡ Thread Id #: 0 1 2 3 … m • Can ¡synchronize ¡with ¡each ¡other ¡ Thread program Courtesy: ¡John ¡Nickolls, ¡NVIDIA ¡

CUDA ¡FuncBon ¡DeclaraBons ¡ Executed Only callable on the: from the: device device __device__ float DeviceFunc() device host __global__ void KernelFunc() host host __host__ float HostFunc() • __global__ ¡defines ¡a ¡kernel ¡funcBon ¡ Must ¡return ¡ void • • __device__ ¡and ¡ __host__ ¡can ¡be ¡used ¡ together ¡ 18 ¡ GPU ¡Programming ¡

GPU ¡Programming ¡ 19 ¡ CUDA ¡FuncBon ¡DeclaraBons ¡(cont.)� ¡ • ¡ ¡ __device__ ¡funcBons ¡cannot ¡have ¡their ¡ address ¡taken ¡ • For ¡funcBons ¡executed ¡on ¡the ¡device: ¡ • No ¡recursion ¡ • No ¡staBc ¡variable ¡declaraBons ¡inside ¡the ¡funcBon ¡ • No ¡variable ¡number ¡of ¡arguments ¡

20 ¡ GPU ¡Programming ¡ Puqng ¡it ¡all ¡together ¡ __global__ void KernelFunc(…) dim3 DimGrid(100, 50); dim3 DimBlock(4, 8, 8); KernelFunc<<< DimGrid, DimBlock >>>(...);

21 ¡ GPU ¡Programming ¡ CUDA ¡Memory ¡Model ¡ • Registers ¡ Grid • Read/write ¡per ¡thread ¡ Block (1, 0) Block (0, 0) • Local ¡memory ¡ • Read/write ¡per ¡thread ¡ Shared Memory Shared Memory • Shared ¡memory ¡ Registers Registers Registers Registers • Read/write ¡per ¡block ¡ • Global ¡memory ¡ Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) • Read/write ¡per ¡grid ¡ Host Global Memory • Constant ¡memory ¡ • Read ¡only, ¡per ¡grid ¡ Constant Memory • Texture ¡memory ¡ Texture Memory • Read ¡only, ¡per ¡grid ¡

22 ¡ GPU ¡Programming ¡ Memory ¡Access ¡Efficiency ¡ • Registers ¡ Grid • Fast ¡ • Local ¡memory ¡ Block (1, 0) Block (0, 0) • Not ¡cached ¡-‑> ¡Slow ¡ • Registers ¡spill ¡into ¡local ¡memory ¡ Shared Memory Shared Memory Registers Registers Registers Registers • Shared ¡memory ¡ • On ¡chip ¡-‑> ¡Fast ¡ Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) • Global ¡memory ¡ • Not ¡cached ¡-‑> ¡Slow ¡ Host Global Memory • Constant ¡memory ¡ Constant Memory • Cached ¡– ¡Fast ¡if ¡good ¡reuse ¡ • Texture ¡memory ¡ Texture Memory • Cached ¡– ¡Fast ¡if ¡good ¡reuse ¡

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists - PowerPoint PPT Presentation

1 GPU Programming GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

OpenMP and GPU Programming GPU Intro Emanuele Ruffaldi

B ALANCED T REES Acknowledgement: The course slides are adapted from the slides prepared by R.

Flight Simulation Advisor: Hans de Nivelle Team: Alisher Shakhiyev, Alen German, Auyez

CS525: Advanced Database Organization Notes 6: Multi-dimensional indexes Yousef M. Elmehdwi

RUBIK: Efficient Threshold Queries on Massive Time Series Eleni Tzirita Zacharatou Thomas

Lecture 10 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Looking at PTX code

Adviser Meeting May 1, 2020 SBLC Items Business Session Wednesday, May 6 10 AM

Agenda AShamelessselfpromo2on

Chapter 5 Data Types and Sizes Constants The C Programming Language Declaration and

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists - PowerPoint PPT Presentation

1 GPU Programming GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

OpenMP and GPU Programming GPU Intro Emanuele Ruffaldi

B ALANCED T REES Acknowledgement: The course slides are adapted from the slides prepared by R.

Flight Simulation Advisor: Hans de Nivelle Team: Alisher Shakhiyev, Alen German, Auyez

CS525: Advanced Database Organization Notes 6: Multi-dimensional indexes Yousef M. Elmehdwi

RUBIK: Efficient Threshold Queries on Massive Time Series Eleni Tzirita Zacharatou Thomas

Lecture 10 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Looking at PTX code

Adviser Meeting May 1, 2020 SBLC Items Business Session Wednesday, May 6 10 AM

Agenda AShamelessselfpromo2on

Chapter 5 Data Types and Sizes Constants The C Programming Language Declaration and

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,