GPU Programming 101 GridKa School 2017: make science && run - PowerPoint PPT Presentation

V100 Member of the Helmholtz Association HBM2 Andreas Herten | GPU Programming 101 | 31 August 2017 16 GB RAM, 900 GB s Example values Now: Done automatically (performance…?) Formerly: Explicitly copy data to/from GPU Do as little as possible! Memory transfers need special consideration! GPU: accelerator / extension card Memory NVLink Device Control DRAM ALU ALU ALU ALU Cache DRAM Host # 7 41 GPU memory ain’t no CPU memory → Separate device from CPU Separate memory, but UVA and UM ≈ 160 GB / s P100 < 720 GB / s 16 GB RAM, 720 GB / s

Member of the Helmholtz Association Device Andreas Herten | GPU Programming 101 | 31 August 2017 Example values Now: Done automatically (performance…?) Formerly: Explicitly copy data to/from GPU Do as little as possible! Memory transfers need special consideration! GPU: accelerator / extension card Memory NVLink HBM2 Host ALU DRAM ALU ALU DRAM ALU Control Cache # 7 41 GPU memory ain’t no CPU memory → Separate device from CPU Separate memory, but UVA and UM ≈ 300 GB / s P100 V100 < 900 GB / s 16 GB RAM, 720 GB / s 16 GB RAM, 900 GB / s

Member of the Helmholtz Association Overview Aim: Hide Latency Everything else follows Asynchronicity Memory High Throughput # 8 41 GPU Architecture SIMT Andreas Herten | GPU Programming 101 | 31 August 2017

Member of the Helmholtz Association Copy Andreas Herten | GPU Programming 101 | 31 August 2017 Compute Copy Compute Copy Async Compute Compute Copy Copy and compute engines run separately ( streams ) Solution: Do something else in meantime ( computation )! Problem: Memory transfer is comparably slow Following difgerent streams # 9 41 → Overlap tasks GPU needs to be fed: Schedule many computations CPU can do other work while GPU computes; synchronization Also: Fast switching of contexts to keep GPU busy.

Member of the Helmholtz Association Overview Aim: Hide Latency Everything else follows Asynchronicity Memory High Throughput # 10 41 GPU Architecture SIMT Andreas Herten | GPU Programming 101 | 31 August 2017

— CPU core GPU multiprocessor ( SM ) — Simultaneous Multithreading ( SMT ) GPU : Single Instruction, Multiple Threads ( SIMT ) Member of the Helmholtz Association A 3 Scalar C 3 C 2 C 1 C 0 B 3 B 2 SIMT B 0 B 1 A 2 — Fast switching of threads (large register file) Of threads and warps A 1 — Working unit: set of threads (32, a warp ) — Branching if A 0 # 11 41 = + = + CPU : + = — Single Instruction, Multiple Data ( SIMD ) + = Andreas Herten | GPU Programming 101 | 31 August 2017

— CPU core GPU multiprocessor ( SM ) — Simultaneous Multithreading ( SMT ) GPU : Single Instruction, Multiple Threads ( SIMT ) Member of the Helmholtz Association A 3 Vector C 3 C 2 C 1 C 0 B 3 B 2 B 1 B 0 A 2 SIMT A 1 A 0 if — Branching — Fast switching of threads (large register file) — Working unit: set of threads (32, a warp ) Of threads and warps # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) Andreas Herten | GPU Programming 101 | 31 August 2017

— CPU core GPU multiprocessor ( SM ) GPU : Single Instruction, Multiple Threads ( SIMT ) Vector B 3 C 0 C 1 C 2 C 3 Member of the Helmholtz Association Core B 1 Core Core Core Andreas Herten | GPU Programming 101 | 31 August 2017 B 2 B 0 SIMT A 3 A 2 A 1 A 0 if — Branching — Fast switching of threads (large register file) — Working unit: set of threads (32, a warp ) Of threads and warps # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) — Simultaneous Multithreading ( SMT )

— CPU core GPU multiprocessor ( SM ) GPU : Single Instruction, Multiple Threads ( SIMT ) Member of the Helmholtz Association Core C 0 C 1 C 2 C 3 Vector Core Core B 2 Core Thread Thread SMT B 3 B 1 SIMT if Of threads and warps — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) B 0 — Branching A 0 A 1 A 2 A 3 # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) — Simultaneous Multithreading ( SMT ) Andreas Herten | GPU Programming 101 | 31 August 2017

— CPU core GPU multiprocessor ( SM ) Core C 0 C 1 C 2 C 3 Vector Member of the Helmholtz Association Core B 2 Core Core Thread Thread SMT B 3 B 1 SIMT if Of threads and warps — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) B 0 — Branching A 0 A 1 A 2 A 3 # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) — Simultaneous Multithreading ( SMT ) GPU : Single Instruction, Multiple Threads ( SIMT ) Andreas Herten | GPU Programming 101 | 31 August 2017

— CPU core GPU multiprocessor ( SM ) Member of the Helmholtz Association Core C 0 C 1 C 2 C 3 Vector Core Core SIMT Core Thread Thread SMT SIMT B 3 B 2 B 1 if Of threads and warps — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) B 0 — Branching A 0 A 1 A 2 A 3 # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) — Simultaneous Multithreading ( SMT ) GPU : Single Instruction, Multiple Threads ( SIMT ) Andreas Herten | GPU Programming 101 | 31 August 2017

Member of the Helmholtz Association SIMT SIMT SMT Thread Thread Core Core Core Core Vector C 3 C 2 C 1 C 0 B 3 B 2 B 1 if Of threads and warps — Working unit: set of threads (32, a warp ) B 0 — Branching — Fast switching of threads (large register file) A 0 A 1 A 2 A 3 # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) — Simultaneous Multithreading ( SMT ) GPU : Single Instruction, Multiple Threads ( SIMT ) — CPU core ≊ GPU multiprocessor ( SM ) Andreas Herten | GPU Programming 101 | 31 August 2017

Member of the Helmholtz Association Core B 3 C 0 C 1 C 2 C 3 Vector Core SIMT Core Core Thread Thread SMT SIMT B 2 B 1 B 0 if Of threads and warps — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) A 3 — Branching Tesla V100 Graphics: Nvidia Corporation [5] A 0 A 1 A 2 # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) — Simultaneous Multithreading ( SMT ) GPU : Single Instruction, Multiple Threads ( SIMT ) — CPU core ≊ GPU multiprocessor ( SM ) Andreas Herten | GPU Programming 101 | 31 August 2017

Member of the Helmholtz Association Vector B 2 B 3 C 0 C 1 C 2 C 3 Core B 0 Core Core Core Thread Thread SMT SIMT SIMT B 1 A 3 if Of threads and warps — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) A 2 — Branching Tesla V100 Multiprocessor Graphics: Nvidia Corporation [5] A 0 A 1 # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) — Simultaneous Multithreading ( SMT ) GPU : Single Instruction, Multiple Threads ( SIMT ) — CPU core ≊ GPU multiprocessor ( SM ) Andreas Herten | GPU Programming 101 | 31 August 2017

Member of the Helmholtz Association Vector B 1 B 2 B 3 C 0 C 1 C 2 C 3 Core A 3 Core Core Core Thread Thread SMT SIMT SIMT B 0 A 2 Tesla V100 Of threads and warps — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) A 1 if — Branching Multiprocessor Tensor Cores Graphics: Nvidia Corporation [5] A 0 # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) — Simultaneous Multithreading ( SMT ) GPU : Single Instruction, Multiple Threads ( SIMT ) — CPU core ≊ GPU multiprocessor ( SM ) 120 PFLOP / s for Deep Learning Andreas Herten | GPU Programming 101 | 31 August 2017

Member of the Helmholtz Association W 2 Andreas Herten | GPU Programming 101 | 31 August 2017 Thread/Warp Processing Context Switch Ready Waiting W 4 W 3 W 1 Low Latency vs. High Throughput GPU Streaming Multiprocessor: High Throughput T 4 T 3 T 2 T 1 CPU Core: Low Latency GPU Hides latency with computations from other thread warps # 12 41 Maybe GPU ’s ultimate feature CPU Minimizes latency within each thread

Member of the Helmholtz Association bandwidth Andreas Herten | GPU Programming 101 | 31 August 2017 memory Optimized for high throughput # 13 41 Optimized for low latency Let’s summarize this! CPU vs. GPU + Large main memory + High bandwidth main + Fast clock rate + Large caches + Latency tolerant (parallelism) + Branch prediction + More compute resources + Powerful ALU + High performance per watt − Relatively low memory − Limited memory capacity − Low per-thread performance − Cache misses costly − Extension card − Low performance per watt

Member of the Helmholtz Association Andreas Herten | GPU Programming 101 | 31 August 2017 # 14 41 Programming GPUs

Member of the Helmholtz Association y[i] = a * x[i] + y[i]; Andreas Herten | GPU Programming 101 | 31 August 2017 saxpy(n, a, x, y); // fill x, y float x[n], y[n]; int n = 10; int a = 42; } for ( int i = 0; i < n; i++) void saxpy( int n, float a, float * x, float * y) { y , with single precision # 15 41 Preface: CPU A simple CPU program as reference! SAXPY: ⃗ y = a ⃗ x + ⃗ Part of LAPACK BLAS Level 1

Member of the Helmholtz Association Libraries The truth is out there! Use applications & libraries! Wizard: Breazell [6] cuBLAS cuSPARSE cuFFT cuRAND CUDA Math Andreas Herten | GPU Programming 101 | 31 August 2017 # 16 41 Programming GPUs is easy: Just don’t!

Member of the Helmholtz Association Libraries The truth is out there! Use applications & libraries! Wizard: Breazell [6] cuBLAS cuSPARSE cuFFT cuRAND CUDA Math Andreas Herten | GPU Programming 101 | 31 August 2017 # 16 41 Programming GPUs is easy: Just don’t! th ano

Member of the Helmholtz Association cuBLAS Parallel algebra Single, double, complex data types Constant competition with Intel’s MKL http://docs.nvidia.com/cuda/cublas Andreas Herten | GPU Programming 101 | 31 August 2017 # 17 41 GPU -parallel BLAS (all 152 routines) Multi- GPU support → https://developer.nvidia.com/cublas

Member of the Helmholtz Association cuBLAS Andreas Herten | GPU Programming 101 | 31 August 2017 cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasSetVector(n, sizeof (x[0]), x, 1, d_x, 1); cudaMallocManaged(&d_y, n * sizeof (y[0]); cudaMallocManaged(&d_x, n * sizeof (x[0]); float * d_x, * d_y; cublasCreate(&handle); cublasHandle_t handle; // fill x, y float x[n], y[n]; int n = 10; int a = 42; Code example # 18 41 cublasSetVector(n, sizeof (y[0]), y, 1, d_y, 1); cublasGetVector(n, sizeof (y[0]), d_y, 1, y, 1);

Member of the Helmholtz Association cuBLAS Andreas Herten | GPU Programming 101 | 31 August 2017 Initialize cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasSetVector(n, sizeof (x[0]), x, 1, d_x, 1); cudaMallocManaged(&d_y, n * sizeof (y[0]); cudaMallocManaged(&d_x, n * sizeof (x[0]); float * d_x, * d_y; cublasCreate(&handle); cublasHandle_t handle; // fill x, y float x[n], y[n]; int n = 10; int a = 42; Code example # 18 41 cublasSetVector(n, sizeof (y[0]), y, 1, d_y, 1); cublasGetVector(n, sizeof (y[0]), d_y, 1, y, 1);

Member of the Helmholtz Association cuBLAS Andreas Herten | GPU Programming 101 | 31 August 2017 Allocate GPU memory Initialize cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasSetVector(n, sizeof (x[0]), x, 1, d_x, 1); cudaMallocManaged(&d_y, n * sizeof (y[0]); cudaMallocManaged(&d_x, n * sizeof (x[0]); float * d_x, * d_y; cublasCreate(&handle); cublasHandle_t handle; // fill x, y float x[n], y[n]; int n = 10; int a = 42; Code example # 18 41 cublasSetVector(n, sizeof (y[0]), y, 1, d_y, 1); cublasGetVector(n, sizeof (y[0]), d_y, 1, y, 1);

Member of the Helmholtz Association cuBLAS Andreas Herten | GPU Programming 101 | 31 August 2017 Copy data to GPU Allocate GPU memory Initialize cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasSetVector(n, sizeof (x[0]), x, 1, d_x, 1); cudaMallocManaged(&d_y, n * sizeof (y[0]); cudaMallocManaged(&d_x, n * sizeof (x[0]); float * d_x, * d_y; cublasCreate(&handle); cublasHandle_t handle; // fill x, y float x[n], y[n]; int n = 10; int a = 42; Code example # 18 41 cublasSetVector(n, sizeof (y[0]), y, 1, d_y, 1); cublasGetVector(n, sizeof (y[0]), d_y, 1, y, 1);

Member of the Helmholtz Association cuBLAS Andreas Herten | GPU Programming 101 | 31 August 2017 Call BLAS routine Copy data to GPU Allocate GPU memory Initialize cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasSetVector(n, sizeof (x[0]), x, 1, d_x, 1); cudaMallocManaged(&d_y, n * sizeof (y[0]); cudaMallocManaged(&d_x, n * sizeof (x[0]); float * d_x, * d_y; cublasCreate(&handle); cublasHandle_t handle; // fill x, y float x[n], y[n]; int n = 10; int a = 42; Code example # 18 41 cublasSetVector(n, sizeof (y[0]), y, 1, d_y, 1); cublasGetVector(n, sizeof (y[0]), d_y, 1, y, 1);

Member of the Helmholtz Association cuBLAS Andreas Herten | GPU Programming 101 | 31 August 2017 Copy result to host Call BLAS routine Copy data to GPU Allocate GPU memory Initialize cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasSetVector(n, sizeof (x[0]), x, 1, d_x, 1); cudaMallocManaged(&d_y, n * sizeof (y[0]); cudaMallocManaged(&d_x, n * sizeof (x[0]); float * d_x, * d_y; cublasCreate(&handle); cublasHandle_t handle; // fill x, y float x[n], y[n]; int n = 10; int a = 42; Code example # 18 41 cublasSetVector(n, sizeof (y[0]), y, 1, d_y, 1); cublasGetVector(n, sizeof (y[0]), d_y, 1, y, 1);

Member of the Helmholtz Association cuBLAS Andreas Herten | GPU Programming 101 | 31 August 2017 Finalize Copy result to host Call BLAS routine Copy data to GPU Allocate GPU memory Initialize cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasSetVector(n, sizeof (x[0]), x, 1, d_x, 1); cudaMallocManaged(&d_y, n * sizeof (y[0]); cudaMallocManaged(&d_x, n * sizeof (x[0]); float * d_x, * d_y; cublasCreate(&handle); cublasHandle_t handle; // fill x, y float x[n], y[n]; int n = 10; int a = 42; Code example # 18 41 cublasSetVector(n, sizeof (y[0]), y, 1, d_y, 1); cublasGetVector(n, sizeof (y[0]), d_y, 1, y, 1);

Member of the Helmholtz Association Libraries The truth is out there! Use applications & libraries! Wizard: Breazell [6] cuBLAS cuSPARSE cuFFT cuRAND CUDA Math Andreas Herten | GPU Programming 101 | 31 August 2017 # 19 41 Programming GPUs is easy: Just don’t! th ano

Member of the Helmholtz Association Thrust Iterators! Iterators everywhere! Thrust C++ Template library Based on iterators Data-parallel primitives ( scan() , sort() , reduce() , … ) Fully compatible with plain CUDA C (comes with CUDA Toolkit) Great with [](){} lambdas! http://docs.nvidia.com/cuda/thrust/ Andreas Herten | GPU Programming 101 | 31 August 2017 # 20 41 CUDA = STL → http://thrust.github.io/

Member of the Helmholtz Association Thrust Code example int a = 42; int n = 10; thrust::host_vector< float > x(n), y(n); // fill x, y thrust::device_vector d_x = x, d_y = y; using namespace thrust::placeholders; thrust::transform(d_x.begin(), d_x.end(), d_y.begin(), d_y.begin(), a * _1 + _2); Andreas Herten | GPU Programming 101 | 31 August 2017 # 21 41 ֒ → x = d_x;

Member of the Helmholtz Association if (N > gGpuThreshold) Andreas Herten | GPU Programming 101 | 31 August 2017 Source thrust::for_each(thrust::host, r, r+N, lambda);} else thrust::for_each(thrust::device, r, r+N, lambda); auto lambda = [=] __host__ __device__ ( int i) { Thrust auto r = thrust::counting_iterator< int >(0); void saxpy( float *x, float *y, float a, int N) { constexpr int gGpuThreshold = 10000; #include <thrust/execution_policy.h> #include <thrust/for_each.h> Code example with lambdas # 21 41 y[i] = a * x[i] + y[i];};

Member of the Helmholtz Association Directives Andreas Herten | GPU Programming 101 | 31 August 2017 # 22 41 Programming GPUs

Member of the Helmholtz Association — Difgerent target architectures Andreas Herten | GPU Programming 101 | 31 August 2017 debug Somewhat harder to Raw power hidden limited Compilers support Con Easy to program from same code To it, it’s a serial program — Other compiler? No problem! Portability Pro Compiler interprets directives, creates according instructions acc_copy(); Also: Generalized API functions for ( int i = 0; i < 1; i+*) {}; #pragma acc loop Annotate usual source code by directives Keepin’ you portable # 23 41 GPU Programming with Directives

Member of the Helmholtz Association for ( Andreas Herten | GPU Programming 101 | 31 August 2017 Might eventually be re-merged into OpenMP standard } } // … ) { #pragma omp parallel for ) { for ( #pragma omp distribute #pragma omp teams num_teams(10) num_threads(10) #pragma omp target map(tofrom:y), map(to:x) 4.0, better since 4.5 The power of… two. # 24 41 GPU Programming with Directives OpenMP Standard for multithread programming on CPU , GPU since OpenACC Similar to OpenMP, but more specifically for GPUs

Member of the Helmholtz Association OpenACC Code example void saxpy_acc( int n, float a, float * x, float * y) { #pragma acc kernels for ( int i = 0; i < n; i++) } int a = 42; int n = 10; float x[n], y[n]; // fill x, y saxpy_acc(n, a, x, y); Andreas Herten | GPU Programming 101 | 31 August 2017 # 25 41 y[i] = a * x[i] + y[i];

Member of the Helmholtz Association OpenACC Code example void saxpy_acc( int n, float a, float * x, float * y) { #pragma acc parallel loop copy(y) copyin(x) for ( int i = 0; i < n; i++) } int a = 42; int n = 10; float x[n], y[n]; // fill x, y saxpy_acc(n, a, x, y); Andreas Herten | GPU Programming 101 | 31 August 2017 # 25 41 y[i] = a * x[i] + y[i];

Member of the Helmholtz Association int a = 42; Andreas Herten | GPU Programming 101 | 31 August 2017 saxpy_acc(n, a, x, y); // fill x, y OpenACC int n = 10; float x[n], y[n]; } for ( int i = 0; i < n; i++) #pragma acc parallel loop copy(y) copyin(x) void saxpy_acc( int n, float a, float * x, float * y) { Code example # 25 41 U P y[i] = a * x[i] + y[i]; G tutorial this afuernoon!

Member of the Helmholtz Association Languages Andreas Herten | GPU Programming 101 | 31 August 2017 # 26 41 Programming GPUs

OpenCL Open Computing Language by Khronos Group (Apple, IBM , NVIDIA, …) 2009 — Targets CPUs , GPUs , FPGAs , and other many-core machines CUDA NVIDIA’s GPU platform 2007 — Only NVIDIA GPUs compiler, debuggers, profilers, … Hardest: Come up with parallelized algorithm Choose what flavor you like, what colleagues/collaboration is using — Also: CUDA Fortran clang has CUDA support, but CUDA needed for last step — Compilation with nvcc (free, but not open) Member of the Helmholtz Association — Platform: Drivers, programming language (CUDA C/C++), API, — Difgerent compilers available — Fully open source compiler — Platform: Programming language (OpenCL C/C++), API, and Two solutions: Finally… # 27 41 Programming GPU Directly Andreas Herten | GPU Programming 101 | 31 August 2017

CUDA NVIDIA’s GPU platform 2007 — Only NVIDIA GPUs — Platform: Drivers, programming language (CUDA C/C++), API, Hardest: Come up with parallelized algorithm Choose what flavor you like, what colleagues/collaboration is using — Also: CUDA Fortran clang has CUDA support, but CUDA needed for last step — Compilation with nvcc (free, but not open) compiler, debuggers, profilers, … Member of the Helmholtz Association — Difgerent compilers available — Fully open source compiler — Platform: Programming language (OpenCL C/C++), API, and Two solutions: Finally… # 27 41 Programming GPU Directly OpenCL Open Computing Language by Khronos Group (Apple, IBM , NVIDIA, …) 2009 — Targets CPUs , GPUs , FPGAs , and other many-core machines Andreas Herten | GPU Programming 101 | 31 August 2017

Member of the Helmholtz Association — Difgerent compilers available Hardest: Come up with parallelized algorithm Choose what flavor you like, what colleagues/collaboration is using — Also: CUDA Fortran clang has CUDA support, but CUDA needed for last step — Compilation with nvcc (free, but not open) compiler, debuggers, profilers, … — Platform: Drivers, programming language (CUDA C/C++), API, # 27 41 — Fully open source compiler — Platform: Programming language (OpenCL C/C++), API, and Two solutions: Finally… Programming GPU Directly OpenCL Open Computing Language by Khronos Group (Apple, IBM , NVIDIA, …) 2009 — Targets CPUs , GPUs , FPGAs , and other many-core machines CUDA NVIDIA’s GPU platform 2007 — Only NVIDIA GPUs Andreas Herten | GPU Programming 101 | 31 August 2017

Member of the Helmholtz Association — Difgerent compilers available Hardest: Come up with parallelized algorithm Choose what flavor you like, what colleagues/collaboration is using — Also: CUDA Fortran clang has CUDA support, but CUDA needed for last step — Compilation with nvcc (free, but not open) API, compiler, debuggers, profilers, … — Platform: Drivers, programming language (CUDA C/C++), # 27 41 — Fully open source compiler — Platform: Programming language (OpenCL C/C++), API, and Two solutions: Finally… Programming GPU Directly OpenCL Open Computing Language by Khronos Group (Apple, IBM , NVIDIA, …) 2009 — Targets CPUs , GPUs , FPGAs , and other many-core machines CUDA NVIDIA’s GPU platform 2007 — Only NVIDIA GPUs Andreas Herten | GPU Programming 101 | 31 August 2017

— Lightweight Member of the Helmholtz Association 3D Andreas Herten | GPU Programming 101 | 31 August 2017 SAXPY! order non-deterministic! — 1000s threads execute simultaneously fast switchting! Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } Parallel function: kernel 3D CUDA Threading Model 3D — Threads & blocks in 3D Grid Blocks — Block Threads — Methods to exploit parallelism: Warp the kernel, it’s a thread! # 28 41

— Lightweight Member of the Helmholtz Association 3D Andreas Herten | GPU Programming 101 | 31 August 2017 SAXPY! order non-deterministic! — 1000s threads execute simultaneously fast switchting! Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } Parallel function: kernel 3D CUDA Threading Model 3D — Threads & blocks in 3D Grid Blocks — Block Thread — Methods to exploit parallelism: Warp the kernel, it’s a thread! # 28 41

— Lightweight Member of the Helmholtz Association CUDA Threading Model Andreas Herten | GPU Programming 101 | 31 August 2017 SAXPY! order non-deterministic! — 1000s threads execute simultaneously fast switchting! Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } Parallel function: kernel 3D 3D 3D — Threads & blocks in 3D Grid Blocks — Block Threads — Methods to exploit parallelism: Warp the kernel, it’s a thread! # 28 41 0 1 2 3 4 5

— Lightweight Member of the Helmholtz Association Parallel function: kernel Andreas Herten | GPU Programming 101 | 31 August 2017 SAXPY! order non-deterministic! — 1000s threads execute simultaneously fast switchting! Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } 3D CUDA Threading Model 3D 3D — Threads & blocks in 3D Grid Blocks — — Methods to exploit parallelism: Warp the kernel, it’s a thread! # 28 41 Threads → Block 0 1 2 3 4 5

— Lightweight Member of the Helmholtz Association 0 Andreas Herten | GPU Programming 101 | 31 August 2017 SAXPY! order non-deterministic! — 1000s threads execute simultaneously fast switchting! Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } Parallel function: kernel # 28 41 CUDA Threading Model 3D 3D 3D — Threads & blocks in 3D Grid Block — — Methods to exploit parallelism: Warp the kernel, it’s a thread! Threads → Block 0 1 2 3 4 5

— Lightweight Member of the Helmholtz Association 0 Andreas Herten | GPU Programming 101 | 31 August 2017 SAXPY! order non-deterministic! — 1000s threads execute simultaneously fast switchting! Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } Parallel function: kernel 2 1 # 28 41 CUDA Threading Model 3D 3D 3D — Threads & blocks in 3D Grid Blocks — — Methods to exploit parallelism: Warp the kernel, it’s a thread! Threads → Block 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5

— Lightweight Member of the Helmholtz Association 0 Andreas Herten | GPU Programming 101 | 31 August 2017 SAXPY! order non-deterministic! — 1000s threads execute simultaneously fast switchting! Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } Parallel function: kernel 2 1 # 28 41 CUDA Threading Model 3D 3D 3D — Threads & blocks in 3D — — Methods to exploit parallelism: Warp the kernel, it’s a thread! Threads → Block 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Blocks → Grid

Member of the Helmholtz Association CUDA Threading Model Andreas Herten | GPU Programming 101 | 31 August 2017 SAXPY! Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } Parallel function: kernel 2 1 0 # 28 41 3D 3D 3D — Threads & blocks in 3D — — Methods to exploit parallelism: Warp the kernel, it’s a thread! Threads → Block 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Blocks → Grid — Lightweight → fast switchting! — 1000s threads execute simultaneously → order non-deterministic!

Member of the Helmholtz Association CUDA Threading Model Andreas Herten | GPU Programming 101 | 31 August 2017 Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } Parallel function: kernel 2 1 0 # 28 41 3D 3D 3D — Threads & blocks in 3D — — Methods to exploit parallelism: Warp the kernel, it’s a thread! Threads → Block 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Blocks → Grid — Lightweight → fast switchting! — 1000s threads execute simultaneously → order non-deterministic! ⇒ SAXPY!

Member of the Helmholtz Association float x[n], y[n]; Andreas Herten | GPU Programming 101 | 31 August 2017 cudaDeviceSynchronize(); saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaMallocManaged(&y, n * sizeof ( float )); cudaMallocManaged(&x, n * sizeof ( float )); // fill x, y int n = 10; CUDA SAXPY int a = 42; } if (i < n) int i = blockIdx.x * blockDim.x + threadIdx.x; __global__ void saxpy_cuda( int n, float a, float * x, float * y) { With runtime-managed data transfers # 29 41 y[i] = a * x[i] + y[i];

Member of the Helmholtz Association float x[n], y[n]; Andreas Herten | GPU Programming 101 | 31 August 2017 Specify kernel cudaDeviceSynchronize(); saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaMallocManaged(&y, n * sizeof ( float )); cudaMallocManaged(&x, n * sizeof ( float )); // fill x, y int n = 10; CUDA SAXPY int a = 42; } if (i < n) int i = blockIdx.x * blockDim.x + threadIdx.x; __global__ void saxpy_cuda( int n, float a, float * x, float * y) { With runtime-managed data transfers # 29 41 y[i] = a * x[i] + y[i];

Member of the Helmholtz Association // fill x, y Andreas Herten | GPU Programming 101 | 31 August 2017 ID variables Specify kernel cudaDeviceSynchronize(); saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaMallocManaged(&y, n * sizeof ( float )); cudaMallocManaged(&x, n * sizeof ( float )); float x[n], y[n]; CUDA SAXPY int n = 10; int a = 42; } if (i < n) int i = blockIdx.x * blockDim.x + threadIdx.x; __global__ void saxpy_cuda( int n, float a, float * x, float * y) { With runtime-managed data transfers # 29 41 y[i] = a * x[i] + y[i];

Member of the Helmholtz Association cudaMallocManaged(&x, n * sizeof ( float )); Andreas Herten | GPU Programming 101 | 31 August 2017 too many threads Guard against ID variables Specify kernel cudaDeviceSynchronize(); saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaMallocManaged(&y, n * sizeof ( float )); // fill x, y CUDA SAXPY float x[n], y[n]; int n = 10; int a = 42; } if (i < n) int i = blockIdx.x * blockDim.x + threadIdx.x; __global__ void saxpy_cuda( int n, float a, float * x, float * y) { With runtime-managed data transfers # 29 41 y[i] = a * x[i] + y[i];

Member of the Helmholtz Association cudaMallocManaged(&y, n * sizeof ( float )); Andreas Herten | GPU Programming 101 | 31 August 2017 memory Allocate too many threads Guard against ID variables Specify kernel cudaDeviceSynchronize(); saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaMallocManaged(&x, n * sizeof ( float )); CUDA SAXPY // fill x, y float x[n], y[n]; int n = 10; int a = 42; } if (i < n) int i = blockIdx.x * blockDim.x + threadIdx.x; __global__ void saxpy_cuda( int n, float a, float * x, float * y) { With runtime-managed data transfers # 29 41 y[i] = a * x[i] + y[i]; GPU -capable

Member of the Helmholtz Association saxpy_cuda<<<2, 5>>>(n, a, x, y); Andreas Herten | GPU Programming 101 | 31 August 2017 2 blocks, each 5 threads Call kernel memory Allocate too many threads Guard against ID variables Specify kernel cudaDeviceSynchronize(); cudaMallocManaged(&y, n * sizeof ( float )); CUDA SAXPY cudaMallocManaged(&x, n * sizeof ( float )); // fill x, y float x[n], y[n]; int n = 10; int a = 42; } if (i < n) int i = blockIdx.x * blockDim.x + threadIdx.x; __global__ void saxpy_cuda( int n, float a, float * x, float * y) { With runtime-managed data transfers # 29 41 y[i] = a * x[i] + y[i]; GPU -capable

Member of the Helmholtz Association cudaDeviceSynchronize(); Andreas Herten | GPU Programming 101 | 31 August 2017 kernel to finish Wait for 2 blocks, each 5 threads Call kernel memory Allocate too many threads Guard against ID variables Specify kernel saxpy_cuda<<<2, 5>>>(n, a, x, y); CUDA SAXPY cudaMallocManaged(&y, n * sizeof ( float )); cudaMallocManaged(&x, n * sizeof ( float )); // fill x, y float x[n], y[n]; int n = 10; int a = 42; } if (i < n) int i = blockIdx.x * blockDim.x + threadIdx.x; __global__ void saxpy_cuda( int n, float a, float * x, float * y) { With runtime-managed data transfers # 29 41 y[i] = a * x[i] + y[i]; GPU -capable

Member of the Helmholtz Association Abstraction Libraries/DSL Andreas Herten | GPU Programming 101 | 31 August 2017 # 30 41 Programming GPUs

Member of the Helmholtz Association Abstraction Libraries & DSLs Libraries with ready-programmed abstractions; partly compiler/transpiler necessary Have difgerent backends to choose from for targeted accelerator Andreas Herten | GPU Programming 101 | 31 August 2017 # 31 41 Between Thrust , OpenACC, and CUDA Examples: Kokkos , Alpaka , Futhark , HIP , C++AMP , …

Member of the Helmholtz Association From Sandia National Laboratories C++ library for performance portability Data-parallel patterns, architecture-aware memory layouts, … Kokkos::View< double *> x("X", length); Kokkos::View< double *> y("Y", length); double a = 2.0; // Fill x, y Kokkos::parallel_for(length, KOKKOS_LAMBDA ( const int & i) { }); Andreas Herten | GPU Programming 101 | 31 August 2017 # 32 41 An Alternative: Kokkos → https://github.com/kokkos/kokkos/ x(i) = a*x(i) + y(i);

GPU Programming 101 GridKa School 2017: make science && run - PowerPoint PPT Presentation

Member of the Helmholtz Association GPU Programming 101 GridKa School 2017: make science && run Andreas Herten , Forschungszentrum Jlich, 31 August 2017 Programming GPUs Member of the Helmholtz Association Optimizing scientific

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

Common Alerting Protocol (CAP) Presentation Outline 101.1 Opportunity and Challenge 101.2

Networking 101.101.101.101 The Internet The Internet is governed by a series of protocols

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

IMAGE CLASSIFICATION WITH NVIDIA DIGITS Pedro Mario Cruz e Silva (pcruzesilva@nvidia.com)

Nucleon PDF separation with the collider and fixed-target data S.Alekhin ( IHEP Protvino) Theory:

Hadronic matrix elements for Dark Matter and other searches Laurent Lellouch CPT Marseille

Parameter Sharing Methods for Multilingual Self-Attentional Translation Models Devendra Sachan 1

Dynamic crossovers in hydration water at 252 and 181 K from experiments, theory and simulations

Deep learning J er emy Fix CentraleSup elec jeremy.fix@centralesupelec.fr 2016 1 / 94

Normal Forms 1 Literals Definition A literal is an atom or the negation of an atom. In the

Pei-Rong Li( ) Lanzhou University

GPU Programming 101 GridKa School 2017: make science && run - PowerPoint PPT Presentation

Member of the Helmholtz Association GPU Programming 101 GridKa School 2017: make science && run Andreas Herten , Forschungszentrum Jlich, 31 August 2017 Programming GPUs Member of the Helmholtz Association Optimizing scientific

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

Common Alerting Protocol (CAP) Presentation Outline 101.1 Opportunity and Challenge 101.2

Networking 101.101.101.101 The Internet The Internet is governed by a series of protocols

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

IMAGE CLASSIFICATION WITH NVIDIA DIGITS Pedro Mario Cruz e Silva (pcruzesilva@nvidia.com)

Nucleon PDF separation with the collider and fixed-target data S.Alekhin ( IHEP Protvino) Theory:

Hadronic matrix elements for Dark Matter and other searches Laurent Lellouch CPT Marseille

Parameter Sharing Methods for Multilingual Self-Attentional Translation Models Devendra Sachan 1

Dynamic crossovers in hydration water at 252 and 181 K from experiments, theory and simulations

Deep learning J er emy Fix CentraleSup elec jeremy.fix@centralesupelec.fr 2016 1 / 94

Normal Forms 1 Literals Definition A literal is an atom or the negation of an atom. In the

Pei-Rong Li( ) Lanzhou University

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,