gpu programming 101
play

GPU Programming 101 GridKa School 2017: make science && run - PowerPoint PPT Presentation

Member of the Helmholtz Association GPU Programming 101 GridKa School 2017: make science && run Andreas Herten , Forschungszentrum Jlich, 31 August 2017 Programming GPUs Member of the Helmholtz Association Optimizing scientific


  1. V100 Member of the Helmholtz Association HBM2 Andreas Herten | GPU Programming 101 | 31 August 2017 16 GB RAM, 900 GB s Example values Now: Done automatically (performance…?) Formerly: Explicitly copy data to/from GPU Do as little as possible! Memory transfers need special consideration! GPU: accelerator / extension card Memory NVLink Device Control DRAM ALU ALU ALU ALU Cache DRAM Host # 7 41 GPU memory ain’t no CPU memory → Separate device from CPU Separate memory, but UVA and UM ≈ 160 GB / s P100 < 720 GB / s 16 GB RAM, 720 GB / s

  2. Member of the Helmholtz Association Device Andreas Herten | GPU Programming 101 | 31 August 2017 Example values Now: Done automatically (performance…?) Formerly: Explicitly copy data to/from GPU Do as little as possible! Memory transfers need special consideration! GPU: accelerator / extension card Memory NVLink HBM2 Host ALU DRAM ALU ALU DRAM ALU Control Cache # 7 41 GPU memory ain’t no CPU memory → Separate device from CPU Separate memory, but UVA and UM ≈ 300 GB / s P100 V100 < 900 GB / s 16 GB RAM, 720 GB / s 16 GB RAM, 900 GB / s

  3. Member of the Helmholtz Association Overview Aim: Hide Latency Everything else follows Asynchronicity Memory High Throughput # 8 41 GPU Architecture SIMT Andreas Herten | GPU Programming 101 | 31 August 2017

  4. Member of the Helmholtz Association Overview Aim: Hide Latency Everything else follows Asynchronicity Memory High Throughput # 8 41 GPU Architecture SIMT Andreas Herten | GPU Programming 101 | 31 August 2017

  5. Member of the Helmholtz Association Copy Andreas Herten | GPU Programming 101 | 31 August 2017 Compute Copy Compute Copy Async Compute Compute Copy Copy and compute engines run separately ( streams ) Solution: Do something else in meantime ( computation )! Problem: Memory transfer is comparably slow Following difgerent streams # 9 41 → Overlap tasks GPU needs to be fed: Schedule many computations CPU can do other work while GPU computes; synchronization Also: Fast switching of contexts to keep GPU busy.

  6. Member of the Helmholtz Association Overview Aim: Hide Latency Everything else follows Asynchronicity Memory High Throughput # 10 41 GPU Architecture SIMT Andreas Herten | GPU Programming 101 | 31 August 2017

  7. Member of the Helmholtz Association Overview Aim: Hide Latency Everything else follows Asynchronicity Memory High Throughput # 10 41 GPU Architecture SIMT Andreas Herten | GPU Programming 101 | 31 August 2017

  8. — CPU core GPU multiprocessor ( SM ) — Simultaneous Multithreading ( SMT ) GPU : Single Instruction, Multiple Threads ( SIMT ) Member of the Helmholtz Association A 3 Scalar C 3 C 2 C 1 C 0 B 3 B 2 SIMT B 0 B 1 A 2 — Fast switching of threads (large register file) Of threads and warps A 1 — Working unit: set of threads (32, a warp ) — Branching if A 0 # 11 41 = + = + CPU : + = — Single Instruction, Multiple Data ( SIMD ) + = Andreas Herten | GPU Programming 101 | 31 August 2017

  9. — CPU core GPU multiprocessor ( SM ) — Simultaneous Multithreading ( SMT ) GPU : Single Instruction, Multiple Threads ( SIMT ) Member of the Helmholtz Association A 3 Vector C 3 C 2 C 1 C 0 B 3 B 2 B 1 B 0 A 2 SIMT A 1 A 0 if — Branching — Fast switching of threads (large register file) — Working unit: set of threads (32, a warp ) Of threads and warps # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) Andreas Herten | GPU Programming 101 | 31 August 2017

  10. — CPU core GPU multiprocessor ( SM ) GPU : Single Instruction, Multiple Threads ( SIMT ) Vector B 3 C 0 C 1 C 2 C 3 Member of the Helmholtz Association Core B 1 Core Core Core Andreas Herten | GPU Programming 101 | 31 August 2017 B 2 B 0 SIMT A 3 A 2 A 1 A 0 if — Branching — Fast switching of threads (large register file) — Working unit: set of threads (32, a warp ) Of threads and warps # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) — Simultaneous Multithreading ( SMT )

  11. — CPU core GPU multiprocessor ( SM ) GPU : Single Instruction, Multiple Threads ( SIMT ) Member of the Helmholtz Association Core C 0 C 1 C 2 C 3 Vector Core Core B 2 Core Thread Thread SMT B 3 B 1 SIMT if Of threads and warps — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) B 0 — Branching A 0 A 1 A 2 A 3 # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) — Simultaneous Multithreading ( SMT ) Andreas Herten | GPU Programming 101 | 31 August 2017

  12. — CPU core GPU multiprocessor ( SM ) Core C 0 C 1 C 2 C 3 Vector Member of the Helmholtz Association Core B 2 Core Core Thread Thread SMT B 3 B 1 SIMT if Of threads and warps — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) B 0 — Branching A 0 A 1 A 2 A 3 # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) — Simultaneous Multithreading ( SMT ) GPU : Single Instruction, Multiple Threads ( SIMT ) Andreas Herten | GPU Programming 101 | 31 August 2017

  13. — CPU core GPU multiprocessor ( SM ) Member of the Helmholtz Association Core C 0 C 1 C 2 C 3 Vector Core Core SIMT Core Thread Thread SMT SIMT B 3 B 2 B 1 if Of threads and warps — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) B 0 — Branching A 0 A 1 A 2 A 3 # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) — Simultaneous Multithreading ( SMT ) GPU : Single Instruction, Multiple Threads ( SIMT ) Andreas Herten | GPU Programming 101 | 31 August 2017

  14. Member of the Helmholtz Association SIMT SIMT SMT Thread Thread Core Core Core Core Vector C 3 C 2 C 1 C 0 B 3 B 2 B 1 if Of threads and warps — Working unit: set of threads (32, a warp ) B 0 — Branching — Fast switching of threads (large register file) A 0 A 1 A 2 A 3 # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) — Simultaneous Multithreading ( SMT ) GPU : Single Instruction, Multiple Threads ( SIMT ) — CPU core ≊ GPU multiprocessor ( SM ) Andreas Herten | GPU Programming 101 | 31 August 2017

  15. Member of the Helmholtz Association Core B 3 C 0 C 1 C 2 C 3 Vector Core SIMT Core Core Thread Thread SMT SIMT B 2 B 1 B 0 if Of threads and warps — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) A 3 — Branching Tesla V100 Graphics: Nvidia Corporation [5] A 0 A 1 A 2 # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) — Simultaneous Multithreading ( SMT ) GPU : Single Instruction, Multiple Threads ( SIMT ) — CPU core ≊ GPU multiprocessor ( SM ) Andreas Herten | GPU Programming 101 | 31 August 2017

  16. Member of the Helmholtz Association Core B 3 C 0 C 1 C 2 C 3 Vector Core SIMT Core Core Thread Thread SMT SIMT B 2 B 1 B 0 if Of threads and warps — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) A 3 — Branching Tesla V100 Graphics: Nvidia Corporation [5] A 0 A 1 A 2 # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) — Simultaneous Multithreading ( SMT ) GPU : Single Instruction, Multiple Threads ( SIMT ) — CPU core ≊ GPU multiprocessor ( SM ) Andreas Herten | GPU Programming 101 | 31 August 2017

  17. Member of the Helmholtz Association Vector B 2 B 3 C 0 C 1 C 2 C 3 Core B 0 Core Core Core Thread Thread SMT SIMT SIMT B 1 A 3 if Of threads and warps — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) A 2 — Branching Tesla V100 Multiprocessor Graphics: Nvidia Corporation [5] A 0 A 1 # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) — Simultaneous Multithreading ( SMT ) GPU : Single Instruction, Multiple Threads ( SIMT ) — CPU core ≊ GPU multiprocessor ( SM ) Andreas Herten | GPU Programming 101 | 31 August 2017

  18. Member of the Helmholtz Association Vector B 1 B 2 B 3 C 0 C 1 C 2 C 3 Core A 3 Core Core Core Thread Thread SMT SIMT SIMT B 0 A 2 Tesla V100 Of threads and warps — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) A 1 if — Branching Multiprocessor Tensor Cores Graphics: Nvidia Corporation [5] A 0 # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) — Simultaneous Multithreading ( SMT ) GPU : Single Instruction, Multiple Threads ( SIMT ) — CPU core ≊ GPU multiprocessor ( SM ) 120 PFLOP / s for Deep Learning Andreas Herten | GPU Programming 101 | 31 August 2017

  19. Member of the Helmholtz Association W 2 Andreas Herten | GPU Programming 101 | 31 August 2017 Thread/Warp Processing Context Switch Ready Waiting W 4 W 3 W 1 Low Latency vs. High Throughput GPU Streaming Multiprocessor: High Throughput T 4 T 3 T 2 T 1 CPU Core: Low Latency GPU Hides latency with computations from other thread warps # 12 41 Maybe GPU ’s ultimate feature CPU Minimizes latency within each thread

  20. Member of the Helmholtz Association W 2 Andreas Herten | GPU Programming 101 | 31 August 2017 Thread/Warp Processing Context Switch Ready Waiting W 4 W 3 W 1 Low Latency vs. High Throughput GPU Streaming Multiprocessor: High Throughput T 4 T 3 T 2 T 1 CPU Core: Low Latency GPU Hides latency with computations from other thread warps # 12 41 Maybe GPU ’s ultimate feature CPU Minimizes latency within each thread

  21. Member of the Helmholtz Association W 2 Andreas Herten | GPU Programming 101 | 31 August 2017 Thread/Warp Processing Context Switch Ready Waiting W 4 W 3 W 1 Low Latency vs. High Throughput GPU Streaming Multiprocessor: High Throughput T 4 T 3 T 2 T 1 CPU Core: Low Latency GPU Hides latency with computations from other thread warps # 12 41 Maybe GPU ’s ultimate feature CPU Minimizes latency within each thread

  22. Member of the Helmholtz Association bandwidth Andreas Herten | GPU Programming 101 | 31 August 2017 memory Optimized for high throughput # 13 41 Optimized for low latency Let’s summarize this! CPU vs. GPU + Large main memory + High bandwidth main + Fast clock rate + Large caches + Latency tolerant (parallelism) + Branch prediction + More compute resources + Powerful ALU + High performance per watt − Relatively low memory − Limited memory capacity − Low per-thread performance − Cache misses costly − Extension card − Low performance per watt

  23. Member of the Helmholtz Association Andreas Herten | GPU Programming 101 | 31 August 2017 # 14 41 Programming GPUs

  24. Member of the Helmholtz Association y[i] = a * x[i] + y[i]; Andreas Herten | GPU Programming 101 | 31 August 2017 saxpy(n, a, x, y); // fill x, y float x[n], y[n]; int n = 10; int a = 42; } for ( int i = 0; i < n; i++) void saxpy( int n, float a, float * x, float * y) { y , with single precision # 15 41 Preface: CPU A simple CPU program as reference! SAXPY: ⃗ y = a ⃗ x + ⃗ Part of LAPACK BLAS Level 1

  25. Member of the Helmholtz Association Libraries The truth is out there! Use applications & libraries! Wizard: Breazell [6] cuBLAS cuSPARSE cuFFT cuRAND CUDA Math Andreas Herten | GPU Programming 101 | 31 August 2017 # 16 41 Programming GPUs is easy: Just don’t!

  26. Member of the Helmholtz Association Libraries The truth is out there! Use applications & libraries! Wizard: Breazell [6] cuBLAS cuSPARSE cuFFT cuRAND CUDA Math Andreas Herten | GPU Programming 101 | 31 August 2017 # 16 41 Programming GPUs is easy: Just don’t!

  27. Member of the Helmholtz Association Libraries The truth is out there! Use applications & libraries! Wizard: Breazell [6] cuBLAS cuSPARSE cuFFT cuRAND CUDA Math Andreas Herten | GPU Programming 101 | 31 August 2017 # 16 41 Programming GPUs is easy: Just don’t!

  28. Member of the Helmholtz Association Libraries The truth is out there! Use applications & libraries! Wizard: Breazell [6] cuBLAS cuSPARSE cuFFT cuRAND CUDA Math Andreas Herten | GPU Programming 101 | 31 August 2017 # 16 41 Programming GPUs is easy: Just don’t! th ano

  29. Member of the Helmholtz Association Libraries The truth is out there! Use applications & libraries! Wizard: Breazell [6] cuBLAS cuSPARSE cuFFT cuRAND CUDA Math Andreas Herten | GPU Programming 101 | 31 August 2017 # 16 41 Programming GPUs is easy: Just don’t! th ano

  30. Member of the Helmholtz Association cuBLAS Parallel algebra Single, double, complex data types Constant competition with Intel’s MKL http://docs.nvidia.com/cuda/cublas Andreas Herten | GPU Programming 101 | 31 August 2017 # 17 41 GPU -parallel BLAS (all 152 routines) Multi- GPU support → https://developer.nvidia.com/cublas

  31. Member of the Helmholtz Association cuBLAS Andreas Herten | GPU Programming 101 | 31 August 2017 cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasSetVector(n, sizeof (x[0]), x, 1, d_x, 1); cudaMallocManaged(&d_y, n * sizeof (y[0]); cudaMallocManaged(&d_x, n * sizeof (x[0]); float * d_x, * d_y; cublasCreate(&handle); cublasHandle_t handle; // fill x, y float x[n], y[n]; int n = 10; int a = 42; Code example # 18 41 cublasSetVector(n, sizeof (y[0]), y, 1, d_y, 1); cublasGetVector(n, sizeof (y[0]), d_y, 1, y, 1);

  32. Member of the Helmholtz Association cuBLAS Andreas Herten | GPU Programming 101 | 31 August 2017 Initialize cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasSetVector(n, sizeof (x[0]), x, 1, d_x, 1); cudaMallocManaged(&d_y, n * sizeof (y[0]); cudaMallocManaged(&d_x, n * sizeof (x[0]); float * d_x, * d_y; cublasCreate(&handle); cublasHandle_t handle; // fill x, y float x[n], y[n]; int n = 10; int a = 42; Code example # 18 41 cublasSetVector(n, sizeof (y[0]), y, 1, d_y, 1); cublasGetVector(n, sizeof (y[0]), d_y, 1, y, 1);

  33. Member of the Helmholtz Association cuBLAS Andreas Herten | GPU Programming 101 | 31 August 2017 Allocate GPU memory Initialize cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasSetVector(n, sizeof (x[0]), x, 1, d_x, 1); cudaMallocManaged(&d_y, n * sizeof (y[0]); cudaMallocManaged(&d_x, n * sizeof (x[0]); float * d_x, * d_y; cublasCreate(&handle); cublasHandle_t handle; // fill x, y float x[n], y[n]; int n = 10; int a = 42; Code example # 18 41 cublasSetVector(n, sizeof (y[0]), y, 1, d_y, 1); cublasGetVector(n, sizeof (y[0]), d_y, 1, y, 1);

  34. Member of the Helmholtz Association cuBLAS Andreas Herten | GPU Programming 101 | 31 August 2017 Copy data to GPU Allocate GPU memory Initialize cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasSetVector(n, sizeof (x[0]), x, 1, d_x, 1); cudaMallocManaged(&d_y, n * sizeof (y[0]); cudaMallocManaged(&d_x, n * sizeof (x[0]); float * d_x, * d_y; cublasCreate(&handle); cublasHandle_t handle; // fill x, y float x[n], y[n]; int n = 10; int a = 42; Code example # 18 41 cublasSetVector(n, sizeof (y[0]), y, 1, d_y, 1); cublasGetVector(n, sizeof (y[0]), d_y, 1, y, 1);

  35. Member of the Helmholtz Association cuBLAS Andreas Herten | GPU Programming 101 | 31 August 2017 Call BLAS routine Copy data to GPU Allocate GPU memory Initialize cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasSetVector(n, sizeof (x[0]), x, 1, d_x, 1); cudaMallocManaged(&d_y, n * sizeof (y[0]); cudaMallocManaged(&d_x, n * sizeof (x[0]); float * d_x, * d_y; cublasCreate(&handle); cublasHandle_t handle; // fill x, y float x[n], y[n]; int n = 10; int a = 42; Code example # 18 41 cublasSetVector(n, sizeof (y[0]), y, 1, d_y, 1); cublasGetVector(n, sizeof (y[0]), d_y, 1, y, 1);

  36. Member of the Helmholtz Association cuBLAS Andreas Herten | GPU Programming 101 | 31 August 2017 Copy result to host Call BLAS routine Copy data to GPU Allocate GPU memory Initialize cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasSetVector(n, sizeof (x[0]), x, 1, d_x, 1); cudaMallocManaged(&d_y, n * sizeof (y[0]); cudaMallocManaged(&d_x, n * sizeof (x[0]); float * d_x, * d_y; cublasCreate(&handle); cublasHandle_t handle; // fill x, y float x[n], y[n]; int n = 10; int a = 42; Code example # 18 41 cublasSetVector(n, sizeof (y[0]), y, 1, d_y, 1); cublasGetVector(n, sizeof (y[0]), d_y, 1, y, 1);

  37. Member of the Helmholtz Association cuBLAS Andreas Herten | GPU Programming 101 | 31 August 2017 Finalize Copy result to host Call BLAS routine Copy data to GPU Allocate GPU memory Initialize cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasSetVector(n, sizeof (x[0]), x, 1, d_x, 1); cudaMallocManaged(&d_y, n * sizeof (y[0]); cudaMallocManaged(&d_x, n * sizeof (x[0]); float * d_x, * d_y; cublasCreate(&handle); cublasHandle_t handle; // fill x, y float x[n], y[n]; int n = 10; int a = 42; Code example # 18 41 cublasSetVector(n, sizeof (y[0]), y, 1, d_y, 1); cublasGetVector(n, sizeof (y[0]), d_y, 1, y, 1);

  38. Member of the Helmholtz Association Libraries The truth is out there! Use applications & libraries! Wizard: Breazell [6] cuBLAS cuSPARSE cuFFT cuRAND CUDA Math Andreas Herten | GPU Programming 101 | 31 August 2017 # 19 41 Programming GPUs is easy: Just don’t! th ano

  39. Member of the Helmholtz Association Libraries The truth is out there! Use applications & libraries! Wizard: Breazell [6] cuBLAS cuSPARSE cuFFT cuRAND CUDA Math Andreas Herten | GPU Programming 101 | 31 August 2017 # 19 41 Programming GPUs is easy: Just don’t! th ano

  40. Member of the Helmholtz Association Thrust Iterators! Iterators everywhere! Thrust C++ Template library Based on iterators Data-parallel primitives ( scan() , sort() , reduce() , … ) Fully compatible with plain CUDA C (comes with CUDA Toolkit) Great with [](){} lambdas! http://docs.nvidia.com/cuda/thrust/ Andreas Herten | GPU Programming 101 | 31 August 2017 # 20 41 CUDA = STL → http://thrust.github.io/

  41. Member of the Helmholtz Association Thrust Code example int a = 42; int n = 10; thrust::host_vector< float > x(n), y(n); // fill x, y thrust::device_vector d_x = x, d_y = y; using namespace thrust::placeholders; thrust::transform(d_x.begin(), d_x.end(), d_y.begin(), d_y.begin(), a * _1 + _2); Andreas Herten | GPU Programming 101 | 31 August 2017 # 21 41 ֒ → x = d_x;

  42. Member of the Helmholtz Association if (N > gGpuThreshold) Andreas Herten | GPU Programming 101 | 31 August 2017 Source thrust::for_each(thrust::host, r, r+N, lambda);} else thrust::for_each(thrust::device, r, r+N, lambda); auto lambda = [=] __host__ __device__ ( int i) { Thrust auto r = thrust::counting_iterator< int >(0); void saxpy( float *x, float *y, float a, int N) { constexpr int gGpuThreshold = 10000; #include <thrust/execution_policy.h> #include <thrust/for_each.h> Code example with lambdas # 21 41 y[i] = a * x[i] + y[i];};

  43. Member of the Helmholtz Association Directives Andreas Herten | GPU Programming 101 | 31 August 2017 # 22 41 Programming GPUs

  44. Member of the Helmholtz Association — Difgerent target architectures Andreas Herten | GPU Programming 101 | 31 August 2017 debug Somewhat harder to Raw power hidden limited Compilers support Con Easy to program from same code To it, it’s a serial program — Other compiler? No problem! Portability Pro Compiler interprets directives, creates according instructions acc_copy(); Also: Generalized API functions for ( int i = 0; i < 1; i+*) {}; #pragma acc loop Annotate usual source code by directives Keepin’ you portable # 23 41 GPU Programming with Directives

  45. Member of the Helmholtz Association — Difgerent target architectures Andreas Herten | GPU Programming 101 | 31 August 2017 debug Somewhat harder to Raw power hidden limited Compilers support Con Easy to program from same code To it, it’s a serial program — Other compiler? No problem! Portability Pro Compiler interprets directives, creates according instructions acc_copy(); Also: Generalized API functions for ( int i = 0; i < 1; i+*) {}; #pragma acc loop Annotate usual source code by directives Keepin’ you portable # 23 41 GPU Programming with Directives

  46. Member of the Helmholtz Association — Difgerent target architectures Andreas Herten | GPU Programming 101 | 31 August 2017 debug Somewhat harder to Raw power hidden limited Compilers support Con Easy to program from same code To it, it’s a serial program — Other compiler? No problem! Portability Pro Compiler interprets directives, creates according instructions acc_copy(); Also: Generalized API functions for ( int i = 0; i < 1; i+*) {}; #pragma acc loop Annotate usual source code by directives Keepin’ you portable # 23 41 GPU Programming with Directives

  47. Member of the Helmholtz Association for ( Andreas Herten | GPU Programming 101 | 31 August 2017 Might eventually be re-merged into OpenMP standard } } // … ) { #pragma omp parallel for ) { for ( #pragma omp distribute #pragma omp teams num_teams(10) num_threads(10) #pragma omp target map(tofrom:y), map(to:x) 4.0, better since 4.5 The power of… two. # 24 41 GPU Programming with Directives OpenMP Standard for multithread programming on CPU , GPU since OpenACC Similar to OpenMP, but more specifically for GPUs

  48. Member of the Helmholtz Association OpenACC Code example void saxpy_acc( int n, float a, float * x, float * y) { #pragma acc kernels for ( int i = 0; i < n; i++) } int a = 42; int n = 10; float x[n], y[n]; // fill x, y saxpy_acc(n, a, x, y); Andreas Herten | GPU Programming 101 | 31 August 2017 # 25 41 y[i] = a * x[i] + y[i];

  49. Member of the Helmholtz Association OpenACC Code example void saxpy_acc( int n, float a, float * x, float * y) { #pragma acc parallel loop copy(y) copyin(x) for ( int i = 0; i < n; i++) } int a = 42; int n = 10; float x[n], y[n]; // fill x, y saxpy_acc(n, a, x, y); Andreas Herten | GPU Programming 101 | 31 August 2017 # 25 41 y[i] = a * x[i] + y[i];

  50. Member of the Helmholtz Association int a = 42; Andreas Herten | GPU Programming 101 | 31 August 2017 saxpy_acc(n, a, x, y); // fill x, y OpenACC int n = 10; float x[n], y[n]; } for ( int i = 0; i < n; i++) #pragma acc parallel loop copy(y) copyin(x) void saxpy_acc( int n, float a, float * x, float * y) { Code example # 25 41 U P y[i] = a * x[i] + y[i]; G tutorial this afuernoon!

  51. Member of the Helmholtz Association Languages Andreas Herten | GPU Programming 101 | 31 August 2017 # 26 41 Programming GPUs

  52. OpenCL Open Computing Language by Khronos Group (Apple, IBM , NVIDIA, …) 2009 — Targets CPUs , GPUs , FPGAs , and other many-core machines CUDA NVIDIA’s GPU platform 2007 — Only NVIDIA GPUs compiler, debuggers, profilers, … Hardest: Come up with parallelized algorithm Choose what flavor you like, what colleagues/collaboration is using — Also: CUDA Fortran clang has CUDA support, but CUDA needed for last step — Compilation with nvcc (free, but not open) Member of the Helmholtz Association — Platform: Drivers, programming language (CUDA C/C++), API, — Difgerent compilers available — Fully open source compiler — Platform: Programming language (OpenCL C/C++), API, and Two solutions: Finally… # 27 41 Programming GPU Directly Andreas Herten | GPU Programming 101 | 31 August 2017

  53. CUDA NVIDIA’s GPU platform 2007 — Only NVIDIA GPUs — Platform: Drivers, programming language (CUDA C/C++), API, Hardest: Come up with parallelized algorithm Choose what flavor you like, what colleagues/collaboration is using — Also: CUDA Fortran clang has CUDA support, but CUDA needed for last step — Compilation with nvcc (free, but not open) compiler, debuggers, profilers, … Member of the Helmholtz Association — Difgerent compilers available — Fully open source compiler — Platform: Programming language (OpenCL C/C++), API, and Two solutions: Finally… # 27 41 Programming GPU Directly OpenCL Open Computing Language by Khronos Group (Apple, IBM , NVIDIA, …) 2009 — Targets CPUs , GPUs , FPGAs , and other many-core machines Andreas Herten | GPU Programming 101 | 31 August 2017

  54. Member of the Helmholtz Association — Difgerent compilers available Hardest: Come up with parallelized algorithm Choose what flavor you like, what colleagues/collaboration is using — Also: CUDA Fortran clang has CUDA support, but CUDA needed for last step — Compilation with nvcc (free, but not open) compiler, debuggers, profilers, … — Platform: Drivers, programming language (CUDA C/C++), API, # 27 41 — Fully open source compiler — Platform: Programming language (OpenCL C/C++), API, and Two solutions: Finally… Programming GPU Directly OpenCL Open Computing Language by Khronos Group (Apple, IBM , NVIDIA, …) 2009 — Targets CPUs , GPUs , FPGAs , and other many-core machines CUDA NVIDIA’s GPU platform 2007 — Only NVIDIA GPUs Andreas Herten | GPU Programming 101 | 31 August 2017

  55. Member of the Helmholtz Association — Difgerent compilers available Hardest: Come up with parallelized algorithm Choose what flavor you like, what colleagues/collaboration is using — Also: CUDA Fortran clang has CUDA support, but CUDA needed for last step — Compilation with nvcc (free, but not open) compiler, debuggers, profilers, … — Platform: Drivers, programming language (CUDA C/C++), API, # 27 41 — Fully open source compiler — Platform: Programming language (OpenCL C/C++), API, and Two solutions: Finally… Programming GPU Directly OpenCL Open Computing Language by Khronos Group (Apple, IBM , NVIDIA, …) 2009 — Targets CPUs , GPUs , FPGAs , and other many-core machines CUDA NVIDIA’s GPU platform 2007 — Only NVIDIA GPUs Andreas Herten | GPU Programming 101 | 31 August 2017

  56. Member of the Helmholtz Association — Difgerent compilers available Hardest: Come up with parallelized algorithm Choose what flavor you like, what colleagues/collaboration is using — Also: CUDA Fortran clang has CUDA support, but CUDA needed for last step — Compilation with nvcc (free, but not open) API, compiler, debuggers, profilers, … — Platform: Drivers, programming language (CUDA C/C++), # 27 41 — Fully open source compiler — Platform: Programming language (OpenCL C/C++), API, and Two solutions: Finally… Programming GPU Directly OpenCL Open Computing Language by Khronos Group (Apple, IBM , NVIDIA, …) 2009 — Targets CPUs , GPUs , FPGAs , and other many-core machines CUDA NVIDIA’s GPU platform 2007 — Only NVIDIA GPUs Andreas Herten | GPU Programming 101 | 31 August 2017

  57. — Lightweight Member of the Helmholtz Association 3D Andreas Herten | GPU Programming 101 | 31 August 2017 SAXPY! order non-deterministic! — 1000s threads execute simultaneously fast switchting! Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } Parallel function: kernel 3D CUDA Threading Model 3D — Threads & blocks in 3D Grid Blocks — Block Threads — Methods to exploit parallelism: Warp the kernel, it’s a thread! # 28 41

  58. — Lightweight Member of the Helmholtz Association 3D Andreas Herten | GPU Programming 101 | 31 August 2017 SAXPY! order non-deterministic! — 1000s threads execute simultaneously fast switchting! Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } Parallel function: kernel 3D CUDA Threading Model 3D — Threads & blocks in 3D Grid Blocks — Block Thread — Methods to exploit parallelism: Warp the kernel, it’s a thread! # 28 41

  59. — Lightweight Member of the Helmholtz Association CUDA Threading Model Andreas Herten | GPU Programming 101 | 31 August 2017 SAXPY! order non-deterministic! — 1000s threads execute simultaneously fast switchting! Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } Parallel function: kernel 3D 3D 3D — Threads & blocks in 3D Grid Blocks — Block Threads — Methods to exploit parallelism: Warp the kernel, it’s a thread! # 28 41 0 1 2 3 4 5

  60. — Lightweight Member of the Helmholtz Association Parallel function: kernel Andreas Herten | GPU Programming 101 | 31 August 2017 SAXPY! order non-deterministic! — 1000s threads execute simultaneously fast switchting! Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } 3D CUDA Threading Model 3D 3D — Threads & blocks in 3D Grid Blocks — — Methods to exploit parallelism: Warp the kernel, it’s a thread! # 28 41 Threads → Block 0 1 2 3 4 5

  61. — Lightweight Member of the Helmholtz Association 0 Andreas Herten | GPU Programming 101 | 31 August 2017 SAXPY! order non-deterministic! — 1000s threads execute simultaneously fast switchting! Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } Parallel function: kernel # 28 41 CUDA Threading Model 3D 3D 3D — Threads & blocks in 3D Grid Block — — Methods to exploit parallelism: Warp the kernel, it’s a thread! Threads → Block 0 1 2 3 4 5

  62. — Lightweight Member of the Helmholtz Association 0 Andreas Herten | GPU Programming 101 | 31 August 2017 SAXPY! order non-deterministic! — 1000s threads execute simultaneously fast switchting! Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } Parallel function: kernel 2 1 # 28 41 CUDA Threading Model 3D 3D 3D — Threads & blocks in 3D Grid Blocks — — Methods to exploit parallelism: Warp the kernel, it’s a thread! Threads → Block 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5

  63. — Lightweight Member of the Helmholtz Association 0 Andreas Herten | GPU Programming 101 | 31 August 2017 SAXPY! order non-deterministic! — 1000s threads execute simultaneously fast switchting! Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } Parallel function: kernel 2 1 # 28 41 CUDA Threading Model 3D 3D 3D — Threads & blocks in 3D — — Methods to exploit parallelism: Warp the kernel, it’s a thread! Threads → Block 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Blocks → Grid

  64. — Lightweight Member of the Helmholtz Association 0 Andreas Herten | GPU Programming 101 | 31 August 2017 SAXPY! order non-deterministic! — 1000s threads execute simultaneously fast switchting! Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } Parallel function: kernel 2 1 # 28 41 CUDA Threading Model 3D 3D 3D — Threads & blocks in 3D — — Methods to exploit parallelism: Warp the kernel, it’s a thread! Threads → Block 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Blocks → Grid

  65. Member of the Helmholtz Association CUDA Threading Model Andreas Herten | GPU Programming 101 | 31 August 2017 SAXPY! Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } Parallel function: kernel 2 1 0 # 28 41 3D 3D 3D — Threads & blocks in 3D — — Methods to exploit parallelism: Warp the kernel, it’s a thread! Threads → Block 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Blocks → Grid — Lightweight → fast switchting! — 1000s threads execute simultaneously → order non-deterministic!

  66. Member of the Helmholtz Association CUDA Threading Model Andreas Herten | GPU Programming 101 | 31 August 2017 Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } Parallel function: kernel 2 1 0 # 28 41 3D 3D 3D — Threads & blocks in 3D — — Methods to exploit parallelism: Warp the kernel, it’s a thread! Threads → Block 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Blocks → Grid — Lightweight → fast switchting! — 1000s threads execute simultaneously → order non-deterministic! ⇒ SAXPY!

  67. Member of the Helmholtz Association float x[n], y[n]; Andreas Herten | GPU Programming 101 | 31 August 2017 cudaDeviceSynchronize(); saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaMallocManaged(&y, n * sizeof ( float )); cudaMallocManaged(&x, n * sizeof ( float )); // fill x, y int n = 10; CUDA SAXPY int a = 42; } if (i < n) int i = blockIdx.x * blockDim.x + threadIdx.x; __global__ void saxpy_cuda( int n, float a, float * x, float * y) { With runtime-managed data transfers # 29 41 y[i] = a * x[i] + y[i];

  68. Member of the Helmholtz Association float x[n], y[n]; Andreas Herten | GPU Programming 101 | 31 August 2017 Specify kernel cudaDeviceSynchronize(); saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaMallocManaged(&y, n * sizeof ( float )); cudaMallocManaged(&x, n * sizeof ( float )); // fill x, y int n = 10; CUDA SAXPY int a = 42; } if (i < n) int i = blockIdx.x * blockDim.x + threadIdx.x; __global__ void saxpy_cuda( int n, float a, float * x, float * y) { With runtime-managed data transfers # 29 41 y[i] = a * x[i] + y[i];

  69. Member of the Helmholtz Association // fill x, y Andreas Herten | GPU Programming 101 | 31 August 2017 ID variables Specify kernel cudaDeviceSynchronize(); saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaMallocManaged(&y, n * sizeof ( float )); cudaMallocManaged(&x, n * sizeof ( float )); float x[n], y[n]; CUDA SAXPY int n = 10; int a = 42; } if (i < n) int i = blockIdx.x * blockDim.x + threadIdx.x; __global__ void saxpy_cuda( int n, float a, float * x, float * y) { With runtime-managed data transfers # 29 41 y[i] = a * x[i] + y[i];

  70. Member of the Helmholtz Association cudaMallocManaged(&x, n * sizeof ( float )); Andreas Herten | GPU Programming 101 | 31 August 2017 too many threads Guard against ID variables Specify kernel cudaDeviceSynchronize(); saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaMallocManaged(&y, n * sizeof ( float )); // fill x, y CUDA SAXPY float x[n], y[n]; int n = 10; int a = 42; } if (i < n) int i = blockIdx.x * blockDim.x + threadIdx.x; __global__ void saxpy_cuda( int n, float a, float * x, float * y) { With runtime-managed data transfers # 29 41 y[i] = a * x[i] + y[i];

  71. Member of the Helmholtz Association cudaMallocManaged(&y, n * sizeof ( float )); Andreas Herten | GPU Programming 101 | 31 August 2017 memory Allocate too many threads Guard against ID variables Specify kernel cudaDeviceSynchronize(); saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaMallocManaged(&x, n * sizeof ( float )); CUDA SAXPY // fill x, y float x[n], y[n]; int n = 10; int a = 42; } if (i < n) int i = blockIdx.x * blockDim.x + threadIdx.x; __global__ void saxpy_cuda( int n, float a, float * x, float * y) { With runtime-managed data transfers # 29 41 y[i] = a * x[i] + y[i]; GPU -capable

  72. Member of the Helmholtz Association saxpy_cuda<<<2, 5>>>(n, a, x, y); Andreas Herten | GPU Programming 101 | 31 August 2017 2 blocks, each 5 threads Call kernel memory Allocate too many threads Guard against ID variables Specify kernel cudaDeviceSynchronize(); cudaMallocManaged(&y, n * sizeof ( float )); CUDA SAXPY cudaMallocManaged(&x, n * sizeof ( float )); // fill x, y float x[n], y[n]; int n = 10; int a = 42; } if (i < n) int i = blockIdx.x * blockDim.x + threadIdx.x; __global__ void saxpy_cuda( int n, float a, float * x, float * y) { With runtime-managed data transfers # 29 41 y[i] = a * x[i] + y[i]; GPU -capable

  73. Member of the Helmholtz Association cudaDeviceSynchronize(); Andreas Herten | GPU Programming 101 | 31 August 2017 kernel to finish Wait for 2 blocks, each 5 threads Call kernel memory Allocate too many threads Guard against ID variables Specify kernel saxpy_cuda<<<2, 5>>>(n, a, x, y); CUDA SAXPY cudaMallocManaged(&y, n * sizeof ( float )); cudaMallocManaged(&x, n * sizeof ( float )); // fill x, y float x[n], y[n]; int n = 10; int a = 42; } if (i < n) int i = blockIdx.x * blockDim.x + threadIdx.x; __global__ void saxpy_cuda( int n, float a, float * x, float * y) { With runtime-managed data transfers # 29 41 y[i] = a * x[i] + y[i]; GPU -capable

  74. Member of the Helmholtz Association Abstraction Libraries/DSL Andreas Herten | GPU Programming 101 | 31 August 2017 # 30 41 Programming GPUs

  75. Member of the Helmholtz Association Abstraction Libraries & DSLs Libraries with ready-programmed abstractions; partly compiler/transpiler necessary Have difgerent backends to choose from for targeted accelerator Andreas Herten | GPU Programming 101 | 31 August 2017 # 31 41 Between Thrust , OpenACC, and CUDA Examples: Kokkos , Alpaka , Futhark , HIP , C++AMP , …

  76. Member of the Helmholtz Association Abstraction Libraries & DSLs Libraries with ready-programmed abstractions; partly compiler/transpiler necessary Have difgerent backends to choose from for targeted accelerator Andreas Herten | GPU Programming 101 | 31 August 2017 # 31 41 Between Thrust , OpenACC, and CUDA Examples: Kokkos , Alpaka , Futhark , HIP , C++AMP , …

  77. Member of the Helmholtz Association From Sandia National Laboratories C++ library for performance portability Data-parallel patterns, architecture-aware memory layouts, … Kokkos::View< double *> x("X", length); Kokkos::View< double *> y("Y", length); double a = 2.0; // Fill x, y Kokkos::parallel_for(length, KOKKOS_LAMBDA ( const int & i) { }); Andreas Herten | GPU Programming 101 | 31 August 2017 # 32 41 An Alternative: Kokkos → https://github.com/kokkos/kokkos/ x(i) = a*x(i) + y(i);

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend