gpus platform programming pitfalls
play

GPUs: Platform, Programming, Pitfalls GridKa School 2016: Data - PowerPoint PPT Presentation

Member of the Helmholtz Association GPUs: Platform, Programming, Pitfalls GridKa School 2016: Data Science on Modern Architectures Andreas Herten , Forschungszentrum Jlich, 1 September 2016 Member of the Helmholtz Association About, Outline


  1. Member of the Helmholtz Association GPUs: Platform, Programming, Pitfalls GridKa School 2016: Data Science on Modern Architectures Andreas Herten , Forschungszentrum Jülich, 1 September 2016

  2. Member of the Helmholtz Association About, Outline Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 Pitfalls Tools Languages Directives Libraries Programming Features Hardware Platform Motivation Optimizing scientific applications for/on Since then: NVIDIA Application Lab # 2 37 Andreas Herten Physics in — Aachen (Dipl. at CMS ) — Jülich/Bochum (Dr. at PANDA ) ° Line Hough Transform Around Isochrone (5 step size) -1 y / cm − 0.028 − 0.03 − 0.032 − 0.034 − 0.036 − 0.038 − − − − − − − − − − 0.042 0.04 0.038 0.036 0.034 0.032 0.03 0.028 0.026 0.024 -1 x / cm 2015-04-13 22:58:19 GPUs

  3. Member of the Helmholtz Association Status Quo GPU all around 1999: General computations with shaders of graphics hardware 2001: NVIDIA GeForce 3 with programmable shaders [1]; 2003: DirectX 9 at ATI GPUs Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 3 37 2016: Top 500: 1 / 10 with GPUs, Green 500: 70 % of top 50 with

  4. Member of the Helmholtz Association Status Quo GPU all around Graphic: Rupp [2] Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 3 37

  5. Member of the Helmholtz Association Status Quo GPU all around Graphic: Rupp [2] Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 3 37

  6. Member of the Helmholtz Association Status Quo GPU all around Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 3 37

  7. Member of the Helmholtz Association Status Quo GPU all around But why?! Let’s find out! Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 3 37

  8. Member of the Helmholtz Association Platform Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 4 37

  9. Member of the Helmholtz Association A matter of specialties Transporting one Graphics: Lee [3] and Shearings Holidays [4] Transporting many Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 5 37 CPU vs. GPU

  10. Member of the Helmholtz Association Chip ALU ALU ALU ALU Control Cache DRAM DRAM Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 5 37 CPU vs. GPU

  11. Member of the Helmholtz Association Overview Aim: Hide Latency Everything else follows Asynchronicity Memory High Throughput # 6 37 GPU Architecture SIMT Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016

  12. Member of the Helmholtz Association Overview Aim: Hide Latency Everything else follows Asynchronicity Memory High Throughput # 6 37 GPU Architecture SIMT Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016

  13. Member of the Helmholtz Association PCIe Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 Now: Done automatically (performance…?) Formerly: Explicitly copy data to/from GPU Do as little as possible! Memory transfers need special consideration! GPU: accelerator / extension card NVLink HBM2 Memory # 7 37 Device ALU DRAM Cache Control DRAM ALU Host ALU ALU GPU memory ain’t no CPU memory → Separate device from CPU Separate memory, but UVA and UM ≈ 80 GB / s < 16 GB / s Values for P100 : 16 GB RAM, 720 GB / s < 720 GB / s

  14. Member of the Helmholtz Association Overview Aim: Hide Latency Everything else follows Asynchronicity Memory High Throughput # 8 37 GPU Architecture SIMT Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016

  15. Member of the Helmholtz Association Async Following difgerent streams Problem: Memory transfer is comparably slow Solution: Do something else in meantime ( computation )! Copy and compute engines run separately ( streams ) GPU needs to be fed: Schedule many computations CPU can do other work while GPU computes; synchronization Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 9 37 → Overlap tasks

  16. Member of the Helmholtz Association Overview Aim: Hide Latency Everything else follows Asynchronicity Memory High Throughput # 10 37 GPU Architecture SIMT Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016

  17. Member of the Helmholtz Association SIMT Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 SIMT SMT Thread Thread Core Core Core Core Vector # 11 37 if Of threads and warps CPU: — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) — Branching A 0 B 0 C 0 A 1 B 1 C 1 + = A 2 B 2 C 2 — Single Instruction, Multiple Data ( SIMD ) A 3 B 3 C 3 — Simultaneous Multithreading ( SMT ) GPU: Single Instruction, Multiple Threads ( SIMT ) — CPU core ≊ GPU multiprocessor ( SM )

  18. Member of the Helmholtz Association SIMT Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 SIMT SMT Thread Thread Core Core Core Core Vector # 11 37 Graphics: Nvidia Corporation [5] Of threads and warps CPU: Pascal GP100 — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) — Branching if A 0 B 0 C 0 A 1 B 1 C 1 + = A 2 B 2 C 2 — Single Instruction, Multiple Data ( SIMD ) A 3 B 3 C 3 — Simultaneous Multithreading ( SMT ) GPU: Single Instruction, Multiple Threads ( SIMT ) — CPU core ≊ GPU multiprocessor ( SM )

  19. Member of the Helmholtz Association Graphics: Nvidia Corporation [5] Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 SIMT SMT Thread Thread Core Core Core Core Vector SIMT # 11 37 Multiprocessor Of threads and warps CPU: Pascal GP100 — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) — Branching if A 0 B 0 C 0 A 1 B 1 C 1 + = A 2 B 2 C 2 — Single Instruction, Multiple Data ( SIMD ) A 3 B 3 C 3 — Simultaneous Multithreading ( SMT ) GPU: Single Instruction, Multiple Threads ( SIMT ) — CPU core ≊ GPU multiprocessor ( SM )

  20. Member of the Helmholtz Association Latency Hiding Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 Graphics: Meinke and Nvidia Corporation [6] # 12 37 GPU hides latency with computations from other thread groups CPU minimizes latency within each thread GPU’s ultimate feature CPU core – Low Latency Processor T 1 T 2 T 3 T 4 GPU Streaming Multiprocessor – High Throughput Processing W 1 Waiting W 2 Ready W 3 Ctx switch W 4

  21. Member of the Helmholtz Association bandwidth Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 memory Optimized for high throughput CPU vs. GPU # 13 37 Optimized for low latency Low latency vs. high throughput + Large main memory + High bandwidth main + Fast clock rate + Large caches + Latency tolerant (parallelism) + Branch prediction + More compute resources + Powerful ALU + High performance per watt − Relatively low memory − Limited memory capacity − Low per-thread performance − Cache misses costly − Extension card − Low performance per watt

  22. Member of the Helmholtz Association Programming Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 14 37

  23. Member of the Helmholtz Association y[i] = a * x[i] + y[i]; Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 saxpy(n, a, x, y); // fill x, y float x[n], y[n]; int n = 10; int a = 42; } for ( int i = 0; i < n; i++) Preface: CPU void saxpy( int n, float a, float * x, float * y) { y , with single precision A simple CPU program! # 15 37 SAXPY: ⃗ y = a ⃗ x + ⃗ Part of LAPACK BLAS Level 1

  24. Member of the Helmholtz Association Libraries The truth is out there! Use applications & libraries! Wizard: Breazell [7] cuBLAS cuSPARSE cuFFT cuRAND CUDA Math Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 16 37 Programming GPUs is easy: Just don’t!

  25. Member of the Helmholtz Association Libraries The truth is out there! Use applications & libraries! Wizard: Breazell [7] cuBLAS cuSPARSE cuFFT cuRAND CUDA Math Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 16 37 Programming GPUs is easy: Just don’t! th ano

  26. Member of the Helmholtz Association cuBLAS Parallel algebra Single, double, complex data types Constant competition with Intel’s MKL http://docs.nvidia.com/cuda/cublas Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 17 37 GPU -parallel BLAS (all 152 routines) Multi- GPU support → https://developer.nvidia.com/cublas

  27. Member of the Helmholtz Association cuBLAS Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 cublasShutdown(); cublasGetVector(n, sizeof (y[0]), d_y, 1, y, 1); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasSetVector(n, sizeof (x[0]), x, 1, d_x, 1); cudaMalloc(( void **)&d_y, n * sizeof (y[0]); cudaMalloc(( void **)&d_x, n * sizeof (x[0]); float * d_x, * d_y; cublasInit(); // fill x, y float x[n], y[n]; int n = 10; int a = 42; Code example # 18 37 cublasSetVector(n, sizeof (y[0]), y, 1, d_y, 1);

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend