GPUs: Platform, Programming, Pitfalls GridKa School 2016: Data - PowerPoint PPT Presentation

Member of the Helmholtz Association GPUs: Platform, Programming, Pitfalls GridKa School 2016: Data Science on Modern Architectures Andreas Herten , Forschungszentrum Jülich, 1 September 2016

Member of the Helmholtz Association About, Outline Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 Pitfalls Tools Languages Directives Libraries Programming Features Hardware Platform Motivation Optimizing scientific applications for/on Since then: NVIDIA Application Lab # 2 37 Andreas Herten Physics in — Aachen (Dipl. at CMS ) — Jülich/Bochum (Dr. at PANDA ) ° Line Hough Transform Around Isochrone (5 step size) -1 y / cm − 0.028 − 0.03 − 0.032 − 0.034 − 0.036 − 0.038 − − − − − − − − − − 0.042 0.04 0.038 0.036 0.034 0.032 0.03 0.028 0.026 0.024 -1 x / cm 2015-04-13 22:58:19 GPUs

Member of the Helmholtz Association Status Quo GPU all around 1999: General computations with shaders of graphics hardware 2001: NVIDIA GeForce 3 with programmable shaders [1]; 2003: DirectX 9 at ATI GPUs Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 3 37 2016: Top 500: 1 / 10 with GPUs, Green 500: 70 % of top 50 with

Member of the Helmholtz Association Status Quo GPU all around Graphic: Rupp [2] Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 3 37

Member of the Helmholtz Association Status Quo GPU all around Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 3 37

Member of the Helmholtz Association Status Quo GPU all around But why?! Let’s find out! Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 3 37

Member of the Helmholtz Association Platform Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 4 37

Member of the Helmholtz Association A matter of specialties Transporting one Graphics: Lee [3] and Shearings Holidays [4] Transporting many Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 5 37 CPU vs. GPU

Member of the Helmholtz Association Chip ALU ALU ALU ALU Control Cache DRAM DRAM Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 5 37 CPU vs. GPU

Member of the Helmholtz Association Overview Aim: Hide Latency Everything else follows Asynchronicity Memory High Throughput # 6 37 GPU Architecture SIMT Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016

Member of the Helmholtz Association PCIe Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 Now: Done automatically (performance…?) Formerly: Explicitly copy data to/from GPU Do as little as possible! Memory transfers need special consideration! GPU: accelerator / extension card NVLink HBM2 Memory # 7 37 Device ALU DRAM Cache Control DRAM ALU Host ALU ALU GPU memory ain’t no CPU memory → Separate device from CPU Separate memory, but UVA and UM ≈ 80 GB / s < 16 GB / s Values for P100 : 16 GB RAM, 720 GB / s < 720 GB / s

Member of the Helmholtz Association Async Following difgerent streams Problem: Memory transfer is comparably slow Solution: Do something else in meantime ( computation )! Copy and compute engines run separately ( streams ) GPU needs to be fed: Schedule many computations CPU can do other work while GPU computes; synchronization Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 9 37 → Overlap tasks

Member of the Helmholtz Association SIMT Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 SIMT SMT Thread Thread Core Core Core Core Vector # 11 37 if Of threads and warps CPU: — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) — Branching A 0 B 0 C 0 A 1 B 1 C 1 + = A 2 B 2 C 2 — Single Instruction, Multiple Data ( SIMD ) A 3 B 3 C 3 — Simultaneous Multithreading ( SMT ) GPU: Single Instruction, Multiple Threads ( SIMT ) — CPU core ≊ GPU multiprocessor ( SM )

Member of the Helmholtz Association SIMT Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 SIMT SMT Thread Thread Core Core Core Core Vector # 11 37 Graphics: Nvidia Corporation [5] Of threads and warps CPU: Pascal GP100 — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) — Branching if A 0 B 0 C 0 A 1 B 1 C 1 + = A 2 B 2 C 2 — Single Instruction, Multiple Data ( SIMD ) A 3 B 3 C 3 — Simultaneous Multithreading ( SMT ) GPU: Single Instruction, Multiple Threads ( SIMT ) — CPU core ≊ GPU multiprocessor ( SM )

Member of the Helmholtz Association Graphics: Nvidia Corporation [5] Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 SIMT SMT Thread Thread Core Core Core Core Vector SIMT # 11 37 Multiprocessor Of threads and warps CPU: Pascal GP100 — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) — Branching if A 0 B 0 C 0 A 1 B 1 C 1 + = A 2 B 2 C 2 — Single Instruction, Multiple Data ( SIMD ) A 3 B 3 C 3 — Simultaneous Multithreading ( SMT ) GPU: Single Instruction, Multiple Threads ( SIMT ) — CPU core ≊ GPU multiprocessor ( SM )

Member of the Helmholtz Association Latency Hiding Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 Graphics: Meinke and Nvidia Corporation [6] # 12 37 GPU hides latency with computations from other thread groups CPU minimizes latency within each thread GPU’s ultimate feature CPU core – Low Latency Processor T 1 T 2 T 3 T 4 GPU Streaming Multiprocessor – High Throughput Processing W 1 Waiting W 2 Ready W 3 Ctx switch W 4

Member of the Helmholtz Association bandwidth Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 memory Optimized for high throughput CPU vs. GPU # 13 37 Optimized for low latency Low latency vs. high throughput + Large main memory + High bandwidth main + Fast clock rate + Large caches + Latency tolerant (parallelism) + Branch prediction + More compute resources + Powerful ALU + High performance per watt − Relatively low memory − Limited memory capacity − Low per-thread performance − Cache misses costly − Extension card − Low performance per watt

Member of the Helmholtz Association Programming Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 14 37

Member of the Helmholtz Association y[i] = a * x[i] + y[i]; Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 saxpy(n, a, x, y); // fill x, y float x[n], y[n]; int n = 10; int a = 42; } for ( int i = 0; i < n; i++) Preface: CPU void saxpy( int n, float a, float * x, float * y) { y , with single precision A simple CPU program! # 15 37 SAXPY: ⃗ y = a ⃗ x + ⃗ Part of LAPACK BLAS Level 1

Member of the Helmholtz Association Libraries The truth is out there! Use applications & libraries! Wizard: Breazell [7] cuBLAS cuSPARSE cuFFT cuRAND CUDA Math Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 16 37 Programming GPUs is easy: Just don’t!

Member of the Helmholtz Association Libraries The truth is out there! Use applications & libraries! Wizard: Breazell [7] cuBLAS cuSPARSE cuFFT cuRAND CUDA Math Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 16 37 Programming GPUs is easy: Just don’t! th ano

Member of the Helmholtz Association cuBLAS Parallel algebra Single, double, complex data types Constant competition with Intel’s MKL http://docs.nvidia.com/cuda/cublas Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 17 37 GPU -parallel BLAS (all 152 routines) Multi- GPU support → https://developer.nvidia.com/cublas

Member of the Helmholtz Association cuBLAS Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 cublasShutdown(); cublasGetVector(n, sizeof (y[0]), d_y, 1, y, 1); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasSetVector(n, sizeof (x[0]), x, 1, d_x, 1); cudaMalloc(( void **)&d_y, n * sizeof (y[0]); cudaMalloc(( void **)&d_x, n * sizeof (x[0]); float * d_x, * d_y; cublasInit(); // fill x, y float x[n], y[n]; int n = 10; int a = 42; Code example # 18 37 cublasSetVector(n, sizeof (y[0]), y, 1, d_y, 1);

GPUs: Platform, Programming, Pitfalls GridKa School 2016: Data - PowerPoint PPT Presentation

Member of the Helmholtz Association GPUs: Platform, Programming, Pitfalls GridKa School 2016: Data Science on Modern Architectures Andreas Herten , Forschungszentrum Jlich, 1 September 2016 Member of the Helmholtz Association About, Outline

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

INSIDE THE PLATFORM Who are we Classic platforms Classic platform Modern platform Modern

Pitfalls in Using a case based approach, we will Arrhythmias review pitfalls in management of:

Knowledge Engineering Pitfalls Knowledge Engineering Pitfalls Which one is better to represent

Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems Ming Yang,

Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems Ming Yang,

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Unleashing the Power of GPUs over the Web Vishal Vaidyanathan Royal Caliber LLC GPUs are

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy K. John Problem Statement

MD5 Chosen-Prefix Collisions on GPUs Marc Bevand m.bevand@gmail.com marc.bevand@rapid7.com

Analyzing Throughput of GPUs Analyzing Throughput of GPUs Exploiting Within-Die Core-to-Core

A Dynamic Programming-based MCMC Framework for Solving DCOPs with GPUs Ferdinando Fioretto 1(2)

ESCeL Platform & NCS Initiatives ESCeL eLearning Platform ESCeL Platform MISSION STATEMENT

Delfi COM Platform What stands behind the term: Delfi COM Platform? The Delfi COM Platform

A Brief History of C Why programme in C? Late 1960s BCPL designed by Martin Richards,

CSCI111: Fundamentals of Programming I Professor Sprenkle sprenkles@wlu.edu

How to Use NoSQL in Enterprise Java Applications Patrick Baumgartner NoSQL Roadshow | Zrich |

The iLab Experience a blended learning hands-on course concept you set the focus Smart Space

UMBC A B M A L T F O U M B C I M Y O R T 1 (Feb 3, 2002) I E S R C E O

Introduction to Computer Science I Summary and Applications Janyl Jumadinova 23 April, 2018

Support for Multilingual Windows Web Apps (WWA) and HTML 5 in Windows 8 Jan Anders Nelson Senior

Normalized Terminal Year in a DCF Question that came in the other day In a DCF

GPUs: Platform, Programming, Pitfalls GridKa School 2016: Data - PowerPoint PPT Presentation

Member of the Helmholtz Association GPUs: Platform, Programming, Pitfalls GridKa School 2016: Data Science on Modern Architectures Andreas Herten , Forschungszentrum Jlich, 1 September 2016 Member of the Helmholtz Association About, Outline

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

INSIDE THE PLATFORM Who are we Classic platforms Classic platform Modern platform Modern

Pitfalls in Using a case based approach, we will Arrhythmias review pitfalls in management of:

Knowledge Engineering Pitfalls Knowledge Engineering Pitfalls Which one is better to represent

Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems Ming Yang,

Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems Ming Yang,

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Unleashing the Power of GPUs over the Web Vishal Vaidyanathan Royal Caliber LLC GPUs are

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy K. John Problem Statement

MD5 Chosen-Prefix Collisions on GPUs Marc Bevand m.bevand@gmail.com marc.bevand@rapid7.com

Analyzing Throughput of GPUs Analyzing Throughput of GPUs Exploiting Within-Die Core-to-Core

A Dynamic Programming-based MCMC Framework for Solving DCOPs with GPUs Ferdinando Fioretto 1(2)

ESCeL Platform &amp; NCS Initiatives ESCeL eLearning Platform ESCeL Platform MISSION STATEMENT

Delfi COM Platform What stands behind the term: Delfi COM Platform? The Delfi COM Platform

A Brief History of C Why programme in C? Late 1960s BCPL designed by Martin Richards,

CSCI111: Fundamentals of Programming I Professor Sprenkle sprenkles@wlu.edu

How to Use NoSQL in Enterprise Java Applications Patrick Baumgartner NoSQL Roadshow | Zrich |

The iLab Experience a blended learning hands-on course concept you set the focus Smart Space

UMBC A B M A L T F O U M B C I M Y O R T 1 (Feb 3, 2002) I E S R C E O

Introduction to Computer Science I Summary and Applications Janyl Jumadinova 23 April, 2018

Support for Multilingual Windows Web Apps (WWA) and HTML 5 in Windows 8 Jan Anders Nelson Senior

Normalized Terminal Year in a DCF Question that came in the other day In a DCF

ESCeL Platform & NCS Initiatives ESCeL eLearning Platform ESCeL Platform MISSION STATEMENT