WITH CUDA C/C++ Pedro Mario Cruz e Silva, Solutions Architect - PowerPoint PPT Presentation

ACCELERATING APPLICATIONS WITH CUDA C/C++ Pedro Mario Cruz e Silva, Solutions Architect Manager

ELEVEN YEARS OF GPU COMPUTING GPU-Trained AI Machine Beats World Champion in Go World’s First Atomic Model of HIV Capsid Oak Ridge Deploys World’s Fastest Supercomputer w/ GPUs Top 13 Greenest Google Outperforms Stanford Builds AI Supercomputers Powered Humans in ImageNet Machine using GPUs by NVIDIA GPUs Fermi: World’s AlexNet beats expert code by huge margin using GPUs First HPC GPU World’s First 3 -D Mapping World’s First GPU Discovered How H1N1 Top500 System of Human Genome Mutates to Resist Drugs CUDA Launched 2006 2017 2010 2008 2012 2014 2

“SCALABILITY OF CPU AND GPU SOLUTIONS OF THE PRIME ELLIPTIC CURVE DISCRETE LOGARITHM PROBLEM” Jairo Panetta (ITA), Paulo Souza (ITA), Luiz Laranjeira (UnB), Carlos Teixeira Jr (UnB) Visit Speed (10 6 ) 250 197.33 200 150 100 77.84 50 29.77 25.99 0 1 STI PS3 K40 + CUDA8.0 P100 + CUDA8.0 V100 + CUDA9.0 3

GPU PROGRAMMING 4

HOW GPU ACCELERATION WORKS Application Code Compute-Intensive Functions Rest of Sequential 5% of Code CPU Code GPU CPU + 5

3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC Libraries Languages Directives Easily Accelerate Maximum “Drop - in” Applications Flexibility Acceleration 6

THE BASICS Heterogenous Computing • Host: The CPU and its memory (host memory) • Device: The GPU and its memory (device memory) Host Device 7

ACCELERATING C/C++ CODE WITH CUDA ON GPUS 8

V1OO ARCHITECTURE 10

TESLA V100 The Fastest and Most Productive GPU for AI and HPC Tensor Core Improved NVLink & Volta MPS Improved SIMT Model Volta Architecture HBM2 125 Programmable Inference Utilization New Algorithms Most Productive GPU TFLOPS Deep Learning Efficient Bandwidth 11

THREAD HIERARCHY Grid, Block & Threads 12

TESLA V100 21B transistors 815 mm 2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink 5 13 *full GV100 chip contains 84 SMs

VOLTA GV100 SM GV100 FP32 units 64 FP64 units 32 INT32 units 64 Tensor Cores 8 Register File 256 KB Unified L1/Shared 128 KB memory Active Threads 2048 14

NEW TENSOR CORE New CUDA TensorOp instructions & data formats 4x4 matrix processing array D[FP32] = A[FP16] * B[FP16] + C[FP32] Optimized for deep learning Activation Inputs Weights Inputs Output Results 15

TENSOR CORE 4x4x4 matrix multiply and accumulate 16

CONCEPTS __global__ - this keyword is used to tell the CUDA compiler that the function is to be compiled for the GPU, and is callable from both the host and the GPU itself. For CUDA C/C++, the nvcc compiler will handle compiling this code. blockIdx.x - this is a read-only variable that is defined for you. It is used within a GPU kernel to determine the ID of the block which is currently executing code. Since there will be many blocks running in parallel, we need this ID to help determine which chunk of data that particular block will work on. threadIdx.x - this is a read-only variable that is defined for you. It is used within a GPU kernel to determine the ID of the thread which is currently executing code in the active block. blockDim.x - this is a read-only variable that is defined for you. It simply returns a value indicating the number of threads there are per block. Remember that all the blocks scheduled to execute on the GPU are identical, except for the blockIdx.x value. myKernel <<< number_of_blocks, threads_per_block>>> (...) - this is the syntax used to launch a kernel on the GPU. Inside the triple-angle brackets we set two values. The first is the total number of blocks we want to run on the GPU, and the second is the number of threads there are per block. It's possible, and in fact recommended, for one to schedule more blocks than the GPU can actively run in parallel. In this case, the system will just continue executing blocks until they have all run. 17

NVIDIA DEEP LEARNING INSTITUTE Hands-on self-paced and instructor-led training in deep learning and accelerated computing for developers Autonomous Deep Learning Medical Image Vehicles Fundamentals Analysis Request onsite instructor-led workshops at your organization: www.nvidia.com/requestdli Take self-paced labs online: www.nvidia.com/dlilabs Download the course catalog, view upcoming Genomics Finance Intelligent Video workshops, and learn about the University Analytics Ambassador Program: www.nvidia.com/dli More industry- specific training coming soon… Game Development Accelerated Computing & Digital Content Fundamentals 43

developer.nvidia.com 44

developer.nvidia.com 45

NVIDIA HW GRANT PROGRAM Jetson TX2 Titan V Volta Quadro P6000 (Dev Kit) • Scientific Computing Scientific Visualization • • Robotics HPC • • Virtual Reality Autonomous Machines • Deep Learning • https://developer.nvidia.com/academic_gpu_seeding 46

INCEPTION PROGRAM http://www.nvidia.com/object/inception-program.html 47

WITH CUDA C/C++ Pedro Mario Cruz e Silva, Solutions Architect - PowerPoint PPT Presentation

ACCELERATING APPLICATIONS WITH CUDA C/C++ Pedro Mario Cruz e Silva, Solutions Architect Manager ELEVEN YEARS OF GPU COMPUTING GPU-Trained AI Machine Beats World Champion in Go Worlds First Atomic Model of HIV Capsid Oak Ridge Deploys

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

S9751: ACCELERATE YOUR CUDA DEVELOPMENT WITH LATEST DEBUGGING AND CODE ANALYSIS DEVELOPER TOOLS

CUDA 7 AND BEYOND MARK HARRIS, NVIDIA CUDA 7 Runtime C++11 cuSOLVER Compilation

SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler

CUDA 8 AND BEYOND Mark Harris, April 5, 2016 INTRODUCING CUDA 8 Pascal Support Unified Memory

PerfMon redux: analyzing a CUDA application with the Windows PerfMon redux: analyzing a CUDA

CUDA ON MOBILE Yogesh Kini, GTC 2016 Typical pipeline ABSTRACT CUDA Interop APIs Unified

Approaches to GPU computing Manuel Ujaldon Nvidia CUDA Fellow Computer Architecture Department

Plan Optimizing Matrix Transpose with CUDA 1 CS4402-9535: High-Performance Computing with CUDA

Shader Programming Shader Programming vs CUDA vs CUDA Tien-Tsin Wong The Chinese University of

Solving Domain Wall Dirac Equation Using Multisplitting Preconditioned Conjugate Gradient Jiqun

VASP 5.4.4 October 2017 Silica IFPEN on V100s PCIe 0.00700 0.00628 0.00600 (Untuned on Volta)

Welcome! Todays Agenda: GPU Execution Model GPGPU Flow GPGPU Low Level Notes

Statistical Learning and Inference Methods for Reasoning in Games March 30, 2005 Brian Mihok

GPGPU 03 NVIDIA case study GeForce 7800 (2006) GeForce 7800 Impossible to maximize

Common Subexpression Convergence (CSC) Sana Damani and Vivek Sarkar Habanero Extreme Scale

Discourse Structure & Wrap-up: Q-A Ling571 Deep Processing Techniques for NLP March 9, 2016

The Present and Absent Lord 2019 TRINITY LECTURE 2 30 JULY 2019 MARKUS BOCKMUEHL, UNIVERSITY

WITH CUDA C/C++ Pedro Mario Cruz e Silva, Solutions Architect - PowerPoint PPT Presentation

ACCELERATING APPLICATIONS WITH CUDA C/C++ Pedro Mario Cruz e Silva, Solutions Architect Manager ELEVEN YEARS OF GPU COMPUTING GPU-Trained AI Machine Beats World Champion in Go Worlds First Atomic Model of HIV Capsid Oak Ridge Deploys

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

S9751: ACCELERATE YOUR CUDA DEVELOPMENT WITH LATEST DEBUGGING AND CODE ANALYSIS DEVELOPER TOOLS

CUDA 7 AND BEYOND MARK HARRIS, NVIDIA CUDA 7 Runtime C++11 cuSOLVER Compilation

SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler

CUDA 8 AND BEYOND Mark Harris, April 5, 2016 INTRODUCING CUDA 8 Pascal Support Unified Memory

PerfMon redux: analyzing a CUDA application with the Windows PerfMon redux: analyzing a CUDA

CUDA ON MOBILE Yogesh Kini, GTC 2016 Typical pipeline ABSTRACT CUDA Interop APIs Unified

Approaches to GPU computing Manuel Ujaldon Nvidia CUDA Fellow Computer Architecture Department

Plan Optimizing Matrix Transpose with CUDA 1 CS4402-9535: High-Performance Computing with CUDA

Shader Programming Shader Programming vs CUDA vs CUDA Tien-Tsin Wong The Chinese University of

Solving Domain Wall Dirac Equation Using Multisplitting Preconditioned Conjugate Gradient Jiqun

VASP 5.4.4 October 2017 Silica IFPEN on V100s PCIe 0.00700 0.00628 0.00600 (Untuned on Volta)

Welcome! Todays Agenda: GPU Execution Model GPGPU Flow GPGPU Low Level Notes

Statistical Learning and Inference Methods for Reasoning in Games March 30, 2005 Brian Mihok

GPGPU 03 NVIDIA case study GeForce 7800 (2006) GeForce 7800 Impossible to maximize

Common Subexpression Convergence (CSC) Sana Damani and Vivek Sarkar Habanero Extreme Scale

Discourse Structure &amp; Wrap-up: Q-A Ling571 Deep Processing Techniques for NLP March 9, 2016

The Present and Absent Lord 2019 TRINITY LECTURE 2 30 JULY 2019 MARKUS BOCKMUEHL, UNIVERSITY

Discourse Structure & Wrap-up: Q-A Ling571 Deep Processing Techniques for NLP March 9, 2016