CUDA NEW FEATURES AND BEYOND Stephen Jones, GTC 2019 A QUICK LOOK - PowerPoint PPT Presentation

CUDA NEW FEATURES AND BEYOND Stephen Jones, GTC 2019

A QUICK LOOK BACK This Time Last Year... 3x3 ReLU convolution 5x5 input ReLU concat convolution max 1x1 ReLU pool convolution DGX-2 + Unified Memory Asynchronous Task Graphs S9241 – All You Need To Know About Programming NVIDIA’s DGX -2, Wednesday March 20, 1-2PM 2

ACCELERATED COMPUTING IS FULL-STACK OPTIMIZATION 2X More Performance With Software Optimizations Alone HPC Applications Speedup 24x 20x CUDA 10 CUBLAS 10 CUFFT 10 16x 2X 12x on same CUDA 8 hardware CUBLAS 8 8x CUFFT 8 4x 0x 1 2 3 4 5 2x Broadwell vs 4xP100 2x Broadwell vs 4xV100 3 HPC Apps: AMBER, Chroma, GROMACS, GTC, LAMMPS, MILC, NAMD, QE, RTM, SPECFEM3D, VASP

TESLA UNIVERSAL ACCELERATION PLATFORM Single Platform To Drive Utilization and Productivity CUSTOMER USECASES Molecular Weather Seismic Speech Translate Recommender Healthcare Manufacturing Finance Simulations Forecasting Mapping CONSUMER INTERNET INDUSTRIAL APPLICATIONS SUPERCOMPUTING APPS & Amber +550 FRAMEWORKS Applications NAMD MACHINE LEARNING | RAPIDS DEEP LEARNING SUPERCOMPUTING NVIDIA SDK & LIBRARIES CuBLAS CuFFT OpenACC NCCL TensorRT cuDF cuML cuGRAPH cuDNN cuBLAS CUTLASS CUDA TESLA GPUs & SYSTEMS VIRTUAL GPU SYSTEM OEM CLOUD TESLA GPU NVIDIA DGX FAMILY NVIDIA HGX 4

TECHNOLOGY PLATFORM DEVELOPMENT TOOLKIT 5

NEW TURING GPU GREATEST LEAP SINCE 2006 CUDA GPU 6

TESLA T4 WORLD’S MOST ADVANCED SCALE -OUT GPU 320 Turing Tensor Cores 2,560 CUDA Cores 65 FP16 TFLOPS | 130 INT8 TOPS | 260 INT4 TOPS 16GB | 320GB/s 70 W Deep Learning Training & Inference HPC Workloads Video Transcode Remote Graphics 7

TURING SM TU102 INT32 64 FP32 64 Tensor Cores 8 RT Core 1 Register File 256 KB L1 and shmem 96 KB Max threads 1024 Compute Capability 75* *Volta (cc70) code runs on Turing without JIT or recompile! 8

RT CORE POTENTIAL FOR ACCELERATION OF NUMERICAL ALGORITHMS Geometry-Heavy Compute Applications Unstructured Algorithms Nearest Neighbor Search Credit: Fortmann-Roe Neutron Transport Credit: CERT, Texas A&M R-Trees, Decision Trees RF Wave Propagation Credit: Wikimedia Credit: COMSOL Seismic Shear Wave Tracing Radiaton Transport Credit: SERC, Carleton College Credit: Greg Stewart / SLAC 9

LOCATING NEIGHBORS WITHIN A RANGE Intersect Rays With Bounding Box Around Points Of Interest For any arbitrary set of points For a point P, find neighbors within a shape enclosed in a Bounding Box P Ray-based solution 1. Attach a box of width R to each point 2. Shoot one ray from P in arbitrary direction, t_max = 2*R 3. Neighbors boxes will have either entry/exit intersection but never both. 4. Refine result points to any shape within the box in SM. 10

RAY TRACED NEAREST NEIGHBOUR SEARCH Using RT-Cores Through OptiX RTX 11

NEW TURING TENSOR CORE MULTI-PRECISION FOR AI INFERENCE & SCALE-OUT TRAINING 65 TFLOPS FP16 | 130 TeraOPS INT8 | 260 TeraOPS INT4 12

TURING TENSOR CORE New 8-Bit & Sub-Byte Warp Matrix Functions In CUDA WMMA 16x16x16 8-bit integer WMMA operations = + D A B C ▪ Turing (sm_75) only 16x16 16x16 16x16 16x16 WMMA 32x8x16 ▪ Signed & unsigned 8-bit input = + ▪ 32-bit integer accumulator D A B C ▪ Match input/output dimensions 32x8 32x16 16x8 32x8 with half WMMA 8x32x16 = + ▪ 2048 ops per cycle, per SM D A B C 8x32 8x16 16x32 8x32 13

EXPERIMENTAL WARP MATRIX FUNCTIONS Turing Enables Experimental Sub-Byte Tensor Core Operations Experimental Sub-Byte Operations namespace experimental { 4-bit signed & unsigned input namespace precision { struct u4; // 4-bit unsigned 1-bit input with custom matrix operations struct s4; // 4-bit signed 32-bit accumulator output struct b1; // 1-bit } Access via special namespace enum bmmaBitOp { bmmaBitOpXOR = 1 }; nvcuda::wmma::experimental enum bmmaAccumulateOp { bmmaAccumulateOpPOPC = 1 }; } Enables researchers to experiment with ultra low precision 14

BINARY TENSOR CORES Example: Binarized Neural Networks 1-bit Concept Train neural networks on lower-precision data: faster compute, lower memory size ▪ Reduce data to positive / negative sign value – can fit in single bit (1 = +ve, 0 = -ve) ▪ ▪ 1-bit weight & activation calculations based only on sign of data Ref: Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or −1 , M. Coubariaux, I. Hubara, D. Soudry, R. El-Yaniv, Y Bengio, 2016 https://arxiv.org/pdf/1602.02830.pdf 15

BINARY TENSOR CORE OPERATION 128-bit population Bitwise 32-bit Integer Output 1-Bit Input Signal count added to XOR Operation Per Point accumulator Other Row/Column Results Accumulated + Bitwise 32-bit Integer XOR Count Previous Accumulation 16

NEW TURING WARP MATRIX FUNCTIONS Input Precision Output Supported Sizes Max Ops/Clock/SM Native Types half * half or float 1024 16 x 16 x 16 char 32 x 8 x 16 integer (int32) 2048 8 x 32 x 16 unsigned char Experimental precision::u4 (4-bit unsigned) 8 x 8 x 32 4096 integer (int32) precision::s4 (4-bit signed) precision::b1 (1-bit) 8 x 8 x 128 16384 * Also available on Volta sm_70. Note: WMMA requires recompilation for Turing sm_75 for peak performance 17

CUTLASS 1.3 GEMM kernels targeting Volta Tensor Cores natively with mma.sync New in CUDA 10.1 & CUTLASS 1.3: mma.sync PTX assembly instruction enables maximum efficiency of Volta Tensor Cores operation 18

INDEPENDENT THREAD SCHEDULING Communicating Algorithms Pascal: Lock-Free Algorithms Volta/Turing: Starvation Free Algorithms Threads may wait for messages Threads cannot wait for messages 19

INDEPENDENT THREAD SCHEDULING Enable Fast Mutexes For Concurrent Data Structures, Replace Complex Lock-Free Algorithms Multi-threading (CPU) Acceleration (RTX 2070) Ref: High Radix Concurrent C++ , Olivier Giroux, CppCon 2018 - https://www.youtube.com/watch?v=75LcDvlEIYw See Also: https://devblogs.nvidia.com/cuda-turing-new-gpu-compute-possibilities/ 20

WARP IMPLEMENTATIONS Pre-Volta Program Counter (PC) and Stack (S) 32 thread warp Volta/Turing PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S Convergence Optimizer 32 thread warp with independent scheduling 21

PC, S SYNCHRONIZING WARP FUNCTIONS Pre-Volta my_value = __shfl(thread, their_value) PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S Volta & Turing PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S 22 PC,S PC,S

PC, S SYNCHRONIZING WARP FUNCTIONS my_value = __shfl _sync ( thread_mask , thread, their_value) Pre-Volta PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S Volta & Turing PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S 25 PC,S PC,S

SYNCHRONIZING WARP FUNCTIONS Pre-Volta Volta & Turing PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC, S my_value = __shfl _sync ( FULL_WARP , thread, their_value) __shfl_sync() and all other *_sync collective operations work on all GPU architectures 26

REMOVAL OF NON-SYNC WARP FUNCTIONS Functions Deprecated In CUDA 9.0: Now Removed In CUDA 10.1 Programs using old functions: Removed Replacement Function Function Will no longer compile for sm_70 (Volta), ▪ __ballot() __ballot_sync() or sm_75 (Turing) __any() __any_sync() Will still compile as older compute_60 (Pascal) ▪ __all() __all_sync() architecture, but without support for any __shfl() __shfl_sync() Volta or Turing features __shfl_up() __shfl_up_sync() __shfl_down() __shfl_down_sync() To compile as compute_60 , add the following __shfl_xor() __shfl_xor_sync() arguments to your compile line: -arch=compute_60 -code=sm_70 27

CUDA 10.1 FOR TEGRA SYSTEMS Compiler Platform Host OS Version Target OS Version Support 16.04 LTS L4T 18.04 LTS GCC 7.3 18.04 LTS Android 16.04 LTS P (Pie) Clang 6.0 18.04 LTS GCC 7.3 Auto 16.04 LTS QNX SDP 7.0.2 GCC 5.4 Yocto 2.5 GCC 7.3 28

DRIVE DEVELOPER WORKFLOW Iterative Workflow Lab PC Vehicle Developer DRIVE™ Xavier Integration with dGPU Iterative Testing Fast iteration loop with PC, same CUDA code used across PC, DRIVE Dev Platform, and vehicle 29

CUDA NEW FEATURES AND BEYOND Stephen Jones, GTC 2019 A QUICK LOOK - PowerPoint PPT Presentation

CUDA NEW FEATURES AND BEYOND Stephen Jones, GTC 2019 A QUICK LOOK BACK This Time Last Year... 3x3 ReLU convolution 5x5 input ReLU concat convolution max 1x1 ReLU pool convolution DGX-2 + Unified Memory Asynchronous Task Graphs

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

CUDA 7 AND BEYOND MARK HARRIS, NVIDIA CUDA 7 Runtime C++11 cuSOLVER Compilation

CUDA 8 AND BEYOND Mark Harris, April 5, 2016 INTRODUCING CUDA 8 Pascal Support Unified Memory

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

How to Write a Parallel GPU Application Using CUDA and Charm++ Presented by Lukasz Wesolowski

S9751: ACCELERATE YOUR CUDA DEVELOPMENT WITH LATEST DEBUGGING AND CODE ANALYSIS DEVELOPER TOOLS

Starting with CUDA Quote: With great power comes great responsibility. Summary How to install

PerfMon redux: analyzing a CUDA application with the Windows PerfMon redux: analyzing a CUDA

Approaches to GPU computing Manuel Ujaldon Nvidia CUDA Fellow Computer Architecture Department

Three Mile Island Unit 2 Key Decisions and Important Events for Removing the Damaged Fuel

Status and future plan of JENDL Osamu Iwamoto Nuclear Data Center Japan Atomic Energy Agency 1

Projects: CHANDA , ERINDA, EUFRAT Introduction to Nuclear Data R&D- ND needs Nuclear

REPORT ON IMPLEMENTATION OF RAF 1007 IN NIGERIA ( STRENGTHENING THE CAPACITIES OF RESEARCH

Fourth Quarter & Fiscal Year 2019 Results November 19, 2019 Forward-Looking Statements &

Agenda Project Background Critical Facilities Included Key Questions Design

Higher Education and NFE in partnership in youth work education in Ireland Department of Applied

Bounty Oil and Gas NL Annual General Meeting November 28, 2019 CEO Presentation ASX Code: BUY

CUDA NEW FEATURES AND BEYOND Stephen Jones, GTC 2019 A QUICK LOOK - PowerPoint PPT Presentation

CUDA NEW FEATURES AND BEYOND Stephen Jones, GTC 2019 A QUICK LOOK BACK This Time Last Year... 3x3 ReLU convolution 5x5 input ReLU concat convolution max 1x1 ReLU pool convolution DGX-2 + Unified Memory Asynchronous Task Graphs

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

CUDA 7 AND BEYOND MARK HARRIS, NVIDIA CUDA 7 Runtime C++11 cuSOLVER Compilation

CUDA 8 AND BEYOND Mark Harris, April 5, 2016 INTRODUCING CUDA 8 Pascal Support Unified Memory

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

How to Write a Parallel GPU Application Using CUDA and Charm++ Presented by Lukasz Wesolowski

S9751: ACCELERATE YOUR CUDA DEVELOPMENT WITH LATEST DEBUGGING AND CODE ANALYSIS DEVELOPER TOOLS

Starting with CUDA Quote: With great power comes great responsibility. Summary How to install

PerfMon redux: analyzing a CUDA application with the Windows PerfMon redux: analyzing a CUDA

Approaches to GPU computing Manuel Ujaldon Nvidia CUDA Fellow Computer Architecture Department

Three Mile Island Unit 2 Key Decisions and Important Events for Removing the Damaged Fuel

Status and future plan of JENDL Osamu Iwamoto Nuclear Data Center Japan Atomic Energy Agency 1

Projects: CHANDA , ERINDA, EUFRAT Introduction to Nuclear Data R&amp;D- ND needs Nuclear

REPORT ON IMPLEMENTATION OF RAF 1007 IN NIGERIA ( STRENGTHENING THE CAPACITIES OF RESEARCH

Fourth Quarter &amp; Fiscal Year 2019 Results November 19, 2019 Forward-Looking Statements &amp;

Agenda Project Background Critical Facilities Included Key Questions Design

Higher Education and NFE in partnership in youth work education in Ireland Department of Applied

Bounty Oil and Gas NL Annual General Meeting November 28, 2019 CEO Presentation ASX Code: BUY

Projects: CHANDA , ERINDA, EUFRAT Introduction to Nuclear Data R&D- ND needs Nuclear

Fourth Quarter & Fiscal Year 2019 Results November 19, 2019 Forward-Looking Statements &