NVIDIA GPU COMPUTING: A JOURNEY FROM PC GAMING TO DEEP LEARNING - PowerPoint PPT Presentation

NVIDIA GPU COMPUTING: A JOURNEY FROM PC GAMING TO DEEP LEARNING Stuart Oberman | October 2017

GAMING PRO VISUALIZATION ENTERPRISE DATA CENTER AUTO NVIDIA ACCELERATED COMPUTING 2

GEFORCE: PC Gaming 200M GeForce gamers worldwide Most advanced technology Gaming ecosystem: More than just chips Amazing experiences & imagery 3

NINTENDO SWITCH: POWERED BY NVIDIA TEGRA 4

GEFORCE NOW: AMAZING GAMES ANYWHERE AAA titles delivered at 1080p 60fps Streamed to SHIELD family of devices Streaming to Mac (beta) https://www.nvidia.com/en- us/geforce/products/geforce- now/mac-pc/ 5

GPU COMPUTING Drug Design Seismic Imaging Automotive Design Medical Imaging Molecular Dynamics Reverse Time Migration Computational Fluid Dynamics Computed Tomography 15x speed up 14x speed up 30-100x speed up Astrophysics Options Pricing Product Development Weather Forecasting Monte Carlo Finite Difference Time Domain Atmospheric Physics n-body 20x speed up 6

GPU: 2017 7

2017: TESLA VOLTA V100 21B transistors 815 mm 2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink 8 *full GV100 chip contains 84 SMs

V100 SPECIFICATIONS 9

HOW DID WE GET HERE? 10

NVIDIA GPUS: 1999 TO NOW https://youtu.be/I25dLTIPREA 11

SOUL OF THE GRAPHICS PROCESSING UNIT GPU: Changes Everything Accelerate computationally-intensive applications • • NVIDIA introduced GPU in 1999 • A single chip processor to accelerate PC gaming and 3D graphics • Goal: approach the image quality of movie studio offline rendering farms, but in real-time Instead of hours per frame, > 60 frames per second • Millions of pixels per frame can all be operated on in parallel • 3D graphics is often termed embarrassingly parallel • • Use large arrays of floating point units to exploit wide and deep parallelism 12

CLASSIC GEFORCE GPUS 13

GEFORCE 6 AND 7 SERIES 2004-2006 Example: GeForce 7900 GTX • • 278M transistors 650MHz pipeline clock • 196mm 2 in 90nm • >300 GFLOPS peak, single-precision • 14

THE LIFE OF A TRIANGLE IN A GPU Classic Edition process commands Host / Front End / Vertex Fetch convert to FP transform vertices Vertex Processing to screen-space generate per- Primitive Assembly, Setup triangle equations Frame Buffer Controller generate pixels, delete pixels Rasterize & Zcull that cannot be seen Pixel Shader determine the colors, transparencies Texture and depth of the pixel Register Combiners do final hidden surface test, blend Pixel Engines (ROP) and write out color and new depth 15

NUMERIC REPRESENTATIONS IN A GPU Fixed point formats • • u8, s8, u16, s16, s3.8, s5.10, ... Floating point formats • • fp16, fp24, fp32, ... Tradeoff of dynamic range vs. precision • • Block floating point formats Treat multiple operands as having a common exponent • • Allows a tradeoff in dynamic range vs storage and computation 16

INSIDE THE 7900GTX GPU vertex fetch engine Host / FW / VTF 8 vertex shaders Cull / Clip / Setup conversion to pixels Z-Cull Shader Instruction Dispatch L2 Tex 24 pixel shaders redistribute pixels Fragment Crossbar 16 pixel engines Memory Memory Memory Memory Partition Partition Partition Partition DRAM(s) DRAM(s) DRAM(s) DRAM(s) 4 independent 64-bit memory partitions 17

G80: REDEFINED THE GPU 18

G80 GeForce 8800 released 2006 G80 first GPU with a unified shader processor architecture • • Introduced the SM: Streaming Multiprocessor Array of simple streaming processor cores: SPs or CUDA cores • • All shader stages use the same instruction set All shader stages execute on the same units • Permits better sharing of SM hardware resources • • Recognized that building dedicated units often results in under-utilization due to the application workload 19

G80 FEATURES • 681M transistors 470mm2 in 90nm • First to support Microsoft DirectX10 API • • Invested a little extra (epsilon) HW in SM to also support general purpose throughput computing Beginning of CUDA everywhere • SM functional units designed to run at 2x frequency, half the number of units • 576 GFLOPs @ 1.5GHz , IEEE 754 fp32 FADD and FMUL • 155W • 21

BEGINNING OF GPU COMPUTING Throughput Computing Latency Oriented • • Fewer, bigger cores with out-of-order, speculative execution Big caches optimized for latency • • Math units are small part of the die Throughput Oriented • • Lots of simple compute cores and hardware scheduling Big register files. Caches optimized for bandwidth. • • Math units are most of the die 22

CUDA Most successful environment for throughput computing C++ for throughput computers On-chip memory management Asynchronous, parallel API Programmability makes it possible to innovate New layer type? No problem. 23

G80 ARCHITECTURE 24

FROM FERMI TO PASCAL 25

FERMI GF100 Tesla C2070 released 2011 3B transistors • • 529 mm2 in 40nm 1150 MHz SM clock • 3 rd generation SM, each with configurable L1/shared • memory IEEE 754-2008 FMA • • 1030 GFLOPS fp32, 515 GFLOPS fp64 247W • 26

KEPLER GK110 Tesla K40 released 2013 7.1B transistors • • 550 mm2 in 28nm Intense focus on power efficiency, operating at lower • frequency • 2880 CUDA cores at 810 MHz • Tradeoff of area efficiency vs. power efficiency 4.3 TFLOPS fp32, 1.4 TFLOPS fp64 • 235W • 27

TITAN SUPERCOMPUTER Oak Ridge National Laboratory 29

PASCAL GP100 released 2016 15.3B transistors • • 610 mm2 in 16ff 10.6 TFLOPS fp32, 5.3 TFLOPS fp64 • • 21 TFLOPS fp16 for Deep Learning training and inference acceleration New high-bandwidth NVLink GPU interconnect • • HBM2 stacked memory 300W • 30

MAJOR ADVANCES IN PASCAL P100 Teraflops (FP32/FP16) P100 3x Bandwidth (GB/Sec) 20 160 P100 (FP16) Bandwidth 15 120 2x P100 10 80 (FP32) M40 1x K40 M40 5 40 K40 K40 M40 3x Compute 5x GPU-GPU BW 3x GPU Mem BW NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 31

GEFORCE GTX 1080TI https://www.nvidia.com/en-us/geforce/products/10series/geforce- gtx-1080-ti/ https://youtu.be/2c2vN736V60 32

FINAL FANTASY XV PREVIEW DEMO WITH GEFORCE GTX 1080TI https://www.geforce.com/whats-new/articles/final-fantasy-xv-windows-edition-4k- trailer-nvidia-gameworks-enhancements https://youtu.be/h0o3fctwXw0 33

2017: VOLTA 34

TESLA V100: 2017 21B transistors 815 mm 2 in 16ff 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink 35 *full GV100 chip contains 84 SMs

TESLA V100 Independent Thread Improved NVLink & New SM Core Tensor Core Volta Architecture Scheduling HBM2 SM L1 I$ Sub- Sub- Sub- Sub- Core Core Core Core TEX L1 D$ & SMEM 120 Programmable Performance & New Algorithms Most Productive GPU Efficient Bandwidth TFLOPS Deep Learning Programmability More V100 Features: 2x L2 atomics, int8, new memory model, copy engine page migration, MPS acceleration, and more … The Fastest and Most Productive GPU for Deep Learning and HPC 36

GPU PERFORMANCE COMPARISON P100 V100 Ratio 12x DL Training 10 TFLOPS 120 TFLOPS 6x DL Inferencing 21 TFLOPS 120 TFLOPS 1.5x FP64/FP32 5/10 TFLOPS 7.5/15 TFLOPS 1.2x HBM2 Bandwidth 720 GB/s 900 GB/s 1.5x STREAM Triad Perf 557 GB/s 855 GB/s 1.9x NVLink Bandwidth 160 GB/s 300 GB/s 1.5x L2 Cache 4 MB 6 MB 7.7x L1 Caches 1.3 MB 10 MB 37

TENSOR CORE CUDA TensorOp instructions & data formats 4x4 matrix processing array D[FP32] = A[FP16] * B[FP16] + C[FP32] Optimized for deep learning Activation Inputs Weights Inputs Output Results 38

TENSOR CORE Mixed Precision Matrix Math 4x4 matrices A 0,0 A 0,1 A 0,2 A 0,3 B 0,0 B 0,1 B 0,2 B 0,3 C 0,0 C 0,1 C 0,2 C 0,3 D = A 1,0 A 1,1 A 1,2 A 1,3 B 1,0 B 1,1 B 1,2 B 1,3 C 1,0 C 1,1 C 1,2 C 1,3 A 2,0 A 2,1 A 2,2 A 2,3 B 2,0 B 2,1 B 2,2 B 2,3 C 2,0 C 2,1 C 2,2 C 2,3 A 3,0 A 3,1 A 3,2 A 3,3 B 3,0 B 3,1 B 3,2 B 3,3 C 3,0 C 3,1 C 3,2 C 3,3 FP16 or FP32 FP16 FP16 FP16 or FP32 D = AB + C 39

VOLTA TENSOR OPERATION Sum with FP32 FP16 Full precision Convert to storage/input product accumulator FP32 result more products F16 × + F32 F16 F32 Also supports FP16 accumulator mode for inferencing 40

NVLINK – PERFORMANCE AND POWER 25Gbps signaling Bandwidth 6 NVLinks for GV100 1.9 x Bandwidth improvement over GP100 Latency sensitive CPU caches GMEM Coherence Fast access in local cache hierarchy Probe filter in GPU Power Savings Reduce number of active lanes for lightly loaded link 41

NVLINK NODES HPC – P9 CORAL NODE – SUMMIT DL – HYBRID CUBE MESH – DGX-1 w/ Volta V100 V100 V100 P9 V100 V100 V100 V100 P9 V100 V100 V100 V100 V100 V100 V100 42

NARROWING THE SHARED MEMORY GAP with the GV100 L1 cache Directed testing: shared in global Cache: vs shared Average Shared 93% Easier to use • Memory Benefit 90%+ as good • 70% Shared: vs cache Faster atomics • More banks • More predictable • Pascal Volta 43

GPU COMPUTING AND DEEP LEARNING 45

NVIDIA GPU COMPUTING: A JOURNEY FROM PC GAMING TO DEEP LEARNING - PowerPoint PPT Presentation

NVIDIA GPU COMPUTING: A JOURNEY FROM PC GAMING TO DEEP LEARNING Stuart Oberman | October 2017 GAMING PRO VISUALIZATION ENTERPRISE DATA CENTER AUTO NVIDIA ACCELERATED COMPUTING 2 GEFORCE: PC Gaming 200M GeForce gamers worldwide Most

FOR THE BEST VDI USER EXPERIENCE NVIDIA VIRTUAL GPU PRODUCT POSITIONING NVIDIA GRID NVIDIA

A Musical Journey A Musical Journey A Musical Journey A Musical Journey A Musical Journey A

GENERATION OF GAMING TECHNOLOGY Samuel Lo, NVIDIA AI Technology Centre samuell@nvidia.com NVIDIA

Gaming & Smart Cards Smart Cards Empower Your Vision Gaming Industry Challenges Gaming

DELIVERING HIGH-PERFORMANCE REMOTE GRAPHICS WITH NVIDIA GRID VIRTUAL GPU Andy Currid NVIDIA

NVIDIA DESIGNWORKS Ankit Patel - ankitp@nvidia.com Prerna Dogra - pdogra@nvidia.com 1 Autonomous

NVIDIA NSIGHT ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA WHAT YOU WILL

DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA GPU Xu Tianhao, Deep Learning Solution

Using Social Media for Health Studies Ingmar Weber Social Computing, Qatar Computing Research

S9226 Fast singular value decomposition on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Samuel

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Custom Gaming Consulting Estimate the Gaming Revenue Potential of Horseshoe Baltimore 1.

Task 4 - Project Specific Gaming Revenue Projections Southeast Gaming Zone - Cherokee County,

Gaming Program Charles LaBoy CPA Assistant Director for Gaming Maryland Lottery and Gaming

HIGH-PERFORMANCE GPU VIDEO ENCODING ABHIJIT PATAIT SR. MANAGER, NVIDIA AGENDA GPU Video

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/20/2019 NVIDIA Video Technologies Overview Turing

Enabling the convergence of HPC and Data Analytics in highly distributed computing

Overview of FMCSAs Research Program April 8, 2009 Dr. Martin R. Walker Chief, Research

Autonomous Vehicles & Automated Production Systems Kristin Dziczek | Director, Industry,

Graphics from http://www.toyota-global.com/ unless otherwise specified Ken Butts, CPS Week 2016,

Intelligent Transportation Systems (ITS) Outline Need for traffic solutions Possible

Integration of Crowdsourced Data into Automated Traffic Signal Performance Measures (ATSPMs)

Connected Corridors I-210 Pilot Project User Needs Workshop Feb 27, 2014 Agenda 2 Introduction

I m plem entation of an Advanced Spatial Technological System for Em ergency Situations Efi DI

NVIDIA GPU COMPUTING: A JOURNEY FROM PC GAMING TO DEEP LEARNING - PowerPoint PPT Presentation

NVIDIA GPU COMPUTING: A JOURNEY FROM PC GAMING TO DEEP LEARNING Stuart Oberman | October 2017 GAMING PRO VISUALIZATION ENTERPRISE DATA CENTER AUTO NVIDIA ACCELERATED COMPUTING 2 GEFORCE: PC Gaming 200M GeForce gamers worldwide Most

FOR THE BEST VDI USER EXPERIENCE NVIDIA VIRTUAL GPU PRODUCT POSITIONING NVIDIA GRID NVIDIA

A Musical Journey A Musical Journey A Musical Journey A Musical Journey A Musical Journey A

GENERATION OF GAMING TECHNOLOGY Samuel Lo, NVIDIA AI Technology Centre samuell@nvidia.com NVIDIA

Gaming &amp; Smart Cards Smart Cards Empower Your Vision Gaming Industry Challenges Gaming

DELIVERING HIGH-PERFORMANCE REMOTE GRAPHICS WITH NVIDIA GRID VIRTUAL GPU Andy Currid NVIDIA

NVIDIA DESIGNWORKS Ankit Patel - ankitp@nvidia.com Prerna Dogra - pdogra@nvidia.com 1 Autonomous

NVIDIA NSIGHT ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA WHAT YOU WILL

DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA GPU Xu Tianhao, Deep Learning Solution

Using Social Media for Health Studies Ingmar Weber Social Computing, Qatar Computing Research

S9226 Fast singular value decomposition on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Samuel

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Custom Gaming Consulting Estimate the Gaming Revenue Potential of Horseshoe Baltimore 1.

Task 4 - Project Specific Gaming Revenue Projections Southeast Gaming Zone - Cherokee County,

Gaming Program Charles LaBoy CPA Assistant Director for Gaming Maryland Lottery and Gaming

HIGH-PERFORMANCE GPU VIDEO ENCODING ABHIJIT PATAIT SR. MANAGER, NVIDIA AGENDA GPU Video

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/20/2019 NVIDIA Video Technologies Overview Turing

Enabling the convergence of HPC and Data Analytics in highly distributed computing

Overview of FMCSAs Research Program April 8, 2009 Dr. Martin R. Walker Chief, Research

Autonomous Vehicles &amp; Automated Production Systems Kristin Dziczek | Director, Industry,

Graphics from http://www.toyota-global.com/ unless otherwise specified Ken Butts, CPS Week 2016,

Intelligent Transportation Systems (ITS) Outline Need for traffic solutions Possible

Integration of Crowdsourced Data into Automated Traffic Signal Performance Measures (ATSPMs)

Connected Corridors I-210 Pilot Project User Needs Workshop Feb 27, 2014 Agenda 2 Introduction

I m plem entation of an Advanced Spatial Technological System for Em ergency Situations Efi DI

Gaming & Smart Cards Smart Cards Empower Your Vision Gaming Industry Challenges Gaming

Autonomous Vehicles & Automated Production Systems Kristin Dziczek | Director, Industry,