INSIDE PASCAL Lars Nyland and Mark Harris, April 5, 2016 - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley INSIDE PASCAL Lars Nyland and Mark Harris, April 5, 2016

INTRODUCING TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node Pascal Architecture NVLink HBM2 Stacked Memory Page Migration Engine T esla P100 CPU Unified Memory Highest Compute Performance GPU Interconnect for Maximum Unifying Compute & Memory in Simple Parallel Programming with Scalability Single Package 512 TB of Virtual Memory 2

GIANT LEAPS IN EVERYTHING P100 Teraflops (FP32/FP16) Bandwidth (GB/Sec) P100 3x 20 160 P100 (FP16) Bandwidth 15 120 2x P100 10 80 (FP32) M40 1x K40 M40 5 40 K40 K40 M40 3x Compute 5x GPU-GPU BW 3x GPU Mem BW 3

TESLA P100 PERFORMANCE DELIVERED NVLink for Max Scalability, More than 45x Faster with 8x P100 50x 2x K80 (M40 for Alexnet) 2x P100 4x P100 8x P100 45x Speed-up vs Dual Socket Haswell 40x 35x 30x 25x 20x 15x 10x 5x 2x Haswell CPU 0x Caffe/Alexnet VASP HOOMD-Blue COSMO MILC Amber HACC 4

PASCAL ARCHITECTURE 5

TESLA P100 GPU: GP100 56 SMs 3584 CUDA Cores 5.3 TF Double Precision 10.6 TF Single Precision 21.2 TF Half Precision 16 GB HBM2 720 GB/s Bandwidth 6

GPU PERFORMANCE COMPARISON P100 M40 K40 Double Precision TFlop/s 5.3 0.2 1.4 Single Precision TFlop/s 10.6 7.0 4.3 Half Precision Tflop/s 21.2 NA NA Memory Bandwidth (GB/s) 720 288 288 Memory Size 16GB 12GB, 24GB 12GB 7

GP100 SM GP100 CUDA Cores 64 Register File 256 KB Shared 64 KB Memory Active Threads 2048 Active Blocks 32 8

Warps Warps Warps Warps Registers Registers Registers Registers P100 SM More resources per core LD/ST LD/ST Cores Cores Cores Cores FP64 FP64 FP64 FP64 SFU SFU 2x Registers 1.33x Shared Memory Capacity 2x Shared Memory Bandwidth Shared Mem 2x Warps Maxwell SM Warps Warps Warps Warps Higher Instruction Throughput Registers Registers Registers Registers LD/ST LD/ST P100 SM Cores Cores Cores Cores FP64 FP64 FP64 FP64 SFU SFU Shared Mem 9

IEEE 754 FLOATING POINT ON GP100 3 sizes, 3 speeds, all fast Feature Half precision Single precision Double precision Layout s5.10 s8.23 s11.52 Issue rate pair every clock 1 every clock 1 every 2 clocks Subnormal support Yes Yes Yes Atomic Addition Yes Yes Yes 10

HALF-PRECISION FLOATING POINT (FP16) 16 bits • s e x p f r a c . • 1 sign bit, 5 exponent bits, 10 fraction bits 2 40 Dynamic range • USE CASES • Normalized values: 1024 values for each power of 2, Deep Learning Training from 2 -14 to 2 15 Radio Astronomy • Subnormals at full speed: 1024 values from 2 -24 to 2 -15 Sensor Data Special values • Image Processing • +- Infinity, Not-a-number 11

NVLink 12

NVLINK P100 supports 4 NVLinks 40 GB/s Up to 94% bandwidth efficiency 40 GB/s Supports read/writes/atomics to peer GPU 40 GB/s Supports read/write access to NVLink-enabled CPU 40 GB/s Links can be ganged for higher bandwidth NVLink on Tesla P100 13

NVLINK - GPU CLUSTER Two fully connected quads, connected at corners 160GB/s per GPU bidirectional to Peers Load/store access to Peer Memory Full atomics to Peer GPUs High speed copy engines for bulk data copy PCIe to/from CPU 14

NVLINK TO CPU Fully connected quad 120 GB/s per GPU bidirectional for peer traffic 40 GB/s per GPU bidirectional to CPU Direct Load/store access to CPU Memory High Speed Copy Engines for bulk data movement 15

TESLA P100 PHYSICAL CONNECTOR With NVLink 16

HBM2 STACKED MEMORY 17

HBM2 : 720GB/SEC BANDWIDTH And ECC is free Spacer GPU 4-high HBM2 Silicon Stack Carrier Bumps Substrate 18

UNIFIED MEMORY 19

PAGE MIGRATION ENGINE Support Virtual Memory Demand Paging 49-bit Virtual Addresses Sufficient to cover 48-bit CPU address + all GPU memory GPU page faulting capability Can handle thousands of simultaneous page faults Up to 2 MB page size Better TLB coverage of GPU memory 20 6.4.2016 Г.

KEPLER/MAXWELL UNIFIED MEMORY CUDA 6+ Single allocation, single pointer, Simpler accessible anywhere Kepler CPU Programming & GPU Eliminate need for explicit copy Memory Model Greatly simplifies code porting Unified Memory Migrate data to accessing processor Performance Through Guarantee global coherency Data Locality Still allows explicit hand tuning Allocate Up To GPU Memory Size 21

PASCAL UNIFIED MEMORY Large datasets, simple programming, High Performance CUDA 8 Oversubscribe GPU memory Enable Large Data Models Pascal Allocate up to system memory size CPU GPU Tune Usage hints via cudaMemAdvise API Unified Memory Explicit prefetching API Unified Memory Performance CPU/GPU Data coherence Simpler Data Access Unified memory atomic operations Allocate Beyond GPU Memory Size 22

INTRODUCING TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node Pascal Architecture NVLink HBM2 Stacked Memory Page Migration Engine T esla P100 CPU Unified Memory Highest Compute Performance GPU Interconnect for Maximum Unifying Compute & Memory in Simple Parallel Programming with Scalability Single Package 512 TB of Virtual Memory More P100 Features: compute preemption, new instructions, larger L2 cache, more … Find out more at http://devblogs.nvidia.com/parallelforall/inside-pascal 23

INSIDE PASCAL Lars Nyland and Mark Harris, April 5, 2016 - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley INSIDE PASCAL Lars Nyland and Mark Harris, April 5, 2016 INTRODUCING TESLA P100 New GPU Architecture to Enable the Worlds Fastest Compute Node Pascal Architecture NVLink HBM2 Stacked Memory Page Migration

0.07 0.06 0.05 0.04 Unspecialized inside Specialized inside (rot, trans) Specialized inside

IMC Presentation Recommendation Adopt Inside Out and Back Again by Thanhha Lai Adopt One

Long-term Research Issues in SSD NVRAMOS 2011 Research Issues:

Inside Vaucanson The Vaucanson group LRDE / EPITA - LIAFA / Paris 7 - LTCI / ENST June 27, 2005

Human Multithreading Pascal Van Cauwenberghe Programmed by Thien Que Nguyen and Pascal Van

De lAsthme Difficile lAsthme Svre Pascal Chanez Pascal.chanez@univmed.fr

Intramolecular dynamics from statistical theories Pascal Parneix 1 Institut des Sciences Mol

Knowledge Representation for the Semantic Web Winter Quarter 2011 Pascal Hitzler Slides 4

Architectures of Next Generation Wireless Networks Pascal LORENZ lorenz@ieee.org Pascal LORENZ

Writing M ATLAB C/MEX Code Pascal Getreuer Pascal Getreuer (UCLA) MATLAB C/MEX 1 / 21 What is

Event SpatioTemporal Extent Stub Pascal Hitzler Data Semantics Laboratory (DaSe Lab) Data

Lets Talk! Uniting Dev and UX to design for Voice Pascal Heynol @ UX Cambridge 2018 Pascal

Animation Renderfarm Pascal Grosvenor DAB Faculty, UTS Pascal.Grosvenor@uts.edu.au XW11

Welcome to FOSDEM! Philip Paeps <philip@fosdem.org> & Pascal Bleser

JDBC JDBC Perf erfor ormance mance fr from the Inside om the Inside Ju July 2017 1

REMIT Inside Information Platform Key Principles: Market Participants are responsible

PERFORMANCE CONSIDERATIONS FOR OPENCL ON NVIDIA GPUS Karthik Raghavan Ravi, 4/4/16 THE PROBLEM

AND OTHER DEMENTIAS LYNNE KORTE, AGING AND LONG-TERM SUPPORT ADMINISTRATION DEPARTMENT OF

Time Management Powerpoint Presentation For Employees Download Time Management PowerPoint

Industry Forum SAFA Programme October 2012 Kln C. Donzel-Defigier (French SAFA National

Endurance Enhancement of Flash-Memory Storage Systems: An Efficient Static Wear Leveling Design

Oregon Techs 2020 Vision (Now that Our Immediate Future is More Clear) Convocation

UV LED INKS FOR PLASTIC CARD PRINTERS WHAT IS UV LED CURING? UV LED curing is based of select

USES OF ELECTRICITY The main uses of electricity are in: q Electric heating q Electric lighting

INSIDE PASCAL Lars Nyland and Mark Harris, April 5, 2016 - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley INSIDE PASCAL Lars Nyland and Mark Harris, April 5, 2016 INTRODUCING TESLA P100 New GPU Architecture to Enable the Worlds Fastest Compute Node Pascal Architecture NVLink HBM2 Stacked Memory Page Migration

0.07 0.06 0.05 0.04 Unspecialized inside Specialized inside (rot, trans) Specialized inside

IMC Presentation Recommendation Adopt Inside Out and Back Again by Thanhha Lai Adopt One

Long-term Research Issues in SSD NVRAMOS 2011 Research Issues:

Inside Vaucanson The Vaucanson group LRDE / EPITA - LIAFA / Paris 7 - LTCI / ENST June 27, 2005

Human Multithreading Pascal Van Cauwenberghe Programmed by Thien Que Nguyen and Pascal Van

De lAsthme Difficile lAsthme Svre Pascal Chanez Pascal.chanez@univmed.fr

Intramolecular dynamics from statistical theories Pascal Parneix 1 Institut des Sciences Mol

Knowledge Representation for the Semantic Web Winter Quarter 2011 Pascal Hitzler Slides 4

Architectures of Next Generation Wireless Networks Pascal LORENZ lorenz@ieee.org Pascal LORENZ

Writing M ATLAB C/MEX Code Pascal Getreuer Pascal Getreuer (UCLA) MATLAB C/MEX 1 / 21 What is

Event SpatioTemporal Extent Stub Pascal Hitzler Data Semantics Laboratory (DaSe Lab) Data

Lets Talk! Uniting Dev and UX to design for Voice Pascal Heynol @ UX Cambridge 2018 Pascal

Animation Renderfarm Pascal Grosvenor DAB Faculty, UTS Pascal.Grosvenor@uts.edu.au XW11

Welcome to FOSDEM! Philip Paeps &lt;philip@fosdem.org&gt; &amp; Pascal Bleser

JDBC JDBC Perf erfor ormance mance fr from the Inside om the Inside Ju July 2017 1

REMIT Inside Information Platform Key Principles: Market Participants are responsible

PERFORMANCE CONSIDERATIONS FOR OPENCL ON NVIDIA GPUS Karthik Raghavan Ravi, 4/4/16 THE PROBLEM

AND OTHER DEMENTIAS LYNNE KORTE, AGING AND LONG-TERM SUPPORT ADMINISTRATION DEPARTMENT OF

Time Management Powerpoint Presentation For Employees Download Time Management PowerPoint

Industry Forum SAFA Programme October 2012 Kln C. Donzel-Defigier (French SAFA National

Endurance Enhancement of Flash-Memory Storage Systems: An Efficient Static Wear Leveling Design

Oregon Techs 2020 Vision (Now that Our Immediate Future is More Clear) Convocation

UV LED INKS FOR PLASTIC CARD PRINTERS WHAT IS UV LED CURING? UV LED curing is based of select

USES OF ELECTRICITY The main uses of electricity are in: q Electric heating q Electric lighting

Welcome to FOSDEM! Philip Paeps <philip@fosdem.org> & Pascal Bleser