INSIDE PASCAL Lars Nyland and Mark Harris, April 5, 2016 - - PowerPoint PPT Presentation

inside pascal
SMART_READER_LITE
LIVE PREVIEW

INSIDE PASCAL Lars Nyland and Mark Harris, April 5, 2016 - - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley INSIDE PASCAL Lars Nyland and Mark Harris, April 5, 2016 INTRODUCING TESLA P100 New GPU Architecture to Enable the Worlds Fastest Compute Node Pascal Architecture NVLink HBM2 Stacked Memory Page Migration


slide-1
SLIDE 1

April 4-7, 2016 | Silicon Valley

Lars Nyland and Mark Harris, April 5, 2016

INSIDE PASCAL

slide-2
SLIDE 2

2

INTRODUCING TESLA P100

New GPU Architecture to Enable the World’s Fastest Compute Node

Pascal Architecture NVLink HBM2 Stacked Memory Page Migration Engine

Highest Compute Performance GPU Interconnect for Maximum Scalability Unifying Compute & Memory in Single Package Simple Parallel Programming with 512 TB of Virtual Memory

Unified Memory

CPU T esla P100

slide-3
SLIDE 3

3

GIANT LEAPS IN EVERYTHING

3x GPU Mem BW

K40 Bandwidth 1x 2x 3x P100 M40

5x GPU-GPU BW

K40 Bandwidth (GB/Sec) 40 80 120 160 P100 M40

3x Compute

Teraflops (FP32/FP16) 5 10 15 20 K40 P100 (FP32) P100 (FP16) M40

slide-4
SLIDE 4

4

TESLA P100 PERFORMANCE DELIVERED

NVLink for Max Scalability, More than 45x Faster with 8x P100

0x 5x 10x 15x 20x 25x 30x 35x 40x 45x 50x

Caffe/Alexnet VASP HOOMD-Blue COSMO MILC Amber HACC

2x K80 (M40 for Alexnet) 2x P100 4x P100 8x P100

Speed-up vs Dual Socket Haswell

2x Haswell CPU

slide-5
SLIDE 5

5

PASCAL ARCHITECTURE

slide-6
SLIDE 6

6

TESLA P100 GPU: GP100

56 SMs 3584 CUDA Cores 5.3 TF Double Precision 10.6 TF Single Precision 21.2 TF Half Precision 16 GB HBM2 720 GB/s Bandwidth

slide-7
SLIDE 7

7

GPU PERFORMANCE COMPARISON

P100 M40 K40 Double Precision TFlop/s 5.3 0.2 1.4 Single Precision TFlop/s 10.6 7.0 4.3 Half Precision Tflop/s 21.2 NA NA Memory Bandwidth (GB/s) 720 288 288 Memory Size 16GB 12GB, 24GB 12GB

slide-8
SLIDE 8

8

GP100 SM

GP100 CUDA Cores 64 Register File 256 KB Shared Memory 64 KB Active Threads 2048 Active Blocks 32

slide-9
SLIDE 9

9

Cores FP64 Cores FP64 LD/ST SFU

Registers Warps

Registers Warps Cores Cores FP64 FP64 LD/ST SFU

Registers Warps

Cores Cores FP64 FP64 LD/ST SFU

Registers Warps

Cores Cores FP64 FP64 LD/ST SFU

Registers Warps

Registers Warps Shared Mem Registers Warps Shared Mem Registers Warps

Maxwell SM P100 SM P100 SM

More resources per core

2x Registers 1.33x Shared Memory Capacity 2x Shared Memory Bandwidth 2x Warps

Higher Instruction Throughput

slide-10
SLIDE 10

10

IEEE 754 FLOATING POINT ON GP100

3 sizes, 3 speeds, all fast

Feature Half precision Single precision Double precision Layout s5.10 s8.23 s11.52 Issue rate pair every clock 1 every clock 1 every 2 clocks Subnormal support Yes Yes Yes Atomic Addition Yes Yes Yes

slide-11
SLIDE 11

11

HALF-PRECISION FLOATING POINT (FP16)

  • 16 bits
  • 1 sign bit, 5 exponent bits, 10 fraction bits
  • 240 Dynamic range
  • Normalized values: 1024 values for each power of 2,

from 2-14 to 215

  • Subnormals at full speed: 1024 values from 2-24 to 2-15
  • Special values
  • +- Infinity, Not-a-number

s e x p f r a c .

USE CASES

Deep Learning Training Radio Astronomy Sensor Data Image Processing

slide-12
SLIDE 12

12

NVLink

slide-13
SLIDE 13

13

NVLINK

P100 supports 4 NVLinks Up to 94% bandwidth efficiency Supports read/writes/atomics to peer GPU Supports read/write access to NVLink-enabled CPU Links can be ganged for higher bandwidth NVLink on Tesla P100

40 GB/s 40 GB/s 40 GB/s 40 GB/s

slide-14
SLIDE 14

14

NVLINK - GPU CLUSTER

Two fully connected quads, connected at corners 160GB/s per GPU bidirectional to Peers Load/store access to Peer Memory Full atomics to Peer GPUs High speed copy engines for bulk data copy PCIe to/from CPU

slide-15
SLIDE 15

15

NVLINK TO CPU

Fully connected quad 120 GB/s per GPU bidirectional for peer traffic 40 GB/s per GPU bidirectional to CPU Direct Load/store access to CPU Memory High Speed Copy Engines for bulk data movement

slide-16
SLIDE 16

16

TESLA P100 PHYSICAL CONNECTOR

With NVLink

slide-17
SLIDE 17

17

HBM2 STACKED MEMORY

slide-18
SLIDE 18

18

HBM2 : 720GB/SEC BANDWIDTH

And ECC is free

Spacer 4-high HBM2 Stack Bumps Silicon Carrier GPU Substrate

slide-19
SLIDE 19

19

UNIFIED MEMORY

slide-20
SLIDE 20

20

PAGE MIGRATION ENGINE

Support Virtual Memory Demand Paging

49-bit Virtual Addresses

Sufficient to cover 48-bit CPU address + all GPU memory

GPU page faulting capability

Can handle thousands of simultaneous page faults

Up to 2 MB page size

Better TLB coverage of GPU memory

6.4.2016 Г.

slide-21
SLIDE 21

21

KEPLER/MAXWELL UNIFIED MEMORY

Performance Through Data Locality

Migrate data to accessing processor Guarantee global coherency Still allows explicit hand tuning

Simpler Programming & Memory Model

Single allocation, single pointer, accessible anywhere Eliminate need for explicit copy Greatly simplifies code porting

Allocate Up To GPU Memory Size Kepler GPU CPU Unified Memory CUDA 6+

slide-22
SLIDE 22

22

PASCAL UNIFIED MEMORY

Large datasets, simple programming, High Performance

Allocate Beyond GPU Memory Size Enable Large Data Models Oversubscribe GPU memory Allocate up to system memory size Tune Unified Memory Performance Usage hints via cudaMemAdvise API Explicit prefetching API Simpler Data Access CPU/GPU Data coherence Unified memory atomic operations Unified Memory Pascal GPU CPU CUDA 8

slide-23
SLIDE 23

23

INTRODUCING TESLA P100

New GPU Architecture to Enable the World’s Fastest Compute Node

Pascal Architecture NVLink HBM2 Stacked Memory Page Migration Engine

Highest Compute Performance GPU Interconnect for Maximum Scalability Unifying Compute & Memory in Single Package Simple Parallel Programming with 512 TB of Virtual Memory

Unified Memory

CPU T esla P100

More P100 Features: compute preemption, new instructions, larger L2 cache, more… Find out more at http://devblogs.nvidia.com/parallelforall/inside-pascal