Accelerating Exascale How the End of Moores Law Scaling is Changing - - PowerPoint PPT Presentation

accelerating exascale
SMART_READER_LITE
LIVE PREVIEW

Accelerating Exascale How the End of Moores Law Scaling is Changing - - PowerPoint PPT Presentation

Accelerating Exascale How the End of Moores Law Scaling is Changing the Machines You Use, the Way You Code, and the Algorithms You Use Steve Oberlin CTO, Accelerated Computing NVIDIA TEGRA K1 Mar 2, 2014 NVIDIA Confidential 1 A Little


slide-1
SLIDE 1

1

NVIDIA TEGRA K1

Mar 2, 2014 NVIDIA Confidential Steve Oberlin

CTO, Accelerated Computing

Accelerating Exascale

How the End of Moore’s Law Scaling is Changing the Machines You Use, the Way You Code, and the Algorithms You Use

slide-2
SLIDE 2

2

A Little Time Travel

slide-3
SLIDE 3

3

The Last Single-CPU Supercomputer

slide-4
SLIDE 4

4

Seymour’s Last (Successful) Supercomputer

slide-5
SLIDE 5

5

“Attack of the Killer Micros”

slide-6
SLIDE 6

6

My Last Supercomputer

slide-7
SLIDE 7

7

Future Shock

slide-8
SLIDE 8

8

The Cold Equations

slide-9
SLIDE 9

9

Hitting a Frequency Wall?

G Bell, History of Supercomputers, LLNL, April 2013

slide-10
SLIDE 10

10

How To Build A Frequency Wall

Depletion of ILP

“We’re running out of computer science...”

Justin Rattner, Micro2000 presentation, 1990

End of Voltage Scaling

Maxed out power budget

slide-11
SLIDE 11

11

The End of Voltage Scaling

The Good Old Days

Leakage was not important, and voltage scaled with feature size

L’ = L/2 V’ = V/2 E’ = CV2 = E/8 f’ = 2f D’ = 1/L2 = 4D P’ = P Halve L and get 4x the transistors and 8x the capability for the same power

The New Reality

Leakage has limited threshold voltage, largely ending voltage scaling

L’ = L/2 V’ = ~V E’ = CV2 = E/2 f’ = 2f D’ = 1/L2 = 4D P’ = 4P Halve L and get 4x the transistors and 8x the capability for 4x the power,

  • r 2x the capability for the same

power in ¼ the area.

slide-12
SLIDE 12

12

The End of Historic Scaling

C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

slide-13
SLIDE 13

13

Chickens and Plows

slide-14
SLIDE 14

14

113 ¡ 182 ¡ 242 ¡

50 100 150 200 250 300 2011 2012 2013 0% 20% 40% 60% 80% 100% 2010 2011 2012 2013

Rise of Accelerated Computing

Adoption of Accelerators GPU Accelerated Apps NVIDIA HPC Share

NVIDIA GPUs

85%

INTEL PHI

4%

OTHERS

11%

Intersect360 Research HPC User Site Census: Systems, July 2013 Intersect360 HPC User Site Census: Systems, July 2013 IDC HPC End-User MSC Study, 2013

% of HPC Customers with Accelerators

44% 77%

slide-15
SLIDE 15

15

Accelerator Perf/Watt

SGEMM / W Normalized

2012 2014 2008 2010 2016

Tesla Fermi Kepler Maxwell Pascal

20 16 12 8 6 2 4 10 14 18

slide-16
SLIDE 16

16

GPUs Power World’s 10 Greenest Supercomputers

Green500 Rank MFLOPS/W Site

1 4,503.17 GSIC Center, Tokyo Tech 2 3,631.86 Cambridge University 3 3,517.84 University of Tsukuba 4 3,185.91 Swiss National Supercomputing (CSCS) 5 3,130.95 ROMEO HPC Center 6 3,068.71 GSIC Center, Tokyo Tech 7 2,702.16 University of Arizona 8 2,629.10 Max-Planck 9 2,629.10 (Financial Institution) 10 2,358.69 CSIRO 37 1959.90 Intel Endeavor (top Xeon Phi cluster) 49 1247.57 Météo France (top CPU cluster)

slide-17
SLIDE 17

17

The Exascale Challenge

slide-18
SLIDE 18

18

20PF 10MW 2 GFLOPs/W

  • n LINPACK

1,000PF (50x) 20MW (2x) 50 GFLOPs/W (25x) On LINPACK

2013 2020 2-4x Tech 6-12x Circuits and Arch

Year 2013-14 2016 2020 28nm 16nm 7nm Logic Energy Scaling Factor (0.70x) 1 0.70 0.49 Wires Energy Scaling Factor (0.90x) 1 0.85 0.72 VDD (Volts) 0.9 0.80 0.75 Total Power (W) (70% Logic / 30% Wires) 100 58 38 Energy Efficiency Improvements due

  • nly to Technological Scaling

1.00 1.70 2.57

The Efficiency Gap

slide-19
SLIDE 19

19

Pascal with HBM Stacked Memory

passive silicon interposer Package Substrate GP100

HBM HBM HBM HBM HBM HBM HBM HBM

Cross-Section View

  • 4x Bandwidth
  • More Capacity
  • ¼ Power per bit
slide-20
SLIDE 20

20

NVLink

5x PCIe bandwidth Move data at CPU memory speed

3x lower energy/bit

TESLA GPU Power or ARM CPU

DDR Memory Stacked Memory

NVLink 80 GB/s

DDR4 50-75 GB/s

HBM 1 Terabyte/s

slide-21
SLIDE 21

21

SP Energy Efficiency @ 28 nm

5 10 15 20 25 Fermi Kepler Maxwell

slide-22
SLIDE 22

22

20mm

64-bit DP 1000 pJ 28nm IC 256-bit access 8 kB SRAM 50 pJ 16000 pJ DRAM Rd/Wr 500 pJ Efficient off-chip link 20 pJ 26 pJ 256 pJ

256 bits

Cost of Computation vs. Communications

slide-23
SLIDE 23

23

Cost of Computation vs. Communications

slide-24
SLIDE 24

24

Cost of Computation vs. Communications

SM XBAR

slide-25
SLIDE 25

25

Enhanced On-Chip Signaling

180W continuous power and 25x25 mm die size Bi-Section Bandwidth = 6T Bytes/s Data is moved an average of 15mm

nTECH’13 - John Wilson

Standard P&R ECost = 190 fJ/bit/mm PAVG = 68 W 38% of GPU Power Delay25mm ~ 17.0 ns

GPU Power

Compute +

  • ther

Global Signaling

“Custom Wire” ECost = 145 fJ/bit/mm PAVG = 39 W 22% of GPU Power Delay25mm ~ 12.5 ns

GPU Power

Compute +

  • ther

Global Signaling

slide-26
SLIDE 26

26

Attack of the Killer Smartphones

[What if there were no long wires?]

slide-27
SLIDE 27

27

TEGRA K1

TEGRA K1

Mobile Super Chip

Unify GPU and Tegra Architecture

MOBILE ARCHITECTURE Maxwell Kepler Tesla Fermi Tegra 3 Tegra 4 Tegra K1 GPU ARCHITECTURE

CUDA Enabled

Mobile

192 Kepler Cores

Tesla Quadro GeForce

slide-28
SLIDE 28

28

JETSON TK1

Development Platform for Embedded Computer Vision, Robotics, Medical 192 Kepler Cores · 326 GFLOPS 4 ARM A15 Cores 2 GB DDR3L 16-256 GB Flash Gigabit Ethernet CUDA Enabled 5-11 Watts $192 Available Now

slide-29
SLIDE 29

29

Perf/Watt Comparison

K40 + CPU

Peak SP: 4.2 TFLOPS SP SGEMM: ~3.8 TFLOPS Memory: 12 GB @ 288 GB/s Power:

GPU: 235 W CPU + Mem: 150 W Total: 385 W

  • Perf/Watt: ~10 SP GFLOPS/W

TK-1

Peak SP: 326 GFLOPS SP SGEMM: ~290 GFLOPS Memory: 2 GB @ 14.9 GB/s Power:

GPU + CPU: <11 W (working hard 1/35 of K40 + CPU

  • Perf/Watt: ~26 SP GFLOPS/W

For the same power as K40 + CPU, you could have 10+ TFLOPS SP, 70 GB DRAM @ 500+ GB/s

slide-30
SLIDE 30

30

25x or 1 Exa?

slide-31
SLIDE 31

31

Likely Exascale Node

Three Building Blocks (GPU, CPU, Network)

NoC

C0 Cn

TOC0 L20

C0 Cn

TOC0 L20

C0 Cn

TOC0 L20

C0 Cn

TOC0 L20

MC

DRAM Stacks DRAM Stacks

MC

NIC

Direct Evolution of 2016 Node

  • Programming model continuity
  • Specialized Cores
  • GPU for parallel work
  • CPU for serial work
  • Coherent memory system with

Stacked, Bulk, & NVRAM

  • Amortize non-parallel costs
  • Increase GPU:CPU
  • Smaller CPU

LOC 0 LOC 7

L2cpu NoC/Bus MC NVRAM DRAM DRAM DRAM

GPU

Throughput optimized, parallel code

CPU

Latency optimized, OS, pointer chasing

NIC

100K nodes link

System Interconnect

link

slide-32
SLIDE 32

32

LINPACK vs. Real Apps

Oreste Villa, Scaling the Power Wall: A Path to Exascale

slide-33
SLIDE 33

33

Future Programming Systems

slide-34
SLIDE 34

34

A Simple Parallel Program

  • forall molecule in set {

forall neighbor in molecule.neighbors { forall force in forces { molecule.force = reduce_sum(force(molecule, neighbor)) } } }

slide-35
SLIDE 35

35

Why Is This Easy?

  • forall molecule in set {

forall neighbor in molecule.neighbors { forall force in forces { molecule.force = reduce_sum(force(molecule, neighbor)) } } }

No machine details All parallelism is expressed Synchronization is semantic (in reduction)

slide-36
SLIDE 36

36

We Can Make It Hard

  • pid = fork() ; // explicitly managing threads
  • lock(struct.lock) ; // complicated, error-prone synchronization

// manipulate struct unlock(struct.lock) ;

  • code = send(pid, tag, &msg) ; // partition across nodes
slide-37
SLIDE 37

37

Programmers, Tools, and Architecture Need to Play Their Positions

Programmer Architecture Tools

  • forall molecule in set { // launch a thread array

forall neighbor in molecule.neighbors { // forall force in forces { // doubly nested molecule.force = reduce_sum(force(molecule, neighbor)) } } }

Map foralls in time and space Map molecules across memories Stage data up/down hierarchy Select mechanisms Exposed storage hierarchy Fast comm/sync/thread mechanisms

slide-38
SLIDE 38

38

System Functions -> Application Optimizations

Energy Management

Power allocation among LOCs and TOCs

Resilience

Failure-tolerant applications by design

slide-39
SLIDE 39

39

Conclusions

slide-40
SLIDE 40

40

Exascale (25x) is Within Reach

(Not so sure about Zetta-scale…)

Requires clever circuits and ruthlessly-efficient architecture

Moore’s Law cannot be relied upon

Need to exploit locality

> 100:1 global v. local energy cost

Need to expose massive concurrency

  • Exaflop at O(GHz) clocks ⇒ O(10 billion-way) parallelism

Need to simplify programming and automate mapping

“MPI + X” is only a step in the right direction

slide-41
SLIDE 41

41

Questions?