Efficiency and Programmability: Enablers for ExaScale Bill Dally | - - PowerPoint PPT Presentation

efficiency and programmability enablers for exascale
SMART_READER_LITE
LIVE PREVIEW

Efficiency and Programmability: Enablers for ExaScale Bill Dally | - - PowerPoint PPT Presentation

Efficiency and Programmability: Enablers for ExaScale Bill Dally | Chief Scientist and SVP , Research NVIDIA | Professor (Research), EE&CS, Stanford Scientific Discovery and Business Analytics Driving an Insatiable Demand for More


slide-1
SLIDE 1

Bill Dally | Chief Scientist and SVP , Research NVIDIA | Professor (Research), EE&CS, Stanford

Efficiency and Programmability: Enablers for ExaScale

slide-2
SLIDE 2

Scientific Discovery and Business Analytics

Driving an Insatiable Demand for More Computing Performance

slide-3
SLIDE 3

Compute

Communication Memory & Storage

HPC

Analytics

slide-4
SLIDE 4

Source: C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

The End

  • f Historic Scaling
slide-5
SLIDE 5

“Moore’s Law gives us more transistors…

Dennard scaling made them useful.”

Bob Colwell, DAC 2013, June 4, 2013

slide-6
SLIDE 6

18,688 NVIDIA Tesla K20X GPUs 27 Petaflops Peak: 90% of Performance from GPUs 17.59 Petaflops Sustained Performance on Linpack Numerous real science applications 2.14GF/W – Most efficient accelerator

TITAN

slide-7
SLIDE 7

18,688 NVIDIA Tesla K20X GPUs 27 Petaflops Peak: 90% of Performance from GPUs 17.59 Petaflops Sustained Performance on Linpack Numerous real science applications 2.14GF/W – Most efficient accelerator

TITAN

slide-8
SLIDE 8

20PF 18,000 GPUs 10MW 2 GFLOPs/W ~10

7 Threads

You Are Here

1,000PF (50x) 72,000HCNs (4x) 20MW (2x) 50 GFLOPs/W (25x) ~10

10 Threads (1000x)

2013 2020

slide-9
SLIDE 9

5 10 15 20 25 30 35 40 45 50 2013 2014 2015 2016 2017 2018 2019 2020 GFLOPS/W Needed Process EFFICIENCY GAP

slide-10
SLIDE 10
slide-11
SLIDE 11

CIRCUITS 3X PROCESS 2.2X 1 10 2013 2014 2015 2016 2017 2018 2019 2020 GFLOPS/W Needed Process

slide-12
SLIDE 12

Simpler Cores = Energy Efficiency

Source: Azizi [PhD 2010]

slide-13
SLIDE 13

CPU

1690 pJ/flop

Optimized for Latency Caches

Westmere 32 nm

GPU

140 pJ/flop

Optimized for Throughput Explicit Management

  • f On-chip Memory

Kepler 28 nm

slide-14
SLIDE 14

1 10 2013 2014 2015 2016 2017 2018 2019 2020 GFLOPS/W Needed Process CIRCUITS 3X PROCESS 2.2X ARCHITECTURE 4X

slide-15
SLIDE 15

Programmers, Tools, and Architecture Need to Play Their Positions

Programmer Architecture Tools

slide-16
SLIDE 16

Programmers, Tools, and Architecture Need to Play Their Positions

Algorithm All of the parallelism Abstract locality Fast mechanisms Exposed costs Combinatorial optimization Mapping Selection of mechanisms Programmer Architecture Tools

slide-17
SLIDE 17 DRAM I/O DRAM I/O DRAM I/O DRAM I/O NW I/O LOC NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM LOC NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM LOC NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM LOC NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM LOC NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM LOC NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM LOC NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM LOC NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM NOC SM SM SM SM DRAM I/O DRAM I/O DRAM I/O DRAM I/O NW I/O DRAM I/O DRAM I/O DRAM I/O DRAM I/O NW I/O DRAM I/O DRAM I/O DRAM I/O DRAM I/O NW I/O

L2 Banks XBAR

NOC SM

Lane Lane Lane Lane Lane Lane Lane Lane

SM SM

slide-18
SLIDE 18
  • <1ms Latency
  • Scalable bandwidth
  • Small messages 50% @ 32B
  • Global adaptive routing
  • PGAS
  • Collectives & Atomics
  • MPI Offload

An Enabling HPC Network

slide-19
SLIDE 19

An Open HPC Network Ecosystem

Common:

  • Software-NIC API
  • NIC-Router Channel

Processor/NIC Vendors System Vendors Networking Vendors

slide-20
SLIDE 20

Programmer Architecture Tools

Programming

Parallelism Heterogeneity Hierarchy

Power

25x Efficiency with 2.2x from process

slide-21
SLIDE 21

“Super” Computing

From Super Computers to Super Phones

slide-22
SLIDE 22

Backup

slide-23
SLIDE 23
slide-24
SLIDE 24

Source: Moore, Electronics 38(8) April 19, 1965

In The Past, Demand Was Fueled by Moore’s Law

slide-25
SLIDE 25

Source: Dally et al. “The Last Classical Computer”, ISAT Study, 2001

ILP Was Mined Out in 2001

0.0001 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 10000000 1980 1990 2000 2010 2020

Perf (ps/Inst) Linear (ps/Inst)

30:1 1,000:1 30,000:1

slide-26
SLIDE 26

Source: Moore, ISSCC Keynote, 2003

Voltage Scaling Ended in 2005

slide-27
SLIDE 27

Moore’s law is alive and well, but…

Instruction-level parallelism (ILP) was mined out in 2001 Voltage scaling (Dennard scaling) ended in 2005 Most power is spent on communication

What does this mean to you?

Summary

slide-28
SLIDE 28

Source: C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

The End

  • f Historic Scaling
slide-29
SLIDE 29

All performance is from parallelism Machines are power limited (efficiency IS performance) Machines are communication limited (locality IS performance)

In the Future

slide-30
SLIDE 30

Two Major Challenges

Programming

Parallel (10

10 threads)

Hierarchical Heterogeneous

Energy Efficiency

25x in 7 years (~2.2x from process)

slide-31
SLIDE 31

1 10 2013 2014 2015 2016 2017 2018 2019 2020 GFLOPS/W Needed Process

slide-32
SLIDE 32

How is Power Spent in a CPU?

In-order Embedded OOO Hi-perf

Clock + Control Logic

24%

Data Supply

17%

Instruction Supply

42%

Register File

11%

ALU 6% Clock + Pins

45%

ALU

4%

Fetch

11%

Rename

10%

Issue

11%

RF

14%

Data Supply

5%

Dally [2008] (Embedded in-order CPU) Natarajan [2003] (Alpha 21264)

slide-33
SLIDE 33

Processor Technology 40 nm 10nm

Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy (256 bits, 10mm) 310 pJ 174 pJ

Memory Technology 45 nm 16nm

DRAM interface pin bandwidth 4 Gbps 50 Gbps DRAM interface energy 20-30 pJ/bit 2 pJ/bit DRAM access energy 8-15 pJ/bit 2.5 pJ/bit

Keckler [Micro 2011], Vogelsang [Micro 2010]

Energy Shopping List

FP Op lower bound = 4 pJ

slide-34
SLIDE 34
slide-35
SLIDE 35

CIRCUITS 3X PROCESS 2.2X 1 10 2013 2014 2015 2016 2017 2018 2019 2020 GFLOPS/W Needed Process

slide-36
SLIDE 36

Throughput-Optimized Core (TOC) Latency-Optimized Core (LOC)

PC PC Branch Predict I$ Register Rename ALU 1 ALU 2 ALU 4 ALU 3 Reorder Buffer

Instruction Window

ALU 1 ALU 2 ALU 4 ALU 3 I$ Select

Register File PCs

Register File

slide-37
SLIDE 37

SIMT Lanes

Streaming Multiprocessor (SM)

Warp Scheduler Shared Memory 32 KB

15% of SM Energy

Main Register File 32 banks ALU SFU MEM TEX

slide-38
SLIDE 38

Hierarchical Register File

0% 20% 40% 60% 80% 100%

Percent of All Values Produced

Read >2 Times Read 2 Times Read 1 Time Read 0 Times

0% 20% 40% 60% 80% 100%

Percent of All Values Produced

Lifetime >3 Lifetime 3 Lifetime 2 Lifetime 1

slide-39
SLIDE 39

Register File Caching (RFC)

S F U M E M T E X Operand Routing Operand Buffering MRF 4x128-bit Banks (1R1W) RFC 4x32-bit (3R1W) Banks ALU

slide-40
SLIDE 40

Energy Savings from RF Hierarchy

54% Energy Reduction

Source: Gebhart, et. al (Micro 2011)

slide-41
SLIDE 41

Two Major Challenges

Programming

Parallel (10

10 threads)

Hierarchical Heterogeneous

Energy Efficiency

25x in 7 years (~2.2x from process)

slide-42
SLIDE 42

Skills on LinkedIn Size (approx) Growth (rel) C++ 1,000,000

  • 8%

Javascript 1,000,000

  • 1%

Python 429,000 7% Fortran 90,000

  • 11%

MPI 21,000

  • 3%

x86 Assembly 17,000

  • 8%

CUDA 14,000 9% Parallel programming 13,000 3% OpenMP 8,000 2% TBB 389 10% 6502 Assembly 256

  • 13%

Source: linkedin.com/skills (as of Jun 11, 2013)

Mainstream Programming Parallel and Assembly Programming

slide-43
SLIDE 43

forall molecule in set: # 1E6 molecules forall neighbor in molecule.neighbors: # 1E2 neighbors ea forall force in forces: # several forces # reduction molecule.force += force(molecule, neighbor)

Parallel Programming is Easy

slide-44
SLIDE 44

We Can Make It Hard

pid = fork() ; // explicitly managing threads lock(struct.lock) ; // complicated, error-prone synchronization // manipulate struct unlock(struct.lock) ; code = send(pid, tag, &msg) ; // partition across nodes

slide-45
SLIDE 45

Programmers, Tools, and Architecture Need to Play Their Positions

Programmer Architecture Tools

slide-46
SLIDE 46

Programmers, Tools, and Architecture Need to Play Their Positions

Algorithm All of the parallelism Abstract locality Fast mechanisms Exposed costs Combinatorial optimization Mapping Selection of mechanisms Programmer Architecture Tools

slide-47
SLIDE 47

OpenACC: Easy and Portable

!$acc parallel loop do i = 1, 20*128 !dir$ unroll 1000 do j = 1, 5000000 fa(i) = a * fa(i) + fb(i) end do end do do i = 1, 20*128 do j = 1, 5000000 fa(i) = a * fa(i) + fb(i) end do end do

Serial Code: SAXPY

slide-48
SLIDE 48

Conclusion

slide-49
SLIDE 49

Source: C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

The End

  • f Historic Scaling
slide-50
SLIDE 50

Parallelism is the source of all performance Power limits all computing Communication dominates power

slide-51
SLIDE 51

Two Challenges

Programming

Parallelism Heterogeneity Hierarchy

Power

25x Efficiency with 2.2x from process

slide-52
SLIDE 52

Programmer Architecture Tools

Programming

Parallelism Heterogeneity Hierarchy

Power

25x Efficiency with 2.2x from process

slide-53
SLIDE 53

“Super” Computing

From Super Computers to Super Phones

slide-54
SLIDE 54

64-bit DP 20pJ 26 pJ 256 pJ 1 nJ 500 pJ Efficient

  • ff-chip link

256-bit buses 16 nJ DRAM Rd/Wr 256-bit access 8 kB SRAM 50 pJ

20mm

Communication Takes More Energy Than Arithmetic

slide-55
SLIDE 55

Key to Parallelism: Independent operations on independent data

sum(map(multiply, x, x))

every pair-wise multiply is independent parallelism is permitted

slide-56
SLIDE 56

Flat computation

Key to Locality: Data decomposition should drive mapping

total = sum(x) vs. tiles = split(x) partials = map(sum, tiles) total = sum(partials)

slide-57
SLIDE 57

Explicit decomposition

Key to Locality: Data decomposition should drive mapping

total = sum(x) vs. tiles = split(x) partials = map(sum, tiles) total = sum(partials)