Accelerating Exascale How the End of Moores Law Scaling is Changing - PowerPoint PPT Presentation

Accelerating Exascale How the End of Moore’s Law Scaling is Changing the Machines You Use, the Way You Code, and the Algorithms You Use Steve Oberlin CTO, Accelerated Computing NVIDIA TEGRA K1 Mar 2, 2014 NVIDIA Confidential 1

A Little Time Travel 2

The Last Single-CPU Supercomputer 3

Seymour’s Last (Successful) Supercomputer 4

“Attack of the Killer Micros” 5

My Last Supercomputer 6

Future Shock 7

The Cold Equations 8

Hitting a Frequency Wall? G Bell, History of Supercomputers , LLNL, April 2013 9

How To Build A Frequency Wall Maxed out power budget Depletion of ILP End of Voltage Scaling “We’re running out of computer science...” Justin Rattner, Micro2000 presentation, 1990 10

The End of Voltage Scaling The Good Old Days The New Reality Leakage was not important, and Leakage has limited threshold voltage scaled with feature size voltage, largely ending voltage scaling L’ = L/2 L’ = L/2 V’ = V/2 V’ = ~V E’ = CV 2 = E/8 E’ = CV 2 = E/2 f’ = 2f f’ = 2f D’ = 1/L 2 = 4D D’ = 1/L2 = 4D P’ = P P’ = 4P Halve L and get 4x the transistors Halve L and get 4x the transistors and 8x the capability for and 8x the capability for the 4x the power, same power or 2x the capability for the same power in ¼ the area. 11

The End of Historic Scaling C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011 12

Chickens and Plows 13

Rise of Accelerated Computing Adoption of Accelerators GPU Accelerated Apps NVIDIA HPC Share % of HPC Customers with Accelerators 100% 300 OTHERS INTEL PHI 242 ¡ 77% 4% 11% 250 80% 182 ¡ 200 60% 44% 150 113 ¡ 40% 100 20% 50 NVIDIA GPUs 85% 0% 0 2010 2011 2012 2013 2011 2012 2013 Intersect360 Research Intersect360 HPC User Site Census: Systems, July 2013 HPC User Site Census: Systems, July 2013 IDC HPC End-User MSC Study, 2013 14

Accelerator Perf/Watt Pascal 20 18 16 SGEMM / W Normalized 14 Maxwell 12 10 8 Kepler 6 4 Fermi 2 Tesla 0 2008 2010 2012 2014 2016 15

GPUs Power World’s 10 Greenest Supercomputers Green500 MFLOPS/W Site Rank 1 4,503.17 GSIC Center, Tokyo Tech 2 3,631.86 Cambridge University 3 3,517.84 University of Tsukuba 4 3,185.91 Swiss National Supercomputing (CSCS) 5 3,130.95 ROMEO HPC Center 6 3,068.71 GSIC Center, Tokyo Tech 7 2,702.16 University of Arizona 8 2,629.10 Max-Planck 9 2,629.10 (Financial Institution) 10 2,358.69 CSIRO 37 1959.90 Intel Endeavor (top Xeon Phi cluster) 49 1247.57 Météo France (top CPU cluster) 16

The Exascale Challenge 17

The Efficiency Gap 1,000PF (50x) 20MW (2x) 50 GFLOPs/W ( 25x ) On LINPACK 6-12x Circuits and Arch 20PF 2-4x Tech 10MW 2 GFLOPs/W on LINPACK 2013 2020 Year 2013-14 2016 2020 28nm 16nm 7nm Logic Energy Scaling Factor (0.70x) 1 0.70 0.49 Wires Energy Scaling Factor (0.90x) 1 0.85 0.72 VDD (Volts) 0.9 0.80 0.75 Total Power (W) (70% Logic / 30% Wires) 100 58 38 Energy Efficiency Improvements due only to Technological Scaling 1.00 1.70 2.57 18

Pascal with HBM Stacked Memory HBM HBM • 4x Bandwidth HBM HBM GP100 HBM HBM • More Capacity HBM HBM • ¼ Power per bit passive silicon interposer Package Substrate Cross-Section View 19

NVLink TESLA Power or NVLink GPU ARM CPU 80 GB/s HBM DDR4 1 Terabyte/s 50-75 GB/s � 5x PCIe bandwidth Stacked Memory DDR Memory � Move data at CPU memory speed � 3x lower energy/bit 20

SP Energy Efficiency @ 28 nm 25 20 15 10 5 0 Fermi Kepler Maxwell 21

Cost of Computation vs. Communications 64-bit DP 20 pJ 26 pJ 256 pJ 16000 pJ DRAM Rd/Wr 256-bit access 256 8 kB SRAM bits 50 pJ 500 pJ Efficient off-chip link 1000 pJ 20mm 28nm IC 22

Cost of Computation vs. Communications 23

Cost of Computation vs. Communications SM XBAR 24

Enhanced On-Chip Signaling Standard P&R “Custom Wire” E Cost = 190 fJ/bit/mm E Cost = 145 fJ/bit/mm P AVG = 68 W P AVG = 39 W 38% of GPU Power 22% of GPU Power Delay 25mm ~ 17.0 ns Delay 25mm ~ 12.5 ns GPU Power GPU Power Compute + Compute + other other Global Global Signaling Signaling 180W continuous power and 25x25 mm die size Bi-Section Bandwidth = 6T Bytes/s nTECH’13 - John Wilson Data is moved an average of 15mm 25

Attack of the Killer Smartphones [What if there were no long wires?] 26

TEGRA K1 Unify GPU and Tegra Architecture Maxwell Kepler Fermi Tesla Tegra K1 GPU ARCHITECTURE Tegra 4 Tegra 3 MOBILE ARCHITECTURE Tesla Quadro TEGRA K1 Mobile Super Chip GeForce 192 Kepler CUDA Enabled Cores Mobile 27

JETSON TK1 Development Platform for Embedded Computer Vision, Robotics, Medical 192 Kepler Cores · 326 GFLOPS 4 ARM A15 Cores 2 GB DDR3L 16-256 GB Flash Gigabit Ethernet CUDA Enabled 5-11 Watts $192 Available Now 28

Perf/Watt Comparison � K40 + CPU � TK-1 � Peak SP: 4.2 TFLOPS � Peak SP: 326 GFLOPS � SP SGEMM: ~3.8 TFLOPS � SP SGEMM: ~290 GFLOPS � Memory: 12 GB @ 288 GB/s � Memory: 2 GB @ 14.9 GB/s � Power: � Power: � GPU: 235 W � GPU + CPU: <11 W (working hard � CPU + Mem: 150 W � 1/35 of K40 + CPU � Total: 385 W Perf/Watt: ~10 SP GFLOPS/W Perf/Watt: ~26 SP GFLOPS/W � � For the same power as K40 + CPU, you could have 10+ TFLOPS SP, 70 GB DRAM @ 500+ GB/s 29

25x or 1 Exa? 30

Likely Exascale Node Three Building Blocks (GPU, CPU, Network) GPU CPU NIC Throughput optimized, Latency optimized, 100K nodes parallel code OS, pointer chasing Direct Evolution of NVRAM 2016 Node DRAM DRAM DRAM • Programming model continuity DRAM System Interconnect MC • Specialized Cores DRAM MC Stacks MC Stacks • GPU for parallel work • CPU for serial work link link NoC NIC NoC/Bus • Coherent memory system with L2 cpu L2 0 Stacked, Bulk, & NVRAM L2 0 L2 0 L2 0 • Amortize non-parallel costs C 0 C n LOC 0 LOC 7 • Increase GPU:CPU C 0 C n C 0 C n • Smaller CPU TOC 0 C 0 C n TOC 0 TOC 0 TOC 0 31

LINPACK vs. Real Apps Oreste Villa, Scaling the Power Wall: A Path to Exascale 32

Future Programming Systems 33

A Simple Parallel Program � forall molecule in set { � forall neighbor in molecule.neighbors { � forall force in forces { � molecule.force = � reduce_sum(force(molecule, neighbor)) � } � } � } � 34

Why Is This Easy? � forall molecule in set { � forall neighbor in molecule.neighbors { � forall force in forces { � molecule.force = � reduce_sum(force(molecule, neighbor)) � } � } � No machine details } � All parallelism is expressed Synchronization is semantic (in reduction) 35

We Can Make It Hard � pid = fork() ; // explicitly managing threads � � lock(struct.lock) ; // complicated, error-prone synchronization � // manipulate struct � unlock(struct.lock) ; � � code = send(pid, tag, &msg) ; // partition across nodes � 36

Programmers, Tools, and Architecture Need to Play Their Positions � forall molecule in set { // launch a thread array � forall neighbor in molecule.neighbors { // � forall force in forces { // doubly nested � Programmer molecule.force = � reduce_sum(force(molecule, neighbor)) � } � } � } � Tools Architecture Map foralls in time and space Exposed storage hierarchy Map molecules across memories Fast comm/sync/thread mechanisms Stage data up/down hierarchy Select mechanisms 37

System Functions -> Application Optimizations � Energy Management � Power allocation among LOCs and TOCs � Resilience � Failure-tolerant applications by design 38

Conclusions 39

Exascale (25x) is Within Reach (Not so sure about Zetta-scale…) � Requires clever circuits and ruthlessly-efficient architecture � Moore’s Law cannot be relied upon � Need to exploit locality � > 100:1 global v. local energy cost � Need to expose massive concurrency Exaflop at O(GHz) clocks ⇒ O(10 billion-way) parallelism � � Need to simplify programming and automate mapping � “MPI + X” is only a step in the right direction 40

Questions? 41

Accelerating Exascale How the End of Moores Law Scaling is Changing - PowerPoint PPT Presentation

Accelerating Exascale How the End of Moores Law Scaling is Changing the Machines You Use, the Way You Code, and the Algorithms You Use Steve Oberlin CTO, Accelerated Computing NVIDIA TEGRA K1 Mar 2, 2014 NVIDIA Confidential 1 A Little

Why Nobody Should Care About Operating Systems for Exascale Operating Systems for Exascale Ron

exascale road in China Ruibo WANG National University of Defense Technology Contents NUDT

Major Challenges to Achieve Exascale Performance Shekhar Borkar Intel Corp. April 29, 2009

HPC Future Look Exascale and Challenges Outline Future architectures Exascale initiatives

The Exascale Computing Project (ECP) Paul Messina, ECP Director Stephen Lee, ECP Deputy Director

Exa-DM: Enabling Scientific Discovery in Exascale Simulations Jeremy Iverson 1 , 2 , Ya Ju Fan 1 ,

Containment Domains Resilience Mechanisms and Tools Toward Exascale Resilience Mattan Erez The

Squeezing Information from Data at Exascale Joel Saltz Emory University Georgia Tech Squeezing

Exascale Computing Project: Software Technology Perspective Rajeev Thakur, Argonne National Lab.

Time to Start over? Software for Exascale William Gropp www.cs.illinois.edu/~wgropp Why Is

The U.S. D.O.E. Exascale Computing Project Goals and Challenges Paul Messina, ECP Director

Exascale: Parallelism gone wild! Craig Stunkel, IBM Research IBM Research Outline Why are

EXASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&UIUC What are we talking about? 100M cores

How I Learned to Stop Worrying about Exascale and Love MPI (Yes, MPI is indeed da bomb!) Pavan

Exascale Science Rick Stevens Argonne National Laboratory University of Chicago Outline

An Introductory Exascale Feasibility Study for FFTs and Multigrid Hormozd Gahvari William Gropp

Foolproof Ansible Playbooks with Molecule Nathaniel Beckstead 1 Nathaniel Beckstead

A Sub-pA Current Sensing Front-End for Transient Induced Molecular Spectroscopy Da Ying, Ping-Wei

Whats New With Ansible Collections Diving into the How to use Collections with Ansible

based virtual screening Peter Willett, University of Sheffield Computers in Scientific Discovery

The VAMDC infrastructure Standard procedures to publish, search and process atomic and molecular

Using cold molecules to detect molecular parity violation Joost van den Berg KVI SSP2012

Long range behavior of the van der Waals forces between a molecule and a perfectly conducting

Hardware Accelerated Similarity Search George Williams Who Am I? Director, GSI Technology

Accelerating Exascale How the End of Moores Law Scaling is Changing - PowerPoint PPT Presentation

Accelerating Exascale How the End of Moores Law Scaling is Changing the Machines You Use, the Way You Code, and the Algorithms You Use Steve Oberlin CTO, Accelerated Computing NVIDIA TEGRA K1 Mar 2, 2014 NVIDIA Confidential 1 A Little

Why Nobody Should Care About Operating Systems for Exascale Operating Systems for Exascale Ron

exascale road in China Ruibo WANG National University of Defense Technology Contents NUDT

Major Challenges to Achieve Exascale Performance Shekhar Borkar Intel Corp. April 29, 2009

HPC Future Look Exascale and Challenges Outline Future architectures Exascale initiatives

The Exascale Computing Project (ECP) Paul Messina, ECP Director Stephen Lee, ECP Deputy Director

Exa-DM: Enabling Scientific Discovery in Exascale Simulations Jeremy Iverson 1 , 2 , Ya Ju Fan 1 ,

Containment Domains Resilience Mechanisms and Tools Toward Exascale Resilience Mattan Erez The

Squeezing Information from Data at Exascale Joel Saltz Emory University Georgia Tech Squeezing

Exascale Computing Project: Software Technology Perspective Rajeev Thakur, Argonne National Lab.

Time to Start over? Software for Exascale William Gropp www.cs.illinois.edu/~wgropp Why Is

The U.S. D.O.E. Exascale Computing Project Goals and Challenges Paul Messina, ECP Director

Exascale: Parallelism gone wild! Craig Stunkel, IBM Research IBM Research Outline Why are

EXASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&amp;UIUC What are we talking about? 100M cores

How I Learned to Stop Worrying about Exascale and Love MPI (Yes, MPI is indeed da bomb!) Pavan

Exascale Science Rick Stevens Argonne National Laboratory University of Chicago Outline

An Introductory Exascale Feasibility Study for FFTs and Multigrid Hormozd Gahvari William Gropp

Foolproof Ansible Playbooks with Molecule Nathaniel Beckstead 1 Nathaniel Beckstead

A Sub-pA Current Sensing Front-End for Transient Induced Molecular Spectroscopy Da Ying, Ping-Wei

Whats New With Ansible Collections Diving into the How to use Collections with Ansible

based virtual screening Peter Willett, University of Sheffield Computers in Scientific Discovery

The VAMDC infrastructure Standard procedures to publish, search and process atomic and molecular

Using cold molecules to detect molecular parity violation Joost van den Berg KVI SSP2012

Long range behavior of the van der Waals forces between a molecule and a perfectly conducting

Hardware Accelerated Similarity Search George Williams Who Am I? Director, GSI Technology

EXASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&UIUC What are we talking about? 100M cores