Efficiency and Programmability: Enablers for ExaScale Bill Dally | - PowerPoint PPT Presentation

Efficiency and Programmability: Enablers for ExaScale Bill Dally | Chief Scientist and SVP , Research NVIDIA | Professor (Research), EE&CS, Stanford

Scientific Discovery and Business Analytics Driving an Insatiable Demand for More Computing Performance

HPC Analytics Memory & Compute Communication Storage

The End of Historic Scaling Source: C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

“ Moore’s Law gives us more transistors… Dennard scaling made them useful. ” Bob Colwell, DAC 2013, June 4, 2013

18,688 NVIDIA Tesla K20X GPUs 27 Petaflops Peak: 90% of Performance from GPUs 17.59 Petaflops Sustained Performance on Linpack TITAN Numerous real science applications 2.14GF/W – Most efficient accelerator

You Are Here 2020 1,000PF (50x) 2013 72,000HCNs (4x) 20MW (2x) 50 GFLOPs/W (25x) 20PF 10 Threads (1000x) ~10 18,000 GPUs 10MW 2 GFLOPs/W 7 Threads ~10

EFFICIENCY GAP 50 45 40 Needed 35 Process GFLOPS/W 30 25 20 15 10 5 0 2013 2014 2015 2016 2017 2018 2019 2020

Needed CIRCUITS Process 3 X GFLOPS/W 10 PROCESS 2.2 X 1 2013 2014 2015 2016 2017 2018 2019 2020

Simpler Cores = Energy Efficiency Source: Azizi [PhD 2010]

CPU GPU 1690 pJ/flop 140 pJ/flop Optimized for Latency Optimized for Throughput Caches Explicit Management of On-chip Memory Kepler Westmere 28 nm 32 nm

ARCHITECTURE 4 X Needed Process GFLOPS/W 10 CIRCUITS 3 X PROCESS 2.2 X 1 2013 2014 2015 2016 2017 2018 2019 2020

Programmers, Tools, and Architecture Need to Play Their Positions Programmer Tools Architecture

Programmers, Tools, and Architecture Need to Play Their Positions Algorithm All of the parallelism Programmer Abstract locality Combinatorial optimization Fast mechanisms Mapping Tools Architecture Exposed costs Selection of mechanisms

Banks XBAR L2 DRAM I/O DRAM I/O NW I/O DRAM I/O DRAM I/O DRAM I/O SM SM SM SM SM SM SM SM DRAM I/O SM SM SM SM SM SM SM SM NOC NOC NOC NOC NOC NOC NOC NOC SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM Lane Lane SM SM SM SM SM SM SM SM Lane Lane SM SM SM SM SM SM SM SM NOC NOC NOC NOC NOC NOC NOC NOC SM SM SM SM SM SM SM SM Lane Lane SM SM SM SM SM SM SM SM DRAM I/O DRAM I/O Lane Lane SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM NOC NOC NOC NOC NOC NOC NOC NOC SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM NOC NOC NOC NOC NOC NOC NOC NOC NOC SM SM SM SM SM SM SM SM NW I/O SM SM SM SM SM SM SM SM NW I/O LOC LOC LOC LOC LOC LOC LOC LOC SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM NOC NOC NOC NOC NOC NOC NOC NOC SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM DRAM I/O SM SM SM SM SM SM SM SM DRAM I/O SM SM SM SM SM SM SM SM NOC NOC NOC NOC NOC NOC NOC NOC SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM NOC NOC NOC NOC NOC NOC NOC NOC SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM DRAM I/O DRAM I/O SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM NOC NOC NOC NOC NOC NOC NOC NOC SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM DRAM I/O DRAM I/O NW I/O DRAM I/O DRAM I/O

An Enabling HPC Network • <1 m s Latency • Scalable bandwidth • Small messages 50% @ 32B • Global adaptive routing • PGAS • Collectives & Atomics • MPI Offload

An Open HPC Network Ecosystem Common: • Software-NIC API • NIC-Router Channel Processor/NIC Vendors System Vendors Networking Vendors

Programming Power Parallelism 25x Efficiency with 2.2x from process Heterogeneity Hierarchy Programmer Tools Architecture

“Super” Computing From Super Computers to Super Phones

Backup

In The Past, Demand Was Fueled by Moore’s Law Source: Moore, Electronics 38(8) April 19, 1965

10000000 1000000 Perf (ps/Inst) 100000 Linear (ps/Inst) 10000 ILP Was Mined Out 1000 100 30:1 in 2001 10 1,000:1 1 0.1 30,000:1 0.01 0.001 0.0001 1980 1990 2000 2010 2020 Source: Dally et al. “The Last Classical Computer”, ISAT Study, 2001

Voltage Scaling Ended in 2005 Source: Moore, ISSCC Keynote, 2003

Moore’s law is alive and well, but… Instruction-level parallelism (ILP) was mined out in 2001 Voltage scaling (Dennard scaling) ended Summary in 2005 Most power is spent on communication What does this mean to you?

All performance is from parallelism Machines are power limited In the Future (efficiency IS performance) Machines are communication limited (locality IS performance)

Two Major Challenges Energy Efficiency Programming 25x in 7 years Parallel (10 10 threads) (~2.2x from process) Hierarchical Heterogeneous

Needed Process GFLOPS/W 10 1 2013 2014 2015 2016 2017 2018 2019 2020

How is Power Spent in a CPU? In-order Embedded OOO Hi-perf Data ALU Supply 4% 5% Data Supply RF 17% Clock + Control Logic 14% ALU 6% 24% Clock + Pins 45% Issue Register File 11% 11% Rename Instruction Supply 10% Fetch 42% 11% Dally [2008] (Embedded in-order CPU) Natarajan [2003] (Alpha 21264)

Energy Shopping List Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ FP Op lower bound Wire energy (256 bits, 10mm) 310 pJ 174 pJ = 4 pJ Memory Technology 45 nm 16nm DRAM interface pin bandwidth 4 Gbps 50 Gbps DRAM interface energy 20-30 pJ/bit 2 pJ/bit DRAM access energy 8-15 pJ/bit 2.5 pJ/bit Keckler [Micro 2011], Vogelsang [Micro 2010]

Needed CIRCUITS Process 3 X GFLOPS/W 10 PROCESS 2.2 X 1 2013 2014 2015 2016 2017 2018 2019 2020

Throughput-Optimized Core Latency-Optimized Core (TOC) (LOC) Branch PC PCs Predict PC I$ Select Register Rename I$ Instruction Window Register File Register File ALU 1 ALU 2 ALU 3 ALU 4 ALU 1 ALU 2 ALU 3 ALU 4 Reorder Buffer

Main Register File 15% of SM Energy 32 banks Warp Scheduler SIMT Lanes ALU SFU MEM TEX Shared Memory 32 KB Streaming Multiprocessor (SM)

Hierarchical Register File 100% 100% Percent of All Values Produced Percent of All Values Produced 80% 80% Read >2 Times 60% 60% Lifetime >3 Read 2 Times Lifetime 3 Lifetime 2 Read 1 Time 40% 40% Lifetime 1 Read 0 Times 20% 20% 0% 0%

Register File Caching (RFC) MRF 4x128-bit Banks (1R1W) Operand Buffering Operand Routing RFC 4x32-bit (3R1W) Banks S M T F E E U M X ALU

Energy Savings from RF Hierarchy 54% Energy Reduction Source: Gebhart, et. al (Micro 2011)

Two Major Challenges Energy Efficiency Programming 25x in 7 years Parallel (10 10 threads) (~2.2x from process) Hierarchical Heterogeneous

Skills on LinkedIn Size (approx) Growth (rel) C++ 1,000,000 -8% Mainstream Javascript 1,000,000 -1% Programming Python 429,000 7% Fortran 90,000 -11% MPI 21,000 -3% x86 Assembly 17,000 -8% CUDA 14,000 9% Parallel and Assembly Parallel programming 13,000 3% Programming OpenMP 8,000 2% TBB 389 10% 6502 Assembly 256 -13% Source: linkedin.com/skills (as of Jun 11, 2013)

Parallel Programming is Easy forall molecule in set: # 1E6 molecules forall neighbor in molecule.neighbors: # 1E2 neighbors ea forall force in forces: # several forces # reduction molecule.force += force(molecule, neighbor)

We Can Make It Hard pid = fork() ; // explicitly managing threads lock(struct.lock) ; // complicated, error-prone synchronization // manipulate struct unlock(struct.lock) ; code = send(pid, tag, &msg) ; // partition across nodes

Programmers, Tools, and Architecture Need to Play Their Positions Programmer Tools Architecture

Programmers, Tools, and Architecture Need to Play Their Positions Algorithm All of the parallelism Programmer Abstract locality Combinatorial optimization Fast mechanisms Mapping Tools Architecture Exposed costs Selection of mechanisms

OpenACC: Easy and Portable do i = 1, 20*128 do j = 1, 5000000 fa(i) = a * fa(i) + fb(i) end do Serial Code: SAXPY end do !$acc parallel loop do i = 1, 20*128 !dir$ unroll 1000 do j = 1, 5000000 fa(i) = a * fa(i) + fb(i) end do end do

Conclusion

Parallelism is the source of all performance Power limits all computing Communication dominates power

Two Challenges Power Programming 25x Efficiency Parallelism with 2.2x from process Heterogeneity Hierarchy

Efficiency and Programmability: Enablers for ExaScale Bill Dally | - PowerPoint PPT Presentation

Efficiency and Programmability: Enablers for ExaScale Bill Dally | Chief Scientist and SVP , Research NVIDIA | Professor (Research), EE&CS, Stanford Scientific Discovery and Business Analytics Driving an Insatiable Demand for More

Mobile Phone Platforms and Mobile Phone Platforms and Service Enablers Service Enablers Dr.

HPC Future Look Exascale and Challenges Outline Future architectures Exascale initiatives

Why Nobody Should Care About Operating Systems for Exascale Operating Systems for Exascale Ron

exascale road in China Ruibo WANG National University of Defense Technology Contents NUDT

Major Challenges to Achieve Exascale Performance Shekhar Borkar Intel Corp. April 29, 2009

E-Commerce enablers and suppliers survey main results overview Presentation Contents

Social, Behavioral, & Environmental Enablers for Healthy Longevity November 6-8, 2019

I Investor Presentation t P t ti Putting in Place the Growth Enablers Putting in Place the

The U.S. D.O.E. Exascale Computing Project Goals and Challenges Paul Messina, ECP Director

How I Learned to Stop Worrying about Exascale and Love MPI (Yes, MPI is indeed da bomb!) Pavan

An Introductory Exascale Feasibility Study for FFTs and Multigrid Hormozd Gahvari William Gropp

Containment Domains Resilience Mechanisms and Tools Toward Exascale Resilience Mattan Erez The

Exascale-ability Today N=4096 3 12.3 10 12 Flops 1.1 TB of Data 3D FFT Exascale-ability

THE ROAD TO EXASCALE: HARDWARE AND SOFTWARE CHALLENGES JACK DONGARRA UNIVERSITY OF TENNESSEE

Squeezing Information from Data at Exascale Joel Saltz Emory University Georgia Tech Squeezing

Exascale Science Rick Stevens Argonne National Laboratory University of Chicago Outline

Statewide High Speed Network networkMaryland TM Advisory Group Meeting May 20, 2008 Annapolis, MD

Comparison of Processor Architectures for LTE Channel Estimation Authors: Omer Anjum Teemu

Gold in Education and Elite Sport Ingrid van Gelder Programme manager NOC*NSF Elite Sports

Issue #1: Collaborative, multi sector involvement in CMSP is essential ... but not easy!

Late Night Noise Limitation Program Overview Sea-Tac Stakeholder Advisory Roundtable, April 24 1

Briefing to the CSMC on Noise Investigations November 5, 2019 1 Six- Month Noise Investigation

1 The INWG reports can be downloaded from: http://www.heatonharris.com/reports publications Or

Presentation Exercise: Chapter 9 Fill in the Blank. ____________________ pronouns, like English

Efficiency and Programmability: Enablers for ExaScale Bill Dally | - PowerPoint PPT Presentation

Efficiency and Programmability: Enablers for ExaScale Bill Dally | Chief Scientist and SVP , Research NVIDIA | Professor (Research), EE&CS, Stanford Scientific Discovery and Business Analytics Driving an Insatiable Demand for More

Mobile Phone Platforms and Mobile Phone Platforms and Service Enablers Service Enablers Dr.

HPC Future Look Exascale and Challenges Outline Future architectures Exascale initiatives

Why Nobody Should Care About Operating Systems for Exascale Operating Systems for Exascale Ron

exascale road in China Ruibo WANG National University of Defense Technology Contents NUDT

Major Challenges to Achieve Exascale Performance Shekhar Borkar Intel Corp. April 29, 2009

E-Commerce enablers and suppliers survey main results overview Presentation Contents

Social, Behavioral, &amp; Environmental Enablers for Healthy Longevity November 6-8, 2019

I Investor Presentation t P t ti Putting in Place the Growth Enablers Putting in Place the

The U.S. D.O.E. Exascale Computing Project Goals and Challenges Paul Messina, ECP Director

How I Learned to Stop Worrying about Exascale and Love MPI (Yes, MPI is indeed da bomb!) Pavan

An Introductory Exascale Feasibility Study for FFTs and Multigrid Hormozd Gahvari William Gropp

Containment Domains Resilience Mechanisms and Tools Toward Exascale Resilience Mattan Erez The

Exascale-ability Today N=4096 3 12.3 10 12 Flops 1.1 TB of Data 3D FFT Exascale-ability

THE ROAD TO EXASCALE: HARDWARE AND SOFTWARE CHALLENGES JACK DONGARRA UNIVERSITY OF TENNESSEE

Squeezing Information from Data at Exascale Joel Saltz Emory University Georgia Tech Squeezing

Exascale Science Rick Stevens Argonne National Laboratory University of Chicago Outline

Statewide High Speed Network networkMaryland TM Advisory Group Meeting May 20, 2008 Annapolis, MD

Comparison of Processor Architectures for LTE Channel Estimation Authors: Omer Anjum Teemu

Gold in Education and Elite Sport Ingrid van Gelder Programme manager NOC*NSF Elite Sports

Issue #1: Collaborative, multi sector involvement in CMSP is essential ... but not easy!

Late Night Noise Limitation Program Overview Sea-Tac Stakeholder Advisory Roundtable, April 24 1

Briefing to the CSMC on Noise Investigations November 5, 2019 1 Six- Month Noise Investigation

1 The INWG reports can be downloaded from: http://www.heatonharris.com/reports publications Or

Presentation Exercise: Chapter 9 Fill in the Blank. ____________________ pronouns, like English

Social, Behavioral, & Environmental Enablers for Healthy Longevity November 6-8, 2019