ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA - PowerPoint PPT Presentation

ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING JACK DONGARRA UNIVERSITY OF TENNESSEE OAK RIDGE NATIONAL LAB

Over Last 20 Years - Performance Development 1E+09 123%%PFlop/s % 100 Pflop/s 100000000 16.3%PFlop/s % 10 Pflop/s 10000000 1 Pflop/s 1000000 SUM % 100 Tflop/s 100000 N=1 % 10 Tflop/s 60.8%TFlop/s % 10000 6-8 years 1 Tflop/s 1000 1.17%TFlop/s % N=500 % 100 Gflop/s My Laptop (70 Gflop/s) 100 59.7%GFlop/s % 10 Gflop/s 10 My iPad2 & iPhone 4s (1.02 Gflop/s) 1 Gflop/s 1 400%MFlop/s % 100 Mflop/s 0.1 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2012

June 2012: The TOP10 Rmax % of Power MFlops Rank Site Computer Country Cores [Pflops] Peak [MW] /Watt DOE / NNSA Sequoia, BlueGene/Q (16c) 1 USA 1,572,864 16.3 81 8.6 1895 L Livermore Nat Lab + custom RIKEN Advanced Inst K computer Fujitsu SPARC64 2 Japan 705,024 10.5 93 12.7 830 for Comp Sci VIIIfx (8c) + custom DOE / OS 3 Mira, BlueGene/Q (16c) + custom USA 786,432 8.16 81 3.95 2069 Argonne Nat Lab Leibniz 4 SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.52 823 Rechenzentrum Tianhe-1A, NUDT Nat. SuperComputer 5 Intel (6c) + Nvidia GPU (14c) China 186,368 2.57 55 4.04 636 Center in Tianjin + custom DOE / OS Jaguar, Cray 6 USA 298,592 1.94 74 5.14 377 Oak Ridge Nat Lab AMD (16c) + custom Fermi, BlueGene/Q (16c) 7 CINECA Italy 163,840 1.73 82 .821 2099 + custom Forschungszentrum JuQUEEN, BlueGene/Q (16c) 8 Germany 131,072 1.38 82 .657 2099 Juelich (FZJ) + custom Commissariat a Curie, Bull 9 l'Energie Atomique France 77,184 1.36 82 2.25 604 Intel (8c) + IB (CEA) Nat. Supercomputer Nebulea, Dawning Intel (6) 10 China 120,640 43 2.58 493 1.27 Center in Shenzhen + Nvidia GPU (14c) + IB 3 ��

Accelerators (58 systems) 60" Intel"MIC"(1)" 50" Clearspeed"CSX600"(0)" ATI"GPU"(2)" 40" IBM"PowerXCell"8i"(2)" Systems% 30" NVIDIA"2070"(10)" NVIDIA"2050(12)" 20" NVIDIA"2090"(31)" �� 10" �� 0" �� 2006" 2007" 2008" 2009" 2010" 2011" 2012" ��

Countries Share Absolute Counts US: 252 China: 68 Japan: 35 UK: 25 France: 22 Germany: 20 Switzerland 5

Swiss Machines in Top500 (max:12 min:1) Jan+93" Oct+95" Jul+98" Apr+01" Jan+04" Oct+06" Jul+09" Apr+12" 0" 50" 100" 150" 200" 250" 300" 350" 400" 450" 500" 4 5 7 9 12 9 8 9 6 6 5 6 6 5 8 8 6 2 1 1 3 3 2 3 3 4 4 5 5 7 6 4 4 5 5 4 4 3 1 High point: 12 systems (6/95) Low points: 1 system (6/02, 11/02, 6/12) 6

28 Systems at > Pflop/s (Peak) Pflop/s"Club" 45" 41# 40" Pflop/s 35" 30" 25" 20" 16.2# 15" 11.1# 6.9# 10" 2.92# 2.73# 2.1# 5" 1.7# 0" US""""""""""""""""""" "Japan""""""" China""""""""""""""" Germany""""""""""" France"""""""""""" UK""""""""""""""""" Italy"""""""""""" Russia""""""" (9)" (4)" (5)" (4)" (2)" (2)" (1)" (1)" (Peak) 10/2/12 7

Linpack Efficiency 100% 90% 80% Linpack Efficiency 70% 60% 50% 40% 30% 20% 10% 0% 0 100 200 300 400 500

Performance Development in Top500 1E+11 1E+10 1 Eflop/s 1E+09 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s N=1% 1000000 100 Tflop/s 100000 10 Tflop/s 10000 1 Tflop/s N=500% 1000 100 Gflop/s 100 10 Gflop/s 10 1 Gflop/s 1 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 0.1

The High Cost of Data Movement • �� 2011 2018 DP FMADD flop 100 pJ 10 pJ DP DRAM read 4800 pJ 1920 pJ Local Interconnect 7500 pJ 2500 pJ Cross System 9000 pJ 3500 pJ �� • �� 12

Energy Cost Challenge � At ~$1M per MW energy costs are substantial ! 10 Pflop/s in 2011 uses ~10 MWs ! 1 Eflop/s in 2018 > 100 MWs ! DOE Target: 1 Eflop/s in 2018 at 20 MWs 13

Potential System Architecture with a cap of $200M and 20MW Systems 2012 2019 Difference Today & 2019 BG/Q Computer System peak 20 Pflop/s 1 Eflop/s O(100) Power 8.6 MW ~20 MW System memory 1.6 PB 32 - 64 PB O(10) (16*96*1024) Node performance 205 GF/s 1.2 or 15TF/s O(10) – O(100) (16*1.6GHz*8) Node memory BW 42.6 GB/s 2 - 4TB/s O(1000) Node concurrency 64 O(1k) or 10k O(100) – O(1000) Threads Total Node Interconnect 20 GB/s 200-400GB/s O(10) BW System size (nodes) 98,304 O(100,000) or O(1M) O(100) – O(1000) (96*1024) Total concurrency 5.97 M O(billion) O(1,000) MTTI 4 days O(<1 day) - O(10)

Potential System Architecture with a cap of $200M and 20MW Systems 2012 2022 Difference Today & 2022 BG/Q Computer System peak 20 Pflop/s 1 Eflop/s O(100) Power 8.6 MW ~20 MW System memory 1.6 PB 32 - 64 PB O(10) (16*96*1024) Node performance 205 GF/s 1.2 or 15TF/s O(10) – O(100) (16*1.6GHz*8) Node memory BW 42.6 GB/s 2 - 4TB/s O(1000) Node concurrency 64 O(1k) or 10k O(100) – O(1000) Threads Total Node Interconnect 20 GB/s 200-400GB/s O(10) BW System size (nodes) 98,304 O(100,000) or O(1M) O(100) – O(1000) (96*1024) Total concurrency 5.97 M O(billion) O(1,000) MTTI 4 days O(<1 day) - O(10)

Critical Issues at Peta & Exascale for Algorithm and Software Design � Synchronization-reducing algorithms ! Break Fork-Join model � Communication-reducing algorithms ! Use methods which have lower bound on communication � Mixed precision methods ! 2x speed of ops and 2x speed for data movement � Autotuning ! Today’s machines are too complicated, build “smarts” into software to adapt to the hardware � Fault resilient algorithms ! Implement algorithms that can recover from failures/bit flips � Reproducibility of results ! Today we can’t guarantee this. We understand the issues, 16 but some of our “colleagues” have a hard time with this.

Major Changes to Algorithms/Software • Must rethink the design of our algorithms and software ! Manycore and Hybrid architectures are disruptive technology ! Similar to what happened with cluster computing and message passing ! Rethink and rewrite the applications, algorithms, and software ! Data movement is expensive ! Flops are cheap 17

Dense Linear Algebra Software Evolution LINPACK (70's) " Level 1 BLAS vector operations LAPACK (80's) " Level 3 BLAS block operations " PBLAS ScaLAPACK (90's) block cyclic " BLACS data distribution (message passing) PLASMA (00's) " tile layout tile operations " dataflow scheduling

PLASMA Principles " Tile Algorithms " minimize capacity misses LAPACK CPU cache MEM " Tile Matrix Layout " minimize conflict misses PLASMA CPU CPU CPU CPU cache cache cache cache MEM " Dynamic DAG Scheduling " minimizes idle time " More overlap " Asynchronous ops

Fork-Join Parallelization of LU and QR. Parallelize the update: dgemm • Easy and done in any reasonable software. - • This is the 2/3n 3 term in the FLOPs count. • Can be done efficiently with LAPACK+multithreaded BLAS Cores Time

ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA - PowerPoint PPT Presentation

ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING JACK DONGARRA UNIVERSITY OF TENNESSEE OAK RIDGE NATIONAL LAB Over Last 20 Years - Performance Development 1E+09 123%%PFlop/s % 100 Pflop/s

How Economists Think and Things They Think About How Economists Think and Things They Think About

An Overview of High An Overview of High Performance Computing and Performance Computing and

New York University High Performance Computing High Performance Computing Information

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High

High Performance Computing in Web Browsers CE Seminar WT14/15 Henning Lohse High Performance

NSF Future of High Performance Computing Bill Kramer NSF Workshop on the Future of High

Future Directions in High Future Directions in High P Performance Computing Performance

On the Future of High Performance Computing: How to Think for Peta and Exascale Computing Jack

Think Aloud This slideshow is inspired from Rolf Mlichs book Think aloud & Steve

Technology and Humanity: opportunities and challenges for the next 10 years The Future is

Introduction to High Performance Computing Pierre Aubert High Performance Computing (HPC)

Trends in High Performance Trends in High Performance Computing and Using Numerical Computing

High Performance Computing, High Performance Computing, Computational Grid, and Numerical

Trends in High Performance Trends in High Performance Computing and the Grid Computing and the

High Performance Computing at High Performance Computing at the University of Utah: A User the

High-performance computing in Java: the data processing of Gaia X. Luri & J. Torra ICCUB/IEEC

RFC822/MIME SMTP HTML Email HTTP WWW Link-Layer-Ctrl

Automated Fast-Track Reconfiguration of Group Communication Systems Christoph Kreitz Department

Challenges and future in three-body heavy meson decays Patricia C. Magalhes

C++ Basics & OOP 2019/3/14 History of C++ Developed by Bjarne Stroustrup starting in

Causal analysis within the framework of structural autoregressive models Alessio Moneta Scuola

Green relations Dominique Perrin 23 novembre 2015 Dominique Perrin Green relations Monoids A

Lecture 5: Representation Learning Kai-Wei Chang CS @ UCLA kw@kwchang.net Couse webpage:

Logistic Regression Lecture 6 Logistic Regression Classification Model CS 335

Sambuz

Useful Links

Newsletter

Mail Us

ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA - PowerPoint PPT Presentation

ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING JACK DONGARRA UNIVERSITY OF TENNESSEE OAK RIDGE NATIONAL LAB Over Last 20 Years - Performance Development 1E+09 123%%PFlop/s % 100 Pflop/s

How Economists Think and Things They Think About How Economists Think and Things They Think About

An Overview of High An Overview of High Performance Computing and Performance Computing and

New York University High Performance Computing High Performance Computing Information

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High

High Performance Computing in Web Browsers CE Seminar WT14/15 Henning Lohse High Performance

NSF Future of High Performance Computing Bill Kramer NSF Workshop on the Future of High

Future Directions in High Future Directions in High P Performance Computing Performance

On the Future of High Performance Computing: How to Think for Peta and Exascale Computing Jack

Think Aloud This slideshow is inspired from Rolf Mlichs book Think aloud &amp; Steve

Technology and Humanity: opportunities and challenges for the next 10 years The Future is

Introduction to High Performance Computing Pierre Aubert High Performance Computing (HPC)

Trends in High Performance Trends in High Performance Computing and Using Numerical Computing

High Performance Computing, High Performance Computing, Computational Grid, and Numerical

Trends in High Performance Trends in High Performance Computing and the Grid Computing and the

High Performance Computing at High Performance Computing at the University of Utah: A User the

High-performance computing in Java: the data processing of Gaia X. Luri &amp; J. Torra ICCUB/IEEC

RFC822/MIME SMTP HTML Email HTTP WWW Link-Layer-Ctrl

Automated Fast-Track Reconfiguration of Group Communication Systems Christoph Kreitz Department

Challenges and future in three-body heavy meson decays Patricia C. Magalhes

C++ Basics &amp; OOP 2019/3/14 History of C++ Developed by Bjarne Stroustrup starting in

Causal analysis within the framework of structural autoregressive models Alessio Moneta Scuola

Green relations Dominique Perrin 23 novembre 2015 Dominique Perrin Green relations Monoids A

Lecture 5: Representation Learning Kai-Wei Chang CS @ UCLA kw@kwchang.net Couse webpage:

Logistic Regression Lecture 6 Logistic Regression Classification Model CS 335

Sambuz

Useful Links

Newsletter

Mail Us

Think Aloud This slideshow is inspired from Rolf Mlichs book Think aloud & Steve

High-performance computing in Java: the data processing of Gaia X. Luri & J. Torra ICCUB/IEEC

C++ Basics & OOP 2019/3/14 History of C++ Developed by Bjarne Stroustrup starting in