Tomorrows Exascale Systems: Not Just Bigger Versions of Todays Peta - PowerPoint PPT Presentation

Tomorrow’s Exascale Systems: Not Just Bigger Versions of Today’s Peta -Computers Thomas Sterling Professor of Informatics and Computing, Indiana University Chief Scientist and Associate Director Center for Research in Extreme Scale Technologies (CREST) School of Informatics and Computing Indiana University November 20, 2013

Tianhe-2: Half-way to Exascale • China, 2013 : the 30 PetaFLOPS dragon • Developed in cooperation between NUDT and Inspur for National Supercomputer Center in Guangzhou • Peak performance of 54.9 PFLOPS – 16,000 nodes contain 32,000 Xeon Ivy Bridge processors and 48,000 Xeon Phi accelerators totaling 3,120,000 cores – 162 cabinets in 720m 2 footprint – Total 1.404 PB memory (88GB per node) – Each Xeon Phi board utilizes 57 cores for aggregate 1.003 TFLOPS at 1.1GHz clock – Proprietary TH Express-2 interconnect (fat tree with thirteen 576-port switches) – 12.4 PB parallel storage system – 17.6MW power consumption under load; 24MW including (water) cooling – 4096 SPARC V9 based Galaxy FT-1500 processors in front-end system

Exaflops by 2019 (maybe) 1E+11 1E+10 1 Eflop/s 1E+09 100 Pflop/s 100000000 10 Pflop/s 10000000 SUM 1 Pflop/s 1000000 100 Tflop/s 100000 N=1 10 Tflop/s 10000 1 Tflop/s 1000 N=500 100 Gflop/s 100 10 Gflop/s 10 1 Gflop/s 1 100 Mflop/s 0.1 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 Courtesy of Erich Strohmaier LBNL

Elements of an MFE Integrated Model  Complex Multi-scale, Multi-physics Processes Courtesy of Bill Tang, Princeton

Progress in Turbulence Simulation Capability: Faster Computer  Achievement of Improved Fusion Energy Physics Insights GTC Computer PE# Speed Particle Time Physics Discovery simulation name used (TF) # steps ( Publication ) 10 2 10 -1 10 8 10 4 1998 Cray T3E Ion turbulence zonal flow NERSC ( Science, 1998 ) 10 3 10 0 10 9 10 4 IBM SP Ion transport scaling 2002 NERSC ( PRL, 2002 ) 10 4 10 2 10 10 10 5 2007 Cray XT3/4 Electron turbulence ORNL ( PRL, 2007 ); EP transport ( PRL, 2008 ) 10 5 10 3 10 11 10 5 Jaguar/Cray XT5 Electron transport scaling 2009 ORNL ( PRL, 2009 ); EP-driven MHD modes Cray XT5  Titan 10 5 10 4 10 12 10 5 Kinetic-MHD; 2012-13 ORNL Turbulence + EP + MHD (current) Tianhe-1A (China) 10 6 10 13 10 6 2018 To Extreme Scale Turbulence + EP + MHD + RF HPC Systems (future) * Example here of GTC code (Z. Lin, et al.) delivering production runs @ TF in 2002 and PF in 2009 Courtesy of Bill Tang, Princeton

Practical Constraints for Exascale  Sustained Performance  Reliability  Exaflops  One factor of availability  100 Petabytes  Generality  125 Petabytes/sec.  How good is it across a range of problems  Cost  Strong scaling  Deployment – $200M  Productivity  Operational support  User programmability  Power  Performance portability  Energy required to run the computer  Size  Energy for cooling (remove heat from  Floor space – 4,000 sq. meters machine)  Access way for power and signal cabling  20 Megawatts 6

Execution Model Phase Change Von Neumann Model • Guiding principles for system design and operation 1949 – Semantics, Mechanisms, Policies, Parameters, Metrics – Driven by technology opportunities and challenges SIF-MOE Model 1968 – Historically, catalyzed by paradigm shift • Decision chain across system layers – For reasoning towards optimization of design and operation Vector Model 1975 • Essential for co-design of all system layers – Architecture, runtime and OS, programming models – Reduces design complexity from O(N 2 ) to O(N) – Enables holistic reasoning about concepts and SIMD-array Model tradeoffs 1983 • Empowers discrimination, commonality, portability CSP Model – Establishes a phylum of HPC class systems 1991 ? ? Model 2020

Courtesy of Peter Kogge, Total Power UND 10 9 8 7 Power (MW) 6 5 4 3 2 1 0 1/1/1992 1/1/1996 1/1/2000 1/1/2004 1/1/2008 1/1/2012 Heavyweight Lightweight Heterogeneous

Technology Demands new Response 9

Total Concurrency 1.E+07 1.E+06 TC (Flops/Cycle) 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 1/1/1992 1/1/1996 1/1/2000 1/1/2004 1/1/2008 1/1/2012 Heavyweight Lightweight Heterogeneous Courtesy of Peter Kogge, UND

Performance Factors - SLOWER • Starvation – Insufficiency of concurrency of work – P = e(L,O,W) * S(s) * a(r) * U (E) Impacts scalability and latency hiding – Effects programmability • Latency – Time measured distance for remote access and services – Impacts efficiency P – performance (ops) • Overhead e – efficiency (0 < e < 1) s – application’s average parallelism, – Critical time additional work to a – availability (0 < a < 1) manage tasks & resources – U – normalization factor/compute unit Impacts efficiency and granularity for E – watts per average compute unit scalability • r – reliability (0 < r < 1) Waiting for contention resolution – Delays due to simultaneous access requests to shared physical or logical resources

The Negative Impact of Global Barriers in Astrophysics Codes Computational phase diagram from the MPI based GADGET code (used for N-body and SPH simluations) using 1M particles over four timesteps on 128 procs. Red indicates computation Blue indicates waiting for communication

Goals of a New Execution Model for Exascale • Serve as a discipline to govern future scalable system architectures, programming methods, and runtime • Latency hiding at all system distances – Latency mitigating architectures • Exploit parallelism in diversity of forms and granularity • Provide a framework for efficient fine-grain synchronization and scheduling (dispatch) • Enable optimized runtime adaptive resource management and task scheduling for dynamic load balancing • Support full virtualization for fault tolerance and power management, and continuous optimization • Self-aware infrastructure for power management • Semantics of failure response for graceful degradation • Complexity of operation as an emergent behavior from simplicity of design, high replication, and local adaptation for global optima in time and space

ParalleX Execution Model • Lightweight multi-threading – Divides work into smaller tasks – Increases concurrency • Message-driven computation – Move work to data – Keeps work local, stops blocking • Constraint-based synchronization – Declarative criteria for work – Event driven – Eliminates global barriers • Data-directed execution – Merger of flow control and data structure • Shared name space – Global address space – Simplifies random gathers

ParalleX Addresses Critical Challenges (1) • Starvation – Lightweight threads for additional level of parallelism – Lightweight threads with rapid context switching for non-blocking – Low overhead for finer granularity and more parallelism – Parallelism discovery at runtime through data-directed execution – Overlap of successive phases of computation for more parallelism • Latency – Lightweight thread context switching for non-blocking – Overlap computation and communication to limit effects – Message-driven computation to reduce latency to put work near data – Reduce number and size of global messages

ParalleX Addresses Critical Challenges (2) • Overhead – Eliminates (mostly) global barriers – However, ultimately will require hardware support in the limit – Uses synchronization objects exhibiting high semantic power – Reduces context switching time – Not all actions require thread instantiation • Waiting due to contention – Adaptive resource allocation with redundant resources • Like hardware for threads – Eliminates polling and reduces # of sources of synch contacts

HPX Runtime Design • Current version of HPX provides the following infrastructure as defined by the ParalleX execution model – Complexes (ParalleX Threads) and ParalleX Thread Management – Parcel Transport and Parcel Management – Local Control Objects (LCOs) – Active Global Address Space (AGAS)

Overlapping computational phases for hydrodynamics Computational phases for LULESH (mini-app for hydrodynamics codes). Red indicates work White indicates waiting for communication Overdecomposition: MPI used 64 process while HPX used 1E3 threads spread across 64 cores. MPI HPX

Dynamic load balancing via message-driven work- queue execution for Adaptive Mesh Refinement (AMR)

Application: Adaptive Mesh Refinement (AMR) for Astrophysics simulations

Conclusions • HPC is in a (6 th ) phase change • Ultra high scale computing of the next decade will require a new model of computation to effectively exploit new technologies and guide system co-design • ParalleX is an example of an experimental execution model that addresses key challenges to Exascale • Early experiments prove encouraging for enhancing scaling of graph-based numeric intensive and knowledge management applications

Tomorrows Exascale Systems: Not Just Bigger Versions of Todays Peta - PowerPoint PPT Presentation

Tomorrows Exascale Systems: Not Just Bigger Versions of Todays Peta -Computers Thomas Sterling Professor of Informatics and Computing, Indiana University Chief Scientist and Associate Director Center for Research in Extreme Scale

Why Nobody Should Care About Operating Systems for Exascale Operating Systems for Exascale Ron

exascale road in China Ruibo WANG National University of Defense Technology Contents NUDT

Major Challenges to Achieve Exascale Performance Shekhar Borkar Intel Corp. April 29, 2009

HPC Future Look Exascale and Challenges Outline Future architectures Exascale initiatives

Quicksort Sorting Lower Bound Exam Exam Exam Exam 2 2 tomorrow evening 2 2 tomorrow

Time to Start over? Software for Exascale William Gropp www.cs.illinois.edu/~wgropp Why Is

Exascale Science Rick Stevens Argonne National Laboratory University of Chicago Outline

The Exascale Computing Project (ECP) Paul Messina, ECP Director Stephen Lee, ECP Deputy Director

Exa-DM: Enabling Scientific Discovery in Exascale Simulations Jeremy Iverson 1 , 2 , Ya Ju Fan 1 ,

Containment Domains Resilience Mechanisms and Tools Toward Exascale Resilience Mattan Erez The

Squeezing Information from Data at Exascale Joel Saltz Emory University Georgia Tech Squeezing

Exascale Computing Project: Software Technology Perspective Rajeev Thakur, Argonne National Lab.

The U.S. D.O.E. Exascale Computing Project Goals and Challenges Paul Messina, ECP Director

Exascale: Parallelism gone wild! Craig Stunkel, IBM Research IBM Research Outline Why are

EXASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&UIUC What are we talking about? 100M cores

How I Learned to Stop Worrying about Exascale and Love MPI (Yes, MPI is indeed da bomb!) Pavan

How mobile-friendly is your organizations website? Melissa Clark VP of Project Management

Designing and Deploying Internet-Scale Services James Hamilton 2008.12.02 Architect, Data

Graceful Degradation Fault-tolerance, or graceful degradation, is the property that enables a

ValueX Vail Conference Kevin C. Smith, CFA NVIDIA Presentation Founder and CEO June 20, 2014

The Bio-chemical Information Processing Metaphor as a Programming Paradigm for Organic Computing

Lessons from Building a Visualization Toolkit for Massively Threaded Architectures Robert

F4 Friday, October 31, 2003 10:00 AM W EB S ERVICES : O VERVIEW AND T EST ING S TRATEGY Alan

Session One: Transition to DTTV Broadcasting Bulgarian Experience and Lessons Learned have been

Tomorrows Exascale Systems: Not Just Bigger Versions of Todays Peta - PowerPoint PPT Presentation

Tomorrows Exascale Systems: Not Just Bigger Versions of Todays Peta -Computers Thomas Sterling Professor of Informatics and Computing, Indiana University Chief Scientist and Associate Director Center for Research in Extreme Scale

Why Nobody Should Care About Operating Systems for Exascale Operating Systems for Exascale Ron

exascale road in China Ruibo WANG National University of Defense Technology Contents NUDT

Major Challenges to Achieve Exascale Performance Shekhar Borkar Intel Corp. April 29, 2009

HPC Future Look Exascale and Challenges Outline Future architectures Exascale initiatives

Quicksort Sorting Lower Bound Exam Exam Exam Exam 2 2 tomorrow evening 2 2 tomorrow

Time to Start over? Software for Exascale William Gropp www.cs.illinois.edu/~wgropp Why Is

Exascale Science Rick Stevens Argonne National Laboratory University of Chicago Outline

The Exascale Computing Project (ECP) Paul Messina, ECP Director Stephen Lee, ECP Deputy Director

Exa-DM: Enabling Scientific Discovery in Exascale Simulations Jeremy Iverson 1 , 2 , Ya Ju Fan 1 ,

Containment Domains Resilience Mechanisms and Tools Toward Exascale Resilience Mattan Erez The

Squeezing Information from Data at Exascale Joel Saltz Emory University Georgia Tech Squeezing

Exascale Computing Project: Software Technology Perspective Rajeev Thakur, Argonne National Lab.

The U.S. D.O.E. Exascale Computing Project Goals and Challenges Paul Messina, ECP Director

Exascale: Parallelism gone wild! Craig Stunkel, IBM Research IBM Research Outline Why are

EXASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&amp;UIUC What are we talking about? 100M cores

How I Learned to Stop Worrying about Exascale and Love MPI (Yes, MPI is indeed da bomb!) Pavan

How mobile-friendly is your organizations website? Melissa Clark VP of Project Management

Designing and Deploying Internet-Scale Services James Hamilton 2008.12.02 Architect, Data

Graceful Degradation Fault-tolerance, or graceful degradation, is the property that enables a

ValueX Vail Conference Kevin C. Smith, CFA NVIDIA Presentation Founder and CEO June 20, 2014

The Bio-chemical Information Processing Metaphor as a Programming Paradigm for Organic Computing

Lessons from Building a Visualization Toolkit for Massively Threaded Architectures Robert

F4 Friday, October 31, 2003 10:00 AM W EB S ERVICES : O VERVIEW AND T EST ING S TRATEGY Alan

Session One: Transition to DTTV Broadcasting Bulgarian Experience and Lessons Learned have been

EXASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&UIUC What are we talking about? 100M cores