Tomorrows Exascale Systems: Not Just Bigger Versions of Todays Peta - - PowerPoint PPT Presentation

tomorrow s exascale systems
SMART_READER_LITE
LIVE PREVIEW

Tomorrows Exascale Systems: Not Just Bigger Versions of Todays Peta - - PowerPoint PPT Presentation

Tomorrows Exascale Systems: Not Just Bigger Versions of Todays Peta -Computers Thomas Sterling Professor of Informatics and Computing, Indiana University Chief Scientist and Associate Director Center for Research in Extreme Scale


slide-1
SLIDE 1

Tomorrow’s Exascale Systems:

Not Just Bigger Versions of Today’s Peta-Computers

Thomas Sterling

Professor of Informatics and Computing, Indiana University Chief Scientist and Associate Director

Center for Research in Extreme Scale Technologies (CREST) School of Informatics and Computing Indiana University

November 20, 2013

slide-2
SLIDE 2

Tianhe-2: Half-way to Exascale

  • China, 2013: the 30 PetaFLOPS dragon
  • Developed in cooperation between

NUDT and Inspur for National Supercomputer Center in Guangzhou

  • Peak performance of 54.9 PFLOPS

– 16,000 nodes contain 32,000 Xeon Ivy Bridge processors and 48,000 Xeon Phi accelerators totaling 3,120,000 cores – 162 cabinets in 720m2 footprint – Total 1.404 PB memory (88GB per node) – Each Xeon Phi board utilizes 57 cores for aggregate 1.003 TFLOPS at 1.1GHz clock – Proprietary TH Express-2 interconnect (fat tree with thirteen 576-port switches) – 12.4 PB parallel storage system – 17.6MW power consumption under load; 24MW including (water) cooling – 4096 SPARC V9 based Galaxy FT-1500 processors in front-end system

slide-3
SLIDE 3

Exaflops by 2019 (maybe)

0.1 1 10 100 1000 10000 100000 1000000 10000000 100000000 1E+09 1E+10 1E+11 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020

SUM N=1 N=500 1 Gflop/s 1 Tflop/s

100 Mflop/s 100 Gflop/s 100 Tflop/s 10 Gflop/s 10 Tflop/s

1 Pflop/s

100 Pflop/s 10 Pflop/s

1 Eflop/s

Courtesy of Erich Strohmaier LBNL

slide-4
SLIDE 4

Elements of an MFE Integrated Model 

Complex Multi-scale, Multi-physics Processes

Courtesy of Bill Tang, Princeton

slide-5
SLIDE 5

GTC simulation Computer name PE# used Speed (TF) Particle # Time steps Physics Discovery (Publication) 1998

Cray T3E NERSC 102 10-1 108 104 Ion turbulence zonal flow (Science, 1998)

2002

IBM SP NERSC 103 100 109 104 Ion transport scaling (PRL, 2002)

2007

Cray XT3/4 ORNL 104 102 1010 105 Electron turbulence (PRL, 2007); EP transport (PRL, 2008)

2009

Jaguar/Cray XT5 ORNL 105 103 1011 105 Electron transport scaling (PRL, 2009); EP-driven MHD modes

2012-13

(current)

Cray XT5Titan ORNL Tianhe-1A (China) 105 104 1012 105 Kinetic-MHD; Turbulence + EP + MHD

2018

(future)

To Extreme Scale HPC Systems 106 1013 106 Turbulence + EP + MHD + RF

Progress in Turbulence Simulation Capability: Faster Computer  Achievement of Improved Fusion Energy Physics Insights

* Example here of GTC code (Z. Lin, et al.) delivering production runs @ TF in 2002 and PF in 2009

Courtesy of Bill Tang, Princeton

slide-6
SLIDE 6

6

Practical Constraints for Exascale

  • Sustained Performance
  • Exaflops
  • 100 Petabytes
  • 125 Petabytes/sec.
  • Cost
  • Deployment – $200M
  • Operational support
  • Power
  • Energy required to run the computer
  • Energy for cooling (remove heat from

machine)

  • 20 Megawatts
  • Reliability
  • One factor of availability
  • Generality
  • How good is it across a range of problems
  • Strong scaling
  • Productivity
  • User programmability
  • Performance portability
  • Size
  • Floor space – 4,000 sq. meters
  • Access way for power and signal cabling
slide-7
SLIDE 7

Execution Model Phase Change

  • Guiding principles for system design and operation

– Semantics, Mechanisms, Policies, Parameters, Metrics – Driven by technology opportunities and challenges – Historically, catalyzed by paradigm shift

  • Decision chain across system layers

– For reasoning towards optimization of design and

  • peration
  • Essential for co-design of all system layers

– Architecture, runtime and OS, programming models – Reduces design complexity from O(N2) to O(N)

– Enables holistic reasoning about concepts and tradeoffs

  • Empowers discrimination, commonality, portability

– Establishes a phylum of HPC class systems

Vector Model 1975 SIMD-array Model 1983 CSP Model 1991 SIF-MOE Model 1968 Von Neumann Model 1949

?

? Model 2020

slide-8
SLIDE 8

Total Power

1 2 3 4 5 6 7 8 9 10 1/1/1992 1/1/1996 1/1/2000 1/1/2004 1/1/2008 1/1/2012 Power (MW) Heavyweight Lightweight Heterogeneous

Courtesy of Peter Kogge, UND

slide-9
SLIDE 9

Technology Demands new Response

9

slide-10
SLIDE 10

1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1/1/1992 1/1/1996 1/1/2000 1/1/2004 1/1/2008 1/1/2012 TC (Flops/Cycle) Heavyweight Lightweight Heterogeneous

Total Concurrency

Courtesy of Peter Kogge, UND

slide-11
SLIDE 11

Performance Factors - SLOWER

  • Starvation

– Insufficiency of concurrency of work – Impacts scalability and latency hiding – Effects programmability

  • Latency

– Time measured distance for remote access and services – Impacts efficiency

  • Overhead

– Critical time additional work to manage tasks & resources – Impacts efficiency and granularity for scalability

  • Waiting for contention resolution

– Delays due to simultaneous access requests to shared physical or logical resources

P = e(L,O,W) * S(s) * a(r) * U(E)

P – performance (ops) e – efficiency (0 < e < 1) s – application’s average parallelism, a – availability (0 < a < 1) U – normalization factor/compute unit E – watts per average compute unit r – reliability (0 < r < 1)

slide-12
SLIDE 12

The Negative Impact of Global Barriers in Astrophysics Codes Computational phase diagram from the MPI based GADGET code (used for N-body and SPH simluations) using 1M particles

  • ver four timesteps on 128

procs. Red indicates computation Blue indicates waiting for communication

slide-13
SLIDE 13

Goals of a New Execution Model for Exascale

  • Serve as a discipline to govern future scalable system architectures,

programming methods, and runtime

  • Latency hiding at all system distances

– Latency mitigating architectures

  • Exploit parallelism in diversity of forms and granularity
  • Provide a framework for efficient fine-grain synchronization and scheduling

(dispatch)

  • Enable optimized runtime adaptive resource management and task scheduling

for dynamic load balancing

  • Support full virtualization for fault tolerance and power management, and

continuous optimization

  • Self-aware infrastructure for power management
  • Semantics of failure response for graceful degradation
  • Complexity of operation as an emergent behavior from simplicity of design, high

replication, and local adaptation for global optima in time and space

slide-14
SLIDE 14

ParalleX Execution Model

  • Lightweight multi-threading

– Divides work into smaller tasks – Increases concurrency

  • Message-driven computation

– Move work to data – Keeps work local, stops blocking

  • Constraint-based synchronization

– Declarative criteria for work – Event driven – Eliminates global barriers

  • Data-directed execution

– Merger of flow control and data structure

  • Shared name space

– Global address space – Simplifies random gathers

slide-15
SLIDE 15

ParalleX Addresses Critical Challenges (1)

  • Starvation

– Lightweight threads for additional level of parallelism – Lightweight threads with rapid context switching for non-blocking – Low overhead for finer granularity and more parallelism – Parallelism discovery at runtime through data-directed execution – Overlap of successive phases of computation for more parallelism

  • Latency

– Lightweight thread context switching for non-blocking – Overlap computation and communication to limit effects – Message-driven computation to reduce latency to put work near data – Reduce number and size of global messages

slide-16
SLIDE 16

ParalleX Addresses Critical Challenges (2)

  • Overhead

– Eliminates (mostly) global barriers – However, ultimately will require hardware support in the limit – Uses synchronization objects exhibiting high semantic power – Reduces context switching time – Not all actions require thread instantiation

  • Waiting due to contention

– Adaptive resource allocation with redundant resources

  • Like hardware for threads

– Eliminates polling and reduces # of sources of synch contacts

slide-17
SLIDE 17

HPX Runtime Design

  • Current version of HPX provides the following infrastructure

as defined by the ParalleX execution model

– Complexes (ParalleX Threads) and ParalleX Thread Management – Parcel Transport and Parcel Management – Local Control Objects (LCOs) – Active Global Address Space (AGAS)

slide-18
SLIDE 18

Overlapping computational phases for hydrodynamics

MPI HPX Computational phases for LULESH (mini-app for hydrodynamics codes). Red indicates work White indicates waiting for communication Overdecomposition: MPI used 64 process while HPX used 1E3 threads spread across 64 cores.

slide-19
SLIDE 19

Dynamic load balancing via message-driven work- queue execution for Adaptive Mesh Refinement (AMR)

slide-20
SLIDE 20

Application: Adaptive Mesh Refinement (AMR) for Astrophysics simulations

slide-21
SLIDE 21

Conclusions

  • HPC is in a (6th) phase change
  • Ultra high scale computing of the next decade will

require a new model of computation to effectively exploit new technologies and guide system co-design

  • ParalleX is an example of an experimental execution

model that addresses key challenges to Exascale

  • Early experiments prove encouraging for enhancing

scaling of graph-based numeric intensive and knowledge management applications

slide-22
SLIDE 22