Reactive Runtime Systems for Heterogeneous Extreme Computations - PowerPoint PPT Presentation

Reactive Runtime Systems for Heterogeneous Extreme Computations Thomas Sterling Chief Scientist, CREST Professor, School of Informatics and Computing Indiana University November 19, 2014

Shifting Paradigms of Computing • Abacus – Counting tables • Pascaline • Difference engine – Charles Babbage – Per Georg Scheutz • Tabulators – Herman Hollerith • Analog computer – Vannevar Bush Machine • Harvard Architecture – Howard Aiken – Konrad Zuse – Charles Babbage Analytical Engine 2

The von Neumann Age • Foundations: – Information Theory – Claude Shannon – Computabilty – Turing/Church – Cybernetics – Norbert Wiener – Stored Program Computer Architecture – von Neumann • The von Neumann Shift: 1945 – 1960 – Vacuum tubes, core memory – Technology assumptions • ALUs are most expensive components • Memory capacity and clock rate are scale drivers – mainframes • Data movement of secondary importance • Von Neumann extended: 1960 – 2014 – Semiconductor, Exploitation of parallelism – Out of order completion – Vector – SIMD – Multiprocessor (MIMD) • SMP – Maintain sequential consistency • MPP/Clusters 3 – Ensemble computations with message passing

Capability Computing Subcellular Memory Requirements Cellular Human Brain Detail Planned EU – HBP Exaflop machine 100 PB Jülich 28-rack BlueGene/Q machine Cellular Rodent Brain BBP / CSCS 4-rack BlueGene/Q + BGAS 100 TB Glia-Cell/Vasculature O(1-10x) CADMOS 4-rack BlueGene/P Reaction-Diffusion O(100-1,000x) Molecular Dynamics EPFL 4-rack BlueGene/L 1 TB O(>1,000,000,000x?) Cellular Neocortical Cellular Mesocircuit Column Plasticity O(1-10x) Learning & Memory O(10-100x) 10 GB Behavior O(100-1000x) Single Cellular Model 1 MB 1 Teraflops 1 Petaflops 1 Exaflops 1 Gigaflops Computational Complexity

The Negative Impact of Global Barriers in Astrophysics Codes Computational phase diagram from the MPI based GADGET code (used for N- body and SPH simulations) using 1M particles over four time steps on 128 procs. Red indicates computation Blue indicates waiting for communication

Clock Rate 10,000 Clock Rate (MHz) 1,000 100 10 1/1/1992 1/1/1996 1/1/2000 1/1/2004 1/1/2008 1/1/2012 Heavyweight Lightweight Heterogeneous Courte sy of Pe te r Kogge , UN D

Total Concurrency 1.E+07 1.E+06 TC (Flops/Cycle) 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 1/1/1992 1/1/1996 1/1/2000 1/1/2004 1/1/2008 1/1/2012 Heavyweight Lightweight Heterogeneous Courte sy of Pe te r Kogge , UN D

Gain with Respect to Cores per Node and Latency; 8 Tasks per Core, Overhead of 16 reg-ops, 8% Network Ops Performance Gain of Non-Blocking Programs over Blocking Programs with Respect to Core Count (Memory Contention) and Net. Latency 9 Latency (Reg-ops) 8 7 64 Performance Gain 6 128 5 256 512 4 1024 3 2048 2 4096 1 8192 0 1 2 4 8 16 32 Cores per Node

Gain with Respect to Cores per Node and Overhead; Latency of 8192 reg-ops, 64 Tasks per Core Performance Gain of Non-Blocking Programs over Blocking Programs with Varying Core Counts (Memory Contention) and Overheads 70 Performance Gain 60 50 40 30 20 10 0 1 2 4 8 16 32

The Purpose of a QUARK Runtime • Objectives – High utilization of each core – Scaling to large number of cores – Synchronization reducing algorithms • Methodology – Dynamic DAG scheduling (QUARK) – Explicit parallelism – Implicit communication – Fine granularity / block data layout • Arbitrary DAG with dynamic scheduling Fork-join parallelism Notice the synchronization penalty in the presence of heterogeneity. DAG scheduled parallelism Courte sy of Jack Dongarra, UTK

Asynchronous Many-Task Runtime Systems • Dharma – Sandia National Laboratories • Legion – Stanford University • Charm++ – University of Illinois • Uintah – University of Utah • STAPL – Texas A&M University • HPX – Indiana University • OCR – Rice University 11

Semantic Components of ParalleX

Overlapping computational phases for hydrodynamics Computational phases for LULESH (mini-app for hydrodynamics codes). Red indicates work White indicates waiting for communication Overdecomposition: MPI used 64 process while HPX used 1E3 threads spread across MPI HPX 64 cores.

HPX-5 Development Progress All cases run on 16 cores (1 locality) Zoom-in on best performers Courte sy of M att Ande rson, I ndiana Unive rsity

Memory Vault E F Memory Interface I Row Buffers N T E Access G H Datapaths R I Control with Parcel C associated K L O Handler control Scratchpad . . . N N Thread 0 registers Wide-word Struct ALU Control-only … E C D interfaces C Thread N-1 registers J Wide T Instruction AGAS Buffer Translation N M Fault Dataflow O PRECISE Detection Control Decoder Power Thread P and State Management Manager Handling A B

Neo-Digital Age • Goal and Objectives – Devise means of exploiting nano-scale semiconductor technologies at end of Moore’s Law to leverage fabrication facilities investments – Create scalable structures and semantics capable of optimal performance (time to solution) within technology and power limitations • Technical Strategy 1. Liberate parallel computer architecture from von Neumann (vN) archaism for efficiency and scalability; eliminate vN bottleneck 2. Rebalance and integration of functional elements for data movement, operations, storage, control to minimize time & energy 3. Emphasize tight coupled logic locality for low time & energy 4. Dynamic adaptive localized control to address asynchrony and expose parallelism with emerging behavior of global computing 16 5. Innovation in execution model for governing principles at all levels

Neo-Digital Age – extending past foundations • Near nano-scale semiconductor technology – Flat- lining of Moore’s Law at single -digit nano-meters – Cost benefits of fab lines for economy of scale through mass-market • Logic function modules – Exploit existing and new IP for efficient functional units – ALUs, latches/registers, nearest-neighbor data paths • Integrated optics – Orders of magnitude bandwidth increase – Inter-socket – On chip • Advanced packaging and cooling – Dramatic improvement opportunities in volumetric utilization – 3-D integration of combined memory/logic/communication dies 17 – Return to wafer-scale integration

Neo-Digital Age – Principles • Elimination of vN-based parallel architecture – Constrained parallel control state – Processor-centric approach optimizes for ALU utilization with very deep memory hierarchy; wrong answer • ALU-pervasive structures – Merge ALUs with storage and communication structures – High availability of ALUs rather than high utilization – Slashes access latencies to save time and energy • Cells of logic/memory/data-passing for single-cycle actions – Optimized for space/energy/performance – Optimized for memory bandwidth utilization • Emphasis on fine-grain nearest-neighbor data-movement structures – Direct access to adjacent state storage – Enables communication through nearest neighbor • Virtual objects in global name space – Data, instructions, synchronization – Intra-medium packet-switched wormhole routing – 18 Dynamic allocation

Prior Art: Non-von Neumann Architectures • Dataflow – Static (Dennis) – Dynamic (Arvind) • Systolic Arrays – Streaming (HT Kung) • Neural Networks – connectionist • Processor in Memory (PIM) – Bit or word-level logic on-chip at/near sense amps of memory – SIMD (Iobst) or MIMD (Kogge) • Cellular Automata (CA) – Global emergent behavior from local rules and state – von Neumann (ironically) • Continuum Computer Architecture 19 – CA with global naming and active packets (Sterling/Brodowicz)

Properties • Local communication – Systolic array: mostly – Dataflow: classically not – Processor in Memory: logic at the sense amps – Cellular Automata: fully – Continuum Computer Architecture: fully, with pipelined packet switched – Neural networks: not • Event driven for asynchrony management – Systolic array: classically not, iWarp yes – Dataflow: yes – PIM: no – Cellular Automata: can be – CCA: yes – Neural networks: classically no, neuromorphic yes • Merged logic/storage/communication – Systolic array: yes – Dataflow: classically not, possible – PIM: yes – Cellular Automata: yes – CCA: yes 20 – Neural networks: yes

tag value tag value register file ALU tag value match logic control network interface

3-D die stack optical fiber bundle TSVs and on-chip network integrated optical transceiver fonton CCA system CCA die

Performance • Maximum ALUs – For high utilization • Maximum memory bandwidth • Lots of inter-cell communication bandwidth • Reduced overhead • Reduced latency • Adaptive routing for contention avoidance • Multi-variate storage beyond the bit • Multi-variate logic beyond base-2 Boolean • Dynamic adaptive execution model 23

Reactive Runtime Systems for Heterogeneous Extreme Computations - PowerPoint PPT Presentation

Reactive Runtime Systems for Heterogeneous Extreme Computations Thomas Sterling Chief Scientist, CREST Professor, School of Informatics and Computing Indiana University November 19, 2014 Shifting Paradigms of Computing Abacus Counting

Reactive design patterns for microservices on multicore Reactive summit - 22/10/18

Reactive elements Building Web Applications in R with Shiny Reactive objects Building Web

F.Maraninchi 2 Aspects and Reactive Systems Switch to full screen F.Maraninchi 0 Aspects and

All-new SDN-RX: Reactive Spring Data Neo4j Spring Data Neo4j / Neo4j-OGM Team Michael Simons

Reactive Systems Dave Farley http://www.davefarley.net @davefarley77 Reactive Systems 21st

Testing Concurrency Runtime via a Testing Concurrency Runtime via a Stochastic Stress Framework

Runtime systems Runtime systems Functional program are very high-level: its not obvious how to

Runtime Enforcement of Reactive Systems using Synchronous Enforcers Srinivas Pinisetty 1 , Partha

Reactive, Message- Driven & Scalable Todd L. Montgomery @toddlmontgomery Why Reactive!?

Characteristics of Adapti tive Runtime Systems in HPC Laxmikant (Sanjay) Kale

Task scheduling over Heterogeneous Multicore Machines: a Runtime Perspective Raymond Namyst

Orrex Plastics Company Orrex: Entirely focused on reactive and complex extrusion compounding

Servlet vs Reactive Choosing the Right Stack Rossen Stoyanchev This talk Servlet and Reactive

Functional Reactive Programming Bob Reynders PLaNES, 4 Februari 2015 1/19 Functional Reactive

Guided Reactive Synthesis with Soft Requirements Reactive Synthesis from QDDC Specify the

The Reactive Turing Machine 2IT70 Finite Automata and Process Theory Technische Universiteit

Welcome to Redland Middle School 2020 2021 Rising 6 th Grade Parent Information Meeting

United States Court of Appeals for the Federal Circuit 2007-1240, -1251, -1274 VERIZON SERVICES

MoCA HOME NETWORK INSTALLATION AND MAINTENANCE SCTE Greater Chicago Chapter Meeting December 2,

Creating a Simple Zigbee Communication Network using XBee ECE-480 SS13 DT2 Outline: What

ELKHORN PUBLIC SCHOOLS January 2018 District Growth and Facility Needs Existing Facilities New

The Critical Role of Departmental Administrators in Research Administration Module #8 UM

Canastota Central School District Budget Presentation 2020-2021 Major District Initiatives

Presented by: Mark Miville, Athletic Director Athletic Director Athletic Trainer

Reactive Runtime Systems for Heterogeneous Extreme Computations - PowerPoint PPT Presentation

Reactive Runtime Systems for Heterogeneous Extreme Computations Thomas Sterling Chief Scientist, CREST Professor, School of Informatics and Computing Indiana University November 19, 2014 Shifting Paradigms of Computing Abacus Counting

Reactive design patterns for microservices on multicore Reactive summit - 22/10/18

Reactive elements Building Web Applications in R with Shiny Reactive objects Building Web

F.Maraninchi 2 Aspects and Reactive Systems Switch to full screen F.Maraninchi 0 Aspects and

All-new SDN-RX: Reactive Spring Data Neo4j Spring Data Neo4j / Neo4j-OGM Team Michael Simons

Reactive Systems Dave Farley http://www.davefarley.net @davefarley77 Reactive Systems 21st

Testing Concurrency Runtime via a Testing Concurrency Runtime via a Stochastic Stress Framework

Runtime systems Runtime systems Functional program are very high-level: its not obvious how to

Runtime Enforcement of Reactive Systems using Synchronous Enforcers Srinivas Pinisetty 1 , Partha

Reactive, Message- Driven &amp; Scalable Todd L. Montgomery @toddlmontgomery Why Reactive!?

Characteristics of Adapti tive Runtime Systems in HPC Laxmikant (Sanjay) Kale

Task scheduling over Heterogeneous Multicore Machines: a Runtime Perspective Raymond Namyst

Orrex Plastics Company Orrex: Entirely focused on reactive and complex extrusion compounding

Servlet vs Reactive Choosing the Right Stack Rossen Stoyanchev This talk Servlet and Reactive

Functional Reactive Programming Bob Reynders PLaNES, 4 Februari 2015 1/19 Functional Reactive

Guided Reactive Synthesis with Soft Requirements Reactive Synthesis from QDDC Specify the

The Reactive Turing Machine 2IT70 Finite Automata and Process Theory Technische Universiteit

Welcome to Redland Middle School 2020 2021 Rising 6 th Grade Parent Information Meeting

United States Court of Appeals for the Federal Circuit 2007-1240, -1251, -1274 VERIZON SERVICES

MoCA HOME NETWORK INSTALLATION AND MAINTENANCE SCTE Greater Chicago Chapter Meeting December 2,

Creating a Simple Zigbee Communication Network using XBee ECE-480 SS13 DT2 Outline: What

ELKHORN PUBLIC SCHOOLS January 2018 District Growth and Facility Needs Existing Facilities New

The Critical Role of Departmental Administrators in Research Administration Module #8 UM

Canastota Central School District Budget Presentation 2020-2021 Major District Initiatives

Presented by: Mark Miville, Athletic Director Athletic Director Athletic Trainer

Reactive, Message- Driven & Scalable Todd L. Montgomery @toddlmontgomery Why Reactive!?