Reactive Runtime Systems for Heterogeneous Extreme Computations - - PowerPoint PPT Presentation

reactive runtime systems for
SMART_READER_LITE
LIVE PREVIEW

Reactive Runtime Systems for Heterogeneous Extreme Computations - - PowerPoint PPT Presentation

Reactive Runtime Systems for Heterogeneous Extreme Computations Thomas Sterling Chief Scientist, CREST Professor, School of Informatics and Computing Indiana University November 19, 2014 Shifting Paradigms of Computing Abacus Counting


slide-1
SLIDE 1

Reactive Runtime Systems for Heterogeneous Extreme Computations

Thomas Sterling

Chief Scientist, CREST Professor, School of Informatics and Computing Indiana University November 19, 2014

slide-2
SLIDE 2

Shifting Paradigms of Computing

  • Abacus

– Counting tables

  • Pascaline
  • Difference engine

– Charles Babbage – Per Georg Scheutz

  • Tabulators

– Herman Hollerith

  • Analog computer

– Vannevar Bush Machine

  • Harvard Architecture

– Howard Aiken – Konrad Zuse – Charles Babbage Analytical Engine

2

slide-3
SLIDE 3

The von Neumann Age

  • Foundations:

– Information Theory – Claude Shannon – Computabilty – Turing/Church – Cybernetics – Norbert Wiener – Stored Program Computer Architecture – von Neumann

  • The von Neumann Shift: 1945 – 1960

– Vacuum tubes, core memory – Technology assumptions

  • ALUs are most expensive components
  • Memory capacity and clock rate are scale drivers – mainframes
  • Data movement of secondary importance
  • Von Neumann extended: 1960 – 2014

– Semiconductor, Exploitation of parallelism – Out of order completion – Vector – SIMD – Multiprocessor (MIMD)

  • SMP

– Maintain sequential consistency

  • MPP/Clusters

– Ensemble computations with message passing 3

slide-4
SLIDE 4

Glia-Cell/Vasculature O(1-10x) Reaction-Diffusion O(100-1,000x) Molecular Dynamics O(>1,000,000,000x?) Plasticity O(1-10x) Learning & Memory O(10-100x) Behavior O(100-1000x)

Subcellular Detail

Capability Computing

Computational Complexity Memory Requirements 1 MB 10 GB 1 TB 100 TB 100 PB

Cellular Neocortical Column Cellular Mesocircuit Cellular Rodent Brain Cellular Human Brain

1 Gigaflops 1 Teraflops 1 Petaflops 1 Exaflops

Single Cellular Model EPFL 4-rack BlueGene/L CADMOS 4-rack BlueGene/P BBP / CSCS 4-rack BlueGene/Q + BGAS Planned EU – HBP Exaflop machine Jülich 28-rack BlueGene/Q machine

slide-5
SLIDE 5

The Negative Impact of Global Barriers in Astrophysics Codes Computational phase diagram from the MPI based GADGET code (used for N- body and SPH simulations) using 1M particles over four time steps on 128 procs. Red indicates computation Blue indicates waiting for communication

slide-6
SLIDE 6

Clock Rate

10 100 1,000 10,000 1/1/1992 1/1/1996 1/1/2000 1/1/2004 1/1/2008 1/1/2012 Clock Rate (MHz) Heavyweight Lightweight Heterogeneous

Courte sy of Pe te r Kogge , UN D

slide-7
SLIDE 7

1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1/1/1992 1/1/1996 1/1/2000 1/1/2004 1/1/2008 1/1/2012 TC (Flops/Cycle) Heavyweight Lightweight Heterogeneous

Total Concurrency

Courte sy of Pe te r Kogge , UN D

slide-8
SLIDE 8

Gain with Respect to Cores per Node and Latency; 8 Tasks per Core, Overhead of 16 reg-ops, 8% Network Ops

1 2 3 4 5 6 7 8 9 1 2 4 8 16 32

Performance Gain

Cores per Node

Performance Gain of Non-Blocking Programs over Blocking Programs with Respect to Core Count (Memory Contention) and Net. Latency

64 128 256 512 1024 2048 4096 8192

Latency (Reg-ops)

slide-9
SLIDE 9

Gain with Respect to Cores per Node and Overhead; Latency of 8192 reg-ops, 64 Tasks per Core

1 2 4 8 16 32 10 20 30 40 50 60 70

Performance Gain

Performance Gain of Non-Blocking Programs over Blocking Programs with Varying Core Counts (Memory Contention) and Overheads

slide-10
SLIDE 10
  • Objectives

– High utilization of each core – Scaling to large number of cores – Synchronization reducing algorithms

  • Methodology

– Dynamic DAG scheduling (QUARK) – Explicit parallelism – Implicit communication – Fine granularity / block data layout

  • Arbitrary DAG with dynamic scheduling

Fork-join parallelism Notice the synchronization penalty in the presence of heterogeneity.

The Purpose of a QUARK Runtime

DAG scheduled parallelism

Courte sy of Jack Dongarra, UTK

slide-11
SLIDE 11
  • Dharma – Sandia National Laboratories
  • Legion – Stanford University
  • Charm++ – University of Illinois
  • Uintah – University of Utah
  • STAPL – Texas A&M University
  • HPX – Indiana University
  • OCR – Rice University

11

Asynchronous Many-Task Runtime Systems

slide-12
SLIDE 12

Semantic Components of ParalleX

slide-13
SLIDE 13

Overlapping computational phases for hydrodynamics

MPI HPX Computational phases for LULESH (mini-app for hydrodynamics codes). Red indicates work White indicates waiting for communication Overdecomposition: MPI used 64 process while HPX used 1E3 threads spread across 64 cores.

slide-14
SLIDE 14

HPX-5 Development Progress

Zoom-in on best performers All cases run on 16 cores (1 locality)

Courte sy of M att Ande rson, I ndiana Unive rsity

slide-15
SLIDE 15

Wide-word Struct ALU

. . .

Thread 0 registers Thread N-1 registers

Scratchpad

Memory Interface

Row Buffers

Dataflow Control State Wide Instruction Buffer

Parcel Handler

Thread Manager Memory Vault Fault Detection and Handling Power Management Access Control

AGAS

Translation Datapaths with associated control Control-only interfaces PRECISE Decoder I N T E R C O N N E C T A B C D E F G H I J K L M N O P

slide-16
SLIDE 16

Neo-Digital Age

  • Goal and Objectives

– Devise means of exploiting nano-scale semiconductor technologies at end of Moore’s Law to leverage fabrication facilities investments – Create scalable structures and semantics capable of optimal performance (time to solution) within technology and power limitations

  • Technical Strategy

1. Liberate parallel computer architecture from von Neumann (vN) archaism for efficiency and scalability; eliminate vN bottleneck 2. Rebalance and integration of functional elements for data movement, operations, storage, control to minimize time & energy 3. Emphasize tight coupled logic locality for low time & energy 4. Dynamic adaptive localized control to address asynchrony and expose parallelism with emerging behavior of global computing 5. Innovation in execution model for governing principles at all levels

16

slide-17
SLIDE 17

Neo-Digital Age – extending past foundations

  • Near nano-scale semiconductor technology

– Flat-lining of Moore’s Law at single-digit nano-meters – Cost benefits of fab lines for economy of scale through mass-market

  • Logic function modules

– Exploit existing and new IP for efficient functional units – ALUs, latches/registers, nearest-neighbor data paths

  • Integrated optics

– Orders of magnitude bandwidth increase – Inter-socket – On chip

  • Advanced packaging and cooling

– Dramatic improvement opportunities in volumetric utilization – 3-D integration of combined memory/logic/communication dies – Return to wafer-scale integration

17

slide-18
SLIDE 18

Neo-Digital Age – Principles

  • Elimination of vN-based parallel architecture

– Constrained parallel control state – Processor-centric approach optimizes for ALU utilization with very deep memory hierarchy; wrong answer

  • ALU-pervasive structures

– Merge ALUs with storage and communication structures – High availability of ALUs rather than high utilization – Slashes access latencies to save time and energy

  • Cells of logic/memory/data-passing for single-cycle actions

– Optimized for space/energy/performance – Optimized for memory bandwidth utilization

  • Emphasis on fine-grain nearest-neighbor data-movement structures

– Direct access to adjacent state storage – Enables communication through nearest neighbor

  • Virtual objects in global name space

– Data, instructions, synchronization – Intra-medium packet-switched wormhole routing – Dynamic allocation

18

slide-19
SLIDE 19

Prior Art: Non-von Neumann Architectures

  • Dataflow

– Static (Dennis) – Dynamic (Arvind)

  • Systolic Arrays

– Streaming (HT Kung)

  • Neural Networks – connectionist
  • Processor in Memory (PIM)

– Bit or word-level logic on-chip at/near sense amps of memory – SIMD (Iobst) or MIMD (Kogge)

  • Cellular Automata (CA)

– Global emergent behavior from local rules and state – von Neumann (ironically)

  • Continuum Computer Architecture

– CA with global naming and active packets (Sterling/Brodowicz)

19

slide-20
SLIDE 20

Properties

  • Local communication

– Systolic array: mostly – Dataflow: classically not – Processor in Memory: logic at the sense amps – Cellular Automata: fully – Continuum Computer Architecture: fully, with pipelined packet switched – Neural networks: not

  • Event driven for asynchrony management

– Systolic array: classically not, iWarp yes – Dataflow: yes – PIM: no – Cellular Automata: can be – CCA: yes – Neural networks: classically no, neuromorphic yes

  • Merged logic/storage/communication

– Systolic array: yes – Dataflow: classically not, possible – PIM: yes – Cellular Automata: yes – CCA: yes – Neural networks: yes

20

slide-21
SLIDE 21

tag match logic

control

ALU value

network interface

register file

tag value tag value

slide-22
SLIDE 22

CCA die CCA system 3-D die stack

fonton TSVs and on-chip network

  • ptical fiber

bundle integrated

  • ptical transceiver
slide-23
SLIDE 23

Performance

  • Maximum ALUs

– For high utilization

  • Maximum memory bandwidth
  • Lots of inter-cell communication bandwidth
  • Reduced overhead
  • Reduced latency
  • Adaptive routing for contention avoidance
  • Multi-variate storage beyond the bit
  • Multi-variate logic beyond base-2 Boolean
  • Dynamic adaptive execution model

23

slide-24
SLIDE 24

Energy

  • Minimize distance between elements

– Permeate structures with ALUs – Short latencies between memory & registers

  • Multiple clock rates to match timings between logic and

storage and

  • Neighbor cell memory/register access
  • Eliminate large caches and multi-level caches
  • Eliminate speculative execution

24

slide-25
SLIDE 25

Questions can be Answered Now

  • 1. Trade-offs of DRAM bits and SRAM bits

– Space, Time, Energy

  • 2. Cell rules

– derived from parallel execution model

  • 3. Cell granularity

– Breakpoint where mitosis is better

  • 4. Design of logic cell (Fonton)

– Very simple compared to full conventional processors

  • 5. Reference implementation

– Emulation, Simulation, FPGA, ASIC, custom

  • 6. 3-D packaging
  • 7. Software environment
  • 8. Application analysis

25

slide-26
SLIDE 26

Conclusions

  • New high density functional structures are

required at end of Moore’s Law and are emerging

  • Reactive Runtime systems supported by

innovations in hardware architecture mechanisms will exploit extremes of parallelism at high efficiency

  • Neo-Digital age advances beyond von Neumann

architectures to maximize execution concurrency and react to uncertainties of asynchrony

26

slide-27
SLIDE 27