Dataflow Supercomputers Michael J. Flynn Maxeler T echnologies and - PowerPoint PPT Presentation

Dataflow Supercomputers Michael J. Flynn Maxeler T echnologies and Stanford University

Outline ◦ History • Dataflow as a supercomputer technology • openSPL: generalizing the dataflow programming model • Optimizing the hardware for dataflow

The great parallel precessor debate of 1967 • Amdahl espouses to sequential machine and posits Amdahl’s law • Danial Slotnick (father of ILLIAC IV) posits the parallel approach while recognizing the problem of programming

“The parallel approach to computing does require that some original thinking be done about numerical analysis and data management in order to secure efficient use. In an environment which has represented the absence of the need to think as the highest virtue this is a decided disadvantage.” -Daniel Slotnick (1967) …. Speedup in parallel processing is achieved by programming effort …… Michael J Flynn

The (multi core) Parallel Processor Problem • Efficient distribution of tasks • Inter-node communications (data assembly & dispatch) reduces computational efficiency: speedup/nodes • Memory bottleneck limitations • The sequential programming model: Layers of abstraction hide critical sources of and limits to efficient parallel execution

May’s first law: Software efficiency halves every 18 months, exactly compensating for Moore’s Law May’s second law: Compiler technology doubles efficiency no faster than once a decade David May

Dataflow as a supercomputer technology • Looking for another way. Some experience from Maxeler • Dataflow is an old technology (70’s and 80’s), conceptually creates and ideal machine to match the program but the interconnect problem was insurmountable for the day. • T oday’s FPGAs have come a long way and enable an emulation of the dataflow machine.

Hardware and Software Alternatives • Hardware: A reconfigurable heterogeneous accelerator array model • Software: A spatial (2D) dataflow programming model rather than a sequential model

Accelerator HW model • Assumes host CPU + FPGA accelerator • Application consists of two parts – Essential (high usage, >99%) part (kernel(s)) – Bulk part (<1% dynamic activity) • Essential part is executed on accelerator; Bulk part on host • So Slotnick’s law of effort now only applies to a small portion of the application

FPGA accelerator hardware model: server with acceleration cards 10

Each (essential) program has a data flow graph (DFG) The ideal HW to execute the DFG is a data flow machine that exactly matches the DFG A compiler / translator transforms the DF machine so that it can be emulated by the FPGA. FPGA based accelerators, while slow in cycle time, offer much more flexibility in matching DFGs. Limitation 1 : The DFG is limited in (static) size to O (104) nodes. Limitation 2 : Only the control structure is matched not the data access patterns

Acceleration with Static, Synchronous, Streaming DFMs Create a static DFM (unroll loops, etc.); generally the goal is throughput not latency. Create a fully synchronous DFM synchronized to multiple memory channels. The time through the DFM is always the same. Stream computations across the long DFM array, creating MISD or pipelined parallelism. If silicon area and pin BW allow, create multiple copies of the DFM (as with SIMD or vector computations). Iterate on the DFM aspect ratio to optimize speedup. 12

Acceleration with Static, Synchronous, Streaming DFMs Create a fully synchronous data flow machine synchronized to multiple memory channels, then stream computations across a long array PCIe accelerator card w memory is DFE (Engine) Computation #2 Data from node Results to memory memory FPGA based DFM Computation #1 Buffer intermediate results 13

Example: X2 + 30 x x SCSVar x = io.input("x", scsInt(32)); 30 SCSVar result = x * x + 30; + io.output("y", result, scsInt(32)); y

Example: Moving Average Y = (Xn-1 + X + Xn+1) / 3 SCSVar x = io.input(“x”, scsFloat(7,17)); SCSVar prev = stream.offset(x, -1); SCSVar next = stream.offset(x, 1); SCSVar sum = prev + x + next; SCSVar result = sum / 3; io.output(“y”, result, scsFloat(7,17));

Example: Choices x 1 1 10 - + SCSVar x = io.input(“x”, scsUInt(24)); > SCSVar result = (x>10) ? x+1 : x-1; io.output(“y”, result, scsUInt(24)); y

Data flow graph as generated by compiler 4866 nodes; about 250x100 Each node represents a line of JAVA code with area time parameters, so that the designer can change the aspect ratio to improve pin BW, area usage and speedup

8 dataflow engines (192-384GB RAM) High-speed MaxRing Zero-copy RDMA between CPUs and DFEs over Infiniband Dynamic CPU/DFE balancing 19

Example: Seismic Data Processing For Oil & Gas exploration: distribute grid of sensors over large area Sonic impulse the area and record reflections: frequency, amplitude, delay at each sensor Sea based surveys use 30,000 sensors to record data (120 db range) each sampled at more than 2kbps with new sonic impulse every 10 seconds  Order of terabytes of data each day

Generates >1GB every 10s 1200m 1200m 1200m 1200m 1200m

Up to 240x speedup for 1 MAX2 card compared to single CPU core Speedup increases with cube size 1 billion point modelling domain using single FPGA card 22

Achieved Computational Speedup for the entire application (not just the kernel) compared to Intel server 624 62 RTM with Chevron 4 Sparse Matrix VTI 19x and TTI 25x Seismic Trace 20-40x Processing 24x Lattice Boltzman Conjugate Gradient Opt Credit 32x and Rates 26x Fluid Flow 30x 26x

So for HPC, how can emulation (FPGA) be better than high performance x86 processor(s)? Multi core approach lacks robustness in streaming hardware (spanning area, time, power) Multi core lacks robust parallel software methodology and tools FPGAs emulate the ideal data flow machine Success comes about from their flexibility in matching the DFG with a synchronous DFM and streaming data through and shear size > 1 million cells Effort and support tools provide significant application speedup 24

Generalizing the programming model openSPL • Open spatial programming language, an orderly way to expose parallelism • 2D dataflow is the programmer’s model, JAVA the syntax • Could target hardware implementations, beyond the DFEs – map on to CPUs (e.g. using OpenMP/MPI) – GPUs – Other accelerators

Temporal Computing (1D) • A program is a sequence of instructions • Performance is dominated by: CPU – Memory latency Memory – ALU availability Actual computation time Get Get Get C C C Read Write Read Write Read Write Inst Inst Inst O O O data Result data Result data Result . M . M . M 1 1 2 2 3 3 P P P 1 2 3 Time

Spatial Computing (2D) Synchronous data movement Contr Contr ol ALU ol data data ALU in out Buffe r ALU ALU ALU Read data [1..N] Computation Write results [1..N] Time Throughput dominated

OpenSPL Basics • Control and Data-flows are decoupled – both are fully programmable – can run in parallel for maximum performance • Operations exist in space and by default run in parallel – their number is limited only by the available space • All operations can be customized at various levels – e.g., from algorithm down to the number representation • Data sets (actions) streams through the operations • The data transport and processing can be matched

OpenSPL Models • Memory: – Fast Memory (FMEM): many, small in size, low latency – Large Memory (LMEM): few, large in size, high latency – Scalars: many, tiny, lowest latency, fixed during exec. • Execution: – datasets + scalar settings sent as atomic “actions” – all data flows through the system synchronously in “ticks” • Programming: – API allows construction of a graph computation – meta-programming allows complex construction

Spatial Arithmetic • Operations instantiated as separate arithmetic units • Units along data paths use custom arithmetic and number representation • The above may reduce individual unit sizes – can maximize the number that fit on a given SCS • Data rates of memory and I/O communication may also be maximized due to scaled down data sizes Exponent (8) Mantissa (23) S S S S S S S s Exponent (3) S S S s Potentially optimal encoding Mantissa (10)

Spatial Arithmetic at All Levels • Arithmetic optimizations at the bit level – e.g., minimizing the number of ’1’s in binary numbers, leading to linear savings of both space and power (the zeros are omitted in the implementation) • Higher level arithmetic optimizations – e.g., in matrix algebra, the location of all non-zero elements in sparse matrix computations is important • Spatial encoding of data structures can reduce transfers between memory and computational units (boost performance and improve efficiency) – In temporal computing encoding and decoding would take time and eventually can cancel out all of the advantages – In spatial computing, encoding and decoding just consume a bit more of additional space

Dataflow Supercomputers Michael J. Flynn Maxeler T echnologies and - PowerPoint PPT Presentation

Dataflow Supercomputers Michael J. Flynn Maxeler T echnologies and Stanford University Outline History Dataflow as a supercomputer technology openSPL: generalizing the dataflow programming model Optimizing the hardware for

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

Supercomputers and Supercomputers and Clusters and Clusters and Grid, Grid, Oh My! Oh My!

Chapter 8 Dataflow Descriptions in VHDL 1 benyamin@mehr.sharif.edu Dataflow Description

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional

Dataflow computation, tree transformations and comonads Tarmo Uustalu, Tallinn Joint work with

Biggest Challenge: Dataflow in Meetup for Android Mike Castleman Meetup New York Android

CO444H Dataflow Dataflow frameworks Ben Livshits Masters Projects Available 1. Crashes to

Oversampling in a Dataflow Synchronous Language (Heptagon) erard 1 L eonard G 1 PARKAS team

approach to parallelism www.pervasivedatarush.com Agenda Background Dataflow Overview

Dataflow Execution Dataflow Execution Craig Knoblock University of Southern California This

Differential Dataflow McSherry, Frank D., Murray, Derek G., Isaacs, Rebecca, Isard, Michael

Dataflow Process Network Goals Formalize dataflow process network Widely used in signal

Class 3 Review; questions Basic Analyses (3) Assign (see Schedule for links)

Dataflow Anomaly Detection Presented By Archana Viswanath Computer Science and Engineering The

Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory

Session 2 Introduction to Cryptography and Symmetric Encryption Sbastien Combfis Fall 2019

Hardware-Software Codesign 3. Mapping Applications To Architectures Lothar Thiele Computer

Disciplina Sistemas de Computao Aula 04 Aviso Slides e Arquivos j esto no site

Chapter 9 such statements as they tend to sound pretty silly in 5 years Alternative

Data Speculation Adam Wierman Daniel Neill Lipasti and Shen. Exceeding the dataflow limit, 1996.

Dataflow Supercomputers Michael J. Flynn Maxeler T echnologies and - PowerPoint PPT Presentation

Dataflow Supercomputers Michael J. Flynn Maxeler T echnologies and Stanford University Outline History Dataflow as a supercomputer technology openSPL: generalizing the dataflow programming model Optimizing the hardware for

Naiad (Timely Dataflow) &amp; Streaming Systems CS 848: Models and Applications of Distributed

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

Supercomputers and Supercomputers and Clusters and Clusters and Grid, Grid, Oh My! Oh My!

Chapter 8 Dataflow Descriptions in VHDL 1 benyamin@mehr.sharif.edu Dataflow Description

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional

Dataflow computation, tree transformations and comonads Tarmo Uustalu, Tallinn Joint work with

Biggest Challenge: Dataflow in Meetup for Android Mike Castleman Meetup New York Android

CO444H Dataflow Dataflow frameworks Ben Livshits Masters Projects Available 1. Crashes to

Oversampling in a Dataflow Synchronous Language (Heptagon) erard 1 L eonard G 1 PARKAS team

approach to parallelism www.pervasivedatarush.com Agenda Background Dataflow Overview

Dataflow Execution Dataflow Execution Craig Knoblock University of Southern California This

Differential Dataflow McSherry, Frank D., Murray, Derek G., Isaacs, Rebecca, Isard, Michael

Dataflow Process Network Goals Formalize dataflow process network Widely used in signal

Class 3 Review; questions Basic Analyses (3) Assign (see Schedule for links)

Dataflow Anomaly Detection Presented By Archana Viswanath Computer Science and Engineering The

Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory

Session 2 Introduction to Cryptography and Symmetric Encryption Sbastien Combfis Fall 2019

Hardware-Software Codesign 3. Mapping Applications To Architectures Lothar Thiele Computer

Disciplina Sistemas de Computao Aula 04 Aviso Slides e Arquivos j esto no site

Chapter 9 such statements as they tend to sound pretty silly in 5 years Alternative

Data Speculation Adam Wierman Daniel Neill Lipasti and Shen. Exceeding the dataflow limit, 1996.

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed