Dataflow Supercomputers Michael J. Flynn Maxeler T echnologies and - - PowerPoint PPT Presentation

dataflow supercomputers
SMART_READER_LITE
LIVE PREVIEW

Dataflow Supercomputers Michael J. Flynn Maxeler T echnologies and - - PowerPoint PPT Presentation

Dataflow Supercomputers Michael J. Flynn Maxeler T echnologies and Stanford University Outline History Dataflow as a supercomputer technology openSPL: generalizing the dataflow programming model Optimizing the hardware for


slide-1
SLIDE 1

Michael J. Flynn Maxeler T echnologies and Stanford University

Dataflow Supercomputers

slide-2
SLIDE 2

Outline

  • History
  • Dataflow as a supercomputer technology
  • openSPL: generalizing the dataflow

programming model

  • Optimizing the hardware for dataflow
slide-3
SLIDE 3
  • Amdahl espouses to sequential machine

and posits Amdahl’s law

  • Danial Slotnick (father of ILLIAC IV) posits

the parallel approach while recognizing the problem of programming

The great parallel precessor debate of 1967

slide-4
SLIDE 4

Michael J Flynn

“The parallel approach to computing does require that some original thinking be done about numerical analysis and data management in order to secure efficient use. In an environment which has represented the absence of the need to think as the highest virtue this is a decided disadvantage.”

  • Daniel Slotnick (1967)

….Speedup in parallel processing is achieved by programming effort……

slide-5
SLIDE 5

The (multi core) Parallel Processor Problem

  • Efficient distribution of tasks
  • Inter-node communications (data assembly

& dispatch) reduces computational efficiency: speedup/nodes

  • Memory bottleneck limitations
  • The sequential programming model: Layers
  • f abstraction hide critical sources of and

limits to efficient parallel execution

slide-6
SLIDE 6

May’s first law: Software efficiency halves every 18 months, exactly compensating for Moore’s Law May’s second law: Compiler technology doubles efficiency no faster than once a decade

David May

slide-7
SLIDE 7
  • Looking for another way. Some

experience from Maxeler

  • Dataflow is an old technology (70’s and

80’s), conceptually creates and ideal machine to match the program but the interconnect problem was insurmountable for the day.

  • T
  • day’s FPGAs have come a long way and

enable an emulation of the dataflow machine.

Dataflow as a supercomputer technology

slide-8
SLIDE 8

Hardware and Software Alternatives

  • Hardware:

A reconfigurable heterogeneous accelerator array model

  • Software:

A spatial (2D) dataflow programming model rather than a sequential model

slide-9
SLIDE 9

Accelerator HW model

  • Assumes host CPU + FPGA accelerator
  • Application consists of two parts

– Essential (high usage, >99%) part (kernel(s)) – Bulk part (<1% dynamic activity)

  • Essential part is executed on accelerator; Bulk

part on host

  • So Slotnick’s law of effort now only applies to a

small portion of the application

slide-10
SLIDE 10

10

FPGA accelerator hardware model: server with acceleration cards

slide-11
SLIDE 11

Each (essential) program has a data flow graph (DFG) The ideal HW to execute the DFG is a data flow machine that exactly matches the DFG A compiler / translator transforms the DF machine so that it can be emulated by the FPGA. FPGA based accelerators, while slow in cycle time, offer much more flexibility in matching DFGs. Limitation 1: The DFG is limited in (static) size to O (104) nodes. Limitation 2: Only the control structure is matched not the data access patterns

slide-12
SLIDE 12

12

Acceleration with Static, Synchronous, Streaming DFMs

Create a static DFM (unroll loops, etc.); generally the goal is throughput not latency. Create a fully synchronous DFM synchronized to multiple memory channels. The time through the DFM is always the same. Stream computations across the long DFM array, creating MISD or pipelined parallelism. If silicon area and pin BW allow, create multiple copies

  • f the DFM (as with SIMD or vector computations).

Iterate on the DFM aspect ratio to optimize speedup.

slide-13
SLIDE 13

13

Acceleration with Static, Synchronous, Streaming DFMs

Create a fully synchronous data flow machine synchronized to multiple memory channels, then stream computations across a long array

FPGA based DFM Data from node memory Computation #1 Results to memory Computation #2 Buffer intermediate results PCIe accelerator card w memory is DFE (Engine)

slide-14
SLIDE 14

x x +

30

y

SCSVar x = io.input("x", scsInt(32)); SCSVar result = x * x + 30; io.output("y", result, scsInt(32));

Example: X2 + 30

slide-15
SLIDE 15

Example: Moving Average

SCSVar x = io.input(“x”, scsFloat(7,17)); SCSVar prev = stream.offset(x, -1); SCSVar next = stream.offset(x, 1); SCSVar sum = prev + x + next; SCSVar result = sum / 3; io.output(“y”, result, scsFloat(7,17));

Y = (Xn-1 + X + Xn+1) / 3

slide-16
SLIDE 16

Example: Choices

x

+

1

y

  • 1

>

10

SCSVar x = io.input(“x”, scsUInt(24)); SCSVar result = (x>10) ? x+1 : x-1; io.output(“y”, result, scsUInt(24));

slide-17
SLIDE 17

Data flow graph as generated by compiler 4866 nodes; about 250x100

Each node represents a line of JAVA code with area time parameters, so that the designer can change the aspect ratio to improve pin BW, area usage and speedup

slide-18
SLIDE 18

18

slide-19
SLIDE 19

8 dataflow engines (192-384GB RAM) High-speed MaxRing Zero-copy RDMA between CPUs and DFEs

  • ver Infiniband

Dynamic CPU/DFE balancing

19

slide-20
SLIDE 20

Example: Seismic Data Processing

For Oil & Gas exploration: distribute grid of sensors over large area Sonic impulse the area and record reflections: frequency, amplitude, delay at each sensor Sea based surveys use 30,000 sensors to record data (120 db range) each sampled at more than 2kbps with new sonic impulse every 10 seconds Order of terabytes of data each day

slide-21
SLIDE 21

1200m 1200m 1200m 1200m 1200m

Generates >1GB every 10s

slide-22
SLIDE 22

Up to 240x speedup for 1 MAX2 card compared to single CPU core Speedup increases with cube size 1 billion point modelling domain using single FPGA card

22

slide-23
SLIDE 23

Achieved Computational Speedup for the entire application (not just the kernel) compared to Intel server

RTM with Chevron VTI 19x and TTI 25x

Sparse Matrix 20-40x Seismic Trace Processing 24x

Lattice Boltzman Fluid Flow 30x

Conjugate Gradient Opt 26x Credit 32x and Rates 26x

624 62 4

slide-24
SLIDE 24

24

So for HPC, how can emulation (FPGA) be better than high performance x86 processor(s)?

Multi core approach lacks robustness in streaming hardware (spanning area, time, power) Multi core lacks robust parallel software methodology and tools FPGAs emulate the ideal data flow machine Success comes about from their flexibility in matching the DFG with a synchronous DFM and streaming data through and shear size > 1 million cells Effort and support tools provide significant application speedup

slide-25
SLIDE 25
  • Open spatial programming language, an
  • rderly way to expose parallelism
  • 2D dataflow is the programmer’s model, JAVA

the syntax

  • Could target hardware implementations,

beyond the DFEs

– map on to CPUs (e.g. using OpenMP/MPI) – GPUs – Other accelerators

Generalizing the programming model

  • penSPL
slide-26
SLIDE 26
  • A program is a sequence of

instructions

  • Performance is dominated by:

– Memory latency – ALU availability

Temporal Computing (1D)

CPU Time

Get Inst . 1

Memory

C O M P

Read data 1 Write Result 1

C O M P

Read data 2 Write Result 2

C O M P

Read data 3 Write Result 3

Actual computation time

Get Inst . 2 Get Inst . 3

slide-27
SLIDE 27

Spatial Computing (2D)

data in ALU ALU Buffe r ALU Contr

  • l

ALU Contr

  • l

ALU data

  • ut

Synchronous data movement Time

Read data [1..N] Computation Write results [1..N]

Throughput dominated

slide-28
SLIDE 28

OpenSPL Basics

  • Control and Data-flows are decoupled

– both are fully programmable – can run in parallel for maximum performance

  • Operations exist in space and by default run in parallel

– their number is limited only by the available space

  • All operations can be customized at various levels

– e.g., from algorithm down to the number representation

  • Data sets (actions) streams through the operations
  • The data transport and processing can be matched
slide-29
SLIDE 29

OpenSPL Models

  • Memory:

– Fast Memory (FMEM): many, small in size, low latency – Large Memory (LMEM): few, large in size, high latency – Scalars: many, tiny, lowest latency, fixed during exec.

  • Execution:

– datasets + scalar settings sent as atomic “actions” – all data flows through the system synchronously in “ticks”

  • Programming:

– API allows construction of a graph computation – meta-programming allows complex construction

slide-30
SLIDE 30

Spatial Arithmetic

  • Operations instantiated as separate arithmetic units
  • Units along data paths use custom arithmetic and number

representation

  • The above may reduce individual unit sizes

– can maximize the number that fit on a given SCS

  • Data rates of memory and I/O communication may also be

maximized due to scaled down data sizes

S S S S S S S

s Exponent (8) Mantissa (23)

S S S

s Exponent (3) Mantissa (10)

Potentially optimal encoding

slide-31
SLIDE 31

Spatial Arithmetic at All Levels

  • Arithmetic optimizations at the bit level

– e.g., minimizing the number of ’1’s in binary numbers, leading to linear savings of both space and power (the zeros are omitted in the implementation)

  • Higher level arithmetic optimizations

– e.g., in matrix algebra, the location of all non-zero elements in sparse matrix computations is important

  • Spatial encoding of data structures can reduce transfers

between memory and computational units (boost performance and improve efficiency)

– In temporal computing encoding and decoding would take time and eventually can cancel out all of the advantages – In spatial computing, encoding and decoding just consume a bit more

  • f additional space
slide-32
SLIDE 32
  • Spatial computing systems generate one result

during every tick

  • SC system efficiency is strongly determined by how

efficiently data can be fed from external sources

  • Fair comparison metrics are needed, among others:

– computations per cubic foot of datacenter space – computations per Watt – operational costs per computation

Benchmarking Spatial Computers

slide-33
SLIDE 33
  • The FPGA while quite suitable for emulation is not

an ideal hardware substrate

– T

  • o fine grain, wasted area

– Expensive – Place and route times are excessive and will get longer – Slow cycle time

  • FPGA advantages

– Commodity part with best process technology – Flexible interconnect – Transistor density scaling – In principle, possible to quickly reduce to ASIC

Hardware: FPGA pros & Cons

slide-34
SLIDE 34

Silicon device density scaling (ITRS 10 year projections)

Net: there’s either 20 billion transistors or 50 Giga Bytes of Flash

  • n a 1cm2 die
slide-35
SLIDE 35

Hardware alternatives: DFArray

  • Clearly an array structure is attractive
  • The LUT is inefficient in the context of dataflow
  • Dataflow operations are relative few and well defined

(arithmetic, logic, mux and FIFO and store)

  • Flexibility in data sizing is important but not

necessarily to the bit.

  • Existing DSPs (really MACs) in FPGAs are prized and

well used (2000 or so per chip)

slide-36
SLIDE 36

Hardware alternatives: DFArray

  • The flexible interconnect required by the dataflow

will still limit the cycle time (perhaps 5x slower than CPU). This also reduces power requirements.

  • The great advantage is in the increased
  • perational density (perhaps 10x over existing

FPGAs) enabling much larger dataflow machines and greater parallelism.

  • Avoids very long (many hours) “place & route”

time required by FPGA

slide-37
SLIDE 37

Hardware alternatives: DFArray

  • The big disadvantage: it’s not a

commodity part (even an expensive

  • ne).
  • There’s a lot of research issues in

determining the best dataflow hardware fabric.

slide-38
SLIDE 38

38

Parallel Processing demands rethinking algorithms, programming approach and environment and hardware. The success of FPGA acceleration points to the weakness of evolutionary approaches to parallel processing: hardware (multi core) and software (C++, etc.), at least for some applications The automation of acceleration is still early on; still required: tools, methodology for writing apps., analysis methodology and (maybe) a new hardware basis For FPGA success software is key: VHDL, inefficient place and route, SW are big limitations

Conclusions

slide-39
SLIDE 39

Conclusions 2

  • In parallel processing: to find success, start

with the problem not the solution.

  • There’s a lot of research ahead to effectively

create parallel translation technology.

slide-40
SLIDE 40

Thank you