Data Flow Computing James Spooner, VP of Acceleration QCon, Finance - PowerPoint PPT Presentation

Acceleration in the Wild, with Data Flow Computing James Spooner, VP of Acceleration QCon, Finance Track, 08 March 2012

Acceleration in the Wild with Data Flow • Deliberate, focused approach to improving application speed – Involves adding Data Flow Engines (DFEs) – Makes some of the program faster – Will be programmed intentionally and be architecture specific – Will exploit as much available parallelism as possible – May require transformations to expose parallelism – May have multiple implementations Maxeler is a acceleration specialist, delivering end-to-end performance for a range of clients in the banking and oil/gas exploration industries. 2

Making efficient use of Silicon

Computing History… Credit: Prof. Paul H.J. Kelly - J. P, Eckert, Jr (Co-Inventor of ENIAC)

Computing History… “The parallel approach to computing does require that some original thinking be done about numerical analysis and data management in order to secure efficient use. In an environment which has represented the absence of the need to think as the highest virtue this is a decided disadvantage.” -Daniel Slotnick (Chief Architect of ILLIAC IV), 1967 Credit: Prof. Michael J. Flynn

So what happened? • Eckert (and Amdahl) were right, Slotnik was wrong, until… • Serial computing hit the wall(s) last decade: – The memory wall ; the increasing gap between processor and memory speeds. This effect pushes cache sizes larger in order to mask the latency of memory. This helps only to the extent that memory bandwidth is not the bottleneck in performance. – The ILP wall ; the increasing difficulty of finding enough parallelism in a single instruction stream to keep a high-performance single-core processor busy. – The power wall ; the trend of consuming exponentially increasing power with each factorial increase of operating frequency. This increase can be mitigated by "shrinking" the processor by using smaller traces for the same logic. The power wall poses manufacturing, system design and deployment problems that have not been justified in the face of the diminished gains in performance due to the memory wall and ILP wall .    2 P C V f Source: Wikipedia avg load DD 6

Using silicon efficiently - parallelism Level of Examples Costs Parallelism Coarse - Multi-Node, Multi-chip, multi-core -Developing a distributed Grained - Process / thread level parallism Distributed system - Locks, mutexes, queues, etc. Fine -Instruction level parallelism (ILP) - Lots of silicon Grained -Out-of-order execution, superscalar, - Compiler can do some work instruction pipelining, speculative upfront execution -Data level parallelism -SIMD / SSE Ultra Fine -Data Flow architectures - Resolve once Grained - Massively parallel, lock free, hazard free, streaming datapaths

How is modern silicon used? Intel 6- Core X5680 “ Westmere ” 8

How is modern silicon used? Intel 6-Core X5680 “ Westmere ” Computation Support Logic for fine grained parallelism 9

What is Dataflow Computing? Computing with control flow processors Computing with dataflow engines (DFEs) vs. 10

1U dataflow cloud providing dynamically scalable compute capability over Infiniband MPC-X1000 • 8 vectis dataflow engines (DFEs) • 192GB of DFE RAM • Dynamic allocation of DFEs to conventional CPU servers – Zero-copy RDMA between CPUs and DFEs over Infiniband • Equivalent performance to 40-60 x86 servers 11

Dataflow Programming

Application Components Host application CPU SLiC Kernels MaxelerOS DataFlow PCI Express Memory + + * Memory Manager 13

Programming with MaxCompiler C / C++ / Fortran MaxJ SLiC 14

MaxCompiler Development Process CPU CPU Code Main Memory CPU Code (.c) int *x, *y;    y x x 30 for (int i =0; i < DATA_SIZE; i++) i i i y[i]= x[i] * x[i] + 30; 15

MaxCompiler Development Process x CPU Memory CPU Code x Chip SLiC 30 MaxelerOS x PCI + x Manager 30 Main + x y Express Memory x y CPU Code (.c) Manager (.java) MyKernel (.java) Manager m = new Manager (“Calc”); HWVar x = io.input("x", hwInt(32)); #include “ MaxSLiCInterface.h ” Kernel k = #include “Calc.max” new MyKernel(); HWVar result = x * x + 30; int *x, *y; for (int i =0; i < DATA_SIZE; i++) m.setKernel(k); io.output("y", result, hwInt(32)); y[i]= x[i] * x[i] + 30; m.setIO( Calc(x, y, DATA_SIZE) link(“x", PCIE), link(“y", PCIE)); m.addMode(modeDefault()); m.build(); 16

MaxCompiler Development Process x CPU Memory y Host Code x Chip SLiC 30 MaxelerOS x PCI + x Manager 30 Main + x Express Memory x y CPUCode (.c) Manager (.java) MyKernel (.java) Manager m = new Manager(); HWVar x = io.input("x", hwInt(32)); #include “ MaxSLiCInterface.h ” device = max_open_device(maxfile, Kernel k = "/dev/maxeler0"); #include “Calc.max” new MyKernel(); HWVar result = x * x + 30; int *x, *y; m.setKernel(k); io.output("y", result, hwInt(32)); m.setIO( Calc(x, DATA_SIZE) link(“x", PCIE), link(“y", DRAM_LINEAR1D)); m.addMode(modeDefault()); m.build(); 17

The Full Kernel x public class MyKernel extends Kernel { public MyKernel (KernelParameters parameters) { super( parameters ) ; x HWVar x = io.input("x", hwInt(32)); 30 HWVar result = x * x + 30; + io.output("y", result, hwInt(32)); } } y 18

Kernel Streaming: In Hardware 5 4 3 2 1 0 x x 30 + y 19

Kernel Streaming: In Hardware 5 4 3 2 1 0 x 0 x 30 + y 20

Kernel Streaming: In Hardware 5 4 3 2 1 0 x 1 x 0 30 + y 21

Kernel Streaming: In Hardware 5 4 3 2 1 0 x 2 x 1 30 + 30 y 22

Kernel Streaming: In Hardware 5 4 3 2 1 0 x 3 x 4 30 + 31 y 30 23

Kernel Streaming: In Hardware 5 4 3 2 1 0 x 4 x 9 30 + 34 y 30 31 24

Kernel Streaming: In Hardware 5 4 3 2 1 0 x 5 x 16 30 + 39 y 30 31 34 25

Kernel Streaming: In Hardware 5 4 3 2 1 0 x x 25 30 + 46 y 30 31 34 39 26

Kernel Streaming: In Hardware 5 4 3 2 1 0 x x 30 + 55 y 30 31 34 39 46 27

Kernel Streaming: In Hardware 5 4 3 2 1 0 x x 30 + y 30 31 34 39 46 55 28

Data flow graph as generated by MaxCompiler 4866 nodes; about 250x100

How we approach Acceleration

What always makes Acceleration hard? • Messy code for (i=0; i<N; ++i) { points[i]->incx(); • Complicated build } dependences • Confused control-flow • Impenetrable data x x x x x access r x x x y y y y y θ y y y • Pointer-intensive data z z z z z z z z q p p structures x y • Premature z optimization 31

Conflicting Goals • Some well-motivated for (i=0; i<N; ++i) { software structures points[i]->incx(); } have real value, but make acceleration harder • Examples: x x x x x r x x x y y y y y – Virtual method calls θ y y y z z z z z z z z inside a loop q p p – Collections with non- x y uniform type z – Substructure sharing 32

What makes Acceleration easier? • Self-evident data dependences • Computing on large x x x x x x x x collections of uniform data y y y y y y y y • Appropriate representation z z z z z z z z hiding • Getting the abstraction right 33

Maximum Performance Computing • Identify parallelism and take advantage of it – Fully understand data dependencies • Minimize memory bandwidth – Data reuse and representation • Regularize the computation and data – Minimize control flow complexity • Find optimal balance for underlying architecture – Memory hierarchy bandwidth(s) and size(s) and latency(s) – Communication bandwidth(s) and latency(s) – Math performance – Branch cost (control divergence) – Axes of Parallelism 34

Maxeler Acceleration Process • Run the code with profiling tools Code • Understand data and loop structures and data Analysis access patterns Sets theoretical • Investigate Transformation performance bounds transformation options Partitioning for these structures and access patterns Implementation Achieve performance • Decide which parts of the code need acceleration Result • Implement and validate 35

Application Analysis 36

Partitioning Options Data Access Plans Code Partitioning Pareto Optimal Options Development Time Transformations Runtime Try to minimise runtime and development time, while maximising flexibility and precision. 37

Credit Derivatives Valuation & Risk • Compute value of complex financial derivatives (CDOs) • Typically run overnight, but beneficial to compute in real-time • Many independent jobs • Speedup: 220-270x • Power consumption per node drops from 250W to 235W/node 38

Discovering the Dataflow of an Application

Data Flow Computing James Spooner, VP of Acceleration QCon, Finance - PowerPoint PPT Presentation

Acceleration in the Wild, with Data Flow Computing James Spooner, VP of Acceleration QCon, Finance Track, 08 March 2012 Acceleration in the Wild with Data Flow Deliberate, focused approach to improving application speed Involves adding

Flow Visualization Overview: Flow Visualization (1) Introduction, overview Flow data Simulation

1 What Is Control-Flow Analysis? Loop Concepts Control-flow analysis discovers the flow of

FLOW CYTOMETRY DATA COMPRESSION A.E. Bras PhD Student Erasmus University, Rotterdam, the

Flow Visualization Overview: Flow Visualization (1) Introduction, overview Fl Flow data d t

Flow Analysis Data-flow analysis, Control-flow analysis, Abstract interpretation, AAM Helpful

Potential Flow & Flow Nets Potential Flow Irrotational flow for which implies:

Coupling free flow / porous-medium flow General idea free flow, Navier-Stokes wind 1 phase, 2

Traffic Flow Characteristics and Theory Primary Elements of Traffic Flow a. Flow Rate = q

Flow networks, flow, maximum flow Can interpret directed graph as flow network. Material

= edge edge ( (u,v u,v) ) is not in is not in E E f x Y ( , ) f x y ( , ) y Y

Data-flow analysis Introduction to data-flow analysis Michel Schinz based on material by

Data Flow Coverage 1 Stuart Anderson Stuart Anderson Data Flow Coverage 1 2011 c 1 Why

Steady Flow: Lid-Driven Cavity Flow This tutorial demonstrates the performance of STAR-CCM+ in

Network Flow CS31005: Algorithms-II Autumn 2020 IIT Kharagpur Network Flow Models the flow

Network Flow 5 Network Flow terminology Network flow is similar to finding how much water we

Flow Modeling on Massive Terrains Laura Toma Duke University Flow Modeling on Massive Terrains

Expressibility in the Lambda Calculus with letrec Clemens Grabmayer and Jan Rochel Dept. of

CSCI 246 - Lesson 1 Quiz Question Draw an arrow diagram for the following relation: A = {1,2,3}

T HE U NTYPED -C ALCULUS (S YNTAX ) Syntax is very simple; only three kinds of terms: S t

CDA 3101 - Computer Organization 2 First Day Quiz - Solution May 16, 2018 1. Write the printf

17. Long Term Trends and Hurst Phenomena From ancient times the Nile river region has been known

ASEP with open boundaries and Koornwinder polynomials Luigi Cantini RAQIS16 - Recent Advances

at LEP, the (HL)LHC and future lepton colliders . Elina Fuchs Weizmann Institute of Science,

Dualizable Algebras with Parallelogram Terms gnes Szendrei CU Boulder/U Szeged Joint work with

Data Flow Computing James Spooner, VP of Acceleration QCon, Finance - PowerPoint PPT Presentation

Acceleration in the Wild, with Data Flow Computing James Spooner, VP of Acceleration QCon, Finance Track, 08 March 2012 Acceleration in the Wild with Data Flow Deliberate, focused approach to improving application speed Involves adding

Flow Visualization Overview: Flow Visualization (1) Introduction, overview Flow data Simulation

1 What Is Control-Flow Analysis? Loop Concepts Control-flow analysis discovers the flow of

FLOW CYTOMETRY DATA COMPRESSION A.E. Bras PhD Student Erasmus University, Rotterdam, the

Flow Visualization Overview: Flow Visualization (1) Introduction, overview Fl Flow data d t

Flow Analysis Data-flow analysis, Control-flow analysis, Abstract interpretation, AAM Helpful

Potential Flow &amp; Flow Nets Potential Flow Irrotational flow for which implies:

Coupling free flow / porous-medium flow General idea free flow, Navier-Stokes wind 1 phase, 2

Traffic Flow Characteristics and Theory Primary Elements of Traffic Flow a. Flow Rate = q

Flow networks, flow, maximum flow Can interpret directed graph as flow network. Material

= edge edge ( (u,v u,v) ) is not in is not in E E f x Y ( , ) f x y ( , ) y Y

Data-flow analysis Introduction to data-flow analysis Michel Schinz based on material by

Data Flow Coverage 1 Stuart Anderson Stuart Anderson Data Flow Coverage 1 2011 c 1 Why

Steady Flow: Lid-Driven Cavity Flow This tutorial demonstrates the performance of STAR-CCM+ in

Network Flow CS31005: Algorithms-II Autumn 2020 IIT Kharagpur Network Flow Models the flow

Network Flow 5 Network Flow terminology Network flow is similar to finding how much water we

Flow Modeling on Massive Terrains Laura Toma Duke University Flow Modeling on Massive Terrains

Expressibility in the Lambda Calculus with letrec Clemens Grabmayer and Jan Rochel Dept. of

CSCI 246 - Lesson 1 Quiz Question Draw an arrow diagram for the following relation: A = {1,2,3}

T HE U NTYPED -C ALCULUS (S YNTAX ) Syntax is very simple; only three kinds of terms: S t

CDA 3101 - Computer Organization 2 First Day Quiz - Solution May 16, 2018 1. Write the printf

17. Long Term Trends and Hurst Phenomena From ancient times the Nile river region has been known

ASEP with open boundaries and Koornwinder polynomials Luigi Cantini RAQIS16 - Recent Advances

at LEP, the (HL)LHC and future lepton colliders . Elina Fuchs Weizmann Institute of Science,

Dualizable Algebras with Parallelogram Terms gnes Szendrei CU Boulder/U Szeged Joint work with

Potential Flow & Flow Nets Potential Flow Irrotational flow for which implies: