Fletcher: A Framework to Effjciently Integrate FPGA Accelerators - PowerPoint PPT Presentation

Fletcher: A Framework to Effjciently Integrate FPGA Accelerators with Apache Arrow @ FPL2019, Barcelona, September 11, 2019 Johan Peltenburg 1 , Jeroen van Straten 1 , Lars Wijtemans 1 Lars T.J. van Leeuwen 1 , Zaid Al-Ars 1 , H. Peter Hofstee 2 1. Delft University of Technology, Netherlands 2. IBM, Austin, Texas, USA Thanks to our supporters: Fitoptivis European ECSEL project no. ECSEL2017-1-737451 Xilinx 1 1

Outline ● The challenge of FPGA integration with Big Data Analytics ● Overcoming serialization bottlenecks with Apache Arrow ● Fletcher ● Mini-tutorial (if time) ● Results ● Conclusion & future work 2 2

An FPGA Accelerator Dev. Perspective 01001011 Byte or even bit-level control High-performance Write a host-side of data structure datapath C lib In memory Sculpt Structs, unions, datastructure to bitspecs, feed datapath padding effjciently 3 3

A Big Data Analytics Dev. Perspective: Source: https://spark.apache.org/ ● DataFrame: like a database table or excel spreadsheet, but... ● Huge. Typically in the order of GiBs to TiBs . ● Distributed over multiple worker nodes (also in storage). ● Operations on it build Directed Acyclic Graphs (DAGS) and are lazily evaluated. ● DAGs are optimized, planned and scheduled to exectue in parallel over a cluster . ● Resilient to node failure, provides automatic recovery and continuation. ● What is all that computer scientist magic that makes this possible? 4 4

Big Data Analytics SW Ecosystem Software Frameworks for languages storage, scalability, & run-times resilience, analysis, etc.. 5 5

A string C++ Java Python FPGA String size Python variable length JVM object header object header Pointer to char bufger UTF-16 Array reference Hash Internal char array (optionally used) Hash cache State Length Characters JVM array object header Variable length Optionally allocated character array char array UTF16 array 6 6

Serialization ... Serialize... Deserialize... ... ... Serialized Collection X in collection in Collection X in Memory of shared memory Memory of Process A or IPC message Process B ● Iterate over all objects in collection (data is big) ● Traverse all object graphs (memory latency) ● Copy fjelds to some intermediate format both A and B understand (bandwidth lost) ● Reconstruct objects in B ((de)allocation overhead) 7 7

I/O bandwidth catching up [1] F. Kruger, “CPU Bandwidth The Worrisome 2020 Trend,” Mar. 2016. [Online]. Available: https://blog.westerndial.com/cpu-bandwidth-the- 8 8 worrisome-2020-trend/

Relative impact on accelerators Original process on CPU: Process on GPGPU/FPGA with serialization (potentially, but not necessarily, exaggerated) NON-FUNCTIONAL NON-FUNCTIONAL Desired profjle: Serialization throughput on collection of Java (OpenJDK) objects on POWER8 [1]: CPU compute time (De)serialize / copy time Accelerator compute time 9 [2] J. Peltenburg, A. Hesam, and Z. Al-Ars, “ Pushing Big Data into Accelerators: Can the JVM Saturate Our 9 Hardware? ” in International Conference on High Performance Computing. Springer, 2017, pp. 220–236.

Overcoming serialization bottlenecks ● In-memory formats determined by: – Programming languages ● Run-time system design choices ● Standard libraries – Algorithms – Programmers ● Increased heterogeneity � more IPC � more serialization overhead ● What if data is… – In a standardized format? ● That every language can use (through libraries or otherwise). – As contiguous as possible? ● We can move it using large bursts, no pointer chasing, less misalignment overhead 10 10

Apache Arrow [3] ● Standardized representation in-memory – Common Data Layer ● Columnar format – Hardware friendly while iterating over entries in single column (SIMD, caches, etc…) – Better for many algorithms, worse for some others. [3] The Apache Software Foundation, ● Libraries and APIs for 10+ languages to build and access data sets (zero-copy) “Apache Arrow,” 2018. [Online]. Available: https://arrow.apache.org 11 11

Arrow in- Schema MySchema { A: Float (nullable) Index A B C memory dataset, B: List<Char> 0 1.33f ola {1, 3.14} trivial example C: Struct{ 1 7.01f fpl {5, 1.41} E: Int16 2 ∅ @upc {3, 1.61} F: Double } } Index Value Ofgset Value Index Value Index Ofgset 0 1.33f 0 o 0 1 Arrow terminology: 0 0 1 7.01f Schema: 1 l 1 5 1 3 Description of data types 2 X 2 a 2 3 in a RecordBatch 2 6 RecordBatch: 3 f 3 10 Index Value Index Valid Tabular structure 4 p containing Arrow arrays 0 3.14 0 1 Arrays: 5 l 1 1.41 1 1 A RecordBatch “column”. 6 @ 2 1.61 Combination of Arrow bufgers , 2 0 can be nested 7 u Bufgers: 8 p Contiguous C-like arrays 9 c 12 12

Integrating FPGA and Arrow ● Arrow ‘turns out’ to be hardware-friendly – In-memory format clearly specifjed, to every bit – Highly contiguous & columnar format ● Iterate over a column in streaming fashion ● Useful for: maps, reductions, fjlters, etc... – Parallel accessible format ● E.g. uses ofgsets, not lengths, for variable length data – we can start anywhere ● Useful for: maps, reductions, fjlters, etc… ● Backed by a large and ever growing community ● Integration in many BDA frameworks, even without offjcial format stability ● Can we generate easy-to-use, high throughput hardware interfaces automatically? 13 13

Main contribution: Fully open-source (Apache-2.0), Vendor agnostic, Generates easy-to-use high-throughput Interfaces. Integrate FPGA accelerators with 14 14 Apache Arrow.

Example: Interface for accelerator parsing strings Typical: Fletcher: Memory interface Memory interface Easy-to use: Easy-to use: Data is delivered as Data is delivered as streams that make streams that make Fletcher-generated interface sense w.r.t. Byte Bus sense w.r.t. schema fjeld types. Address Word schema fjeld types. High-throughput: High-throughput: Manual implementation Length Char RecordBatch Number of values of interface Number of values Stream stream Range delivered per cycle delivered per cycle confjgurable confjgurable Computational part Computational part 15 15

Generated interface overview ● Architecture based on library with streaming primitives ● BufgerReader/Writer : Basic unit to read (N) Arrow Bufger elements ● ArrayReader/Writer : Combination of BufgerReaders/Writers [1] Dictated by the schema fjeld and format specifjcation – Generated through pure HDL; vendor agnostic – ● RecordBatchReader/Writer : Combination of ArrayReaders/Writers ● Mantle : Wraps multiple RecordBatchR/W + bus infrastructure [4] J. Peltenburg, J. van Straten, M. Brobbel, H. P. Hofstee, and Z. Al-Ars, “ Supporting Columnar In-memory Formats on FPGA: The Hardware Design of Fletcher 16 16 for Apache Arrow ”, in Applied Reconfigurable Computing, Cham: Springer International Publishing, 2019, pp. 32–47.

Combining BufgerReaders into ArrayReaders Arrow Schema & Index Value Index Valid ● format spec 0 1.33f 0 1 dictate how to 1 7.01f 1 1 2 ∅ 2 0 combine bufgers. Passed to ● ArrayReaders through Index Ofgset Ofgset Value confjguration 0 0 0 o string in HDL. 1 3 1 l 2 6 2 a Seeking the limits 3 10 3 f ● of synthesis … ... tools :-) Index Value Index Value Over 10k+ random ● 0 1 0 3.14 fjeld types 1 5 1 1.41 simulated. 2 3 2 1.61 17 17

Arrow support: High-level architecture generation: Fletchgen ☑ RecordBatches ☑! Arrays ☑! Buffers ● Need syntactically pleasing interfaces – Grouping of ArrayReader/Writer interfaces for RecordBatches – Stream names must correspond to schema fjelds – Synthesizable HDL too limited ● Need kernel template generation for kernel implementation in HDL/HLS ● Need simulation ● Need platform integration ● High-level architecture generator: Fletchgen 18 18

Fletcher run-time stack Reap the benefjts of Arrow: ● Create one accelerator. – Leverage in any supported – language. Fletcher Generated Hardware ● Interface is platform agnostic – requires no IP, tcl scripts, etc… Top level with AXI4 interface ● available. 19 19

Mini-tutorial: Fletcher “Hello, World!” Mini-tutorial: Fletcher “Hello, World!” ● Trivial example: – Sum a column of integers ● Get to know the toolchain ● More realistic applications: – Complex types – More Arrow Arrays – More input/output RecordBatches Also on GitHub: https://github.com/abs-tudelft/fmetcher 20 20

Step 1: Create an Arrow Schema Step 1: Create an Arrow Schema import pyarrow as pa number_fjeld = pa.fjeld('number', pa.int64(), nullable=False) schema = pa.schema([number_fjeld]) metadata = {b'fmetcher_mode': b'read', b'fmetcher_name': b'ExampleBatch'} schema = schema.add_metadata(metadata) 21 21

Step 2: Create a RecordBatch Step 2: Create a RecordBatch (optional, for simulation) data = [pa.array([1, -3, 3, -7])] recordbatch = pa.RecordBatch.from_arrays(data, schema) writer = pa.RecordBatchFileWriter('recordbatch.rb', schema) writer.write(recordbatch) writer.close() 22 22

Step 3: Generate the design Step 3: Generate the design RecordBatch input Design output languages $ fmetchgen -n Sum -r recordbatch.rb -s recordbatch.srec -l vhdl dot --sim Kernel name Memory model fjle Generate simulation top-level 23 23

Fletcher: A Framework to Effjciently Integrate FPGA Accelerators - PowerPoint PPT Presentation

Fletcher: A Framework to Effjciently Integrate FPGA Accelerators with Apache Arrow @ FPL2019, Barcelona, September 11, 2019 Johan Peltenburg 1 , Jeroen van Straten 1 , Lars Wijtemans 1 Lars T.J. van Leeuwen 1 , Zaid Al-Ars 1 , H. Peter Hofstee 2

Collaborate effjciently with git Collaborate effjciently with git GSA 2019 Manolis Surligas

GRVI Phalanx Update: A Massively Parallel RISC-V FPGA Accelerator Framework Jan Gray |

FPGA What is a FPGA? How FPGAs work How do they work? Manufacturers

INTEGRATE-HTA Ansgar Gerhardus for the INTEGRATE-HTA project team INTEGRATE-HTA Aim of

The Don Ball Ernie Fletcher, MD David Johnson, MSW, ACSW Fletcher Group Founder Fletcher Group

WWW.FPGA What is an FPGA? Field Programmable Gate Array Introduction to FPGA designs

trees 1 are lists enough? for correctness sure want to effjciently access items better

Design and Implementation of a Guard Installation and Administration Framework (GIAF) 15 March

On bridging relational and document-centric data stores John Roijackers and George Fletcher

Accelerating Convolutional Neural networks on FPGA SoC Francesco Restuccia, Ph.D. Fellow

Progr gram amming ing Rec econf nfigur igurable able Devi vices ces via FPGA Regio vi

An introduction to FPGA-based acceleration of neural networks Marco Pagani 1 What is an FPGA?

FPGA language ( HDL ). Overview 3 4 Component-Based Software Design LMES Component-Based

GraVF: GraVF: A Vertex-Centric A Vertex-Centric Graph Processing Graph Processing Framework

Tips about an FPGA 02/09/2018 J.C. special topic FPGA ( field-programmable gate array ) FPGA :

Building Multi-Processor FPGA Systems Hands-on Tutorial to Using FPGAs and Linux Chris Martin

ASIC/FPGA Trust Assessment Framework Melanie Berg AS&D in support of NASA/GSFC

FPGP: Graph Processing Framework on FPGA Guohao DAI, Yuze CHI, Yu WANG, Huazhong YANG E.E.

Our Role In Rural Health ERNIE FLETCHER, MD Fletcher Group Co-Founder This presenta,on

Support of Cross Calls between Microprocessor and FPGA in CPU-FPGA Coupling Architecture G.

USING OPENSTACK TO INTEGRATE NON-OPENSTACK SERVICE JUNHO YOON, ANDREW LIU, JACK NING AGENDA

FPGA vs GPU Performance Comparison on the Implementation of FIR Filters FPGA. While comparing the

The Spartan 3e FPGA The Spartan 3e FPGA Whats inside the chip? How does it implement

Public FPGA based DM Public FPGA based DMA Atta A Attacking king UlfFrisk Agenda Background

Fletcher: A Framework to Effjciently Integrate FPGA Accelerators - PowerPoint PPT Presentation

Fletcher: A Framework to Effjciently Integrate FPGA Accelerators with Apache Arrow @ FPL2019, Barcelona, September 11, 2019 Johan Peltenburg 1 , Jeroen van Straten 1 , Lars Wijtemans 1 Lars T.J. van Leeuwen 1 , Zaid Al-Ars 1 , H. Peter Hofstee 2

Collaborate effjciently with git Collaborate effjciently with git GSA 2019 Manolis Surligas

GRVI Phalanx Update: A Massively Parallel RISC-V FPGA Accelerator Framework Jan Gray |

FPGA What is a FPGA? How FPGAs work How do they work? Manufacturers

INTEGRATE-HTA Ansgar Gerhardus for the INTEGRATE-HTA project team INTEGRATE-HTA Aim of

The Don Ball Ernie Fletcher, MD David Johnson, MSW, ACSW Fletcher Group Founder Fletcher Group

WWW.FPGA What is an FPGA? Field Programmable Gate Array Introduction to FPGA designs

trees 1 are lists enough? for correctness sure want to effjciently access items better

Design and Implementation of a Guard Installation and Administration Framework (GIAF) 15 March

On bridging relational and document-centric data stores John Roijackers and George Fletcher

Accelerating Convolutional Neural networks on FPGA SoC Francesco Restuccia, Ph.D. Fellow

Progr gram amming ing Rec econf nfigur igurable able Devi vices ces via FPGA Regio vi

An introduction to FPGA-based acceleration of neural networks Marco Pagani 1 What is an FPGA?

FPGA language ( HDL ). Overview 3 4 Component-Based Software Design LMES Component-Based

GraVF: GraVF: A Vertex-Centric A Vertex-Centric Graph Processing Graph Processing Framework

Tips about an FPGA 02/09/2018 J.C. special topic FPGA ( field-programmable gate array ) FPGA :

Building Multi-Processor FPGA Systems Hands-on Tutorial to Using FPGAs and Linux Chris Martin

ASIC/FPGA Trust Assessment Framework Melanie Berg AS&amp;D in support of NASA/GSFC

FPGP: Graph Processing Framework on FPGA Guohao DAI, Yuze CHI, Yu WANG, Huazhong YANG E.E.

Our Role In Rural Health ERNIE FLETCHER, MD Fletcher Group Co-Founder This presenta,on

Support of Cross Calls between Microprocessor and FPGA in CPU-FPGA Coupling Architecture G.

USING OPENSTACK TO INTEGRATE NON-OPENSTACK SERVICE JUNHO YOON, ANDREW LIU, JACK NING AGENDA

FPGA vs GPU Performance Comparison on the Implementation of FIR Filters FPGA. While comparing the

The Spartan 3e FPGA The Spartan 3e FPGA Whats inside the chip? How does it implement

Public FPGA based DM Public FPGA based DMA Atta A Attacking king UlfFrisk Agenda Background

ASIC/FPGA Trust Assessment Framework Melanie Berg AS&D in support of NASA/GSFC