fletcher a framework to effjciently integrate fpga
play

Fletcher: A Framework to Effjciently Integrate FPGA Accelerators - PowerPoint PPT Presentation

Fletcher: A Framework to Effjciently Integrate FPGA Accelerators with Apache Arrow @ FPL2019, Barcelona, September 11, 2019 Johan Peltenburg 1 , Jeroen van Straten 1 , Lars Wijtemans 1 Lars T.J. van Leeuwen 1 , Zaid Al-Ars 1 , H. Peter Hofstee 2


  1. Fletcher: A Framework to Effjciently Integrate FPGA Accelerators with Apache Arrow @ FPL2019, Barcelona, September 11, 2019 Johan Peltenburg 1 , Jeroen van Straten 1 , Lars Wijtemans 1 Lars T.J. van Leeuwen 1 , Zaid Al-Ars 1 , H. Peter Hofstee 2 1. Delft University of Technology, Netherlands 2. IBM, Austin, Texas, USA Thanks to our supporters: Fitoptivis European ECSEL project no. ECSEL2017-1-737451 Xilinx 1 1

  2. Outline ● The challenge of FPGA integration with Big Data Analytics ● Overcoming serialization bottlenecks with Apache Arrow ● Fletcher ● Mini-tutorial (if time) ● Results ● Conclusion & future work 2 2

  3. An FPGA Accelerator Dev. Perspective 01001011 Byte or even bit-level control High-performance Write a host-side of data structure datapath C lib In memory Sculpt Structs, unions, datastructure to bitspecs, feed datapath padding effjciently 3 3

  4. A Big Data Analytics Dev. Perspective: Source: https://spark.apache.org/ ● DataFrame: like a database table or excel spreadsheet, but... ● Huge. Typically in the order of GiBs to TiBs . ● Distributed over multiple worker nodes (also in storage). ● Operations on it build Directed Acyclic Graphs (DAGS) and are lazily evaluated. ● DAGs are optimized, planned and scheduled to exectue in parallel over a cluster . ● Resilient to node failure, provides automatic recovery and continuation. ● What is all that computer scientist magic that makes this possible? 4 4

  5. Big Data Analytics SW Ecosystem Software Frameworks for languages storage, scalability, & run-times resilience, analysis, etc.. 5 5

  6. A string C++ Java Python FPGA String size Python variable length JVM object header object header Pointer to char bufger UTF-16 Array reference Hash Internal char array (optionally used) Hash cache State Length Characters JVM array object header Variable length Optionally allocated character array char array UTF16 array 6 6

  7. Serialization ... Serialize... Deserialize... ... ... Serialized Collection X in collection in Collection X in Memory of shared memory Memory of Process A or IPC message Process B ● Iterate over all objects in collection (data is big) ● Traverse all object graphs (memory latency) ● Copy fjelds to some intermediate format both A and B understand (bandwidth lost) ● Reconstruct objects in B ((de)allocation overhead) 7 7

  8. I/O bandwidth catching up [1] F. Kruger, “CPU Bandwidth The Worrisome 2020 Trend,” Mar. 2016. [Online]. Available: https://blog.westerndial.com/cpu-bandwidth-the- 8 8 worrisome-2020-trend/

  9. Relative impact on accelerators Original process on CPU: Process on GPGPU/FPGA with serialization (potentially, but not necessarily, exaggerated) NON-FUNCTIONAL NON-FUNCTIONAL Desired profjle: Serialization throughput on collection of Java (OpenJDK) objects on POWER8 [1]: CPU compute time (De)serialize / copy time Accelerator compute time 9 [2] J. Peltenburg, A. Hesam, and Z. Al-Ars, “ Pushing Big Data into Accelerators: Can the JVM Saturate Our 9 Hardware? ” in International Conference on High Performance Computing. Springer, 2017, pp. 220–236.

  10. Overcoming serialization bottlenecks ● In-memory formats determined by: – Programming languages ● Run-time system design choices ● Standard libraries – Algorithms – Programmers ● Increased heterogeneity � more IPC � more serialization overhead ● What if data is… – In a standardized format? ● That every language can use (through libraries or otherwise). – As contiguous as possible? ● We can move it using large bursts, no pointer chasing, less misalignment overhead 10 10

  11. Apache Arrow [3] ● Standardized representation in-memory – Common Data Layer ● Columnar format – Hardware friendly while iterating over entries in single column (SIMD, caches, etc…) – Better for many algorithms, worse for some others. [3] The Apache Software Foundation, ● Libraries and APIs for 10+ languages to build and access data sets (zero-copy) “Apache Arrow,” 2018. [Online]. Available: https://arrow.apache.org 11 11

  12. Arrow in- Schema MySchema { A: Float (nullable) Index A B C memory dataset, B: List<Char> 0 1.33f ola {1, 3.14} trivial example C: Struct{ 1 7.01f fpl {5, 1.41} E: Int16 2 ∅ @upc {3, 1.61} F: Double } } Index Value Ofgset Value Index Value Index Ofgset 0 1.33f 0 o 0 1 Arrow terminology: 0 0 1 7.01f Schema: 1 l 1 5 1 3 Description of data types 2 X 2 a 2 3 in a RecordBatch 2 6 RecordBatch: 3 f 3 10 Index Value Index Valid Tabular structure 4 p containing Arrow arrays 0 3.14 0 1 Arrays: 5 l 1 1.41 1 1 A RecordBatch “column”. 6 @ 2 1.61 Combination of Arrow bufgers , 2 0 can be nested 7 u Bufgers: 8 p Contiguous C-like arrays 9 c 12 12

  13. Integrating FPGA and Arrow ● Arrow ‘turns out’ to be hardware-friendly – In-memory format clearly specifjed, to every bit – Highly contiguous & columnar format ● Iterate over a column in streaming fashion ● Useful for: maps, reductions, fjlters, etc... – Parallel accessible format ● E.g. uses ofgsets, not lengths, for variable length data – we can start anywhere ● Useful for: maps, reductions, fjlters, etc… ● Backed by a large and ever growing community ● Integration in many BDA frameworks, even without offjcial format stability ● Can we generate easy-to-use, high throughput hardware interfaces automatically? 13 13

  14. Main contribution: Fully open-source (Apache-2.0), Vendor agnostic, Generates easy-to-use high-throughput Interfaces. Integrate FPGA accelerators with 14 14 Apache Arrow.

  15. Example: Interface for accelerator parsing strings Typical: Fletcher: Memory interface Memory interface Easy-to use: Easy-to use: Data is delivered as Data is delivered as streams that make streams that make Fletcher-generated interface sense w.r.t. Byte Bus sense w.r.t. schema fjeld types. Address Word schema fjeld types. High-throughput: High-throughput: Manual implementation Length Char RecordBatch Number of values of interface Number of values Stream stream Range delivered per cycle delivered per cycle confjgurable confjgurable Computational part Computational part 15 15

  16. Generated interface overview ● Architecture based on library with streaming primitives ● BufgerReader/Writer : Basic unit to read (N) Arrow Bufger elements ● ArrayReader/Writer : Combination of BufgerReaders/Writers [1] Dictated by the schema fjeld and format specifjcation – Generated through pure HDL; vendor agnostic – ● RecordBatchReader/Writer : Combination of ArrayReaders/Writers ● Mantle : Wraps multiple RecordBatchR/W + bus infrastructure [4] J. Peltenburg, J. van Straten, M. Brobbel, H. P. Hofstee, and Z. Al-Ars, “ Supporting Columnar In-memory Formats on FPGA: The Hardware Design of Fletcher 16 16 for Apache Arrow ”, in Applied Reconfigurable Computing, Cham: Springer International Publishing, 2019, pp. 32–47.

  17. Combining BufgerReaders into ArrayReaders Arrow Schema & Index Value Index Valid ● format spec 0 1.33f 0 1 dictate how to 1 7.01f 1 1 2 ∅ 2 0 combine bufgers. Passed to ● ArrayReaders through Index Ofgset Ofgset Value confjguration 0 0 0 o string in HDL. 1 3 1 l 2 6 2 a Seeking the limits 3 10 3 f ● of synthesis … ... tools :-) Index Value Index Value Over 10k+ random ● 0 1 0 3.14 fjeld types 1 5 1 1.41 simulated. 2 3 2 1.61 17 17

  18. Arrow support: High-level architecture generation: Fletchgen ☑ RecordBatches ☑! Arrays ☑! Buffers ● Need syntactically pleasing interfaces – Grouping of ArrayReader/Writer interfaces for RecordBatches – Stream names must correspond to schema fjelds – Synthesizable HDL too limited ● Need kernel template generation for kernel implementation in HDL/HLS ● Need simulation ● Need platform integration ● High-level architecture generator: Fletchgen 18 18

  19. Fletcher run-time stack Reap the benefjts of Arrow: ● Create one accelerator. – Leverage in any supported – language. Fletcher Generated Hardware ● Interface is platform agnostic – requires no IP, tcl scripts, etc… Top level with AXI4 interface ● available. 19 19

  20. Mini-tutorial: Fletcher “Hello, World!” Mini-tutorial: Fletcher “Hello, World!” ● Trivial example: – Sum a column of integers ● Get to know the toolchain ● More realistic applications: – Complex types – More Arrow Arrays – More input/output RecordBatches Also on GitHub: https://github.com/abs-tudelft/fmetcher 20 20

  21. Step 1: Create an Arrow Schema Step 1: Create an Arrow Schema import pyarrow as pa number_fjeld = pa.fjeld('number', pa.int64(), nullable=False) schema = pa.schema([number_fjeld]) metadata = {b'fmetcher_mode': b'read', b'fmetcher_name': b'ExampleBatch'} schema = schema.add_metadata(metadata) 21 21

  22. Step 2: Create a RecordBatch Step 2: Create a RecordBatch (optional, for simulation) data = [pa.array([1, -3, 3, -7])] recordbatch = pa.RecordBatch.from_arrays(data, schema) writer = pa.RecordBatchFileWriter('recordbatch.rb', schema) writer.write(recordbatch) writer.close() 22 22

  23. Step 3: Generate the design Step 3: Generate the design RecordBatch input Design output languages $ fmetchgen -n Sum -r recordbatch.rb -s recordbatch.srec -l vhdl dot --sim Kernel name Memory model fjle Generate simulation top-level 23 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend