High-Throughput Multi-Threaded Sum-Product Network Inference in the - PowerPoint PPT Presentation

High-Throughput Multi-Threaded Sum-Product Network Inference in the Reconfigurable Cloud Micha Ober, Jaco A. Hofmann, Lukas Sommer, Lukas Weber, Andreas Koch Embedded Systems and Applications Group, TU Darmstadt 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 1

Agenda • TaPaSCo in the Clouds • Introduction to the TaPaSCo framework • Challenges in porting TaPaSCo to Amazon AWS F1 • High-Throughput Sum-Product Network Inference • Introduction to Sum-Product Networks • FPGA Acceleration Toolflow • Optimizations for Amazon AWS F1 • Evaluation • Conclusion 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 2

TaPaSCo Framework • Builds complete FPGA SoC-designs from HLS kernels or custom HDL cores • Automates Design-Space Exploration to determine best system composition • Supports wide variety of Xilinx platforms • Includes software API for dispatching compute tasks to FPGA • Available as free & open-source software 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 3

TaPaSCo Design Flow Design frequency Core name tapasco compose [cnn x 2, sobel x 3] @ 100 MHz – p vc709 Core count Platform 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 4

TaPaSCo Architecture 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 5

TaPaSCo Software API 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 6

TaPaSCo Software API – Example Wrap information Tapasco tapasco; about data-transfer auto a_wrapped = makeWrappedPointer(a.data(), a.size()); auto b_wrapped = makeWrappedPointer(b.data(), b.size()); auto job = tapasco.launch(SIMPLE_HLS_ID, makeInOnly(a_wrapped), makeOutOnly(b_wrapped)); job(); Provide information about data-transfer Launch FPGA direction execution 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 7

TaPaSCo Platforms Datacenter Edge Devices • Xilinx Alveo U250 • Xilinx Zynq UltraScale+ MPSoC ZCU102 • Xilinx Virtex UltraScale+ VCU1525 • Xilinx Zynq SoC ZC706 • Xilinx Virtex UltraScale+ VCU118 • AVNET ZedBoard • Xilinx Virtex UltraScale VCU108 • Digilent NetFPGA SUME • Digilent Pynq-Z1 • Xilinx Virtex VC709 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 8

TaPaSCo in the Cloud • Amazon deploys Xilinx VU9+ FPGAs in AWS EC2 F1 instances • Most of the FPGA logic freely programmable, all interfaces routed through fixed Shell provided by Amazon DDR4 channel Shell 3 Optional Custom DDR4 logic channels Image source: Amazon 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 9

TaPaSCo in the Cloud - Challenges • Shell provides only a few frequencies, TaPaSCo supports arbitrary design frequencies • Include custom clock controller in programmable logic • DMA engine in Shell provides only limited throughput • Replace with T P C ‘ own DMA engine • Shell provides only 16 interrupts, not enough for TaPaSCo architecture • Include custom interrupt controller for translation • Memory controllers for 3 DDR channels have to be placed in custom logic • Carefull timing necessary 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 10

TaPaSCo in the Clouds – Conclusion • Completely automated toolflow to generate SoC-design from HLS code or custom HDL core for Amazon AWS EC2 F1 FPGA instances • Generates ready-to-use Amazon FPGA Image (AFI) • Supports up to four independent memory channels • Easy-to-use software API for interfacing with FPGA accelerator • Open-source available! 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 11

Case-Study SUM-PRODUCT NETWORK INFERENCE 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 12

Sum-Product Networks • ML technique from the class of probabilistic models • Capture joint probability over a set of random variables • Advantage over NN: Exact inference, express uncertainty over output • Advantage over other PGM: Tractable inference in linear time wrt. network size • Three kinds of nodes in DAG: • Sum nodes • Product nodes • Leaf nodes 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 13

Sum-Product Networks – Leaf Nodes • Capture univariate distributions, e.g., Gaussian, Poisson; • Queried with evidence (input value) to obtain probability value • Can be represented efficiently using histograms 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 14

Sum-Product Networks – Product Nodes • Factorization over independent random variables • Multiply probability value from child nodes to obtain result • Domain knowledge might be required to determine independence x A B 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 15

Sum-Product Networks – Sum-Node • Mixture of two distributions over the same set of random variables 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 16

Sum-Product Networks – Sum-Node • Mixture of two distributions over the same set of random variables • Cluster and split samples, e.g. kNN-clustering 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 17

Sum-Product Networks – Sum-Node • Mixture of two distributions over the same set of random variables • Cluster and split samples, e.g. kNN-clustering • Associated weight corresponds to relative size of the cluster + 0.3 0.7 A A 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 18

Sum-Product Networks – Example 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 19

Sum-Product Networks – Example Professors Adminstrative staff Ph.D.-students undergraduate students 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 20

Sum-Product Networks – Example Network + 0.1 0.4 0.3 0.2 x x x x A I A I A I A I 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 21

Sum-Product Networks - Inference • Answer probabilistic queries & solve ML tasks • Probability of earning 100k$ at age 27: P(A=27, I=100k$) • Probability of earning 150k$: P(I=150k$) – marginalization • Add label {student, Ph.D.-student, admin, professor} as input variable, do classification based on information about age and income • Inference is bottom-up evaluation of the SPN graph with (partial) evidence • Some queries might require multiple passes, but always linear time 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 22

Sum-Product Networks – Example Network Probability of earning 100k$ at age 27: P(A=27, I=100k$) ≈ 0 + 0.1 0.4 0.3 0.2 x x x x A I A I A I A I 0.7 0.01 0.9 0.1 0.1 0.001 0.25 0.0001 Adminstrative undergraduate Professors staff Ph.D.-students students 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 23

FPGA Inference Accelerator • Automatic toolflow for FPGA acceleration of SPN inference developed in prior work [TPM2018, ICCD2018] • Maps SPN graph to fully spatial, pipelined accelerator with AXI4-based, pipelined memory interface • Throughput-oriented scenario, accelerate inference for batch of input queries • Turn-key solution, heterogeneous system integration with TaPaSCo 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 24

Memory Interface • Existing memory interface aggressively optimized, occupies bus through long-running AXI4 bursts and many transfers in-flight • Potential deadlocks in multi-core scenario • No concurrent DMA-transfer between host and FPGA memory possible • Solution: Complete re-design of the memory interface • Strictly limit the number of outstanding transfers • Buffer result values, write back block-wise in short-running AXI4 burst transfer 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 25

Multi-core Architecture • Size of VU9P FPGA allows for replication of accelerator • Baseline architecture: 1 compute unit, 1 memory channel • Multi-core, single memory: Up to 4 compute units, 1 memory channel • Multi-core, multi-memory: Up to 4 compute units, 4 memory channels 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 26

Multi-threaded Operation • Low to moderate computational density for SPN inference, data-transfer overhead significant • Solution: Split computation into blocks, overlap computation and data-transfer to/from host with multiple threads on host-side 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 27

Evaluation • Evaluation with 8 different benchmarks from the NeurIPS corpus • FPGA implementation for AWS F1 for all three architectures • CPU comparison with generated C++ code on 12-core Xeon E5-2680v3 • GPU comparison with optimized CUDA code on Nvidia V100 (AWS EC2) • Measure end-to-end throughput, including data-transfer from/to host 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 28

FPGA Implementation Results 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 29

Performance Comparison 20.11.2019 | TU Darmstadt | ESA | L. Sommer | 32

High-Throughput Multi-Threaded Sum-Product Network Inference in the - PowerPoint PPT Presentation

High-Throughput Multi-Threaded Sum-Product Network Inference in the Reconfigurable Cloud Micha Ober, Jaco A. Hofmann, Lukas Sommer, Lukas Weber, Andreas Koch Embedded Systems and Applications Group, TU Darmstadt 20.11.2019 | TU Darmstadt |

Detecting Data Races in Multi-Threaded Programs Eraser A Dynamic Data-Race Detector for

Product Section Product Section New Product Introduction New Product Introduction Product

ex Addition: 1-bit half adder A + Sum B Carry out Carry A B Sum out 0 0 A 0 1 Sum

Emulation Outline Emulation Interpretation basic, threaded, directed threaded

Threaded Programming Lecture 1: Concepts Overview Shared memory systems Basic Concepts

A Light-Weight Approach for Verifying Multi-Threaded Programs with CPAchecker ThreadingCPA Dirk

Sum Product Networks What is a Sum Product Network? 1. It is a tractable probabilistic model

Webbit Evented, single-threaded WebSocket server http://webbitserver.org/ @aslak_hellesoy

Threaded Network Interrupts Steven Rostedt srostedt@redhat.com <rostedt@goodmis.org>

A Problem or an Opportunity? Database workload + low throughput (0.8 IPC on an 8-wide

High throughput High throughput kafka for science kafka for science Testing Kafkas limits

Evaluation of Improved Scalability Comparison points Throughput (IPC/Node)

Basic Ruby Syntax sum = 0 Newline is statement separator i = 1 while i <= 10 do sum += i*i

ex start small with a 1-bit (half) adder A B Carry out Sum A 0 0 Sum 0 1 B 1 0 1 1

Chapter 6 Methods 1 Opening Problem Find the sum of integers from 1 to 10, from 20 to 30, and

Validation Outline 2 Introduction Methodology Single-threaded results

A Reproducible Accurate Summation Algorithm for High-Performance Computing Sylvain Collange 1 ,

Lecture 2 expressions, variables, for loops Special thanks to CS Washington CS 142 Except where

A Simple Near-Linear Pseudopolynomial Time Randomized Algorithm for Subset Sum Ce Jin , Hongxun Wu

Sum and difference formulae for sine and cosine Consider angles and with > . These

Sparse Prefix Sums Michael Shekelyan, Anton Digns, Johann Gamper 1 0 0 0 1 7 7 15 0 0

Quantum Information Complexity and Direct Sum Dave Touchette Universit e de Montr eal QIP

Verilog HDL:Digital Design and Modeling Chapter 2 Overview Chapter 2 Overview 2 Page 16

Optimal Merging in Quantum k -xor and k -sum Algorithms Mara Naya-Plasencia, Andr

High-Throughput Multi-Threaded Sum-Product Network Inference in the - PowerPoint PPT Presentation

High-Throughput Multi-Threaded Sum-Product Network Inference in the Reconfigurable Cloud Micha Ober, Jaco A. Hofmann, Lukas Sommer, Lukas Weber, Andreas Koch Embedded Systems and Applications Group, TU Darmstadt 20.11.2019 | TU Darmstadt |

Detecting Data Races in Multi-Threaded Programs Eraser A Dynamic Data-Race Detector for

Product Section Product Section New Product Introduction New Product Introduction Product

ex Addition: 1-bit half adder A + Sum B Carry out Carry A B Sum out 0 0 A 0 1 Sum

Emulation Outline Emulation Interpretation basic, threaded, directed threaded

Threaded Programming Lecture 1: Concepts Overview Shared memory systems Basic Concepts

A Light-Weight Approach for Verifying Multi-Threaded Programs with CPAchecker ThreadingCPA Dirk

Sum Product Networks What is a Sum Product Network? 1. It is a tractable probabilistic model

Webbit Evented, single-threaded WebSocket server http://webbitserver.org/ @aslak_hellesoy

Threaded Network Interrupts Steven Rostedt srostedt@redhat.com &lt;rostedt@goodmis.org&gt;

A Problem or an Opportunity? Database workload + low throughput (0.8 IPC on an 8-wide

High throughput High throughput kafka for science kafka for science Testing Kafkas limits

Evaluation of Improved Scalability Comparison points Throughput (IPC/Node)

Basic Ruby Syntax sum = 0 Newline is statement separator i = 1 while i &lt;= 10 do sum += i*i

ex start small with a 1-bit (half) adder A B Carry out Sum A 0 0 Sum 0 1 B 1 0 1 1

Chapter 6 Methods 1 Opening Problem Find the sum of integers from 1 to 10, from 20 to 30, and

Validation Outline 2 Introduction Methodology Single-threaded results

A Reproducible Accurate Summation Algorithm for High-Performance Computing Sylvain Collange 1 ,

Lecture 2 expressions, variables, for loops Special thanks to CS Washington CS 142 Except where

A Simple Near-Linear Pseudopolynomial Time Randomized Algorithm for Subset Sum Ce Jin , Hongxun Wu

Sum and difference formulae for sine and cosine Consider angles and with &gt; . These

Sparse Prefix Sums Michael Shekelyan, Anton Digns, Johann Gamper 1 0 0 0 1 7 7 15 0 0

Quantum Information Complexity and Direct Sum Dave Touchette Universit e de Montr eal QIP

Verilog HDL:Digital Design and Modeling Chapter 2 Overview Chapter 2 Overview 2 Page 16

Optimal Merging in Quantum k -xor and k -sum Algorithms Mara Naya-Plasencia, Andr

Threaded Network Interrupts Steven Rostedt srostedt@redhat.com <rostedt@goodmis.org>

Basic Ruby Syntax sum = 0 Newline is statement separator i = 1 while i <= 10 do sum += i*i

Sum and difference formulae for sine and cosine Consider angles and with > . These