High-Throughput Multi-Threaded Sum-Product Network Inference in the - - PowerPoint PPT Presentation

high throughput multi threaded sum product network
SMART_READER_LITE
LIVE PREVIEW

High-Throughput Multi-Threaded Sum-Product Network Inference in the - - PowerPoint PPT Presentation

High-Throughput Multi-Threaded Sum-Product Network Inference in the Reconfigurable Cloud Micha Ober, Jaco A. Hofmann, Lukas Sommer, Lukas Weber, Andreas Koch Embedded Systems and Applications Group, TU Darmstadt 20.11.2019 | TU Darmstadt |


slide-1
SLIDE 1

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 1

Micha Ober, Jaco A. Hofmann, Lukas Sommer, Lukas Weber, Andreas Koch Embedded Systems and Applications Group, TU Darmstadt

High-Throughput Multi-Threaded Sum-Product Network Inference in the Reconfigurable Cloud

slide-2
SLIDE 2

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 2

Agenda

  • TaPaSCo in the Clouds
  • Introduction to the TaPaSCo framework
  • Challenges in porting TaPaSCo to Amazon AWS F1
  • High-Throughput Sum-Product Network Inference
  • Introduction to Sum-Product Networks
  • FPGA Acceleration Toolflow
  • Optimizations for Amazon AWS F1
  • Evaluation
  • Conclusion
slide-3
SLIDE 3

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 3

TaPaSCo Framework

  • Builds complete FPGA SoC-designs from HLS kernels
  • r custom HDL cores
  • Automates Design-Space Exploration to determine

best system composition

  • Supports wide variety of Xilinx platforms
  • Includes software API for dispatching compute tasks to

FPGA

  • Available as free & open-source software
slide-4
SLIDE 4

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 4

TaPaSCo Design Flow

tapasco compose [cnn x 2, sobel x 3] @ 100 MHz –p vc709 Core name Design frequency Core count Platform

slide-5
SLIDE 5

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 5

TaPaSCo Architecture

slide-6
SLIDE 6

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 6

TaPaSCo Software API

slide-7
SLIDE 7

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 7

TaPaSCo Software API – Example

Tapasco tapasco; auto a_wrapped = makeWrappedPointer(a.data(), a.size()); auto b_wrapped = makeWrappedPointer(b.data(), b.size()); auto job = tapasco.launch(SIMPLE_HLS_ID, makeInOnly(a_wrapped), makeOutOnly(b_wrapped)); job();

Wrap information about data-transfer Launch FPGA execution Provide information about data-transfer direction

slide-8
SLIDE 8

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 8

TaPaSCo Platforms

Datacenter

  • Xilinx Alveo U250
  • Xilinx Virtex UltraScale+ VCU1525
  • Xilinx Virtex UltraScale+ VCU118
  • Xilinx Virtex UltraScale VCU108
  • Digilent NetFPGA SUME
  • Xilinx Virtex VC709

Edge Devices

  • Xilinx Zynq UltraScale+ MPSoC ZCU102
  • Xilinx Zynq SoC ZC706
  • AVNET ZedBoard
  • Digilent Pynq-Z1
slide-9
SLIDE 9

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 9

TaPaSCo in the Cloud

  • Amazon deploys Xilinx VU9+ FPGAs in AWS EC2 F1 instances
  • Most of the FPGA logic freely programmable, all interfaces routed through fixed Shell provided by

Amazon

Image source: Amazon

Shell Custom logic DDR4 channel 3 Optional DDR4 channels

slide-10
SLIDE 10

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 10

TaPaSCo in the Cloud - Challenges

  • Shell provides only a few frequencies, TaPaSCo supports arbitrary design frequencies
  • Include custom clock controller in programmable logic
  • DMA engine in Shell provides only limited throughput
  • Replace with T P C ‘ own DMA engine
  • Shell provides only 16 interrupts, not enough for TaPaSCo architecture
  • Include custom interrupt controller for translation
  • Memory controllers for 3 DDR channels have to be placed in custom logic
  • Carefull timing necessary
slide-11
SLIDE 11

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 11

TaPaSCo in the Clouds – Conclusion

  • Completely automated toolflow to generate SoC-design from HLS code or custom HDL core for

Amazon AWS EC2 F1 FPGA instances

  • Generates ready-to-use Amazon FPGA Image (AFI)
  • Supports up to four independent memory channels
  • Easy-to-use software API for interfacing with FPGA accelerator
  • Open-source available!
slide-12
SLIDE 12

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 12

SUM-PRODUCT NETWORK INFERENCE

Case-Study

slide-13
SLIDE 13

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 13

Sum-Product Networks

  • ML technique from the class of probabilistic models
  • Capture joint probability over a set of random variables
  • Advantage over NN: Exact inference, express uncertainty over output
  • Advantage over other PGM: Tractable inference in linear time wrt. network size
  • Three kinds of nodes in DAG:
  • Sum nodes
  • Product nodes
  • Leaf nodes
slide-14
SLIDE 14

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 14

Sum-Product Networks – Leaf Nodes

  • Capture univariate distributions, e.g.,

Gaussian, Poisson;

  • Queried with evidence (input value) to obtain

probability value

  • Can be represented efficiently using

histograms

slide-15
SLIDE 15

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 15

Sum-Product Networks – Product Nodes

  • Factorization over independent random

variables

  • Multiply probability value from child nodes to
  • btain result
  • Domain knowledge might be required to

determine independence

x A B

slide-16
SLIDE 16

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 16

Sum-Product Networks – Sum-Node

  • Mixture of two distributions over the same set
  • f random variables
slide-17
SLIDE 17

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 17

Sum-Product Networks – Sum-Node

  • Mixture of two distributions over the same set
  • f random variables
  • Cluster and split samples, e.g. kNN-clustering
slide-18
SLIDE 18

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 18

Sum-Product Networks – Sum-Node

  • Mixture of two distributions over the same set
  • f random variables
  • Cluster and split samples, e.g. kNN-clustering
  • Associated weight corresponds to relative

size of the cluster

+ A A 0.3 0.7

slide-19
SLIDE 19

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 19

Sum-Product Networks – Example

slide-20
SLIDE 20

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 20

Sum-Product Networks – Example

Professors Adminstrative staff Ph.D.-students undergraduate students

slide-21
SLIDE 21

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 21

Sum-Product Networks – Example Network

x A I x A I x A I x A I +

0.4 0.2 0.3 0.1

slide-22
SLIDE 22

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 22

Sum-Product Networks - Inference

  • Answer probabilistic queries & solve ML tasks
  • Probability of earning 100k$ at age 27: P(A=27, I=100k$)
  • Probability of earning 150k$: P(I=150k$) – marginalization
  • Add label {student, Ph.D.-student, admin, professor} as input variable, do classification based on

information about age and income

  • Inference is bottom-up evaluation of the SPN graph with (partial) evidence
  • Some queries might require multiple passes, but always linear time
slide-23
SLIDE 23

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 23

Adminstrative staff

Sum-Product Networks – Example Network

x A I x A I x A I x A I +

0.4 0.2 0.3 0.1 Professors Ph.D.-students undergraduate students

0.01 0.1 0.001 0.0001 0.9 0.1 0.7 0.25 ≈ 0 Probability of earning 100k$ at age 27: P(A=27, I=100k$)

slide-24
SLIDE 24

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 24

FPGA Inference Accelerator

  • Automatic toolflow for FPGA acceleration of SPN inference developed in prior work [TPM2018,

ICCD2018]

  • Maps SPN graph to fully spatial, pipelined accelerator with AXI4-based, pipelined memory

interface

  • Throughput-oriented scenario, accelerate inference for batch of input queries
  • Turn-key solution, heterogeneous system integration with TaPaSCo
slide-25
SLIDE 25

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 25

Memory Interface

  • Existing memory interface aggressively optimized, occupies bus through long-running AXI4

bursts and many transfers in-flight

  • Potential deadlocks in multi-core scenario
  • No concurrent DMA-transfer between host and FPGA memory possible
  • Solution: Complete re-design of the memory interface
  • Strictly limit the number of outstanding transfers
  • Buffer result values, write back block-wise in short-running AXI4 burst transfer
slide-26
SLIDE 26

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 26

Multi-core Architecture

  • Size of VU9P FPGA allows for replication of accelerator
  • Baseline architecture: 1 compute unit, 1 memory channel
  • Multi-core, single memory: Up to 4 compute units, 1 memory channel
  • Multi-core, multi-memory: Up to 4 compute units, 4 memory channels
slide-27
SLIDE 27

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 27

Multi-threaded Operation

  • Low to moderate computational density for SPN inference, data-transfer overhead significant
  • Solution: Split computation into blocks, overlap computation and data-transfer to/from host

with multiple threads on host-side

slide-28
SLIDE 28

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 28

Evaluation

  • Evaluation with 8 different benchmarks from the NeurIPS corpus
  • FPGA implementation for AWS F1 for all three architectures
  • CPU comparison with generated C++ code on 12-core Xeon E5-2680v3
  • GPU comparison with optimized CUDA code on Nvidia V100 (AWS EC2)
  • Measure end-to-end throughput, including data-transfer from/to host
slide-29
SLIDE 29

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 29

FPGA Implementation Results

slide-30
SLIDE 30

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 30

FPGA Implementation Results

slide-31
SLIDE 31

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 31

FPGA Implementation Results

slide-32
SLIDE 32

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 32

Performance Comparison

slide-33
SLIDE 33

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 33

Conclusion

  • Case study demonstrates ease-of-use of the TaPaSCo design flow to generate heterogeneous

accelerator system for the Amazon AWS EC2 F1 instances

  • Multi-core architecture with multi-threaded software interface significantly improves throughput for

SPN inference

  • Up to 1.9x speedup over 12-core Xeon CPU
  • Up to 6.6x speedup over Nvidia Tesla V100 GPU – due to low arithmetic intensity
slide-34
SLIDE 34

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 34

Start to build your own AWS F1 accelerator system using TaPaSCo! Download TaPaSCo from Github: github.com/esa-tu-darmstadt/tapasco

slide-35
SLIDE 35

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 35

Existing FPGA Acceleration Toolflow

slide-36
SLIDE 36

20.11.2019 | TU Darmstadt | ESA | L. Sommer | 36

Existing FPGA Accelerator Core