Machine Learning on a Heterogeneous Cluster NaifTarafdar , Giuseppe - - PowerPoint PPT Presentation

machine learning on a
SMART_READER_LITE
LIVE PREVIEW

Machine Learning on a Heterogeneous Cluster NaifTarafdar , Giuseppe - - PowerPoint PPT Presentation

AIgean: An Open Framework for Machine Learning on a Heterogeneous Cluster NaifTarafdar , Giuseppe Di Guglielmo, Philip C Harris, Jeffrey D Krupa, Vladimir Loncar , Dylan S Rankin, Nhan Tran , Zhenbin Wu , Qianfeng Shen and


slide-1
SLIDE 1

AIgean: An Open Framework for Machine Learning on a Heterogeneous Cluster

NaifTarafdar¹, Giuseppe Di Guglielmo², Philip C Harris³, Jeffrey D Krupa³, Vladimir Loncar⁴, Dylan S Rankin³, Nhan Tran⁵, Zhenbin Wu⁶, Qianfeng Shen¹ and Paul Chow¹

University of Toronto¹ Columbia University² Massachusetts Institute of Technology³ CERN⁴ Fermilab⁵ University of Illinois⁶

slide-2
SLIDE 2

Take Aways

  • Galapagos: Platform for multi-FPGA application

deployment

– A scalable giant FPGA comprised of individual FPGAs

  • AIgean: Mapping an ML application onto the giant

FPGA

– Could also be your own applications

  • Depending on your area of expertise and interest you

can use different parts of this project

November 13, 2020 H2RC 2020

slide-3
SLIDE 3

Machine Learning

  • One of the most popular topics of research

– In many areas, many applications (e.g medical, financial, safety, transportation etc.) – Also within the computing community

  • Wide usage in world pushes limits of devices

– Metrics include performance and energy – Leading many researchers to consider heterogeneity!

November 13, 2020 H2RC 2020

slide-4
SLIDE 4

Heterogeneity All Around Us

November 13, 2020

This Photo by Unknown author is licensed under CC BY-NC.

This Photo by Unknown author is licensed under CC BY-SA-NC. This Photo by Unknown author is licensed under CC BY-NC-ND.

H2RC 2020

slide-5
SLIDE 5

Applying Machine Learning to a Heterogeneous Environment

  • Challenge: How do you design machine learning

algorithms for a heterogenous space?

– Hard enough with a homogenous computing environment – Is there a framework for such a thing?

  • Challenge: If such a framework exists can we get

both flexibility and performance?

November 13, 2020 H2RC 2020

slide-6
SLIDE 6

Outline

November 13, 2020

  • Brief Motivation
  • Overview of machine learning frameworks

– Categorized as an abstraction layer stack

  • Overview of AIgean

– HLS4ML – Galapagos

  • Results

H2RC 2020

slide-7
SLIDE 7

MA MACH CHINE NE LEA EARNING RNING FR FRAM AMEW EWORKS ORKS

November 13, 2020 H2RC 2020

slide-8
SLIDE 8

Many Popular Examples!

  • Such as

– Tensorflow – PyTorch – Caffe – Intel DLA – Xilinx XfDNN

  • What do these different frameworks offer?

– Depends on who you ask!

November 13, 2020 H2RC 2020

slide-9
SLIDE 9

Machine Learning Stack

November 13, 2020

Applications & Algorithms Cluster Deployment & Communication Hardware

H2RC 2020

slide-10
SLIDE 10

Machine Learning Stack

November 13, 2020

Applications & Algorithms Cluster Deployment & Communication Hardware E.g: Neural net layers, quantization, compression, pruning

H2RC 2020

slide-11
SLIDE 11

Machine Learning Stack

November 13, 2020

Applications & Algorithms Cluster Deployment & Communication Hardware E.g: Physical Connections (PCIe, ethernet etc.), Communication Protocols

H2RC 2020

slide-12
SLIDE 12

Machine Learning Stack

November 13, 2020

Applications & Algorithms Cluster Deployment & Communication Hardware E.g: Hardware circuit (multipliers, shifters), memory architecture (caching etc.)

H2RC 2020

slide-13
SLIDE 13

Machine Learning Stack

  • Allows researchers to

pick and choose layers they wish to configure

  • Collapsable/Expandable

for specific application and infrastructure!

November 13, 2020

Applications & Algorithms Cluster Deployment & Communication Hardware

H2RC 2020

slide-14
SLIDE 14

AIG AIGEAN AN OVE VERVI VIEW

November 13, 2020 H2RC 2020

slide-15
SLIDE 15

AIGean Introduction

  • Like the archipelago and sea
  • Combines two existing

frameworks:

– HLS4ML:

  • HLS IP cores of ML IP

– Galapagos

  • Connects and deploys

heterogeneous distributed application across multiple nodes

November 13, 2020 H2RC 2020

slide-16
SLIDE 16

HLS4ML

  • Open source project
  • Input:

– Description of FPGA resources

  • LUT, BRAM, DSP

– Description of neural net

  • PyT
  • rch, Keras, Onyx support
  • Output:

– HLS synthesizable C++ that fits within resource constraints implementing neural net

  • Tunable HLS code, made to fit the FPGA

November 13, 2020 H2RC 2020

slide-17
SLIDE 17

VM VM VM

Galapagos

  • User can define a FPGA cluster using

cluster description files and AXI-Stream kernels

T

  • ol

Flow

Network Kern el

AXI-Stream

FPGA 2 FPGA 3 FPGA 1

Kern el

Kernel AXI-Stream

File

November 13, 2020 H2RC 2020

slide-18
SLIDE 18

VM VM VM

Galapagos

  • User can define a FPGA cluster using

cluster description files and AXI-Stream kernels

T

  • ol

Flow

Network Kern el

AXI-Stream

FPGA 2 FPGA 3 FPGA 1

Kern el

Kernel AXI-Stream

File

November 13, 2020 H2RC 2020

slide-19
SLIDE 19

VM VM VM

Galapagos

  • User can define a FPGA cluster using

cluster description files and AXI-Stream kernels

T

  • ol

Flow

Network Kern el

AXI-Stream

FPGA 2 FPGA 3 FPGA 1

Kern el

Kernel AXI-Stream

File

November 13, 2020 H2RC 2020

slide-20
SLIDE 20

VM VM VM

Galapagos

  • User can define a FPGA cluster using

cluster description files and AXI-Stream kernels

T

  • ol

Flow

Network Kern el

AXI-Stream

FPGA 2 FPGA 3 FPGA 1

Kern el

Kernel AXI-Stream

File

November 13, 2020 H2RC 2020

slide-21
SLIDE 21

Galapagos

  • Heterogeneous Stack
  • Allows users to create flexible heterogeneous clusters

across CPUs/FPGAs

  • Seamlessly prototype by implementing both on CPU and

FPGA

– Galapagos ensures functional portability for network communication – Essentially "network-connected" HLS kernels

  • For both SW and HW

– Iterative development, selectively move bottleneck from SW to hardware without modifying code

  • Flexibly change communication protocol without modifying

user application

– TCP, UDP, L1 etc – User application is agnostic to this

November 13, 2020

Communication Layer Middleware/Network Layer Hypervisor Layer Physical Hardware

H2RC 2020

slide-22
SLIDE 22

Birth of AIgean

  • HLS4ML creates HLS IP core to maximize FPGA

utilization

  • Galapagos can give a multi-FPGA fabric
  • Tools combined to deploy neural-net on multi-

FPGA Fabric

November 13, 2020 H2RC 2020

slide-23
SLIDE 23

AIgean Tool Flow

Model

HDF5 & JSON

HLS ML Layers

C++ & TCL

Tunin g Tuning

Model Training

Keras, PyTorch

Hls4 ml*

CPU/FPGA Cluster

HLS ML to Galapag

  • s

Bridge

C++ & TCL

+

Partitio ner IP Cluster Not connected ML2G

AIGean Automated Flow

November 13, 2020 H2RC 2020

slide-24
SLIDE 24

AIgean Tool Flow

Model

HDF5 & JSON

HLS ML Layers C++ & TCL

Tuning

Model Training

Keras, PyTorch

Hls4ml*

CPU/FPGA Cluster

HLS ML to Galapagos Bridge C++ & TCL

+

Partitioner

IP Cluster Not connected

ML2G

AIGean Automated Flow

November 13, 2020 H2RC 2020

slide-25
SLIDE 25

AIgean Tool Flow

Model

HDF5 & JSON

HLS ML Layers C++ & TCL

Tuning

Model Training

Keras, PyTorch

Hls4ml*

CPU/FPGA Cluster

HLS ML to Galapagos Bridge C++ & TCL

+

Partitioner

IP Cluster Not connected

ML2G

AIGean Automated Flow

November 13, 2020 H2RC 2020

slide-26
SLIDE 26

HLS4ML Modifications

  • HLS4ML modified to

create independent layers as separate HLS IP cores

– Each IP core is a streaming core with each stream per dimension of the particular layer

November 13, 2020

Hls4ml* HLS ML Layers HLS ML to Galapagos Bridge C++ & TCL C++ & TCL

Model

HDF5 & JSON

H2RC 2020

slide-27
SLIDE 27

HLS4ML Galapagos Bridge

November 13, 2020

ML Layer

Bridge

  • ne 8 bit stream per dim

One 512 bit stream ML Layer Bridge

  • ne 8 bit stream per dim

One 512 bit stream

Galapagos Kernel

  • Bridges custom made for the layers used in the

network (different bridges needed for different number

  • f dimensions)
  • If the user has a different application layer then they

would need a different bridge

H2RC 2020

slide-28
SLIDE 28

AIgean Tool Flow

Model

HDF5 & JSON

HLS ML Layers C++ & TCL

Tuning

Model Training

Keras, PyTorch

Hls4ml*

CPU/FPGA Cluster

HLS ML to Galapagos Bridge C++ & TCL

+

Partitioner

IP Cluster Not connected

ML2G

AIGean Automated Flow

November 13, 2020 H2RC 2020

slide-29
SLIDE 29

Partitioner

  • Partitioner separates IP cores onto

different FPGAs

  • Currently using IP resources

estimation from HLS Place and route and performing simple greedy approach

  • Does not place the bridges as that is

AIgean specific, and this partitioner is general for all Galapagos IP kernels

November 13, 2020

Partitioner

IP Cluster Not connected

HLS ML Layers C++ & TCL

H2RC 2020

slide-30
SLIDE 30

AIgean Tool Flow

Model

HDF5 & JSON

HLS ML Layers C++ & TCL

Tuning

Model Training

Keras, PyTorch

Hls4ml*

CPU/FPGA Cluster

HLS ML to Galapagos Bridge C++ & TCL

+

Partitioner

IP Cluster Not connected

ML2G

AIGean Automated Flow

November 13, 2020 H2RC 2020

slide-31
SLIDE 31

Machine Learning to Galapagos (ML2G)

  • Adds the appropriate

bridges on the interfaces

  • f the FPGAs
  • Creates the local

connections for kernels

  • n the same FPGA

November 13, 2020 IP Cluster Not connected

HLS ML to Galapagos Bridge C++ & TCL

+

ML2G

CPU/FPGA Cluster H2RC 2020

slide-32
SLIDE 32

RESU SULTS

November 13, 2020 H2RC 2020

slide-33
SLIDE 33

Experiment Setup

  • CPUs

– Xeon E5-2650

  • 24 Cores at 2.2 GHz
  • FPGAs

– Fidus Sidewinder

  • ZU19EG FPGA

– ~1 Million logic cells, 35 MB BRAM, 1968 DSP slices

  • 100 GB network interface

– 100 GB UDP core

November 13, 2020 H2RC 2020

slide-34
SLIDE 34

Microbenchmarks

Link Latency Throughput Software to Hardware 0.029 ms 0.244 GB/s Hardware to Hardware 0.00017 ms 100 GB/s Hardware to Software 0.0203 ms N/A

November 13, 2020

  • Latency send

single flit

  • Throughput:

maximum throughput of link (varying packet size for software)

H2RC 2020

slide-35
SLIDE 35

Microbenchmarks

Link Latency Throughput Software to Hardware 0.029 ms 0.244 GB/s Hardware to Hardware 0.00017 ms 100 GB/s Hardware to Software 0.0203 ms N/A

November 13, 2020

  • Larger the

packet, higher the throughput.

  • UDP packet

size limited

– No segmentation – MTU size – Jumbo Frames: 8K

H2RC 2020

slide-36
SLIDE 36

Microbenchmarks

Link Latency Throughput Software to Hardware 0.029 ms 0.244 GB/s Hardware to Hardware 0.00017 ms 100 GB/s Hardware to Software 0.0203 ms N/A

November 13, 2020

  • Line-rate,

same throughput at small and large packet size

H2RC 2020

slide-37
SLIDE 37

Microbenchmarks

Link Latency Throughput Software to Hardware 0.029 ms 0.244 GB/s Hardware to Hardware 0.00017 ms 100 GB/s Hardware to Software 0.0203 ms N/A

November 13, 2020

  • HW at line-

rate

  • UDP, SW

can't keep up and we see packet drop

H2RC 2020

slide-38
SLIDE 38

Small Neural Network: Results

  • Single CPU, single FPGA, used in physics application to

calculate energy of a particle

  • 16K inferences
  • SDAccel (without AIgean) 3 ms
  • AIgean 6.3 ms

– Latency of single inference 0.08 ms, we can do this since streaming, not possible via SDAccel

  • Bottleneck: Sending data to FPGA via CPU

network link

November 13, 2020 H2RC 2020

slide-39
SLIDE 39

Small Neural Network: Takeaway

  • Comparison vs SDAccel shows that network link for a

single FPGA can be competitive with PCIe

– Network link wins in terms of scalability, many more available FPGAs via network vs PCIe

  • Can stream data

– Latency of single inference a lot faster

  • Should target larger application

– We can do this as we have a large multi-FPGA fabric!

November 13, 2020 H2RC 2020

slide-40
SLIDE 40

Autoencoder: Results

  • Autoencoder implemented in both SDAccel on

single FPGA and AIgean using 3 FPGAs

  • SDAccel: Single FPGA, higher reuse factor to fit

logic

– 0.26 ms

  • AIgean: Three FPGAs

– 0.08 ms, more than 3x improvement

November 13, 2020 H2RC 2020

slide-41
SLIDE 41

Autoencoder: Takeaway

  • Using a larger fabric allows us to implement larger

circuits

  • The difficulty of communication between multi-

FPGA is abstracted away

November 13, 2020 H2RC 2020

slide-42
SLIDE 42

ResNet-50

  • Currently IP cores implemented at 6600

images/second (slightly better than Brainwave)

  • Prototype in software working
  • Bridges working at line rate
  • 12 FPGA bitstreams currently being synthesized and

tested

  • In the pipeline: 30000 images/second

November 13, 2020 H2RC 2020

slide-43
SLIDE 43

SU SUMMA MMARY Y AN AND CO CONC NCLUSION LUSION

November 13, 2020 H2RC 2020

slide-44
SLIDE 44

Summary

  • Multi-FPGA/CPU neural net framework by leveraging and combining HLS4ML

and Galapagos frameworks

  • Tunable IP cores, flexible communication
  • ML HLS IP cores deployed onto cluster of network connected FPGAs and

CPUs

  • Communication abstracted away from user

November 13, 2020 Model

HDF5 & JSON

Tuning

Model Training

Keras, PyTorch

CPU/FPGA Cluster

AIGean Automated Flow

H2RC 2020

slide-45
SLIDE 45

Conclusions

  • Network connected FPGAs/CPUs are more scalable

than traditional PCIe

  • Creation of larger fabrics with network connected

FPGAs opens door for more complex algorithms

  • Many opportunities to explore in multi-FPGA ML
  • Galapagos provides a good foundation for multi-FPGA

applications

November 13, 2020 H2RC 2020

slide-46
SLIDE 46

Acknowledgments

November 13, 2020 H2RC 2020

slide-47
SLIDE 47

Thank You

  • Email: pc@eecg.toronto.edu

November 13, 2020 H2RC 2020