Heterogeneous Compute Architectures For Deep Learning In The Cloud - - PowerPoint PPT Presentation

heterogeneous compute
SMART_READER_LITE
LIVE PREVIEW

Heterogeneous Compute Architectures For Deep Learning In The Cloud - - PowerPoint PPT Presentation

Heterogeneous Compute Architectures For Deep Learning In The Cloud Ken OBrien, Nicholas Fraser , Michaela Blott 27 th March 2019 Outline Why FPGAs? Deep Learning: Challenges & Solutions FINN FPGAs to ACAPs Mega-Trend:


slide-1
SLIDE 1

Ken O’Brien, Nicholas Fraser, Michaela Blott 27th March 2019

Heterogeneous Compute Architectures For Deep Learning In The Cloud

slide-2
SLIDE 2

Outline

˃Why FPGAs? ˃Deep Learning: Challenges & Solutions ˃FINN ˃FPGAs to ACAPs

slide-3
SLIDE 3

Mega-Trend:

Explosion of Data

˃ Astronomically growing amounts of data

More sensors More users More use cases: Genomics (DNA) “Genomical”

>> 3

Stephens, Zachary D., et al. "Big data: astronomical or genomical?."

1 0.1 2 21 5 10 15 20 25 Astronomy Twitter YouTube Genomics Storage ExaBytes/year

Data Acquisition in 2025

We need significantly more compute resources to process and extract patterns / insights from this data!

slide-4
SLIDE 4

Technology:

End of Moore’s Law & Dennard Scaling

Economics become questionable Power dissipation becomes problematic

slide-5
SLIDE 5

Technology Trends

Era of Heterogeneous Compute using Accelerators

Page 5

˃ Diversification of increasingly heterogenous devices and system

Moving away from standard van Neumann architectures

˃ True Architectural innovation & Unconventional Computing Systems

slide-6
SLIDE 6

Deep Learning

  • customized precision arithmetic
slide-7
SLIDE 7

Cat? Input Image

What’s the Challenge? Example: Convolutional Neural Networks

Forward Pass (Inference)

Neural Network Neural Network For ResNet50: 70 Layers 7.7 Billion operations 25.5 millions of weights

>> 7

Basic arithmetic, incredible parallel but Huge Compute and Memory Requirements

slide-8
SLIDE 8

Compute and Memory for Inference

Inference (1 input) GOPS average Inference (1 input) MBytes average Spectrum of Neural Networks

MLP ImageNet Classification CNNs Object Detection Semantic Segmentation OCR Speech Recognition

*architecture independent **1 image forward *** batch = 1 **** int8

Huge Compute and Memory Requirements & Variations

>> 8

slide-9
SLIDE 9

0.00 10.00 20.00 30.00 40.00 50.00 60.00 06/07/2009 18/11/2010 01/04/2012 14/08/2013 27/12/2014 10/05/2016 22/09/2017 04/02/2019 Error (%) Publication Date

ImageNet Classification Top-5 Error Over Time (ImageNet)

BNN CNN Reduced Precision Internal

Floating Point to Reduced Precision Neural Networks

Deliver Competitive Accuracy

Float point improvements are slowing down Reduced precision competitive accuracy

slide-10
SLIDE 10

Reducing Precision

Scales Performance & Reduces Memory

˃ Reducing precision shrinks LUT cost

Instantiate 100x more compute within the same fabric

˃ Potential to reduce memory footprint

NN model can stay on-chip => no memory bottlenecks Precision Modelsize [MB] (ResNet50) 1b 3.2 8b 25.5 32b 102.5 C= size of accumulator * size of weight * size of activation

slide-11
SLIDE 11

Reducing Precision Inherently Saves Power

Source: Bill Dally (Stanford), Cadence Embedded Neural Network Summit, February 1, 2017 Target Device ZU7EV ● Ambient temperature: 25 °C ● 12.5% of toggle rate ● 0.5 of Static Probability ● Power reported for PL accelerated block only 2/2 0.500 0.700 0.900 1.100 1.300 1.500 1.700 1.900 2.100 8 10 12 14 16 18 20 Estimated Power Consumption [W] Test error [%]

LSTM - Test Error vs Power(W)

Bits (W/A) Pareto Optimal >> 11 Rybalkin, V., Pappalardo, A., Ghaffar, M.M., Gambardella, G., Wehn, N. and Blott, M. "FINN-L: Library Extensions and Design Trade-

  • ff Analysis for Variable Precision LSTM Networks on FPGAs."

FPGA: ASIC:

2/3 3/4 2/4 2/8 4/4 3/8 8/8 3/3 4/8

slide-12
SLIDE 12

1.0 10.0 100.0 1000.0 10000.0 100000.0 1000000.0 10000000.0 100000000.0 1000000000.0 0.00 5.00 10.00 15.00 20.00 25.00 30.00 COMPUTE COST (LUTS + 100*DSPS)

  • VAL. ERROR (%)

IMAGENET CLASSIFICATION TOP5% VS COMPUTE COST F(LUT,DSP)

1b weights 2b weights 5bit weights 8bit weights FP weights minifloat ResNet-50 Syq

Design Space Trade-Offs

Resnet18 8b/8b Compute Cost 286 Error 10.68% Resnet50 2b/8b Compute Cost 127 Error 9.86%

Reduced Precision can

  • reduce cost / resources
  • save power
  • scale performance

Pareto-optimal solutions

slide-13
SLIDE 13

Scaling with FINN

>> 13

slide-14
SLIDE 14

˃ Design Flow Tool for Quantized Neural Networks

Rapid access to network structure and compute/memory footprint statistics Performance prediction for target device Automatic architecture scaling and generation for target device

˃ Multi-stage tool-flow

Frontend Design Space Exploration Backend

˃ Binary Network Release Available

https://github.com/Xilinx/FINN

Page 14

FINN –Tool for Exploration of NNs of FPGAs

slide-15
SLIDE 15

HW Architecture – Dataflow

Layer 0

Input image

Weight buffer Weight buffer Weight buffer

Inference output

Layer 1 Layer X-1

slide-16
SLIDE 16

HW Architecture – Dataflow

Layer 0

Input image

Weight buffer Weight buffer Weight buffer

Inference output

Layer 1 Layer X-1

Weight buffering in on-chip memory

  • High operational intensity for

inference Small intermediate buffer for feature maps

  • No data reordering between layers
  • Multi-line buffering for convolutions
  • Low latency, high throughput
slide-17
SLIDE 17

HW Architecture – Dataflow

Layer 0 DSP

Input image

Weight buffer Weight buffer Weight buffer

Inference output

Layer 1 LUT-MAC Layer X-1

1 Compute engine per layer

  • Ad-hoc arithmetic according to layer

quantization

slide-18
SLIDE 18

HW Architecture – Dataflow

Layer 0 DSP

Input image

Weight buffer Weight buffer Weight buffer

Inference output

Layer 1 LUT-MAC Layer X-1

1 Compute engine per each layer

  • Adjust parallelism with compute

requirements

PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE

slide-19
SLIDE 19

Per layer operations Topology summary

19

Frontend Stage – Import and Network Statistics

Neural Network Description (Prototxt) FINN

slide-20
SLIDE 20

Device Specification File Folding Factor Calculation Performance Prediction

20

Design Space Exploration Stage: Balanced Dataflow

Neural Network Description FINN

slide-21
SLIDE 21

Page 21

Convolutional Layer – Folding

Height Width Channels . . . . SIMD PE PE SIMD Input Feature Map Output Feature Map Weights

slide-22
SLIDE 22

Folding Factor Calculation Performance Prediction Device Specification File

22

Design Space Exploration Stage: Balanced Dataflow

Neural Network Description FINN 1: Given a target FPS, what resources are required? 2: Given total resources, what FPS can be achieved?

slide-23
SLIDE 23

Page 23

Vivado HLS – QNN Library

Layer-specific configuration values

– Support to multiple padding, in this case same

Implementation-specific parallelism values

– Folding factors

Precision configuration values

– Independent precision for input/output activations and weights and signed/unsigned math

slide-24
SLIDE 24

Device Specification File Hardware Generation

24

Backend Stage - Hardware/ Runtime Generation

Neural Network Description FINN FINN QNN Library Optimal Folding Factors

slide-25
SLIDE 25

˃ top.cpp

Sequence of layers, 1:1 with network topology

˃ config.h

Finn-generated configuration, with network configuration values + parallelism-specific values

˃ (possible) params.h

Finn-generated weights values to be hardcoded in the bitstream

Page 25

Hardware Generation – Network Dataflow Example

slide-26
SLIDE 26

Page 26

Scaling Parallelism

For each layer, set all SIMD, PE to 1

– Single MAC

Until hardware no longer fits on device or FPS target reached

– Find slowest layer

  • Increase SIMD to next factor of IFM_CHANS or
  • Increase PE to next factor of OFM_CHANS

Goal: Calculate folding factors such that layers produce balanced dataflow

slide-27
SLIDE 27

FINN

Performance Results

˃ Up to 50TOPS measured performance for BNNs

Network Platform Precision (W/A) Performance (TOPS) MLP AWS-F1 1/1 50.8 CNV AWS-F1 1/1 12.1 Tincy-YOLO AWS-F1 1/3 5.3 DoReFa-Net/PF AWS-F1 1/2 11.4

˃ Multiple precision types supported

8-bit in DSPs, reduced precision in LUTs

Blott, M., et al. "FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks"

slide-28
SLIDE 28

From FPGAs to ACAPs

>> 28

slide-29
SLIDE 29

New Heterogeneous Devices

>> 29

NOC

Programmable Logic Processing System I/O

(GT, AMS)

AI Engines

SW PE SW PE SW PE SW PE SW PE SW PE LUT BRAM DSP URAM Application Processor Real-Time Processor Transceivers PCIe DDR HBM AMS

˃ From the Xilinx World: Evolution of FPGAs to ACAPs

Up to ~147 TOPS of Int8 performance!

slide-30
SLIDE 30

Conclusions

˃ As Moore’s law has ended, heterogeneous accelerated systems have emerged ˃ High computational demand of machine learning applications is driving hardware development ˃ Customized dataflow architectures and memory subsystems, custom precisions

  • Dramatic performance scaling and energy efficiency benefits
  • Target Datacenter or Embedded devices
  • Enabling new exciting trade-offs within the design space

˃ New ACAP devices with AI engines

30

slide-31
SLIDE 31

Adaptable. Intelligent.

>> 31

Thanks!