Ken O’Brien, Nicholas Fraser, Michaela Blott 27th March 2019
Heterogeneous Compute Architectures For Deep Learning In The Cloud - - PowerPoint PPT Presentation
Heterogeneous Compute Architectures For Deep Learning In The Cloud - - PowerPoint PPT Presentation
Heterogeneous Compute Architectures For Deep Learning In The Cloud Ken OBrien, Nicholas Fraser , Michaela Blott 27 th March 2019 Outline Why FPGAs? Deep Learning: Challenges & Solutions FINN FPGAs to ACAPs Mega-Trend:
Outline
˃Why FPGAs? ˃Deep Learning: Challenges & Solutions ˃FINN ˃FPGAs to ACAPs
Mega-Trend:
Explosion of Data
˃ Astronomically growing amounts of data
More sensors More users More use cases: Genomics (DNA) “Genomical”
>> 3
Stephens, Zachary D., et al. "Big data: astronomical or genomical?."
1 0.1 2 21 5 10 15 20 25 Astronomy Twitter YouTube Genomics Storage ExaBytes/year
Data Acquisition in 2025
We need significantly more compute resources to process and extract patterns / insights from this data!
Technology:
End of Moore’s Law & Dennard Scaling
Economics become questionable Power dissipation becomes problematic
Technology Trends
Era of Heterogeneous Compute using Accelerators
Page 5
˃ Diversification of increasingly heterogenous devices and system
Moving away from standard van Neumann architectures
˃ True Architectural innovation & Unconventional Computing Systems
Deep Learning
- customized precision arithmetic
Cat? Input Image
What’s the Challenge? Example: Convolutional Neural Networks
Forward Pass (Inference)
Neural Network Neural Network For ResNet50: 70 Layers 7.7 Billion operations 25.5 millions of weights
>> 7
Basic arithmetic, incredible parallel but Huge Compute and Memory Requirements
Compute and Memory for Inference
Inference (1 input) GOPS average Inference (1 input) MBytes average Spectrum of Neural Networks
MLP ImageNet Classification CNNs Object Detection Semantic Segmentation OCR Speech Recognition
*architecture independent **1 image forward *** batch = 1 **** int8
Huge Compute and Memory Requirements & Variations
>> 8
0.00 10.00 20.00 30.00 40.00 50.00 60.00 06/07/2009 18/11/2010 01/04/2012 14/08/2013 27/12/2014 10/05/2016 22/09/2017 04/02/2019 Error (%) Publication Date
ImageNet Classification Top-5 Error Over Time (ImageNet)
BNN CNN Reduced Precision Internal
Floating Point to Reduced Precision Neural Networks
Deliver Competitive Accuracy
Float point improvements are slowing down Reduced precision competitive accuracy
Reducing Precision
Scales Performance & Reduces Memory
˃ Reducing precision shrinks LUT cost
Instantiate 100x more compute within the same fabric
˃ Potential to reduce memory footprint
NN model can stay on-chip => no memory bottlenecks Precision Modelsize [MB] (ResNet50) 1b 3.2 8b 25.5 32b 102.5 C= size of accumulator * size of weight * size of activation
Reducing Precision Inherently Saves Power
Source: Bill Dally (Stanford), Cadence Embedded Neural Network Summit, February 1, 2017 Target Device ZU7EV ● Ambient temperature: 25 °C ● 12.5% of toggle rate ● 0.5 of Static Probability ● Power reported for PL accelerated block only 2/2 0.500 0.700 0.900 1.100 1.300 1.500 1.700 1.900 2.100 8 10 12 14 16 18 20 Estimated Power Consumption [W] Test error [%]
LSTM - Test Error vs Power(W)
Bits (W/A) Pareto Optimal >> 11 Rybalkin, V., Pappalardo, A., Ghaffar, M.M., Gambardella, G., Wehn, N. and Blott, M. "FINN-L: Library Extensions and Design Trade-
- ff Analysis for Variable Precision LSTM Networks on FPGAs."
FPGA: ASIC:
2/3 3/4 2/4 2/8 4/4 3/8 8/8 3/3 4/8
1.0 10.0 100.0 1000.0 10000.0 100000.0 1000000.0 10000000.0 100000000.0 1000000000.0 0.00 5.00 10.00 15.00 20.00 25.00 30.00 COMPUTE COST (LUTS + 100*DSPS)
- VAL. ERROR (%)
IMAGENET CLASSIFICATION TOP5% VS COMPUTE COST F(LUT,DSP)
1b weights 2b weights 5bit weights 8bit weights FP weights minifloat ResNet-50 Syq
Design Space Trade-Offs
Resnet18 8b/8b Compute Cost 286 Error 10.68% Resnet50 2b/8b Compute Cost 127 Error 9.86%
Reduced Precision can
- reduce cost / resources
- save power
- scale performance
Pareto-optimal solutions
Scaling with FINN
>> 13
˃ Design Flow Tool for Quantized Neural Networks
Rapid access to network structure and compute/memory footprint statistics Performance prediction for target device Automatic architecture scaling and generation for target device
˃ Multi-stage tool-flow
Frontend Design Space Exploration Backend
˃ Binary Network Release Available
https://github.com/Xilinx/FINN
Page 14
FINN –Tool for Exploration of NNs of FPGAs
HW Architecture – Dataflow
Layer 0
…
Input image
Weight buffer Weight buffer Weight buffer
Inference output
Layer 1 Layer X-1
HW Architecture – Dataflow
Layer 0
…
Input image
Weight buffer Weight buffer Weight buffer
Inference output
Layer 1 Layer X-1
Weight buffering in on-chip memory
- High operational intensity for
inference Small intermediate buffer for feature maps
- No data reordering between layers
- Multi-line buffering for convolutions
- Low latency, high throughput
HW Architecture – Dataflow
Layer 0 DSP
…
Input image
Weight buffer Weight buffer Weight buffer
Inference output
Layer 1 LUT-MAC Layer X-1
1 Compute engine per layer
- Ad-hoc arithmetic according to layer
quantization
HW Architecture – Dataflow
Layer 0 DSP
…
Input image
Weight buffer Weight buffer Weight buffer
Inference output
Layer 1 LUT-MAC Layer X-1
1 Compute engine per each layer
- Adjust parallelism with compute
requirements
PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE
Per layer operations Topology summary
19
Frontend Stage – Import and Network Statistics
Neural Network Description (Prototxt) FINN
Device Specification File Folding Factor Calculation Performance Prediction
20
Design Space Exploration Stage: Balanced Dataflow
Neural Network Description FINN
Page 21
Convolutional Layer – Folding
Height Width Channels . . . . SIMD PE PE SIMD Input Feature Map Output Feature Map Weights
Folding Factor Calculation Performance Prediction Device Specification File
22
Design Space Exploration Stage: Balanced Dataflow
Neural Network Description FINN 1: Given a target FPS, what resources are required? 2: Given total resources, what FPS can be achieved?
Page 23
Vivado HLS – QNN Library
Layer-specific configuration values
– Support to multiple padding, in this case same
Implementation-specific parallelism values
– Folding factors
Precision configuration values
– Independent precision for input/output activations and weights and signed/unsigned math
Device Specification File Hardware Generation
24
Backend Stage - Hardware/ Runtime Generation
Neural Network Description FINN FINN QNN Library Optimal Folding Factors
˃ top.cpp
Sequence of layers, 1:1 with network topology
˃ config.h
Finn-generated configuration, with network configuration values + parallelism-specific values
˃ (possible) params.h
Finn-generated weights values to be hardcoded in the bitstream
Page 25
Hardware Generation – Network Dataflow Example
Page 26
Scaling Parallelism
For each layer, set all SIMD, PE to 1
– Single MAC
Until hardware no longer fits on device or FPS target reached
– Find slowest layer
- Increase SIMD to next factor of IFM_CHANS or
- Increase PE to next factor of OFM_CHANS
Goal: Calculate folding factors such that layers produce balanced dataflow
FINN
Performance Results
˃ Up to 50TOPS measured performance for BNNs
Network Platform Precision (W/A) Performance (TOPS) MLP AWS-F1 1/1 50.8 CNV AWS-F1 1/1 12.1 Tincy-YOLO AWS-F1 1/3 5.3 DoReFa-Net/PF AWS-F1 1/2 11.4
˃ Multiple precision types supported
8-bit in DSPs, reduced precision in LUTs
Blott, M., et al. "FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks"
From FPGAs to ACAPs
>> 28
New Heterogeneous Devices
>> 29
NOC
Programmable Logic Processing System I/O
(GT, AMS)
AI Engines
SW PE SW PE SW PE SW PE SW PE SW PE LUT BRAM DSP URAM Application Processor Real-Time Processor Transceivers PCIe DDR HBM AMS
˃ From the Xilinx World: Evolution of FPGAs to ACAPs
Up to ~147 TOPS of Int8 performance!
Conclusions
˃ As Moore’s law has ended, heterogeneous accelerated systems have emerged ˃ High computational demand of machine learning applications is driving hardware development ˃ Customized dataflow architectures and memory subsystems, custom precisions
- Dramatic performance scaling and energy efficiency benefits
- Target Datacenter or Embedded devices
- Enabling new exciting trade-offs within the design space
˃ New ACAP devices with AI engines
30
Adaptable. Intelligent.
>> 31