Heterogeneous Compute Architectures For Deep Learning In The Cloud - PowerPoint PPT Presentation

Heterogeneous Compute Architectures For Deep Learning In The Cloud Ken O’Brien, Nicholas Fraser , Michaela Blott 27 th March 2019

Outline ˃ Why FPGAs? ˃ Deep Learning: Challenges & Solutions ˃ FINN ˃ FPGAs to ACAPs

Mega-Trend: Explosion of Data ˃ Astronomically growing amounts of data More sensors More users More use cases: Genomics (DNA) “ Genomical ” Data Acquisition in 2025 25 We need significantly more compute Storage ExaBytes/year 20 21 resources to process and extract patterns / insights from this data! 15 10 5 Stephens, Zachary D., et al. 1 0.1 2 0 "Big data: astronomical or genomical?." Astronomy Twitter YouTube Genomics >> 3

Technology: End of Moore’s Law & Dennard Scaling Economics become questionable Power dissipation becomes problematic

Era of Heterogeneous Compute using Accelerators Trends Technology ˃ Diversification of increasingly heterogenous devices and system Moving away from standard van Neumann architectures ˃ True Architectural innovation & Unconventional Computing Systems Page 5

Deep Learning - customized precision arithmetic

What’s the Challenge? Example: Convolutional Neural Networks Forward Pass (Inference) Neural Network Input Image Neural Network Cat? For ResNet50: 70 Layers 7.7 Billion operations 25.5 millions of weights Basic arithmetic, incredible parallel but Huge Compute and Memory Requirements >> 7

*architecture independent Compute and Memory for Inference **1 image forward *** batch = 1 **** int8 Spectrum of Neural Networks Inference (1 input) GOPS average Inference (1 input) MBytes average Object Semantic Speech MLP ImageNet Classification CNNs OCR Detection Segmentation Recognition Huge Compute and Memory Requirements & Variations >> 8

Floating Point to Reduced Precision Neural Networks Deliver Competitive Accuracy ImageNet Classification Top-5 Error Over Time (ImageNet) 60.00 Float point improvements are slowing down 50.00 Reduced precision competitive accuracy 40.00 Error (%) 30.00 20.00 10.00 0.00 06/07/2009 18/11/2010 01/04/2012 14/08/2013 27/12/2014 10/05/2016 22/09/2017 04/02/2019 Publication Date BNN CNN Reduced Precision Internal

Reducing Precision Scales Performance & Reduces Memory ˃ Reducing precision shrinks LUT cost Instantiate 100x more compute within the same fabric ˃ Potential to reduce memory footprint NN model can stay on-chip => no memory bottlenecks C= size of accumulator * Precision Modelsize size of weight * [MB] size of activation (ResNet50) 1b 3.2 8b 25.5 32b 102.5

Reducing Precision Inherently Saves Power FPGA: ASIC: LSTM - Test Error vs Power(W) 20 3/3 2/2 18 16 Test error [%] 2/3 14 3/4 2/8 12 4/4 2/4 4/8 10 3/8 Source: Bill Dally (Stanford), Cadence Embedded Neural 8/8 Network Summit, February 1, 2017 8 0.500 0.700 0.900 1.100 1.300 1.500 1.700 1.900 2.100 Estimated Power Consumption [W] Bits (W/A) Pareto Optimal Target Device ZU7EV ● Ambient temperature: 25 °C ● 12.5% of toggle rate ● 0.5 of Static Probability ● Power reported for PL accelerated block only Rybalkin, V., Pappalardo, A., Ghaffar, M.M., Gambardella, G., Wehn, N. and Blott, M. "FINN-L: Library Extensions and Design Trade- >> 11 off Analysis for Variable Precision LSTM Networks on FPGAs."

Design Space Trade-Offs IMAGENET CLASSIFICATION TOP5% VS COMPUTE COST F(LUT,DSP) 1b weights 2b weights 5bit weights 8bit weights FP weights minifloat ResNet-50 Syq 30.00 25.00 Resnet18 8b/8b Compute Cost 286 20.00 Error 10.68% VAL. ERROR (%) 15.00 10.00 Resnet50 Pareto-optimal solutions 2b/8b 5.00 Compute Cost 127 Reduced Precision can Error 9.86% • reduce cost / resources 0.00 1.0 10.0 100.0 1000.0 10000.0 100000.0 1000000.0 10000000.0 100000000.0 1000000000.0 • save power COMPUTE COST (LUTS + 100*DSPS) • scale performance

Scaling with FINN >> 13

FINN – Tool for Exploration of NNs of FPGAs ˃ Design Flow Tool for Quantized Neural Networks Rapid access to network structure and compute/memory footprint statistics Performance prediction for target device Automatic architecture scaling and generation for target device ˃ Multi-stage tool-flow Frontend Design Space Exploration Backend ˃ Binary Network Release Available https://github.com/Xilinx/FINN Page 14

HW Architecture – Dataflow Weight Weight Weight buffer buffer buffer Layer 0 Layer 1 Layer X-1 Input image Inference output …

HW Architecture – Dataflow Weight buffering in on-chip memory • High operational intensity for inference Small intermediate buffer for feature maps • No data reordering between layers Weight Weight Weight • Multi-line buffering for convolutions buffer buffer buffer • Low latency, high throughput Layer 0 Layer 1 Layer X-1 Input image Inference output …

HW Architecture – Dataflow 1 Compute engine per layer • Ad-hoc arithmetic according to layer quantization Weight Weight Weight buffer buffer buffer Layer 0 Layer 1 Layer X-1 Input image Inference output … DSP LUT-MAC

HW Architecture – Dataflow 1 Compute engine per each layer • Adjust parallelism with compute requirements Weight Weight Weight buffer buffer buffer Layer 1 PE PE Layer X-1 Input image Layer 0 Inference output PE PE PE PE PE PE PE PE … PE PE PE PE DSP PE PE LUT-MAC

Frontend Stage – Import and Network Statistics Per layer operations Neural Network Description (Prototxt) FINN Topology summary 19

Design Space Exploration Stage: Balanced Dataflow Neural Network Description Folding Factor Calculation FINN Device Specification File Performance Prediction 20

Convolutional Layer – Folding PE SIMD SIMD . PE . . . Input Feature Map Weights Output Feature Map Height Channels Width Page 21

Design Space Exploration Stage: Balanced Dataflow Neural Network Description Folding Factor Calculation FINN Device Specification File Performance Prediction 1: Given a target FPS, what resources are required? 2: Given total resources, what FPS can be achieved? 22

Vivado HLS – QNN Library Layer-specific configuration values – Support to multiple padding, in this case same Implementation-specific parallelism values – Folding factors Precision configuration values – Independent precision for input/output activations and weights and signed/unsigned math Page 23

Backend Stage - Hardware/ Runtime Generation Neural Network Hardware Generation Description Optimal Folding Factors FINN Device Specification File FINN QNN Library 24

Hardware Generation – Network Dataflow Example ˃ top.cpp Sequence of layers, 1:1 with network topology ˃ config.h Finn-generated configuration, with network configuration values + parallelism-specific values ˃ (possible) params.h Finn-generated weights values to be hardcoded in the bitstream Page 25

Scaling Parallelism For each layer, set all SIMD, PE to 1 – Single MAC Until hardware no longer fits on device or FPS target reached – Find slowest layer • Increase SIMD to next factor of IFM_CHANS or • Increase PE to next factor of OFM_CHANS Goal: Calculate folding factors such that layers produce balanced dataflow Page 26

FINN Performance Results Network Platform Precision (W/A) Performance (TOPS) MLP AWS-F1 1/1 50.8 CNV AWS-F1 1/1 12.1 Tincy-YOLO AWS-F1 1/3 5.3 DoReFa-Net/PF AWS-F1 1/2 11.4 ˃ Up to 50TOPS measured ˃ Multiple precision types supported performance for BNNs 8-bit in DSPs, reduced precision in LUTs Blott, M., et al. "FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks"

From FPGAs to ACAPs >> 28

New Heterogeneous Devices AI Engines Processing I/O (GT, AMS) System SW PE SW PE SW PE Transceivers Application SW PE SW PE SW PE Processor PCIe Up to ~147 TOPS of Int8 Real-Time NOC Processor performance! DDR Programmable Logic HBM LUT BRAM AMS DSP URAM ˃ From the Xilinx World: Evolution of FPGAs to ACAPs >> 29

Conclusions ˃ As Moore’s law has ended, heterogeneous accelerated systems have emerged ˃ High computational demand of machine learning applications is driving hardware development ˃ Customized dataflow architectures and memory subsystems, custom precisions • Dramatic performance scaling and energy efficiency benefits • Target Datacenter or Embedded devices • Enabling new exciting trade-offs within the design space ˃ New ACAP devices with AI engines 30

Thanks! Adaptable. Intelligent. >> 31

Heterogeneous Compute Architectures For Deep Learning In The Cloud - PowerPoint PPT Presentation

Heterogeneous Compute Architectures For Deep Learning In The Cloud Ken OBrien, Nicholas Fraser , Michaela Blott 27 th March 2019 Outline Why FPGAs? Deep Learning: Challenges & Solutions FINN FPGAs to ACAPs Mega-Trend:

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

1 1 easy to compute , 1 easy to compute 2

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

OPEN COMPUTE BRIEF 7x24 Exchange Carolinas Chapter 2017 Winter Meeting AGENDA Open

CUDA (Compute Unified Device Dr. Bharathwaj Bharath Muthuswamy Architecture) and OpenCL

MULTI-GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute Jan Stephan, Intern Devtech

Infrastructure as a Service (IaaS) Google Compute Engine AWS Elastic Compute Cloud (EC2) Azure

Uniform access to heterogeneous Uniform access to heterogeneous grid infrastructures with grid

Mining Heterogeneous Mining Heterogeneous Information Networks Information Networks Xifeng Yan

Learning by Fusing Heterogeneous Data Marinka Zitnik Thesis Defense, October 22 2015 Motivation

Composing heterogeneous software with style Stephen Kell stephen.kell@cs.ox.ac.uk Composing. . .

Decentralized Dynamic Scheduling across Heterogeneous Multi core across Heterogeneous Multi

Static Worksharing Strategies for Heterogeneous Computers with Unrecoverable Failures Anne

An Introduction to Coupling Conditions Homogeneous Heterogeneous Domain Decomposition Problems

Modeling Heterogeneous Modeling Heterogeneous Real- -time Components in BIP time Components in

The Era of Heterogeneous Compute: Challenges and Opportunities Sudhakar Yalamanchili Computer

Spark: The Technology Innovation Marketplace Agenda Presentation (30 min) and Q&A (15min)

Performance via Complexity Need for architectural innovations Outline Components of a basic

INF5210 Information Infrastructure Class #6 Architecture of Complex Systems Ben Eaton Dan

Computing Drug Order Compliance with Guidelines Using an OWL2 Reasoner and Standard Drug

Science and Technology Takeoff in Theoretical and Empirical Perspective Gao Jian Tsinghua

Software architecture for reactive systems (introduction) Jos Proena HASLab - INESC TEC

Software architecture for reactive systems (introduction) Lus Soares Barbosa Jos Proena

CSE 115 Introduction to Computer Science I Announcements Lab activity #1 -> can finish up in

Heterogeneous Compute Architectures For Deep Learning In The Cloud - PowerPoint PPT Presentation

Heterogeneous Compute Architectures For Deep Learning In The Cloud Ken OBrien, Nicholas Fraser , Michaela Blott 27 th March 2019 Outline Why FPGAs? Deep Learning: Challenges & Solutions FINN FPGAs to ACAPs Mega-Trend:

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

1 1 easy to compute , 1 easy to compute 2

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

OPEN COMPUTE BRIEF 7x24 Exchange Carolinas Chapter 2017 Winter Meeting AGENDA Open

CUDA (Compute Unified Device Dr. Bharathwaj Bharath Muthuswamy Architecture) and OpenCL

MULTI-GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute Jan Stephan, Intern Devtech

Infrastructure as a Service (IaaS) Google Compute Engine AWS Elastic Compute Cloud (EC2) Azure

Uniform access to heterogeneous Uniform access to heterogeneous grid infrastructures with grid

Mining Heterogeneous Mining Heterogeneous Information Networks Information Networks Xifeng Yan

Learning by Fusing Heterogeneous Data Marinka Zitnik Thesis Defense, October 22 2015 Motivation

Composing heterogeneous software with style Stephen Kell stephen.kell@cs.ox.ac.uk Composing. . .

Decentralized Dynamic Scheduling across Heterogeneous Multi core across Heterogeneous Multi

Static Worksharing Strategies for Heterogeneous Computers with Unrecoverable Failures Anne

An Introduction to Coupling Conditions Homogeneous Heterogeneous Domain Decomposition Problems

Modeling Heterogeneous Modeling Heterogeneous Real- -time Components in BIP time Components in

The Era of Heterogeneous Compute: Challenges and Opportunities Sudhakar Yalamanchili Computer

Spark: The Technology Innovation Marketplace Agenda Presentation (30 min) and Q&amp;A (15min)

Performance via Complexity Need for architectural innovations Outline Components of a basic

INF5210 Information Infrastructure Class #6 Architecture of Complex Systems Ben Eaton Dan

Computing Drug Order Compliance with Guidelines Using an OWL2 Reasoner and Standard Drug

Science and Technology Takeoff in Theoretical and Empirical Perspective Gao Jian Tsinghua

Software architecture for reactive systems (introduction) Jos Proena HASLab - INESC TEC

Software architecture for reactive systems (introduction) Lus Soares Barbosa Jos Proena

CSE 115 Introduction to Computer Science I Announcements Lab activity #1 -&gt; can finish up in

Spark: The Technology Innovation Marketplace Agenda Presentation (30 min) and Q&A (15min)

CSE 115 Introduction to Computer Science I Announcements Lab activity #1 -> can finish up in