heterogeneous compute
play

Heterogeneous Compute Architectures For Deep Learning In The Cloud - PowerPoint PPT Presentation

Heterogeneous Compute Architectures For Deep Learning In The Cloud Ken OBrien, Nicholas Fraser , Michaela Blott 27 th March 2019 Outline Why FPGAs? Deep Learning: Challenges & Solutions FINN FPGAs to ACAPs Mega-Trend:


  1. Heterogeneous Compute Architectures For Deep Learning In The Cloud Ken O’Brien, Nicholas Fraser , Michaela Blott 27 th March 2019

  2. Outline ˃ Why FPGAs? ˃ Deep Learning: Challenges & Solutions ˃ FINN ˃ FPGAs to ACAPs

  3. Mega-Trend: Explosion of Data ˃ Astronomically growing amounts of data More sensors More users More use cases: Genomics (DNA) “ Genomical ” Data Acquisition in 2025 25 We need significantly more compute Storage ExaBytes/year 20 21 resources to process and extract patterns / insights from this data! 15 10 5 Stephens, Zachary D., et al. 1 0.1 2 0 "Big data: astronomical or genomical?." Astronomy Twitter YouTube Genomics >> 3

  4. Technology: End of Moore’s Law & Dennard Scaling Economics become questionable Power dissipation becomes problematic

  5. Era of Heterogeneous Compute using Accelerators Trends Technology ˃ Diversification of increasingly heterogenous devices and system Moving away from standard van Neumann architectures ˃ True Architectural innovation & Unconventional Computing Systems Page 5

  6. Deep Learning - customized precision arithmetic

  7. What’s the Challenge? Example: Convolutional Neural Networks Forward Pass (Inference) Neural Network Input Image Neural Network Cat? For ResNet50: 70 Layers 7.7 Billion operations 25.5 millions of weights Basic arithmetic, incredible parallel but Huge Compute and Memory Requirements >> 7

  8. *architecture independent Compute and Memory for Inference **1 image forward *** batch = 1 **** int8 Spectrum of Neural Networks Inference (1 input) GOPS average Inference (1 input) MBytes average Object Semantic Speech MLP ImageNet Classification CNNs OCR Detection Segmentation Recognition Huge Compute and Memory Requirements & Variations >> 8

  9. Floating Point to Reduced Precision Neural Networks Deliver Competitive Accuracy ImageNet Classification Top-5 Error Over Time (ImageNet) 60.00 Float point improvements are slowing down 50.00 Reduced precision competitive accuracy 40.00 Error (%) 30.00 20.00 10.00 0.00 06/07/2009 18/11/2010 01/04/2012 14/08/2013 27/12/2014 10/05/2016 22/09/2017 04/02/2019 Publication Date BNN CNN Reduced Precision Internal

  10. Reducing Precision Scales Performance & Reduces Memory ˃ Reducing precision shrinks LUT cost Instantiate 100x more compute within the same fabric ˃ Potential to reduce memory footprint NN model can stay on-chip => no memory bottlenecks C= size of accumulator * Precision Modelsize size of weight * [MB] size of activation (ResNet50) 1b 3.2 8b 25.5 32b 102.5

  11. Reducing Precision Inherently Saves Power FPGA: ASIC: LSTM - Test Error vs Power(W) 20 3/3 2/2 18 16 Test error [%] 2/3 14 3/4 2/8 12 4/4 2/4 4/8 10 3/8 Source: Bill Dally (Stanford), Cadence Embedded Neural 8/8 Network Summit, February 1, 2017 8 0.500 0.700 0.900 1.100 1.300 1.500 1.700 1.900 2.100 Estimated Power Consumption [W] Bits (W/A) Pareto Optimal Target Device ZU7EV ● Ambient temperature: 25 °C ● 12.5% of toggle rate ● 0.5 of Static Probability ● Power reported for PL accelerated block only Rybalkin, V., Pappalardo, A., Ghaffar, M.M., Gambardella, G., Wehn, N. and Blott, M. "FINN-L: Library Extensions and Design Trade- >> 11 off Analysis for Variable Precision LSTM Networks on FPGAs."

  12. Design Space Trade-Offs IMAGENET CLASSIFICATION TOP5% VS COMPUTE COST F(LUT,DSP) 1b weights 2b weights 5bit weights 8bit weights FP weights minifloat ResNet-50 Syq 30.00 25.00 Resnet18 8b/8b Compute Cost 286 20.00 Error 10.68% VAL. ERROR (%) 15.00 10.00 Resnet50 Pareto-optimal solutions 2b/8b 5.00 Compute Cost 127 Reduced Precision can Error 9.86% • reduce cost / resources 0.00 1.0 10.0 100.0 1000.0 10000.0 100000.0 1000000.0 10000000.0 100000000.0 1000000000.0 • save power COMPUTE COST (LUTS + 100*DSPS) • scale performance

  13. Scaling with FINN >> 13

  14. FINN – Tool for Exploration of NNs of FPGAs ˃ Design Flow Tool for Quantized Neural Networks Rapid access to network structure and compute/memory footprint statistics Performance prediction for target device Automatic architecture scaling and generation for target device ˃ Multi-stage tool-flow Frontend Design Space Exploration Backend ˃ Binary Network Release Available https://github.com/Xilinx/FINN Page 14

  15. HW Architecture – Dataflow Weight Weight Weight buffer buffer buffer Layer 0 Layer 1 Layer X-1 Input image Inference output …

  16. HW Architecture – Dataflow Weight buffering in on-chip memory • High operational intensity for inference Small intermediate buffer for feature maps • No data reordering between layers Weight Weight Weight • Multi-line buffering for convolutions buffer buffer buffer • Low latency, high throughput Layer 0 Layer 1 Layer X-1 Input image Inference output …

  17. HW Architecture – Dataflow 1 Compute engine per layer • Ad-hoc arithmetic according to layer quantization Weight Weight Weight buffer buffer buffer Layer 0 Layer 1 Layer X-1 Input image Inference output … DSP LUT-MAC

  18. HW Architecture – Dataflow 1 Compute engine per each layer • Adjust parallelism with compute requirements Weight Weight Weight buffer buffer buffer Layer 1 PE PE Layer X-1 Input image Layer 0 Inference output PE PE PE PE PE PE PE PE … PE PE PE PE DSP PE PE LUT-MAC

  19. Frontend Stage – Import and Network Statistics Per layer operations Neural Network Description (Prototxt) FINN Topology summary 19

  20. Design Space Exploration Stage: Balanced Dataflow Neural Network Description Folding Factor Calculation FINN Device Specification File Performance Prediction 20

  21. Convolutional Layer – Folding PE SIMD SIMD . PE . . . Input Feature Map Weights Output Feature Map Height Channels Width Page 21

  22. Design Space Exploration Stage: Balanced Dataflow Neural Network Description Folding Factor Calculation FINN Device Specification File Performance Prediction 1: Given a target FPS, what resources are required? 2: Given total resources, what FPS can be achieved? 22

  23. Vivado HLS – QNN Library Layer-specific configuration values – Support to multiple padding, in this case same Implementation-specific parallelism values – Folding factors Precision configuration values – Independent precision for input/output activations and weights and signed/unsigned math Page 23

  24. Backend Stage - Hardware/ Runtime Generation Neural Network Hardware Generation Description Optimal Folding Factors FINN Device Specification File FINN QNN Library 24

  25. Hardware Generation – Network Dataflow Example ˃ top.cpp Sequence of layers, 1:1 with network topology ˃ config.h Finn-generated configuration, with network configuration values + parallelism-specific values ˃ (possible) params.h Finn-generated weights values to be hardcoded in the bitstream Page 25

  26. Scaling Parallelism For each layer, set all SIMD, PE to 1 – Single MAC Until hardware no longer fits on device or FPS target reached – Find slowest layer • Increase SIMD to next factor of IFM_CHANS or • Increase PE to next factor of OFM_CHANS Goal: Calculate folding factors such that layers produce balanced dataflow Page 26

  27. FINN Performance Results Network Platform Precision (W/A) Performance (TOPS) MLP AWS-F1 1/1 50.8 CNV AWS-F1 1/1 12.1 Tincy-YOLO AWS-F1 1/3 5.3 DoReFa-Net/PF AWS-F1 1/2 11.4 ˃ Up to 50TOPS measured ˃ Multiple precision types supported performance for BNNs 8-bit in DSPs, reduced precision in LUTs Blott, M., et al. "FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks"

  28. From FPGAs to ACAPs >> 28

  29. New Heterogeneous Devices AI Engines Processing I/O (GT, AMS) System SW PE SW PE SW PE Transceivers Application SW PE SW PE SW PE Processor PCIe Up to ~147 TOPS of Int8 Real-Time NOC Processor performance! DDR Programmable Logic HBM LUT BRAM AMS DSP URAM ˃ From the Xilinx World: Evolution of FPGAs to ACAPs >> 29

  30. Conclusions ˃ As Moore’s law has ended, heterogeneous accelerated systems have emerged ˃ High computational demand of machine learning applications is driving hardware development ˃ Customized dataflow architectures and memory subsystems, custom precisions • Dramatic performance scaling and energy efficiency benefits • Target Datacenter or Embedded devices • Enabling new exciting trade-offs within the design space ˃ New ACAP devices with AI engines 30

  31. Thanks! Adaptable. Intelligent. >> 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend