recurrent neural networks deep neural networks have
play

Recurrent Neural Networks Deep neural networks have enabled major - PowerPoint PPT Presentation

Recurrent Neural Networks Deep neural networks have enabled major advances in machine learning and AI y t-1 y t y t+1 Computer vision h t-1 h t h t+1 Language translation Speech recognition h t-1 h t h t+1 Question answering x t-1 x t x t+1


  1. Recurrent Neural Networks Deep neural networks have enabled major advances in machine learning and AI y t-1 y t y t+1 Computer vision h t-1 h t h t+1 Language translation Speech recognition h t-1 h t h t+1 Question answering x t-1 x t x t+1 And more… Convolutional Neural Networks Problem: DNNs are challenging to serve and deploy in large-scale interactive services 2

  2. DNN Processing Units Registers Contro l Unit CPUs GPUs Arithmeti ASICs Hard (CU) c Logic Soft DPU DPU (FPGA) Unit (ALU) FLEXIBILITY EFFICIENCY Cerebras BrainWave Google TPU Baidu SDA Graphcore Deephi Tech Groq ESE Intel Nervana Teradeep Movidius Etc. Wave Computing Etc. 3

  3. A Scalable FPGA-powered DNN Serving Platform L1 Instr Decoder Network switches & Control L0 L0 Neural FU FPGAs F F F F F F BrainWave Pretrained DNN Model Scalable DNN Hardware Soft DPU in CNTK, etc. Microservice 4

  4. CPU compute layer Reconfigurable compute layer (FPGA) Converged network

  5. Sub-millisecond FPGA compute latencies at batch 1 6

  6. 7

  7. A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs 8

  8. A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms 9

  9. A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch 10

  10. A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch Persist model parameters entirely in FPGA on-chip memories Support large models by scaling across many FPGAs 11

  11. A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch Persist model parameters entirely in FPGA on-chip memories Support large models by scaling across many FPGAs Intel FPGAs deployed at scale with HW microservices [MICRO’16] 12

  12. 13

  13. A Cloud-Scale Acceleration Architecture [MICRO’16] 14

  14. Interconnected FPGAs form a separate plane of computation Can be managed and used independently from the CPU Routers Hardware acceleration plane QPI Deep neural CPU CPU SQL networks Web search FPGAs SDN offload ranking Web search ranking FPGA CPUs QSFP QSFP QSFP Traditional software (CPU) server plane ToR 40Gb/s 40Gb/s 15

  15. 16

  16. Caffe CNTK Tensorflow Model Model Model Frontends Portable IR FPGA0 1000-dim Vector FPGA1 Graph Splitter and Optimizer Split 500 500 Transformed IRs 500x500 500x500 Matrix 500x500 500x500 Matrix Matrix Matrix MatMul500 MatMul500 MatMul500 MatMul500 Target Target Target Add500 Add500 compiler compiler compiler Sigmoid500 Sigmoid500 1000-dim Vector CPU-CNTK FPGA CPU-Caffe Split Deployment Package 500 500 Add500 Add500 FPGA HW Microservice Concat 17

  17. Output pre-activation Input activation N weight kernels = = O(N 3 ) data O(N 2 ) data O(N 4 K 2 ) compute O(N 2 ) compute 18

  18. Output pre-activation Input activation = = O(N 3 ) data O(N 2 ) data O(N 4 K 2 ) compute O(N 2 ) compute 19

  19. Model Parameters Initialized in DRAM FPGA 2xCPU 20

  20. Model Parameters Initialized in DRAM FPGA 2xCPU 21

  21. Hardware Utilization (%) Batch Size FPGA 22

  22. Maximum Hardware Latency Allowed Utilization Latency at 99th (%) Batch Size Batch Size Batching improves HW utilization but increases latency 23

  23. Maximum Hardware Latency Allowed Utilization Latency at 99th (%) Batch Size Batch Size Batching improves HW utilization but increases latency 24

  24. FPGA 2xCPU 25

  25. Observations 2xCPU 26

  26. 2xCPU 27

  27. 2xCPU 28

  28. 29

  29. 30

  30. 31

  31. Core Features Matrix Proprietary parameterizable narrow precision • Vector format wrapped in float16 interfaces Unit FPGA 32

  32. Matrix-Vector Unit Matrix-Vector Unit Convert to msft-fp Convert to msft-fp Neural Functional Unit Neural Functional Unit Matrix RF Matrix RF VRF VRF Matrix RF Matrix RF VRF VRF Matrix RF Matrix RF VRF VRF Instruction Instruction Control Control Kernel Kernel Kernel Kernel Kernel Kernel Decoder Decoder ... ... Processor Processor Matrix Vector Matrix Vector Matrix Vector Matrix Vector Matrix Vector Matrix Vector Multiply Multiply Multiply Multiply Multiply Multiply TA TA + + Tensor Manager Tensor Manager Convert to float16 Convert to float16 Input Message Input Message Matrix Memory Matrix Memory Processor Processor Network IFC Network IFC Manager Manager TA TA VRF VRF DRAM DRAM Vector Memory Vector Memory TA TA Multifunction Multifunction A A Output Message Output Message Manager Manager Unit Unit Processor Processor xbar xbar x x VRF VRF TA TA + + VRF VRF Legend Legend Multifunction Multifunction A A Unit Unit A A Activation Activation Tensor data Tensor data xbar xbar x x VRF VRF Instructions Instructions Multiply Multiply x x TA TA + + VRF VRF Commands Commands Add/Sub Add/Sub + + Memory Memory Tensor Arbiter Tensor Arbiter TA TA 33

  33. Features × Matrix Row 1 + × + × + × + × Matrix Row 2 + × + × + × Matrix Row N Float16 Output Float16 Input Tensor Tensor 34

  34. FPGA Performance vs. Data Type 5.0 Stratix V D5 @ 225MHz 4.5 4.0 Tera-Operations/sec 3.0 2.7 2.0 2.0 1.0 1.4 0.0 16-bit int 8-bit int ms-fp9 ms-fp8 35

  35. 36

  36. FPGA Performance vs. Data Type FPGA Performance vs. Data Type 5.0 100 Stratix V D5 @ 225MHz Stratix V D5 @ 225MHz 90 Stratix 10 280 @ 500MHz 4.5 4.0 80 Tera-Operations/sec Tera-Operations/sec 70 3.0 60 50 2.7 2.0 40 2.0 30 1.0 1.4 20 10 0.0 0 16-bit int 8-bit int ms-fp9 ms-fp8 16-bit int 8-bit int ms-fp9 ms-fp8 37

  37. FPGA Performance vs. Data Type FPGA Performance vs. Data Type 5.0 100 Stratix V D5 @ 225MHz Stratix V D5 @ 225MHz 90 Stratix 10 280 @ 500MHz 4.5 4.0 80 Tera-Operations/sec Tera-Operations/sec 70 3.0 60 50 2.7 2.0 40 2.0 30 1.0 1.4 20 12 10 0.0 0 16-bit int 8-bit int ms-fp9 ms-fp8 16-bit int 8-bit int ms-fp9 ms-fp8 38

  38. FPGA Performance vs. Data Type FPGA Performance vs. Data Type 5.0 100 Stratix V D5 @ 225MHz Stratix V D5 @ 225MHz 90 Stratix 10 280 @ 500MHz 4.5 4.0 80 Tera-Operations/sec Tera-Operations/sec 70 3.0 60 50 2.7 2.0 40 31 2.0 30 1.0 1.4 20 12 10 0.0 0 16-bit int 8-bit int ms-fp9 ms-fp8 16-bit int 8-bit int ms-fp9 ms-fp8 39

  39. FPGA Performance vs. Data Type FPGA Performance vs. Data Type 5.0 100 Stratix V D5 @ 225MHz Stratix V D5 @ 225MHz 90 Stratix 10 280 @ 500MHz 4.5 4.0 80 Tera-Operations/sec 65 Tera-Operations/sec 70 3.0 60 50 2.7 2.0 40 31 2.0 30 1.0 1.4 20 12 10 0.0 0 16-bit int 8-bit int ms-fp9 ms-fp8 16-bit int 8-bit int ms-fp9 ms-fp8 40

  40. FPGA Performance vs. Data Type FPGA Performance vs. Data Type 5.0 100 90 Stratix V D5 @ 225MHz Stratix V D5 @ 225MHz 90 Stratix 10 280 @ 500MHz 4.5 4.0 80 Tera-Operations/sec 65 Tera-Operations/sec 70 3.0 60 50 2.7 2.0 40 31 2.0 30 1.0 1.4 20 12 10 0.0 0 16-bit int 8-bit int ms-fp9 ms-fp8 16-bit int 8-bit int ms-fp9 ms-fp8 41

  41. FPGA Performance vs. Data Type FPGA Performance vs. Data Type Impact of Narrow Precison on Accuracy 5.0 100 90 Stratix V D5 @ 225MHz Stratix V D5 @ 225MHz 1.00 90 Stratix 10 280 @ 500MHz 4.5 4.0 80 0.90 Tera-Operations/sec 65 Tera-Operations/sec 70 3.0 0.80 60 Accuracy 50 2.7 0.70 2.0 40 31 2.0 30 0.60 1.0 1.4 20 12 0.50 10 Model 1 Model 2 Model 3 (GRU-based) (LSTM-based) (LSTM-based) 0.0 0 float32 ms-fp9 ms-fp9 retrain 16-bit int 8-bit int ms-fp9 ms-fp8 16-bit int 8-bit int ms-fp9 ms-fp8 42

  42. Project BrainWave is a powerful platform for an accelerated AI cloud  Runs on Microsoft’s hyperscale infrastructure with FPGAs Achieves excellent performance at low batch sizes via Stay tuned for persistency and narrow precision announcements about Adaptable to precision and changes in future AI algorithms external availability. BrainWave running on Hardware Microservices will push the boundary of what is possible to deploy in the cloud Deeper/larger CNNs for more accurate computer vision Higher dimensional RNNs toward human-like natural language processing State-of-the-art speech And much more…

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend