Recurrent Neural Networks Deep neural networks have enabled major - PowerPoint PPT Presentation

Recurrent Neural Networks Deep neural networks have enabled major advances in machine learning and AI y t-1 y t y t+1 Computer vision h t-1 h t h t+1 Language translation Speech recognition h t-1 h t h t+1 Question answering x t-1 x t x t+1 And more… Convolutional Neural Networks Problem: DNNs are challenging to serve and deploy in large-scale interactive services 2

DNN Processing Units Registers Contro l Unit CPUs GPUs Arithmeti ASICs Hard (CU) c Logic Soft DPU DPU (FPGA) Unit (ALU) FLEXIBILITY EFFICIENCY Cerebras BrainWave Google TPU Baidu SDA Graphcore Deephi Tech Groq ESE Intel Nervana Teradeep Movidius Etc. Wave Computing Etc. 3

A Scalable FPGA-powered DNN Serving Platform L1 Instr Decoder Network switches & Control L0 L0 Neural FU FPGAs F F F F F F BrainWave Pretrained DNN Model Scalable DNN Hardware Soft DPU in CNTK, etc. Microservice 4

CPU compute layer Reconfigurable compute layer (FPGA) Converged network

Sub-millisecond FPGA compute latencies at batch 1 6

A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs 8

A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms 9

A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch 10

A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch Persist model parameters entirely in FPGA on-chip memories Support large models by scaling across many FPGAs 11

A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch Persist model parameters entirely in FPGA on-chip memories Support large models by scaling across many FPGAs Intel FPGAs deployed at scale with HW microservices [MICRO’16] 12

A Cloud-Scale Acceleration Architecture [MICRO’16] 14

Interconnected FPGAs form a separate plane of computation Can be managed and used independently from the CPU Routers Hardware acceleration plane QPI Deep neural CPU CPU SQL networks Web search FPGAs SDN offload ranking Web search ranking FPGA CPUs QSFP QSFP QSFP Traditional software (CPU) server plane ToR 40Gb/s 40Gb/s 15

Caffe CNTK Tensorflow Model Model Model Frontends Portable IR FPGA0 1000-dim Vector FPGA1 Graph Splitter and Optimizer Split 500 500 Transformed IRs 500x500 500x500 Matrix 500x500 500x500 Matrix Matrix Matrix MatMul500 MatMul500 MatMul500 MatMul500 Target Target Target Add500 Add500 compiler compiler compiler Sigmoid500 Sigmoid500 1000-dim Vector CPU-CNTK FPGA CPU-Caffe Split Deployment Package 500 500 Add500 Add500 FPGA HW Microservice Concat 17

Output pre-activation Input activation N weight kernels = = O(N 3 ) data O(N 2 ) data O(N 4 K 2 ) compute O(N 2 ) compute 18

Output pre-activation Input activation = = O(N 3 ) data O(N 2 ) data O(N 4 K 2 ) compute O(N 2 ) compute 19

Model Parameters Initialized in DRAM FPGA 2xCPU 20

Model Parameters Initialized in DRAM FPGA 2xCPU 21

Hardware Utilization (%) Batch Size FPGA 22

Maximum Hardware Latency Allowed Utilization Latency at 99th (%) Batch Size Batch Size Batching improves HW utilization but increases latency 23

Maximum Hardware Latency Allowed Utilization Latency at 99th (%) Batch Size Batch Size Batching improves HW utilization but increases latency 24

FPGA 2xCPU 25

Observations 2xCPU 26

2xCPU 27

2xCPU 28

Core Features Matrix Proprietary parameterizable narrow precision • Vector format wrapped in float16 interfaces Unit FPGA 32

Matrix-Vector Unit Matrix-Vector Unit Convert to msft-fp Convert to msft-fp Neural Functional Unit Neural Functional Unit Matrix RF Matrix RF VRF VRF Matrix RF Matrix RF VRF VRF Matrix RF Matrix RF VRF VRF Instruction Instruction Control Control Kernel Kernel Kernel Kernel Kernel Kernel Decoder Decoder ... ... Processor Processor Matrix Vector Matrix Vector Matrix Vector Matrix Vector Matrix Vector Matrix Vector Multiply Multiply Multiply Multiply Multiply Multiply TA TA + + Tensor Manager Tensor Manager Convert to float16 Convert to float16 Input Message Input Message Matrix Memory Matrix Memory Processor Processor Network IFC Network IFC Manager Manager TA TA VRF VRF DRAM DRAM Vector Memory Vector Memory TA TA Multifunction Multifunction A A Output Message Output Message Manager Manager Unit Unit Processor Processor xbar xbar x x VRF VRF TA TA + + VRF VRF Legend Legend Multifunction Multifunction A A Unit Unit A A Activation Activation Tensor data Tensor data xbar xbar x x VRF VRF Instructions Instructions Multiply Multiply x x TA TA + + VRF VRF Commands Commands Add/Sub Add/Sub + + Memory Memory Tensor Arbiter Tensor Arbiter TA TA 33

Features × Matrix Row 1 + × + × + × + × Matrix Row 2 + × + × + × Matrix Row N Float16 Output Float16 Input Tensor Tensor 34

FPGA Performance vs. Data Type 5.0 Stratix V D5 @ 225MHz 4.5 4.0 Tera-Operations/sec 3.0 2.7 2.0 2.0 1.0 1.4 0.0 16-bit int 8-bit int ms-fp9 ms-fp8 35

FPGA Performance vs. Data Type FPGA Performance vs. Data Type 5.0 100 Stratix V D5 @ 225MHz Stratix V D5 @ 225MHz 90 Stratix 10 280 @ 500MHz 4.5 4.0 80 Tera-Operations/sec Tera-Operations/sec 70 3.0 60 50 2.7 2.0 40 2.0 30 1.0 1.4 20 10 0.0 0 16-bit int 8-bit int ms-fp9 ms-fp8 16-bit int 8-bit int ms-fp9 ms-fp8 37

FPGA Performance vs. Data Type FPGA Performance vs. Data Type 5.0 100 Stratix V D5 @ 225MHz Stratix V D5 @ 225MHz 90 Stratix 10 280 @ 500MHz 4.5 4.0 80 Tera-Operations/sec Tera-Operations/sec 70 3.0 60 50 2.7 2.0 40 2.0 30 1.0 1.4 20 12 10 0.0 0 16-bit int 8-bit int ms-fp9 ms-fp8 16-bit int 8-bit int ms-fp9 ms-fp8 38

FPGA Performance vs. Data Type FPGA Performance vs. Data Type 5.0 100 Stratix V D5 @ 225MHz Stratix V D5 @ 225MHz 90 Stratix 10 280 @ 500MHz 4.5 4.0 80 Tera-Operations/sec Tera-Operations/sec 70 3.0 60 50 2.7 2.0 40 31 2.0 30 1.0 1.4 20 12 10 0.0 0 16-bit int 8-bit int ms-fp9 ms-fp8 16-bit int 8-bit int ms-fp9 ms-fp8 39

FPGA Performance vs. Data Type FPGA Performance vs. Data Type 5.0 100 Stratix V D5 @ 225MHz Stratix V D5 @ 225MHz 90 Stratix 10 280 @ 500MHz 4.5 4.0 80 Tera-Operations/sec 65 Tera-Operations/sec 70 3.0 60 50 2.7 2.0 40 31 2.0 30 1.0 1.4 20 12 10 0.0 0 16-bit int 8-bit int ms-fp9 ms-fp8 16-bit int 8-bit int ms-fp9 ms-fp8 40

FPGA Performance vs. Data Type FPGA Performance vs. Data Type 5.0 100 90 Stratix V D5 @ 225MHz Stratix V D5 @ 225MHz 90 Stratix 10 280 @ 500MHz 4.5 4.0 80 Tera-Operations/sec 65 Tera-Operations/sec 70 3.0 60 50 2.7 2.0 40 31 2.0 30 1.0 1.4 20 12 10 0.0 0 16-bit int 8-bit int ms-fp9 ms-fp8 16-bit int 8-bit int ms-fp9 ms-fp8 41

FPGA Performance vs. Data Type FPGA Performance vs. Data Type Impact of Narrow Precison on Accuracy 5.0 100 90 Stratix V D5 @ 225MHz Stratix V D5 @ 225MHz 1.00 90 Stratix 10 280 @ 500MHz 4.5 4.0 80 0.90 Tera-Operations/sec 65 Tera-Operations/sec 70 3.0 0.80 60 Accuracy 50 2.7 0.70 2.0 40 31 2.0 30 0.60 1.0 1.4 20 12 0.50 10 Model 1 Model 2 Model 3 (GRU-based) (LSTM-based) (LSTM-based) 0.0 0 float32 ms-fp9 ms-fp9 retrain 16-bit int 8-bit int ms-fp9 ms-fp8 16-bit int 8-bit int ms-fp9 ms-fp8 42

Project BrainWave is a powerful platform for an accelerated AI cloud  Runs on Microsoft’s hyperscale infrastructure with FPGAs Achieves excellent performance at low batch sizes via Stay tuned for persistency and narrow precision announcements about Adaptable to precision and changes in future AI algorithms external availability. BrainWave running on Hardware Microservices will push the boundary of what is possible to deploy in the cloud Deeper/larger CNNs for more accurate computer vision Higher dimensional RNNs toward human-like natural language processing State-of-the-art speech And much more…

Recurrent Neural Networks Deep neural networks have enabled major - PowerPoint PPT Presentation

Recurrent Neural Networks Deep neural networks have enabled major advances in machine learning and AI y t-1 y t y t+1 Computer vision h t-1 h t h t+1 Language translation Speech recognition h t-1 h t h t+1 Question answering x t-1 x t x t+1

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

The Power of Linear Recurrent Neural Networks Neural Networks Was knnen lineare rekurrente

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT Kharagpur Mar 11, 2020

Understanding LSTM Networks Recurrent Neural Networks An unrolled recurrent neural network The

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

Lecture 9: Recurrent Neural Networks Princeton University COS 495 Instructor: Yingyu Liang

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

Introduction to Recurrent Neural Networks Jakob Verbeek Modeling sequential data with Recurrent

Recurrent Neural Networks CS 6956: Deep Learning for NLP Overview 1. Modeling sequences 2.

Recurrent Neural Networks CS 6956: Deep Learning for NLP Overview 1. Modeling sequences 2.

Ampliutde Parton Showers and Non-Global Logarithms Simon Pltzer Particle Physics, University

Coloring Visual Codebooks Coloring Visual Codebooks for Concept Detection in Video for Concept

1. Modelling Issues Modelling is more an art than a science! Choice of the Right Decision

Welcome! Todays Agenda: Topic Introduction Course Introduction Team Practical

iCub a shared platform for research in robotics & AI Genoa June 25, 2015 Giorgio Metta

Review of 3D Related Technologies for HEP Ray Yarema For the Fermilab ILC Pixel Detector Group

Sem inar on Scientific Soft Skills Sem inar organization Jaroslav Kivnek, MFF UK

Acquiring Comparative Commonsense Knowledge from the Web Niket Tandon Max Planck Institute for

Recurrent Neural Networks Deep neural networks have enabled major - PowerPoint PPT Presentation

Recurrent Neural Networks Deep neural networks have enabled major advances in machine learning and AI y t-1 y t y t+1 Computer vision h t-1 h t h t+1 Language translation Speech recognition h t-1 h t h t+1 Question answering x t-1 x t x t+1

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

The Power of Linear Recurrent Neural Networks Neural Networks Was knnen lineare rekurrente

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT Kharagpur Mar 11, 2020

Understanding LSTM Networks Recurrent Neural Networks An unrolled recurrent neural network The

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

Lecture 9: Recurrent Neural Networks Princeton University COS 495 Instructor: Yingyu Liang

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

Introduction to Recurrent Neural Networks Jakob Verbeek Modeling sequential data with Recurrent

Recurrent Neural Networks CS 6956: Deep Learning for NLP Overview 1. Modeling sequences 2.

Recurrent Neural Networks CS 6956: Deep Learning for NLP Overview 1. Modeling sequences 2.

Ampliutde Parton Showers and Non-Global Logarithms Simon Pltzer Particle Physics, University

Coloring Visual Codebooks Coloring Visual Codebooks for Concept Detection in Video for Concept

1. Modelling Issues Modelling is more an art than a science! Choice of the Right Decision

Welcome! Todays Agenda: Topic Introduction Course Introduction Team Practical

iCub a shared platform for research in robotics &amp; AI Genoa June 25, 2015 Giorgio Metta

Review of 3D Related Technologies for HEP Ray Yarema For the Fermilab ILC Pixel Detector Group

Sem inar on Scientific Soft Skills Sem inar organization Jaroslav Kivnek, MFF UK

Acquiring Comparative Commonsense Knowledge from the Web Niket Tandon Max Planck Institute for

iCub a shared platform for research in robotics & AI Genoa June 25, 2015 Giorgio Metta