Recurrent Neural Networks Deep neural networks have enabled major - - PowerPoint PPT Presentation
Recurrent Neural Networks Deep neural networks have enabled major - - PowerPoint PPT Presentation
Recurrent Neural Networks Deep neural networks have enabled major advances in machine learning and AI y t-1 y t y t+1 Computer vision h t-1 h t h t+1 Language translation Speech recognition h t-1 h t h t+1 Question answering x t-1 x t x t+1
Deep neural networks have enabled major advances in machine learning and AI
Computer vision Language translation Speech recognition Question answering And more…
Problem: DNNs are challenging to serve and deploy in large-scale interactive services Convolutional Neural Networks
2
ht-1 ht ht+1
xt-1 xt xt+1
ht-1 ht ht+1
yt-1 yt yt+1
Recurrent Neural Networks
DNN Processing Units
EFFICIENCY
3
FLEXIBILITY
Soft DPU
(FPGA)
Contro l Unit (CU) Registers Arithmeti c Logic Unit (ALU)
CPUs GPUs ASICs
Hard DPU
Cerebras Google TPU Graphcore Groq Intel Nervana Movidius Wave Computing Etc. BrainWave Baidu SDA Deephi Tech ESE Teradeep Etc.
F F F L0 L1 F F F L0
Pretrained DNN Model in CNTK, etc. Scalable DNN Hardware Microservice BrainWave Soft DPU Instr Decoder & Control
Neural FU
A Scalable FPGA-powered DNN Serving Platform
4
Network switches FPGAs
CPU compute layer Reconfigurable compute layer (FPGA) Converged network
Sub-millisecond FPGA compute latencies at batch 1
6
7
A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs
8
A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms
9
A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch
10
A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch Persist model parameters entirely in FPGA on-chip memories Support large models by scaling across many FPGAs
11
A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch Persist model parameters entirely in FPGA on-chip memories Support large models by scaling across many FPGAs Intel FPGAs deployed at scale with HW microservices [MICRO’16]
12
13
14
A Cloud-Scale Acceleration Architecture [MICRO’16]
Web search ranking
Traditional software (CPU) server plane
QPI CPU
QSFP 40Gb/s
ToR
FPGA CPU
40Gb/s QSFP QSFP
Hardware acceleration plane
Interconnected FPGAs form a separate plane of computation Can be managed and used independently from the CPU
Web search ranking Deep neural networks SDN offload SQL
15
CPUs FPGAs Routers
16
FPGA0 FPGA1
Add500 1000-dim Vector 1000-dim Vector Split 500x500 Matrix MatMul500 500x500 Matrix MatMul500 MatMul500 MatMul500 500x500 Matrix Add500 Add500 Sigmoid500 Sigmoid500 Split Add500 500 500 Concat 500 500 500x500 Matrix17
Target compiler FPGA Target compiler CPU-CNTK Frontends Portable IR Target compiler CPU-Caffe Transformed IRs Graph Splitter and Optimizer Deployment Package Caffe Model FPGA HW Microservice CNTK Model Tensorflow Model
18
=
O(N2) data O(N2) compute
Input activation Output pre-activation N weight kernels
O(N3) data O(N4K2) compute
=
19
=
O(N2) data O(N2) compute
Input activation Output pre-activation
O(N3) data O(N4K2) compute
=
FPGA
2xCPU
Model Parameters Initialized in DRAM
20
FPGA
2xCPU
21
Model Parameters Initialized in DRAM
Batch Size Hardware Utilization (%)
22
FPGA
Batch Size Latency at 99th
Maximum Allowed Latency
Batch Size Hardware Utilization (%)
Batching improves HW utilization but increases latency
23
Batch Size Latency at 99th
Maximum Allowed Latency
Batch Size Hardware Utilization (%)
24
Batching improves HW utilization but increases latency
FPGA
2xCPU
25
2xCPU
Observations
26
2xCPU
27
2xCPU
28
29
30
31
Core Features
- Proprietary parameterizable narrow precision
format wrapped in float16 interfaces
FPGA
Matrix Vector Unit
32
33 Neural Functional Unit
VRF
Instruction Decoder
TA TA TA TA TA
Matrix-Vector Unit
Convert to msft-fp Convert to float16 Multifunction Unit xbar x A + VRF VRF Multifunction Unit xbar x + VRF VRF Tensor Manager Matrix Memory Manager Vector Memory Manager DRAM x A +
Activation Multiply Add/Sub Legend Memory Tensor data Instructions Commands
TA
Tensor Arbiter Input Message Processor Control Processor Output Message Processor
A
Kernel
Matrix Vector Multiply VRF Matrix RF
+
Kernel
Matrix Vector Multiply VRF Matrix RF
Kernel
Matrix Vector Multiply VRF Matrix RF Network IFC
...
Neural Functional Unit
VRF
Instruction Decoder
TA TA TA TA TA
Matrix-Vector Unit
Convert to msft-fp Convert to float16 Multifunction Unit xbar x A + VRF VRF Multifunction Unit xbar x + VRF VRF Tensor Manager Matrix Memory Manager Vector Memory Manager DRAM x A +
Activation Multiply Add/Sub Legend Memory Tensor data Instructions Commands
TA
Tensor Arbiter Input Message Processor Control Processor Output Message Processor
A
Kernel
Matrix Vector Multiply VRF Matrix RF
+
Kernel
Matrix Vector Multiply VRF Matrix RF
Kernel
Matrix Vector Multiply VRF Matrix RF Network IFC
...
Features
Matrix Row 1 Matrix Row 2 Matrix Row N Float16 Input Tensor
+ + × × + × × + + × × + × × +
34 Float16 Output Tensor
35
1.4 2.0 2.7 4.5 0.0 1.0 2.0 3.0 4.0 5.0 16-bit int 8-bit int ms-fp9 ms-fp8 Tera-Operations/sec
FPGA Performance vs. Data Type
Stratix V D5 @ 225MHz
36
1.4 2.0 2.7 4.5 0.0 1.0 2.0 3.0 4.0 5.0 16-bit int 8-bit int ms-fp9 ms-fp8 Tera-Operations/sec
FPGA Performance vs. Data Type
Stratix V D5 @ 225MHz
37
10 20 30 40 50 60 70 80 90 100 16-bit int 8-bit int ms-fp9 ms-fp8 Tera-Operations/sec
FPGA Performance vs. Data Type
Stratix V D5 @ 225MHz Stratix 10 280 @ 500MHz
1.4 2.0 2.7 4.5 0.0 1.0 2.0 3.0 4.0 5.0 16-bit int 8-bit int ms-fp9 ms-fp8 Tera-Operations/sec
FPGA Performance vs. Data Type
Stratix V D5 @ 225MHz
38
12 10 20 30 40 50 60 70 80 90 100 16-bit int 8-bit int ms-fp9 ms-fp8 Tera-Operations/sec
FPGA Performance vs. Data Type
Stratix V D5 @ 225MHz Stratix 10 280 @ 500MHz
1.4 2.0 2.7 4.5 0.0 1.0 2.0 3.0 4.0 5.0 16-bit int 8-bit int ms-fp9 ms-fp8 Tera-Operations/sec
FPGA Performance vs. Data Type
Stratix V D5 @ 225MHz
39
12 31 10 20 30 40 50 60 70 80 90 100 16-bit int 8-bit int ms-fp9 ms-fp8 Tera-Operations/sec
FPGA Performance vs. Data Type
Stratix V D5 @ 225MHz Stratix 10 280 @ 500MHz
1.4 2.0 2.7 4.5 0.0 1.0 2.0 3.0 4.0 5.0 16-bit int 8-bit int ms-fp9 ms-fp8 Tera-Operations/sec
FPGA Performance vs. Data Type
Stratix V D5 @ 225MHz
40
12 31 65 10 20 30 40 50 60 70 80 90 100 16-bit int 8-bit int ms-fp9 ms-fp8 Tera-Operations/sec
FPGA Performance vs. Data Type
Stratix V D5 @ 225MHz Stratix 10 280 @ 500MHz
1.4 2.0 2.7 4.5 0.0 1.0 2.0 3.0 4.0 5.0 16-bit int 8-bit int ms-fp9 ms-fp8 Tera-Operations/sec
FPGA Performance vs. Data Type
Stratix V D5 @ 225MHz
41
12 31 65 90 10 20 30 40 50 60 70 80 90 100 16-bit int 8-bit int ms-fp9 ms-fp8 Tera-Operations/sec
FPGA Performance vs. Data Type
Stratix V D5 @ 225MHz Stratix 10 280 @ 500MHz
1.4 2.0 2.7 4.5 0.0 1.0 2.0 3.0 4.0 5.0 16-bit int 8-bit int ms-fp9 ms-fp8 Tera-Operations/sec
FPGA Performance vs. Data Type
Stratix V D5 @ 225MHz
42
12 31 65 90 10 20 30 40 50 60 70 80 90 100 16-bit int 8-bit int ms-fp9 ms-fp8 Tera-Operations/sec
FPGA Performance vs. Data Type
Stratix V D5 @ 225MHz Stratix 10 280 @ 500MHz
0.50 0.60 0.70 0.80 0.90 1.00 Model 1 (GRU-based) Model 2 (LSTM-based) Model 3 (LSTM-based)
Accuracy
Impact of Narrow Precison on Accuracy
float32 ms-fp9 ms-fp9 retrain