Recurrent Neural Networks Deep neural networks have enabled major - - PowerPoint PPT Presentation

recurrent neural networks deep neural networks have
SMART_READER_LITE
LIVE PREVIEW

Recurrent Neural Networks Deep neural networks have enabled major - - PowerPoint PPT Presentation

Recurrent Neural Networks Deep neural networks have enabled major advances in machine learning and AI y t-1 y t y t+1 Computer vision h t-1 h t h t+1 Language translation Speech recognition h t-1 h t h t+1 Question answering x t-1 x t x t+1


slide-1
SLIDE 1
slide-2
SLIDE 2

Deep neural networks have enabled major advances in machine learning and AI

Computer vision Language translation Speech recognition Question answering And more…

Problem: DNNs are challenging to serve and deploy in large-scale interactive services Convolutional Neural Networks

2

ht-1 ht ht+1

xt-1 xt xt+1

ht-1 ht ht+1

yt-1 yt yt+1

Recurrent Neural Networks

slide-3
SLIDE 3

DNN Processing Units

EFFICIENCY

3

FLEXIBILITY

Soft DPU

(FPGA)

Contro l Unit (CU) Registers Arithmeti c Logic Unit (ALU)

CPUs GPUs ASICs

Hard DPU

Cerebras Google TPU Graphcore Groq Intel Nervana Movidius Wave Computing Etc. BrainWave Baidu SDA Deephi Tech ESE Teradeep Etc.

slide-4
SLIDE 4

F F F L0 L1 F F F L0

Pretrained DNN Model in CNTK, etc. Scalable DNN Hardware Microservice BrainWave Soft DPU Instr Decoder & Control

Neural FU

A Scalable FPGA-powered DNN Serving Platform

4

Network switches FPGAs

slide-5
SLIDE 5

CPU compute layer Reconfigurable compute layer (FPGA) Converged network

slide-6
SLIDE 6

Sub-millisecond FPGA compute latencies at batch 1

6

slide-7
SLIDE 7

7

slide-8
SLIDE 8

A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs

8

slide-9
SLIDE 9

A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms

9

slide-10
SLIDE 10

A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch

10

slide-11
SLIDE 11

A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch Persist model parameters entirely in FPGA on-chip memories Support large models by scaling across many FPGAs

11

slide-12
SLIDE 12

A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch Persist model parameters entirely in FPGA on-chip memories Support large models by scaling across many FPGAs Intel FPGAs deployed at scale with HW microservices [MICRO’16]

12

slide-13
SLIDE 13

13

slide-14
SLIDE 14

14

A Cloud-Scale Acceleration Architecture [MICRO’16]

slide-15
SLIDE 15

Web search ranking

Traditional software (CPU) server plane

QPI CPU

QSFP 40Gb/s

ToR

FPGA CPU

40Gb/s QSFP QSFP

Hardware acceleration plane

Interconnected FPGAs form a separate plane of computation Can be managed and used independently from the CPU

Web search ranking Deep neural networks SDN offload SQL

15

CPUs FPGAs Routers

slide-16
SLIDE 16

16

slide-17
SLIDE 17

FPGA0 FPGA1

Add500 1000-dim Vector 1000-dim Vector Split 500x500 Matrix MatMul500 500x500 Matrix MatMul500 MatMul500 MatMul500 500x500 Matrix Add500 Add500 Sigmoid500 Sigmoid500 Split Add500 500 500 Concat 500 500 500x500 Matrix

17

Target compiler FPGA Target compiler CPU-CNTK Frontends Portable IR Target compiler CPU-Caffe Transformed IRs Graph Splitter and Optimizer Deployment Package Caffe Model FPGA HW Microservice CNTK Model Tensorflow Model

slide-18
SLIDE 18

18

=

O(N2) data O(N2) compute

Input activation Output pre-activation N weight kernels

O(N3) data O(N4K2) compute

=

slide-19
SLIDE 19

19

=

O(N2) data O(N2) compute

Input activation Output pre-activation

O(N3) data O(N4K2) compute

=

slide-20
SLIDE 20

FPGA

2xCPU

Model Parameters Initialized in DRAM

20

slide-21
SLIDE 21

FPGA

2xCPU

21

Model Parameters Initialized in DRAM

slide-22
SLIDE 22

Batch Size Hardware Utilization (%)

22

FPGA

slide-23
SLIDE 23

Batch Size Latency at 99th

Maximum Allowed Latency

Batch Size Hardware Utilization (%)

Batching improves HW utilization but increases latency

23

slide-24
SLIDE 24

Batch Size Latency at 99th

Maximum Allowed Latency

Batch Size Hardware Utilization (%)

24

Batching improves HW utilization but increases latency

slide-25
SLIDE 25

FPGA

2xCPU

25

slide-26
SLIDE 26

2xCPU

Observations

26

slide-27
SLIDE 27

2xCPU

27

slide-28
SLIDE 28

2xCPU

28

slide-29
SLIDE 29

29

slide-30
SLIDE 30

30

slide-31
SLIDE 31

31

slide-32
SLIDE 32

Core Features

  • Proprietary parameterizable narrow precision

format wrapped in float16 interfaces

FPGA

Matrix Vector Unit

32

slide-33
SLIDE 33

33 Neural Functional Unit

VRF

Instruction Decoder

TA TA TA TA TA

Matrix-Vector Unit

Convert to msft-fp Convert to float16 Multifunction Unit xbar x A + VRF VRF Multifunction Unit xbar x + VRF VRF Tensor Manager Matrix Memory Manager Vector Memory Manager DRAM x A +

Activation Multiply Add/Sub Legend Memory Tensor data Instructions Commands

TA

Tensor Arbiter Input Message Processor Control Processor Output Message Processor

A

Kernel

Matrix Vector Multiply VRF Matrix RF

+

Kernel

Matrix Vector Multiply VRF Matrix RF

Kernel

Matrix Vector Multiply VRF Matrix RF Network IFC

...

Neural Functional Unit

VRF

Instruction Decoder

TA TA TA TA TA

Matrix-Vector Unit

Convert to msft-fp Convert to float16 Multifunction Unit xbar x A + VRF VRF Multifunction Unit xbar x + VRF VRF Tensor Manager Matrix Memory Manager Vector Memory Manager DRAM x A +

Activation Multiply Add/Sub Legend Memory Tensor data Instructions Commands

TA

Tensor Arbiter Input Message Processor Control Processor Output Message Processor

A

Kernel

Matrix Vector Multiply VRF Matrix RF

+

Kernel

Matrix Vector Multiply VRF Matrix RF

Kernel

Matrix Vector Multiply VRF Matrix RF Network IFC

...

slide-34
SLIDE 34

Features

Matrix Row 1 Matrix Row 2 Matrix Row N Float16 Input Tensor

+ + × × + × × + + × × + × × +

34 Float16 Output Tensor

slide-35
SLIDE 35

35

1.4 2.0 2.7 4.5 0.0 1.0 2.0 3.0 4.0 5.0 16-bit int 8-bit int ms-fp9 ms-fp8 Tera-Operations/sec

FPGA Performance vs. Data Type

Stratix V D5 @ 225MHz

slide-36
SLIDE 36

36

slide-37
SLIDE 37

1.4 2.0 2.7 4.5 0.0 1.0 2.0 3.0 4.0 5.0 16-bit int 8-bit int ms-fp9 ms-fp8 Tera-Operations/sec

FPGA Performance vs. Data Type

Stratix V D5 @ 225MHz

37

10 20 30 40 50 60 70 80 90 100 16-bit int 8-bit int ms-fp9 ms-fp8 Tera-Operations/sec

FPGA Performance vs. Data Type

Stratix V D5 @ 225MHz Stratix 10 280 @ 500MHz

slide-38
SLIDE 38

1.4 2.0 2.7 4.5 0.0 1.0 2.0 3.0 4.0 5.0 16-bit int 8-bit int ms-fp9 ms-fp8 Tera-Operations/sec

FPGA Performance vs. Data Type

Stratix V D5 @ 225MHz

38

12 10 20 30 40 50 60 70 80 90 100 16-bit int 8-bit int ms-fp9 ms-fp8 Tera-Operations/sec

FPGA Performance vs. Data Type

Stratix V D5 @ 225MHz Stratix 10 280 @ 500MHz

slide-39
SLIDE 39

1.4 2.0 2.7 4.5 0.0 1.0 2.0 3.0 4.0 5.0 16-bit int 8-bit int ms-fp9 ms-fp8 Tera-Operations/sec

FPGA Performance vs. Data Type

Stratix V D5 @ 225MHz

39

12 31 10 20 30 40 50 60 70 80 90 100 16-bit int 8-bit int ms-fp9 ms-fp8 Tera-Operations/sec

FPGA Performance vs. Data Type

Stratix V D5 @ 225MHz Stratix 10 280 @ 500MHz

slide-40
SLIDE 40

1.4 2.0 2.7 4.5 0.0 1.0 2.0 3.0 4.0 5.0 16-bit int 8-bit int ms-fp9 ms-fp8 Tera-Operations/sec

FPGA Performance vs. Data Type

Stratix V D5 @ 225MHz

40

12 31 65 10 20 30 40 50 60 70 80 90 100 16-bit int 8-bit int ms-fp9 ms-fp8 Tera-Operations/sec

FPGA Performance vs. Data Type

Stratix V D5 @ 225MHz Stratix 10 280 @ 500MHz

slide-41
SLIDE 41

1.4 2.0 2.7 4.5 0.0 1.0 2.0 3.0 4.0 5.0 16-bit int 8-bit int ms-fp9 ms-fp8 Tera-Operations/sec

FPGA Performance vs. Data Type

Stratix V D5 @ 225MHz

41

12 31 65 90 10 20 30 40 50 60 70 80 90 100 16-bit int 8-bit int ms-fp9 ms-fp8 Tera-Operations/sec

FPGA Performance vs. Data Type

Stratix V D5 @ 225MHz Stratix 10 280 @ 500MHz

slide-42
SLIDE 42

1.4 2.0 2.7 4.5 0.0 1.0 2.0 3.0 4.0 5.0 16-bit int 8-bit int ms-fp9 ms-fp8 Tera-Operations/sec

FPGA Performance vs. Data Type

Stratix V D5 @ 225MHz

42

12 31 65 90 10 20 30 40 50 60 70 80 90 100 16-bit int 8-bit int ms-fp9 ms-fp8 Tera-Operations/sec

FPGA Performance vs. Data Type

Stratix V D5 @ 225MHz Stratix 10 280 @ 500MHz

0.50 0.60 0.70 0.80 0.90 1.00 Model 1 (GRU-based) Model 2 (LSTM-based) Model 3 (LSTM-based)

Accuracy

Impact of Narrow Precison on Accuracy

float32 ms-fp9 ms-fp9 retrain

slide-43
SLIDE 43

Stay tuned for announcements about external availability.

Project BrainWave is a powerful platform for an accelerated AI cloud

Runs on Microsoft’s hyperscale infrastructure with FPGAs Achieves excellent performance at low batch sizes via persistency and narrow precision Adaptable to precision and changes in future AI algorithms

BrainWave running on Hardware Microservices will push the boundary of what is possible to deploy in the cloud

Deeper/larger CNNs for more accurate computer vision Higher dimensional RNNs toward human-like natural language processing State-of-the-art speech And much more…