SCALEDEEP Bryce Paputa and Royce Hwang Motivation: DNN Applications - - PowerPoint PPT Presentation

scaledeep
SMART_READER_LITE
LIVE PREVIEW

SCALEDEEP Bryce Paputa and Royce Hwang Motivation: DNN Applications - - PowerPoint PPT Presentation

SCALEDEEP Bryce Paputa and Royce Hwang Motivation: DNN Applications Google image search, Apple Siri Self-driving cars, Education, Healthcare Source: https://deepmind.com/ Source:


slide-1
SLIDE 1

SCALEDEEP

Bryce Paputa and Royce Hwang

slide-2
SLIDE 2

Motivation: DNN Applications

  • Google image search, Apple Siri
  • Self-driving cars, Education, Healthcare

2

Source: https://deepmind.com/ Source: http://fortune.com/2017/03/27/waymo-self-driving-minivans-snow/ Source: https://www.verizonwireless.com/od/smartphones/apple-iphone-x/

slide-3
SLIDE 3

Simple Neural Network

3 Source: https://www.dtreg.com/solution/view/22

slide-4
SLIDE 4

3 Stages of Training

  • Forward propagation:

Evaluates the network.

  • Back propagation:

Calculates the error and propagates it from the output stages to the input stages

  • Weight gradient and update:

Calculates the gradient of the error and updates the weights to reduce the error

4

slide-5
SLIDE 5

From Simple to Deep NN

5 Source: https://hackernoon.com/log-analytics-with-deep-learning-and-machine-learning-20a1891ff70e

slide-6
SLIDE 6

Convolutional Neural Network

6 Source: http://cs231n.github.io/convolutional-networks/

slide-7
SLIDE 7

Implemention Challenges

  • Training and inference steps are extremely computational and

data intensive

  • Example:

Overfeat DNN – 820K Neurons, 145M parameters ~3.3 x 109 operations for a single 231 x 231 image

  • To process ImageNet dataset (1.2 million images), it needs

~15 x 1015 operations for a single training iteration

7

slide-8
SLIDE 8

Escalating Computational Requirements

Unless otherwise noted, all figures are from SCALEDEEP: A Scalable Compute Architecture for Learning and Evaluating Deep Networks, Venkataramani et al, ISCA 2017.

8

slide-9
SLIDE 9

Ways to Speed This Up

9

slide-10
SLIDE 10

System Architecture

10

slide-11
SLIDE 11

Convolutional DNN

Source: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

11

slide-12
SLIDE 12

3 Main Layers: Convolution

  • Convolution (CONV) Layer

– Takes in inputs and applies convolution operation with weights – Outputs values (features) to the next layers.

– Computationally intensive

12

slide-13
SLIDE 13

3 Main Layers: Sampling

  • Sampling (SAMP) Layer

– Also known as pooling layer – Performs up/down sampling

  • n features

– Example: Decreasing image resolution – Data Intensive

13

slide-14
SLIDE 14

3 Main Layers: Fully Connected

  • Fully Connected (FC)

Layer

– Composes features in the CONV layers into

  • utput (classification,

etc.) – Data Intensive

14

slide-15
SLIDE 15

Computation Heavy Layers

Initial CONV layers

  • Fewer, but larger features
  • 16% of Flops
  • Very high reuse of weights

Middle CONV layers

  • Smaller features, but more

numerous

  • 80% of Flops

15

slide-16
SLIDE 16

Memory Heavy Layers

Fully connected layers

  • Fewer Flops (4%)
  • No weight reuse

Sampling Layers

  • Even fewer Flops (0.1%)
  • No training step/weights
  • Very high Bytes/FLOP

16

slide-17
SLIDE 17

Summary of Characteristics

17

slide-18
SLIDE 18

CompHeavy Tile

  • Used for low Byte/FLOP

stages

  • 2D-PE computes the dot

product of an input and kernel

  • Computes many kernels

convolved with the input

  • Statically controlled

18

slide-19
SLIDE 19

MemHeavy Tile

  • Stores features, weights,

errors, and error gradients in scratchpad memory.

  • Special Function Units

(SFU) implement activation functions like ReLu, tanh, sigmoid

19

slide-20
SLIDE 20

SCALEDEEP Chip

20

slide-21
SLIDE 21

Heterogeneous Chips .

CONV Layer Chip FC Layer Chip

21

slide-22
SLIDE 22

Node Architecture .

  • All memory is on chip or

directly connected

  • Wheel configuration allows

for high memory bandwidth and for layers to be split between chips

  • Ring configuration allows

for high model parallelism

22

slide-23
SLIDE 23

Intra-layer Parallelism .

23

slide-24
SLIDE 24

Inter-Layer Parallelism .

  • Pipeline depth is

equal to twice the number of layers using during training

  • Depth is equal to the

number of layers during evaluation

24

slide-25
SLIDE 25

Experimental Results

  • System tested using 7032

Processing Elements

  • Single precision - 680 TFLOPS
  • Half precision - 1.35 PFLOPS
  • 6-28x speedup compared to

TitanX GPU

25

slide-26
SLIDE 26

Power Usage

26

slide-27
SLIDE 27
  • 1. The granularity that they can allocate PE at is higher than

ideal:

  • a. Layer distribution to columns
  • b. Feature distribution to MemHeavy Tiles
  • c. Feature sizes are not a multiple of 2D-PE rows
  • 2. Control logic and data transfer also lower utilization

Total utilization is 35%

Hardware (PE) Utilization

27

slide-28
SLIDE 28

Key Features of SCALEDEEP

  • Heterogeneous processing and compute chips
  • System design matches the structure of memory

access of DNNs

  • Nested pipelining to minimize data movement

and improve core utilization

28

slide-29
SLIDE 29

Discussion

  • Since DNN design is still more of an art than science at this

point, does it make sense to make an ASIC, given the high cost of developing hardware?

  • How does ScaleDeep compare to other systems like Google’s

TPU and TABLA? In what situations is it better and worse?

  • What are some pitfalls of this design?

29