SCALEDEEP Bryce Paputa and Royce Hwang Motivation: DNN Applications - - PowerPoint PPT Presentation
SCALEDEEP Bryce Paputa and Royce Hwang Motivation: DNN Applications - - PowerPoint PPT Presentation
SCALEDEEP Bryce Paputa and Royce Hwang Motivation: DNN Applications Google image search, Apple Siri Self-driving cars, Education, Healthcare Source: https://deepmind.com/ Source:
Motivation: DNN Applications
- Google image search, Apple Siri
- Self-driving cars, Education, Healthcare
2
Source: https://deepmind.com/ Source: http://fortune.com/2017/03/27/waymo-self-driving-minivans-snow/ Source: https://www.verizonwireless.com/od/smartphones/apple-iphone-x/
Simple Neural Network
3 Source: https://www.dtreg.com/solution/view/22
3 Stages of Training
- Forward propagation:
Evaluates the network.
- Back propagation:
Calculates the error and propagates it from the output stages to the input stages
- Weight gradient and update:
Calculates the gradient of the error and updates the weights to reduce the error
4
From Simple to Deep NN
5 Source: https://hackernoon.com/log-analytics-with-deep-learning-and-machine-learning-20a1891ff70e
Convolutional Neural Network
6 Source: http://cs231n.github.io/convolutional-networks/
Implemention Challenges
- Training and inference steps are extremely computational and
data intensive
- Example:
Overfeat DNN – 820K Neurons, 145M parameters ~3.3 x 109 operations for a single 231 x 231 image
- To process ImageNet dataset (1.2 million images), it needs
~15 x 1015 operations for a single training iteration
7
Escalating Computational Requirements
Unless otherwise noted, all figures are from SCALEDEEP: A Scalable Compute Architecture for Learning and Evaluating Deep Networks, Venkataramani et al, ISCA 2017.
8
Ways to Speed This Up
9
System Architecture
10
Convolutional DNN
Source: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
11
3 Main Layers: Convolution
- Convolution (CONV) Layer
– Takes in inputs and applies convolution operation with weights – Outputs values (features) to the next layers.
– Computationally intensive
12
3 Main Layers: Sampling
- Sampling (SAMP) Layer
– Also known as pooling layer – Performs up/down sampling
- n features
– Example: Decreasing image resolution – Data Intensive
13
3 Main Layers: Fully Connected
- Fully Connected (FC)
Layer
– Composes features in the CONV layers into
- utput (classification,
etc.) – Data Intensive
14
Computation Heavy Layers
Initial CONV layers
- Fewer, but larger features
- 16% of Flops
- Very high reuse of weights
Middle CONV layers
- Smaller features, but more
numerous
- 80% of Flops
15
Memory Heavy Layers
Fully connected layers
- Fewer Flops (4%)
- No weight reuse
Sampling Layers
- Even fewer Flops (0.1%)
- No training step/weights
- Very high Bytes/FLOP
16
Summary of Characteristics
17
CompHeavy Tile
- Used for low Byte/FLOP
stages
- 2D-PE computes the dot
product of an input and kernel
- Computes many kernels
convolved with the input
- Statically controlled
18
MemHeavy Tile
- Stores features, weights,
errors, and error gradients in scratchpad memory.
- Special Function Units
(SFU) implement activation functions like ReLu, tanh, sigmoid
19
SCALEDEEP Chip
20
Heterogeneous Chips .
CONV Layer Chip FC Layer Chip
21
Node Architecture .
- All memory is on chip or
directly connected
- Wheel configuration allows
for high memory bandwidth and for layers to be split between chips
- Ring configuration allows
for high model parallelism
22
Intra-layer Parallelism .
23
Inter-Layer Parallelism .
- Pipeline depth is
equal to twice the number of layers using during training
- Depth is equal to the
number of layers during evaluation
24
Experimental Results
- System tested using 7032
Processing Elements
- Single precision - 680 TFLOPS
- Half precision - 1.35 PFLOPS
- 6-28x speedup compared to
TitanX GPU
25
Power Usage
26
- 1. The granularity that they can allocate PE at is higher than
ideal:
- a. Layer distribution to columns
- b. Feature distribution to MemHeavy Tiles
- c. Feature sizes are not a multiple of 2D-PE rows
- 2. Control logic and data transfer also lower utilization
Total utilization is 35%
Hardware (PE) Utilization
27
Key Features of SCALEDEEP
- Heterogeneous processing and compute chips
- System design matches the structure of memory
access of DNNs
- Nested pipelining to minimize data movement
and improve core utilization
28
Discussion
- Since DNN design is still more of an art than science at this
point, does it make sense to make an ASIC, given the high cost of developing hardware?
- How does ScaleDeep compare to other systems like Google’s
TPU and TABLA? In what situations is it better and worse?
- What are some pitfalls of this design?
29