scaledeep
play

SCALEDEEP Bryce Paputa and Royce Hwang Motivation: DNN Applications - PowerPoint PPT Presentation

SCALEDEEP Bryce Paputa and Royce Hwang Motivation: DNN Applications Google image search, Apple Siri Self-driving cars, Education, Healthcare Source: https://deepmind.com/ Source:


  1. SCALEDEEP Bryce Paputa and Royce Hwang

  2. Motivation: DNN Applications • Google image search, Apple Siri • Self-driving cars, Education, Healthcare Source: https://deepmind.com/ Source: http://fortune.com/2017/03/27/waymo-self-driving-minivans-snow/ Source: https://www.verizonwireless.com/od/smartphones/apple-iphone-x/ 2

  3. Simple Neural Network Source: https://www.dtreg.com/solution/view/22 3

  4. 3 Stages of Training - Forward propagation: Evaluates the network. - Back propagation: Calculates the error and propagates it from the output stages to the input stages - Weight gradient and update: Calculates the gradient of the error and updates the weights to reduce the error 4

  5. From Simple to Deep NN Source: https://hackernoon.com/log-analytics-with-deep-learning-and-machine-learning-20a1891ff70e 5

  6. Convolutional Neural Network Source: http://cs231n.github.io/convolutional-networks/ 6

  7. Implemention Challenges • Training and inference steps are extremely computational and data intensive • Example: Overfeat DNN – 820K Neurons, 145M parameters ~3.3 x 10 9 operations for a single 231 x 231 image • To process ImageNet dataset (1.2 million images), it needs ~15 x 10 15 operations for a single training iteration 7

  8. Escalating Computational Requirements Unless otherwise noted, all figures are from SCALEDEEP: A Scalable Compute Architecture for Learning and Evaluating Deep Networks, Venkataramani et al, ISCA 2017. 8

  9. Ways to Speed This Up 9

  10. System Architecture 10

  11. Convolutional DNN Source: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/ 11

  12. 3 Main Layers: Convolution • Convolution (CONV) Layer – Takes in inputs and applies convolution operation with weights – Outputs values (features) to the next layers. – Computationally intensive 12

  13. 3 Main Layers: Sampling • Sampling (SAMP) Layer – Also known as pooling layer – Performs up/down sampling on features – Example: Decreasing image resolution – Data Intensive 13

  14. 3 Main Layers: Fully Connected • Fully Connected (FC) Layer – Composes features in the CONV layers into output (classification, etc.) – Data Intensive 14

  15. Computation Heavy Layers Initial CONV layers - Fewer, but larger features - 16% of Flops - Very high reuse of weights Middle CONV layers - Smaller features, but more numerous - 80% of Flops 15

  16. Memory Heavy Layers Fully connected layers - Fewer Flops (4%) - No weight reuse Sampling Layers - Even fewer Flops (0.1%) - No training step/weights - Very high Bytes/FLOP 16

  17. Summary of Characteristics 17

  18. CompHeavy Tile - Used for low Byte/FLOP stages - 2D-PE computes the dot product of an input and kernel - Computes many kernels convolved with the input - Statically controlled 18

  19. MemHeavy Tile - Stores features, weights, errors, and error gradients in scratchpad memory. - Special Function Units (SFU) implement activation functions like ReLu, tanh, sigmoid 19

  20. SCALEDEEP Chip 20

  21. Heterogeneous Chips . 21 CONV Layer Chip FC Layer Chip

  22. Node Architecture . - All memory is on chip or directly connected - Wheel configuration allows for high memory bandwidth and for layers to be split between chips - Ring configuration allows for high model parallelism 22

  23. Intra-layer Parallelism . 23

  24. Inter-Layer Parallelism . - Pipeline depth is equal to twice the number of layers using during training - Depth is equal to the number of layers during evaluation 24

  25. Experimental Results - System tested using 7032 Processing Elements - Single precision - 680 TFLOPS - Half precision - 1.35 PFLOPS - 6-28x speedup compared to TitanX GPU 25

  26. Power Usage 26

  27. Hardware (PE) Utilization 1. The granularity that they can allocate PE at is higher than ideal: a. Layer distribution to columns b. Feature distribution to MemHeavy Tiles c. Feature sizes are not a multiple of 2D-PE rows 2. Control logic and data transfer also lower utilization Total utilization is 35% 27

  28. Key Features of SCALEDEEP • Heterogeneous processing and compute chips • System design matches the structure of memory access of DNNs • Nested pipelining to minimize data movement and improve core utilization 28

  29. Discussion • Since DNN design is still more of an art than science at this point, does it make sense to make an ASIC, given the high cost of developing hardware? • How does ScaleDeep compare to other systems like Google’s TPU and TABLA? In what situations is it better and worse? • What are some pitfalls of this design? 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend