The Explosion in Neural Network Hardware USC Friday 19 th April - - PowerPoint PPT Presentation

the explosion in neural network hardware
SMART_READER_LITE
LIVE PREVIEW

The Explosion in Neural Network Hardware USC Friday 19 th April - - PowerPoint PPT Presentation

1 The Explosion in Neural Network Hardware USC Friday 19 th April Trevor Mudge Bredt Family Professor of Computer Science and Engineering The University of Michigan, Ann Arbor 1 What Just Happened? 2 For years the common wisdom was that


slide-1
SLIDE 1

1 1

The Explosion in Neural Network Hardware

USC Friday 19th April

Trevor Mudge Bredt Family Professor of Computer Science and Engineering

The University of Michigan, Ann Arbor

slide-2
SLIDE 2

2 2

University of Michigan

What Just Happened?

§ For years the common wisdom was that Hardware

was a bad bet for venture

§ That has changed § More than 45 start-ups are designing chips for image

processing, speech, and self-driving cars

§ 5 have raised more than $100 million § Venture capitalists have invested over $1.5 billion in

chip start-ups last year

2

slide-3
SLIDE 3

3 3

University of Michigan

Driving Factors

§ Fueled by pragmatism— “unreasonable” success of

neural nets

§ Slowing of Moore’s Law has made accelerators more

attractive

§ Existing accelerators could easily be repurposed—

GPUs and DSPs

§ Algorithms fitted an existing paradigm § Orders of magnitude increase in the size of data sets § Independent Foundries—TSMC is the best known

3

slide-4
SLIDE 4

4 4

University of Michigan

What Are Neural Networks Used For?

4

Computer vision Self-driving cars Keyword Spotting Seizure Detection

§ A unifying approach to “understanding” –in contrast to an

expert guided set of algorithms to recognize faces for example

§ Their success is based on the availability of enormous

amounts of training data

slide-5
SLIDE 5

5 5

University of Michigan

Notable Successes

§ Facebooks Deep Face is 97.35% accurate on the Labeled Faces in the

Wild (LFW) dataset—as good than a human in some cases

§ Recent attention grabbing application—DeepMind’s AlphaGO

§ It beat European Go champion

Fan Hui in October 2015

§ It was powered by Google’s

Tensor Processing Unit (TPU 1.0)

§ TPU 2.0 beat Ke Jie, the world

  • no. 1 GO player May 2017

§ AlphaZero improved on that by playing itself § (Not just NNs)

5

slide-6
SLIDE 6

6 6

University of Michigan

Slowing of Moore’s Law ⇒ Accelerators

§ Power scaling—ended a long time ago § Cost per transistor scaling—more recently § Technical limits—still has several nodes to go § 2nm may not be worth it—see article from EE Times 3/23/18 § Time between nodes increasing significantly

6

*projected. (Source: The Linley Group)

slide-7
SLIDE 7

7 7

University of Michigan

Algorithms Fit Existing Paradigm

§ Algorithms fitted an existing paradigm—variations on dense

matrix-vector and matrix-matrix multiply

§ Many variations notably convolutional neural networks—CNN

7

slide-8
SLIDE 8

8 8

University of Michigan

Orders of Magnitude Increase in Data

§ Orders of magnitude increase in the size of data sets § Google /Facebook / Baidu / etc. have access to vast amounts

  • f data and this has been the game changer

§ FAANGs (Facebook/Amazon/Apple/Netflix/Google) have

access to vast amounts of data and this has been the game changer

§ Add to that list: Baidu/Microsoft/Alibaba/Tencent/FSB (!) § Available to 3rd parties—Cambridge Analytica § Open Source

§ AlexNet—image classification (CNN) § VGG-16—large-scale image recognition (CNN) § Deep Residual Network—Microsoft

8

slide-9
SLIDE 9

9 9

University of Michigan

What are Neural Nets—NNs

NEURON

§ Unfortunate anthropomorphization!

Only a passing relationship to the neurons in your brain

§ Neuron shown with (synaptic) weighted

inputs feeding dendrites!

§ The net input function is just a dot-product § The “activation” function is a non-linear

function

§ Often simplified to the rectified linear unit—ReLU

9

mandatory brain picture

slide-10
SLIDE 10

10 10 10

University of Michigan

What are Neural Nets—5 Slide Introduction!

NEURAL NETS

§ From input to first hidden layer is a matrix-vector multiply with

a weight matrix W ⊗ I = V

§ Deep Neural Nets (DNNs) have multiple

hidden layers

  • utput = … ⊗ W3 ⊗ W2 ⊗ W1 ⊗ I

10

slide-11
SLIDE 11

11 11 11

University of Michigan

DNN—deep neural networks

§ DNNs have more than two levels that are “fully connected” § Bipartite graphs § Dense matrix operations

11

slide-12
SLIDE 12

12 12 12

University of Michigan

CNN—convolutional neural networks

§ Borrowed an idea from signal processing § Used typically in image applications § Cuts down on dimensionality § The 4 feature maps are produced as a result of 4 convolution

kernels being applied to the image array

12

slide-13
SLIDE 13

13 13 13

University of Michigan

Training and Inference

§ The weights come from the learning or training phase § Start with randomly assigned weights and “learn” through a

process of successive approximation that involves back propagation with gradient descent

§ Both processes involve matrix-vector multiplication § Inference is done much more frequently § Often inference uses fixed point and training uses floating point

13

backpropagation

slide-14
SLIDE 14

14 14 14

University of Michigan

Summary

§

Basic Algorithm is a vector-matrix multiply

… ⊗ W3 ⊗ W2 ⊗ W1 ⊗ I §

The number of weigh matrices corresponds to the depth of the network—the rank of the matrices can be in the millions

§

The non-linear operator ⊗ prevents us from pre-evaluating the matrix products—this is a significant inefficiency

§

BUT it makes possible non-linear separation in classification space

§

The basic operation is a dot product followed by a non-linear

  • peration—a MAC operation and some sort of thresholding

threshold ∑ +

,-×/-

14

slide-15
SLIDE 15

15 15 15

University of Michigan

Summary—Note on pre-evaluation

§

Basic Algorithm is a vector-matrix multiply

… ⊗ W3 ⊗ W2 ⊗ W1 ⊗ I §

The product is a function of I

§

If ⊗ were simply normal matrix multiply∙ then

… W3∙W2∙W1∙I

Can be written W∙I Where W = … W3∙W2∙W1

§

The inference step would be just ONE matrix multiply

§

Question: Can we use (W2 ⊗ W1 ⊗ I W2∙W1∙I) for representative samples of I as an approximate correction

15

slide-16
SLIDE 16

16 16 16

University of Michigan

What’s Changed?

§

Neural nets have been around for over 70 years—eons in computer-evolution time

§

McCulloch–Pitts Neurons—1943

§

Countless innovations but the basic idea is quite old

§

Notably back propagation to learn weights in supervised learning

§

Convolutional NN—nearest neighbor convolution layer

§

Recurrent NN—feedback added

§

Massive improvements in Compute power & More Data

§

Larger, deeper, better

§

AlexNet

§ 8 layers, 240MB weights §

VGG-16

§ 16 layers, 550MB weights §

Deep Residual Network

§ 152 layers, 229MB weights

16

slide-17
SLIDE 17

17 17 17

University of Michigan

Convergence—what is the common denominator?

§ Dot product for dense matrix operations—MAC units § Take away for computer architects: § Dense => vector processing § We know how to do this § Why not use existing—repurpose § There are still opportunities

§ Size and power § Systolic-type organizations § Tailor precision to the application

17

slide-18
SLIDE 18

18 18 18

University of Michigan

Who’s On The Bandwagon?

18

§ Recall:

§ More than 45 start-ups are designing chips for image processing,

speech, and self-driving cars

§ 5 have raised more than $100 million § Venture capitalists have invested over $1.5 billion in chip start-ups last

year

§ These numbers are conservative

slide-19
SLIDE 19

19 19 19

University of Michigan

Just Some of the Offerings

§ Two Approaches

§ Repurpose a signal processing chip or a GPU—CEVA & nVidia § Start from scratch—Google’s TPU & now nVidia is claiming a TPU in the works

§ Because the key ingredient is a dot product hardware to do this has been

around for decades—DSP MACs

§ Consequently everyone in the DSP space claims they have a DNN solution! § Some of the current offerings and their characteristics

§ Intel—purchased Nervana and Movidius § Possible use of the Movidius accelerator in Intel’s future PC chip sets § Wave—45 person start up with DSP expertise § TPU—disagrees with M/soft FPGA solution and nVidia’s GPU solution § CEVA-XM6-based vision platform § nVidia—announced a TPU-like processor § Tesla for training § Graphcore's Intelligent Processor Unit (IPU) § TSMC—no details, has “very high” memory bandwidth 8 bit arithmetic § FIVEAI from GraphCore § Apple’s Bionic neural engine in the A11 SoC in its iPhone § The DeePhi block in Samsung’s Exynos 9810 in the Galaxy S9 § The neural engine from China’s Cambricon in Huawei’s Kirin 970 handset

19

slide-20
SLIDE 20

20 20 20

University of Michigan

Landscape for Hardware Offerings

§ Training tends to use heavy-weight GPGPUs § Inference uses smaller engines § Inference is now being done in mobile platforms

20

slide-21
SLIDE 21

21 21 21

University of Michigan

Raw Performance of Inference Accelerators Announced to Date

21

MACs are the unit of work

slide-22
SLIDE 22

22 22 22

University of Michigan

Intel Movidius—no details

22

slide-23
SLIDE 23

23 23 23

University of Michigan

Cadence / Tensilica C5

23

slide-24
SLIDE 24

24 24 24

University of Michigan

CEVA-XM4—currently at XM6

24

slide-25
SLIDE 25

25 25 25

University of Michigan

Appearing in Small Low-Power Applications

§ Non-uniform

scratchpad architecture

§ Many always-on

application executes in a repeatable and deterministic fashion

§ Optimal memory access

can be pre-determined statically § Scratchpad instead of

cache

§ Assign more frequently

data to smaller, nearby banks

PE2 PE3 PE4 Cortex-M0 Processor Serial Bus Central Arbitration Unit Compiled SRAM for Cortex-M0 External Sensor and Application Core PE3 Mem PE2 Mem L4 L3 L2 L1 PE4 Mem PE1

Always-on Accelerator

PE

L4 L3 L2 L1

L3 Bank4 L4 Bank4 L4 Bank3 L4 Bank2 L4 Bank1 L3 Bank3 L3 Bank2 L3 Bank1

L2 Bank4 L2 Bank3 L2 Bank2 L2 Bank1

L1 Bank4 L1 Bank3 L1 Bank2 L1 Bank1

slide-26
SLIDE 26

26 26 26

University of Michigan

Chip implementation

Process 40nm Chip Area 7.1mm2 # of PEs 4 Accelerator SRAM Size 270 KB Available Fixed- Point Precision 6, 8, 12, 16, 24, 32 bits Operating Power 0.288 mW Efficiency 374 GOPs / W

Reference: S. Bang, et.al, ISSCC 2017

slide-27
SLIDE 27

27 27 27

University of Michigan

Google’s TPU 1.0*—a 3 year old technology

§ Matrix multiply unit—65,536 (256x256)

§ 8-bit multiply-accumulate units

§ 700 MHz clock § Peak: 92T operations/second

§ 65,5362700M § >25 more MACs vs GPU § >1000 more MACs vs CPU

§ 24 MB of on-chip Unified Buffer § 3.5 as much on-chip memory vs GPU § Two 2133MHz DDR3 DRAM channels § 8 GB of off-chip weight DRAM memory § Control and data pipelined

* "In-Datacenter Performance Analysis of a Tensor Processing Unit, Jouppi et al." 44th International Symposium on Computer Architecture (ISCA), Toronto, Canada, June 26, 2017.

27

slide-28
SLIDE 28

28 28 28

University of Michigan

Performance Comparison—nVidia Responds

28

slide-29
SLIDE 29

29 29 29

University of Michigan

Use Scenario

§ TPU card replaces a disk § Accelerator for server

§ Connected through PCIe bus § Host sends instructions like FPU § doesn’t fetch its own instructions § 4 cards per server

§ Five basic instructions—complex

Read_Host_Memory Write_Host_Memory Read_Weights MatrixMultiply/Convolve Activate(ReLU,Sigmoid,Maxpool,LRN,…)

§ CPI > 10 § 4 stage pipeline execution § No branches / in-order issue / SW controlled

29

slide-30
SLIDE 30

30 30 30

University of Michigan

Observations

§ TPU 1.0 uses 8 bit integer arithmetic to save power and area

§ A theme for others too—e.g. GraphCore

§ BUT TPU 2.0 appears to be floating point

§ Ease of programming is worth something

§ Systolic operation is best suited to dense matrices § Publicly available development environment—Tensor Flow

30

slide-31
SLIDE 31

31 31 31

University of Michigan

What Next?

§ Recall

§ AlexNet—8 layers, 240MB weights § VGG-16—16 layers, 550MB weights § Deep Residual Network—152 layers, 229MB weights

§ Large model size leads to high energy cost

§ NNs cannot fit in on-chip SRAM § DRAM access is energy-consuming

31 1 10 100 1000 10000

32-bit DRAM 32-bit SRAM 32-bit float MULT 32-bit int MULT 32-bit float ADD 32-bit int ADD

Relative Energy Cost *

* Han, et al. "EIE: Efficient Inference Engine on Compressed Deep Neural Network." arXiv preprint arXiv:1602.01528 (2016).

slide-32
SLIDE 32

32 32 32

University of Michigan

Solution 1: Run in Cloud*

§ Pro: Solves data and power growth § Con:

§

Network delay—a function of the data set

§

User Privacy

§

Data connection not guaranteed

§ Partitioning a challenge

§ Works for small low frequency data sets-–Siri § Challenging for images—compression adds to latency

32

*Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, L. Tang. Neurosurgeon: Collaborative Intelligence Between the Cloud and Mobile Edge.

  • Proc. 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Xi’an, China, April 2017, to appear.
slide-33
SLIDE 33

33 33 33

University of Michigan

Solution 2: Reduce the DNN Size*

§ Precision reduction

§ Low-precision fixed-point representation § Need hardware support

§ Weights pruning

§ Remove redundant weights § Sparse weights matrix

§ Weight sharing § Application Specific Accelerators ⍏

33

* Han et al. “A deep neural network compression pipeline: Pruning, quantization, huffman encoding.” arXiv preprint arXiv:1510.00149 (2015)

⍏Han, et al. "EIE: Efficient Inference Engine on Compressed Deep Neural Network." arXiv

preprint arXiv:1602.01528 (2016).

slide-34
SLIDE 34

34 34 34

University of Michigan

Reducing Storage and Computation*

§ Weights Pruning*

§ Remove weights lower than pruning thresholds § Remove neurons without inputs/outputs

§ Weights sharing

§ K-means clustering

§

Multiple weights share the same value

34

Original Network Weights Pruning

Remove weights Remove neurons

Weights Sharing

0.4 0.2

* Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS. 2015.

slide-35
SLIDE 35

35 35 35

University of Michigan

Drawbacks—Sparsity Difficult to Vectorize

§ Execution time increases

§ Computation reduction not fully utilized § Extra computation for decoding sparse format

§ AlexNet

35 Unpruned Baseline

22% 42% 125% 334%

slide-36
SLIDE 36

36 36 36

University of Michigan

Sparse Storage Formats

§ Compressed Sparse Row—CSR (CSC) § Irregular data structure § Significant “work” to find elements § Vector Machines don’t do so well

36

slide-37
SLIDE 37

37 37 37

University of Michigan

What’s in the Future

§ Investment boom is tailing off—”AI fatigue” § Recognition that many of the future ML

problems will require efficient handling

  • f sparse data structures

§ Big data collected from various sources

§ Sensor feed, social media, scientific experiments

§ Challenge: the nature of data is sparse § Architecture research previously

focused on improving compute

§ Sparse matrix computation: a key example of

memory bound workloads

§ GPUs achieve ~100 GFLOPS for dense matrix multiply vs. ~100 MFLOPS for

sparse matrices

§ Change of focus to data movement & less rigid SIMD compute

model

37

slide-38
SLIDE 38

38 38 38

University of Michigan

OuterSPACE Project

§ SPMD-style Processing Elements (PEs), high-speed crossbars and non-

coherent caches with request coalescing, HBM interface

§ Local Control Processor (LCP): streaming instructions in to the PEs § Central Control Processor (CCP): work scheduling and memory

management

38

slide-39
SLIDE 39

39 39 39

University of Michigan

OuterSPACE: Merge Phase

§ The L0 cache-crossbar blocks are reconfigured to accommodate

the change in data access pattern

§ L0 reconfigured into smaller, private caches and private

scratchpads

HPCA 2018

39

slide-40
SLIDE 40

40 40 40

University of Michigan

Performance of Matrix-Matrix Multiplication

§ Evaluation of SpGEMM on UFL SuiteSparse and SNAP

matrices

40

HPCA 2018 Summary of Results: a mean speedup of 7.9× over Intel Math Kernel Library on a Xeon CPU, 13.0× against cuSPARSE and 14.0× against CUSP when run on an NVIDIA K40 GPU, while achieving an aver- age throughput

  • f 2.9 GFLOPS within a 24 W power budget in an area of 87 mm2.
slide-41
SLIDE 41

41 41 41

University of Michigan

Power and Area Summary

Table:Power and area estimates for OuterSPACE in 32 nm

§ Total chip area for OuterSPACE: ∼87 mm2 @ 24 W

power budget

§ Average throughput of 2.9 GFLOPS à 126 MFLOPS/W § K40 GPU achieves avg. 67 MFLOPS @ 85 W for

UFL/SNAP matrices

41

HPCA 2018

slide-42
SLIDE 42

42 42 42

University of Michigan

42

Q&A

Thank you Questions?