 
              1 The Explosion in Neural Network Hardware USC Friday 19 th April Trevor Mudge Bredt Family Professor of Computer Science and Engineering The University of Michigan, Ann Arbor 1
What Just Happened? 2 § For years the common wisdom was that Hardware was a bad bet for venture § That has changed § More than 45 start-ups are designing chips for image processing, speech, and self-driving cars § 5 have raised more than $100 million § Venture capitalists have invested over $1.5 billion in chip start-ups last year 2 University of Michigan 2
Driving Factors 3 § Fueled by pragmatism— “unreasonable” success of neural nets § Slowing of Moore’s Law has made accelerators more attractive § Existing accelerators could easily be repurposed— GPUs and DSPs § Algorithms fitted an existing paradigm § Orders of magnitude increase in the size of data sets § Independent Foundries—TSMC is the best known 3 University of Michigan 3
What Are Neural Networks Used For? 4 Computer vision Self-driving cars Keyword Spotting Seizure Detection § A unifying approach to “understanding” –in contrast to an expert guided set of algorithms to recognize faces for example § Their success is based on the availability of enormous amounts of training data 4 University of Michigan 4
Notable Successes 5 § Facebooks Deep Face is 97.35% accurate on the Labeled Faces in the Wild (LFW) dataset—as good than a human in some cases § Recent attention grabbing application—DeepMind’s AlphaGO § It beat European Go champion Fan Hui in October 2015 § It was powered by Google’s Tensor Processing Unit (TPU 1.0) § TPU 2.0 beat Ke Jie, the world no. 1 GO player May 2017 § AlphaZero improved on that by playing itself § (Not just NNs) 5 University of Michigan 5
Slowing of Moore’s Law ⇒ Accelerators 6 § Power scaling—ended a long time ago § Cost per transistor scaling—more recently § Technical limits—still has several nodes to go § 2nm may not be worth it—see article from EE Times 3/23/18 § Time between nodes increasing significantly *projected. (Source: The Linley Group) 6 University of Michigan 6
Algorithms Fit Existing Paradigm 7 § Algorithms fitted an existing paradigm—variations on dense matrix-vector and matrix-matrix multiply § Many variations notably convolutional neural networks—CNN 7 University of Michigan 7
Orders of Magnitude Increase in Data 8 § Orders of magnitude increase in the size of data sets § Google /Facebook / Baidu / etc. have access to vast amounts of data and this has been the game changer § FAANGs (Facebook/Amazon/Apple/Netflix/Google) have access to vast amounts of data and this has been the game changer § Add to that list: Baidu/Microsoft/Alibaba/Tencent/FSB (!) § Available to 3 rd parties—Cambridge Analytica § Open Source § AlexNet—image classification (CNN) § VGG-16—large-scale image recognition (CNN) § Deep Residual Network—Microsoft 8 University of Michigan 8
What are Neural Nets—NNs 9 NEURON § Unfortunate anthropomorphization! Only a passing relationship to the neurons in your brain mandatory brain picture § Neuron shown with (synaptic) weighted inputs feeding dendrites! § The net input function is just a dot-product § The “activation” function is a non-linear function § Often simplified to the rectified linear unit—ReLU 9 University of Michigan 9
What are Neural Nets—5 Slide Introduction! 10 NEURAL NETS § From input to first hidden layer is a matrix-vector multiply with a weight matrix W ⊗ I = V § Deep Neural Nets (DNNs) have multiple hidden layers output = … ⊗ W 3 ⊗ W 2 ⊗ W 1 ⊗ I 10 University of Michigan 10 10
DNN—deep neural networks 11 § DNNs have more than two levels that are “fully connected” § Bipartite graphs § Dense matrix operations 11 University of Michigan 11 11
CNN—convolutional neural networks 12 § Borrowed an idea from signal processing § Used typically in image applications § Cuts down on dimensionality § The 4 feature maps are produced as a result of 4 convolution kernels being applied to the image array 12 University of Michigan 12 12
13 Training and Inference 13 § The weights come from the learning or training phase § Start with randomly assigned weights and “learn” through a process of successive approximation that involves back propagation with gradient descent § Both processes involve matrix-vector multiplication § Inference is done much more frequently § Often inference uses fixed point and training uses floating point backpropagation University of Michigan 13 13
14 14 Summary § Basic Algorithm is a vector-matrix multiply … ⊗ W 3 ⊗ W 2 ⊗ W 1 ⊗ I § The number of weigh matrices corresponds to the depth of the network—the rank of the matrices can be in the millions § The non-linear operator ⊗ prevents us from pre-evaluating the matrix products—this is a significant inefficiency § BUT it makes possible non-linear separation in classification space § The basic operation is a dot product followed by a non-linear operation—a MAC operation and some sort of thresholding threshold ∑ + ,- ×/ - University of Michigan 14 14
Summary—Note on pre-evaluation 15 § Basic Algorithm is a vector-matrix multiply … ⊗ W 3 ⊗ W 2 ⊗ W 1 ⊗ I § The product is a function of I If ⊗ were simply normal matrix multiply ∙ then § … W 3 ∙ W 2 ∙ W 1 ∙ I Can be written W ∙ I Where W = … W 3 ∙ W 2 ∙ W 1 § The inference step would be just ONE matrix multiply Question: Can we use (W 2 ⊗ W 1 ⊗ I � W 2 ∙ W 1 ∙ I) for representative samples of I as an § approximate correction 15 University of Michigan 15 15
What’s Changed? 16 § Neural nets have been around for over 70 years—eons in computer-evolution time § McCulloch–Pitts Neurons—1943 § Countless innovations but the basic idea is quite old § Notably back propagation to learn weights in supervised learning § Convolutional NN—nearest neighbor convolution layer § Recurrent NN—feedback added § Massive improvements in Compute power & More Data § Larger, deeper, better § AlexNet § 8 layers, 240MB weights § VGG-16 § 16 layers, 550MB weights § Deep Residual Network § 152 layers, 229MB weights 16 University of Michigan 16 16
Convergence—what is the common denominator? 17 § Dot product for dense matrix operations—MAC units § Take away for computer architects: § Dense => vector processing § We know how to do this § Why not use existing—repurpose § There are still opportunities § Size and power § Systolic-type organizations § Tailor precision to the application 17 University of Michigan 17 17
Who’s On The Bandwagon? 18 § Recall: § More than 45 start-ups are designing chips for image processing, speech, and self-driving cars § 5 have raised more than $100 million § Venture capitalists have invested over $1.5 billion in chip start-ups last year § These numbers are conservative 18 University of Michigan 18 18
Just Some of the Offerings 19 § Two Approaches § Repurpose a signal processing chip or a GPU—CEVA & nVidia § Start from scratch—Google’s TPU & now nVidia is claiming a TPU in the works § Because the key ingredient is a dot product hardware to do this has been around for decades—DSP MACs § Consequently everyone in the DSP space claims they have a DNN solution! § Some of the current offerings and their characteristics § Intel—purchased Nervana and Movidius § Possible use of the Movidius accelerator in Intel’s future PC chip sets § Wave—45 person start up with DSP expertise § TPU—disagrees with M/soft FPGA solution and nVidia’s GPU solution § CEVA-XM6-based vision platform § nVidia—announced a TPU-like processor § Tesla for training § Graphcore's Intelligent Processor Unit (IPU) § TSMC—no details, has “very high” memory bandwidth 8 bit arithmetic § FIVEAI from GraphCore § Apple’s Bionic neural engine in the A11 SoC in its iPhone § The DeePhi block in Samsung’s Exynos 9810 in the Galaxy S9 § The neural engine from China’s Cambricon in Huawei’s Kirin 970 handset 19 University of Michigan 19 19
Landscape for Hardware Offerings 20 § Training tends to use heavy-weight GPGPUs § Inference uses smaller engines § Inference is now being done in mobile platforms 20 University of Michigan 20 20
Raw Performance of Inference Accelerators Announced to Date 21 MACs are the unit of work 21 University of Michigan 21 21
Intel Movidius—no details 22 22 University of Michigan 22 22
23 Cadence / Tensilica C5 23 University of Michigan 23 23
24 CEVA-XM4—currently at XM6 24 University of Michigan 24 24
Appearing in Small Low-Power Applications 25 § Non-uniform Always-on Accelerator L4 Bank4 scratchpad architecture L4 Bank3 L4 L4 L4 Bank2 § Many always-on PE4 Mem PE3 Mem PE2 Mem application executes in L4 Bank1 a repeatable and L3 Bank4 deterministic fashion L3 L3 Bank3 L3 § Optimal memory access L3 Bank2 L2 L3 Bank1 L1 can be pre-determined PE4 PE3 PE2 PE1 L2 Bank4 L2 Bank3 L2 statically L2 Bank2 L2 Bank1 § Scratchpad instead of L1 Bank4 L1 L1 Bank3 cache L1 Bank2 Central Arbitration Cortex-M0 L1 Bank1 Serial Bus PE § Assign more frequently Unit Processor data to smaller, nearby Compiled SRAM banks for Cortex-M0 External Sensor and Application Core University of Michigan 25 25
Recommend
More recommend