Backpropagation Why backpropagation Neural networks are sequences - PowerPoint PPT Presentation

Backpropagation

Why backpropagation • Neural networks are sequences of parametrized functions x ℎ($; !) linear conv subsample conv subsample filters filters weights Parameters !

Why backpropagation • Neural networks are sequences of parametrized functions • Parameters need to be set by minimizing some loss function N 1 X min L ( h ( x i ; θ ) , y i ) N θ i =1 Convolutional network

Why backpropagation • Neural networks are sequences of parametrized functions • Parameters need to be set by minimizing some loss function • Minimization through gradient descent requires computing the gradient N θ ( t +1) = θ ( t ) � λ 1 X r L ( h ( x i ; θ ) , y i ) N i =1

Why backpropagation • Neural networks are sequences of parametrized functions • Parameters need to be set by minimizing some loss function • Minimization through gradient descent requires computing the gradient N θ ( t +1) = θ ( t ) � λ 1 X r L ( h ( x i ; θ ) , y i ) N i =1 r θ L ( z, y ) = ∂ L ( z, y ) ∂ z z = h ( x ; θ ) ∂ z ∂ θ

Why backpropagation • Neural networks are sequences of parametrized functions • Parameters need to be set by minimizing some loss function • Minimization through gradient descent requires computing the gradient • Backpropagation : way to compute gradient ∂ z ∂ θ

The gradient of convnets z 5 = z z z z z 4 1 2 3 f 1 f 2 f 3 f 4 f 5 x w 1 w 2 w 3 w 4 w 5

The gradient of convnets z 5 = z z 4 z 1 z 2 z 3 f 1 f 2 f 3 f 4 f 5 x w 1 w 2 w 3 w 4 w 5 ∂ z = ∂ z ∂ z 3 ∂ w 3 ∂ z 3 ∂ w 3

The gradient of convnets z 5 = z z 4 z 1 z 2 z 3 f 1 f 2 f 3 f 4 f 5 x w 1 w 2 w 3 w 4 w 5 ∂ z = ∂ z ∂ z 4 ∂ z 3 ∂ z 4 ∂ z 3 ∂ z = ∂ z ∂ z 3 ∂ w 3 ∂ z 3 ∂ w 3

The gradient of convnets z 5 = z z 4 z 1 z 2 z 3 f 1 f 2 f 3 f 4 f 5 x w 1 w 2 w 3 w 4 w 5 Recurrence ∂ z = ∂ z ∂ z 3 going ∂ z 2 ∂ z 3 ∂ z 2 backward!! ∂ z = ∂ z ∂ z 2 ∂ w 2 ∂ z 2 ∂ w 2

The gradient of convnets z 5 = z z 4 z 1 z 2 z 3 f 1 f 2 f 3 f 4 f 5 x w 1 w 2 w 3 w 4 w 5

Backpropagation for a sequence of functions z i = f i ( z i − 1 , w i ) z 0 = x z = z n • Assume we can compute partial derivatives of each function ∂ z i = ∂ f i ( z i − 1 , w i ) = ∂ f i ( z i − 1 , w i ) ∂ z i ∂ z i − 1 ∂ z i − 1 ∂ w i ∂ w i • Use g(z i ) to store gradient of z w.r.t z i , g(w i ) for w i • Calculate g(z i ) by iterating backwards g ( z i − 1 ) = ∂ z ∂ z i = g ( z i ) ∂ z i g ( z n ) = ∂ z = 1 ∂ z n ∂ z i ∂ z i − 1 ∂ z i − 1 • Use g(z i ) to compute gradient of parameters g ( w i ) = ∂ z ∂ z i = g ( z i ) ∂ z i ∂ z i ∂ w i ∂ w i

Loss as a function label linear loss conv subsample conv subsample filters filters weights

Putting it all together: SGD training of ConvNets 1. Sample image and label label Image linear loss conv subsample conv subsample filters filters weights

Putting it all together: SGD training of ConvNets 1. Sample image and label 2. Pass image through network to get loss (forward) label Image linear loss conv subsample conv subsample filters filters weights

Putting it all together: SGD training of ConvNets 1. Sample image and label 2. Pass image through network to get loss (forward) 3. Backpropagate to get gradients (backward) label Image linear loss conv subsample conv subsample filters filters weights

Putting it all together: SGD training of ConvNets 1. Sample image and label 2. Pass image through network to get loss (forward) 3. Backpropagate to get gradients (backward) 4. Take step along negative gradients to update weights label Image linear loss conv subsample conv subsample filters filters weights

Putting it all together: SGD training of ConvNets 1. Sample image and label 2. Pass image through network to get loss (forward) 3. Backpropagate to get gradients (backward) 4. Take step along negative gradients to update weights label Image 5. Repeat! linear loss conv subsample conv subsample filters filters weights

Beyond sequences: computation graphs • Arbitrary graphs of functions • No distinction between intermediate outputs and parameters u g k x f l z y w h

Computation graph - Functions • Each node implements two functions • A “forward” • Computes output given input • A “backward” • Computes derivative of z w.r.t input, given derivative of z w.r.t output

Computation graphs a b f i d c

Computation graphs ∂ z ∂ a ∂ z ∂ z f i ∂ b ∂ d ∂ z ∂ c

Computation graphs a b f i d c

Computation graphs ∂ z ∂ a ∂ z ∂ z f i ∂ b ∂ d ∂ z ∂ c

Neural network frameworks

Stochastic gradient descent K θ ( t +1) θ ( t ) � λ 1 r L ( h ( x i k ; θ ( t ) ) , y i k ) X K k =1 Noisy!

Momentum • Average multiple gradient steps • Use exponential averaging K g ( t ) 1 X r L ( h ( x i k ; θ ( t ) ) , y i k ) K k =1 p ( t ) µ g ( t ) + (1 � µ ) p ( t − 1) θ ( t +1) θ ( t ) � λ p ( t )

Weight decay • Add −"# $ to the gradient • Prevents # from growing to infinity • Equivalent to L2 regularization of weights

Learning rate decay • Large step size / learning rate • Faster convergence initially • Bouncing around at the end because of noisy gradients • Learning rate must be decreased over time • Usually done in steps

Convolutional network training • Initialize network • Sample minibatch of images • Forward pass to compute loss • Backpropagate loss to compute gradient • Combine gradient with momentum and weight decay • Take step according to current learning rate

Setting hyperparameters • How do we find a hyperparameter setting that works? • Try it! • Train on train • Test on test validation • Picking hyperparameters that work for test = Overfitting on test set

Setting hyperparameters Test on test (Ideally only Test on once) validation Training iterations Train Validation Test Pick new hyperparameters

Vagaries of optimization • Non-convex • Local optima • Sensitivity to initialization • Vanishing / exploding gradients ∂ z ∂ z ∂ z n − 1 . . . ∂ z i +1 = ∂ z i ∂ z n − 1 ∂ z n − 2 ∂ z i • If each term is (much) greater than 1 à explosion of gradients • If each term is (much) less than 1 à vanishing gradients

Image Classification

How to do machine learning • Create training / validation sets • Identify loss functions • Choose hypothesis class • Find best hypothesis by minimizing training loss

How to do machine learning • Create training / validation sets Multiclass • Identify loss functions classificatio n!! • Choose hypothesis class • Find best hypothesis by minimizing training loss e s k p ( y = k | x ) = ˆ h ( x ) = s p ( y = k | x ) ∝ e s k ˆ P j e s j L ( h ( x ) , y ) = − log ˆ p ( y | x )

Building a convolutional network conv + relu + subsample conv + relu + subsample linear 10 classes average pool conv + relu + subsample

Building a convolutional network

MNIST Classification Method Error rate (%) Linear classifier over pixels 12 Kernel SVM over HOG 0.56 Convolutional Network 0.8

ImageNet • 1000 categories • ~1000 instances per category Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. (* = equal contribution) ImageNet Large Scale Visual Recognition Challenge . International Journal of Computer Vision , 2015.

ImageNet • Top-5 error: algorithm makes 5 predictions, true label must be in top 5 • Useful for incomplete labelings

Challenge winner's accuracy 30 Convolutional 25 Networks 20 15 10 5 0 2010 2011 2012

Backpropagation Why backpropagation Neural networks are sequences - PowerPoint PPT Presentation

Backpropagation Why backpropagation Neural networks are sequences of parametrized functions x ($; !) linear conv subsample conv subsample filters filters weights Parameters ! Why backpropagation Neural networks are

MLPs with Backpropagation CS 472 Backpropagation 1 Multilayer Nets? Linear Systems F(cx) =

CSC321 Lecture 6: Backpropagation Roger Grosse Roger Grosse CSC321 Lecture 6: Backpropagation 1

Neural Networks for Machine Learning Lecture 13a The ups and downs of backpropagation Geoffrey

Backpropagation Matt Gormley Lecture 12 Oct 10, 2018 1 Q&A 3 BACKPROPAGATION 4 A

Learning From Data Lecture 21 Neural Networks: Backpropagation Forward propagation: algorithmic

Lecture 11: Multi-layer Perceptron Forward Pass Backpropagation Aykut Erdem November

GRAPH REPRESENTATIONS, BACKPROPAGATION AND BIOLOGICAL PLAUSIBILITY Marco Gori SAILAB,

Why Im NOT Why Im NOT Why Im NOT Why Im NOT a Hindu Why Im NOT a Hindu

Smart City Wonderful Life Guo

Why is there a price to pay? Why is there a price to pay? Why cant God just

Backpropagation and Gradient Descent Brian Carignan, Dec 5 2016 Overview Notation/background

Event-Driven Random Backpropagation: Enabling Neuromorphic Deep Learning Machines Emre Neftci

White Box : Website Frontend & Network visualization using Guided Backpropagation Neha Das

Unsupervised Domain Adaptation by Backpropagation Chih-Hui Ho, Xingyu Gu, Yuan Qi Outline

Bayesian Neural Networks - Presenters Group 1: A Practical Bayesian Framework for Backpropagation

Backpropagation Learning 15-486/782: Artificial Neural Networks David S. Touretzky Fall 2006 1

Orbital Space Plane Orbital Space Plane How Did We Get Here and Why? How Did We Get Here and

OPPORTUNITY AREAS Oh, I need my ipad, where is my baaaag located?! ??? zzz BAG BEHIND

CCT Connection Tool Fast, Effective Fiber Connector End-Face Cleaning CONNECTING AT THE SPEED

Method code for the ttH channel analysis on GPU's 's pla latform G. Grasseau 1 , F. Beaudette 1

Lecture 12: Perceptron and Back Propagation CS109A Introduction to Data Science Pavlos

Binarized Neural Network to Compress DNN Xianda ( Bryce ) Xu xxu373@wisc.edu November 7th Why

r

Road Scene Understanding Presented by: Mohamed Mohsen Agenda Problem Definition Currently

Backpropagation Why backpropagation Neural networks are sequences - PowerPoint PPT Presentation

Backpropagation Why backpropagation Neural networks are sequences of parametrized functions x ($; !) linear conv subsample conv subsample filters filters weights Parameters ! Why backpropagation Neural networks are

MLPs with Backpropagation CS 472 Backpropagation 1 Multilayer Nets? Linear Systems F(cx) =

CSC321 Lecture 6: Backpropagation Roger Grosse Roger Grosse CSC321 Lecture 6: Backpropagation 1

Neural Networks for Machine Learning Lecture 13a The ups and downs of backpropagation Geoffrey

Backpropagation Matt Gormley Lecture 12 Oct 10, 2018 1 Q&amp;A 3 BACKPROPAGATION 4 A

Learning From Data Lecture 21 Neural Networks: Backpropagation Forward propagation: algorithmic

Lecture 11: Multi-layer Perceptron Forward Pass Backpropagation Aykut Erdem November

GRAPH REPRESENTATIONS, BACKPROPAGATION AND BIOLOGICAL PLAUSIBILITY Marco Gori SAILAB,

Why Im NOT Why Im NOT Why Im NOT Why Im NOT a Hindu Why Im NOT a Hindu

Smart City Wonderful Life Guo

Why is there a price to pay? Why is there a price to pay? Why cant God just

Backpropagation and Gradient Descent Brian Carignan, Dec 5 2016 Overview Notation/background

Event-Driven Random Backpropagation: Enabling Neuromorphic Deep Learning Machines Emre Neftci

White Box : Website Frontend &amp; Network visualization using Guided Backpropagation Neha Das

Unsupervised Domain Adaptation by Backpropagation Chih-Hui Ho, Xingyu Gu, Yuan Qi Outline

Bayesian Neural Networks - Presenters Group 1: A Practical Bayesian Framework for Backpropagation

Backpropagation Learning 15-486/782: Artificial Neural Networks David S. Touretzky Fall 2006 1

Orbital Space Plane Orbital Space Plane How Did We Get Here and Why? How Did We Get Here and

OPPORTUNITY AREAS Oh, I need my ipad, where is my baaaag located?! ??? zzz BAG BEHIND

CCT Connection Tool Fast, Effective Fiber Connector End-Face Cleaning CONNECTING AT THE SPEED

Method code for the ttH channel analysis on GPU's 's pla latform G. Grasseau 1 , F. Beaudette 1

Lecture 12: Perceptron and Back Propagation CS109A Introduction to Data Science Pavlos

Binarized Neural Network to Compress DNN Xianda ( Bryce ) Xu xxu373@wisc.edu November 7th Why

r

Road Scene Understanding Presented by: Mohamed Mohsen Agenda Problem Definition Currently

Backpropagation Matt Gormley Lecture 12 Oct 10, 2018 1 Q&A 3 BACKPROPAGATION 4 A

White Box : Website Frontend & Network visualization using Guided Backpropagation Neha Das