Backpropagation Why backpropagation Neural networks are sequences - - PowerPoint PPT Presentation

backpropagation why backpropagation
SMART_READER_LITE
LIVE PREVIEW

Backpropagation Why backpropagation Neural networks are sequences - - PowerPoint PPT Presentation

Backpropagation Why backpropagation Neural networks are sequences of parametrized functions x ($; !) linear conv subsample conv subsample filters filters weights Parameters ! Why backpropagation Neural networks are


slide-1
SLIDE 1

Backpropagation

slide-2
SLIDE 2

Why backpropagation

  • Neural networks are sequences of parametrized

functions

conv filters subsample subsample conv linear filters weights Parameters !

x

ℎ($; !)

slide-3
SLIDE 3

Why backpropagation

  • Neural networks are sequences of parametrized

functions

  • Parameters need to be set by minimizing some loss

function

Convolutional network

min

θ

1 N

N

X

i=1

L(h(xi; θ), yi)

slide-4
SLIDE 4

Why backpropagation

  • Neural networks are sequences of parametrized

functions

  • Parameters need to be set by minimizing some loss

function

  • Minimization through gradient descent requires

computing the gradient

θ(t+1) = θ(t) λ 1 N

N

X

i=1

rL(h(xi; θ), yi)

slide-5
SLIDE 5

Why backpropagation

  • Neural networks are sequences of parametrized

functions

  • Parameters need to be set by minimizing some loss

function

  • Minimization through gradient descent requires

computing the gradient

θ(t+1) = θ(t) λ 1 N

N

X

i=1

rL(h(xi; θ), yi)

z = h(x; θ)

rθL(z, y) = ∂L(z, y) ∂z ∂z ∂θ

slide-6
SLIDE 6

Why backpropagation

  • Neural networks are sequences of parametrized

functions

  • Parameters need to be set by minimizing some loss

function

  • Minimization through gradient descent requires

computing the gradient

  • Backpropagation: way to compute gradient ∂z

∂θ

slide-7
SLIDE 7

The gradient of convnets

f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z

1

z

2

z

3

z

4

z5 = z

slide-8
SLIDE 8

The gradient of convnets

f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z

∂z ∂w3 = ∂z ∂z3 ∂z3 ∂w3

slide-9
SLIDE 9

The gradient of convnets

f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z

∂z ∂w3 = ∂z ∂z3 ∂z3 ∂w3

slide-10
SLIDE 10

The gradient of convnets

f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z

∂z ∂w3 = ∂z ∂z3 ∂z3 ∂w3

slide-11
SLIDE 11

The gradient of convnets

f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z

∂z ∂w3 = ∂z ∂z3 ∂z3 ∂w3 ∂z ∂z3 = ∂z ∂z4 ∂z4 ∂z3

slide-12
SLIDE 12

The gradient of convnets

f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z

∂z ∂w3 = ∂z ∂z3 ∂z3 ∂w3 ∂z ∂z3 = ∂z ∂z4 ∂z4 ∂z3

slide-13
SLIDE 13

The gradient of convnets

f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z

∂z ∂z2 = ∂z ∂z3 ∂z3 ∂z2 ∂z ∂w2 = ∂z ∂z2 ∂z2 ∂w2

Recurrence going backward!!

slide-14
SLIDE 14

The gradient of convnets

f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z

slide-15
SLIDE 15

Backpropagation for a sequence

  • f functions
  • Assume we can compute partial derivatives of each function
  • Use g(zi) to store gradient of z w.r.t zi, g(wi) for wi
  • Calculate g(zi ) by iterating backwards
  • Use g(zi) to compute gradient of parameters

zi = fi(zi−1, wi)

z0 = x

z = zn

∂zi ∂zi−1 = ∂fi(zi−1, wi) ∂zi−1

∂zi ∂wi = ∂fi(zi−1, wi) ∂wi

g(zn) = ∂z ∂zn = 1

g(zi−1) = ∂z ∂zi ∂zi ∂zi−1 = g(zi) ∂zi ∂zi−1

g(wi) = ∂z ∂zi ∂zi ∂wi = g(zi) ∂zi ∂wi

slide-16
SLIDE 16

Loss as a function

conv filters subsample subsample conv linear filters weights loss label

slide-17
SLIDE 17

Putting it all together: SGD training of ConvNets

  • 1. Sample image and label

conv filters subsample subsample conv linear filters weights loss label Image

slide-18
SLIDE 18

Putting it all together: SGD training of ConvNets

  • 1. Sample image and label
  • 2. Pass image through network to get loss (forward)

conv filters subsample subsample conv linear filters weights loss label Image

slide-19
SLIDE 19

Putting it all together: SGD training of ConvNets

  • 1. Sample image and label
  • 2. Pass image through network to get loss (forward)
  • 3. Backpropagate to get gradients (backward)

conv filters subsample subsample conv linear filters weights loss label Image

slide-20
SLIDE 20

Putting it all together: SGD training of ConvNets

  • 1. Sample image and label
  • 2. Pass image through network to get loss (forward)
  • 3. Backpropagate to get gradients (backward)
  • 4. Take step along negative gradients to update

weights

conv filters subsample subsample conv linear filters weights loss label Image

slide-21
SLIDE 21

Putting it all together: SGD training of ConvNets

  • 1. Sample image and label
  • 2. Pass image through network to get loss (forward)
  • 3. Backpropagate to get gradients (backward)
  • 4. Take step along negative gradients to update

weights

  • 5. Repeat!

conv filters subsample subsample conv linear filters weights loss label Image

slide-22
SLIDE 22

Beyond sequences: computation graphs

  • Arbitrary graphs of functions
  • No distinction between intermediate outputs and

parameters

f h g k l x y w u z

slide-23
SLIDE 23

Computation graph - Functions

  • Each node implements two functions
  • A “forward”
  • Computes output given input
  • A “backward”
  • Computes derivative of z w.r.t input, given derivative of z w.r.t
  • utput
slide-24
SLIDE 24

Computation graphs

fi a d c b

slide-25
SLIDE 25

Computation graphs

fi

∂z ∂d

∂z ∂a

∂z ∂b ∂z ∂c

slide-26
SLIDE 26

Computation graphs

fi a d c b

slide-27
SLIDE 27

Computation graphs

fi

∂z ∂d

∂z ∂a

∂z ∂b ∂z ∂c

slide-28
SLIDE 28

Neural network frameworks

slide-29
SLIDE 29

Stochastic gradient descent

θ(t+1) θ(t) λ 1 K

K

X

k=1

rL(h(xik; θ(t)), yik)

Noisy!

slide-30
SLIDE 30

Momentum

  • Average multiple gradient steps
  • Use exponential averaging

g(t) 1 K

K

X

k=1

rL(h(xik; θ(t)), yik) p(t) µg(t) + (1 µ)p(t−1) θ(t+1) θ(t) λp(t)

slide-31
SLIDE 31

Weight decay

  • Add −"# $ to the gradient
  • Prevents # from growing to infinity
  • Equivalent to L2 regularization of weights
slide-32
SLIDE 32

Learning rate decay

  • Large step size / learning

rate

  • Faster convergence

initially

  • Bouncing around at the

end because of noisy gradients

  • Learning rate must be

decreased over time

  • Usually done in steps
slide-33
SLIDE 33

Convolutional network training

  • Initialize network
  • Sample minibatch of images
  • Forward pass to compute loss
  • Backpropagate loss to compute gradient
  • Combine gradient with momentum and weight

decay

  • Take step according to current learning rate
slide-34
SLIDE 34

Setting hyperparameters

  • How do we find a hyperparameter setting that

works?

  • Try it!
  • Train on train
  • Test on test
  • Picking hyperparameters that work for test =

Overfitting on test set

validation

slide-35
SLIDE 35

Setting hyperparameters

Train Validation Test Training iterations Test on validation Pick new hyperparameters Test on test (Ideally only

  • nce)
slide-36
SLIDE 36

Vagaries of optimization

  • Non-convex
  • Local optima
  • Sensitivity to initialization
  • Vanishing / exploding gradients
  • If each term is (much) greater than 1 à explosion of

gradients

  • If each term is (much) less than 1 à vanishing gradients

∂z ∂zi = ∂z ∂zn−1 ∂zn−1 ∂zn−2 . . . ∂zi+1 ∂zi

slide-37
SLIDE 37

Image Classification

slide-38
SLIDE 38

How to do machine learning

  • Create training / validation sets
  • Identify loss functions
  • Choose hypothesis class
  • Find best hypothesis by minimizing training loss
slide-39
SLIDE 39

How to do machine learning

  • Create training / validation sets
  • Identify loss functions
  • Choose hypothesis class
  • Find best hypothesis by minimizing training loss

h(x) = s

Multiclass classificatio n!!

ˆ p(y = k|x) ∝ esk ˆ p(y = k|x) = esk P

j esj

L(h(x), y) = − log ˆ p(y|x)

slide-40
SLIDE 40

Building a convolutional network

conv + relu + subsample conv + relu + subsample conv + relu + subsample average pool linear 10 classes

slide-41
SLIDE 41

Building a convolutional network

slide-42
SLIDE 42

MNIST Classification

Method Error rate (%) Linear classifier over pixels 12 Kernel SVM over HOG 0.56 Convolutional Network 0.8

slide-43
SLIDE 43

ImageNet

  • 1000 categories
  • ~1000 instances per category

Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. (* = equal contribution) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 2015.

slide-44
SLIDE 44

ImageNet

  • Top-5 error: algorithm makes 5 predictions, true label

must be in top 5

  • Useful for incomplete labelings
slide-45
SLIDE 45

5 10 15 20 25 30 2010 2011 2012 Challenge winner's accuracy

Convolutional Networks