Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie - - PowerPoint PPT Presentation

back propagation
SMART_READER_LITE
LIVE PREVIEW

Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie - - PowerPoint PPT Presentation

Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University back to the Worlds Smallest Perceptron! w f y x y = wx (a.k.a. line equation, linear regression) function of ONE parameter! Training the worlds


slide-1
SLIDE 1

Back-Propagation

16-385 Computer Vision (Kris Kitani)

Carnegie Mellon University

slide-2
SLIDE 2

World’s Smallest Perceptron!

y = wx back to the… function of ONE parameter!

(a.k.a. line equation, linear regression)

y x w f

slide-3
SLIDE 3

Training the world’s smallest perceptron

this should be the gradient of the loss function Now where does this come from? This is just gradient descent, that means…

slide-4
SLIDE 4

dL dw

L = 1 2(y − ˆ y)2

y = wx

…is the rate at which this will change… … per unit change of this

the loss function the weight parameter

Let’s compute the derivative…

slide-5
SLIDE 5

Compute the derivative That means the weight update for gradient descent is:

dL dw = d dw ⇢1 2(y ˆ y)2

  • = (y ˆ

y)dwx dw = (y ˆ y)x = rw

just shorthand

w = w rw = w + (y ˆ y)x

move in direction of negative gradient

slide-6
SLIDE 6

Gradient Descent (world’s smallest perceptron) For each sample

  • 1. Predict
  • a. Forward pass
  • b. Compute Loss
  • 2. Update
  • a. Back Propagation
  • b. Gradient update

{xi, yi}

ˆ y = wxi dLi dw = (yi ˆ y)xi = rw Li = 1 2(yi − ˆ y)2 w = w rw

slide-7
SLIDE 7

Training the world’s smallest perceptron

slide-8
SLIDE 8

y

world’s (second) smallest perceptron!

function of two parameters! w1 w2 x1 x2

slide-9
SLIDE 9

Gradient Descent For each sample

  • 1. Predict
  • a. Forward pass
  • b. Compute Loss
  • 2. Update
  • a. Back Propagation
  • b. Gradient update

{xi, yi}

we just need to compute partial derivatives for this network

slide-10
SLIDE 10

Back-Propagation ∂L ∂w1 = ∂ ∂w1 ⇢1 2(y ˆ y)2

  • = (y ˆ

y) ∂ˆ y ∂w1 = (y ˆ y)∂ P

i wixi

∂w1 = (y ˆ y)∂w1x1 ∂w1 = (y ˆ y)x1 = rw1 ∂L ∂w2 = ∂ ∂w2 ⇢1 2(y ˆ y)2

  • = (y ˆ

y) ∂ˆ y ∂w2 = (y ˆ y)∂ P

i wixi

∂w1 = (y ˆ y)∂w2x2 ∂w2 = (y ˆ y)x2 = rw2 Why do we have partial derivatives now?

slide-11
SLIDE 11

Back-Propagation ∂L ∂w1 = ∂ ∂w1 ⇢1 2(y ˆ y)2

  • = (y ˆ

y) ∂ˆ y ∂w1 = (y ˆ y)∂ P

i wixi

∂w1 = (y ˆ y)∂w1x1 ∂w1 = (y ˆ y)x1 = rw1 ∂L ∂w2 = ∂ ∂w2 ⇢1 2(y ˆ y)2

  • = (y ˆ

y) ∂ˆ y ∂w2 = (y ˆ y)∂ P

i wixi

∂w1 = (y ˆ y)∂w2x2 ∂w2 = (y ˆ y)x2 = rw2 w1 = w1 ηrw1 = w1 + η(y ˆ y)x1 w2 = w2 ηrw2 = w2 + η(y ˆ y)x2 Gradient Update

slide-12
SLIDE 12

Gradient Descent For each sample

  • 1. Predict
  • a. Forward pass
  • b. Compute Loss
  • 2. Update
  • a. Back Propagation
  • b. Gradient update

{xi, yi}

Li = 1 2(yi − ˆ y)

ˆ y = fMLP(xi; θ)

w1i = w1i + η(y − ˆ y)x1i w2i = w2i + η(y − ˆ y)x2i

rw1i = (yi ˆ y)x1i rw2i = (yi ˆ y)x2i

(since gradients approximated from stochastic sample) (adjustable step size) two BP lines now

slide-13
SLIDE 13

We haven’t seen a lot of ‘propagation’ yet because our perceptrons only had one layer…

slide-14
SLIDE 14

multi-layer perceptron

x function of FOUR parameters and FOUR layers! w1 w2 b 1 h1 h2 w3 y

slide-15
SLIDE 15

x w1 a1 f1 w2 y f2

hidden layer 2 hidden layer 3

  • utput

layer 4

weight input activation sum weight weight

a2 a3 w3 f3

activation activation

input layer 1

b1

slide-16
SLIDE 16

x w1 a1 f1 w2 y f2

hidden layer 2 hidden layer 3

  • utput

layer 4

weight input activation sum weight weight

a2 a3 w3 f3

activation activation

input layer 1

b1

slide-17
SLIDE 17

x w1 a1 f1 w2 y f2

hidden layer 2 hidden layer 3

  • utput

layer 4

weight input activation sum weight weight

a2 a3 w3 f3

activation activation

input layer 1

b1

a1 = w1 · x + b1

slide-18
SLIDE 18

x w1 a1 f1 w2 y f2

hidden layer 2 hidden layer 3

  • utput

layer 4

weight input activation sum weight weight

a2 a3 w3 f3

activation activation

input layer 1

b1

a1 = w1 · x + b1

slide-19
SLIDE 19

x w1 a1 f1 w2 y f2

hidden layer 2 hidden layer 3

  • utput

layer 4

weight input activation sum weight weight

a2 a3 w3 f3

activation activation

input layer 1

b1

a1 = w1 · x + b1 a2 = w2 · f1(w1 · x + b1)

slide-20
SLIDE 20

x w1 a1 f1 w2 y f2

hidden layer 2 hidden layer 3

  • utput

layer 4

weight input activation sum weight weight

a2 a3 w3 f3

activation activation

input layer 1

b1

a1 = w1 · x + b1 a2 = w2 · f1(w1 · x + b1)

slide-21
SLIDE 21

x w1 a1 f1 w2 y f2

hidden layer 2 hidden layer 3

  • utput

layer 4

weight input activation sum weight weight

a2 a3 w3 f3

activation activation

input layer 1

b1

a1 = w1 · x + b1 a2 = w2 · f1(w1 · x + b1) a3 = w3 · f2(w2 · f1(w1 · x + b1))

slide-22
SLIDE 22

x w1 a1 f1 w2 y f2

hidden layer 2 hidden layer 3

  • utput

layer 4

weight input activation sum weight weight

a2 a3 w3 f3

activation activation

input layer 1

b1

a1 = w1 · x + b1 a2 = w2 · f1(w1 · x + b1) a3 = w3 · f2(w2 · f1(w1 · x + b1))

slide-23
SLIDE 23

x w1 a1 f1 w2 y f2

hidden layer 2 hidden layer 3

  • utput

layer 4

weight input activation sum weight weight

a2 a3 w3 f3

activation activation

input layer 1

a1 = w1 · x + b1 a2 = w2 · f1(w1 · x + b1) a3 = w3 · f2(w2 · f1(w1 · x + b1)) y = f3(w3 · f2(w2 · f1(w1 · x + b1)))

b1

slide-24
SLIDE 24

· · · y = f3(w3 · f2(w2 · f1(w1 · x + b1)))

Entire network can be written out as one long equation What is known? What is unknown? We need to train the network:

slide-25
SLIDE 25

· · · y = f3(w3 · f2(w2 · f1(w1 · x + b1)))

Entire network can be written out as a long equation What is known? What is unknown?

known

We need to train the network:

slide-26
SLIDE 26

· · · y = f3(w3 · f2(w2 · f1(w1 · x + b1)))

Entire network can be written out as a long equation What is known? What is unknown?

unknown

We need to train the network:

activation function sometimes has parameters

slide-27
SLIDE 27

Given a set of samples and a MLP Estimate the parameters of the MLP

Learning an MLP

{xi, yi} θ = {f, w, b}

y = fMLP(x; θ)

slide-28
SLIDE 28

Stochastic Gradient Descent For each random sample

  • 1. Predict
  • a. Forward pass
  • b. Compute Loss
  • 2. Update
  • a. Back Propagation
  • b. Gradient update

{xi, yi}

ˆ y = fMLP(xi; θ)

∂L ∂θ

vector of parameter update equations vector of parameter partial derivatives

θ θ ηrθ

slide-29
SLIDE 29

So we need to compute the partial derivatives

∂L ∂θ =  ∂L ∂w3 ∂L ∂w2 ∂L ∂w1 ∂L ∂b

slide-30
SLIDE 30

x w1 a1 f1 w2 f2 a2 a3 w3 f3 b1 y

how …this this does affect…

∂L ∂w1 Partial derivative describes… So, how do you compute it?

(loss layer)

Remember,

slide-31
SLIDE 31

The Chain Rule

slide-32
SLIDE 32

∂L ∂w3 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂w3 f2 a3 w3 f3

· · ·

∂L ∂w3

∂L ∂f3 ∂f3 ∂a3

∂a3 ∂w3

rest of the network

L(y, ˆ y) ˆ y Intuitively, the effect of weight on loss function :

depends on depends on depends on

According to the chain rule…

slide-33
SLIDE 33

f2 a3 w3 f3

rest of the network

L(y, ˆ y) ˆ y ∂L ∂w3 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂w3

Chain Rule!

slide-34
SLIDE 34

f2 a3 w3 f3

rest of the network

L(y, ˆ y) ˆ y ∂L ∂w3 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂w3 = −η(y − ˆ y)∂f3 ∂a3 ∂a3 ∂w3

Just the partial derivative of L2 loss

slide-35
SLIDE 35

f2 a3 w3 f3

rest of the network

L(y, ˆ y) ˆ y ∂L ∂w3 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂w3 = −η(y − ˆ y)∂f3 ∂a3 ∂a3 ∂w3

Let’s use a Sigmoid function

ds(x) dx = s(x)(1 − s(x))

slide-36
SLIDE 36

f2 a3 w3 f3

rest of the network

L(y, ˆ y) ˆ y ∂L ∂w3 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂w3 = −η(y − ˆ y)∂f3 ∂a3 ∂a3 ∂w3 = −η(y − ˆ y)f3(1 − f3) ∂a3 ∂w3

Let’s use a Sigmoid function

ds(x) dx = s(x)(1 − s(x))

slide-37
SLIDE 37

f2 a3 w3 f3

rest of the network

L(y, ˆ y) ˆ y ∂L ∂w3 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂w3 = −η(y − ˆ y)∂f3 ∂a3 ∂a3 ∂w3 = −η(y − ˆ y)f3(1 − f3) ∂a3 ∂w3 = −η(y − ˆ y)f3(1 − f3)f2

slide-38
SLIDE 38

x w1 a1 f1 w2 y f2 a2 a3 w3 f3 b1

∂L ∂w2 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂w2

slide-39
SLIDE 39

x w1 a1 f1 w2 y f2 a2 a3 w3 f3 b1

already computed. re-use (propagate)!

∂L ∂w2 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂w2

slide-40
SLIDE 40

x w1 a1 f1 w2 f2 a2 a3 w3 f3 b1 y The Chain rule says…

depends on depends on depends on depends on depends on depends on depends on

∂L ∂w1 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂f1 ∂f1 ∂a1 ∂a1 ∂w1

slide-41
SLIDE 41

x w1 a1 f1 w2 f2 a2 a3 w3 f3 b1 y The Chain rule says…

depends on depends on depends on depends on depends on depends on depends on

already computed. re-use (propagate)!

∂L ∂w1 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂f1 ∂f1 ∂a1 ∂a1 ∂w1

slide-42
SLIDE 42

x w1 a1 f1 w2 f2 a2 a3 w3 f3 b1 y

depends on depends on depends on depends on depends on depends on depends on

∂L ∂w3 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂w3 ∂L ∂w2 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂w2 ∂L ∂w1 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂f1 ∂f1 ∂a1 ∂a1 ∂w1 ∂L ∂b = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂f1 ∂f1 ∂a1 ∂a1 ∂b

slide-43
SLIDE 43

x w1 a1 f1 w2 f2 a2 a3 w3 f3 b1 y

depends on depends on depends on depends on depends on depends on depends on

∂L ∂w3 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂w3 ∂L ∂w2 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂w2 ∂L ∂w1 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂f1 ∂f1 ∂a1 ∂a1 ∂w1 ∂L ∂b = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂f1 ∂f1 ∂a1 ∂a1 ∂b

slide-44
SLIDE 44

x w1 a1 f1 w2 f2 a2 a3 w3 f3 b1 y

depends on depends on depends on depends on depends on depends on depends on

∂L ∂w3 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂w3 ∂L ∂w2 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂w2 ∂L ∂w1 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂f1 ∂f1 ∂a1 ∂a1 ∂w1 ∂L ∂b = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂f1 ∂f1 ∂a1 ∂a1 ∂b

slide-45
SLIDE 45

Stochastic Gradient Descent For each example sample

  • 1. Predict
  • a. Forward pass
  • b. Compute Loss
  • 2. Update
  • a. Back Propagation
  • b. Gradient update

{xi, yi}

ˆ y = fMLP(xi; θ)

w3 = w3 ηrw3 w2 = w2 ηrw2 w1 = w1 ηrw1 b = b ηrb

Li

∂L ∂w3 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂w3 ∂L ∂w2 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂w2 ∂L ∂w1 = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂f1 ∂f1 ∂a1 ∂a1 ∂w1 ∂L ∂b = ∂L ∂f3 ∂f3 ∂a3 ∂a3 ∂f2 ∂f2 ∂a2 ∂a2 ∂f1 ∂f1 ∂a1 ∂a1 ∂b

slide-46
SLIDE 46

∂L ∂θ θ ← θ + η ∂L ∂θ

vector of parameter update equations vector of parameter partial derivatives

Stochastic Gradient Descent For each example sample

  • 1. Predict
  • a. Forward pass
  • b. Compute Loss
  • 2. Update
  • a. Back Propagation
  • b. Gradient update

{xi, yi}

ˆ y = fMLP(xi; θ) Li