Machine Learning: Chenhao Tan University of Colorado Boulder - - PowerPoint PPT Presentation

machine learning chenhao tan
SMART_READER_LITE
LIVE PREVIEW

Machine Learning: Chenhao Tan University of Colorado Boulder - - PowerPoint PPT Presentation

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 7 Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan | Boulder | 1 of 39 Final projects WSDM Cup SemEval 2018 Machine Learning:


slide-1
SLIDE 1

Machine Learning: Chenhao Tan

University of Colorado Boulder

LECTURE 7 Slides adapted from Jordan Boyd-Graber, Chris Ketelsen

Machine Learning: Chenhao Tan | Boulder | 1 of 39

slide-2
SLIDE 2

Final projects

  • WSDM Cup
  • SemEval 2018

Machine Learning: Chenhao Tan | Boulder | 2 of 39

slide-3
SLIDE 3

Overview

Forward propagation recap Back propagation Chain rule Back propagation Full algorithm

Machine Learning: Chenhao Tan | Boulder | 3 of 39

slide-4
SLIDE 4

Forward propagation recap

Outline

Forward propagation recap Back propagation Chain rule Back propagation Full algorithm

Machine Learning: Chenhao Tan | Boulder | 4 of 39

slide-5
SLIDE 5

Forward propagation recap

Forward propagation algorithm

Store the biases for layer l in bl, weight matrix in Wl x1 x2 . . . xd W1, b1 W2, b2 W3, b3

  • 1
  • 2

Machine Learning: Chenhao Tan | Boulder | 5 of 39

slide-6
SLIDE 6

Forward propagation recap

Forward propagation algorithm

Suppose your network has L layers Make a prediction based on test point x

1: Initialize a0 = x 2: for l = 1 to L do 3:

zl = Wlal−1 + bl

4:

al = g(zl)

5: end for 6: The prediction ˆ

y is simply aL

Machine Learning: Chenhao Tan | Boulder | 6 of 39

slide-7
SLIDE 7

Forward propagation recap

Neural networks in a nutshell

  • Training data Strain = {(x, y)}
  • Network architecture (model)

ˆ y = fw(x) Wl, bl, l = 1, . . . , L

  • Loss function (objective function)

L (y,ˆ y)

  • How do we learn the parameters?

Machine Learning: Chenhao Tan | Boulder | 7 of 39

slide-8
SLIDE 8

Forward propagation recap

Neural networks in a nutshell

  • Training data Strain = {(x, y)}
  • Network architecture (model)

ˆ y = fw(x) Wl, bl, l = 1, . . . , L

  • Loss function (objective function)

L (y,ˆ y)

  • How do we learn the parameters?

Stochastic gradient descent, Wl ← Wl − η∂L (y,ˆ y) ∂Wl

Machine Learning: Chenhao Tan | Boulder | 7 of 39

slide-9
SLIDE 9

Forward propagation recap

Challenge

  • Challenge: How the heck do we compute derivatives of the loss function with

respect to weights and biases?

  • Solution: Back Propagation

Machine Learning: Chenhao Tan | Boulder | 8 of 39

slide-10
SLIDE 10

Back propagation

Outline

Forward propagation recap Back propagation Chain rule Back propagation Full algorithm

Machine Learning: Chenhao Tan | Boulder | 9 of 39

slide-11
SLIDE 11

Back propagation | Chain rule

The Chain Rule

The chain rule allows us to take derivatives of nested functions. There are two forms of the Chain Rule

Machine Learning: Chenhao Tan | Boulder | 10 of 39

slide-12
SLIDE 12

Back propagation | Chain rule

The Chain Rule

The chain rule allows us to take derivatives of nested functions. There are two forms of the Chain Rule Baby Chain Rule: d dx f(g(x)) = f ′(g(x)) g′(x) = df dg dg dx

Machine Learning: Chenhao Tan | Boulder | 10 of 39

slide-13
SLIDE 13

Back propagation | Chain rule

The Chain Rule

The chain rule allows us to take derivatives of nested functions. There are two forms of the Chain Rule Baby Chain Rule: d dx f(g(x)) = f ′(g(x)) g′(x) = df dg dg dx Example: d dx sin(x2) = cos(x2) 2x

Machine Learning: Chenhao Tan | Boulder | 10 of 39

slide-14
SLIDE 14

Back propagation | Chain rule

The Chain Rule

Full-Grown Adult Chain Rule: u v r(u, v) s(u, v) x(r, s) y(r, s) z(r, s) f(x, y, z)

Machine Learning: Chenhao Tan | Boulder | 11 of 39

slide-15
SLIDE 15

Back propagation | Chain rule

The Chain Rule

Full-Grown Adult Chain Rule: u v r(u, v) s(u, v) x(r, s) y(r, s) z(r, s) f(x, y, z) Derivative of L with respect to x: ∂f ∂x Similarly, ∂f

∂y, ∂f ∂z

Machine Learning: Chenhao Tan | Boulder | 11 of 39

slide-16
SLIDE 16

Back propagation | Chain rule

The Chain Rule

What is the derivative of f with respect to r? u v r(u, v) s(u, v) x(r, s) y(r, s) z(r, s) f(x, y, z)

Machine Learning: Chenhao Tan | Boulder | 12 of 39

slide-17
SLIDE 17

Back propagation | Chain rule

The Chain Rule

What is the derivative of f with respect to r? u v r(u, v) s(u, v) x(r, s) y(r, s) z(r, s) f(x, y, z) ∂f ∂r = ∂f ∂x ∂x ∂r + ∂f ∂y ∂y ∂r + ∂f ∂z ∂z ∂r

Machine Learning: Chenhao Tan | Boulder | 12 of 39

slide-18
SLIDE 18

Back propagation | Chain rule

The Chain Rule

What is the derivative of f with respect to s? u v r(u, v) s(u, v) x(r, s) y(r, s) z(r, s) f(x, y, z)

Machine Learning: Chenhao Tan | Boulder | 13 of 39

slide-19
SLIDE 19

Back propagation | Chain rule

The Chain Rule

What is the derivative of f with respect to s? u v r(u, v) s(u, v) x(r, s) y(r, s) z(r, s) f(x, y, z) ∂f ∂s = ∂f ∂x ∂x ∂s + ∂f ∂y ∂y ∂s + ∂f ∂z ∂z ∂s

Machine Learning: Chenhao Tan | Boulder | 13 of 39

slide-20
SLIDE 20

Back propagation | Chain rule

The Chain Rule

What is the derivative of f with respect to s? u v r(u, v) s(u, v) x(r, s) y(r, s) z(r, s) f(x, y, z) Example: Let f = xyz , x = r, y = rs, and z = s. Find ∂f/∂s ∂f ∂s = ∂f ∂x ∂x ∂s + ∂f ∂y ∂y ∂s + ∂f ∂z ∂z ∂s

Machine Learning: Chenhao Tan | Boulder | 14 of 39

slide-21
SLIDE 21

Back propagation | Chain rule

The Chain Rule

What is the derivative of f with respect to s? u v r(u, v) s(u, v) x(r, s) y(r, s) z(r, s) f(x, y, z) Example: Let f = xyz , x = r, y = rs, and z = s. Find ∂f/∂s ∂f ∂s = yz · 0 + xz · r + xy · 1

Machine Learning: Chenhao Tan | Boulder | 14 of 39

slide-22
SLIDE 22

Back propagation | Chain rule

The Chain Rule

What is the derivative of f with respect to s? u v r(u, v) s(u, v) x(r, s) y(r, s) z(r, s) f(x, y, z) Example: Let f = xyz , x = r, y = rs, and z = s. Find ∂f/∂s ∂f ∂s = rs2 · 0 + rs · r + r2s · 1

Machine Learning: Chenhao Tan | Boulder | 14 of 39

slide-23
SLIDE 23

Back propagation | Chain rule

The Chain Rule

What is the derivative of f with respect to s? u v r(u, v) s(u, v) x(r, s) y(r, s) z(r, s) f(x, y, z) Example: Let f = xyz , x = r, y = rs, and z = s. Find ∂f/∂s ∂f ∂s = 2r2s

Machine Learning: Chenhao Tan | Boulder | 14 of 39

slide-24
SLIDE 24

Back propagation | Chain rule

The Chain Rule

What is the derivative of f with respect to s? u v r(u, v) s(u, v) x(r, s) y(r, s) z(r, s) f(x, y, z) Example: Let f = xyz , x = r, y = rs, and z = s. Find ∂f/∂s f(r, s) = r · rs · s = r2s2 ⇒ ∂f ∂s = 2r2s

Machine Learning: Chenhao Tan | Boulder | 15 of 39

slide-25
SLIDE 25

Back propagation | Chain rule

The Chain Rule

What is the derivative of f with respect to u? u v r(u, v) s(u, v) x(r, s) y(r, s) z(r, s) f(x, y, z)

Machine Learning: Chenhao Tan | Boulder | 16 of 39

slide-26
SLIDE 26

Back propagation | Chain rule

The Chain Rule

What is the derivative of f with respect to u? u v r(u, v) s(u, v) x(r, s) y(r, s) z(r, s) f(x, y, z) ∂f ∂u = ∂f ∂r ∂r ∂u + ∂f ∂s ∂s ∂u

Machine Learning: Chenhao Tan | Boulder | 16 of 39

slide-27
SLIDE 27

Back propagation | Chain rule

The Chain Rule

What is the derivative of f with respect to u? u v r(u, v) s(u, v) x(r, s) y(r, s) z(r, s) f(x, y, z) Crux: If you know derivative of objective w.r.t. intermediate value in the chain, can eliminate everything in between.

Machine Learning: Chenhao Tan | Boulder | 17 of 39

slide-28
SLIDE 28

Back propagation | Chain rule

The Chain Rule

What is the derivative of f with respect to u? u v r(u, v) s(u, v) x(r, s) y(r, s) z(r, s) f(x, y, z) Crux: If you know derivative of objective w.r.t. intermediate value in the chain, can eliminate everything in between. This is the cornerstone of the Back Propagation algorithm.

Machine Learning: Chenhao Tan | Boulder | 17 of 39

slide-29
SLIDE 29

Back propagation | Back propagation

Back Propagation

x1 x2 . . . xd W1, b1 W2, b2 W3, b3 W4, b4

  • 1
  • 2

Machine Learning: Chenhao Tan | Boulder | 18 of 39

slide-30
SLIDE 30

Back propagation | Back propagation

Back Propagation

For the derivation, we’ll consider a simplified network a0 W1 z1|a1 W2 z2|a2 L (y, a2) We want to use back propagation to compute partial derivative of L w.r.t. the weights and biases ∂L ∂w2

ij

, for l = 1, 2

Machine Learning: Chenhao Tan | Boulder | 19 of 39

slide-31
SLIDE 31

Back propagation | Back propagation

Back Propagation

For the derivation, we’ll consider a simplified network a0 W1 z1|a1 W2 z2|a2 L (y, a2) We need to choose an intermediate term that lives on the nodes, that we can easily compute derivative with respect to Could choose a’s, but we’ll choose z’s because math is easier

Machine Learning: Chenhao Tan | Boulder | 19 of 39

slide-32
SLIDE 32

Back propagation | Back propagation

Back Propagation

For the derivation, we’ll consider a simplified network a0 W1 z1|a1 W2 z2|a2 L (y, a2) Define the derivative w.r.t. the z’s by δ: δl

j = ∂L

∂zl

j

Note that δl has the same size as zl and al

Machine Learning: Chenhao Tan | Boulder | 19 of 39

slide-33
SLIDE 33

Back propagation | Back propagation

Back Propagation

For the derivation, we’ll consider a simplified network a0 W1 z1|a1 W2 z2|a2 L (y, a2) Let’s compute δL for output layer L: δL

j = ∂L

∂zL

j

= ∂L ∂aL

j

daL

j

dzL

j

Machine Learning: Chenhao Tan | Boulder | 19 of 39

slide-34
SLIDE 34

Back propagation | Back propagation

Back Propagation

δL

j = ∂L

∂zL

j

= ∂L ∂aL

j

daL

j

dzL

j

We know that aL

j = g(zL j ),

so daL

j

dzL

j

= g′(zL

j )

δL

j = ∂L

∂aL

j

g′(zL

j )

Note: The first term is jth entry of gradient of L w.r.t. aL’s , ∇aL

j L .

Machine Learning: Chenhao Tan | Boulder | 20 of 39

slide-35
SLIDE 35

Back propagation | Back propagation

Back Propagation

δL

j = ∇aL

j L g′(zL

j )

We can combine all of these into a vector operation δL = ∇aLL ⊙ g′(zL) Where g′(zL) is the activation function applied elementwise to zL. The symbol ⊙ indicates element-wise multiplication of vectors.

Machine Learning: Chenhao Tan | Boulder | 21 of 39

slide-36
SLIDE 36

Back propagation | Back propagation

Back Propagation

δL

j = ∇aL

j L g′(zL

j )

We can combine all of these into a vector operation δL = ∇aLL ⊙ g′(zL) Where g′(zL) is the activation function applied elementwise to zL. The symbol ⊙ indicates element-wise multiplication of vectors. Notice that computing δL requires knowing activations. This means that before we can compute derivatives for SGD through back propagation, we first run forward propagation through the network.

Machine Learning: Chenhao Tan | Boulder | 21 of 39

slide-37
SLIDE 37

Back propagation | Back propagation

Back Propagation

Example: Suppose we’re in regression setting and choose a sigmoid activation function: L = 1 2

  • j

(yj − aL

j )2

and σ(z) = sigm(z) ∂L ∂aL

j

= (aL

j − yj),

daL

j

dzL

j

= σ′(zL

j ) = σ(zL j )(1 − σ(zL j ))

So δL = (aL − y) ⊙ σ(zL) ⊙ (1 − σ(zL))

Machine Learning: Chenhao Tan | Boulder | 22 of 39

slide-38
SLIDE 38

Back propagation | Back propagation

Back Propagation

OK Great! Now we can easily-ish compute the δ’s for the output layer But really we’re after partials w.r.t. to weights and biases a0 W1 z1|a1 W2 z2|a2 L (y, a2)

Machine Learning: Chenhao Tan | Boulder | 23 of 39

slide-39
SLIDE 39

Back propagation | Back propagation

Back Propagation

OK Great! Now we can easily-ish compute the δ’s for the output layer But really we’re after partials w.r.t. to weights and biases a0 W1 z1|a1 W2 z2|a2 L (y, a2) Question: What do you notice?

Machine Learning: Chenhao Tan | Boulder | 23 of 39

slide-40
SLIDE 40

Back propagation | Back propagation

Back Propagation

We want to find derivative L w.r.t. to weights and biases a0 W1 z1|a1 W2 z2|a2 L (y, a2) Every weight connected to a node in layer L depends on a single δL

j

Machine Learning: Chenhao Tan | Boulder | 24 of 39

slide-41
SLIDE 41

Back propagation | Back propagation

Back Propagation

a0 W1 z1|a1 W2 z2|a2 L (y, a2) So we have ∂L ∂wL

jk

= ∂L ∂zL

j

∂zL

j

∂wL

jk

= δL

j

∂zL

j

∂wL

jk

Need to compute ∂zL

j

∂wL

jk

. Recall zL = WLaL−1 + bL jth entry in vector ⇒ zL

j =

  • i

wL

jiaL−1 i

+ bL

j

Machine Learning: Chenhao Tan | Boulder | 25 of 39

slide-42
SLIDE 42

Back propagation | Back propagation

Back Propagation

a0 W1 z1|a1 W2 z2|a2 L (y, a2) So we have ∂L ∂wL

jk

= ∂L ∂zL

j

∂zL

j

∂wL

jk

= δL

j

∂zL

j

∂wL

jk

Taking derivative w.r.t. wL

jk gives

⇒ ∂zL

j

∂wL

jk

= aL−1

k

⇒ ∂L ∂wL

jk

= aL−1

k

δL

j

Machine Learning: Chenhao Tan | Boulder | 25 of 39

slide-43
SLIDE 43

Back propagation | Back propagation

Back Propagation

a0 W1 z1|a1 W2 z2|a2 L (y, a2) So we have ∂L ∂wL

jk

= aL−1

k

δL

j

Easy expression for derivative w.r.t. every weight leading into layer L.

Machine Learning: Chenhao Tan | Boulder | 26 of 39

slide-44
SLIDE 44

Back propagation | Back propagation

Back Propagation

Let’s make the notation a little more practical. W2 = w2

11

w2

12

w2

13

w2

21

w2

22

w2

23

  • Machine Learning: Chenhao Tan

| Boulder | 27 of 39

slide-45
SLIDE 45

Back propagation | Back propagation

Back Propagation

Let’s make the notation a little more practical. W2 = w2

11

w2

12

w2

13

w2

21

w2

22

w2

23

  • ∂L

∂W2 = ∂L

∂w2

11

∂L ∂w2

12

∂L ∂w2

13

∂L ∂w2

21

∂L ∂w2

22

∂L ∂w2

23

  • =

δ2

1a1 1

δ2

1a1 2

δ2

1a1 3

δ2

2a1 1

δ2

2a1 2

δ2

2a1 3

  • Now we can write this as an outer-product of δ2 and a1,

∂L ∂W2 = δ2(a1)T (Exercise for yourself, ∂L

∂b2 )

Machine Learning: Chenhao Tan | Boulder | 27 of 39

slide-46
SLIDE 46

Back propagation | Back propagation

Intermediate summary

For a giving training example x, perform forward propagation to get zl and al on each layer. Then to get the partial derivatives for W2 or WL:

  • 1. Compute δL = ∇aLL ⊙ g′(zL)
  • 2. Compute ∂L

∂WL = δL(aL−1)T and ∂L ∂bL = δL

OK, that wasn’t so bad! We found very simple expressions for the derivatives with respect to the weights in the last hidden layer!

Machine Learning: Chenhao Tan | Boulder | 28 of 39

slide-47
SLIDE 47

Back propagation | Back propagation

Intermediate summary

For a giving training example x, perform forward propagation to get zl and al on each layer. Then to get the partial derivatives for W2 or WL:

  • 1. Compute δL = ∇aLL ⊙ g′(zL)
  • 2. Compute ∂L

∂WL = δL(aL−1)T and ∂L ∂bL = δL

OK, that wasn’t so bad! We found very simple expressions for the derivatives with respect to the weights in the last hidden layer! Problem: How do we do the other layers?

Machine Learning: Chenhao Tan | Boulder | 28 of 39

slide-48
SLIDE 48

Back propagation | Back propagation

Intermediate summary

For a giving training example x, perform forward propagation to get zl and al on each layer. Then to get the partial derivatives for W2 or WL:

  • 1. Compute δL = ∇aLL ⊙ g′(zL)
  • 2. Compute ∂L

∂WL = δL(aL−1)T and ∂L ∂bL = δL

OK, that wasn’t so bad! We found very simple expressions for the derivatives with respect to the weights in the last hidden layer! Problem: How do we do the other layers? Since the formulas were so nice once we knew the adjacent δl, it sure would be nice if we could easily compute the δl’s on earlier layers.

Machine Learning: Chenhao Tan | Boulder | 28 of 39

slide-49
SLIDE 49

Back propagation | Back propagation

Back Propagation

But the relationship between L and z1 is really complicated because of multiple passes through the activation functions.

Machine Learning: Chenhao Tan | Boulder | 29 of 39

slide-50
SLIDE 50

Back propagation | Back propagation

Back Propagation

But the relationship between L and z1 is really complicated because of multiple passes through the activation functions. It is OK! Back propagation comes to rescue!

Machine Learning: Chenhao Tan | Boulder | 29 of 39

slide-51
SLIDE 51

Back propagation | Back propagation

Back Propagation

But the relationship between L and z1 is really complicated because of multiple passes through the activation functions. It is OK! Back propagation comes to rescue! Notice that δ1 depends on δ2. a0 W1 z1|a1 W2 z2|a2 L (y, a2)

Machine Learning: Chenhao Tan | Boulder | 29 of 39

slide-52
SLIDE 52

Back propagation | Back propagation

Back Propagation

Notice that δ1 depends on δ2. a0 W1 z1|a1 W2 z2|a2 L (y, a2) By adult chain rule, ∂L ∂zl−1

k

=

  • j

∂L ∂zl

j

∂zl

j

∂zl−1

k

Machine Learning: Chenhao Tan | Boulder | 30 of 39

slide-53
SLIDE 53

Back propagation | Back propagation

Back Propagation

Notice that δ1 depends on δ2. a0 W1 z1|a1 W2 z2|a2 L (y, a2) By adult chain rule, δl−1

k

= ∂L ∂zl−1

k

=

  • j

∂L ∂zl

j

∂zl

j

∂zl−1

k

=

  • j

δl

j

∂zl

j

∂zl−1

k

Machine Learning: Chenhao Tan | Boulder | 30 of 39

slide-54
SLIDE 54

Back propagation | Back propagation

Back Propagation

Notice that δ1 depends on δ2. a0 W1 z1|a1 W2 z2|a2 L (y, a2) By adult chain rule, δ1

2 = ∂L

∂z1

2

= δ2

1

∂z2

1

∂z1

2

+ δ2

2

∂z2

2

∂z1

2

Machine Learning: Chenhao Tan | Boulder | 30 of 39

slide-55
SLIDE 55

Back propagation | Back propagation

Back Propagation

δ1

2 = ∂L

∂z1

2

= δ2

1

∂z2

1

∂z1

2

+ δ2

2

∂z2

2

∂z1

2

Machine Learning: Chenhao Tan | Boulder | 31 of 39

slide-56
SLIDE 56

Back propagation | Back propagation

Back Propagation

δ1

2 = ∂L

∂z1

2

= δ2

1

∂z2

1

∂z1

2

+ δ2

2

∂z2

2

∂z1

2

Recall that z2 = W2a1 + b2, it follows that z2

i = w2 i1a1 1 + w2 i2a1 2 + w2 i3a1 3 + b2 i

Taking the derivative ∂z2

i

∂z1

2 = w2

i2g′(z1 2), and plugging in gives

δ1

2 = ∂L

∂z1

2

= δ2

1w2 12g′(z1 2) + δ2 2w2 22g′(z1 2)

Machine Learning: Chenhao Tan | Boulder | 31 of 39

slide-57
SLIDE 57

Back propagation | Back propagation

Back Propagation

If we do this for each of the 3 δ1

i ’s, something nice happens:

(Exercise for yourself: work out δ1

1 and δ1 3 for yourself)

δ1

1

= δ2

1w2 11g′(z1 1) + δ2 2w2 21g′(z1 1)

δ1

2

= δ2

1w2 12g′(z1 2) + δ2 2w2 22g′(z1 2)

δ1

3

= δ2

1w2 13g′(z1 3) + δ2 2w2 23g′(z1 3)

Machine Learning: Chenhao Tan | Boulder | 32 of 39

slide-58
SLIDE 58

Back propagation | Back propagation

Back Propagation

If we do this for each of the 3 δ1

i ’s, something nice happens:

(Exercise for yourself: work out δ1

1 and δ1 3 for yourself)

δ1

1

= δ2

1w2 11g′(z1 1) + δ2 2w2 21g′(z1 1)

δ1

2

= δ2

1w2 12g′(z1 2) + δ2 2w2 22g′(z1 2)

δ1

3

= δ2

1w2 13g′(z1 3) + δ2 2w2 23g′(z1 3)

Notice that each row of the system gets multiplied by g′(z1

i ), so let’s factor those

  • ut.

Machine Learning: Chenhao Tan | Boulder | 32 of 39

slide-59
SLIDE 59

Back propagation | Back propagation

Back Propagation

If we do this for each of the 3 δ2

i ’s, something nice happens:

δ1

1

= (δ2

1w2 11 + δ2 2w2 21) · g′(z1 1)

δ1

2

= (δ2

1w2 12 + δ2 2w2 22) · g′(z1 2)

δ1

3

= (δ2

1w2 13 + δ2 2w2 23) · g′(z1 3)

Machine Learning: Chenhao Tan | Boulder | 33 of 39

slide-60
SLIDE 60

Back propagation | Back propagation

Back Propagation

If we do this for each of the 3 δ2

i ’s, something nice happens:

δ1

1

= (δ2

1w2 11 + δ2 2w2 21) · g′(z1 1)

δ1

2

= (δ2

1w2 12 + δ2 2w2 22) · g′(z1 2)

δ1

3

= (δ2

1w2 13 + δ2 2w2 23) · g′(z1 3)

Remember δ2 = δ2

1

δ2

2

  • , W2 =

w2

11

w2

12

w2

13

w2

21

w2

22

w2

23

  • Do you see δ2 and W2 lurking anywhere in the above system?

Machine Learning: Chenhao Tan | Boulder | 33 of 39

slide-61
SLIDE 61

Back propagation | Back propagation

Back Propagation

If we do this for each of the 3 δ2

i ’s, something nice happens:

δ1

1

= (δ2

1w2 11 + δ2 2w2 21) · g′(z1 1)

δ2

2

= (δ2

1w2 12 + δ2 2w2 22) · g′(z1 2)

δ2

3

= (δ2

1w2 13 + δ2 2w2 23) · g′(z1 3)

Does this help? (W2)T =   w2

11

w2

21

w2

12

w2

22

w2

13

w2

23

 , δ2 = δ2

1

δ2

2

  • .

Machine Learning: Chenhao Tan | Boulder | 34 of 39

slide-62
SLIDE 62

Back propagation | Back propagation

Back Propagation

If we do this for each of the 3 δ2

i ’s, something nice happens:

δ1

1

= (δ2

1w2 11 + δ2 2w2 21) · g′(z1 1)

δ1

2

= (δ2

1w2 12 + δ2 2w2 22) · g′(z1 2)

δ1

3

= (δ2

1w2 13 + δ2 2w2 23) · g′(z1 3)

δ1 = (W2)Tδ2 ⊙ g′(z1)

Machine Learning: Chenhao Tan | Boulder | 35 of 39

slide-63
SLIDE 63

Back propagation | Back propagation

Back Propagation

OK Great! We can easily compute δ1 from δ2 Then we can compute derivatives of L w.r.t. weights W1 and biases b1 exactly the way we did for W2 and biases b2

  • 1. Compute δ1 = (W2)Tδ2 ⊙ g′(z1)
  • 2. Compute ∂L

∂W1 = δ1(a0)T and ∂L ∂b1 = δ1

Machine Learning: Chenhao Tan | Boulder | 36 of 39

slide-64
SLIDE 64

Back propagation | Back propagation

Back Propagation

OK Great! We can easily compute δ1 from δ2 Then we can compute derivatives of L w.r.t. weights W1 and biases b1 exactly the way we did for W2 and biases b2

  • 1. Compute δ1 = (W2)Tδ2 ⊙ g′(z1)
  • 2. Compute ∂L

∂W1 = δ1(a0)T and ∂L ∂b1 = δ1

We’ve worked this out for a simple network with one hidden layer. Nothing we’ve done assumed anything about the number of layers, so we can apply the same procedure recursively with any number of layers.

Machine Learning: Chenhao Tan | Boulder | 36 of 39

slide-65
SLIDE 65

Back propagation | Full algorithm

Back Propagation

δL = ∇aLL ⊙ σ′(zL) # Compute δ’s on output layer For ℓ = L, . . . , 1

∂L ∂Wℓ = δℓ(al−1)T

# Compute weight derivatives ∂L ∂bℓ = δℓ # Compute bias derivatives δℓ−1 =

  • WℓT δℓ ⊙ σ′(zℓ−1)

# Back prop δ’s to previous layer (After this, ready to do a SGD update on weights/biases)

Machine Learning: Chenhao Tan | Boulder | 37 of 39

slide-66
SLIDE 66

Back propagation | Full algorithm

Training a Feed-Forward Neural Network

Given initial guess for weights and biases. Loop over each training example in random order:

  • 1. Forward Propagate to get activations on each layer
  • 2. Back Propagate to get derivatives
  • 3. Update weights and biases via Stochastic Gradient Descent
  • 4. Rinse and Repeat

Machine Learning: Chenhao Tan | Boulder | 38 of 39

slide-67
SLIDE 67

Back propagation | Full algorithm

Training a Feed-Forward Neural Network

Remaining Questions:

  • 1. Can I batch this?
  • 2. When do we stop?
  • 3. How do we initialize weights and biases?

Machine Learning: Chenhao Tan | Boulder | 39 of 39