Applied Machine Learning Applied Machine Learning Gradient - - PowerPoint PPT Presentation

applied machine learning applied machine learning
SMART_READER_LITE
LIVE PREVIEW

Applied Machine Learning Applied Machine Learning Gradient - - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Gradient Computation & Automatic Differentiation Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives using the


slide-1
SLIDE 1

Applied Machine Learning Applied Machine Learning

Gradient Computation & Automatic Differentiation

Siamak Ravanbakhsh Siamak Ravanbakhsh

COMP 551 COMP 551 (winter 2020) (winter 2020)

1

slide-2
SLIDE 2

using the chain rule to calculate the gradients automatic differentiation forward mode reverse mode (backpropagation)

Learning objectives Learning objectives

2

slide-3
SLIDE 3

Landscape of the cost function Landscape of the cost function

two layer MLP model

f(x; W, V ) = g(Wh(V x))

image credit: https://www.offconvex.org

3

slide-4
SLIDE 4

Landscape of the cost function Landscape of the cost function

two layer MLP model

f(x; W, V ) = g(Wh(V x))

loss function depends on the task

min L(y , f(x ; W, V ))

W,V ∑n (n) (n)

  • bjective

image credit: https://www.offconvex.org

3

slide-5
SLIDE 5

Landscape of the cost function Landscape of the cost function

this is a non-convex optimization problem many critical points (points where gradient is zero) two layer MLP model

f(x; W, V ) = g(Wh(V x))

loss function depends on the task

min L(y , f(x ; W, V ))

W,V ∑n (n) (n)

  • bjective

image credit: https://www.offconvex.org

3

slide-6
SLIDE 6

Landscape of the cost function Landscape of the cost function

this is a non-convex optimization problem many critical points (points where gradient is zero) two layer MLP model

f(x; W, V ) = g(Wh(V x))

loss function depends on the task

min L(y , f(x ; W, V ))

W,V ∑n (n) (n)

  • bjective

these are not stable and SGD can escape

image credit: https://www.offconvex.org

3

slide-7
SLIDE 7

Landscape of the cost function Landscape of the cost function

this is a non-convex optimization problem many critical points (points where gradient is zero) two layer MLP model

f(x; W, V ) = g(Wh(V x))

loss function depends on the task

min L(y , f(x ; W, V ))

W,V ∑n (n) (n)

  • bjective

these are not stable and SGD can escape

image credit: https://www.offconvex.org

there are exponentially many global optima: given one global optimum we can permute hidden units in each layer for symmetric activations: negate input/ouput of a unit for rectifiers: rescale input/output of a unit

3

slide-8
SLIDE 8

Landscape of the cost function Landscape of the cost function

this is a non-convex optimization problem many critical points (points where gradient is zero) two layer MLP model

f(x; W, V ) = g(Wh(V x))

loss function depends on the task

min L(y , f(x ; W, V ))

W,V ∑n (n) (n)

  • bjective

these are not stable and SGD can escape

image credit: https://www.offconvex.org

there are exponentially many global optima: given one global optimum we can permute hidden units in each layer for symmetric activations: negate input/ouput of a unit for rectifiers: rescale input/output of a unit

general beliefs

many more saddle points than local minima

supported by empirical and theoretical results in a special settings

3

slide-9
SLIDE 9

Landscape of the cost function Landscape of the cost function

this is a non-convex optimization problem many critical points (points where gradient is zero) two layer MLP model

f(x; W, V ) = g(Wh(V x))

loss function depends on the task

min L(y , f(x ; W, V ))

W,V ∑n (n) (n)

  • bjective

these are not stable and SGD can escape

image credit: https://www.offconvex.org

number of local minima increases for lower costs therefore most local optima are close to global optima there are exponentially many global optima: given one global optimum we can permute hidden units in each layer for symmetric activations: negate input/ouput of a unit for rectifiers: rescale input/output of a unit

general beliefs

many more saddle points than local minima

supported by empirical and theoretical results in a special settings

3

slide-10
SLIDE 10

Landscape of the cost function Landscape of the cost function

this is a non-convex optimization problem many critical points (points where gradient is zero) two layer MLP model

f(x; W, V ) = g(Wh(V x))

loss function depends on the task

min L(y , f(x ; W, V ))

W,V ∑n (n) (n)

  • bjective

these are not stable and SGD can escape

image credit: https://www.offconvex.org

number of local minima increases for lower costs therefore most local optima are close to global optima there are exponentially many global optima: given one global optimum we can permute hidden units in each layer for symmetric activations: negate input/ouput of a unit for rectifiers: rescale input/output of a unit

general beliefs

many more saddle points than local minima

supported by empirical and theoretical results in a special settings

3

strategy

use gradient descent methods (covered earlier in the course)

slide-11
SLIDE 11

Jacobian matrix Jacobian matrix

f : R → R

we have the derivative

f(w) ∈

dw d

R

4 . 1

slide-12
SLIDE 12

Jacobian matrix Jacobian matrix

f : R → R

we have the derivative

f(w) ∈

dw d

R

gradient is the vector of all partial derivatives

f : R →

D

R ∇ f(w) =

w

[ f(w), … , f(w)] ∈

∂w1 ∂ ∂wD ∂ ⊤

RD

4 . 1

slide-13
SLIDE 13

Jacobian matrix Jacobian matrix

f : R → R

we have the derivative

f(w) ∈

dw d

R

gradient is the vector of all partial derivatives

f : R →

D

R ∇ f(w) =

w

[ f(w), … , f(w)] ∈

∂w1 ∂ ∂wD ∂ ⊤

RD

the Jacobian matrix of all partial derivatives

f : R →

D

RM

J = ⎣ ⎢ ⎢ ⎢ ⎡ ,

∂w1 ∂f (w)

1

⋮ ,

∂w1 ∂f (w)

M

… , ⋱ … ,

∂wD ∂f (w)

1

∂wD ∂f (w)

M

⎦ ⎥ ⎥ ⎥ ⎤

note that we use J also for cost function

∇ f (w)

w 1

∈ RM×D f(w)

∂w1 ∂

4 . 1

slide-14
SLIDE 14

Jacobian matrix Jacobian matrix

f : R → R

we have the derivative

f(w) ∈

dw d

R

gradient is the vector of all partial derivatives

f : R →

D

R ∇ f(w) =

w

[ f(w), … , f(w)] ∈

∂w1 ∂ ∂wD ∂ ⊤

RD

for all three case we may simply write , where M,D will be clear from the context

f(w)

∂w ∂

the Jacobian matrix of all partial derivatives

f : R →

D

RM

J = ⎣ ⎢ ⎢ ⎢ ⎡ ,

∂w1 ∂f (w)

1

⋮ ,

∂w1 ∂f (w)

M

… , ⋱ … ,

∂wD ∂f (w)

1

∂wD ∂f (w)

M

⎦ ⎥ ⎥ ⎥ ⎤

note that we use J also for cost function

∇ f (w)

w 1

∈ RM×D f(w)

∂w1 ∂

4 . 1

slide-15
SLIDE 15

Jacobian matrix Jacobian matrix

f : R → R

we have the derivative

f(w) ∈

dw d

R

gradient is the vector of all partial derivatives

f : R →

D

R ∇ f(w) =

w

[ f(w), … , f(w)] ∈

∂w1 ∂ ∂wD ∂ ⊤

RD

for all three case we may simply write , where M,D will be clear from the context

f(w)

∂w ∂

the Jacobian matrix of all partial derivatives

f : R →

D

RM

J = ⎣ ⎢ ⎢ ⎢ ⎡ ,

∂w1 ∂f (w)

1

⋮ ,

∂w1 ∂f (w)

M

… , ⋱ … ,

∂wD ∂f (w)

1

∂wD ∂f (w)

M

⎦ ⎥ ⎥ ⎥ ⎤

note that we use J also for cost function

∇ f (w)

w 1

∈ RM×D f(w)

∂w1 ∂

what if W is a matrix? we assume it is reshaped it into a vector for these calculations

4 . 1

slide-16
SLIDE 16

Chain rule Chain rule

f : x ↦ z and h : z ↦ y

for

x, y, z ∈ R

where

4 . 2

slide-17
SLIDE 17

Chain rule Chain rule

f : x ↦ z and h : z ↦ y

for

x, y, z ∈ R

where

=

dx dy dz dy dx dz

speed of change in z as we change x speed of change in y as we change z speed of change in y as we change x

4 . 2

slide-18
SLIDE 18

Chain rule Chain rule

f : x ↦ z and h : z ↦ y

for

x, y, z ∈ R

where

x ∈ R , z ∈

D

R , y ∈

M

RC

more generally

=

dx dy dz dy dx dz

speed of change in z as we change x speed of change in y as we change z speed of change in y as we change x

4 . 2

slide-19
SLIDE 19

Chain rule Chain rule

f : x ↦ z and h : z ↦ y

for

x, y, z ∈ R

where

x ∈ R , z ∈

D

R , y ∈

M

RC

more generally

=

∂xd ∂yc

∑m=1

M ∂zm ∂yc ∂xd ∂zm

we are looking at all the "paths" through which change in changes and add their contribution

xd yc

=

dx dy dz dy dx dz

speed of change in z as we change x speed of change in y as we change z speed of change in y as we change x

4 . 2

slide-20
SLIDE 20

Winter 2020 | Applied Machine Learning (COMP551)

Chain rule Chain rule

f : x ↦ z and h : z ↦ y

for

x, y, z ∈ R

where

x ∈ R , z ∈

D

R , y ∈

M

RC

more generally

=

∂xd ∂yc

∑m=1

M ∂zm ∂yc ∂xd ∂zm

we are looking at all the "paths" through which change in changes and add their contribution

xd yc

=

dx dy dz dy dx dz

speed of change in z as we change x speed of change in y as we change z speed of change in y as we change x

4 . 2

in matrix form

C x M Jacobian M x D Jacobian

=

∂x ∂y ∂z ∂y ∂x ∂z

C x D Jacobian

slide-21
SLIDE 21

Training a two layer network Training a two layer network

suppose we have D inputs C outputs M hidden units

x , … , x

1 D

z , … , z

1 M

, … , y ^1 y ^C

for simplicity we drop the bias terms

W x1

... ... ...

x2 xD 1 1 z1 z2 zM y ^1 y ^2 y ^C input hidden units

  • utput

V

5 . 1

slide-22
SLIDE 22

Training a two layer network Training a two layer network

suppose we have D inputs C outputs M hidden units

x , … , x

1 D

z , … , z

1 M

, … , y ^1 y ^C

for simplicity we drop the bias terms

W x1

... ... ...

x2 xD 1 1 z1 z2 zM y ^1 y ^2 y ^C input hidden units

  • utput

V

5 . 1

= y ^ g(W h(V x))

model

J(W, V ) = L(y , g ( W h ( V x )) ∑n

(n) (n) Cost function we want to minimize

slide-23
SLIDE 23

Training a two layer network Training a two layer network

suppose we have D inputs C outputs M hidden units

x , … , x

1 D

z , … , z

1 M

, … , y ^1 y ^C

for simplicity we drop the bias terms

W x1

... ... ...

x2 xD 1 1 z1 z2 zM y ^1 y ^2 y ^C input hidden units

  • utput

V

5 . 1

= y ^ g(W h(V x))

model

J(W, V ) = L(y , g ( W h ( V x )) ∑n

(n) (n) Cost function we want to minimize need gradient wrt W and V:

J, J

∂W ∂ ∂V ∂

slide-24
SLIDE 24

Training a two layer network Training a two layer network

suppose we have D inputs C outputs M hidden units

x , … , x

1 D

z , … , z

1 M

, … , y ^1 y ^C

for simplicity we drop the bias terms

W x1

... ... ...

x2 xD 1 1 z1 z2 zM y ^1 y ^2 y ^C input hidden units

  • utput

V

5 . 1

= y ^ g(W h(V x))

model

J(W, V ) = L(y , g ( W h ( V x )) ∑n

(n) (n) Cost function we want to minimize need gradient wrt W and V:

J, J

∂W ∂ ∂V ∂ simpler to write this for one instance (n)

slide-25
SLIDE 25

Training a two layer network Training a two layer network

suppose we have D inputs C outputs M hidden units

x , … , x

1 D

z , … , z

1 M

, … , y ^1 y ^C

for simplicity we drop the bias terms

W x1

... ... ...

x2 xD 1 1 z1 z2 zM y ^1 y ^2 y ^C input hidden units

  • utput

V

5 . 1

= y ^ g(W h(V x))

model

J(W, V ) = L(y , g ( W h ( V x )) ∑n

(n) (n) Cost function we want to minimize need gradient wrt W and V:

J, J

∂W ∂ ∂V ∂ simpler to write this for one instance (n) so we will calculate and recover and

L, L

∂W ∂ ∂V ∂

J =

∂W ∂

L(y , ) ∑n=1

N ∂W ∂ (n) y

^(n) J =

∂V ∂

L(y , ) ∑n=1

N ∂V ∂ (n) y

^(n)

slide-26
SLIDE 26

Gradient calculation Gradient calculation

x1

... ... ...

x2 xD z1 z2 zM y ^1 y ^2 y ^C

5 . 2

pre-activations pre-activations

slide-27
SLIDE 27

Gradient calculation Gradient calculation

x1

... ... ...

x2 xD z1 z2 zM y ^1 y ^2 y ^C

5 . 2

pre-activations pre-activations

xd z =

m

h(q )

m

= y ^c g(u )

c

q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L(y, ) y ^

slide-28
SLIDE 28

Gradient calculation Gradient calculation

x1

... ... ...

x2 xD z1 z2 zM y ^1 y ^2 y ^C

5 . 2

pre-activations pre-activations

L =

∂Wc,m ∂ ∂y ^c ∂L ∂uc ∂y ^c ∂Wc,m ∂uc

zm

depends on the loss function depends on the activation function

using the chain rule

xd z =

m

h(q )

m

= y ^c g(u )

c

q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L(y, ) y ^

slide-29
SLIDE 29

Gradient calculation Gradient calculation

x1

... ... ...

x2 xD z1 z2 zM y ^1 y ^2 y ^C

5 . 2

pre-activations pre-activations

L =

∂Wc,m ∂ ∂y ^c ∂L ∂uc ∂y ^c ∂Wc,m ∂uc

zm

depends on the loss function depends on the activation function

using the chain rule

xd z =

m

h(q )

m

= y ^c g(u )

c

q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L(y, ) y ^

L =

∂Vm,d ∂

∑c ∂y

^c ∂L ∂uc ∂y ^c ∂zm ∂uc ∂qm ∂zm ∂Vm,d ∂qm

Wc,m xd

depends on the middle layer activation

depends on the loss function depends on the activation function

similarly for V

slide-30
SLIDE 30

Gradient calculation Gradient calculation

L =

∂Wc,m ∂ ∂y ^c ∂L ∂uc ∂y ^c ∂Wc,m ∂uc

zm

depends on the loss function depends on the activation function

using the chain rule

xd z =

m

h(q )

m

= y ^c g(u )

c

q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L(y, ) y ^

5 . 3

slide-31
SLIDE 31

Gradient calculation Gradient calculation

L =

∂Wc,m ∂ ∂y ^c ∂L ∂uc ∂y ^c ∂Wc,m ∂uc

zm

depends on the loss function depends on the activation function

using the chain rule

regression = y ^ g(u) = u = Wz L(y, ) = y ^ ∣∣y −

2 1

∣∣ y ^ 2

2

{

xd z =

m

h(q )

m

= y ^c g(u )

c

q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L(y, ) y ^

5 . 3

slide-32
SLIDE 32

Gradient calculation Gradient calculation

L =

∂Wc,m ∂ ∂y ^c ∂L ∂uc ∂y ^c ∂Wc,m ∂uc

zm

depends on the loss function depends on the activation function

using the chain rule

regression = y ^ g(u) = u = Wz L(y, ) = y ^ ∣∣y −

2 1

∣∣ y ^ 2

2

{

substituting

L(y, z) = ∣∣y −

2 1

Wz∣∣2

2

xd z =

m

h(q )

m

= y ^c g(u )

c

q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L(y, ) y ^

5 . 3

slide-33
SLIDE 33

Gradient calculation Gradient calculation

L =

∂Wc,m ∂ ∂y ^c ∂L ∂uc ∂y ^c ∂Wc,m ∂uc

zm

depends on the loss function depends on the activation function

using the chain rule

regression = y ^ g(u) = u = Wz L(y, ) = y ^ ∣∣y −

2 1

∣∣ y ^ 2

2

{

substituting

L(y, z) = ∣∣y −

2 1

Wz∣∣2

2

L =

∂Wc,m ∂

( − y ^c y )z

c m

we have seen this in linear regression lecture taking derivative

xd z =

m

h(q )

m

= y ^c g(u )

c

q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L(y, ) y ^

5 . 3

slide-34
SLIDE 34

Gradient calculation Gradient calculation

L =

∂Wc,m ∂ ∂y ^c ∂L ∂uc ∂y ^c ∂Wc,m ∂uc

zm

depends on the loss function depends on the activation function

using the chain rule

L(y, ) = y ^ y log + y ^ (1 − y) log(1 − ) y ^ binary classification

= y ^ g(u) = (1 + e )

−u −1 scalar output C=1 { 5 . 4

xd z =

m

h(q )

m

= y ^c g(u )

c

q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L(y, ) y ^

slide-35
SLIDE 35

Gradient calculation Gradient calculation

L =

∂Wc,m ∂ ∂y ^c ∂L ∂uc ∂y ^c ∂Wc,m ∂uc

zm

depends on the loss function depends on the activation function

using the chain rule

L(y, ) = y ^ y log + y ^ (1 − y) log(1 − ) y ^ binary classification

= y ^ g(u) = (1 + e )

−u −1 scalar output C=1 {

L(y, u) = y log(1 + e ) +

−u

(1 − y) log(1 + e )

u

substituting and simplifying (see logistic regression lecture) 5 . 4

xd z =

m

h(q )

m

= y ^c g(u )

c

q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L(y, ) y ^

slide-36
SLIDE 36

Gradient calculation Gradient calculation

L =

∂Wc,m ∂ ∂y ^c ∂L ∂uc ∂y ^c ∂Wc,m ∂uc

zm

depends on the loss function depends on the activation function

using the chain rule

L(y, ) = y ^ y log + y ^ (1 − y) log(1 − ) y ^ binary classification

= y ^ g(u) = (1 + e )

−u −1 scalar output C=1 {

L(y, u) = y log(1 + e ) +

−u

(1 − y) log(1 + e )

u

substituting and simplifying (see logistic regression lecture)

L =

∂Wm ∂

( − y ^ y)zm

substituting u in L and taking derivative

{u =

W z ∑m

m m

5 . 4

xd z =

m

h(q )

m

= y ^c g(u )

c

q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L(y, ) y ^

slide-37
SLIDE 37

Gradient calculation Gradient calculation

L =

∂Wc,m ∂ ∂y ^c ∂L ∂uc ∂y ^c ∂Wc,m ∂uc

zm

depends on the loss function depends on the activation function

using the chain rule

multiclass classification

y = g(u) = softmax(u) L(y, ) = y ^ y log ∑k

k

y ^k

C is the number of classes

{

5 . 5

xd z =

m

h(q )

m

= y ^c g(u )

c

q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L(y, ) y ^

slide-38
SLIDE 38

Gradient calculation Gradient calculation

L =

∂Wc,m ∂ ∂y ^c ∂L ∂uc ∂y ^c ∂Wc,m ∂uc

zm

depends on the loss function depends on the activation function

using the chain rule

L(y, u) = −y u +

log e ∑c

u

substituting and simplifying (see logistic regression lecture)

multiclass classification

y = g(u) = softmax(u) L(y, ) = y ^ y log ∑k

k

y ^k

C is the number of classes

{

5 . 5

xd z =

m

h(q )

m

= y ^c g(u )

c

q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L(y, ) y ^

slide-39
SLIDE 39

Gradient calculation Gradient calculation

L =

∂Wc,m ∂ ∂y ^c ∂L ∂uc ∂y ^c ∂Wc,m ∂uc

zm

depends on the loss function depends on the activation function

using the chain rule

L(y, u) = −y u +

log e ∑c

u

substituting and simplifying (see logistic regression lecture)

L =

∂Wc,m ∂

( − y ^c y )z

c m substituting u in L and taking derivative

{u =

c

W z ∑m

c,m m

multiclass classification

y = g(u) = softmax(u) L(y, ) = y ^ y log ∑k

k

y ^k

C is the number of classes

{

5 . 5

xd z =

m

h(q )

m

= y ^c g(u )

c

q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L(y, ) y ^

slide-40
SLIDE 40

Gradient calculation Gradient calculation

xd z =

m

h(q )

m

= y ^c g(u )

c

q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L =

∂Vm,d ∂

∑c ∂y

^c ∂L ∂um ∂y ^c ∂zm ∂um ∂qm ∂zm ∂Vm,d ∂qm

Wk,m xd

depends on the middle layer activation

L(y, ) y ^

5 . 6

gradient wrt V:

slide-41
SLIDE 41

Gradient calculation Gradient calculation

xd z =

m

h(q )

m

= y ^c g(u )

c

q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L =

∂Vm,d ∂

∑c ∂y

^c ∂L ∂um ∂y ^c ∂zm ∂um ∂qm ∂zm ∂Vm,d ∂qm

Wk,m xd

depends on the middle layer activation

L(y, ) y ^

5 . 6

we already did this part

gradient wrt V:

slide-42
SLIDE 42

Gradient calculation Gradient calculation

xd z =

m

h(q )

m

= y ^c g(u )

c

q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L =

∂Vm,d ∂

∑c ∂y

^c ∂L ∂um ∂y ^c ∂zm ∂um ∂qm ∂zm ∂Vm,d ∂qm

Wk,m xd

depends on the middle layer activation

L(y, ) y ^

5 . 6

we already did this part

gradient wrt V: σ(q )(1 −

m

σ(q ))

m

{0 1 q ≤ 0

m

q > 0

m

1 − tanh(q )

m 2

logistic function hyperbolic tan. ReLU

slide-43
SLIDE 43

Gradient calculation Gradient calculation

xd z =

m

h(q )

m

= y ^c g(u )

c

q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L =

∂Vm,d ∂

∑c ∂y

^c ∂L ∂um ∂y ^c ∂zm ∂um ∂qm ∂zm ∂Vm,d ∂qm

Wk,m xd

depends on the middle layer activation

L(y, ) y ^

5 . 6

we already did this part

gradient wrt V: σ(q )(1 −

m

σ(q ))

m

{0 1 q ≤ 0

m

q > 0

m

1 − tanh(q )

m 2

logistic function hyperbolic tan. ReLU logistic sigmoid

example J =

∂Vm,d ∂

( − ∑n ∑c y ^c

(n)

y )W σ(q )(1 −

c (n) c,m m (n)

σ(q ))x

m (n) d (n)

slide-44
SLIDE 44

Gradient calculation Gradient calculation

xd z =

m

h(q )

m

= y ^c g(u )

c

q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L =

∂Vm,d ∂

∑c ∂y

^c ∂L ∂um ∂y ^c ∂zm ∂um ∂qm ∂zm ∂Vm,d ∂qm

Wk,m xd

depends on the middle layer activation

L(y, ) y ^

5 . 6

we already did this part

gradient wrt V: σ(q )(1 −

m

σ(q ))

m

{0 1 q ≤ 0

m

q > 0

m

1 − tanh(q )

m 2

logistic function hyperbolic tan. ReLU logistic sigmoid

example J =

∂Vm,d ∂

( − ∑n ∑c y ^c

(n)

y )W σ(q )(1 −

c (n) c,m m (n)

σ(q ))x

m (n) d (n)

= ( − ∑n ∑c y ^c

(n)

y )W z (1 −

c (n) c,m m (n)

z )x

m (n) d (n)

slide-45
SLIDE 45

Gradient calculation Gradient calculation

xd z =

m

h(q )

m

= y ^c g(u )

c

q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L =

∂Vm,d ∂

∑c ∂y

^c ∂L ∂um ∂y ^c ∂zm ∂um ∂qm ∂zm ∂Vm,d ∂qm

Wk,m xd

depends on the middle layer activation

L(y, ) y ^

5 . 6

we already did this part

gradient wrt V: σ(q )(1 −

m

σ(q ))

m

{0 1 q ≤ 0

m

q > 0

m

1 − tanh(q )

m 2

logistic function hyperbolic tan. ReLU logistic sigmoid

example J =

∂Vm,d ∂

( − ∑n ∑c y ^c

(n)

y )W σ(q )(1 −

c (n) c,m m (n)

σ(q ))x

m (n) d (n)

= ( − ∑n ∑c y ^c

(n)

y )W z (1 −

c (n) c,m m (n)

z )x

m (n) d (n)

for biases we simply assume the input is 1. x

=

(n)

1

slide-46
SLIDE 46

Gradient calculation Gradient calculation

L =

∂Wc,m ∂ ∂y ^c ∂L ∂uc ∂y ^c ∂Wc,m ∂uc

zm

a common pattern

L =

∂Vm,d ∂

∑c ∂y

^c ∂L ∂uc ∂y ^c ∂zm ∂uc ∂qm ∂zm ∂Vm,d ∂qm

5 . 7

input from below error from above ∂uc

∂L

error from above ∂qm

∂L

xd

input from below

slide-47
SLIDE 47

Winter 2020 | Applied Machine Learning (COMP551)

Gradient calculation Gradient calculation

L =

∂Wc,m ∂ ∂y ^c ∂L ∂uc ∂y ^c ∂Wc,m ∂uc

zm

a common pattern

xd z =

m

h(q )

m

= y ^c g(u )

c

q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L(y, ) y ^

L =

∂Vm,d ∂

∑c ∂y

^c ∂L ∂uc ∂y ^c ∂zm ∂uc ∂qm ∂zm ∂Vm,d ∂qm

5 . 7

input from below error from above ∂uc

∂L

error from above ∂qm

∂L

xd

input from below

slide-48
SLIDE 48

Iris dataset (D=2 features + 1 bias) M = 16 hidden units C=3 classes

xd z =

m

σ(q )

m

= y ^ softmax(u) q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L(y, ) y ^

6 . 1

Example: Example: classification classification

slide-49
SLIDE 49

Iris dataset (D=2 features + 1 bias) M = 16 hidden units C=3 classes

def cost(X, #N x D Y, #N x C W, #M x C V, #D x M ): 1 2 3 4 5 Q = np.dot(X, V) #N x M 6 Z = logistic(Q) #N x M 7 U = np.dot(Z, W) #N x K 8 Yh = softmax(U) 9 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) 10 return nll 11

cost is softmax-cross-entropy

xd z =

m

σ(q )

m

= y ^ softmax(u) q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L(y, ) y ^

6 . 1

Example: Example: classification classification

def logsumexp( Z,# NxC ): Zmax = np.max(Z,axis=1)[:, None] lse = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=1))[:, None] return lse #N def softmax( u, # N x C ): u_exp = np.exp(u - np.max(u, 1)[:, None]) return u_exp / np.sum(u_exp, axis=-1)[:, None] 1 2 3 4 5 6 7 8 9 10 11 12

helper functions

slide-50
SLIDE 50

Iris dataset (D=2 features + 1 bias) M = 16 hidden units C=3 classes

def cost(X, #N x D Y, #N x C W, #M x C V, #D x M ): 1 2 3 4 5 Q = np.dot(X, V) #N x M 6 Z = logistic(Q) #N x M 7 U = np.dot(Z, W) #N x K 8 Yh = softmax(U) 9 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) 10 return nll 11 Q = np.dot(X, V) #N x M def cost(X, #N x D 1 Y, #N x C 2 W, #M x C 3 V, #D x M 4 ): 5 6 Z = logistic(Q) #N x M 7 U = np.dot(Z, W) #N x K 8 Yh = softmax(U) 9 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) 10 return nll 11

cost is softmax-cross-entropy

xd z =

m

σ(q )

m

= y ^ softmax(u) q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L(y, ) y ^

J = − y u + ∑n=1

N (n)⊤ (n)

log e ∑c

uc

(n)

6 . 1

Example: Example: classification classification

def logsumexp( Z,# NxC ): Zmax = np.max(Z,axis=1)[:, None] lse = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=1))[:, None] return lse #N def softmax( u, # N x C ): u_exp = np.exp(u - np.max(u, 1)[:, None]) return u_exp / np.sum(u_exp, axis=-1)[:, None] 1 2 3 4 5 6 7 8 9 10 11 12

helper functions

slide-51
SLIDE 51

Iris dataset (D=2 features + 1 bias) M = 16 hidden units C=3 classes

def cost(X, #N x D Y, #N x C W, #M x C V, #D x M ): 1 2 3 4 5 Q = np.dot(X, V) #N x M 6 Z = logistic(Q) #N x M 7 U = np.dot(Z, W) #N x K 8 Yh = softmax(U) 9 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) 10 return nll 11 Q = np.dot(X, V) #N x M def cost(X, #N x D 1 Y, #N x C 2 W, #M x C 3 V, #D x M 4 ): 5 6 Z = logistic(Q) #N x M 7 U = np.dot(Z, W) #N x K 8 Yh = softmax(U) 9 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) 10 return nll 11 Z = logistic(Q) #N x M def cost(X, #N x D 1 Y, #N x C 2 W, #M x C 3 V, #D x M 4 ): 5 Q = np.dot(X, V) #N x M 6 7 U = np.dot(Z, W) #N x K 8 Yh = softmax(U) 9 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) 10 return nll 11

cost is softmax-cross-entropy

xd z =

m

σ(q )

m

= y ^ softmax(u) q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L(y, ) y ^

J = − y u + ∑n=1

N (n)⊤ (n)

log e ∑c

uc

(n)

6 . 1

Example: Example: classification classification

def logsumexp( Z,# NxC ): Zmax = np.max(Z,axis=1)[:, None] lse = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=1))[:, None] return lse #N def softmax( u, # N x C ): u_exp = np.exp(u - np.max(u, 1)[:, None]) return u_exp / np.sum(u_exp, axis=-1)[:, None] 1 2 3 4 5 6 7 8 9 10 11 12

helper functions

slide-52
SLIDE 52

Iris dataset (D=2 features + 1 bias) M = 16 hidden units C=3 classes

def cost(X, #N x D Y, #N x C W, #M x C V, #D x M ): 1 2 3 4 5 Q = np.dot(X, V) #N x M 6 Z = logistic(Q) #N x M 7 U = np.dot(Z, W) #N x K 8 Yh = softmax(U) 9 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) 10 return nll 11 Q = np.dot(X, V) #N x M def cost(X, #N x D 1 Y, #N x C 2 W, #M x C 3 V, #D x M 4 ): 5 6 Z = logistic(Q) #N x M 7 U = np.dot(Z, W) #N x K 8 Yh = softmax(U) 9 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) 10 return nll 11 Z = logistic(Q) #N x M def cost(X, #N x D 1 Y, #N x C 2 W, #M x C 3 V, #D x M 4 ): 5 Q = np.dot(X, V) #N x M 6 7 U = np.dot(Z, W) #N x K 8 Yh = softmax(U) 9 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) 10 return nll 11 U = np.dot(Z, W) #N x K def cost(X, #N x D 1 Y, #N x C 2 W, #M x C 3 V, #D x M 4 ): 5 Q = np.dot(X, V) #N x M 6 Z = logistic(Q) #N x M 7 8 Yh = softmax(U) 9 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) 10 return nll 11

cost is softmax-cross-entropy

xd z =

m

σ(q )

m

= y ^ softmax(u) q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L(y, ) y ^

J = − y u + ∑n=1

N (n)⊤ (n)

log e ∑c

uc

(n)

6 . 1

Example: Example: classification classification

def logsumexp( Z,# NxC ): Zmax = np.max(Z,axis=1)[:, None] lse = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=1))[:, None] return lse #N def softmax( u, # N x C ): u_exp = np.exp(u - np.max(u, 1)[:, None]) return u_exp / np.sum(u_exp, axis=-1)[:, None] 1 2 3 4 5 6 7 8 9 10 11 12

helper functions

slide-53
SLIDE 53

Iris dataset (D=2 features + 1 bias) M = 16 hidden units C=3 classes

def cost(X, #N x D Y, #N x C W, #M x C V, #D x M ): 1 2 3 4 5 Q = np.dot(X, V) #N x M 6 Z = logistic(Q) #N x M 7 U = np.dot(Z, W) #N x K 8 Yh = softmax(U) 9 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) 10 return nll 11 Q = np.dot(X, V) #N x M def cost(X, #N x D 1 Y, #N x C 2 W, #M x C 3 V, #D x M 4 ): 5 6 Z = logistic(Q) #N x M 7 U = np.dot(Z, W) #N x K 8 Yh = softmax(U) 9 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) 10 return nll 11 Z = logistic(Q) #N x M def cost(X, #N x D 1 Y, #N x C 2 W, #M x C 3 V, #D x M 4 ): 5 Q = np.dot(X, V) #N x M 6 7 U = np.dot(Z, W) #N x K 8 Yh = softmax(U) 9 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) 10 return nll 11 U = np.dot(Z, W) #N x K def cost(X, #N x D 1 Y, #N x C 2 W, #M x C 3 V, #D x M 4 ): 5 Q = np.dot(X, V) #N x M 6 Z = logistic(Q) #N x M 7 8 Yh = softmax(U) 9 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) 10 return nll 11 Yh = softmax(U) def cost(X, #N x D 1 Y, #N x C 2 W, #M x C 3 V, #D x M 4 ): 5 Q = np.dot(X, V) #N x M 6 Z = logistic(Q) #N x M 7 U = np.dot(Z, W) #N x K 8 9 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) 10 return nll 11

cost is softmax-cross-entropy

xd z =

m

σ(q )

m

= y ^ softmax(u) q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L(y, ) y ^

J = − y u + ∑n=1

N (n)⊤ (n)

log e ∑c

uc

(n)

6 . 1

Example: Example: classification classification

def logsumexp( Z,# NxC ): Zmax = np.max(Z,axis=1)[:, None] lse = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=1))[:, None] return lse #N def softmax( u, # N x C ): u_exp = np.exp(u - np.max(u, 1)[:, None]) return u_exp / np.sum(u_exp, axis=-1)[:, None] 1 2 3 4 5 6 7 8 9 10 11 12

helper functions

slide-54
SLIDE 54

Iris dataset (D=2 features + 1 bias) M = 16 hidden units C=3 classes

def cost(X, #N x D Y, #N x C W, #M x C V, #D x M ): 1 2 3 4 5 Q = np.dot(X, V) #N x M 6 Z = logistic(Q) #N x M 7 U = np.dot(Z, W) #N x K 8 Yh = softmax(U) 9 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) 10 return nll 11 Q = np.dot(X, V) #N x M def cost(X, #N x D 1 Y, #N x C 2 W, #M x C 3 V, #D x M 4 ): 5 6 Z = logistic(Q) #N x M 7 U = np.dot(Z, W) #N x K 8 Yh = softmax(U) 9 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) 10 return nll 11 Z = logistic(Q) #N x M def cost(X, #N x D 1 Y, #N x C 2 W, #M x C 3 V, #D x M 4 ): 5 Q = np.dot(X, V) #N x M 6 7 U = np.dot(Z, W) #N x K 8 Yh = softmax(U) 9 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) 10 return nll 11 U = np.dot(Z, W) #N x K def cost(X, #N x D 1 Y, #N x C 2 W, #M x C 3 V, #D x M 4 ): 5 Q = np.dot(X, V) #N x M 6 Z = logistic(Q) #N x M 7 8 Yh = softmax(U) 9 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) 10 return nll 11 Yh = softmax(U) def cost(X, #N x D 1 Y, #N x C 2 W, #M x C 3 V, #D x M 4 ): 5 Q = np.dot(X, V) #N x M 6 Z = logistic(Q) #N x M 7 U = np.dot(Z, W) #N x K 8 9 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) 10 return nll 11 nll = - np.mean(np.sum(U*Y, 1) - logsumexp(U)) def cost(X, #N x D 1 Y, #N x C 2 W, #M x C 3 V, #D x M 4 ): 5 Q = np.dot(X, V) #N x M 6 Z = logistic(Q) #N x M 7 U = np.dot(Z, W) #N x K 8 Yh = softmax(U) 9 10 return nll 11

cost is softmax-cross-entropy

xd z =

m

σ(q )

m

= y ^ softmax(u) q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

L(y, ) y ^

J = − y u + ∑n=1

N (n)⊤ (n)

log e ∑c

uc

(n)

6 . 1

Example: Example: classification classification

def logsumexp( Z,# NxC ): Zmax = np.max(Z,axis=1)[:, None] lse = Zmax + np.log(np.sum(np.exp(Z - Zmax), axis=1))[:, None] return lse #N def softmax( u, # N x C ): u_exp = np.exp(u - np.max(u, 1)[:, None]) return u_exp / np.sum(u_exp, axis=-1)[:, None] 1 2 3 4 5 6 7 8 9 10 11 12

helper functions

slide-55
SLIDE 55

Iris dataset (D=2 features + 1 bias) M = 16 hidden units C=3 classes

xd z =

m

σ(q )

m

= y ^ softmax(u) q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

6 . 2

L(y, ) y ^

Example: Example: classification classification

slide-56
SLIDE 56

def gradients(X,#N x D Y,#N x K W,#M x K V,#D x M ): 1 2 3 4 5 Z = logistic(np.dot(X, V))#N x M 6 N,D = X.shape 7 Yh = softmax(np.dot(Z, W))#N x K 8 dY = Yh - Y #N x K 9 dW= np.dot(Z.T, dY)/N #M x K 10 dZ = np.dot(dY, W.T) #N x M 11 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 12 return dW, dV 13

L =

∂Wm ∂

( − y ^ y)zm L =

∂Vm,d ∂

( − y ^ y)W z (1 −

m m

z )x

m d

Iris dataset (D=2 features + 1 bias) M = 16 hidden units C=3 classes

xd z =

m

σ(q )

m

= y ^ softmax(u) q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

6 . 2

L(y, ) y ^

Example: Example: classification classification

slide-57
SLIDE 57

def gradients(X,#N x D Y,#N x K W,#M x K V,#D x M ): 1 2 3 4 5 Z = logistic(np.dot(X, V))#N x M 6 N,D = X.shape 7 Yh = softmax(np.dot(Z, W))#N x K 8 dY = Yh - Y #N x K 9 dW= np.dot(Z.T, dY)/N #M x K 10 dZ = np.dot(dY, W.T) #N x M 11 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 12 return dW, dV 13 Z = logistic(np.dot(X, V))#N x M def gradients(X,#N x D 1 Y,#N x K 2 W,#M x K 3 V,#D x M 4 ): 5 6 N,D = X.shape 7 Yh = softmax(np.dot(Z, W))#N x K 8 dY = Yh - Y #N x K 9 dW= np.dot(Z.T, dY)/N #M x K 10 dZ = np.dot(dY, W.T) #N x M 11 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 12 return dW, dV 13

L =

∂Wm ∂

( − y ^ y)zm L =

∂Vm,d ∂

( − y ^ y)W z (1 −

m m

z )x

m d

Iris dataset (D=2 features + 1 bias) M = 16 hidden units C=3 classes

check your gradient function using finite difference approximation that uses the cost function

scipy.optimize.check_grad 1

xd z =

m

σ(q )

m

= y ^ softmax(u) q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

6 . 2

L(y, ) y ^

Example: Example: classification classification

slide-58
SLIDE 58

def gradients(X,#N x D Y,#N x K W,#M x K V,#D x M ): 1 2 3 4 5 Z = logistic(np.dot(X, V))#N x M 6 N,D = X.shape 7 Yh = softmax(np.dot(Z, W))#N x K 8 dY = Yh - Y #N x K 9 dW= np.dot(Z.T, dY)/N #M x K 10 dZ = np.dot(dY, W.T) #N x M 11 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 12 return dW, dV 13 Z = logistic(np.dot(X, V))#N x M def gradients(X,#N x D 1 Y,#N x K 2 W,#M x K 3 V,#D x M 4 ): 5 6 N,D = X.shape 7 Yh = softmax(np.dot(Z, W))#N x K 8 dY = Yh - Y #N x K 9 dW= np.dot(Z.T, dY)/N #M x K 10 dZ = np.dot(dY, W.T) #N x M 11 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 12 return dW, dV 13 Yh = softmax(np.dot(Z, W))#N x K def gradients(X,#N x D 1 Y,#N x K 2 W,#M x K 3 V,#D x M 4 ): 5 Z = logistic(np.dot(X, V))#N x M 6 N,D = X.shape 7 8 dY = Yh - Y #N x K 9 dW= np.dot(Z.T, dY)/N #M x K 10 dZ = np.dot(dY, W.T) #N x M 11 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 12 return dW, dV 13

L =

∂Wm ∂

( − y ^ y)zm L =

∂Vm,d ∂

( − y ^ y)W z (1 −

m m

z )x

m d

Iris dataset (D=2 features + 1 bias) M = 16 hidden units C=3 classes

check your gradient function using finite difference approximation that uses the cost function

scipy.optimize.check_grad 1

xd z =

m

σ(q )

m

= y ^ softmax(u) q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

6 . 2

L(y, ) y ^

Example: Example: classification classification

slide-59
SLIDE 59

def gradients(X,#N x D Y,#N x K W,#M x K V,#D x M ): 1 2 3 4 5 Z = logistic(np.dot(X, V))#N x M 6 N,D = X.shape 7 Yh = softmax(np.dot(Z, W))#N x K 8 dY = Yh - Y #N x K 9 dW= np.dot(Z.T, dY)/N #M x K 10 dZ = np.dot(dY, W.T) #N x M 11 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 12 return dW, dV 13 Z = logistic(np.dot(X, V))#N x M def gradients(X,#N x D 1 Y,#N x K 2 W,#M x K 3 V,#D x M 4 ): 5 6 N,D = X.shape 7 Yh = softmax(np.dot(Z, W))#N x K 8 dY = Yh - Y #N x K 9 dW= np.dot(Z.T, dY)/N #M x K 10 dZ = np.dot(dY, W.T) #N x M 11 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 12 return dW, dV 13 Yh = softmax(np.dot(Z, W))#N x K def gradients(X,#N x D 1 Y,#N x K 2 W,#M x K 3 V,#D x M 4 ): 5 Z = logistic(np.dot(X, V))#N x M 6 N,D = X.shape 7 8 dY = Yh - Y #N x K 9 dW= np.dot(Z.T, dY)/N #M x K 10 dZ = np.dot(dY, W.T) #N x M 11 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 12 return dW, dV 13 dY = Yh - Y #N x K dW= np.dot(Z.T, dY)/N #M x K def gradients(X,#N x D 1 Y,#N x K 2 W,#M x K 3 V,#D x M 4 ): 5 Z = logistic(np.dot(X, V))#N x M 6 N,D = X.shape 7 Yh = softmax(np.dot(Z, W))#N x K 8 9 10 dZ = np.dot(dY, W.T) #N x M 11 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 12 return dW, dV 13

L =

∂Wm ∂

( − y ^ y)zm L =

∂Vm,d ∂

( − y ^ y)W z (1 −

m m

z )x

m d

Iris dataset (D=2 features + 1 bias) M = 16 hidden units C=3 classes

check your gradient function using finite difference approximation that uses the cost function

scipy.optimize.check_grad 1

xd z =

m

σ(q )

m

= y ^ softmax(u) q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

6 . 2

L(y, ) y ^

Example: Example: classification classification

slide-60
SLIDE 60

def gradients(X,#N x D Y,#N x K W,#M x K V,#D x M ): 1 2 3 4 5 Z = logistic(np.dot(X, V))#N x M 6 N,D = X.shape 7 Yh = softmax(np.dot(Z, W))#N x K 8 dY = Yh - Y #N x K 9 dW= np.dot(Z.T, dY)/N #M x K 10 dZ = np.dot(dY, W.T) #N x M 11 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 12 return dW, dV 13 Z = logistic(np.dot(X, V))#N x M def gradients(X,#N x D 1 Y,#N x K 2 W,#M x K 3 V,#D x M 4 ): 5 6 N,D = X.shape 7 Yh = softmax(np.dot(Z, W))#N x K 8 dY = Yh - Y #N x K 9 dW= np.dot(Z.T, dY)/N #M x K 10 dZ = np.dot(dY, W.T) #N x M 11 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 12 return dW, dV 13 Yh = softmax(np.dot(Z, W))#N x K def gradients(X,#N x D 1 Y,#N x K 2 W,#M x K 3 V,#D x M 4 ): 5 Z = logistic(np.dot(X, V))#N x M 6 N,D = X.shape 7 8 dY = Yh - Y #N x K 9 dW= np.dot(Z.T, dY)/N #M x K 10 dZ = np.dot(dY, W.T) #N x M 11 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 12 return dW, dV 13 dY = Yh - Y #N x K dW= np.dot(Z.T, dY)/N #M x K def gradients(X,#N x D 1 Y,#N x K 2 W,#M x K 3 V,#D x M 4 ): 5 Z = logistic(np.dot(X, V))#N x M 6 N,D = X.shape 7 Yh = softmax(np.dot(Z, W))#N x K 8 9 10 dZ = np.dot(dY, W.T) #N x M 11 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 12 return dW, dV 13 dZ = np.dot(dY, W.T) #N x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M def gradients(X,#N x D 1 Y,#N x K 2 W,#M x K 3 V,#D x M 4 ): 5 Z = logistic(np.dot(X, V))#N x M 6 N,D = X.shape 7 Yh = softmax(np.dot(Z, W))#N x K 8 dY = Yh - Y #N x K 9 dW= np.dot(Z.T, dY)/N #M x K 10 11 12 return dW, dV 13

L =

∂Wm ∂

( − y ^ y)zm L =

∂Vm,d ∂

( − y ^ y)W z (1 −

m m

z )x

m d

Iris dataset (D=2 features + 1 bias) M = 16 hidden units C=3 classes

check your gradient function using finite difference approximation that uses the cost function

scipy.optimize.check_grad 1

xd z =

m

σ(q )

m

= y ^ softmax(u) q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

6 . 2

L(y, ) y ^

Example: Example: classification classification

slide-61
SLIDE 61

def gradients(X,#N x D Y,#N x K W,#M x K V,#D x M ): 1 2 3 4 5 Z = logistic(np.dot(X, V))#N x M 6 N,D = X.shape 7 Yh = softmax(np.dot(Z, W))#N x K 8 dY = Yh - Y #N x K 9 dW= np.dot(Z.T, dY)/N #M x K 10 dZ = np.dot(dY, W.T) #N x M 11 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 12 return dW, dV 13 Z = logistic(np.dot(X, V))#N x M def gradients(X,#N x D 1 Y,#N x K 2 W,#M x K 3 V,#D x M 4 ): 5 6 N,D = X.shape 7 Yh = softmax(np.dot(Z, W))#N x K 8 dY = Yh - Y #N x K 9 dW= np.dot(Z.T, dY)/N #M x K 10 dZ = np.dot(dY, W.T) #N x M 11 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 12 return dW, dV 13 Yh = softmax(np.dot(Z, W))#N x K def gradients(X,#N x D 1 Y,#N x K 2 W,#M x K 3 V,#D x M 4 ): 5 Z = logistic(np.dot(X, V))#N x M 6 N,D = X.shape 7 8 dY = Yh - Y #N x K 9 dW= np.dot(Z.T, dY)/N #M x K 10 dZ = np.dot(dY, W.T) #N x M 11 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 12 return dW, dV 13 dY = Yh - Y #N x K dW= np.dot(Z.T, dY)/N #M x K def gradients(X,#N x D 1 Y,#N x K 2 W,#M x K 3 V,#D x M 4 ): 5 Z = logistic(np.dot(X, V))#N x M 6 N,D = X.shape 7 Yh = softmax(np.dot(Z, W))#N x K 8 9 10 dZ = np.dot(dY, W.T) #N x M 11 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 12 return dW, dV 13 dZ = np.dot(dY, W.T) #N x M dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M def gradients(X,#N x D 1 Y,#N x K 2 W,#M x K 3 V,#D x M 4 ): 5 Z = logistic(np.dot(X, V))#N x M 6 N,D = X.shape 7 Yh = softmax(np.dot(Z, W))#N x K 8 dY = Yh - Y #N x K 9 dW= np.dot(Z.T, dY)/N #M x K 10 11 12 return dW, dV 13 return dW, dV def gradients(X,#N x D 1 Y,#N x K 2 W,#M x K 3 V,#D x M 4 ): 5 Z = logistic(np.dot(X, V))#N x M 6 N,D = X.shape 7 Yh = softmax(np.dot(Z, W))#N x K 8 dY = Yh - Y #N x K 9 dW= np.dot(Z.T, dY)/N #M x K 10 dZ = np.dot(dY, W.T) #N x M 11 dV = np.dot(X.T, dZ * Z * (1 - Z))/N #D x M 12 13

L =

∂Wm ∂

( − y ^ y)zm L =

∂Vm,d ∂

( − y ^ y)W z (1 −

m m

z )x

m d

Iris dataset (D=2 features + 1 bias) M = 16 hidden units C=3 classes

check your gradient function using finite difference approximation that uses the cost function

scipy.optimize.check_grad 1

xd z =

m

σ(q )

m

= y ^ softmax(u) q =

m

V x ∑d=1

D m,d d

u =

c

W z ∑m=1

M c,m m

6 . 2

L(y, ) y ^

Example: Example: classification classification

slide-62
SLIDE 62

Example: Example: classification classification

Iris dataset (D=2 features + 1 bias) M = 16 hidden units C=3 classes

using GD for optimization

dW, dV = gradients(X, Y, W, V) W = W - lr*dW V = V - lr*dV def GD(X, Y, M, lr=.1, eps=1e-9, max_iters=100000): 1 N, D = X.shape 2 N,K = Y.shape 3 W = np.random.randn(M, K)*.01 4 V = np.random.randn(D, M)*.01 5 dW = np.inf*np.ones_like(W) 6 t = 0 7 while np.linalg.norm(dW) > eps and t < max_iters: 8 9 10 11 t += 1 12 return W, V 13 6 . 3

slide-63
SLIDE 63

Winter 2020 | Applied Machine Learning (COMP551)

Example: Example: classification classification

Iris dataset (D=2 features + 1 bias) M = 16 hidden units C=3 classes

using GD for optimization

dW, dV = gradients(X, Y, W, V) W = W - lr*dW V = V - lr*dV def GD(X, Y, M, lr=.1, eps=1e-9, max_iters=100000): 1 N, D = X.shape 2 N,K = Y.shape 3 W = np.random.randn(M, K)*.01 4 V = np.random.randn(D, M)*.01 5 dW = np.inf*np.ones_like(W) 6 t = 0 7 while np.linalg.norm(dW) > eps and t < max_iters: 8 9 10 11 t += 1 12 return W, V 13

the resulting decision boundaries

6 . 3

slide-64
SLIDE 64

Automating gradient computation Automating gradient computation

gradient computation is tedious and mechanical. can we automate it?

7 . 1

slide-65
SLIDE 65

Automating gradient computation Automating gradient computation

gradient computation is tedious and mechanical. can we automate it?

approximates partial derivatives using finite difference needs multiple forward passes (for each input output pair) can be slow and inaccurate useful for black-box cost functions or checking the correctness of gradient functions

using numerical differentiation?

∂w ∂f ϵ f(w+ϵ)−f(w)

7 . 1

slide-66
SLIDE 66

Automating gradient computation Automating gradient computation

gradient computation is tedious and mechanical. can we automate it?

approximates partial derivatives using finite difference needs multiple forward passes (for each input output pair) can be slow and inaccurate useful for black-box cost functions or checking the correctness of gradient functions

using numerical differentiation?

∂w ∂f ϵ f(w+ϵ)−f(w)

symbolic differentiation: symbolic calculation of derivatives

does not identify the computational procedure and reuse of values

7 . 1

slide-67
SLIDE 67

Automating gradient computation Automating gradient computation

gradient computation is tedious and mechanical. can we automate it?

approximates partial derivatives using finite difference needs multiple forward passes (for each input output pair) can be slow and inaccurate useful for black-box cost functions or checking the correctness of gradient functions

using numerical differentiation?

∂w ∂f ϵ f(w+ϵ)−f(w)

symbolic differentiation: symbolic calculation of derivatives

does not identify the computational procedure and reuse of values

automatic / algorithmic differentiation is what we want

write code that calculates various functions, e.g., the cost function automatically produce (partial) derivatives e.g., gradients used in learning

7 . 1

slide-68
SLIDE 68

Automatic differentiation Automatic differentiation

∗, sin, ...

x 1

idea

use the chain rule + derivative of simple operations use a computational graph as a data structure (for storing the result of computation)

7 . 2

slide-69
SLIDE 69

Automatic differentiation Automatic differentiation

∗, sin, ...

x 1

idea

use the chain rule + derivative of simple operations use a computational graph as a data structure (for storing the result of computation)

7 . 2

step 1

break down to atomic operations

L = (y −

2 1

wx)2 a =

4

a ×

1

a2 a =

5

a −

4

a3 a =

6

a5

2

a =

7

.5 × a6 a =

1

w a =

2

x a =

3

y

slide-70
SLIDE 70

Automatic differentiation Automatic differentiation

∗, sin, ...

x 1

idea

use the chain rule + derivative of simple operations use a computational graph as a data structure (for storing the result of computation)

7 . 2

step 1

break down to atomic operations

L = (y −

2 1

wx)2 a =

4

a ×

1

a2 a =

5

a −

4

a3 a =

6

a5

2

a =

7

.5 × a6 a =

1

w a =

2

x a =

3

y step 2

build a graph with operations as internal nodes and input variables as leaf nodes

slide-71
SLIDE 71

a1 a2 a3 a5 a7 a6 L

Automatic differentiation Automatic differentiation

∗, sin, ...

x 1

idea

use the chain rule + derivative of simple operations use a computational graph as a data structure (for storing the result of computation)

a4

7 . 2

step 1

break down to atomic operations

L = (y −

2 1

wx)2 a =

4

a ×

1

a2 a =

5

a −

4

a3 a =

6

a5

2

a =

7

.5 × a6 a =

1

w a =

2

x a =

3

y step 2

build a graph with operations as internal nodes and input variables as leaf nodes

slide-72
SLIDE 72

a1 a2 a3 a5 a7 a6 L

Automatic differentiation Automatic differentiation

∗, sin, ...

x 1

idea

use the chain rule + derivative of simple operations use a computational graph as a data structure (for storing the result of computation)

a4

7 . 2

there are two ways to use the computational graph to calculate derivatives

step 3 step 1

break down to atomic operations

L = (y −

2 1

wx)2 a =

4

a ×

1

a2 a =

5

a −

4

a3 a =

6

a5

2

a =

7

.5 × a6 a =

1

w a =

2

x a =

3

y step 2

build a graph with operations as internal nodes and input variables as leaf nodes

slide-73
SLIDE 73

a1 a2 a3 a5 a7 a6 L

Automatic differentiation Automatic differentiation

∗, sin, ...

x 1

idea

use the chain rule + derivative of simple operations use a computational graph as a data structure (for storing the result of computation)

a4

7 . 2

there are two ways to use the computational graph to calculate derivatives

step 3

forward mode: start from the leafs and propagate derivatives upward

step 1

break down to atomic operations

L = (y −

2 1

wx)2 a =

4

a ×

1

a2 a =

5

a −

4

a3 a =

6

a5

2

a =

7

.5 × a6 a =

1

w a =

2

x a =

3

y step 2

build a graph with operations as internal nodes and input variables as leaf nodes

slide-74
SLIDE 74

a1 a2 a3 a5 a7 a6 L

Automatic differentiation Automatic differentiation

∗, sin, ...

x 1

idea

use the chain rule + derivative of simple operations use a computational graph as a data structure (for storing the result of computation)

a4

7 . 2

there are two ways to use the computational graph to calculate derivatives

step 3

forward mode: start from the leafs and propagate derivatives upward reverse mode:

  • 1. first in a bottom-up (forward) pass calculate the values
  • 2. in a top-down (backward) pass calculate the derivatives

a , … , a

1 4

step 1

break down to atomic operations

L = (y −

2 1

wx)2 a =

4

a ×

1

a2 a =

5

a −

4

a3 a =

6

a5

2

a =

7

.5 × a6 a =

1

w a =

2

x a =

3

y step 2

build a graph with operations as internal nodes and input variables as leaf nodes

slide-75
SLIDE 75

a1 a2 a3 a5 a7 a6 L

Automatic differentiation Automatic differentiation

∗, sin, ...

x 1

idea

use the chain rule + derivative of simple operations use a computational graph as a data structure (for storing the result of computation)

a4

7 . 2

this second procedure is called backpropagation when applied to neuran networks there are two ways to use the computational graph to calculate derivatives

step 3

forward mode: start from the leafs and propagate derivatives upward reverse mode:

  • 1. first in a bottom-up (forward) pass calculate the values
  • 2. in a top-down (backward) pass calculate the derivatives

a , … , a

1 4

step 1

break down to atomic operations

L = (y −

2 1

wx)2 a =

4

a ×

1

a2 a =

5

a −

4

a3 a =

6

a5

2

a =

7

.5 × a6 a =

1

w a =

2

x a =

3

y step 2

build a graph with operations as internal nodes and input variables as leaf nodes

slide-76
SLIDE 76

Forward mode Forward mode

suppose we want the derivative where y =

1

sin(w x +

1

w ) ∂w1 ∂y1 y =

2

cos(w x +

1

w )

{

7 . 3

slide-77
SLIDE 77

Forward mode Forward mode

suppose we want the derivative where y =

1

sin(w x +

1

w ) ∂w1 ∂y1 y =

2

cos(w x +

1

w )

{

we can calculate both and derivatives in a single forward pass y , y

1 2

∂w1 ∂y1 ∂w1 ∂y2

7 . 3

slide-78
SLIDE 78

Forward mode Forward mode

evaluation

a =

1

w0 a =

2

w1 a =

3

x

suppose we want the derivative where y =

1

sin(w x +

1

w ) ∂w1 ∂y1 y =

2

cos(w x +

1

w )

{

we can calculate both and derivatives in a single forward pass y , y

1 2

∂w1 ∂y1 ∂w1 ∂y2

7 . 3

slide-79
SLIDE 79

Forward mode Forward mode

evaluation

a =

1

w0 a =

2

w1 a =

3

x

suppose we want the derivative where y =

1

sin(w x +

1

w ) ∂w1 ∂y1 y =

2

cos(w x +

1

w )

{

we can calculate both and derivatives in a single forward pass y , y

1 2

∂w1 ∂y1 ∂w1 ∂y2

7 . 3

= a1 ˙ = a3 ˙ = a2 ˙ 1

partial derivatives

slide-80
SLIDE 80

Forward mode Forward mode

evaluation

a =

1

w0 a =

2

w1 a =

3

x

}

we initialize these to identify which derivative we want this means

= □ ˙

∂w1 ∂□ suppose we want the derivative where y =

1

sin(w x +

1

w ) ∂w1 ∂y1 y =

2

cos(w x +

1

w )

{

we can calculate both and derivatives in a single forward pass y , y

1 2

∂w1 ∂y1 ∂w1 ∂y2

7 . 3

= a1 ˙ = a3 ˙ = a2 ˙ 1

partial derivatives

slide-81
SLIDE 81

Forward mode Forward mode

evaluation

a =

1

w0 a =

2

w1 a =

3

x

}

we initialize these to identify which derivative we want this means

= □ ˙

∂w1 ∂□ suppose we want the derivative where y =

1

sin(w x +

1

w ) ∂w1 ∂y1 y =

2

cos(w x +

1

w )

{

we can calculate both and derivatives in a single forward pass y , y

1 2

∂w1 ∂y1 ∂w1 ∂y2

7 . 3

= a1 ˙ = a3 ˙ = a2 ˙ 1

partial derivatives

a =

4

a ×

2

a3 = a4 ˙ a ×

2

+ a3 ˙ × a2 ˙ a3 w x

1

x

slide-82
SLIDE 82

Forward mode Forward mode

evaluation

a =

1

w0 a =

2

w1 a =

3

x

}

we initialize these to identify which derivative we want this means

= □ ˙

∂w1 ∂□ suppose we want the derivative where y =

1

sin(w x +

1

w ) ∂w1 ∂y1 y =

2

cos(w x +

1

w )

{

we can calculate both and derivatives in a single forward pass y , y

1 2

∂w1 ∂y1 ∂w1 ∂y2

7 . 3

= a1 ˙ = a3 ˙ = a2 ˙ 1

partial derivatives

a =

5

a +

4

a1 = a5 ˙ + a4 ˙ a1 ˙ w x +

1

w0 x a =

4

a ×

2

a3 = a4 ˙ a ×

2

+ a3 ˙ × a2 ˙ a3 w x

1

x

slide-83
SLIDE 83

Forward mode Forward mode

evaluation

a =

1

w0 a =

2

w1 a =

3

x

}

we initialize these to identify which derivative we want this means

= □ ˙

∂w1 ∂□ suppose we want the derivative where y =

1

sin(w x +

1

w ) ∂w1 ∂y1 y =

2

cos(w x +

1

w )

{

we can calculate both and derivatives in a single forward pass y , y

1 2

∂w1 ∂y1 ∂w1 ∂y2

7 . 3

= a1 ˙ = a3 ˙ = a2 ˙ 1

partial derivatives

a =

5

a +

4

a1 = a5 ˙ + a4 ˙ a1 ˙ w x +

1

w0 x a =

4

a ×

2

a3 = a4 ˙ a ×

2

+ a3 ˙ × a2 ˙ a3 w x

1

x = a6 ˙ cos(a ) a5 ˙

5

a =

6

sin(a )

5

y =

1

sin(w x +

1

w ) x cos(w x +

1

w ) =

∂w1 ∂y1

slide-84
SLIDE 84

Forward mode Forward mode

evaluation

a =

1

w0 a =

2

w1 a =

3

x

}

we initialize these to identify which derivative we want this means

= □ ˙

∂w1 ∂□ suppose we want the derivative where y =

1

sin(w x +

1

w ) ∂w1 ∂y1 y =

2

cos(w x +

1

w )

{

we can calculate both and derivatives in a single forward pass y , y

1 2

∂w1 ∂y1 ∂w1 ∂y2

7 . 3

= a1 ˙ = a3 ˙ = a2 ˙ 1

partial derivatives

a =

5

a +

4

a1 = a5 ˙ + a4 ˙ a1 ˙ w x +

1

w0 x a =

4

a ×

2

a3 = a4 ˙ a ×

2

+ a3 ˙ × a2 ˙ a3 w x

1

x = a6 ˙ cos(a ) a5 ˙

5

a =

6

sin(a )

5

y =

1

sin(w x +

1

w ) x cos(w x +

1

w ) =

∂w1 ∂y1

= a7 ˙ − sin(a ) a5 ˙

5

a =

7

cos(a )

5

y =

2

cos(w x +

1

w ) −x sin(w x +

1

w ) =

∂w1 ∂y2

slide-85
SLIDE 85

Forward mode Forward mode

evaluation

a =

1

w0 a =

2

w1 a =

3

x

}

we initialize these to identify which derivative we want this means

= □ ˙

∂w1 ∂□ suppose we want the derivative where y =

1

sin(w x +

1

w ) ∂w1 ∂y1 y =

2

cos(w x +

1

w )

{

we can calculate both and derivatives in a single forward pass y , y

1 2

∂w1 ∂y1 ∂w1 ∂y2

7 . 3

= a1 ˙ = a3 ˙ = a2 ˙ 1

partial derivatives

a =

5

a +

4

a1 = a5 ˙ + a4 ˙ a1 ˙ w x +

1

w0 x a =

4

a ×

2

a3 = a4 ˙ a ×

2

+ a3 ˙ × a2 ˙ a3 w x

1

x = a6 ˙ cos(a ) a5 ˙

5

a =

6

sin(a )

5

y =

1

sin(w x +

1

w ) x cos(w x +

1

w ) =

∂w1 ∂y1

= a7 ˙ − sin(a ) a5 ˙

5

a =

7

cos(a )

5

y =

2

cos(w x +

1

w ) −x sin(w x +

1

w ) =

∂w1 ∂y2

∂w1 ∂□ note that we get all partial derivatives in one forward pass

slide-86
SLIDE 86

Forward mode: Forward mode: computational graph computational graph

suppose we want the derivative where y =

1

sin(w x +

1

w ) ∂w1 ∂y1 y =

2

cos(w x +

1

w )

{

7 . 4

evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 a =

1

w0 a =

2

w1 a =

3

x = a1 ˙ = a3 ˙ = a2 ˙ 1 = a4 ˙ a ×

2

+ a3 ˙ × a2 ˙ a3 = a5 ˙ + a4 ˙ a1 ˙ = a6 ˙ cos(a ) a5 ˙

5

= a7 ˙ − cos(a ) a5 ˙

5

partial derivatives

y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

slide-87
SLIDE 87

Forward mode: Forward mode: computational graph computational graph

suppose we want the derivative where y =

1

sin(w x +

1

w ) ∂w1 ∂y1

a2 a3 a1 a4 a5 a7 a6

y =

2

cos(w x +

1

w )

{

7 . 4

evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 a =

1

w0 a =

2

w1 a =

3

x = a1 ˙ = a3 ˙ = a2 ˙ 1 = a4 ˙ a ×

2

+ a3 ˙ × a2 ˙ a3 = a5 ˙ + a4 ˙ a1 ˙ = a6 ˙ cos(a ) a5 ˙

5

= a7 ˙ − cos(a ) a5 ˙

5

partial derivatives

y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

we can represent this computation using a graph

slide-88
SLIDE 88

Forward mode: Forward mode: computational graph computational graph

suppose we want the derivative where y =

1

sin(w x +

1

w ) ∂w1 ∂y1

a2 a3 a1 a4 a5 a7 a6

y =

2

cos(w x +

1

w )

{

7 . 4

evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 a =

1

w0 a =

2

w1 a =

3

x = a1 ˙ = a3 ˙ = a2 ˙ 1 = a4 ˙ a ×

2

+ a3 ˙ × a2 ˙ a3 = a5 ˙ + a4 ˙ a1 ˙ = a6 ˙ cos(a ) a5 ˙

5

= a7 ˙ − cos(a ) a5 ˙

5

partial derivatives

y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

we can represent this computation using a graph

= a1 ˙ a =

1

w0 = a2 ˙ 1 a =

2

w1 = a3 ˙ a =

3

x

slide-89
SLIDE 89

Forward mode: Forward mode: computational graph computational graph

suppose we want the derivative where y =

1

sin(w x +

1

w ) ∂w1 ∂y1

a2 a3 a1 a4 a5 a7 a6

y =

2

cos(w x +

1

w )

{

7 . 4

evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 a =

1

w0 a =

2

w1 a =

3

x = a1 ˙ = a3 ˙ = a2 ˙ 1 = a4 ˙ a ×

2

+ a3 ˙ × a2 ˙ a3 = a5 ˙ + a4 ˙ a1 ˙ = a6 ˙ cos(a ) a5 ˙

5

= a7 ˙ − cos(a ) a5 ˙

5

partial derivatives

y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

we can represent this computation using a graph

= a4 ˙ a ×

2

+ a3 ˙ × a2 ˙ a3 a =

4

a ×

2

a3 = a1 ˙ a =

1

w0 = a2 ˙ 1 a =

2

w1 = a3 ˙ a =

3

x

slide-90
SLIDE 90

Forward mode: Forward mode: computational graph computational graph

suppose we want the derivative where y =

1

sin(w x +

1

w ) ∂w1 ∂y1

a2 a3 a1 a4 a5 a7 a6

y =

2

cos(w x +

1

w )

{

7 . 4

evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 a =

1

w0 a =

2

w1 a =

3

x = a1 ˙ = a3 ˙ = a2 ˙ 1 = a4 ˙ a ×

2

+ a3 ˙ × a2 ˙ a3 = a5 ˙ + a4 ˙ a1 ˙ = a6 ˙ cos(a ) a5 ˙

5

= a7 ˙ − cos(a ) a5 ˙

5

partial derivatives

y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

we can represent this computation using a graph

= a4 ˙ a ×

2

+ a3 ˙ × a2 ˙ a3 a =

4

a ×

2

a3 = a5 ˙ + a4 ˙ a1 ˙ a =

5

a +

4

a1 = a1 ˙ a =

1

w0 = a2 ˙ 1 a =

2

w1 = a3 ˙ a =

3

x

slide-91
SLIDE 91

Forward mode: Forward mode: computational graph computational graph

suppose we want the derivative where y =

1

sin(w x +

1

w ) ∂w1 ∂y1

a2 a3 a1 a4 a5 a7 a6

y =

2

cos(w x +

1

w )

{

7 . 4

evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 a =

1

w0 a =

2

w1 a =

3

x = a1 ˙ = a3 ˙ = a2 ˙ 1 = a4 ˙ a ×

2

+ a3 ˙ × a2 ˙ a3 = a5 ˙ + a4 ˙ a1 ˙ = a6 ˙ cos(a ) a5 ˙

5

= a7 ˙ − cos(a ) a5 ˙

5

partial derivatives

y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

we can represent this computation using a graph

= a4 ˙ a ×

2

+ a3 ˙ × a2 ˙ a3 a =

4

a ×

2

a3 = a5 ˙ + a4 ˙ a1 ˙ a =

5

a +

4

a1 = a1 ˙ a =

1

w0 = a2 ˙ 1 a =

2

w1 = a3 ˙ a =

3

x =

∂w1 ∂y1

= a6 ˙ cos(a ) a5 ˙

5

y =

1

a =

6

sin(a )

5

slide-92
SLIDE 92

Forward mode: Forward mode: computational graph computational graph

suppose we want the derivative where y =

1

sin(w x +

1

w ) ∂w1 ∂y1

a2 a3 a1 a4 a5 a7 a6

y =

2

cos(w x +

1

w )

{

7 . 4

evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 a =

1

w0 a =

2

w1 a =

3

x = a1 ˙ = a3 ˙ = a2 ˙ 1 = a4 ˙ a ×

2

+ a3 ˙ × a2 ˙ a3 = a5 ˙ + a4 ˙ a1 ˙ = a6 ˙ cos(a ) a5 ˙

5

= a7 ˙ − cos(a ) a5 ˙

5

partial derivatives

y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

we can represent this computation using a graph

= a4 ˙ a ×

2

+ a3 ˙ × a2 ˙ a3 a =

4

a ×

2

a3 = a5 ˙ + a4 ˙ a1 ˙ a =

5

a +

4

a1 = a1 ˙ a =

1

w0 = a2 ˙ 1 a =

2

w1 = a3 ˙ a =

3

x =

∂w1 ∂y1

= a6 ˙ cos(a ) a5 ˙

5

y =

1

a =

6

sin(a )

5

=

∂w1 ∂y2

= a7 ˙ − cos(a ) a5 ˙

5

y =

2

a =

7

cos(a )

5

slide-93
SLIDE 93

Forward mode: Forward mode: computational graph computational graph

suppose we want the derivative where y =

1

sin(w x +

1

w ) ∂w1 ∂y1

a2 a3 a1 a4 a5 a7 a6

y =

2

cos(w x +

1

w )

{

7 . 4

evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 a =

1

w0 a =

2

w1 a =

3

x = a1 ˙ = a3 ˙ = a2 ˙ 1 = a4 ˙ a ×

2

+ a3 ˙ × a2 ˙ a3 = a5 ˙ + a4 ˙ a1 ˙ = a6 ˙ cos(a ) a5 ˙

5

= a7 ˙ − cos(a ) a5 ˙

5

partial derivatives

y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

we can represent this computation using a graph

  • nce the nodes up stream calculate their values and derivatives we may discard a node

e.g., once are obtained we can discard the values and partial derivatives for a ,

5 a5

˙ a , , a ,

4 a4

˙

1 a1

˙

= a4 ˙ a ×

2

+ a3 ˙ × a2 ˙ a3 a =

4

a ×

2

a3 = a5 ˙ + a4 ˙ a1 ˙ a =

5

a +

4

a1 = a1 ˙ a =

1

w0 = a2 ˙ 1 a =

2

w1 = a3 ˙ a =

3

x =

∂w1 ∂y1

= a6 ˙ cos(a ) a5 ˙

5

y =

1

a =

6

sin(a )

5

=

∂w1 ∂y2

= a7 ˙ − cos(a ) a5 ˙

5

y =

2

a =

7

cos(a )

5

slide-94
SLIDE 94

Reverse mode Reverse mode

7 . 5

suppose we want the derivative where y =

2

cos(w x +

1

w ) ∂w1 ∂y2

slide-95
SLIDE 95

Reverse mode Reverse mode

first do a forward pass for evaluation 1) evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

a =

1

w0 a =

2

w1 a =

3

x w x

1

w x +

1

w0 y =

1

sin(w x +

1

w ) y =

2

cos(w x +

1

w )

7 . 5

suppose we want the derivative where y =

2

cos(w x +

1

w ) ∂w1 ∂y2

slide-96
SLIDE 96

Reverse mode Reverse mode

then use these values to calculate partial derivatives in a backward pass first do a forward pass for evaluation 1) evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

a =

1

w0 a =

2

w1 a =

3

x w x

1

w x +

1

w0 y =

1

sin(w x +

1

w ) y =

2

cos(w x +

1

w )

7 . 5

suppose we want the derivative where y =

2

cos(w x +

1

w ) ∂w1 ∂y2

slide-97
SLIDE 97

Reverse mode Reverse mode

= a7 ˉ 1 = a6 ˉ 0 }

this means

= □ ˉ

∂□ ∂y2 2) partial derivatives then use these values to calculate partial derivatives in a backward pass first do a forward pass for evaluation 1) evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

a =

1

w0 a =

2

w1 a =

3

x w x

1

w x +

1

w0 y =

1

sin(w x +

1

w ) y =

2

cos(w x +

1

w )

7 . 5

suppose we want the derivative where y =

2

cos(w x +

1

w ) ∂w1 ∂y2

slide-98
SLIDE 98

Reverse mode Reverse mode

= a7 ˉ 1 = a6 ˉ 0 }

this means

= □ ˉ

∂□ ∂y2 2) partial derivatives then use these values to calculate partial derivatives in a backward pass first do a forward pass for evaluation 1) evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

a =

1

w0 a =

2

w1 a =

3

x w x

1

w x +

1

w0 y =

1

sin(w x +

1

w ) y =

2

cos(w x +

1

w )

7 . 5

=

∂y2 ∂y2

1 =

∂y1 ∂y2

suppose we want the derivative where y =

2

cos(w x +

1

w ) ∂w1 ∂y2

slide-99
SLIDE 99

Reverse mode Reverse mode

= a7 ˉ 1 = a6 ˉ 0 }

this means

= □ ˉ

∂□ ∂y2 2) partial derivatives then use these values to calculate partial derivatives in a backward pass first do a forward pass for evaluation 1) evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

a =

1

w0 a =

2

w1 a =

3

x w x

1

w x +

1

w0 y =

1

sin(w x +

1

w ) y =

2

cos(w x +

1

w )

7 . 5

= a5 ˉ cos(a ) − a6 ˉ

5

sin(a ) a7 ˉ

5

=

∂a5 ∂y2

+

∂a7 ∂y2 ∂a5 ∂a7

=

∂a6 ∂y2 ∂a5 ∂a6

− sin(w x +

1

w ) =

∂y2 ∂y2

1 =

∂y1 ∂y2

suppose we want the derivative where y =

2

cos(w x +

1

w ) ∂w1 ∂y2

slide-100
SLIDE 100

Reverse mode Reverse mode

= a7 ˉ 1 = a6 ˉ 0 }

this means

= □ ˉ

∂□ ∂y2 2) partial derivatives then use these values to calculate partial derivatives in a backward pass first do a forward pass for evaluation 1) evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

a =

1

w0 a =

2

w1 a =

3

x w x

1

w x +

1

w0 y =

1

sin(w x +

1

w ) y =

2

cos(w x +

1

w )

7 . 5

= a5 ˉ cos(a ) − a6 ˉ

5

sin(a ) a7 ˉ

5

=

∂a5 ∂y2

+

∂a7 ∂y2 ∂a5 ∂a7

=

∂a6 ∂y2 ∂a5 ∂a6

− sin(w x +

1

w ) =

∂y2 ∂y2

1 =

∂y1 ∂y2

= a4 ˉ a5 ˉ

=

∂a4 ∂y2

− sin(w x +

1

w )

suppose we want the derivative where y =

2

cos(w x +

1

w ) ∂w1 ∂y2

slide-101
SLIDE 101

Reverse mode Reverse mode

= a7 ˉ 1 = a6 ˉ 0 }

this means

= □ ˉ

∂□ ∂y2 2) partial derivatives then use these values to calculate partial derivatives in a backward pass first do a forward pass for evaluation 1) evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

a =

1

w0 a =

2

w1 a =

3

x w x

1

w x +

1

w0 y =

1

sin(w x +

1

w ) y =

2

cos(w x +

1

w )

7 . 5

= a5 ˉ cos(a ) − a6 ˉ

5

sin(a ) a7 ˉ

5

=

∂a5 ∂y2

+

∂a7 ∂y2 ∂a5 ∂a7

=

∂a6 ∂y2 ∂a5 ∂a6

− sin(w x +

1

w ) =

∂y2 ∂y2

1 =

∂y1 ∂y2

= a4 ˉ a5 ˉ

=

∂a4 ∂y2

− sin(w x +

1

w )

= a3 ˉ a2a4 ˉ

=

∂x ∂y2

−w sin(w x +

1 1

w )

suppose we want the derivative where y =

2

cos(w x +

1

w ) ∂w1 ∂y2

slide-102
SLIDE 102

Reverse mode Reverse mode

= a7 ˉ 1 = a6 ˉ 0 }

this means

= □ ˉ

∂□ ∂y2 2) partial derivatives then use these values to calculate partial derivatives in a backward pass first do a forward pass for evaluation 1) evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

a =

1

w0 a =

2

w1 a =

3

x w x

1

w x +

1

w0 y =

1

sin(w x +

1

w ) y =

2

cos(w x +

1

w )

7 . 5

= a5 ˉ cos(a ) − a6 ˉ

5

sin(a ) a7 ˉ

5

=

∂a5 ∂y2

+

∂a7 ∂y2 ∂a5 ∂a7

=

∂a6 ∂y2 ∂a5 ∂a6

− sin(w x +

1

w ) =

∂y2 ∂y2

1 =

∂y1 ∂y2

= a4 ˉ a5 ˉ

=

∂a4 ∂y2

− sin(w x +

1

w )

= a3 ˉ a2a4 ˉ

=

∂x ∂y2

−w sin(w x +

1 1

w )

suppose we want the derivative where y =

2

cos(w x +

1

w ) ∂w1 ∂y2

= a2 ˉ a3a4 ˉ

=

∂w1 ∂y2

−x sin(w x +

1

w )

slide-103
SLIDE 103

Reverse mode Reverse mode

= a7 ˉ 1 = a6 ˉ 0 }

this means

= □ ˉ

∂□ ∂y2 2) partial derivatives then use these values to calculate partial derivatives in a backward pass first do a forward pass for evaluation 1) evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

a =

1

w0 a =

2

w1 a =

3

x w x

1

w x +

1

w0 y =

1

sin(w x +

1

w ) y =

2

cos(w x +

1

w )

7 . 5

= a5 ˉ cos(a ) − a6 ˉ

5

sin(a ) a7 ˉ

5

=

∂a5 ∂y2

+

∂a7 ∂y2 ∂a5 ∂a7

=

∂a6 ∂y2 ∂a5 ∂a6

− sin(w x +

1

w ) =

∂y2 ∂y2

1 =

∂y1 ∂y2

= a4 ˉ a5 ˉ

=

∂a4 ∂y2

− sin(w x +

1

w )

= a3 ˉ a2a4 ˉ

=

∂x ∂y2

−w sin(w x +

1 1

w )

suppose we want the derivative where y =

2

cos(w x +

1

w ) ∂w1 ∂y2

= a2 ˉ a3a4 ˉ

=

∂w1 ∂y2

−x sin(w x +

1

w )

= a1 ˉ a5 ˉ

=

∂w0 ∂y2

− sin(w x +

1

w )

slide-104
SLIDE 104

Reverse mode Reverse mode

= a7 ˉ 1 = a6 ˉ 0 }

this means

= □ ˉ

∂□ ∂y2 2) partial derivatives then use these values to calculate partial derivatives in a backward pass first do a forward pass for evaluation 1) evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

a =

1

w0 a =

2

w1 a =

3

x w x

1

w x +

1

w0 y =

1

sin(w x +

1

w ) y =

2

cos(w x +

1

w )

7 . 5

= a5 ˉ cos(a ) − a6 ˉ

5

sin(a ) a7 ˉ

5

=

∂a5 ∂y2

+

∂a7 ∂y2 ∂a5 ∂a7

=

∂a6 ∂y2 ∂a5 ∂a6

− sin(w x +

1

w ) =

∂y2 ∂y2

1 =

∂y1 ∂y2

= a4 ˉ a5 ˉ

=

∂a4 ∂y2

− sin(w x +

1

w )

= a3 ˉ a2a4 ˉ

=

∂x ∂y2

−w sin(w x +

1 1

w )

suppose we want the derivative where y =

2

cos(w x +

1

w ) ∂w1 ∂y2

= a2 ˉ a3a4 ˉ

=

∂w1 ∂y2

−x sin(w x +

1

w )

= a1 ˉ a5 ˉ

=

∂w0 ∂y2

− sin(w x +

1

w )

we get all partial derivatives in one backward pass ∂□ ∂y2

slide-105
SLIDE 105

Reverse mode: Reverse mode: computational graph computational graph

7 . 6

suppose we want the derivative where y =

2

cos(w x +

1

w ) ∂w1 ∂y2

a2 a3 a1 a4 a5 a7 a6

slide-106
SLIDE 106

Reverse mode: Reverse mode: computational graph computational graph

7 . 6

suppose we want the derivative where y =

2

cos(w x +

1

w ) ∂w1 ∂y2

a2 a3 a1 a4 a5 a7 a6

we can represent this computation using a graph

  • 1. in a forward pass we do evaluation and keep the values
  • 2. use these values in the backward pass to get partial derivatives
slide-107
SLIDE 107

Reverse mode: Reverse mode: computational graph computational graph

1) evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

a =

1

w0 a =

2

w1 a =

3

x

7 . 6

suppose we want the derivative where y =

2

cos(w x +

1

w ) ∂w1 ∂y2

a2 a3 a1 a4 a5 a7 a6

we can represent this computation using a graph

  • 1. in a forward pass we do evaluation and keep the values
  • 2. use these values in the backward pass to get partial derivatives
slide-108
SLIDE 108

Reverse mode: Reverse mode: computational graph computational graph

1) evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

a =

1

w0 a =

2

w1 a =

3

x = a7 ˉ 1 = a6 ˉ = a5 ˉ cos(a ) − a6 ˉ

5

sin(a ) a7 ˉ

5

= a4 ˉ a5 ˉ = a3 ˉ a2a4 ˉ = a1 ˉ a5 ˉ

2) partial derivatives

= a2 ˉ a3a4 ˉ

7 . 6

suppose we want the derivative where y =

2

cos(w x +

1

w ) ∂w1 ∂y2

a2 a3 a1 a4 a5 a7 a6

we can represent this computation using a graph

  • 1. in a forward pass we do evaluation and keep the values
  • 2. use these values in the backward pass to get partial derivatives
slide-109
SLIDE 109

Reverse mode: Reverse mode: computational graph computational graph

1) evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

a =

1

w0 a =

2

w1 a =

3

x = a7 ˉ 1 = a6 ˉ = a5 ˉ cos(a ) − a6 ˉ

5

sin(a ) a7 ˉ

5

= a4 ˉ a5 ˉ = a3 ˉ a2a4 ˉ = a1 ˉ a5 ˉ

2) partial derivatives

= a2 ˉ a3a4 ˉ

7 . 6

suppose we want the derivative where y =

2

cos(w x +

1

w ) ∂w1 ∂y2

a2 a3 a1 a4 a5 a7 a6

we can represent this computation using a graph

  • 1. in a forward pass we do evaluation and keep the values
  • 2. use these values in the backward pass to get partial derivatives

a =

1

w0 a =

2

w1 a =

3

x

slide-110
SLIDE 110

a =

4

a ×

2

a3 a =

5

a +

4

a1 y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

Reverse mode: Reverse mode: computational graph computational graph

1) evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

a =

1

w0 a =

2

w1 a =

3

x = a7 ˉ 1 = a6 ˉ = a5 ˉ cos(a ) − a6 ˉ

5

sin(a ) a7 ˉ

5

= a4 ˉ a5 ˉ = a3 ˉ a2a4 ˉ = a1 ˉ a5 ˉ

2) partial derivatives

= a2 ˉ a3a4 ˉ

7 . 6

suppose we want the derivative where y =

2

cos(w x +

1

w ) ∂w1 ∂y2

a2 a3 a1 a4 a5 a7 a6

we can represent this computation using a graph

  • 1. in a forward pass we do evaluation and keep the values
  • 2. use these values in the backward pass to get partial derivatives

a =

1

w0 a =

2

w1 a =

3

x

slide-111
SLIDE 111

a =

4

a ×

2

a3 a =

5

a +

4

a1 y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

Reverse mode: Reverse mode: computational graph computational graph

1) evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

a =

1

w0 a =

2

w1 a =

3

x = a7 ˉ 1 = a6 ˉ = a5 ˉ cos(a ) − a6 ˉ

5

sin(a ) a7 ˉ

5

= a4 ˉ a5 ˉ = a3 ˉ a2a4 ˉ = a1 ˉ a5 ˉ

2) partial derivatives

= a2 ˉ a3a4 ˉ

7 . 6

suppose we want the derivative where y =

2

cos(w x +

1

w ) ∂w1 ∂y2

a2 a3 a1 a4 a5 a7 a6

=

∂y1 ∂y1

= a7 ˉ 1 =

∂y2 ∂y1

= a6 ˉ we can represent this computation using a graph

  • 1. in a forward pass we do evaluation and keep the values
  • 2. use these values in the backward pass to get partial derivatives

a =

1

w0 a =

2

w1 a =

3

x

slide-112
SLIDE 112

a =

4

a ×

2

a3 a =

5

a +

4

a1 y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

Reverse mode: Reverse mode: computational graph computational graph

1) evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

a =

1

w0 a =

2

w1 a =

3

x = a7 ˉ 1 = a6 ˉ = a5 ˉ cos(a ) − a6 ˉ

5

sin(a ) a7 ˉ

5

= a4 ˉ a5 ˉ = a3 ˉ a2a4 ˉ = a1 ˉ a5 ˉ

2) partial derivatives

= a2 ˉ a3a4 ˉ

7 . 6

suppose we want the derivative where y =

2

cos(w x +

1

w ) ∂w1 ∂y2

a2 a3 a1 a4 a5 a7 a6

=

∂y1 ∂y1

= a7 ˉ 1 =

∂y2 ∂y1

= a6 ˉ = a5 ˉ cos(a ) − a6 ˉ

5

sin(a ) a7 ˉ

5

we can represent this computation using a graph

  • 1. in a forward pass we do evaluation and keep the values
  • 2. use these values in the backward pass to get partial derivatives

a =

1

w0 a =

2

w1 a =

3

x

slide-113
SLIDE 113

a =

4

a ×

2

a3 a =

5

a +

4

a1 y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

Reverse mode: Reverse mode: computational graph computational graph

1) evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

a =

1

w0 a =

2

w1 a =

3

x = a7 ˉ 1 = a6 ˉ = a5 ˉ cos(a ) − a6 ˉ

5

sin(a ) a7 ˉ

5

= a4 ˉ a5 ˉ = a3 ˉ a2a4 ˉ = a1 ˉ a5 ˉ

2) partial derivatives

= a2 ˉ a3a4 ˉ

7 . 6

suppose we want the derivative where y =

2

cos(w x +

1

w ) ∂w1 ∂y2

a2 a3 a1 a4 a5 a7 a6

=

∂y1 ∂y1

= a7 ˉ 1 =

∂y2 ∂y1

= a6 ˉ = a5 ˉ cos(a ) − a6 ˉ

5

sin(a ) a7 ˉ

5

=

∂w0 ∂y1

= a1 ˉ a5 ˉ we can represent this computation using a graph

  • 1. in a forward pass we do evaluation and keep the values
  • 2. use these values in the backward pass to get partial derivatives

a =

1

w0 a =

2

w1 a =

3

x

slide-114
SLIDE 114

a =

4

a ×

2

a3 a =

5

a +

4

a1 y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

Reverse mode: Reverse mode: computational graph computational graph

1) evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

a =

1

w0 a =

2

w1 a =

3

x = a7 ˉ 1 = a6 ˉ = a5 ˉ cos(a ) − a6 ˉ

5

sin(a ) a7 ˉ

5

= a4 ˉ a5 ˉ = a3 ˉ a2a4 ˉ = a1 ˉ a5 ˉ

2) partial derivatives

= a2 ˉ a3a4 ˉ

7 . 6

suppose we want the derivative where y =

2

cos(w x +

1

w ) ∂w1 ∂y2

a2 a3 a1 a4 a5 a7 a6

=

∂y1 ∂y1

= a7 ˉ 1 =

∂y2 ∂y1

= a6 ˉ = a5 ˉ cos(a ) − a6 ˉ

5

sin(a ) a7 ˉ

5

= a4 ˉ a5 ˉ =

∂w0 ∂y1

= a1 ˉ a5 ˉ we can represent this computation using a graph

  • 1. in a forward pass we do evaluation and keep the values
  • 2. use these values in the backward pass to get partial derivatives

a =

1

w0 a =

2

w1 a =

3

x

slide-115
SLIDE 115

a =

4

a ×

2

a3 a =

5

a +

4

a1 y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

Reverse mode: Reverse mode: computational graph computational graph

1) evaluation

a =

4

a ×

2

a3 a =

5

a +

4

a1 y =

1

a =

6

sin(a )

5

y =

2

a =

7

cos(a )

5

a =

1

w0 a =

2

w1 a =

3

x = a7 ˉ 1 = a6 ˉ = a5 ˉ cos(a ) − a6 ˉ

5

sin(a ) a7 ˉ

5

= a4 ˉ a5 ˉ = a3 ˉ a2a4 ˉ = a1 ˉ a5 ˉ

2) partial derivatives

= a2 ˉ a3a4 ˉ

7 . 6

suppose we want the derivative where y =

2

cos(w x +

1

w ) ∂w1 ∂y2

a2 a3 a1 a4 a5 a7 a6

=

∂y1 ∂y1

= a7 ˉ 1 =

∂y2 ∂y1

= a6 ˉ = a5 ˉ cos(a ) − a6 ˉ

5

sin(a ) a7 ˉ

5

= a4 ˉ a5 ˉ =

∂x ∂y1

= a3 ˉ a2a4 ˉ =

∂w0 ∂y1

= a1 ˉ a5 ˉ =

∂w1 ∂y1

= a2 ˉ a3a4 ˉ we can represent this computation using a graph

  • 1. in a forward pass we do evaluation and keep the values
  • 2. use these values in the backward pass to get partial derivatives

a =

1

w0 a =

2

w1 a =

3

x

slide-116
SLIDE 116

Forward vs Reverse mode Forward vs Reverse mode

forward mode is more natural, easier to implement and requires less memory a single forward pass calculates

, … ,

∂w ∂y1 ∂w ∂yc 7 . 7

slide-117
SLIDE 117

Forward vs Reverse mode Forward vs Reverse mode

forward mode is more natural, easier to implement and requires less memory a single forward pass calculates

, … ,

∂w ∂y1 ∂w ∂yc

however, reverse mode is more efficient in calculating gradient ∇ y =

w

[ , … , ]

∂w1 ∂y ∂wD ∂y ⊤

7 . 7

slide-118
SLIDE 118

Forward vs Reverse mode Forward vs Reverse mode

forward mode is more natural, easier to implement and requires less memory a single forward pass calculates

, … ,

∂w ∂y1 ∂w ∂yc

however, reverse mode is more efficient in calculating gradient ∇ y =

w

[ , … , ]

∂w1 ∂y ∂wD ∂y ⊤

this is more efficient if we have single output (cost) and many variables (weights) for this reason, in training neural networks, reverse mode is used the backward pass in the reverse mode is called backpropagation

7 . 7

slide-119
SLIDE 119

Winter 2020 | Applied Machine Learning (COMP551)

Forward vs Reverse mode Forward vs Reverse mode

forward mode is more natural, easier to implement and requires less memory a single forward pass calculates

, … ,

∂w ∂y1 ∂w ∂yc

however, reverse mode is more efficient in calculating gradient ∇ y =

w

[ , … , ]

∂w1 ∂y ∂wD ∂y ⊤

this is more efficient if we have single output (cost) and many variables (weights) for this reason, in training neural networks, reverse mode is used the backward pass in the reverse mode is called backpropagation many machine learning software implement autodiff: autograd (extends numpy) pytorch tensorflow

7 . 7

slide-120
SLIDE 120

Improving optimization in deep learning Improving optimization in deep learning

Initialization of parameters: random initialization (uniform or Gaussian) with small variance break the symmetry of hidden units

image credit: Mobahi'16

8

slide-121
SLIDE 121

Improving optimization in deep learning Improving optimization in deep learning

Initialization of parameters: random initialization (uniform or Gaussian) with small variance break the symmetry of hidden units models that are simpler to optimize: using ReLU activation using skip-connection using batch-normalization (next)

image credit: Mobahi'16

8

slide-122
SLIDE 122

Improving optimization in deep learning Improving optimization in deep learning

Initialization of parameters: random initialization (uniform or Gaussian) with small variance break the symmetry of hidden units models that are simpler to optimize: using ReLU activation using skip-connection using batch-normalization (next)

x =

{ℓ+l}

W ReLU(… ReLU(W x ) …) +

{ℓ+l} {ℓ} {ℓ}

x{ℓ}

this block is fixing residual errors of the predictions of the previous layers

image credit: Mobahi'16

8

slide-123
SLIDE 123

Improving optimization in deep learning Improving optimization in deep learning

Initialization of parameters: random initialization (uniform or Gaussian) with small variance break the symmetry of hidden units models that are simpler to optimize: using ReLU activation using skip-connection using batch-normalization (next) Pretrain a (simpler) model on a (simpler) task and fine-tune on a more difficult target setting (has many forms)

x =

{ℓ+l}

W ReLU(… ReLU(W x ) …) +

{ℓ+l} {ℓ} {ℓ}

x{ℓ}

this block is fixing residual errors of the predictions of the previous layers

image credit: Mobahi'16

8

slide-124
SLIDE 124

Improving optimization in deep learning Improving optimization in deep learning

Initialization of parameters: random initialization (uniform or Gaussian) with small variance break the symmetry of hidden units models that are simpler to optimize: using ReLU activation using skip-connection using batch-normalization (next) Pretrain a (simpler) model on a (simpler) task and fine-tune on a more difficult target setting (has many forms) continuation methods in optimization gradually increase the difficulty of the optimization problem good initialization for the next iteration

x =

{ℓ+l}

W ReLU(… ReLU(W x ) …) +

{ℓ+l} {ℓ} {ℓ}

x{ℓ}

this block is fixing residual errors of the predictions of the previous layers

image credit: Mobahi'16

8

slide-125
SLIDE 125

Improving optimization in deep learning Improving optimization in deep learning

Initialization of parameters: random initialization (uniform or Gaussian) with small variance break the symmetry of hidden units models that are simpler to optimize: using ReLU activation using skip-connection using batch-normalization (next) Pretrain a (simpler) model on a (simpler) task and fine-tune on a more difficult target setting (has many forms) continuation methods in optimization gradually increase the difficulty of the optimization problem good initialization for the next iteration

x =

{ℓ+l}

W ReLU(… ReLU(W x ) …) +

{ℓ+l} {ℓ} {ℓ}

x{ℓ}

this block is fixing residual errors of the predictions of the previous layers

image credit: Mobahi'16

curriculum learning (similar idea) increase the number of "difficult" examples over time similar to the way humans learn

8

slide-126
SLIDE 126

Batch Normalization Batch Normalization

gradient descent: parameters in all layers are updated distribution of inputs to layer changes each layer has to re-adjust inefficient for very deep networks

  • riginal motivation

9

Improving optimization in deep learning

slide-127
SLIDE 127

Batch Normalization Batch Normalization

gradient descent: parameters in all layers are updated distribution of inputs to layer changes each layer has to re-adjust inefficient for very deep networks

  • riginal motivation

idea normalize the input to each unit (m) of a layer ℓ

9

Improving optimization in deep learning

slide-128
SLIDE 128

Batch Normalization Batch Normalization

gradient descent: parameters in all layers are updated distribution of inputs to layer changes each layer has to re-adjust inefficient for very deep networks

  • riginal motivation

idea normalize the input to each unit (m) of a layer ℓ

unit m

= x ^m

{ℓ},(n) σm

{ℓ}

x −μ

m {ℓ},(n) m {ℓ}

activation for the instance (n) at layerℓ

9

Improving optimization in deep learning

slide-129
SLIDE 129

Batch Normalization Batch Normalization

gradient descent: parameters in all layers are updated distribution of inputs to layer changes each layer has to re-adjust inefficient for very deep networks

  • riginal motivation

idea normalize the input to each unit (m) of a layer ℓ alternatively: apply the batch-norm to W

x

{ℓ} {ℓ}

unit m

= x ^m

{ℓ},(n) σm

{ℓ}

x −μ

m {ℓ},(n) m {ℓ}

activation for the instance (n) at layerℓ

9

Improving optimization in deep learning

slide-130
SLIDE 130

Batch Normalization Batch Normalization

gradient descent: parameters in all layers are updated distribution of inputs to layer changes each layer has to re-adjust inefficient for very deep networks

  • riginal motivation

each unit is unnecessarily constrained to have zero-mean and std=1 (we only need to fix the distribution) idea normalize the input to each unit (m) of a layer ℓ alternatively: apply the batch-norm to W

x

{ℓ} {ℓ}

ReLU(γ BN(W x ) +

{ℓ} {ℓ} {ℓ}

β )

{ℓ} introduce learnable parameters

unit m

= x ^m

{ℓ},(n) σm

{ℓ}

x −μ

m {ℓ},(n) m {ℓ}

activation for the instance (n) at layerℓ

9

Improving optimization in deep learning

slide-131
SLIDE 131

Batch Normalization Batch Normalization

gradient descent: parameters in all layers are updated distribution of inputs to layer changes each layer has to re-adjust inefficient for very deep networks

  • riginal motivation

mean and std per unit is calculated for the minibatch during the forward pass we backpropagate through this normalization at test time use the mean and std. from the whole training set BN regularizes the model (e.g., no need for dropout) each unit is unnecessarily constrained to have zero-mean and std=1 (we only need to fix the distribution) idea normalize the input to each unit (m) of a layer ℓ alternatively: apply the batch-norm to W

x

{ℓ} {ℓ}

ReLU(γ BN(W x ) +

{ℓ} {ℓ} {ℓ}

β )

{ℓ} introduce learnable parameters

unit m

= x ^m

{ℓ},(n) σm

{ℓ}

x −μ

m {ℓ},(n) m {ℓ}

activation for the instance (n) at layerℓ

9

Improving optimization in deep learning

slide-132
SLIDE 132

Batch Normalization Batch Normalization

gradient descent: parameters in all layers are updated distribution of inputs to layer changes each layer has to re-adjust inefficient for very deep networks

  • riginal motivation

mean and std per unit is calculated for the minibatch during the forward pass we backpropagate through this normalization at test time use the mean and std. from the whole training set BN regularizes the model (e.g., no need for dropout) each unit is unnecessarily constrained to have zero-mean and std=1 (we only need to fix the distribution) idea normalize the input to each unit (m) of a layer ℓ alternatively: apply the batch-norm to W

x

{ℓ} {ℓ} recent observations the change in distribution of activations is not a big issue empirically BN works so well because it makes the loss function smooth

ReLU(γ BN(W x ) +

{ℓ} {ℓ} {ℓ}

β )

{ℓ} introduce learnable parameters

unit m

= x ^m

{ℓ},(n) σm

{ℓ}

x −μ

m {ℓ},(n) m {ℓ}

activation for the instance (n) at layerℓ

9

Improving optimization in deep learning

slide-133
SLIDE 133

Summary Summary

  • ptimization landscape in neural networks is special and not yet fully understood

exponentially many local optima and saddle points most local minima are good calculate the gradients using backpropagation

10

slide-134
SLIDE 134

Summary Summary

  • ptimization landscape in neural networks is special and not yet fully understood

exponentially many local optima and saddle points most local minima are good calculate the gradients using backpropagation automatic differentiation simplifies gradient calculation for complex models gradient descent becomes simpler to use forward mode is useful for calculating the jacobian of when reverse mode can be more efficient when backpropagation is reverse mode autodiff.

f : R →

Q

RP P ≥ Q Q > P

10

slide-135
SLIDE 135

Summary Summary

  • ptimization landscape in neural networks is special and not yet fully understood

exponentially many local optima and saddle points most local minima are good calculate the gradients using backpropagation automatic differentiation simplifies gradient calculation for complex models gradient descent becomes simpler to use forward mode is useful for calculating the jacobian of when reverse mode can be more efficient when backpropagation is reverse mode autodiff.

f : R →

Q

RP P ≥ Q Q > P

Better optimization in deep learning: better initialization models that are easier to optimize (using skip-connection, batch-norm, ReLU) pre-training and curriculum learning

10