[PPT] - -4 -2 0 in 2 r, E 10 Erro 15 x 0 In-sample 20 x 1 25 PowerPoint Presentation

SLIDE 1 Review

f

Le ture 9

Logisti

regression

s x x x x0

1 2 d

h x

( )

s

θ( )

Lik

eliho

d

measure

N

n=1

P(yn | xn) =

N

n=1

θ(ynw

Txn)

Gradient

des ent PSfrag repla ements W eights, w In-sample Erro r, E in

10
8
6
4
2

2 10 15 20 25

Initialize w(0)
F
r t = 0, 1, 2, · · ·

[to termination℄

w(t + 1) = w(t) − η ∇E

in(w(t))

Return

nal w

SLIDE 2 Lea rning F rom Data Y aser S. Abu-Mostafa Califo rnia Institute

f

T e hnology Le ture 10: Neural Net w

rks

Sp

nso

red b y Calte h's Provost O e, E&AS Division, and IST

Thursda

y , Ma y 3, 2012

SLIDE 3 Outline

Sto

hasti gradient des ent

Neural

net w

rk

mo del

Ba kp

ropagation algo rithm

A

M L

Creato r: Y aser Abu-Mostafa

LFD

Le ture 10 2/21

SLIDE 4 Sto hasti gradient des ent GD minimizes:

E

in(w) = 1

N

n=1

e

h(xn), yn
ln(1+e−ynw

Txn) ←

−

in logisti regression b y iterative steps along −∇E in :

∆w = − η ∇E

in(w)

∇E

in is based

n

all examples (xn, yn)

bat h

GD

A

M L

Creato r: Y aser Abu-Mostafa

LFD

Le ture 10 3/21

SLIDE 5 The sto hasti asp e t Pi k

ne (xn, yn)

at a time. Apply GD to e

h(xn), yn
A

verage dire tion:

En

−∇ e
h(xn), yn
= 1

N

n=1

−∇ e

h(xn), yn
= −∇ E

in randomized version

f

GD sto hasti gradient des ent (SGD)

A

M L

Creato r: Y aser Abu-Mostafa

LFD

Le ture 10 4/21

SLIDE 6 Benets

f

SGD PSfrag repla ements W eights, w

E

in 1 2 3 4 5 6 0.05 0.1 0.15 randomization helps 1. heap er

mputation

2. randomization 3. simple Rule

f

thumb:

η = 0.1

w

rks

A

M L

Creato r: Y aser Abu-Mostafa

LFD

Le ture 10 5/21

SLIDE 7 SGD in a tion

rating

m o v i e

top

i j rij u u u u

i1 i2 i3 iK j1

v v v

jK

v

j2 j3

u s e r

bottom

Rememb er movie ratings? eij =

rij −

K

k=1

uikvjk 2

A

M L

Creato r: Y aser Abu-Mostafa

LFD

Le ture 10 6/21

SLIDE 8 Outline

Sto

hasti gradient des ent

Neural

net w

rk

mo del

Ba kp

ropagation algo rithm

A

M L

Creato r: Y aser Abu-Mostafa

LFD

Le ture 10 7/21

SLIDE 9 Biologi al inspiration biologi al fun tion

− →

biologi al stru ture

1 2 1 2

A

M L

Creato r: Y aser Abu-Mostafa

LFD

Le ture 10 8/21

SLIDE 10 Combining p er eptrons

− − + x1 x2 + − h1 x1 x2 + + − h2 x1 x2

OR(x1, x2) 1 x1 x2 1 1 1.5 −1.5 1 x1 x2 AND(x1, x2) 1 1

A

M L

Creato r: Y aser Abu-Mostafa

LFD

Le ture 10 9/21

SLIDE 11 Creating la y ers

1 1 h1¯ h2 ¯ h1h2 f 1.5 1

1 f 1 1.5 1 1 1 h1 h2 −1 −1 −1.5 −1.5 1

A

M L

Creato r: Y aser Abu-Mostafa

LFD

Le ture 10 10/21

SLIDE 12 The multila y er p er eptron

w1 • x f 1 1.5 1 1 1 1 −1 −1 −1.5 −1.5 1 1 x1 x2 w2 • x

3 la y ers feedfo rw a rd

A

M L

Creato r: Y aser Abu-Mostafa

LFD

Le ture 10 11/21

SLIDE 13 A p

w

erful mo del

− − −

+ + +

−

+

− − − −

+ + + +

− − −

+ + +

−

+

T a rget 8 p er eptrons 16 p er eptrons 2 red ags fo r generalization and

ptimization

A

M L

Creato r: Y aser Abu-Mostafa

LFD

Le ture 10 12/21

SLIDE 14 The neural net w

rk

θ x2 xd s θ(s) h(x) 1 1 1 x1

input x hidden la y ers 1 ≤ l < L

utput

la y er l = L

θ θ θ θ θ

A

M L

Creato r: Y aser Abu-Mostafa

LFD

Le ture 10 13/21

SLIDE 15 Ho w the net w

rk
p

erates PSfrag repla ements linea r tanh ha rd threshold

+1 −1

4
2

2 4

1
0.5

0.5 1

θ(s) = tanh(s) = es − e−s es + e−s w(l)

ij

       1 ≤ l ≤ L layers 0 ≤ i ≤ d(l−1) inputs 1 ≤ j ≤ d(l)

utputs

x(l)

j = θ(s(l) j ) = θ

 

d(l−1)

i=0

w(l)

ij x(l−1) i

 

Apply x to x(0)

1 · · · x(0) d(0) → → x(L) 1

= h(x)

A

M L

Creato r: Y aser Abu-Mostafa

LFD

Le ture 10 14/21

SLIDE 16 Outline

Sto

hasti gradient des ent

Neural

net w

rk

mo del

Ba kp

ropagation algo rithm

A

M L

Creato r: Y aser Abu-Mostafa

LFD

Le ture 10 15/21

SLIDE 17 Applying SGD All the w eights w = {w(l)

ij }

determine h(x) Erro r

n

example (xn, yn) is e

h(xn), yn
=

e(w) T

implement

SGD, w e need the gradient

∇ e(w)

:

∂

e(w)

∂ w(l)

ij

fo r all i, j, l

A

M L

Creato r: Y aser Abu-Mostafa

LFD

Le ture 10 16/21

SLIDE 18 Computing

∂

e(w)

∂ w(l)

ij

x w s x

i ij j j (l) (l) (l) (l−1)

top bottom

θ

W e an evaluate ∂ e(w)

∂ w(l)

ij

ne

b y

ne:

analyti ally

r

numeri ally A tri k fo r e ient

mputation:

∂

e(w)

∂ w(l)

ij

= ∂

e(w)

∂ s(l)

j

× ∂ s(l)

j

∂ w(l)

ij

W e have

∂ s(l)

j

∂ w(l)

ij

= x(l−1)

i

W e

nly

need: ∂ e(w)

∂ s(l)

j

= δ(l)

j

A

M L

Creato r: Y aser Abu-Mostafa

LFD

Le ture 10 17/21

SLIDE 19

δ

fo r the nal la y er

δ(l)

j

= ∂

e(w)

∂ s(l)

j

F

r

the nal la y er l = L and j = 1:

δ(L)

1

= ∂

e(w)

∂ s(L)

1

e(w) = ( x(L)

1

− yn)2 x(L)

1

= θ(s(L)

1 )

θ′(s) = 1 − θ2(s)

fo r the tanh

A

M L

Creato r: Y aser Abu-Mostafa

LFD

Le ture 10 18/21

SLIDE 20 Ba k p ropagation

f δ

x i

(l−1)

x i

(l−1) 2

wij

j

xj

(l) (l) (l)

δ δ i

(l−1)

1−( )

top bottom

δ(l−1)

i

= ∂

e(w)

∂ s(l−1)

i

=

d(l)

j=1

∂

e(w)

∂ s(l)

j

× ∂ s(l)

j

∂ x(l−1)

i

×∂ x(l−1)

i

∂ s(l−1)

i

=

d(l)

j=1

δ(l)

j

× w(l)

ij

× θ′(s(l−1)

i

) δ(l−1)

i

= (1 − (x(l−1)

i

)2)

d(l)

j=1

w(l)

ij δ(l) j

A

M L

Creato r: Y aser Abu-Mostafa

LFD

Le ture 10 19/21

SLIDE 21 Ba kp ropagation algo rithm

wij

(l)

δ j

(l)

x

(l−1) i

top bottom

1: Initialize all w eights w(l)

ij

at random 2: fo r t = 0, 1, 2, . . . do 3: Pi k n ∈ {1, 2, · · · , N} 4: F

rw

a rd: Compute all x(l)

j

5: Ba kw a rd: Compute all δ(l)

j

6: Up date the w eights: w(l)

ij

← w(l)

ij − η x(l−1) i

δ(l)

j

7: Iterate to the next step until it is time to stop 8: Return the nal w eights w(l)

ij

A

M L

Creato r: Y aser Abu-Mostafa

LFD

Le ture 10 20/21

SLIDE 22 Final rema rk: hidden la y ers

θ x2 xd s θ(s) h(x) 1 1 1 x1 θ θ θ θ θ

lea rned nonlinea r transfo rm interp retation?

A

M L

Creato r: Y aser Abu-Mostafa

LFD

Le ture 10 21/21