-4 -2 0 in 2 r, E 10 Erro 15 x 0 In-sample 20 x 1 25 - - PowerPoint PPT Presentation

4 2 0 in 2 r e 10 erro 15 x 0 in sample 20 x 1 25 s w
SMART_READER_LITE
LIVE PREVIEW

-4 -2 0 in 2 r, E 10 Erro 15 x 0 In-sample 20 x 1 25 - - PowerPoint PPT Presentation

PSfrag replaements Review of Leture 9 Gradient desent -10 -8 Logisti regression -6 -4 -2 0 in 2 r, E 10 Erro 15 x 0 In-sample 20 x 1 25 s W eights, w ( ) - Initialize w (0) x h x 2 ( ) - F o r t =


slide-1
SLIDE 1 Review
  • f
Le ture 9
  • Logisti
regression

s x x x x0

1 2 d

h x

( )

s

θ( )

  • Lik
eliho
  • d
measure

N

  • n=1

P(yn | xn) =

N

  • n=1

θ(ynw

Txn)
  • Gradient
des ent PSfrag repla ements W eights, w In-sample Erro r, E in
  • 10
  • 8
  • 6
  • 4
  • 2
2 10 15 20 25
  • Initialize w(0)
  • F
  • r t = 0, 1, 2, · · ·
[to termination℄

w(t + 1) = w(t) − η ∇E

in(w(t))
  • Return
nal w
slide-2
SLIDE 2 Lea rning F rom Data Y aser S. Abu-Mostafa Califo rnia Institute
  • f
T e hnology Le ture 10: Neural Net w
  • rks
Sp
  • nso
red b y Calte h's Provost O e, E&AS Division, and IST
  • Thursda
y , Ma y 3, 2012
slide-3
SLIDE 3 Outline
  • Sto
hasti gradient des ent
  • Neural
net w
  • rk
mo del
  • Ba kp
ropagation algo rithm

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 10 2/21
slide-4
SLIDE 4 Sto hasti gradient des ent GD minimizes:

E

in(w) = 1

N

N

  • n=1
e
  • h(xn), yn
  • ln(1+e−ynw
Txn) ←

in logisti regression b y iterative steps along −∇E in :

∆w = − η ∇E

in(w)

∇E

in is based
  • n
all examples (xn, yn)
  • bat h
GD

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 10 3/21
slide-5
SLIDE 5 The sto hasti asp e t Pi k
  • ne (xn, yn)
at a time. Apply GD to e
  • h(xn), yn
  • A
verage dire tion:

En

  • −∇ e
  • h(xn), yn
  • = 1

N

N

  • n=1

−∇ e

  • h(xn), yn
  • = −∇ E
in randomized version
  • f
GD sto hasti gradient des ent (SGD)

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 10 4/21
slide-6
SLIDE 6 Benets
  • f
SGD PSfrag repla ements W eights, w

E

in 1 2 3 4 5 6 0.05 0.1 0.15 randomization helps 1. heap er
  • mputation
2. randomization 3. simple Rule
  • f
thumb:

η = 0.1

w
  • rks

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 10 5/21
slide-7
SLIDE 7 SGD in a tion

rating

m o v i e

top

i j rij u u u u

i1 i2 i3 iK j1

v v v

jK

v

j2 j3

u s e r

bottom

Rememb er movie ratings? eij =
  • rij −

K

  • k=1

uikvjk 2

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 10 6/21
slide-8
SLIDE 8 Outline
  • Sto
hasti gradient des ent
  • Neural
net w
  • rk
mo del
  • Ba kp
ropagation algo rithm

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 10 7/21
slide-9
SLIDE 9 Biologi al inspiration biologi al fun tion

− →

biologi al stru ture

1 2 1 2

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 10 8/21
slide-10
SLIDE 10 Combining p er eptrons

− − + x1 x2 + − h1 x1 x2 + + − h2 x1 x2

OR(x1, x2) 1 x1 x2 1 1 1.5 −1.5 1 x1 x2 AND(x1, x2) 1 1

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 10 9/21
slide-11
SLIDE 11 Creating la y ers

1 1 h1¯ h2 ¯ h1h2 f 1.5 1

1 f 1 1.5 1 1 1 h1 h2 −1 −1 −1.5 −1.5 1

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 10 10/21
slide-12
SLIDE 12 The multila y er p er eptron

w1 • x f 1 1.5 1 1 1 1 −1 −1 −1.5 −1.5 1 1 x1 x2 w2 • x

3 la y ers feedfo rw a rd

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 10 11/21
slide-13
SLIDE 13 A p
  • w
erful mo del

− − −

+ + +

+

− − − −

+ + + +

− − −

+ + +

+

T a rget 8 p er eptrons 16 p er eptrons 2 red ags fo r generalization and
  • ptimization

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 10 12/21
slide-14
SLIDE 14 The neural net w
  • rk

θ x2 xd s θ(s) h(x) 1 1 1 x1

input x hidden la y ers 1 ≤ l < L
  • utput
la y er l = L

θ θ θ θ θ

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 10 13/21
slide-15
SLIDE 15 Ho w the net w
  • rk
  • p
erates PSfrag repla ements linea r tanh ha rd threshold

+1 −1

  • 4
  • 2
2 4
  • 1
  • 0.5
0.5 1

θ(s) = tanh(s) = es − e−s es + e−s w(l)

ij

       1 ≤ l ≤ L layers 0 ≤ i ≤ d(l−1) inputs 1 ≤ j ≤ d(l)

  • utputs

x(l)

j = θ(s(l) j ) = θ

 

d(l−1)

  • i=0

w(l)

ij x(l−1) i

 

Apply x to x(0)

1 · · · x(0) d(0) → → x(L) 1

= h(x)

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 10 14/21
slide-16
SLIDE 16 Outline
  • Sto
hasti gradient des ent
  • Neural
net w
  • rk
mo del
  • Ba kp
ropagation algo rithm

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 10 15/21
slide-17
SLIDE 17 Applying SGD All the w eights w = {w(l)

ij }

determine h(x) Erro r
  • n
example (xn, yn) is e
  • h(xn), yn
  • =
e(w) T
  • implement
SGD, w e need the gradient

∇ e(w)

:

e(w)

∂ w(l)

ij

fo r all i, j, l

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 10 16/21
slide-18
SLIDE 18 Computing

e(w)

∂ w(l)

ij

x w s x

i ij j j (l) (l) (l) (l−1)

top bottom

θ

W e an evaluate ∂ e(w)

∂ w(l)

ij

  • ne
b y
  • ne:
analyti ally
  • r
numeri ally A tri k fo r e ient
  • mputation:

e(w)

∂ w(l)

ij

= ∂

e(w)

∂ s(l)

j

× ∂ s(l)

j

∂ w(l)

ij

W e have

∂ s(l)

j

∂ w(l)

ij

= x(l−1)

i

W e
  • nly
need: ∂ e(w)

∂ s(l)

j

= δ(l)

j

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 10 17/21
slide-19
SLIDE 19

δ

fo r the nal la y er

δ(l)

j

= ∂

e(w)

∂ s(l)

j

F
  • r
the nal la y er l = L and j = 1:

δ(L)

1

= ∂

e(w)

∂ s(L)

1

e(w) = ( x(L)

1

− yn)2 x(L)

1

= θ(s(L)

1 )

θ′(s) = 1 − θ2(s)

fo r the tanh

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 10 18/21
slide-20
SLIDE 20 Ba k p ropagation
  • f δ

x i

(l−1)

x i

(l−1) 2

wij

j

xj

(l) (l) (l)

δ δ i

(l−1)

1−( )

top bottom

δ(l−1)

i

= ∂

e(w)

∂ s(l−1)

i

=

d(l)

  • j=1

e(w)

∂ s(l)

j

× ∂ s(l)

j

∂ x(l−1)

i

×∂ x(l−1)

i

∂ s(l−1)

i

=

d(l)

  • j=1

δ(l)

j

× w(l)

ij

× θ′(s(l−1)

i

) δ(l−1)

i

= (1 − (x(l−1)

i

)2)

d(l)

  • j=1

w(l)

ij δ(l) j

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 10 19/21
slide-21
SLIDE 21 Ba kp ropagation algo rithm

wij

(l)

δ j

(l)

x

(l−1) i

top bottom

1: Initialize all w eights w(l)

ij

at random 2: fo r t = 0, 1, 2, . . . do 3: Pi k n ∈ {1, 2, · · · , N} 4: F
  • rw
a rd: Compute all x(l)

j

5: Ba kw a rd: Compute all δ(l)

j

6: Up date the w eights: w(l)

ij

← w(l)

ij − η x(l−1) i

δ(l)

j

7: Iterate to the next step until it is time to stop 8: Return the nal w eights w(l)

ij

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 10 20/21
slide-22
SLIDE 22 Final rema rk: hidden la y ers

θ x2 xd s θ(s) h(x) 1 1 1 x1 θ θ θ θ θ

lea rned nonlinea r transfo rm interp retation?

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 10 21/21