CptS 570 Machine Learning School of EECS Washington State - - PowerPoint PPT Presentation

cpts 570 machine learning school of eecs washington state
SMART_READER_LITE
LIVE PREVIEW

CptS 570 Machine Learning School of EECS Washington State - - PowerPoint PPT Presentation

CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine Learning 1 Also called multilayer perceptrons Inspired from human brain Brain consists of interconnected neurons Brain still outperforms


slide-1
SLIDE 1

CptS 570 – Machine Learning School of EECS Washington State University

CptS 570 - Machine Learning 1

slide-2
SLIDE 2

 Also called multilayer perceptrons  Inspired from human brain

  • Brain consists of interconnected neurons
  • Brain still outperforms machines on several tasks
  • E.g., vision, speech recognition, learning

 Nonparametric estimator  Classification and regression  Trained using error backpropagation

CptS 570 - Machine Learning 2

slide-3
SLIDE 3

CptS 570 - Machine Learning 3

slide-4
SLIDE 4

 Processors

  • Computer: Typically 1-2 (~109 Hz)
  • Brain: 1011 neurons (~ 103 Hz)

 Parallelism

  • Computer: Typically little
  • Brain: Massive parallelism

 On average, each neuron is connected via synapses to 104 other neurons

CptS 570 - Machine Learning 4

slide-5
SLIDE 5

CptS 570 - Machine Learning 5

“The Singularity is Near” Ray Kurzweil.

slide-6
SLIDE 6

CptS 570 - Machine Learning 6

[ ] [ ]

T d T d T d j j j

x x w w w w x w y ,..., , ,..., ,

1 1 1

1 = = = + =∑

=

x w x w

slide-7
SLIDE 7

 y = wx + w0

CptS 570 - Machine Learning 7

w w0 y x x0=+1 y x

slide-8
SLIDE 8

 If (wx + w0 > 0) Then y=1 Else y=0

CptS 570 - Machine Learning 8

w w0 y x

( )

[ ]

x w x w

T T

y − + = = exp 1 1 sigmoid

x

slide-9
SLIDE 9

CptS 570 - Machine Learning 9

k k i i k k i i T i i

y y C

  • y
  • max

if choose ) exp( ) exp( = = =

x w

x y x w W = = + =∑

= T i i d j j ij i

w x w y

1

Cla lassif ific ication: Regressi ssion:

slide-10
SLIDE 10

 Batch learning (gradient descent)

  • Requires entire training set
  • Each weight update based on pass through entire

training set

 Online learning (stochastic gradient descent)

  • Allows incremental arrival of training examples
  • Weights updated for each training example
  • Adaptive to problems changing over time
  • Tends to converge faster

CptS 570 - Machine Learning 10

( ) t

j t i t i t ij

x y r w − = ∆ η

slide-11
SLIDE 11

 Regression  Single linear output

CptS 570 - Machine Learning 11

( ) ( ) ( )

[ ]

( ) t

j t t t j t T t t t t t t

x y r w r y r r E − = ∆ − = − = η

2 2

2 1 2 1 x w x w , |

slide-12
SLIDE 12

 Classification  Single sigmoid output  K>2 softmax outputs

CptS 570 - Machine Learning 12

( ) ( ) ( ) ( ) ( ) t

j t t t j t t t t t t t t T t

x y r w y r y r E y − = ∆ − − − − = = η 1 1 log log | sigmoid r x w x w ,

{ }

( ) ( ) t

j t i t i t ij i t i t i t t i i t k t T k t T i t i

x y r w y r E y − = ∆ − = =

∑ ∑

η log , | ) exp( ) exp( r x w x w x w

Cross Entropy

slide-13
SLIDE 13

 Stochastic

  • nline gradient

descent for K>2 classes

CptS 570 - Machine Learning 13

For i = 1,…,K For j = 0,…,d wij ← rand(-0.01,0.01) Repeat For all (xt,rt) in X in random order For i = 1,…,K

  • i ← 0

For j = 0,…,d

  • i ← oi + wijxt

j

For i = 1,…,K yi ← exp(oi) / Σk exp(ok) For i = 1,…,K For j = 0,…,d wij ← wij + ɳ(rt

i – yi)xt j

Until convergence

slide-14
SLIDE 14

CptS 570 - Machine Learning 14

slide-15
SLIDE 15

CptS 570 - Machine Learning 15

2 1 1 2

≤ + + > + > + ≤ w w w w w w w w

No w0, w1, w2 satisfy: Minsky and Papert (1969): Stalled perceptron research for 15 years.

slide-16
SLIDE 16

 Perceptrons can only approximate linear

functions

 But multiple layers of perceptrons can

approximate nonlinear functions

CptS 570 - Machine Learning 16

Hidden Layer

slide-17
SLIDE 17

CptS 570 - Machine Learning 17

( )

( )

[ ]

∑ ∑

= =

+ − + = = + = =

d j h j hj T h h H h i h ih T i i

w x w z v z v y

1 1

1 1 exp sigmoid x w z v

(Rumelhart et al., 1986)

slide-18
SLIDE 18

CptS 570 - Machine Learning 18

 x1 XOR x2 =

(x1 AND ~x2) OR (~x1 AND x2)

slide-19
SLIDE 19

 MLP can represent any Boolean function

  • Any Boolean function can be expressed as a

disjunction of conjunctions

  • Each conjunction implemented by a hidden unit
  • Disjunction implemented by one output unit
  • May need 2d hidden units in worst case

CptS 570 - Machine Learning 19

slide-20
SLIDE 20

 MLP with two hidden layers can approximate

any function with continuous inputs and

  • utputs
  • First hidden layer computes hyperplanes for

isolating regions of instance space

  • Second hidden layer ANDs hyperplanes together to

isolate regions

  • Weight from second-layer hidden unit to output

unit is value of function in this region

  • Piecewise constant approximator

 MLP with one sufficiently large hidden layer

can learn any nonlinear function

CptS 570 - Machine Learning 20

slide-21
SLIDE 21

 Weights vih feeding into

  • utput units learned

using previous methods

 Weights whj feeding into

hidden units learned based on error propagated from output layer

 Error backpropagation

(Rumelhart et al., 1986)

CptS 570 - Machine Learning 21

( )

( )

[ ]

hj h h i i hj d j h j hj T h h H h i h ih T i i

w z z y y E w E w x w z v z v y ∂ ∂ ∂ ∂ ∂ ∂ = ∂ ∂ + − + = = + = =

∑ ∑

= =

exp 1 1 sigmoid

1 1

x w z v

slide-22
SLIDE 22

CptS 570 - Machine Learning 22

( )

x wT

h h

z sigmoid =

=

+ =

H h t h h t

v z v y

1

( ) ( ) ( ) ( ) t

j t h t h t h t t t j t h t h t h t t hj t h t h t t t hj hj

x z z v y r x z z v y r w z z y y E w E w − − = − − − − = ∂ ∂ ∂ ∂ ∂ ∂ − = ∂ ∂ − = ∆

∑ ∑ ∑

1 1 η η η η

Forward Backward

x

( )

( )

2

2 1∑ − =

t t t

y r E X | ,v W

( ) t

h t t t h

z y r v

− = ∆ η

slide-23
SLIDE 23

CptS 570 - Machine Learning 23

( )

( ) ( ) ( ) ( ) t

j t h t h t i ih t i t i hj t h t t i t i ih i H h t h ih t i t i t i t i

x z z v y r w z y r v v z v y y r E −       − = ∆ − = ∆ + = − =

∑ ∑ ∑ ∑ ∑∑

=

1 2 1 | ,

1 2

η η X V W

z h v ih y i xj w hj

slide-24
SLIDE 24

CptS 570 - Machine Learning 24

Epoch is one pass through training data X. Note: All weight updates computed before any made.

slide-25
SLIDE 25

 f(x) = sin(6x)  xt ~ U(-0.5,0.5)  yt = f(xt)+N(0,0.1)  2 hidden units  After 100, 200

and 300 epochs

CptS 570 - Machine Learning 25

slide-26
SLIDE 26

CptS 570 - Machine Learning 26

slide-27
SLIDE 27

CptS 570 - Machine Learning 27

Hyperplanes w hx+w 0 computed by hidden units Outputs z h computed by hidden units Inputs v hz h to output unit

slide-28
SLIDE 28

 One sigmoid output yt for P(C1|xt) and

P(C2|xt) ≡ 1-yt

CptS 570 - Machine Learning 28

( )

( ) ( ) ( ) ( ) ( ) t

j t h t h h t t t hj t h t t t h t t t t t H h t h h t

x z z v y r w z y r v y r y r E v z v y − − = ∆ − = ∆ − − + − =       + =

∑ ∑ ∑ ∑

=

1 1 log 1 log | , sigmoid

1

η η X v W

Same as before

slide-29
SLIDE 29

CptS 570 - Machine Learning 29

( )

( )

( ) ( ) ( ) t

j t h t h t i ih t i t i hj t h t t i t i ih t i t i t i t i k t k t i t i H h i t h ih t i

x z z v y r w z y r v y r E C P

  • y

v z v

      − = ∆ − = ∆ − = ≡ = + =

∑ ∑ ∑ ∑∑ ∑ ∑

=

1 log | , | ) ( exp ) ( exp

1

η η X V W x

slide-30
SLIDE 30

 Theoretically, only need one hidden layer  Multiple hidden layers may simplify network  Training proceeds by propagating error back

layer by layer

CptS 570 - Machine Learning 30

( ) ( )

∑ ∑ ∑

= = =

+ = = =         + = = =         + = =

2 1

1 2 2 2 1 2 1 2 1 2 2 1 1 1 1 1 1

1 1

H l l l T H h l h lh T l l d j h j hj T h h

v z v y H l w z w z H h w x w z z v z w x w ,..., , ,..., , sigmoid sigmoid sigmoid sigmoid

slide-31
SLIDE 31

 Gradient descent can be slow to converge  Successive weight updates can lead to large

  • scillations

 Idea: Use previous weight update to smooth

trajectory

 Momentum α ∈ (0.5,1.0)

CptS 570 - Machine Learning 31

1 −

∆ + ∂ ∂ − = ∆

t i i t t i

w w E w α η

slide-32
SLIDE 32

 Learning rate ɳ ∈ (0.0,0.3)  Kept low to avoid oscillations, but slow

learning

 Prefer high ɳ initially, but lower ɳ as network

converges

 Adaptive learning rate

  • Increase ɳ if error decreases
  • Decrease ɳ if error increases
  • Best if Et is average over past few epochs

CptS 570 - Machine Learning 32

   − < + = ∆

+

  • therwise

if η η

τ

b E E a

t t

slide-33
SLIDE 33

 Network with d inputs, K outputs and H

hidden units has K(H+1)+H(d+1) weights

 Choosing H too high can lead to overfitting  This is the same bias/variance dilemma as

before

CptS 570 - Machine Learning 33

slide-34
SLIDE 34

CptS 570 - Machine Learning 34

Previous example: f(x) = sin(6x)

slide-35
SLIDE 35

 Similar

  • verfitting

behavior if training continued too long

 More and

more weights move from zero

 Overtraining

CptS 570 - Machine Learning 35

slide-36
SLIDE 36

 Cross validation can be used to choose a

good H and a good stopping condition

 Training multiple networks, each with

different random initial weights, can address local minima in the error

CptS 570 - Machine Learning 36

slide-37
SLIDE 37

 Structuring the network

  • Completely connected networks are harder to train
  • Inputs may be locally correlated (e.g., pixels)
  • Not all inputs connected to all hidden units

CptS 570 - Machine Learning 37

slide-38
SLIDE 38

 Normalizing inputs and outputs  Virtual examples

  • Adding additional training examples based on

known invariances

  • E.g., optical character recognition

 Characters are invariant to rotation, translation and scale

CptS 570 - Machine Learning 38

slide-39
SLIDE 39

 Modifying the error function E’ = E + λhEh  If two examples x and x’ are the same from

the domain’s point of view (e.g., A and A)

  • Eh = [g(x|θ) – g(x’|θ)]2

 Know f(x) in (ax,bx)

CptS 570 - Machine Learning 39

( ) [

]

( ) ( ) ( ) ( ) ( ) ( )

     > − < − ∈ =

x x x x x x h

b x g b x g a x g a x g b a x g E θ θ θ θ θ | if | | if | , | if

2 2

slide-40
SLIDE 40

 Manually try different network structures with

validation set

 Incorporate structural adaptation into the

learning algorithm

  • Destructive: Start with large network and gradually

remove units and/or connections

  • Constructive: Start with small network and gradually

add units and/or connections

  • Continue until performance degrades

CptS 570 - Machine Learning 40

slide-41
SLIDE 41

 Weight decay (destructive)

  • Penalize networks with many non-zero weights

CptS 570 - Machine Learning 41

+ = − ∂ ∂ − = ∆

i i i i i

w E E w w E w

2

2 λ λ η '

slide-42
SLIDE 42

 Dynamic node creation (constructive)

  • Train network until convergence
  • If error still high, add another hidden unit
  • Randomly initialize new unit’s weights
  • Do not reinitialize previously-trained weights

CptS 570 - Machine Learning 42

slide-43
SLIDE 43

 Cascade correlation (constructive)

  • Train network until convergence
  • If error still high, add another hidden layer with one

hidden unit

  • Connect new unit to all previous hidden units, all

inputs, and output

  • Randomly initialize new weights
  • Freeze previously-trained weights

CptS 570 - Machine Learning 43

slide-44
SLIDE 44

 Search hypothesis space H of all network

topologies

 |H| is exponential  Search guided by performance of network on

validation set (expensive)

 Previous tuning methods essentially search

  • perators

 Genetic algorithms show promise  Open research issue

CptS 570 - Machine Learning 44

slide-45
SLIDE 45

 Consider weights wi as random variables with

priors p(wi) ~ N(0,1/2λ)

 Weight decay, ridge regression, regularization  Cost = data-misfit + λ * complexity

CptS 570 - Machine Learning 45

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

2 2

2 1 2 w w w w w w w w w w w λ λ + =      − ⋅ = = + + = = =

E E w c w p w p p C p p p p p p p p

i i i i MAP

' ) / ( ˆ exp where log | log | log | log max arg | | X X X X X X

slide-46
SLIDE 46

 Sequence recognition

  • Classify an ordered sequence
  • E.g., speech recognition: predict word based on

sequence of auditory data

  • E.g., activity recognition: predict activity based on

sequence of sensor data

 Sequence reproduction

  • Predict the rest of a sequence
  • E.g., predict price of stock into the future based on

past performance

CptS 570 - Machine Learning 46

slide-47
SLIDE 47

 Temporal association

  • Predict output sequence given input sequence
  • Both input and output changing over time
  • E.g., output sequence of steps to achieve a goal,

where input is the previous steps taken (states visited)

CptS 570 - Machine Learning 47

slide-48
SLIDE 48

 Convert temporal

sequence to spatial sequence

 Pass time window of

length T over input sequence

 Invoke/train network

  • nce T inputs
  • bserved
  • All fed in at once

CptS 570 - Machine Learning 48

slide-49
SLIDE 49

 Units have connections to themselves or to

units in same or previous layers

 Acts as short-term memory of the past

CptS 570 - Machine Learning 49

slide-50
SLIDE 50

 Unfolding in time

  • Convert recurrent network to

equivalent non-recurrent feed-forward network

  • Okay if input sequence not

too long

CptS 570 - Machine Learning 50

slide-51
SLIDE 51

 Based loosely on structure of human brain  Multilayer perceptron  Universal approximator  Error backpropagation  Incorporating background knowledge  Tuning network structure  Learning time-dependent functions  Overall, a powerful general-purpose learner,

but requires considerable tuning

CptS 570 - Machine Learning 51