[PPT] - CptS 570 Machine Learning School of EECS Washington State PowerPoint Presentation

SLIDE 1

CptS 570 – Machine Learning School of EECS Washington State University

CptS 570 - Machine Learning 1

SLIDE 2

 Also called multilayer perceptrons  Inspired from human brain

Brain consists of interconnected neurons
Brain still outperforms machines on several tasks
E.g., vision, speech recognition, learning

 Nonparametric estimator  Classification and regression  Trained using error backpropagation

CptS 570 - Machine Learning 2

SLIDE 3

CptS 570 - Machine Learning 3

SLIDE 4

 Processors

Computer: Typically 1-2 (~109 Hz)
Brain: 1011 neurons (~ 103 Hz)

 Parallelism

Computer: Typically little
Brain: Massive parallelism

 On average, each neuron is connected via synapses to 104 other neurons

CptS 570 - Machine Learning 4

SLIDE 5

CptS 570 - Machine Learning 5

“The Singularity is Near” Ray Kurzweil.

SLIDE 6

CptS 570 - Machine Learning 6

[ ] [ ]

T d T d T d j j j

x x w w w w x w y ,..., , ,..., ,

1 1 1

1 = = = + =∑

=

x w x w

SLIDE 7

 y = wx + w0

CptS 570 - Machine Learning 7

w w0 y x x0=+1 y x

SLIDE 8

 If (wx + w0 > 0) Then y=1 Else y=0

CptS 570 - Machine Learning 8

w w0 y x

( )

[ ]

x w x w

T T

y − + = = exp 1 1 sigmoid

x

SLIDE 9

CptS 570 - Machine Learning 9

k k i i k k i i T i i

y y C

y
max

if choose ) exp( ) exp( = = =

∑

x w

x y x w W = = + =∑

= T i i d j j ij i

w x w y

1

Cla lassif ific ication: Regressi ssion:

SLIDE 10

 Batch learning (gradient descent)

Requires entire training set
Each weight update based on pass through entire

training set

 Online learning (stochastic gradient descent)

Allows incremental arrival of training examples
Weights updated for each training example
Adaptive to problems changing over time
Tends to converge faster

CptS 570 - Machine Learning 10

( ) t

j t i t i t ij

x y r w − = ∆ η

SLIDE 11

 Regression  Single linear output

CptS 570 - Machine Learning 11

( ) ( ) ( )

[ ]

( ) t

j t t t j t T t t t t t t

x y r w r y r r E − = ∆ − = − = η

2 2

2 1 2 1 x w x w , |

SLIDE 12

 Classification  Single sigmoid output  K>2 softmax outputs

CptS 570 - Machine Learning 12

( ) ( ) ( ) ( ) ( ) t

j t t t j t t t t t t t t T t

x y r w y r y r E y − = ∆ − − − − = = η 1 1 log log | sigmoid r x w x w ,

{ }

( ) ( ) t

j t i t i t ij i t i t i t t i i t k t T k t T i t i

x y r w y r E y − = ∆ − = =

∑ ∑

η log , | ) exp( ) exp( r x w x w x w

Cross Entropy

SLIDE 13

 Stochastic

nline gradient

descent for K>2 classes

CptS 570 - Machine Learning 13

For i = 1,…,K For j = 0,…,d wij ← rand(-0.01,0.01) Repeat For all (xt,rt) in X in random order For i = 1,…,K

i ← 0

For j = 0,…,d

i ← oi + wijxt

j

For i = 1,…,K yi ← exp(oi) / Σk exp(ok) For i = 1,…,K For j = 0,…,d wij ← wij + ɳ(rt

i – yi)xt j

Until convergence

SLIDE 14

CptS 570 - Machine Learning 14

SLIDE 15

CptS 570 - Machine Learning 15

2 1 1 2

≤ + + > + > + ≤ w w w w w w w w

No w0, w1, w2 satisfy: Minsky and Papert (1969): Stalled perceptron research for 15 years.

SLIDE 16

 Perceptrons can only approximate linear

functions

 But multiple layers of perceptrons can

approximate nonlinear functions

CptS 570 - Machine Learning 16

Hidden Layer

SLIDE 17

CptS 570 - Machine Learning 17

( )

[ ]

∑ ∑

= =

+ − + = = + = =

d j h j hj T h h H h i h ih T i i

w x w z v z v y

1 1

1 1 exp sigmoid x w z v

(Rumelhart et al., 1986)

SLIDE 18

CptS 570 - Machine Learning 18

 x1 XOR x2 =

(x1 AND ~x2) OR (~x1 AND x2)

SLIDE 19

 MLP can represent any Boolean function

Any Boolean function can be expressed as a

disjunction of conjunctions

Each conjunction implemented by a hidden unit
Disjunction implemented by one output unit
May need 2d hidden units in worst case

CptS 570 - Machine Learning 19

SLIDE 20

 MLP with two hidden layers can approximate

any function with continuous inputs and

utputs
First hidden layer computes hyperplanes for

isolating regions of instance space

Second hidden layer ANDs hyperplanes together to

isolate regions

Weight from second-layer hidden unit to output

unit is value of function in this region

Piecewise constant approximator

 MLP with one sufficiently large hidden layer

can learn any nonlinear function

CptS 570 - Machine Learning 20

SLIDE 21

 Weights vih feeding into

utput units learned

using previous methods

 Weights whj feeding into

hidden units learned based on error propagated from output layer

 Error backpropagation

(Rumelhart et al., 1986)

CptS 570 - Machine Learning 21

( )

[ ]

hj h h i i hj d j h j hj T h h H h i h ih T i i

w z z y y E w E w x w z v z v y ∂ ∂ ∂ ∂ ∂ ∂ = ∂ ∂ + − + = = + = =

∑ ∑

= =

exp 1 1 sigmoid

1 1

x w z v

SLIDE 22

CptS 570 - Machine Learning 22

( )

x wT

h h

z sigmoid =

∑

=

+ =

H h t h h t

v z v y

1

( ) ( ) ( ) ( ) t

j t h t h t h t t t j t h t h t h t t hj t h t h t t t hj hj

x z z v y r x z z v y r w z z y y E w E w − − = − − − − = ∂ ∂ ∂ ∂ ∂ ∂ − = ∂ ∂ − = ∆

∑ ∑ ∑

1 1 η η η η

Forward Backward

x

( )

2

2 1∑ − =

t t t

y r E X | ,v W

( ) t

h t t t h

z y r v

∑

− = ∆ η

SLIDE 23

CptS 570 - Machine Learning 23

( )

( ) ( ) ( ) ( ) t

j t h t h t i ih t i t i hj t h t t i t i ih i H h t h ih t i t i t i t i

x z z v y r w z y r v v z v y y r E −       − = ∆ − = ∆ + = − =

∑ ∑ ∑ ∑ ∑∑

=

1 2 1 | ,

1 2

η η X V W

z h v ih y i xj w hj

SLIDE 24

CptS 570 - Machine Learning 24

Epoch is one pass through training data X. Note: All weight updates computed before any made.

SLIDE 25

 f(x) = sin(6x)  xt ~ U(-0.5,0.5)  yt = f(xt)+N(0,0.1)  2 hidden units  After 100, 200

and 300 epochs

CptS 570 - Machine Learning 25

SLIDE 26

CptS 570 - Machine Learning 26

SLIDE 27

CptS 570 - Machine Learning 27

Hyperplanes w hx+w 0 computed by hidden units Outputs z h computed by hidden units Inputs v hz h to output unit

SLIDE 28

 One sigmoid output yt for P(C1|xt) and

P(C2|xt) ≡ 1-yt

CptS 570 - Machine Learning 28

( )

( ) ( ) ( ) ( ) ( ) t

j t h t h h t t t hj t h t t t h t t t t t H h t h h t

x z z v y r w z y r v y r y r E v z v y − − = ∆ − = ∆ − − + − =       + =

∑ ∑ ∑ ∑

=

1 1 log 1 log | , sigmoid

1

η η X v W

Same as before

SLIDE 29

CptS 570 - Machine Learning 29

( )

( ) ( ) ( ) t

j t h t h t i ih t i t i hj t h t t i t i ih t i t i t i t i k t k t i t i H h i t h ih t i

x z z v y r w z y r v y r E C P

y

v z v

−

      − = ∆ − = ∆ − = ≡ = + =

∑ ∑ ∑ ∑∑ ∑ ∑

=

1 log | , | ) ( exp ) ( exp

1

η η X V W x

SLIDE 30

 Theoretically, only need one hidden layer  Multiple hidden layers may simplify network  Training proceeds by propagating error back

layer by layer

CptS 570 - Machine Learning 30

( ) ( )

∑ ∑ ∑

= = =

+ = = =         + = = =         + = =

2 1

1 2 2 2 1 2 1 2 1 2 2 1 1 1 1 1 1

1 1

H l l l T H h l h lh T l l d j h j hj T h h

v z v y H l w z w z H h w x w z z v z w x w ,..., , ,..., , sigmoid sigmoid sigmoid sigmoid

SLIDE 31

 Gradient descent can be slow to converge  Successive weight updates can lead to large

scillations

 Idea: Use previous weight update to smooth

trajectory

 Momentum α ∈ (0.5,1.0)

CptS 570 - Machine Learning 31

1 −

∆ + ∂ ∂ − = ∆

t i i t t i

w w E w α η

SLIDE 32

 Learning rate ɳ ∈ (0.0,0.3)  Kept low to avoid oscillations, but slow

learning

 Prefer high ɳ initially, but lower ɳ as network

converges

 Adaptive learning rate

Increase ɳ if error decreases
Decrease ɳ if error increases
Best if Et is average over past few epochs

CptS 570 - Machine Learning 32

   − < + = ∆

+

therwise

if η η

τ

b E E a

t t

SLIDE 33

 Network with d inputs, K outputs and H

hidden units has K(H+1)+H(d+1) weights

 Choosing H too high can lead to overfitting  This is the same bias/variance dilemma as

before

CptS 570 - Machine Learning 33

SLIDE 34

CptS 570 - Machine Learning 34

Previous example: f(x) = sin(6x)

SLIDE 35

 Similar

verfitting

behavior if training continued too long

 More and

more weights move from zero

 Overtraining

CptS 570 - Machine Learning 35

SLIDE 36

 Cross validation can be used to choose a

good H and a good stopping condition

 Training multiple networks, each with

different random initial weights, can address local minima in the error

CptS 570 - Machine Learning 36

SLIDE 37

 Structuring the network

Completely connected networks are harder to train
Inputs may be locally correlated (e.g., pixels)
Not all inputs connected to all hidden units

CptS 570 - Machine Learning 37

SLIDE 38

 Normalizing inputs and outputs  Virtual examples

Adding additional training examples based on

known invariances

E.g., optical character recognition

 Characters are invariant to rotation, translation and scale

CptS 570 - Machine Learning 38

SLIDE 39

 Modifying the error function E’ = E + λhEh  If two examples x and x’ are the same from

the domain’s point of view (e.g., A and A)

Eh = [g(x|θ) – g(x’|θ)]2

 Know f(x) in (ax,bx)

CptS 570 - Machine Learning 39

( ) [

]

( ) ( ) ( ) ( ) ( ) ( )

     > − < − ∈ =

x x x x x x h

b x g b x g a x g a x g b a x g E θ θ θ θ θ | if | | if | , | if

2 2

SLIDE 40

 Manually try different network structures with

validation set

 Incorporate structural adaptation into the

learning algorithm

Destructive: Start with large network and gradually

remove units and/or connections

Constructive: Start with small network and gradually

add units and/or connections

Continue until performance degrades

CptS 570 - Machine Learning 40

SLIDE 41

 Weight decay (destructive)

Penalize networks with many non-zero weights

CptS 570 - Machine Learning 41

∑

+ = − ∂ ∂ − = ∆

i i i i i

w E E w w E w

2

2 λ λ η '

SLIDE 42

 Dynamic node creation (constructive)

Train network until convergence
If error still high, add another hidden unit
Randomly initialize new unit’s weights
Do not reinitialize previously-trained weights

CptS 570 - Machine Learning 42

SLIDE 43

 Cascade correlation (constructive)

Train network until convergence
If error still high, add another hidden layer with one

hidden unit

Connect new unit to all previous hidden units, all

inputs, and output

Randomly initialize new weights
Freeze previously-trained weights

CptS 570 - Machine Learning 43

SLIDE 44

 Search hypothesis space H of all network

topologies

 |H| is exponential  Search guided by performance of network on

validation set (expensive)

 Previous tuning methods essentially search

perators

 Genetic algorithms show promise  Open research issue

CptS 570 - Machine Learning 44

SLIDE 45

 Consider weights wi as random variables with

priors p(wi) ~ N(0,1/2λ)

 Weight decay, ridge regression, regularization  Cost = data-misfit + λ * complexity

CptS 570 - Machine Learning 45

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

2 2

2 1 2 w w w w w w w w w w w λ λ + =      − ⋅ = = + + = = =

∏

E E w c w p w p p C p p p p p p p p

i i i i MAP

SLIDE 46

 Sequence recognition

Classify an ordered sequence
E.g., speech recognition: predict word based on

sequence of auditory data

E.g., activity recognition: predict activity based on

sequence of sensor data

 Sequence reproduction

Predict the rest of a sequence
E.g., predict price of stock into the future based on

past performance

CptS 570 - Machine Learning 46

SLIDE 47

 Temporal association

Predict output sequence given input sequence
Both input and output changing over time
E.g., output sequence of steps to achieve a goal,

where input is the previous steps taken (states visited)

CptS 570 - Machine Learning 47

SLIDE 48

 Convert temporal

sequence to spatial sequence

 Pass time window of

length T over input sequence

 Invoke/train network

nce T inputs
bserved
All fed in at once

CptS 570 - Machine Learning 48

SLIDE 49

 Units have connections to themselves or to

units in same or previous layers

 Acts as short-term memory of the past

CptS 570 - Machine Learning 49

SLIDE 50

 Unfolding in time

Convert recurrent network to

equivalent non-recurrent feed-forward network

Okay if input sequence not

too long

CptS 570 - Machine Learning 50

SLIDE 51

 Based loosely on structure of human brain  Multilayer perceptron  Universal approximator  Error backpropagation  Incorporating background knowledge  Tuning network structure  Learning time-dependent functions  Overall, a powerful general-purpose learner,

but requires considerable tuning

CptS 570 - Machine Learning 51