CptS 570 – Machine Learning School of EECS Washington State University
CptS 570 - Machine Learning 1
CptS 570 Machine Learning School of EECS Washington State - - PowerPoint PPT Presentation
CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine Learning 1 Also called multilayer perceptrons Inspired from human brain Brain consists of interconnected neurons Brain still outperforms
CptS 570 – Machine Learning School of EECS Washington State University
CptS 570 - Machine Learning 1
Also called multilayer perceptrons Inspired from human brain
Nonparametric estimator Classification and regression Trained using error backpropagation
CptS 570 - Machine Learning 2
CptS 570 - Machine Learning 3
Processors
Parallelism
On average, each neuron is connected via synapses to 104 other neurons
CptS 570 - Machine Learning 4
CptS 570 - Machine Learning 5
“The Singularity is Near” Ray Kurzweil.
CptS 570 - Machine Learning 6
T d T d T d j j j
1 1 1
=
y = wx + w0
CptS 570 - Machine Learning 7
w w0 y x x0=+1 y x
If (wx + w0 > 0) Then y=1 Else y=0
CptS 570 - Machine Learning 8
w w0 y x
T T
x
CptS 570 - Machine Learning 9
k k i i k k i i T i i
y y C
if choose ) exp( ) exp( = = =
x w
= T i i d j j ij i
1
Cla lassif ific ication: Regressi ssion:
Batch learning (gradient descent)
training set
Online learning (stochastic gradient descent)
CptS 570 - Machine Learning 10
j t i t i t ij
Regression Single linear output
CptS 570 - Machine Learning 11
j t t t j t T t t t t t t
2 2
Classification Single sigmoid output K>2 softmax outputs
CptS 570 - Machine Learning 12
j t t t j t t t t t t t t T t
x y r w y r y r E y − = ∆ − − − − = = η 1 1 log log | sigmoid r x w x w ,
j t i t i t ij i t i t i t t i i t k t T k t T i t i
Cross Entropy
Stochastic
CptS 570 - Machine Learning 13
For i = 1,…,K For j = 0,…,d wij ← rand(-0.01,0.01) Repeat For all (xt,rt) in X in random order For i = 1,…,K
For j = 0,…,d
j
For i = 1,…,K yi ← exp(oi) / Σk exp(ok) For i = 1,…,K For j = 0,…,d wij ← wij + ɳ(rt
i – yi)xt j
Until convergence
CptS 570 - Machine Learning 14
CptS 570 - Machine Learning 15
2 1 1 2
≤ + + > + > + ≤ w w w w w w w w
No w0, w1, w2 satisfy: Minsky and Papert (1969): Stalled perceptron research for 15 years.
Perceptrons can only approximate linear
But multiple layers of perceptrons can
CptS 570 - Machine Learning 16
Hidden Layer
CptS 570 - Machine Learning 17
= =
+ − + = = + = =
d j h j hj T h h H h i h ih T i i
w x w z v z v y
1 1
1 1 exp sigmoid x w z v
(Rumelhart et al., 1986)
CptS 570 - Machine Learning 18
x1 XOR x2 =
MLP can represent any Boolean function
disjunction of conjunctions
CptS 570 - Machine Learning 19
MLP with two hidden layers can approximate
isolating regions of instance space
isolate regions
unit is value of function in this region
MLP with one sufficiently large hidden layer
CptS 570 - Machine Learning 20
Weights vih feeding into
Weights whj feeding into
Error backpropagation
CptS 570 - Machine Learning 21
hj h h i i hj d j h j hj T h h H h i h ih T i i
w z z y y E w E w x w z v z v y ∂ ∂ ∂ ∂ ∂ ∂ = ∂ ∂ + − + = = + = =
= =
exp 1 1 sigmoid
1 1
x w z v
CptS 570 - Machine Learning 22
x wT
h h
z sigmoid =
=
H h t h h t
1
j t h t h t h t t t j t h t h t h t t hj t h t h t t t hj hj
x z z v y r x z z v y r w z z y y E w E w − − = − − − − = ∂ ∂ ∂ ∂ ∂ ∂ − = ∂ ∂ − = ∆
1 1 η η η η
Forward Backward
x
2
2 1∑ − =
t t t
y r E X | ,v W
h t t t h
z y r v
− = ∆ η
CptS 570 - Machine Learning 23
j t h t h t i ih t i t i hj t h t t i t i ih i H h t h ih t i t i t i t i
=
1 2
z h v ih y i xj w hj
CptS 570 - Machine Learning 24
Epoch is one pass through training data X. Note: All weight updates computed before any made.
f(x) = sin(6x) xt ~ U(-0.5,0.5) yt = f(xt)+N(0,0.1) 2 hidden units After 100, 200
and 300 epochs
CptS 570 - Machine Learning 25
CptS 570 - Machine Learning 26
CptS 570 - Machine Learning 27
Hyperplanes w hx+w 0 computed by hidden units Outputs z h computed by hidden units Inputs v hz h to output unit
One sigmoid output yt for P(C1|xt) and
CptS 570 - Machine Learning 28
j t h t h h t t t hj t h t t t h t t t t t H h t h h t
=
1
Same as before
CptS 570 - Machine Learning 29
j t h t h t i ih t i t i hj t h t t i t i ih t i t i t i t i k t k t i t i H h i t h ih t i
=
1
Theoretically, only need one hidden layer Multiple hidden layers may simplify network Training proceeds by propagating error back
CptS 570 - Machine Learning 30
= = =
+ = = = + = = = + = =
2 1
1 2 2 2 1 2 1 2 1 2 2 1 1 1 1 1 1
1 1
H l l l T H h l h lh T l l d j h j hj T h h
v z v y H l w z w z H h w x w z z v z w x w ,..., , ,..., , sigmoid sigmoid sigmoid sigmoid
Gradient descent can be slow to converge Successive weight updates can lead to large
Idea: Use previous weight update to smooth
Momentum α ∈ (0.5,1.0)
CptS 570 - Machine Learning 31
1 −
t i i t t i
Learning rate ɳ ∈ (0.0,0.3) Kept low to avoid oscillations, but slow
Prefer high ɳ initially, but lower ɳ as network
Adaptive learning rate
CptS 570 - Machine Learning 32
− < + = ∆
+
if η η
τ
b E E a
t t
Network with d inputs, K outputs and H
Choosing H too high can lead to overfitting This is the same bias/variance dilemma as
CptS 570 - Machine Learning 33
CptS 570 - Machine Learning 34
Previous example: f(x) = sin(6x)
Similar
behavior if training continued too long
More and
more weights move from zero
Overtraining
CptS 570 - Machine Learning 35
Cross validation can be used to choose a
Training multiple networks, each with
CptS 570 - Machine Learning 36
Structuring the network
CptS 570 - Machine Learning 37
Normalizing inputs and outputs Virtual examples
known invariances
Characters are invariant to rotation, translation and scale
CptS 570 - Machine Learning 38
Modifying the error function E’ = E + λhEh If two examples x and x’ are the same from
Know f(x) in (ax,bx)
CptS 570 - Machine Learning 39
> − < − ∈ =
x x x x x x h
b x g b x g a x g a x g b a x g E θ θ θ θ θ | if | | if | , | if
2 2
Manually try different network structures with
Incorporate structural adaptation into the
remove units and/or connections
add units and/or connections
CptS 570 - Machine Learning 40
Weight decay (destructive)
CptS 570 - Machine Learning 41
i i i i i
2
Dynamic node creation (constructive)
CptS 570 - Machine Learning 42
Cascade correlation (constructive)
hidden unit
inputs, and output
CptS 570 - Machine Learning 43
Search hypothesis space H of all network
|H| is exponential Search guided by performance of network on
Previous tuning methods essentially search
Genetic algorithms show promise Open research issue
CptS 570 - Machine Learning 44
Consider weights wi as random variables with
Weight decay, ridge regression, regularization Cost = data-misfit + λ * complexity
CptS 570 - Machine Learning 45
( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )
2 2
2 1 2 w w w w w w w w w w w λ λ + = − ⋅ = = + + = = =
E E w c w p w p p C p p p p p p p p
i i i i MAP
' ) / ( ˆ exp where log | log | log | log max arg | | X X X X X X
Sequence recognition
sequence of auditory data
sequence of sensor data
Sequence reproduction
past performance
CptS 570 - Machine Learning 46
Temporal association
where input is the previous steps taken (states visited)
CptS 570 - Machine Learning 47
Convert temporal
Pass time window of
Invoke/train network
CptS 570 - Machine Learning 48
Units have connections to themselves or to
Acts as short-term memory of the past
CptS 570 - Machine Learning 49
Unfolding in time
equivalent non-recurrent feed-forward network
too long
CptS 570 - Machine Learning 50
Based loosely on structure of human brain Multilayer perceptron Universal approximator Error backpropagation Incorporating background knowledge Tuning network structure Learning time-dependent functions Overall, a powerful general-purpose learner,
CptS 570 - Machine Learning 51