CS344: Introduction to Artificial Intelligence (associated lab: - - PowerPoint PPT Presentation
CS344: Introduction to Artificial Intelligence (associated lab: - - PowerPoint PPT Presentation
CS344: Introduction to Artificial Intelligence (associated lab: CS386) Pushpak Bhattacharyya CSE Dept., CSE Dept., IIT Bombay Lecture 32: sigmoid neuron; Feedforward N/W; Error Backpropagation 29 th March, 2011 The Perceptron Model . y = 1
The Perceptron Model
. Output = y
y = 1 for Σwixi >=θ = 0 otherwise
wn Wn-1 w1 Xn-1 x1 Threshold = θ
1
y θ
Σwixi
Perceptron Training Algorithm
1.
Start with a random value of w ex: <0,0,0…>
2.
Test for wxi > 0 If the test succeeds for i=1,2,…n then return w
- 3. Modify w, wnext = wprev + xfail
Feedforward Network
Example - XOR
x x x x Calculation of XOR w2=1 w1=1 θ = 0.5 x1x2 x1x2 x1 x2 x1x
2
1 1 1 1 1 w2=1.5 w1=-1 θ = 1 x1 x2
Θ < + Θ < Θ ≥ Θ < 2 1 1 2 w w w w
Calculation of x1x2
Example - XOR
w2=1 w1=1 θ = 0.5 x1x2 x1x2 1 1
- 1
x1 x2
- 1
1.5 1.5
x2 x1 h2 h1
3 3
c x m y + =
1 1
c x m y + =
2 2
c x m y + =
Can Linear Neurons Work?
1 2 2 1 1 1 1
) ( c x w x w m h + + =
1 2 2 1 1 1 1
) ( c x w x w m h + + =
3 2 2 1 1 3 2 6 1 5
) ( k x k x k c h w h w Out + + = + + =
Note: The whole structure shown in earlier slide is reducible to a single neuron with given behavior Claim: A neuron with linear I-O behavior can’t compute X- OR. Proof: Considering all possible cases:
3 2 2 1 1
k x k x k Out + + =
Proof: Considering all possible cases:
[assuming 0.1 and 0.9 as the lower and upper thresholds] For (0,0), Zero class: For (0,1), One class:
1 . . 1 . ) . . (
2 1
< − ⇒ < + − + θ θ m c c w w m 9 . . . 9 . ) . 1 . (
1 1 2
> + − ⇒ > + − + c m w m c w w m θ θ
For (1,0), One class: For (1,1), Zero class: These equations are inconsistent. Hence X-OR can’t be computed.
Observations:
9 . . .
1
> + − c m w m θ 9 . . .
1
> + − c m w m θ
1.
A linear neuron can’t compute X-OR.
2.
A multilayer FFN with linear neurons is collapsible to a single linear neuron, hence no a additional power due to hidden layer.
3.
Non-linearity is essential for power.
Multilayer Perceptron
Gradient Descent Technique
Let E be the error at the output layer
∑∑
= =
− =
p j n i j i i
- t
E
1 1 2
) ( 2 1
ti = target output; oi = observed output i is the index going over n neurons in the
- utermost layer
j is the index going over the p patterns (1 to p) Ex: XOR:– p=4 and n=1
Weights in a FF NN
wmn is the weight of the
connection from the nth neuron to the mth neuron
E vs
surface is a complex
W
m n wmn surface in the space defined by the weights wij
- gives the direction in
which a movement of the
- perating point in the wmn co-
- rdinate space will result in
maximum decrease in error
mn
w E δ δ −
mn mn
w E w δ δ − ∝ ∆
Sigmoid neurons
Gradient Descent needs a derivative computation
- not possible in perceptron due to the discontinuous
step function used! Sigmoid neurons with easy-to-compute derivatives used!
Computing power comes from non-linearity of
sigmoid function.
−∞ → → ∞ → → x y x y as as 1
Derivative of Sigmoid function
) ( 1 1 1 e e dy e y
x x x
= − − = + =
− − −
) 1 ( 1 1 1 1 1 ) 1 ( ) ( ) 1 ( 1
2 2
y y e e e e e e dx dy
x x x x x
− = + − + = + = − + − =
− − − − −
Training algorithm
Initialize weights to random values. For input x = <xn,xn-1,…,x0>, modify weights as
follows Target output = t, Observed output = o Target output = t, Observed output = o
Iterate until E < δ
(threshold)
i i
w E w δ δ − ∝ ∆
2
) ( 2 1
- t
E − =
Calculation of ∆wi
i n i i i i i
w net net
- E
x w net where w net net E w E :
1
× × = = × =
∑
− =
δ δ δ δ δ δ δ δ δ δ δ δ
i i i i i i
x
- t
w w E w x
- t
w net
- )
1 ( ) ( ) 1 constant, learning ( ) 1 ( ) ( − − = ∆ ≤ ≤ = − = ∆ − − − = η η η δ δ η δ δ δ
Observations
Does the training technique support our intuition?
The larger the xi, larger is ∆wi
Error burden is borne by the weight values