CS344: Introduction to Artificial Intelligence (associated lab: - - PowerPoint PPT Presentation

cs344 introduction to artificial intelligence associated
SMART_READER_LITE
LIVE PREVIEW

CS344: Introduction to Artificial Intelligence (associated lab: - - PowerPoint PPT Presentation

CS344: Introduction to Artificial Intelligence (associated lab: CS386) Pushpak Bhattacharyya CSE Dept., CSE Dept., IIT Bombay Lecture 32: sigmoid neuron; Feedforward N/W; Error Backpropagation 29 th March, 2011 The Perceptron Model . y = 1


slide-1
SLIDE 1

CS344: Introduction to Artificial Intelligence (associated lab: CS386)

Pushpak Bhattacharyya

CSE Dept., CSE Dept., IIT Bombay Lecture 32: sigmoid neuron; Feedforward N/W; Error Backpropagation 29th March, 2011

slide-2
SLIDE 2

The Perceptron Model

. Output = y

y = 1 for Σwixi >=θ = 0 otherwise

wn Wn-1 w1 Xn-1 x1 Threshold = θ

slide-3
SLIDE 3

1

y θ

Σwixi

slide-4
SLIDE 4

Perceptron Training Algorithm

1.

Start with a random value of w ex: <0,0,0…>

2.

Test for wxi > 0 If the test succeeds for i=1,2,…n then return w

  • 3. Modify w, wnext = wprev + xfail
slide-5
SLIDE 5

Feedforward Network

slide-6
SLIDE 6

Example - XOR

x x x x Calculation of XOR w2=1 w1=1 θ = 0.5 x1x2 x1x2 x1 x2 x1x

2

1 1 1 1 1 w2=1.5 w1=-1 θ = 1 x1 x2

Θ < + Θ < Θ ≥ Θ < 2 1 1 2 w w w w

Calculation of x1x2

slide-7
SLIDE 7

Example - XOR

w2=1 w1=1 θ = 0.5 x1x2 x1x2 1 1

  • 1

x1 x2

  • 1

1.5 1.5

slide-8
SLIDE 8

x2 x1 h2 h1

3 3

c x m y + =

1 1

c x m y + =

2 2

c x m y + =

Can Linear Neurons Work?

1 2 2 1 1 1 1

) ( c x w x w m h + + =

1 2 2 1 1 1 1

) ( c x w x w m h + + =

3 2 2 1 1 3 2 6 1 5

) ( k x k x k c h w h w Out + + = + + =

slide-9
SLIDE 9

Note: The whole structure shown in earlier slide is reducible to a single neuron with given behavior Claim: A neuron with linear I-O behavior can’t compute X- OR. Proof: Considering all possible cases:

3 2 2 1 1

k x k x k Out + + =

Proof: Considering all possible cases:

[assuming 0.1 and 0.9 as the lower and upper thresholds] For (0,0), Zero class: For (0,1), One class:

1 . . 1 . ) . . (

2 1

< − ⇒ < + − + θ θ m c c w w m 9 . . . 9 . ) . 1 . (

1 1 2

> + − ⇒ > + − + c m w m c w w m θ θ

slide-10
SLIDE 10

For (1,0), One class: For (1,1), Zero class: These equations are inconsistent. Hence X-OR can’t be computed.

Observations:

9 . . .

1

> + − c m w m θ 9 . . .

1

> + − c m w m θ

1.

A linear neuron can’t compute X-OR.

2.

A multilayer FFN with linear neurons is collapsible to a single linear neuron, hence no a additional power due to hidden layer.

3.

Non-linearity is essential for power.

slide-11
SLIDE 11

Multilayer Perceptron

slide-12
SLIDE 12

Gradient Descent Technique

Let E be the error at the output layer

∑∑

= =

− =

p j n i j i i

  • t

E

1 1 2

) ( 2 1

ti = target output; oi = observed output i is the index going over n neurons in the

  • utermost layer

j is the index going over the p patterns (1 to p) Ex: XOR:– p=4 and n=1

slide-13
SLIDE 13

Weights in a FF NN

wmn is the weight of the

connection from the nth neuron to the mth neuron

E vs

surface is a complex

W

m n wmn surface in the space defined by the weights wij

  • gives the direction in

which a movement of the

  • perating point in the wmn co-
  • rdinate space will result in

maximum decrease in error

mn

w E δ δ −

mn mn

w E w δ δ − ∝ ∆

slide-14
SLIDE 14

Sigmoid neurons

Gradient Descent needs a derivative computation

  • not possible in perceptron due to the discontinuous

step function used! Sigmoid neurons with easy-to-compute derivatives used!

Computing power comes from non-linearity of

sigmoid function.

−∞ → → ∞ → → x y x y as as 1

slide-15
SLIDE 15

Derivative of Sigmoid function

) ( 1 1 1 e e dy e y

x x x

= − − = + =

− − −

) 1 ( 1 1 1 1 1 ) 1 ( ) ( ) 1 ( 1

2 2

y y e e e e e e dx dy

x x x x x

− =       + − + = + = − + − =

− − − − −

slide-16
SLIDE 16

Training algorithm

Initialize weights to random values. For input x = <xn,xn-1,…,x0>, modify weights as

follows Target output = t, Observed output = o Target output = t, Observed output = o

Iterate until E < δ

(threshold)

i i

w E w δ δ − ∝ ∆

2

) ( 2 1

  • t

E − =

slide-17
SLIDE 17

Calculation of ∆wi

i n i i i i i

w net net

  • E

x w net where w net net E w E :

1

× × =       = × =

− =

δ δ δ δ δ δ δ δ δ δ δ δ

i i i i i i

x

  • t

w w E w x

  • t

w net

  • )

1 ( ) ( ) 1 constant, learning ( ) 1 ( ) ( − − = ∆ ≤ ≤ = − = ∆ − − − = η η η δ δ η δ δ δ

slide-18
SLIDE 18

Observations

Does the training technique support our intuition?

The larger the xi, larger is ∆wi

Error burden is borne by the weight values

corresponding to large input values