Feedforward Networks Gradient Descent Learning and Backpropagation - - PDF document

feedforward networks
SMART_READER_LITE
LIVE PREVIEW

Feedforward Networks Gradient Descent Learning and Backpropagation - - PDF document

Feedforward Networks Gradient Descent Learning and Backpropagation Christian Jacob CPSC 565 Winter 2003 Department of Computer Science University of Calgary Canada Learning by Gradient Descent Definition of the Learning Problem Let us


slide-1
SLIDE 1

Feedforward Networks Gradient Descent Learning and Backpropagation

Christian Jacob

CPSC 565 — Winter 2003 Department of Computer Science University of Calgary Canada

Learning by Gradient Descent

Definition of the Learning Problem

Let us start with the simple case of linear cells, which we have introduced as percep- tron units. The linear network should learn mappings (for m = 1, …, P) between Ë an input pattern xm = Hx1

m, …, xN m L and

Ë an associated target pattern T m .

slide-2
SLIDE 2

Figure 1. Perceptron

The output Oi

m of cell i for the input pattern xm is calculated as

(1) Oi

m = ‚ k

Hwki ÿxk

mL

The goal of the learning procedure is, that eventually the output Oi

m for input pat-

tern xm corresponds to the desired output Ti

m:

(2) Oi

m = ! Ti m = ‚ k

Hwki ÿxk

mL

Explicit Solution (Linear Network)*

For a linear network, the weights that satisfy Equation (2) can be calculated explic- itly using the pseudo-inverse: (3) wik = 1 Å Å Å Å P ‚

ml

Ti

m HQk

  • 1Lml xk

l

(4) Qml = 1 Å Å Å Å P ‚

k

xk

m xk l

2 03-Backprop-Printout.nb

slide-3
SLIDE 3

‡ Correlation Matrix

Here Qml is a component of the correlation matrix Qk of the input patterns: (5) Qk = i k j j j j j j j j xk

1 xk 1

xk

1 xk 2

… xk

1 xk P

. . . . xk

P xk 1

… … xk

P xk P

y { z z z z z z z z You can check that this is indeed a solution by verifying (6) ‚

k

wik xk

m = Ti m.

‡ Caveat

Note that Q-1 only exists for linearly independent input patterns. That means, if there are ai such that for all k = 1, …, N (7) a1 xk

1 + a2 xk 2 + … + aP xk P = 0,

then the outputs Oi

m cannot be selected independently from each other, and the

problem is NOT solvable.

Feedforward Networks and Gradient Descent Learning 3

slide-4
SLIDE 4

Learning by Gradient Descent (Linear Network)

Let us now try to find a learning rule for a linear network with M output units. Starting from a random initial weight setting w ” ÷÷

0, the learning procedure should

find a solution weight matrix for Equation (2).

‡ Error Function

For this purpose, we define a cost or error function EHw ” ÷÷ L: (8) E Hw ”L = 1 Å Å Å Å 2 ‚

m=1 M

m=1 P

HTm

m - Om mL2

E Hw ”L = 1 Å Å Å Å 2 „

m=1 M

m=1 P

i k j j j jTm

m - ‚ k

Hwkm ÿxk

mL

y { z z z z

2

EHw ” ÷÷ L ¥0 will approach zero as w ” ÷÷ = 8wkm< satisfies Equation (2). This cost function is a quadratic function in weight space.

4 03-Backprop-Printout.nb

slide-5
SLIDE 5

‡ Paraboloid

Therefore, EHw ” ÷÷ L is a paraboloid with a single global minimum.

<< RealTime3D` Plot3D@x2 + y2, 8x, -5, 5<, 8y, -5, 5<D;

Feedforward Networks and Gradient Descent Learning 5

slide-6
SLIDE 6

ContourPlot@x2 + y2, 8x, -5, 5<, 8y, -5, 5<D;

  • 4
  • 2

2 4

  • 4
  • 2

2 4

If the pattern vectors are linearly independent—i.e., a solution for Equation (2) exists—the minimum is at E = 0.

‡ Finding the Minimum: Following the Gradient

We can find the minimum of EHw ” ÷÷ L in weight space by following the negative gradient (9)

  • ∑w

” ÷

EHw ” ÷ L = -∑EHw ” ÷ L Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å ∑ w ” We can implement this gradient strategy as follows:

6 03-Backprop-Printout.nb

slide-7
SLIDE 7

‡ Changing a Weight

Each weight wki œ w ” ÷÷ is changed by Dwki proportionate to the E gradient at the current weight position (i.e., the current settings of all the weights): (10) Dwki = -h ∑ E Hw ”L Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å ∑ wki

‡ Steps Towards the Solution

(11) Dwki = -h ∑ Å Å Å Å Å Å Å Å Å Å Å Å Å ∑ wki i k j j j j j j j j 1 Å Å Å Å 2 „

m=1 M

m=1 P

i k j j j jTm

m - ‚ n

Hwnm ÿxn

mL

y { z z z z

2y

{ z z z z z z z z Dwki = -h 1 Å Å Å Å 2 „

m=1 P

∑ Å Å Å Å Å Å Å Å Å Å Å Å Å ∑ wki i k j j j j j j j„

m=1 M

i k j j j jTm

m - ‚ n

Hwnm ÿxn

mL

y { z z z z

2y

{ z z z z z z z Dwki = -h 1 Å Å Å Å 2 „

m=1 P

2 i k j j j jTi

m - ‚ n

Hwni ÿxn

mL

y { z z z z H-xk

mL

‡ Weight Adaptation Rule

(12) Dwki = h ‚

m=1 P

HTi

m - Oi mL xk m

The parameter h is usually referred to as the learning rate. In this formula, the adaptation of the weights are accumulated over all patterns.

Feedforward Networks and Gradient Descent Learning 7

slide-8
SLIDE 8

‡ Delta, LMS Learning

If we change the weights after each presentation of an input pattern to the network, we get a simpler form for the weight update term: (13) Dwki = h HTi

m - Oi mL xk m

  • r

(14) Dwki = h di

m xk m

with (15) di

m = Ti m - Oi m.

This learning rule has several names: Ë Delta rule Ë Adaline rule Ë Widrow-Hoff rule Ë LMS (least mean square) rule.

Gradient Descent Learning with Nonlinear Cells

We will now extend the gradient descent technique for the case of nonlinear cells, that is, where the activation/output function is a general nonlinear function g(x). † The input function is denoted by hHxL. † The output function gHhHxLL is assumed to be differentiable in x.

8 03-Backprop-Printout.nb

slide-9
SLIDE 9

‡ Rewriting the Error Function

The definition of the error function (Equation (8)) can be simply rewritten as follows: (16) E Hw ”L = 1 Å Å Å Å 2 ‚

m=1 M

m=1 P

HTm

m - Om mL2

E Hw ”L = 1 Å Å Å Å 2 „

m=1 M

m=1 P

i k j j j jTm

m - g

i k j j j j‚

k

Hwkm ÿxk

mL

y { z z z z y { z z z z

2

‡ Weight Gradients

Consequently, we can compute the wki gradients: (17) ∑ E Hw ”L Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å ∑ wki = ‚

m=1 P

HTi

m - g Hhi mLL ÿg£ Hhi mL ÿxk m

‡ From Weight Gradients to the Learning Rule

This eventually (after some more calculations) shows us that the adaptation term Dwki for wki has the same form as in Equations (10), (13), and (14), namely: (18) Dwki = h di

m xk m

where (19) di

m = HTi m - Oi mL ÿg£ Hhi mL

Feedforward Networks and Gradient Descent Learning 9

slide-10
SLIDE 10

Suitable Activation Functions

The calculation of the above d terms is easy for the following functions g, which are commonly used as activation functions:

‡ Hyperbolic Tangens:

(20) g HxL = tanh b x g£ HxL = b H1 - g2 HxLL Hyperbolic Tangens Plot:

Plot@Tanh@xD, 8x, -5, 5<D;

  • 4
  • 2

2 4

  • 1
  • 0.5

0.5 1

10 03-Backprop-Printout.nb

slide-11
SLIDE 11

Plot of the first derivative:

Plot@Tanh'@xD, 8x, -5, 5<D;

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

Check for equality with 1 - tanh2 x

Plot@1 - Tanh@xD2, 8x, -5, 5<D;

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

Influence of the b parameter:

p1@b_D := Plot@Tanh@b xD, 8x, -5, 5<, PlotRange Ø All, DisplayFunction Ø IdentityD p2@b_D := Plot@Tanh'@b xD, 8x, -5, 5<, PlotRange Ø All, DisplayFunction Ø IdentityD

Feedforward Networks and Gradient Descent Learning 11

slide-12
SLIDE 12

Table@Show@GraphicsArray@8p1@bD, p2@bD<DD, 8b, 1, 5<D;

  • 4
  • 2

2 4

  • 1
  • 0.5

0.5 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4

  • 1
  • 0.5

0.5 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4

  • 1
  • 0.5

0.5 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4

  • 1
  • 0.5

0.5 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4

  • 1
  • 0.5

0.5 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

12 03-Backprop-Printout.nb

slide-13
SLIDE 13

Table@Show@GraphicsArray@8p1@bD, p2@bD<DD, 8b, 0.1, 1, 0.1<D;

  • 4
  • 2

2 4

  • 0.4
  • 0.2

0.2 0.4

  • 4
  • 2

2 4 0.8 0.85 0.9 0.95

  • 4
  • 2

2 4

  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6

  • 4
  • 2

2 4 0.5 0.6 0.7 0.8 0.9

  • 4
  • 2

2 4

  • 0.75
  • 0.5
  • 0.25

0.25 0.5 0.75

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8

  • 4
  • 2

2 4

  • 1
  • 0.5

0.5 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8

  • 4
  • 2

2 4

  • 1
  • 0.5

0.5 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

Feedforward Networks and Gradient Descent Learning 13

slide-14
SLIDE 14
  • 4
  • 2

2 4

  • 1
  • 0.5

0.5 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4

  • 1
  • 0.5

0.5 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4

  • 1
  • 0.5

0.5 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4

  • 1
  • 0.5

0.5 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4

  • 1
  • 0.5

0.5 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

14 03-Backprop-Printout.nb

slide-15
SLIDE 15

‡ Sigmoid:

(21) g HxL = 1 Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å 1 + e-2 bx g£ HxL = 2 b g HxL H1 - g HxLL Sigmoid Plot:

sigmoid@x_, b_D := 1 Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å 1 + E-2 b x Plot@sigmoid@x, 1D, 8x, -5, 5<D;

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

Plot of the first derivative:

D@sigmoid@x, bD, xD 2 ‰-2 x b b Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å H1 + ‰-2 x bL2

Feedforward Networks and Gradient Descent Learning 15

slide-16
SLIDE 16

Plot@D@sigmoid@x, 1D, xD êê Evaluate, 8x, -5, 5<D;

  • 4
  • 2

2 4 0.1 0.2 0.3 0.4 0.5

Check for equality with 2ÿ g ÿ H1 - gL

Plot@2 sigmoid@x, 1D H1 - sigmoid@x, 1DL, 8x, -5, 5<D;

  • 4
  • 2

2 4 0.1 0.2 0.3 0.4 0.5

16 03-Backprop-Printout.nb

slide-17
SLIDE 17

Influence of the b parameter:

p1@b_D := Plot@sigmoid@x, bD, 8x, -5, 5<, PlotRange Ø All, DisplayFunction Ø IdentityD p2@b_D := Plot@D@sigmoid@x, bD, xD êê Evaluate, 8x, -5, 5<, PlotRange Ø All, DisplayFunction Ø IdentityD Table@Show@GraphicsArray@8p1@bD, p2@bD<DD, 8b, 1, 5<D;

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4 0.1 0.2 0.3 0.4 0.5

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1 1.2 1.4

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4 0.5 1 1.5 2

Feedforward Networks and Gradient Descent Learning 17

slide-18
SLIDE 18

Table@Show@GraphicsArray@8p1@bD, p2@bD<DD, 8b, 0.1, 1, 0.1<D;

  • 4
  • 2

2 4 0.4 0.5 0.6 0.7

  • 4
  • 2

2 4 0.042 0.044 0.046 0.048 0.05

  • 4
  • 2

2 4 0.2 0.3 0.4 0.5 0.6 0.7 0.8

  • 4
  • 2

2 4 0.05 0.06 0.07 0.08 0.09

  • 4
  • 2

2 4 0.4 0.6 0.8

  • 4
  • 2

2 4 0.04 0.06 0.08 0.12 0.14

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4 0.025 0.05 0.075 0.125 0.15 0.175 0.2

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4 0.05 0.1 0.15 0.2 0.25

18 03-Backprop-Printout.nb

slide-19
SLIDE 19
  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4 0.05 0.1 0.15 0.2 0.25 0.3

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4 0.05 0.1 0.15 0.2 0.25 0.3 0.35

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4 0.1 0.2 0.3 0.4

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4 0.1 0.2 0.3 0.4

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4 0.1 0.2 0.3 0.4 0.5

Feedforward Networks and Gradient Descent Learning 19

slide-20
SLIDE 20

d Update Rule for Sigmoid Units

Using the sigmoidal activation function, the d update rule takes the simple form: (22) di

m = Oi m H1 - Oi mL HTi m - Oi mL,

which is used in the weight update rule: (23) Dwki = h di

m xk m

Learning in Multilayer Networks

Multilayer networks with nonlinear processing elements have a wider capability for solving classification tasks. Learning by error backpropagation is a common method to train multilayer networks.

Error Backpropagation

The backpropagation (BP) algorithm describes an update procedure for the set of weights w ” ÷÷ in a feedforward multilayer network. The network has to learn input-output patterns 8xk

m, Ti m<.

The basis for BP learning is, again, a similar gradient descent technique as used for perceptron learning, as described above.

20 03-Backprop-Printout.nb

slide-21
SLIDE 21

‡ Notation

We use the following notation: Ë xk

m

: value of input unit k for training pattern m; k = 1, …, N ; m = 1, …, P Ë H j :

  • utput of hidden unit j

Ë Oi :

  • utput of output unit i, i = 1, …, M

Ë wkj : weight of the link from input unit k to hidden unit j Ë Wji : weight of the link from hidden unit j to output unit i

‡ Propagating the input through the network

For pattern m the hidden unit j receives the input (24) hj

m = ‚ k=1 N

wkj xk

m

and generates the output (25) Hj

m = g Hhj mL = g

i k j j j j‚

k=1 N

wkj xk

my

{ z z z z. These signals are propagated to the output cells, which receive the signals (26) hi

m = ‚ j

Wij Hj

m = „ j

Wij g i k j j j j‚

k=1 N

wkj xk

my

{ z z z z and generate the output (27) Oi

m = g Hhi mL = g

i k j j j j j j j j„

j

Wij g i k j j j j‚

k=1 N

wkj xk

my

{ z z z z y { z z z z z z z z

Feedforward Networks and Gradient Descent Learning 21

slide-22
SLIDE 22

‡ Error function

We use the known quadratic function as our error function: (28) E Hw ”L = 1 Å Å Å Å 2 ‚

m=1 M

m=1 P

HTm

m - Om mL2

Continuing the calculations, we get: (29) E Hw ”L = 1 Å Å Å Å 2 ‚

m=1 M

m=1 P

HTm

m - g Hhm mLL2

E Hw ”L = 1 Å Å Å Å 2 „

m=1 M

m=1 P

i k j j j j j j j j Tm

m - g

i k j j j j j j j j„

j

Wmj g i k j j j j‚

k=1 N

wkj xk

my

{ z z z z y { z z z z z z z z y { z z z z z z z z

2

E Hw ”L = 1 Å Å Å Å 2 „

m=1 M

m=1 P

i k j j j j jTm

m - g

i k j j j j j‚

j

Wmj Hj

my

{ z z z z z y { z z z z z

2

‡ Updating the weights: hidden—output layer

For the connections from hidden to output cells we can use the delta weight update rule: (30) DWji = -h ∑ E Å Å Å Å Å Å Å Å Å Å Å Å Å ∑ Wji DWji = h ‚

m

HTi

m - Oi mL g£ Hhi mL Hj m

DWji = h ‚

m

di

m Hj m

with (31) di

m = g£ Hhi mL HTi m - Oi mL

22 03-Backprop-Printout.nb

slide-23
SLIDE 23

‡ Updating the weights: input—hidden layer

(32) Dwkj = -h ∑ E Å Å Å Å Å Å Å Å Å Å Å Å Å ∑ wkj Dwkj = -h „

m

i k j j j ∑ E Å Å Å Å Å Å Å Å Å Å Å ∑ Hj

m ÿ ∑ Hj m

Å Å Å Å Å Å Å Å Å Å Å Å Å ∑ wkj y { z z z After a few more calculations we get the following weight update rule: (33) Dwkj = h ‚

m

dj

m xk m

with (34) dj

m = g£ Hhj mL ‚ i

Wji

m di m

Feedforward Networks and Gradient Descent Learning 23

slide-24
SLIDE 24

The Backpropagation Algorithm

For the BP algorithms we use the following notations: Ë Vi

m

:

  • utput of cell i in layer m

Ë Vi : corresponds to xi , the i-th input component Ë wji

m

: the connection from V j

m-1 to Vi m

‡ Backpropagation Algorithm

Ï Step 1: Initialize all weights with random values. Ï Step 2: Select a pattern xm and attach it to the input layer Hm = 0L:

(35) Vj

0 = xj m , " k Ï Step 3: Propagate the signals through all layers:

(36) Vi

m = g Hhi mL = g

i k j j j j j j‚

j

wji

m Vj m-1y

{ z z z z z z, " i, " m

Ï Step 4: Calculate the d's of the output layer:

(37) di

M = g£ Hhi ML HTi M - Vi ML Ï Step 5: Calculate the d's for the inner layers by error backpropagation:

(38) di

m-1 = g£ Hhi m-1L ‚ j

wij

m dj m, m = M, M - 1, …, 2 Ï Step 6: Adapt all connection weights:

(39) wji

new = wji

  • ld + Dwji

with Dwji m

= h di

m Vj m-1 Ï Step 7: Go back to Step 2 for the next training pattern.

24 03-Backprop-Printout.nb

slide-25
SLIDE 25

References

Freeman, J. A. Simulating Neural Networks with Mathematica. Addison-Wesley, Reading, MA, 1994. Hertz, J., Krogh, A., and Palmer, R. G. Introduction to the Theory of Neural Compu-

  • tation. Addison-Wesley, Reading, MA, 1991.

Rojas, R. Neural Networks: A Systematic Introduction. Springer Verlag, Berlin,1996.

Feedforward Networks and Gradient Descent Learning 25