Artificial Neural Networks (Part 2) Gradient Descent Learning and - - PDF document

artificial neural networks part 2 gradient descent
SMART_READER_LITE
LIVE PREVIEW

Artificial Neural Networks (Part 2) Gradient Descent Learning and - - PDF document

Artificial Neural Networks (Part 2) Gradient Descent Learning and Backpropagation Christian Jacob CPSC 533 Winter 2001 Learning by Gradient Descent Definition of the Learning Problem Let us start with the simple case of linear cells, which


slide-1
SLIDE 1

Artificial Neural Networks (Part 2) Gradient Descent Learning and Backpropagation

Christian Jacob

CPSC 533 — Winter 2001

Learning by Gradient Descent

Definition of the Learning Problem

Let us start with the simple case of linear cells, which we have introduced as percep- tron units. The linear network should learn mappings (for m = 1, …, P) between Ë an input pattern xm = Hx1

m, …, xN m L and

Ë an associated target pattern T m.

slide-2
SLIDE 2

Figure 1. Perceptron

The output Oi

m of cell i for the input pattern xm is calculated as

(1) Oi

m = ‚ k

Hwki ÿ xk

mL

The goal of the learning procedure is, that eventually the output Oi

m for input pat-

tern xm corresponds to the desired output Ti

m:

(2) Oi

m = ! Ti m = ‚ k

Hwki ÿ xk

mL

Explicit Solution (Linear Network)

For a linear network, the weights that satisfy Equation (2) can be calculated explic- itly using the pseudo-inverse: (3) wik = 1 ÅÅÅÅ P ‚

ml

Ti

m HQk

  • 1Lml xk

l

2 05.2-Backprop-Printout.nb

slide-3
SLIDE 3

(4) Qml = 1 ÅÅÅÅ P ‚

k

xk

m xk l

‡ Correlation Matrix Here Qml is a component of the correlation matrix Qk of the input patterns: (5) Qk = i k j j j j j j j j xk

1 xk 1

xk

1 xk 2

… xk

1 xk P

. . . . xk

P xk 1

… … xk

P xk P

y { z z z z z z z z You can check that this is indeed a solution by verifying (6) ‚

k

wik xk

m = Ti m.

‡ Caveat Note that Q-1 only exists for linearly independent input patterns. That means, if there are ai such that for all k = 1, …, N (7) a1 xk

1 + a2 xk 2 + … + aP xk P = 0,

then the outputs Oi

m cannot be selected independently from each other, and the

problem is NOT solvable.

Learning by Gradient Descent (Linear Network)

Let us now try to find a learning rule for a linear network with M output units. Starting from a random initial weight setting w ” ÷÷

0, the learning procedure should

find a solution weight matrix for Equation (2). ‡ Error Function For this purpose, we define a cost or error function EHw ” ÷÷ L:

05.2-Backprop-Printout.nb 3

slide-4
SLIDE 4

(8) E Hw ”L = 1 ÅÅÅÅ 2 ‚

m=1 M

m=1 P

HTm

m - Om mL2

E Hw ”L = 1 ÅÅÅÅ 2 „

m=1 M

m=1 P

i k j j j jTm

m - ‚ k

Hwkm ÿ xk

mL

y { z z z z

2

EHw ” ÷÷ L ¥ 0 will approach zero as w ” ÷÷ = 8wkm< satisfies Equation (2). This cost function is a quadratic function in weight space. ‡ Paraboloid Therefore, EHw ” ÷÷ L is a paraboloid with a single global minimum.

<< RealTime3D` Plot3D@x2 + y2, 8x, -5, 5<, 8y, -5, 5<D;

4 05.2-Backprop-Printout.nb

slide-5
SLIDE 5

ContourPlot@x2 + y2, 8x, -5, 5<, 8y, -5, 5<D;

  • 4
  • 2

2 4

  • 4
  • 2

2 4

If the pattern vectors are linearly independent—i.e., a solution for Equation (2) exists—the minimum is at E = 0. ‡ Finding the Minimum: Following the Gradient We can find the minimum of EHw ” ÷÷ L in weight space by following the negative gradient (9)

  • ∑w

” ÷

EHw ” ÷ L = -∑EHw ” ÷ L ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ ÅÅÅ ∑ w ” We can implement this gradient strategy as follows: ‡ Changing a Weight Each weight wki œ w ” ÷÷ is changed by Dwki proportionate to the E gradient at the current weight position (i.e., the current settings of all the weights):

05.2-Backprop-Printout.nb 5

slide-6
SLIDE 6

(10) Dwki = -h ∑ E Hw ”L ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ Å ∑ wki ‡ Steps Towards the Solution (11) Dwki = -h ∑ ÅÅÅÅÅÅÅÅÅÅÅÅ Å ∑ wki i k j j j j j j j j 1 ÅÅÅÅ 2 „

m=1 M

m=1 P

i k j j j jTm

m - ‚ n

Hwnm ÿ xn

mL

y { z z z z

2y

{ z z z z z z z z Dwki = -h 1 ÅÅÅÅ 2 „

m=1 P

∑ ÅÅÅÅÅÅÅÅÅÅÅÅ Å ∑ wki i k j j j j j j j„

m=1 M

i k j j j jTm

m - ‚ n

Hwnm ÿ xn

mL

y { z z z z

2y

{ z z z z z z z Dwki = -h 1 ÅÅÅÅ 2 „

m=1 P

2 i k j j j jTi

m - ‚ n

Hwni ÿ xn

mL

y { z z z z H-xk

mL

‡ Weight Adaptation Rule (12) Dwki = h ‚

m=1 P

HTi

m - Oi mL xk m

The parameter h is usually referred to as the learning rate. In this formula, the adaptation of the weights are accumulated over all patterns. ‡ Delta, LMS Learning If we change the weights after each presentation of an input pattern to the network, we get a simpler form for the weight update term: (13) Dwki = h HTi

m - Oi mL xk m

  • r

(14) Dwki = h di

m xk m

with (15) di

m = Ti m - Oi m.

6 05.2-Backprop-Printout.nb

slide-7
SLIDE 7

This learning rule has several names: Ë Delta rule Ë Adaline rule Ë Widrow-Hoff rule Ë LMS (least mean square) rule.

Gradient Descent Learning with Nonlinear Cells

We will now extend the gradient descent technique for the case of nonlinear cells, that is, where the activation/output function is a general nonlinear function g(x). † The input function is denoted by hHxL. † The output function gHhHxLL is assumed to be differentiable in x. ‡ Rewriting the Error Function The definition of the error function (Equation (8)) can be simply rewritten as follows: (16) E Hw ”L = 1 ÅÅÅÅ 2 ‚

m=1 M

m=1 P

HTm

m - Om mL2

E Hw ”L = 1 ÅÅÅÅ 2 „

m=1 M

m=1 P

i k j j j jTm

m - g

i k j j j j‚

k

Hwkm ÿ xk

mL

y { z z z z y { z z z z

2

‡ Weight Gradients Consequently, we can compute the wki gradients: (17) ∑ E Hw ”L ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ ∑ wki = ‚

m=1 P

HTi

m - g Hhi mLL ÿ g£ Hhi mL ÿ xk m

05.2-Backprop-Printout.nb 7

slide-8
SLIDE 8

‡ From Weight Gradients to the Learning Rule This eventually (after some more calculations) shows us that the adaptation term Dwki for wki has the same form as in Equations (10), (13), and (14), namely: (18) Dwki = h di

m xk m

where (19) di

m = HTi m - Oi mL ÿ g£ Hhi mL

Suitable Activation Functions

The calculation of the above d terms is easy for the following functions g, which are commonly used as activation functions: ‡ Hyperbolic Tangens: (20) g HxL = tanh b x g£ HxL = b H1 - g2 HxLL Hyperbolic Tangens Plot:

Plot@Tanh@xD, 8x, -5, 5<D;

  • 4
  • 2

2 4

  • 1
  • 0.5

0.5 1

8 05.2-Backprop-Printout.nb

slide-9
SLIDE 9

Plot of the first derivative:

Plot@Tanh'@xD, 8x, -5, 5<D;

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

Check for equality with 1 - tanh2 x

Plot@1 - Tanh@xD2, 8x, -5, 5<D;

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

Influence of the b parameter:

p1@b_D := Plot@Tanh@b xD, 8x, -5, 5<, PlotRange Ø All, DisplayFunction Ø IdentityD p2@b_D := Plot@Tanh'@b xD, 8x, -5, 5<, PlotRange Ø All, DisplayFunction Ø IdentityD

05.2-Backprop-Printout.nb 9

slide-10
SLIDE 10

Table@Show@GraphicsArray@8p1@bD, p2@bD<DD, 8b, 1, 5<D;

  • 4
  • 2

2 4

  • 1
  • 0.5

0.5 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4

  • 1
  • 0.5

0.5 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4

  • 1
  • 0.5

0.5 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4

  • 1
  • 0.5

0.5 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

10 05.2-Backprop-Printout.nb

slide-11
SLIDE 11
  • 4
  • 2

2 4

  • 1
  • 0.5

0.5 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1 Table@Show@GraphicsArray@8p1@bD, p2@bD<DD, 8b, 0.1, 1, 0.1<D;

  • 4
  • 2

2 4

  • 0.4
  • 0.2

0.2 0.4

  • 4
  • 2

2 4 0.8 0.85 0.9 0.95

  • 4
  • 2

2 4

  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6

  • 4
  • 2

2 4 0.5 0.6 0.7 0.8 0.9

  • 4
  • 2

2 4

  • 0.75
  • 0.5
  • 0.25

0.25 0.5 0.75

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8

  • 4
  • 2

2 4

  • 1
  • 0.5

0.5 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8

05.2-Backprop-Printout.nb 11

slide-12
SLIDE 12
  • 4
  • 2

2 4

  • 1
  • 0.5

0.5 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4

  • 1
  • 0.5

0.5 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4

  • 1
  • 0.5

0.5 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4

  • 1
  • 0.5

0.5 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4

  • 1
  • 0.5

0.5 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

12 05.2-Backprop-Printout.nb

slide-13
SLIDE 13
  • 4
  • 2

2 4

  • 1
  • 0.5

0.5 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

‡ Sigmoid: (21) g HxL = 1 ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ ÅÅÅÅ Å 1 + e-2 bx g£ HxL = 2 b g HxL H1 - g HxLL Sigmoid Plot:

sigmoid@x_, b_D := 1 ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ ÅÅÅÅÅÅ 1 + E-2 b x Plot@sigmoid@x, 1D, 8x, -5, 5<D;

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

Plot of the first derivative:

05.2-Backprop-Printout.nb 13

slide-14
SLIDE 14

D@sigmoid@x, bD, xD 2 ‰-2 x b b ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ ÅÅÅÅÅÅÅÅ ÅÅÅÅÅÅ H1 + ‰-2 x bL2 Plot@D@sigmoid@x, 1D, xD êê Evaluate, 8x, -5, 5<D;

  • 4
  • 2

2 4 0.1 0.2 0.3 0.4 0.5

Check for equality with 2 ÿ g ÿH1 - gL

Plot@2 sigmoid@x, 1D H1 - sigmoid@x, 1DL, 8x, -5, 5<D;

  • 4
  • 2

2 4 0.1 0.2 0.3 0.4 0.5

Influence of the b parameter:

14 05.2-Backprop-Printout.nb

slide-15
SLIDE 15

p1@b_D := Plot@sigmoid@x, bD, 8x, -5, 5<, PlotRange Ø All, DisplayFunction Ø IdentityD p2@b_D := Plot@D@sigmoid@x, bD, xD êê Evaluate, 8x, -5, 5<, PlotRange Ø All, DisplayFunction Ø IdentityD Table@Show@GraphicsArray@8p1@bD, p2@bD<DD, 8b, 1, 5<D;

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4 0.1 0.2 0.3 0.4 0.5

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1 1.2 1.4

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4 0.5 1 1.5 2

05.2-Backprop-Printout.nb 15

slide-16
SLIDE 16
  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4 0.5 1 1.5 2 2.5 Table@Show@GraphicsArray@8p1@bD, p2@bD<DD, 8b, 0.1, 1, 0.1<D;

  • 4
  • 2

2 4 0.4 0.5 0.6 0.7

  • 4
  • 2

2 4 0.042 0.044 0.046 0.048 0.05

  • 4
  • 2

2 4 0.2 0.3 0.4 0.5 0.6 0.7 0.8

  • 4
  • 2

2 4 0.05 0.06 0.07 0.08 0.09

  • 4
  • 2

2 4 0.4 0.6 0.8

  • 4
  • 2

2 4 0.04 0.06 0.08 0.12 0.14

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4 0.025 0.05 0.075 0.125 0.15 0.175 0.2

16 05.2-Backprop-Printout.nb

slide-17
SLIDE 17
  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4 0.05 0.1 0.15 0.2 0.25

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4 0.05 0.1 0.15 0.2 0.25 0.3

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4 0.05 0.1 0.15 0.2 0.25 0.3 0.35

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4 0.1 0.2 0.3 0.4

  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4 0.1 0.2 0.3 0.4

05.2-Backprop-Printout.nb 17

slide-18
SLIDE 18
  • 4
  • 2

2 4 0.2 0.4 0.6 0.8 1

  • 4
  • 2

2 4 0.1 0.2 0.3 0.4 0.5

d Update Rule for Sigmoid Units

Using the sigmoidal activation function, the d update rule takes the simple form: (22) di

m = Oi m H1 - Oi mL HTi m - Oi mL.

Learning in Multilayer Networks

Multilayer networks with nonlinear processing elements have a wider capability for solving classification tasks. Learning by error backpropagation is a common method to train multilayer networks.

Error Backpropagation

The backpropagation (BP) algorithm describes an update procedure for the set of weights w ” ÷÷ in a feedforward multilayer network. The network has to learn input-output patterns 8xk

m, Ti m<.

The basis for BP learning is, again, a similar gradient descent technique as used for perceptron learning, as described above. ‡ Notation We use the following notation: Ë xk

m

: value of input unit k for training pattern m; k = 1, …, N; m = 1, …, P Ë H j :

  • utput of hidden unit j

18 05.2-Backprop-Printout.nb

slide-19
SLIDE 19

Ë Oi :

  • utput of output unit i, i = 1, …, M

Ë wkj : weight of the link from input unit k to hidden unit j Ë Wji : weight of the link from hidden unit j to output unit i ‡ Propagating the input through the network For pattern m the hidden unit j receives the input (23) hj

m = ‚ k=1 N

wkj xk

m

and generates the output (24) Hj

m = g Hhj mL = g

i k j j j j‚

k=1 N

wkj xk

my

{ z z z z. These signals are propagated to the output cells, which receive the signals (25) hi

m = ‚ j

Wij Hj

m = „ j

Wij g i k j j j j‚

k=1 N

wkj xk

my

{ z z z z and generate the output (26) Oi

m = g Hhi mL = g

i k j j j j j j j j„

j

Wij g i k j j j j‚

k=1 N

wkj xk

my

{ z z z z y { z z z z z z z z ‡ Error function We use the known quadratic function as our error function: (27) E Hw ”L = 1 ÅÅÅÅ 2 ‚

m=1 M

m=1 P

HTm

m - Om mL2

Continuing the calculations, we get:

05.2-Backprop-Printout.nb 19

slide-20
SLIDE 20

(28) E Hw ”L = 1 ÅÅÅÅ 2 ‚

m=1 M

m=1 P

HTm

m - g Hhm mLL2

E Hw ”L = 1 ÅÅÅÅ 2 „

m=1 M

m=1 P

i k j j j j j j j j Tm

m - g

i k j j j j j j j j„

j

Wmj g i k j j j j‚

k=1 N

wkj xk

my

{ z z z z y { z z z z z z z z y { z z z z z z z z

2

E Hw ”L = 1 ÅÅÅÅ 2 „

m=1 M

m=1 P

i k j j j j j jTm

m - g

i k j j j j j j‚

j

Wmj Hj

my

{ z z z z z z y { z z z z z z

2

‡ Updating the weights: hidden—output layer For the connections from hidden to output cells we can use the delta weight update rule: (29) DWji = -h ∑ E ÅÅÅÅÅÅÅÅÅÅÅÅ Å ∑ Wji DWji = h ‚

m

HTi

m - Oi mL g£ Hhi mL Hj m

DWji = h ‚

m

di

m Hj m

with (30) di

m = g£ Hhi mL HTi m - Oi mL

‡ Updating the weights: input—hidden layer (31) Dwkj = -h ∑ E ÅÅÅÅÅÅÅÅÅÅÅÅ Å ∑ wkj Dwkj = -h „

m

i k j j j ∑ E ÅÅÅÅÅÅÅÅÅÅ Å ∑ Hj

m ÿ ∑ Hj m

ÅÅÅÅÅÅÅÅÅÅÅÅ Å ∑ wkj y { z z z After a few more calculations we get the following weight update rule:

20 05.2-Backprop-Printout.nb

slide-21
SLIDE 21

(32) Dwkj = h ‚

m

dj

m xk m

with (33) dj

m = g£ Hhj mL ‚ i

Wji

m di m

The Backpropagation Algorithm

For the BP algorithms we use the following notations: Ë Vi

m

:

  • utput of cell i in layer m

Ë Vi : corresponds to xi, the i-th input component Ë wji

m

: the connection from V j

m-1 to Vi m

‡ Backpropagation Algorithm

Ï Step 1: Initialize all weights with random values. Ï Step 2: Select a pattern xmand attach it to the input layer Hm = 0L:

(34) Vj

0 = xj m , " k Ï Step 3: Propagate the signals through all layers:

(35) Vi

m = g Hhi mL = g

i k j j j j j j‚

j

wji

m Vj m-1y

{ z z z z z z, " i, " m

Ï Step 4: Calculate the d's of the output layer:

(36) di

M = g£ Hhi ML HTi M - Vi ML Ï Step 5: Calculate the d's for the inner layers by error backpropagation:

(37) di

m-1 = g£ Hhi m-1L ‚ j

wij

m dj m, m = M, M - 1, …, 2 Ï Step 6: Adapt all connection weights:

(38) wji

new = wji

  • ld + Dwji

with Dwji m

= h di

m Vj m-1

05.2-Backprop-Printout.nb 21

slide-22
SLIDE 22

Ï Step 7: Go back to Step 2 for the next training pattern.

Examples

‡ TC Learning Task ‡ XOR

References

Freeman, J. A. Simulating Neural Networks with Mathematica. Addison-Wesley, Reading, MA, 1994. Hertz, J., Krogh, A., and Palmer, R. G. Introduction to the Theory of Neural Compu-

  • tation. Addison-Wesley, Reading, MA, 1991.

Rojas, R. Neural Networks: A Systematic Introduction. Springer Verlag, Berlin,1996.

22 05.2-Backprop-Printout.nb