Machine Learning 2007: Lecture 8 Instructor: Tim van Erven - - PowerPoint PPT Presentation

machine learning 2007 lecture 8 instructor tim van erven
SMART_READER_LITE
LIVE PREVIEW

Machine Learning 2007: Lecture 8 Instructor: Tim van Erven - - PowerPoint PPT Presentation

Machine Learning 2007: Lecture 8 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website: www.cwi.nl/erven/teaching/0708/ml/ October 31, 2007 1 / 31 Overview Organisational Organisational Matters Matters Linear Functions as Inner


slide-1
SLIDE 1

1 / 31

Machine Learning 2007: Lecture 8 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website: www.cwi.nl/˜erven/teaching/0708/ml/

October 31, 2007

slide-2
SLIDE 2

Overview

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 2 / 31

  • Organisational Matters
  • Linear Functions as Inner Products
  • Neural Networks

The Perceptron

General Neural Networks

  • Gradient Descent

Convex Functions

Gradient Descent in One Variable

Gradient Descent in More Variables

Optimizing Perceptron Weights

slide-3
SLIDE 3

Course Organisation

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 3 / 31

Final Exam:

  • You have to enroll for the final exam on tisvu (when possible.)
  • The final exam will be more difficult than the intermediate

exam.

Mitchell:

  • Read: Chapter 4, sections 4.1 – 4.4.
slide-4
SLIDE 4

Course Organisation

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 3 / 31

Final Exam:

  • You have to enroll for the final exam on tisvu (when possible.)
  • The final exam will be more difficult than the intermediate

exam.

Mitchell:

  • Read: Chapter 4, sections 4.1 – 4.4.

This Lecture:

  • Explanation of linear functions as inner products is needed to

understand Mitchell.

  • Neural networks are in Mitchell. I have some extra pictures.
  • Convex functions are not discussed in Mitchell.
  • I will give more background on gradient descent.
slide-5
SLIDE 5

Overview

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 4 / 31

  • Organisational Matters
  • Linear Functions as Inner Products
  • Neural Networks

The Perceptron

General Neural Networks

  • Gradient Descent

Convex Functions

Gradient Descent in One Variable

Gradient Descent in More Variables

Optimizing Perceptron Weights

slide-6
SLIDE 6

Linear Functions as Inner Products

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 5 / 31

Linear Function:

hw(x) = w0 + w1x1 + . . . + wdxd

  • x = (x1, . . . , xd)⊤ is a d-dimensional feature vector.
  • w = (w0, w1, . . ., wd)⊤ is a d + 1-dimensional weight vector.
slide-7
SLIDE 7

Linear Functions as Inner Products

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 5 / 31

Linear Function:

hw(x) = w0 + w1x1 + . . . + wdxd

  • x = (x1, . . . , xd)⊤ is a d-dimensional feature vector.
  • w = (w0, w1, . . ., wd)⊤ is a d + 1-dimensional weight vector.

As an Inner Product (a standard trick):

We may change x into a d + 1-dimensional vector x′ by adding an imaginary extra feature x0, which always has value 1: x = (x1, . . . , xd)⊤ ⇒ x′ = (1, x1, . . . , xd)⊤ hw(x) =

d

  • i=0

wix′

i = w, x′

slide-8
SLIDE 8

Linear Functions as Inner Products

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 5 / 31

Linear Function:

hw(x) = w0 + w1x1 + . . . + wdxd

  • x = (x1, . . . , xd)⊤ is a d-dimensional feature vector.
  • w = (w0, w1, . . ., wd)⊤ is a d + 1-dimensional weight vector.

As an Inner Product (a standard trick):

We may change x into a d + 1-dimensional vector x′ by adding an imaginary extra feature x0, which always has value 1: x = (x1, . . . , xd)⊤ ⇒ x′ = (1, x1, . . . , xd)⊤ hw(x) =

d

  • i=0

wix′

i = w, x′

  • Mitchell writes w · x′ for w, x′.
slide-9
SLIDE 9

Overview

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 6 / 31

  • Organisational Matters
  • Linear Functions as Inner Products
  • Neural Networks

The Perceptron

General Neural Networks

  • Gradient Descent

Convex Functions

Gradient Descent in One Variable

Gradient Descent in More Variables

Optimizing Perceptron Weights

slide-10
SLIDE 10

Artificial Neurons

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 7 / 31

An Artificial Neuron:

An (artificial) neuron is some function h that gets a feature vector x as input and outputs a (single) label y.

The Perceptron:

The most famous type of (artificial) neuron is the perceptron: hw(x) =

  • 1

if w0 + w1x1 + . . . wdxd > 0, −1

  • therwise.
  • Applies a threshold to a linear function of x.
  • Has parameters w.
slide-11
SLIDE 11

Different Views of The Perceptron

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 8 / 31

Simple Neural Network: Mitchell’s Drawing:

INPUTS OUTPUT NEURONS x1 x2 x4 x3 y1 OUTPUTS

Equation:

hw(x) =

  • 1

if w0 + w1x1 + . . . wdxd > 0, −1

  • therwise.
slide-12
SLIDE 12

Different Views of The Perceptron

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 8 / 31

Simple Neural Network: Mitchell’s Drawing:

INPUTS OUTPUT NEURONS x1 x2 x4 x3 y1 OUTPUTS

Equation:

hw(x) =

  • 1

if w0 + w1x1 + . . . wdxd > 0, −1

  • therwise.
  • One of the most simple neural networks consists of just one

perceptron neuron.

  • A perceptron does classification.
slide-13
SLIDE 13

Different Views of The Perceptron

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 8 / 31

Simple Neural Network: Mitchell’s Drawing:

INPUTS OUTPUT NEURONS x1 x2 x4 x3 y1 OUTPUTS

Equation:

hw(x) =

  • 1

if w0 + w1x1 + . . . wdxd > 0, −1

  • therwise.
  • One of the most simple neural networks consists of just one

perceptron neuron.

  • A perceptron does classification.
  • The network has no hidden units, and just one output.
  • It may have any number of inputs.
slide-14
SLIDE 14

Decision Boundary of the Perceptron

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 9 / 31

Decision boundary: w0 + w1x1 + . . . + wdxd = 0

  • This is where the perceptron changes its output y from −1 (-)

to +1 (+) if we change x a little bit.

  • For d = 2 this decision boundary is always a line.
slide-15
SLIDE 15

Decision Boundary of the Perceptron

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 9 / 31

Decision boundary: w0 + w1x1 + . . . + wdxd = 0

  • This is where the perceptron changes its output y from −1 (-)

to +1 (+) if we change x a little bit.

  • For d = 2 this decision boundary is always a line.

Representing Boolean Functions (−1 = false, 1 = true):

AND OR

−3 −2 −1 1 2 3 x1 −3 −2 −1 1 2 3 x2 + − − − −3 −2 −1 1 2 3 x1 −3 −2 −1 1 2 3 x2 + + + −

w0 = −0.8, w1 = 0.5, w2 = 0.5 w0 = 0.3, w1 = 0.5, w2 = 0.5 Wrong in Mitchell!

slide-16
SLIDE 16

Perceptron Cannot Represent Exclusive Or

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 10 / 31

Exclusive Or:

−3 −2 −1 1 2 3 x1 −3 −2 −1 1 2 3 x2 − + + −

  • There exists no line that separates the inputs with label ‘-’

from the inputs with label ‘+’. They are not linearly separable.

slide-17
SLIDE 17

Perceptron Cannot Represent Exclusive Or

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 10 / 31

Exclusive Or:

−3 −2 −1 1 2 3 x1 −3 −2 −1 1 2 3 x2 − + + −

  • There exists no line that separates the inputs with label ‘-’

from the inputs with label ‘+’. They are not linearly separable.

  • The decision boundary for the perceptron is always a line.
  • Hence a perceptron can never implement the ‘exclusive or’

function, whichever weights we choose!

slide-18
SLIDE 18

Overview

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 11 / 31

  • Organisational Matters
  • Linear Functions as Inner Products
  • Neural Networks

The Perceptron

General Neural Networks

  • Gradient Descent

Convex Functions

Gradient Descent in One Variable

Gradient Descent in More Variables

Optimizing Perceptron Weights

slide-19
SLIDE 19

Artificial Neural Networks

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 12 / 31 INPUTS HIDDEN NEURONS OUTPUT NEURONS x1 x2 x4 x3 x5 x6 y1 y2 y3 y4 OUTPUTS

  • We can create an (artificial) neural network (NN) by

connecting neurons together.

  • We hook up our feature vector x to the input neurons in the
  • network. We get a label vector y from the output neurons.
slide-20
SLIDE 20

Artificial Neural Networks

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 12 / 31 INPUTS HIDDEN NEURONS OUTPUT NEURONS x1 x2 x4 x3 x5 x6 y1 y2 y3 y4 OUTPUTS

  • We can create an (artificial) neural network (NN) by

connecting neurons together.

  • We hook up our feature vector x to the input neurons in the
  • network. We get a label vector y from the output neurons.
  • The parameters of the neural network w consist of all the

parameters of the neurons in the network taken together in

  • ne big vector.
slide-21
SLIDE 21

NN Example: ALVINN

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 13 / 31

Sharp Left Sharp Right

4 Hidden Units 30 Output Units 30x32 Sensor Input Retina

Straight Ahead

slide-22
SLIDE 22

Overview

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 14 / 31

  • Organisational Matters
  • Linear Functions as Inner Products
  • Neural Networks

The Perceptron

General Neural Networks

  • Gradient Descent

Convex Functions

Gradient Descent in One Variable

Gradient Descent in More Variables

Optimizing Perceptron Weights

slide-23
SLIDE 23

Convex Functions

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 15 / 31

Intuition:

  • 10
  • 5

5 10 x 20 40 60 80 100 x2

slide-24
SLIDE 24

Convex Functions

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 15 / 31

Intuition:

  • 10
  • 5

5 10 x 20 40 60 80 100 x2

  • A function is convex if it lies below the line between any two of

its points. For example, f(−3) and f(7).

slide-25
SLIDE 25

Convex Functions

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 15 / 31

Intuition:

  • 10
  • 5

5 10 x 20 40 60 80 100 x2

  • A function is convex if it lies below the line between any two of

its points. For example, f(−3) and f(7).

Definition: A function f(x) is convex if

f(αx1 + (1 − α)x2) ≤ αf(x1) + (1 − α)f(x2) for any two inputs x1, x2 and any 0 ≤ α ≤ 1.

slide-26
SLIDE 26

Examples

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 16 / 31

Convex:

  • 2
  • 1

1 2 3 4 5 x 20 40 60 80 100 120 140 x

slide-27
SLIDE 27

Examples

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 16 / 31

Convex:

−2 −1 1 2 3 4 5 x 20 40 60 80 100 120 140 ex

slide-28
SLIDE 28

Examples

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 16 / 31

Convex:

−2 −1 1 2 3 4 5 x 20 40 60 80 100 120 140 ex −10 −5 5 10 x 20 40 60 80 100 x2

slide-29
SLIDE 29

Examples

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 16 / 31

Convex:

−2 −1 1 2 3 4 5 x 20 40 60 80 100 120 140 ex −10 −5 5 10 x 20 40 60 80 100 x2

Not Convex:

  • 3
  • 2
  • 1

1 2 3 x

  • 4
  • 2

2 4 x3

slide-30
SLIDE 30

Examples

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 16 / 31

Convex:

−2 −1 1 2 3 4 5 x 20 40 60 80 100 120 140 ex −10 −5 5 10 x 20 40 60 80 100 x2

Not Convex:

  • 3
  • 2
  • 1

1 2 3 x

  • 4
  • 2

2 4 x3

slide-31
SLIDE 31

Examples

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 16 / 31

Convex:

−2 −1 1 2 3 4 5 x 20 40 60 80 100 120 140 ex −10 −5 5 10 x 20 40 60 80 100 x2

Not Convex:

  • 3
  • 2
  • 1

1 2 3 x

  • 4
  • 2

2 4 x3

  • 10
  • 5

5 10 x

  • 100
  • 80
  • 60
  • 40
  • 20

x2

slide-32
SLIDE 32

Examples

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 16 / 31

Convex:

−2 −1 1 2 3 4 5 x 20 40 60 80 100 120 140 ex −10 −5 5 10 x 20 40 60 80 100 x2

Not Convex:

  • 3
  • 2
  • 1

1 2 3 x

  • 4
  • 2

2 4 x3

  • 10
  • 5

5 10 x

  • 100
  • 80
  • 60
  • 40
  • 20

x2

slide-33
SLIDE 33

Overview

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 17 / 31

  • Organisational Matters
  • Linear Functions as Inner Products
  • Neural Networks

The Perceptron

General Neural Networks

  • Gradient Descent

Convex Functions

Gradient Descent in One Variable

Gradient Descent in More Variables

Optimizing Perceptron Weights

slide-34
SLIDE 34

Gradient Descent

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 18 / 31

  • Gradient descent is a method to find the minimum of a

function: minx f(x).

  • It works for convex functions, but not for some other functions.
slide-35
SLIDE 35

Gradient Descent

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 18 / 31

  • Gradient descent is a method to find the minimum of a

function: minx f(x).

  • It works for convex functions, but not for some other functions.

−10 −5 10 x 10 20 30 40 50 f(x)

x1

General Idea:

1. Pick some starting point x1. 2. Keep taking small steps downhill: f(x1) > f(x2) > f(x3) > . . . 3. Stop at the minimum.

slide-36
SLIDE 36

Gradient Descent More Precisely

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 19 / 31

What is Downhill?

The derivative f′(x) points uphill, so downhill is −f′(x).

−10 −5 10 x 10 20 30 40 50 f(x)

x1

slide-37
SLIDE 37

Gradient Descent More Precisely

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 19 / 31

What is Downhill?

The derivative f′(x) points uphill, so downhill is −f′(x).

Step Size:

  • We multiply −f′(xn) by the learning rate η.
  • This controls the size of our steps.
  • If η is too big, we will walk past the minimum.
  • If η is too small, it will take very long before we get to the

minimum.

slide-38
SLIDE 38

Gradient Descent More Precisely

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 19 / 31

What is Downhill?

The derivative f′(x) points uphill, so downhill is −f′(x).

Step Size:

  • We multiply −f′(xn) by the learning rate η.
  • This controls the size of our steps.
  • If η is too big, we will walk past the minimum.
  • If η is too small, it will take very long before we get to the

minimum.

  • There exist more advanced methods to choose your step size.
slide-39
SLIDE 39

Gradient Descent More Precisely

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 19 / 31

What is Downhill?

The derivative f′(x) points uphill, so downhill is −f′(x).

Step Size:

  • We multiply −f′(xn) by the learning rate η.
  • This controls the size of our steps.
  • If η is too big, we will walk past the minimum.
  • If η is too small, it will take very long before we get to the

minimum.

  • There exist more advanced methods to choose your step size.

The Gradient Descent Algorithm:

1. Pick some starting point x1. 2. xn+1 = xn + ∆xn, where ∆xn = −η · f′(xn). 3. Stop when ∆xn is very small.

slide-40
SLIDE 40

What Can Go Wrong?

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 20 / 31

Local minima:

  • 40
  • 30
  • 20
  • 10

10 20 x

  • 500
  • 250

250 500 750 1000 100 9 x2 x3 x4

  • 40
  • For some starting points,

we may get stuck at a local minimum (x = 0 in figure).

  • Most important problem for

gradient descent.

  • Convex functions do not

have local minima!

slide-41
SLIDE 41

What Can Go Wrong?

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 20 / 31

Local minima:

  • 40
  • 30
  • 20
  • 10

10 20 x

  • 500
  • 250

250 500 750 1000 100 9 x2 x3 x4

  • 40
  • For some starting points,

we may get stuck at a local minimum (x = 0 in figure).

  • Most important problem for

gradient descent.

  • Convex functions do not

have local minima!

No minimum exists:

  • 2
  • 1

1 2 3 4 5 x 20 40 60 80 100 120 140 x

  • The function may have no

minima at all.

  • In that case gradient de-

scent cannot find a mini- mum (of course).

slide-42
SLIDE 42

Overview

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 21 / 31

  • Organisational Matters
  • Linear Functions as Inner Products
  • Neural Networks

The Perceptron

General Neural Networks

  • Gradient Descent

Convex Functions

Gradient Descent in One Variable

Gradient Descent in More Variables

Optimizing Perceptron Weights

slide-43
SLIDE 43

The Gradient in Two Variables

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 22 / 31

One Variable:

  • Suppose g(x) is a function in one variable x.
  • Then we can take the derivative

∂ ∂xg.

Two Variables:

  • But suppose f(x) is a function that takes a 2-dimensional

vector x as input and outputs a scalar.

  • Does there exist something like the derivative of f with

respect to x?

slide-44
SLIDE 44

The Gradient in Two Variables

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 22 / 31

One Variable:

  • Suppose g(x) is a function in one variable x.
  • Then we can take the derivative

∂ ∂xg.

Two Variables:

  • But suppose f(x) is a function that takes a 2-dimensional

vector x as input and outputs a scalar.

  • Does there exist something like the derivative of f with

respect to x?

  • Yes, it is called the gradient:

Gradient: ∇f =

∂x1 f ∂ ∂x2 f

  • Example: ∇x2

1x2 + x2 =

2x1x2 x2

1 + 1

  • Note that ∇f is a function that takes x as input (like f), but
  • utputs a vector!
slide-45
SLIDE 45

The Gradient in d Variables

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 23 / 31

Definition:

Suppose f is a function that takes an d-dimensional vector x as input and outputs a scalar, then the gradient of f is ∇f =   

∂ ∂x1 f

. . .

∂ ∂xd f

  

slide-46
SLIDE 46

The Gradient in d Variables

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 23 / 31

Definition:

Suppose f is a function that takes an d-dimensional vector x as input and outputs a scalar, then the gradient of f is ∇f =   

∂ ∂x1 f

. . .

∂ ∂xd f

  

  • ∇f is a function that takes an d-dimensional vector x as

input, just like f.

  • But ∇f also outputs an d-dimensional vector, unlike f.
slide-47
SLIDE 47

The Gradient in d Variables

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 23 / 31

Definition:

Suppose f is a function that takes an d-dimensional vector x as input and outputs a scalar, then the gradient of f is ∇f =   

∂ ∂x1 f

. . .

∂ ∂xd f

  

  • ∇f is a function that takes an d-dimensional vector x as

input, just like f.

  • But ∇f also outputs an d-dimensional vector, unlike f.
  • For d = 1 the gradient is just the derivative.
  • The gradient is a generalisation of the derivative to higher

dimensional inputs.

slide-48
SLIDE 48

Gradient Examples

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 24 / 31

Examples on 3-dimensional input vector x:

Functions Functions at x = (1, 2, 3)⊤ f ∇f f     1 2 3     ∇f     1 2 3     x1 + 2x2

2 − x3

  1 4x2 −1   6   1 8 −1   x1x2x2

3

  x2x2

3

x1x2

3

2x1x2x3   18   18 9 12  

slide-49
SLIDE 49

Gradient Descent in More Dimensions

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 25 / 31

4 2 2 4 x1 4 2 2 4 x2 10 20 30 fx1,x2 4 2 2 4 x1

  • We can also use gradient descent to find the minimum of a

function that takes a vector as input: minx f(x).

  • It is called gradient descent because it walks in the direction
  • f minus the gradient.
  • It works for convex functions, but not for some other functions.
slide-50
SLIDE 50

Gradient Descent in More Dimensions

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 26 / 31

What is Downhill?

It can be shown that the gradient ∇f(x) points in the direction of the steepest ascent at x, and that −∇f(x) points in the direction

  • f the steepest descent.
  • 1

1 2

  • 2
  • 1

1 2 3 5 10 15 20 25 w0 w1 E[w]

slide-51
SLIDE 51

Gradient Descent in More Dimensions

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 26 / 31

What is Downhill?

It can be shown that the gradient ∇f(x) points in the direction of the steepest ascent at x, and that −∇f(x) points in the direction

  • f the steepest descent.

Step Size:

  • We multiply −∇f(x) by the learning rate η.
  • This controls the size of our steps.
slide-52
SLIDE 52

Gradient Descent in More Dimensions

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 26 / 31

What is Downhill?

It can be shown that the gradient ∇f(x) points in the direction of the steepest ascent at x, and that −∇f(x) points in the direction

  • f the steepest descent.

Step Size:

  • We multiply −∇f(x) by the learning rate η.
  • This controls the size of our steps.

The Gradient Descent Algorithm:

1. Pick some starting point x1. 2. xn+1 = xn + ∆xn, where ∆xn = −η · ∇f(xn). 3. Stop when ∆xn is a very small vector.

slide-53
SLIDE 53

Gradient Descent in More Dimensions

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 26 / 31

What is Downhill?

It can be shown that the gradient ∇f(x) points in the direction of the steepest ascent at x, and that −∇f(x) points in the direction

  • f the steepest descent.

Step Size:

  • We multiply −∇f(x) by the learning rate η.
  • This controls the size of our steps.

The Gradient Descent Algorithm:

1. Pick some starting point x1. 2. xn+1 = xn + ∆xn, where ∆xn = −η · ∇f(xn). 3. Stop when ∆xn is a very small vector.

  • Do not confuse ∆ (delta) and ∇ (the gradient).
slide-54
SLIDE 54

Overview

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 27 / 31

  • Organisational Matters
  • Linear Functions as Inner Products
  • Neural Networks

The Perceptron

General Neural Networks

  • Gradient Descent

Convex Functions

Gradient Descent in One Variable

Gradient Descent in More Variables

Optimizing Perceptron Weights

slide-55
SLIDE 55

The Delta Rule

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 28 / 31

The idea: Given data D, use gradient descent to find perceptron

weights that minimize the number of wrongly classified training examples in D.

slide-56
SLIDE 56

The Delta Rule

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 28 / 31

The idea: Given data D, use gradient descent to find perceptron

weights that minimize the number of wrongly classified training examples in D.

A Problem:

  • The perceptron applies a threshold to a linear function.
  • This threshold makes the derivative/gradient undefined for

some inputs.

Solution:

  • Minimize the sum of squared errors on D for the perceptron

without the threshold.

  • Note that D is considered fixed: We are minimizing

SSE(w, D) as a function of w.

slide-57
SLIDE 57

The Delta Rule

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 28 / 31

The idea: Given data D, use gradient descent to find perceptron

weights that minimize the number of wrongly classified training examples in D.

A Problem:

  • The perceptron applies a threshold to a linear function.
  • This threshold makes the derivative/gradient undefined for

some inputs.

Solution:

  • Minimize the sum of squared errors on D for the perceptron

without the threshold.

  • Note that D is considered fixed: We are minimizing

SSE(w, D) as a function of w.

  • The perceptron without the threshold is just a linear function

hw(x) (also called linear unit in NNs).

  • This is just linear regression!
slide-58
SLIDE 58

Gradient Descent for Perceptron Weights

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 29 / 31

Remarks:

  • SSE(w, D) is a convex function of w.
slide-59
SLIDE 59

Gradient Descent for Perceptron Weights

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 29 / 31

Remarks:

  • SSE(w, D) is a convex function of w.
  • To apply gradient descent we need to compute the gradient.
  • It will be convenient to minimize 1

2SSE(w, D) instead of

SSE(w, D).

slide-60
SLIDE 60

Gradient Descent for Perceptron Weights

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 29 / 31

Remarks:

  • SSE(w, D) is a convex function of w.
  • To apply gradient descent we need to compute the gradient.
  • It will be convenient to minimize 1

2SSE(w, D) instead of

SSE(w, D).

Computing The Gradient:

We can compute the ith component of the gradient as follows (see Mitchell, Equation 4.6): ∂ ∂wi 1 2SSE(w, D) = ∂ ∂wi 1 2

  • (y,x)⊤∈D

(y − hw(x))2 =

  • (y,x)⊤∈D

(y − hw(x)) · (−xi)

slide-61
SLIDE 61

Overview

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 30 / 31

  • Organisational Matters
  • Linear Functions as Inner Products
  • Neural Networks

The Perceptron

General Neural Networks

  • Gradient Descent

Convex Functions

Gradient Descent in One Variable

Gradient Descent in More Variables

Optimizing Perceptron Weights

slide-62
SLIDE 62

References

Organisational Matters Linear Functions as Inner Products Neural Networks Gradient Descent 31 / 31

  • S. Boyd and L. Vandenberghe. Convex Optimization.

Cambridge University Press, 2004

  • T.M. Mitchell, “Machine Learning”, McGraw-Hill, 1997