Pattern Recognition Introduction to Gradient Descent Ad Feelders - - PowerPoint PPT Presentation

pattern recognition introduction to gradient descent
SMART_READER_LITE
LIVE PREVIEW

Pattern Recognition Introduction to Gradient Descent Ad Feelders - - PowerPoint PPT Presentation

Pattern Recognition Introduction to Gradient Descent Ad Feelders Universiteit Utrecht Ad Feelders (Universiteit Utrecht) Pattern Recognition 1 / 32 Optimization (single variable) Suppose we want to find the value of x for which the function


slide-1
SLIDE 1

Pattern Recognition Introduction to Gradient Descent

Ad Feelders

Universiteit Utrecht

Ad Feelders (Universiteit Utrecht) Pattern Recognition 1 / 32

slide-2
SLIDE 2

Optimization (single variable)

Suppose we want to find the value of x for which the function y = f (x) is minimized (or maximized). From calculus we know that a necessary condition for a minimum is: df dx = 0 (1) This condition is not sufficient, since maxima and points of inflection also satisfy equation (1). Together with the second-order condition: d2f dx2 > 0 (2) we have a sufficient condition for a local minimum.

Ad Feelders (Universiteit Utrecht) Pattern Recognition 2 / 32

slide-3
SLIDE 3

Optimization (single variable)

The equation df dx = 0 may not have a closed form solution however. In such cases we have to resort to iterative numerical procedures such as gradient descent.

Ad Feelders (Universiteit Utrecht) Pattern Recognition 3 / 32

slide-4
SLIDE 4

Optimization (single variable)

f(x) x

d f d x(x = x∗)

x∗

The derivative at x = x∗ is positive, so to increase the function value we should increase the value of x, i.e. make a step in the direction of the

  • derivative. To decrease the function value, we should make a step in the
  • pposite direction.

Ad Feelders (Universiteit Utrecht) Pattern Recognition 4 / 32

slide-5
SLIDE 5

Optimization (single variable)

Also, the tangent line to the graph at x = x∗ is a local linear approximation to f . ∆f ≈ df dx (x = x∗)∆x The closer we are to x∗, the better the approximation.

Ad Feelders (Universiteit Utrecht) Pattern Recognition 5 / 32

slide-6
SLIDE 6

Gradient Descent Algorithm (single variable)

The basic gradient-descent algorithm is:

1 Set i ← 0, and choose an initial value x(0) 2 determine the derivative

df dx (x = x(i))

  • f f (x) at x(i) and update

x(i+1) ← x(i) − η df dx (x = x(i)) Set i ← i + 1.

3 Repeat the previous step until

df dx = 0 and check if a (local) minimum has been reached. η > 0 is the step size (or learning rate).

Ad Feelders (Universiteit Utrecht) Pattern Recognition 6 / 32

slide-7
SLIDE 7

Optimization (multiple variables)

Suppose we want to find the values of x1, . . . , xm for which the function y = f (x1, . . . , xm) is minimized (or maximized). Analogous to the single-variable case a necessary condition for a minimum is: ∂f ∂xj = 0 j = 1, . . . , m (3) Again this condition is not sufficient, since maxima and saddle points also satisfy (3). For the second order condition, define the Hessian matrix H, with Hij = ∂2f ∂xi∂xj Together with the second-order condition that H is positive definite, i.e. z⊤Hz > 0, for all z = 0 (4) we have a sufficient condition for a local minimum.

Ad Feelders (Universiteit Utrecht) Pattern Recognition 7 / 32

slide-8
SLIDE 8

Linear Functions

Consider a linear function f (x) = a +

m

  • i=1

bixi = a + b⊤x The contour lines of f are given by f (x) = a + b⊤x = c, for different values of the constant c. For linear functions the contours are parallel straight lines.

Ad Feelders (Universiteit Utrecht) Pattern Recognition 8 / 32

slide-9
SLIDE 9

The Gradient

The gradient of f (x1, x2, . . . , xm), is the vector of partial derivatives ∇f =      

∂f ∂x1 ∂f ∂x2

. . .

∂f ∂xm

     

Ad Feelders (Universiteit Utrecht) Pattern Recognition 9 / 32

slide-10
SLIDE 10

Gradient of a Linear Function

The gradient of a linear function f (x) = a + b⊤x is given by ∇f = b Furthermore, for linear functions we have: ∆f = b⊤∆x = ∇f ⊤∆x In which direction should we move to maximize ∆f ?

Ad Feelders (Universiteit Utrecht) Pattern Recognition 10 / 32

slide-11
SLIDE 11

The direction of steepest ascent (descent) is perpendicular to the contour line

The direction of steepest ascent (descent) is an increasing (decreasing) direction perpendicular to the contour line. The direction of steepest ascent (descent) from the point x∗ is where the contour line is tangent to a circle of radius one around x∗.

Ad Feelders (Universiteit Utrecht) Pattern Recognition 11 / 32

slide-12
SLIDE 12

The gradient is also perpendicular to the contour line

Consider two points xA and xB both of which lie on the same contour line. Because f (xA) = f (xB) = c, we have f (xA) − f (xB) = 0 Therefore (a + b⊤xA) − (a + b⊤xB) = b⊤(xA − xB) = 0 and so the gradient is perpendicular to the contour line, because

1 The vector xA − xB runs parallel to the contour line. 2 Vectors are perpendicular if their dot product is zero. Ad Feelders (Universiteit Utrecht) Pattern Recognition 12 / 32

slide-13
SLIDE 13

The gradient is also perpendicular to the contour line

Ad Feelders (Universiteit Utrecht) Pattern Recognition 13 / 32

slide-14
SLIDE 14

The gradient is perpendicular to the contour line

For linear functions the direction of steepest increase is perpendicular to the contour line, as is the gradient. From ∆f = b⊤∆x = ∇f ⊤∆x we conclude that the gradient points in an increasing direction, since filling in ∇f for ∆x gives ∆f = ∇f ⊤∇f = ∇f 2 Therefore:

1 The gradient points in the direction of fastest increase of f . 2 Minus the gradient points in the direction of fastest decrease of f . Ad Feelders (Universiteit Utrecht) Pattern Recognition 14 / 32

slide-15
SLIDE 15

Linear Approximation

This reasoning works for arbitrary functions by considering a local linear approximation to the function at x∗ by the tangent plane: (y − y∗) = ∂f ∂x1 (x∗)(x1 − x∗

1) + ∂f

∂x2 (x∗)(x2 − x∗

2),

and using the linear approximation ∆f ≈ ∂f ∂x1 (x∗)∆x1 + ∂f ∂x2 (x∗)∆x2 = ∇f ⊤(x∗)∆x. Here ∂f ∂x1 (x∗) and ∂f ∂x2 (x∗) are the slopes of the tangent lines in the direction of x1 resp. x2 at the point x = x∗.

Ad Feelders (Universiteit Utrecht) Pattern Recognition 15 / 32

slide-16
SLIDE 16

Local Linear Approximation by Tangent Plane

The white dot represents the point (x∗, f (x∗)).

Ad Feelders (Universiteit Utrecht) Pattern Recognition 16 / 32

slide-17
SLIDE 17

Gradient Descent Algorithm (multivariable)

The basic gradient-descent algorithm is:

1 Set i ← 0, and choose an initial value x(0) 2 determine the gradient ∇f (x(i)) of f (x) at x(i) and update

x(i+1) ← x(i) − η∇f (x(i)) Set i ← i + 1.

3 Repeat the previous step until

∇f (x(i)) = 0 and check if a (local) minimum has been reached. η > 0 is the step size (or learning rate).

Ad Feelders (Universiteit Utrecht) Pattern Recognition 17 / 32

slide-18
SLIDE 18

Example of gradient descent for linear regression

Note: w0 and w1 are the variables here! n x t y = w0 + w1x e = t − y 1 1 w0 1 − w0 2 1 3 w0 + w1 3 − w0 − w1 3 2 4 w0 + 2w1 4 − w0 − 2w1 4 3 3 w0 + 3w1 3 − w0 − 3w1 5 4 5 w0 + 4w1 5 − w0 − 4w1 SSE(w0, w1) = (1 − w0)2 + (3 − w0 − w1)2 +(4 − w0 − 2w1)2 + (3 − w0 − 3w1)2 +(5 − w0 − 4w1)2

Ad Feelders (Universiteit Utrecht) Pattern Recognition 18 / 32

slide-19
SLIDE 19

Example of gradient descent

∂SSE ∂w0 = [2(1 − w0)(−1)] + [2(3 − w0 − w1)(−1)] + [2(4 − w0 − 2w1)(−1)] + [2(3 − w0 − 3w1)(−1)] + [2(5 − w0 − 4w1)(−1)] = −32 + 10w0 + 20w1 ∂SSE ∂w1 = 0 + [2(3 − w0 − w1)(−1)] + [2(4 − w0 − 2w1)(−2)] + [2(3 − w0 − 3w1)(−3)] + [2(5 − w0 − 4w1)(−4)] = −80 + 20w0 + 60w1

Ad Feelders (Universiteit Utrecht) Pattern Recognition 19 / 32

slide-20
SLIDE 20

Example of gradient descent

So the gradient is: ∇SSE =  

∂SSE ∂w0 ∂SSE ∂w1

  = −32 + 10w0 + 20w1 −80 + 20w0 + 60w1

  • Let w(0) = (0, 0). Then the gradient evaluated in the point w(0) is:

∇SSE(w(0)) = −32 + 10 × 0 + 20 × 0 −80 + 20 × 0 + 60 × 0

  • =

−32 −80

  • Ad Feelders

(Universiteit Utrecht) Pattern Recognition 20 / 32

slide-21
SLIDE 21

Example of gradient descent

Let η = 1

  • 50. Then we get the following update:

w(1) = w(0) − η∂SSE ∂w0 = 0 − 1 50 × −32 = 0.64 w(1)

1

= w(0)

1

− η∂SSE ∂w1 = 0 − 1 50 × −80 = 1.6 Or both at once: w(1) = w(0) − η∇SSE(w(0)) =

  • − 1

50 −32 −80

  • =

0.64 1.6

  • Ad Feelders

(Universiteit Utrecht) Pattern Recognition 21 / 32

slide-22
SLIDE 22

Gradient Descent with step size η = 0.02

b0 b1 5 5 40 4 30 30 2 20 1 10 5 4 4 3 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0

  • ● ● ● ● ● ●●●●●●●
  • Ad Feelders

(Universiteit Utrecht) Pattern Recognition 22 / 32

slide-23
SLIDE 23

Gradient Ascent for Logistic Regression

The logistic regression log-likelihood function is: ℓ(w) =

N

  • n=1

{tn ln pn + (1 − tn) ln(1 − pn)} with pn = (1 + e−w⊤xn)−1 1 − pn = (1 + ew⊤xn)−1, where pn ≡ P(t = 1 | xn). Filling this in gives: ℓ(w) =

N

  • n=1
  • tn ln
  • 1

1 + e−w⊤xn

  • + (1 − tn) ln
  • 1

1 + ew⊤xn

  • Ad Feelders

(Universiteit Utrecht) Pattern Recognition 23 / 32

slide-24
SLIDE 24

Determining the gradient

ℓ(w) =

N

  • n=1

{tn ln pn + (1 − tn) ln(1 − pn)} g(wj) = ∂ℓ(w) ∂wj =

N

  • n=1

tn pn · ∂pn ∂wj − 1 − tn 1 − pn · ∂pn ∂wj (1) ∂pn ∂wj = pn(1 − pn)xnj (2) Filling in (2) in equation (1) gives (verify this!): g(wj) =

N

  • n=1

(tn − pn)xnj

Ad Feelders (Universiteit Utrecht) Pattern Recognition 24 / 32

slide-25
SLIDE 25

Determining the gradient

Recall that: pn = (1 + e−w⊤xn)−1 so we have (apply the chain rule twice) ∂pn ∂wj = −(1 + e−w⊤xn)−2 · e−w⊤xn · −xnj = e−w⊤xn (1 + e−w⊤xn)2 · xnj = 1 (1 + e−w⊤xn) · e−w⊤xn (1 + e−w⊤xn) · xnj = pn(1 − pn)xnj

Ad Feelders (Universiteit Utrecht) Pattern Recognition 25 / 32

slide-26
SLIDE 26

Optimization by Gradient Ascent

The gradient points in the direction of the steepest ascent of the function. We can perform optimization by the method of gradient ascent as follows. The new estimate of wj based on processing the n-th observation is: w(i+1)

j

= w(i)

j

+ η × gn(wj) = w(i)

j

+ η × (tn − p(i)

n )xnj,

where p(i)

n

= (1 + e−w(i)⊤xn)−1 is the current estimate of P(t = 1 | xn), that is, using w(i).

Ad Feelders (Universiteit Utrecht) Pattern Recognition 26 / 32

slide-27
SLIDE 27

Optimization by Gradient Ascent (batch version)

The basic gradient-ascent algorithm applied to logistic regression:

1 choose an initial value w(0) (e.g. at random); i ← 0 2 determine the gradient and update

w(i+1)

j

← w(i)

j

+ η ×

N

  • n=1

(tn − p(i)

n )xnj,

j = 0, . . . , m Set i ← i + 1

3 Repeat the previous step until

g(wj) = 0 for all j = 0, . . . , m and check if a (local) maximum has been reached.

Ad Feelders (Universiteit Utrecht) Pattern Recognition 27 / 32

slide-28
SLIDE 28

Optimization by Gradient Ascent (online version)

You can also update the weights after processing each single observation:

1 choose an initial value w(0)

Set i ← 0

2 For n = 1 to N

w(i+1)

j

← w(i)

j

+ η × (tn − p(i)

n )xnj,

j = 0, . . . , m i ← i + 1

3 Repeat step 2 until convergence and check if a (local) maximum has

been reached.

Ad Feelders (Universiteit Utrecht) Pattern Recognition 28 / 32

slide-29
SLIDE 29

Simple R implementation (online version)

function (x,y,maxepoch=100,eta=0.01) { N <- length(y) x <- cbind(rep(1,N),x) # append column of ones M <- ncol(x) W <- matrix(nrow=maxepoch,ncol=M) w <- runif(M,-1,1) # random start weights p <- vector(length=N) for (i in 1:maxepoch){ W[i,] <- w # store weights of this epoch index <- sample(N) # process in random order for(n in index){ z <- as.numeric(-w %*% x[n,]) p[n] <- 1/(1+exp(z)) for (j in 1:M){ s <- (y[n]-p[n])*x[n,j] w[j] <- w[j] + eta * s}}} return(W) }

Ad Feelders (Universiteit Utrecht) Pattern Recognition 29 / 32

slide-30
SLIDE 30

Fitting the programming assignment data

> prog.logreg <- glm(succes ∼ month.exp, data=prog.dat, family=binomial) > summary(prog.logreg) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.05970 1.25935

  • 2.430

0.0151 * month.exp 0.16149 0.06498 2.485 0.0129 * Hence, the maximum likelihood estimate for w0 ≈ −3.06 and for w1 ≈ 0.16. Final values obtained with a run of gradient ascent were: w0 = -3.0596346 w1 = 0.1623779 See next two slides for graphs of iterations.

Ad Feelders (Universiteit Utrecht) Pattern Recognition 30 / 32

slide-31
SLIDE 31

w1: 100,000 epochs with η = 0.00015

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 −0.15 −0.10 −0.05 0.00 0.05 0.10 0.15 iteration w1 Ad Feelders (Universiteit Utrecht) Pattern Recognition 31 / 32

slide-32
SLIDE 32

w0: 100,000 epochs with η = 0.00015

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 −3.0 −2.5 −2.0 −1.5 −1.0 iteration w0 Ad Feelders (Universiteit Utrecht) Pattern Recognition 32 / 32