CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department - - PowerPoint PPT Presentation

cs 6316 machine learning
SMART_READER_LITE
LIVE PREVIEW

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department - - PowerPoint PPT Presentation

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University of Virginia Overview 1. Gradient Descent 2. Stochastic Gradient Descent 3. SGD with Momentum 4. Adaptive Learning Rates 1 Gradient Descent


slide-1
SLIDE 1

CS 6316 Machine Learning

Gradient Descent

Yangfeng Ji

Department of Computer Science University of Virginia

slide-2
SLIDE 2

Overview

  • 1. Gradient Descent
  • 2. Stochastic Gradient Descent
  • 3. SGD with Momentum
  • 4. Adaptive Learning Rates

1

slide-3
SLIDE 3

Gradient Descent

slide-4
SLIDE 4

Learning as Optimization

As discussed before, learning can be viewed as

  • ptimization problem.

◮ Training set S {(x1, y1), . . . , (xm, ym)} ◮ Empirical risk

L(hθ, S) 1 m

m

  • i1

R(hθ(xi), yi) (1) where R is the risk function

◮ Learning: minimize the empirical risk

θ ← argmin

θ′

LS(hθ′, S) (2)

3

slide-5
SLIDE 5

Learning as Optimization (II)

Some examples of risk functions

◮ Logistic regression

R(hθ(xi), yi) − log p(yi | xi; θ) (3)

4

slide-6
SLIDE 6

Learning as Optimization (II)

Some examples of risk functions

◮ Logistic regression

R(hθ(xi), yi) − log p(yi | xi; θ) (3)

◮ Linear regression

R(hθ(xi), yi) hθ(xi) − yi2

2

(4)

4

slide-7
SLIDE 7

Learning as Optimization (II)

Some examples of risk functions

◮ Logistic regression

R(hθ(xi), yi) − log p(yi | xi; θ) (3)

◮ Linear regression

R(hθ(xi), yi) hθ(xi) − yi2

2

(4)

◮ Neural network

R(hθ(xi), yi) Cross-entropy(hθ(xi), yi) (5)

4

slide-8
SLIDE 8

Learning as Optimization (II)

Some examples of risk functions

◮ Logistic regression

R(hθ(xi), yi) − log p(yi | xi; θ) (3)

◮ Linear regression

R(hθ(xi), yi) hθ(xi) − yi2

2

(4)

◮ Neural network

R(hθ(xi), yi) Cross-entropy(hθ(xi), yi) (5)

◮ Percetpron and AdaBoost can also be viewed as

minimizing certain loss functions

4

slide-9
SLIDE 9

Constrained Optimization

The dual optimization problem for SVMs of the separable cases is max

α m

  • i1

αi − 1 2

m

  • i,j1

αiαj yi yjxi, xj (6) s.t. αi ≥ 0 (7)

m

  • i1

αi yi 0 ∀i ∈ [m] (8)

5

slide-10
SLIDE 10

Constrained Optimization

The dual optimization problem for SVMs of the separable cases is max

α m

  • i1

αi − 1 2

m

  • i,j1

αiαj yi yjxi, xj (6) s.t. αi ≥ 0 (7)

m

  • i1

αi yi 0 ∀i ∈ [m] (8)

◮ Lagrange multiplier α is also called dual variable ◮ This is an optimization problem only about α ◮ The dual problem is defined on the inner product

xi, xj

5

slide-11
SLIDE 11

Optimization via Gradient Descent

The basic form of an optimization problem min f (θ) s.t.θ ∈ B (9) where f (θ) : Rd → R is the objective function and B ⊆ Rd is the constraint on θ, which usually can be formulated as a set of inequalities (e.g., SVM)

6

slide-12
SLIDE 12

Optimization via Gradient Descent

The basic form of an optimization problem min f (θ) s.t.θ ∈ B (9) where f (θ) : Rd → R is the objective function and B ⊆ Rd is the constraint on θ, which usually can be formulated as a set of inequalities (e.g., SVM) In this lecture

◮ we only focus on unconstrained optimization

problem, in other words, θ ∈ Rd

◮ assume f is convex and differentiable

6

slide-13
SLIDE 13

Review: Gradient of a 1-D Function

Consider the gradient of this 1-dimensional function y f (x) x2 − x − 2 (10)

7

slide-14
SLIDE 14

Review: Gradient of a 2-D Function

Now, consider a 2-dimensional function with x (x1, x2) y f (x) x2

1 + 10x2 2

(11) Here is the contour plot of this function We are going to use this as our running example

8

slide-15
SLIDE 15

Gradient Descent

To learn the parameter θ, the learning algorithm needs to update it iteratively using the following three steps

  • 1. Choose an initial point θ(0) ∈ Rd
  • 2. Repeat

θ(t+1) ← θ(t) − ηt · ∇ f (θ)|θθ(t) (12) where ηt is the learning rate at time t

  • 3. Go back step 1 until it converges

9

slide-16
SLIDE 16

Gradient Descent

To learn the parameter θ, the learning algorithm needs to update it iteratively using the following three steps

  • 1. Choose an initial point θ(0) ∈ Rd
  • 2. Repeat

θ(t+1) ← θ(t) − ηt · ∇ f (θ)|θθ(t) (12) where ηt is the learning rate at time t

  • 3. Go back step 1 until it converges

∇ f (θ) is defined as ∇ f (θ) ∂ f (θ) ∂θ1 , · · · , ∂ f (θ) ∂θd

  • (13)

9

slide-17
SLIDE 17

Gradient Descent Interpretation

An intuitive justification of the gradient descent algorithm is to consider the following plot The direction of the gradient is the direction that the function has the “fastest increase”.

10

slide-18
SLIDE 18

Gradient Descent Interpretation (II)

Theoretical justification

◮ First-order Taylor approximation

f (θ + ∆θ) ≈ f (θ) + ∆θ, ∇ f

  • θ

(14)

11

slide-19
SLIDE 19

Gradient Descent Interpretation (II)

Theoretical justification

◮ First-order Taylor approximation

f (θ + ∆θ) ≈ f (θ) + ∆θ, ∇ f

  • θ

(14)

◮ In gradient descent, ∆θ −η∇ f

  • θ

11

slide-20
SLIDE 20

Gradient Descent Interpretation (II)

Theoretical justification

◮ First-order Taylor approximation

f (θ + ∆θ) ≈ f (θ) + ∆θ, ∇ f

  • θ

(14)

◮ In gradient descent, ∆θ −η∇ f

  • θ

◮ Therefore, we have

f (θ + ∆θ) ≈ f (θ) + ∆θ, ∇ f

  • θ
  • f (θ) − η∇ f , ∇ f
  • θ
  • f (θ) − η∇ f 2

2

  • θ ≤ f (θ)

(15)

11

slide-21
SLIDE 21

Gradient Descent Interpretation (III)

Consider the second-order Taylor approximation of f f (θ′) ≈ f (θ) + ∇ f (θ)(θ′ − θ) + 1 2(θ′ − θ)T∇2 f (θ)(θ′ − θ)

12

slide-22
SLIDE 22

Gradient Descent Interpretation (III)

Consider the second-order Taylor approximation of f f (θ′) ≈ f (θ) + ∇ f (θ)(θ′ − θ) + 1 2(θ′ − θ)T∇2 f (θ)(θ′ − θ)

◮ The quadratic approximation of f with the following

f (θ′) ≈ f (θ) + ∇ f (θ)(θ′ − θ) + 1 2η(θ′ − θ)T(θ′ − θ)

12

slide-23
SLIDE 23

Gradient Descent Interpretation (III)

Consider the second-order Taylor approximation of f f (θ′) ≈ f (θ) + ∇ f (θ)(θ′ − θ) + 1 2(θ′ − θ)T∇2 f (θ)(θ′ − θ)

◮ The quadratic approximation of f with the following

f (θ′) ≈ f (θ) + ∇ f (θ)(θ′ − θ) + 1 2η(θ′ − θ)T(θ′ − θ)

◮ Minimize f (θ′) wrt θ′

∂ f (θ′) ∂θ′ ≈ ∇ f (θ) + 1 2η(θ′ − θ) 0 ⇒ θ′ θ − η · ∇ f (θ) (16)

12

slide-24
SLIDE 24

Gradient Descent Interpretation (III)

Consider the second-order Taylor approximation of f f (θ′) ≈ f (θ) + ∇ f (θ)(θ′ − θ) + 1 2(θ′ − θ)T∇2 f (θ)(θ′ − θ)

◮ The quadratic approximation of f with the following

f (θ′) ≈ f (θ) + ∇ f (θ)(θ′ − θ) + 1 2η(θ′ − θ)T(θ′ − θ)

◮ Minimize f (θ′) wrt θ′

∂ f (θ′) ∂θ′ ≈ ∇ f (θ) + 1 2η(θ′ − θ) 0 ⇒ θ′ θ − η · ∇ f (θ) (16)

◮ Gradient descent chooses the next point θ′ to

minimize the function

12

slide-25
SLIDE 25

Step size

θ(t+1) ← θ(t) − ηt · ∂ f (θ) ∂θ

  • θθ(t)

(17) If choose fixed step size ηt η0, consider the following function f (θ) (10θ2

1 + θ2 2)/2

(a) Too small

13

slide-26
SLIDE 26

Step size

θ(t+1) ← θ(t) − ηt · ∂ f (θ) ∂θ

  • θθ(t)

(17) If choose fixed step size ηt η0, consider the following function f (θ) (10θ2

1 + θ2 2)/2

(d) Too small (e) Too large

13

slide-27
SLIDE 27

Step size

θ(t+1) ← θ(t) − ηt · ∂ f (θ) ∂θ

  • θθ(t)

(17) If choose fixed step size ηt η0, consider the following function f (θ) (10θ2

1 + θ2 2)/2

(g) Too small (h) Too large (i) Just right

13

slide-28
SLIDE 28

Optimal Step Sizes

◮ Exact Line Search Solve this one-dimensional

subproblem t ← argmin

s≥0

f (θ − s∇ f (θ)) (18)

14

slide-29
SLIDE 29

Optimal Step Sizes

◮ Exact Line Search Solve this one-dimensional

subproblem t ← argmin

s≥0

f (θ − s∇ f (θ)) (18)

◮ Backtracking Line Search: with parameters

0 < β < 1, 0 < α ≤ 1/2, and large initial value ηt, if f (θ − η∇ f (θ)) > f (θ) − αηt∇ f (θ)2

2

(19) shrink ηt ← βηt

14

slide-30
SLIDE 30

Optimal Step Sizes

◮ Exact Line Search Solve this one-dimensional

subproblem t ← argmin

s≥0

f (θ − s∇ f (θ)) (18)

◮ Backtracking Line Search: with parameters

0 < β < 1, 0 < α ≤ 1/2, and large initial value ηt, if f (θ − η∇ f (θ)) > f (θ) − αηt∇ f (θ)2

2

(19) shrink ηt ← βηt

◮ Usually, this is not worth the effort, since the

computational complexity may be too high (e.g., f is a neural network)

14

slide-31
SLIDE 31

Convergence Analysis

◮ f is convex and differentiable, additionally

∇ f (θ) − ∇ f (θ′)2 ≤ L · θ − θ′2 (20) for any θ, θ′ ∈ Rd and L is a fixed positive value

15

slide-32
SLIDE 32

Convergence Analysis

◮ f is convex and differentiable, additionally

∇ f (θ) − ∇ f (θ′)2 ≤ L · θ − θ′2 (20) for any θ, θ′ ∈ Rd and L is a fixed positive value

◮ Theorem: Gradient descent with fixed step size

η0 ≤ 1/L satisfies f (θ(t)) − f ∗ ≤ θ(0) − θ∗2

2

2η0t (21) where f ∗ is the optimal value and θ∗ is the optimal parameter

15

slide-33
SLIDE 33

Convergence Analysis

◮ f is convex and differentiable, additionally

∇ f (θ) − ∇ f (θ′)2 ≤ L · θ − θ′2 (20) for any θ, θ′ ∈ Rd and L is a fixed positive value

◮ Theorem: Gradient descent with fixed step size

η0 ≤ 1/L satisfies f (θ(t)) − f ∗ ≤ θ(0) − θ∗2

2

2η0t (21) where f ∗ is the optimal value and θ∗ is the optimal parameter

◮ Same result holds for backtracking with η0 replaced

by β/L

15

slide-34
SLIDE 34

Stochastic Gradient Descent

slide-35
SLIDE 35

Gradient Descent

Given a training set {(xi, yi)}m

i1, the loss function is

defined as L(hθ, S) 1 m

m

  • i1

R(hθ(xi), yi) (22) where R is the risk function Examples:

◮ Logistic regression

R(hθ(xi), yi) − log p(yi | xi; θ) (23)

◮ Linear regression

R(hθ(xi), yi) hθ(xi) − yi2

2

(24)

17

slide-36
SLIDE 36

Gradient Descent (II)

◮ Consider the gradient of loss function ∇L(hθ, S)

∇L(hθ, S) 1 m

m

  • i1

∇R(hθ(xi), yi) (25)

18

slide-37
SLIDE 37

Gradient Descent (II)

◮ Consider the gradient of loss function ∇L(hθ, S)

∇L(hθ, S) 1 m

m

  • i1

∇R(hθ(xi), yi) (25)

◮ To simplify the notation, let fi(θ) R(hθ(xi), yi) and

f (θ) L(hθ, S), then ∇ f (θ) 1 m

m

  • i1

∇ fi(θ) (26)

18

slide-38
SLIDE 38

Stochastic Gradient Descent

To learn the parameter θ, we can compute the gradient with one training example (xi, yi) each time step and update the parameter as θ(t+1) ← θ(t) − ηt · ∇ fi(θ)|θ(t) (27) where

◮ t: time step ◮ ∇ fi(θ(t)) is the gradient of the single-example loss L ◮ ηt is the learning rate (step size)

19

slide-39
SLIDE 39

Stochastic?

Compare gradient descent and stochastic gradient descent As each step SGD only uses the gradient from one training example, it can be viewed as a gradient descent method with some randomness

20

slide-40
SLIDE 40

Motivation

There are at least two motivations of using SGD

◮ SGD can be a big savings in terms of memory usage

◮ learning with large-scale data ◮ models with lots of parameters

◮ The iteration cost of SGD is independent of sample

size m

21

slide-41
SLIDE 41

Motivation (II)

An empirical comparison between SGD and a batch

  • ptimization method (L-BFGS) on a binary classification

problem with logistic regression [Bottou et al., 2018]

22

slide-42
SLIDE 42

How to Choose an Example

◮ Cyclic Rule: choose i ∈ (1, 2, . . . , m) in order

23

slide-43
SLIDE 43

How to Choose an Example

◮ Cyclic Rule: choose i ∈ (1, 2, . . . , m) in order ◮ Randomized Rule: Every iteration, choose i ∈ [m]

uniformly at random ◮ Equivalently, shuffle the training example at the end

  • f each training epoch

23

slide-44
SLIDE 44

How to Choose an Example

◮ Cyclic Rule: choose i ∈ (1, 2, . . . , m) in order ◮ Randomized Rule: Every iteration, choose i ∈ [m]

uniformly at random ◮ Equivalently, shuffle the training example at the end

  • f each training epoch

◮ In practice, randomized rule is more common, since

we have E

  • ∇ fi(θ)
  • ≈ 1

m

m

  • i1

∇ fi(θ) ∇ f (θ) (28) as an unbiased estimate of ∇ f (θ)

23

slide-45
SLIDE 45

Convergence of SGD

The convergence of SGD usually requires diminishing step sizes

◮ The usual conditions on the learning rates are

  • t1

ηt ∞

  • t1

η2

t ≤ ∞

(29) [Bottou, 1998]

24

slide-46
SLIDE 46

Convergence of SGD

The convergence of SGD usually requires diminishing step sizes

◮ The usual conditions on the learning rates are

  • t1

ηt ∞

  • t1

η2

t ≤ ∞

(29)

◮ A simplest function that satisfies these conditions is

ηt 1 t (30) [Bottou, 1998]

24

slide-47
SLIDE 47

SGD with Momentum

slide-48
SLIDE 48

Review: Vector Addition

The parallelogram law of vector addition c a + b (31)

26

slide-49
SLIDE 49

SGD with Momentum

Given the loss function f (θ) to be minimized, SGD with momentum is given by v(t)

  • µv(t−1) + ∇ f (θ)|θ(t−1)

(32) θ(t)

  • θ(t−1) − ηtv(t)

(33) where

◮ ηt is still the learning rate ◮ µ ∈ [0, 1] is the momentum coefficient. Usually,

µ 0.99 or 0.999.

27

slide-50
SLIDE 50

Intuitive Explanation

(Note: the arrow show the opposite direction of the gradient)

(a) SGD without momentum

Figure: The effect of momentum in SGD: reduce the fluctuation (Credit: Genevieve B. Orr)

28

slide-51
SLIDE 51

Intuitive Explanation

(Note: the arrow show the opposite direction of the gradient)

(a) SGD without momentum (b) SGD with momentum

Figure: The effect of momentum in SGD: reduce the fluctuation (Credit: Genevieve B. Orr)

28

slide-52
SLIDE 52

Another Example with Contour Plot

Consider the following problem y x2

1 + 10x2 2

(34) ∂y ∂x1 2x1 ∂y ∂x2 20x2 (35)

Note: the arrow show the opposite direction of the gradient

29

slide-53
SLIDE 53

Another Example with Contour Plot (Cont.)

Add the previous gradient reduce the fluctuation of stochastic gradients v(t) µv(t−1) + g(t−1) (36)

!"($%&) (($%&) "($)

Note: the arrow show the opposite direction of the gradient

30

slide-54
SLIDE 54

Adaptive Learning Rates

slide-55
SLIDE 55

Basic Idea

The basic idea of using adaptive learning rates is to make sure that all θk’s converge roughly at the same speed For neural networks, the motivation of picking a different learning rate for each θk (the k-th component of parameter θ) is not new [LeCun et al., 2012] (the article was originally published in 1998).

32

slide-56
SLIDE 56

AdaGrad

The basic idea of AdaGrad [Duchi et al., 2011] is to modify the learning rate η for θk by using the history of the gradients θ(t)

k

θ(t−1)

k

− η0

  • G(t−1)

k,k

+ ǫ g(t−1)

k

(37)

33

slide-57
SLIDE 57

AdaGrad

The basic idea of AdaGrad [Duchi et al., 2011] is to modify the learning rate η for θk by using the history of the gradients θ(t)

k

θ(t−1)

k

− η0

  • G(t−1)

k,k

+ ǫ g(t−1)

k

(37)

◮ g(t−1)

k

[∇ f (θ)|θ(t−1)]k is the k-th component of ∇ f (θ)|θ(t−1)

◮ G(t−1)

k,k

t−1

i1(g(i) k )2

33

slide-58
SLIDE 58

AdaGrad

The basic idea of AdaGrad [Duchi et al., 2011] is to modify the learning rate η for θk by using the history of the gradients θ(t)

k

θ(t−1)

k

− η0

  • G(t−1)

k,k

+ ǫ g(t−1)

k

(37)

◮ g(t−1)

k

[∇ f (θ)|θ(t−1)]k is the k-th component of ∇ f (θ)|θ(t−1)

◮ G(t−1)

k,k

t−1

i1(g(i) k )2

◮ η0 is the initial learning rate ◮ ǫ is a smoothing parameter usually with order 10−6

33

slide-59
SLIDE 59

AdaGrad: Intuitive Explanation

Consider the gradient of a 2-dimensional optimization problem with θ (θ1, θ2) θ(t)

k

θ(t−1)

k

− η0

  • G(t−1)

k,k

+ ǫ g(t−1)

k

(38) The magnitude of gradient along θ2 is often larger then θ1

34

slide-60
SLIDE 60

AdaGrad: Intuitive Explanation

Consider the gradient of a 2-dimensional optimization problem with θ (θ1, θ2) θ(t)

k

θ(t−1)

k

− η0

  • G(t−1)

k,k

+ ǫ g(t−1)

k

(38) The magnitude of gradient along θ2 is often larger then θ1 AdaGrad helps shrink step sizes along θ2 that allows the procedure converges roughly at the same speed

34

slide-61
SLIDE 61

RMSProp

RMSProp (Root Mean Square Propagation) uses a moving average over the past gradients θ(t)

k

θ(t−1)

k

− η0

  • r(t)

k

+ ǫ g(t−1)

k

(39) where r(t)

k

ρr(t−1)

k

+ (1 − ρ)[g(t−1)

k

]2 (40) and ρ ∈ (0, 1), k is the dimension index, and t is the time stemp [Hinton et al., 2012]

35

slide-62
SLIDE 62

Adam

The Adam algorithm [Kingma and Ba, 2014] is proposed to combine the idea of SGD with moment and RMSProp

36

slide-63
SLIDE 63

Adam

The Adam algorithm [Kingma and Ba, 2014] is proposed to combine the idea of SGD with moment and RMSProp

v(t)

k

  • µv(t−1)

k

+ (1 − µ)g(t−1)

k

(41) r(t)

k

  • ρr(t−1)

k

+ (1 − ρ)[g(t−1)

k

]2 (42) ˆ v(t)

k

  • v(t)

k

1 − µt (43) ˆ r(t)

k

  • r(t)

k

1 − ρt (44) θ(t)

k

  • θ(t−1)

k

− η0 ˆ v(t)

k

  • ˆ

r(t)

k + ǫ

(45) The default values of µ and ρ are 0.9 and 0.999 respectively.

36

slide-64
SLIDE 64

Adam

The Adam algorithm [Kingma and Ba, 2014] is proposed to combine the idea of SGD with moment and RMSProp

v(t)

k

  • µv(t−1)

k

+ (1 − µ)g(t−1)

k

(41) r(t)

k

  • ρr(t−1)

k

+ (1 − ρ)[g(t−1)

k

]2 (42) (43) (44) θ(t)

k

  • θ(t−1)

k

− η0 ˆ v(t)

k

  • ˆ

r(t)

k + ǫ

(45) The default values of µ and ρ are 0.9 and 0.999 respectively.

36

slide-65
SLIDE 65

How to Choose a Optimization Algorithm?

[Hinton et al., 2012, Lecture Notes in 2012]

37

slide-66
SLIDE 66

Reference

Bottou, L. (1998). Online learning and stochastic approximations. On-line learning in neural networks, 17(9):142. Bottou, L., Curtis, F. E., and Nocedal, J. (2018). Optimization methods for large-scale machine learning. Siam Review, 60(2):223–311. Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(Jul):2121–2159. Hinton, G., Srivastava, N., and Swersky, K. (2012). Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. LeCun, Y. A., Bottou, L., Orr, G. B., and Müller, K.-R. (2012). Efficient backprop. In Neural networks: Tricks of the trade, pages 9–48. Springer.

38