Support Vector Machines: Training with Stochastic Gradient Descent - - PowerPoint PPT Presentation

β–Ά
support vector machines training with stochastic gradient
SMART_READER_LITE
LIVE PREVIEW

Support Vector Machines: Training with Stochastic Gradient Descent - - PowerPoint PPT Presentation

Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and


slide-1
SLIDE 1

Machine Learning

Support Vector Machines: Training with Stochastic Gradient Descent

1

slide-2
SLIDE 2

Support vector machines

  • Training by maximizing margin
  • The SVM objective
  • Solving the SVM optimization problem
  • Support vectors, duals and kernels

2

slide-3
SLIDE 3

SVM objective function

3

Regularization term:

  • Maximize the margin
  • Imposes a preference over the

hypothesis space and pushes for better generalization

  • Can be replaced with other

regularization terms which impose

  • ther preferences

Empirical Loss:

  • Hinge loss
  • Penalizes weight vectors that make

mistakes

  • Can be replaced with other loss

functions which impose other preferences A hyper-parameter that controls the tradeoff between a large margin and a small hinge-loss

slide-4
SLIDE 4

Outline: Training SVM by optimization

1. Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron

4

slide-5
SLIDE 5

Outline: Training SVM by optimization

  • 1. Review of convex functions and gradient descent

2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron

5

slide-6
SLIDE 6

Solving the SVM optimization problem

This function is convex in w

6

slide-7
SLIDE 7

A function 𝑔 is convex if for every 𝒗, π’˜ in the domain, and for every πœ‡ ∈ [0,1] we have 𝑔 πœ‡π’— + 1 βˆ’ πœ‡ π’˜ ≀ πœ‡π‘” 𝒗 + 1 βˆ’ πœ‡ 𝑔(π’˜)

7

u v f(v) f(u)

Recall: Convex functions

From geometric perspective Every tangent plane lies below the function

slide-8
SLIDE 8

A function 𝑔 is convex if for every 𝒗, π’˜ in the domain, and for every πœ‡ ∈ [0,1] we have 𝑔 πœ‡π’— + 1 βˆ’ πœ‡ π’˜ ≀ πœ‡π‘” 𝒗 + 1 βˆ’ πœ‡ 𝑔(π’˜)

8

u v f(v) f(u)

Recall: Convex functions

From geometric perspective Every tangent plane lies below the function

slide-9
SLIDE 9

Convex functions

9

Linear functions max is convex Some ways to show that a function is convex:

  • 1. Using the definition of convexity
  • 2. Showing that the second derivative is

positive (for one dimensional functions)

  • 3. Showing that the second derivative is

positive semi-definite (for vector functions)

slide-10
SLIDE 10

Not all functions are convex

10

These are concave These are neither 𝑔 πœ‡π’— + 1 βˆ’ πœ‡ π’˜ β‰₯ πœ‡π‘” 𝒗 + 1 βˆ’ πœ‡ 𝑔(π’˜)

slide-11
SLIDE 11

Convex functions are convenient

A function 𝑔 is convex if for every 𝒗, π’˜ in the domain, and for every πœ‡ ∈ [0,1] we have 𝑔 πœ‡π’— + 1 βˆ’ πœ‡ π’˜ ≀ πœ‡π‘” 𝒗 + 1 βˆ’ πœ‡ 𝑔(π’˜) In general: Necessary condition for x to be a minimum for the function f is r f (x)= 0 For convex functions, this is both necessary and sufficient

11

u v f(v) f(u)

slide-12
SLIDE 12

Solving the SVM optimization problem

This function is convex in w

  • This is a quadratic optimization problem because the objective is

quadratic

  • Older methods: Used techniques from Quadratic Programming

– Very slow

  • No constraints, can use gradient descent

– Still very slow!

12

slide-13
SLIDE 13

Gradient descent

General strategy for minimizing a function J(w)

  • Start with an initial guess for

w, say w0

  • Iterate till convergence:

– Compute the gradient of the gradient of J at wt – Update wt to get wt+1 by taking a step in the opposite direction

  • f the gradient

13

J(w) w w0 Intuition: The gradient is the direction

  • f steepest increase in the function. To

get to the minimum, go in the opposite direction We are trying to minimize

slide-14
SLIDE 14

Gradient descent

General strategy for minimizing a function J(w)

  • Start with an initial guess for

w, say w0

  • Iterate till convergence:

– Compute the gradient of the gradient of J at wt – Update wt to get wt+1 by taking a step in the opposite direction

  • f the gradient

14

J(w) w w0 w1 Intuition: The gradient is the direction

  • f steepest increase in the function. To

get to the minimum, go in the opposite direction We are trying to minimize

slide-15
SLIDE 15

Gradient descent

General strategy for minimizing a function J(w)

  • Start with an initial guess for

w, say w0

  • Iterate till convergence:

– Compute the gradient of the gradient of J at wt – Update wt to get wt+1 by taking a step in the opposite direction

  • f the gradient

15

J(w) w w0 w1 w2 Intuition: The gradient is the direction

  • f steepest increase in the function. To

get to the minimum, go in the opposite direction We are trying to minimize

slide-16
SLIDE 16

Gradient descent

General strategy for minimizing a function J(w)

  • Start with an initial guess for

w, say w0

  • Iterate till convergence:

– Compute the gradient of the gradient of J at wt – Update wt to get wt+1 by taking a step in the opposite direction

  • f the gradient

16

J(w) w w0 w1 w2 w3 Intuition: The gradient is the direction

  • f steepest increase in the function. To

get to the minimum, go in the opposite direction We are trying to minimize

slide-17
SLIDE 17

Gradient descent for SVM

  • 1. Initialize w0
  • 2. For t = 0, 1, 2, ….

1. Compute gradient of J(w) at wt. Call it r J(w

t)

2. Update w as follows:

17

r: Called the learning rate . We are trying to minimize

slide-18
SLIDE 18

Outline: Training SVM by optimization

ΓΌ Review of convex functions and gradient descent

  • 2. Stochastic gradient descent

3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron

18

slide-19
SLIDE 19

Gradient descent for SVM

  • 1. Initialize w0
  • 2. For t = 0, 1, 2, ….

1. Compute gradient of J(w) at wt. Call it r J(w

t)

2. Update w as follows:

19

r: Called the learning rate

Gradient of the SVM objective requires summing over the entire training set Slow, does not really scale

We are trying to minimize

slide-20
SLIDE 20

Stochastic gradient descent for SVM

Given a training set S = {(xi, yi)}, x 2 <n, y 2 {-1,1} 1. Initialize w0 = 0 2 <n 2. For epoch = 1 … T:

1. Pick a random example (xi, yi) from the training set S 2. Treat (xi, yi) as a full dataset and take the derivative of the SVM

  • bjective at the current wt-1 to be rJt(wt-1)

3. Update: wt Γƒ wt-1 – Β°t rJt (wt-1)

3. Return final w

20

slide-21
SLIDE 21

Stochastic gradient descent for SVM

Given a training set S = {(xi, yi)}, x 2 <n, y 2 {-1,1} 1. Initialize w0 = 0 2 <n 2. For epoch = 1 … T:

1. Pick a random example (xi, yi) from the training set S 2. Treat (xi, yi) as a full dataset and take the derivative of the SVM

  • bjective at the current wt-1 to be rJt(wt-1)

3. Update: wt Γƒ wt-1 – Β°t rJt (wt-1)

3. Return final w

21

slide-22
SLIDE 22

Stochastic gradient descent for SVM

Given a training set S = {(xi, yi)}, x 2 <n, y 2 {-1,1} 1. Initialize w0 = 0 2 <n 2. For epoch = 1 … T:

1. Pick a random example (xi, yi) from the training set S 2. Treat (xi, yi) as a full dataset and take the derivative of the SVM

  • bjective at the current wt-1 to be rJt(wt-1)

3. Update: wt Γƒ wt-1 – Β°t rJt (wt-1)

3. Return final w

22

slide-23
SLIDE 23

Stochastic gradient descent for SVM

Given a training set S = {(xi, yi)}, x 2 <n, y 2 {-1,1} 1. Initialize w0 = 0 2 <n 2. For epoch = 1 … T:

1. Pick a random example (xi, yi) from the training set S 2. Treat (xi, yi) as a full dataset and take the derivative of the SVM

  • bjective at the current wt-1 to be rJt(wt-1)

3. Update: wt Γƒ wt-1 – Β°t rJt (wt-1)

3. Return final w

23

slide-24
SLIDE 24

Stochastic gradient descent for SVM

Given a training set S = {(xi, yi)}, x 2 <n, y 2 {-1,1} 1. Initialize w0 = 0 2 <n 2. For epoch = 1 … T:

1. Pick a random example (xi, yi) from the training set S 2. Treat (xi, yi) as a full dataset and take the derivative of the SVM

  • bjective at the current wt-1 to be rJt(wt-1)

3. Update: wt Γƒ wt-1 – Β°t rJt (wt-1)

3. Return final w

24

slide-25
SLIDE 25

Stochastic gradient descent for SVM

Given a training set S = {(xi, yi)}, x 2 <n, y 2 {-1,1} 1. Initialize w0 = 0 2 <n 2. For epoch = 1 … T:

1. Pick a random example (xi, yi) from the training set S 2. Treat (xi, yi) as a full dataset and take the derivative of the SVM

  • bjective at the current wt-1 to be rJt(wt-1)

3. Update: wt Γƒ wt-1 – Β°t rJt (wt-1)

3. Return final w What is the gradient of the hinge loss with respect to w? (The hinge loss is not a differentiable function!)

25

This algorithm is guaranteed to converge to the minimum of J if Β°t is small enough. Why? The objective J(w) is a convex function

slide-26
SLIDE 26

Outline: Training SVM by optimization

ΓΌ Review of convex functions and gradient descent ΓΌ Stochastic gradient descent

  • 3. Gradient descent vs stochastic gradient descent

4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron

26

slide-27
SLIDE 27

Gradient Descent vs SGD

27

Gradient descent

slide-28
SLIDE 28

Gradient Descent vs SGD

28

Stochastic Gradient descent

slide-29
SLIDE 29

Gradient Descent vs SGD

29

Stochastic Gradient descent

slide-30
SLIDE 30

Gradient Descent vs SGD

30

Stochastic Gradient descent

slide-31
SLIDE 31

Gradient Descent vs SGD

31

Stochastic Gradient descent

slide-32
SLIDE 32

Gradient Descent vs SGD

32

Stochastic Gradient descent

slide-33
SLIDE 33

Gradient Descent vs SGD

33

Stochastic Gradient descent

slide-34
SLIDE 34

Gradient Descent vs SGD

34

Stochastic Gradient descent

slide-35
SLIDE 35

Gradient Descent vs SGD

35

Stochastic Gradient descent

slide-36
SLIDE 36

Gradient Descent vs SGD

36

Stochastic Gradient descent

slide-37
SLIDE 37

Gradient Descent vs SGD

37

Stochastic Gradient descent

slide-38
SLIDE 38

Gradient Descent vs SGD

38

Stochastic Gradient descent

slide-39
SLIDE 39

Gradient Descent vs SGD

39

Stochastic Gradient descent

slide-40
SLIDE 40

Gradient Descent vs SGD

40

Stochastic Gradient descent

slide-41
SLIDE 41

Gradient Descent vs SGD

41

Stochastic Gradient descent

slide-42
SLIDE 42

Gradient Descent vs SGD

42

Stochastic Gradient descent

slide-43
SLIDE 43

Gradient Descent vs SGD

43

Stochastic Gradient descent

slide-44
SLIDE 44

Gradient Descent vs SGD

44

Stochastic Gradient descent

slide-45
SLIDE 45

Gradient Descent vs SGD

45

Stochastic Gradient descent Many more updates than gradient descent, but each individual update is less computationally expensive

slide-46
SLIDE 46

Outline: Training SVM by optimization

ΓΌ Review of convex functions and gradient descent ΓΌ Stochastic gradient descent ΓΌ Gradient descent vs stochastic gradient descent

  • 4. Sub-derivatives of the hinge loss

5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron

46

slide-47
SLIDE 47

Stochastic gradient descent for SVM

Given a training set S = {(xi, yi)}, x 2 <n, y 2 {-1,1} 1. Initialize w0 = 0 2 <n 2. For epoch = 1 … T:

1. Pick a random example (xi, yi) from the training set S 2. Treat (xi, yi) as a full dataset and take the derivative of the SVM

  • bjective at the current wt-1 to be rJt(wt-1)

3. Update: wt Γƒ wt-1 – Β°t rJt (wt-1)

3. Return final w What is the derivative of the hinge loss with respect to w? (The hinge loss is not a differentiable function!)

47

slide-48
SLIDE 48

Hinge loss is not differentiable!

What is the derivative of the hinge loss with respect to w?

48

slide-49
SLIDE 49

Detour: Sub-gradients

Generalization of gradients to non-differentiable functions (Remember that every tangent is a hyperplane that lies below the function for convex functions) Informally, a sub-tangent at a point is any hyperplane that lies below the function at the point. A sub-gradient is the slope of that line

49

slide-50
SLIDE 50

Sub-gradients

50

[Example from Boyd]

g1 is a gradient at x1 g2 and g3 is are both subgradients at x2 f is differentiable at x1 Tangent at this point Formally, a vector g is a subgradient to f at point x if

slide-51
SLIDE 51

Sub-gradients

51

[Example from Boyd]

g1 is a gradient at x1 g2 and g3 is are both subgradients at x2 f is differentiable at x1 Tangent at this point Formally, a vector g is a subgradient to f at point x if

slide-52
SLIDE 52

Sub-gradients

52

[Example from Boyd]

g1 is a gradient at x1 g2 and g3 is are both subgradients at x2 f is differentiable at x1 Tangent at this point Formally, a vector g is a subgradient to f at point x if

slide-53
SLIDE 53

Sub-gradient of the SVM objective

53

General strategy: First solve the max and compute the gradient for each case

slide-54
SLIDE 54

Sub-gradient of the SVM objective

54

General strategy: First solve the max and compute the gradient for each case

slide-55
SLIDE 55

Outline: Training SVM by optimization

ΓΌ Review of convex functions and gradient descent ΓΌ Stochastic gradient descent ΓΌ Gradient descent vs stochastic gradient descent ΓΌ Sub-derivatives of the hinge loss

  • 5. Stochastic sub-gradient descent for SVM

6. Comparison to perceptron

55

slide-56
SLIDE 56

Stochastic sub-gradient descent for SVM

Given a training set S = {(xi, yi)}, x 2 <n, y 2 {-1,1}

  • 1. Initialize w = 0 2 <n
  • 2. For epoch = 1 … T:

1. For each training example (xi, yi)2 S:

If yi wTxi Β· 1, w Γƒ (1- Β°t) w + Β°t C yi xi else w Γƒ (1- Β°t) w

  • 3. Return w

56

slide-57
SLIDE 57

Stochastic sub-gradient descent for SVM

Given a training set S = {(xi, yi)}, x 2 <n, y 2 {-1,1}

  • 1. Initialize w = 0 2 <n
  • 2. For epoch = 1 … T:

1. For each training example (xi, yi)2 S:

If yi wTxi Β· 1, w Γƒ (1- Β°t) w + Β°t C yi xi else w Γƒ (1- Β°t) w

  • 3. Return w

57

slide-58
SLIDE 58

Stochastic sub-gradient descent for SVM

Given a training set S = {(xi, yi)}, x 2 <n, y 2 {-1,1}

  • 1. Initialize w = 0 2 <n
  • 2. For epoch = 1 … T:

1. For each training example (xi, yi)2 S:

If yi wTxi Β· 1, w Γƒ (1- Β°t) w + Β°t C yi xi else w Γƒ (1- Β°t) w

  • 3. Return w

58

Update w w Γƒ w – Β°t rJt

slide-59
SLIDE 59

Stochastic sub-gradient descent for SVM

Given a training set S = {(xi, yi)}, x 2 <n, y 2 {-1,1}

  • 1. Initialize w = 0 2 <n
  • 2. For epoch = 1 … T:

1. For each training example (xi, yi)2 S:

If yi wTxi Β· 1, w Γƒ (1- Β°t) w + Β°t C yi xi else w Γƒ (1- Β°t) w

  • 3. Return w

59

slide-60
SLIDE 60

Stochastic sub-gradient descent for SVM

Given a training set S = {(xi, yi)}, x 2 <n, y 2 {-1,1}

  • 1. Initialize w = 0 2 <n
  • 2. For epoch = 1 … T:

1. For each training example (xi, yi)2 S:

If yi wTxi Β· 1, w Γƒ (1- Β°t) w + Β°t C yi xi else w Γƒ (1- Β°t) w

  • 3. Return w

60

Β°t: learning rate, many tweaks possible

Important to shuffle examples at the start of each epoch

slide-61
SLIDE 61

Stochastic sub-gradient descent for SVM

Given a training set S = {(xi, yi)}, x 2 <n, y 2 {-1,1}

  • 1. Initialize w = 0 2 <n
  • 2. For epoch = 1 … T:

1. Shuffle the training set 2. For each training example (xi, yi)2 S:

If yi wTxi Β· 1, w Γƒ (1- Β°t) w + Β°t C yi xi else w Γƒ (1- Β°t) w

  • 3. Return w

61

Β°t: learning rate, many tweaks possible

slide-62
SLIDE 62

Convergence and learning rates

With enough iterations, it will converge in expectation

Provided the step sizes are β€œsquare summable, but not summable”

  • Step sizes Β°t are positive
  • Sum of squares of step sizes over t = 1 to 1 is not infinite
  • Sum of step sizes over t = 1 to 1 is infinity
  • Some examples: 𝛿2 =

56 7896:

;

  • r 𝛿2 =

56 782

62

slide-63
SLIDE 63

Convergence and learning rates

  • Number of iterations to get to accuracy within Β²
  • For strongly convex functions, N examples, d dimensional:

– Gradient descent: O(Nd ln(1/Β²)) – Stochastic gradient descent: O(d/Β²)

  • More subtleties involved, but SGD is generally preferable

when the data size is huge

  • Recently, many variants that are based on this general

strategy

– Examples: Adagrad, momentum, Nesterov’s accelerated gradient, Adam, RMSProp, etc…

63

slide-64
SLIDE 64

Convergence and learning rates

  • Number of iterations to get to accuracy within Β²
  • For strongly convex functions, N examples, d dimensional:

– Gradient descent: O(Nd ln(1/Β²)) – Stochastic gradient descent: O(d/Β²)

  • More subtleties involved, but SGD is generally preferable

when the data size is huge

  • Recently, many variants that are based on this general

strategy, targeting multilayer neural networks

– Examples: Adagrad, momentum, Nesterov’s accelerated gradient, Adam, RMSProp, etc…

64

slide-65
SLIDE 65

Stochastic sub-gradient descent for SVM

Given a training set S = {(xi, yi)}, x 2 <n, y 2 {-1,1}

  • 1. Initialize w = 0 2 <n
  • 2. For epoch = 1 … T:

1. Shuffle the training set 2. For each training example (xi, yi)2 S:

If yi wTxi Β· 1, w Γƒ (1-Β°t) w + Β°t C yi xi else w Γƒ (1-Β°t) w

  • 3. Return w

65

slide-66
SLIDE 66

Outline: Training SVM by optimization

ΓΌ Review of convex functions and gradient descent ΓΌ Stochastic gradient descent ΓΌ Gradient descent vs stochastic gradient descent ΓΌ Sub-derivatives of the hinge loss ΓΌ Stochastic sub-gradient descent for SVM

  • 6. Comparison to perceptron

66

slide-67
SLIDE 67

Stochastic sub-gradient descent for SVM

Given a training set S = {(xi, yi)}, x 2 <n, y 2 {-1,1}

  • 1. Initialize w = 0 2 <n
  • 2. For epoch = 1 … T:

1. Shuffle the training set 2. For each training example (xi, yi)2 S:

If yi wTxi Β· 1, w Γƒ (1-Β°t) w + Β°t C yi xi else w Γƒ (1-Β°t) w

  • 3. Return w

67

Compare with the Perceptron update: If y wTx Β· 0, update w Γƒ w + r y x

slide-68
SLIDE 68

Perceptron vs. SVM

  • Perceptron: Stochastic sub-gradient descent for a

different loss

– No regularization though

  • SVM optimizes the hinge loss

– With regularization

68

slide-69
SLIDE 69

SVM summary from optimization perspective

  • Minimize regularized hinge loss
  • Solve using stochastic gradient descent

– Very fast, run time does not depend on number of examples – Compare with Perceptron algorithm: Perceptron does not maximize margin width

  • Perceptron variants can force a margin

– Convergence criterion is an issue; can be too aggressive in the beginning and get to a reasonably good solution fast; but convergence is slow for very accurate weight vector

  • Other successful optimization algorithms exist

– Eg: Dual coordinate descent, implemented in liblinear

69

Questions?