Support vector machines CS 446 Part 1: linear support vector - - PowerPoint PPT Presentation

support vector machines
SMART_READER_LITE
LIVE PREVIEW

Support vector machines CS 446 Part 1: linear support vector - - PowerPoint PPT Presentation

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8 0.8 0 0 0 . 8 - 0.6 0.6 0.6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . 0 0 0 0 0


slide-1
SLIDE 1

Support vector machines

CS 446

slide-2
SLIDE 2

Part 1: linear support vector machines

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0

  • 3

2 .

  • 2

4 .

  • 1

6 .

  • 8

. . 8 . 1 6 .

Logistic regression.

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0

  • 1

. 2

  • .

8

  • .

4 . . 4 . 8 1 . 2

Least squares.

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0

  • 8

.

  • 6

.

  • 4

.

  • 2

. . 2 . 4 .

SVM.

1 / 39

slide-3
SLIDE 3

Part 2: kernelized support vector machines

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 32.000
  • 24.000
  • 1

6 .

  • 8.000
  • 8.000

. . 8.000 8.000 16.000

ReLU network.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 12.500
  • 1

.

  • 7.500
  • 5.000
  • 2.500
  • 2

. 5 . 0.000 2 . 5

Quadratic SVM.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 3

.

  • 2

.

  • 1.000
  • 1

. . 0.000 1.000 2.000 3.000

RBF SVM.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 1

. 5

  • 1.000
  • 1

.

  • 1.000
  • 0.500
  • 0.500

. . 0.500 0.500 1 . 1 . 5

Narrower RBF SVM.

2 / 39

slide-4
SLIDE 4
  • 1. Recap: linearly separable data
slide-5
SLIDE 5

Linear classifiers (with Y = {−1, +1})

Linear separability assumption Assume there is a linear classifier that perfectly classifies the training data S: for some w⋆ ∈ Rd, min

(x,y)∈S yx

Tw⋆ > 0. 3 / 39

slide-6
SLIDE 6

Linear classifiers (with Y = {−1, +1})

Linear separability assumption Assume there is a linear classifier that perfectly classifies the training data S: for some w⋆ ∈ Rd, min

(x,y)∈S yx

Tw⋆ > 0.

Convex program Finding any such w is a convex (linear!) feasibility problem.

3 / 39

slide-7
SLIDE 7

Linear classifiers (with Y = {−1, +1})

Linear separability assumption Assume there is a linear classifier that perfectly classifies the training data S: for some w⋆ ∈ Rd, min

(x,y)∈S yx

Tw⋆ > 0.

Convex program Finding any such w is a convex (linear!) feasibility problem. Logistic regression Alternatively, can run enough steps of logistic regression.

3 / 39

slide-8
SLIDE 8

Support vector machines (SVMs)

Motivation ◮ Let’s first define a good linear separator, and then solve for it. ◮ Let’s also find a principled approach to nonseparable data.

4 / 39

slide-9
SLIDE 9

Support vector machines (SVMs)

Motivation ◮ Let’s first define a good linear separator, and then solve for it. ◮ Let’s also find a principled approach to nonseparable data. Support vector machines (Vapnik and Chervonenkis, 1963) ◮ Characterize a stable solution for linearly separable problems—the maximum margin solution. ◮ Solve for the maximum margin solution efficiently via convex optimization. ◮ Convex dual has valuable structure; it will give useful extensions, and is what we’ll optimize. ◮ Extend the optimization problem to non-separable data via convex surrogate losses. ◮ Nonlinear separators via kernels.

4 / 39

slide-10
SLIDE 10
  • 2. Maximum margin solution
slide-11
SLIDE 11

Maximum margin solution

Best linear classifier on population

5 / 39

slide-12
SLIDE 12

Maximum margin solution

Best linear classifier on population Arbitrary linear separator on training data S

5 / 39

slide-13
SLIDE 13

Maximum margin solution

Best linear classifier on population Arbitrary linear separator on training data S

5 / 39

slide-14
SLIDE 14

Maximum margin solution

Best linear classifier on population Arbitrary linear separator on training data S Maximum margin solution on training data S

5 / 39

slide-15
SLIDE 15

Maximum margin solution

Best linear classifier on population Arbitrary linear separator on training data S Maximum margin solution on training data S

5 / 39

slide-16
SLIDE 16

Maximum margin solution

Best linear classifier on population Arbitrary linear separator on training data S Maximum margin solution on training data S

Why use the maximum margin solution? (i) Uniquely determined by S, unlike the linear program. (ii) It is a particular inductive bias—i.e., an assumption about the problem—that seems to be commonly useful. ◮ We’ve seen inductive bias: least squares and logistic regression choose different predictors on same data. ◮ This particular bias (margin maximization) is common in machine learning, has many nice properties.

5 / 39

slide-17
SLIDE 17

Maximum margin solution

Best linear classifier on population Arbitrary linear separator on training data S Maximum margin solution on training data S

Why use the maximum margin solution? (i) Uniquely determined by S, unlike the linear program. (ii) It is a particular inductive bias—i.e., an assumption about the problem—that seems to be commonly useful. ◮ We’ve seen inductive bias: least squares and logistic regression choose different predictors on same data. ◮ This particular bias (margin maximization) is common in machine learning, has many nice properties. Key insight: can express this as another convex program.

5 / 39

slide-18
SLIDE 18

Distance to decision boundary

Suppose w ∈ Rd satisfies min

(x,y)∈S yx

Tw > 0. 6 / 39

slide-19
SLIDE 19

Distance to decision boundary

Suppose w ∈ Rd satisfies min

(x,y)∈S yx

Tw > 0.

◮ “Maximum margin” shouldn’t care about scaling; w and 10w should be equally good. ◮ Thus for each direction w/w, we can fix a scaling.

6 / 39

slide-20
SLIDE 20

Distance to decision boundary

Suppose w ∈ Rd satisfies min

(x,y)∈S yx

Tw > 0.

◮ “Maximum margin” shouldn’t care about scaling; w and 10w should be equally good. ◮ Thus for each direction w/w, we can fix a scaling. Let (˜ x, ˜ y) be any example in S that achieves the minimum.

6 / 39

slide-21
SLIDE 21

Distance to decision boundary

Suppose w ∈ Rd satisfies min

(x,y)∈S yx

Tw > 0.

◮ “Maximum margin” shouldn’t care about scaling; w and 10w should be equally good. ◮ Thus for each direction w/w, we can fix a scaling. Let (˜ x, ˜ y) be any example in S that achieves the minimum. H w ˜ y˜ x

˜ y˜ xTw w2

6 / 39

slide-22
SLIDE 22

Distance to decision boundary

Suppose w ∈ Rd satisfies min

(x,y)∈S yx

Tw > 0.

◮ “Maximum margin” shouldn’t care about scaling; w and 10w should be equally good. ◮ Thus for each direction w/w, we can fix a scaling. Let (˜ x, ˜ y) be any example in S that achieves the minimum. H w ˜ y˜ x

˜ y˜ xTw w2

◮ Rescale w so that ˜ y˜ xTw = 1. (Now scaling is fixed.)

6 / 39

slide-23
SLIDE 23

Distance to decision boundary

Suppose w ∈ Rd satisfies min

(x,y)∈S yx

Tw > 0.

◮ “Maximum margin” shouldn’t care about scaling; w and 10w should be equally good. ◮ Thus for each direction w/w, we can fix a scaling. Let (˜ x, ˜ y) be any example in S that achieves the minimum. H w ˜ y˜ x

˜ y˜ xTw w2

◮ Rescale w so that ˜ y˜ xTw = 1. (Now scaling is fixed.) ◮ Distance from ˜ y˜ x to H is 1 w2 . This is the (normalized minimum) margin.

6 / 39

slide-24
SLIDE 24

Distance to decision boundary

Suppose w ∈ Rd satisfies min

(x,y)∈S yx

Tw > 0.

◮ “Maximum margin” shouldn’t care about scaling; w and 10w should be equally good. ◮ Thus for each direction w/w, we can fix a scaling. Let (˜ x, ˜ y) be any example in S that achieves the minimum. H w ˜ y˜ x

˜ y˜ xTw w2

◮ Rescale w so that ˜ y˜ xTw = 1. (Now scaling is fixed.) ◮ Distance from ˜ y˜ x to H is 1 w2 . This is the (normalized minimum) margin. ◮ This gives optimization problem max

1/w

  • subj. to

min

(x,y)∈S yx

Tw = 1.

Can make constraint ≥ 1.

6 / 39

slide-25
SLIDE 25

Maximum margin linear classifier

The solution ˆ w to the following mathematical optimization problem: min

w∈Rd

1 2w2

2

s.t. yx

Tw ≥ 1

for all (x, y) ∈ S gives the linear classifier with the largest minimum margin on S—i.e., the maximum margin linear classifier or support vector machine (SVM) classifier.

7 / 39

slide-26
SLIDE 26

Maximum margin linear classifier

The solution ˆ w to the following mathematical optimization problem: min

w∈Rd

1 2w2

2

s.t. yx

Tw ≥ 1

for all (x, y) ∈ S gives the linear classifier with the largest minimum margin on S—i.e., the maximum margin linear classifier or support vector machine (SVM) classifier. This is a convex optimization problem; can be solved in polynomial time.

7 / 39

slide-27
SLIDE 27

Maximum margin linear classifier

The solution ˆ w to the following mathematical optimization problem: min

w∈Rd

1 2w2

2

s.t. yx

Tw ≥ 1

for all (x, y) ∈ S gives the linear classifier with the largest minimum margin on S—i.e., the maximum margin linear classifier or support vector machine (SVM) classifier. This is a convex optimization problem; can be solved in polynomial time. If there is a solution (i.e., S is linearly separable), then the solution is unique.

7 / 39

slide-28
SLIDE 28

Maximum margin linear classifier

The solution ˆ w to the following mathematical optimization problem: min

w∈Rd

1 2w2

2

s.t. yx

Tw ≥ 1

for all (x, y) ∈ S gives the linear classifier with the largest minimum margin on S—i.e., the maximum margin linear classifier or support vector machine (SVM) classifier. This is a convex optimization problem; can be solved in polynomial time. If there is a solution (i.e., S is linearly separable), then the solution is unique. We can solve this in a variety of ways (e.g., projected gradient descent); we will work with the dual.

7 / 39

slide-29
SLIDE 29

Maximum margin linear classifier

The solution ˆ w to the following mathematical optimization problem: min

w∈Rd

1 2w2

2

s.t. yx

Tw ≥ 1

for all (x, y) ∈ S gives the linear classifier with the largest minimum margin on S—i.e., the maximum margin linear classifier or support vector machine (SVM) classifier. This is a convex optimization problem; can be solved in polynomial time. If there is a solution (i.e., S is linearly separable), then the solution is unique. We can solve this in a variety of ways (e.g., projected gradient descent); we will work with the dual. Note: Can also explicitly include affine expansion, so decision boundary need not pass through origin. We’ll do our derivations without it.

7 / 39

slide-30
SLIDE 30
  • 3. SVM dual problem
slide-31
SLIDE 31

Two SVM optimization problems

SVM (primal) problem min

w∈Rd

1 2w2

2

s.t. yix

T

i w ≥ 1

for all i = 1, . . . , n.

8 / 39

slide-32
SLIDE 32

Two SVM optimization problems

SVM (primal) problem min

w∈Rd

1 2w2

2

s.t. yix

T

i w ≥ 1

for all i = 1, . . . , n. SVM dual problem max

α1,α2,...,αn≥0 n

  • i=1

αi − 1 2

n

  • i,j=1

αiαjyiyjx

T

i xj.

8 / 39

slide-33
SLIDE 33

Two SVM optimization problems

SVM (primal) problem min

w∈Rd

1 2w2

2

s.t. yix

T

i w ≥ 1

for all i = 1, . . . , n. SVM dual problem max

α1,α2,...,αn≥0 n

  • i=1

αi − 1 2

n

  • i,j=1

αiαjyiyjx

T

i xj.

We’ll derive SVM dual problem using Lagrangian duality.

8 / 39

slide-34
SLIDE 34

Two SVM optimization problems

SVM (primal) problem min

w∈Rd

1 2w2

2

s.t. yix

T

i w ≥ 1

for all i = 1, . . . , n. SVM dual problem max

α1,α2,...,αn≥0 n

  • i=1

αi − 1 2

n

  • i,j=1

αiαjyiyjx

T

i xj.

We’ll derive SVM dual problem using Lagrangian duality. Looking ahead, the dual will help us with kernels and nonlinear separators.

8 / 39

slide-35
SLIDE 35

Lagrange multipliers

Move constraints to objective using method of Lagrange multipliers. Original problem: min

w∈Rd

1 2w2

2

s.t. 1 − yix

T

i w ≤ 0

for all i = 1, . . . , n.

9 / 39

slide-36
SLIDE 36

Lagrange multipliers

Move constraints to objective using method of Lagrange multipliers. Original problem: min

w∈Rd

1 2w2

2

s.t. 1 − yix

T

i w ≤ 0

for all i = 1, . . . , n. ◮ For each (inequality) constraint 1 − yixT

i w ≤ 0, associate a

(non-negative) dual variable αi (a.k.a. Lagrange multiplier).

9 / 39

slide-37
SLIDE 37

Lagrange multipliers

Move constraints to objective using method of Lagrange multipliers. Original problem: min

w∈Rd

1 2w2

2

s.t. 1 − yix

T

i w ≤ 0

for all i = 1, . . . , n. ◮ For each (inequality) constraint 1 − yixT

i w ≤ 0, associate a

(non-negative) dual variable αi (a.k.a. Lagrange multiplier). ◮ Move constraints to objective by adding n

i=1 αi(1 − yixT i w) and

maximizing over α = (α1, . . . , αn) ∈ Rn s.t. α ≥ 0 (i.e., αi ≥ 0 for all i).

9 / 39

slide-38
SLIDE 38

Lagrange multipliers

Move constraints to objective using method of Lagrange multipliers. Original problem: min

w∈Rd

1 2w2

2

s.t. 1 − yix

T

i w ≤ 0

for all i = 1, . . . , n. ◮ For each (inequality) constraint 1 − yixT

i w ≤ 0, associate a

(non-negative) dual variable αi (a.k.a. Lagrange multiplier). ◮ Move constraints to objective by adding n

i=1 αi(1 − yixT i w) and

maximizing over α = (α1, . . . , αn) ∈ Rn s.t. α ≥ 0 (i.e., αi ≥ 0 for all i). Resulting optimization problem (note lack of explicit constraints on w): min

w∈Rd

 max

α≥0

1 2w2

2 + n

  • i=1

αi(1 − yix

T

i w)

  .

9 / 39

slide-39
SLIDE 39

Lagrange multipliers

Move constraints to objective using method of Lagrange multipliers. Original problem: min

w∈Rd

1 2w2

2

s.t. 1 − yix

T

i w ≤ 0

for all i = 1, . . . , n. ◮ For each (inequality) constraint 1 − yixT

i w ≤ 0, associate a

(non-negative) dual variable αi (a.k.a. Lagrange multiplier). ◮ Move constraints to objective by adding n

i=1 αi(1 − yixT i w) and

maximizing over α = (α1, . . . , αn) ∈ Rn s.t. α ≥ 0 (i.e., αi ≥ 0 for all i). Resulting optimization problem (note lack of explicit constraints on w): min

w∈Rd

 max

α≥0

1 2w2

2 + n

  • i=1

αi(1 − yix

T

i w)

  . Equivalence: If w violates original i-th constraint (so 1 − yixT

i w > 0), then the

“max

α≥0” will set αi → ∞ to make objective → ∞. Such w cannot be minimizer!

9 / 39

slide-40
SLIDE 40

Dual problem

Let’s define Lagrangian L as L(w, α) := 1 2w2

2 + n

  • i=1

αi(1 − yix

T

i w).

Our primal maximum margin problem was P(w) = max

α≥0 L(w, α) = max α≥0

 1 2w2

2 + n

  • i=1

αi(1 − yix

T

i w)

  .

10 / 39

slide-41
SLIDE 41
slide-42
SLIDE 42

Dual problem

Let’s define Lagrangian L as L(w, α) := 1 2w2

2 + n

  • i=1

αi(1 − yix

T

i w).

Our primal maximum margin problem was P(w) = max

α≥0 L(w, α) = max α≥0

 1 2w2

2 + n

  • i=1

αi(1 − yix

T

i w)

  . The dual problem D(α) = minw L(w, α) can be found in closed form; since w → L(w, α) is a convex quadratic with minimum w = n

i=1 αiyixi,

D(α) = min

w∈Rd L(w, α) = L

 

n

  • i=1

αiyixi, α   =

n

  • i=1

αi − 1 2

  • n
  • i=1

αiyixi

  • 2

2

=

n

  • i=1

αi − 1 2

n

  • i,j=1

αiαjyiyjx

T

i xj.

(The last form appeared in an earlier slide as the dual problem.)

10 / 39

slide-43
SLIDE 43

Dual problem

Let’s define Lagrangian L as L(w, α) := 1 2w2

2 + n

  • i=1

αi(1 − yix

T

i w).

Our primal maximum margin problem was P(w) = max

α≥0 L(w, α) = max α≥0

 1 2w2

2 + n

  • i=1

αi(1 − yix

T

i w)

  . The dual problem D(α) = minw L(w, α) can be found in closed form; since w → L(w, α) is a convex quadratic with minimum w = n

i=1 αiyixi,

D(α) = min

w∈Rd L(w, α) = L

 

n

  • i=1

αiyixi, α   =

n

  • i=1

αi − 1 2

  • n
  • i=1

αiyixi

  • 2

2

=

n

  • i=1

αi − 1 2

n

  • i,j=1

αiαjyiyjx

T

i xj.

(The last form appeared in an earlier slide as the dual problem.) Key question: does a solution to D gives us a solution to P?

10 / 39

slide-44
SLIDE 44

Recall L(w, α) = 1 2w2

2 + n

  • i=1

αi(1 − yix

T

i w),

Lagrangian, P(w) = max

α≥0 L(w, α)

primal problem, D(α) = min

w L(w, α)

dual problem.

11 / 39

slide-45
SLIDE 45

Recall L(w, α) = 1 2w2

2 + n

  • i=1

αi(1 − yix

T

i w),

Lagrangian, P(w) = max

α≥0 L(w, α)

primal problem, D(α) = min

w L(w, α)

dual problem. ◮ For general Lagrangians, have weak duality P(w) ≥ D(α), since P(w) = max

α′≥0 L(w, α′) ≥ L(w, α) ≥ min w′ L(w′, α) = D(α).

11 / 39

slide-46
SLIDE 46

Recall L(w, α) = 1 2w2

2 + n

  • i=1

αi(1 − yix

T

i w),

Lagrangian, P(w) = max

α≥0 L(w, α)

primal problem, D(α) = min

w L(w, α)

dual problem. ◮ For general Lagrangians, have weak duality P(w) ≥ D(α), since P(w) = max

α′≥0 L(w, α′) ≥ L(w, α) ≥ min w′ L(w′, α) = D(α).

◮ By convexity, have strong duality minw P(w) = maxα≥0 D(α), and an optimum α for D gives an optimum for P via w =

n

  • i=1

αiyixi = arg min

w

L(w, α).

11 / 39

slide-47
SLIDE 47

Support vectors

Optimal solutions ˆ w and ˆ α = (ˆ α1, . . . , ˆ αn) satisfy ◮ ˆ w =

n

  • i=1

ˆ αiyixi =

  • i:ˆ

αi>0

ˆ αiyixi, ◮ ˆ αi > 0 ⇒ yixT

i ˆ

w = 1 for all i = 1, . . . , n (complementary slackness).

12 / 39

slide-48
SLIDE 48

Support vectors

Optimal solutions ˆ w and ˆ α = (ˆ α1, . . . , ˆ αn) satisfy ◮ ˆ w =

n

  • i=1

ˆ αiyixi =

  • i:ˆ

αi>0

ˆ αiyixi, ◮ ˆ αi > 0 ⇒ yixT

i ˆ

w = 1 for all i = 1, . . . , n (complementary slackness). The yixi where ˆ αi > 0 are called support vectors. H

12 / 39

slide-49
SLIDE 49

Support vectors

Optimal solutions ˆ w and ˆ α = (ˆ α1, . . . , ˆ αn) satisfy ◮ ˆ w =

n

  • i=1

ˆ αiyixi =

  • i:ˆ

αi>0

ˆ αiyixi, ◮ ˆ αi > 0 ⇒ yixT

i ˆ

w = 1 for all i = 1, . . . , n (complementary slackness). The yixi where ˆ αi > 0 are called support vectors. H ◮ Support vector examples satisfy “margin” constraints with equality.

12 / 39

slide-50
SLIDE 50

Support vectors

Optimal solutions ˆ w and ˆ α = (ˆ α1, . . . , ˆ αn) satisfy ◮ ˆ w =

n

  • i=1

ˆ αiyixi =

  • i:ˆ

αi>0

ˆ αiyixi, ◮ ˆ αi > 0 ⇒ yixT

i ˆ

w = 1 for all i = 1, . . . , n (complementary slackness). The yixi where ˆ αi > 0 are called support vectors. H ◮ Support vector examples satisfy “margin” constraints with equality. ◮ Get same solution even if we only had support vector examples.

12 / 39

slide-51
SLIDE 51

Proof of complementary slackness

For the optimal (feasible) solutions ˆ w and ˆ α, we have P( ˆ w) = D(ˆ α) = min

w∈Rd L(w, ˆ

α) (by strong duality)

13 / 39

slide-52
SLIDE 52

Proof of complementary slackness

For the optimal (feasible) solutions ˆ w and ˆ α, we have P( ˆ w) = D(ˆ α) = min

w∈Rd L(w, ˆ

α) (by strong duality) ≤ L( ˆ w, ˆ α)

13 / 39

slide-53
SLIDE 53

Proof of complementary slackness

For the optimal (feasible) solutions ˆ w and ˆ α, we have P( ˆ w) = D(ˆ α) = min

w∈Rd L(w, ˆ

α) (by strong duality) ≤ L( ˆ w, ˆ α) = 1 2 ˆ w2

2 + n

  • i=1

ˆ αi(1 − yix

T

i ˆ

w)

13 / 39

slide-54
SLIDE 54

Proof of complementary slackness

For the optimal (feasible) solutions ˆ w and ˆ α, we have P( ˆ w) = D(ˆ α) = min

w∈Rd L(w, ˆ

α) (by strong duality) ≤ L( ˆ w, ˆ α) = 1 2 ˆ w2

2 + n

  • i=1

ˆ αi(1 − yix

T

i ˆ

w) ≤ 1 2 ˆ w2

2

(constraints are satisfied) = P( ˆ w).

13 / 39

slide-55
SLIDE 55

Proof of complementary slackness

For the optimal (feasible) solutions ˆ w and ˆ α, we have P( ˆ w) = D(ˆ α) = min

w∈Rd L(w, ˆ

α) (by strong duality) ≤ L( ˆ w, ˆ α) = 1 2 ˆ w2

2 + n

  • i=1

ˆ αi(1 − yix

T

i ˆ

w) ≤ 1 2 ˆ w2

2

(constraints are satisfied) = P( ˆ w). Therefore, every term in sum

n

  • i=1

ˆ αi(1 − yix

T

i ˆ

w) must be zero: ˆ αi(1 − yix

T

i ˆ

w) = 0 for all i = 1, . . . , n.

13 / 39

slide-56
SLIDE 56

Proof of complementary slackness

For the optimal (feasible) solutions ˆ w and ˆ α, we have P( ˆ w) = D(ˆ α) = min

w∈Rd L(w, ˆ

α) (by strong duality) ≤ L( ˆ w, ˆ α) = 1 2 ˆ w2

2 + n

  • i=1

ˆ αi(1 − yix

T

i ˆ

w) ≤ 1 2 ˆ w2

2

(constraints are satisfied) = P( ˆ w). Therefore, every term in sum

n

  • i=1

ˆ αi(1 − yix

T

i ˆ

w) must be zero: ˆ αi(1 − yix

T

i ˆ

w) = 0 for all i = 1, . . . , n. If αi > 0, then must have 1 − yixT

i ˆ

w = 0.

13 / 39

slide-57
SLIDE 57

SVM Duality summary

Lagrangian L(w, α) = 1 2w2

2 + n

  • i=1

αi(1 − yix

T

i w).

Primal maximum margin problem was P(w) = max

α≥0 L(w, α) = max α≥0

 1 2w2

2 + n

  • i=1

αi(1 − yix

T

i w)

  . Dual problem D(α) = min

w∈Rd L(w, α) = L

 

n

  • i=1

αiyixi, α   =

n

  • i=1

αi − 1 2

  • n
  • i=1

αiyixi

  • 2

2

=

n

  • i=1

αi − 1 2

n

  • i,j=1

αiαjyiyjx

T

i xj.

Given dual optimum ˆ α, ◮ Corresponding primal optimum ˆ w = n

i=1 αiyixi;

◮ Strong duality P( ˆ w) = D(ˆ α); ◮ ˆ αi > 0 implies yixT

i ˆ

w = 1, and these yixi are support vectors.

14 / 39