Announcements - Homework Homework 1 is graded, please collect at end - - PowerPoint PPT Presentation

announcements homework
SMART_READER_LITE
LIVE PREVIEW

Announcements - Homework Homework 1 is graded, please collect at end - - PowerPoint PPT Presentation

Announcements - Homework Homework 1 is graded, please collect at end of lecture Homework 2 due today Homework 3 out soon (watch email) Ques 1 midterm review HW1 score distribution HW1 total score 40 35 30 25 20 15 10 5


slide-1
SLIDE 1

Announcements - Homework

  • Homework 1 is graded, please collect at end of

lecture

  • Homework 2 due today
  • Homework 3 out soon (watch email)
  • Ques 1 – midterm review
slide-2
SLIDE 2

HW1 score distribution

2

5 10 15 20 25 30 35 40 0~10 10~20 20~30 30~40 40~50 50~60 60~70 70~80 80~90 90~100 100~110

HW1 total score

slide-3
SLIDE 3

Announcements - Midterm

  • When: Wednesday, 10/20
  • Where: In Class
  • What: You, your pencil, your textbook, your notes,

course slides, your calculator, your good mood :)

  • What NOT: No computers, iphones, or anything else

that has an internet connection.

  • Material: Everything from the beginning of the

semester, until, and including SVMs and the Kernel trick

3

slide-4
SLIDE 4

Recitation Tomorrow!

  • Boosting, SVM (convex optimization),

Midterm review!

  • Strongly recommended!!
  • Place: NSH 3305 (Note: change from last time)
  • Time: 5-6 pm

Rob

slide-5
SLIDE 5

Support Vector Machines

Aarti Singh

Machine Learning 10-701/15-781 Oct 13, 2010

slide-6
SLIDE 6

At Pittsburgh G-20 summit …

6

slide-7
SLIDE 7

Linear classifiers – which line is better?

7

slide-8
SLIDE 8

Pick the one with the largest margin!

8

slide-9
SLIDE 9

Parameterizing the decision boundary

9

w.x + b > 0 w.x + b < 0

Data: Example i (= 1,2,…,n):

w.x = j w(j) x(j)

slide-10
SLIDE 10

Parameterizing the decision boundary

10

w.x + b > 0 w.x + b < 0

slide-11
SLIDE 11

Maximizing the margin

11

margin = g = 2a/ǁwǁ

w.x + b > 0 w.x + b < 0

g g

Distance of closest examples from the line/hyperplane

slide-12
SLIDE 12

Maximizing the margin

12

w.x + b > 0 w.x + b < 0

g g max g = 2a/ǁwǁ

w,b

s.t. (w.xj+b) yj ≥ a j margin = g = 2a/ǁwǁ

Note: ‘a’ is arbitrary (can normalize equations by a)

Distance of closest examples from the line/hyperplane

slide-13
SLIDE 13

Support Vector Machines

13

w.x + b > 0 w.x + b < 0

g g min w.w

w,b

s.t. (w.xj+b) yj ≥ 1 j

Solve efficiently by quadratic programming (QP)

– Well-studied solution algorithms

Linear hyperplane defined by “support vectors”

slide-14
SLIDE 14

Support Vectors

14

w.x + b > 0 w.x + b < 0

g g

Linear hyperplane defined by “support vectors” Moving other points a little doesn’t effect the decision boundary

  • nly need to store the

support vectors to predict labels of new points How many support vectors in linearly separable case? ≤ m+1

slide-15
SLIDE 15

What if data is not linearly separable?

15

Use features of features

  • f features of features….

But run risk of overfitting!

x1

2, x2 2, x1x2, …., exp(x1)

slide-16
SLIDE 16

What if data is still not linearly separable?

16

min w.w + C #mistakes

w,b

s.t. (w.xj+b) yj ≥ 1 j

Allow “error” in classification Maximize margin and minimize # mistakes on training data C - tradeoff parameter Not QP  0/1 loss (doesn’t distinguish between

near miss and bad mistake)

slide-17
SLIDE 17

What if data is still not linearly separable?

17

min w.w + C Σξj

w,b

s.t. (w.xj+b) yj ≥ 1-ξj j ξj ≥ 0 j

j

Allow “error” in classification

ξj - “slack” variables

= (>1 if xj misclassifed) pay linear penalty if mistake C - tradeoff parameter (chosen by cross-validation) Still QP 

Soft margin approach

slide-18
SLIDE 18

Slack variables – Hinge loss

18

min w.w + C Σξj

w,b

s.t. (w.xj+b) yj ≥ 1-ξj j ξj ≥ 0 j

j

Complexity penalization Hinge loss 0-1 loss

  • 1

1

slide-19
SLIDE 19

SVM vs. Logistic Regression

19

SVM : Hinge loss 0-1 loss

  • 1

1

Logistic Regression : Log loss ( -ve log conditional likelihood) Hinge loss Log loss

slide-20
SLIDE 20

What about multiple classes?

20

slide-21
SLIDE 21

One against all

21

Learn 3 classifiers separately: Class k vs. rest (wk, bk)k=1,2,3 y = arg max wk.x + bk

k

But wks may not be based on the same scale. Note: (aw).x + (ab) is also a solution

slide-22
SLIDE 22

Learn 1 classifier: Multi-class SVM

22

Simultaneously learn 3 sets of weights y = arg max w(k).x + b(k) Margin - gap between correct class and nearest other class

slide-23
SLIDE 23

Learn 1 classifier: Multi-class SVM

23

Simultaneously learn 3 sets of weights y = arg max w(k).x + b(k) Joint optimization: wks have the same scale.

slide-24
SLIDE 24

What you need to know

  • Maximizing margin
  • Derivation of SVM formulation
  • Slack variables and hinge loss
  • Relationship between SVMs and logistic regression

– 0/1 loss – Hinge loss – Log loss

  • Tackling multiple class

– One against All – Multiclass SVMs

24

slide-25
SLIDE 25

SVMs reminder

25

min w.w + C Σξj

w,b

s.t. (w.xj+b) yj ≥ 1-ξj j ξj ≥ 0 j

Hinge loss Regularization Soft margin approach

slide-26
SLIDE 26

Today’s Lecture

  • Learn one of the most interesting and exciting

recent advancements in machine learning

– The “kernel trick” – High dimensional feature spaces at no extra cost!

  • But first, a detour

– Constrained optimization!

26

slide-27
SLIDE 27

Constrained Optimization

27

slide-28
SLIDE 28

Lagrange Multiplier – Dual Variables

28

Moving the constraint to objective function Lagrangian: Solve:

Constraint is tight when a > 0

slide-29
SLIDE 29

Duality

29

Primal problem: Dual problem: Weak duality – For all feasible points Strong duality – (holds under KKT conditions)

slide-30
SLIDE 30

Lagrange Multiplier – Dual Variables

30

Solving: b -ve b +ve

When a > 0, constraint is tight