support vector machines
play

Support Vector Machines Machine Learning 1 Big picture Linear - PowerPoint PPT Presentation

Support Vector Machines Machine Learning 1 Big picture Linear models 2 Big picture Linear models How good is a learning algorithm? 3 Big picture Linear models Perceptron, Winnow Online learning How good is a learning algorithm?


  1. Support Vector Machines Machine Learning 1

  2. Big picture Linear models 2

  3. Big picture Linear models How good is a learning algorithm? 3

  4. Big picture Linear models Perceptron, Winnow Online learning How good is a learning algorithm? 4

  5. Big picture Linear models Perceptron, Winnow Online PAC, Agnostic learning learning How good is a learning algorithm? 5

  6. Big picture Linear models Perceptron, Support Vector Winnow Machines Online PAC, Agnostic learning learning How good is a learning algorithm? 6

  7. Big picture Linear models …. Perceptron, Support Vector Winnow Machines Online PAC, Agnostic …. learning learning How good is a learning algorithm? 7

  8. This lecture: Support vector machines • Training by maximizing margin • The SVM objective • Solving the SVM optimization problem • Support vectors, duals and kernels 8

  9. This lecture: Support vector machines • Training by maximizing margin • The SVM objective • Solving the SVM optimization problem • Support vectors, duals and kernels 9

  10. VC dimensions and linear classifiers What we know so far 1. If we have 𝑛 examples, then with probability 1 - 𝜀 , the true error of a hypothesis ℎ with training error 𝑓𝑠𝑠 ! ℎ is bounded by 𝑊𝐷(𝐼) + 1 + ln 4 2𝑛 𝑊𝐷 𝐼 ln 𝜀 𝑓𝑠𝑠 ! ℎ ≤ 𝑓𝑠𝑠 " ℎ + 𝑛 Generalization error Training error A function of VC dimension. Low VC dimension gives tighter bound 10

  11. VC dimensions and linear classifiers What we know so far 1. If we have 𝑛 examples, then with probability 1 - 𝜀 , the true error of a hypothesis ℎ with training error 𝑓𝑠𝑠 ! ℎ is bounded by 𝑊𝐷(𝐼) + 1 + ln 4 2𝑛 𝑊𝐷 𝐼 A function of VC dimension. ln 𝜀 Low VC dimension gives tighter bound 𝑓𝑠𝑠 ! ℎ ≤ 𝑓𝑠𝑠 " ℎ + 𝑛 Generalization error Training error 11

  12. VC dimensions and linear classifiers What we know so far What we know so far 1. 1. If we have 𝑛 examples, then with probability 1 - 𝜀 , the true If we have 𝑛 examples, then with probability 1 - 𝜀 , the true error of a hypothesis ℎ with training error 𝑓𝑠𝑠 error of a hypothesis ℎ with training error 𝑓𝑠𝑠 ! ℎ is ! ℎ is bounded by bounded by 𝑊𝐷(𝐼) + 1 + ln 4 2𝑛 𝑊𝐷 𝐼 A function of VC dimension. ln 𝜀 Low VC dimension gives tighter bound 𝑓𝑠𝑠 ! ℎ ≤ 𝑓𝑠𝑠 " ℎ + 𝑛 Generalization error Training error 2. VC dimension of a linear classifier in 𝑒 dimensions = 𝑒 + 1 12

  13. VC dimensions and linear classifiers What we know so far 1. If we have 𝑛 examples, then with probability 1 - 𝜀 , the true error of a hypothesis ℎ with training error 𝑓𝑠𝑠 ! ℎ is bounded by 𝑊𝐷(𝐼) + 1 + ln 4 2𝑛 𝑊𝐷 𝐼 A function of VC dimension. ln 𝜀 Low VC dimension gives tighter bound 𝑓𝑠𝑠 ! ℎ ≤ 𝑓𝑠𝑠 " ℎ + 𝑛 Generalization error Training error 2. VC dimension of a linear classifier in 𝑒 dimensions = 𝑒 + 1 But are all linear classifiers the same? 13

  14. Recall: Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. + ++ - + - + - - - - - - + + + - - - - - - - - - - 14

  15. Recall: Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. + ++ - + - + - - - - - - + + + - - - - - - - - - Margin with respect to this hyperplane - 15

  16. Which line is a better choice? Why? h 1 + ++ - + - + - - - - - - + + + - - - - - - - - - - h 2 + ++ - + - + - - - - - - + + + - - - - - - - - - - 16

  17. Which line is a better choice? Why? h 1 + + ++ - + - + - - - - - - + + + - - - - - A new example, - - - not from the - - training set might be misclassified if h 2 the margin is + smaller + ++ - + - + - - - - - - + + + - - - - - - - - - - 17

  18. Data dependent VC dimension • Intuitively, larger margins are better • Suppose we only consider linear separators with margins 𝜈 ! and 𝜈 " – 𝐼 " = linear separators that have a margin 𝜈 " – 𝐼 # = linear separators that have a margin 𝜈 # – And 𝜈 " > 𝜈 # • The entire set of functions 𝐼 ! is “better” 18

  19. Data dependent VC dimension Theorem (Vapnik): – Let H be the set of linear classifiers that separate the training set by a margin at least 𝛿 – Then 𝑊𝐷 𝐼 ≤ min 𝑆 # 𝛿 # , 𝑒 + 1 – 𝑆 is the radius of the smallest sphere containing the data 19

  20. Data dependent VC dimension Theorem (Vapnik): – Let H be the set of linear classifiers that separate the training set by a margin at least 𝛿 – Then 𝑊𝐷 𝐼 ≤ min 𝑆 # 𝛿 # , 𝑒 + 1 – 𝑆 is the radius of the smallest sphere containing the data Larger margin ⇒ Lower VC dimension Lower VC dimension ⇒ Better generalization bound 20

  21. Learning strategy Find the linear separator that maximizes the margin 21

  22. This lecture: Support vector machines • Training by maximizing margin • The SVM objective • Solving the SVM optimization problem • Support vectors, duals and kernels 22

  23. Support Vector Machines So far Lower VC dimension → Better generalization • Vapnik: For linear separators, the VC dimension depends inversely • on the margin – That is, larger margin → better generalization For the separable case: • – Among all linear classifiers that separate the data, find the one that maximizes the margin – Maximizing the margin by minimizing 𝒙 ! 𝒙 if for all examples 𝑧𝒙 ! 𝒚 ≥ 1 General case: • – Introduce slack variables – one 𝜊 " for each example – Slack variables allow the margin constraint above to be violated 23

  24. Recall: The geometry of a linear classifier Prediction = sgn(𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 ) 𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 = 0 + ++ + + + + + |𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐| = 𝑧(𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐) " + 𝑥 " - | 𝐱 | - " 𝑥 ! - - - - - - - - - - - - - - - - 24

  25. Recall: The geometry of a linear classifier Prediction = sgn(𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 ) 𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 = 0 We only care about + ++ the sign, not the + + magnitude + + + |𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐| = 𝑧(𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐) " + 𝑥 " - | 𝐱 | - " 𝑥 ! - - - - - - - - - - - - - - - - 25

  26. Recall: The geometry of a linear classifier Prediction = sgn(𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 ) 𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 = 0 We only care about 2 + 𝑥 1 𝑐 2 𝑦 1 + 𝑥 2 + ++ the sign, not the 2 𝑦 2 = 0 + + magnitude + + + 1000𝑐 + 1000𝑥 1 𝑦 1 + 1000𝑥 2 𝑦 2 = 0 |𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐| = 𝑧(𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐) " + 𝑥 " - | 𝐱 | - " 𝑥 ! - - - - - - - - - - - - - - - - All these are equivalent. We could multiply or divide the coefficients by any positive number and the sign of the prediction will not change 26

  27. Maximizing margin • Margin of a hyperplane = distance of the closest point from the hyperplane 𝑧 & (𝐱 ' 𝐲 & + 𝑐) 𝛿 𝐱,% = max | 𝐱 | & • We want max w ° Some people call this the geometric margin The numerator alone is called the functional margin 27

  28. Maximizing margin • Margin of a hyperplane = distance of the closest point from the hyperplane 𝑧 & (𝐱 ' 𝐲 & + 𝑐) 𝛿 𝐱,% = max | 𝐱 | & • We want to maximize this margin: max 𝐱,% 𝛿 𝐱,% Sometimes this is called the geometric margin The numerator alone is called the functional margin 28

  29. Recall: The geometry of a linear classifier Prediction = sgn(𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 ) b +w 1 x 1 + w 2 x 2 =0 + ++ + + + + + |𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐| - - " + 𝑥 " - " 𝑥 ! - - - - - - - - - - - - - - - We only care about the sign, not the magnitude 29

  30. Towards maximizing the margin b +w 1 x 1 + w 2 x 2 =0 𝑥 ! 𝑑 𝑦 ! + 𝑥 " 𝑑 𝑦 " + 𝑐 + ++ + + 𝑑 + + " " + 𝑥 ! + 𝑥 " 𝑑 𝑑 |𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐| - - " + 𝑥 " - " 𝑥 ! - - - - - - - - - - - - - - - We only care about We can scale the weights the sign, not the to make the optimization easier magnitude 30

  31. Towards maximizing the margin b +w 1 x 1 + w 2 x 2 =0 𝑥 ! 𝑑 𝑦 ! + 𝑥 " 𝑑 𝑦 " + 𝑐 + ++ + + 𝑑 + + " " + 𝑥 ! + 𝑥 " 𝑑 𝑑 |𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐| - - " + 𝑥 " - " 𝑥 ! - - - - - - - - - - - - - - - Key observation : We can We only care about We can scale the weights scale the 𝑑 so that the the sign, not the to make the optimization numerator is 1 for points easier magnitude that define the margin. 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend