support vector machines svms semi supervised learning
play

Support Vector Machines (SVMs). Semi-Supervised Learning. - PowerPoint PPT Presentation

Support Vector Machines (SVMs). Semi-Supervised Learning. Semi-Supervised SVMs. Maria-Florina Balcan 03/25/2015 Support Vector Machines (SVMs). One of the most theoretically well motivated and practically most e ff ective


  1. Support Vector Machines (SVMs). • Semi-Supervised Learning. • Semi-Supervised SVMs. • Maria-Florina Balcan 03/25/2015

  2. Support Vector Machines (SVMs). One of the most theoretically well motivated and practically most e ff ective classification algorithms in machine learning. Directly motivated by Margins and Kernels!

  3. Geometric Margin WLOG homogeneous linear separators [w 0 = 0] . Definition: The margin of example 𝑦 w.r.t. a linear sep. 𝑥 is the distance from 𝑦 to the plane 𝑥 ⋅ 𝑦 = 0 . Margin of example 𝑦 1 If 𝑥 = 1 , margin of x w.r.t. w is | 𝑦 ⋅ 𝑥| . 𝑦 1 w Margin of example 𝑦 2 𝑦 2

  4. Geometric Margin Definition: The margin of example 𝑦 w.r.t. a linear sep. 𝑥 is the distance from 𝑦 to the plane 𝑥 ⋅ 𝑦 = 0 . Definition: The margin 𝛿 𝑥 of a set of examples 𝑇 wrt a linear separator 𝑥 is the smallest margin over points 𝑦 ∈ 𝑇 . Definition: The margin 𝛿 of a set of examples 𝑇 is the maximum 𝛿 𝑥 over all linear separators 𝑥 . w + + 𝛿 - 𝛿 + + - - - + - - - - -

  5. Margin Important Theme in ML Both sample complexity and algorithmic implications. Sample/Mistake Bound complexity : • If large margin, # mistakes Peceptron makes is small (independent on the dim of the space)! If large margin 𝛿 and if alg. produces a large • w + + 𝛿 margin classifier, then amount of data needed - 𝛿 + depends only on R/𝛿 [Bartlett & Shawe- Taylor ’99 ] . + - - - + - - Algorithmic Implications - - - Suggests searching for a large margin classifier… SVMs

  6. Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs First, assume we know a lower bound on the margin 𝛿 w + + Input: 𝛿 , S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; 𝛿 - 𝛿 + Find: some w where: + - - - + - 2 = 1 - • w - • For all i, 𝑧 𝑗 𝑥 ⋅ 𝑦 𝑗 ≥ 𝛿 - - Output: w, a separator of margin 𝛿 over S Realizable case, where the data is linearly separable by margin 𝛿

  7. Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs E.g., search for the best possible 𝛿 w + + Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; 𝛿 - 𝛿 + Find: some w and maximum 𝛿 where: + - - - + - 2 = 1 - • w - • For all i, 𝑧 𝑗 𝑥 ⋅ 𝑦 𝑗 ≥ 𝛿 - - Output: maximum margin separator over S

  8. Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; Maximize 𝛿 under the constraint: 2 = 1 w + + 𝛿 • w - 𝛿 + • For all i, 𝑧 𝑗 𝑥 ⋅ 𝑦 𝑗 ≥ 𝛿 + - - - + - - - - -

  9. Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs This is a Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; constrained Maximize 𝛿 under the constraint: optimization 2 = 1 problem. • w • For all i, 𝑧 𝑗 𝑥 ⋅ 𝑦 𝑗 ≥ 𝛿 objective constraints function • Famous example of constrained optimization: linear programming , where objective fn is linear, constraints are linear (in)equalities

  10. Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs w + + 𝛿 Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; - 𝛿 + Maximize 𝛿 under the constraint: + - - - + 2 = 1 - • w - - • For all i, 𝑧 𝑗 𝑥 ⋅ 𝑦 𝑗 ≥ 𝛿 - - 𝑥 1 This constraint is non-linear. 𝑥 1 + 𝑥 2 2 In fact, it’s even non -convex 𝑥 2

  11. Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs w + + 𝛿 Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; - 𝛿 + Maximize 𝛿 under the constraint: + - - - + 2 = 1 - • w - - • For all i, 𝑧 𝑗 𝑥 ⋅ 𝑦 𝑗 ≥ 𝛿 - - 𝑥’ = 𝑥/𝛿 , then max 𝛿 is equiv. to minimizing ||𝑥’|| 2 (since ||𝑥’|| 2 = 1/𝛿 2 ). So, dividing both sides by 𝛿 and writing in terms of w’ we get : w’ 𝑥’ ⋅ 𝑦 = −1 + + Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; - + + 2 under the constraint: - - Minimize 𝑥′ - + - - • For all i, 𝑧 𝑗 𝑥′ ⋅ 𝑦 𝑗 ≥ 1 - - 𝑥’ ⋅ 𝑦 = 1 -

  12. Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; This is a constrained 2 s.t.: argmin w 𝑥 optimization problem. • For all i, 𝑧 𝑗 𝑥 ⋅ 𝑦 𝑗 ≥ 1 • The objective is convex (quadratic) • All constraints are linear • Can solve efficiently (in poly time) using standard quadratic programing (QP) software

  13. Support Vector Machines (SVMs) Question: what if data isn’t perfectly linearly separable? - Issue 1: now have two objectives 𝑥 ⋅ 𝑦 = −1 w + + maximize margin • - + minimize # of misclassifications. • - - - + - - Ans 1: Let’s optimize their sum: minimize + 𝑥 ⋅ 𝑦 = 1 - - 2 + 𝐷 (# misclassifications) 𝑥 where 𝐷 is some tradeoff constant. Issue 2: This is computationally hard (NP-hard). [even if didn’t care about margin and minimized # mistakes] NP-hard [Guruswami-Raghavendra ’06]

  14. Support Vector Machines (SVMs) Question: what if data isn’t perfectly linearly separable? - Issue 1: now have two objectives 𝑥 ⋅ 𝑦 = −1 w + + maximize margin • - + minimize # of misclassifications. • - - - + - - Ans 1: Let’s optimize their sum: minimize + 𝑥 ⋅ 𝑦 = 1 - - 2 + 𝐷 (# misclassifications) 𝑥 where 𝐷 is some tradeoff constant. Issue 2: This is computationally hard (NP-hard). [even if didn’t care about margin and minimized # mistakes] NP-hard [Guruswami-Raghavendra ’06]

  15. Support Vector Machines (SVMs) Question: what if data isn’t perfectly linearly separable? R eplace “# mistakes” with upper bound called “hinge loss” w’ 𝑥’ ⋅ 𝑦 = −1 + + Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; - + + 2 under the constraint: - - Minimize 𝑥′ - + - - • For all i, 𝑧 𝑗 𝑥′ ⋅ 𝑦 𝑗 ≥ 1 - - 𝑥’ ⋅ 𝑦 = 1 - - Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; 𝑥 ⋅ 𝑦 = −1 w + 2 + 𝐷 𝜊 𝑗 + Find s.t.: argmin w,𝜊 1 ,…,𝜊 𝑛 𝑥 - + 𝑗 - - • - For all i, 𝑧 𝑗 𝑥 ⋅ 𝑦 𝑗 ≥ 1 − 𝜊 𝑗 + - - 𝜊 𝑗 ≥ 0 + 𝑥 ⋅ 𝑦 = 1 - 𝜊 𝑗 are “slack variables” -

  16. Support Vector Machines (SVMs) Question: what if data isn’t perfectly linearly separable? R eplace “# mistakes” with upper bound called “hinge loss” - 𝑥 ⋅ 𝑦 = −1 w + Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; + - + 2 + 𝐷 𝜊 𝑗 Find s.t.: - - argmin w,𝜊 1 ,…,𝜊 𝑛 𝑥 - 𝑗 + • - - For all i, 𝑧 𝑗 𝑥 ⋅ 𝑦 𝑗 ≥ 1 − 𝜊 𝑗 + 𝑥 ⋅ 𝑦 = 1 𝜊 𝑗 ≥ 0 - - 𝜊 𝑗 are “slack variables” C controls the relative weighting between the 2 small (margin is twin goals of making the 𝑥 large) and ensuring that most examples have functional margin ≥ 1 . (0,1 − 𝑧 𝑥 ⋅ 𝑦) 𝑚 𝑥, 𝑦, 𝑧 = max

  17. Support Vector Machines (SVMs) Question: what if data isn’t perfectly linearly separable? R eplace “# mistakes” with upper bound called “hinge loss” - 𝑥 ⋅ 𝑦 = −1 w + Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; + - + 2 + 𝐷 𝜊 𝑗 Find s.t.: - - argmin w,𝜊 1 ,…,𝜊 𝑛 𝑥 - 𝑗 + • - - For all i, 𝑧 𝑗 𝑥 ⋅ 𝑦 𝑗 ≥ 1 − 𝜊 𝑗 + 𝑥 ⋅ 𝑦 = 1 𝜊 𝑗 ≥ 0 - - Replace the number of mistakes with the hinge loss 2 + 𝐷 (# misclassifications) 𝑥 (0,1 − 𝑧 𝑥 ⋅ 𝑦) 𝑚 𝑥, 𝑦, 𝑧 = max

  18. Support Vector Machines (SVMs) Question: what if data isn’t perfectly linearly separable? R eplace “# mistakes” with upper bound called “hinge loss” - 𝑥 ⋅ 𝑦 = −1 w + Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; + - + 2 + 𝐷 𝜊 𝑗 Find s.t.: - - argmin w,𝜊 1 ,…,𝜊 𝑛 𝑥 - 𝑗 + • - - For all i, 𝑧 𝑗 𝑥 ⋅ 𝑦 𝑗 ≥ 1 − 𝜊 𝑗 + 𝑥 ⋅ 𝑦 = 1 𝜊 𝑗 ≥ 0 - - Total amount have to move the points to get them on the correct side of the lines 𝑥 ⋅ 𝑦 = +1/−1 , where the distance between the lines 𝑥 ⋅ 𝑦 = 0 and 𝑥 ⋅ 𝑦 = 1 counts as “1 unit”. (0,1 − 𝑧 𝑥 ⋅ 𝑦) 𝑚 𝑥, 𝑦, 𝑧 = max

  19. What if the data is far from being linearly separable? No good linear vs Example: separator in pixel representation. SVM philosophy: “ Use Kernel ” e a Ke

  20. Support Vector Machines (SVMs) Input: S={( x 1 , 𝑧 1 ) , …,( x m , 𝑧 m )}; Primal 2 + 𝐷 𝜊 𝑗 Find s.t.: argmin w,𝜊 1 ,…,𝜊 𝑛 𝑥 form 𝑗 • For all i, 𝑧 𝑗 𝑥 ⋅ 𝑦 𝑗 ≥ 1 − 𝜊 𝑗 𝜊 𝑗 ≥ 0 Which is equivalent to: Input: S={( x 1 , y 1 ) , …,( x m , y m )}; Lagrangian Dual 1 s.t.: Find 2 y i y j α i α j x i ⋅ x j − α i argmin α i j i • For all i, 0 ≤ α i ≤ C i y i α i = 0 i

  21. SVMs (Lagrangian Dual) Input: S={( x 1 , y 1 ) , …,( x m , y m )}; 1 s.t.: Find 2 y i y j α i α j x i ⋅ x j − α i argmin α i j i • For all i, 0 ≤ α i ≤ C i + - y i α i = 0 𝑥 ⋅ 𝑦 = −1 w i + - + - - - • Final classifier is: w = α i y i x i + i - - • The points x i for which α i ≠ 0 + 𝑥 ⋅ 𝑦 = 1 - are called the “support vectors” -

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend