lecture 17
play

Lecture 17: Multi-class SVMs Kernels Aykut Erdem December 2016 - PowerPoint PPT Presentation

Lecture 17: Multi-class SVMs Kernels Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on Saturday December 17, 2016 (I will check the date today) . Project progress reports are due today!


  1. Lecture 17: − Multi-class SVMs − Kernels Aykut Erdem December 2016 Hacettepe University

  2. Administrative • We will have a make-up lecture on Saturday December 17, 2016 (I will check the date today) . • Project progress reports are due today! 2

  3. Last time… Support Vector Machines h w, x i + b � 1 h w, x i + b  � 1 linear function slide by Alex Smola f ( x ) = h w, x i + b 3

  4. Last time… Support Vector Machines h w, x i + b = 1 h w, x i + b = � 1 w optimization problem 1 slide by Alex Smola maximize k w k subject to y i [ h x i , w i + b ] � 1 w,b 4

  5. Last time… Support Vector Machines h w, x i + b = 1 h w, x i + b = � 1 w optimization problem 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 slide by Alex Smola minimize w,b

  6. Last time… Support Vector Machines 1 2 k w k 2 subject to y i [ h x i , w i + b ] � 1 minimize w,b X w = y i α i x i i w � 1 X X α i α j y i y j h x i , x j i + maximize α i 2 α slide by Alex Smola i,j i X α i y i = 0 and α i � 0 subject to i

  7. Last time… Large Margin Classifier h w, x i + b = 1 h w, x i + b = � 1 w support 
 vectors α i > 0 = ) 7

  8. Last time… Soft-margin Classifier h w, x i + b � 1 h w, x i + b  � 1 minimum error separator is impossible Theorem (Minsky & Papert) 
 slide by Alex Smola Finding the minimum error separating hyperplane is NP hard

  9. Last time… Adding Slack Variables ξ i ≥ 0 h w, x i + b � 1 � ξ h w, x i + b  � 1 + ξ minimize amount of slack slide by Alex Smola Convex optimization problem

  10. Last time… Adding Slack Variables • for point is between the margin and correctly 0 < ξ ≤ 1 classified • for point is misclassified ξ i ≥ 0 h w, x i + b � 1 � ξ h w, x i + b  � 1 + ξ adopted from Andrew Zisserman minimize amount of slack Convex optimization problem

  11. 
 
 
 Last time… Adding Slack Variables • Hard margin problem 1 2 k w k 2 subject to y i [ h w, x i i + b ] � 1 minimize w,b • With slack variables 
 1 2 k w k 2 + C X minimize ξ i w,b i subject to y i [ h w, x i i + b ] � 1 � ξ i and ξ i � 0 Problem is always feasible. Proof: 
 (also yields upper bound) w = 0 and b = 0 and ξ i = 1 slide by Alex Smola

  12. 
 
 
 Soft-margin classifier • Optimisation problem: 
 1 2 k w k 2 + C X minimize ξ i w,b i subject to y i [ h w, x i i + b ] � 1 � ξ i and ξ i � 0 C is a regularization parameter: • small C allows constraints to be easily ignored 
 → large margin • large C makes constraints hard to ignore 
 adopted from Andrew Zisserman → narrow margin • C = ∞ enforces all constraints: hard margin

  13. Demo time… 13

  14. This week • Multi-class classification • Introduction to kernels 14

  15. Multi-class classification slide by Eric Xing 15

  16. Multi-class classification slide by Eric Xing 16

  17. Multi-class classification slide by Eric Xing 17

  18. One versus all classification • Learn&3&classifiers:& w + – &.&vs.&{o,+},&weights&w .& w - – +&vs.&{o,.},&weights&w +& – o&vs.&{+,.},&weights&w o& w o • Predict&label&using:& • Any&problems?& slide by Eric Xing • Could&we&learn&this&dataset?& 18

  19. Multi-class SVM • Simultaneously-learn-3-sets-- w + of-weights:-- w - • How-do-we-guarantee-the-- correct-labels?-- w o • Need-new-constraints!-- The-“score”-of-the-correct-- class-must-be-be?er-than- the-“score”-of-wrong-classes:-- slide by Eric Xing 19

  20. Multi-class SVM • As#for#the#SVM,#we#introduce#slack#variables#and#maximize#margin:## To predict, we use: Now#can#we#learn#it?### ?  slide by Eric Xing 20

  21. 21 Kernels slide by Alex Smola

  22. Non-linear features • Regression 
 We got nonlinear functions by preprocessing • Perceptron • Map data into feature space x → φ ( x ) • Solve problem in this space • Query replace by for code h x, x 0 i h φ ( x ) , φ ( x 0 ) i • Feature Perceptron • Solution in span of φ ( x i ) slide by Alex Smola

  23. Non-linear features • Separating surfaces are 
 Circles, hyperbolae, parabolae slide by Alex Smola

  24. Solving XOR ( x 1 , x 2 ) ( x 1 , x 2 , x 1 x 2 ) • XOR not linearly separable • Mapping into 3 dimensions makes it easily solvable slide by Alex Smola 24

  25. Linear Separation with Quadratic Kernels slide by Alex Smola 25

  26. Quadratic Features Quadratic Features in Quadratic Features in R 2 p ⇣ ⌘ x 2 2 x 1 x 2 , x 2 Φ ( x ) := 1 , 2 Dot Product Dot Product p p D⇣ ⌘ ⇣ 2 ⌘E 2 , h Φ ( x ) , Φ ( x 0 ) i = x 0 2 x 0 1 x 0 2 , x 0 x 2 2 x 1 x 2 , x 2 1 , , 2 1 2 = h x, x 0 i 2 . Insight Insight via Trick works for any polynomials of order Trick works for any polynomials of order d via h x, x 0 i d . slide by Alex Smola 26

  27. Computational E ffi ciency Problem Extracting features can sometimes be very costly. Example: second order features in 1000 dimensions. This leads to 5005 numbers. For higher order polyno- 5 · 10 5 mial features much worse. Solu%on Solution Don’t compute the features, try to compute dot products implicitly. For some features this works . . . Defini%on Definition A kernel function k : X ⇥ X ! R is a symmetric function in its arguments for which the following property holds k ( x, x 0 ) = h Φ ( x ) , Φ ( x 0 ) i for some feature map Φ . slide by Alex Smola If k ( x, x 0 ) is much cheaper to compute than Φ ( x ) . . . 27

  28. Recap: The Perceptron initialize w = 0 and b = 0 repeat if y i [ h w, x i i + b ]  0 then w w + y i x i and b b + y i end if until all classified correctly • Nothing happens if classified correctly • Weight vector is linear combination X w = y i x i • Classifier is linear combination of 
 i ∈ I inner products X f ( x ) = y i h x i , x i + b slide by Alex Smola i ∈ I 28

  29. Recap: The Perceptron on features initialize w, b = 0 repeat Pick ( x i , y i ) from data if y i ( w · Φ ( x i ) + b )  0 then w 0 = w + y i Φ ( x i ) b 0 = b + y i until y i ( w · Φ ( x i ) + b ) > 0 for all i end • Nothing happens if classified correctly • Weight vector is linear combination X w = y i φ ( x i ) • Classifier is linear combination of 
 i ∈ I slide by Alex Smola inner products X f ( x ) = y i h φ ( x i ) , φ ( x ) i + b 29 i ∈ I

  30. The Kernel Perceptron initialize f = 0 repeat Pick ( x i , y i ) from data if y i f ( x i ) ≤ 0 then f ( · ) ← f ( · ) + y i k ( x i , · ) + y i until y i f ( x i ) > 0 for all i end • Nothing happens if classified correctly • Weight vector is linear combination X w = y i φ ( x i ) i ∈ I • Classifier is linear combination of inner products slide by Alex Smola X X f ( x ) = y i h φ ( x i ) , φ ( x ) i + b = y i k ( x i , x ) + b 30 i ∈ I i ∈ I

  31. Processing Pipeline • Original data • Data in feature space (implicit) • Solve in feature space using kernels slide by Alex Smola 31

  32. Polynomial Kernels Idea We want to extend k ( x, x 0 ) = h x, x 0 i 2 to k ( x, x 0 ) = ( h x, x 0 i + c ) d where c > 0 and d 2 N . Prove that such a kernel corresponds to a dot product. Proof strategy Simple and straightforward: compute the explicit sum given by the kernel, i.e. m ✓ d ◆ k ( x, x 0 ) = ( h x, x 0 i + c ) d = ( h x, x 0 i ) i c d � i X i i =0 slide by Alex Smola Individual terms ( h x, x 0 i ) i are dot products for some Φ i ( x ) . 32

  33. Kernel Conditions Computability We have to be able to compute k ( x, x 0 ) efficiently (much cheaper than dot products themselves). “Nice and Useful” Functions The features themselves have to be useful for the learn- ing problem at hand. Quite often this means smooth functions. Symmetry Obviously k ( x, x 0 ) = k ( x 0 , x ) due to the symmetry of the dot product h Φ ( x ) , Φ ( x 0 ) i = h Φ ( x 0 ) , Φ ( x ) i . Dot Product in Feature Space slide by Alex Smola Is there always a Φ such that k really is a dot product? 33

  34. Mercer’s Theorem The Theorem For any symmetric function k : X ⇥ X ! R which is square integrable in X ⇥ X and which satisfies Z k ( x, x 0 ) f ( x ) f ( x 0 ) dxdx 0 � 0 for all f 2 L 2 ( X ) X ⇥ X there exist φ i : X ! R and numbers λ i � 0 where λ i φ i ( x ) φ i ( x 0 ) for all x, x 0 2 X . X k ( x, x 0 ) = i Interpretation Double integral is the continuous version of a vector- matrix-vector multiplication. For positive semidefinite matrices we have slide by Alex Smola X X k ( x i , x j ) α i α j � 0 34

  35. Properties Distance in Feature Space Distance between points in feature space via d ( x, x 0 ) 2 := k Φ ( x ) � Φ ( x 0 ) k 2 = h Φ ( x ) , Φ ( x ) i � 2 h Φ ( x ) , Φ ( x 0 ) i + h Φ ( x 0 ) , Φ ( x 0 ) i = k ( x, x ) + k ( x 0 , x 0 ) � 2 k ( x, x ) Kernel Matrix To compare observations we compute dot products, so we study the matrix K given by K ij = h Φ ( x i ) , Φ ( x j ) i = k ( x i , x j ) where x i are the training patterns. Similarity Measure slide by Alex Smola The entries K ij tell us the overlap between Φ ( x i ) and Φ ( x j ) , so k ( x i , x j ) is a similarity measure. 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend