natural language processing and information retrieval
play

Natural Language Processing and Information Retrieval Support - PowerPoint PPT Presentation

Natural Language Processing and Information Retrieval Support Vector Machines Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@disi.unitn.it Summary Support Vector Machines


  1. Natural Language Processing and Information Retrieval Support Vector Machines Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@disi.unitn.it

  2. Summary Support Vector Machines Hard-margin SVMs Soft-margin SVMs

  3. Which hyperplane choose?

  4. Classifier with a Maximum Margin Var 1 IDEA 1: Select the hyperplane with maximum margin Margin Margin Var 2

  5. Support Vector Var 1 Support Vectors Margin Var 2

  6. Support Vector Machine Classifiers Var 1 The margin is equal to 2 k w � � � � w x b k + = w � � � k Var 2 w x b k + = � k w � � x b 0 � + =

  7. Support Vector Machines Var 1 The margin is equal to 2 k w We need to solve 2 k || � max � � � w || w � � � x + b � + k , if � � w x b k + = w x is positive w � � � x + b � � k , if � x is negative � � � k Var 2 w x b k + = � k w � � x b 0 � + =

  8. Support Vector Machines Var 1 There is a scale for which k=1 . The problem transforms in: 2 || � max � � w || � w x b 1 w � � � x + b � + 1, if � � + = w x is positive w � � � x + b � � 1, if � x is negative � � 1 Var 2 w x b 1 � + = � 1 w � � x b 0 � + =

  9. Final Formulation 2 || � max 2 w || w � � � || � max � � w || x i + b � + 1, y i = 1 y i ( � w � � w � � � x i + b ) � 1 x i + b � � 1, y i = -1 min || � min || � 2 w || w || � � 2 2 y i ( � w � � y i ( � w � � x i + b ) � 1 x i + b ) � 1

  10. Optimization Problem Optimal Hyperplane: � ( � � ) = 1 2 w w Minimize 2 y i (( � w � � Subject to x i ) + b ) � 1, i = 1,..., m The dual problem is simpler

  11. Lagrangian Definition �

  12. Dual Optimization Problem

  13. Dual Transformation Given the Lagrangian associated with our problem To solve the dual problem we need to evaluate: � w Let us impose the derivatives to 0, with respect to

  14. Dual Transformation (cont’d) and wrt b Then we substituted them in the Lagrange function

  15. Final Dual Problem

  16. Khun-Tucker Theorem Necessary and sufficient conditions to optimality α ∗ , � ∂ L ( � β ∗ ) w ∗ , � = � 0 ∂ � w α ∗ , � ∂ L ( � β ∗ ) w ∗ , � = � 0 ∂ b α ∗ i g i ( � w ∗ ) = 0 , i = 1 , .., m g i ( � w ∗ ) ≤ 0 , i = 1 , .., m α ∗ ≥ 0 , i = 1 , .., m i

  17. Properties coming from constraints � � m m � � Lagrange constraints: y i = 0 w = y i x i � i � i i = 1 i = 1 Karush-Kuhn-Tucker constraints � i � [ y i ( � i � � x w + b ) � 1] = 0, i = 1,..., m Support Vectors have not null � i To evaluate b, we can apply the following equation

  18. Warning! On the graphical examples, we always consider normalized hyperplane (hyperplanes with normalized gradient) b in this case is exactly the distance of the hyperplane from the origin So if we have an equation not normalized we may have x � � � ' + b = 0 with � ) and � ( ( ) w x = x , y w ' = 1,1 and b is not the distance

  19. Warning! Let us consider a normalized gradient � ( ) w = 1/ 2,1/ 2 ( ) + b = 0 � x / 2 + y / 2 = � b ( ) � 1/ 2,1/ 2 x , y � y = � x � b 2 Now we see that -b is exactly the distance. For x =0, we have the intersection with . This � b 2 � distance projected on is - b w

  20. Soft Margin SVMs Var 1 slack variables are � i � added i Some errors are allowed but they should penalize the objective function � � � w x b 1 � + = w � � 1 w x b 1 Var 2 � + = � 1 w � � x b 0 � + =

  21. Soft Margin SVMs The new constraints are y i ( � w � � Var 1 x i + b ) � 1 � � i � i � � x i where � i � 0 The objective function penalizes the incorrect � � classified examples � w x b 1 � + = w 2 || � min 1 w || 2 + C � � i � � i 1 w x b 1 Var 2 � + = � 1 C is the trade-off w � � x b 0 between margin and the � + = error

  22. Dual formulation w || + C � m 1 i =1 ξ 2  2 || � min i   y i ( � w · � x i + b ) ≥ 1 − ξ i , ∀ i = 1 , .., m ξ i ≥ 0 , i = 1 , .., m   m m α ) = 1 w + C w, b, � � � ξ 2 L ( � α i [ y i ( � x i + b ) − 1 + ξ i ] , ξ , � 2 � w · � w · � i − 2 i =1 i =1 � � By deriving wrt w , and b �

  23. Partial Derivatives

  24. Substitution in the objective function of Kronecker � ij

  25. Final dual optimization problem

  26. Soft Margin Support Vector Machines y i ( � w � � i + b ) � 1 � � i � � 2 || � x x min 1 w || 2 + C � � i i i � i � 0 The algorithm tries to keep ξ i low and maximize the margin NB: The number of error is not directly minimized (NP-complete problem); the distances from the hyperplane are minimized If C →∞ , the solution tends to the one of the hard-margin algorithm � y i b � 1 � � i � � || w || x Attention !!!: if C = 0 we get = 0, since i If C increases the number of error decreases. When C tends to infinite the number of errors must be 0, i.e. the hard-margin formulation

  27. Robusteness of Soft vs. Hard Margin SVMs Var 1 Var 1 � i ξ i Var 2 Var 2 w � � w � � x b 0 � + = x b 0 � + = Hard Margin SVM Soft Margin SVM

  28. Soft vs Hard Margin SVMs Soft-Margin has ever a solution Soft-Margin is more robust to odd examples Hard-Margin does not require parameters

  29. Parameters 2 || � 2 || � + + C � min 1 w || 2 + C = min 1 w || 2 + C + � � � � � i � i � i i i i 2 || � + + ( ) = min 1 w || 2 + C J � � � � i � i i i C: trade-off parameter J: cost factor

  30. Theoretical Justification

  31. Definition of Training Set error Training Data ( � 1 , y 1 ),....,( � N � ± 1 f : R 1 { } { } N x x m , y m ) � R � ± Empirical Risk (error) 2 f ( � m � R emp [ f ] = 1 x i ) � y i 1 m Risk (error) i = 1 2 f ( � ) � ydP ( � 1 � R [ f ] = x x , y )

  32. Error Characterization (part 1) From PAC-learning Theory ( Vapnik ): log( � ) R ( � ) � R emp ( � ) + � ( d m , m ) d (log 2 m d + 1) � log( � 4 ) log( � ) � ( d m , m ) = m where d is theVC-dimension, m is the number of examples, δ is a bound on the probability to get such error and α is a classifier parameter.

  33. There are many versions for different bounds

  34. Error Characterization (part 2)

  35. Ranking, Regression and Multiclassification

  36. The Ranking SVM [Herbrich et al. 1999, 2000; Joachims et al. 2002] The aim is to classify instance pairs as correctly ranked or incorrectly ranked This turns an ordinal regression problem back into a binary classification problem We want a ranking function f such that x i > x j iff f ( x i ) > f ( x j ) … or at least one that tries to do this with minimal error Suppose that f is a linear function f ( x i ) = w  x i

  37. • Sec.
15.4.2
 The Ranking SVM Ranking Model: f ( x i ) f ( x i )

  38. • Sec.
15.4.2
 The Ranking SVM Then (combining the two equations on the last slide): x i > x j iff w  x i − w  x j > 0 x i > x j iff w  ( x i − x j ) > 0 Let us then create a new instance space from such pairs: z k = x i − x k y k = +1, − 1 as x i ≥ , < x k

  39. Support Vector Ranking   w || + C � m 1 i =1 ξ 2  2 || � min i   y k ( � w · ( � x j ) + b ) ≥ 1 − ξ k , ∀ i, j = 1 , .., m x i − � (2 k = 1 , .., m 2 ξ k ≥ 0 ,   y k = 1 if rank ( � x i ) > rank ( � x j ), 0 otherwise, where k = i × m + j � 1 Given two examples we build one example ( x i , x j )

  40. Support Vector Regression (SVR) Solution: f(x) 1 T Min w w 2 + ε ( ) f x wx b = + Constraints: 0 T - ε y w x b − − ≤ ε i i T w x b y + − ≤ ε i i x

  41. Support Vector Regression (SVR) f(x) Minimise: 1 N T w + C ( * ) � 2 w + ε � i + � i ( ) f x wx b = + i = 1 0 - ε Constraints: ξ T y w x b − − ≤ ε + ξ i i i T * w x b y + − ≤ ε + ξ ξ * i i i * , 0 ξ ξ ≥ i i x

  42. Support Vector Regression y i is not -1 or 1 anymore, now it is a value ε is the tollerance of our function value

  43. From Binary to Multiclass classifiers Three different approaches: ONE-vs-ALL (OVA) Given the example sets, {E1, E2, E3, …} for the categories: {C1, C2, C3,…} the binary classifiers: {b1, b2, b3,…} are built. For b1, E1 is the set of positives and E2 ∪ E3 ∪ … is the set of negatives, and so on For testing: given a classification instance x, the category is the one associated with the maximum margin among all binary classifiers

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend