part 5 structured support vector machines
play

Part 5: Structured Support Vector Machines Sebastian Nowozin and - PowerPoint PPT Presentation

Sebastian Nowozin and Christoph Lampert Structured Models in Computer Vision Part 5. Structured SVMs Part 5: Structured Support Vector Machines Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 56


  1. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Part 5: Structured Support Vector Machines Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 56

  2. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Problem (Loss-Minimizing Parameter Learning) Let d ( x, y ) be the (unknown) true data distribution. Let D = { ( x 1 , y 1 ) , . . . , ( x N , y N ) } be i.i.d. samples from d ( x, y ) . Let φ : X × Y → R D be a feature function. Let ∆ : Y × Y → R be a loss function. ◮ Find a weight vector w ∗ that leads to minimal expected loss E ( x,y ) ∼ d ( x,y ) { ∆( y, f ( x )) } for f ( x ) = argmax y ∈Y � w, φ ( x, y ) � . 2 / 56

  3. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Problem (Loss-Minimizing Parameter Learning) Let d ( x, y ) be the (unknown) true data distribution. Let D = { ( x 1 , y 1 ) , . . . , ( x N , y N ) } be i.i.d. samples from d ( x, y ) . Let φ : X × Y → R D be a feature function. Let ∆ : Y × Y → R be a loss function. ◮ Find a weight vector w ∗ that leads to minimal expected loss E ( x,y ) ∼ d ( x,y ) { ∆( y, f ( x )) } for f ( x ) = argmax y ∈Y � w, φ ( x, y ) � . Pro: ◮ We directly optimize for the quantity of interest: expected loss. ◮ No expensive-to-compute partition function Z will show up. Con: ◮ We need to know the loss function already at training time. ◮ We can’t use probabilistic reasoning to find w ∗ . 3 / 56

  4. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Reminder: learning by regularized risk minimization For compatibility function g ( x, y ; w ) := � w, φ ( x, y ) � find w ∗ that minimizes E ( x,y ) ∼ d ( x,y ) ∆( y, argmax y g ( x, y ; w ) ) . Two major problems: ◮ d ( x, y ) is unknown ◮ argmax y g ( x, y ; w ) maps into a discrete space → ∆( y, argmax y g ( x, y ; w )) is discontinuous, piecewise constant 4 / 56

  5. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Task: min E ( x,y ) ∼ d ( x,y ) ∆( y, argmax y g ( x, y ; w ) ) . w Problem 1: ◮ d ( x, y ) is unknown Solution: 1 ◮ Replace E ( x,y ) ∼ d ( x,y ) � � � � · with empirical estimate � · ( x n ,y n ) N ◮ To avoid overfitting: add a regularizer , e.g. λ � w � 2 . New task: N λ � w � 2 + 1 � ∆( y n , argmax y g ( x n , y ; w ) ) . min N w n =1 5 / 56

  6. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Task: N λ � w � 2 + 1 � ∆( y n , argmax y g ( x n , y ; w ) ) . min N w n =1 Problem: ◮ ∆( y, argmax y g ( x, y ; w ) ) discontinuous w.r.t. w . Solution: ◮ Replace ∆( y, y ′ ) with well behaved ℓ ( x, y, w ) ◮ Typically: ℓ upper bound to ∆ , continuous and convex w.r.t. w . New task: N λ � w � 2 + 1 � ℓ ( x n , y n , w )) min N w n =1 6 / 56

  7. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Regularized Risk Minimization N 1 � λ � w � 2 ℓ ( x n , y n , w )) min + N w n =1 Regularization + Loss on training data 7 / 56

  8. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Regularized Risk Minimization N 1 � λ � w � 2 ℓ ( x n , y n , w )) min + N w n =1 Regularization + Loss on training data Hinge loss: maximum margin training ℓ ( x n , y n , w ) := max ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � � � y ∈Y 8 / 56

  9. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Regularized Risk Minimization N 1 � λ � w � 2 ℓ ( x n , y n , w )) min + N w n =1 Regularization + Loss on training data Hinge loss: maximum margin training ℓ ( x n , y n , w ) := max ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � � � y ∈Y ◮ ℓ is maximum over linear functions → continuous , convex . ◮ ℓ bounds ∆ from above. y = argmax y g ( x n , y, w ) Proof: Let ¯ ∆( y n , ¯ y ) ≤ ∆( y n , ¯ y ) + g ( x n , ¯ y, w ) − g ( x n , y n , w ) ∆( y n , y ) + g ( x n , y, w ) − g ( x n , y n , w ) � � ≤ max y ∈Y 9 / 56

  10. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Regularized Risk Minimization N 1 � λ � w � 2 ℓ ( x n , y n , w )) min + N w n =1 Regularization + Loss on training data Hinge loss: maximum margin training ℓ ( x n , y n , w ) := max ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � � � y ∈Y Alternative: Logistic loss: probabilistic training � ℓ ( x n , y n , w ) := log � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � � � exp y ∈Y 10 / 56

  11. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Structured Output Support Vector Machine N 2 � w � 2 + C 1 � � � y ∈Y ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � min max N w n =1 Conditional Random Field N � w � 2 � �� � � � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � � min 2 σ 2 + log exp w n =1 y ∈Y CRFs and SSVMs have more in common than usually assumed. ◮ both do regularized risk minimization ◮ log � y exp( · ) can be interpreted as a soft-max 11 / 56

  12. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Solving the Training Optimization Problem Numerically Structured Output Support Vector Machine: N 1 2 � w � 2 + C � �� � ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � min max N w y ∈Y n =1 Unconstrained optimization, convex, non-differentiable objective. 12 / 56

  13. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Structured Output SVM (equivalent formulation): N 2 � w � 2 + C 1 � ξ n min N w,ξ n =1 subject to, for n = 1 , . . . , N , � � ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � ≤ ξ n max y ∈Y N non-linear contraints, convex, differentiable objective. 13 / 56

  14. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Structured Output SVM (also equivalent formulation): N 1 2 � w � 2 + C � ξ n min N w,ξ n =1 subject to, for n = 1 , . . . , N , ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � ≤ ξ n , for all y ∈ Y N |Y| linear constraints, convex, differentiable objective. 14 / 56

  15. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Example: Multiclass SVM � for y � = y ′ 1 ◮ Y = { 1 , 2 , . . . , K } , ∆( y, y ′ ) = otherwise . 0 � � ◮ φ ( x, y ) = � y = 1 � φ ( x ) , � y = 2 � φ ( x ) , . . . , � y = K � φ ( x ) N 1 2 � w � 2 + C � ξ n Solve: min N w,ξ n =1 subject to, for i = 1 , . . . , n , � w, φ ( x n , y n ) � − � w, φ ( x n , y ) � ≥ 1 − ξ n for all y ∈ Y \ { y n } . Classification: f ( x ) = argmax y ∈Y � w, φ ( x, y ) � . Crammer-Singer Multiclass SVM [K. Crammer, Y. Singer: ”On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines”, JMLR, 2001] 15 / 56

  16. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Example: Hierarchical SVM Hierarchical Multiclass Loss: ∆( y, y ′ ) := 1 2( distance in tree ) ∆( cat , cat ) = 0 , ∆( cat , dog ) = 1 , ∆( cat , bus ) = 2 , etc. N 1 2 � w � 2 + C � ξ n Solve: min N w,ξ n =1 subject to, for i = 1 , . . . , n , � w, φ ( x n , y n ) � − � w, φ ( x n , y ) � ≥ ∆( y n , y ) − ξ n for all y ∈ Y . [L. Cai, T. Hofmann: ”Hierarchical Document Categorization with Support Vector Machines”, ACM CIKM, 2004] [A. Binder, K.-R. M¨ uller, M. Kawanabe: ”On taxonomies for multi-class image categorization”, IJCV, 2011] 16 / 56

  17. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Solving the Training Optimization Problem Numerically We can solve SSVM training like CRF training: N 1 2 � w � 2 + C � � � y ∈Y ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � min max N w n =1 ◮ continuous � ◮ unconstrained � ◮ convex � ◮ non-differentiable � → we can’t use gradient descent directly. → we’ll have to use subgradients 17 / 56

  18. f(w 0 )+⟨v,w-w 0 ⟩ f(w) f(w 0 ) w w 0 Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Definition Let f : R D → R be a convex, not necessarily differentiable, function. A vector v ∈ R D is called a subgradient of f at w 0 , if f ( w ) ≥ f ( w 0 ) + � v, w − w 0 � for all w . f(w 0 )+⟨v,w-w 0 ⟩ f(w) f(w 0 ) w w 0 18 / 56

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend