optimization for machine learning
play

Optimization for Machine Learning Lecture 3: Bundle Methods S.V . - PowerPoint PPT Presentation

Optimization for Machine Learning Lecture 3: Bundle Methods S.V . N. (vishy) Vishwanathan Purdue University vishy@purdue.edu July 11, 2012 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 30 Motivation


  1. Optimization for Machine Learning Lecture 3: Bundle Methods S.V . N. (vishy) Vishwanathan Purdue University vishy@purdue.edu July 11, 2012 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 30

  2. Motivation Outline Motivation 1 Cutting Plane Methods 2 Non Smooth Functions 3 Bundle Methods 4 BMRM 5 Convergence Analysis 6 Experiments 7 Lower Bounds 8 References 9 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 30

  3. Motivation Regularized Risk Minimization Objective Function Training data: { x 1 , . . . , x m } Labels: { y 1 , . . . , y m } Learn a vector: w m + 1 � J ( w ) := λ Ω( w ) l ( x i , y i , w ) minimize m w � �� � i =1 Regularizer � �� � Risk R emp S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 30

  4. Motivation Binary Classification y i = +1 y i = − 1 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 30

  5. Motivation Binary Classification y i = +1 y i = − 1 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 30

  6. Motivation Binary Classification y i = +1 y i = − 1 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 30

  7. Motivation Binary Classification � w , x 1 � + b = +1 y i = +1 � w , x 2 � + b = − 1 � w , x 1 − x 2 � = 2 � � 2 w � w � , x 1 − x 2 = � w � x 2 x 1 y i = − 1 { x | � w , x � + b = 1 } { x | � w , x � + b = − 1 } { x | � w , x � + b = 0 } S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 30

  8. Motivation Linear Support Vector Machines Optimization Problem 2 max � w � w , b s.t. y i ( � w , x i � + b ) ≥ 1 for all i S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 30

  9. Motivation Linear Support Vector Machines Optimization Problem 1 2 � w � 2 min w , b s.t. y i ( � w , x i � + b ) ≥ 1 for all i S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 30

  10. Motivation Linear Support Vector Machines Optimization Problem 1 2 � w � 2 min w , b ,ξ s.t. y i ( � w , x i � + b ) ≥ 1 − ξ i for all i ξ i ≥ 0 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 30

  11. Motivation Linear Support Vector Machines Optimization Problem m 2 � w � 2 + 1 λ � min ξ i m w , b ,ξ i =1 s.t. y i ( � w , x i � + b ) ≥ 1 − ξ i for all i ξ i ≥ 0 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 30

  12. Motivation Linear Support Vector Machines Optimization Problem m 2 � w � 2 + 1 λ � min ξ i m w , b ,ξ i =1 s.t. ξ i ≥ 1 − y i ( � w , x i � + b ) for all i ξ i ≥ 0 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 30

  13. Motivation Linear Support Vector Machines Optimization Problem m λ 2 � w � 2 + 1 � min max(0 , 1 − y i ( � w , x i � + b )) m w , b i =1 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 30

  14. Motivation Linear Support Vector Machines Optimization Problem m λ + 1 � 2 � w � 2 min max(0 , 1 − y i ( � w , x i � + b )) m w , b i =1 � �� � � �� � λ Ω( w ) R emp ( w ) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 30

  15. Motivation Binary Hinge Loss loss y ( � w , x � + b ) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 30

  16. Cutting Plane Methods Outline Motivation 1 Cutting Plane Methods 2 Non Smooth Functions 3 Bundle Methods 4 BMRM 5 Convergence Analysis 6 Experiments 7 Lower Bounds 8 References 9 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 7 / 30

  17. Cutting Plane Methods First Order Taylor Expansion The First Order Taylor approximation globally lower bounds the function For any x and x ′ we have � � f ( x ) ≥ f ( x ′ ) + x − x ′ , ∇ f ( x ′ ) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 8 / 30

  18. Cutting Plane Methods Cutting Plane Methods S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30

  19. Cutting Plane Methods Cutting Plane Methods S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30

  20. Cutting Plane Methods Cutting Plane Methods S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30

  21. Cutting Plane Methods Cutting Plane Methods S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30

  22. Cutting Plane Methods Cutting Plane Methods S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30

  23. Cutting Plane Methods Cutting Plane Methods S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30

  24. Cutting Plane Methods Cutting Plane Methods S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30

  25. Cutting Plane Methods Cutting Plane Methods S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30

  26. Cutting Plane Methods Cutting Plane Methods S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30

  27. Cutting Plane Methods In a Nutshell Cutting Plane Methods work by forming the piecewise linear lower bound J ( w ) ≥ J CP ( w ) := max 1 ≤ i ≤ t { J ( w i − 1 ) + � w − w i − 1 , s i �} . t where s i denotes the gradient ∇ J ( w i − 1 ). At iteration t the set { w i } t − 1 i =0 is augmented by J CP w t := argmin ( w ) . t w Stop when the duality gap 0 ≤ i ≤ t J ( w i ) − J CP ǫ t := min ( w t ) t falls below a pre-specified threshold ǫ . S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 30

  28. Cutting Plane Methods In a Nutshell Cutting Plane Methods work by forming the piecewise linear lower bound J ( w ) ≥ J CP ( w ) := max 1 ≤ i ≤ t { J ( w i − 1 ) + � w − w i − 1 , s i �} . t where s i denotes the gradient ∇ J ( w i − 1 ). At iteration t the set { w i } t − 1 i =0 is augmented by J CP w t := argmin ( w ) . t w Stop when the duality gap 0 ≤ i ≤ t J ( w i ) − J CP ǫ t := min ( w t ) t falls below a pre-specified threshold ǫ . S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 30

  29. Cutting Plane Methods In a Nutshell Cutting Plane Methods work by forming the piecewise linear lower bound J ( w ) ≥ J CP ( w ) := max 1 ≤ i ≤ t { J ( w i − 1 ) + � w − w i − 1 , s i �} . t where s i denotes the gradient ∇ J ( w i − 1 ). At iteration t the set { w i } t − 1 i =0 is augmented by J CP w t := argmin ( w ) . t w Stop when the duality gap 0 ≤ i ≤ t J ( w i ) − J CP ǫ t := min ( w t ) t falls below a pre-specified threshold ǫ . S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 30

  30. Non Smooth Functions Outline Motivation 1 Cutting Plane Methods 2 Non Smooth Functions 3 Bundle Methods 4 BMRM 5 Convergence Analysis 6 Experiments 7 Lower Bounds 8 References 9 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 30

  31. Non Smooth Functions What if the Function is NonSmooth? The piecewise linear function J ( w ) := max � u i , w � i is convex but not differentiable at the kinks! S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 30

  32. Non Smooth Functions Subgradients to the Rescue A subgradient at w ′ is any vector s which satisfies � � J ( w ) ≥ J ( w ′ ) + w − w ′ , s for all w Set of all subgradients is denoted as ∂ J ( w ) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 13 / 30

  33. Non Smooth Functions Subgradients to the Rescue A subgradient at w ′ is any vector s which satisfies � � J ( w ) ≥ J ( w ′ ) + w − w ′ , s for all w Set of all subgradients is denoted as ∂ J ( w ) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 13 / 30

  34. Non Smooth Functions Subgradients to the Rescue A subgradient at w ′ is any vector s which satisfies � � J ( w ) ≥ J ( w ′ ) + w − w ′ , s for all w Set of all subgradients is denoted as ∂ J ( w ) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 13 / 30

  35. Non Smooth Functions Good News! Cutting Plane Methods work with subgradients Just choose an arbitrary one S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 14 / 30

  36. Non Smooth Functions Good News! Cutting Plane Methods work with subgradients Just choose an arbitrary one Then what is the bad news? S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 14 / 30

  37. Non Smooth Functions Bad News 1 0 . 8 0 . 6 1 0 . 5 0 − 1 − 0 . 5 0 − 0 . 5 0 . 5 1 − 1 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 15 / 30

  38. Bundle Methods Outline Motivation 1 Cutting Plane Methods 2 Non Smooth Functions 3 Bundle Methods 4 BMRM 5 Convergence Analysis 6 Experiments 7 Lower Bounds 8 References 9 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 16 / 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend