 
              Optimization for Machine Learning Lecture 2: Support Vector Machine Training S.V . N. (vishy) Vishwanathan Purdue University vishy@purdue.edu July 11, 2012 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 41
Linear Support Vector Machines Outline Linear Support Vector Machines 1 Stochastic Optimization 2 Implicit Updates 3 Dual Problem 4 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 41
Linear Support Vector Machines Binary Classification y i = +1 y i = − 1 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 41
Linear Support Vector Machines Binary Classification y i = +1 y i = − 1 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 41
Linear Support Vector Machines Binary Classification y i = +1 y i = − 1 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 41
Linear Support Vector Machines Binary Classification � w , x 1 � + b = +1 y i = +1 � w , x 2 � + b = − 1 � w , x 1 − x 2 � = 2 � � 2 w � w � , x 1 − x 2 = � w � x 2 x 1 y i = − 1 { x | � w , x � + b = 1 } { x | � w , x � + b = − 1 } { x | � w , x � + b = 0 } S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 41
Linear Support Vector Machines Linear Support Vector Machines Optimization Problem m 2 � w � 2 + 1 λ � min ξ i m w , b ,ξ i =1 s.t. y i ( � w , x i � + b ) ≥ 1 − ξ i for all i ξ i ≥ 0 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 41
Linear Support Vector Machines Linear Support Vector Machines Optimization Problem m λ 2 � w � 2 + 1 � min max(0 , 1 − y i ( � w , x i � + b )) m w , b i =1 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 41
Linear Support Vector Machines Linear Support Vector Machines Optimization Problem m λ + 1 � 2 � w � 2 min max(0 , 1 − y i ( � w , x i � + b )) m w , b i =1 � �� � � �� � λ Ω( w ) R emp ( w ) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 41
Stochastic Optimization Outline Linear Support Vector Machines 1 Stochastic Optimization 2 Implicit Updates 3 Dual Problem 4 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 41
Stochastic Optimization Stochastic Optimization Algorithms Optimization Problem (with no bias) m λ + 1 � 2 � w � 2 min max(0 , 1 − y i � w , x i � ) m w i =1 � �� � � �� � Ω( w ) R emp ( w ) Unconstrained Nonsmooth Convex S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 41
Stochastic Optimization Pegasos: Stochastic Gradient Descent Require: T 1: w 0 ← 0 2: for t = 1 , . . . , T do 1 η t ← 3: λ t if y t � w t , x t � < 1 then 4: w ′ t ← (1 − η t λ ) w t + η t y t x t 5: else 6: w ′ t ← (1 − η t λ ) w t 7: end if 8: 9: end for √ � � 1 , 1 / λ w ′ 10: w t +1 ← min t � w ′ t � S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 7 / 41
Stochastic Optimization Understanding Pegasos Objective Function Revisited m J ( w ) = λ 2 � w � 2 + 1 � max(0 , 1 − y i � w , x i � ) m i =1 Subgradient If y t � w , x t � < 1 then ∂ w J t ( w ) = λ w − y t x t else ∂ w J t ( w ) = λ w S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 8 / 41
Stochastic Optimization Understanding Pegasos Objective Function Revisited J ( w ) ≈ J t ( w ) = λ 2 � w � 2 + max(0 , 1 − y t � w , x t � ) Subgradient If y t � w , x t � < 1 then ∂ w J t ( w ) = λ w − y t x t else ∂ w J t ( w ) = λ w S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 8 / 41
Stochastic Optimization Understanding Pegasos Objective Function Revisited J ( w ) ≈ J t ( w ) = λ 2 � w � 2 + max(0 , 1 − y t � w , x t � ) Subgradient If y t � w , x t � < 1 then ∂ w J t ( w ) = λ w − y t x t else ∂ w J t ( w ) = λ w S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 8 / 41
Stochastic Optimization Understanding Pegasos Explicit Update If y t � w , x t � < 1 then w ′ t = w t − η t ∂ w J t ( w t ) = (1 − λη t ) w t + y t x t else w ′ t = w t − η t ∂ w J t ( w t ) = (1 − λη t ) w t Projection Project w ′ t onto the set √ � � B = w s.t. � w � ≤ 1 / λ S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 41
Stochastic Optimization Motivating Stochastic Gradient Descent How are the Updates Derived? Minimize the following objective function 1 2 � w − w t � 2 + η t J t ( w ) w t +1 = argmin w This gives us S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 41
Stochastic Optimization Motivating Stochastic Gradient Descent How are the Updates Derived? Minimize the following objective function 1 2 � w − w t � 2 + η t J t ( w ) w t +1 = argmin w This gives us w t +1 = w t − η t ∂ w J t ( w t +1 ) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 41
Stochastic Optimization Motivating Stochastic Gradient Descent How are the Updates Derived? Minimize the following objective function 1 2 � w − w t � 2 + η t J t ( w ) w t +1 = argmin w This gives us w t +1 = w t − η t ∂ w J t ( w t +1 ) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 41
Stochastic Optimization Motivating Stochastic Gradient Descent How are the Updates Derived? Minimize the following objective function 1 2 � w − w t � 2 + η t J t ( w ) w t +1 = argmin w This gives us w t +1 ≈ w t − η t ∂ w J t ( w t ) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 41
Implicit Updates Outline Linear Support Vector Machines 1 Stochastic Optimization 2 Implicit Updates 3 Dual Problem 4 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 41
Implicit Updates Implicit Updates What if we did not approximate ∂ w J t ( w t +1 ) ? w t +1 = w t − η t ∂ w J t ( w t +1 ) Subgradient ∂ w J t ( w ) = λ w − γ y t x t If y t � w , x t � < 1 then γ = 1 If y t � w , x t � = 1 then γ ∈ [0 , 1] If y t � w , x t � > 1 then γ = 0 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 41
Implicit Updates Implicit Updates What if we did not approximate ∂ w J t ( w t +1 ) ? w t +1 = w t − η t ∂ w J t ( w t +1 ) Subgradient ∂ w J t ( w ) = λ w − γ y t x t If y t � w , x t � < 1 then γ = 1 If y t � w , x t � = 1 then γ ∈ [0 , 1] If y t � w , x t � > 1 then γ = 0 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 41
Implicit Updates Implicit Updates What if we did not approximate ∂ w J t ( w t +1 ) ? w t +1 = w t − η t λ w t +1 + γη t y t x t Subgradient ∂ w J t ( w ) = λ w − γ y t x t If y t � w , x t � < 1 then γ = 1 If y t � w , x t � = 1 then γ ∈ [0 , 1] If y t � w , x t � > 1 then γ = 0 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 41
Implicit Updates Implicit Updates What if we did not approximate ∂ w J t ( w t +1 ) ? (1 + η t λ ) w t +1 = w t + γη t y t x t Subgradient ∂ w J t ( w ) = λ w − γ y t x t If y t � w , x t � < 1 then γ = 1 If y t � w , x t � = 1 then γ ∈ [0 , 1] If y t � w , x t � > 1 then γ = 0 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 41
Implicit Updates Implicit Updates What if we did not approximate ∂ w J t ( w t +1 ) ? 1 w t +1 = 1 + η t λ [ w t + γη t y t x t ] Subgradient ∂ w J t ( w ) = λ w − γ y t x t If y t � w , x t � < 1 then γ = 1 If y t � w , x t � = 1 then γ ∈ [0 , 1] If y t � w , x t � > 1 then γ = 0 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 41
Implicit Updates Implicit Updates: Case 1 The Implicit Update Condition 1 w t +1 = 1 + η t λ [ w t + γη t y t x t ] Case 1 Suppose 1 + η t λ < y t � w t , x t � . Set 1 w t +1 = 1 + η t λ w t Verify y t � w t +1 , x t � > 1 which implies that γ = 0 and the implicit update condition is satisfied S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 13 / 41
Implicit Updates Implicit Updates: Case 2 The Implicit Update Condition 1 w t +1 = 1 + η t λ [ w t + γη t y t x t ] Case 2 Suppose y t � w t , x t � < 1 + η t λ − η t � x t , x t � . Set 1 w t +1 = 1 + η t λ [ w t + η t y t x t ] Verify y t � w t +1 , x t � < 1 which implies that γ = 1 and the implicit update condition is satisfied S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 14 / 41
Recommend
More recommend