introduction to machine learning cs725 instructor prof
play

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh - PowerPoint PPT Presentation

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Lecture 13 - KKT Conditions, Duality, SVR Dual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


  1. Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Lecture 13 - KKT Conditions, Duality, SVR Dual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  2. KKT conditions for SVR m L ( w , α, α ∗ , µ, µ ∗ ) = 1 2 ∥ w ∥ 2 + C ( ) ∑ ∑ ( ξ i + ξ ∗ y i − w ⊤ φ ( x i ) − b − ϵ − ξ i i ) + + α i i =1 i m m m ( ) ∑ α ∗ b + w ⊤ φ ( x i ) − y i − ϵ − ξ ∗ ∑ ∑ µ ∗ i ξ ∗ − µ i ξ i − i i i i =1 i =1 i =1 Differentiating the Lagrangian w.r.t. w , m ∑ w − α i φ ( x i ) + α ∗ ( α i − α ∗ i φ ( x i ) = 0 i.e. , w = i ) φ ( x i ) i =1 Differentiating the Lagrangian w.r.t. ξ i , C − α i − µ i = 0 i.e. , α i + µ i = C Differentiating the Lagrangian w.r.t ξ ∗ i , α ∗ i + µ ∗ i = C Differentiating the Lagrangian w.r.t b , i ( α ∗ ∑ i − α i ) = 0 Complimentary slackness: α i ( y i − w ⊤ φ ( x i ) − b − ϵ − ξ i ) = 0 AND µ i ξ i = 0 AND α ∗ i ( b + w ⊤ φ ( x i ) − y i − ϵ − ξ ∗ i ) = 0 AND µ ∗ i ξ ∗ i = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  3. For Support Vector Regression, since the original objective and the constraints are convex, any ( w , b , α, α ∗ , µ, µ ∗ , ξ, ξ ∗ ) that satisfy the necessary KKT conditions gives optimality (conditions are also sufficient) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  4. Some observations α i , α ∗ i ≥ 0, µ i , µ ∗ i ≥ 0, α i + µ i = C and α ∗ i + µ ∗ i = C Thus, α i , µ i , α ∗ i , µ ∗ i ∈ [0 , C ], ∀ i If 0 < α i < C , then 0 < µ i < C (as α i + µ i = C ) µ i ξ i = 0 and α i ( y i − w ⊤ φ ( x i ) − b − ϵ − ξ i ) = 0 are complementary slackness conditions So 0 < α i < C ⇒ ξ i = 0 and y i − w ⊤ φ ( x i ) − b = ϵ + ξ i = ϵ All such points lie on the boundary of the ϵ band Using any point x j (that is with α j ∈ (0 , C )) on margin, we can recover b as: b = y j − w ⊤ φ ( x j ) − ϵ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  5. Support Vector Regression Dual Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  6. Weak Duality L ∗ ( α, α ∗ , µ, µ ∗ ) = w , b ,ξ,ξ ∗ L ( w , b , ξ, ξ ∗ , α, α ∗ , µ, µ ∗ ) min By weak duality theorem, we have: 2 ∥ w ∥ 2 + C ∑ m 1 i =1 ( ξ i + ξ ∗ i ) ≥ L ∗ ( α, α ∗ , µ, µ ∗ ) min w , b ,ξ,ξ ∗ s.t. y i − w ⊤ φ ( x i ) − b ≤ ϵ − ξ i , and w ⊤ φ ( x i ) + b − y i ≤ ϵ − ξ ∗ i , and ξ i , ξ ∗ ≥ 0, ∀ i = 1 , . . . , n The above is true for any α i , α ∗ i ≥ 0 and µ i , µ ∗ i ≥ 0 Thus, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  7. Weak Duality L ∗ ( α, α ∗ , µ, µ ∗ ) = w , b ,ξ,ξ ∗ L ( w , b , ξ, ξ ∗ , α, α ∗ , µ, µ ∗ ) min By weak duality theorem, we have: 2 ∥ w ∥ 2 + C ∑ m 1 i =1 ( ξ i + ξ ∗ i ) ≥ L ∗ ( α, α ∗ , µ, µ ∗ ) min w , b ,ξ,ξ ∗ s.t. y i − w ⊤ φ ( x i ) − b ≤ ϵ − ξ i , and w ⊤ φ ( x i ) + b − y i ≤ ϵ − ξ ∗ i , and ξ i , ξ ∗ ≥ 0, ∀ i = 1 , . . . , n The above is true for any α i , α ∗ i ≥ 0 and µ i , µ ∗ i ≥ 0 Thus, m 1 2 ∥ w ∥ 2 + C ∑ ( ξ i + ξ ∗ α,α ∗ ,µ,µ ∗ L ∗ ( α, α ∗ , µ, µ ∗ ) min i ) ≥ max w , b ,ξ,ξ ∗ i =1 s.t. y i − w ⊤ φ ( x i ) − b ≤ ϵ − ξ i , and w ⊤ φ ( x i ) + b − y i ≤ ϵ − ξ ∗ i , and ξ i , ξ ∗ ≥ 0, ∀ i = 1 , . . . , n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  8. Dual objective L ∗ ( α, α ∗ , µ, µ ∗ ) = w , b ,ξ,ξ ∗ L ( w , b , ξ, ξ ∗ , α, α ∗ , µ, µ ∗ ) min Assume: In case of SVR, we have a strictly convex objective and linear constraints ⇒ KKT conditions are necessary and sufficient and strong duality holds: m 1 2 ∥ w ∥ 2 + C ∑ ( ξ i + ξ ∗ α,α ∗ ,µ,µ ∗ L ∗ ( α, α ∗ , µ, µ ∗ ) min i ) = max w , b ,ξ,ξ ∗ i =1 s.t. y i − w ⊤ φ ( x i ) − b ≤ ϵ − ξ i , and w ⊤ φ ( x i ) + b − y i ≤ ϵ − ξ ∗ i , and ξ i , ξ ∗ ≥ 0, ∀ i = 1 , . . . , n This value is precisely obtained at the ( w , b , ξ, ξ ∗ , α, α ∗ , µ, µ ∗ ) that satisfies the necessary (and sufficient) KKT optimality conditions Given strong duality, we can equivalently solve α,α ∗ ,µ,µ ∗ L ∗ ( α, α ∗ , µ, µ ∗ ) max . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  9. 2 ∥ w ∥ 2 + C ∑ m L ( α, α ∗ , µ, µ ∗ ) = 1 i =1 ( ξ i + ξ ∗ i ) + m α i ( y i − w ⊤ φ ( x i ) − b − ϵ − ξ i ) + α ∗ i ( w ⊤ φ ( x i ) + b − y i − ϵ − ξ ∗ ( ∑ i ) i =1 m ( µ i ξ i + µ ∗ i ξ ∗ ∑ i ) i =1 i in terms of α , α ∗ , µ and µ ∗ by using We obtain w , b , ξ i , ξ ∗ m ( α i − α ∗ the KKT conditions derived earlier as w = ∑ i ) φ ( x i ) i =1 m ( α i − α ∗ i ) = 0 and α i + µ i = C and α ∗ i + µ ∗ and ∑ i = C i =1 Thus, we get: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  10. 2 ∥ w ∥ 2 + C ∑ m L ( α, α ∗ , µ, µ ∗ ) = 1 i =1 ( ξ i + ξ ∗ i ) + m α i ( y i − w ⊤ φ ( x i ) − b − ϵ − ξ i ) + α ∗ i ( w ⊤ φ ( x i ) + b − y i − ϵ − ξ ∗ ( ∑ i ) i =1 m ( µ i ξ i + µ ∗ i ξ ∗ ∑ i ) i =1 i in terms of α , α ∗ , µ and µ ∗ by using We obtain w , b , ξ i , ξ ∗ m ( α i − α ∗ the KKT conditions derived earlier as w = ∑ i ) φ ( x i ) i =1 m ( α i − α ∗ i ) = 0 and α i + µ i = C and α ∗ i + µ ∗ and ∑ i = C i =1 Thus, we get: L ( w , b , ξ, ξ ∗ , α, α ∗ , µ, µ ∗ ) = 1 j ( α i − α ∗ i )( α j − α ∗ j ) φ ⊤ ( x i ) φ ( x j ) + ∑ ∑ 2 i i ( ξ i ( C − α i − µ i ) + ξ ∗ i ( C − α ∗ i − µ ∗ i ( α i − α ∗ ∑ i )) − b ∑ i ) − i ( α i + α ∗ i y i ( α i − α ∗ j ( α i − α ∗ ϵ ∑ i ) + ∑ i ) − ∑ ∑ i )( α j − i α ∗ j ) φ ⊤ ( x i ) φ ( x j ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  11. 2 ∥ w ∥ 2 + C ∑ m L ( α, α ∗ , µ, µ ∗ ) = 1 i =1 ( ξ i + ξ ∗ i ) + m α i ( y i − w ⊤ φ ( x i ) − b − ϵ − ξ i ) + α ∗ i ( w ⊤ φ ( x i ) + b − y i − ϵ − ξ ∗ ( ∑ i ) i =1 m ( µ i ξ i + µ ∗ i ξ ∗ ∑ i ) i =1 i in terms of α , α ∗ , µ and µ ∗ by using We obtain w , b , ξ i , ξ ∗ m ( α i − α ∗ the KKT conditions derived earlier as w = ∑ i ) φ ( x i ) i =1 m ( α i − α ∗ i ) = 0 and α i + µ i = C and α ∗ i + µ ∗ and ∑ i = C i =1 Thus, we get: L ( w , b , ξ, ξ ∗ , α, α ∗ , µ, µ ∗ ) = 1 j ( α i − α ∗ i )( α j − α ∗ j ) φ ⊤ ( x i ) φ ( x j ) + ∑ ∑ 2 i i ( ξ i ( C − α i − µ i ) + ξ ∗ i ( C − α ∗ i − µ ∗ i ( α i − α ∗ ∑ i )) − b ∑ i ) − i ( α i + α ∗ i y i ( α i − α ∗ j ( α i − α ∗ ϵ ∑ i ) + ∑ i ) − ∑ ∑ i )( α j − i α ∗ j ) φ ⊤ ( x i ) φ ( x j ) = − 1 j ( α i − α ∗ i )( α j − α ∗ j ) φ ⊤ ( x i ) φ ( x j ) − ϵ ∑ ∑ ∑ i ( α i + 2 i α ∗ i y i ( α i − α ∗ i ) + ∑ i ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  12. Kernel function: K ( x i , x j ) = φ T ( x i ) φ ( x j ) w = ∑ m i =1 ( α i − α ∗ i ) φ ( x i ) ⇒ the final decision function f ( x ) = w T φ ( x ) + b = ∑ m i =1 ( α i − α ∗ i ) φ T ( x i ) φ ( x )+ y j − ∑ m i =1 ( α i − α ∗ i ) φ T ( x i ) φ ( x j ) − ϵ x j is any point with α j ∈ (0 , C ). Recall similarity with . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  13. Kernel function: K ( x i , x j ) = φ T ( x i ) φ ( x j ) w = ∑ m i =1 ( α i − α ∗ i ) φ ( x i ) ⇒ the final decision function f ( x ) = w T φ ( x ) + b = ∑ m i =1 ( α i − α ∗ i ) φ T ( x i ) φ ( x )+ y j − ∑ m i =1 ( α i − α ∗ i ) φ T ( x i ) φ ( x j ) − ϵ x j is any point with α j ∈ (0 , C ). Recall similarity with kernelized expression for Ridge Regression The dual optimization problem to compute the α ’s for SVR is: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend