Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh - PowerPoint PPT Presentation

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Lecture 13 - KKT Conditions, Duality, SVR Dual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

KKT conditions for SVR m L ( w , α, α ∗ , µ, µ ∗ ) = 1 2 ∥ w ∥ 2 + C ( ) ∑ ∑ ( ξ i + ξ ∗ y i − w ⊤ φ ( x i ) − b − ϵ − ξ i i ) + + α i i =1 i m m m ( ) ∑ α ∗ b + w ⊤ φ ( x i ) − y i − ϵ − ξ ∗ ∑ ∑ µ ∗ i ξ ∗ − µ i ξ i − i i i i =1 i =1 i =1 Differentiating the Lagrangian w.r.t. w , m ∑ w − α i φ ( x i ) + α ∗ ( α i − α ∗ i φ ( x i ) = 0 i.e. , w = i ) φ ( x i ) i =1 Differentiating the Lagrangian w.r.t. ξ i , C − α i − µ i = 0 i.e. , α i + µ i = C Differentiating the Lagrangian w.r.t ξ ∗ i , α ∗ i + µ ∗ i = C Differentiating the Lagrangian w.r.t b , i ( α ∗ ∑ i − α i ) = 0 Complimentary slackness: α i ( y i − w ⊤ φ ( x i ) − b − ϵ − ξ i ) = 0 AND µ i ξ i = 0 AND α ∗ i ( b + w ⊤ φ ( x i ) − y i − ϵ − ξ ∗ i ) = 0 AND µ ∗ i ξ ∗ i = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

For Support Vector Regression, since the original objective and the constraints are convex, any ( w , b , α, α ∗ , µ, µ ∗ , ξ, ξ ∗ ) that satisfy the necessary KKT conditions gives optimality (conditions are also sufficient) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Some observations α i , α ∗ i ≥ 0, µ i , µ ∗ i ≥ 0, α i + µ i = C and α ∗ i + µ ∗ i = C Thus, α i , µ i , α ∗ i , µ ∗ i ∈ [0 , C ], ∀ i If 0 < α i < C , then 0 < µ i < C (as α i + µ i = C ) µ i ξ i = 0 and α i ( y i − w ⊤ φ ( x i ) − b − ϵ − ξ i ) = 0 are complementary slackness conditions So 0 < α i < C ⇒ ξ i = 0 and y i − w ⊤ φ ( x i ) − b = ϵ + ξ i = ϵ All such points lie on the boundary of the ϵ band Using any point x j (that is with α j ∈ (0 , C )) on margin, we can recover b as: b = y j − w ⊤ φ ( x j ) − ϵ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Support Vector Regression Dual Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Weak Duality L ∗ ( α, α ∗ , µ, µ ∗ ) = w , b ,ξ,ξ ∗ L ( w , b , ξ, ξ ∗ , α, α ∗ , µ, µ ∗ ) min By weak duality theorem, we have: 2 ∥ w ∥ 2 + C ∑ m 1 i =1 ( ξ i + ξ ∗ i ) ≥ L ∗ ( α, α ∗ , µ, µ ∗ ) min w , b ,ξ,ξ ∗ s.t. y i − w ⊤ φ ( x i ) − b ≤ ϵ − ξ i , and w ⊤ φ ( x i ) + b − y i ≤ ϵ − ξ ∗ i , and ξ i , ξ ∗ ≥ 0, ∀ i = 1 , . . . , n The above is true for any α i , α ∗ i ≥ 0 and µ i , µ ∗ i ≥ 0 Thus, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Weak Duality L ∗ ( α, α ∗ , µ, µ ∗ ) = w , b ,ξ,ξ ∗ L ( w , b , ξ, ξ ∗ , α, α ∗ , µ, µ ∗ ) min By weak duality theorem, we have: 2 ∥ w ∥ 2 + C ∑ m 1 i =1 ( ξ i + ξ ∗ i ) ≥ L ∗ ( α, α ∗ , µ, µ ∗ ) min w , b ,ξ,ξ ∗ s.t. y i − w ⊤ φ ( x i ) − b ≤ ϵ − ξ i , and w ⊤ φ ( x i ) + b − y i ≤ ϵ − ξ ∗ i , and ξ i , ξ ∗ ≥ 0, ∀ i = 1 , . . . , n The above is true for any α i , α ∗ i ≥ 0 and µ i , µ ∗ i ≥ 0 Thus, m 1 2 ∥ w ∥ 2 + C ∑ ( ξ i + ξ ∗ α,α ∗ ,µ,µ ∗ L ∗ ( α, α ∗ , µ, µ ∗ ) min i ) ≥ max w , b ,ξ,ξ ∗ i =1 s.t. y i − w ⊤ φ ( x i ) − b ≤ ϵ − ξ i , and w ⊤ φ ( x i ) + b − y i ≤ ϵ − ξ ∗ i , and ξ i , ξ ∗ ≥ 0, ∀ i = 1 , . . . , n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Dual objective L ∗ ( α, α ∗ , µ, µ ∗ ) = w , b ,ξ,ξ ∗ L ( w , b , ξ, ξ ∗ , α, α ∗ , µ, µ ∗ ) min Assume: In case of SVR, we have a strictly convex objective and linear constraints ⇒ KKT conditions are necessary and sufficient and strong duality holds: m 1 2 ∥ w ∥ 2 + C ∑ ( ξ i + ξ ∗ α,α ∗ ,µ,µ ∗ L ∗ ( α, α ∗ , µ, µ ∗ ) min i ) = max w , b ,ξ,ξ ∗ i =1 s.t. y i − w ⊤ φ ( x i ) − b ≤ ϵ − ξ i , and w ⊤ φ ( x i ) + b − y i ≤ ϵ − ξ ∗ i , and ξ i , ξ ∗ ≥ 0, ∀ i = 1 , . . . , n This value is precisely obtained at the ( w , b , ξ, ξ ∗ , α, α ∗ , µ, µ ∗ ) that satisfies the necessary (and sufficient) KKT optimality conditions Given strong duality, we can equivalently solve α,α ∗ ,µ,µ ∗ L ∗ ( α, α ∗ , µ, µ ∗ ) max . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 ∥ w ∥ 2 + C ∑ m L ( α, α ∗ , µ, µ ∗ ) = 1 i =1 ( ξ i + ξ ∗ i ) + m α i ( y i − w ⊤ φ ( x i ) − b − ϵ − ξ i ) + α ∗ i ( w ⊤ φ ( x i ) + b − y i − ϵ − ξ ∗ ( ∑ i ) i =1 m ( µ i ξ i + µ ∗ i ξ ∗ ∑ i ) i =1 i in terms of α , α ∗ , µ and µ ∗ by using We obtain w , b , ξ i , ξ ∗ m ( α i − α ∗ the KKT conditions derived earlier as w = ∑ i ) φ ( x i ) i =1 m ( α i − α ∗ i ) = 0 and α i + µ i = C and α ∗ i + µ ∗ and ∑ i = C i =1 Thus, we get: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 ∥ w ∥ 2 + C ∑ m L ( α, α ∗ , µ, µ ∗ ) = 1 i =1 ( ξ i + ξ ∗ i ) + m α i ( y i − w ⊤ φ ( x i ) − b − ϵ − ξ i ) + α ∗ i ( w ⊤ φ ( x i ) + b − y i − ϵ − ξ ∗ ( ∑ i ) i =1 m ( µ i ξ i + µ ∗ i ξ ∗ ∑ i ) i =1 i in terms of α , α ∗ , µ and µ ∗ by using We obtain w , b , ξ i , ξ ∗ m ( α i − α ∗ the KKT conditions derived earlier as w = ∑ i ) φ ( x i ) i =1 m ( α i − α ∗ i ) = 0 and α i + µ i = C and α ∗ i + µ ∗ and ∑ i = C i =1 Thus, we get: L ( w , b , ξ, ξ ∗ , α, α ∗ , µ, µ ∗ ) = 1 j ( α i − α ∗ i )( α j − α ∗ j ) φ ⊤ ( x i ) φ ( x j ) + ∑ ∑ 2 i i ( ξ i ( C − α i − µ i ) + ξ ∗ i ( C − α ∗ i − µ ∗ i ( α i − α ∗ ∑ i )) − b ∑ i ) − i ( α i + α ∗ i y i ( α i − α ∗ j ( α i − α ∗ ϵ ∑ i ) + ∑ i ) − ∑ ∑ i )( α j − i α ∗ j ) φ ⊤ ( x i ) φ ( x j ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 ∥ w ∥ 2 + C ∑ m L ( α, α ∗ , µ, µ ∗ ) = 1 i =1 ( ξ i + ξ ∗ i ) + m α i ( y i − w ⊤ φ ( x i ) − b − ϵ − ξ i ) + α ∗ i ( w ⊤ φ ( x i ) + b − y i − ϵ − ξ ∗ ( ∑ i ) i =1 m ( µ i ξ i + µ ∗ i ξ ∗ ∑ i ) i =1 i in terms of α , α ∗ , µ and µ ∗ by using We obtain w , b , ξ i , ξ ∗ m ( α i − α ∗ the KKT conditions derived earlier as w = ∑ i ) φ ( x i ) i =1 m ( α i − α ∗ i ) = 0 and α i + µ i = C and α ∗ i + µ ∗ and ∑ i = C i =1 Thus, we get: L ( w , b , ξ, ξ ∗ , α, α ∗ , µ, µ ∗ ) = 1 j ( α i − α ∗ i )( α j − α ∗ j ) φ ⊤ ( x i ) φ ( x j ) + ∑ ∑ 2 i i ( ξ i ( C − α i − µ i ) + ξ ∗ i ( C − α ∗ i − µ ∗ i ( α i − α ∗ ∑ i )) − b ∑ i ) − i ( α i + α ∗ i y i ( α i − α ∗ j ( α i − α ∗ ϵ ∑ i ) + ∑ i ) − ∑ ∑ i )( α j − i α ∗ j ) φ ⊤ ( x i ) φ ( x j ) = − 1 j ( α i − α ∗ i )( α j − α ∗ j ) φ ⊤ ( x i ) φ ( x j ) − ϵ ∑ ∑ ∑ i ( α i + 2 i α ∗ i y i ( α i − α ∗ i ) + ∑ i ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Kernel function: K ( x i , x j ) = φ T ( x i ) φ ( x j ) w = ∑ m i =1 ( α i − α ∗ i ) φ ( x i ) ⇒ the final decision function f ( x ) = w T φ ( x ) + b = ∑ m i =1 ( α i − α ∗ i ) φ T ( x i ) φ ( x )+ y j − ∑ m i =1 ( α i − α ∗ i ) φ T ( x i ) φ ( x j ) − ϵ x j is any point with α j ∈ (0 , C ). Recall similarity with . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Kernel function: K ( x i , x j ) = φ T ( x i ) φ ( x j ) w = ∑ m i =1 ( α i − α ∗ i ) φ ( x i ) ⇒ the final decision function f ( x ) = w T φ ( x ) + b = ∑ m i =1 ( α i − α ∗ i ) φ T ( x i ) φ ( x )+ y j − ∑ m i =1 ( α i − α ∗ i ) φ T ( x i ) φ ( x j ) − ϵ x j is any point with α j ∈ (0 , C ). Recall similarity with kernelized expression for Ridge Regression The dual optimization problem to compute the α ’s for SVR is: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh - PowerPoint PPT Presentation

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Lecture 13 - KKT Conditions, Duality, SVR Dual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Lecture 4 - Linear

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Lecture 7 - Linear

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Overview of Linear

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

Two-Player Perfect Information Games: A Brief Survey Tsan-sheng Hsu tshsu@iis.sinica.edu.tw

Geometric Constraints and Variational Approaches to Image Analysis Daniel Martins Antunes 1

Distinguishing Convergence on Two-Taxon and Three-Taxon Networks Jonathan Mitchell Supervisors:

Neutrino-Driven Turbulent Convection in Stalled Supernova Cores David Radice Collaborators: E.

Sierd Cloetingh Roland Oberhnsli Alexander Rudloff TOPO-EUROPE TOPO-EUROPE: geoscience of

of Epidemiology (LIRE): The Beginning of the End (or The End of the Beginning?) Jeffrey (Jerry)

The Sound and the Noise of Black Holes a High Energy Physics Colloquium Edgardo Franzin

linear spaces of tilings Richard Kenyon (Brown University) Thursday, May 12, 16 Rectangle tilings