10701 Recitation 5 Duality and SVM Ahmed Hefny Outline - PowerPoint PPT Presentation

10701 Recitation 5 Duality and SVM Ahmed Hefny

Outline • Langrangian and Duality – The Lagrangian – Duality – Examples • Support Vector Machines – Primal Formulation – Dual Formulation – Soft Margin and Hinge Loss

Lagrangian • Consider the problem min 𝑦 𝑔(𝑦) s.t. 𝑕 𝑗 𝑦 = 0 • Add a Lagrange multiplier for each constraint 𝑀 𝑦, 𝑣 = 𝑔 𝑦 + 𝑗 𝑣 𝑗 𝑕 𝑗 (𝑦)

Lagrangian • Lagrangian 𝑀 𝑦, 𝑣 = 𝑔 𝑦 + 𝑗 𝑣 𝑗 𝑕 𝑗 (𝑦) • Setting gradient to 0 gives – 𝑕 𝑗 𝑦 = 0 [Feasible point] – 𝛼𝑔 𝑦 + 𝑗 𝑣 𝑗 𝛼𝑕 𝑗 𝑦 = 0 [Cannot decrease 𝑔 except by violating constraints]

Lagrangian • Consider the problem min 𝑦 𝑔(𝑦) 𝑕 𝑗 𝑦 = 0 s.t. ℎ 𝑘 𝑦 ≤ 0 • Add a Lagrange multiplier for each constraint 𝑀 𝑦, 𝑣, 𝜇 = 𝑔 𝑦 + 𝑗 𝑣 𝑗 𝑕 𝑗 (𝑦) + 𝑘 𝜇 𝑘 ℎ 𝑘 (𝑦)

Duality

Duality • Primal problem min 𝑦 𝑔(𝑦) 𝑕 𝑗 𝑦 = 0 s.t. ℎ 𝑘 𝑦 ≤ 0 • Equivalent to min 𝜇≥0,𝑣 𝑔 𝑦 + max 𝑣 𝑗 𝑕 𝑗 (𝑦) + 𝜇 𝑘 ℎ 𝑘 (𝑦) 𝑦 𝑗 𝑘

Duality • Primal problem min 𝑦 𝑔(𝑦) 𝑕 𝑗 𝑦 = 0 s.t. ℎ 𝑘 𝑦 ≤ 0 • Equivalent to 𝑦 𝑔(𝑦) 𝑦 𝑗𝑡 𝑔𝑓𝑏𝑡𝑗𝑐𝑚𝑓 min ∞ 𝑝. 𝑥.

Duality • Dual Problem 𝑦 𝑔 𝑦 + 𝑗 𝑣 𝑗 𝑕 𝑗 (𝑦) + 𝑘 𝜇 𝑘 ℎ 𝑘 (𝑦) 𝜇≥0,𝑣 min max Lagrangian Dual Function 𝑀(𝜇, 𝑣) • Dual function: – Concave, regardless of the convexity of the primal – Lower bound on primal

Duality Primal Problem min 𝑦 max 𝜇≥0 𝑀(𝑦, 𝜇) 𝑦 λ

Duality Primal Problem min 𝑦 max 𝜇≥0 𝑀(𝑦, 𝜇) 𝑦 For each row (choice of 𝑦 ), pick the largest element then select the minimum. λ

Duality Dual Problem max 𝜇≥0 min 𝑦 𝑀(𝑦, 𝜇) 𝑦 For each column (choice of 𝜇 ), pick the smallest element then select the maximum. λ

Duality Claim: min 𝑦 max 𝜇≥0 𝑀(𝑦, 𝜇) ≥ max 𝜇≥0 min 𝑦 𝑀(𝑦, 𝜇) 𝑦 ∗ , 𝜇 ∗ 𝑦 λ

Duality Claim: min 𝑦 max 𝜇≥0 𝑀(𝑦, 𝜇) ≥ max 𝜇≥0 min 𝑦 𝑀(𝑦, 𝜇) 𝑦 ∗ , 𝜇 ∗ 𝑦 For any 𝜇 ≥ 0 𝑦 𝑀(𝑦, 𝜇) ≤ 𝑀 𝑦 ∗ , 𝜇 ≤ 𝑀(𝑦 ∗ , 𝜇 ∗ ) min The difference between primal minimum And dual maximum is called duality gap λ duality gap = 0  Strong Duality

Duality When does min 𝑦 max 𝜇≥0 𝑀(𝑦, 𝜇) = max 𝜇≥0 min 𝑦 𝑀(𝑦, 𝜇) 𝑦 ∗ , 𝜇 ∗ 𝑦 λ

Duality When does min 𝑦 max 𝜇≥0 𝑀(𝑦, 𝜇) = max 𝜇≥0 min 𝑦 𝑀(𝑦, 𝜇) 𝒚 ∗ , 𝝁 ∗ 𝑦 𝑦 ∗ , 𝜇 ∗ is a saddle point 𝑀 𝑦 ∗ , 𝜇 ≤ 𝑀 𝑦 ∗ , 𝜇 ∗ ≤ 𝑀(𝑦, 𝜇 ∗ ) λ

Duality When does min 𝑦 max 𝜇≥0 𝑀(𝑦, 𝜇) = max 𝜇≥0 min 𝑦 𝑀(𝑦, 𝜇) 𝒚 ∗ , 𝝁 ∗ 𝑦 𝑦 ∗ , 𝜇 ∗ is a saddle point 𝑀 𝑦 ∗ , 𝜇 ≤ 𝑀 𝑦 ∗ , 𝜇 ∗ ≤ 𝑀(𝑦, 𝜇 ∗ ) Necessity  By definition of dual Sufficiency  x 𝑀(𝑦, 𝜇) ≤ 𝑀 𝑦 ∗ , 𝜇 ∗ 𝑀 𝜇 = min λ 𝑀 𝜇 ∗ = 𝑀 𝑦 ∗ , 𝜇 ∗

Duality When does min 𝑦 max 𝜇≥0 𝑀(𝑦, 𝜇) = max 𝜇≥0 min 𝑦 𝑀(𝑦, 𝜇) 𝒚 ∗ , 𝝁 ∗ 𝑦 𝑦 ∗ , 𝜇 ∗ is a saddle point 𝑀 𝑦 ∗ , 𝜇 ≤ 𝑀 𝑦 ∗ , 𝜇 ∗ ≤ 𝑀(𝑦, 𝜇 ∗ ) Necessity  By definition of dual Sufficiency  𝑀 𝜇 = min 𝑦 𝑀(𝑦, 𝜇) ≤ 𝑀 𝑦 ∗ , 𝜇 ∗ λ 𝑀 𝜇 ∗ = 𝑀 𝑦 ∗ , 𝜇 ∗ The dual at 𝜇 ∗ is the upper bound

Duality • If strong duality holds, KKT conditions apply to optimal point – Stationary Point 𝛼𝑀 𝑦, 𝑣, 𝜇 = 0 – Primal Feasibility – Dual Feasibility ( 𝜇 ≥ 0 ) – Complementary Slackness ( 𝜇 𝑗 ℎ 𝑗 𝑦 = 0 ) • KKT conditions are – Sufficient – Necessary under strong duality

Example: LP • Primal 𝑦 𝑑 𝑈 𝑦 min s.t. 𝐵𝑦 ≥ 𝑐

Example: LP • Primal 𝑦 𝑑 𝑈 𝑦 min s.t. 𝐵𝑦 ≥ 𝑐 • Lagrangian 𝑀 𝑦, 𝜇 = 𝑑 𝑈 𝑦 − 𝜇 𝑈 𝐵𝑦 − 𝑐

Example: LP • Dual Function 𝑦 𝑑 𝑈 𝑦 − 𝜇 𝑈 𝐵𝑦 − 𝑐 𝑀 𝜇 = min

Example: LP • Dual Function 𝑦 𝑑 𝑈 𝑦 − 𝜇 𝑈 𝐵𝑦 − 𝑐 𝑀 𝜇 = min • Set gradient w.r.t 𝑦 to 0 − 𝐵 𝑈 𝜇 = 0 𝑑

Example: LP • Dual Function 𝑦 𝑑 𝑈 𝑦 − 𝜇 𝑈 𝐵𝑦 − 𝑐 𝑀 𝜇 = min • Set gradient w.r.t 𝑦 to 0 𝑑 − 𝐵 𝑈 𝜇 = 0 • Dual Problem 𝜇≥0 𝜇 𝑈 𝑐 max s.t. 𝑑 − 𝐵 𝑈 𝜇 = 0 Why keep this as a constraint ?

Example: LASSO • We will use duality to transform LASSO into a QP

Example: LASSO Primal min 1 2 𝑧 − 𝑌𝑥 2 + 𝛿 𝑥 1 What is the dual function in this case ?

Example: LASSO Reformulated Primal min 1 2 𝑧 − 𝑨 2 + 𝛿 𝑥 1 s.t. 𝑨 = 𝑌𝑥 Dual 1 2 𝑧 − 𝑨 2 + 𝛿 𝑥 1 + 𝜇 𝑈 (𝑨 − 𝑌𝑥) 𝑀 𝜇 = min 𝑨,𝑥

Example: LASSO Dual 1 2 𝑧 − 𝑨 2 + 𝛿 𝑥 1 + 𝜇 𝑈 (𝑨 − 𝑌𝑥) 𝑀 𝜇 = min 𝑨,𝑥 Setting gradient to zero gives 𝑨 = 𝑧 − 𝜇 𝑌 𝑈 𝜇 ∞ ≤ 𝛿

Example: LASSO • Dual Problem max − 1 2 𝜇 2 + 𝜇 𝑈 𝑧 s.t. 𝑌 𝑈 𝜇 ∞ ≤ 𝛿

Support Vector Machines docs.opencv.org

Support Vector Machines • Find the maximum margin hyper-plane • “Distance” from a point 𝑦 to the hyper-plane 𝑥, 𝑦 𝑗 + 𝑐 = 0 is given by 𝑒 𝑗 = ( 𝑥, 𝑦 𝑗 + 𝑐)/ 𝑥 1 • 𝑁𝑏𝑠𝑕𝑗𝑜 = min 𝑗 𝑧 𝑗 𝑒 𝑗 = 𝑥 min 𝑗 𝑥, 𝑦 𝑗 + 𝑐 𝑧 𝑗 1 • Max Margin: max 𝑥 min 𝑗 𝑥, 𝑦 𝑗 + 𝑐 𝑧 𝑗 𝑥,𝑐

Support Vector Machines • Max Margin 1 max min 𝑗 𝑥, 𝑦 𝑗 + 𝑐 𝑧 𝑗 𝑥 𝑥,𝑐 • Unpleasant (max min ?) • No Unique Solution

Support Vector Machines • Max Margin 1 max min 𝑗 𝑥, 𝑦 𝑗 + 𝑐 𝑧 𝑗 𝑥 𝑥,𝑐 s.t. ???

Support Vector Machines • Max Margin 1 max min 𝑗 𝑥, 𝑦 𝑗 + 𝑐 𝑧 𝑗 𝑥 𝑥,𝑐 s.t. min 𝑗 𝑥, 𝑦 𝑗 + 𝑐 𝑧 𝑗 = 1

Support Vector Machines • Max Margin 1 2 𝑥 2 min 𝑥,𝑐 s.t. min 𝑥, 𝑦 𝑗 + 𝑐 𝑧 𝑗 = 1 𝑗

Support Vector Machines • Max Margin (Canonical Representation) 1 2 𝑥 2 min 𝑥,𝑐 𝑥, 𝑦 𝑗 + 𝑐 𝑧 𝑗 ≥ 1, ∀𝑗 s.t. • QP, much better than 1 max 𝑥 min 𝑗 𝑥, 𝑦 𝑗 + 𝑐 𝑧 𝑗 𝑥,𝑐

SVM Dual Problem Recall that the Lagrangian is formed by adding a Lagrange multiplier for each constraint. 𝑀 𝑥, 𝑐, 𝛽 = 1 2 𝑥 2 − 𝛽 𝑗 [ 𝑥, 𝑦 𝑗 + 𝑐 𝑧 𝑗 − 1] 𝑗

SVM Dual Problem 𝑀 𝑥, 𝑐, 𝛽 = 1 2 𝑥 2 − 𝛽 𝑗 𝑥, 𝑦 𝑗 + 𝑐 𝑧 𝑗 − 1 𝑗 Fix 𝛽 and minimize w.r.t 𝑥, 𝑐 : 𝑥 − 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗 = 0 𝑗 𝛽 𝑗 𝑧 𝑗 = 0

SVM Dual Problem 𝑀 𝑥, 𝑐, 𝛽 = 1 2 𝑥 2 − 𝛽 𝑗 𝑥, 𝑦 𝑗 + 𝑐 𝑧 𝑗 − 1 𝑗 Fix 𝛽 and minimize w.r.t 𝑥, 𝑐 : Plug-in 𝑥 − 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗 = 0 𝑗 𝛽 𝑗 𝑧 𝑗 = 0 Constraint (why ?)

SVM Dual Problem Dual Problem max − 1 2 𝛽 𝑗 𝛽 𝑘 𝑧 𝑗 𝑧 𝑘 𝑦 𝑗 , 𝑦 𝑘 + 𝛽 𝑗 𝑗 𝑘 𝑗 s.t. 𝑗 𝛽 𝑗 𝑧 𝑗 = 0 𝛽 𝑗 ≥ 0 Another QP. So what ?

SVM Dual Problem • Only Inner products  Kernel Trick • Complementary Slackness  Support Vectors • KKT conditions lead to Efficient optimization algorithms (compared to general QP solver)

SVM Dual Problem • Classification of a test point 𝑔 𝑦 = 𝑥, 𝑦 + 𝑐 = 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗 , 𝑦 + 𝑐 𝑗 • To get 𝑐 use the fact that 𝑧 𝑗 𝑔(𝑦 𝑗 ) = 1 for any support vector. • For numerical stability, average over all support vectors.

Soft Margin SVM Hard Margin SVM 2 1 w,b 𝑗 𝐹 ∞ 1 − min 𝑥, 𝑦 𝑗 + 𝑐 𝑧 𝑗 + 2 𝑥 , where 𝐹 ∞ 𝑦 = ∞ 𝑦 ≥ 0 0 𝑦 < 0

Soft Margin SVM Hard Margin SVM 2 1 w,b 𝑗 𝐹 ∞ 1 − min 𝑥, 𝑦 𝑗 + 𝑐 𝑧 𝑗 + 2 𝑥 , where loss regularization 𝑚𝑝𝑡𝑡 𝐹 ∞ 𝑦 = ∞ 𝑦 ≥ 0 0 𝑦 < 0 𝑧 𝑗 𝑔(𝑦 𝑗 )

Soft Margin SVM Relax it a little bit 2 1 w,b 𝑗 𝐹 𝐷 1 − min 𝑥, 𝑦 𝑗 + 𝑐 𝑧 𝑗 + 2 𝑥 , where 𝐹 𝐷 𝑦 = 𝐷𝑦 𝑦 ≥ 0 0 𝑦 < 0

Soft Margin SVM Relax it a little bit 2 1 w,b 𝑗 𝐹 𝐷 1 − min 𝑥, 𝑦 𝑗 + 𝑐 𝑧 𝑗 + 2 𝑥 , where 𝑚𝑝𝑡𝑡 𝐹 𝐷 𝑦 = 𝐷𝑦 𝑦 ≥ 0 0 𝑦 < 0 𝑧 𝑗 𝑔(𝑦 𝑗 )

Soft Margin SVM Relax it a little bit 1 2 𝑥 2 w,b 𝐷 𝑗 1 − min 𝑥, 𝑦 𝑗 + 𝑐 𝑧 𝑗 + + 𝑚𝑝𝑡𝑡 𝑧 𝑗 𝑔(𝑦 𝑗 )

Soft Margin SVM Equivalent Formulation 1 2 𝑥 2 w,b,𝜂 𝐷 𝑗 𝜂 𝑗 + min s.t. 𝜂 𝑗 ≥ 0 𝑥, 𝑦 𝑗 + 𝑐 𝑧 𝑗 ≥ 1 − 𝜂 𝑗

10701 Recitation 5 Duality and SVM Ahmed Hefny Outline - PowerPoint PPT Presentation

10701 Recitation 5 Duality and SVM Ahmed Hefny Outline Langrangian and Duality The Lagrangian Duality Examples Support Vector Machines Primal Formulation Dual Formulation Soft Margin and Hinge Loss Lagrangian

10701 Machine Learning Recitation 7 - Tail bounds and Averages Ahmed Hefny Slides mostly by Alex

Recitation First recitation tomorrow 56:30 here Linear algebra Geoff Gordon10-701

Parallel Programming Parallel Programming 0024 0024 Recitation Week 7 Recitation Week 7

Earth Movement and Earth Movement and Solar Calendar Solar Calendar Recitation 2 Recitation 2

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos & Aarti

CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh 2014 Spring

Recursion continued Midterm Exam 2 parts Part 1 done in recitation Programming

[CS112] Data Structure Recitation (Section 02, 05) 1 st week Changkyu Song

Inheritance Recitation - 02/22/2008 CS 180 Department of Computer Science, Purdue University

Math 610 Section 700 - Recitation week 3 week 4 week 6 week 8 TA: Peng Wei Office: Blocker

[CS112] Data Structure Recitation (Section 4, 15) Changkyu Song cs1080@cs.rutgers.edu Office

Introduction to Machine Learning 10701 Independent Component Analysis Barnabs Pczos &

Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabs Pczos

10701 Machine Learning Clustering What is Clustering? Organizing data into clusters such

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos & Alex

Introduction to Machine Learning CMU-10701 2. MLE, MAP, Bayes classification Barnabs Pczos

MIXTURE DENSITY NETWORKS MIXTURE DENSITY NETWORKS Charles Martin SO FAR; RNNS THAT MODEL

Slope Stability Dr. Hend AlShatnawi Hashemite University Class of 2019-2020 Slope Stability

Interpolating sequences for the Dirichlet space Nicola Arcozzi, with R. Rochberg and E. Sawyer

Deep Learning - Theory and Practice Linear Regression, Least Squares 13-02-2020 Classification

Variational Laplace Autoencoders Yookoon Park, Chris Dongjoo Kim and Gunhee Kim Vision and

Bishop Steven L. Ullestad Celebrating Renewal ST. ELIZABETH OF HUNGARY SERVICE Pastor

Cubical Exact Equality and Categorical Gluing J. Sterling 1 C. Angiuli 1 D. Gratzer 2 1 Department

E-Voting and Forensics: Prying Open the Black Box Sean Peisert Matt Bishop Candice Hoke

10701 Recitation 5 Duality and SVM Ahmed Hefny Outline - PowerPoint PPT Presentation

10701 Recitation 5 Duality and SVM Ahmed Hefny Outline Langrangian and Duality The Lagrangian Duality Examples Support Vector Machines Primal Formulation Dual Formulation Soft Margin and Hinge Loss Lagrangian

10701 Machine Learning Recitation 7 - Tail bounds and Averages Ahmed Hefny Slides mostly by Alex

Recitation First recitation tomorrow 56:30 here Linear algebra Geoff Gordon10-701

Parallel Programming Parallel Programming 0024 0024 Recitation Week 7 Recitation Week 7

Earth Movement and Earth Movement and Solar Calendar Solar Calendar Recitation 2 Recitation 2

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos &amp; Aarti

CMU-10701 Support Vector Machines Barnabs Pczos &amp; Aarti Singh 2014 Spring

Recursion continued Midterm Exam 2 parts Part 1 done in recitation Programming

[CS112] Data Structure Recitation (Section 02, 05) 1 st week Changkyu Song

Inheritance Recitation - 02/22/2008 CS 180 Department of Computer Science, Purdue University

Math 610 Section 700 - Recitation week 3 week 4 week 6 week 8 TA: Peng Wei Office: Blocker

[CS112] Data Structure Recitation (Section 4, 15) Changkyu Song cs1080@cs.rutgers.edu Office

Introduction to Machine Learning 10701 Independent Component Analysis Barnabs Pczos &amp;

Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabs Pczos

10701 Machine Learning Clustering What is Clustering? Organizing data into clusters such

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos &amp; Alex

Introduction to Machine Learning CMU-10701 2. MLE, MAP, Bayes classification Barnabs Pczos

MIXTURE DENSITY NETWORKS MIXTURE DENSITY NETWORKS Charles Martin SO FAR; RNNS THAT MODEL

Slope Stability Dr. Hend AlShatnawi Hashemite University Class of 2019-2020 Slope Stability

Interpolating sequences for the Dirichlet space Nicola Arcozzi, with R. Rochberg and E. Sawyer

Deep Learning - Theory and Practice Linear Regression, Least Squares 13-02-2020 Classification

Variational Laplace Autoencoders Yookoon Park, Chris Dongjoo Kim and Gunhee Kim Vision and

Bishop Steven L. Ullestad Celebrating Renewal ST. ELIZABETH OF HUNGARY SERVICE Pastor

Cubical Exact Equality and Categorical Gluing J. Sterling 1 C. Angiuli 1 D. Gratzer 2 1 Department

E-Voting and Forensics: Prying Open the Black Box Sean Peisert Matt Bishop Candice Hoke

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos & Aarti

CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh 2014 Spring

Introduction to Machine Learning 10701 Independent Component Analysis Barnabs Pczos &

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos & Alex