10701 Recitation 5 Duality and SVM Ahmed Hefny Outline - - PowerPoint PPT Presentation
10701 Recitation 5 Duality and SVM Ahmed Hefny Outline - - PowerPoint PPT Presentation
10701 Recitation 5 Duality and SVM Ahmed Hefny Outline Langrangian and Duality The Lagrangian Duality Examples Support Vector Machines Primal Formulation Dual Formulation Soft Margin and Hinge Loss Lagrangian
Outline
- Langrangian and Duality
– The Lagrangian – Duality – Examples
- Support Vector Machines
– Primal Formulation – Dual Formulation – Soft Margin and Hinge Loss
Lagrangian
- Consider the problem
min
𝑦 𝑔(𝑦)
s.t. 𝑗 𝑦 = 0
- Add a Lagrange multiplier for each constraint
𝑀 𝑦, 𝑣 = 𝑔 𝑦 + 𝑗 𝑣𝑗𝑗(𝑦)
Lagrangian
- Lagrangian
𝑀 𝑦, 𝑣 = 𝑔 𝑦 + 𝑗 𝑣𝑗𝑗(𝑦)
- Setting gradient to 0 gives
– 𝑗 𝑦 = 0 [Feasible point] – 𝛼𝑔 𝑦 + 𝑗 𝑣𝑗𝛼𝑗 𝑦 = 0 [Cannot decrease 𝑔 except by violating constraints]
Lagrangian
- Consider the problem
min
𝑦 𝑔(𝑦)
s.t. 𝑗 𝑦 = 0 ℎ𝑘 𝑦 ≤ 0
- Add a Lagrange multiplier for each constraint
𝑀 𝑦, 𝑣, 𝜇 = 𝑔 𝑦 + 𝑗 𝑣𝑗𝑗(𝑦) + 𝑘 𝜇𝑘ℎ𝑘(𝑦)
Duality
Duality
- Primal problem
min
𝑦 𝑔(𝑦)
s.t. 𝑗 𝑦 = 0 ℎ𝑘 𝑦 ≤ 0
- Equivalent to
min
𝑦
max
𝜇≥0,𝑣 𝑔 𝑦 + 𝑗
𝑣𝑗𝑗(𝑦) +
𝑘
𝜇𝑘ℎ𝑘(𝑦)
Duality
- Primal problem
min
𝑦 𝑔(𝑦)
s.t. 𝑗 𝑦 = 0 ℎ𝑘 𝑦 ≤ 0
- Equivalent to
min
𝑦 𝑔(𝑦)
𝑦 𝑗𝑡 𝑔𝑓𝑏𝑡𝑗𝑐𝑚𝑓 ∞ 𝑝. 𝑥.
Duality
- Dual Problem
max
𝜇≥0,𝑣 min 𝑦 𝑔 𝑦 + 𝑗 𝑣𝑗𝑗(𝑦) + 𝑘 𝜇𝑘ℎ𝑘(𝑦)
- Dual function:
– Concave, regardless of the convexity of the primal – Lower bound on primal
Lagrangian Dual Function 𝑀(𝜇, 𝑣)
Duality
λ 𝑦
Primal Problem min
𝑦 max 𝜇≥0 𝑀(𝑦, 𝜇)
Duality
λ 𝑦
Primal Problem min
𝑦 max 𝜇≥0 𝑀(𝑦, 𝜇)
For each row (choice of 𝑦), pick the largest element then select the minimum.
Duality
λ 𝑦
Dual Problem max
𝜇≥0 min 𝑦 𝑀(𝑦, 𝜇)
For each column (choice of 𝜇), pick the smallest element then select the maximum.
Duality
𝑦∗, 𝜇∗ λ 𝑦
Claim:
min
𝑦 max 𝜇≥0 𝑀(𝑦, 𝜇) ≥ max 𝜇≥0 min 𝑦 𝑀(𝑦, 𝜇)
Duality
𝑦∗, 𝜇∗ λ 𝑦
Claim:
min
𝑦 max 𝜇≥0 𝑀(𝑦, 𝜇) ≥ max 𝜇≥0 min 𝑦 𝑀(𝑦, 𝜇)
For any 𝜇 ≥ 0
min
𝑦 𝑀(𝑦, 𝜇) ≤ 𝑀 𝑦∗, 𝜇 ≤ 𝑀(𝑦∗, 𝜇∗)
The difference between primal minimum And dual maximum is called duality gap duality gap = 0 Strong Duality
Duality
𝑦∗, 𝜇∗ λ 𝑦
When does
min
𝑦 max 𝜇≥0 𝑀(𝑦, 𝜇) = max 𝜇≥0 min 𝑦 𝑀(𝑦, 𝜇)
Duality
𝒚∗, 𝝁∗ λ 𝑦
When does
min
𝑦 max 𝜇≥0 𝑀(𝑦, 𝜇) = max 𝜇≥0 min 𝑦 𝑀(𝑦, 𝜇)
𝑦∗, 𝜇∗ is a saddle point 𝑀 𝑦∗, 𝜇 ≤ 𝑀 𝑦∗, 𝜇∗ ≤ 𝑀(𝑦, 𝜇∗)
Duality
𝒚∗, 𝝁∗ λ 𝑦
When does
min
𝑦 max 𝜇≥0 𝑀(𝑦, 𝜇) = max 𝜇≥0 min 𝑦 𝑀(𝑦, 𝜇)
𝑦∗, 𝜇∗ is a saddle point 𝑀 𝑦∗, 𝜇 ≤ 𝑀 𝑦∗, 𝜇∗ ≤ 𝑀(𝑦, 𝜇∗)
Necessity By definition of dual Sufficiency 𝑀 𝜇 = min
x 𝑀(𝑦, 𝜇) ≤ 𝑀 𝑦∗, 𝜇∗
𝑀 𝜇∗ = 𝑀 𝑦∗, 𝜇∗
Duality
𝒚∗, 𝝁∗ λ 𝑦
When does
min
𝑦 max 𝜇≥0 𝑀(𝑦, 𝜇) = max 𝜇≥0 min 𝑦 𝑀(𝑦, 𝜇)
𝑦∗, 𝜇∗ is a saddle point 𝑀 𝑦∗, 𝜇 ≤ 𝑀 𝑦∗, 𝜇∗ ≤ 𝑀(𝑦, 𝜇∗)
Necessity By definition of dual Sufficiency 𝑀 𝜇 = min𝑦 𝑀(𝑦, 𝜇) ≤ 𝑀 𝑦∗, 𝜇∗ 𝑀 𝜇∗ = 𝑀 𝑦∗, 𝜇∗ The dual at 𝜇∗ is the upper bound
Duality
- If strong duality holds, KKT conditions apply to
- ptimal point
– Stationary Point 𝛼𝑀 𝑦, 𝑣, 𝜇 = 0 – Primal Feasibility – Dual Feasibility (𝜇 ≥ 0) – Complementary Slackness (𝜇𝑗ℎ𝑗 𝑦 = 0)
- KKT conditions are
– Sufficient – Necessary under strong duality
Example: LP
- Primal
min
𝑦 𝑑𝑈𝑦
s.t. 𝐵𝑦 ≥ 𝑐
Example: LP
- Primal
min
𝑦 𝑑𝑈𝑦
s.t. 𝐵𝑦 ≥ 𝑐
- Lagrangian
𝑀 𝑦, 𝜇 = 𝑑𝑈𝑦 − 𝜇𝑈 𝐵𝑦 − 𝑐
Example: LP
- Dual Function
𝑀 𝜇 = min
𝑦 𝑑𝑈𝑦 − 𝜇𝑈 𝐵𝑦 − 𝑐
Example: LP
- Dual Function
𝑀 𝜇 = min
𝑦 𝑑𝑈𝑦 − 𝜇𝑈 𝐵𝑦 − 𝑐
- Set gradient w.r.t 𝑦 to 0
𝑑 − 𝐵𝑈𝜇 = 0
Example: LP
- Dual Function
𝑀 𝜇 = min
𝑦 𝑑𝑈𝑦 − 𝜇𝑈 𝐵𝑦 − 𝑐
- Set gradient w.r.t 𝑦 to 0
𝑑 − 𝐵𝑈𝜇 = 0
- Dual Problem
max
𝜇≥0 𝜇𝑈𝑐
s.t. 𝑑 − 𝐵𝑈𝜇 = 0
Why keep this as a constraint ?
Example: LASSO
- We will use duality to transform LASSO into a
QP
Example: LASSO
Primal min 1 2 𝑧 − 𝑌𝑥 2 + 𝛿 𝑥 1 What is the dual function in this case ?
Example: LASSO
Reformulated Primal min 1 2 𝑧 − 𝑨 2 + 𝛿 𝑥 1 s.t. 𝑨 = 𝑌𝑥 Dual 𝑀 𝜇 = min
𝑨,𝑥
1 2 𝑧 − 𝑨 2 + 𝛿 𝑥 1 + 𝜇𝑈(𝑨 − 𝑌𝑥)
Example: LASSO
Dual 𝑀 𝜇 = min
𝑨,𝑥
1 2 𝑧 − 𝑨 2 + 𝛿 𝑥 1 + 𝜇𝑈(𝑨 − 𝑌𝑥) Setting gradient to zero gives 𝑨 = 𝑧 − 𝜇 𝑌𝑈𝜇 ∞ ≤ 𝛿
Example: LASSO
- Dual Problem
max − 1 2 𝜇 2 + 𝜇𝑈𝑧 s.t. 𝑌𝑈𝜇 ∞ ≤ 𝛿
Support Vector Machines
docs.opencv.org
Support Vector Machines
- Find the maximum margin hyper-plane
- “Distance” from a point 𝑦
to the hyper-plane 𝑥, 𝑦𝑗 + 𝑐 = 0 is given by 𝑒𝑗 = ( 𝑥, 𝑦𝑗 + 𝑐)/ 𝑥
- 𝑁𝑏𝑠𝑗𝑜 = min
𝑗 𝑧𝑗𝑒𝑗 = 1 𝑥 min𝑗
𝑥, 𝑦𝑗 + 𝑐 𝑧𝑗
- Max Margin: max
𝑥,𝑐 1 𝑥 min𝑗
𝑥, 𝑦𝑗 + 𝑐 𝑧𝑗
Support Vector Machines
- Max Margin
max
𝑥,𝑐
1 𝑥 min𝑗 𝑥, 𝑦𝑗 + 𝑐 𝑧𝑗
- Unpleasant (max min ?)
- No Unique Solution
Support Vector Machines
- Max Margin
max
𝑥,𝑐
1 𝑥 min𝑗 𝑥, 𝑦𝑗 + 𝑐 𝑧𝑗 s.t. ???
Support Vector Machines
- Max Margin
max
𝑥,𝑐
1 𝑥 min𝑗 𝑥, 𝑦𝑗 + 𝑐 𝑧𝑗 s.t. min𝑗 𝑥, 𝑦𝑗 + 𝑐 𝑧𝑗 = 1
Support Vector Machines
- Max Margin
min
𝑥,𝑐
1 2 𝑥 2 s.t. min
𝑗
𝑥, 𝑦𝑗 + 𝑐 𝑧𝑗 = 1
Support Vector Machines
- Max Margin (Canonical Representation)
min
𝑥,𝑐
1 2 𝑥 2 s.t. 𝑥, 𝑦𝑗 + 𝑐 𝑧𝑗 ≥ 1, ∀𝑗
- QP, much better than
max
𝑥,𝑐 1 𝑥 min𝑗
𝑥, 𝑦𝑗 + 𝑐 𝑧𝑗
SVM Dual Problem
Recall that the Lagrangian is formed by adding a Lagrange multiplier for each constraint.
𝑀 𝑥, 𝑐, 𝛽 = 1 2 𝑥 2 −
𝑗
𝛽𝑗 [ 𝑥, 𝑦𝑗 + 𝑐 𝑧𝑗 − 1]
SVM Dual Problem
𝑀 𝑥, 𝑐, 𝛽 = 1 2 𝑥 2 −
𝑗
𝛽𝑗 𝑥, 𝑦𝑗 + 𝑐 𝑧𝑗 − 1
Fix 𝛽 and minimize w.r.t 𝑥, 𝑐: 𝑥 − 𝑗 𝛽𝑗 𝑧𝑗𝑦𝑗 = 0 𝑗 𝛽𝑗𝑧𝑗 = 0
SVM Dual Problem
𝑀 𝑥, 𝑐, 𝛽 = 1 2 𝑥 2 −
𝑗
𝛽𝑗 𝑥, 𝑦𝑗 + 𝑐 𝑧𝑗 − 1
Fix 𝛽 and minimize w.r.t 𝑥, 𝑐: 𝑥 − 𝑗 𝛽𝑗 𝑧𝑗𝑦𝑗 = 0 𝑗 𝛽𝑗𝑧𝑗 = 0
Plug-in Constraint (why ?)
SVM Dual Problem
Dual Problem
max − 1 2
𝑗 𝑘
𝛽𝑗𝛽𝑘𝑧𝑗𝑧𝑘 𝑦𝑗, 𝑦𝑘 +
𝑗
𝛽𝑗 s.t. 𝑗 𝛽𝑗𝑧𝑗 = 0 𝛽𝑗 ≥ 0
Another QP. So what ?
SVM Dual Problem
- Only Inner products Kernel Trick
- Complementary Slackness Support Vectors
- KKT conditions lead to Efficient optimization
algorithms (compared to general QP solver)
SVM Dual Problem
- Classification of a test point
𝑔 𝑦 = 𝑥, 𝑦 + 𝑐 =
𝑗
𝛽𝑗𝑧𝑗 𝑦𝑗, 𝑦 + 𝑐
- To get 𝑐 use the fact that 𝑧𝑗𝑔(𝑦𝑗) = 1 for any
support vector.
- For numerical stability, average over all
support vectors.
Soft Margin SVM
Hard Margin SVM
min
w,b 𝑗 𝐹∞ 1 −
𝑥, 𝑦𝑗 + 𝑐 𝑧𝑗 +
1 2 𝑥 2
, where 𝐹∞ 𝑦 = ∞ 𝑦 ≥ 0 𝑦 < 0
Soft Margin SVM
Hard Margin SVM
min
w,b 𝑗 𝐹∞ 1 −
𝑥, 𝑦𝑗 + 𝑐 𝑧𝑗 +
1 2 𝑥 2
, where 𝐹∞ 𝑦 = ∞ 𝑦 ≥ 0 𝑦 < 0
𝑧𝑗𝑔(𝑦𝑗) 𝑚𝑝𝑡𝑡 loss regularization
Soft Margin SVM
Relax it a little bit
min
w,b 𝑗 𝐹𝐷 1 −
𝑥, 𝑦𝑗 + 𝑐 𝑧𝑗 +
1 2 𝑥 2
, where 𝐹𝐷 𝑦 = 𝐷𝑦 𝑦 ≥ 0 𝑦 < 0
Soft Margin SVM
Relax it a little bit
min
w,b 𝑗 𝐹𝐷 1 −
𝑥, 𝑦𝑗 + 𝑐 𝑧𝑗 +
1 2 𝑥 2
, where 𝐹𝐷 𝑦 = 𝐷𝑦 𝑦 ≥ 0 𝑦 < 0
𝑧𝑗𝑔(𝑦𝑗) 𝑚𝑝𝑡𝑡
Soft Margin SVM
Relax it a little bit
min
w,b 𝐷 𝑗 1 −
𝑥, 𝑦𝑗 + 𝑐 𝑧𝑗 + +
1 2 𝑥 2
𝑚𝑝𝑡𝑡 𝑧𝑗𝑔(𝑦𝑗)
Soft Margin SVM
Equivalent Formulation
min
w,b,𝜂 𝐷 𝑗 𝜂𝑗 + 1 2 𝑥 2
s.t. 𝜂𝑗 ≥ 0 𝑥, 𝑦𝑗 + 𝑐 𝑧𝑗 ≥ 1 − 𝜂𝑗
Conclusions
- Duality allows for establishing a lower bound on
minimization problem.
- Key idea
– “min max” upper bounds “max min”
- Strong Duality Necessity of KKT Conditions
- Duality on SVMs
– Kernel Trick – Support Vectors
- Soft Margin SVM = Hinge Loss
Resources
- Bishop, “Pattern Recognition and Machine
Learning”, Chp 7
- Gordon & Tibshirani, 10725 Optimization (Fall
2012) Lecture Slides: http://www.cs.cmu.edu/~ggordon/10725- F12/schedule.html
- Fiterau, Kernels and SVM