Introduction to Machine Learning
- 5. Optimization
Geoff Gordon and Alex Smola Carnegie Mellon University
http://alex.smola.org/teaching/cmu2013-10-701 10-701x
Introduction to Machine Learning 5. Optimization Geoff Gordon and - - PowerPoint PPT Presentation
Introduction to Machine Learning 5. Optimization Geoff Gordon and Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701x Optimization Basic Techniques Gradient descent Newton's method
http://alex.smola.org/teaching/cmu2013-10-701 10-701x
(more on this in a later lecture)
log p(θ|X) = 1 2σ2 kθk2 +
m
X
i=1
g(θ) hφ(xi), θi + const.
−2 2 −2 2 0.2 0.4 0.6 0.8 −2 2 −3 −2 −1 1 2 3
For x, x0 ∈ X it follows that λx + (1 − λ)x0 ∈ X for λ ∈ [0, 1] λλf(x) + (1 − λ)f(x0) ≥ f(λx + (1 − λ)x0) for λ ∈ [0, 1]
f(λx + (1 − λ)x0) ≤ λf(x) + (1 − λ)f(x hence λx + (1 − λ)x0 ∈ X for x, x0 ∈ X
λx + (1 λ)x0 62 X for λ > 1 for all x0 2 X
co X :=
x
x =
n
αixi where n ∈ N, αi ≥ 0 and
n
αi ≤ 1
sup
x∈X
f(x) = sup
x∈coX
f(x)
(exploit convexity in upper bound and keep 5 points)
7 3 1 2 4 5 6
Require: a, b, Precision Set A = a, B = b repeat if f ′ A+B
2
B = A+B
2
else A = A+B
2
end if until (B − A) min(|f ′(A)|, |f ′(B)|) ≤ Output: x = A+B
2
solution on the left
approximation of objective function
convex objective
given a starting point x ∈ dom f. repeat
until stopping criterion is satisfied.
f(y) f(x) + hy x, ∂xf(x)i + m 2 ky xk2 f(x) f(x∗) m 2 kx x∗k2 f(x) f(x∗) hx x∗, ∂xf(x)i m 2 kx∗ xk2 sup
y hx y, ∂xf(x)i m
2 ky xk2 = 1 2m k∂xf(x)k2
f(y) f(x) + hy x, ∂xf(x)i + M 2 ky xk2 = ) f(x + tgx) f(x) t kgxk2 + M 2 t2 kgxk2 f(x) 1 2M kgxk2 = ) f(x + tgx) f(x∗) f(x) f(x∗) 1 2M kgxk2 f(x) f(x∗) h 1 m M i M m log f(x) − f(x∗) ✏
Isaac Newton
∂2
xf(x) ⌫ 0
f(x + δ) = f(x) + hδ, ∂xf(x)i + 1 2δ>∂2
xf(x)δ + O(δ3)
x ← x − ⇥ ∂2
xf(x)
⇤−1 ∂xf(x)
⌦ x∗ x, ∂2
xf(x)
↵ γ kx∗ xk2
kxn+1 x∗k =
⇥ ∂2
xf(xn)
⇤−1 [∂xf(xn) ∂xf(x∗)]
∂2
xf(xn)
⇤−1 ⇥ ∂f
x(xn)[xn x∗] ∂xf(xn) + ∂xf(x∗)
⇤
∂2
xf(xn)
⇤−1
⌦ x∗ x, ∂2
xf(x)
↵ γ kx∗ xk2 kxn+1 x∗k γ
∂2
xf(xn)
⇤−1
See Boyd and Vandenberghe, Chapter 9.5 for much more
x(0) x(1) x(2)
from Boyd & Vandenberghe
from Boyd & Vandenberghe
x x + ∆xnt x + ∆xnsd
δi = B−1
i
∂xf(xi−1) xi+1 = xi − αiδi Bi+1 = Bi + uiu>
i + viv> i
Bi+1(xi+1 − xi) = ∂xf(xi+1) − ∂xf(xi) Bi+1 = Bi + gig>
i
αiδ>
i gi
− Biδiδ>
i Bi
δ>
i Biδi
minimize
x
f(x) subject to ci(x) ≤ 0 for all i hwi, xi + bi 0 x>Qx + b>x c with Q ⌫ 0 M ⌫ 0 or M0 + X
i
xiMi ⌫ 0
,
w {x | <w x> + b = 0} ,
{x | <w x> + b = −1}
,
{x | <w x> + b = +1}
, x2 x1 Note: <w x1> + b = +1 <w x2> + b = −1 => <w (x1−x2)> = 2 => (x1−x2) = w ||w||
, , , , 2 ||w|| yi = −1 yi = +1
❍ ❍ ❍ ❍ ❍ ◆ ◆ ◆ ◆
minimize
w,b
1 2 kwk2 subject to yi [hw, xii + b] 1
hw, x1i + b = 1 hw, x2i + b = 1 hence hw, x1 x2i + b = 2 hence ⌧ w kwk, x1 x2
2 kwk
L(x, α) := f(x) +
n
X
i=1
αici(x) where αi ≥ 0 L(x∗, α) ≤ L(x∗, α∗) ≤ L(x, α∗)
L(x∗, α) ≤ L(x∗, α∗) ≤ L(x, α∗) (αi − α∗
i )ci(x∗) ≤ 0 for all αi ≥ 0
αi = 0 α∗
i ci(x∗) = 0
L(x∗, α∗) = f(x∗) ≤ L(x, α∗) = f(x) + X
i
α∗
i ci(x) ≤ f(x)
There exists some x such that for all i
For all nonnegative α there exists some x such that
The feasible region contains at least two distinct elements and there exists an x in X such that all ci(x) are strictly convex at x with respect to X
ci(x) < 0 X
i
αici(x) ≤ 0
∂xL(x∗, α∗) = ∂xf(x∗) + X
i
α∗
i ∂xci(x∗)
= 0 (Saddlepoint in x∗) ∂αiL(x∗, α∗) = ci(x∗) ≤ 0 (Saddlepoint in α∗) X
i
α∗
i ci(x∗)
= 0 (Vanishing KKT-gap)
Yields algorithm for solving optimization problems Solve for saddlepoint and KKT conditions
f(x) − f(x⇤) ≥ [∂xf(x⇤)]> (x − x⇤) (by convexity) = − X
i
α⇤
i [∂xci(x⇤)]> (x − x⇤)
(by Saddlepoint in x⇤) ≥ − X
i
α⇤
i (ci(x) − ci(x⇤))
(by convexity) = X
i
α⇤
i ci(x)
(by vanishing KKT gap) ≥ 0
minimize
x
c>x subject to Ax + d ≤ 0 L(x, α) = c>x + α>(Ax + d) ∂xL(x, α) = A>α + c = 0 ∂αL(x, α) = Ax + d ≤ 0 0 = α>(Ax + d) 0 ≤ α maximize
i
d>α subject to A>α + c = 0 and α ≥ 0
minimize
x
c>x subject to Ax + d ≤ 0 maximize
i
d>α subject to A>α + c = 0 and α ≥ 0
minimize
x
1 2x>Qx + c>x subject to Ax + d ≤ 0 L(x, α) = 1 2x>Qx + c>x + α>(Ax + d) ∂xL(x, α) = Qx + A>α + c = 0 ∂αL(x, α) = Ax + d ≤ 0 0 = α>(Ax + d) 0 ≤ α
Qx + A>α + c = 0 L(x, α) = 1 2x>Qx + c>x + α>(Ax + d) = −1 2x>Qx + α>d = −1 2(A>α + c)>Q1(A>α + c) + α>d = −1 2α>AQ1A>α + α> ⇥ d − AQ1c ⇤ − 1 2c>Q1c subject to α ≥ 0
minimize
x
1 2x>Qx + c>x subject to Ax + d ≤ 0 minimize
α
1 2α>AQ1A>α + α> ⇥ AQ1c − d ⇤ subject to α ≥ 0
minimize
θ m
X
i=1
log p(xi|θ) log p(θ) equivalently minimize
θ m
X
i=1
[g(θ) hφ(xi), θi] + 1 2σ2 kθk2 minimize
θ m
X
i=1
l (yi hφ(xi), θi) + 1 2σ2 kθk2
minimize
θ m
X
i=1
li(θ) + λΩ[θ]
Regularized Risk Minimization minimize
w
Remp[w] + λΩ[w] Taylor Approximation for Remp[w] Remp[w] Remp[wt] + hw wt, ∂wRemp[wt]i = hat, wi + bt where at = ∂wRemp[wt−1] and bt = Remp[wt−1] hat, wt−1i. Bundle Bound Remp[w] Rt[w] := max
i≤t hai, wi + bi
Regularizer Ω[w] solves stability problems.
Initialize t = 0, w0 = 0, a0 = 0, b0 = 0 repeat Find minimizer wt := argmin
w
Rt(w) + Ω[w] Compute gradient at+1 and offset bt+1. Increment t ← t + 1. until ✏t ≤ ✏ Convergence Monitor Rt+1[wt] − Rt[wt] Since Rt+1[wt] = Remp[wt] (Taylor approximation) we have Rt+1[wt] + Ω[wt] ≥ min
w Remp[w] + Ω[w] ≥ Rt[wt] + Ω[wt]
G o o d N e w s Dual optimization for Ω[w] = 1
2 kwk2 2 is Quadratic Program
regardless of the choice of the empirical risk Remp[w]. D e t a i l s minimize
β 1 2λβ>AA>β β>b
subject to βi 0 and kβk1 = 1 The primal coefficient w is given by w = λ1A>β. G e n e r a l R e s u l t Use Fenchel-Legendre dual of Ω[w], e.g. k·k1 ! k·k1. V e r y C h e a p V a r i a n t Can even use simple line search for update (almost as good).
Parallelization Empirical risk sum of many terms: MapReduce Gradient sum of many terms, gather from cluster. Possible even for multivariate performance scores. Data is local. Combine data from competing entities. Solver independent of loss No need to change solver for new loss. Loss independent of solver/regularizer Add new regularizer without need to re-implement loss. Line search variant Optimization does not require QP solver at all! Update along gradient direction in the dual. We only need inner product on gradients!
Theorem The number of iterations to reach ✏ precision is bounded by n ≤ log2 Remp[0] G2 + 8G2 ✏ − 4
any ✏ ≤ /2 takes at most the following number of steps: n ≤ log2 Remp[0] 4G2 + 4 max ⇥ 0, 1 − 8G2H∗/ ⇤ − 4H∗
Advantages Linear convergence for smooth loss For non-smooth loss almost as good in practice (as long as smooth on a course scale). Does not require primal line search.
Duality Argument Dual of Ri[w] + λΩ[w] lower bounds minimum of regularized risk Remp[w] + λΩ[w]. Ri+1[wi] + λΩ[wi] is upper bound. Show that the gap γi := Ri+1[wi] − Ri[wi] vanishes. Dual Improvement Give lower bound on increase in dual problem in terms of γi and the subgradient ∂w [Remp[w] + λΩ[w]]. For unbounded Hessian we have δγ = O(γ2). For bounded Hessian we have δγ = O(γ). Convergence Solve difference equation in γt to get desired result.
1 m
m
X
i=1
l (yi hφ(xi), θi) = Ei∼{1,..m} [l (yi hφ(xi), θi)] here πX(θ) = argmin
x∈X
kx θk θt+1 θt ηt∂θ (yt, hφ(xt), θti) θt+1 πx [θt ηt∂θ (yt, hφ(xt), θti)]
E¯
θ
⇥ l(¯ θ) ⇤ l∗ R2 + L2 PT −1
t=0 η2 t
2 PT −1
t=0 ηt
where l(θ) = E(x,y) [l(y, hφ(x), θi)] and l∗ = inf
θ∈X l(θ) and ¯
θ = PT −1
t=0 θtηt
PT −1
t=0 ηt
parameter average
θ∗ 2 argmin
θ∈X
l(θ) and set rt := kθ∗ θtk
from Nesterov and Vial
initial loss
r2
t+1 = kπX[θt ηtgt] θ∗k2
kθt ηtgt θ∗k2 = r2
t + η2 t kgtk2 2ηt hθt θ∗, gti
hence E ⇥ r2
t+1 r2 t
⇤ η2
t L2 + 2ηt [l∗ E[l(θt)]]
η2
t L2 + 2ηt
⇥ l∗ E[l(¯ θ)] ⇤
by convexity by convexity
E¯
θ
⇥ l(¯ θ) ⇤ − l∗ ≤ R2 + L2 PT −1
t=0 η2 t
2 PT −1
t=0 ηt
η = R L √ T and hence E¯
θ[l(¯
θ)] − l∗ ≤ R[1 + 1/T]L 2 √ T < LR √ T ηt = O(t− 1
2 )
E¯
θ[l(¯
θ)] − l∗ = O ✓log T √ T ◆
li(θ0) li(θ) + h∂θli(θ), θ0 θi + 1 2λ kθ θ0k2 r2
t+1 r2 t + η2 t kgtk2 2ηt hθt θ∗, gti
r2
t + η2 t L2 2ηt [lt(θt) lt(θ∗)] 2ληtr2 k
hence E[r2
t+1] (1 λht)E[r2 t ] 2ηt [E [l(θt)] l∗]
¯ θ = 1 − σ 1 − σT
T −1
X
t=0
σT −1−tθt l(¯ θ) − l∗ ≤ 2L2 λT log " 1 + λRT
1 2
2L # for η = 2 λT log " 1 + λRT
1 2
2L #
θt+1 πx [θt ηt∂θ (yt, hφ(xt), θti)] O(t− 1
2 )
O(t−1)
minimize
x
c>x subject to Ax ≤ b and x ∈ Zn
maximize
π
X
ij
πijCij subject to πij ∈ {0, 1} and X
i
πij = 1 and X
j
πij = 1
Pr n F[max
i
xi] < ✏
hence n = log log(1 − ✏) ≤ − log ✏
good element (e.g. CM sketch)
Mean-Median theorem)
For web search results we might have individually But if we can show only 4 we should probably pick
f(A ∪ C) − f(A) ≥ f(B ∪ C) − f(B) for A ⊆ B
Often NP hard even to find tight approximation
max
X∈X f(X) subject to |X| ≤ k
f(X ∪ {x})
http://dl.acm.org/citation.cfm?id=1377347
http://books.nips.cc/papers/files/nips20/NIPS2007_0699.pdf
http://www.mcs.anl.gov/research/projects/tao/
http://martin.zinkevich.org/publications/ratliff_nathan_2007_3.pdf
http://dl.acm.org/citation.cfm?id=1273598
http://arxiv.org/abs/0911.0491
http://pages.cs.wisc.edu/~brecht/papers/hogwildTR.pdf