SLIDE 1
Convex Optimization by Stephen Boyd, and Lieven Vandenberghe. - - PowerPoint PPT Presentation
Convex Optimization by Stephen Boyd, and Lieven Vandenberghe. - - PowerPoint PPT Presentation
Convex Optimization by Stephen Boyd, and Lieven Vandenberghe. Optimization for Machine Learning by Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright Convex Optimization- Chapter 1: Introduction mathematical optimization Least-squares and
SLIDE 2
SLIDE 3
Mathematical optimization
(mathematical) optimization problem minimize f0(x) subject to fi(x) ≤ bi, i = 1, . . . , m x = (x1, ..., xn): optimization variables f0 : Rn → R objective function fi : Rn → R, i = 1, . . . , m constraint functions
- ptimal solution x∗ has smallest value of f0 among all
vectors that satisfy the constraints
SLIDE 4
Example
portfolio optimization variables: amounts invested in different assets constraints: budget, max./min. investment per asset, minimum return
- bjective: overall risk or return variance
data fitting variables: model parameters constraints: prior information, parameter limits
- bjective: measure of misfit or prediction error
SLIDE 5
Solving optimization problems
general optimization problem very difficult to solve methods involve some compromise, e.g., very long computation time, or not always finding the solution exceptions: certain problem classes can be solved efficiently and reliably least-squares problems linear programming problems convex optimization problems
SLIDE 6
Least-squares
minimize ||Ax − b||2
2
solving least-squares problems analytical solution: x∗ = (ATA)−1ATb reliable and efficient algorithms and software computation time proportional to n2k(A ∈ Rk×n); less if structured a mature technology using least-squares least-squares problems are easy to recognize a few standard techniques increase flexibility (e.g., including weights, adding regularization terms)
SLIDE 7
Linear programming
minimize cTx subject to aT
i x ≤ b,
i = 1, . . . , m solving linear programs no analytical formula for solution reliable and efficient algorithms and software computation time proportional to n2m if m ≥ n; less with structure a mature technology using linear programming a few standard tricks used to convert problems into linear programs (e.g., problems involving l1 - or l2-norms, piecewise-linear functions)
SLIDE 8
Chebyshev approximation problem
minimize maxi=1,...,k|aT
i x − bi|
Approximate linear problem minimize t subject to aT
i x − t ≤ bi,
i = 1, . . . , k −aT
i x − t ≤ −bi,
i = 1, . . . , k
SLIDE 9
Convex optimization problem
minimize f0(x) subject to fi(x) ≤ bi i = 1, . . . , m
- bjective and constraint functions are convex:
fi(αx + βy) ≤ αfi(x) + βfi(y) if α + β = 1, α ≥ 0, β ≥ 0 includes least-squares problems and linear programs as special cases
SLIDE 10
Convex optimization problem
solving convex optimization problems no analytical solution reliable and efficient algorithms computation time (roughly) proportional to max {n3, n2m, F}, where F is cost of evaluating fi’s and their first and second derivatives almost a technology using convex optimization
- ften difficult to recognize
many tricks for transforming problems into convex form surprisingly many problems can be solved via convex
- ptimization
SLIDE 11
Nonlinear optimization
traditional techniques for general nonconvex problems involve compromises local optimization methods (nonlinear programming) find a point that minimizes f0 among feasible points near it fast, can handle large problems require initial guess provide no information about distance to (global)
- ptimum
global optimization methods find the (global) solution worst-case complexity grows exponentially with problem size these algorithms are often based on solving convex subproblems
SLIDE 12
Optimization and Machine Learning
minimizew,b,ξ 1 2w Tw + C
m
- i=1
ξi subject to yi(w Txi + b) ≥ 1 − ξi, ξi ≥ 0, 1 ≤ i ≤ m. Its dual minimizeα 1 2αTYXTXYα − αT1 subject to
- i
yiαi = 0, 0 ≤ αi ≤ C, Y = Diag(y1, . . . , ym) X = [x1, . . . , xm] ∈ Rn×m w = m
i=1 αixi
f (x) = sgn(w Tx + b)
SLIDE 13
More powerful classifiers allow kernel. Kij :=< φ(xi), φ(xj) > is the kernel matrix w = m
i=1 αiφ(xi)
f (x) = sgn[m
i=1 αiK(xi, x) + b]
SLIDE 14
Themes of algorithms
General techniques for convex quadratic programming have limited appeal (1) large problem size (2) ill-condition Hessian
1 decomposition Rather than computing a step in all
components of α at once, these methods focus on a relatively small subset and fix the other components.
2 regularized solutions
SLIDE 15
decomposition approach
Early approach: Works with a subset B ⊂ {1, 2, . . . , s}, whose size is assume to exceed the number of nonzero component of α. Replaces one element of B at each iteration and then re-solves the reduced problem The sequential minimal optimization (SMO): works with just two components of α at each iteration, reducing each QP sub-problem to triviality.
SLIDE 16
decomposition approach
SVMlight: Uses a linearization of the objective around the current point to choose the working set B to be the indices most likely to give decent, giving a fixed size limitation on B. Shrinking reduces the workload further by eliminating computation associated with components of α that seem to be at their lower or upper bounds. But the computation is more complex it needs further computational savings. Interior-point method : It is hardly efficient in large problems (duo to ill-conditioning of the kernel matrix)
Replace Hessian with a low ranked matrix (VV T, where V ∈ Rm×r for r ≪ m)
Co-ordinate relaxation procedure
SLIDE 17
regularized solutions
1 Regularized solutions generalized better 2 Regularized solutions provide simplicity. (w is sparse)
minimizew φγ(w) = f (w) + γr(w) f (w) = ξi r(w) = 1
2w Tw
γ = 1
C
trade-off between minimizing mis-classification error and reducing ||w||2
SLIDE 18
Applications
Image denoising:
r: total-variation (TV) norm result: large areas of constant intensity (a cartoon like appearance)
Matrix completion:
W is the matrix variable Regularizer spectral norm : sum of singular values of W This regularizer favors matrices with low rank
Lasso procedure
l1-norm f is least squares
SLIDE 19
Algorithm
1 Gradient and subgradient methods:
wk+1 ← wk − δkgk This method ensures sub-linear convergence φγ(wk) − φγ(w ∗) ≤ O( 1 k2)
SLIDE 20
Algorithm
1 Second approach:
wk+1 := arg min
w
(w − wk)T ▽ f (wk) + γr(w) + 1 2µ||w − wk||2
2
If works well for f with Lipschitz continuous gradient. Sub-linear rate of convergence O( 1
K ). In special cases
O( 1
K 2)