Convex Optimization by Stephen Boyd, and Lieven Vandenberghe. - - PowerPoint PPT Presentation

convex optimization by stephen boyd and lieven
SMART_READER_LITE
LIVE PREVIEW

Convex Optimization by Stephen Boyd, and Lieven Vandenberghe. - - PowerPoint PPT Presentation

Convex Optimization by Stephen Boyd, and Lieven Vandenberghe. Optimization for Machine Learning by Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright Convex Optimization- Chapter 1: Introduction mathematical optimization Least-squares and


slide-1
SLIDE 1

Convex Optimization by Stephen Boyd, and Lieven Vandenberghe. Optimization for Machine Learning by Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright

slide-2
SLIDE 2

Convex Optimization- Chapter 1: Introduction

mathematical optimization Least-squares and linear programming Convex optimization Nonlinear optimization

slide-3
SLIDE 3

Mathematical optimization

(mathematical) optimization problem minimize f0(x) subject to fi(x) ≤ bi, i = 1, . . . , m x = (x1, ..., xn): optimization variables f0 : Rn → R objective function fi : Rn → R, i = 1, . . . , m constraint functions

  • ptimal solution x∗ has smallest value of f0 among all

vectors that satisfy the constraints

slide-4
SLIDE 4

Example

portfolio optimization variables: amounts invested in different assets constraints: budget, max./min. investment per asset, minimum return

  • bjective: overall risk or return variance

data fitting variables: model parameters constraints: prior information, parameter limits

  • bjective: measure of misfit or prediction error
slide-5
SLIDE 5

Solving optimization problems

general optimization problem very difficult to solve methods involve some compromise, e.g., very long computation time, or not always finding the solution exceptions: certain problem classes can be solved efficiently and reliably least-squares problems linear programming problems convex optimization problems

slide-6
SLIDE 6

Least-squares

minimize ||Ax − b||2

2

solving least-squares problems analytical solution: x∗ = (ATA)−1ATb reliable and efficient algorithms and software computation time proportional to n2k(A ∈ Rk×n); less if structured a mature technology using least-squares least-squares problems are easy to recognize a few standard techniques increase flexibility (e.g., including weights, adding regularization terms)

slide-7
SLIDE 7

Linear programming

minimize cTx subject to aT

i x ≤ b,

i = 1, . . . , m solving linear programs no analytical formula for solution reliable and efficient algorithms and software computation time proportional to n2m if m ≥ n; less with structure a mature technology using linear programming a few standard tricks used to convert problems into linear programs (e.g., problems involving l1 - or l2-norms, piecewise-linear functions)

slide-8
SLIDE 8

Chebyshev approximation problem

minimize maxi=1,...,k|aT

i x − bi|

Approximate linear problem minimize t subject to aT

i x − t ≤ bi,

i = 1, . . . , k −aT

i x − t ≤ −bi,

i = 1, . . . , k

slide-9
SLIDE 9

Convex optimization problem

minimize f0(x) subject to fi(x) ≤ bi i = 1, . . . , m

  • bjective and constraint functions are convex:

fi(αx + βy) ≤ αfi(x) + βfi(y) if α + β = 1, α ≥ 0, β ≥ 0 includes least-squares problems and linear programs as special cases

slide-10
SLIDE 10

Convex optimization problem

solving convex optimization problems no analytical solution reliable and efficient algorithms computation time (roughly) proportional to max {n3, n2m, F}, where F is cost of evaluating fi’s and their first and second derivatives almost a technology using convex optimization

  • ften difficult to recognize

many tricks for transforming problems into convex form surprisingly many problems can be solved via convex

  • ptimization
slide-11
SLIDE 11

Nonlinear optimization

traditional techniques for general nonconvex problems involve compromises local optimization methods (nonlinear programming) find a point that minimizes f0 among feasible points near it fast, can handle large problems require initial guess provide no information about distance to (global)

  • ptimum

global optimization methods find the (global) solution worst-case complexity grows exponentially with problem size these algorithms are often based on solving convex subproblems

slide-12
SLIDE 12

Optimization and Machine Learning

minimizew,b,ξ 1 2w Tw + C

m

  • i=1

ξi subject to yi(w Txi + b) ≥ 1 − ξi, ξi ≥ 0, 1 ≤ i ≤ m. Its dual minimizeα 1 2αTYXTXYα − αT1 subject to

  • i

yiαi = 0, 0 ≤ αi ≤ C, Y = Diag(y1, . . . , ym) X = [x1, . . . , xm] ∈ Rn×m w = m

i=1 αixi

f (x) = sgn(w Tx + b)

slide-13
SLIDE 13

More powerful classifiers allow kernel. Kij :=< φ(xi), φ(xj) > is the kernel matrix w = m

i=1 αiφ(xi)

f (x) = sgn[m

i=1 αiK(xi, x) + b]

slide-14
SLIDE 14

Themes of algorithms

General techniques for convex quadratic programming have limited appeal (1) large problem size (2) ill-condition Hessian

1 decomposition Rather than computing a step in all

components of α at once, these methods focus on a relatively small subset and fix the other components.

2 regularized solutions

slide-15
SLIDE 15

decomposition approach

Early approach: Works with a subset B ⊂ {1, 2, . . . , s}, whose size is assume to exceed the number of nonzero component of α. Replaces one element of B at each iteration and then re-solves the reduced problem The sequential minimal optimization (SMO): works with just two components of α at each iteration, reducing each QP sub-problem to triviality.

slide-16
SLIDE 16

decomposition approach

SVMlight: Uses a linearization of the objective around the current point to choose the working set B to be the indices most likely to give decent, giving a fixed size limitation on B. Shrinking reduces the workload further by eliminating computation associated with components of α that seem to be at their lower or upper bounds. But the computation is more complex it needs further computational savings. Interior-point method : It is hardly efficient in large problems (duo to ill-conditioning of the kernel matrix)

Replace Hessian with a low ranked matrix (VV T, where V ∈ Rm×r for r ≪ m)

Co-ordinate relaxation procedure

slide-17
SLIDE 17

regularized solutions

1 Regularized solutions generalized better 2 Regularized solutions provide simplicity. (w is sparse)

minimizew φγ(w) = f (w) + γr(w) f (w) = ξi r(w) = 1

2w Tw

γ = 1

C

trade-off between minimizing mis-classification error and reducing ||w||2

slide-18
SLIDE 18

Applications

Image denoising:

r: total-variation (TV) norm result: large areas of constant intensity (a cartoon like appearance)

Matrix completion:

W is the matrix variable Regularizer spectral norm : sum of singular values of W This regularizer favors matrices with low rank

Lasso procedure

l1-norm f is least squares

slide-19
SLIDE 19

Algorithm

1 Gradient and subgradient methods:

wk+1 ← wk − δkgk This method ensures sub-linear convergence φγ(wk) − φγ(w ∗) ≤ O( 1 k2)

slide-20
SLIDE 20

Algorithm

1 Second approach:

wk+1 := arg min

w

(w − wk)T ▽ f (wk) + γr(w) + 1 2µ||w − wk||2

2

If works well for f with Lipschitz continuous gradient. Sub-linear rate of convergence O( 1

K ). In special cases

O( 1

K 2)

Some methods use second-order information.