Introduction to Convex Optimization for Machine Learning John Duchi - - PowerPoint PPT Presentation

introduction to convex optimization for machine learning
SMART_READER_LITE
LIVE PREVIEW

Introduction to Convex Optimization for Machine Learning John Duchi - - PowerPoint PPT Presentation

Introduction to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 1 / 53 Outline What is


slide-1
SLIDE 1

Introduction to Convex Optimization for Machine Learning

John Duchi

University of California, Berkeley

Practical Machine Learning, Fall 2009

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 1 / 53

slide-2
SLIDE 2

Outline

What is Optimization Convex Sets Convex Functions Convex Optimization Problems Lagrange Duality Optimization Algorithms Take Home Messages

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 2 / 53

slide-3
SLIDE 3

What is Optimization

What is Optimization (and why do we care?)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 3 / 53

slide-4
SLIDE 4

What is Optimization

What is Optimization?

◮ Finding the minimizer of a function subject to constraints:

minimize

x

f0(x) s.t. fi(x) ≤ 0, i = {1, . . . , k} hj(x) = 0, j = {1, . . . , l}

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 4 / 53

slide-5
SLIDE 5

What is Optimization

What is Optimization?

◮ Finding the minimizer of a function subject to constraints:

minimize

x

f0(x) s.t. fi(x) ≤ 0, i = {1, . . . , k} hj(x) = 0, j = {1, . . . , l}

◮ Example: Stock market. “Minimize variance of return subject to

getting at least $50.”

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 4 / 53

slide-6
SLIDE 6

What is Optimization

Why do we care?

Optimization is at the heart of many (most practical?) machine learning algorithms.

◮ Linear regression:

minimize

w

Xw − y2

◮ Classification (logistic regresion or SVM):

minimize

w n

  • i=1

log

  • 1 + exp(−yixT

i w)

  • r w2 + C

n

  • i=1

ξi s.t. ξi ≥ 1 − yixT

i w, ξi ≥ 0.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 5 / 53

slide-7
SLIDE 7

What is Optimization

We still care...

◮ Maximum likelihood estimation:

maximize

θ n

  • i=1

log pθ(xi)

◮ Collaborative filtering:

minimize

w

  • i≺j

log

  • 1 + exp(wT xi − wT xj)
  • ◮ k-means:

minimize

µ1,...,µk

J(µ) =

k

  • j=1
  • i∈Cj

xi − µj2

◮ And more (graphical models, feature selection, active learning,

control)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 6 / 53

slide-8
SLIDE 8

What is Optimization

But generally speaking...

We’re screwed.

◮ Local (non global) minima of f0 ◮ All kinds of constraints (even restricting to continuous functions):

h(x) = sin(2πx) = 0

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3 −50 50 100 150 200 250

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53

slide-9
SLIDE 9

What is Optimization

But generally speaking...

We’re screwed.

◮ Local (non global) minima of f0 ◮ All kinds of constraints (even restricting to continuous functions):

h(x) = sin(2πx) = 0

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3 −50 50 100 150 200 250

◮ Go for convex problems!

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53

slide-10
SLIDE 10

Convex Sets

Convex Sets

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 8 / 53

slide-11
SLIDE 11

Convex Sets

Definition A set C ⊆ Rn is convex if for x, y ∈ C and any α ∈ [0, 1], αx + (1 − α)y ∈ C. x y

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 9 / 53

slide-12
SLIDE 12

Convex Sets

Examples

◮ All of Rn (obvious)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 10 / 53

slide-13
SLIDE 13

Convex Sets

Examples

◮ All of Rn (obvious) ◮ Non-negative orthant, Rn +: let x 0, y 0, clearly

αx + (1 − α)y 0.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 10 / 53

slide-14
SLIDE 14

Convex Sets

Examples

◮ All of Rn (obvious) ◮ Non-negative orthant, Rn +: let x 0, y 0, clearly

αx + (1 − α)y 0.

◮ Norm balls: let x ≤ 1, y ≤ 1, then

αx + (1 − α)y ≤ αx + (1 − α)y = α x + (1 − α) y ≤ 1.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 10 / 53

slide-15
SLIDE 15

Convex Sets

Examples

◮ Affine subspaces: Ax = b, Ay = b, then

A(αx + (1 − α)y) = αAx + (1 − α)Ay = αb + (1 − α)b = b.

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 −0.4 −0.2 0.2 0.4 0.6 0.8 1

x1 x2 x3

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 11 / 53

slide-16
SLIDE 16

Convex Sets

More examples

◮ Arbitrary intersections of convex sets: let Ci be convex for i ∈ I,

C =

i Ci, then

x ∈ C, y ∈ C ⇒ αx + (1 − α)y ∈ Ci ∀ i ∈ I so αx + (1 − α)y ∈ C.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 12 / 53

slide-17
SLIDE 17

Convex Sets

More examples

◮ PSD Matrices, a.k.a. the

positive semidefinite cone Sn

+ ⊂ Rn×n. A ∈ Sn + means

xT Ax ≥ 0 for all x ∈ Rn. For A, B ∈ S+

n ,

xT (αA + (1 − α)B) x = αxT Ax + (1 − α)xT Bx ≥ 0.

◮ On right:

S2

+ =

x z z y

  • =
  • x, y, z : x ≥ 0, y ≥ 0, xy ≥ z2

0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1 0.2 0.4 0.6 0.8 1

x y z

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 13 / 53

slide-18
SLIDE 18

Convex Functions

Convex Functions

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 14 / 53

slide-19
SLIDE 19

Convex Functions

Definition A function f : Rn → R is convex if for x, y ∈ dom f and any α ∈ [0, 1], f(αx + (1 − α)y) ≤ αf(x) + (1 − α)f(y). f(x) f(y) αf(x) + (1 - α)f(y)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 15 / 53

slide-20
SLIDE 20

Convex Functions

First order convexity conditions

Theorem Suppose f : Rn → R is differentiable. Then f is convex if and only if for all x, y ∈ dom f f(y) ≥ f(x) + ∇f(x)T (y − x)

f(y) (x, f(x)) f(x) + ∇f(x)T (y - x)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 16 / 53

slide-21
SLIDE 21

Convex Functions

Actually, more general than that

Definition The subgradient set, or subdifferential set, ∂f(x) of f at x is ∂f(x) =

  • g : f(y) ≥ f(x) + gT (y − x) for all y
  • .

Theorem f : Rn → R is convex if and

  • nly if it has non-empty

subdifferential set everywhere.

f(y) f(x) + gT (y - x) (x, f(x))

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 17 / 53

slide-22
SLIDE 22

Convex Functions

Second order convexity conditions

Theorem Suppose f : Rn → R is twice differentiable. Then f is convex if and only if for all x ∈ dom f, ∇2f(x) 0.

−2 −1 1 2 −2 −1 1 2 2 4 6 8 10

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 18 / 53

slide-23
SLIDE 23

Convex Functions

Convex sets and convex functions

Definition The epigraph of a function f is the set of points epi f = {(x, t) : f(x) ≤ t}.

◮ epi f is convex if and only if f is

convex.

◮ Sublevel sets, {x : f(x) ≤ a}

are convex for convex f.

epi f

a

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 19 / 53

slide-24
SLIDE 24

Convex Functions Examples

Examples

◮ Linear/affine functions:

f(x) = bT x + c.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 20 / 53

slide-25
SLIDE 25

Convex Functions Examples

Examples

◮ Linear/affine functions:

f(x) = bT x + c.

◮ Quadratic functions:

f(x) = 1 2xT Ax + bT x + c for A 0. For regression: 1 2 Xw − y2 = 1 2wT XT Xw − yT Xw + 1 2yT y.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 20 / 53

slide-26
SLIDE 26

Convex Functions Examples

More examples

◮ Norms (like ℓ1 or ℓ2 for regularization):

αx + (1 − α)y ≤ αx + (1 − α)y = α x + (1 − α) y .

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 21 / 53

slide-27
SLIDE 27

Convex Functions Examples

More examples

◮ Norms (like ℓ1 or ℓ2 for regularization):

αx + (1 − α)y ≤ αx + (1 − α)y = α x + (1 − α) y .

◮ Composition with an affine function f(Ax + b):

f(A(αx + (1 − α)y) + b) = f(α(Ax + b) + (1 − α)(Ay + b)) ≤ αf(Ax + b) + (1 − α)f(Ay + b)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 21 / 53

slide-28
SLIDE 28

Convex Functions Examples

More examples

◮ Norms (like ℓ1 or ℓ2 for regularization):

αx + (1 − α)y ≤ αx + (1 − α)y = α x + (1 − α) y .

◮ Composition with an affine function f(Ax + b):

f(A(αx + (1 − α)y) + b) = f(α(Ax + b) + (1 − α)(Ay + b)) ≤ αf(Ax + b) + (1 − α)f(Ay + b)

◮ Log-sum-exp (via ∇2f(x) PSD):

f(x) = log n

  • i=1

exp(xi)

  • Duchi (UC Berkeley)

Convex Optimization for Machine Learning Fall 2009 21 / 53

slide-29
SLIDE 29

Convex Functions Examples

Important examples in Machine Learning

◮ SVM loss:

f(w) =

  • 1 − yixT

i w

  • +

◮ Binary logistic loss:

f(w) = log

  • 1 + exp(−yixT

i w)

  • −2

3 3

[1 - x]+ log(1 + ex)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 22 / 53

slide-30
SLIDE 30

Convex Optimization Problems

Convex Optimization Problems

Definition An optimization problem is convex if its objective is a convex function, the inequality constraints fj are convex, and the equality constraints hj are affine minimize

x

f0(x) (Convex function) s.t. fi(x) ≤ 0 (Convex sets) hj(x) = 0 (Affine)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 23 / 53

slide-31
SLIDE 31

Convex Optimization Problems

It’s nice to be convex

Theorem If ˆ x is a local minimizer of a convex optimization problem, it is a global minimizer.

0.5 1 1.5 2 2.5 3 3.5 0.5 1 1.5 2 2.5 3 3.5 4

x∗

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 24 / 53

slide-32
SLIDE 32

Convex Optimization Problems

Even more reasons to be convex

Theorem ∇f(x) = 0 if and only if x is a global minimizer of f(x). Proof.

◮ ∇f(x) = 0. We have

f(y) ≥ f(x) + ∇f(x)T (y − x) = f(x).

◮ ∇f(x) = 0. There is a direction of descent.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 25 / 53

slide-33
SLIDE 33

Convex Optimization Problems

LET’S TAKE A BREAK

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 26 / 53

slide-34
SLIDE 34

Lagrange Duality

Lagrange Duality

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 27 / 53

slide-35
SLIDE 35

Lagrange Duality

Goals of Lagrange Duality

◮ Get certificate for optimality of a problem ◮ Remove constraints ◮ Reformulate problem

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 28 / 53

slide-36
SLIDE 36

Lagrange Duality

Constructing the dual

◮ Start with optimization problem:

minimize

x

f0(x) s.t. fi(x) ≤ 0, i = {1, . . . , k} hj(x) = 0, j = {1, . . . , l}

◮ Form Lagrangian using Lagrange multipliers λi ≥ 0, νi ∈ R

L(x, λ, ν) = f0(x) +

k

  • i=1

λifi(x) +

l

  • j=1

νjhj(x)

◮ Form dual function

g(λ, ν) = inf

x L(x, λ, ν) = inf x

  f0(x) +

k

  • i=1

λifi(x) +

l

  • j=1

νjhj(x)   

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 29 / 53

slide-37
SLIDE 37

Lagrange Duality

Remarks

◮ Original problem is equivalent to

minimize

x

  • sup

λ0,ν

L(x, λ, ν)

  • ◮ Dual problem is switching the min and max:

maximize

λ0,ν

  • inf

x L(x, λ, ν)

  • .

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 30 / 53

slide-38
SLIDE 38

Lagrange Duality

One Great Property of Dual

Lemma (Weak Duality) If λ 0, then g(λ, ν) ≤ f0(x∗). Proof. We have g(λ, ν) = inf

x L(x, λ, ν) ≤ L(x∗, λ, ν)

= f0(x∗) +

k

  • i=1

λifi(x∗) +

l

  • j=1

νjhj(x∗) ≤ f0(x∗).

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 31 / 53

slide-39
SLIDE 39

Lagrange Duality

The Greatest Property of the Dual

Theorem For reasonable1 convex problems, sup

λ0,ν

g(λ, ν) = f0(x∗)

1There are conditions called constraint qualification for which this is true Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 32 / 53

slide-40
SLIDE 40

Lagrange Duality

Geometric Look

Minimize 1

2(x − c − 1)2 subject to x2 ≤ c.

−2 −1.5 −1 −0.5 0.5 1 1.5 2 −2 2 4 6 8 10 12 14

x

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 −0.4 −0.2 0.2 0.4 0.6

λ

True function (blue), constraint (green), L(x, λ) for different λ (dotted) Dual function g(λ) (black), primal op- timal (dotted blue)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 33 / 53

slide-41
SLIDE 41

Lagrange Duality

Intuition

Can interpret duality as linear approximation.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 34 / 53

slide-42
SLIDE 42

Lagrange Duality

Intuition

Can interpret duality as linear approximation.

◮ I−(a) = ∞ if a > 0, 0 otherwise; I0(a) = ∞ unless a = 0. Rewrite

problem as minimize

x

f0(x) +

k

  • i=1

I−(fi(x)) +

l

  • j=1

I0(hj(x))

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 34 / 53

slide-43
SLIDE 43

Lagrange Duality

Intuition

Can interpret duality as linear approximation.

◮ I−(a) = ∞ if a > 0, 0 otherwise; I0(a) = ∞ unless a = 0. Rewrite

problem as minimize

x

f0(x) +

k

  • i=1

I−(fi(x)) +

l

  • j=1

I0(hj(x))

◮ Replace I(fi(x)) with λifi(x); a measure of “displeasure” when

λi ≥ 0, fi(x) > 0. νihj(x) lower bounds I0(hj(x)): minimize

x

f0(x) +

k

  • i=1

λifi(x) +

l

  • j=1

νjhj(x)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 34 / 53

slide-44
SLIDE 44

Lagrange Duality

Example: Linearly constrained least squares

minimize

x

1 2 Ax − b2 s.t. Bx = d. Form the Lagrangian: L(x, ν) = 1 2 Ax − b2 + νT (Bx − d) Take infimum: ∇xL(x, ν) = AT Ax − AT b + BT ν ⇒ x = (AT A)−1(AT b − BT ν) Simple unconstrained quadratic problem! inf

x L(x, ν)

= 1 2

  • A(AT A)−1(AT b − BT ν) − b
  • 2 + νT B((AT A)−1AT b − BT ν) − dT ν

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 35 / 53

slide-45
SLIDE 45

Lagrange Duality

Example: Quadratically constrained least squares

minimize

x

1 2 Ax − b2 s.t. 1 2 x2 ≤ c. Form the Lagrangian (λ ≥ 0): L(x, λ) = 1 2 Ax − b2 + 1 2λ(x2 − 2c) Take infimum: ∇xL(x, ν) = AT Ax − AT b + λI ⇒ x = (AT A + λI)−1AT b inf

x L(x, λ) = 1

2

  • A(AT A + λI)−1AT b − b
  • 2+λ

2

  • (AT A + λI)−1AT b
  • 2−λc

One variable dual problem! g(λ) = −1 2bT A(AT A + λI)−1AT b − λc + 1 2 b2 .

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 36 / 53

slide-46
SLIDE 46

Lagrange Duality

Uses of the Dual

◮ Main use: certificate of optimality (a.k.a. duality gap). If we have

feasible x and know the dual g(λ, ν), then g(λ, ν) ≤ f0(x∗) ≤ f0(x) ⇒ f0(x∗) − f0(x) ≥ g(λ, ν) − f0(x) ⇒ f0(x) − f0(x∗) ≤ f0(x) − g(λ, ν).

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 37 / 53

slide-47
SLIDE 47

Lagrange Duality

Uses of the Dual

◮ Main use: certificate of optimality (a.k.a. duality gap). If we have

feasible x and know the dual g(λ, ν), then g(λ, ν) ≤ f0(x∗) ≤ f0(x) ⇒ f0(x∗) − f0(x) ≥ g(λ, ν) − f0(x) ⇒ f0(x) − f0(x∗) ≤ f0(x) − g(λ, ν).

◮ Also used in more advanced primal-dual algorithms (we won’t talk

about these).

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 37 / 53

slide-48
SLIDE 48

Optimization Algorithms

Optimization Algorithms

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 38 / 53

slide-49
SLIDE 49

Optimization Algorithms Gradient Methods

Gradient Descent

The simplest algorithm in the world (almost). Goal: minimize

x

f(x) Just iterate xt+1 = xt − ηt∇f(xt) where ηt is stepsize.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 39 / 53

slide-50
SLIDE 50

Optimization Algorithms Gradient Methods

Single Step Illustration

f(xt) - η∇f(xt)T (x - xt) f(x) (xt, f(xt))

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 40 / 53

slide-51
SLIDE 51

Optimization Algorithms Gradient Methods

Full Gradient Descent

f(x) = log(exp(x1 + 3x2 − .1) + exp(x1 − 3x2 − .1) + exp(−x1 − .1))

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 41 / 53

slide-52
SLIDE 52

Optimization Algorithms Gradient Methods

Stepsize Selection

How do I choose a stepsize?

◮ Idea 1: exact line search

ηt = argmin

η

f (x − η∇f(x)) Too expensive to be practical.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 42 / 53

slide-53
SLIDE 53

Optimization Algorithms Gradient Methods

Stepsize Selection

How do I choose a stepsize?

◮ Idea 1: exact line search

ηt = argmin

η

f (x − η∇f(x)) Too expensive to be practical.

◮ Idea 2: backtracking (Armijo) line search. Let α ∈ (0, 1 2), β ∈ (0, 1).

Multiply η = βη until f (x − η∇f(x)) ≤ f(x) − αη ∇f(x)2 Works well in practice.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 42 / 53

slide-54
SLIDE 54

Optimization Algorithms Gradient Methods

Illustration of Armijo/Backtracking Line Search

ηt = 0 η0 η f(x) - ηα∇f(x)2 f(x - η∇f(x))

As a function of stepsize η. Clearly a region where f(x − η∇f(x)) is below line f(x) − αη ∇f(x)2.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 43 / 53

slide-55
SLIDE 55

Optimization Algorithms Gradient Methods

Newton’s method

Idea: use a second-order approximation to function. f(x + ∆x) ≈ f(x) + ∇f(x)T ∆x + 1 2∆xT ∇2f(x)∆x Choose ∆x to minimize above: ∆x = −

  • ∇2f(x)

−1 ∇f(x) This is descent direction: ∇f(x)T ∆x = −∇f(x)T ∇2f(x) −1 ∇f(x) < 0.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 44 / 53

slide-56
SLIDE 56

Optimization Algorithms Gradient Methods

Newton step picture

(x, f(x)) (x + ∆x, f(x + ∆x)) f ˆ f ˆ f is 2nd-order approximation, f is true function.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 45 / 53

slide-57
SLIDE 57

Optimization Algorithms Gradient Methods

Convergence of gradient descent and Newton’s method

◮ Strongly convex case: ∇2f(x) mI, then “Linear convergence.” For

some γ ∈ (0, 1), f(xt) − f(x∗) ≤ γt, γ < 1. f(xt) − f(x∗) ≤ γt

  • r

t ≥ 1 γ log 1 ε ⇒ f(xt) − f(x∗) ≤ ε.

◮ Smooth case: ∇f(x) − ∇f(y) ≤ C x − y.

f(xt) − f(x∗) ≤ K t2

◮ Newton’s method often is faster, especially when f has “long valleys”

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 46 / 53

slide-58
SLIDE 58

Optimization Algorithms Gradient Methods

What about constraints?

◮ Linear constraints Ax = b are easy. For example, in Newton method

(assume Ax = b): minimize

∆x

∇f(x)T ∆x + 1 2∆xT ∇2f(x)∆x s.t. A∆x = 0. Solution ∆x satisfies A(x + ∆x) = Ax + A∆x = b.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 47 / 53

slide-59
SLIDE 59

Optimization Algorithms Gradient Methods

What about constraints?

◮ Linear constraints Ax = b are easy. For example, in Newton method

(assume Ax = b): minimize

∆x

∇f(x)T ∆x + 1 2∆xT ∇2f(x)∆x s.t. A∆x = 0. Solution ∆x satisfies A(x + ∆x) = Ax + A∆x = b.

◮ Inequality constraints are a bit tougher...

minimize

∆x

∇f(x)T ∆x + 1 2∆xT ∇2f(x)∆x s.t. fi(x + ∆x) ≤ 0 just as hard as original.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 47 / 53

slide-60
SLIDE 60

Optimization Algorithms Gradient Methods

Logarithmic Barrier Methods

Goal: minimize

x

f0(x) s.t. fi(x) ≤ 0, i = {1, . . . , k} Convert to minimize

x

f0(x) +

k

  • i=1

I−(fi(x)) Approximate I−(u) ≈ −t log(−u) for small t. minimize

x

f0(x) − t

k

  • i=1

log (−fi(x))

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 48 / 53

slide-61
SLIDE 61

Optimization Algorithms Gradient Methods

The barrier function

−3 1 −2 10

u I−(u) is dotted line, others are −t log(−u).

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 49 / 53

slide-62
SLIDE 62

Optimization Algorithms Gradient Methods

Illustration

Minimizing cT x subject to Ax ≤ b.

c t = 5 t = 1

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 50 / 53

slide-63
SLIDE 63

Optimization Algorithms Subgradient Methods

Subgradient Descent

Really, the simplest algorithm in the world. Goal: minimize

x

f(x) Just iterate xt+1 = xt − ηtgt where ηt is a stepsize, gt ∈ ∂f(xt).

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 51 / 53

slide-64
SLIDE 64

Optimization Algorithms Subgradient Methods

Why subgradient descent?

◮ Lots of non-differentiable convex functions used in machine learning:

f(x) =

  • 1 − aT x
  • + ,

f(x) = x1 , f(X) =

k

  • r=1

σr(X) where σr is the rth singular value of X.

◮ Easy to analyze ◮ Do not even need true sub-gradient: just have Egt ∈ ∂f(xt).

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 52 / 53

slide-65
SLIDE 65

Optimization Algorithms Subgradient Methods

Proof of convergence for subgradient descent

Idea: bound xt+1 − x∗ using subgradient inequality. Assume that gt ≤ G. xt+1 − x∗2 = xt − ηgt − x∗2 = xt − x∗2 − 2ηgT

t (xt − x∗) + η2 gt2

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 53 / 53

slide-66
SLIDE 66

Optimization Algorithms Subgradient Methods

Proof of convergence for subgradient descent

Idea: bound xt+1 − x∗ using subgradient inequality. Assume that gt ≤ G. xt+1 − x∗2 = xt − ηgt − x∗2 = xt − x∗2 − 2ηgT

t (xt − x∗) + η2 gt2

Recall that f(x∗) ≥ f(xt) + gT

t (x∗ − xt)

⇒ − gT

t (xt − x∗) ≤ f(x∗) − f(xt)

so xt+1 − x∗2 ≤ xt − x∗2 + 2η [f(x∗) − f(xt)] + η2G2. Then f(xt) − f(x∗) ≤ xt − x∗2 − xt+1 − x∗2 2η + η 2G2.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 53 / 53

slide-67
SLIDE 67

Optimization Algorithms Subgradient Methods

Almost done...

Sum from t = 1 to T:

T

  • t=1

f(xt) − f(x∗) ≤ 1 2η

T

  • t=1
  • xt − x∗2 − xt+1 − x∗2

+ Tη 2 G2 = 1 2η x1 − x∗2 − 1 2η xT+1 − x∗2 + Tη 2 G2

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 54 / 53

slide-68
SLIDE 68

Optimization Algorithms Subgradient Methods

Almost done...

Sum from t = 1 to T:

T

  • t=1

f(xt) − f(x∗) ≤ 1 2η

T

  • t=1
  • xt − x∗2 − xt+1 − x∗2

+ Tη 2 G2 = 1 2η x1 − x∗2 − 1 2η xT+1 − x∗2 + Tη 2 G2 Now let D = x1 − x∗, and keep track of min along run, f (xbest) − f(x∗) ≤ 1 2ηT D2 + η 2G2. Set η =

D G √ T and

f (xbest) − f(x∗) ≤ DG √ T .

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 54 / 53

slide-69
SLIDE 69

Optimization Algorithms Subgradient Methods

Extension: projected subgradient descent

Now have a convex constraint set X. Goal: minimize

x∈X

f(x) Idea: do subgradient steps, project xt back into X at every iteration. xt+1 = ΠX(xt − ηgt) Proof: ΠX(xt) − x∗ ≤ xt − x∗ if x∗ ∈ X. X

x∗ xt ΠX(xt)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 55 / 53

slide-70
SLIDE 70

Optimization Algorithms Subgradient Methods

Projected subgradient example

minimize

x

1 2 Ax − b s.t. x1 ≤ 1

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 56 / 53

slide-71
SLIDE 71

Optimization Algorithms Subgradient Methods

Projected subgradient example

minimize

x

1 2 Ax − b s.t. x1 ≤ 1

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 56 / 53

slide-72
SLIDE 72

Optimization Algorithms Subgradient Methods

Projected subgradient example

minimize

x

1 2 Ax − b s.t. x1 ≤ 1

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 56 / 53

slide-73
SLIDE 73

Optimization Algorithms Subgradient Methods

Projected subgradient example

minimize

x

1 2 Ax − b s.t. x1 ≤ 1

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 56 / 53

slide-74
SLIDE 74

Optimization Algorithms Subgradient Methods

Projected subgradient example

minimize

x

1 2 Ax − b s.t. x1 ≤ 1

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 56 / 53

slide-75
SLIDE 75

Optimization Algorithms Subgradient Methods

Projected subgradient example

minimize

x

1 2 Ax − b s.t. x1 ≤ 1

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 56 / 53

slide-76
SLIDE 76

Optimization Algorithms Subgradient Methods

Projected subgradient example

minimize

x

1 2 Ax − b s.t. x1 ≤ 1

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 56 / 53

slide-77
SLIDE 77

Optimization Algorithms Subgradient Methods

Projected subgradient example

minimize

x

1 2 Ax − b s.t. x1 ≤ 1

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 56 / 53

slide-78
SLIDE 78

Optimization Algorithms Subgradient Methods

Convergence results for (projected) subgradient methods

◮ Any decreasing, non-summable stepsize ηt → 0, ∞ t=1 ηt = ∞ gives

f

  • xavg(t)
  • − f(x∗) → 0.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 57 / 53

slide-79
SLIDE 79

Optimization Algorithms Subgradient Methods

Convergence results for (projected) subgradient methods

◮ Any decreasing, non-summable stepsize ηt → 0, ∞ t=1 ηt = ∞ gives

f

  • xavg(t)
  • − f(x∗) → 0.

◮ Slightly less brain-dead analysis than earlier shows with ηt ∝ 1/

√ t f

  • xavg(t)
  • − f(x∗) ≤ C

√ t

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 57 / 53

slide-80
SLIDE 80

Optimization Algorithms Subgradient Methods

Convergence results for (projected) subgradient methods

◮ Any decreasing, non-summable stepsize ηt → 0, ∞ t=1 ηt = ∞ gives

f

  • xavg(t)
  • − f(x∗) → 0.

◮ Slightly less brain-dead analysis than earlier shows with ηt ∝ 1/

√ t f

  • xavg(t)
  • − f(x∗) ≤ C

√ t

◮ Same convergence when gt is random, i.e. Egt ∈ ∂f(xt). Example:

f(w) = 1 2 w2 + C

n

  • i=1
  • 1 − yixT

i w

  • +

Just pick random training example.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 57 / 53

slide-81
SLIDE 81

Take Home Messages

Recap

◮ Defined convex sets and functions ◮ Saw why we want optimization problems to be convex (solvable) ◮ Sketched some of Lagrange duality ◮ First order methods are easy and (often) work well

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 58 / 53

slide-82
SLIDE 82

Take Home Messages

Take Home Messages

◮ Many useful problems formulated as convex optimization problems ◮ If it is not convex and not an eigenvalue problem, you are out of luck ◮ If it is convex, you are golden

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 59 / 53