Convex Optimization DS-GA 1013 / MATH-GA 2824 Optimization-based - - PowerPoint PPT Presentation

convex optimization
SMART_READER_LITE
LIVE PREVIEW

Convex Optimization DS-GA 1013 / MATH-GA 2824 Optimization-based - - PowerPoint PPT Presentation

Convex Optimization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html Carlos Fernandez-Granda Convexity Differentiable convex functions Minimizing differentiable convex


slide-1
SLIDE 1

Convex Optimization

DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis

http://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html Carlos Fernandez-Granda

slide-2
SLIDE 2

Convexity Differentiable convex functions Minimizing differentiable convex functions

slide-3
SLIDE 3

Convex functions

A function f : Rn → R is convex if for any x, y ∈ Rn and any θ ∈ (0, 1) θf ( x) + (1 − θ) f ( y) ≥ f (θ x + (1 − θ) y) A function f if concave is −f is convex

slide-4
SLIDE 4

Convex functions

f (θ x + (1 − θ) y) θf ( x) + (1 − θ)f ( y) f ( x) f ( y)

slide-5
SLIDE 5

Linear functions are convex

If f is linear f (θ x + (1 − θ) y)

slide-6
SLIDE 6

Linear functions are convex

If f is linear f (θ x + (1 − θ) y) = θf ( x) + (1 − θ) f ( y)

slide-7
SLIDE 7

Strictly convex functions

A function f : Rn → R is strictly convex if for any x, y ∈ Rn and any θ ∈ (0, 1) θf ( x) + (1 − θ) f ( y) > f (θ x + (1 − θ) y)

slide-8
SLIDE 8

Local minima are global

Any local minimum of a convex function is also a global minimum

slide-9
SLIDE 9

Proof

Let xloc be a local minimum: for all x ∈ Rn such that || x − xloc||2 ≤ γ f ( xloc) ≤ f ( x) Let xglob be a global minimum f

  • xglob
  • < f (

xloc)

slide-10
SLIDE 10

Proof

Choose θ so that xθ := θ xloc + (1 − θ) xglob satisfies || xθ − xloc||2 ≤ γ then f ( xloc) ≤ f ( xθ)

slide-11
SLIDE 11

Proof

Choose θ so that xθ := θ xloc + (1 − θ) xglob satisfies || xθ − xloc||2 ≤ γ then f ( xloc) ≤ f ( xθ) = f

  • θ

xloc + (1 − θ) xglob

slide-12
SLIDE 12

Proof

Choose θ so that xθ := θ xloc + (1 − θ) xglob satisfies || xθ − xloc||2 ≤ γ then f ( xloc) ≤ f ( xθ) = f

  • θ

xloc + (1 − θ) xglob

  • ≤ θf (

xloc) + (1 − θ) f

  • xglob
  • by convexity of f
slide-13
SLIDE 13

Proof

Choose θ so that xθ := θ xloc + (1 − θ) xglob satisfies || xθ − xloc||2 ≤ γ then f ( xloc) ≤ f ( xθ) = f

  • θ

xloc + (1 − θ) xglob

  • ≤ θf (

xloc) + (1 − θ) f

  • xglob
  • by convexity of f

< f ( xloc) because f

  • xglob
  • < f (

xloc)

slide-14
SLIDE 14

Norm

Let V be a vector space, a norm is a function ||·|| from V to R with the following properties

◮ It is homogeneous. For any scalar α and any

x ∈ V ||α x|| = |α| || x|| .

◮ It satisfies the triangle inequality

|| x + y|| ≤ || x|| + || y|| . In particular, || x|| ≥ 0

◮ ||

x|| = 0 implies x =

slide-15
SLIDE 15

Norms are convex

For any x, y ∈ Rn and any θ ∈ (0, 1) ||θ x + (1 − θ) y||

slide-16
SLIDE 16

Norms are convex

For any x, y ∈ Rn and any θ ∈ (0, 1) ||θ x + (1 − θ) y|| ≤ ||θ x|| + ||(1 − θ) y||

slide-17
SLIDE 17

Norms are convex

For any x, y ∈ Rn and any θ ∈ (0, 1) ||θ x + (1 − θ) y|| ≤ ||θ x|| + ||(1 − θ) y|| = θ || x|| + (1 − θ) || y||

slide-18
SLIDE 18

Composition of convex and affine function

If f : Rn → R is convex, then for any A ∈ Rn×m and b ∈ Rn h ( x) := f

  • A

x + b

  • is convex

Consequence: f ( x) :=

  • A

x + b

  • is convex for any A and

b

slide-19
SLIDE 19

Composition of convex and affine function

h (θ x + (1 − θ) y)

slide-20
SLIDE 20

Composition of convex and affine function

h (θ x + (1 − θ) y) = f

  • θ
  • A

x + b

  • + (1 − θ)
  • A

y + b

slide-21
SLIDE 21

Composition of convex and affine function

h (θ x + (1 − θ) y) = f

  • θ
  • A

x + b

  • + (1 − θ)
  • A

y + b

  • ≤ θf
  • A

x + b

  • + (1 − θ) f
  • A

y + b

slide-22
SLIDE 22

Composition of convex and affine function

h (θ x + (1 − θ) y) = f

  • θ
  • A

x + b

  • + (1 − θ)
  • A

y + b

  • ≤ θf
  • A

x + b

  • + (1 − θ) f
  • A

y + b

  • = θ h (

x) + (1 − θ) h ( y)

slide-23
SLIDE 23

ℓ0 “norm"

Number of nonzero entries in a vector Not a norm! ||2 x||0

slide-24
SLIDE 24

ℓ0 “norm"

Number of nonzero entries in a vector Not a norm! ||2 x||0 = || x||0

slide-25
SLIDE 25

ℓ0 “norm"

Number of nonzero entries in a vector Not a norm! ||2 x||0 = || x||0 = 2 || x||0

slide-26
SLIDE 26

ℓ0 “norm"

Number of nonzero entries in a vector Not a norm! ||2 x||0 = || x||0 = 2 || x||0 Not convex

slide-27
SLIDE 27

ℓ0 “norm"

Number of nonzero entries in a vector Not a norm! ||2 x||0 = || x||0 = 2 || x||0 Not convex Let x := ( 1

0 ) and

y := ( 0

1 ), for any θ ∈ (0, 1)

||θ x + (1 − θ) y||0 θ || x||0 + (1 − θ) || y||0

slide-28
SLIDE 28

ℓ0 “norm"

Number of nonzero entries in a vector Not a norm! ||2 x||0 = || x||0 = 2 || x||0 Not convex Let x := ( 1

0 ) and

y := ( 0

1 ), for any θ ∈ (0, 1)

||θ x + (1 − θ) y||0 = 2 θ || x||0 + (1 − θ) || y||0

slide-29
SLIDE 29

ℓ0 “norm"

Number of nonzero entries in a vector Not a norm! ||2 x||0 = || x||0 = 2 || x||0 Not convex Let x := ( 1

0 ) and

y := ( 0

1 ), for any θ ∈ (0, 1)

||θ x + (1 − θ) y||0 = 2 θ || x||0 + (1 − θ) || y||0 = 1

slide-30
SLIDE 30

Promoting sparsity

Finding sparse vectors consistent with data is often very useful Toy problem: Find t such that

  • vt :=

  t t − 1 t − 1   is sparse Strategy: Minimize f (t) := || vt||

slide-31
SLIDE 31

Promoting sparsity

−0.4 −0.2 0.2 0.4 0.6 0.8 1 1.2 1.4 1 2 3 4 5 t ||t||0 ||t||1 ||t||2 ||t||∞

slide-32
SLIDE 32

The rank is not convex

The rank of matrices in Rn×n interpreted as a function from Rn×n to R is not convex

slide-33
SLIDE 33

The rank is not convex

The rank of matrices in Rn×n interpreted as a function from Rn×n to R is not convex X := 1

  • Y :=

1

  • For any θ ∈ (0, 1)

rank (θX + (1 − θ) Y ) θ rank (X) + (1 − θ) rank (Y )

slide-34
SLIDE 34

The rank is not convex

The rank of matrices in Rn×n interpreted as a function from Rn×n to R is not convex X := 1

  • Y :=

1

  • For any θ ∈ (0, 1)

rank (θX + (1 − θ) Y ) = 2 θ rank (X) + (1 − θ) rank (Y )

slide-35
SLIDE 35

The rank is not convex

The rank of matrices in Rn×n interpreted as a function from Rn×n to R is not convex X := 1

  • Y :=

1

  • For any θ ∈ (0, 1)

rank (θX + (1 − θ) Y ) = 2 θ rank (X) + (1 − θ) rank (Y ) = 1

slide-36
SLIDE 36

Matrix norms

Frobenius norm ||A||F :=

  • m
  • i=1

n

  • j=1

A2

ij =

  • min{m,n}
  • i=1

σ2

i

Operator norm ||A|| := max {||

x||2=1 | x∈Rn}

||A x||2 = σ1 Nuclear norm ||A||∗ :=

min{m,n}

  • i=1

σi

slide-37
SLIDE 37

Promoting low-rank structure

Finding low-rank matrices consistent with data is often very useful Toy problem: Find t such that M (t) :=   0.5 + t 1 1 0.5 0.5 t 0.5 1 − t 0.5   , is low rank Strategy: Minimize f (t) := ||M (t)||

slide-38
SLIDE 38

Promoting low-rank structure

1.0 0.5 0.0 0.5 1.0 1.5 t 1.0 1.5 2.0 2.5 3.0 Rank Operator norm Frobenius norm Nuclear norm

slide-39
SLIDE 39

Convexity Differentiable convex functions Minimizing differentiable convex functions

slide-40
SLIDE 40

Gradient

∇f ( x) =        

∂f ( x) ∂ x[1] ∂f ( x) ∂ x[2]

· · ·

∂f ( x) ∂ x[n]

        If the gradient exists at every point, the function is said to be differentiable

slide-41
SLIDE 41

Directional derivative

Encodes first-order rate of change in a particular direction f ′

  • u (

x) := lim

h→0

f ( x + h u) − f ( x) h = ∇f ( x) , u where ||u||2 = 1

slide-42
SLIDE 42

Direction of maximum variation

∇f is direction of maximum increase

  • ∇f is direction of maximum decrease
  • f ′
  • u (

x)

  • =
  • ∇f (

x)T u

slide-43
SLIDE 43

Direction of maximum variation

∇f is direction of maximum increase

  • ∇f is direction of maximum decrease
  • f ′
  • u (

x)

  • =
  • ∇f (

x)T u

  • ≤ ||∇f (

x)||2 || u||2 Cauchy-Schwarz inequality

slide-44
SLIDE 44

Direction of maximum variation

∇f is direction of maximum increase

  • ∇f is direction of maximum decrease
  • f ′
  • u (

x)

  • =
  • ∇f (

x)T u

  • ≤ ||∇f (

x)||2 || u||2 Cauchy-Schwarz inequality = ||∇f ( x)||2

slide-45
SLIDE 45

Direction of maximum variation

∇f is direction of maximum increase

  • ∇f is direction of maximum decrease
  • f ′
  • u (

x)

  • =
  • ∇f (

x)T u

  • ≤ ||∇f (

x)||2 || u||2 Cauchy-Schwarz inequality = ||∇f ( x)||2 equality holds if and only if u = ±

∇f ( x) ||∇f ( x)||2

slide-46
SLIDE 46

Gradient

slide-47
SLIDE 47

First-order approximation

The first-order or linear approximation of f : Rn → R at x is f 1

  • x (

y) := f ( x) + ∇f ( x)T ( y − x) If f is continuously differentiable at x lim

  • y→

x

f ( y) − f 1

  • x (

y) || y − x||2 = 0

slide-48
SLIDE 48

First-order approximation

x f ( y) f 1

x (

y)

slide-49
SLIDE 49

Convexity

A differentiable function f : Rn → R is convex if and only if for every

  • x,

y ∈ Rn f ( y) ≥ f ( x) + ∇f ( x)T ( y − x) It is strictly convex if and only if f ( y) > f ( x) + ∇f ( x)T ( y − x)

slide-50
SLIDE 50

Optimality condition

If f is convex and ∇f ( x) = 0, then for any y ∈ R f ( y) ≥ f ( x) If f is strictly convex then for any y = x f ( y) > f ( x)

slide-51
SLIDE 51

Epigraph

The epigraph of f : Rn → R is epi (f ) :=    x | f    

  • x[1]

· · ·

  • x[n]

    ≤ x[n + 1]   

slide-52
SLIDE 52

Epigraph

f epi (f )

slide-53
SLIDE 53

Supporting hyperplane

A hyperplane H is a supporting hyperplane of a set S at x if

◮ H and S intersect at

x

◮ S is contained in one of the half-spaces bounded by H

slide-54
SLIDE 54

Geometric intuition

Geometrically, f is convex if and only if for every x the plane Hf ,

x :=

   y | y[n + 1] = f 1

  • x

   

  • y[1]

· · ·

  • y[n]

       is a supporting hyperplane of the epigraph at x If ∇f ( x) = 0 the hyperplane is horizontal

slide-55
SLIDE 55

Convexity

x f ( y) f 1

x (

y)

slide-56
SLIDE 56

Hessian matrix

If f has a Hessian matrix at every point, it is twice differentiable ∇2f ( x) =         

∂2f ( x) ∂ x[1]2 ∂2f ( x) ∂ x[1]∂ x[2]

· · ·

∂2f ( x) ∂ x[1]∂ x[n] ∂2f ( x) ∂ x[1]∂ x[2] ∂2f ( x) ∂ x[1]2

· · ·

∂2f ( x) ∂ x[2]∂ x[n]

· · ·

∂2f ( x) ∂ x[1]∂ x[n] ∂2f ( x) ∂ x[2]∂ x[n]

· · ·

∂2f ( x) ∂ x[n]2

        

slide-57
SLIDE 57

Curvature

The second directional derivative f ′′

  • u of f at

x equals f ′′

  • u (

x) = uT∇2f ( x) u for any unit-norm vector u ∈ Rn

slide-58
SLIDE 58

Second-order approximation

The second-order or quadratic approximation of f at x is f 2

  • x (

y) := f ( x) + ∇f ( x) ( y − x) + 1 2 ( y − x)T ∇2f ( x) ( y − x)

slide-59
SLIDE 59

Second-order approximation

x f ( y) f 2

x (

y)

slide-60
SLIDE 60

Quadratic form

Second order polynomial in several dimensions q ( x) := xTA x + bT x + c parametrized by symmetric matrix A ∈ Rn×n, a vector b ∈ Rn and a constant c

slide-61
SLIDE 61

Quadratic approximation

The quadratic approximation f 2

  • x : Rn → R at

x ∈ Rn of a twice-continuously differentiable function f : Rn → R satisfies lim

  • y→

x

f ( y) − f 2

  • x (

y) || y − x||2

2

= 0

slide-62
SLIDE 62

Eigendecomposition of symmetric matrices

Let A = UΛUT be the eigendecomposition of a symmetric matrix A Eigenvalues: λ1 ≥ · · · ≥ λn (which can be negative or 0) Eigenvectors: u1, . . . , un, orthonormal basis λ1 = max {||

x||2=1 | x∈Rn}

  • xTA

x

  • u1 =

arg max {||

x||2=1 | x∈Rn}

  • xTA

x λn = min {||

x||2=1 | x∈Rn}

  • xTA

x

  • un =

arg min {||

x||2=1 | x∈Rn}

  • xTA

x

slide-63
SLIDE 63

Maximum and minimum curvature

Let ∇2f ( x) = UΛUT be the eigendecomposition of the Hessian at x Direction of maximum curvature: u1 Direction of minimum curvature (or maximum negative curvature): un

slide-64
SLIDE 64

Positive semidefinite matrices

For any x

  • xTA

x = xTUΛUT x =

n

  • i=1

λi ui, x2 All eigenvalues are nonnegative if and only if

  • xTA

x ≥ 0 for all x The matrix is positive semidefinite

slide-65
SLIDE 65

Positive (negative) (semi)definite matrices

Positive (semi)definite: all eigenvalues are positive (nonnegative), equivalently for all x

  • xTA

x > (≥) 0 Quadratic form: All directions have positive curvature Negative (semi)definite: all eigenvalues are negative (nonpositive), equivalently for all x

  • xTA

x < (≤) 0 Quadratic form: All directions have negative curvature

slide-66
SLIDE 66

Convexity

A twice-differentiable function g : R → R is convex if and only if g′′ (x) ≥ 0 for all x ∈ R A twice-differentiable function in Rn is convex if and only if their Hessian is positive semidefinite at every point If the Hessian is positive definite, the function is strictly convex

slide-67
SLIDE 67

Second-order approximation

x f ( y) f 2

x (

y)

slide-68
SLIDE 68

Convex

slide-69
SLIDE 69

Concave

slide-70
SLIDE 70

Neither

slide-71
SLIDE 71

Convexity Differentiable convex functions Minimizing differentiable convex functions

slide-72
SLIDE 72

Problem

Challenge: Minimizing differentiable convex functions min

  • x∈Rn

f ( x)

slide-73
SLIDE 73

Gradient descent

Intuition: Make local progress in the steepest direction −∇f ( x) Set the initial point x (0) to an arbitrary value Update by setting

  • x (k+1) :=

x (k)−αk ∇f

  • x (k)

where αk > 0 is the step size, until a stopping criterion is met

slide-74
SLIDE 74

Gradient descent

slide-75
SLIDE 75

Gradient descent

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

slide-76
SLIDE 76

Small step size

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

slide-77
SLIDE 77

Large step size

10 20 30 40 50 60 70 80 90 100

slide-78
SLIDE 78

Line search

Idea: Find minimum of αk := arg min

α h (α)

= arg min

α∈R f

  • x (k) − αk∇f
  • x (k)
slide-79
SLIDE 79

Backtracking line search with Armijo rule

Given α0 ≥ 0 and β, η ∈ (0, 1), set αk := α0 βi for smallest i such that

  • x (k+1) :=

x (k) − αk∇f

  • x (k)

satisfies f

  • x (k+1)

≤ f

  • x (k)

− 1 2 αk

  • ∇f
  • x (k)
  • 2

2

a condition known as Armijo rule

slide-80
SLIDE 80

Backtracking line search with Armijo rule

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

slide-81
SLIDE 81

Gradient descent for least squares

Aim: Use n examples

  • y(1),

x (1) ,

  • y(2),

x (2) , . . . ,

  • y(n),

x (n) to fit a linear model by minimizing least-squares cost function minimize

β∈Rp

  • y − X

β

  • 2

2

slide-82
SLIDE 82

Gradient descent for least squares

The gradient of the quadratic function f ( β) :=

  • y − X

β

  • 2

2

= βTX TX β − 2 βTX T y + yT y equals ∇f ( β)

slide-83
SLIDE 83

Gradient descent for least squares

The gradient of the quadratic function f ( β) :=

  • y − X

β

  • 2

2

= βTX TX β − 2 βTX T y + yT y equals ∇f ( β) = 2X TX β − 2X T y

slide-84
SLIDE 84

Gradient descent for least squares

The gradient of the quadratic function f ( β) :=

  • y − X

β

  • 2

2

= βTX TX β − 2 βTX T y + yT y equals ∇f ( β) = 2X TX β − 2X T y Gradient descent updates are

  • β(k+1) =

β(k) + 2αkX T

  • y − X

β(k)

slide-85
SLIDE 85

Gradient descent for least squares

The gradient of the quadratic function f ( β) :=

  • y − X

β

  • 2

2

= βTX TX β − 2 βTX T y + yT y equals ∇f ( β) = 2X TX β − 2X T y Gradient descent updates are

  • β(k+1) =

β(k) + 2αkX T

  • y − X

β(k) = β(k) + 2αk

n

  • i=1
  • y(i) − x(i),

β(k)

  • x(i)
slide-86
SLIDE 86

Gradient ascent for logistic regression

Aim: Use n examples

  • y(1),

x (1) ,

  • y(2),

x (2) , . . . ,

  • y(n),

x (n) to fit logistic-regression model by maximizing log-likelihood cost function f ( β) :=

n

  • i=1

y(i) log g

  • x (i),

β

  • +
  • 1 − y(i)

log

  • 1 − g
  • x (i),

β

  • where

g (t) = 1 1 − exp −t

slide-87
SLIDE 87

Gradient ascent for logistic regression

g′ (t) = g (t) (1 − g (t)) (1 − g (t))′ = −g (t) (1 − g (t)) The gradient of the cost function equals ∇f ( β)

slide-88
SLIDE 88

Gradient ascent for logistic regression

g′ (t) = g (t) (1 − g (t)) (1 − g (t))′ = −g (t) (1 − g (t)) The gradient of the cost function equals ∇f ( β) =

n

  • i=1

y(i) 1 − g( x (i), β)

  • x (i) −
  • 1 − y(i)

g( x (i), β) x (i)

slide-89
SLIDE 89

Gradient ascent for logistic regression

g′ (t) = g (t) (1 − g (t)) (1 − g (t))′ = −g (t) (1 − g (t)) The gradient of the cost function equals ∇f ( β) =

n

  • i=1

y(i) 1 − g( x (i), β)

  • x (i) −
  • 1 − y(i)

g( x (i), β) x (i) The gradient ascent updates are

  • β(k+1)
slide-90
SLIDE 90

Gradient ascent for logistic regression

g′ (t) = g (t) (1 − g (t)) (1 − g (t))′ = −g (t) (1 − g (t)) The gradient of the cost function equals ∇f ( β) =

n

  • i=1

y(i) 1 − g( x (i), β)

  • x (i) −
  • 1 − y(i)

g( x (i), β) x (i) The gradient ascent updates are

  • β(k+1) :=

β(k) + αk

n

  • i=1

y(i) 1 − g( x (i), β(k))

  • x (i) −
  • 1 − y(i)

g( x (i), β(k)) x (i)

slide-91
SLIDE 91

Convergence of gradient descent

Does the method converge? How fast (slow)? For what step sizes?

slide-92
SLIDE 92

Convergence of gradient descent

Does the method converge? How fast (slow)? For what step sizes? Depends on function

slide-93
SLIDE 93

Lipschitz continuity

A function f : Rn → Rm is Lipschitz continuous if for any x, y ∈ Rn ||f ( y) − f ( x)||2 ≤ L || y − x||2 . L is the Lipschitz constant

slide-94
SLIDE 94

Lipschitz-continuous gradients

If ∇f is Lipschitz continuous with Lipschitz constant L ||∇f ( y) − ∇f ( x)||2 ≤ L || y − x||2 then for any x, y ∈ Rn we have a quadratic upper bound f ( y) ≤ f ( x) + ∇f ( x)T ( y − x) + L 2 || y − x||2

2

slide-95
SLIDE 95

Local progress of gradient descent

  • x (k+1) :=

x (k) − αk∇f

  • x (k)

f

  • x (k+1)
slide-96
SLIDE 96

Local progress of gradient descent

  • x (k+1) :=

x (k) − αk∇f

  • x (k)

f

  • x (k+1)

≤ f

  • x (k)

+ ∇f

  • x (k)T
  • x (k+1) −

x (k) + L 2

  • x (k+1) −

x (k)

  • 2

2

slide-97
SLIDE 97

Local progress of gradient descent

  • x (k+1) :=

x (k) − αk∇f

  • x (k)

f

  • x (k+1)

≤ f

  • x (k)

+ ∇f

  • x (k)T
  • x (k+1) −

x (k) + L 2

  • x (k+1) −

x (k)

  • 2

2

= f

  • x (k)

− αk

  • 1 − αkL

2

  • ∇f
  • x (k)
  • 2

2

slide-98
SLIDE 98

Local progress of gradient descent

  • x (k+1) :=

x (k) − αk∇f

  • x (k)

f

  • x (k+1)

≤ f

  • x (k)

+ ∇f

  • x (k)T
  • x (k+1) −

x (k) + L 2

  • x (k+1) −

x (k)

  • 2

2

= f

  • x (k)

− αk

  • 1 − αkL

2

  • ∇f
  • x (k)
  • 2

2

If αk ≤ 1

L

f

  • x (k+1)

≤ f

  • x (k)

−αk 2

  • ∇f
  • x (k)
  • 2

2

slide-99
SLIDE 99

Convergence of gradient descent

◮ f is convex ◮ ∇f is L-Lipschitz continuous ◮ There exists a point

x ∗ at which f achieves a finite minimum

◮ The step size is set to αk := α ≤ 1/L

f

  • x (k)

− f ( x ∗) ≤

  • x (0) −

x ∗

  • 2

2

2 α k

slide-100
SLIDE 100

Convergence of gradient descent

f

  • x (k)

≤ f

  • x (k−1)

− αk 2

  • ∇f
  • x (k−1)
  • 2

2

f

  • x (k−1)

+ ∇f

  • x (k−1)T
  • x ∗ −

x (k−1) ≤ f ( x ∗) f

  • x (k)

− f ( x ∗)

slide-101
SLIDE 101

Convergence of gradient descent

f

  • x (k)

≤ f

  • x (k−1)

− αk 2

  • ∇f
  • x (k−1)
  • 2

2

f

  • x (k−1)

+ ∇f

  • x (k−1)T
  • x ∗ −

x (k−1) ≤ f ( x ∗) f

  • x (k)

− f ( x ∗) ≤ f

  • x (k−1)

− f ( x ∗) − αk 2

  • ∇f
  • x (k−1)
  • 2

2

slide-102
SLIDE 102

Convergence of gradient descent

f

  • x (k)

≤ f

  • x (k−1)

− αk 2

  • ∇f
  • x (k−1)
  • 2

2

f

  • x (k−1)

+ ∇f

  • x (k−1)T
  • x ∗ −

x (k−1) ≤ f ( x ∗) f

  • x (k)

− f ( x ∗) ≤ f

  • x (k−1)

− f ( x ∗) − αk 2

  • ∇f
  • x (k−1)
  • 2

2

≤ ∇f

  • x (k−1)T
  • x (k−1) −

x ∗ − α 2

  • ∇f
  • x (k−1)
  • 2

2

slide-103
SLIDE 103

Convergence of gradient descent

f

  • x (k)

≤ f

  • x (k−1)

− αk 2

  • ∇f
  • x (k−1)
  • 2

2

f

  • x (k−1)

+ ∇f

  • x (k−1)T
  • x ∗ −

x (k−1) ≤ f ( x ∗) f

  • x (k)

− f ( x ∗) ≤ f

  • x (k−1)

− f ( x ∗) − αk 2

  • ∇f
  • x (k−1)
  • 2

2

≤ ∇f

  • x (k−1)T
  • x (k−1) −

x ∗ − α 2

  • ∇f
  • x (k−1)
  • 2

2

= 1 2 α

  • x (k−1) −

x ∗

  • 2

2 −

  • x (k−1) −

x ∗ − α∇f

  • x (k−1)
  • 2

2

slide-104
SLIDE 104

Convergence of gradient descent

f

  • x (k)

≤ f

  • x (k−1)

− αk 2

  • ∇f
  • x (k−1)
  • 2

2

f

  • x (k−1)

+ ∇f

  • x (k−1)T
  • x ∗ −

x (k−1) ≤ f ( x ∗) f

  • x (k)

− f ( x ∗) ≤ f

  • x (k−1)

− f ( x ∗) − αk 2

  • ∇f
  • x (k−1)
  • 2

2

≤ ∇f

  • x (k−1)T
  • x (k−1) −

x ∗ − α 2

  • ∇f
  • x (k−1)
  • 2

2

= 1 2 α

  • x (k−1) −

x ∗

  • 2

2 −

  • x (k−1) −

x ∗ − α∇f

  • x (k−1)
  • 2

2

  • = 1

2 α

  • x (k−1) −

x ∗

  • 2

2 −

  • x (k) −

x ∗

  • 2
slide-105
SLIDE 105

Convergence of gradient descent

f

  • x (k)

− f ( x ∗)

slide-106
SLIDE 106

Convergence of gradient descent

f

  • x (k)

− f ( x ∗) ≤ 1 k

k

  • i=1

f

  • x (k)

− f ( x ∗)

slide-107
SLIDE 107

Convergence of gradient descent

f

  • x (k)

− f ( x ∗) ≤ 1 k

k

  • i=1

f

  • x (k)

− f ( x ∗) never increases

slide-108
SLIDE 108

Convergence of gradient descent

f

  • x (k)

− f ( x ∗) ≤ 1 k

k

  • i=1

f

  • x (k)

− f ( x ∗) never increases = 1 2 α k

k

  • i=1
  • x (k−1) −

x ∗

  • 2

2 −

  • x (k) −

x ∗

  • 2
slide-109
SLIDE 109

Convergence of gradient descent

f

  • x (k)

− f ( x ∗) ≤ 1 k

k

  • i=1

f

  • x (k)

− f ( x ∗) never increases = 1 2 α k

k

  • i=1
  • x (k−1) −

x ∗

  • 2

2 −

  • x (k) −

x ∗

  • 2

= 1 2 α k

  • x (0) −

x ∗

  • 2

2 −

  • x (k) −

x ∗

  • 2

2

slide-110
SLIDE 110

Convergence of gradient descent

f

  • x (k)

− f ( x ∗) ≤ 1 k

k

  • i=1

f

  • x (k)

− f ( x ∗) never increases = 1 2 α k

k

  • i=1
  • x (k−1) −

x ∗

  • 2

2 −

  • x (k) −

x ∗

  • 2

= 1 2 α k

  • x (0) −

x ∗

  • 2

2 −

  • x (k) −

x ∗

  • 2

2

  • x (0) −

x ∗

  • 2

2

2 α k

slide-111
SLIDE 111

Accelerated gradient descent

◮ Gradient descent takes O (1/ǫ) to achieve an error of ǫ ◮ The optimal rate is O (1/√ǫ) ◮ Gradient descent can be accelerated by adding a momentum term

slide-112
SLIDE 112

Accelerated gradient descent

Set the initial point x (0) to an arbitrary value Update by setting y(k+1) = x(k) − αk∇f

  • x(k)

x(k+1) = βk y(k+1) + γk y(k) where αk is the step size and βk > 0 and γk > 0 are parameters

slide-113
SLIDE 113

Digit classification

MNIST data Aim: Determine whether a digit is a 5 or not

  • xi is an image
  • yi = 1 or

yi = 0 if image i is a 5 or not, respectively We fit a logistic-regression model

slide-114
SLIDE 114

Digit classification

102 103 104 Sizes 10-1 100 101 Times (s)

Gradient Descent Accelerated Descent

slide-115
SLIDE 115

Stochastic gradient descent

Cost functions to fit models are often additive f ( x) = 1 m

m

  • i=1

fi ( x) .

◮ Linear regression n

  • i=1
  • y(i) −

x (i) T β 2 =

  • y − X

β

  • 2

2 ◮ Logistic regression n

  • i=1

y(i) log g

  • x (i),

β

  • +
  • 1 − y(i)

log

  • 1 − g
  • x (i),

β

slide-116
SLIDE 116

Stochastic gradient descent

In big data regime (very large n), gradient descent is too slow In some cases, data is acquired sequentially (online setting) Stochastic gradient descent: update solution using a subset of the data

slide-117
SLIDE 117

Stochastic gradient descent

Set the initial point x (0) to an arbitrary value Update by

  • 1. Choosing a random subset of b indices B (b ≪ m is the batch size)
  • 2. Setting
  • x (k+1) :=

x (k) − αkm

  • i∈B

∇fi

  • x (k)

where αk is the step size

slide-118
SLIDE 118

Stochastic gradient descent

We replace ∇f by

  • i∈B

∇fi

  • x (k)

=

m

  • i=1

1i∈B∇fi

  • x (k+1)

Noisy estimate of ∇f Unbiased if every example is in the batch with probability p E m

  • i=1

1i∈B∇fi

  • x (k)
slide-119
SLIDE 119

Stochastic gradient descent

We replace ∇f by

  • i∈B

∇fi

  • x (k)

=

m

  • i=1

1i∈B∇fi

  • x (k+1)

Noisy estimate of ∇f Unbiased if every example is in the batch with probability p E m

  • i=1

1i∈B∇fi

  • x (k)

=

m

  • i=1

E (1i∈B) ∇fi

  • x (k)
slide-120
SLIDE 120

Stochastic gradient descent

We replace ∇f by

  • i∈B

∇fi

  • x (k)

=

m

  • i=1

1i∈B∇fi

  • x (k+1)

Noisy estimate of ∇f Unbiased if every example is in the batch with probability p E m

  • i=1

1i∈B∇fi

  • x (k)

=

m

  • i=1

E (1i∈B) ∇fi

  • x (k)

=

m

  • i=1

P (i ∈ B) ∇fi

  • x (k)
slide-121
SLIDE 121

Stochastic gradient descent

We replace ∇f by

  • i∈B

∇fi

  • x (k)

=

m

  • i=1

1i∈B∇fi

  • x (k+1)

Noisy estimate of ∇f Unbiased if every example is in the batch with probability p E m

  • i=1

1i∈B∇fi

  • x (k)

=

m

  • i=1

E (1i∈B) ∇fi

  • x (k)

=

m

  • i=1

P (i ∈ B) ∇fi

  • x (k)

= p∇f

  • x (k)
slide-122
SLIDE 122

Stochastic gradient descent

◮ Linear regression

  • β(k+1) :=

β(k) + 2αk

  • i∈B
  • y(i) − x(i),

β(k)

  • x(i)

◮ Logistic regression

  • β(k+1) :=

β(k) + αk

  • i∈B

y(i) 1 − g( x (i), β(k))

  • x (i) −
  • 1 − y(i)

g( x (i), β(k)) x (i)

slide-123
SLIDE 123

Digit classification

MNIST data Aim: Determine whether a digit is a 5 or not

  • xi is an image
  • yi = 1 or

yi = 0 if image i is a 5 or not, respectively We fit a logistic-regression model

slide-124
SLIDE 124

Digit classification

100 101 102 103 104 105 106 Steps 2 4 6 8 10 12 14 16 Training Loss

Gradient Descent SGD (1) SGD (10) SGD (100) SGD (1000) SGD (10000)

slide-125
SLIDE 125

Newton’s method

Motivation: Convex functions are often almost quadratic f ≈ f 2

  • x

Idea: Iteratively minimize quadratic approximation f 2

  • x (

y) := f ( x) + ∇f ( x) ( y − x) + 1 2 ( y − x)T ∇2f ( x) ( y − x) , Minimum has closed form arg min

  • y∈Rn

f 2

  • x (

y) = x − ∇2f ( x)−1 ∇f ( x)

slide-126
SLIDE 126

Proof

We have ∇f 2

  • x (y) = ∇f (

x) + ∇2f ( x) ( y − x) It is equal to zero if ∇2f ( x) ( y − x) = −∇f ( x) If the Hessian is positive definite, the only minimum of f 2

  • x is at
  • x − ∇2f (

x)−1 ∇f ( x)

slide-127
SLIDE 127

Newton’s method

Set the initial point x (0) to an arbitrary value Update by setting

  • x (k+1) :=

x (k) − ∇2f

  • x (k)−1

∇f

  • x (k)

until a stopping criterion is met

slide-128
SLIDE 128

Newton’s method

Quadratic approximation

slide-129
SLIDE 129

Quadratic function

slide-130
SLIDE 130

Quadratic function

slide-131
SLIDE 131

Convex function

slide-132
SLIDE 132

Logistic regression

∂2f ( x) ∂ x[j]∂ x[l] = −

n

  • i=1

g( x (i), β)

  • 1 − g(

x (i), β)

  • x(i)[j]

x(i)[l] ∇2f ( β) = −X TG( β)X The rows of X ∈ Rn×p contain x (1), . . . x (n) G is a diagonal matrix such that G( β)ii := g( x (i), β)

  • 1 − g(

x (i), β)

  • ,

1 ≤ i ≤ n

slide-133
SLIDE 133

Logistic regression

Newton updates are

  • β(k+1) :=

β(k) −

  • X TG(

β(k))X −1 ∇f ( β(k)) Sanity check: Cost function is concave, for any β, v ∈ Rp

  • vT∇2f (

β) v = −

n

  • i=1

G( β)ii (X v) [i]2 ≤ 0