SLIDE 1
Convex Optimization DS-GA 1013 / MATH-GA 2824 Optimization-based - - PowerPoint PPT Presentation
Convex Optimization DS-GA 1013 / MATH-GA 2824 Optimization-based - - PowerPoint PPT Presentation
Convex Optimization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html Carlos Fernandez-Granda Convexity Differentiable convex functions Minimizing differentiable convex
SLIDE 2
SLIDE 3
Convex functions
A function f : Rn → R is convex if for any x, y ∈ Rn and any θ ∈ (0, 1) θf ( x) + (1 − θ) f ( y) ≥ f (θ x + (1 − θ) y) A function f if concave is −f is convex
SLIDE 4
Convex functions
f (θ x + (1 − θ) y) θf ( x) + (1 − θ)f ( y) f ( x) f ( y)
SLIDE 5
Linear functions are convex
If f is linear f (θ x + (1 − θ) y)
SLIDE 6
Linear functions are convex
If f is linear f (θ x + (1 − θ) y) = θf ( x) + (1 − θ) f ( y)
SLIDE 7
Strictly convex functions
A function f : Rn → R is strictly convex if for any x, y ∈ Rn and any θ ∈ (0, 1) θf ( x) + (1 − θ) f ( y) > f (θ x + (1 − θ) y)
SLIDE 8
Local minima are global
Any local minimum of a convex function is also a global minimum
SLIDE 9
Proof
Let xloc be a local minimum: for all x ∈ Rn such that || x − xloc||2 ≤ γ f ( xloc) ≤ f ( x) Let xglob be a global minimum f
- xglob
- < f (
xloc)
SLIDE 10
Proof
Choose θ so that xθ := θ xloc + (1 − θ) xglob satisfies || xθ − xloc||2 ≤ γ then f ( xloc) ≤ f ( xθ)
SLIDE 11
Proof
Choose θ so that xθ := θ xloc + (1 − θ) xglob satisfies || xθ − xloc||2 ≤ γ then f ( xloc) ≤ f ( xθ) = f
- θ
xloc + (1 − θ) xglob
SLIDE 12
Proof
Choose θ so that xθ := θ xloc + (1 − θ) xglob satisfies || xθ − xloc||2 ≤ γ then f ( xloc) ≤ f ( xθ) = f
- θ
xloc + (1 − θ) xglob
- ≤ θf (
xloc) + (1 − θ) f
- xglob
- by convexity of f
SLIDE 13
Proof
Choose θ so that xθ := θ xloc + (1 − θ) xglob satisfies || xθ − xloc||2 ≤ γ then f ( xloc) ≤ f ( xθ) = f
- θ
xloc + (1 − θ) xglob
- ≤ θf (
xloc) + (1 − θ) f
- xglob
- by convexity of f
< f ( xloc) because f
- xglob
- < f (
xloc)
SLIDE 14
Norm
Let V be a vector space, a norm is a function ||·|| from V to R with the following properties
◮ It is homogeneous. For any scalar α and any
x ∈ V ||α x|| = |α| || x|| .
◮ It satisfies the triangle inequality
|| x + y|| ≤ || x|| + || y|| . In particular, || x|| ≥ 0
◮ ||
x|| = 0 implies x =
SLIDE 15
Norms are convex
For any x, y ∈ Rn and any θ ∈ (0, 1) ||θ x + (1 − θ) y||
SLIDE 16
Norms are convex
For any x, y ∈ Rn and any θ ∈ (0, 1) ||θ x + (1 − θ) y|| ≤ ||θ x|| + ||(1 − θ) y||
SLIDE 17
Norms are convex
For any x, y ∈ Rn and any θ ∈ (0, 1) ||θ x + (1 − θ) y|| ≤ ||θ x|| + ||(1 − θ) y|| = θ || x|| + (1 − θ) || y||
SLIDE 18
Composition of convex and affine function
If f : Rn → R is convex, then for any A ∈ Rn×m and b ∈ Rn h ( x) := f
- A
x + b
- is convex
Consequence: f ( x) :=
- A
x + b
- is convex for any A and
b
SLIDE 19
Composition of convex and affine function
h (θ x + (1 − θ) y)
SLIDE 20
Composition of convex and affine function
h (θ x + (1 − θ) y) = f
- θ
- A
x + b
- + (1 − θ)
- A
y + b
SLIDE 21
Composition of convex and affine function
h (θ x + (1 − θ) y) = f
- θ
- A
x + b
- + (1 − θ)
- A
y + b
- ≤ θf
- A
x + b
- + (1 − θ) f
- A
y + b
SLIDE 22
Composition of convex and affine function
h (θ x + (1 − θ) y) = f
- θ
- A
x + b
- + (1 − θ)
- A
y + b
- ≤ θf
- A
x + b
- + (1 − θ) f
- A
y + b
- = θ h (
x) + (1 − θ) h ( y)
SLIDE 23
ℓ0 “norm"
Number of nonzero entries in a vector Not a norm! ||2 x||0
SLIDE 24
ℓ0 “norm"
Number of nonzero entries in a vector Not a norm! ||2 x||0 = || x||0
SLIDE 25
ℓ0 “norm"
Number of nonzero entries in a vector Not a norm! ||2 x||0 = || x||0 = 2 || x||0
SLIDE 26
ℓ0 “norm"
Number of nonzero entries in a vector Not a norm! ||2 x||0 = || x||0 = 2 || x||0 Not convex
SLIDE 27
ℓ0 “norm"
Number of nonzero entries in a vector Not a norm! ||2 x||0 = || x||0 = 2 || x||0 Not convex Let x := ( 1
0 ) and
y := ( 0
1 ), for any θ ∈ (0, 1)
||θ x + (1 − θ) y||0 θ || x||0 + (1 − θ) || y||0
SLIDE 28
ℓ0 “norm"
Number of nonzero entries in a vector Not a norm! ||2 x||0 = || x||0 = 2 || x||0 Not convex Let x := ( 1
0 ) and
y := ( 0
1 ), for any θ ∈ (0, 1)
||θ x + (1 − θ) y||0 = 2 θ || x||0 + (1 − θ) || y||0
SLIDE 29
ℓ0 “norm"
Number of nonzero entries in a vector Not a norm! ||2 x||0 = || x||0 = 2 || x||0 Not convex Let x := ( 1
0 ) and
y := ( 0
1 ), for any θ ∈ (0, 1)
||θ x + (1 − θ) y||0 = 2 θ || x||0 + (1 − θ) || y||0 = 1
SLIDE 30
Promoting sparsity
Finding sparse vectors consistent with data is often very useful Toy problem: Find t such that
- vt :=
t t − 1 t − 1 is sparse Strategy: Minimize f (t) := || vt||
SLIDE 31
Promoting sparsity
−0.4 −0.2 0.2 0.4 0.6 0.8 1 1.2 1.4 1 2 3 4 5 t ||t||0 ||t||1 ||t||2 ||t||∞
SLIDE 32
The rank is not convex
The rank of matrices in Rn×n interpreted as a function from Rn×n to R is not convex
SLIDE 33
The rank is not convex
The rank of matrices in Rn×n interpreted as a function from Rn×n to R is not convex X := 1
- Y :=
1
- For any θ ∈ (0, 1)
rank (θX + (1 − θ) Y ) θ rank (X) + (1 − θ) rank (Y )
SLIDE 34
The rank is not convex
The rank of matrices in Rn×n interpreted as a function from Rn×n to R is not convex X := 1
- Y :=
1
- For any θ ∈ (0, 1)
rank (θX + (1 − θ) Y ) = 2 θ rank (X) + (1 − θ) rank (Y )
SLIDE 35
The rank is not convex
The rank of matrices in Rn×n interpreted as a function from Rn×n to R is not convex X := 1
- Y :=
1
- For any θ ∈ (0, 1)
rank (θX + (1 − θ) Y ) = 2 θ rank (X) + (1 − θ) rank (Y ) = 1
SLIDE 36
Matrix norms
Frobenius norm ||A||F :=
- m
- i=1
n
- j=1
A2
ij =
- min{m,n}
- i=1
σ2
i
Operator norm ||A|| := max {||
x||2=1 | x∈Rn}
||A x||2 = σ1 Nuclear norm ||A||∗ :=
min{m,n}
- i=1
σi
SLIDE 37
Promoting low-rank structure
Finding low-rank matrices consistent with data is often very useful Toy problem: Find t such that M (t) := 0.5 + t 1 1 0.5 0.5 t 0.5 1 − t 0.5 , is low rank Strategy: Minimize f (t) := ||M (t)||
SLIDE 38
Promoting low-rank structure
1.0 0.5 0.0 0.5 1.0 1.5 t 1.0 1.5 2.0 2.5 3.0 Rank Operator norm Frobenius norm Nuclear norm
SLIDE 39
Convexity Differentiable convex functions Minimizing differentiable convex functions
SLIDE 40
Gradient
∇f ( x) =
∂f ( x) ∂ x[1] ∂f ( x) ∂ x[2]
· · ·
∂f ( x) ∂ x[n]
If the gradient exists at every point, the function is said to be differentiable
SLIDE 41
Directional derivative
Encodes first-order rate of change in a particular direction f ′
- u (
x) := lim
h→0
f ( x + h u) − f ( x) h = ∇f ( x) , u where ||u||2 = 1
SLIDE 42
Direction of maximum variation
∇f is direction of maximum increase
- ∇f is direction of maximum decrease
- f ′
- u (
x)
- =
- ∇f (
x)T u
SLIDE 43
Direction of maximum variation
∇f is direction of maximum increase
- ∇f is direction of maximum decrease
- f ′
- u (
x)
- =
- ∇f (
x)T u
- ≤ ||∇f (
x)||2 || u||2 Cauchy-Schwarz inequality
SLIDE 44
Direction of maximum variation
∇f is direction of maximum increase
- ∇f is direction of maximum decrease
- f ′
- u (
x)
- =
- ∇f (
x)T u
- ≤ ||∇f (
x)||2 || u||2 Cauchy-Schwarz inequality = ||∇f ( x)||2
SLIDE 45
Direction of maximum variation
∇f is direction of maximum increase
- ∇f is direction of maximum decrease
- f ′
- u (
x)
- =
- ∇f (
x)T u
- ≤ ||∇f (
x)||2 || u||2 Cauchy-Schwarz inequality = ||∇f ( x)||2 equality holds if and only if u = ±
∇f ( x) ||∇f ( x)||2
SLIDE 46
Gradient
SLIDE 47
First-order approximation
The first-order or linear approximation of f : Rn → R at x is f 1
- x (
y) := f ( x) + ∇f ( x)T ( y − x) If f is continuously differentiable at x lim
- y→
x
f ( y) − f 1
- x (
y) || y − x||2 = 0
SLIDE 48
First-order approximation
x f ( y) f 1
x (
y)
SLIDE 49
Convexity
A differentiable function f : Rn → R is convex if and only if for every
- x,
y ∈ Rn f ( y) ≥ f ( x) + ∇f ( x)T ( y − x) It is strictly convex if and only if f ( y) > f ( x) + ∇f ( x)T ( y − x)
SLIDE 50
Optimality condition
If f is convex and ∇f ( x) = 0, then for any y ∈ R f ( y) ≥ f ( x) If f is strictly convex then for any y = x f ( y) > f ( x)
SLIDE 51
Epigraph
The epigraph of f : Rn → R is epi (f ) := x | f
- x[1]
· · ·
- x[n]
≤ x[n + 1]
SLIDE 52
Epigraph
f epi (f )
SLIDE 53
Supporting hyperplane
A hyperplane H is a supporting hyperplane of a set S at x if
◮ H and S intersect at
x
◮ S is contained in one of the half-spaces bounded by H
SLIDE 54
Geometric intuition
Geometrically, f is convex if and only if for every x the plane Hf ,
x :=
y | y[n + 1] = f 1
- x
- y[1]
· · ·
- y[n]
is a supporting hyperplane of the epigraph at x If ∇f ( x) = 0 the hyperplane is horizontal
SLIDE 55
Convexity
x f ( y) f 1
x (
y)
SLIDE 56
Hessian matrix
If f has a Hessian matrix at every point, it is twice differentiable ∇2f ( x) =
∂2f ( x) ∂ x[1]2 ∂2f ( x) ∂ x[1]∂ x[2]
· · ·
∂2f ( x) ∂ x[1]∂ x[n] ∂2f ( x) ∂ x[1]∂ x[2] ∂2f ( x) ∂ x[1]2
· · ·
∂2f ( x) ∂ x[2]∂ x[n]
· · ·
∂2f ( x) ∂ x[1]∂ x[n] ∂2f ( x) ∂ x[2]∂ x[n]
· · ·
∂2f ( x) ∂ x[n]2
SLIDE 57
Curvature
The second directional derivative f ′′
- u of f at
x equals f ′′
- u (
x) = uT∇2f ( x) u for any unit-norm vector u ∈ Rn
SLIDE 58
Second-order approximation
The second-order or quadratic approximation of f at x is f 2
- x (
y) := f ( x) + ∇f ( x) ( y − x) + 1 2 ( y − x)T ∇2f ( x) ( y − x)
SLIDE 59
Second-order approximation
x f ( y) f 2
x (
y)
SLIDE 60
Quadratic form
Second order polynomial in several dimensions q ( x) := xTA x + bT x + c parametrized by symmetric matrix A ∈ Rn×n, a vector b ∈ Rn and a constant c
SLIDE 61
Quadratic approximation
The quadratic approximation f 2
- x : Rn → R at
x ∈ Rn of a twice-continuously differentiable function f : Rn → R satisfies lim
- y→
x
f ( y) − f 2
- x (
y) || y − x||2
2
= 0
SLIDE 62
Eigendecomposition of symmetric matrices
Let A = UΛUT be the eigendecomposition of a symmetric matrix A Eigenvalues: λ1 ≥ · · · ≥ λn (which can be negative or 0) Eigenvectors: u1, . . . , un, orthonormal basis λ1 = max {||
x||2=1 | x∈Rn}
- xTA
x
- u1 =
arg max {||
x||2=1 | x∈Rn}
- xTA
x λn = min {||
x||2=1 | x∈Rn}
- xTA
x
- un =
arg min {||
x||2=1 | x∈Rn}
- xTA
x
SLIDE 63
Maximum and minimum curvature
Let ∇2f ( x) = UΛUT be the eigendecomposition of the Hessian at x Direction of maximum curvature: u1 Direction of minimum curvature (or maximum negative curvature): un
SLIDE 64
Positive semidefinite matrices
For any x
- xTA
x = xTUΛUT x =
n
- i=1
λi ui, x2 All eigenvalues are nonnegative if and only if
- xTA
x ≥ 0 for all x The matrix is positive semidefinite
SLIDE 65
Positive (negative) (semi)definite matrices
Positive (semi)definite: all eigenvalues are positive (nonnegative), equivalently for all x
- xTA
x > (≥) 0 Quadratic form: All directions have positive curvature Negative (semi)definite: all eigenvalues are negative (nonpositive), equivalently for all x
- xTA
x < (≤) 0 Quadratic form: All directions have negative curvature
SLIDE 66
Convexity
A twice-differentiable function g : R → R is convex if and only if g′′ (x) ≥ 0 for all x ∈ R A twice-differentiable function in Rn is convex if and only if their Hessian is positive semidefinite at every point If the Hessian is positive definite, the function is strictly convex
SLIDE 67
Second-order approximation
x f ( y) f 2
x (
y)
SLIDE 68
Convex
SLIDE 69
Concave
SLIDE 70
Neither
SLIDE 71
Convexity Differentiable convex functions Minimizing differentiable convex functions
SLIDE 72
Problem
Challenge: Minimizing differentiable convex functions min
- x∈Rn
f ( x)
SLIDE 73
Gradient descent
Intuition: Make local progress in the steepest direction −∇f ( x) Set the initial point x (0) to an arbitrary value Update by setting
- x (k+1) :=
x (k)−αk ∇f
- x (k)
where αk > 0 is the step size, until a stopping criterion is met
SLIDE 74
Gradient descent
SLIDE 75
Gradient descent
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
SLIDE 76
Small step size
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
SLIDE 77
Large step size
10 20 30 40 50 60 70 80 90 100
SLIDE 78
Line search
Idea: Find minimum of αk := arg min
α h (α)
= arg min
α∈R f
- x (k) − αk∇f
- x (k)
SLIDE 79
Backtracking line search with Armijo rule
Given α0 ≥ 0 and β, η ∈ (0, 1), set αk := α0 βi for smallest i such that
- x (k+1) :=
x (k) − αk∇f
- x (k)
satisfies f
- x (k+1)
≤ f
- x (k)
− 1 2 αk
- ∇f
- x (k)
- 2
2
a condition known as Armijo rule
SLIDE 80
Backtracking line search with Armijo rule
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
SLIDE 81
Gradient descent for least squares
Aim: Use n examples
- y(1),
x (1) ,
- y(2),
x (2) , . . . ,
- y(n),
x (n) to fit a linear model by minimizing least-squares cost function minimize
β∈Rp
- y − X
β
- 2
2
SLIDE 82
Gradient descent for least squares
The gradient of the quadratic function f ( β) :=
- y − X
β
- 2
2
= βTX TX β − 2 βTX T y + yT y equals ∇f ( β)
SLIDE 83
Gradient descent for least squares
The gradient of the quadratic function f ( β) :=
- y − X
β
- 2
2
= βTX TX β − 2 βTX T y + yT y equals ∇f ( β) = 2X TX β − 2X T y
SLIDE 84
Gradient descent for least squares
The gradient of the quadratic function f ( β) :=
- y − X
β
- 2
2
= βTX TX β − 2 βTX T y + yT y equals ∇f ( β) = 2X TX β − 2X T y Gradient descent updates are
- β(k+1) =
β(k) + 2αkX T
- y − X
β(k)
SLIDE 85
Gradient descent for least squares
The gradient of the quadratic function f ( β) :=
- y − X
β
- 2
2
= βTX TX β − 2 βTX T y + yT y equals ∇f ( β) = 2X TX β − 2X T y Gradient descent updates are
- β(k+1) =
β(k) + 2αkX T
- y − X
β(k) = β(k) + 2αk
n
- i=1
- y(i) − x(i),
β(k)
- x(i)
SLIDE 86
Gradient ascent for logistic regression
Aim: Use n examples
- y(1),
x (1) ,
- y(2),
x (2) , . . . ,
- y(n),
x (n) to fit logistic-regression model by maximizing log-likelihood cost function f ( β) :=
n
- i=1
y(i) log g
- x (i),
β
- +
- 1 − y(i)
log
- 1 − g
- x (i),
β
- where
g (t) = 1 1 − exp −t
SLIDE 87
Gradient ascent for logistic regression
g′ (t) = g (t) (1 − g (t)) (1 − g (t))′ = −g (t) (1 − g (t)) The gradient of the cost function equals ∇f ( β)
SLIDE 88
Gradient ascent for logistic regression
g′ (t) = g (t) (1 − g (t)) (1 − g (t))′ = −g (t) (1 − g (t)) The gradient of the cost function equals ∇f ( β) =
n
- i=1
y(i) 1 − g( x (i), β)
- x (i) −
- 1 − y(i)
g( x (i), β) x (i)
SLIDE 89
Gradient ascent for logistic regression
g′ (t) = g (t) (1 − g (t)) (1 − g (t))′ = −g (t) (1 − g (t)) The gradient of the cost function equals ∇f ( β) =
n
- i=1
y(i) 1 − g( x (i), β)
- x (i) −
- 1 − y(i)
g( x (i), β) x (i) The gradient ascent updates are
- β(k+1)
SLIDE 90
Gradient ascent for logistic regression
g′ (t) = g (t) (1 − g (t)) (1 − g (t))′ = −g (t) (1 − g (t)) The gradient of the cost function equals ∇f ( β) =
n
- i=1
y(i) 1 − g( x (i), β)
- x (i) −
- 1 − y(i)
g( x (i), β) x (i) The gradient ascent updates are
- β(k+1) :=
β(k) + αk
n
- i=1
y(i) 1 − g( x (i), β(k))
- x (i) −
- 1 − y(i)
g( x (i), β(k)) x (i)
SLIDE 91
Convergence of gradient descent
Does the method converge? How fast (slow)? For what step sizes?
SLIDE 92
Convergence of gradient descent
Does the method converge? How fast (slow)? For what step sizes? Depends on function
SLIDE 93
Lipschitz continuity
A function f : Rn → Rm is Lipschitz continuous if for any x, y ∈ Rn ||f ( y) − f ( x)||2 ≤ L || y − x||2 . L is the Lipschitz constant
SLIDE 94
Lipschitz-continuous gradients
If ∇f is Lipschitz continuous with Lipschitz constant L ||∇f ( y) − ∇f ( x)||2 ≤ L || y − x||2 then for any x, y ∈ Rn we have a quadratic upper bound f ( y) ≤ f ( x) + ∇f ( x)T ( y − x) + L 2 || y − x||2
2
SLIDE 95
Local progress of gradient descent
- x (k+1) :=
x (k) − αk∇f
- x (k)
f
- x (k+1)
SLIDE 96
Local progress of gradient descent
- x (k+1) :=
x (k) − αk∇f
- x (k)
f
- x (k+1)
≤ f
- x (k)
+ ∇f
- x (k)T
- x (k+1) −
x (k) + L 2
- x (k+1) −
x (k)
- 2
2
SLIDE 97
Local progress of gradient descent
- x (k+1) :=
x (k) − αk∇f
- x (k)
f
- x (k+1)
≤ f
- x (k)
+ ∇f
- x (k)T
- x (k+1) −
x (k) + L 2
- x (k+1) −
x (k)
- 2
2
= f
- x (k)
− αk
- 1 − αkL
2
- ∇f
- x (k)
- 2
2
SLIDE 98
Local progress of gradient descent
- x (k+1) :=
x (k) − αk∇f
- x (k)
f
- x (k+1)
≤ f
- x (k)
+ ∇f
- x (k)T
- x (k+1) −
x (k) + L 2
- x (k+1) −
x (k)
- 2
2
= f
- x (k)
− αk
- 1 − αkL
2
- ∇f
- x (k)
- 2
2
If αk ≤ 1
L
f
- x (k+1)
≤ f
- x (k)
−αk 2
- ∇f
- x (k)
- 2
2
SLIDE 99
Convergence of gradient descent
◮ f is convex ◮ ∇f is L-Lipschitz continuous ◮ There exists a point
x ∗ at which f achieves a finite minimum
◮ The step size is set to αk := α ≤ 1/L
f
- x (k)
− f ( x ∗) ≤
- x (0) −
x ∗
- 2
2
2 α k
SLIDE 100
Convergence of gradient descent
f
- x (k)
≤ f
- x (k−1)
− αk 2
- ∇f
- x (k−1)
- 2
2
f
- x (k−1)
+ ∇f
- x (k−1)T
- x ∗ −
x (k−1) ≤ f ( x ∗) f
- x (k)
− f ( x ∗)
SLIDE 101
Convergence of gradient descent
f
- x (k)
≤ f
- x (k−1)
− αk 2
- ∇f
- x (k−1)
- 2
2
f
- x (k−1)
+ ∇f
- x (k−1)T
- x ∗ −
x (k−1) ≤ f ( x ∗) f
- x (k)
− f ( x ∗) ≤ f
- x (k−1)
− f ( x ∗) − αk 2
- ∇f
- x (k−1)
- 2
2
SLIDE 102
Convergence of gradient descent
f
- x (k)
≤ f
- x (k−1)
− αk 2
- ∇f
- x (k−1)
- 2
2
f
- x (k−1)
+ ∇f
- x (k−1)T
- x ∗ −
x (k−1) ≤ f ( x ∗) f
- x (k)
− f ( x ∗) ≤ f
- x (k−1)
− f ( x ∗) − αk 2
- ∇f
- x (k−1)
- 2
2
≤ ∇f
- x (k−1)T
- x (k−1) −
x ∗ − α 2
- ∇f
- x (k−1)
- 2
2
SLIDE 103
Convergence of gradient descent
f
- x (k)
≤ f
- x (k−1)
− αk 2
- ∇f
- x (k−1)
- 2
2
f
- x (k−1)
+ ∇f
- x (k−1)T
- x ∗ −
x (k−1) ≤ f ( x ∗) f
- x (k)
− f ( x ∗) ≤ f
- x (k−1)
− f ( x ∗) − αk 2
- ∇f
- x (k−1)
- 2
2
≤ ∇f
- x (k−1)T
- x (k−1) −
x ∗ − α 2
- ∇f
- x (k−1)
- 2
2
= 1 2 α
- x (k−1) −
x ∗
- 2
2 −
- x (k−1) −
x ∗ − α∇f
- x (k−1)
- 2
2
SLIDE 104
Convergence of gradient descent
f
- x (k)
≤ f
- x (k−1)
− αk 2
- ∇f
- x (k−1)
- 2
2
f
- x (k−1)
+ ∇f
- x (k−1)T
- x ∗ −
x (k−1) ≤ f ( x ∗) f
- x (k)
− f ( x ∗) ≤ f
- x (k−1)
− f ( x ∗) − αk 2
- ∇f
- x (k−1)
- 2
2
≤ ∇f
- x (k−1)T
- x (k−1) −
x ∗ − α 2
- ∇f
- x (k−1)
- 2
2
= 1 2 α
- x (k−1) −
x ∗
- 2
2 −
- x (k−1) −
x ∗ − α∇f
- x (k−1)
- 2
2
- = 1
2 α
- x (k−1) −
x ∗
- 2
2 −
- x (k) −
x ∗
- 2
SLIDE 105
Convergence of gradient descent
f
- x (k)
− f ( x ∗)
SLIDE 106
Convergence of gradient descent
f
- x (k)
− f ( x ∗) ≤ 1 k
k
- i=1
f
- x (k)
− f ( x ∗)
SLIDE 107
Convergence of gradient descent
f
- x (k)
− f ( x ∗) ≤ 1 k
k
- i=1
f
- x (k)
− f ( x ∗) never increases
SLIDE 108
Convergence of gradient descent
f
- x (k)
− f ( x ∗) ≤ 1 k
k
- i=1
f
- x (k)
− f ( x ∗) never increases = 1 2 α k
k
- i=1
- x (k−1) −
x ∗
- 2
2 −
- x (k) −
x ∗
- 2
SLIDE 109
Convergence of gradient descent
f
- x (k)
− f ( x ∗) ≤ 1 k
k
- i=1
f
- x (k)
− f ( x ∗) never increases = 1 2 α k
k
- i=1
- x (k−1) −
x ∗
- 2
2 −
- x (k) −
x ∗
- 2
= 1 2 α k
- x (0) −
x ∗
- 2
2 −
- x (k) −
x ∗
- 2
2
SLIDE 110
Convergence of gradient descent
f
- x (k)
− f ( x ∗) ≤ 1 k
k
- i=1
f
- x (k)
− f ( x ∗) never increases = 1 2 α k
k
- i=1
- x (k−1) −
x ∗
- 2
2 −
- x (k) −
x ∗
- 2
= 1 2 α k
- x (0) −
x ∗
- 2
2 −
- x (k) −
x ∗
- 2
2
- ≤
- x (0) −
x ∗
- 2
2
2 α k
SLIDE 111
Accelerated gradient descent
◮ Gradient descent takes O (1/ǫ) to achieve an error of ǫ ◮ The optimal rate is O (1/√ǫ) ◮ Gradient descent can be accelerated by adding a momentum term
SLIDE 112
Accelerated gradient descent
Set the initial point x (0) to an arbitrary value Update by setting y(k+1) = x(k) − αk∇f
- x(k)
x(k+1) = βk y(k+1) + γk y(k) where αk is the step size and βk > 0 and γk > 0 are parameters
SLIDE 113
Digit classification
MNIST data Aim: Determine whether a digit is a 5 or not
- xi is an image
- yi = 1 or
yi = 0 if image i is a 5 or not, respectively We fit a logistic-regression model
SLIDE 114
Digit classification
102 103 104 Sizes 10-1 100 101 Times (s)
Gradient Descent Accelerated Descent
SLIDE 115
Stochastic gradient descent
Cost functions to fit models are often additive f ( x) = 1 m
m
- i=1
fi ( x) .
◮ Linear regression n
- i=1
- y(i) −
x (i) T β 2 =
- y − X
β
- 2
2 ◮ Logistic regression n
- i=1
y(i) log g
- x (i),
β
- +
- 1 − y(i)
log
- 1 − g
- x (i),
β
SLIDE 116
Stochastic gradient descent
In big data regime (very large n), gradient descent is too slow In some cases, data is acquired sequentially (online setting) Stochastic gradient descent: update solution using a subset of the data
SLIDE 117
Stochastic gradient descent
Set the initial point x (0) to an arbitrary value Update by
- 1. Choosing a random subset of b indices B (b ≪ m is the batch size)
- 2. Setting
- x (k+1) :=
x (k) − αkm
- i∈B
∇fi
- x (k)
where αk is the step size
SLIDE 118
Stochastic gradient descent
We replace ∇f by
- i∈B
∇fi
- x (k)
=
m
- i=1
1i∈B∇fi
- x (k+1)
Noisy estimate of ∇f Unbiased if every example is in the batch with probability p E m
- i=1
1i∈B∇fi
- x (k)
SLIDE 119
Stochastic gradient descent
We replace ∇f by
- i∈B
∇fi
- x (k)
=
m
- i=1
1i∈B∇fi
- x (k+1)
Noisy estimate of ∇f Unbiased if every example is in the batch with probability p E m
- i=1
1i∈B∇fi
- x (k)
=
m
- i=1
E (1i∈B) ∇fi
- x (k)
SLIDE 120
Stochastic gradient descent
We replace ∇f by
- i∈B
∇fi
- x (k)
=
m
- i=1
1i∈B∇fi
- x (k+1)
Noisy estimate of ∇f Unbiased if every example is in the batch with probability p E m
- i=1
1i∈B∇fi
- x (k)
=
m
- i=1
E (1i∈B) ∇fi
- x (k)
=
m
- i=1
P (i ∈ B) ∇fi
- x (k)
SLIDE 121
Stochastic gradient descent
We replace ∇f by
- i∈B
∇fi
- x (k)
=
m
- i=1
1i∈B∇fi
- x (k+1)
Noisy estimate of ∇f Unbiased if every example is in the batch with probability p E m
- i=1
1i∈B∇fi
- x (k)
=
m
- i=1
E (1i∈B) ∇fi
- x (k)
=
m
- i=1
P (i ∈ B) ∇fi
- x (k)
= p∇f
- x (k)
SLIDE 122
Stochastic gradient descent
◮ Linear regression
- β(k+1) :=
β(k) + 2αk
- i∈B
- y(i) − x(i),
β(k)
- x(i)
◮ Logistic regression
- β(k+1) :=
β(k) + αk
- i∈B
y(i) 1 − g( x (i), β(k))
- x (i) −
- 1 − y(i)
g( x (i), β(k)) x (i)
SLIDE 123
Digit classification
MNIST data Aim: Determine whether a digit is a 5 or not
- xi is an image
- yi = 1 or
yi = 0 if image i is a 5 or not, respectively We fit a logistic-regression model
SLIDE 124
Digit classification
100 101 102 103 104 105 106 Steps 2 4 6 8 10 12 14 16 Training Loss
Gradient Descent SGD (1) SGD (10) SGD (100) SGD (1000) SGD (10000)
SLIDE 125
Newton’s method
Motivation: Convex functions are often almost quadratic f ≈ f 2
- x
Idea: Iteratively minimize quadratic approximation f 2
- x (
y) := f ( x) + ∇f ( x) ( y − x) + 1 2 ( y − x)T ∇2f ( x) ( y − x) , Minimum has closed form arg min
- y∈Rn
f 2
- x (
y) = x − ∇2f ( x)−1 ∇f ( x)
SLIDE 126
Proof
We have ∇f 2
- x (y) = ∇f (
x) + ∇2f ( x) ( y − x) It is equal to zero if ∇2f ( x) ( y − x) = −∇f ( x) If the Hessian is positive definite, the only minimum of f 2
- x is at
- x − ∇2f (
x)−1 ∇f ( x)
SLIDE 127
Newton’s method
Set the initial point x (0) to an arbitrary value Update by setting
- x (k+1) :=
x (k) − ∇2f
- x (k)−1
∇f
- x (k)
until a stopping criterion is met
SLIDE 128
Newton’s method
Quadratic approximation
SLIDE 129
Quadratic function
SLIDE 130
Quadratic function
SLIDE 131
Convex function
SLIDE 132
Logistic regression
∂2f ( x) ∂ x[j]∂ x[l] = −
n
- i=1
g( x (i), β)
- 1 − g(
x (i), β)
- x(i)[j]
x(i)[l] ∇2f ( β) = −X TG( β)X The rows of X ∈ Rn×p contain x (1), . . . x (n) G is a diagonal matrix such that G( β)ii := g( x (i), β)
- 1 − g(
x (i), β)
- ,
1 ≤ i ≤ n
SLIDE 133
Logistic regression
Newton updates are
- β(k+1) :=
β(k) −
- X TG(
β(k))X −1 ∇f ( β(k)) Sanity check: Cost function is concave, for any β, v ∈ Rp
- vT∇2f (
β) v = −
n
- i=1