compsci 514: algorithms for data science Cameron Musco University - - PowerPoint PPT Presentation
compsci 514: algorithms for data science Cameron Musco University - - PowerPoint PPT Presentation
compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 18 0 logistics 1 Problem Set 3 on Spectral Methods due this Friday at 8pm . Can turn in without penalty until Sunday at
logistics
- Problem Set 3 on Spectral Methods due this Friday at 8pm.
- Can turn in without penalty until Sunday at 11:59pm.
1
summary
Last Class:
- Power method for computing the top singular vector of a matrix.
- High level discussion of Krylov methods, block versions for
computing more singular vectors.
- Power method is an iterative algorithm for solving the non-convex
- ptimization problem:
max
v v
2 2
1 vTXTXv
This Class (and until Thanksgiving):
- More general iterative algorithms for optimization, specifically
gradient descent and its variants.
- What are they methods, when are they applied, and how do you
analyze their performance?
- Small taste of what you can find in COMPSCI 590OP or 690OP.
2
summary
Last Class:
- Power method for computing the top singular vector of a matrix.
- High level discussion of Krylov methods, block versions for
computing more singular vectors.
- Power method is an iterative algorithm for solving the non-convex
- ptimization problem:
max
⃗ v:∥⃗ v∥2
2≤1
⃗ vTXTX⃗ v. This Class (and until Thanksgiving):
- More general iterative algorithms for optimization, specifically
gradient descent and its variants.
- What are they methods, when are they applied, and how do you
analyze their performance?
- Small taste of what you can find in COMPSCI 590OP or 690OP.
2
discrete vs. continuous optimization
Discrete (Combinatorial) Optimization: (traditional CS algorithms)
- Graph Problems: min-cut, max flow, shortest path, matchings,
maximum independent set, traveling salesman problem
- Problems with discrete constraints or outputs: bin-packing,
scheduling, sequence alignment, submodular maximization
- Generally searching over a finite but exponentially large set of
possible solutions. Many of these problems are NP-Hard. Continuous Optimization: (not covered in core CS curriculum. Touched on in ML/advanced algorithms, maybe.)
- Unconstrained convex and non-convex optimization.
- Linear programming, quadratic programming, semidefinite
programming
3
discrete vs. continuous optimization
Discrete (Combinatorial) Optimization: (traditional CS algorithms)
- Graph Problems: min-cut, max flow, shortest path, matchings,
maximum independent set, traveling salesman problem
- Problems with discrete constraints or outputs: bin-packing,
scheduling, sequence alignment, submodular maximization
- Generally searching over a finite but exponentially large set of
possible solutions. Many of these problems are NP-Hard. Continuous Optimization: (not covered in core CS curriculum. Touched on in ML/advanced algorithms, maybe.)
- Unconstrained convex and non-convex optimization.
- Linear programming, quadratic programming, semidefinite
programming
3
discrete vs. continuous optimization
Discrete (Combinatorial) Optimization: (traditional CS algorithms)
- Graph Problems: min-cut, max flow, shortest path, matchings,
maximum independent set, traveling salesman problem
- Problems with discrete constraints or outputs: bin-packing,
scheduling, sequence alignment, submodular maximization
- Generally searching over a finite but exponentially large set of
possible solutions. Many of these problems are NP-Hard. Continuous Optimization: (not covered in core CS curriculum. Touched on in ML/advanced algorithms, maybe.)
- Unconstrained convex and non-convex optimization.
- Linear programming, quadratic programming, semidefinite
programming
3
continuous optimization examples
4
continuous optimization examples
4
mathematical setup
Given some function f : Rd → R, find ⃗ θ⋆ with: f(⃗ θ⋆) = min
⃗ θ∈Rd f(⃗
θ) Typically up to some small approximation factor. Often under some constraints:
- 2
1,
1
1.
- A
b,
TA
0.
- 1T
d i 1
i c.
5
mathematical setup
Given some function f : Rd → R, find ⃗ θ⋆ with: f(⃗ θ⋆) = min
⃗ θ∈Rd f(⃗
θ) + ϵ Typically up to some small approximation factor. Often under some constraints:
- 2
1,
1
1.
- A
b,
TA
0.
- 1T
d i 1
i c.
5
mathematical setup
Given some function f : Rd → R, find ⃗ θ⋆ with: f(⃗ θ⋆) = min
⃗ θ∈Rd f(⃗
θ) + ϵ Typically up to some small approximation factor. Often under some constraints:
- ∥⃗
θ∥2 ≤ 1, ∥⃗ θ∥1 ≤ 1.
- A⃗
θ ≤ ⃗ b, ⃗ θTA⃗ θ ≥ 0.
- ⃗
1T⃗ θ = ∑d
i=1 ⃗
θ(i) ≤ c.
5
why continuous optimization?
Modern machine learning centers around continuous optimization. Typical Set Up: (supervised machine learning)
- Have a model, which is a function mapping inputs to predictions
(neural network, linear function, low-degree polynomial etc).
- The model is parameterized by a parameter vector (weights in a
neural network, coefficients in a linear function or polynomial)
- Want to train this model on input data, by picking a parameter
vector such that the model does a good job mapping inputs to predictions on your training data. This training step is typically formulated as a continuous
- ptimization problem.
6
why continuous optimization?
Modern machine learning centers around continuous optimization. Typical Set Up: (supervised machine learning)
- Have a model, which is a function mapping inputs to predictions
(neural network, linear function, low-degree polynomial etc).
- The model is parameterized by a parameter vector (weights in a
neural network, coefficients in a linear function or polynomial)
- Want to train this model on input data, by picking a parameter
vector such that the model does a good job mapping inputs to predictions on your training data. This training step is typically formulated as a continuous
- ptimization problem.
6
why continuous optimization?
Modern machine learning centers around continuous optimization. Typical Set Up: (supervised machine learning)
- Have a model, which is a function mapping inputs to predictions
(neural network, linear function, low-degree polynomial etc).
- The model is parameterized by a parameter vector (weights in a
neural network, coefficients in a linear function or polynomial)
- Want to train this model on input data, by picking a parameter
vector such that the model does a good job mapping inputs to predictions on your training data. This training step is typically formulated as a continuous
- ptimization problem.
6
why continuous optimization?
Modern machine learning centers around continuous optimization. Typical Set Up: (supervised machine learning)
- Have a model, which is a function mapping inputs to predictions
(neural network, linear function, low-degree polynomial etc).
- The model is parameterized by a parameter vector (weights in a
neural network, coefficients in a linear function or polynomial)
- Want to train this model on input data, by picking a parameter
vector such that the model does a good job mapping inputs to predictions on your training data. This training step is typically formulated as a continuous
- ptimization problem.
6
- ptimization in ml
Example 1: Linear Regression Model: M
d
with M x
def
x 1 x 1 d x d . Parameter Vector:
d (the regression coefficients)
Optimization Problem: Given data points (training points) x1 xn (the rows of data matrix X
n d) and labels y1
yn , find minimizing the loss function: L X
n i 1
M xi yi R where is some measurement of how far M xi is from yi.
- M
xi yi M xi yi
2 (least squares regression)
- yi
1 1 and M xi yi ln 1 exp yiM xi (logistic regression)
7
- ptimization in ml
Example 1: Linear Regression Model: M⃗
θ : Rd → R with M⃗ θ(⃗
x)
def
= ⟨⃗ θ,⃗ x⟩ 1 x 1 d x d . Parameter Vector:
d (the regression coefficients)
Optimization Problem: Given data points (training points) x1 xn (the rows of data matrix X
n d) and labels y1
yn , find minimizing the loss function: L X
n i 1
M xi yi R where is some measurement of how far M xi is from yi.
- M
xi yi M xi yi
2 (least squares regression)
- yi
1 1 and M xi yi ln 1 exp yiM xi (logistic regression)
7
- ptimization in ml
Example 1: Linear Regression Model: M⃗
θ : Rd → R with M⃗ θ(⃗
x)
def
= ⟨⃗ θ,⃗ x⟩ = ⃗ θ(1) ·⃗ x(1) + . . . + ⃗ θ(d) ·⃗ x(d). Parameter Vector:
d (the regression coefficients)
Optimization Problem: Given data points (training points) x1 xn (the rows of data matrix X
n d) and labels y1
yn , find minimizing the loss function: L X
n i 1
M xi yi R where is some measurement of how far M xi is from yi.
- M
xi yi M xi yi
2 (least squares regression)
- yi
1 1 and M xi yi ln 1 exp yiM xi (logistic regression)
7
- ptimization in ml
Example 1: Linear Regression Model: M⃗
θ : Rd → R with M⃗ θ(⃗
x)
def
= ⟨⃗ θ,⃗ x⟩ = ⃗ θ(1) ·⃗ x(1) + . . . + ⃗ θ(d) ·⃗ x(d). Parameter Vector: ⃗ θ ∈ Rd (the regression coefficients) Optimization Problem: Given data points (training points) x1 xn (the rows of data matrix X
n d) and labels y1
yn , find minimizing the loss function: L X
n i 1
M xi yi R where is some measurement of how far M xi is from yi.
- M
xi yi M xi yi
2 (least squares regression)
- yi
1 1 and M xi yi ln 1 exp yiM xi (logistic regression)
7
- ptimization in ml
Example 1: Linear Regression Model: M⃗
θ : Rd → R with M⃗ θ(⃗
x)
def
= ⟨⃗ θ,⃗ x⟩ = ⃗ θ(1) ·⃗ x(1) + . . . + ⃗ θ(d) ·⃗ x(d). Parameter Vector: ⃗ θ ∈ Rd (the regression coefficients) Optimization Problem: Given data points (training points) ⃗ x1, . . . ,⃗ xn (the rows of data matrix X ∈ Rn×d) and labels y1, . . . , yn ∈ R, find ⃗ θ∗ minimizing the loss function: L(⃗ θ, X) =
n
∑
i=1
ℓ(M⃗
θ(⃗
xi), yi) R where ℓ is some measurement of how far M⃗
θ(⃗
xi) is from yi.
- M
xi yi M xi yi
2 (least squares regression)
- yi
1 1 and M xi yi ln 1 exp yiM xi (logistic regression)
7
- ptimization in ml
Example 1: Linear Regression Model: M⃗
θ : Rd → R with M⃗ θ(⃗
x)
def
= ⟨⃗ θ,⃗ x⟩ = ⃗ θ(1) ·⃗ x(1) + . . . + ⃗ θ(d) ·⃗ x(d). Parameter Vector: ⃗ θ ∈ Rd (the regression coefficients) Optimization Problem: Given data points (training points) ⃗ x1, . . . ,⃗ xn (the rows of data matrix X ∈ Rn×d) and labels y1, . . . , yn ∈ R, find ⃗ θ∗ minimizing the loss function: L(⃗ θ, X) =
n
∑
i=1
ℓ(M⃗
θ(⃗
xi), yi) R where ℓ is some measurement of how far M⃗
θ(⃗
xi) is from yi.
- ℓ(M⃗
θ(⃗
xi), yi) = ( M⃗
θ(⃗
xi) − yi )2 (least squares regression)
- yi ∈ {−1, 1} and ℓ(M⃗
θ(⃗
xi), yi) = ln ( 1 + exp(−yiM⃗
θ(⃗
xi)) ) (logistic regression)
7
- ptimization in ml
Example 1: Linear Regression Model: M⃗
θ : Rd → R with M⃗ θ(⃗
x)
def
= ⟨⃗ θ,⃗ x⟩ = ⃗ θ(1) ·⃗ x(1) + . . . + ⃗ θ(d) ·⃗ x(d). Parameter Vector: ⃗ θ ∈ Rd (the regression coefficients) Optimization Problem: Given data points (training points) ⃗ x1, . . . ,⃗ xn (the rows of data matrix X ∈ Rn×d) and labels y1, . . . , yn ∈ R, find ⃗ θ∗ minimizing the loss function: L(⃗ θ, X) =
n
∑
i=1
ℓ(M⃗
θ(⃗
xi), yi) + R(⃗ θ) where ℓ is some measurement of how far M⃗
θ(⃗
xi) is from yi.
- ℓ(M⃗
θ(⃗
xi), yi) = ( M⃗
θ(⃗
xi) − yi )2 (least squares regression)
- yi ∈ {−1, 1} and ℓ(M⃗
θ(⃗
xi), yi) = ln ( 1 + exp(−yiM⃗
θ(⃗
xi)) ) (logistic regression)
7
- ptimization in ml
Example 1: Linear Regression Model: M⃗
θ : Rd → R with M⃗ θ(⃗
x)
def
= ⟨⃗ θ,⃗ x⟩ = ⃗ θ(1) ·⃗ x(1) + . . . + ⃗ θ(d) ·⃗ x(d). Parameter Vector: ⃗ θ ∈ Rd (the regression coefficients) Optimization Problem: Given data points (training points) ⃗ x1, . . . ,⃗ xn (the rows of data matrix X ∈ Rn×d) and labels y1, . . . , yn ∈ R, find ⃗ θ∗ minimizing the loss function: L(⃗ θ, X) =
n
∑
i=1
ℓ(M⃗
θ(⃗
xi), yi) + λ∥⃗ θ∥2
2
where ℓ is some measurement of how far M⃗
θ(⃗
xi) is from yi.
- ℓ(M⃗
θ(⃗
xi), yi) = ( M⃗
θ(⃗
xi) − yi )2 (least squares regression)
- yi ∈ {−1, 1} and ℓ(M⃗
θ(⃗
xi), yi) = ln ( 1 + exp(−yiM⃗
θ(⃗
xi)) ) (logistic regression)
7
- ptimization in ml
Example 2: Neural Networks Model: M
d
. M x wout W2 W1x . Parameter Vector:
edges (the weights on every edge)
Optimization Problem: Given data points x1 xn and labels y1 yn , find minimizing the loss function: L X
n i 1
M xi yi
8
- ptimization in ml
Example 2: Neural Networks Model: M⃗
θ : Rd → R.
M x wout W2 W1x . Parameter Vector: ⃗ θ ∈ R(# edges) (the weights on every edge) Optimization Problem: Given data points x1 xn and labels y1 yn , find minimizing the loss function: L X
n i 1
M xi yi
8
- ptimization in ml
Example 2: Neural Networks Model: M⃗
θ : Rd → R. M⃗ θ(⃗
x) = ⟨⃗ wout, σ(W2σ(W1⃗ x))⟩. Parameter Vector: ⃗ θ ∈ R(# edges) (the weights on every edge) Optimization Problem: Given data points x1 xn and labels y1 yn , find minimizing the loss function: L X
n i 1
M xi yi
8
- ptimization in ml
Example 2: Neural Networks Model: M⃗
θ : Rd → R. M⃗ θ(⃗
x) = ⟨⃗ wout, σ(W2σ(W1⃗ x))⟩. Parameter Vector: ⃗ θ ∈ R(# edges) (the weights on every edge) Optimization Problem: Given data points ⃗ x1, . . . ,⃗ xn and labels y1, . . . , yn ∈ R, find ⃗ θ∗ minimizing the loss function: L(⃗ θ, X) =
n
∑
i=1
ℓ(M⃗
θ(⃗
xi), yi)
8
- ptimization in ml
L(⃗ θ, X) =
n
∑
i=1
ℓ(M⃗
θ(⃗
xi), yi)
- Supervised means we have labels y1, . . . , yn for the training points.
- Solving the final optimization problem has many different names:
likelihood maximization, empirical risk minimization, minimizing training loss, etc.
- Continuous optimization is also very common in unsupervised
learning. (PCA, spectral clustering, etc.)
- Generalization tries to explain why minimizing the loss L
X on the training points minimizes the loss on future test points. I.e., makes us have good predictions on future inputs.
9
- ptimization in ml
L(⃗ θ, X) =
n
∑
i=1
ℓ(M⃗
θ(⃗
xi), yi)
- Supervised means we have labels y1, . . . , yn for the training points.
- Solving the final optimization problem has many different names:
likelihood maximization, empirical risk minimization, minimizing training loss, etc.
- Continuous optimization is also very common in unsupervised
learning. (PCA, spectral clustering, etc.)
- Generalization tries to explain why minimizing the loss L
X on the training points minimizes the loss on future test points. I.e., makes us have good predictions on future inputs.
9
- ptimization in ml
L(⃗ θ, X) =
n
∑
i=1
ℓ(M⃗
θ(⃗
xi), yi)
- Supervised means we have labels y1, . . . , yn for the training points.
- Solving the final optimization problem has many different names:
likelihood maximization, empirical risk minimization, minimizing training loss, etc.
- Continuous optimization is also very common in unsupervised
learning. (PCA, spectral clustering, etc.)
- Generalization tries to explain why minimizing the loss L
X on the training points minimizes the loss on future test points. I.e., makes us have good predictions on future inputs.
9
- ptimization in ml
L(⃗ θ, X) =
n
∑
i=1
ℓ(M⃗
θ(⃗
xi), yi)
- Supervised means we have labels y1, . . . , yn for the training points.
- Solving the final optimization problem has many different names:
likelihood maximization, empirical risk minimization, minimizing training loss, etc.
- Continuous optimization is also very common in unsupervised
- learning. (PCA, spectral clustering, etc.)
- Generalization tries to explain why minimizing the loss L
X on the training points minimizes the loss on future test points. I.e., makes us have good predictions on future inputs.
9
- ptimization in ml
L(⃗ θ, X) =
n
∑
i=1
ℓ(M⃗
θ(⃗
xi), yi)
- Supervised means we have labels y1, . . . , yn for the training points.
- Solving the final optimization problem has many different names:
likelihood maximization, empirical risk minimization, minimizing training loss, etc.
- Continuous optimization is also very common in unsupervised
- learning. (PCA, spectral clustering, etc.)
- Generalization tries to explain why minimizing the loss L(⃗
θ, X) on the training points minimizes the loss on future test points. I.e., makes us have good predictions on future inputs.
9
- ptimization algorithms
Choice of optimization algorithm for minimizing f(⃗ θ) will depend on many things:
- The form of f (in ML, depends on the model & loss function).
- Any constraints on ⃗
θ (e.g., ∥⃗ θ∥ < c).
- Other constraints, such as memory constraints.
L X
n i 1
M xi yi What are some popular optimization algorithms?
10
- ptimization algorithms
Choice of optimization algorithm for minimizing f(⃗ θ) will depend on many things:
- The form of f (in ML, depends on the model & loss function).
- Any constraints on ⃗
θ (e.g., ∥⃗ θ∥ < c).
- Other constraints, such as memory constraints.
L(⃗ θ, X) =
n
∑
i=1
ℓ(M⃗
θ(⃗
xi), yi) What are some popular optimization algorithms?
10
gradient descent
This class: Gradient descent (and some important variants)
- An extremely simple greedy iterative method, that can be applied
to almost any continuous function we care about optimizing.
- Often not the ‘best’ choice for any given function, but it is the
approach of choice in ML since it is simple, general, and often works very well.
- At each step, tries to move towards the lowest nearby point in the
function that is can – in the direction of the gradient.
11
gradient descent
This class: Gradient descent (and some important variants)
- An extremely simple greedy iterative method, that can be applied
to almost any continuous function we care about optimizing.
- Often not the ‘best’ choice for any given function, but it is the
approach of choice in ML since it is simple, general, and often works very well.
- At each step, tries to move towards the lowest nearby point in the
function that is can – in the direction of the gradient.
11
gradient descent
This class: Gradient descent (and some important variants)
- An extremely simple greedy iterative method, that can be applied
to almost any continuous function we care about optimizing.
- Often not the ‘best’ choice for any given function, but it is the
approach of choice in ML since it is simple, general, and often works very well.
- At each step, tries to move towards the lowest nearby point in the
function that is can – in the direction of the gradient.
11
gradient descent
This class: Gradient descent (and some important variants)
- An extremely simple greedy iterative method, that can be applied
to almost any continuous function we care about optimizing.
- Often not the ‘best’ choice for any given function, but it is the
approach of choice in ML since it is simple, general, and often works very well.
- At each step, tries to move towards the lowest nearby point in the
function that is can – in the direction of the gradient.
11
gradient descent
This class: Gradient descent (and some important variants)
- An extremely simple greedy iterative method, that can be applied
to almost any continuous function we care about optimizing.
- Often not the ‘best’ choice for any given function, but it is the
approach of choice in ML since it is simple, general, and often works very well.
- At each step, tries to move towards the lowest nearby point in the
function that is can – in the direction of the gradient.
11
multivariate calculus review
Let ⃗ ei ∈ Rd denote the ith standard basis vector, ⃗ ei = [0, 0, 1, 0, 0, . . . , 0]
- 1 at position i
. Partial Derivative: f i lim f ei f Directional Derivative: Dv f lim f v f
12
multivariate calculus review
Let ⃗ ei ∈ Rd denote the ith standard basis vector, ⃗ ei = [0, 0, 1, 0, 0, . . . , 0]
- 1 at position i
. Partial Derivative: ∂f ∂⃗ θ(i) = lim
ϵ→0
f(⃗ θ + ϵ · ⃗ ei) − f(⃗ θ) ϵ . Directional Derivative: Dv f lim f v f
12
multivariate calculus review
Let ⃗ ei ∈ Rd denote the ith standard basis vector, ⃗ ei = [0, 0, 1, 0, 0, . . . , 0]
- 1 at position i
. Partial Derivative: ∂f ∂⃗ θ(i) = lim
ϵ→0
f(⃗ θ + ϵ · ⃗ ei) − f(⃗ θ) ϵ . Directional Derivative: D⃗
v f(⃗
θ) = lim
ϵ→0
f(⃗ θ + ϵ⃗ v) − f(⃗ θ) ϵ .
12
multivariate calculus review
Gradient: Just a ‘list’ of the partial derivatives. ⃗ ∇f(⃗ θ) =
∂f ∂⃗ θ(1) ∂f ∂⃗ θ(2)
. . .
∂f ∂⃗ θ(d)
Directional Derivative in Terms of the Gradient: Dv f lim f v f v 1 f 1 v 2 f 2 v d f d v f
13
multivariate calculus review
Gradient: Just a ‘list’ of the partial derivatives. ⃗ ∇f(⃗ θ) =
∂f ∂⃗ θ(1) ∂f ∂⃗ θ(2)
. . .
∂f ∂⃗ θ(d)
Directional Derivative in Terms of the Gradient: D⃗
v f(⃗
θ) = lim
ϵ→0
f(⃗ θ + ϵ⃗ v) − f(⃗ θ) ϵ v 1 f 1 v 2 f 2 v d f d v f
13
multivariate calculus review
Gradient: Just a ‘list’ of the partial derivatives. ⃗ ∇f(⃗ θ) =
∂f ∂⃗ θ(1) ∂f ∂⃗ θ(2)
. . .
∂f ∂⃗ θ(d)
Directional Derivative in Terms of the Gradient: D⃗
v f(⃗
θ) = lim
ϵ→0
f(⃗ θ + ϵ(⃗ e1 ·⃗ v(1) + ⃗ e2 ·⃗ v(2) + . . . + ⃗ ed ·⃗ v(d)) − f(⃗ θ) ϵ v 1 f 1 v 2 f 2 v d f d v f
13
multivariate calculus review
Gradient: Just a ‘list’ of the partial derivatives. ⃗ ∇f(⃗ θ) =
∂f ∂⃗ θ(1) ∂f ∂⃗ θ(2)
. . .
∂f ∂⃗ θ(d)
Directional Derivative in Terms of the Gradient: D⃗
v f(⃗
θ) = lim
ϵ→0
f(⃗ θ + ϵ(⃗ e1 ·⃗ v(1) + ⃗ e2 ·⃗ v(2) + . . . + ⃗ ed ·⃗ v(d)) − f(⃗ θ) ϵ v 1 f 1 v 2 f 2 v d f d v f
13
multivariate calculus review
Gradient: Just a ‘list’ of the partial derivatives. ⃗ ∇f(⃗ θ) =
∂f ∂⃗ θ(1) ∂f ∂⃗ θ(2)
. . .
∂f ∂⃗ θ(d)
Directional Derivative in Terms of the Gradient: D⃗
v f(⃗
θ) = lim
ϵ→0
f(⃗ θ + ϵ(⃗ e1 ·⃗ v(1) + ⃗ e2 ·⃗ v(2) + . . . + ⃗ ed ·⃗ v(d)) − f(⃗ θ) ϵ ≈ ⃗ v(1) · ∂f ∂⃗ θ(1) v 2 f 2 v d f d v f
13
multivariate calculus review
Gradient: Just a ‘list’ of the partial derivatives. ⃗ ∇f(⃗ θ) =
∂f ∂⃗ θ(1) ∂f ∂⃗ θ(2)
. . .
∂f ∂⃗ θ(d)
Directional Derivative in Terms of the Gradient: D⃗
v f(⃗
θ) = lim
ϵ→0
f(⃗ θ + ϵ(⃗ e1 ·⃗ v(1) + ⃗ e2 ·⃗ v(2) + . . . + ⃗ ed ·⃗ v(d)) − f(⃗ θ) ϵ ≈ ⃗ v(1) · ∂f ∂⃗ θ(1) +⃗ v(2) · ∂f ∂⃗ θ(2) + . . . +⃗ v(d) · ∂f ∂⃗ θ(d) v f
13
multivariate calculus review
Gradient: Just a ‘list’ of the partial derivatives. ⃗ ∇f(⃗ θ) =
∂f ∂⃗ θ(1) ∂f ∂⃗ θ(2)
. . .
∂f ∂⃗ θ(d)
Directional Derivative in Terms of the Gradient: D⃗
v f(⃗
θ) = lim
ϵ→0
f(⃗ θ + ϵ(⃗ e1 ·⃗ v(1) + ⃗ e2 ·⃗ v(2) + . . . + ⃗ ed ·⃗ v(d)) − f(⃗ θ) ϵ ≈ ⃗ v(1) · ∂f ∂⃗ θ(1) +⃗ v(2) · ∂f ∂⃗ θ(2) + . . . +⃗ v(d) · ∂f ∂⃗ θ(d) = ⟨⃗ v, ⃗ ∇f(⃗ θ)⟩.
13
function access
Often the functions we are trying to optimize are very complex (e.g., a neural network). We will assume access to: Function Evaluation: Can compute f(⃗ θ) for any ⃗ θ. Gradient Evaluation: Can compute ⃗ ∇f(⃗ θ) for any ⃗ θ. In neural networks:
- Function evaluation is called a forward pass (propogate an
input through the network).
- Gradient evaluation is called a backward pass (compute the
gradient via chain rule, using backpropagation).
14
function access
Often the functions we are trying to optimize are very complex (e.g., a neural network). We will assume access to: Function Evaluation: Can compute f(⃗ θ) for any ⃗ θ. Gradient Evaluation: Can compute ⃗ ∇f(⃗ θ) for any ⃗ θ. In neural networks:
- Function evaluation is called a forward pass (propogate an
input through the network).
- Gradient evaluation is called a backward pass (compute the
gradient via chain rule, using backpropagation).
14
gradient example
Running Example: Least squares regression. Given input points ⃗ x1, . . .⃗ xn (the rows of data matrix X ∈ Rn×d) and labels y1, . . . , yn (the entries of ⃗ y ∈ Rn) , find ⃗ θ∗ minimizing: L(⃗ θ, X) =
n
∑
i=1
( ⃗ θT⃗ xi − yi )2 X y 2
2
By Chain rule: L X j
n i 1
2
Txi
yi
Txi
yi j
n i 1
2
Txi
yi xi j
Txi
yi j
Txi
j lim
Txi
ej Txi lim eT
j xi
xi j
15
gradient example
Running Example: Least squares regression. Given input points ⃗ x1, . . .⃗ xn (the rows of data matrix X ∈ Rn×d) and labels y1, . . . , yn (the entries of ⃗ y ∈ Rn) , find ⃗ θ∗ minimizing: L(⃗ θ, X) =
n
∑
i=1
( ⃗ θT⃗ xi − yi )2 = ∥X⃗ θ −⃗ y∥2
2.
By Chain rule: ∂L(⃗ θ, X) ∂⃗ θ(j) =
n
∑
i=1
2 · ( ⃗ θT⃗ xi − yi ) · ∂ ( ⃗ θT⃗ xi − yi ) ∂⃗ θ(j)
n i 1
2
Txi
yi xi j
Txi
yi j
Txi
j lim
Txi
ej Txi lim eT
j xi
xi j
15
gradient example
Running Example: Least squares regression. Given input points ⃗ x1, . . .⃗ xn (the rows of data matrix X ∈ Rn×d) and labels y1, . . . , yn (the entries of ⃗ y ∈ Rn) , find ⃗ θ∗ minimizing: L(⃗ θ, X) =
n
∑
i=1
( ⃗ θT⃗ xi − yi )2 = ∥X⃗ θ −⃗ y∥2
2.
By Chain rule: ∂L(⃗ θ, X) ∂⃗ θ(j) =
n
∑
i=1
2 · ( ⃗ θT⃗ xi − yi ) · ∂ ( ⃗ θT⃗ xi − yi ) ∂⃗ θ(j)
n i 1
2
Txi
yi xi j ∂ ( ⃗ θT⃗ xi − yi ) ∂⃗ θ(j) = ∂(θT⃗ xi) ∂⃗ θ(j) lim
Txi
ej Txi lim eT
j xi
xi j
15
gradient example
Running Example: Least squares regression. Given input points ⃗ x1, . . .⃗ xn (the rows of data matrix X ∈ Rn×d) and labels y1, . . . , yn (the entries of ⃗ y ∈ Rn) , find ⃗ θ∗ minimizing: L(⃗ θ, X) =
n
∑
i=1
( ⃗ θT⃗ xi − yi )2 = ∥X⃗ θ −⃗ y∥2
2.
By Chain rule: ∂L(⃗ θ, X) ∂⃗ θ(j) =
n
∑
i=1
2 · ( ⃗ θT⃗ xi − yi ) · ∂ ( ⃗ θT⃗ xi − yi ) ∂⃗ θ(j)
n i 1
2
Txi
yi xi j ∂ ( ⃗ θT⃗ xi − yi ) ∂⃗ θ(j) = ∂(θT⃗ xi) ∂⃗ θ(j) = lim
ϵ→0
θT⃗ xi − (θ + ϵ⃗ ej)T⃗ xi ϵ lim eT
j xi
xi j
15
gradient example
Running Example: Least squares regression. Given input points ⃗ x1, . . .⃗ xn (the rows of data matrix X ∈ Rn×d) and labels y1, . . . , yn (the entries of ⃗ y ∈ Rn) , find ⃗ θ∗ minimizing: L(⃗ θ, X) =
n
∑
i=1
( ⃗ θT⃗ xi − yi )2 = ∥X⃗ θ −⃗ y∥2
2.
By Chain rule: ∂L(⃗ θ, X) ∂⃗ θ(j) =
n
∑
i=1
2 · ( ⃗ θT⃗ xi − yi ) · ∂ ( ⃗ θT⃗ xi − yi ) ∂⃗ θ(j)
n i 1
2
Txi
yi xi j ∂ ( ⃗ θT⃗ xi − yi ) ∂⃗ θ(j) = ∂(θT⃗ xi) ∂⃗ θ(j) = lim
ϵ→0
θT⃗ xi − (θ + ϵ⃗ ej)T⃗ xi ϵ = lim
ϵ→0
ϵ⃗ eT
j⃗
xi ϵ xi j
15
gradient example
Running Example: Least squares regression. Given input points ⃗ x1, . . .⃗ xn (the rows of data matrix X ∈ Rn×d) and labels y1, . . . , yn (the entries of ⃗ y ∈ Rn) , find ⃗ θ∗ minimizing: L(⃗ θ, X) =
n
∑
i=1
( ⃗ θT⃗ xi − yi )2 = ∥X⃗ θ −⃗ y∥2
2.
By Chain rule: ∂L(⃗ θ, X) ∂⃗ θ(j) =
n
∑
i=1
2 · ( ⃗ θT⃗ xi − yi ) · ∂ ( ⃗ θT⃗ xi − yi ) ∂⃗ θ(j)
n i 1
2
Txi
yi xi j ∂ ( ⃗ θT⃗ xi − yi ) ∂⃗ θ(j) = ∂(θT⃗ xi) ∂⃗ θ(j) = lim
ϵ→0
θT⃗ xi − (θ + ϵ⃗ ej)T⃗ xi ϵ = lim
ϵ→0
ϵ⃗ eT
j⃗
xi ϵ = ⃗ xi(j).
15
gradient example
Running Example: Least squares regression. Given input points ⃗ x1, . . .⃗ xn (the rows of data matrix X ∈ Rn×d) and labels y1, . . . , yn (the entries of ⃗ y ∈ Rn) , find ⃗ θ∗ minimizing: L(⃗ θ, X) =
n
∑
i=1
( ⃗ θT⃗ xi − yi )2 = ∥X⃗ θ −⃗ y∥2
2.
By Chain rule: ∂L(⃗ θ, X) ∂⃗ θ(j) =
n
∑
i=1
2 · ( ⃗ θT⃗ xi − yi ) · ∂ ( ⃗ θT⃗ xi − yi ) ∂⃗ θ(j) =
n
∑
i=1
2 · ( ⃗ θT⃗ xi − yi ) ⃗ xi(j) ∂ ( ⃗ θT⃗ xi − yi ) ∂⃗ θ(j) = ∂(θT⃗ xi) ∂⃗ θ(j) = lim
ϵ→0
θT⃗ xi − (θ + ϵ⃗ ej)T⃗ xi ϵ = lim
ϵ→0
ϵ⃗ eT
j⃗
xi ϵ = ⃗ xi(j).
15
gradient example
Partial derivative for least squares regression: ∂L(⃗ θ, X) ∂⃗ θ(j) =
n
∑
i=1
2 · ( ⃗ θT⃗ xi − yi ) ⃗ xi(j). L X
n i 1
2
Txi
yi xi XT X y
16
gradient example
Partial derivative for least squares regression: ∂L(⃗ θ, X) ∂⃗ θ(j) =
n
∑
i=1
2 · ( ⃗ θT⃗ xi − yi ) ⃗ xi(j). ⃗ ∇L(⃗ θ, X) =
n
∑
i=1
2 · ( ⃗ θT⃗ xi − yi ) ⃗ xi XT X y
16
gradient example
Partial derivative for least squares regression: ∂L(⃗ θ, X) ∂⃗ θ(j) =
n
∑
i=1
2 · ( ⃗ θT⃗ xi − yi ) ⃗ xi(j). ⃗ ∇L(⃗ θ, X) =
n
∑
i=1
2 · ( ⃗ θT⃗ xi − yi ) ⃗ xi = XT(X⃗ θ −⃗ y).
16
gradient example
Gradient for least squares regression via linear algebraic approach: ∇L(⃗ θ, X) = ∇∥X⃗ θ −⃗ y∥2
2
17
gradient descent greedy approach
Gradient descent is a greedy iterative optimization algorithm: Starting at ⃗ θ(0), in each iteration let ⃗ θ(i) = ⃗ θ(i−1) + η⃗ v, where η is a (small) ‘step size’ and ⃗ v is a direction chosen to minimize f(⃗ θ(i−1) + η⃗ v). Dv f lim f v f So for small : f
i
f
i 1
f
i 1
v f
i 1
Dvf
i 1
v f
i 1
We want to choose v minimizing v f
i 1
– i.e., pointing in the direction of f
i 1
but with the opposite sign.
18
gradient descent greedy approach
Gradient descent is a greedy iterative optimization algorithm: Starting at ⃗ θ(0), in each iteration let ⃗ θ(i) = ⃗ θ(i−1) + η⃗ v, where η is a (small) ‘step size’ and ⃗ v is a direction chosen to minimize f(⃗ θ(i−1) + η⃗ v). D⃗
v f(⃗
θ) = lim
ϵ→0
f(⃗ θ + ϵ⃗ v) − f(⃗ θ) ϵ . So for small : f
i
f
i 1
f
i 1
v f
i 1
Dvf
i 1
v f
i 1
We want to choose v minimizing v f
i 1
– i.e., pointing in the direction of f
i 1
but with the opposite sign.
18
gradient descent greedy approach
Gradient descent is a greedy iterative optimization algorithm: Starting at ⃗ θ(0), in each iteration let ⃗ θ(i) = ⃗ θ(i−1) + η⃗ v, where η is a (small) ‘step size’ and ⃗ v is a direction chosen to minimize f(⃗ θ(i−1) + η⃗ v). D⃗
v f(⃗
θ(i−1)) = lim
ϵ→0
f(⃗ θ(i−1) + ϵ⃗ v) − f(⃗ θ(i−1)) ϵ . So for small : f
i
f
i 1
f
i 1
v f
i 1
Dvf
i 1
v f
i 1
We want to choose v minimizing v f
i 1
– i.e., pointing in the direction of f
i 1
but with the opposite sign.
18
gradient descent greedy approach
Gradient descent is a greedy iterative optimization algorithm: Starting at ⃗ θ(0), in each iteration let ⃗ θ(i) = ⃗ θ(i−1) + η⃗ v, where η is a (small) ‘step size’ and ⃗ v is a direction chosen to minimize f(⃗ θ(i−1) + η⃗ v). D⃗
v f(⃗
θ(i−1)) = lim
ϵ→0
f(⃗ θ(i−1) + ϵ⃗ v) − f(⃗ θ(i−1)) ϵ . So for small η: f(⃗ θ(i)) − f(⃗ θ(i−1)) = f(⃗ θ(i−1) + η⃗ v) − f(⃗ θ(i−1)) Dvf
i 1
v f
i 1
We want to choose v minimizing v f
i 1
– i.e., pointing in the direction of f
i 1
but with the opposite sign.
18
gradient descent greedy approach
Gradient descent is a greedy iterative optimization algorithm: Starting at ⃗ θ(0), in each iteration let ⃗ θ(i) = ⃗ θ(i−1) + η⃗ v, where η is a (small) ‘step size’ and ⃗ v is a direction chosen to minimize f(⃗ θ(i−1) + η⃗ v). D⃗
v f(⃗
θ(i−1)) = lim
ϵ→0
f(⃗ θ(i−1) + ϵ⃗ v) − f(⃗ θ(i−1)) ϵ . So for small η: f(⃗ θ(i)) − f(⃗ θ(i−1)) = f(⃗ θ(i−1) + η⃗ v) − f(⃗ θ(i−1)) ≈ η · D⃗
vf(⃗
θ(i−1)) v f
i 1
We want to choose v minimizing v f
i 1
– i.e., pointing in the direction of f
i 1
but with the opposite sign.
18
gradient descent greedy approach
Gradient descent is a greedy iterative optimization algorithm: Starting at ⃗ θ(0), in each iteration let ⃗ θ(i) = ⃗ θ(i−1) + η⃗ v, where η is a (small) ‘step size’ and ⃗ v is a direction chosen to minimize f(⃗ θ(i−1) + η⃗ v). D⃗
v f(⃗
θ(i−1)) = lim
ϵ→0
f(⃗ θ(i−1) + ϵ⃗ v) − f(⃗ θ(i−1)) ϵ . So for small η: f(⃗ θ(i)) − f(⃗ θ(i−1)) = f(⃗ θ(i−1) + η⃗ v) − f(⃗ θ(i−1)) ≈ η · D⃗
vf(⃗
θ(i−1)) = η · ⟨⃗ v, ⃗ ∇f(⃗ θ(i−1))⟩. We want to choose v minimizing v f
i 1
– i.e., pointing in the direction of f
i 1
but with the opposite sign.
18
gradient descent greedy approach
Gradient descent is a greedy iterative optimization algorithm: Starting at ⃗ θ(0), in each iteration let ⃗ θ(i) = ⃗ θ(i−1) + η⃗ v, where η is a (small) ‘step size’ and ⃗ v is a direction chosen to minimize f(⃗ θ(i−1) + η⃗ v). D⃗
v f(⃗
θ(i−1)) = lim
ϵ→0
f(⃗ θ(i−1) + ϵ⃗ v) − f(⃗ θ(i−1)) ϵ . So for small η: f(⃗ θ(i)) − f(⃗ θ(i−1)) = f(⃗ θ(i−1) + η⃗ v) − f(⃗ θ(i−1)) ≈ η · D⃗
vf(⃗
θ(i−1)) = η · ⟨⃗ v, ⃗ ∇f(⃗ θ(i−1))⟩. We want to choose ⃗ v minimizing ⟨⃗ v, ⃗ ∇f(⃗ θ(i−1))⟩ – i.e., pointing in the direction of ⃗ ∇f(⃗ θ(i−1)) but with the opposite sign.
18
gradient descent psuedocode
Gradient Descent
- Choose some initialization ⃗
θ(0).
- For i = 1, . . . , t
- ⃗
θ(i) = ⃗ θ(i−1) − η∇f(⃗ θ(i−1))
- Return ⃗
θ(t), as an approximate minimizer of f(⃗ θ). Step size η is chosen ahead of time or adapted during the algorithm (details to come.)
- For now assume
stays the same in each iteration. When will this algorithm work well?
19
gradient descent psuedocode
Gradient Descent
- Choose some initialization ⃗
θ(0).
- For i = 1, . . . , t
- ⃗
θ(i) = ⃗ θ(i−1) − η∇f(⃗ θ(i−1))
- Return ⃗
θ(t), as an approximate minimizer of f(⃗ θ). Step size η is chosen ahead of time or adapted during the algorithm (details to come.)
- For now assume η stays the same in each iteration.
When will this algorithm work well?
19
gradient descent psuedocode
Gradient Descent
- Choose some initialization ⃗
θ(0).
- For i = 1, . . . , t
- ⃗
θ(i) = ⃗ θ(i−1) − η∇f(⃗ θ(i−1))
- Return ⃗
θ(t), as an approximate minimizer of f(⃗ θ). Step size η is chosen ahead of time or adapted during the algorithm (details to come.)
- For now assume η stays the same in each iteration.
When will this algorithm work well?
19
Gradient Descent Update: ⃗ θ(i) = ⃗ θ(i−1) − η∇f(⃗ θ(i−1))
20
conditions for gradient descent convergence
Convex Functions: After sufficient iterations, gradient descent will converge to a approximate minimizer ˆ θ with: f(ˆ θ) ≤ f(θ∗) + ϵ min f Examples: least squares regression, logistic regression, sparse regression (lasso), regularized regression, SVMS,... Non-Convex Functions: After sufficient iterations, gradient descent will converge to a approximate stationary point with: f
2
Examples: neural networks, clustering, mixture models.
21
conditions for gradient descent convergence
Convex Functions: After sufficient iterations, gradient descent will converge to a approximate minimizer ˆ θ with: f(ˆ θ) ≤ f(θ∗) + ϵ = min
θ
f(θ) + ϵ. Examples: least squares regression, logistic regression, sparse regression (lasso), regularized regression, SVMS,... Non-Convex Functions: After sufficient iterations, gradient descent will converge to a approximate stationary point with: f
2
Examples: neural networks, clustering, mixture models.
21
conditions for gradient descent convergence
Convex Functions: After sufficient iterations, gradient descent will converge to a approximate minimizer ˆ θ with: f(ˆ θ) ≤ f(θ∗) + ϵ = min
θ
f(θ) + ϵ. Examples: least squares regression, logistic regression, sparse regression (lasso), regularized regression, SVMS,... Non-Convex Functions: After sufficient iterations, gradient descent will converge to a approximate stationary point with: f
2
Examples: neural networks, clustering, mixture models.
21
conditions for gradient descent convergence
Convex Functions: After sufficient iterations, gradient descent will converge to a approximate minimizer ˆ θ with: f(ˆ θ) ≤ f(θ∗) + ϵ = min
θ
f(θ) + ϵ. Examples: least squares regression, logistic regression, sparse regression (lasso), regularized regression, SVMS,... Non-Convex Functions: After sufficient iterations, gradient descent will converge to a approximate stationary point ˆ θ with: ∥∇f(ˆ θ)∥2 ≤ ϵ. Examples: neural networks, clustering, mixture models.
21
conditions for gradient descent convergence
Convex Functions: After sufficient iterations, gradient descent will converge to a approximate minimizer ˆ θ with: f(ˆ θ) ≤ f(θ∗) + ϵ = min
θ
f(θ) + ϵ. Examples: least squares regression, logistic regression, sparse regression (lasso), regularized regression, SVMS,... Non-Convex Functions: After sufficient iterations, gradient descent will converge to a approximate stationary point ˆ θ with: ∥∇f(ˆ θ)∥2 ≤ ϵ. Examples: neural networks, clustering, mixture models.
21
stationary point vs. local minimum
Why for non-convex functions do we only guarantee convergence to a approximate stationary point rather than an approximate local minimum?
22
stationary point vs. local minimum
Why for non-convex functions do we only guarantee convergence to a approximate stationary point rather than an approximate local minimum?
22
well-behaved functions
Gradient Descent Update: ⃗ θ(i) = ⃗ θ(i−1) − η∇f(⃗ θ(i−1))
23
well-behaved functions
Gradient Descent Update: ⃗ θ(i) = ⃗ θ(i−1) − η∇f(⃗ θ(i−1))
24
well-behaved functions
Both Convex and Non-convex: Need to assume the function is well behaved in some way.
- Lipschitz (size of gradient is bounded): For all
and some G, f
2
G
- Smooth (direction/size of gradient is not changing too
quickly): For all
1 2 and some
, f
1
f
2 2 1 2 2 25
well-behaved functions
Both Convex and Non-convex: Need to assume the function is well behaved in some way.
- Lipschitz (size of gradient is bounded): For all ⃗
θ and some G, ∥⃗ ∇f(⃗ θ)∥2 ≤ G.
- Smooth (direction/size of gradient is not changing too
quickly): For all ⃗ θ1, ⃗ θ2 and some β, ∥⃗ ∇f(⃗ θ1) − ⃗ ∇f(⃗ θ2)∥2 ≤ β · ∥⃗ θ1 − ⃗ θ2∥2.
25
Gradient Descent analysis for convex functions.
26
convexity
Definition – Convex Function: A function f : Rd → R is convex if and only if, for any ⃗ θ1, ⃗ θ2 ∈ Rd and λ ∈ [0, 1]: (1 − λ) · f(⃗ θ1) + λ · f(⃗ θ2) ≥ f ( (1 − λ) · ⃗ θ1 + λ · ⃗ θ2 )
27
convexity
Corollary – Convex Function: A function f : Rd → R is convex if and only if, for any ⃗ θ1, ⃗ θ2 ∈ Rd and λ ∈ [0, 1]: f(⃗ θ2) − f(⃗ θ1) ≥ ⃗ ∇f(⃗ θ1)T ( ⃗ θ2 − ⃗ θ1 )
28
gd analysis – convex functions
Assume that:
- f is convex.
- f is G Lipschitz (i.e., ∥⃗
∇f(⃗ θ)∥2 ≤ G for all ⃗ θ.
- ∥⃗
θ0 − ⃗ θ∗∥2 ≤ R where θ0 is the initialization point. Gradient Descent
- Choose some initialization ⃗
θ0 and set η =
R G √ t.
- For i = 1, . . . , t
- ⃗
θi = ⃗ θi−1 − η∇f(⃗ θi−1)
- Return ˆ
θ = arg min⃗
θ0,...⃗ θt f(⃗
θi).
29
gd analysis proof
Theorem – GD on Convex Lipschitz Functions: For convex G Lipschitz function f, GD run with t ≥ R2G2
ϵ2
iterations, η =
R G √ t,
and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. Step 1: For all i, f
i
f
i 2 2 i 1 2 2
2 G2 2 . Visually:
30
gd analysis proof
Theorem – GD on Convex Lipschitz Functions: For convex G Lipschitz function f, GD run with t ≥ R2G2
ϵ2
iterations, η =
R G √ t,
and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. Step 1: For all i, f(θi) − f(θ∗) ≤ ∥θi−θ∗∥2
2−∥θi+1−θ∗∥2 2
2η
+ ηG2
2 . Visually:
30
gd analysis proof
Theorem – GD on Convex Lipschitz Functions: For convex G Lipschitz function f, GD run with t ≥ R2G2
ϵ2
iterations, η =
R G √ t,
and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. Step 1: For all i, f(θi) − f(θ∗) ≤ ∥θi−θ∗∥2
2−∥θi+1−θ∗∥2 2
2η
+ ηG2
2 . Formally:
31
gd analysis proof
Theorem – GD on Convex Lipschitz Functions: For convex G Lipschitz function f, GD run with t ≥ R2G2
ϵ2
iterations, η =
R G √ t,
and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. Step 1: For all i, f(θi) − f(θ∗) ≤ ∥θi−θ∗∥2
2−∥θi+1−θ∗∥2 2
2η
+ ηG2
2 .
Step 1.1: ∇f(θi)(θi − θ∗) ≤ ∥θi−θ∗∥2
2−∥θi+1−θ∗∥2 2
2η
+ ηG2
2
Step 1.
32
gd analysis proof
Theorem – GD on Convex Lipschitz Functions: For convex G Lipschitz function f, GD run with t ≥ R2G2
ϵ2
iterations, η =
R G √ t,
and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. Step 1: For all i, f(θi) − f(θ∗) ≤ ∥θi−θ∗∥2
2−∥θi+1−θ∗∥2 2
2η
+ ηG2
2 .
Step 1.1: ∇f(θi)(θi − θ∗) ≤ ∥θi−θ∗∥2
2−∥θi+1−θ∗∥2 2
2η
+ ηG2
2
= ⇒ Step 1.
32
gd analysis proof
Theorem – GD on Convex Lipschitz Functions: For convex G Lipschitz function f, GD run with t ≥ R2G2
ϵ2
iterations, η =
R G √ t,
and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. Step 1: For all i, f(θi) − f(θ∗) ≤ ∥θi−θ∗∥2
2−∥θi+1−θ∗∥2 2
2η
+ ηG2
2
Step 2: 1
T T i 1 f i
f
R2 2 T G2 2 .
33
gd analysis proof
Theorem – GD on Convex Lipschitz Functions: For convex G Lipschitz function f, GD run with t ≥ R2G2
ϵ2
iterations, η =
R G √ t,
and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. Step 1: For all i, f(θi) − f(θ∗) ≤ ∥θi−θ∗∥2
2−∥θi+1−θ∗∥2 2
2η
+ ηG2
2
= ⇒ Step 2: 1
T
∑T
i=1 f(θi) − f(θ∗) ≤ R2 2η·T + ηG2 2 .
33
gd analysis proof
Theorem – GD on Convex Lipschitz Functions: For convex G Lipschitz function f, GD run with t ≥ R2G2
ϵ2
iterations, η =
R G √ t,
and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. Step 2: 1
T
∑T
i=1 f(θi) − f(θ∗) ≤ R2 2η·T + ηG2 2 .