compsci 514: algorithms for data science Cameron Musco University - - PowerPoint PPT Presentation

compsci 514 algorithms for data science
SMART_READER_LITE
LIVE PREVIEW

compsci 514: algorithms for data science Cameron Musco University - - PowerPoint PPT Presentation

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 18 0 logistics 1 Problem Set 3 on Spectral Methods due this Friday at 8pm . Can turn in without penalty until Sunday at


slide-1
SLIDE 1

compsci 514: algorithms for data science

Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 18

slide-2
SLIDE 2

logistics

  • Problem Set 3 on Spectral Methods due this Friday at 8pm.
  • Can turn in without penalty until Sunday at 11:59pm.

1

slide-3
SLIDE 3

summary

Last Class:

  • Power method for computing the top singular vector of a matrix.
  • High level discussion of Krylov methods, block versions for

computing more singular vectors.

  • Power method is an iterative algorithm for solving the non-convex
  • ptimization problem:

max

v v

2 2

1 vTXTXv

This Class (and until Thanksgiving):

  • More general iterative algorithms for optimization, specifically

gradient descent and its variants.

  • What are they methods, when are they applied, and how do you

analyze their performance?

  • Small taste of what you can find in COMPSCI 590OP or 690OP.

2

slide-4
SLIDE 4

summary

Last Class:

  • Power method for computing the top singular vector of a matrix.
  • High level discussion of Krylov methods, block versions for

computing more singular vectors.

  • Power method is an iterative algorithm for solving the non-convex
  • ptimization problem:

max

⃗ v:∥⃗ v∥2

2≤1

⃗ vTXTX⃗ v. This Class (and until Thanksgiving):

  • More general iterative algorithms for optimization, specifically

gradient descent and its variants.

  • What are they methods, when are they applied, and how do you

analyze their performance?

  • Small taste of what you can find in COMPSCI 590OP or 690OP.

2

slide-5
SLIDE 5

discrete vs. continuous optimization

Discrete (Combinatorial) Optimization: (traditional CS algorithms)

  • Graph Problems: min-cut, max flow, shortest path, matchings,

maximum independent set, traveling salesman problem

  • Problems with discrete constraints or outputs: bin-packing,

scheduling, sequence alignment, submodular maximization

  • Generally searching over a finite but exponentially large set of

possible solutions. Many of these problems are NP-Hard. Continuous Optimization: (not covered in core CS curriculum. Touched on in ML/advanced algorithms, maybe.)

  • Unconstrained convex and non-convex optimization.
  • Linear programming, quadratic programming, semidefinite

programming

3

slide-6
SLIDE 6

discrete vs. continuous optimization

Discrete (Combinatorial) Optimization: (traditional CS algorithms)

  • Graph Problems: min-cut, max flow, shortest path, matchings,

maximum independent set, traveling salesman problem

  • Problems with discrete constraints or outputs: bin-packing,

scheduling, sequence alignment, submodular maximization

  • Generally searching over a finite but exponentially large set of

possible solutions. Many of these problems are NP-Hard. Continuous Optimization: (not covered in core CS curriculum. Touched on in ML/advanced algorithms, maybe.)

  • Unconstrained convex and non-convex optimization.
  • Linear programming, quadratic programming, semidefinite

programming

3

slide-7
SLIDE 7

discrete vs. continuous optimization

Discrete (Combinatorial) Optimization: (traditional CS algorithms)

  • Graph Problems: min-cut, max flow, shortest path, matchings,

maximum independent set, traveling salesman problem

  • Problems with discrete constraints or outputs: bin-packing,

scheduling, sequence alignment, submodular maximization

  • Generally searching over a finite but exponentially large set of

possible solutions. Many of these problems are NP-Hard. Continuous Optimization: (not covered in core CS curriculum. Touched on in ML/advanced algorithms, maybe.)

  • Unconstrained convex and non-convex optimization.
  • Linear programming, quadratic programming, semidefinite

programming

3

slide-8
SLIDE 8

continuous optimization examples

4

slide-9
SLIDE 9

continuous optimization examples

4

slide-10
SLIDE 10

mathematical setup

Given some function f : Rd → R, find ⃗ θ⋆ with: f(⃗ θ⋆) = min

⃗ θ∈Rd f(⃗

θ) Typically up to some small approximation factor. Often under some constraints:

  • 2

1,

1

1.

  • A

b,

TA

0.

  • 1T

d i 1

i c.

5

slide-11
SLIDE 11

mathematical setup

Given some function f : Rd → R, find ⃗ θ⋆ with: f(⃗ θ⋆) = min

⃗ θ∈Rd f(⃗

θ) + ϵ Typically up to some small approximation factor. Often under some constraints:

  • 2

1,

1

1.

  • A

b,

TA

0.

  • 1T

d i 1

i c.

5

slide-12
SLIDE 12

mathematical setup

Given some function f : Rd → R, find ⃗ θ⋆ with: f(⃗ θ⋆) = min

⃗ θ∈Rd f(⃗

θ) + ϵ Typically up to some small approximation factor. Often under some constraints:

  • ∥⃗

θ∥2 ≤ 1, ∥⃗ θ∥1 ≤ 1.

  • A⃗

θ ≤ ⃗ b, ⃗ θTA⃗ θ ≥ 0.

1T⃗ θ = ∑d

i=1 ⃗

θ(i) ≤ c.

5

slide-13
SLIDE 13

why continuous optimization?

Modern machine learning centers around continuous optimization. Typical Set Up: (supervised machine learning)

  • Have a model, which is a function mapping inputs to predictions

(neural network, linear function, low-degree polynomial etc).

  • The model is parameterized by a parameter vector (weights in a

neural network, coefficients in a linear function or polynomial)

  • Want to train this model on input data, by picking a parameter

vector such that the model does a good job mapping inputs to predictions on your training data. This training step is typically formulated as a continuous

  • ptimization problem.

6

slide-14
SLIDE 14

why continuous optimization?

Modern machine learning centers around continuous optimization. Typical Set Up: (supervised machine learning)

  • Have a model, which is a function mapping inputs to predictions

(neural network, linear function, low-degree polynomial etc).

  • The model is parameterized by a parameter vector (weights in a

neural network, coefficients in a linear function or polynomial)

  • Want to train this model on input data, by picking a parameter

vector such that the model does a good job mapping inputs to predictions on your training data. This training step is typically formulated as a continuous

  • ptimization problem.

6

slide-15
SLIDE 15

why continuous optimization?

Modern machine learning centers around continuous optimization. Typical Set Up: (supervised machine learning)

  • Have a model, which is a function mapping inputs to predictions

(neural network, linear function, low-degree polynomial etc).

  • The model is parameterized by a parameter vector (weights in a

neural network, coefficients in a linear function or polynomial)

  • Want to train this model on input data, by picking a parameter

vector such that the model does a good job mapping inputs to predictions on your training data. This training step is typically formulated as a continuous

  • ptimization problem.

6

slide-16
SLIDE 16

why continuous optimization?

Modern machine learning centers around continuous optimization. Typical Set Up: (supervised machine learning)

  • Have a model, which is a function mapping inputs to predictions

(neural network, linear function, low-degree polynomial etc).

  • The model is parameterized by a parameter vector (weights in a

neural network, coefficients in a linear function or polynomial)

  • Want to train this model on input data, by picking a parameter

vector such that the model does a good job mapping inputs to predictions on your training data. This training step is typically formulated as a continuous

  • ptimization problem.

6

slide-17
SLIDE 17
  • ptimization in ml

Example 1: Linear Regression Model: M

d

with M x

def

x 1 x 1 d x d . Parameter Vector:

d (the regression coefficients)

Optimization Problem: Given data points (training points) x1 xn (the rows of data matrix X

n d) and labels y1

yn , find minimizing the loss function: L X

n i 1

M xi yi R where is some measurement of how far M xi is from yi.

  • M

xi yi M xi yi

2 (least squares regression)

  • yi

1 1 and M xi yi ln 1 exp yiM xi (logistic regression)

7

slide-18
SLIDE 18
  • ptimization in ml

Example 1: Linear Regression Model: M⃗

θ : Rd → R with M⃗ θ(⃗

x)

def

= ⟨⃗ θ,⃗ x⟩ 1 x 1 d x d . Parameter Vector:

d (the regression coefficients)

Optimization Problem: Given data points (training points) x1 xn (the rows of data matrix X

n d) and labels y1

yn , find minimizing the loss function: L X

n i 1

M xi yi R where is some measurement of how far M xi is from yi.

  • M

xi yi M xi yi

2 (least squares regression)

  • yi

1 1 and M xi yi ln 1 exp yiM xi (logistic regression)

7

slide-19
SLIDE 19
  • ptimization in ml

Example 1: Linear Regression Model: M⃗

θ : Rd → R with M⃗ θ(⃗

x)

def

= ⟨⃗ θ,⃗ x⟩ = ⃗ θ(1) ·⃗ x(1) + . . . + ⃗ θ(d) ·⃗ x(d). Parameter Vector:

d (the regression coefficients)

Optimization Problem: Given data points (training points) x1 xn (the rows of data matrix X

n d) and labels y1

yn , find minimizing the loss function: L X

n i 1

M xi yi R where is some measurement of how far M xi is from yi.

  • M

xi yi M xi yi

2 (least squares regression)

  • yi

1 1 and M xi yi ln 1 exp yiM xi (logistic regression)

7

slide-20
SLIDE 20
  • ptimization in ml

Example 1: Linear Regression Model: M⃗

θ : Rd → R with M⃗ θ(⃗

x)

def

= ⟨⃗ θ,⃗ x⟩ = ⃗ θ(1) ·⃗ x(1) + . . . + ⃗ θ(d) ·⃗ x(d). Parameter Vector: ⃗ θ ∈ Rd (the regression coefficients) Optimization Problem: Given data points (training points) x1 xn (the rows of data matrix X

n d) and labels y1

yn , find minimizing the loss function: L X

n i 1

M xi yi R where is some measurement of how far M xi is from yi.

  • M

xi yi M xi yi

2 (least squares regression)

  • yi

1 1 and M xi yi ln 1 exp yiM xi (logistic regression)

7

slide-21
SLIDE 21
  • ptimization in ml

Example 1: Linear Regression Model: M⃗

θ : Rd → R with M⃗ θ(⃗

x)

def

= ⟨⃗ θ,⃗ x⟩ = ⃗ θ(1) ·⃗ x(1) + . . . + ⃗ θ(d) ·⃗ x(d). Parameter Vector: ⃗ θ ∈ Rd (the regression coefficients) Optimization Problem: Given data points (training points) ⃗ x1, . . . ,⃗ xn (the rows of data matrix X ∈ Rn×d) and labels y1, . . . , yn ∈ R, find ⃗ θ∗ minimizing the loss function: L(⃗ θ, X) =

n

i=1

ℓ(M⃗

θ(⃗

xi), yi) R where ℓ is some measurement of how far M⃗

θ(⃗

xi) is from yi.

  • M

xi yi M xi yi

2 (least squares regression)

  • yi

1 1 and M xi yi ln 1 exp yiM xi (logistic regression)

7

slide-22
SLIDE 22
  • ptimization in ml

Example 1: Linear Regression Model: M⃗

θ : Rd → R with M⃗ θ(⃗

x)

def

= ⟨⃗ θ,⃗ x⟩ = ⃗ θ(1) ·⃗ x(1) + . . . + ⃗ θ(d) ·⃗ x(d). Parameter Vector: ⃗ θ ∈ Rd (the regression coefficients) Optimization Problem: Given data points (training points) ⃗ x1, . . . ,⃗ xn (the rows of data matrix X ∈ Rn×d) and labels y1, . . . , yn ∈ R, find ⃗ θ∗ minimizing the loss function: L(⃗ θ, X) =

n

i=1

ℓ(M⃗

θ(⃗

xi), yi) R where ℓ is some measurement of how far M⃗

θ(⃗

xi) is from yi.

  • ℓ(M⃗

θ(⃗

xi), yi) = ( M⃗

θ(⃗

xi) − yi )2 (least squares regression)

  • yi ∈ {−1, 1} and ℓ(M⃗

θ(⃗

xi), yi) = ln ( 1 + exp(−yiM⃗

θ(⃗

xi)) ) (logistic regression)

7

slide-23
SLIDE 23
  • ptimization in ml

Example 1: Linear Regression Model: M⃗

θ : Rd → R with M⃗ θ(⃗

x)

def

= ⟨⃗ θ,⃗ x⟩ = ⃗ θ(1) ·⃗ x(1) + . . . + ⃗ θ(d) ·⃗ x(d). Parameter Vector: ⃗ θ ∈ Rd (the regression coefficients) Optimization Problem: Given data points (training points) ⃗ x1, . . . ,⃗ xn (the rows of data matrix X ∈ Rn×d) and labels y1, . . . , yn ∈ R, find ⃗ θ∗ minimizing the loss function: L(⃗ θ, X) =

n

i=1

ℓ(M⃗

θ(⃗

xi), yi) + R(⃗ θ) where ℓ is some measurement of how far M⃗

θ(⃗

xi) is from yi.

  • ℓ(M⃗

θ(⃗

xi), yi) = ( M⃗

θ(⃗

xi) − yi )2 (least squares regression)

  • yi ∈ {−1, 1} and ℓ(M⃗

θ(⃗

xi), yi) = ln ( 1 + exp(−yiM⃗

θ(⃗

xi)) ) (logistic regression)

7

slide-24
SLIDE 24
  • ptimization in ml

Example 1: Linear Regression Model: M⃗

θ : Rd → R with M⃗ θ(⃗

x)

def

= ⟨⃗ θ,⃗ x⟩ = ⃗ θ(1) ·⃗ x(1) + . . . + ⃗ θ(d) ·⃗ x(d). Parameter Vector: ⃗ θ ∈ Rd (the regression coefficients) Optimization Problem: Given data points (training points) ⃗ x1, . . . ,⃗ xn (the rows of data matrix X ∈ Rn×d) and labels y1, . . . , yn ∈ R, find ⃗ θ∗ minimizing the loss function: L(⃗ θ, X) =

n

i=1

ℓ(M⃗

θ(⃗

xi), yi) + λ∥⃗ θ∥2

2

where ℓ is some measurement of how far M⃗

θ(⃗

xi) is from yi.

  • ℓ(M⃗

θ(⃗

xi), yi) = ( M⃗

θ(⃗

xi) − yi )2 (least squares regression)

  • yi ∈ {−1, 1} and ℓ(M⃗

θ(⃗

xi), yi) = ln ( 1 + exp(−yiM⃗

θ(⃗

xi)) ) (logistic regression)

7

slide-25
SLIDE 25
  • ptimization in ml

Example 2: Neural Networks Model: M

d

. M x wout W2 W1x . Parameter Vector:

edges (the weights on every edge)

Optimization Problem: Given data points x1 xn and labels y1 yn , find minimizing the loss function: L X

n i 1

M xi yi

8

slide-26
SLIDE 26
  • ptimization in ml

Example 2: Neural Networks Model: M⃗

θ : Rd → R.

M x wout W2 W1x . Parameter Vector: ⃗ θ ∈ R(# edges) (the weights on every edge) Optimization Problem: Given data points x1 xn and labels y1 yn , find minimizing the loss function: L X

n i 1

M xi yi

8

slide-27
SLIDE 27
  • ptimization in ml

Example 2: Neural Networks Model: M⃗

θ : Rd → R. M⃗ θ(⃗

x) = ⟨⃗ wout, σ(W2σ(W1⃗ x))⟩. Parameter Vector: ⃗ θ ∈ R(# edges) (the weights on every edge) Optimization Problem: Given data points x1 xn and labels y1 yn , find minimizing the loss function: L X

n i 1

M xi yi

8

slide-28
SLIDE 28
  • ptimization in ml

Example 2: Neural Networks Model: M⃗

θ : Rd → R. M⃗ θ(⃗

x) = ⟨⃗ wout, σ(W2σ(W1⃗ x))⟩. Parameter Vector: ⃗ θ ∈ R(# edges) (the weights on every edge) Optimization Problem: Given data points ⃗ x1, . . . ,⃗ xn and labels y1, . . . , yn ∈ R, find ⃗ θ∗ minimizing the loss function: L(⃗ θ, X) =

n

i=1

ℓ(M⃗

θ(⃗

xi), yi)

8

slide-29
SLIDE 29
  • ptimization in ml

L(⃗ θ, X) =

n

i=1

ℓ(M⃗

θ(⃗

xi), yi)

  • Supervised means we have labels y1, . . . , yn for the training points.
  • Solving the final optimization problem has many different names:

likelihood maximization, empirical risk minimization, minimizing training loss, etc.

  • Continuous optimization is also very common in unsupervised

learning. (PCA, spectral clustering, etc.)

  • Generalization tries to explain why minimizing the loss L

X on the training points minimizes the loss on future test points. I.e., makes us have good predictions on future inputs.

9

slide-30
SLIDE 30
  • ptimization in ml

L(⃗ θ, X) =

n

i=1

ℓ(M⃗

θ(⃗

xi), yi)

  • Supervised means we have labels y1, . . . , yn for the training points.
  • Solving the final optimization problem has many different names:

likelihood maximization, empirical risk minimization, minimizing training loss, etc.

  • Continuous optimization is also very common in unsupervised

learning. (PCA, spectral clustering, etc.)

  • Generalization tries to explain why minimizing the loss L

X on the training points minimizes the loss on future test points. I.e., makes us have good predictions on future inputs.

9

slide-31
SLIDE 31
  • ptimization in ml

L(⃗ θ, X) =

n

i=1

ℓ(M⃗

θ(⃗

xi), yi)

  • Supervised means we have labels y1, . . . , yn for the training points.
  • Solving the final optimization problem has many different names:

likelihood maximization, empirical risk minimization, minimizing training loss, etc.

  • Continuous optimization is also very common in unsupervised

learning. (PCA, spectral clustering, etc.)

  • Generalization tries to explain why minimizing the loss L

X on the training points minimizes the loss on future test points. I.e., makes us have good predictions on future inputs.

9

slide-32
SLIDE 32
  • ptimization in ml

L(⃗ θ, X) =

n

i=1

ℓ(M⃗

θ(⃗

xi), yi)

  • Supervised means we have labels y1, . . . , yn for the training points.
  • Solving the final optimization problem has many different names:

likelihood maximization, empirical risk minimization, minimizing training loss, etc.

  • Continuous optimization is also very common in unsupervised
  • learning. (PCA, spectral clustering, etc.)
  • Generalization tries to explain why minimizing the loss L

X on the training points minimizes the loss on future test points. I.e., makes us have good predictions on future inputs.

9

slide-33
SLIDE 33
  • ptimization in ml

L(⃗ θ, X) =

n

i=1

ℓ(M⃗

θ(⃗

xi), yi)

  • Supervised means we have labels y1, . . . , yn for the training points.
  • Solving the final optimization problem has many different names:

likelihood maximization, empirical risk minimization, minimizing training loss, etc.

  • Continuous optimization is also very common in unsupervised
  • learning. (PCA, spectral clustering, etc.)
  • Generalization tries to explain why minimizing the loss L(⃗

θ, X) on the training points minimizes the loss on future test points. I.e., makes us have good predictions on future inputs.

9

slide-34
SLIDE 34
  • ptimization algorithms

Choice of optimization algorithm for minimizing f(⃗ θ) will depend on many things:

  • The form of f (in ML, depends on the model & loss function).
  • Any constraints on ⃗

θ (e.g., ∥⃗ θ∥ < c).

  • Other constraints, such as memory constraints.

L X

n i 1

M xi yi What are some popular optimization algorithms?

10

slide-35
SLIDE 35
  • ptimization algorithms

Choice of optimization algorithm for minimizing f(⃗ θ) will depend on many things:

  • The form of f (in ML, depends on the model & loss function).
  • Any constraints on ⃗

θ (e.g., ∥⃗ θ∥ < c).

  • Other constraints, such as memory constraints.

L(⃗ θ, X) =

n

i=1

ℓ(M⃗

θ(⃗

xi), yi) What are some popular optimization algorithms?

10

slide-36
SLIDE 36

gradient descent

This class: Gradient descent (and some important variants)

  • An extremely simple greedy iterative method, that can be applied

to almost any continuous function we care about optimizing.

  • Often not the ‘best’ choice for any given function, but it is the

approach of choice in ML since it is simple, general, and often works very well.

  • At each step, tries to move towards the lowest nearby point in the

function that is can – in the direction of the gradient.

11

slide-37
SLIDE 37

gradient descent

This class: Gradient descent (and some important variants)

  • An extremely simple greedy iterative method, that can be applied

to almost any continuous function we care about optimizing.

  • Often not the ‘best’ choice for any given function, but it is the

approach of choice in ML since it is simple, general, and often works very well.

  • At each step, tries to move towards the lowest nearby point in the

function that is can – in the direction of the gradient.

11

slide-38
SLIDE 38

gradient descent

This class: Gradient descent (and some important variants)

  • An extremely simple greedy iterative method, that can be applied

to almost any continuous function we care about optimizing.

  • Often not the ‘best’ choice for any given function, but it is the

approach of choice in ML since it is simple, general, and often works very well.

  • At each step, tries to move towards the lowest nearby point in the

function that is can – in the direction of the gradient.

11

slide-39
SLIDE 39

gradient descent

This class: Gradient descent (and some important variants)

  • An extremely simple greedy iterative method, that can be applied

to almost any continuous function we care about optimizing.

  • Often not the ‘best’ choice for any given function, but it is the

approach of choice in ML since it is simple, general, and often works very well.

  • At each step, tries to move towards the lowest nearby point in the

function that is can – in the direction of the gradient.

11

slide-40
SLIDE 40

gradient descent

This class: Gradient descent (and some important variants)

  • An extremely simple greedy iterative method, that can be applied

to almost any continuous function we care about optimizing.

  • Often not the ‘best’ choice for any given function, but it is the

approach of choice in ML since it is simple, general, and often works very well.

  • At each step, tries to move towards the lowest nearby point in the

function that is can – in the direction of the gradient.

11

slide-41
SLIDE 41

multivariate calculus review

Let ⃗ ei ∈ Rd denote the ith standard basis vector, ⃗ ei = [0, 0, 1, 0, 0, . . . , 0]

  • 1 at position i

. Partial Derivative: f i lim f ei f Directional Derivative: Dv f lim f v f

12

slide-42
SLIDE 42

multivariate calculus review

Let ⃗ ei ∈ Rd denote the ith standard basis vector, ⃗ ei = [0, 0, 1, 0, 0, . . . , 0]

  • 1 at position i

. Partial Derivative: ∂f ∂⃗ θ(i) = lim

ϵ→0

f(⃗ θ + ϵ · ⃗ ei) − f(⃗ θ) ϵ . Directional Derivative: Dv f lim f v f

12

slide-43
SLIDE 43

multivariate calculus review

Let ⃗ ei ∈ Rd denote the ith standard basis vector, ⃗ ei = [0, 0, 1, 0, 0, . . . , 0]

  • 1 at position i

. Partial Derivative: ∂f ∂⃗ θ(i) = lim

ϵ→0

f(⃗ θ + ϵ · ⃗ ei) − f(⃗ θ) ϵ . Directional Derivative: D⃗

v f(⃗

θ) = lim

ϵ→0

f(⃗ θ + ϵ⃗ v) − f(⃗ θ) ϵ .

12

slide-44
SLIDE 44

multivariate calculus review

Gradient: Just a ‘list’ of the partial derivatives. ⃗ ∇f(⃗ θ) =       

∂f ∂⃗ θ(1) ∂f ∂⃗ θ(2)

. . .

∂f ∂⃗ θ(d)

       Directional Derivative in Terms of the Gradient: Dv f lim f v f v 1 f 1 v 2 f 2 v d f d v f

13

slide-45
SLIDE 45

multivariate calculus review

Gradient: Just a ‘list’ of the partial derivatives. ⃗ ∇f(⃗ θ) =       

∂f ∂⃗ θ(1) ∂f ∂⃗ θ(2)

. . .

∂f ∂⃗ θ(d)

       Directional Derivative in Terms of the Gradient: D⃗

v f(⃗

θ) = lim

ϵ→0

f(⃗ θ + ϵ⃗ v) − f(⃗ θ) ϵ v 1 f 1 v 2 f 2 v d f d v f

13

slide-46
SLIDE 46

multivariate calculus review

Gradient: Just a ‘list’ of the partial derivatives. ⃗ ∇f(⃗ θ) =       

∂f ∂⃗ θ(1) ∂f ∂⃗ θ(2)

. . .

∂f ∂⃗ θ(d)

       Directional Derivative in Terms of the Gradient: D⃗

v f(⃗

θ) = lim

ϵ→0

f(⃗ θ + ϵ(⃗ e1 ·⃗ v(1) + ⃗ e2 ·⃗ v(2) + . . . + ⃗ ed ·⃗ v(d)) − f(⃗ θ) ϵ v 1 f 1 v 2 f 2 v d f d v f

13

slide-47
SLIDE 47

multivariate calculus review

Gradient: Just a ‘list’ of the partial derivatives. ⃗ ∇f(⃗ θ) =       

∂f ∂⃗ θ(1) ∂f ∂⃗ θ(2)

. . .

∂f ∂⃗ θ(d)

       Directional Derivative in Terms of the Gradient: D⃗

v f(⃗

θ) = lim

ϵ→0

f(⃗ θ + ϵ(⃗ e1 ·⃗ v(1) + ⃗ e2 ·⃗ v(2) + . . . + ⃗ ed ·⃗ v(d)) − f(⃗ θ) ϵ v 1 f 1 v 2 f 2 v d f d v f

13

slide-48
SLIDE 48

multivariate calculus review

Gradient: Just a ‘list’ of the partial derivatives. ⃗ ∇f(⃗ θ) =       

∂f ∂⃗ θ(1) ∂f ∂⃗ θ(2)

. . .

∂f ∂⃗ θ(d)

       Directional Derivative in Terms of the Gradient: D⃗

v f(⃗

θ) = lim

ϵ→0

f(⃗ θ + ϵ(⃗ e1 ·⃗ v(1) + ⃗ e2 ·⃗ v(2) + . . . + ⃗ ed ·⃗ v(d)) − f(⃗ θ) ϵ ≈ ⃗ v(1) · ∂f ∂⃗ θ(1) v 2 f 2 v d f d v f

13

slide-49
SLIDE 49

multivariate calculus review

Gradient: Just a ‘list’ of the partial derivatives. ⃗ ∇f(⃗ θ) =       

∂f ∂⃗ θ(1) ∂f ∂⃗ θ(2)

. . .

∂f ∂⃗ θ(d)

       Directional Derivative in Terms of the Gradient: D⃗

v f(⃗

θ) = lim

ϵ→0

f(⃗ θ + ϵ(⃗ e1 ·⃗ v(1) + ⃗ e2 ·⃗ v(2) + . . . + ⃗ ed ·⃗ v(d)) − f(⃗ θ) ϵ ≈ ⃗ v(1) · ∂f ∂⃗ θ(1) +⃗ v(2) · ∂f ∂⃗ θ(2) + . . . +⃗ v(d) · ∂f ∂⃗ θ(d) v f

13

slide-50
SLIDE 50

multivariate calculus review

Gradient: Just a ‘list’ of the partial derivatives. ⃗ ∇f(⃗ θ) =       

∂f ∂⃗ θ(1) ∂f ∂⃗ θ(2)

. . .

∂f ∂⃗ θ(d)

       Directional Derivative in Terms of the Gradient: D⃗

v f(⃗

θ) = lim

ϵ→0

f(⃗ θ + ϵ(⃗ e1 ·⃗ v(1) + ⃗ e2 ·⃗ v(2) + . . . + ⃗ ed ·⃗ v(d)) − f(⃗ θ) ϵ ≈ ⃗ v(1) · ∂f ∂⃗ θ(1) +⃗ v(2) · ∂f ∂⃗ θ(2) + . . . +⃗ v(d) · ∂f ∂⃗ θ(d) = ⟨⃗ v, ⃗ ∇f(⃗ θ)⟩.

13

slide-51
SLIDE 51

function access

Often the functions we are trying to optimize are very complex (e.g., a neural network). We will assume access to: Function Evaluation: Can compute f(⃗ θ) for any ⃗ θ. Gradient Evaluation: Can compute ⃗ ∇f(⃗ θ) for any ⃗ θ. In neural networks:

  • Function evaluation is called a forward pass (propogate an

input through the network).

  • Gradient evaluation is called a backward pass (compute the

gradient via chain rule, using backpropagation).

14

slide-52
SLIDE 52

function access

Often the functions we are trying to optimize are very complex (e.g., a neural network). We will assume access to: Function Evaluation: Can compute f(⃗ θ) for any ⃗ θ. Gradient Evaluation: Can compute ⃗ ∇f(⃗ θ) for any ⃗ θ. In neural networks:

  • Function evaluation is called a forward pass (propogate an

input through the network).

  • Gradient evaluation is called a backward pass (compute the

gradient via chain rule, using backpropagation).

14

slide-53
SLIDE 53

gradient example

Running Example: Least squares regression. Given input points ⃗ x1, . . .⃗ xn (the rows of data matrix X ∈ Rn×d) and labels y1, . . . , yn (the entries of ⃗ y ∈ Rn) , find ⃗ θ∗ minimizing: L(⃗ θ, X) =

n

i=1

( ⃗ θT⃗ xi − yi )2 X y 2

2

By Chain rule: L X j

n i 1

2

Txi

yi

Txi

yi j

n i 1

2

Txi

yi xi j

Txi

yi j

Txi

j lim

Txi

ej Txi lim eT

j xi

xi j

15

slide-54
SLIDE 54

gradient example

Running Example: Least squares regression. Given input points ⃗ x1, . . .⃗ xn (the rows of data matrix X ∈ Rn×d) and labels y1, . . . , yn (the entries of ⃗ y ∈ Rn) , find ⃗ θ∗ minimizing: L(⃗ θ, X) =

n

i=1

( ⃗ θT⃗ xi − yi )2 = ∥X⃗ θ −⃗ y∥2

2.

By Chain rule: ∂L(⃗ θ, X) ∂⃗ θ(j) =

n

i=1

2 · ( ⃗ θT⃗ xi − yi ) · ∂ ( ⃗ θT⃗ xi − yi ) ∂⃗ θ(j)

n i 1

2

Txi

yi xi j

Txi

yi j

Txi

j lim

Txi

ej Txi lim eT

j xi

xi j

15

slide-55
SLIDE 55

gradient example

Running Example: Least squares regression. Given input points ⃗ x1, . . .⃗ xn (the rows of data matrix X ∈ Rn×d) and labels y1, . . . , yn (the entries of ⃗ y ∈ Rn) , find ⃗ θ∗ minimizing: L(⃗ θ, X) =

n

i=1

( ⃗ θT⃗ xi − yi )2 = ∥X⃗ θ −⃗ y∥2

2.

By Chain rule: ∂L(⃗ θ, X) ∂⃗ θ(j) =

n

i=1

2 · ( ⃗ θT⃗ xi − yi ) · ∂ ( ⃗ θT⃗ xi − yi ) ∂⃗ θ(j)

n i 1

2

Txi

yi xi j ∂ ( ⃗ θT⃗ xi − yi ) ∂⃗ θ(j) = ∂(θT⃗ xi) ∂⃗ θ(j) lim

Txi

ej Txi lim eT

j xi

xi j

15

slide-56
SLIDE 56

gradient example

Running Example: Least squares regression. Given input points ⃗ x1, . . .⃗ xn (the rows of data matrix X ∈ Rn×d) and labels y1, . . . , yn (the entries of ⃗ y ∈ Rn) , find ⃗ θ∗ minimizing: L(⃗ θ, X) =

n

i=1

( ⃗ θT⃗ xi − yi )2 = ∥X⃗ θ −⃗ y∥2

2.

By Chain rule: ∂L(⃗ θ, X) ∂⃗ θ(j) =

n

i=1

2 · ( ⃗ θT⃗ xi − yi ) · ∂ ( ⃗ θT⃗ xi − yi ) ∂⃗ θ(j)

n i 1

2

Txi

yi xi j ∂ ( ⃗ θT⃗ xi − yi ) ∂⃗ θ(j) = ∂(θT⃗ xi) ∂⃗ θ(j) = lim

ϵ→0

θT⃗ xi − (θ + ϵ⃗ ej)T⃗ xi ϵ lim eT

j xi

xi j

15

slide-57
SLIDE 57

gradient example

Running Example: Least squares regression. Given input points ⃗ x1, . . .⃗ xn (the rows of data matrix X ∈ Rn×d) and labels y1, . . . , yn (the entries of ⃗ y ∈ Rn) , find ⃗ θ∗ minimizing: L(⃗ θ, X) =

n

i=1

( ⃗ θT⃗ xi − yi )2 = ∥X⃗ θ −⃗ y∥2

2.

By Chain rule: ∂L(⃗ θ, X) ∂⃗ θ(j) =

n

i=1

2 · ( ⃗ θT⃗ xi − yi ) · ∂ ( ⃗ θT⃗ xi − yi ) ∂⃗ θ(j)

n i 1

2

Txi

yi xi j ∂ ( ⃗ θT⃗ xi − yi ) ∂⃗ θ(j) = ∂(θT⃗ xi) ∂⃗ θ(j) = lim

ϵ→0

θT⃗ xi − (θ + ϵ⃗ ej)T⃗ xi ϵ = lim

ϵ→0

ϵ⃗ eT

j⃗

xi ϵ xi j

15

slide-58
SLIDE 58

gradient example

Running Example: Least squares regression. Given input points ⃗ x1, . . .⃗ xn (the rows of data matrix X ∈ Rn×d) and labels y1, . . . , yn (the entries of ⃗ y ∈ Rn) , find ⃗ θ∗ minimizing: L(⃗ θ, X) =

n

i=1

( ⃗ θT⃗ xi − yi )2 = ∥X⃗ θ −⃗ y∥2

2.

By Chain rule: ∂L(⃗ θ, X) ∂⃗ θ(j) =

n

i=1

2 · ( ⃗ θT⃗ xi − yi ) · ∂ ( ⃗ θT⃗ xi − yi ) ∂⃗ θ(j)

n i 1

2

Txi

yi xi j ∂ ( ⃗ θT⃗ xi − yi ) ∂⃗ θ(j) = ∂(θT⃗ xi) ∂⃗ θ(j) = lim

ϵ→0

θT⃗ xi − (θ + ϵ⃗ ej)T⃗ xi ϵ = lim

ϵ→0

ϵ⃗ eT

j⃗

xi ϵ = ⃗ xi(j).

15

slide-59
SLIDE 59

gradient example

Running Example: Least squares regression. Given input points ⃗ x1, . . .⃗ xn (the rows of data matrix X ∈ Rn×d) and labels y1, . . . , yn (the entries of ⃗ y ∈ Rn) , find ⃗ θ∗ minimizing: L(⃗ θ, X) =

n

i=1

( ⃗ θT⃗ xi − yi )2 = ∥X⃗ θ −⃗ y∥2

2.

By Chain rule: ∂L(⃗ θ, X) ∂⃗ θ(j) =

n

i=1

2 · ( ⃗ θT⃗ xi − yi ) · ∂ ( ⃗ θT⃗ xi − yi ) ∂⃗ θ(j) =

n

i=1

2 · ( ⃗ θT⃗ xi − yi ) ⃗ xi(j) ∂ ( ⃗ θT⃗ xi − yi ) ∂⃗ θ(j) = ∂(θT⃗ xi) ∂⃗ θ(j) = lim

ϵ→0

θT⃗ xi − (θ + ϵ⃗ ej)T⃗ xi ϵ = lim

ϵ→0

ϵ⃗ eT

j⃗

xi ϵ = ⃗ xi(j).

15

slide-60
SLIDE 60

gradient example

Partial derivative for least squares regression: ∂L(⃗ θ, X) ∂⃗ θ(j) =

n

i=1

2 · ( ⃗ θT⃗ xi − yi ) ⃗ xi(j). L X

n i 1

2

Txi

yi xi XT X y

16

slide-61
SLIDE 61

gradient example

Partial derivative for least squares regression: ∂L(⃗ θ, X) ∂⃗ θ(j) =

n

i=1

2 · ( ⃗ θT⃗ xi − yi ) ⃗ xi(j). ⃗ ∇L(⃗ θ, X) =

n

i=1

2 · ( ⃗ θT⃗ xi − yi ) ⃗ xi XT X y

16

slide-62
SLIDE 62

gradient example

Partial derivative for least squares regression: ∂L(⃗ θ, X) ∂⃗ θ(j) =

n

i=1

2 · ( ⃗ θT⃗ xi − yi ) ⃗ xi(j). ⃗ ∇L(⃗ θ, X) =

n

i=1

2 · ( ⃗ θT⃗ xi − yi ) ⃗ xi = XT(X⃗ θ −⃗ y).

16

slide-63
SLIDE 63

gradient example

Gradient for least squares regression via linear algebraic approach: ∇L(⃗ θ, X) = ∇∥X⃗ θ −⃗ y∥2

2

17

slide-64
SLIDE 64

gradient descent greedy approach

Gradient descent is a greedy iterative optimization algorithm: Starting at ⃗ θ(0), in each iteration let ⃗ θ(i) = ⃗ θ(i−1) + η⃗ v, where η is a (small) ‘step size’ and ⃗ v is a direction chosen to minimize f(⃗ θ(i−1) + η⃗ v). Dv f lim f v f So for small : f

i

f

i 1

f

i 1

v f

i 1

Dvf

i 1

v f

i 1

We want to choose v minimizing v f

i 1

– i.e., pointing in the direction of f

i 1

but with the opposite sign.

18

slide-65
SLIDE 65

gradient descent greedy approach

Gradient descent is a greedy iterative optimization algorithm: Starting at ⃗ θ(0), in each iteration let ⃗ θ(i) = ⃗ θ(i−1) + η⃗ v, where η is a (small) ‘step size’ and ⃗ v is a direction chosen to minimize f(⃗ θ(i−1) + η⃗ v). D⃗

v f(⃗

θ) = lim

ϵ→0

f(⃗ θ + ϵ⃗ v) − f(⃗ θ) ϵ . So for small : f

i

f

i 1

f

i 1

v f

i 1

Dvf

i 1

v f

i 1

We want to choose v minimizing v f

i 1

– i.e., pointing in the direction of f

i 1

but with the opposite sign.

18

slide-66
SLIDE 66

gradient descent greedy approach

Gradient descent is a greedy iterative optimization algorithm: Starting at ⃗ θ(0), in each iteration let ⃗ θ(i) = ⃗ θ(i−1) + η⃗ v, where η is a (small) ‘step size’ and ⃗ v is a direction chosen to minimize f(⃗ θ(i−1) + η⃗ v). D⃗

v f(⃗

θ(i−1)) = lim

ϵ→0

f(⃗ θ(i−1) + ϵ⃗ v) − f(⃗ θ(i−1)) ϵ . So for small : f

i

f

i 1

f

i 1

v f

i 1

Dvf

i 1

v f

i 1

We want to choose v minimizing v f

i 1

– i.e., pointing in the direction of f

i 1

but with the opposite sign.

18

slide-67
SLIDE 67

gradient descent greedy approach

Gradient descent is a greedy iterative optimization algorithm: Starting at ⃗ θ(0), in each iteration let ⃗ θ(i) = ⃗ θ(i−1) + η⃗ v, where η is a (small) ‘step size’ and ⃗ v is a direction chosen to minimize f(⃗ θ(i−1) + η⃗ v). D⃗

v f(⃗

θ(i−1)) = lim

ϵ→0

f(⃗ θ(i−1) + ϵ⃗ v) − f(⃗ θ(i−1)) ϵ . So for small η: f(⃗ θ(i)) − f(⃗ θ(i−1)) = f(⃗ θ(i−1) + η⃗ v) − f(⃗ θ(i−1)) Dvf

i 1

v f

i 1

We want to choose v minimizing v f

i 1

– i.e., pointing in the direction of f

i 1

but with the opposite sign.

18

slide-68
SLIDE 68

gradient descent greedy approach

Gradient descent is a greedy iterative optimization algorithm: Starting at ⃗ θ(0), in each iteration let ⃗ θ(i) = ⃗ θ(i−1) + η⃗ v, where η is a (small) ‘step size’ and ⃗ v is a direction chosen to minimize f(⃗ θ(i−1) + η⃗ v). D⃗

v f(⃗

θ(i−1)) = lim

ϵ→0

f(⃗ θ(i−1) + ϵ⃗ v) − f(⃗ θ(i−1)) ϵ . So for small η: f(⃗ θ(i)) − f(⃗ θ(i−1)) = f(⃗ θ(i−1) + η⃗ v) − f(⃗ θ(i−1)) ≈ η · D⃗

vf(⃗

θ(i−1)) v f

i 1

We want to choose v minimizing v f

i 1

– i.e., pointing in the direction of f

i 1

but with the opposite sign.

18

slide-69
SLIDE 69

gradient descent greedy approach

Gradient descent is a greedy iterative optimization algorithm: Starting at ⃗ θ(0), in each iteration let ⃗ θ(i) = ⃗ θ(i−1) + η⃗ v, where η is a (small) ‘step size’ and ⃗ v is a direction chosen to minimize f(⃗ θ(i−1) + η⃗ v). D⃗

v f(⃗

θ(i−1)) = lim

ϵ→0

f(⃗ θ(i−1) + ϵ⃗ v) − f(⃗ θ(i−1)) ϵ . So for small η: f(⃗ θ(i)) − f(⃗ θ(i−1)) = f(⃗ θ(i−1) + η⃗ v) − f(⃗ θ(i−1)) ≈ η · D⃗

vf(⃗

θ(i−1)) = η · ⟨⃗ v, ⃗ ∇f(⃗ θ(i−1))⟩. We want to choose v minimizing v f

i 1

– i.e., pointing in the direction of f

i 1

but with the opposite sign.

18

slide-70
SLIDE 70

gradient descent greedy approach

Gradient descent is a greedy iterative optimization algorithm: Starting at ⃗ θ(0), in each iteration let ⃗ θ(i) = ⃗ θ(i−1) + η⃗ v, where η is a (small) ‘step size’ and ⃗ v is a direction chosen to minimize f(⃗ θ(i−1) + η⃗ v). D⃗

v f(⃗

θ(i−1)) = lim

ϵ→0

f(⃗ θ(i−1) + ϵ⃗ v) − f(⃗ θ(i−1)) ϵ . So for small η: f(⃗ θ(i)) − f(⃗ θ(i−1)) = f(⃗ θ(i−1) + η⃗ v) − f(⃗ θ(i−1)) ≈ η · D⃗

vf(⃗

θ(i−1)) = η · ⟨⃗ v, ⃗ ∇f(⃗ θ(i−1))⟩. We want to choose ⃗ v minimizing ⟨⃗ v, ⃗ ∇f(⃗ θ(i−1))⟩ – i.e., pointing in the direction of ⃗ ∇f(⃗ θ(i−1)) but with the opposite sign.

18

slide-71
SLIDE 71

gradient descent psuedocode

Gradient Descent

  • Choose some initialization ⃗

θ(0).

  • For i = 1, . . . , t

θ(i) = ⃗ θ(i−1) − η∇f(⃗ θ(i−1))

  • Return ⃗

θ(t), as an approximate minimizer of f(⃗ θ). Step size η is chosen ahead of time or adapted during the algorithm (details to come.)

  • For now assume

stays the same in each iteration. When will this algorithm work well?

19

slide-72
SLIDE 72

gradient descent psuedocode

Gradient Descent

  • Choose some initialization ⃗

θ(0).

  • For i = 1, . . . , t

θ(i) = ⃗ θ(i−1) − η∇f(⃗ θ(i−1))

  • Return ⃗

θ(t), as an approximate minimizer of f(⃗ θ). Step size η is chosen ahead of time or adapted during the algorithm (details to come.)

  • For now assume η stays the same in each iteration.

When will this algorithm work well?

19

slide-73
SLIDE 73

gradient descent psuedocode

Gradient Descent

  • Choose some initialization ⃗

θ(0).

  • For i = 1, . . . , t

θ(i) = ⃗ θ(i−1) − η∇f(⃗ θ(i−1))

  • Return ⃗

θ(t), as an approximate minimizer of f(⃗ θ). Step size η is chosen ahead of time or adapted during the algorithm (details to come.)

  • For now assume η stays the same in each iteration.

When will this algorithm work well?

19

slide-74
SLIDE 74

Gradient Descent Update: ⃗ θ(i) = ⃗ θ(i−1) − η∇f(⃗ θ(i−1))

20

slide-75
SLIDE 75

conditions for gradient descent convergence

Convex Functions: After sufficient iterations, gradient descent will converge to a approximate minimizer ˆ θ with: f(ˆ θ) ≤ f(θ∗) + ϵ min f Examples: least squares regression, logistic regression, sparse regression (lasso), regularized regression, SVMS,... Non-Convex Functions: After sufficient iterations, gradient descent will converge to a approximate stationary point with: f

2

Examples: neural networks, clustering, mixture models.

21

slide-76
SLIDE 76

conditions for gradient descent convergence

Convex Functions: After sufficient iterations, gradient descent will converge to a approximate minimizer ˆ θ with: f(ˆ θ) ≤ f(θ∗) + ϵ = min

θ

f(θ) + ϵ. Examples: least squares regression, logistic regression, sparse regression (lasso), regularized regression, SVMS,... Non-Convex Functions: After sufficient iterations, gradient descent will converge to a approximate stationary point with: f

2

Examples: neural networks, clustering, mixture models.

21

slide-77
SLIDE 77

conditions for gradient descent convergence

Convex Functions: After sufficient iterations, gradient descent will converge to a approximate minimizer ˆ θ with: f(ˆ θ) ≤ f(θ∗) + ϵ = min

θ

f(θ) + ϵ. Examples: least squares regression, logistic regression, sparse regression (lasso), regularized regression, SVMS,... Non-Convex Functions: After sufficient iterations, gradient descent will converge to a approximate stationary point with: f

2

Examples: neural networks, clustering, mixture models.

21

slide-78
SLIDE 78

conditions for gradient descent convergence

Convex Functions: After sufficient iterations, gradient descent will converge to a approximate minimizer ˆ θ with: f(ˆ θ) ≤ f(θ∗) + ϵ = min

θ

f(θ) + ϵ. Examples: least squares regression, logistic regression, sparse regression (lasso), regularized regression, SVMS,... Non-Convex Functions: After sufficient iterations, gradient descent will converge to a approximate stationary point ˆ θ with: ∥∇f(ˆ θ)∥2 ≤ ϵ. Examples: neural networks, clustering, mixture models.

21

slide-79
SLIDE 79

conditions for gradient descent convergence

Convex Functions: After sufficient iterations, gradient descent will converge to a approximate minimizer ˆ θ with: f(ˆ θ) ≤ f(θ∗) + ϵ = min

θ

f(θ) + ϵ. Examples: least squares regression, logistic regression, sparse regression (lasso), regularized regression, SVMS,... Non-Convex Functions: After sufficient iterations, gradient descent will converge to a approximate stationary point ˆ θ with: ∥∇f(ˆ θ)∥2 ≤ ϵ. Examples: neural networks, clustering, mixture models.

21

slide-80
SLIDE 80

stationary point vs. local minimum

Why for non-convex functions do we only guarantee convergence to a approximate stationary point rather than an approximate local minimum?

22

slide-81
SLIDE 81

stationary point vs. local minimum

Why for non-convex functions do we only guarantee convergence to a approximate stationary point rather than an approximate local minimum?

22

slide-82
SLIDE 82

well-behaved functions

Gradient Descent Update: ⃗ θ(i) = ⃗ θ(i−1) − η∇f(⃗ θ(i−1))

23

slide-83
SLIDE 83

well-behaved functions

Gradient Descent Update: ⃗ θ(i) = ⃗ θ(i−1) − η∇f(⃗ θ(i−1))

24

slide-84
SLIDE 84

well-behaved functions

Both Convex and Non-convex: Need to assume the function is well behaved in some way.

  • Lipschitz (size of gradient is bounded): For all

and some G, f

2

G

  • Smooth (direction/size of gradient is not changing too

quickly): For all

1 2 and some

, f

1

f

2 2 1 2 2 25

slide-85
SLIDE 85

well-behaved functions

Both Convex and Non-convex: Need to assume the function is well behaved in some way.

  • Lipschitz (size of gradient is bounded): For all ⃗

θ and some G, ∥⃗ ∇f(⃗ θ)∥2 ≤ G.

  • Smooth (direction/size of gradient is not changing too

quickly): For all ⃗ θ1, ⃗ θ2 and some β, ∥⃗ ∇f(⃗ θ1) − ⃗ ∇f(⃗ θ2)∥2 ≤ β · ∥⃗ θ1 − ⃗ θ2∥2.

25

slide-86
SLIDE 86

Gradient Descent analysis for convex functions.

26

slide-87
SLIDE 87

convexity

Definition – Convex Function: A function f : Rd → R is convex if and only if, for any ⃗ θ1, ⃗ θ2 ∈ Rd and λ ∈ [0, 1]: (1 − λ) · f(⃗ θ1) + λ · f(⃗ θ2) ≥ f ( (1 − λ) · ⃗ θ1 + λ · ⃗ θ2 )

27

slide-88
SLIDE 88

convexity

Corollary – Convex Function: A function f : Rd → R is convex if and only if, for any ⃗ θ1, ⃗ θ2 ∈ Rd and λ ∈ [0, 1]: f(⃗ θ2) − f(⃗ θ1) ≥ ⃗ ∇f(⃗ θ1)T ( ⃗ θ2 − ⃗ θ1 )

28

slide-89
SLIDE 89

gd analysis – convex functions

Assume that:

  • f is convex.
  • f is G Lipschitz (i.e., ∥⃗

∇f(⃗ θ)∥2 ≤ G for all ⃗ θ.

  • ∥⃗

θ0 − ⃗ θ∗∥2 ≤ R where θ0 is the initialization point. Gradient Descent

  • Choose some initialization ⃗

θ0 and set η =

R G √ t.

  • For i = 1, . . . , t

θi = ⃗ θi−1 − η∇f(⃗ θi−1)

  • Return ˆ

θ = arg min⃗

θ0,...⃗ θt f(⃗

θi).

29

slide-90
SLIDE 90

gd analysis proof

Theorem – GD on Convex Lipschitz Functions: For convex G Lipschitz function f, GD run with t ≥ R2G2

ϵ2

iterations, η =

R G √ t,

and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. Step 1: For all i, f

i

f

i 2 2 i 1 2 2

2 G2 2 . Visually:

30

slide-91
SLIDE 91

gd analysis proof

Theorem – GD on Convex Lipschitz Functions: For convex G Lipschitz function f, GD run with t ≥ R2G2

ϵ2

iterations, η =

R G √ t,

and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. Step 1: For all i, f(θi) − f(θ∗) ≤ ∥θi−θ∗∥2

2−∥θi+1−θ∗∥2 2

+ ηG2

2 . Visually:

30

slide-92
SLIDE 92

gd analysis proof

Theorem – GD on Convex Lipschitz Functions: For convex G Lipschitz function f, GD run with t ≥ R2G2

ϵ2

iterations, η =

R G √ t,

and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. Step 1: For all i, f(θi) − f(θ∗) ≤ ∥θi−θ∗∥2

2−∥θi+1−θ∗∥2 2

+ ηG2

2 . Formally:

31

slide-93
SLIDE 93

gd analysis proof

Theorem – GD on Convex Lipschitz Functions: For convex G Lipschitz function f, GD run with t ≥ R2G2

ϵ2

iterations, η =

R G √ t,

and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. Step 1: For all i, f(θi) − f(θ∗) ≤ ∥θi−θ∗∥2

2−∥θi+1−θ∗∥2 2

+ ηG2

2 .

Step 1.1: ∇f(θi)(θi − θ∗) ≤ ∥θi−θ∗∥2

2−∥θi+1−θ∗∥2 2

+ ηG2

2

Step 1.

32

slide-94
SLIDE 94

gd analysis proof

Theorem – GD on Convex Lipschitz Functions: For convex G Lipschitz function f, GD run with t ≥ R2G2

ϵ2

iterations, η =

R G √ t,

and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. Step 1: For all i, f(θi) − f(θ∗) ≤ ∥θi−θ∗∥2

2−∥θi+1−θ∗∥2 2

+ ηG2

2 .

Step 1.1: ∇f(θi)(θi − θ∗) ≤ ∥θi−θ∗∥2

2−∥θi+1−θ∗∥2 2

+ ηG2

2

= ⇒ Step 1.

32

slide-95
SLIDE 95

gd analysis proof

Theorem – GD on Convex Lipschitz Functions: For convex G Lipschitz function f, GD run with t ≥ R2G2

ϵ2

iterations, η =

R G √ t,

and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. Step 1: For all i, f(θi) − f(θ∗) ≤ ∥θi−θ∗∥2

2−∥θi+1−θ∗∥2 2

+ ηG2

2

Step 2: 1

T T i 1 f i

f

R2 2 T G2 2 .

33

slide-96
SLIDE 96

gd analysis proof

Theorem – GD on Convex Lipschitz Functions: For convex G Lipschitz function f, GD run with t ≥ R2G2

ϵ2

iterations, η =

R G √ t,

and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. Step 1: For all i, f(θi) − f(θ∗) ≤ ∥θi−θ∗∥2

2−∥θi+1−θ∗∥2 2

+ ηG2

2

= ⇒ Step 2: 1

T

∑T

i=1 f(θi) − f(θ∗) ≤ R2 2η·T + ηG2 2 .

33

slide-97
SLIDE 97

gd analysis proof

Theorem – GD on Convex Lipschitz Functions: For convex G Lipschitz function f, GD run with t ≥ R2G2

ϵ2

iterations, η =

R G √ t,

and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. Step 2: 1

T

∑T

i=1 f(θi) − f(θ∗) ≤ R2 2η·T + ηG2 2 .

34

slide-98
SLIDE 98

Questions on Gradient Descent?

35