compsci 514: algorithms for data science Cameron Musco University - - PowerPoint PPT Presentation

compsci 514 algorithms for data science
SMART_READER_LITE
LIVE PREVIEW

compsci 514: algorithms for data science Cameron Musco University - - PowerPoint PPT Presentation

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 21 0 Finish discussion of SGD. Understanding gradient descent and SGD as applied to least Connections to more advanced


slide-1
SLIDE 1

compsci 514: algorithms for data science

Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 21

slide-2
SLIDE 2

summary

Last Class:

  • Stochastic gradient descent (SGD).
  • Online optimization and online gradient descent (OGD).
  • Analysis of SGD as a special case of online gradient descent.

This Class:

  • Finish discussion of SGD.
  • Understanding gradient descent and SGD as applied to least

squares regression.

  • Connections to more advanced techniques: accelerated methods

and adaptive gradient methods.

1

slide-3
SLIDE 3

summary

Last Class:

  • Stochastic gradient descent (SGD).
  • Online optimization and online gradient descent (OGD).
  • Analysis of SGD as a special case of online gradient descent.

This Class:

  • Finish discussion of SGD.
  • Understanding gradient descent and SGD as applied to least

squares regression.

  • Connections to more advanced techniques: accelerated methods

and adaptive gradient methods.

1

slide-4
SLIDE 4

logistics

This class wraps up the optimization unit. Three remaining classes after break. Give your feedback on Piazza about what you’d like to see.

  • High dimensional geometry and connections to random

projection.

  • Randomized methods for fast approximate SVD,

eigendecomposition, regression.

  • Fourier methods, compressed sensing, sparse recovery.
  • More advanced optimization methods (alternating

minimization, k-means clustering,...)

  • Fairness and differential privacy.

2

slide-5
SLIDE 5

quick review

Gradient Descent:

  • Applies to: Any differentiable f : Rd → R.
  • Goal: Find ˆ

θ ∈ Rd with f(ˆ θ) ≤ min⃗

θ∈Rd f(⃗

θ) + ϵ.

  • Update Step: ⃗

θ(i+1) = ⃗ θ(i) − η⃗ ∇f(⃗ θ(i)). Online Gradient Descent:

  • Applies to: f1 f2

ft

d

presented online.

  • Goal: Pick

1 t d in an online fashion with t i 1 fi i

min

d

t i 1 f

(i.e., achieve regret ).

  • Update Step:

i 1 i

fi

i .

Stochastic Gradient Descent:

  • Applies to: f

d

that can be written as f

n i 1 fi

.

  • Goal: Find

d with f

min

d f

  • Update Step:

i 1 i

fji

i

where ji is chosen uniformly at random from 1 n.

3

slide-6
SLIDE 6

quick review

Gradient Descent:

  • Applies to: Any differentiable f : Rd → R.
  • Goal: Find ˆ

θ ∈ Rd with f(ˆ θ) ≤ min⃗

θ∈Rd f(⃗

θ) + ϵ.

  • Update Step: ⃗

θ(i+1) = ⃗ θ(i) − η⃗ ∇f(⃗ θ(i)). Online Gradient Descent:

  • Applies to: f1, f2, . . . , ft : Rd → R presented online.
  • Goal: Pick ⃗

θ(1), . . . , ⃗ θ(t) ∈ Rd in an online fashion with ∑t

i=1 fi(⃗

θ(i)) ≤ min⃗

θ∈Rd

∑t

i=1 f(⃗

θ) + ϵ (i.e., achieve regret ≤ ϵ).

  • Update Step: ⃗

θ(i+1) = ⃗ θ(i) − η⃗ ∇fi(⃗ θ(i)). Stochastic Gradient Descent:

  • Applies to: f

d

that can be written as f

n i 1 fi

.

  • Goal: Find

d with f

min

d f

  • Update Step:

i 1 i

fji

i

where ji is chosen uniformly at random from 1 n.

3

slide-7
SLIDE 7

quick review

Gradient Descent:

  • Applies to: Any differentiable f : Rd → R.
  • Goal: Find ˆ

θ ∈ Rd with f(ˆ θ) ≤ min⃗

θ∈Rd f(⃗

θ) + ϵ.

  • Update Step: ⃗

θ(i+1) = ⃗ θ(i) − η⃗ ∇f(⃗ θ(i)). Online Gradient Descent:

  • Applies to: f1, f2, . . . , ft : Rd → R presented online.
  • Goal: Pick ⃗

θ(1), . . . , ⃗ θ(t) ∈ Rd in an online fashion with ∑t

i=1 fi(⃗

θ(i)) ≤ min⃗

θ∈Rd

∑t

i=1 f(⃗

θ) + ϵ (i.e., achieve regret ≤ ϵ).

  • Update Step: ⃗

θ(i+1) = ⃗ θ(i) − η⃗ ∇fi(⃗ θ(i)). Stochastic Gradient Descent:

  • Applies to: f : Rd → R that can be written as f(⃗

θ) = ∑n

i=1 fi(⃗

θ).

  • Goal: Find ˆ

θ ∈ Rd with f(ˆ θ) ≤ min⃗

θ∈Rd f(⃗

θ) + ϵ.

  • Update Step: ⃗

θ(i+1) = ⃗ θ(i) − η⃗ ∇fji(⃗ θ(i)) where ji is chosen uniformly at random from 1, . . . , n.

3

slide-8
SLIDE 8

quick review

Gradient Descent:

  • Applies to: Any differentiable f : Rd → R.
  • Goal: Find ˆ

θ ∈ Rd with f(ˆ θ) ≤ min⃗

θ∈Rd f(⃗

θ) + ϵ.

  • Update Step: ⃗

θ(i+1) = ⃗ θ(i) − η⃗ ∇f(⃗ θ(i)). Online Gradient Descent:

  • Applies to: f1, f2, . . . , ft : Rd → R presented online.
  • Goal: Pick ⃗

θ(1), . . . , ⃗ θ(t) ∈ Rd in an online fashion with ∑t

i=1 fi(⃗

θ(i)) ≤ min⃗

θ∈Rd

∑t

i=1 f(⃗

θ) + ϵ (i.e., achieve regret ≤ ϵ).

  • Update Step: ⃗

θ(i+1) = ⃗ θ(i) − η⃗ ∇fi(⃗ θ(i)). Stochastic Gradient Descent:

  • Applies to: f : Rd → R that can be written as f(⃗

θ) = ∑n

i=1 fi(⃗

θ).

  • Goal: Find ˆ

θ ∈ Rd with f(ˆ θ) ≤ min⃗

θ∈Rd f(⃗

θ) + ϵ.

  • Update Step: ⃗

θ(i+1) = ⃗ θ(i) − η⃗ ∇fji(⃗ θ(i)) where ji is chosen uniformly at random from 1, . . . , n.

3

slide-9
SLIDE 9

stochastic gradient analysis recap

Minimizing a finite sum function: f(⃗ θ) = ∑n

i=1 fi(⃗

θ).

  • Stochastic gradient descent is identical to online gradient

descent run on the sequence of t functions fj1 fj2 fjt.

  • These functions are picked uniformly at random, so in

expectation,

t i 1 fji i t i 1 f i

.

  • By convexity

1 t t i 1 i gives only a better solution. I.e., t i 1

f

t i 1

f

i

  • Quality directly bounded by the regret analysis for online

gradient descent!

4

slide-10
SLIDE 10

stochastic gradient analysis recap

Minimizing a finite sum function: f(⃗ θ) = ∑n

i=1 fi(⃗

θ).

  • Stochastic gradient descent is identical to online gradient

descent run on the sequence of t functions fj1, fj2, . . . , fjt.

  • These functions are picked uniformly at random, so in

expectation, E [∑t

i=1 fji(⃗

θ(i)) ] = E [∑t

i=1 f(⃗

θ(i)) ] .

  • By convexity

1 t t i 1 i gives only a better solution. I.e., t i 1

f

t i 1

f

i

  • Quality directly bounded by the regret analysis for online

gradient descent!

4

slide-11
SLIDE 11

stochastic gradient analysis recap

Minimizing a finite sum function: f(⃗ θ) = ∑n

i=1 fi(⃗

θ).

  • Stochastic gradient descent is identical to online gradient

descent run on the sequence of t functions fj1, fj2, . . . , fjt.

  • These functions are picked uniformly at random, so in

expectation, E [∑t

i=1 fji(⃗

θ(i)) ] = E [∑t

i=1 f(⃗

θ(i)) ] .

  • By convexity ˆ

θ = 1

t

∑t

i=1 ⃗

θ(i) gives only a better solution. I.e., E [

t

i=1

f(ˆ θ) ] ≤ E [

t

i=1

f(⃗ θ(i)) ] .

  • Quality directly bounded by the regret analysis for online

gradient descent!

4

slide-12
SLIDE 12

sgd vs. gd

Stochastic gradient descent generally makes more iterations than gradient descent. Each iteration is much cheaper (by a factor of n). ⃗ ∇f(⃗ θ) = ⃗ ∇

n

j=1

fj(⃗ θ) vs. ⃗ ∇fj(⃗ θ)

5

slide-13
SLIDE 13

sgd vs. gd

Consider f(⃗ θ) = ∑n

j=1 fj(⃗

θ) with each fj convex. Theorem – SGD: If ∥⃗ ∇fj(⃗ θ)∥2 ≤ G

n ∀⃗

θ, after t ≥ R2G2

ϵ2

iterations

  • utputs ˆ

θ satisfying: E[f(ˆ θ)] ≤ f(θ∗) + ϵ. Theorem – GD: If ∥⃗ ∇f(⃗ θ)∥2 ≤ ¯ G ∀⃗ θ, after t ≥

R2¯ G2 ϵ2

iterations

  • utputs ˆ

θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. f

2

f1 fn

2 n j 1

fj

2

n

G n

G. When would this bound be tight? I.e., SGD takes the same number of iterations as GD. When is it loose? I.e., SGD performs very poorly compared to GD.

6

slide-14
SLIDE 14

sgd vs. gd

Consider f(⃗ θ) = ∑n

j=1 fj(⃗

θ) with each fj convex. Theorem – SGD: If ∥⃗ ∇fj(⃗ θ)∥2 ≤ G

n ∀⃗

θ, after t ≥ R2G2

ϵ2

iterations

  • utputs ˆ

θ satisfying: E[f(ˆ θ)] ≤ f(θ∗) + ϵ. Theorem – GD: If ∥⃗ ∇f(⃗ θ)∥2 ≤ ¯ G ∀⃗ θ, after t ≥

R2¯ G2 ϵ2

iterations

  • utputs ˆ

θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. ∥⃗ ∇f(⃗ θ)∥2 = ∥⃗ ∇f1(⃗ θ) + . . . + ⃗ ∇fn(⃗ θ)∥2 ≤ ∑n

j=1 ∥⃗

∇fj(⃗ θ)∥2 ≤ n · G

n ≤ G.

When would this bound be tight? I.e., SGD takes the same number of iterations as GD. When is it loose? I.e., SGD performs very poorly compared to GD.

6

slide-15
SLIDE 15

sgd vs. gd

Consider f(⃗ θ) = ∑n

j=1 fj(⃗

θ) with each fj convex. Theorem – SGD: If ∥⃗ ∇fj(⃗ θ)∥2 ≤ G

n ∀⃗

θ, after t ≥ R2G2

ϵ2

iterations

  • utputs ˆ

θ satisfying: E[f(ˆ θ)] ≤ f(θ∗) + ϵ. Theorem – GD: If ∥⃗ ∇f(⃗ θ)∥2 ≤ ¯ G ∀⃗ θ, after t ≥

R2¯ G2 ϵ2

iterations

  • utputs ˆ

θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. ∥⃗ ∇f(⃗ θ)∥2 = ∥⃗ ∇f1(⃗ θ) + . . . + ⃗ ∇fn(⃗ θ)∥2 ≤ ∑n

j=1 ∥⃗

∇fj(⃗ θ)∥2 ≤ n · G

n ≤ G.

When would this bound be tight? I.e., SGD takes the same number of iterations as GD. When is it loose? I.e., SGD performs very poorly compared to GD.

6

slide-16
SLIDE 16

sgd vs. gd

Consider f(⃗ θ) = ∑n

j=1 fj(⃗

θ) with each fj convex. Theorem – SGD: If ∥⃗ ∇fj(⃗ θ)∥2 ≤ G

n ∀⃗

θ, after t ≥ R2G2

ϵ2

iterations

  • utputs ˆ

θ satisfying: E[f(ˆ θ)] ≤ f(θ∗) + ϵ. Theorem – GD: If ∥⃗ ∇f(⃗ θ)∥2 ≤ ¯ G ∀⃗ θ, after t ≥

R2¯ G2 ϵ2

iterations

  • utputs ˆ

θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. ∥⃗ ∇f(⃗ θ)∥2 = ∥⃗ ∇f1(⃗ θ) + . . . + ⃗ ∇fn(⃗ θ)∥2 ≤ ∑n

j=1 ∥⃗

∇fj(⃗ θ)∥2 ≤ n · G

n ≤ G.

When would this bound be tight? I.e., SGD takes the same number of iterations as GD. When is it loose? I.e., SGD performs very poorly compared to GD.

6

slide-17
SLIDE 17

sgd vs. gd

Roughly: SGD performs well compared to GD when ∑n

j=1 ∥⃗

∇fj(⃗ θ)∥2 is small compared to ∥⃗ ∇f(⃗ θ)∥2.

n j 1

fj

2 2

f

2 2 n j 1

fj f

2 2 (good exercise)

Reducing this variance is a key technique used to increase performance of SGD.

  • mini-batching
  • stochastic variance reduced gradient descent (SVRG)
  • stochastic average gradient (SAG)

7

slide-18
SLIDE 18

sgd vs. gd

Roughly: SGD performs well compared to GD when ∑n

j=1 ∥⃗

∇fj(⃗ θ)∥2 is small compared to ∥⃗ ∇f(⃗ θ)∥2.

n

j=1

∥⃗ ∇fj(⃗ θ)∥2

2 − ∥⃗

∇f(⃗ θ)∥2

2 = n

j=1

∥⃗ ∇fj(⃗ θ) − ⃗ ∇f(⃗ θ)∥2

2 (good exercise)

Reducing this variance is a key technique used to increase performance of SGD.

  • mini-batching
  • stochastic variance reduced gradient descent (SVRG)
  • stochastic average gradient (SAG)

7

slide-19
SLIDE 19

sgd vs. gd

Roughly: SGD performs well compared to GD when ∑n

j=1 ∥⃗

∇fj(⃗ θ)∥2 is small compared to ∥⃗ ∇f(⃗ θ)∥2.

n

j=1

∥⃗ ∇fj(⃗ θ)∥2

2 − ∥⃗

∇f(⃗ θ)∥2

2 = n

j=1

∥⃗ ∇fj(⃗ θ) − ⃗ ∇f(⃗ θ)∥2

2 (good exercise)

Reducing this variance is a key technique used to increase performance of SGD.

  • mini-batching
  • stochastic variance reduced gradient descent (SVRG)
  • stochastic average gradient (SAG)

7

slide-20
SLIDE 20

test of intuition

What does f1(θ) + f2(θ) + f3(θ) look like?

  • 10
  • 5

5 10 2000 4000 6000 8000 10000 12000

f1 f2 f3

A sum of convex functions is always convex (good exercise).

8

slide-21
SLIDE 21

test of intuition

What does f1(θ) + f2(θ) + f3(θ) look like?

  • 10
  • 5

5 10 2000 4000 6000 8000 10000 12000

f1 f2 f3

A sum of convex functions is always convex (good exercise).

8

slide-22
SLIDE 22

test of intuition

What does f1(θ) + f2(θ) + f3(θ) look like?

  • 10
  • 5

5 10 2000 4000 6000 8000 10000 12000

f1 f2 f3

A sum of convex functions is always convex (good exercise).

8

slide-23
SLIDE 23

rest of today Linear Algebra + Convex Optimization

9

slide-24
SLIDE 24

iterative optimization for least squares regression

Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2

2.

Optimum given by V

  • 1UTy. Have X

Why solve with an iterative method (e.g., gradient descent)?

10

slide-25
SLIDE 25

iterative optimization for least squares regression

Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2

2.

Optimum given by V

  • 1UTy. Have X

Why solve with an iterative method (e.g., gradient descent)?

10

slide-26
SLIDE 26

iterative optimization for least squares regression

Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2

2.

Optimum given by ⃗ θ∗ = VΣ−1UTy. Have X⃗ θ∗ = Why solve with an iterative method (e.g., gradient descent)?

10

slide-27
SLIDE 27

iterative optimization for least squares regression

Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2

2.

Optimum given by ⃗ θ∗ = VΣ−1UTy. Have X⃗ θ∗ = Why solve with an iterative method (e.g., gradient descent)?

10

slide-28
SLIDE 28

iterative optimization for least squares regression

Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2

2.

Optimum given by ⃗ θ∗ = VΣ−1UTy. Have X⃗ θ∗ = Why solve with an iterative method (e.g., gradient descent)?

10

slide-29
SLIDE 29

least squares regression reformulation

Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2

2.

Claim 1: f(⃗ θ) = ∥X⃗ θ − X⃗ θ∗∥2

2 + c = ∥X(⃗

θ − ⃗ θ∗)∥2

2 + c.

Claim 2: f 2XTX 2XTy 2XT X y

residual

Gradient Descent Update:

i 1 i

2 XT X

i

y

i

2

n j 1

xj ri j where ri j xT

j i

yj is the residual for data point j at step i.

11

slide-30
SLIDE 30

least squares regression reformulation

Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2

2.

Claim 1: f(⃗ θ) = ∥X⃗ θ − X⃗ θ∗∥2

2 + c = ∥X(⃗

θ − ⃗ θ∗)∥2

2 + c.

Claim 2: f 2XTX 2XTy 2XT X y

residual

Gradient Descent Update:

i 1 i

2 XT X

i

y

i

2

n j 1

xj ri j where ri j xT

j i

yj is the residual for data point j at step i.

11

slide-31
SLIDE 31

least squares regression reformulation

Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2

2.

Claim 1: f(⃗ θ) = ∥X⃗ θ − X⃗ θ∗∥2

2 + c = ∥X(⃗

θ − ⃗ θ∗)∥2

2 + c.

Claim 2: f 2XTX 2XTy 2XT X y

residual

Gradient Descent Update:

i 1 i

2 XT X

i

y

i

2

n j 1

xj ri j where ri j xT

j i

yj is the residual for data point j at step i.

11

slide-32
SLIDE 32

least squares regression reformulation

Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2

2.

Claim 1: f(⃗ θ) = ∥X⃗ θ − X⃗ θ∗∥2

2 + c = ∥X(⃗

θ − ⃗ θ∗)∥2

2 + c.

Claim 2: ⃗ ∇f(θ) = 2XTX⃗ θ − 2XT⃗ y = 2XT (X⃗ θ −⃗ y)

  • residual

Gradient Descent Update:

i 1 i

2 XT X

i

y

i

2

n j 1

xj ri j where ri j xT

j i

yj is the residual for data point j at step i.

11

slide-33
SLIDE 33

least squares regression reformulation

Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2

2.

Claim 1: f(⃗ θ) = ∥X⃗ θ − X⃗ θ∗∥2

2 + c = ∥X(⃗

θ − ⃗ θ∗)∥2

2 + c.

Claim 2: ⃗ ∇f(θ) = 2XTX⃗ θ − 2XT⃗ y = 2XT (X⃗ θ −⃗ y)

  • residual

Gradient Descent Update: ⃗ θ(i+1) = ⃗ θ(i) − 2ηXT(X⃗ θ(i) −⃗ y) = ⃗ θ(i) − 2η

n

j=1

⃗ xj · ri,j. where ri,j = (⃗ xT

j ⃗

θ(i) − yj) is the residual for data point j at step i.

11

slide-34
SLIDE 34

sgd for regression

Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2

2 n j 1

xT

j

yj

2 n j 1

fj Claim 3: fj 2 xT

j

yj

residual

xj SGD Update: Pick random j 1 n and set:

i 1 i

fj

i i

2 xj ri j verses 2

n j 1

xjri j where ri j xT

j i

yj is the residual for data point j at step i. Make a small correction for a single data point in each step. In the direction of the data point.

12

slide-35
SLIDE 35

sgd for regression

Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2

2 = n

j=1

( ⃗ xT

j ⃗

θ − yj )2

n j 1

fj Claim 3: fj 2 xT

j

yj

residual

xj SGD Update: Pick random j 1 n and set:

i 1 i

fj

i i

2 xj ri j verses 2

n j 1

xjri j where ri j xT

j i

yj is the residual for data point j at step i. Make a small correction for a single data point in each step. In the direction of the data point.

12

slide-36
SLIDE 36

sgd for regression

Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2

2 = n

j=1

( ⃗ xT

j ⃗

θ − yj )2 =

n

j=1

fj(⃗ θ). Claim 3: fj 2 xT

j

yj

residual

xj SGD Update: Pick random j 1 n and set:

i 1 i

fj

i i

2 xj ri j verses 2

n j 1

xjri j where ri j xT

j i

yj is the residual for data point j at step i. Make a small correction for a single data point in each step. In the direction of the data point.

12

slide-37
SLIDE 37

sgd for regression

Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2

2 = n

j=1

( ⃗ xT

j ⃗

θ − yj )2 =

n

j=1

fj(⃗ θ). Claim 3: ⃗ ∇fj(θ) = 2(⃗ xT

j ⃗

θ −⃗ yj)

  • residual

·⃗ xj SGD Update: Pick random j 1 n and set:

i 1 i

fj

i i

2 xj ri j verses 2

n j 1

xjri j where ri j xT

j i

yj is the residual for data point j at step i. Make a small correction for a single data point in each step. In the direction of the data point.

12

slide-38
SLIDE 38

sgd for regression

Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2

2 = n

j=1

( ⃗ xT

j ⃗

θ − yj )2 =

n

j=1

fj(⃗ θ). Claim 3: ⃗ ∇fj(θ) = 2(⃗ xT

j ⃗

θ −⃗ yj)

  • residual

·⃗ xj SGD Update: Pick random j ∈ {1, . . . , n} and set: ⃗ θ(i+1) = ⃗ θ(i) − η⃗ ∇fj(θ(i)) = ⃗ θ(i) − 2η⃗ xj · ri,j verses 2

n j 1

xjri j where ri,j = (⃗ xT

j ⃗

θ(i) − yj) is the residual for data point j at step i. Make a small correction for a single data point in each step. In the direction of the data point.

12

slide-39
SLIDE 39

sgd for regression

Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2

2 = n

j=1

( ⃗ xT

j ⃗

θ − yj )2 =

n

j=1

fj(⃗ θ). Claim 3: ⃗ ∇fj(θ) = 2(⃗ xT

j ⃗

θ −⃗ yj)

  • residual

·⃗ xj SGD Update: Pick random j ∈ {1, . . . , n} and set: ⃗ θ(i+1) = ⃗ θ(i) − η⃗ ∇fj(θ(i)) = ⃗ θ(i) − 2η⃗ xj · ri,j verses − 2η

n

j=1

⃗ xjri,j where ri,j = (⃗ xT

j ⃗

θ(i) − yj) is the residual for data point j at step i. Make a small correction for a single data point in each step. In the direction of the data point.

12

slide-40
SLIDE 40

sgd for regression

Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2

2 = n

j=1

( ⃗ xT

j ⃗

θ − yj )2 =

n

j=1

fj(⃗ θ). Claim 3: ⃗ ∇fj(θ) = 2(⃗ xT

j ⃗

θ −⃗ yj)

  • residual

·⃗ xj SGD Update: Pick random j ∈ {1, . . . , n} and set: ⃗ θ(i+1) = ⃗ θ(i) − η⃗ ∇fj(θ(i)) = ⃗ θ(i) − 2η⃗ xj · ri,j verses − 2η

n

j=1

⃗ xjri,j where ri,j = (⃗ xT

j ⃗

θ(i) − yj) is the residual for data point j at step i. Make a small correction for a single data point in each step. In the direction of the data point.

12

slide-41
SLIDE 41

gradient descent as polynomial approximation

Gradient Descent for Regression: ⃗ θ(i+1) = ⃗ θ(i) − η⃗ ∇f(⃗ θ(i)) = ⃗ θ(i) − 2ηXT(X⃗ θ(i) −⃗ y). Initialize ⃗ θ(1) = ⃗ 0.

2

2 XT X0 y 2 XTy

3

2 XTy 2 XT 2 XXTy y 4 XTy 4 2 XTX XTy 4 I XTX XTy

4 3

XT X

3

y 6 XTy 16 XTX XTy 8 2 XTX 2XTy

t

pt XTX XTy XTX

1XTy

where pt is a degree t 2 polynomial.

13

slide-42
SLIDE 42

gradient descent as polynomial approximation

Gradient Descent for Regression: ⃗ θ(i+1) = ⃗ θ(i) − η⃗ ∇f(⃗ θ(i)) = ⃗ θ(i) − 2ηXT(X⃗ θ(i) −⃗ y). Initialize ⃗ θ(1) = ⃗ 0. ⃗ θ(2) = ⃗ 0 − 2ηXT(X⃗ 0 −⃗ y) = 2ηXT⃗ y.

3

2 XTy 2 XT 2 XXTy y 4 XTy 4 2 XTX XTy 4 I XTX XTy

4 3

XT X

3

y 6 XTy 16 XTX XTy 8 2 XTX 2XTy

t

pt XTX XTy XTX

1XTy

where pt is a degree t 2 polynomial.

13

slide-43
SLIDE 43

gradient descent as polynomial approximation

Gradient Descent for Regression: ⃗ θ(i+1) = ⃗ θ(i) − η⃗ ∇f(⃗ θ(i)) = ⃗ θ(i) − 2ηXT(X⃗ θ(i) −⃗ y). Initialize ⃗ θ(1) = ⃗ 0. ⃗ θ(2) = ⃗ 0 − 2ηXT(X⃗ 0 −⃗ y) = 2ηXT⃗ y. ⃗ θ(3) = 2ηXT⃗ y−2ηXT(2ηXXT⃗ y−⃗ y) = 4ηXT⃗ y−4η2(XTX)XT⃗ y = 4η(I−ηXTX)XT⃗ y

4 3

XT X

3

y 6 XTy 16 XTX XTy 8 2 XTX 2XTy

t

pt XTX XTy XTX

1XTy

where pt is a degree t 2 polynomial.

13

slide-44
SLIDE 44

gradient descent as polynomial approximation

Gradient Descent for Regression: ⃗ θ(i+1) = ⃗ θ(i) − η⃗ ∇f(⃗ θ(i)) = ⃗ θ(i) − 2ηXT(X⃗ θ(i) −⃗ y). Initialize ⃗ θ(1) = ⃗ 0. ⃗ θ(2) = ⃗ 0 − 2ηXT(X⃗ 0 −⃗ y) = 2ηXT⃗ y. ⃗ θ(3) = 2ηXT⃗ y−2ηXT(2ηXXT⃗ y−⃗ y) = 4ηXT⃗ y−4η2(XTX)XT⃗ y = 4η(I−ηXTX)XT⃗ y ⃗ θ(4) = θ(3) − ηXT(X⃗ θ(3) −⃗ y) = 6ηXT⃗ y − 16η(XTX)XT⃗ y + 8η2(XTX)2XT⃗ y.

t

pt XTX XTy XTX

1XTy

where pt is a degree t 2 polynomial.

13

slide-45
SLIDE 45

gradient descent as polynomial approximation

Gradient Descent for Regression: ⃗ θ(i+1) = ⃗ θ(i) − η⃗ ∇f(⃗ θ(i)) = ⃗ θ(i) − 2ηXT(X⃗ θ(i) −⃗ y). Initialize ⃗ θ(1) = ⃗ 0. ⃗ θ(2) = ⃗ 0 − 2ηXT(X⃗ 0 −⃗ y) = 2ηXT⃗ y. ⃗ θ(3) = 2ηXT⃗ y−2ηXT(2ηXXT⃗ y−⃗ y) = 4ηXT⃗ y−4η2(XTX)XT⃗ y = 4η(I−ηXTX)XT⃗ y ⃗ θ(4) = θ(3) − ηXT(X⃗ θ(3) −⃗ y) = 6ηXT⃗ y − 16η(XTX)XT⃗ y + 8η2(XTX)2XT⃗ y. ⃗ θ(t) = pt(XTX) · XT⃗ y XTX

1XTy

. where pt is a degree t − 2 polynomial.

13

slide-46
SLIDE 46

gradient descent as polynomial approximation

Gradient Descent for Regression: ⃗ θ(i+1) = ⃗ θ(i) − η⃗ ∇f(⃗ θ(i)) = ⃗ θ(i) − 2ηXT(X⃗ θ(i) −⃗ y). Initialize ⃗ θ(1) = ⃗ 0. ⃗ θ(2) = ⃗ 0 − 2ηXT(X⃗ 0 −⃗ y) = 2ηXT⃗ y. ⃗ θ(3) = 2ηXT⃗ y−2ηXT(2ηXXT⃗ y−⃗ y) = 4ηXT⃗ y−4η2(XTX)XT⃗ y = 4η(I−ηXTX)XT⃗ y ⃗ θ(4) = θ(3) − ηXT(X⃗ θ(3) −⃗ y) = 6ηXT⃗ y − 16η(XTX)XT⃗ y + 8η2(XTX)2XT⃗ y. ⃗ θ(t) = pt(XTX) · XT⃗ y ≈ θ∗ XTX

1XTy

. where pt is a degree t − 2 polynomial.

13

slide-47
SLIDE 47

gradient descent as polynomial approximation

Gradient Descent for Regression: ⃗ θ(i+1) = ⃗ θ(i) − η⃗ ∇f(⃗ θ(i)) = ⃗ θ(i) − 2ηXT(X⃗ θ(i) −⃗ y). Initialize ⃗ θ(1) = ⃗ 0. ⃗ θ(2) = ⃗ 0 − 2ηXT(X⃗ 0 −⃗ y) = 2ηXT⃗ y. ⃗ θ(3) = 2ηXT⃗ y−2ηXT(2ηXXT⃗ y−⃗ y) = 4ηXT⃗ y−4η2(XTX)XT⃗ y = 4η(I−ηXTX)XT⃗ y ⃗ θ(4) = θ(3) − ηXT(X⃗ θ(3) −⃗ y) = 6ηXT⃗ y − 16η(XTX)XT⃗ y + 8η2(XTX)2XT⃗ y. ⃗ θ(t) = pt(XTX) · XT⃗ y ≈ θ∗ = (XTX)−1XT⃗ y. where pt is a degree t − 2 polynomial.

13

slide-48
SLIDE 48

gradient descent as polynomial approximation

Upshot: Gradient descent computes ⃗ θ(t) = pt(XTX) · XT⃗ y ≈ (XTX)−1XT⃗ y = θ∗. One of the most basic Krylov subspace methods. Chebyshev iteration, Jacobi iteration, conjugate gradient, accelerated gradient descent, heavy ball methods....

14

slide-49
SLIDE 49

gradient descent as polynomial approximation

Upshot: Gradient descent computes ⃗ θ(t) = pt(XTX) · XT⃗ y ≈ (XTX)−1XT⃗ y = θ∗. One of the most basic Krylov subspace methods. Chebyshev iteration, Jacobi iteration, conjugate gradient, accelerated gradient descent, heavy ball methods....

14

slide-50
SLIDE 50

gradient descent as polynomial approximation

Upshot: Gradient descent computes ⃗ θ(t) = pt(XTX) · XT⃗ y ≈ (XTX)−1XT⃗ y = θ∗. One of the most basic Krylov subspace methods. Chebyshev iteration, Jacobi iteration, conjugate gradient, accelerated gradient descent, heavy ball methods....

14

slide-51
SLIDE 51

gradient descent as polynomial approximation

Upshot: Gradient descent computes ⃗ θ(t) = pt(XTX) · XT⃗ y ≈ (XTX)−1XT⃗ y = θ∗. View in Eigendecomposition: One of the most basic Krylov subspace methods. Chebyshev iteration, Jacobi iteration, conjugate gradient, accelerated gradient descent, heavy ball methods....

14

slide-52
SLIDE 52

gradient descent as polynomial approximation

Upshot: Gradient descent computes ⃗ θ(t) = pt(XTX) · XT⃗ y ≈ (XTX)−1XT⃗ y = θ∗.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 4 6 8 10

1/x p10(x)

One of the most basic Krylov subspace methods. Chebyshev iteration, Jacobi iteration, conjugate gradient, accelerated gradient descent, heavy ball methods....

14

slide-53
SLIDE 53

gradient descent as polynomial approximation

Upshot: Gradient descent computes ⃗ θ(t) = pt(XTX) · XT⃗ y ≈ (XTX)−1XT⃗ y = θ∗.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 4 6 8 10

1/x p10(x)

One of the most basic Krylov subspace methods. Chebyshev iteration, Jacobi iteration, conjugate gradient, accelerated gradient descent, heavy ball methods....

14

slide-54
SLIDE 54

gradient descent as polynomial approximation

Upshot: Gradient descent computes ⃗ θ(t) = pt(XTX) · XT⃗ y ≈ (XTX)−1XT⃗ y = θ∗.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 4 6 8 10

1/x p30(x)

One of the most basic Krylov subspace methods. Chebyshev iteration, Jacobi iteration, conjugate gradient, accelerated gradient descent, heavy ball methods....

14

slide-55
SLIDE 55

conditioning

Gradient descent for least squares regression requires a lot of iterations when the eigenvalues of XTX are spread out. Formally:

  • Is f

X y 2

2

X

2 2 Lipschitz?

  • A convex function f

d

is

  • smooth and
  • strongly convex if

1 2:

2

1 2 2 2

f

1 T 1 2

f

1

f

2

2

1 2 2 2

  • f

is

max XTX smooth and min XTX strongly convex.

15

slide-56
SLIDE 56

conditioning

Gradient descent for least squares regression requires a lot of iterations when the eigenvalues of XTX are spread out. Formally:

  • Is f(⃗

θ) = ∥X⃗ θ −⃗ y∥2

2 = ∥X(⃗

θ − ⃗ θ∗)∥2

2 Lipschitz?

  • A convex function f

d

is

  • smooth and
  • strongly convex if

1 2:

2

1 2 2 2

f

1 T 1 2

f

1

f

2

2

1 2 2 2

  • f

is

max XTX smooth and min XTX strongly convex.

15

slide-57
SLIDE 57

conditioning

Gradient descent for least squares regression requires a lot of iterations when the eigenvalues of XTX are spread out. Formally:

  • Is f(⃗

θ) = ∥X⃗ θ −⃗ y∥2

2 = ∥X(⃗

θ − ⃗ θ∗)∥2

2 Lipschitz?

  • A convex function f : Rd → R is β-smooth and α-strongly convex if

∀⃗ θ1, ⃗ θ2: α 2 ∥⃗ θ1 − ⃗ θ2∥2

2 ≤ ⃗

∇f(⃗ θ1)T(⃗ θ1 − ⃗ θ2) − [f(⃗ θ1) − f(⃗ θ2)] ≤ β 2 ∥⃗ θ1 − ⃗ θ2∥2

2.

  • f

is

max XTX smooth and min XTX strongly convex.

15

slide-58
SLIDE 58

conditioning

Gradient descent for least squares regression requires a lot of iterations when the eigenvalues of XTX are spread out. Formally:

  • Is f(⃗

θ) = ∥X⃗ θ −⃗ y∥2

2 = ∥X(⃗

θ − ⃗ θ∗)∥2

2 Lipschitz?

  • A convex function f : Rd → R is β-smooth and α-strongly convex if

∀⃗ θ1, ⃗ θ2: α 2 ∥⃗ θ1 − ⃗ θ2∥2

2 ≤ ⃗

∇f(⃗ θ1)T(⃗ θ1 − ⃗ θ2) − [f(⃗ θ1) − f(⃗ θ2)] ≤ β 2 ∥⃗ θ1 − ⃗ θ2∥2

2.

  • f(θ) is β = λmax(XTX) smooth and α = λmin(XTX) strongly convex.

15

slide-59
SLIDE 59

conditioning

Theorem: For any α-strongly convex and β-smooth func- tion f(⃗ θ), GD initialized with ⃗ θ(1) within a radius R of ⃗ θ∗ and run for t = O (

β α · log(1/ϵ)

) iterations returns ˆ θ with ∥ˆ θ − θ∗∥2 ≤ ϵR. For least squares regression, α = λmin(XTX), β = λmax(XTX), and

β α is called the condition number κ. 16

slide-60
SLIDE 60

conditioning

Recall: f(⃗ θ) = ∥X(⃗ θ − ⃗ θ∗)∥2

2.

How can we mitigate this issue? Scale the directions to make the surface more ‘round’. Idea of adaptive gradient methods: AdaGrad, RMSprop, Adam. And quasi-Newton methods: BFGS, L-BFGS,...

17

slide-61
SLIDE 61

conditioning

Recall: f(⃗ θ) = ∥X(⃗ θ − ⃗ θ∗)∥2

2.

How can we mitigate this issue? Scale the directions to make the surface more ‘round’. Idea of adaptive gradient methods: AdaGrad, RMSprop, Adam. And quasi-Newton methods: BFGS, L-BFGS,...

17

slide-62
SLIDE 62

conditioning

Recall: f(⃗ θ) = ∥X(⃗ θ − ⃗ θ∗)∥2

2.

How can we mitigate this issue? Scale the directions to make the surface more ‘round’. Idea of adaptive gradient methods: AdaGrad, RMSprop, Adam. And quasi-Newton methods: BFGS, L-BFGS,...

17

slide-63
SLIDE 63

conditioning

Recall: f(⃗ θ) = ∥X(⃗ θ − ⃗ θ∗)∥2

2.

How can we mitigate this issue? Scale the directions to make the surface more ‘round’. Idea of adaptive gradient methods: AdaGrad, RMSprop, Adam. And quasi-Newton methods: BFGS, L-BFGS,...

17

slide-64
SLIDE 64

conditioning

Recall: f(⃗ θ) = ∥X(⃗ θ − ⃗ θ∗)∥2

2.

How can we mitigate this issue? Scale the directions to make the surface more ‘round’. Idea of adaptive gradient methods: AdaGrad, RMSprop, Adam. And quasi-Newton methods: BFGS, L-BFGS,...

17

slide-65
SLIDE 65

conditioning

Recall: f(⃗ θ) = ∥X(⃗ θ − ⃗ θ∗)∥2

2.

How can we mitigate this issue? Scale the directions to make the surface more ‘round’. Idea of adaptive gradient methods: AdaGrad, RMSprop, Adam. And quasi-Newton methods: BFGS, L-BFGS,...

17

slide-66
SLIDE 66

mathematical view of preconditioning – if time

18