compsci 514: algorithms for data science Cameron Musco University - - PowerPoint PPT Presentation
compsci 514: algorithms for data science Cameron Musco University - - PowerPoint PPT Presentation
compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 21 0 Finish discussion of SGD. Understanding gradient descent and SGD as applied to least Connections to more advanced
summary
Last Class:
- Stochastic gradient descent (SGD).
- Online optimization and online gradient descent (OGD).
- Analysis of SGD as a special case of online gradient descent.
This Class:
- Finish discussion of SGD.
- Understanding gradient descent and SGD as applied to least
squares regression.
- Connections to more advanced techniques: accelerated methods
and adaptive gradient methods.
1
summary
Last Class:
- Stochastic gradient descent (SGD).
- Online optimization and online gradient descent (OGD).
- Analysis of SGD as a special case of online gradient descent.
This Class:
- Finish discussion of SGD.
- Understanding gradient descent and SGD as applied to least
squares regression.
- Connections to more advanced techniques: accelerated methods
and adaptive gradient methods.
1
logistics
This class wraps up the optimization unit. Three remaining classes after break. Give your feedback on Piazza about what you’d like to see.
- High dimensional geometry and connections to random
projection.
- Randomized methods for fast approximate SVD,
eigendecomposition, regression.
- Fourier methods, compressed sensing, sparse recovery.
- More advanced optimization methods (alternating
minimization, k-means clustering,...)
- Fairness and differential privacy.
2
quick review
Gradient Descent:
- Applies to: Any differentiable f : Rd → R.
- Goal: Find ˆ
θ ∈ Rd with f(ˆ θ) ≤ min⃗
θ∈Rd f(⃗
θ) + ϵ.
- Update Step: ⃗
θ(i+1) = ⃗ θ(i) − η⃗ ∇f(⃗ θ(i)). Online Gradient Descent:
- Applies to: f1 f2
ft
d
presented online.
- Goal: Pick
1 t d in an online fashion with t i 1 fi i
min
d
t i 1 f
(i.e., achieve regret ).
- Update Step:
i 1 i
fi
i .
Stochastic Gradient Descent:
- Applies to: f
d
that can be written as f
n i 1 fi
.
- Goal: Find
d with f
min
d f
- Update Step:
i 1 i
fji
i
where ji is chosen uniformly at random from 1 n.
3
quick review
Gradient Descent:
- Applies to: Any differentiable f : Rd → R.
- Goal: Find ˆ
θ ∈ Rd with f(ˆ θ) ≤ min⃗
θ∈Rd f(⃗
θ) + ϵ.
- Update Step: ⃗
θ(i+1) = ⃗ θ(i) − η⃗ ∇f(⃗ θ(i)). Online Gradient Descent:
- Applies to: f1, f2, . . . , ft : Rd → R presented online.
- Goal: Pick ⃗
θ(1), . . . , ⃗ θ(t) ∈ Rd in an online fashion with ∑t
i=1 fi(⃗
θ(i)) ≤ min⃗
θ∈Rd
∑t
i=1 f(⃗
θ) + ϵ (i.e., achieve regret ≤ ϵ).
- Update Step: ⃗
θ(i+1) = ⃗ θ(i) − η⃗ ∇fi(⃗ θ(i)). Stochastic Gradient Descent:
- Applies to: f
d
that can be written as f
n i 1 fi
.
- Goal: Find
d with f
min
d f
- Update Step:
i 1 i
fji
i
where ji is chosen uniformly at random from 1 n.
3
quick review
Gradient Descent:
- Applies to: Any differentiable f : Rd → R.
- Goal: Find ˆ
θ ∈ Rd with f(ˆ θ) ≤ min⃗
θ∈Rd f(⃗
θ) + ϵ.
- Update Step: ⃗
θ(i+1) = ⃗ θ(i) − η⃗ ∇f(⃗ θ(i)). Online Gradient Descent:
- Applies to: f1, f2, . . . , ft : Rd → R presented online.
- Goal: Pick ⃗
θ(1), . . . , ⃗ θ(t) ∈ Rd in an online fashion with ∑t
i=1 fi(⃗
θ(i)) ≤ min⃗
θ∈Rd
∑t
i=1 f(⃗
θ) + ϵ (i.e., achieve regret ≤ ϵ).
- Update Step: ⃗
θ(i+1) = ⃗ θ(i) − η⃗ ∇fi(⃗ θ(i)). Stochastic Gradient Descent:
- Applies to: f : Rd → R that can be written as f(⃗
θ) = ∑n
i=1 fi(⃗
θ).
- Goal: Find ˆ
θ ∈ Rd with f(ˆ θ) ≤ min⃗
θ∈Rd f(⃗
θ) + ϵ.
- Update Step: ⃗
θ(i+1) = ⃗ θ(i) − η⃗ ∇fji(⃗ θ(i)) where ji is chosen uniformly at random from 1, . . . , n.
3
quick review
Gradient Descent:
- Applies to: Any differentiable f : Rd → R.
- Goal: Find ˆ
θ ∈ Rd with f(ˆ θ) ≤ min⃗
θ∈Rd f(⃗
θ) + ϵ.
- Update Step: ⃗
θ(i+1) = ⃗ θ(i) − η⃗ ∇f(⃗ θ(i)). Online Gradient Descent:
- Applies to: f1, f2, . . . , ft : Rd → R presented online.
- Goal: Pick ⃗
θ(1), . . . , ⃗ θ(t) ∈ Rd in an online fashion with ∑t
i=1 fi(⃗
θ(i)) ≤ min⃗
θ∈Rd
∑t
i=1 f(⃗
θ) + ϵ (i.e., achieve regret ≤ ϵ).
- Update Step: ⃗
θ(i+1) = ⃗ θ(i) − η⃗ ∇fi(⃗ θ(i)). Stochastic Gradient Descent:
- Applies to: f : Rd → R that can be written as f(⃗
θ) = ∑n
i=1 fi(⃗
θ).
- Goal: Find ˆ
θ ∈ Rd with f(ˆ θ) ≤ min⃗
θ∈Rd f(⃗
θ) + ϵ.
- Update Step: ⃗
θ(i+1) = ⃗ θ(i) − η⃗ ∇fji(⃗ θ(i)) where ji is chosen uniformly at random from 1, . . . , n.
3
stochastic gradient analysis recap
Minimizing a finite sum function: f(⃗ θ) = ∑n
i=1 fi(⃗
θ).
- Stochastic gradient descent is identical to online gradient
descent run on the sequence of t functions fj1 fj2 fjt.
- These functions are picked uniformly at random, so in
expectation,
t i 1 fji i t i 1 f i
.
- By convexity
1 t t i 1 i gives only a better solution. I.e., t i 1
f
t i 1
f
i
- Quality directly bounded by the regret analysis for online
gradient descent!
4
stochastic gradient analysis recap
Minimizing a finite sum function: f(⃗ θ) = ∑n
i=1 fi(⃗
θ).
- Stochastic gradient descent is identical to online gradient
descent run on the sequence of t functions fj1, fj2, . . . , fjt.
- These functions are picked uniformly at random, so in
expectation, E [∑t
i=1 fji(⃗
θ(i)) ] = E [∑t
i=1 f(⃗
θ(i)) ] .
- By convexity
1 t t i 1 i gives only a better solution. I.e., t i 1
f
t i 1
f
i
- Quality directly bounded by the regret analysis for online
gradient descent!
4
stochastic gradient analysis recap
Minimizing a finite sum function: f(⃗ θ) = ∑n
i=1 fi(⃗
θ).
- Stochastic gradient descent is identical to online gradient
descent run on the sequence of t functions fj1, fj2, . . . , fjt.
- These functions are picked uniformly at random, so in
expectation, E [∑t
i=1 fji(⃗
θ(i)) ] = E [∑t
i=1 f(⃗
θ(i)) ] .
- By convexity ˆ
θ = 1
t
∑t
i=1 ⃗
θ(i) gives only a better solution. I.e., E [
t
∑
i=1
f(ˆ θ) ] ≤ E [
t
∑
i=1
f(⃗ θ(i)) ] .
- Quality directly bounded by the regret analysis for online
gradient descent!
4
sgd vs. gd
Stochastic gradient descent generally makes more iterations than gradient descent. Each iteration is much cheaper (by a factor of n). ⃗ ∇f(⃗ θ) = ⃗ ∇
n
∑
j=1
fj(⃗ θ) vs. ⃗ ∇fj(⃗ θ)
5
sgd vs. gd
Consider f(⃗ θ) = ∑n
j=1 fj(⃗
θ) with each fj convex. Theorem – SGD: If ∥⃗ ∇fj(⃗ θ)∥2 ≤ G
n ∀⃗
θ, after t ≥ R2G2
ϵ2
iterations
- utputs ˆ
θ satisfying: E[f(ˆ θ)] ≤ f(θ∗) + ϵ. Theorem – GD: If ∥⃗ ∇f(⃗ θ)∥2 ≤ ¯ G ∀⃗ θ, after t ≥
R2¯ G2 ϵ2
iterations
- utputs ˆ
θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. f
2
f1 fn
2 n j 1
fj
2
n
G n
G. When would this bound be tight? I.e., SGD takes the same number of iterations as GD. When is it loose? I.e., SGD performs very poorly compared to GD.
6
sgd vs. gd
Consider f(⃗ θ) = ∑n
j=1 fj(⃗
θ) with each fj convex. Theorem – SGD: If ∥⃗ ∇fj(⃗ θ)∥2 ≤ G
n ∀⃗
θ, after t ≥ R2G2
ϵ2
iterations
- utputs ˆ
θ satisfying: E[f(ˆ θ)] ≤ f(θ∗) + ϵ. Theorem – GD: If ∥⃗ ∇f(⃗ θ)∥2 ≤ ¯ G ∀⃗ θ, after t ≥
R2¯ G2 ϵ2
iterations
- utputs ˆ
θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. ∥⃗ ∇f(⃗ θ)∥2 = ∥⃗ ∇f1(⃗ θ) + . . . + ⃗ ∇fn(⃗ θ)∥2 ≤ ∑n
j=1 ∥⃗
∇fj(⃗ θ)∥2 ≤ n · G
n ≤ G.
When would this bound be tight? I.e., SGD takes the same number of iterations as GD. When is it loose? I.e., SGD performs very poorly compared to GD.
6
sgd vs. gd
Consider f(⃗ θ) = ∑n
j=1 fj(⃗
θ) with each fj convex. Theorem – SGD: If ∥⃗ ∇fj(⃗ θ)∥2 ≤ G
n ∀⃗
θ, after t ≥ R2G2
ϵ2
iterations
- utputs ˆ
θ satisfying: E[f(ˆ θ)] ≤ f(θ∗) + ϵ. Theorem – GD: If ∥⃗ ∇f(⃗ θ)∥2 ≤ ¯ G ∀⃗ θ, after t ≥
R2¯ G2 ϵ2
iterations
- utputs ˆ
θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. ∥⃗ ∇f(⃗ θ)∥2 = ∥⃗ ∇f1(⃗ θ) + . . . + ⃗ ∇fn(⃗ θ)∥2 ≤ ∑n
j=1 ∥⃗
∇fj(⃗ θ)∥2 ≤ n · G
n ≤ G.
When would this bound be tight? I.e., SGD takes the same number of iterations as GD. When is it loose? I.e., SGD performs very poorly compared to GD.
6
sgd vs. gd
Consider f(⃗ θ) = ∑n
j=1 fj(⃗
θ) with each fj convex. Theorem – SGD: If ∥⃗ ∇fj(⃗ θ)∥2 ≤ G
n ∀⃗
θ, after t ≥ R2G2
ϵ2
iterations
- utputs ˆ
θ satisfying: E[f(ˆ θ)] ≤ f(θ∗) + ϵ. Theorem – GD: If ∥⃗ ∇f(⃗ θ)∥2 ≤ ¯ G ∀⃗ θ, after t ≥
R2¯ G2 ϵ2
iterations
- utputs ˆ
θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. ∥⃗ ∇f(⃗ θ)∥2 = ∥⃗ ∇f1(⃗ θ) + . . . + ⃗ ∇fn(⃗ θ)∥2 ≤ ∑n
j=1 ∥⃗
∇fj(⃗ θ)∥2 ≤ n · G
n ≤ G.
When would this bound be tight? I.e., SGD takes the same number of iterations as GD. When is it loose? I.e., SGD performs very poorly compared to GD.
6
sgd vs. gd
Roughly: SGD performs well compared to GD when ∑n
j=1 ∥⃗
∇fj(⃗ θ)∥2 is small compared to ∥⃗ ∇f(⃗ θ)∥2.
n j 1
fj
2 2
f
2 2 n j 1
fj f
2 2 (good exercise)
Reducing this variance is a key technique used to increase performance of SGD.
- mini-batching
- stochastic variance reduced gradient descent (SVRG)
- stochastic average gradient (SAG)
7
sgd vs. gd
Roughly: SGD performs well compared to GD when ∑n
j=1 ∥⃗
∇fj(⃗ θ)∥2 is small compared to ∥⃗ ∇f(⃗ θ)∥2.
n
∑
j=1
∥⃗ ∇fj(⃗ θ)∥2
2 − ∥⃗
∇f(⃗ θ)∥2
2 = n
∑
j=1
∥⃗ ∇fj(⃗ θ) − ⃗ ∇f(⃗ θ)∥2
2 (good exercise)
Reducing this variance is a key technique used to increase performance of SGD.
- mini-batching
- stochastic variance reduced gradient descent (SVRG)
- stochastic average gradient (SAG)
7
sgd vs. gd
Roughly: SGD performs well compared to GD when ∑n
j=1 ∥⃗
∇fj(⃗ θ)∥2 is small compared to ∥⃗ ∇f(⃗ θ)∥2.
n
∑
j=1
∥⃗ ∇fj(⃗ θ)∥2
2 − ∥⃗
∇f(⃗ θ)∥2
2 = n
∑
j=1
∥⃗ ∇fj(⃗ θ) − ⃗ ∇f(⃗ θ)∥2
2 (good exercise)
Reducing this variance is a key technique used to increase performance of SGD.
- mini-batching
- stochastic variance reduced gradient descent (SVRG)
- stochastic average gradient (SAG)
7
test of intuition
What does f1(θ) + f2(θ) + f3(θ) look like?
- 10
- 5
5 10 2000 4000 6000 8000 10000 12000
f1 f2 f3
A sum of convex functions is always convex (good exercise).
8
test of intuition
What does f1(θ) + f2(θ) + f3(θ) look like?
- 10
- 5
5 10 2000 4000 6000 8000 10000 12000
f1 f2 f3
A sum of convex functions is always convex (good exercise).
8
test of intuition
What does f1(θ) + f2(θ) + f3(θ) look like?
- 10
- 5
5 10 2000 4000 6000 8000 10000 12000
f1 f2 f3
A sum of convex functions is always convex (good exercise).
8
rest of today Linear Algebra + Convex Optimization
9
iterative optimization for least squares regression
Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2
2.
Optimum given by V
- 1UTy. Have X
Why solve with an iterative method (e.g., gradient descent)?
10
iterative optimization for least squares regression
Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2
2.
Optimum given by V
- 1UTy. Have X
Why solve with an iterative method (e.g., gradient descent)?
10
iterative optimization for least squares regression
Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2
2.
Optimum given by ⃗ θ∗ = VΣ−1UTy. Have X⃗ θ∗ = Why solve with an iterative method (e.g., gradient descent)?
10
iterative optimization for least squares regression
Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2
2.
Optimum given by ⃗ θ∗ = VΣ−1UTy. Have X⃗ θ∗ = Why solve with an iterative method (e.g., gradient descent)?
10
iterative optimization for least squares regression
Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2
2.
Optimum given by ⃗ θ∗ = VΣ−1UTy. Have X⃗ θ∗ = Why solve with an iterative method (e.g., gradient descent)?
10
least squares regression reformulation
Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2
2.
Claim 1: f(⃗ θ) = ∥X⃗ θ − X⃗ θ∗∥2
2 + c = ∥X(⃗
θ − ⃗ θ∗)∥2
2 + c.
Claim 2: f 2XTX 2XTy 2XT X y
residual
Gradient Descent Update:
i 1 i
2 XT X
i
y
i
2
n j 1
xj ri j where ri j xT
j i
yj is the residual for data point j at step i.
11
least squares regression reformulation
Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2
2.
Claim 1: f(⃗ θ) = ∥X⃗ θ − X⃗ θ∗∥2
2 + c = ∥X(⃗
θ − ⃗ θ∗)∥2
2 + c.
Claim 2: f 2XTX 2XTy 2XT X y
residual
Gradient Descent Update:
i 1 i
2 XT X
i
y
i
2
n j 1
xj ri j where ri j xT
j i
yj is the residual for data point j at step i.
11
least squares regression reformulation
Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2
2.
Claim 1: f(⃗ θ) = ∥X⃗ θ − X⃗ θ∗∥2
2 + c = ∥X(⃗
θ − ⃗ θ∗)∥2
2 + c.
Claim 2: f 2XTX 2XTy 2XT X y
residual
Gradient Descent Update:
i 1 i
2 XT X
i
y
i
2
n j 1
xj ri j where ri j xT
j i
yj is the residual for data point j at step i.
11
least squares regression reformulation
Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2
2.
Claim 1: f(⃗ θ) = ∥X⃗ θ − X⃗ θ∗∥2
2 + c = ∥X(⃗
θ − ⃗ θ∗)∥2
2 + c.
Claim 2: ⃗ ∇f(θ) = 2XTX⃗ θ − 2XT⃗ y = 2XT (X⃗ θ −⃗ y)
- residual
Gradient Descent Update:
i 1 i
2 XT X
i
y
i
2
n j 1
xj ri j where ri j xT
j i
yj is the residual for data point j at step i.
11
least squares regression reformulation
Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2
2.
Claim 1: f(⃗ θ) = ∥X⃗ θ − X⃗ θ∗∥2
2 + c = ∥X(⃗
θ − ⃗ θ∗)∥2
2 + c.
Claim 2: ⃗ ∇f(θ) = 2XTX⃗ θ − 2XT⃗ y = 2XT (X⃗ θ −⃗ y)
- residual
Gradient Descent Update: ⃗ θ(i+1) = ⃗ θ(i) − 2ηXT(X⃗ θ(i) −⃗ y) = ⃗ θ(i) − 2η
n
∑
j=1
⃗ xj · ri,j. where ri,j = (⃗ xT
j ⃗
θ(i) − yj) is the residual for data point j at step i.
11
sgd for regression
Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2
2 n j 1
xT
j
yj
2 n j 1
fj Claim 3: fj 2 xT
j
yj
residual
xj SGD Update: Pick random j 1 n and set:
i 1 i
fj
i i
2 xj ri j verses 2
n j 1
xjri j where ri j xT
j i
yj is the residual for data point j at step i. Make a small correction for a single data point in each step. In the direction of the data point.
12
sgd for regression
Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2
2 = n
∑
j=1
( ⃗ xT
j ⃗
θ − yj )2
n j 1
fj Claim 3: fj 2 xT
j
yj
residual
xj SGD Update: Pick random j 1 n and set:
i 1 i
fj
i i
2 xj ri j verses 2
n j 1
xjri j where ri j xT
j i
yj is the residual for data point j at step i. Make a small correction for a single data point in each step. In the direction of the data point.
12
sgd for regression
Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2
2 = n
∑
j=1
( ⃗ xT
j ⃗
θ − yj )2 =
n
∑
j=1
fj(⃗ θ). Claim 3: fj 2 xT
j
yj
residual
xj SGD Update: Pick random j 1 n and set:
i 1 i
fj
i i
2 xj ri j verses 2
n j 1
xjri j where ri j xT
j i
yj is the residual for data point j at step i. Make a small correction for a single data point in each step. In the direction of the data point.
12
sgd for regression
Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2
2 = n
∑
j=1
( ⃗ xT
j ⃗
θ − yj )2 =
n
∑
j=1
fj(⃗ θ). Claim 3: ⃗ ∇fj(θ) = 2(⃗ xT
j ⃗
θ −⃗ yj)
- residual
·⃗ xj SGD Update: Pick random j 1 n and set:
i 1 i
fj
i i
2 xj ri j verses 2
n j 1
xjri j where ri j xT
j i
yj is the residual for data point j at step i. Make a small correction for a single data point in each step. In the direction of the data point.
12
sgd for regression
Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2
2 = n
∑
j=1
( ⃗ xT
j ⃗
θ − yj )2 =
n
∑
j=1
fj(⃗ θ). Claim 3: ⃗ ∇fj(θ) = 2(⃗ xT
j ⃗
θ −⃗ yj)
- residual
·⃗ xj SGD Update: Pick random j ∈ {1, . . . , n} and set: ⃗ θ(i+1) = ⃗ θ(i) − η⃗ ∇fj(θ(i)) = ⃗ θ(i) − 2η⃗ xj · ri,j verses 2
n j 1
xjri j where ri,j = (⃗ xT
j ⃗
θ(i) − yj) is the residual for data point j at step i. Make a small correction for a single data point in each step. In the direction of the data point.
12
sgd for regression
Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2
2 = n
∑
j=1
( ⃗ xT
j ⃗
θ − yj )2 =
n
∑
j=1
fj(⃗ θ). Claim 3: ⃗ ∇fj(θ) = 2(⃗ xT
j ⃗
θ −⃗ yj)
- residual
·⃗ xj SGD Update: Pick random j ∈ {1, . . . , n} and set: ⃗ θ(i+1) = ⃗ θ(i) − η⃗ ∇fj(θ(i)) = ⃗ θ(i) − 2η⃗ xj · ri,j verses − 2η
n
∑
j=1
⃗ xjri,j where ri,j = (⃗ xT
j ⃗
θ(i) − yj) is the residual for data point j at step i. Make a small correction for a single data point in each step. In the direction of the data point.
12
sgd for regression
Least Squares Regression: Given data matrix X ∈ Rn×d and label vector ⃗ y ∈ Rn: f(⃗ θ) = ∥X⃗ θ −⃗ y∥2
2 = n
∑
j=1
( ⃗ xT
j ⃗
θ − yj )2 =
n
∑
j=1
fj(⃗ θ). Claim 3: ⃗ ∇fj(θ) = 2(⃗ xT
j ⃗
θ −⃗ yj)
- residual
·⃗ xj SGD Update: Pick random j ∈ {1, . . . , n} and set: ⃗ θ(i+1) = ⃗ θ(i) − η⃗ ∇fj(θ(i)) = ⃗ θ(i) − 2η⃗ xj · ri,j verses − 2η
n
∑
j=1
⃗ xjri,j where ri,j = (⃗ xT
j ⃗
θ(i) − yj) is the residual for data point j at step i. Make a small correction for a single data point in each step. In the direction of the data point.
12
gradient descent as polynomial approximation
Gradient Descent for Regression: ⃗ θ(i+1) = ⃗ θ(i) − η⃗ ∇f(⃗ θ(i)) = ⃗ θ(i) − 2ηXT(X⃗ θ(i) −⃗ y). Initialize ⃗ θ(1) = ⃗ 0.
2
2 XT X0 y 2 XTy
3
2 XTy 2 XT 2 XXTy y 4 XTy 4 2 XTX XTy 4 I XTX XTy
4 3
XT X
3
y 6 XTy 16 XTX XTy 8 2 XTX 2XTy
t
pt XTX XTy XTX
1XTy
where pt is a degree t 2 polynomial.
13
gradient descent as polynomial approximation
Gradient Descent for Regression: ⃗ θ(i+1) = ⃗ θ(i) − η⃗ ∇f(⃗ θ(i)) = ⃗ θ(i) − 2ηXT(X⃗ θ(i) −⃗ y). Initialize ⃗ θ(1) = ⃗ 0. ⃗ θ(2) = ⃗ 0 − 2ηXT(X⃗ 0 −⃗ y) = 2ηXT⃗ y.
3
2 XTy 2 XT 2 XXTy y 4 XTy 4 2 XTX XTy 4 I XTX XTy
4 3
XT X
3
y 6 XTy 16 XTX XTy 8 2 XTX 2XTy
t
pt XTX XTy XTX
1XTy
where pt is a degree t 2 polynomial.
13
gradient descent as polynomial approximation
Gradient Descent for Regression: ⃗ θ(i+1) = ⃗ θ(i) − η⃗ ∇f(⃗ θ(i)) = ⃗ θ(i) − 2ηXT(X⃗ θ(i) −⃗ y). Initialize ⃗ θ(1) = ⃗ 0. ⃗ θ(2) = ⃗ 0 − 2ηXT(X⃗ 0 −⃗ y) = 2ηXT⃗ y. ⃗ θ(3) = 2ηXT⃗ y−2ηXT(2ηXXT⃗ y−⃗ y) = 4ηXT⃗ y−4η2(XTX)XT⃗ y = 4η(I−ηXTX)XT⃗ y
4 3
XT X
3
y 6 XTy 16 XTX XTy 8 2 XTX 2XTy
t
pt XTX XTy XTX
1XTy
where pt is a degree t 2 polynomial.
13
gradient descent as polynomial approximation
Gradient Descent for Regression: ⃗ θ(i+1) = ⃗ θ(i) − η⃗ ∇f(⃗ θ(i)) = ⃗ θ(i) − 2ηXT(X⃗ θ(i) −⃗ y). Initialize ⃗ θ(1) = ⃗ 0. ⃗ θ(2) = ⃗ 0 − 2ηXT(X⃗ 0 −⃗ y) = 2ηXT⃗ y. ⃗ θ(3) = 2ηXT⃗ y−2ηXT(2ηXXT⃗ y−⃗ y) = 4ηXT⃗ y−4η2(XTX)XT⃗ y = 4η(I−ηXTX)XT⃗ y ⃗ θ(4) = θ(3) − ηXT(X⃗ θ(3) −⃗ y) = 6ηXT⃗ y − 16η(XTX)XT⃗ y + 8η2(XTX)2XT⃗ y.
t
pt XTX XTy XTX
1XTy
where pt is a degree t 2 polynomial.
13
gradient descent as polynomial approximation
Gradient Descent for Regression: ⃗ θ(i+1) = ⃗ θ(i) − η⃗ ∇f(⃗ θ(i)) = ⃗ θ(i) − 2ηXT(X⃗ θ(i) −⃗ y). Initialize ⃗ θ(1) = ⃗ 0. ⃗ θ(2) = ⃗ 0 − 2ηXT(X⃗ 0 −⃗ y) = 2ηXT⃗ y. ⃗ θ(3) = 2ηXT⃗ y−2ηXT(2ηXXT⃗ y−⃗ y) = 4ηXT⃗ y−4η2(XTX)XT⃗ y = 4η(I−ηXTX)XT⃗ y ⃗ θ(4) = θ(3) − ηXT(X⃗ θ(3) −⃗ y) = 6ηXT⃗ y − 16η(XTX)XT⃗ y + 8η2(XTX)2XT⃗ y. ⃗ θ(t) = pt(XTX) · XT⃗ y XTX
1XTy
. where pt is a degree t − 2 polynomial.
13
gradient descent as polynomial approximation
Gradient Descent for Regression: ⃗ θ(i+1) = ⃗ θ(i) − η⃗ ∇f(⃗ θ(i)) = ⃗ θ(i) − 2ηXT(X⃗ θ(i) −⃗ y). Initialize ⃗ θ(1) = ⃗ 0. ⃗ θ(2) = ⃗ 0 − 2ηXT(X⃗ 0 −⃗ y) = 2ηXT⃗ y. ⃗ θ(3) = 2ηXT⃗ y−2ηXT(2ηXXT⃗ y−⃗ y) = 4ηXT⃗ y−4η2(XTX)XT⃗ y = 4η(I−ηXTX)XT⃗ y ⃗ θ(4) = θ(3) − ηXT(X⃗ θ(3) −⃗ y) = 6ηXT⃗ y − 16η(XTX)XT⃗ y + 8η2(XTX)2XT⃗ y. ⃗ θ(t) = pt(XTX) · XT⃗ y ≈ θ∗ XTX
1XTy
. where pt is a degree t − 2 polynomial.
13
gradient descent as polynomial approximation
Gradient Descent for Regression: ⃗ θ(i+1) = ⃗ θ(i) − η⃗ ∇f(⃗ θ(i)) = ⃗ θ(i) − 2ηXT(X⃗ θ(i) −⃗ y). Initialize ⃗ θ(1) = ⃗ 0. ⃗ θ(2) = ⃗ 0 − 2ηXT(X⃗ 0 −⃗ y) = 2ηXT⃗ y. ⃗ θ(3) = 2ηXT⃗ y−2ηXT(2ηXXT⃗ y−⃗ y) = 4ηXT⃗ y−4η2(XTX)XT⃗ y = 4η(I−ηXTX)XT⃗ y ⃗ θ(4) = θ(3) − ηXT(X⃗ θ(3) −⃗ y) = 6ηXT⃗ y − 16η(XTX)XT⃗ y + 8η2(XTX)2XT⃗ y. ⃗ θ(t) = pt(XTX) · XT⃗ y ≈ θ∗ = (XTX)−1XT⃗ y. where pt is a degree t − 2 polynomial.
13
gradient descent as polynomial approximation
Upshot: Gradient descent computes ⃗ θ(t) = pt(XTX) · XT⃗ y ≈ (XTX)−1XT⃗ y = θ∗. One of the most basic Krylov subspace methods. Chebyshev iteration, Jacobi iteration, conjugate gradient, accelerated gradient descent, heavy ball methods....
14
gradient descent as polynomial approximation
Upshot: Gradient descent computes ⃗ θ(t) = pt(XTX) · XT⃗ y ≈ (XTX)−1XT⃗ y = θ∗. One of the most basic Krylov subspace methods. Chebyshev iteration, Jacobi iteration, conjugate gradient, accelerated gradient descent, heavy ball methods....
14
gradient descent as polynomial approximation
Upshot: Gradient descent computes ⃗ θ(t) = pt(XTX) · XT⃗ y ≈ (XTX)−1XT⃗ y = θ∗. One of the most basic Krylov subspace methods. Chebyshev iteration, Jacobi iteration, conjugate gradient, accelerated gradient descent, heavy ball methods....
14
gradient descent as polynomial approximation
Upshot: Gradient descent computes ⃗ θ(t) = pt(XTX) · XT⃗ y ≈ (XTX)−1XT⃗ y = θ∗. View in Eigendecomposition: One of the most basic Krylov subspace methods. Chebyshev iteration, Jacobi iteration, conjugate gradient, accelerated gradient descent, heavy ball methods....
14
gradient descent as polynomial approximation
Upshot: Gradient descent computes ⃗ θ(t) = pt(XTX) · XT⃗ y ≈ (XTX)−1XT⃗ y = θ∗.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 4 6 8 10
1/x p10(x)
One of the most basic Krylov subspace methods. Chebyshev iteration, Jacobi iteration, conjugate gradient, accelerated gradient descent, heavy ball methods....
14
gradient descent as polynomial approximation
Upshot: Gradient descent computes ⃗ θ(t) = pt(XTX) · XT⃗ y ≈ (XTX)−1XT⃗ y = θ∗.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 4 6 8 10
1/x p10(x)
One of the most basic Krylov subspace methods. Chebyshev iteration, Jacobi iteration, conjugate gradient, accelerated gradient descent, heavy ball methods....
14
gradient descent as polynomial approximation
Upshot: Gradient descent computes ⃗ θ(t) = pt(XTX) · XT⃗ y ≈ (XTX)−1XT⃗ y = θ∗.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 4 6 8 10
1/x p30(x)
One of the most basic Krylov subspace methods. Chebyshev iteration, Jacobi iteration, conjugate gradient, accelerated gradient descent, heavy ball methods....
14
conditioning
Gradient descent for least squares regression requires a lot of iterations when the eigenvalues of XTX are spread out. Formally:
- Is f
X y 2
2
X
2 2 Lipschitz?
- A convex function f
d
is
- smooth and
- strongly convex if
1 2:
2
1 2 2 2
f
1 T 1 2
f
1
f
2
2
1 2 2 2
- f
is
max XTX smooth and min XTX strongly convex.
15
conditioning
Gradient descent for least squares regression requires a lot of iterations when the eigenvalues of XTX are spread out. Formally:
- Is f(⃗
θ) = ∥X⃗ θ −⃗ y∥2
2 = ∥X(⃗
θ − ⃗ θ∗)∥2
2 Lipschitz?
- A convex function f
d
is
- smooth and
- strongly convex if
1 2:
2
1 2 2 2
f
1 T 1 2
f
1
f
2
2
1 2 2 2
- f
is
max XTX smooth and min XTX strongly convex.
15
conditioning
Gradient descent for least squares regression requires a lot of iterations when the eigenvalues of XTX are spread out. Formally:
- Is f(⃗
θ) = ∥X⃗ θ −⃗ y∥2
2 = ∥X(⃗
θ − ⃗ θ∗)∥2
2 Lipschitz?
- A convex function f : Rd → R is β-smooth and α-strongly convex if
∀⃗ θ1, ⃗ θ2: α 2 ∥⃗ θ1 − ⃗ θ2∥2
2 ≤ ⃗
∇f(⃗ θ1)T(⃗ θ1 − ⃗ θ2) − [f(⃗ θ1) − f(⃗ θ2)] ≤ β 2 ∥⃗ θ1 − ⃗ θ2∥2
2.
- f
is
max XTX smooth and min XTX strongly convex.
15
conditioning
Gradient descent for least squares regression requires a lot of iterations when the eigenvalues of XTX are spread out. Formally:
- Is f(⃗
θ) = ∥X⃗ θ −⃗ y∥2
2 = ∥X(⃗
θ − ⃗ θ∗)∥2
2 Lipschitz?
- A convex function f : Rd → R is β-smooth and α-strongly convex if
∀⃗ θ1, ⃗ θ2: α 2 ∥⃗ θ1 − ⃗ θ2∥2
2 ≤ ⃗
∇f(⃗ θ1)T(⃗ θ1 − ⃗ θ2) − [f(⃗ θ1) − f(⃗ θ2)] ≤ β 2 ∥⃗ θ1 − ⃗ θ2∥2
2.
- f(θ) is β = λmax(XTX) smooth and α = λmin(XTX) strongly convex.
15
conditioning
Theorem: For any α-strongly convex and β-smooth func- tion f(⃗ θ), GD initialized with ⃗ θ(1) within a radius R of ⃗ θ∗ and run for t = O (
β α · log(1/ϵ)
) iterations returns ˆ θ with ∥ˆ θ − θ∗∥2 ≤ ϵR. For least squares regression, α = λmin(XTX), β = λmax(XTX), and
β α is called the condition number κ. 16
conditioning
Recall: f(⃗ θ) = ∥X(⃗ θ − ⃗ θ∗)∥2
2.
How can we mitigate this issue? Scale the directions to make the surface more ‘round’. Idea of adaptive gradient methods: AdaGrad, RMSprop, Adam. And quasi-Newton methods: BFGS, L-BFGS,...
17
conditioning
Recall: f(⃗ θ) = ∥X(⃗ θ − ⃗ θ∗)∥2
2.
How can we mitigate this issue? Scale the directions to make the surface more ‘round’. Idea of adaptive gradient methods: AdaGrad, RMSprop, Adam. And quasi-Newton methods: BFGS, L-BFGS,...
17
conditioning
Recall: f(⃗ θ) = ∥X(⃗ θ − ⃗ θ∗)∥2
2.
How can we mitigate this issue? Scale the directions to make the surface more ‘round’. Idea of adaptive gradient methods: AdaGrad, RMSprop, Adam. And quasi-Newton methods: BFGS, L-BFGS,...
17
conditioning
Recall: f(⃗ θ) = ∥X(⃗ θ − ⃗ θ∗)∥2
2.
How can we mitigate this issue? Scale the directions to make the surface more ‘round’. Idea of adaptive gradient methods: AdaGrad, RMSprop, Adam. And quasi-Newton methods: BFGS, L-BFGS,...
17
conditioning
Recall: f(⃗ θ) = ∥X(⃗ θ − ⃗ θ∗)∥2
2.
How can we mitigate this issue? Scale the directions to make the surface more ‘round’. Idea of adaptive gradient methods: AdaGrad, RMSprop, Adam. And quasi-Newton methods: BFGS, L-BFGS,...
17
conditioning
Recall: f(⃗ θ) = ∥X(⃗ θ − ⃗ θ∗)∥2
2.