compsci 514: algorithms for data science Cameron Musco University - - PowerPoint PPT Presentation

compsci 514 algorithms for data science
SMART_READER_LITE
LIVE PREVIEW

compsci 514: algorithms for data science Cameron Musco University - - PowerPoint PPT Presentation

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 19 0 logistics 1 Problem Set 3 on Spectral Methods due this Friday at 8pm . Can turn in without penalty until Sunday at


slide-1
SLIDE 1

compsci 514: algorithms for data science

Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 19

slide-2
SLIDE 2

logistics

  • Problem Set 3 on Spectral Methods due this Friday at 8pm.
  • Can turn in without penalty until Sunday at 11:59pm.

1

slide-3
SLIDE 3

summary

Last Class:

  • Intro to continuous optimization.
  • Multivariable calculus review.
  • Intro to Gradient Descent.

This Class:

  • Analysis of gradient descent for optimizing convex functions.
  • Analysis of projected gradient descent for optimizing under

constraints.

2

slide-4
SLIDE 4

summary

Last Class:

  • Intro to continuous optimization.
  • Multivariable calculus review.
  • Intro to Gradient Descent.

This Class:

  • Analysis of gradient descent for optimizing convex functions.
  • Analysis of projected gradient descent for optimizing under

constraints.

2

slide-5
SLIDE 5

gradient descent motivation

Gradient descent greedy motivation: At each step, make a small change to ⃗ θ(i−1) to give ⃗ θ(i), with minimum value of f(⃗ θ(i)). Gradient descent step: When the step size is small, this is approximate optimized by stepping in the opposite direction of the gradient: ⃗ θ(i) = ⃗ θ(i−1) − η · ⃗ ∇f(⃗ θ(i−1)). Psuedocode:

  • Choose some initialization

0 .

  • For i

1 t

  • i

i 1

f

i 1

  • Return

t , as an approximate minimizer of f

. Step size is chosen ahead of time or adapted during the algorithm.

3

slide-6
SLIDE 6

gradient descent motivation

Gradient descent greedy motivation: At each step, make a small change to ⃗ θ(i−1) to give ⃗ θ(i), with minimum value of f(⃗ θ(i)). Gradient descent step: When the step size is small, this is approximate optimized by stepping in the opposite direction of the gradient: ⃗ θ(i) = ⃗ θ(i−1) − η · ⃗ ∇f(⃗ θ(i−1)). Psuedocode:

  • Choose some initialization ⃗

θ(0).

  • For i = 1, . . . , t

θ(i) = ⃗ θ(i−1) − η∇f(⃗ θ(i−1))

  • Return ⃗

θ(t), as an approximate minimizer of f(⃗ θ). Step size η is chosen ahead of time or adapted during the algorithm.

3

slide-7
SLIDE 7

Gradient Descent Update: ⃗ θ(i) = ⃗ θ(i−1) − η∇f(⃗ θ(i−1))

4

slide-8
SLIDE 8

convexity

Definition – Convex Function: A function f : Rd → R is convex if and only if, for any ⃗ θ1, ⃗ θ2 ∈ Rd and λ ∈ [0, 1]: (1 − λ) · f(⃗ θ1) + λ · f(⃗ θ2) ≥ f ( (1 − λ) · ⃗ θ1 + λ · ⃗ θ2 )

5

slide-9
SLIDE 9

convexity

Definition – Convex Function: A function f : Rd → R is convex if and only if, for any ⃗ θ1, ⃗ θ2 ∈ Rd and λ ∈ [0, 1]: (1 − λ) · f(⃗ θ1) + λ · f(⃗ θ2) ≥ f ( (1 − λ) · ⃗ θ1 + λ · ⃗ θ2 )

5

slide-10
SLIDE 10

convexity

Corollary – Convex Function: A function f : Rd → R is convex if and only if, for any ⃗ θ1, ⃗ θ2 ∈ Rd and λ ∈ [0, 1]: f(⃗ θ2) − f(⃗ θ1) ≥ ⃗ ∇f(⃗ θ1)T ( ⃗ θ2 − ⃗ θ1 )

6

slide-11
SLIDE 11
  • ther assumptions

We will also assume that f(·) is ‘well-behaved’ in some way.

  • Lipschitz (size of gradient is bounded): For all

and some G, f

2

G

  • Smooth (direction/size of gradient is not changing too quickly):

For all

1 2 and some

, f

1

f

2 2 1 2 2

7

slide-12
SLIDE 12
  • ther assumptions

We will also assume that f(·) is ‘well-behaved’ in some way.

  • Lipschitz (size of gradient is bounded): For all ⃗

θ and some G, ∥⃗ ∇f(⃗ θ)∥2 ≤ G.

  • Smooth (direction/size of gradient is not changing too quickly):

For all ⃗ θ1, ⃗ θ2 and some β, ∥⃗ ∇f(⃗ θ1) − ⃗ ∇f(⃗ θ2)∥2 ≤ β · ∥⃗ θ1 − ⃗ θ2∥2.

7

slide-13
SLIDE 13

lipschitz assumption

8

slide-14
SLIDE 14

gd analysis – convex functions

Assume that:

  • f is convex.
  • f is G-Lipschitz (i.e., ∥⃗

∇f(⃗ θ)∥2 ≤ G for all ⃗ θ.)

  • ∥⃗

θ0 − ⃗ θ∗∥2 ≤ R where θ0 is the initialization point. Gradient Descent

  • Choose some initialization ⃗

θ0 and set η =

R G √ t.

  • For i = 1, . . . , t

θi = ⃗ θi−1 − η · ∇f(⃗ θi−1)

  • Return ˆ

θ = arg min⃗

θ0,...,⃗ θt f(⃗

θi).

9

slide-15
SLIDE 15

gd analysis proof

Theorem – GD on Convex Lipschitz Functions: For convex G- Lipschitz function f, GD run with t ≥ R2G2

ϵ2

iterations, η =

R G √ t,

and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. Step 1: For all i, f

i

f

i 2 2 i 1 2 2

2 G2 2 . Visually:

10

slide-16
SLIDE 16

gd analysis proof

Theorem – GD on Convex Lipschitz Functions: For convex G- Lipschitz function f, GD run with t ≥ R2G2

ϵ2

iterations, η =

R G √ t,

and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. Step 1: For all i, f(θi) − f(θ∗) ≤ ∥θi−θ∗∥2

2−∥θi+1−θ∗∥2 2

+ ηG2

2 . Visually:

10

slide-17
SLIDE 17

gd analysis proof

Theorem – GD on Convex Lipschitz Functions: For convex G- Lipschitz function f, GD run with t ≥ R2G2

ϵ2

iterations, η =

R G √ t,

and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. Step 1: For all i, f(θi) − f(θ∗) ≤ ∥θi−θ∗∥2

2−∥θi+1−θ∗∥2 2

+ ηG2

2 . Formally:

11

slide-18
SLIDE 18

gd analysis proof

Theorem – GD on Convex Lipschitz Functions: For convex G- Lipschitz function f, GD run with t ≥ R2G2

ϵ2

iterations, η =

R G √ t,

and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. Step 1: For all i, f(θi) − f(θ∗) ≤ ∥θi−θ∗∥2

2−∥θi+1−θ∗∥2 2

+ ηG2

2 .

Step 1.1: ∇f(θi)(θi − θ∗) ≤ ∥θi−θ∗∥2

2−∥θi+1−θ∗∥2 2

+ ηG2

2

Step 1.

12

slide-19
SLIDE 19

gd analysis proof

Theorem – GD on Convex Lipschitz Functions: For convex G- Lipschitz function f, GD run with t ≥ R2G2

ϵ2

iterations, η =

R G √ t,

and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. Step 1: For all i, f(θi) − f(θ∗) ≤ ∥θi−θ∗∥2

2−∥θi+1−θ∗∥2 2

+ ηG2

2 .

Step 1.1: ∇f(θi)(θi − θ∗) ≤ ∥θi−θ∗∥2

2−∥θi+1−θ∗∥2 2

+ ηG2

2

= ⇒ Step 1.

12

slide-20
SLIDE 20

gd analysis proof

Theorem – GD on Convex Lipschitz Functions: For convex G- Lipschitz function f, GD run with t ≥ R2G2

ϵ2

iterations, η =

R G √ t,

and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. Step 1: For all i, f(θi) − f(θ∗) ≤ ∥θi−θ∗∥2

2−∥θi+1−θ∗∥2 2

+ ηG2

2

Step 2: 1

t t i 1 f i

f

R2 2 t G2 2 .

13

slide-21
SLIDE 21

gd analysis proof

Theorem – GD on Convex Lipschitz Functions: For convex G- Lipschitz function f, GD run with t ≥ R2G2

ϵ2

iterations, η =

R G √ t,

and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. Step 1: For all i, f(θi) − f(θ∗) ≤ ∥θi−θ∗∥2

2−∥θi+1−θ∗∥2 2

+ ηG2

2

Step 2: 1

t

∑t

i=1 f(θi) − f(θ∗) ≤ R2 2η·t + ηG2 2 .

13

slide-22
SLIDE 22

gd analysis proof

Theorem – GD on Convex Lipschitz Functions: For convex G- Lipschitz function f, GD run with t ≥ R2G2

ϵ2

iterations, η =

R G √ t,

and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ. Step 2: 1

t

∑t

i=1 f(θi) − f(θ∗) ≤ R2 2η·t + ηG2 2 .

14

slide-23
SLIDE 23

constrained convex optimization

Often want to perform convex optimization with convex constraints. θ∗ = arg min

θ∈S

f(θ), where S is a convex set. Definition – Convex Set: A set

d is convex if and only if,

for any

1 2

and 0 1 : 1

1 2

E.g.

d 2

1 .

15

slide-24
SLIDE 24

constrained convex optimization

Often want to perform convex optimization with convex constraints. θ∗ = arg min

θ∈S

f(θ), where S is a convex set. Definition – Convex Set: A set S ⊆ Rd is convex if and only if, for any ⃗ θ1, ⃗ θ2 ∈ S and λ ∈ [0, 1]: (1 − λ)⃗ θ1 + λ · ⃗ θ2 ∈ S E.g.

d 2

1 .

15

slide-25
SLIDE 25

constrained convex optimization

Often want to perform convex optimization with convex constraints. θ∗ = arg min

θ∈S

f(θ), where S is a convex set. Definition – Convex Set: A set S ⊆ Rd is convex if and only if, for any ⃗ θ1, ⃗ θ2 ∈ S and λ ∈ [0, 1]: (1 − λ)⃗ θ1 + λ · ⃗ θ2 ∈ S E.g. S = {⃗ θ ∈ Rd : ∥⃗ θ∥2 ≤ 1}.

15

slide-26
SLIDE 26

projected gradient descent

For any convex set let PS(·) denote the projection function onto S.

  • PS(⃗

y) = arg min⃗

θ∈S ∥⃗

θ −⃗ y∥2.

  • For

d 2

1 what is P y ?

  • For

being a k dimensional subspace of

d, what is P

y ? Projected Gradient Descent

  • Choose some initialization

0 and set R G t.

  • For i

1 t

  • ut

i i 1

f

i 1

  • i

P

  • ut

i

.

  • Return

arg min

t f

i .

16

slide-27
SLIDE 27

projected gradient descent

For any convex set let PS(·) denote the projection function onto S.

  • PS(⃗

y) = arg min⃗

θ∈S ∥⃗

θ −⃗ y∥2.

  • For S = {⃗

θ ∈ Rd : ∥⃗ θ∥2 ≤ 1} what is PS(⃗ y)?

  • For

being a k dimensional subspace of

d, what is P

y ? Projected Gradient Descent

  • Choose some initialization

0 and set R G t.

  • For i

1 t

  • ut

i i 1

f

i 1

  • i

P

  • ut

i

.

  • Return

arg min

t f

i .

16

slide-28
SLIDE 28

projected gradient descent

For any convex set let PS(·) denote the projection function onto S.

  • PS(⃗

y) = arg min⃗

θ∈S ∥⃗

θ −⃗ y∥2.

  • For S = {⃗

θ ∈ Rd : ∥⃗ θ∥2 ≤ 1} what is PS(⃗ y)?

  • For S being a k dimensional subspace of Rd, what is PS(⃗

y)? Projected Gradient Descent

  • Choose some initialization

0 and set R G t.

  • For i

1 t

  • ut

i i 1

f

i 1

  • i

P

  • ut

i

.

  • Return

arg min

t f

i .

16

slide-29
SLIDE 29

projected gradient descent

For any convex set let PS(·) denote the projection function onto S.

  • PS(⃗

y) = arg min⃗

θ∈S ∥⃗

θ −⃗ y∥2.

  • For S = {⃗

θ ∈ Rd : ∥⃗ θ∥2 ≤ 1} what is PS(⃗ y)?

  • For S being a k dimensional subspace of Rd, what is PS(⃗

y)? Projected Gradient Descent

  • Choose some initialization ⃗

θ0 and set η =

R G √ t.

  • For i = 1, . . . , t

θ(out)

i

= ⃗ θi−1 − η · ∇f(⃗ θi−1)

θi = PS(⃗ θ(out)

i

).

  • Return ˆ

θ = arg min⃗

θ0,...,⃗ θt f(⃗

θi).

16

slide-30
SLIDE 30

projected gradient descent

Visually:

17

slide-31
SLIDE 31

convex projections

Projected gradient descent can be analyzed identically to gradient descent! Theorem – Projection to a convex set: For any convex set

d, y d, and

, P y

2

y

2

18

slide-32
SLIDE 32

convex projections

Projected gradient descent can be analyzed identically to gradient descent! Theorem – Projection to a convex set: For any convex set S ⊆ Rd, ⃗ y ∈ Rd, and ⃗ θ ∈ S, ∥PS(⃗ y) − ⃗ θ∥2 ≤ ∥⃗ y − ⃗ θ∥2.

18

slide-33
SLIDE 33

projected gradient descent analysis

Theorem – Projeted GD: For convex G-Lipschitz function f, and convex set S, Projected GD run with t ≥ R2G2

ϵ2 iterations, η = R G √ t,

and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ = min

θ∈S f(θ) + ϵ

Recall:

  • ut

i 1 i

f

i and i 1

P

  • ut

i 1

. Step 1: For all i, f

i

f

i 2 2

  • ut

i 1 2 2

2 G2 2 .

Step 1.a: For all i, f

i

f

i 2 2 i 1 2 2

2 G2 2 .

Step 2: 1

t t i 1 f i

f

R2 2 t G2 2

Theorem.

19

slide-34
SLIDE 34

projected gradient descent analysis

Theorem – Projeted GD: For convex G-Lipschitz function f, and convex set S, Projected GD run with t ≥ R2G2

ϵ2 iterations, η = R G √ t,

and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ = min

θ∈S f(θ) + ϵ

Recall: θ(out)

i+1

= θi − η · ∇f(θi) and θi+1 = PS(θ(out)

i+1 ).

Step 1: For all i, f

i

f

i 2 2

  • ut

i 1 2 2

2 G2 2 .

Step 1.a: For all i, f

i

f

i 2 2 i 1 2 2

2 G2 2 .

Step 2: 1

t t i 1 f i

f

R2 2 t G2 2

Theorem.

19

slide-35
SLIDE 35

projected gradient descent analysis

Theorem – Projeted GD: For convex G-Lipschitz function f, and convex set S, Projected GD run with t ≥ R2G2

ϵ2 iterations, η = R G √ t,

and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ = min

θ∈S f(θ) + ϵ

Recall: θ(out)

i+1

= θi − η · ∇f(θi) and θi+1 = PS(θ(out)

i+1 ).

Step 1: For all i, f(θi) − f(θ∗) ≤

∥θi−θ∗∥2

2−∥θ(out) i+1 −θ∗∥2 2

+ ηG2

2 .

Step 1.a: For all i, f

i

f

i 2 2 i 1 2 2

2 G2 2 .

Step 2: 1

t t i 1 f i

f

R2 2 t G2 2

Theorem.

19

slide-36
SLIDE 36

projected gradient descent analysis

Theorem – Projeted GD: For convex G-Lipschitz function f, and convex set S, Projected GD run with t ≥ R2G2

ϵ2 iterations, η = R G √ t,

and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ = min

θ∈S f(θ) + ϵ

Recall: θ(out)

i+1

= θi − η · ∇f(θi) and θi+1 = PS(θ(out)

i+1 ).

Step 1: For all i, f(θi) − f(θ∗) ≤

∥θi−θ∗∥2

2−∥θ(out) i+1 −θ∗∥2 2

+ ηG2

2 .

Step 1.a: For all i, f(θi) − f(θ∗) ≤ ∥θi−θ∗∥2

2−∥θi+1−θ∗∥2 2

+ ηG2

2 .

Step 2: 1

t t i 1 f i

f

R2 2 t G2 2

Theorem.

19

slide-37
SLIDE 37

projected gradient descent analysis

Theorem – Projeted GD: For convex G-Lipschitz function f, and convex set S, Projected GD run with t ≥ R2G2

ϵ2 iterations, η = R G √ t,

and starting point within radius R of θ∗, outputs ˆ θ satisfying: f(ˆ θ) ≤ f(θ∗) + ϵ = min

θ∈S f(θ) + ϵ

Recall: θ(out)

i+1

= θi − η · ∇f(θi) and θi+1 = PS(θ(out)

i+1 ).

Step 1: For all i, f(θi) − f(θ∗) ≤

∥θi−θ∗∥2

2−∥θ(out) i+1 −θ∗∥2 2

+ ηG2

2 .

Step 1.a: For all i, f(θi) − f(θ∗) ≤ ∥θi−θ∗∥2

2−∥θi+1−θ∗∥2 2

+ ηG2

2 .

Step 2: 1

t

∑t

i=1 f(θi) − f(θ∗) ≤ R2 2η·t + ηG2 2

= ⇒ Theorem.

19

slide-38
SLIDE 38

gradient descent at scale

Typical Optimization Problem in Machine Learning: Given data points ⃗ x1, . . . ,⃗ xn and labels/observations y1, . . . , yn solve: ⃗ θ∗ = arg min

⃗ θ∈Rd

L(⃗ θ, X) =

n

i=1

ℓ(M⃗

θ(⃗

xi), yi). Why is gradient descent expensive to run if you have many data points? L X

n i 1

M xi yi Solution: Take gradient step only taking into account one data point (or a small ‘batch’ of data points) at a time. Online and stochastic gradient descent.

20

slide-39
SLIDE 39

gradient descent at scale

Typical Optimization Problem in Machine Learning: Given data points ⃗ x1, . . . ,⃗ xn and labels/observations y1, . . . , yn solve: ⃗ θ∗ = arg min

⃗ θ∈Rd

L(⃗ θ, X) =

n

i=1

ℓ(M⃗

θ(⃗

xi), yi). Why is gradient descent expensive to run if you have many data points? L X

n i 1

M xi yi Solution: Take gradient step only taking into account one data point (or a small ‘batch’ of data points) at a time. Online and stochastic gradient descent.

20

slide-40
SLIDE 40

gradient descent at scale

Typical Optimization Problem in Machine Learning: Given data points ⃗ x1, . . . ,⃗ xn and labels/observations y1, . . . , yn solve: ⃗ θ∗ = arg min

⃗ θ∈Rd

L(⃗ θ, X) =

n

i=1

ℓ(M⃗

θ(⃗

xi), yi). Why is gradient descent expensive to run if you have many data points? ⃗ ∇L(⃗ θ, X) =

n

i=1

⃗ ∇ℓ(M⃗

θ(⃗

xi), yi). Solution: Take gradient step only taking into account one data point (or a small ‘batch’ of data points) at a time. Online and stochastic gradient descent.

20

slide-41
SLIDE 41

gradient descent at scale

Typical Optimization Problem in Machine Learning: Given data points ⃗ x1, . . . ,⃗ xn and labels/observations y1, . . . , yn solve: ⃗ θ∗ = arg min

⃗ θ∈Rd

L(⃗ θ, X) =

n

i=1

ℓ(M⃗

θ(⃗

xi), yi). Why is gradient descent expensive to run if you have many data points? ⃗ ∇L(⃗ θ, X) =

n

i=1

⃗ ∇ℓ(M⃗

θ(⃗

xi), yi). Solution: Take gradient step only taking into account one data point (or a small ‘batch’ of data points) at a time. Online and stochastic gradient descent.

20