Optimization Aymeric DIEULEVEUT EPFL, Lausanne January 26, 2018 - - PowerPoint PPT Presentation

optimization
SMART_READER_LITE
LIVE PREVIEW

Optimization Aymeric DIEULEVEUT EPFL, Lausanne January 26, 2018 - - PowerPoint PPT Presentation

Optimization Aymeric DIEULEVEUT EPFL, Lausanne January 26, 2018 Journ ees YSP 1 Outline 1. General context and examples. 2. What makes optimization hard ? 2 Outline 1. General context and examples. 2. What makes optimization hard ?


slide-1
SLIDE 1

Optimization

Aymeric DIEULEVEUT

EPFL, Lausanne

January 26, 2018 Journ´ ees YSP

1

slide-2
SLIDE 2

Outline

  • 1. General context and examples.
  • 2. What makes optimization hard ?

2

slide-3
SLIDE 3

Outline

  • 1. General context and examples.
  • 2. What makes optimization hard ?

In the context of supervised machine learning:

  • 3. Minimizing Empirical Risk.

2

slide-4
SLIDE 4

Outline

  • 1. General context and examples.
  • 2. What makes optimization hard ?

In the context of supervised machine learning:

  • 3. Minimizing Empirical Risk.
  • 4. Minimizing Generalization Risk.

2

slide-5
SLIDE 5

General context

What is optimization about ? min

θ∈Θ f (θ)

With θ a parameter, and f a cost function.

3

slide-6
SLIDE 6

General context

What is optimization about ? min

θ∈Θ f (θ)

With θ a parameter, and f a cost function. Why ?

3

slide-7
SLIDE 7

General context

What is optimization about ? min

θ∈Θ f (θ)

With θ a parameter, and f a cost function. Why ? We formulate our problem as an optimization problem. 3 examples:

◮ Supervised machine learning ◮ Signal Processing ◮ Optimal transport

3

slide-8
SLIDE 8

Some Examples

Example 1: Supervised Machine Learning Goal: predict a phenomenon from “explanatory variables”, given a set of observations.

4

slide-9
SLIDE 9

Some Examples

Example 1: Supervised Machine Learning Goal: predict a phenomenon from “explanatory variables”, given a set of observations. Bio-informatics Input: DNA/RNA sequence, Output: Drug responsiveness Image classification Input: Images, Output: Digit

4

slide-10
SLIDE 10

Supervised Machine Learning

Example 1: Supervised Machine Learning Consider an input/output pair (X, Y ) ∈ X × Y, (X, Y ) ∼ ρ. Goal: function θ : X → R, s.t. θ(X) good prediction for Y .

5

slide-11
SLIDE 11

Supervised Machine Learning

Example 1: Supervised Machine Learning Consider an input/output pair (X, Y ) ∈ X × Y, (X, Y ) ∼ ρ. Goal: function θ : X → R, s.t. θ(X) good prediction for Y . Here, as a linear function θ, Φ(X) of features Φ(X) ∈ Rd.

5

slide-12
SLIDE 12

Supervised Machine Learning

Example 1: Supervised Machine Learning Consider an input/output pair (X, Y ) ∈ X × Y, (X, Y ) ∼ ρ. Goal: function θ : X → R, s.t. θ(X) good prediction for Y . Here, as a linear function θ, Φ(X) of features Φ(X) ∈ Rd. Consider a loss function ℓ : Y × R → R+ Define the Generalization risk : R(θ) := Eρ [ℓ(Y , θ, Φ(X))] .

5

slide-13
SLIDE 13

Empirical Risk minimization (I)

Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d. Empirical risk (or training error): ˆ R(θ) = 1 n

n

  • i=1

ℓ(yi, θ, Φ(xi)).

6

slide-14
SLIDE 14

Empirical Risk minimization (I)

Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d. Empirical risk (or training error): ˆ R(θ) = 1 n

n

  • i=1

ℓ(yi, θ, Φ(xi)). Empirical risk minimization (ERM) : find ˆ θ solution of min

θ∈Rd

1 n

n

  • i=1

  • yi, θ, Φ(xi)
  • +

µΩ(θ). convex data fitting term + regularizer

6

slide-15
SLIDE 15

Empirical Risk minimization (II)

For example, least-squares regression: min

θ∈Rd

1 2n

n

  • i=1
  • yi − θ, Φ(xi)

2 + µΩ(θ),

7

slide-16
SLIDE 16

Empirical Risk minimization (II)

For example, least-squares regression: min

θ∈Rd

1 2n

n

  • i=1
  • yi − θ, Φ(xi)

2 + µΩ(θ), and logistic regression: min

θ∈Rd

1 n

n

  • i=1

log

  • 1 + exp(−yiθ, Φ(xi))
  • +

µΩ(θ). Two fundamental questions: (1) computing (2) analyzing ˆ θ. Problem is formalized as a (convex) optimization problem. In the large scale setting, high dimensional problem and many examples.

7

slide-17
SLIDE 17

Some Examples

Example 2: Signal processing Observe a signal Y ∈ Rn×q, try to recover the source B ∈ Rp×q, knowing the “forward matrix” X ∈ Rn×p. (multi-task regression) min

β Xβ − Y 2 F

+ λΩ(β)

8

slide-18
SLIDE 18

Some Examples

Example 2: Signal processing Observe a signal Y ∈ Rn×q, try to recover the source B ∈ Rp×q, knowing the “forward matrix” X ∈ Rn×p. (multi-task regression) min

β Xβ − Y 2 F

+ λΩ(β) Ω sparsity inducing regularization.

8

slide-19
SLIDE 19

Some Examples

Example 2: Signal processing Observe a signal Y ∈ Rn×q, try to recover the source B ∈ Rp×q, knowing the “forward matrix” X ∈ Rn×p. (multi-task regression) min

β Xβ − Y 2 F

+ λΩ(β) Ω sparsity inducing regularization. How to choose λ?

8

slide-20
SLIDE 20

Some Examples

Example 3: Optimal transport min

π∈Π

  • c(x, y)dπ(x, y)

Π set of probability distributions c(x, y) “distance” from x to y. + regularization Kantorovic formulation of OT.

9

slide-21
SLIDE 21

Is it a (hard) problem?

for convex optimization, in 99 % of the cases, no.

10

slide-22
SLIDE 22

Is it a (hard) problem?

for convex optimization, in 99 % of the cases, no. In other words:

10

slide-23
SLIDE 23

Is it a (hard) problem?

for convex optimization, in 99 % of the cases, no. In other words: Use cvxpy

10

slide-24
SLIDE 24

Is it a (hard) problem?

for convex optimization, in 99 % of the cases, no. In other words: Use cvxpy ⇈ Interesting (or hard) problems

10

slide-25
SLIDE 25

What makes it hard: 1. Convexity

Why?

11

slide-26
SLIDE 26

What makes it hard: 1. Convexity

Why? Typical non-convex problems: Empirical risk minimization with 0-1 loss.

11

slide-27
SLIDE 27

What makes it hard: 1. Convexity

Why? Typical non-convex problems: Empirical risk minimization with 0-1 loss. ˆ R(θ) = 1

n

n

i=1 1yi =signθ,Φ(xi ).

11

slide-28
SLIDE 28

What makes it hard: 1. Convexity

Why? Typical non-convex problems: Empirical risk minimization with 0-1 loss. ˆ R(θ) = 1

n

n

i=1 1yi =signθ,Φ(xi ).

Matrix factorization minY ,W X − YW 2

F

not jointly convex.

11

slide-29
SLIDE 29

What makes it hard: 1. Convexity

Why? Typical non-convex problems: Empirical risk minimization with 0-1 loss. ˆ R(θ) = 1

n

n

i=1 1yi =signθ,Φ(xi ).

Matrix factorization minY ,W X − YW 2

F

not jointly convex. Neural networks: parametric non-convex functions.

11

slide-30
SLIDE 30

What makes it hard: 2. Regularity of the function

  • a. Smoothness

◮ A function g : Rd → R is L-smooth if and only if it is

twice differentiable and ∀θ ∈ Rd, eigenvalues

  • g ′′(θ)
  • L

12

slide-31
SLIDE 31

What makes it hard: 2. Regularity of the function

  • a. Smoothness

◮ A function g : Rd → R is L-smooth if and only if it is

twice differentiable and ∀θ ∈ Rd, eigenvalues

  • g ′′(θ)
  • L

For all θ ∈ Rd: g(θ) ≤ g(θ′) + g(θ′), θ − θ′ + L

  • θ − θ′

2

12

slide-32
SLIDE 32

What makes it hard: 2. Regularity of the function

  • b. Strong Convexity

◮ A twice differentiable function g : Rd → R is µ-strongly

convex if and only if ∀θ ∈ Rd, eigenvalues

  • g ′′(θ)
  • µ

13

slide-33
SLIDE 33

What makes it hard: 2. Regularity of the function

  • b. Strong Convexity

◮ A twice differentiable function g : Rd → R is µ-strongly

convex if and only if ∀θ ∈ Rd, eigenvalues

  • g ′′(θ)
  • µ

For all θ ∈ Rd: g(θ) ≥ g(θ′) + g(θ′), θ − θ′ + µ

  • θ − θ′

2

13

slide-34
SLIDE 34

What makes it hard: 2. Regularity of the function

Why? Rates typically depend on the condition number κ = L

µ:

14

slide-35
SLIDE 35

What makes it hard: 2. Regularity of the function

Why? Rates typically depend on the condition number κ = L

µ:

Large κ Small κ harder to optimize easier to optimize

14

slide-36
SLIDE 36

Smoothness and strong convexity in ML

We consider an a.s. convex loss in θ. Thus ˆ R and R are convex.

15

slide-37
SLIDE 37

Smoothness and strong convexity in ML

We consider an a.s. convex loss in θ. Thus ˆ R and R are convex. Hessian of ˆ R ≈ covariance matrix 1

n

n

i=1 Φ(xi)Φ(xi)⊤

15

slide-38
SLIDE 38

Smoothness and strong convexity in ML

We consider an a.s. convex loss in θ. Thus ˆ R and R are convex. Hessian of ˆ R ≈ covariance matrix 1

n

n

i=1 Φ(xi)Φ(xi)⊤

If ℓ is smooth, and E[Φ(X)2] ≤ r 2 , R is smooth. If ℓ is µ-strongly convex, and data has an invertible covariance matrix (low correlation/dimension), R is strongly convex.

15

slide-39
SLIDE 39

Smoothness and strong convexity in ML

We consider an a.s. convex loss in θ. Thus ˆ R and R are convex. Hessian of ˆ R ≈ covariance matrix 1

n

n

i=1 Φ(xi)Φ(xi)⊤

If ℓ is smooth, and E[Φ(X)2] ≤ r 2 , R is smooth. If ℓ is µ-strongly convex, and data has an invertible covariance matrix (low correlation/dimension), R is strongly convex. Importance of regularization: provides strong convexity, and avoids

  • verfitting.

15

slide-40
SLIDE 40

Smoothness and strong convexity in ML

We consider an a.s. convex loss in θ. Thus ˆ R and R are convex. Hessian of ˆ R ≈ covariance matrix 1

n

n

i=1 Φ(xi)Φ(xi)⊤

If ℓ is smooth, and E[Φ(X)2] ≤ r 2 , R is smooth. If ℓ is µ-strongly convex, and data has an invertible covariance matrix (low correlation/dimension), R is strongly convex. Importance of regularization: provides strong convexity, and avoids

  • verfitting.

Note: when considering dual formulation of the problem:

◮ L-smoothness ↔ 1/L-strong convexity. ◮ µ-strong convexity ↔ 1/µ-smoothness

15

slide-41
SLIDE 41

What makes it hard: 3. Set Θ, complexity of f

  • a. Set Θ: (if Θ is a convex set.)

◮ May be described implicitly (via equations):

Θ = {θ ∈ Rd s.t. θ2 ≤ R and θ, 1 = r}.

16

slide-42
SLIDE 42

What makes it hard: 3. Set Θ, complexity of f

  • a. Set Θ: (if Θ is a convex set.)

◮ May be described implicitly (via equations):

Θ = {θ ∈ Rd s.t. θ2 ≤ R and θ, 1 = r}. Use dual formulation of the problem.

16

slide-43
SLIDE 43

What makes it hard: 3. Set Θ, complexity of f

  • a. Set Θ: (if Θ is a convex set.)

◮ May be described implicitly (via equations):

Θ = {θ ∈ Rd s.t. θ2 ≤ R and θ, 1 = r}. Use dual formulation of the problem.

◮ Projection might be difficult or impossible.

16

slide-44
SLIDE 44

What makes it hard: 3. Set Θ, complexity of f

  • a. Set Θ: (if Θ is a convex set.)

◮ May be described implicitly (via equations):

Θ = {θ ∈ Rd s.t. θ2 ≤ R and θ, 1 = r}. Use dual formulation of the problem.

◮ Projection might be difficult or impossible.

use algorithms requiring linear minimization oracle instead of quadratic oracles (Frank Wolfe)

16

slide-45
SLIDE 45

What makes it hard: 3. Set Θ, complexity of f

  • a. Set Θ: (if Θ is a convex set.)

◮ May be described implicitly (via equations):

Θ = {θ ∈ Rd s.t. θ2 ≤ R and θ, 1 = r}. Use dual formulation of the problem.

◮ Projection might be difficult or impossible.

use algorithms requiring linear minimization oracle instead of quadratic oracles (Frank Wolfe)

◮ Even when Θ = Rd, d might be very large (typically

millions)

16

slide-46
SLIDE 46

What makes it hard: 3. Set Θ, complexity of f

  • a. Set Θ: (if Θ is a convex set.)

◮ May be described implicitly (via equations):

Θ = {θ ∈ Rd s.t. θ2 ≤ R and θ, 1 = r}. Use dual formulation of the problem.

◮ Projection might be difficult or impossible.

use algorithms requiring linear minimization oracle instead of quadratic oracles (Frank Wolfe)

◮ Even when Θ = Rd, d might be very large (typically

millions) use only first order methods

16

slide-47
SLIDE 47

What makes it hard: 3. Set Θ, complexity of f

  • a. Set Θ: (if Θ is a convex set.)

◮ May be described implicitly (via equations):

Θ = {θ ∈ Rd s.t. θ2 ≤ R and θ, 1 = r}. Use dual formulation of the problem.

◮ Projection might be difficult or impossible.

use algorithms requiring linear minimization oracle instead of quadratic oracles (Frank Wolfe)

◮ Even when Θ = Rd, d might be very large (typically

millions) use only first order methods

  • b. Structure of f . If f = ˆ

R(θ) = 1

n

n

i=1 ℓ(yi, θ, Φ(xi)),

computing a gradient has a cost proportional to n.

16

slide-48
SLIDE 48

Optimization

Take home

◮ We express problems as minimizing a function over a

set

◮ Most convex problems are solved ◮ Difficulties come from non-convexity, lack of

regularity, complexity of the set Θ (or high dimension), complexity of computing gradients

17

slide-49
SLIDE 49

Optimization

Take home

◮ We express problems as minimizing a function over a

set

◮ Most convex problems are solved ◮ Difficulties come from non-convexity, lack of

regularity, complexity of the set Θ (or high dimension), complexity of computing gradients What happens for supervised machine learning ?

17

slide-50
SLIDE 50

Optimization

Take home

◮ We express problems as minimizing a function over a

set

◮ Most convex problems are solved ◮ Difficulties come from non-convexity, lack of

regularity, complexity of the set Θ (or high dimension), complexity of computing gradients What happens for supervised machine learning ? Goals:

◮ present algorithms (convex, large dimension, high

number of observations)

17

slide-51
SLIDE 51

Optimization

Take home

◮ We express problems as minimizing a function over a

set

◮ Most convex problems are solved ◮ Difficulties come from non-convexity, lack of

regularity, complexity of the set Θ (or high dimension), complexity of computing gradients What happens for supervised machine learning ? Goals:

◮ present algorithms (convex, large dimension, high

number of observations)

◮ show how rates depend onsmoothness and strong

convexity

17

slide-52
SLIDE 52

Optimization

Take home

◮ We express problems as minimizing a function over a

set

◮ Most convex problems are solved ◮ Difficulties come from non-convexity, lack of

regularity, complexity of the set Θ (or high dimension), complexity of computing gradients What happens for supervised machine learning ? Goals:

◮ present algorithms (convex, large dimension, high

number of observations)

◮ show how rates depend onsmoothness and strong

convexity

◮ show how we can use the structure

17

slide-53
SLIDE 53

Optimization

Take home

◮ We express problems as minimizing a function over a

set

◮ Most convex problems are solved ◮ Difficulties come from non-convexity, lack of

regularity, complexity of the set Θ (or high dimension), complexity of computing gradients What happens for supervised machine learning ? Goals:

◮ present algorithms (convex, large dimension, high

number of observations)

◮ show how rates depend onsmoothness and strong

convexity

◮ show how we can use the structure ◮ not forgetting the initial problem...!

17

slide-54
SLIDE 54

Stochastic algorithms for ERM

min

θ∈Rd

  • ˆ

R(θ) = 1 n

n

  • i=1

ℓ(yi, θ, Φ(xi))

  • .

Two fundamental questions: (a) computing (b) analyzing ˆ θ.

18

slide-55
SLIDE 55

Stochastic algorithms for ERM

min

θ∈Rd

  • ˆ

R(θ) = 1 n

n

  • i=1

ℓ(yi, θ, Φ(xi))

  • .

Two fundamental questions: (a) computing (b) analyzing ˆ θ. “Large scale” framework: number of examples n and the number of explanatory variables d are both large.

  • 1. High dimension d

= ⇒ First order algorithms Gradient Descent (GD) : θk = θk−1 − γk ˆ R′(θk−1)

18

slide-56
SLIDE 56

Stochastic algorithms for ERM

min

θ∈Rd

  • ˆ

R(θ) = 1 n

n

  • i=1

ℓ(yi, θ, Φ(xi))

  • .

Two fundamental questions: (a) computing (b) analyzing ˆ θ. “Large scale” framework: number of examples n and the number of explanatory variables d are both large.

  • 1. High dimension d

= ⇒ First order algorithms Gradient Descent (GD) : θk = θk−1 − γk ˆ R′(θk−1) Problem: computing the gradient costs O(dn) per iteration.

  • 2. Large n =

⇒ Stochastic algorithms Stochastic Gradient Descent (SGD)

18

slide-57
SLIDE 57

Stochastic Gradient descent

◮ Goal:

min

θ∈Rd f (θ)

given unbiased gradient estimates f ′

n ◮ θ∗ := argminRd f (θ).

θ∗

19

slide-58
SLIDE 58

Stochastic Gradient descent

◮ Goal:

min

θ∈Rd f (θ)

given unbiased gradient estimates f ′

n ◮ θ∗ := argminRd f (θ). ◮ Key algorithm: Stochastic Gradient Descent (SGD) (Robbins

and Monro, 1951): θk = θk−1 − γk f ′

k(θk−1) ◮ E[f ′ k(θk−1)|Fk−1] = f ′(θk−1) for a filtration (Fk)k≥0, θk is Fk

measurable.

19

θ∗ θ0

slide-59
SLIDE 59

Stochastic Gradient descent

◮ Goal:

min

θ∈Rd f (θ)

given unbiased gradient estimates f ′

n ◮ θ∗ := argminRd f (θ).

θ∗ θ0 θn θ1

◮ Key algorithm: Stochastic Gradient Descent (SGD) (Robbins

and Monro, 1951): θk = θk−1 − γk f ′

k(θk−1) ◮ E[f ′ k(θk−1)|Fk−1] = f ′(θk−1) for a filtration (Fk)k≥0, θk is Fk

measurable.

19

slide-60
SLIDE 60

SGD for ERM: f = ˆ R

Loss for a single pair of observations, for any j ≤ n: fj(θ) := ℓ(yj, θ, Φ(xj)). One observation at each step = ⇒ complexity O(d) per iteration.

20

slide-61
SLIDE 61

SGD for ERM: f = ˆ R

Loss for a single pair of observations, for any j ≤ n: fj(θ) := ℓ(yj, θ, Φ(xj)). One observation at each step = ⇒ complexity O(d) per iteration. For the empirical risk ˆ R(θ) = 1

n n

  • k=1

ℓ(yk, θ, Φ(xk)).

◮ At each step k ∈ N∗, sample Ik ∼ U{1, . . . n}, and use:

f ′

Ik(θk−1) = ℓ′(yIk, θk−1, Φ(xIk))

20

slide-62
SLIDE 62

SGD for ERM: f = ˆ R

Loss for a single pair of observations, for any j ≤ n: fj(θ) := ℓ(yj, θ, Φ(xj)). One observation at each step = ⇒ complexity O(d) per iteration. For the empirical risk ˆ R(θ) = 1

n n

  • k=1

ℓ(yk, θ, Φ(xk)).

◮ At each step k ∈ N∗, sample Ik ∼ U{1, . . . n}, and use:

f ′

Ik(θk−1) = ℓ′(yIk, θk−1, Φ(xIk))

E[f ′

Ik(θk−1)|Fk−1] = 1

n

n

  • k=1

ℓ′(yk, θ, Φ(xk))

20

slide-63
SLIDE 63

SGD for ERM: f = ˆ R

Loss for a single pair of observations, for any j ≤ n: fj(θ) := ℓ(yj, θ, Φ(xj)). One observation at each step = ⇒ complexity O(d) per iteration. For the empirical risk ˆ R(θ) = 1

n n

  • k=1

ℓ(yk, θ, Φ(xk)).

◮ At each step k ∈ N∗, sample Ik ∼ U{1, . . . n}, and use:

f ′

Ik(θk−1) = ℓ′(yIk, θk−1, Φ(xIk))

E[f ′

Ik(θk−1)|Fk−1] = 1

n

n

  • k=1

ℓ′(yk, θ, Φ(xk)) = ˆ R′(θk−1). with Fk = σ((xi, yi)1≤i≤n, (Ii)1≤i≤k).

20

slide-64
SLIDE 64

Analysis: behaviour of (θn)n≥0

θk = θk−1 − γk f ′

k(θk−1)

Importance of the learning rate (γk)k≥0. For smooth and strongly convex problem, θk → θ∗ a.s. if

  • k=1

γk = ∞

  • k=1

γ2

k < ∞.

21

slide-65
SLIDE 65

Analysis: behaviour of (θn)n≥0

θk = θk−1 − γk f ′

k(θk−1)

Importance of the learning rate (γk)k≥0. For smooth and strongly convex problem, θk → θ∗ a.s. if

  • k=1

γk = ∞

  • k=1

γ2

k < ∞.

And asymptotic normality √ k(θk − θ∗)

d

→ N (0, V ), for γk = γ0

k , γ0 ≥ 1 µ.

21

slide-66
SLIDE 66

Analysis: behaviour of (θn)n≥0

θk = θk−1 − γk f ′

k(θk−1)

Importance of the learning rate (γk)k≥0. For smooth and strongly convex problem, θk → θ∗ a.s. if

  • k=1

γk = ∞

  • k=1

γ2

k < ∞.

And asymptotic normality √ k(θk − θ∗)

d

→ N (0, V ), for γk = γ0

k , γ0 ≥ 1 µ. ◮ Limit variance scales as 1/µ2 ◮ Very sensitive to ill-conditioned problems. ◮ µ generally unknown...

21

slide-67
SLIDE 67

Polyak Ruppert averaging

Introduced by Polyak and Juditsky (1992) and Ruppert (1988): ¯ θk = 1 k + 1

k

  • i=0

θi.

◮ off line averaging reduces the noise effect.

22

θ∗ θ0 θ1 θn θ1

slide-68
SLIDE 68

Polyak Ruppert averaging

Introduced by Polyak and Juditsky (1992) and Ruppert (1988): ¯ θk = 1 k + 1

k

  • i=0

θi.

θ∗ θ0 θ1 θn θn θ1 θ2

◮ off line averaging reduces the noise effect. ◮ on line computing: ¯

θk+1 =

1 k+1θk+1 + k k+1 ¯

θk.

22

slide-69
SLIDE 69

Convex stochastic approximation: convergence

Known global minimax rates for non-smooth problems

◮ Strongly convex: O((µk)−1)

Attained by averaged stochastic gradient descent with γk ∝ (µk)−1

◮ Non-strongly convex: O(k−1/2)

Attained by averaged stochastic gradient descent with γk ∝ k−1/2

23

slide-70
SLIDE 70

Convex stochastic approximation: convergence

Known global minimax rates for non-smooth problems

◮ Strongly convex: O((µk)−1)

Attained by averaged stochastic gradient descent with γk ∝ (µk)−1

◮ Non-strongly convex: O(k−1/2)

Attained by averaged stochastic gradient descent with γk ∝ k−1/2 For smooth problems

◮ Strongly convex: O(µk)−1

for γk ∝ k−1/2: adapts to strong convexity.

23

slide-71
SLIDE 71

Convergence rate for f (˜ θk) − f (θ∗), smooth f .

min ˆ R min R SGD GD SAG SGD Convex O

  • 1

√ k

  • O
  • 1

k

  • O
  • 1

√ k

  • 24
slide-72
SLIDE 72

Convergence rate for f (˜ θk) − f (θ∗), smooth f .

min ˆ R min R SGD GD SAG SGD Convex O

  • 1

√ k

  • O
  • 1

k

  • O
  • 1

√ k

  • Stgly-Cvx

O

  • 1

µk

  • O(e−µk)

O

  • 1 − (µ ∧ 1

n)

k O

  • 1

µk

  • ⊖ Gradient descent update costs n times as much as SGD

update. Can we get best of both worlds ?

24

slide-73
SLIDE 73

Convergence rate for f (˜ θk) − f (θ∗), smooth f .

min ˆ R min R SGD GD SAG SGD Convex O

  • 1

√ k

  • O
  • 1

k

  • O
  • 1

√ k

  • Stgly-Cvx

O

  • 1

µk

  • O(e−µk)

O

  • 1 − (µ ∧ 1

n)

k O

  • 1

µk

  • ⊖ Gradient descent update costs n times as much as SGD

update.

24

slide-74
SLIDE 74

Convergence rate for f (˜ θk) − f (θ∗), smooth f .

min ˆ R min R SGD GD SAG SGD Convex O

  • 1

√ k

  • O
  • 1

k

  • O
  • 1

√ k

  • Stgly-Cvx

O

  • 1

µk

  • O(e−µk)

O

  • 1 − (µ ∧ 1

n)

k O

  • 1

µk

  • ⊖ Gradient descent update costs n times as much as SGD

update. Can we get best of both worlds ?

24

slide-75
SLIDE 75

Methods for finite sum minimization

◮ GD: at step k, use 1 n

n

i=0 f ′ i (θk)

25

slide-76
SLIDE 76

Methods for finite sum minimization

◮ GD: at step k, use 1 n

n

i=0 f ′ i (θk) ◮ SGD: at step k, sample ik ∼ U[1; n], use f ′ ik(θk)

25

slide-77
SLIDE 77

Methods for finite sum minimization

◮ GD: at step k, use 1 n

n

i=0 f ′ i (θk) ◮ SGD: at step k, sample ik ∼ U[1; n], use f ′ ik(θk) ◮ SAG: at step k,

◮ keep a “full gradient” 1

n

n

i=0 f ′ i (θki ), with θki ∈ {θ1, . . . θk}

25

slide-78
SLIDE 78

Methods for finite sum minimization

◮ GD: at step k, use 1 n

n

i=0 f ′ i (θk) ◮ SGD: at step k, sample ik ∼ U[1; n], use f ′ ik(θk) ◮ SAG: at step k,

◮ keep a “full gradient” 1

n

n

i=0 f ′ i (θki ), with θki ∈ {θ1, . . . θk}

◮ sample ik ∼ U[1; n], use

1 n n

  • i=0

f ′

i (θki ) − f ′ ik(θkik ) + f ′ ik(θk)

  • ,

25

slide-79
SLIDE 79

Methods for finite sum minimization

◮ GD: at step k, use 1 n

n

i=0 f ′ i (θk) ◮ SGD: at step k, sample ik ∼ U[1; n], use f ′ ik(θk) ◮ SAG: at step k,

◮ keep a “full gradient” 1

n

n

i=0 f ′ i (θki ), with θki ∈ {θ1, . . . θk}

◮ sample ik ∼ U[1; n], use

1 n n

  • i=0

f ′

i (θki ) − f ′ ik(θkik ) + f ′ ik(θk)

  • ,

⊕ update costs the same as SGD ⊖ needs to store all gradients f ′

i (θki ) at “points in the past”

Some references:

◮ SAG Schmidt et al. (2013), SAGA Defazio et al. (2014a) ◮ SVRG Johnson and Zhang (2013) (reduces memory cost but 2 epochs...) ◮ FINITO Defazio et al. (2014b) ◮ S2GD Koneˇ

cn` y and Richt´ arik (2013)... And many others... See for example Niao He’s lecture notes for a nice overview.

25

slide-80
SLIDE 80

Convergence rate for f (˜ θk) − f (θ∗), smooth

  • bjective f .

min ˆ R min R SGD GD SAG SGD Convex O

  • 1

√ k

  • O
  • 1

k

  • O
  • 1

√ k

  • 26
slide-81
SLIDE 81

Convergence rate for f (˜ θk) − f (θ∗), smooth

  • bjective f .

min ˆ R min R SGD GD SAG SGD Convex O

  • 1

√ k

  • O
  • 1

k

  • O
  • 1

√ k

  • Stgly-Cvx

O

  • 1

µk

  • O(e−µk)

O

  • 1 − (µ ∧ 1

n)

k O

  • 1

µk

  • GD, SGD, SAG (Fig. from Schmidt et al. (2013))

26

slide-82
SLIDE 82

Take home

Stochastic algorithms for Empirical Risk Minimization.

◮ Rates depend on the regularity of the function. ◮ Several algorithms to optimize empirical risk, most

efficient ones are stochastic and rely on finite sum structure

slide-83
SLIDE 83

Take home

Stochastic algorithms for Empirical Risk Minimization.

◮ Rates depend on the regularity of the function. ◮ Several algorithms to optimize empirical risk, most

efficient ones are stochastic and rely on finite sum structure

◮ Stochastic algorithms to optimize a deterministic

function.

27

slide-84
SLIDE 84

What about generalization risk

Initial problem: Generalization guarantees.

◮ Uniform upper bound supθ

  • ˆ

R(θ) − R(θ)

  • . (empirical

process theory)

◮ More precise: localized complexities (Bartlett et al.,

2002), stability (Bousquet and Elisseeff, 2002).

28

slide-85
SLIDE 85

What about generalization risk

Initial problem: Generalization guarantees.

◮ Uniform upper bound supθ

  • ˆ

R(θ) − R(θ)

  • . (empirical

process theory)

◮ More precise: localized complexities (Bartlett et al.,

2002), stability (Bousquet and Elisseeff, 2002). Problems for ERM:

◮ Choose regularization (overfitting risk) ◮ How many iterations (i.e., passes on the data)? ◮ Generalization guarantees generally of order O(1/√n),

no need to be precise

28

slide-86
SLIDE 86

What about generalization risk

Initial problem: Generalization guarantees.

◮ Uniform upper bound supθ

  • ˆ

R(θ) − R(θ)

  • . (empirical

process theory)

◮ More precise: localized complexities (Bartlett et al.,

2002), stability (Bousquet and Elisseeff, 2002). Problems for ERM:

◮ Choose regularization (overfitting risk) ◮ How many iterations (i.e., passes on the data)? ◮ Generalization guarantees generally of order O(1/√n),

no need to be precise 2 important insights:

  • 1. No need to optimize below statistical error,
  • 2. Generalization risk is more important than empirical risk.

28

slide-87
SLIDE 87

What about generalization risk

Initial problem: Generalization guarantees.

◮ Uniform upper bound supθ

  • ˆ

R(θ) − R(θ)

  • . (empirical

process theory)

◮ More precise: localized complexities (Bartlett et al.,

2002), stability (Bousquet and Elisseeff, 2002). Problems for ERM:

◮ Choose regularization (overfitting risk) ◮ How many iterations (i.e., passes on the data)? ◮ Generalization guarantees generally of order O(1/√n),

no need to be precise 2 important insights:

  • 1. No need to optimize below statistical error,
  • 2. Generalization risk is more important than empirical risk.

SGD can be used to minimize the generalization risk.

28

slide-88
SLIDE 88

SGD for the generalization risk: f = R

SGD: key assumption E[f ′

n(θn−1)|Fn−1] = f ′(θn−1).

29

slide-89
SLIDE 89

SGD for the generalization risk: f = R

SGD: key assumption E[f ′

n(θn−1)|Fn−1] = f ′(θn−1).

For the risk R(θ) = Eρ [ ℓ(Y , θ, Φ(X))]

◮ At step 0 < k ≤ n, use a new point independent of

θk−1: f ′

k(θk−1) = ℓ′(yk, θk−1, Φ(xk))

29

slide-90
SLIDE 90

SGD for the generalization risk: f = R

SGD: key assumption E[f ′

n(θn−1)|Fn−1] = f ′(θn−1).

For the risk R(θ) = Eρ [ ℓ(Y , θ, Φ(X))]

◮ At step 0 < k ≤ n, use a new point independent of

θk−1: f ′

k(θk−1) = ℓ′(yk, θk−1, Φ(xk)) ◮ For 0 ≤ k ≤ n, Fk = σ((xi, yi)1≤i≤k).

E[f ′

k(θk−1)|Fk−1]

= Eρ[ ℓ′(yk, θk−1, Φ(xk))|Fk−1]

29

slide-91
SLIDE 91

SGD for the generalization risk: f = R

SGD: key assumption E[f ′

n(θn−1)|Fn−1] = f ′(θn−1).

For the risk R(θ) = Eρ [ ℓ(Y , θ, Φ(X))]

◮ At step 0 < k ≤ n, use a new point independent of

θk−1: f ′

k(θk−1) = ℓ′(yk, θk−1, Φ(xk)) ◮ For 0 ≤ k ≤ n, Fk = σ((xi, yi)1≤i≤k).

E[f ′

k(θk−1)|Fk−1]

= Eρ[ ℓ′(yk, θk−1, Φ(xk))|Fk−1] = Eρ

  • ℓ′(Y , θk−1, Φ(X))
  • = R′(θk−1)

29

slide-92
SLIDE 92

SGD for the generalization risk: f = R

SGD: key assumption E[f ′

n(θn−1)|Fn−1] = f ′(θn−1).

For the risk R(θ) = Eρ [ ℓ(Y , θ, Φ(X))]

◮ At step 0 < k ≤ n, use a new point independent of

θk−1: f ′

k(θk−1) = ℓ′(yk, θk−1, Φ(xk)) ◮ For 0 ≤ k ≤ n, Fk = σ((xi, yi)1≤i≤k).

E[f ′

k(θk−1)|Fk−1]

= Eρ[ ℓ′(yk, θk−1, Φ(xk))|Fk−1] = Eρ

  • ℓ′(Y , θk−1, Φ(X))
  • = R′(θk−1)

◮ Single pass through the data, Running-time = O(nd), ◮ “Automatic” regularization.

29

slide-93
SLIDE 93

SGD for the generalization risk: f = R

SGD: key assumption E[f ′

n(θn−1)|Fn−1] = f ′(θn−1).

For the risk R(θ) = Eρ [ ℓ(Y , θ, Φ(X))]

◮ At step 0 < k ≤ n, use a new point independent of

θk−1: f ′

k(θk−1) = ℓ′(yk, θk−1, Φ(xk)) ◮ For 0 ≤ k ≤ n, Fk = σ((xi, yi)1≤i≤k).

E[f ′

k(θk−1)|Fk−1]

= Eρ[ ℓ′(yk, θk−1, Φ(xk))|Fk−1] = Eρ

  • ℓ′(Y , θk−1, Φ(X))
  • = R′(θk−1)

◮ Single pass through the data, Running-time = O(nd), ◮ “Automatic” regularization.

29

slide-94
SLIDE 94

SGD for the generalization risk: f = R

ERM minimization

  • Gen. risk minimization

several passes : 0 ≤ k One pass 0 ≤ k ≤ n xi, yi is Ft-measurable for any t Ft-measurable for t ≥ i.

30

slide-95
SLIDE 95

Convergence rate for f (˜ θk) − f (θ∗), smooth

  • bjective f .

min ˆ R min R SGD GD SAG SGD Convex O

  • 1

√ k

  • O
  • 1

k

  • O
  • 1

√ k

  • 31
slide-96
SLIDE 96

Convergence rate for f (˜ θk) − f (θ∗), smooth

  • bjective f .

min ˆ R min R SGD GD SAG SGD Convex O

  • 1

√ k

  • O
  • 1

k

  • O
  • 1

√ k

  • Stgly-Cvx

O

  • 1

µk

  • O(e−µk)

O

  • 1 − (µ ∧ 1

n)

k O

  • 1

µk

  • 0 ≤ k

0 ≤ k ≤ n Lower Bounds α β γ δ δ :Information theoretic LB - Statistical theory (Tsybakov, 2003). Gradient does not even exist

31

slide-97
SLIDE 97

Convergence rate for f (˜ θk) − f (θ∗), smooth

  • bjective f .

min ˆ R min R SGD GD SAG SGD Convex O

  • 1

√ k

  • O
  • 1

k

  • O
  • 1

√n

  • Stgly-Cvx

O

  • 1

µk

  • O(e−µk)

O

  • 1 − (µ ∧ 1

n)

k O

  • 1

µn

  • 0 ≤ k

0 ≤ k ≤ n

31

slide-98
SLIDE 98

Convergence rate for f (˜ θk) − f (θ∗), smooth

  • bjective f .

min ˆ R min R SGD GD SAG SGD Convex O

  • 1

√ k

  • O
  • 1

k

  • O
  • 1

√n

  • Stgly-Cvx

O

  • 1

µk

  • O(e−µk)

O

  • 1 − (µ ∧ 1

n)

k O

  • 1

µn

  • 0 ≤ k

0 ≤ k ≤ n Gradient is unknown

31

slide-99
SLIDE 99

Least Mean Squares: rate independent of µ

Least-squares: R(θ) = 1

2E

  • (Y − Φ(X), θ)2

Analysis for averaging and constant step-size γ = 1/(4R2) (Bach and Moulines, 2013)

◮ Assume Φ(xn) r and |yn − Φ(xn), θ∗| σ ◮ No assumption regarding lowest eigenvalues of the

Hessian ER(¯ θn) − R(θ∗) 4σ2d n + θ0 − θ∗2 γn

32

slide-100
SLIDE 100

Least Mean Squares: rate independent of µ

Least-squares: R(θ) = 1

2E

  • (Y − Φ(X), θ)2

Analysis for averaging and constant step-size γ = 1/(4R2) (Bach and Moulines, 2013)

◮ Assume Φ(xn) r and |yn − Φ(xn), θ∗| σ ◮ No assumption regarding lowest eigenvalues of the

Hessian ER(¯ θn) − R(θ∗) 4σ2d n + θ0 − θ∗2 γn

◮ Matches statistical lower bound (Tsybakov, 2003). ◮ Optimal rate with “large” step sizes

32

slide-101
SLIDE 101

Take home

◮ SGD can be used to minimize the true risk directly ◮ Stochastic algorithm to minimize unknown function

Stochastic approximation, beyond Least Squares ?

slide-102
SLIDE 102

Take home

◮ SGD can be used to minimize the true risk directly ◮ Stochastic algorithm to minimize unknown function ◮ No regularization needed, only one pass

Stochastic approximation, beyond Least Squares ?

slide-103
SLIDE 103

Take home

◮ SGD can be used to minimize the true risk directly ◮ Stochastic algorithm to minimize unknown function ◮ No regularization needed, only one pass ◮ For Least Squares, with constant step, optimal rate .

Stochastic approximation, beyond Least Squares ?

33

slide-104
SLIDE 104

Further references

Many stochastic algorithms not covered in this talk (coordinate descent, online Newton, composite optimization, non convex learning) ...

◮ Good introduction: Francis’s lecture notes at Orsay ◮ Book:

Convex Optimization: Algorithms and Complexity, S´ ebastien Bubeck

34

slide-105
SLIDE 105

Agarwal, A., Bartlett, P. L., Ravikumar, P., and Wainwright, M. J. (2012). Information-theoretic lower bounds on the oracle complexity of stochastic convex

  • ptimization. IEEE Transactions on Information Theory, 58(5):3235–3249.

Agarwal, A. and Bottou, L. (2014). A Lower Bound for the Optimization of Finite

  • Sums. ArXiv e-prints.

Arjevani, Y. and Shamir, O. (2016). Dimension-free iteration complexity of finite sum optimization problems. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R., editors, Advances in Neural Information Processing Systems 29, pages 3540–3548. Curran Associates, Inc. Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). Advances in Neural Information Processing Systems (NIPS). Bartlett, P. L., Bousquet, O., and Mendelson, S. (2002). Localized Rademacher Complexities, pages 44–58. Springer Berlin Heidelberg, Berlin, Heidelberg. Bousquet, O. and Elisseeff, A. (2002). Stability and generalization. Journal of Machine Learning Research, 2(Mar):499–526. Defazio, A., Bach, F., and Lacoste-Julien, S. (2014a). Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654. Defazio, A., Domke, J., and Caetano, T. (2014b). Finito: A faster, permutable incremental gradient method for big data problems. In Proceedings of the 31st international conference on machine learning (ICML-14), pages 1125–1133. Fabian, V. (1968). On asymptotic normality in stochastic approximation. The Annals of Mathematical Statistics, pages 1327–1332.

34

slide-106
SLIDE 106

Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323. Koneˇ cn` y, J. and Richt´ arik, P. (2013). Semi-stochastic gradient descent methods. arXiv preprint arXiv:1312.1666. Nemirovsky, A. S. and Yudin, D. B. (1983). Problem complexity and method efficiency in optimization. A Wiley-Interscience Publication. John Wiley & Sons, Inc., New York. Translated from the Russian and with a preface by E. R. Dawson, Wiley-Interscience Series in Discrete Mathematics. Nesterov, Y. (2004). Introductory Lectures on Convex Optimization: A Basic

  • Course. Applied Optimization. Springer.

Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838–855. Robbins, H. and Monro, S. (1951). A stochastic approxiation method. The Annals

  • f mathematical Statistics, 22(3):400–407.

Robbins, H. and Siegmund, D. (1985). A convergence theorem for non negative almost supermartingales and some applications. In Herbert Robbins Selected Papers, pages 111–135. Springer. Ruppert, D. (1988). Efficient estimations from a slowly convergent Robbins-Monro

  • process. Technical report, Cornell University Operations Research and Industrial

Engineering. Schmidt, M., Le Roux, N., and Bach, F. (2013). Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112. Tsybakov, A. B. (2003). Optimal rates of aggregation. In Proceedings of the Annual Conference on Computational Learning Theory.

34