Introduction to Stochastic Optimization January 13, 2015 P. - - PowerPoint PPT Presentation

introduction to stochastic optimization
SMART_READER_LITE
LIVE PREVIEW

Introduction to Stochastic Optimization January 13, 2015 P. - - PowerPoint PPT Presentation

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Introduction to Stochastic Optimization January 13, 2015 P. Carpentier Master MMMEF Cours MNOS 2014-2015 3 / 265 General Introduction to Stochastic


slide-1
SLIDE 1

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview

Introduction to Stochastic Optimization

January 13, 2015

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 3 / 265

slide-2
SLIDE 2

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview

Lecture Outline

1

General Introduction to Stochastic Optimization Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

2

Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 4 / 265

slide-3
SLIDE 3

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

1

General Introduction to Stochastic Optimization Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

2

Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 5 / 265

slide-4
SLIDE 4

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

1

General Introduction to Stochastic Optimization Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

2

Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 6 / 265

slide-5
SLIDE 5

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

Goals of this Course General objective: present numerical methods (convergence results, discretization schemes, algorithms. . . ) in order to be able to solve optimization problems in a stochastic framework. Specific objective: be able to deal with large scale system problems for which standard methods are no more effective (dynamic programming, curse of dimensionality). Problems under Consideration Open-loop problems: decisions do not depend on specific

  • bservation of the uncertainties.

Closed-loop problems: available observations reveal some information and decisions depend on these observations. New concept in stochastic optimization: information structure, that is, amount of information available to the decision maker.

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 7 / 265

slide-6
SLIDE 6

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

Expected Difficulties Solving stochastic optimization problems is not only a matter of

  • ptimizing a criterion under conventional constraints. Issues are:

how to compute (conditional) expectations? how to deal with (probability) constraints? how to properly handle informational constraints? Examples Open-loop problem: take a decision facing an uncertain future (investment problem). Recourse problem: take a first decision, and then optimize its consequences (investment-operating problem). Multistage problems: a decision has to be taken at each time step (management, planning).

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 8 / 265

slide-7
SLIDE 7

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

1

General Introduction to Stochastic Optimization Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

2

Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 9 / 265

slide-8
SLIDE 8

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

Deterministic Optimization Problems

General Problem min

u∈Uad⊂U J(u)

(1a) subject to Θ(u) ∈ −C ⊂ V . (1b) Dynamic Problem min

(u0,...,uT−1,x0,...,xT ) T−1

  • t=0

Lt(xt, ut) + K(xT) (2a) subject to x0 = xini given , xt+1 = ft(xt, ut) , t = 0, . . . , T − 1 . (2b)

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 10 / 265

slide-9
SLIDE 9

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

1

General Introduction to Stochastic Optimization Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

2

Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 11 / 265

slide-10
SLIDE 10

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

Extension of the General Problem – Open-Loop Case (1)

Consider Problem (1) without explicit constraint Θ, and suppose that J is in fact the expectation of a function j, depending on a random variable W defined on a probability space (Ω, A, P) and valued on (W, W): J(u) = E

  • j(u, W )
  • .

Then the optimization problem writes min

u∈Uad E

  • j(u, W )
  • .

(3) The decision u is a deterministic variable, which only depends on the probability law of W (and not on on-line observations of W ). Main difficulty: calculation of the expectation.

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 12 / 265

slide-11
SLIDE 11

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

Extension of the General Problem – Open-Loop Case (2)

Solution using Exact Quadrature J(u) = E

  • j(u, W )
  • ,

∇J(u) = E

u j(u, W )

  • .

Projected gradient algorithm: u(k+1) = projUad

  • u(k) − ǫ∇J(u(k))
  • .

Sample Average Approximation (SAA) Obtain a realization (w(1), . . . , w(k)) of a k-sample of W and minimize the Monte Carlo approximation of J: u(k) ∈ arg min

u∈Uad

1 k

k

  • l=1

j(u, w(l)) . Note that u(k) depends on the realization (w(1), . . . , w(k))!

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 13 / 265

slide-12
SLIDE 12

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

Extension of the General Problem – Open-Loop Case (3)

Stochastic Gradient Method Underlying ideas: use an “easily computable” approximation of ∇J based on realizations (w(1), . . . , w(k), . . .) of samples of W , incorporate the realizations one by one into the algorithm. These considerations lead to the following algorithm: u(k+1) = projUad

  • u(k) − ǫ(k)∇

u j(u(k), w(k+1))

  • .

Iterations of the gradient algorithm are used a) to move towards the solution and b) to refine the Monte-Carlo sampling process. Topic of the first part of the course.

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 14 / 265

slide-13
SLIDE 13

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

Extension of the General Problem – Closed-Loop Case (1)

Consider again Problem (3), and assume now that the control is in fact a random variable U defined on the probability space (Ω, A, P) and valued on (U, U):2 J(U) = E

  • j(U, W )
  • .

Denote by F the σ-field generated by W . The (interesting part of the) information available to the decision maker is a piece of the information revealed by the noise W , and thus is represented by a σ-field G included in F. Then the optimization problem writes min

U G E

  • j(U, W )
  • ,

where U G means that U is measurable w.r.t. the σ-field G.

2There is here a tricky point in the notations. . .

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 15 / 265

slide-14
SLIDE 14

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

Extension of the General Problem – Closed-Loop Case (2)

min

U G E

  • j(U, W )
  • .

Examples. G = {∅, Ω}: this corresponds to the open-loop case: min

u∈U E

  • j(u, W )
  • .

G = σ

  • W
  • : we have by the interchange theorem:

min

u∈U j(u, w) , ∀w ∈ W .

G ⊂ σ

  • W
  • : the problem is equivalent to

E

  • min

u∈U E

  • j(u, W )
  • G
  • .
  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 16 / 265

slide-15
SLIDE 15

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

Extension of the General Problem – Closed-Loop Case (3)

Generally, the σ-field G is generated by an observation, that is, a random variable Y : G = σ

  • Y
  • . Then, the information constraint

writes U Y , and the optimization problem is: min

UY E

  • j(U, W )
  • .

(4) The information constraint may be viewed from the algebraic point of view: the constraint is expressed in term of σ-field, that is, σ

  • U
  • ⊂ σ
  • Y
  • ,

from the functional point of view: using Doob’s Theorem, U is expressed as a function of Y , that is, U = ϕ(Y ). Main difficulty: taking into account the information constraint.

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 17 / 265

slide-16
SLIDE 16

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

Extension of the Dynamic Problem (1)

The natural stochastic extension of Problem (2) consists in adding a perturbation Wt at each time step t: min

(U0,...,UT−1,X0,...,XT ) E

T−1

  • t=0

Lt(Xt, Ut, Wt+1) + K(XT)

  • (5a)

subject to    X0 = f−1(W0) , Xt+1 = ft(Xt, Ut, Wt+1) , t = 0, . . . , T − 1 . (5b) We denote by Ft the σ-field generated by noises prior time t: Ft = σ

  • W0, . . . , Wt
  • , t = 0, . . . , T .

Nonanticipativity: Ft is the maximal information available at t.

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 18 / 265

slide-17
SLIDE 17

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

Extension of the Dynamic Problem (2)

Information Structure. A new observation becomes available at time t: Zt = ht(Xt, Wt) , t = 0, . . . , T − 1 .

Zt = Wt: observation of the noise, Zt = Xt: observation of the state.

Information at t is a function of past observations: Yt = Ct

  • Z0, . . . , Zt
  • , t = 0, . . . , T − 1 .

Yt = Zt: memoryless information. Yt = (Z0, . . . , Zt): perfect memory.

Information Constraints: Ut Yt ∀t Functional approach: Ut = ϕt

  • Yt
  • .

Algebraic approach: σ

  • Ut
  • ⊂ σ
  • Yt
  • .
  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 19 / 265

slide-18
SLIDE 18

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

Extension of the Dynamic Problem (3)

Functional Approach: Stochastic Optimal Control Assumptions on the noise process {Wt}t=0,...,T. Markovian case: Zt = Xt / Yt Zt. Solution may be computed by the Dynamic Programming approach, developed on the state Xt. Classical case: Zt = ht(Xt, Wt) / Yt = (Z0, . . . , Zt). The Dynamic Programming approach is still available, the state being the probability law of Xt rather than Xt itself. General case: Zt = ht(Xt, Wt) / Yt = Ct

  • Z0, . . . , Zt
  • .

We are usually not able to solve the optimality conditions (dual effect, Witsenhausen counterexample). Curse of dimensionality. Trouble to use Decomposition/Coordination methods.

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 20 / 265

slide-19
SLIDE 19

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

Extension of the Dynamic Problem (4)

Algebraic Approach: Stochastic Programming Rather than looking for the solution of the problem as functions depending on information (Dynamic Programming point of view), we try to obtain the solution as random variables satisfying the information constraints: σ

  • Ut
  • ⊂ σ
  • Yt
  • , t = 0, . . . , T − 1 .

First issue: characterize the classes of problems that can be solved by this approach. Intuitively, the problem is much more intricate if the information Y depends on the decision U. . . Second issue: obtain a finite approximation of the problem, and more specifically discretize the information constraints. Topic of the second part of the course.

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 21 / 265

slide-20
SLIDE 20

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

1

General Introduction to Stochastic Optimization Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

2

Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 22 / 265

slide-21
SLIDE 21

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

1

General Introduction to Stochastic Optimization Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

2

Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 23 / 265

slide-22
SLIDE 22

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

Stochastic Gradient (SG) algorithm

Standard Stochastic Gradient Algorithm min

u∈Uad⊂U E

  • j(u, W )
  • .

(6)

1 Let u(0) ∈ Uad and choose a positive real sequence {ǫ(k)}k∈N. 2 At iteration (k + 1), draw a realization w(k+1) of the r.v. W . 3 Compute the gradient of j and update u(k+1) by the formula:

u(k+1) = projUad

  • u(k) − ǫ(k)∇

u j(u(k), w(k+1))

  • .

4 Set k = k + 1 and go to step 2.

Note that (w(0), . . . , w(k), . . .) is a realization of a ∞-sample of W numerical implementation of the stochastic gradient method.

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 24 / 265

slide-23
SLIDE 23

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

Probabilistic Considerations (1)

In order to study the convergence of the algorithm, it is necessary to cast it in the adequate probabilistic framework: U(k+1) = projUad

  • U(k) − ǫ(k)∇

u j(U(k), W (k+1))

  • .

where {W (k)}k∈N is a infinite-dimensional sample of W .3 Iterative relation involving random variables. Convergence in law. Convergence in probability. Convergence in Lp norm. Almost sure convergence (the “intuitive” one).

3Note that (Ω, A, P) has to be “big enough” to support such a sample.

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 25 / 265

slide-24
SLIDE 24

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

Probabilistic Considerations (2)

An iteration of the algorithm is represented by the general relation: U(k+1) = R(k) U(k), W (k+1) . Let F(k) be the σ-field generated by (W (1), . . . , W (k)). The random variable U(k) is F(k)-mesurable for all k: E

  • U(k)

F(k) = U(k) . The conditional expectation of U(k+1) w.r.t. F(k) merely consists of a standard expectation: E

  • U(k+1)

F(k) (ω) =

R(k) U(k)(ω), W (ω′)

  • dP(ω′) .
  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 26 / 265

slide-25
SLIDE 25

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

Example: Estimation of an Expectation (1)

Let W be a real-valued random variable defined on (Ω, A, P), and suppose we want to compute an estimate of its expectation E(W ) =

W (ω) dP(ω) . Monte Carlo method: obtain a k-sample (W (1), . . . , W (k)) of W and compute the associated arithmetic mean: U(k) = 1 k

k

  • l=1

W (l) . By the Strong Law of Large Numbers (SLLN), the sequence of random variables {U(k)}k∈N almost surely converges to E(W ).

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 27 / 265

slide-26
SLIDE 26

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

Example: Estimation of an Expectation (2)

It is easy to show that U(k+1) = U(k) − 1 k + 1

  • U(k) − W (k+1)

. Using the notations ǫ(k) = 1/(k + 1) and j(u, w) =

  • u − w

2/2, the last expression of U(k+1) writes U(k+1) = U(k) − ǫ(k)∇

u j(U(k), W (k+1)) ,

which corresponds to the stochastic gradient algorithm applied to:4 min

u∈R

1 2E

  • (u − W )2

.

4Recall that E

  • W
  • is the value which minimizes the dispersion of W .
  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 28 / 265

slide-27
SLIDE 27

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

Example: Estimation of an Expectation (3)

This example makes it possible to enlighten some features of the stochastic gradient method. The step size ǫ(k) = 1/(k + 1) goes to zero as k goes to +∞. Note however that ǫ(k) goes to zero “not too fast”, that is,

  • k∈N

ǫ(k) = +∞ . It is reasonable to expect an almost sure convergence result for the stochastic gradient algorithm (rather than a weaker notion as convergence in distribution or convergence in probability). As the Central Limit Theorem (CLT) applies to this case, we may expect a similar result for the rate of convergence of the stochastic gradient algorithm.

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 29 / 265

slide-28
SLIDE 28

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

1

General Introduction to Stochastic Optimization Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

2

Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 30 / 265

slide-29
SLIDE 29

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

Stochastic Approximation (SA) Framework

A classical problem in Stochastic Approximation is to determine the zero of a function h : U → U, with U = Rn, in case that the

  • bservation of h(u) is perturbed by an additive random variable ξ.

The standard SA algorithm consists in the following formula: U(k+1) = U(k) + ǫ(k) h(U(k)) + ξ(k+1) , a random process {ξ(k)}k∈N and a filtration {F(k)}k∈N being given. Link with the stochastic gradient algorithm: h(u) = −∇J(u) , ξ(k+1) = ∇J(U(k)) − ∇

u j(U(k), W (k+1)) .

Finding u♯ s.t. h(u♯) = 0 is equivalent to solving ∇J(u♯) = 0.

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 31 / 265

slide-30
SLIDE 30

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

Convergence Theorem (SA) (1)

Assumptions

1 The random variable U(0) is F(0)-mesurable. 2 The mapping h : U −

→ U is continuous, such that

∃ u♯ ∈ Rn, h(u♯) = 0 and

  • h(u) , u − u♯

< 0, ∀u = u♯; ∃ a > 0, ∀u ∈ Rn, h(u)2 ≤ a

  • 1 + u2

.

3 The random variable ξ(k) is F(k)-mesurable for all k, and

E

  • ξ(k+1)

F(k) = 0, ∃ d > 0, E

  • ξ(k+1)2

F(k) ≤ d

  • 1 + U(k)2

.

4 The sequence {ǫ(k)}k∈N is a σ-sequence, that is,

  • k∈N

ǫ(k) = +∞ ,

  • k∈N
  • ǫ(k)2 < +∞ .
  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 32 / 265

slide-31
SLIDE 31

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

Convergence Theorem (SA) (2)

Robbins-Monro Theorem Under the previous assumptions, the sequence {U(k)}k∈N of random variables generated the SA algorithm almost surely converges to u♯. For a proof, see [Duflo, 1997, §1.4]. This theorem can be extended to more general situations. A projection operator can be added: U(k+1) = projUad

  • U(k) + ǫ(k)

h(U(k)) + ξ(k+1) . A “small” additional term R(k+1) can be added:5 U(k+1) = U(k) + ǫ(k) h(U(k)) + ξ(k+1) + R(k+1) .

5for example a bias on h(u), as considered in the Kiefer-Wolfowitz algorithm

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 33 / 265

slide-32
SLIDE 32

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

Rate of Convergence (SA) (1)

We recall a result about the asymptotic normality of the sequence {U(k)} generated by the SA algorithm, together with an estimation of its rate of convergence. We first need to be more specific about the notion of σ-sequence. Definition A positive real sequence {ǫ(k)}k∈N is a σ(α, β, γ)-sequence if it is such that ǫ(k) = α kγ + β , with α > 0, β ≥ 0 and 1/2 < γ ≤ 1. A consequence of this definition is that a σ(α, β, γ)-sequence is also a σ-sequence.

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 34 / 265

slide-33
SLIDE 33

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

Rate of Convergence (SA) (2)

Assumptions

1 h is continuously differentiable and, in a neighborhood of u♯,

h(u) = −H(u − u♯) + O

  • u − u♯2

, where H is a symmetric positive-definite matrix.

2 The sequence

  • E
  • ξ(k+1)(ξ(k+1))⊤

F(k)

k∈N almost surely

converges to a symmetric positive-definite matrix Γ.

3 ∃ δ > 0 such that supk∈N E

  • ξ(k+1)2+δ

F(k) < +∞.

4 The sequence {ǫ(k)}k∈N is a σ(α, β, γ)-sequence. 5 The square matrix (H − λI) is positive-definite, with

λ =

  • if γ < 1 ,

1 2α if γ = 1 .

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 35 / 265

slide-34
SLIDE 34

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

Rate of Convergence (SA) (3)

We retain the assumptions ensuring the almost sure convergence. Central Limit Theorem Under all previous assumptions, the sequence of random variables

  • (1/

√ ǫ(k))(U(k) − u♯)

  • k∈N converges in law towards a centered

gaussian distribution with covariance matrix Σ, that is, 1 √ ǫ(k)

  • U(k) − u♯

D

− → N

  • 0, Σ
  • ,

in which Σ is the solution of the so-called Lyapunov equation

  • H − λI
  • Σ + Σ
  • H − λI
  • = Γ .

For a proof, see [Duflo, 1996, Chapter 4].

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 36 / 265

slide-35
SLIDE 35

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

Rate of Convergence (SA) (4)

The result is valid only for unconstrained problems: Uad = U. The result can be rephrased as k

γ 2

  • U(k) − u♯

D

− → N

  • 0, αΣ
  • ,

so that β has in fact no influence on the convergence rate. The choice γ = 1 achieves the greatest convergence rate. We recover the rate 1/ √ k of a standard Monte Carlo estimator. If we refer back to Problem (6) where h = −∇J, we notice that H is the Hessian matrix of J at u♯: H = ∇2J(u♯) , and that Γ is the covariance matrix of ∇

u j evaluated at u♯:

Γ = E

u j(u♯, W )

u j(u♯, W )

⊤ .

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 37 / 265

slide-36
SLIDE 36

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

1

General Introduction to Stochastic Optimization Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

2

Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 38 / 265

slide-37
SLIDE 37

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

Stochastic Newton Algorithm (1)

Here, the step sizes ǫ(k) are built using the optimal choice γ = 1, and the scalar gain α is replaced by a matrix gain A, where A is a symmetric positive-definite matrix. The algorithm becomes U(k+1) = U(k) − 1 k + β A ∇

u j(U(k), W (k+1)) ,

which in the Stochastic Approximation setting writes U(k+1) = U(k) + 1 k + β

  • A h(U(k)) + A ξ(k+1)

. The Central Limit Theorem is thus available, and we have √ k

  • U(k) − u♯

D

− → N

  • 0, ΣA
  • ,

where ΣA is the unique solution of

  • AH − I

2

  • ΣA + ΣA
  • HA − I

2

  • = AΓA .
  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 39 / 265

slide-38
SLIDE 38

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

Stochastic Newton Algorithm (2)

Let CH be the set of symmetric positive-definite matrices A, such that AH − I/2 is a positive-definite matrix. Theorem The choice A♯ = H−1 for the matrix A minimizes the asymptotic covariance matrix ΣA over the set CH. The expression of the minimal asymptotic covariance matrix is ΣA♯ = H−1ΓH−1 . Definition A stochastic gradient algorithm is Newton-efficient if the sequence {U(k)}k∈N it generates has the same asymptotic convergence rate as the optimal Newton algorithm, namely √ k

  • U(k) − u♯

D

− → N

  • 0, H−1ΓH−1

.

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 40 / 265

slide-39
SLIDE 39

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

Stochastic Gradient Algorithm with Averaging (1)

[Polyak, 1992] proposed to implement a Newton-efficient algorithm by incorporating a new averaging stage in the standard algorithm. Assuming that the admissible set Uad is equal to the space U, the standard stochastic iteration U(k+1) = U(k) − ǫ(k)∇

u j(U(k), W (k+1)) ,

is replaced by U(k+1) = U(k) − ǫ(k)∇

u j(U(k), W (k+1)) ,

U(k+1)

M

= 1 k + 1

k+1

  • l=1

U(l) . An equivalent recursive form for the averaging stage is U(k+1)

M

= U(k)

M +

1 k + 1

  • U(k+1) − U(k)

M

  • .
  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 41 / 265

slide-40
SLIDE 40

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

Stochastic Gradient Algorithm with Averaging (2)

Theorem Under the additional assumption that the σ(α, β, γ)-sequence {ǫ(k)}k∈N is such that γ < 1, the averaged stochastic gradient algorithm is Newton-efficient: √ k

  • U(k)

M − u♯ D

− → N

  • 0, H−1ΓH−1

. For a proof, see [Duflo, 1996, Chapter 4]. According to the standard theorem, the convergence rate achieved by the sequence {U(k)}k∈N with γ < 1 is smaller than 1/ √ k and hence not optimal. The “nice” convergence properties are obtained regarding the averaged sequence {U(k)

M }k∈N.

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 42 / 265

slide-41
SLIDE 41

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

1

General Introduction to Stochastic Optimization Motivation and Goals Reminders in the Deterministic Framework Switching to the Stochastic Case

2

Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 43 / 265

slide-42
SLIDE 42

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

A Toy Problem (Stochastic Gradient Toolbox)

Let us consider the following optimization problem: min

u∈R10 E

1 2u⊤Au + W ⊤u

  • ,

A being a symmetric positive definite matrix, and W being a R10-valued Gaussian random variable N(m, Γ). Its optimal solution u♯ = −A−1m is estimated by Monte Carlo:

  • U

(k+1) = −

1 k + 1

k+1

  • l=1

A−1W (l) , and the standard stochastic gradient algorithm writes in that case: U(k+1) = U(k) − ǫ(k) AU(k) + W (k+1) .

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 44 / 265

slide-43
SLIDE 43

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

Tuning the Standard Algorithm (1)

Let ǫ(k) be a σ(α, β, γ)-sequence, that is, ǫ(k) = α kγ + β . The best convergence rate is reached for γ = 1. The coefficient α influences the asymptotic behavior. The covariance matrix αΣ grows as α goes to +∞, but using too small values for α may generate very small gradient steps. The choice of α corresponds to a trade-off between stability and precision. Ultimately, the coefficient β makes it possible to regulate the transient behavior of the algorithm. During the first iterations, the coefficient ǫ(k) is approximately equal to α/β. If α/β is too small, the transient phase may be slow. On the contrary, taking a too large ratio may lead to a numerical burst.

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 45 / 265

slide-44
SLIDE 44

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

Tuning the Standard Algorithm (α/β = 0.1) (2)

1000 2000 3000 4000 5000 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

α = 0.3

1000 2000 3000 4000 5000 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

α = 1

1000 2000 3000 4000 5000 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

α = 5

1000 2000 3000 4000 5000 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

α = 10

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 46 / 265

slide-45
SLIDE 45

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

Tuning the Averaged Algorithm (1)

Take ǫ(k) shaped as a σ(α, β, γ)-sequence with 1/2 < γ < 1. The averaged stochastic gradient algorithm writes on our example

U(k+1) = U(k) − α kγ + β

  • AU(k) + W (k+1)

, U(k+1)

M

= 1 k + 1

k+1

  • l=1

U(l) .

The value γ = 2/3 is considered as a good choice. The tuning of parameters α and β is much easier than for the standard algorithm. Indeed, the problem of “too small” step sizes arising from a bad choice of α is not so critical because the term k−γ goes down more slowly towards zero. Of course, the ratio α/β must be chosen in such a way that numerical bursts do not occur during the first iterations of the algorithm.

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 47 / 265

slide-46
SLIDE 46

General Introduction to Stochastic Optimization Stochastic Gradient Method Overview Stochastic Gradient Algorithm Connexion with Stochastic Approximation Asymptotic Efficiency and Averaging Practical Considerations

Tuning the Averaged Algorithm (α/β = 0.1) (2)

1000 2000 3000 4000 5000 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

α = 0.3

1000 2000 3000 4000 5000 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

α = 1

1000 2000 3000 4000 5000 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

α = 5

1000 2000 3000 4000 5000 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

α = 10

  • P. Carpentier

Master MMMEF — Cours MNOS 2014-2015 48 / 265