Statistics and learning Monte Carlo Markov Chains (methods) - - PowerPoint PPT Presentation

statistics and learning
SMART_READER_LITE
LIVE PREVIEW

Statistics and learning Monte Carlo Markov Chains (methods) - - PowerPoint PPT Presentation

Statistics and learning Monte Carlo Markov Chains (methods) Emmanuel Rachelson and Matthieu Vignes ISAE SupAero 22 nd March 2013 E. Rachelson & M. Vignes (ISAE) SAD 2013 1 / 19 Monte Carlo computation Why, what ? An old experiment


slide-1
SLIDE 1

Statistics and learning

Monte Carlo Markov Chains (methods) Emmanuel Rachelson and Matthieu Vignes

ISAE SupAero

22nd March 2013

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 1 / 19

slide-2
SLIDE 2

Monte Carlo computation

Why, what ?

◮ An old experiment that conceived the idea of Monte Carlo methods is

that of “Buffon’s needle”: you throw a l-length needle on a flat surface made of parallel lines with spacing D (> l). Under ideal conditions, P(needle crosses one of the lines) =

2l πD. → Estimation of

π thanks to a large number of thrown needles : π = lim

n→∞

2l PnD, where Pn is the proportion of crosses in n such throws.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 2 / 19

slide-3
SLIDE 3

Monte Carlo computation

Why, what ?

◮ An old experiment that conceived the idea of Monte Carlo methods is

that of “Buffon’s needle”: you throw a l-length needle on a flat surface made of parallel lines with spacing D (> l). Under ideal conditions, P(needle crosses one of the lines) =

2l πD. → Estimation of

π thanks to a large number of thrown needles : π = lim

n→∞

2l PnD, where Pn is the proportion of crosses in n such throws.

◮ Basic concept here is that of simulating random processes in order

to help evaluate some quantities of interest.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 2 / 19

slide-4
SLIDE 4

Monte Carlo computation

Why, what ?

◮ An old experiment that conceived the idea of Monte Carlo methods is

that of “Buffon’s needle”: you throw a l-length needle on a flat surface made of parallel lines with spacing D (> l). Under ideal conditions, P(needle crosses one of the lines) =

2l πD. → Estimation of

π thanks to a large number of thrown needles : π = lim

n→∞

2l PnD, where Pn is the proportion of crosses in n such throws.

◮ Basic concept here is that of simulating random processes in order

to help evaluate some quantities of interest.

◮ First intensive use during WW II in order to make a good use of

computing facilities (ENIAC): neutron random diffusion for atomic bomb design and the estimation of eigenvalues in the Schr¨

  • dinger
  • equation. Intensively developped by (statistical) physicists.
  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 2 / 19

slide-5
SLIDE 5

Monte Carlo computation

Why, what ?

◮ An old experiment that conceived the idea of Monte Carlo methods is

that of “Buffon’s needle”: you throw a l-length needle on a flat surface made of parallel lines with spacing D (> l). Under ideal conditions, P(needle crosses one of the lines) =

2l πD. → Estimation of

π thanks to a large number of thrown needles : π = lim

n→∞

2l PnD, where Pn is the proportion of crosses in n such throws.

◮ Basic concept here is that of simulating random processes in order

to help evaluate some quantities of interest.

◮ First intensive use during WW II in order to make a good use of

computing facilities (ENIAC): neutron random diffusion for atomic bomb design and the estimation of eigenvalues in the Schr¨

  • dinger
  • equation. Intensively developped by (statistical) physicists.

◮ main interest when no closed form of solutions is tractable.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 2 / 19

slide-6
SLIDE 6

Typical problems

  • 1. Integral computation

I =

  • h(x)f(x)dx,

can be assimilated to a Ef[h] if f is a density distribution. To be written

  • h(x) f(x)

g(x)g(x)dx = Eg[hf/g], if f was not a density distribution and

Supp(f) ⊂ Supp(g).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 3 / 19

slide-7
SLIDE 7

Typical problems

  • 1. Integral computation

I =

  • h(x)f(x)dx,

can be assimilated to a Ef[h] if f is a density distribution. To be written

  • h(x) f(x)

g(x)g(x)dx = Eg[hf/g], if f was not a density distribution and

Supp(f) ⊂ Supp(g).

  • 2. Optimisation

maxx inX f(x) or argmaxx inXf(x) (min can replace max)

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 3 / 19

slide-8
SLIDE 8

Need of Monte Carlo techniques: integration

◮ Essential part in many scientific problems: computation of

I =

  • D

f(x)dx.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 4 / 19

slide-9
SLIDE 9

Need of Monte Carlo techniques: integration

◮ Essential part in many scientific problems: computation of

I =

  • D

f(x)dx.

◮ If we can draw iid random samples from D, we can compute

ˆ In =

j(f(x(j)))/n and LLN says: limn ˆ

In = I with probability 1 and CLT give convergence rate: √n( ˆ In − I) → N(O, σ2), where σ2 = var(g(x)).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 4 / 19

slide-10
SLIDE 10

Need of Monte Carlo techniques: integration

◮ Essential part in many scientific problems: computation of

I =

  • D

f(x)dx.

◮ If we can draw iid random samples from D, we can compute

ˆ In =

j(f(x(j)))/n and LLN says: limn ˆ

In = I with probability 1 and CLT give convergence rate: √n( ˆ In − I) → N(O, σ2), where σ2 = var(g(x)).

◮ In dimension 1, Riemann’s approximation give a O(1/n) error rate.

But deterministc methods fail when dimensionality increases.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 4 / 19

slide-11
SLIDE 11

Need of Monte Carlo techniques: integration

◮ Essential part in many scientific problems: computation of

I =

  • D

f(x)dx.

◮ If we can draw iid random samples from D, we can compute

ˆ In =

j(f(x(j)))/n and LLN says: limn ˆ

In = I with probability 1 and CLT give convergence rate: √n( ˆ In − I) → N(O, σ2), where σ2 = var(g(x)).

◮ In dimension 1, Riemann’s approximation give a O(1/n) error rate.

But deterministc methods fail when dimensionality increases.

◮ However, no free lunch theorem: in high-dimensional D, (i) σ2 ≈ how

uniform g is can be quite large and (ii) issue to produce uniformly distributed sample in D.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 4 / 19

slide-12
SLIDE 12

Need of Monte Carlo techniques: integration

◮ Essential part in many scientific problems: computation of

I =

  • D

f(x)dx.

◮ If we can draw iid random samples from D, we can compute

ˆ In =

j(f(x(j)))/n and LLN says: limn ˆ

In = I with probability 1 and CLT give convergence rate: √n( ˆ In − I) → N(O, σ2), where σ2 = var(g(x)).

◮ In dimension 1, Riemann’s approximation give a O(1/n) error rate.

But deterministc methods fail when dimensionality increases.

◮ However, no free lunch theorem: in high-dimensional D, (i) σ2 ≈ how

uniform g is can be quite large and (ii) issue to produce uniformly distributed sample in D.

◮ Again, importance sampling theoretically solves this but the choice

  • f sample distribution is a challenge.
  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 4 / 19

slide-13
SLIDE 13

Integration

a classical Monte Carlo approach

If we try to evaluate I =

  • f(x)g(x)dx, where g is a density function:

I = Eg[f] and then:

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 5 / 19

slide-14
SLIDE 14

Integration

a classical Monte Carlo approach

If we try to evaluate I =

  • f(x)g(x)dx, where g is a density function:

I = Eg[f] and then:

classical Monte Carlo method

ˆ In = 1/n n

i=1 f(xi), where xi ∼ L(f).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 5 / 19

slide-15
SLIDE 15

Integration

a classical Monte Carlo approach

If we try to evaluate I =

  • f(x)g(x)dx, where g is a density function:

I = Eg[f] and then:

classical Monte Carlo method

ˆ In = 1/n n

i=1 f(xi), where xi ∼ L(f).

Justified by LLN & CLT if

  • f2g < ∞.
  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 5 / 19

slide-16
SLIDE 16

Integration

no density at first

If f is not a density (or not a “good” one), then for any density g whose support contains the support of f: I =

  • h(x) f(x)

g(x)g(x)dx = Eg[hf/g].

Similarly:

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 6 / 19

slide-17
SLIDE 17

Integration

no density at first

If f is not a density (or not a “good” one), then for any density g whose support contains the support of f: I =

  • h(x) f(x)

g(x)g(x)dx = Eg[hf/g].

Similarly:

importance sampling Monte Carlo method

ˆ In = 1/n n

i=1 h(yi)f(yi)/g(yi), where yi ∼ L(g).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 6 / 19

slide-18
SLIDE 18

Integration

no density at first

If f is not a density (or not a “good” one), then for any density g whose support contains the support of f: I =

  • h(x) f(x)

g(x)g(x)dx = Eg[hf/g].

Similarly:

importance sampling Monte Carlo method

ˆ In = 1/n n

i=1 h(yi)f(yi)/g(yi), where yi ∼ L(g).

Same justification but

  • h2f2/g < ∞. This is equivalent to

Varg(In) = Varg(1/n n

i=1 h(Yi)f(Yi)/g(Yi)); g must have an heavier tail

than that of f. Choice of g ?

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 6 / 19

slide-19
SLIDE 19

Integration

no density at first

If f is not a density (or not a “good” one), then for any density g whose support contains the support of f: I =

  • h(x) f(x)

g(x)g(x)dx = Eg[hf/g].

Similarly:

importance sampling Monte Carlo method

ˆ In = 1/n n

i=1 h(yi)f(yi)/g(yi), where yi ∼ L(g).

Same justification but

  • h2f2/g < ∞. This is equivalent to

Varg(In) = Varg(1/n n

i=1 h(Yi)f(Yi)/g(Yi)); g must have an heavier tail

than that of f. Choice of g ?

Theorem (Rubinstein)

The density g∗ which minimises Var( ˆ In) (for all n) is g∗(x) = |h(x)|f(x)

  • |h(y)|f(y)dy.
  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 6 / 19

slide-20
SLIDE 20

Monte Carlo integration

◮ was this optimal g∗ really useful ? Remember the denominator (if

h > 0) ?

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 7 / 19

slide-21
SLIDE 21

Monte Carlo integration

◮ was this optimal g∗ really useful ? Remember the denominator (if

h > 0) ?

◮ In practice, we choose g such that Var( ˆ

In) < ∞ and |h|f/g ≃ C.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 7 / 19

slide-22
SLIDE 22

Monte Carlo integration

◮ was this optimal g∗ really useful ? Remember the denominator (if

h > 0) ?

◮ In practice, we choose g such that Var( ˆ

In) < ∞ and |h|f/g ≃ C.

◮ If g is known up to a constant, the estimator

1/n n

i=1 h(yi)f(yi)/g(yi)/ n i=1 f(yi)/g(yi) can replace In.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 7 / 19

slide-23
SLIDE 23

Monte Carlo integration

◮ was this optimal g∗ really useful ? Remember the denominator (if

h > 0) ?

◮ In practice, we choose g such that Var( ˆ

In) < ∞ and |h|f/g ≃ C.

◮ If g is known up to a constant, the estimator

1/n n

i=1 h(yi)f(yi)/g(yi)/ n i=1 f(yi)/g(yi) can replace In. ◮ BUT the optimality of g cannot give any clue on the variance of this

estimator...

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 7 / 19

slide-24
SLIDE 24

Monte Carlo for optimisation

◮ Goal: maxx∈X f(x) or argmaxx∈X f(x).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 8 / 19

slide-25
SLIDE 25

Monte Carlo for optimisation

◮ Goal: maxx∈X f(x) or argmaxx∈X f(x). ◮ Very simple part 1: if X is bounded, take (xi) ∼ U(X) and estimate

the max by maxi=1...n f(xi). If X is not bounded, use an adequate variable transformation.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 8 / 19

slide-26
SLIDE 26

Monte Carlo for optimisation

◮ Goal: maxx∈X f(x) or argmaxx∈X f(x). ◮ Very simple part 1: if X is bounded, take (xi) ∼ U(X) and estimate

the max by maxi=1...n f(xi). If X is not bounded, use an adequate variable transformation.

◮ Very simple part 2: if f ≥ 0, estimate argmaxx∈X f(x) boils down to

estimating the mode of the distribution with density f/

  • f. Recipe

becomes: take (xi) ∼ L(f/

  • f), the estimator is the mode of the

histogram of the xi’s. If f 0, then work with g(x) = exp [f(x)] or g(x) =

exp [f(x)] 1+exp [f(x)].

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 8 / 19

slide-27
SLIDE 27

Monte Carlo for optimisation

◮ Goal: maxx∈X f(x) or argmaxx∈X f(x). ◮ Very simple part 1: if X is bounded, take (xi) ∼ U(X) and estimate

the max by maxi=1...n f(xi). If X is not bounded, use an adequate variable transformation.

◮ Very simple part 2: if f ≥ 0, estimate argmaxx∈X f(x) boils down to

estimating the mode of the distribution with density f/

  • f. Recipe

becomes: take (xi) ∼ L(f/

  • f), the estimator is the mode of the

histogram of the xi’s. If f 0, then work with g(x) = exp [f(x)] or g(x) =

exp [f(x)] 1+exp [f(x)]. ◮ In the latter case, the problem is the computation of the

normalisation constant !

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 8 / 19

slide-28
SLIDE 28

Monte Carlo for optimisation

◮ Goal: maxx∈X f(x) or argmaxx∈X f(x). ◮ Very simple part 1: if X is bounded, take (xi) ∼ U(X) and estimate

the max by maxi=1...n f(xi). If X is not bounded, use an adequate variable transformation.

◮ Very simple part 2: if f ≥ 0, estimate argmaxx∈X f(x) boils down to

estimating the mode of the distribution with density f/

  • f. Recipe

becomes: take (xi) ∼ L(f/

  • f), the estimator is the mode of the

histogram of the xi’s. If f 0, then work with g(x) = exp [f(x)] or g(x) =

exp [f(x)] 1+exp [f(x)]. ◮ In the latter case, the problem is the computation of the

normalisation constant !

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 8 / 19

slide-29
SLIDE 29

Monte Carlo for optimisation

◮ Goal: maxx∈X f(x) or argmaxx∈X f(x). ◮ Very simple part 1: if X is bounded, take (xi) ∼ U(X) and estimate

the max by maxi=1...n f(xi). If X is not bounded, use an adequate variable transformation.

◮ Very simple part 2: if f ≥ 0, estimate argmaxx∈X f(x) boils down to

estimating the mode of the distribution with density f/

  • f. Recipe

becomes: take (xi) ∼ L(f/

  • f), the estimator is the mode of the

histogram of the xi’s. If f 0, then work with g(x) = exp [f(x)] or g(x) =

exp [f(x)] 1+exp [f(x)]. ◮ In the latter case, the problem is the computation of the

normalisation constant !

  • 1. Newton-Raphson like methods: MCNR (MC approximation of score

integrals and Hessian matrices) or StochasticApproximationNR.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 8 / 19

slide-30
SLIDE 30

Monte Carlo for optimisation

◮ Goal: maxx∈X f(x) or argmaxx∈X f(x). ◮ Very simple part 1: if X is bounded, take (xi) ∼ U(X) and estimate

the max by maxi=1...n f(xi). If X is not bounded, use an adequate variable transformation.

◮ Very simple part 2: if f ≥ 0, estimate argmaxx∈X f(x) boils down to

estimating the mode of the distribution with density f/

  • f. Recipe

becomes: take (xi) ∼ L(f/

  • f), the estimator is the mode of the

histogram of the xi’s. If f 0, then work with g(x) = exp [f(x)] or g(x) =

exp [f(x)] 1+exp [f(x)]. ◮ In the latter case, the problem is the computation of the

normalisation constant !

  • 1. Newton-Raphson like methods: MCNR (MC approximation of score

integrals and Hessian matrices) or StochasticApproximationNR.

  • 2. EM-like approximations: MCEM or StochasticApproximationMC.
  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 8 / 19

slide-31
SLIDE 31

Monte Carlo vs numerical methods

◮ Numerical methods have lower computational cost in low dimension

(integration) / would account for f regularity, whilst MC methods won’t: no hypothesis on f nor on X (optimisation).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 9 / 19

slide-32
SLIDE 32

Monte Carlo vs numerical methods

◮ Numerical methods have lower computational cost in low dimension

(integration) / would account for f regularity, whilst MC methods won’t: no hypothesis on f nor on X (optimisation).

◮ Advantage of MC methods 1 (integration): important support areas

are given priority (whether the function varies a lot or its actual norm is great),

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 9 / 19

slide-33
SLIDE 33

Monte Carlo vs numerical methods

◮ Numerical methods have lower computational cost in low dimension

(integration) / would account for f regularity, whilst MC methods won’t: no hypothesis on f nor on X (optimisation).

◮ Advantage of MC methods 1 (integration): important support areas

are given priority (whether the function varies a lot or its actual norm is great),

◮ advantage of MC methods 2 (optimisation): local minima can be

escaped and

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 9 / 19

slide-34
SLIDE 34

Monte Carlo vs numerical methods

◮ Numerical methods have lower computational cost in low dimension

(integration) / would account for f regularity, whilst MC methods won’t: no hypothesis on f nor on X (optimisation).

◮ Advantage of MC methods 1 (integration): important support areas

are given priority (whether the function varies a lot or its actual norm is great),

◮ advantage of MC methods 2 (optimisation): local minima can be

escaped and

◮ advantage of MC methods 3: a straithforward extension to statistical

inference (see next slide).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 9 / 19

slide-35
SLIDE 35

Monte Carlo vs numerical methods

◮ Numerical methods have lower computational cost in low dimension

(integration) / would account for f regularity, whilst MC methods won’t: no hypothesis on f nor on X (optimisation).

◮ Advantage of MC methods 1 (integration): important support areas

are given priority (whether the function varies a lot or its actual norm is great),

◮ advantage of MC methods 2 (optimisation): local minima can be

escaped and

◮ advantage of MC methods 3: a straithforward extension to statistical

inference (see next slide).

◮ → ideally, a method which efficiently combines the 2 points of view

sounds much cleverer...

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 9 / 19

slide-36
SLIDE 36

Monte Carlo and statistical inference

Integration

◮ Expectation computation ◮ Estimator precision estimation ◮ Bayesian analysis ◮ Mixture modelling or missing data treatment

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 10 / 19

slide-37
SLIDE 37

Monte Carlo and statistical inference

Integration

◮ Expectation computation ◮ Estimator precision estimation ◮ Bayesian analysis ◮ Mixture modelling or missing data treatment

Optimisation

◮ Optimisation of some criterion, ◮ MLE, ◮ same last 2 points.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 10 / 19

slide-38
SLIDE 38

Monte Carle and statistical inference

Bayesian framework

◮ Let x = (xi)i=1...n a sample with density known up to parameter

θ ∈ Θ.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 11 / 19

slide-39
SLIDE 39

Monte Carle and statistical inference

Bayesian framework

◮ Let x = (xi)i=1...n a sample with density known up to parameter

θ ∈ Θ.

◮ The Bayesian approach treats θ as a rv with (prior) density π(θ).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 11 / 19

slide-40
SLIDE 40

Monte Carle and statistical inference

Bayesian framework

◮ Let x = (xi)i=1...n a sample with density known up to parameter

θ ∈ Θ.

◮ The Bayesian approach treats θ as a rv with (prior) density π(θ). ◮ We denote by f(x|θ) the density of x conditional to θ.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 11 / 19

slide-41
SLIDE 41

Monte Carle and statistical inference

Bayesian framework

◮ Let x = (xi)i=1...n a sample with density known up to parameter

θ ∈ Θ.

◮ The Bayesian approach treats θ as a rv with (prior) density π(θ). ◮ We denote by f(x|θ) the density of x conditional to θ. ◮ Bayes rule states that the posterior law is π(θ|x) = π(θ)f(x|θ)

  • π(θ)f(x|θ)dθ

(note that often, the normalising constant is not tractable).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 11 / 19

slide-42
SLIDE 42

Monte Carle and statistical inference

Bayesian framework

◮ Let x = (xi)i=1...n a sample with density known up to parameter

θ ∈ Θ.

◮ The Bayesian approach treats θ as a rv with (prior) density π(θ). ◮ We denote by f(x|θ) the density of x conditional to θ. ◮ Bayes rule states that the posterior law is π(θ|x) = π(θ)f(x|θ)

  • π(θ)f(x|θ)dθ

(note that often, the normalising constant is not tractable).

◮ Main interests: (i) prior π permits to include prior knwoledge on

parameter and (ii) natural in some applications/modelling (Markov chains, mixture modelling, breakpoint detection . . . )

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 11 / 19

slide-43
SLIDE 43

A Bayesian estimator T(X) for θ

in a nutshell

  • 1. Choose a cost function L(θ, T(X)) e.g. (i)

✶θ(T(X) ⇒ T ∗(x) = argmaxθπ(θ|x): optimisation problem or (ii) T(X) − θ 2⇒ T ∗(x) =

  • θπ(θ|x)dθ,
  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 12 / 19

slide-44
SLIDE 44

A Bayesian estimator T(X) for θ

in a nutshell

  • 1. Choose a cost function L(θ, T(X)) e.g. (i)

✶θ(T(X) ⇒ T ∗(x) = argmaxθπ(θ|x): optimisation problem or (ii) T(X) − θ 2⇒ T ∗(x) =

  • θπ(θ|x)dθ,
  • 2. Derive the average risk: R(T) =
  • X (
  • Θ L(θ, T(X)f(x|θ)π(θ)dθ)dx,
  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 12 / 19

slide-45
SLIDE 45

A Bayesian estimator T(X) for θ

in a nutshell

  • 1. Choose a cost function L(θ, T(X)) e.g. (i)

✶θ(T(X) ⇒ T ∗(x) = argmaxθπ(θ|x): optimisation problem or (ii) T(X) − θ 2⇒ T ∗(x) =

  • θπ(θ|x)dθ,
  • 2. Derive the average risk: R(T) =
  • X (
  • Θ L(θ, T(X)f(x|θ)π(θ)dθ)dx,
  • 3. Find the Bayesian estimator T ∗ = argminT R(T),
  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 12 / 19

slide-46
SLIDE 46

A Bayesian estimator T(X) for θ

in a nutshell

  • 1. Choose a cost function L(θ, T(X)) e.g. (i)

✶θ(T(X) ⇒ T ∗(x) = argmaxθπ(θ|x): optimisation problem or (ii) T(X) − θ 2⇒ T ∗(x) =

  • θπ(θ|x)dθ,
  • 2. Derive the average risk: R(T) =
  • X (
  • Θ L(θ, T(X)f(x|θ)π(θ)dθ)dx,
  • 3. Find the Bayesian estimator T ∗ = argminT R(T),
  • 4. The generalised Bayesian estimator is

T ∗(x) = argminT

  • Θ L(θ, T(X)f(x|θ)π(θ)dθ almost everywhere.
  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 12 / 19

slide-47
SLIDE 47

MCMC methods

Why ? How ?

Why ? Monte Carlo Markov Chain methods are used when the distribution under study cannot be simulated directly by usual techniques and/or when its density is known up to a constant.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 13 / 19

slide-48
SLIDE 48

MCMC methods

Why ? How ?

Why ? Monte Carlo Markov Chain methods are used when the distribution under study cannot be simulated directly by usual techniques and/or when its density is known up to a constant. How ? An MCMC methods simulates a Markov chain (Xi)i≥0 with transition kernel P. The Markov chain converges in a sense to be precised towards the distribution of interest π (ergodicity property)

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 13 / 19

slide-49
SLIDE 49

Ergodic theorem

for homogeneous Markov chains

Theorem

Under certain conditions (recurrence and existence of an invariant distribution ofr example), whatever the initial distribution µ0 for X0, the distribution µi is s.t. lim

i→∞ µi − π = 0 and

1/n

n−1

  • i=0

h(Xk) → Eπ[h(X)] =

  • h(x)π(x)dx a.s.
  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 14 / 19

slide-50
SLIDE 50

Ergodic theorem

for homogeneous Markov chains

Theorem

Under certain conditions (recurrence and existence of an invariant distribution ofr example), whatever the initial distribution µ0 for X0, the distribution µi is s.t. lim

i→∞ µi − π = 0 and

1/n

n−1

  • i=0

h(Xk) → Eπ[h(X)] =

  • h(x)π(x)dx a.s.

Remarks

◮ (Xi)’s are not independent but the ergodic theorem replace the LLN. ◮ Ergodic theorems exist under milder conditions and for

inhomogeneous chains.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 14 / 19

slide-51
SLIDE 51

MCMC algorithms

Just like accept/reject methods or importance sampling, MCMC methods make use of an instrumental law. This instrumental law can be caracterised by a transition kernel q(|) or by a conditional distribution.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 15 / 19

slide-52
SLIDE 52

MCMC algorithms

Just like accept/reject methods or importance sampling, MCMC methods make use of an instrumental law. This instrumental law can be caracterised by a transition kernel q(|) or by a conditional distribution.

◮ Simulation and integration: Metropolis-Hastings algorithm or Gibbs

sampling.

◮ Optimisation: simulated annealing.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 15 / 19

slide-53
SLIDE 53

Metropolis-Hastings algorithm

◮ Initialisation: x0. ◮ for each step k ≥ 0:

  • 1. Simulate a value yk from Yk ∼ q(.|xk),
  • 2. Simulate a value uk from Uk ∼ U([0, 1]),
  • 3. Update

xk+1 =

  • yk

if uk ≤ ρ(xk, yk) xk

  • therwise,

where ρ(x, y) = min

  • 1, π(y)q(x|y)

π(x)q(y|x)

  • .
  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 16 / 19

slide-54
SLIDE 54

Metropolis-Hastings algorithm

◮ Initialisation: x0. ◮ for each step k ≥ 0:

  • 1. Simulate a value yk from Yk ∼ q(.|xk),
  • 2. Simulate a value uk from Uk ∼ U([0, 1]),
  • 3. Update

xk+1 =

  • yk

if uk ≤ ρ(xk, yk) xk

  • therwise,

where ρ(x, y) = min

  • 1, π(y)q(x|y)

π(x)q(y|x)

  • .

Note that only π(y)/π(x) and q(y|x)/q(x|y) ratios are needed, so no need to compute normalising constants ! Note also that while favourable move are always accepted, unfavourable move can be accepted (with a probability which decreases with the level of degradation).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 16 / 19

slide-55
SLIDE 55

Simulated annealing

Goal: minimise a real-valued function f.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 17 / 19

slide-56
SLIDE 56

Simulated annealing

Goal: minimise a real-valued function f. Idea: Apply a Metropolis-Hastings algorithm to simulate the distribution π(x) ∝ exp(−f(x)) and then estimate its mode(s).

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 17 / 19

slide-57
SLIDE 57

Simulated annealing

Goal: minimise a real-valued function f. Idea: Apply a Metropolis-Hastings algorithm to simulate the distribution π(x) ∝ exp(−f(x)) and then estimate its mode(s). Clever practical modification: the objective function is changed over the iteration: π(x) ∝ exp (−f(x)/Tk) , where (Tk) is a non-increasing sequence of temperatures. In practice, the temperature is high in the first iterations to explore and avoid local minima and it then starts decreasing more or less rapidly towards 0.

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 17 / 19

slide-58
SLIDE 58

Simulated annealing algorithm

◮ Initialisation: x0. ◮ for each step k ≥ 0:

  • 1. Simulate a value yk from Yk ∼ q(.|xk),
  • 2. Simulate a value uk from Uk ∼ U([0, 1]),
  • 3. Update

xk+1 =

  • yk

if uk ≤ ρ(xk, yk) xk

  • therwise,

where ρ(x, y) = min

  • 1, e−f(y)/Tk q(x|y)

e−f(x)/Tk q(y|x)

  • .
  • 4. Decrease temperature Tk → Tk+1.
  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 18 / 19

slide-59
SLIDE 59

This is over !

  • r almost

Was that clear enough ? Too quick ? Some simple applications might help...

  • E. Rachelson & M. Vignes (ISAE)

SAD 2013 19 / 19