The past, present and future of Monte Carlo methods (a personal - - PowerPoint PPT Presentation

the past present and future of monte carlo methods
SMART_READER_LITE
LIVE PREVIEW

The past, present and future of Monte Carlo methods (a personal - - PowerPoint PPT Presentation

Australian Centre of Excellence for Mathematical and Statistical Frontiers in Big Data, Big Models, New Insights The past, present and future of Monte Carlo methods (a personal experience) Scott A. Sisson UNSW Sydney & ACEMS 8th12th


slide-1
SLIDE 1

Australian Centre of Excellence for Mathematical and Statistical Frontiers in Big Data, Big Models, New Insights

The past, present and future of Monte Carlo methods

(a personal experience) Scott A. Sisson

UNSW Sydney & ACEMS

8th–12th July 2019, MCM2019

1/71

slide-2
SLIDE 2

Talk Outline

* Sisson & Fan (2007), Statistics & Computing.

  • 1. The Past
  • RJMCMC convergence diagnostics∗
  • Adaptive MCMC
  • 2. The Present
  • Approximate Bayesian computing
  • Likelihood-free approximate Gibbs

sampling

  • MCMC for approximate data
  • 3. The Future
  • of Monte Carlo methods research

2/71

slide-3
SLIDE 3

RJMCMC convergence diagnostics

Active research in 1990s–2000’s Gelman-Rubin (Potential Scale Reduction Factor) diagnostic (1992): ˆ R =

  • N−1

N

ˆ W + 1

N ˆ

B ˆ W 1/2

◮ B = between chain variance. ◮ W = within chain variance. ◮ ˆ

R → 1 when all chains have converged.

◮ Based on some quantity constantly observed across all chains ◮ Not clear how to use for RJMCMC:

  • Monitor deviance?
  • Monitor model indicator?

3/71

slide-4
SLIDE 4

RJMCMC convergence diagnostics

After reading some books on point processes (. . . )

◮ Distance Xv from point v to nearest event ξ1, . . . , ξn in space A. ◮ For some models, can represent parameters as point process

realisations

  • E.g. Mixture of Normals f (x) = n

i=1 wiφ(x; µi, σi)

  • θ = (µ1, . . . , µn, σ1, . . . , σn, w1, . . . , wn, n)
  • θ = {ξi = (µi, σi, wi) : i = 1, . . . , n}

◮ Now Xv is constantly observed across all chains ◮ So could estimate ˆ

F(xv|v) for a chain, from some location v = (µv, σv, ξv)

Use for a diagnostic:

◮ Comparing two chains ujk =

  • V

∞ |F j(xv|v) − F k(xv|v)|pdxvdv

◮ or, ˆ

Rv on (x1

v , . . . , xC v ) over all chains (etc.)

4/71

slide-5
SLIDE 5

RJMCMC convergence diagnostics

Specific model1: Corbelled domes (unknown number of changepoints) log(ri) = log(αj) + βj log(di + δj) + ǫi γj−1 ≤ di < γj log(c) + ǫi γJ ≤ di with changepoints 0 = γ0 ≤ . . . ≤ γk, ǫi ∼ N(0, σ2), subject to some continuity constraints.

Fan and Brooks

◮ ξj = (αj, βj, δj, γj, σ) ◮ Run 5 RJMCMC samplers for

15m iterations

◮ Did they converge?

1Fan and Brooks (2000), The Statistician. 5/71

slide-6
SLIDE 6

RJMCMC convergence diagnostics

◮ (a,b) Kolomogorov-Smirnov and χ2 test on model indicators ◮ (c) PSRF/ ˆ

R using deviance

6/71

slide-7
SLIDE 7

RJMCMC convergence diagnostics

◮ (a) All uj3 traces suggest chain #3 is different (grey lines) ◮ (b) ˆ

Rv shows different performance for 3 specific v ∼ π(ξ|x).

◮ (c) Density of γ2 under chain 3 is very different to other chains ◮ Difference identified by 3 closest v (points).

7/71

slide-8
SLIDE 8

Talk Outline

* Garthwaite, Fan & Sisson (2010/16) Communications in Statistics.

  • 1. The Past
  • RJMCMC convergence diagnostics
  • Adaptive MCMC∗
  • 2. The Present
  • Approximate Bayesian computing
  • Likelihood-free approximate Gibbs

sampling

  • MCMC for approximate data
  • 3. The Future
  • of Monte Carlo methods research

8/71

slide-9
SLIDE 9

Adaptive MCMC

Adaptive Random Walk Metropolis-Hastings (ARWMH):

◮ MCMC with proposal distribution q(θ, θ∗) = N(θ, σ2At) ◮ Where At = Cov(θ) estimated from θt−1, . . . , θ1 ◮ Satisfies “diminishing adaptations” condition so achieves correct

limiting distribution Q: How to choose σ? (One) A: Realise a particular acceptance probability Which acceptance probability?

◮ 0.234 for multivariate proposal distributions (Roberts et al., 1997) ◮ 0.44 for univariate proposal distributions (Roberts and Rosenthal, 2001)

Appears to work well in general Typically obtained by manually searching for σ2 (laborious)

9/71

slide-10
SLIDE 10

The Robbins-Monro Process

◮ Stochastic search algorithm

for a binary response with probability of success p(σ). Can control σ.

◮ The aim is to find the value

  • f σ∗ that gives a specified

value of p, say p∗ = p(σ∗).

2 4 6 8 10 12 0.0 0.2 0.4 0.6 0.8 1.0 σ p(σ) 2 4 6 8 10 12 0.0 0.2 0.4 0.6 0.8 1.0 σ p(σ)

Robbins-Monro process (Robbins & Monro, 1951)

Conduct a sequence of trials.

◮ At each trial σ is set equal to the

current estimate of σ∗.

◮ If the trial is a success, σ is increased. ◮ If it is a failure, σ is reduced. ◮ Adjustment size decreases with each

trial.

10/71

slide-11
SLIDE 11

Incorporation into RWMH algorithms

Within a RWMH algorithm:

Given the current value of σt, update σt+1 according to σt+1 = σt + c(1 − p∗)/t if θ∗ is accepted σt − cp∗/t if θ∗ is rejected where θ∗ ∼ N(θt, σ2), and c is a chosen steplength constant, which controls algorithm convergence.

◮ Assumes that p(σ) is a monotonic function of σ. ◮ The optimum choice of c is (Hodges and Lehmann, 1955)

c∗ = −1/[dp(σ)/dσ]σ=σ∗. In general this must be estimated.

◮ Previous attempts to estimate c∗ used 3 additional Markov chains

(Andrieu and Robert, 2001) 11/71

slide-12
SLIDE 12

Estimating c∗

◮ Pr(accepting a MH move from θ)

p(θ, σ) =

  • α(θ, θ∗)q(θ, θ∗)dθ∗

◮ Hence, overall MH acceptance probability (OAP) is

p(σ) =

  • p(θ, σ) π(θ|y) dθ.

For Gaussian RWMH (symmetric proposal) we then have

dp(σ) dσ = min π(θ∗|y) π(θ|y) , 1 dq(θ, θ∗; σ) dσ

  • π(θ|y)dθ∗dθ

which is difficult to work with in general. However . . .

12/71

slide-13
SLIDE 13

Some results

If q is a m-dimensional MVN(θt, σ2A) where A is independent of σ.

Proposition 1

A lower bound on c∗/σ∗ is 1/(mp∗).

Proposition 2

If π(θ|y) has finite variance, then for m = 1: c∗/σ∗ → 1/p∗ as σ → ∞.

Proposition 3

If π(θ|y) is continuous over Θ, and if σ → 0 as p(σ) → 1, then c∗/σ∗ ≈ 1/(1 − p∗) as p∗ → 1.

13/71

slide-14
SLIDE 14

A simple estimate of c∗ (univariate)

Hence for univariate proposals (m = 1), one possibility is c∗ σ∗ ≈ 1 p∗(1 − p∗) which satisfies propositions 1–3.

In practice:

Have an estimate of c∗ for every estimate of σ∗ in the Robbins-Monro search process. Results for multivariate posteriors too.

14/71

slide-15
SLIDE 15

How good in practice?

p* c*/!*

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 5 10 15 20

◮ Given p∗ numerically solve p(σ) for σ∗ and dp(σ)/dσ for c∗. ◮ Univariate distributions:

N(0,1); t5; Cauchy; logistic; double Exponential; Gamma(5,1); Beta(3,7); U(0,1); 1

2*N(0,1)+ 1 2*N(5,5)

◮ Note: Useful to slightly overestimate c∗ in practice

15/71

slide-16
SLIDE 16

How good in practice?

p* c*/!*

0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20

(a) m=2 p* c*/!*

0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20

(b) m=4 p* c*/!*

0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20

(c) m=8 p* c*/!*

0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20

(d) m=20 ◮ π(θ|y) = h(θi) where hi are previous 9 densities. ◮ Consistent results as before.

16/71

slide-17
SLIDE 17

Algorithm comments

Univariate target distributions with θ∗ ∼ N(θt, σ2):

Set p∗ = 0.44, update σt according to σt+1 = σt + c(1 − p∗)/t if θ∗ is accepted σt − cp∗/t if θ∗ is rejected, where c = σt/{p∗(1 − p∗)} Various other implementational details omitted.

17/71

slide-18
SLIDE 18

Univariate simulation study

Target Optimal σ∗ quantile OAP quantile distribution σ∗ 0.05 median 0.95 0.05 median 0.95 N(0, 1) 2.42 2.31 2.43 2.56 0.417 0.443 0.468 t-dist. (5 d.f.) 2.71 2.54 2.73 2.89 0.413 0.441 0.470 Cauchy 4.39 3.69 4.25 5.03 0.389 0.443 0.501 Logistic 4.05 3.82 4.05 4.33 0.417 0.442 0.467 Double exponential 2.70 2.52 2.70 2.93 0.413 0.439 0.465 Gamma(5,1) 4.98 4.62 4.96 5.28 0.414 0.443 0.467 Beta(3,7) 0.335 0.311 0.335 0.355 0.417 0.440 0.466 Uniform 0.806 0.764 0.807 0.849 0.418 0.442 0.464

1 2 N(0, 1) + 1 2 N(5, 5)

6.07 5.59 6.10 6.50 0.415 0.442 0.468

500 1500 3 4 5 6

norm

Index sigma 500 1500 0.0 0.2 0.4 0.6 0.8 1.0

norm

Index acc 500 1500 2.0 2.5 3.0 3.5 4.0 4.5 5.0

DBEXP

Index sigma 500 1500 0.0 0.2 0.4 0.6 0.8 1.0

DBEXP

Index acc 500 1500 1 2 3 4

BETA

Index sigma 500 1500 0.0 0.2 0.4 0.6 0.8 1.0

BETA

Index acc

18/71

slide-19
SLIDE 19

The Past: Summary

  • 1. The Past
  • RJMCMC convergence diagnostics
  • Adaptive MCMC

Active research in 1990s–2000’s Who worries about convergence in 2020? Should be more important than ever (bigger, more complex models) MCMC adaption still important (e.g. HMC) But less focus on going beyond (1-step) Markov chain Maybe the problems have been adequately resolved (for now). Research trends come and go, and this is ok.

19/71

slide-20
SLIDE 20

Talk Outline

* Rodrigues, Nott & Sisson (2016) Computational Statistics & Data Analysis

  • 1. The Past
  • RJMCMC convergence diagnostics
  • Adaptive MCMC
  • 2. The Present
  • Approximate Bayesian computing∗
  • Likelihood-free approximate Gibbs

sampling

  • MCMC for approximate data
  • 3. The Future
  • of Monte Carlo methods research

20/71

slide-21
SLIDE 21

ABC basics

Aim: Posterior simulation from π(θ|y) ∝ L(y|θ)π(θ) BUT: Likelihood is computationally intractable

Simple ABC mechanism:

  • 1. Draw samples (θ1, x1), (θ2, x2), . . . , (θN, xN) from

π(θ, x) = L(x | θ)π(θ)

  • 2. Look at marginal distribution of π(θ | x ≈ y)
  • 3. If x is “close enough” to y then :

π(θ | x ≈ y) ≈ π(θ|y) Note: “close enough” could be if x − y ≤ h for small h

21/71

slide-22
SLIDE 22

ABC basics

◮ Condition x ≈ y is very unlikely. ◮ Reduce data to summary statistics sx = S(x)

ABC Algorithm:

  • 1. Draw samples (θi, xi) from π(θ, x) = L(x | θ)π(θ)
  • 2. Compute summary statistics sx = S(x)
  • 3. Look at marginal distribution of π(θ | sx ≈ sy)
  • 4. If sx is “close enough” to sy (e.g. if sx − sy ≤ h) then

π(θ | sx ≈ sy) ≈ π(θ|sy) ≈ π(θ|y) Note: If sy are sufficient statistics, then π(θ|sy) ≡ π(θ|y),

  • therwise some loss of information.

Note: Computation increases exponentially as accuracy increases (i.e. as h → 0).

22/71

slide-23
SLIDE 23

ABC toy example

Toy example setup:

◮ Suppose the model is x1, . . . , x100 ∼ N(θ, 1) ◮ A sufficient statistic is sx = ¯

x

◮ Suppose sy = 0 and prior π(θ) is U(−1, 1) ◮ Then π(θ | y) ∝ N(0, 1/102) on (−1, 1)

Simulation procedure (rejection sampling):

  • 1. Generate θ ∼ π(θ) from the prior
  • 2. Generate x1, . . . , x100 ∼ N(θ, 1) and compute sx = ¯

x

  • 3. Examine distribution of those θ for which ¯

x ≈ 0 (i.e. sx ≈ sy)

23/71

slide-24
SLIDE 24

ABC toy example

Approximating the true posterior: h = ∞ Acceptance = 100%

  • Model Parameter

Data Model Parameter Density 24/71

slide-25
SLIDE 25

ABC toy example

Approximating the true posterior: h = 4.5 Acceptance = 85%

  • Model Parameter

Data Model Parameter Density 25/71

slide-26
SLIDE 26

ABC toy example

Approximating the true posterior: h = 3 Acceptance = 55%

  • Model Parameter

Data Model Parameter Density 26/71

slide-27
SLIDE 27

ABC toy example

Approximating the true posterior: h = 2 Acceptance = 41%

  • Model Parameter

Data Model Parameter Density 27/71

slide-28
SLIDE 28

ABC toy example

Approximating the true posterior: h = 1.3 Acceptance = 25%

  • Model Parameter

Data Model Parameter Density 28/71

slide-29
SLIDE 29

ABC toy example

Approximating the true posterior: h = 0.005 Acceptance = 0.09%

  • Model Parameter

Data Model Parameter Density 29/71

slide-30
SLIDE 30

The computational overheads problem

  • Model Parameter

Data Model Parameter Density

Clear trade off: Greater accuracy ⇒ Greater computation needed

30/71

slide-31
SLIDE 31

What is the model here?

Actual posterior distribution is: π(θ, sx|sy) ∝ Kh(sx − sy)L(sx|θ)π(θ) where e.g. Kh(sx − sy) =

  • 1

if sx − sy ≤ h else Interpretation: πABC(θ|sy) =

  • Kh(sx − sy)L(sx|θ)π(θ)dsx

= π(θ|sy) + other terms involving h based on kernel density estimation arguments. So πABC(θ|sy) → π(θ|sy) as h → 0.

31/71

slide-32
SLIDE 32

ABC Monte Carlo samplers

How does model augmentation help with Monte Carlo simulation?

Recall: Rejection sampling

Want to sample from f (θ) based on samples from g(θ):

◮ Sample θ ∼ g(θ) ◮ Accept θ as a draw from f (θ) with probability ∝ f (θ)

g(θ)

ABC Rejection sampling

f (θ) g(θ) ∝ π(θ, sx|sy) L(sx|θ)π(θ) ∝ Kh(sx − sy)L(sx|θ)π(θ) L(sx|θ)π(θ) = Kh(sx − sy).

◮ Intractable likelihoods cancel out ◮ Their evaluation is completely circumvented. ◮ Works similarly for other algorithms (e.g. MCMC, SMC etc.)

32/71

slide-33
SLIDE 33

Hierarchical Gaussian process prior (Rodrigues, Nott & Sisson, 2016)

We have data {Xij}: i = 1, . . . g groups and j = 1, . . . , ni. A prior on the set of densities fi(x), i = 1, . . . , g: fi(x) = L(Zi(x))b(x|φ) ci(φ, Zi(x)) (1a) Zi(x) ∼ GP(µ(x), k(x, x′|θ1)) (1b) µ(x) ∼ GP(m(x), k(x, x′|θ2)) (1c) (θ1, θ2) ∼ π(·). (1d)

◮ L(z) = 1/(1 + exp(−z)) is the logistic function ◮ b(x|φ) is a parametric base density ◮ GP(µ(x), k(x, x′|θ)) = a Gaussian process with mean function

µ(x) and covariance function k(x, x′|θ)

◮ m(x) is a conveniently chosen function ◮ π(·) is a hyperprior for the parameters θ1, θ2 ◮ ci(φ, Zi(x)) =

  • L(Zi(x))b(x|φ)dx = normalising constant of fi(x).

33/71

slide-34
SLIDE 34

ABC for GP density estimation

ABC Algorithm

  • 1. Draw a sample (a set of density functions) from the hierarchical GP

prior and generate data from these of same dimensions as observed data.

  • 2. Summarise data using summary statistics:

– Standard kde for each group, i = 1, . . . , g.

  • 3. Comparison: If simulated kde’s are close to observed kde’s,

accept sample densities from 1. Otherwise reject.

  • 4. Repeat steps 1–3 to produce ℓ = 1, . . . , m accepted samples.

Also implement functional-regression-adjustment (not discussed).

34/71

slide-35
SLIDE 35

A simulated example

◮ g = 10 groups ◮ Sample sizes of 5, 20, 35, . . . , 140 ◮ σ1, σ2 ∼ Gamma(3, 5) and α1, α2 ∼ Gamma(1, 0.1) ◮ Generate 50,000 samples, accept closest 10% to observed data ◮ b(x|φ) = U(0, 1)

Focus on groups 1 and 10 with smallest and largest sample sizes.

35/71

slide-36
SLIDE 36

A simulated example

0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 Group 1 Group 10 Other groups

(a) True densities

0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 Group 1 Group 10 Other groups

(b) Kernel density estimates

◮ Kde for group 10 is reasonable, for group 1 is poor ◮ Boundary bias problem is evident

36/71

slide-37
SLIDE 37

A simulated example

0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 Posterior True density 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 95% credibility intervals True density ABC estimate Kernel estimate

◮ Group 1: Samples cluster around true density ◮ Posterior mean strongly outperforms initial kde ◮ Less biased around the boundary

37/71

slide-38
SLIDE 38

A simulated example

0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 Posterior True density 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 95% credibility intervals True density ABC estimate Kernel estimate

◮ Group 10: Small sample variability around true density ◮ Posterior mean outperforms initial kde ◮ Less biased around the boundary

38/71

slide-39
SLIDE 39

2-level Hierarchical KDE’s

Add an extra level to the original model: fi(x) = L(Z 1

i (x))b(x|φ)

ci(φ, Z 1

i (x))

Z 1

i (x)

∼ GP(Z 2

δ(i)(x), k(x, x′|σ1, α1))

– State level Z 2

j (x)

∼ GP(Z 3(x), k(x, x′|σ2, α2)) – Region level Z 3(x) ∼ GP(m(x), k(x, x′|σ3, α3)), – National level for i = 1, . . . , g1 = 27 and j = 1, . . . , g2 = 5, where δ(i) indicates the region to which state i belongs.

39/71

slide-40
SLIDE 40

2-level Hierarchical KDE’s

300 400 500 600 700 300 400 500 600 700 300 400 500 600 700 300 400 500 600 700

Possible density relationships with 27 densities in 5 regions Full details in Rodrigues, Nott & Sisson (2016)

40/71

slide-41
SLIDE 41

Talk Outline

* Rodrigues, Nott & Sisson (2019) https://arxiv.org/abs/1906.04347

  • 1. The Past
  • RJMCMC convergence diagnostics
  • Adaptive MCMC
  • 2. The Present
  • Approximate Bayesian computing
  • Likelihood-free approximate Gibbs

sampling∗

  • MCMC for approximate data
  • 3. The Future
  • of Monte Carlo methods research

41/71

slide-42
SLIDE 42

LF approximate Gibbs sampling

Likelihood L(x|θ) is intractable.

Typical LF (ABC) MCMC algorithm:

  • 1. Given current state θt generate proposal θ′ ∼ q(θ|θt)
  • 2. Generate dataset x′ ∼ L(x|θ′)
  • 3. Compute summary statistic s′ = S(x′)
  • 4. Compute M-H acceptance probability

α(θt, θ′) = min

  • 1, Kh(sobs, s′)

Kh(sobs, st) L(s′|θ′)π(θ′) L(st|θt)π(θt) L(st|θt)q(θt|θ′) L(s′|θ′)q(θ′|θt)

  • =

min

  • 1, Kh(sobs, s′)

Kh(sobs, st) π(θ′) π(θt) q(θt|θ′) q(θ′|θt)

  • .

◮ Acceptance probability free of intractable likelihoods. ◮ Targets approximate posterior:

πh(θ|sobs) ∝

  • Kh(sobs, s)L(s|θ)π(θ)ds

42/71

slide-43
SLIDE 43

LF approximate Gibbs sampling

Two main problems with ABC-MCMC:

  • 1. Need to match sobs and s′ in Kh(sobs, s′)
  • More problematic as |s| increases as then h increases
  • ⇒ low acceptance rates or low accuracy
  • 2. Mixing of chain is ∝ probability of generating s′ close to sobs
  • So mixing rate varies over parameter space

2000 4000 6000 8000 10000 0.0 0.5 1.0

Trace of Theta

Iterations Theta

Histogram of Theta

Theta Density −3 −2 −1 1 2 3 0.0 0.5 1.0 1.5 2.0

◮ Chain can “stick” in distributional tails ◮ Giving poor sampler performance ◮ Well known result (Sisson et al. 2007)

Block-update ABC-MCMC (Kousathanas et al., 2016): Perform block MH updates on θ = (θ[1], . . . , θ[b]) Using lower dimensional subsets s[i] ⊆ s.

43/71

slide-44
SLIDE 44

LF approximate Gibbs sampling

Benefits of Blocked ABC-MCMC:

◮ Reduces dimension of comparisons from |s| to |s[i]|

Down sides of Blocked ABC-MCMC:

◮ Doesn’t prevent “sticking” ◮ Some issues in defining a joint distribution by a set of

approximations to conditionals.

5000 6000 7000 8000 9000 10000 0.0 0.5 1.0 1.5 2.0

Iteration τx

A Gibbs sampler would solve the sticking problem. However

◮ Don’t know/can’t evaluate the

joint posterior

◮ Similarly for all conditionals ◮ How to implement?

44/71

slide-45
SLIDE 45

LF approximate Gibbs sampling

Basic idea (version #1):

◮ Generate (θ(k), s(k)) ∼ L(s|θ)b(θ) in region of high posterior density

  • Only generate these once

◮ Build regression models θ[i] ∼ f (θ[−i], s) + ǫ[i]

  • Could be linear, non-linear, non-parametric (NN)
  • Samples weighted according to Kh(sobs[i], s(k)

[i] )

◮ At each Gibbs update, sample

θ′

[i] = ˆ

f (θt[−i], s) + ǫ′

[i]

  • If error distribution is parametric, then ǫ′

[i] ∼

ˆ Dist(ǫ[i])

  • If non-parametric, then draw an empirical residual

ǫ′

[i] ∼ {θ(k) [i] − ˆ

f (θt[−i], s(k))}

45/71

slide-46
SLIDE 46

LF approximate Gibbs sampling

Basic idea (version #2):

◮ Generate (θ(k), s(k)) ∼ L(s|θ)b(θ) in region of high posterior density

  • Only generate these once

◮ At each Gibbs update

  • Build simple regression model θ[i] ∼ f (θt[−i], s) + ǫ[i]
  • Samples weighted according to Kh([sobs[i], θt[−i]], [s(k)

[i] , θ(k) [−i]])

  • As before, sample

θ′

[i] = ˆ

f (θt[−i], s) + ǫ′

[i].

Building regressions:

◮ within Gibbs iterations: easier regression modelling, computation

proportional to length of sampler

◮ before sampling: challenging modelling, computation done once

46/71

slide-47
SLIDE 47

LF approximate Gibbs sampling

Xuℓ ∼ N(µu, τ −1

x

) µu ∼ N(µ, τ −1

µ )

τx ∼ Gamma(αx, νx) τµ ∼ Gamma(αµ, νµ) µ ∼ N(0, 1), µ µu τµ Xuℓ τx

ℓ = 1, . . . , L u = 1, . . . , U

Standard hierarchical model

◮ Assume observation model is intractable:

Xuℓ ∼ N(µu, τ −1

x

)

◮ Estimate approximate conditionals: µu, τx ◮ Standard Gibbs updates for µ, τµ

Samplers (all non-parametric error distributions):

  • 1. Simple model, global regression
  • 2. Simple model, local regression
  • 3. Flexible model, global regresison

Note: Structure of model:

◮ Helps identify which elements of θ[−i] and

s to regress on

◮ Only need to estimate one µu, ∀ groups

47/71

slide-48
SLIDE 48

LF approximate Gibbs sampling

0.0 0.5 1.0 1.5 2.0 2.5 3.0 80 85 90 95 100

Coverage

Exact Gibbs Simple−global Simple−local Flexible−global ABC−PaSS

τx µu

(c) Relative MSE.

−1 1 2 3 4 0.0 0.5 1.0 1.5 2.0

Exact Gibbs Simple−global Simple−local Flexible−global ABC−PaSS

(d) Posterior for µ1.

1 2 3 4 1 2 3

Exact Gibbs Simple−global Simple−local Flexible−global ABC−PaSS

(e) Posterior for τx.

◮ Blocked ABC-MCMC

(ABC-PaSS) performs poorly

◮ Simple-global not so good ◮ Simple-local and

Flexible-global ok

48/71

slide-49
SLIDE 49

LF approximate Gibbs sampling

Time (sec)

0.1 1 10 100 1000 Exact Gibbs Simple global Simple local Flexible global ABC PaSS

Method Synthetic samples Regression fits MCMC Time (s) Number Time (s) Number Sampler Exact Gibbs 0.42 Simple-global 6.58 10,000 0.02 2 12.75 Simple-local 6.58 10,000 20,000∗ 95.77 Flexible-global 6.58 10,000 85.85 4 27.51 ABC-PaSS 20,000∗ 14.38

* – performed within the sampler

Given fit quality: Flexible-global performs the best in this case.

49/71

slide-50
SLIDE 50

LF approximate Gibbs sampling

5000 6000 7000 8000 9000 10000 0.0 0.5 1.0 1.5 2.0

Iteration τx

(f) ABC-PaSS sampler.

5000 6000 7000 8000 9000 10000 0.0 0.5 1.0 1.5 2.0

Iteration τx

(g) Flexible-global approx. Gibbs

sampler.

Good chain mixing compared to MH-based ABC-MCMC

◮ Will be exact Gibbs sampler if ˆ

f (θ[i]| . . .) → π(θ[i]| . . .) else approximation

◮ Modelling with potentially inconsistent conditional distributions ◮ Model 13,140 parameter intractable, high-dimensional multivariate

non-linear state space model in the paper. A new record for ABC methods?

50/71

slide-51
SLIDE 51

Talk Outline

* X. Fan & Sisson (2019) in preparation.

  • 1. The Past
  • RJMCMC convergence diagnostics
  • Adaptive MCMC
  • 2. The Present
  • Approximate Bayesian computing
  • Likelihood-free approximate Gibbs

sampling

  • MCMC for approximate data∗
  • 3. The Future
  • of Monte Carlo methods research

51/71

slide-52
SLIDE 52

MCMC for approximate data

◮ Want to fit mixture model

xi ∼

J

  • j=1

wjNd(µj, Σj)

◮ Typical Gibbs sampler implemented using

indicator variables zi so that xi|zi ∼ Nd(µzi, Σzi) (zi)|{wk} ∼ Multinomial(1; w1, . . . , wJ) for i = 1, . . . , N.

◮ What if N is big?

  • Higgs dataset: N = 11m, d = 26
  • Very many zi to estimate and mix over
  • Computation high
  • Expect slow mixing

◮ Can we use approximate data to speed this up?

52/71

slide-53
SLIDE 53

Modelling random histograms2

The general approach: L(S|θ, φ) ∝

  • x

f (S|x, φ)L(x|θ)dx where

◮ L(x|θ) – standard, classical data likelihood ◮ f (S|x, φ) – probability of obtaining histogram S given x

Gist: Fitting the standard classical model L(x|θ), when the data are viewed only through the distribution S as a summary.

◮ L(S|θ) is an approximation of L(X|θ) ◮ As #bins→ ∞, L(S|θ) → L(X|θ) ◮ A general method – just focus on random histograms here.

2Beranger, Lin & Sisson, (2018). New models for symbolic data analysis. https://arxiv.org/abs/1809.03659 53/71

slide-54
SLIDE 54

Modelling random histograms

n=1000, bins=11

y Density

  • 2

2 4 6 8 0.0 0.1 0.2 0.3 Standard GEV Symbolic GEV

◮ Underlying data

X1, . . . , Xn ∈ Rp ∼ g(x|θ).

◮ Collected into histogram (random

counts) with fixed bins as: S = (s1, . . . , sB)⊤ = (#Xi ∈ B1, . . . , #Xi ∈ BB)⊤ such that

b sb = n.

f (S|x, φ) = 1 if sb observations in bin b; for each b = 1, . . . , B else The likelihood is then (multinomial): L(S|θ)∝

  • x

f (S|x)

n

  • k=1

g(xk|θ)dx ∝

B

  • b=1
  • Bb

g(z|θ)dz sb ⇒ generalises univariate result of McLachlan & Jones (1988).

54/71

slide-55
SLIDE 55

Calculating

  • Bb g(z|θ)dz

Ls1

1 Ls1

L1

s1 1 L1 s1

L2

s2 1

L2

s2

1D: Probability of Xi falling in bin (Ls1, Ls1−1] is F(Ls1|θ) − F(Ls1−1|θ). 2D: Probability of Xi falling in bin (L(1)

s1 , L(1) s1−1] × (L(2) s1 , L(2) s1−1] is

F

  • L(1)

s1−1, L(2) s2−1|θ

  • − F
  • L(1)

s1−1, L(2) s2 |θ

  • − F
  • L(1)

s1 , L(2) s2−1|θ

  • + F
  • L(1)

s1 , L(2) s2 |θ

  • .

d-D: Has 2d components – viable for low d.

55/71

slide-56
SLIDE 56

Example: Fitting a GEV

Suppose x1, . . . , xn ∼ GEV(µ, σ, ξ) for large n. Create histogram of counts s = (s1, . . . , sB). Log-likelihood function is then ℓ(s|µ, σ, ξ) ∝

B

  • b=1

sb log [G(ab+1|µ, σ, ξ) − G(ab|µ, σ, ξ)]

n=1000, bins=11

y Density

  • 2

2 4 6 8 0.0 0.1 0.2 0.3 Standard GEV Symbolic GEV

Computation:

◮ Optimisation of ℓ (v. quick) ◮ Creation of histogram s (slower)

← good fits with moderate bin numbers

56/71

slide-57
SLIDE 57

MCMC for approximate data

◮ Want to fit mixture model

xi ∼

J

  • j=1

wjNd(µj, Σj)

◮ Typical Gibbs sampler implemented using

indicator variables zi so that xi|zi ∼ Nd(µzi, Σzi) (zi)|{wk} ∼ Multinomial(1; w1, . . . , wJ) for i = 1, . . . , N.

◮ What if N is big?

  • Higgs dataset: N = 11m, d = 26
  • Very many zi to estimate and mix over
  • Computation high
  • Expect slow mixing

◮ Can we use approximate data to speed this up?

57/71

slide-58
SLIDE 58

MCMC for approximate data

If we use histograms: log L(s|θ) ∝

B

  • b=1

sb log

  • Bb

g(z|θ)dz

  • ◮ For all xi in bin Bb, we have many zi

Zb = {zi : xi ∈ Bb} |Zb| = big.

◮ Only J possible values zi ∈ Zb can take, so collapse these to

(κ1, . . . , κJ) ∼ Multinomial(|Zb|; w1, . . . , wJ).

◮ “Distribution on indicators in each bin” ◮ Suddenly we only have J × B << n indicators to track ◮ Can update (κ1, . . . , κJ) for each bin in one step ◮ Down side is no longer have Gibbs Sampler – so MH etc.

58/71

slide-59
SLIDE 59

MCMC for approximate data

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 0.000 0.025 0.050 0.075 0.100 0.125 0.150

Curve Fitting (N = 105, K = 10)

Original Data Continuous Data Histogram data: h = 1 Histogram data: h = 0.1 Histogram data: h = 0.01 Histogram data: h = 0.001 Interval data 15 10 5 5 10 15 0.00 0.02 0.04 0.06 0.08 0.10

Curve Fitting (N = 105, K = 50)

Original Data Continuous Data Histogram data: h = 1 Histogram data: h = 0.1 Histogram data: h = 0.01 Histogram data: h = 0.001 Interval data

◮ N = 10, 000, K = 10 components, binwidth h = 1, 0.1, 0.01, 0.001 ◮ Appears to perform well ◮ Also shown – mixture fitting for random intervals (not discussed)

59/71

slide-60
SLIDE 60

MCMC for approximate data

◮ n = 100, 000, 5 components, 100 sampler iterations. ◮ 21 × 21, 61 × 61 and 151 × 151 bins ◮ Slightly blurry for lowest resolution

60/71

slide-61
SLIDE 61

MCMC for approximate data

2 3 4 5 6 7 8 Logarithm of Number of dataset 3 2 1 1 2 Logarithm of Running time per iteration (s$)

Running Time Comparison (5 univariate Gaussian components)

Continuous data Histogram Symbol with 40 bins Histogram Symbol with 80 bins Histogram Symbol with 120 bins Histogram Symbol with 160 bins Histogram Symbol with 180 bins 2 3 4 5 6 7 8 Logarithm of Number of dataset 2 1 1 Logarithm of Running time per iteration (s$)

Running Time Comparison (5 bivariate Gaussian components)

Continuous data Histogram Symbol with 20 × 20 bins Histogram Symbol with 80 × 80 bins Histogram Symbol with 140 × 140 bins Histogram Symbol with 200 × 200 bins Histogram Symbol with 300 × 300 bins

◮ Approximate data computational overheads are constant in N

(scaling with B)

◮ Classical data methods are exponential in N ◮ Seems quite promising.

61/71

slide-62
SLIDE 62

MCMC for approximate data

Continuous data Symbolic data (3947×1635) Symbolic data (1974×818) Symbolic data (988×410) Symbolic data (495×206) Symbolic data (248×104)

Time: 55.1s 563s 117s 32s 8.64s 2.01s Bins: 11M 6.5M 1.6M 405K 102K 26K

◮ Dataset: Higgs: N = 10, 999, 998, d = 26 ◮ K = 15 mixture components (first 2 principle components: 20.3%) ◮ Time is mean number of seconds per sampler iteration. ◮ Good performance for large enough N.

62/71

slide-63
SLIDE 63

Talk Outline

* Just for fun!

  • 1. The Past
  • RJMCMC convergence diagnostics
  • Adaptive MCMC
  • 2. The Present
  • Approximate Bayesian computing
  • Likelihood-free approximate Gibbs

sampling

  • MCMC for approximate data
  • 3. The Future
  • of Monte Carlo methods research∗

63/71

slide-64
SLIDE 64

Future of Monte Carlo methods

“Spring Bayes” Conference predictions, Brisbane 2006: My score: 1/3 (some good packages!) + 0 (?) + 1/2 = 5/18 = 28%

64/71

slide-65
SLIDE 65

Future of Monte Carlo methods

MCM Conference predictions, Sydney 2019:

  • 1. Quantum computing
  • In 2000: “Remember Bayes/MC methods before

single-core/thread computers?”

  • In 2015: “Remember Bayes/MC methods before

multi-core/thread computers?”

  • In 2030: “Remember Bayes/MC methods before quantum

computers?”

◮ Q computers are getting closer, ◮ Q programming already exists ◮ Q Monte Carlo is already a thing ◮ New Q Monte Carlo methods exploiting

Q computing strengths?

65/71

slide-66
SLIDE 66

Future of Monte Carlo methods

MCM Conference predictions, Sydney 2019

  • 2. Move away from Dirac mass-based Monte Carlo
  • Using point mass samples to approximate distributions

ˆ π(θ|x) ≈ 1 N

N

  • i=1

δθ(i)(θ) θ(i) ∼ π(θ|x) using point masses δθ(i)(θ) is not particularly efficient.

  • Can we move beyond this?
  • E.g. use

ˆ π(θ|x) ≈ 1 N

N′

  • i=1

h(θ|β(i)) β(i) ∼ q(β|x) where N′ << N and h is some specified function.

  • h(θ|β) = δθ(i)(θ) (with θ(i) = β) is particular case.

66/71

slide-67
SLIDE 67

Future of Monte Carlo methods

MCM Conference predictions, Sydney 2019

◮ So rather than having

Eπ[g(θ)] =

  • g(θ)π(θ|x)dθ ≈
  • g(θ)
  • 1

N

N

  • i=1

δθ(i)(θ)

  • dθ = 1

N

N

  • i=1

g(θ(i)) we would have Eπ[g(θ)] =

  • g(θ)π(θ|x)dθ ≈
  • g(θ)

  1 N′

N′

  • i=1

h(θ|β(i))   dθ = 1 N′

N′

  • i=1
  • g(θ)h(θ|β(i))dθ = 1

N′

N′

  • i=1

Eh|β(i)[g(θ)]

◮ Good choice of h makes expectations tractable.

E.g. h(θ|β) = φ(θ; β1, β2

2)

◮ If g(θ) = I(θ < −1) then averaging 0/1’s versus averaging

probabilities.

67/71

slide-68
SLIDE 68

Future of Monte Carlo methods

MCM Conference predictions, Sydney 2019

◮ We know this works in principle because e.g.

tν(θ; µ, σ) =

  • φ(θ|µ, β)IGamma
  • β|ν

2, νσ2 2

≈ 1 N′

N′

  • i=1

φ(θ|µ, β(i)) β(i) ∼ IGamma

  • β|ν

2, νσ2 2

  • .

◮ π(θ) = tν(µ, σ), h(θ|β) = φ(θ|µ, β),

q(β) = IGamma(β; . . . , ).

◮ Considerably more efficient. ◮ But how to generalise? ◮ Links to variational methods?

68/71

slide-69
SLIDE 69

Future of Monte Carlo methods

MCM Conference predictions, Sydney 2019

  • 3. Even greater focus on faster approximate methods

What isn’t going away:

  • Increasing dataset size & complexity
  • Crazy/innovative ideas coming out of:
  • Computational biology (ABC, pseudo-marginal, annealing . . . )
  • Machine-learning (NN’s, GANs, variational inference . . . )
  • Physics (everything else!)

So we should be looking to:

  • Be able to handle more data, more complex models
  • Exact inference may be a luxury
  • Approximation may become the norm
  • Provide theory for the quality of these
  • Innovate with other disciplines (& fix their maths)

69/71

slide-70
SLIDE 70

Future of Monte Carlo methods

MCM Conference predictions, Sydney 2019

  • 1. Quantum computing
  • 2. Move away from Dirac mass-based Monte Carlo
  • 3. Even greater focus on faster approximate methods

“Spring Bayes” Conference predictions, Brisbane 2006:

◮ My score: = 28%

⇒ 1 out of the above 3 predictions will come true. BUT WHICH ONE!?!?

70/71

slide-71
SLIDE 71

Australian Centre of Excellence for Mathematical and Statistical Frontiers in Big Data, Big Models, New Insights

Follow @ABC Research on www.twitter.com

◮ Workshop: The Future of Statistics

31st Oct–1st Sept, UNSW Sydney

◮ Friendly challenge to MCM speakers: What will

be the hot research topic at MCM in 2029/31?

THANK YOU

71/71