Multitree Generalized Steppingstone Sampling A New MCMC Method for - - PowerPoint PPT Presentation

multitree generalized steppingstone sampling a new mcmc
SMART_READER_LITE
LIVE PREVIEW

Multitree Generalized Steppingstone Sampling A New MCMC Method for - - PowerPoint PPT Presentation

Multitree Generalized Steppingstone Sampling A New MCMC Method for Estimating the Marginal Likelihood of a Models Mark T. Holder, Paul O. Lewis, David L. Swofford, and David Bryant KU, UConn, Duke, U. Otago (NZ) Feb 16, 2013 Austin,


slide-1
SLIDE 1

Multitree Generalized Steppingstone Sampling – A New MCMC Method for Estimating the Marginal Likelihood of a Models

Mark T. Holder, Paul O. Lewis, David L. Swofford, and David Bryant

KU, UConn, Duke, U. Otago (NZ)

Feb 16, 2013 – Austin, TX

slide-2
SLIDE 2

Context

  • We rarely know the “true” model to use for analyses.
  • Model-averaging methods can be difficult to implement

and use.

  • We can use the marginal likelihood to choose between

models.

slide-3
SLIDE 3

The marginal likelihood is the denominator of Bayes’ Rule

p(θ | D, M) = P(D | θ, M)p(θ | M) P(D | M) θ is a set of parameter values in the model M D is the data P(D | θ, M) is the likelihood of θ. p(θ, M) is the prior of θ.

slide-4
SLIDE 4

The marginal likelihood assess the fit of the model to the data by considering all parameter values

P(D | M) =

  • P(D | θ, M)p(θ | M)dθ
slide-5
SLIDE 5

MCMC avoids the marginal like. calc.

p(θ∗|D, M) p(θ | D, M) =

P(D|θ∗,M)p(θ∗|M) P(D|M) P(D|θ,M)p(θ|M) P(D|M)

p(θ∗|D, M) p(θ | D, M) = P(D | θ∗, M)p(θ∗ | M) P(D | θ, M)p(θ | M)

slide-6
SLIDE 6

Bayesian model selection

Bayes Factor between two models: B10 = P(D | M1) P(D | M0) We could estimate P(D|M1) by drawing points from the prior on θ and calculating the mean likelihood.

slide-7
SLIDE 7

− 2 − 1 1 2 1 2 30 40

Sharp posterior (black) and prior (red)

x density

slide-8
SLIDE 8

Drawing from the prior will often miss any of the trees with high posterior density – massive sample sizes would be needed. Perhaps we can use samples from the posterior?

slide-9
SLIDE 9

We can use the harmonic mean of the likelihoods of MCMC samples to estimate P(D|M1).

  • However. . .
slide-10
SLIDE 10

“The Harmonic Mean of the Likelihood: Worst Monte Carlo Method Ever”

A post on Dr. Radford Neal’s blog

http://radfordneal.wordpress.com/2008/08/17/ the-harmonic-mean-of-the-likelihood-worst-monte-carlo-method-ever

“The total unsuitability of the harmonic mean estimator should have been apparent within an hour of its discovery.”

slide-11
SLIDE 11

Harmonic mean estimator of the marginal likelihood

  • appealing because it comes “for free” after we have

sampled the posterior using MCMC,

  • unfortunately the estimator can have a huge variance

associated with it in some (very common) cases. For example if:

  • the vast majority of parameter space has very low

likelihood, and

  • a very small region has high likelihoods.
slide-12
SLIDE 12

Importance sampling to approximate a difficult target distribution

1 Simulate points from an easy distribution. 2 Reweight the points by the ratio of densities between

the easy and target distribution.

3 Treat the reweighted samples as draws from the target

distribution.

slide-13
SLIDE 13

−3 −2 −1 1 2 3 0.0 0.5 1.0 1.5 2.0 2.5 x density

slide-14
SLIDE 14

−3 −2 −1 1 2 3 0.0 0.5 1.0 1.5 2.0 2.5

Importance and target densities

x density −3 −2 −1 1 2 3 2 4 6 8 10 12

Importance weights

weight

slide-15
SLIDE 15

−3 −2 −1 1 2 3 0.0 0.5 1.0 1.5 2.0 2.5

Importance and target densities

x density −3 −2 −1 1 2 3 2 4 6 8 10 12

Importance weights

x weight −3 −2 −1 1 2 3 2 4 6 8 10 12

Samples from importance distribution

x −3 −2 −1 1 2 3 2 4 6 8 10 12

Weighted samples

x weight

slide-16
SLIDE 16

Importance sampling

The method works well if the importance distribution is:

  • fairly similar to the target distribution, and
  • not “too tight” to allow sampling the full range of the

target distribution

slide-17
SLIDE 17

In phylogenetics our posterior distribution is too peaked and our prior is too vague to allow us to use them in importance sampling:

− 2 − 1 1 2 1 2 30 40

Sharp posterior (black) and prior (red)

density

slide-18
SLIDE 18

Steppingstone sampling uses a series of importance sampling runs

−2 −1 1 2 10 20 30 40

Steppingstone densities

x density

slide-19
SLIDE 19

Steppingstone sampling (Xie et al. 2011, Fan et al. 2011) blends two distributions:

  • the posterior, P(D | θ, M1)P(θ, M1)
  • a tractable reference distribution, π(θ)

pβ(θ | D, M1) = [P(D | θ, M1)P(θ, M1)]β [π(θ)](1−β) cβ p1(θ | D, M1) is the posterior. c1 is the marginal likelihood of the model. p0(θ | D, M1) is the reference distribution. c0 is 1.

slide-20
SLIDE 20

Steppingstone sampling (Xie et al. 2011, Fan et al. 2011) blends two distributions:

  • the posterior, P(D | θ, M1)P(θ, M1)
  • a tractable reference distribution, π(θ)

pβ(θ | D, M1) = [P(D | θ, M1)P(θ, M1)]β [π(θ)](1−β) cβ P(D | M1) = c1 c0 = c1 c0.38 c0.38 c0.1 c0.1 c0.01 c0.01 c0

  • =

c1

✘✘ ✘

c0.38

✘✘ ✘

c0.38

✟ ✟

c0.1

✟ ✟

c0.1

✘✘ ✘

c0.01

✘✘ ✘

c0.01 c0

slide-21
SLIDE 21

Run MCMC with different β values

−2 −1 1 2 10 20 30 40

Steppingstone densities

x density

slide-22
SLIDE 22

P(D | M) =

  • P(D|M)

c0.38 c0.38 c0.1 c0.1 c0.01

c0.01

1

  • Photo by Johan Nobel http://www.flickr.com/photos/43147325@N08/4326713557/ downloaded from Wikimedia
slide-23
SLIDE 23

Reference distributions in Steppingstone sampling

In the original steppingstone (Xie et al 2011): reference = prior In generalized steppingstone (Fan et al 2011) it can be any distribution:

  • Should be centered around the areas with high

probability,

  • Must be a probability distribution with a known

normalizing constant,

  • Should be easy to draw sample from.
slide-24
SLIDE 24

The log Bayes Factor for a complex model compared to a simple (true) model. Estimated twice (by Fan et al, 2011) Harmonic mean: Original Steppingstone:

  • −60
−40 −20 20 −60 −40 −20 20 Seed 1 Seed 2 −60 −40 −20 20 −60 −40 −20 20 Seed 2
  • −60
−40 −20 20 −60 −40 −20 20 Seed 1 Seed 2
slide-25
SLIDE 25

Original Steppingstone: Generalized Steppingstone:

  • −60
−40 −20 20 −60 −40 −20 20 Seed 1 Seed 2 −60 −40 −20 20 −60 −40 −20 20 Seed 2
  • −60
−40 −20 20 −60 −40 −20 20 Seed 1 Seed 2

Figure from Fan et al., 2011

slide-26
SLIDE 26

Steppingstone sampling when the tree is not known

  • generalized steppingstone assumes a

fixed tree,

  • the original steppingstone can very slow

when the tree is not known.

slide-27
SLIDE 27

Tree-Centered Independent-Split-Probability (TCISP) distribution

Input: a tree with probabilities for each split (from a pilot MCMC run). Output: a probability distribution over all tree topologies.

slide-28
SLIDE 28

A G D E J L H F C K I

. 9 0.8 0.6 0.5 . 4 0.8 0.3 0.9

Input: a focal tree with split probabilities to center the distribution

slide-29
SLIDE 29

A G D E J L H F C K I The tree we will draw will display the blue splits and will not display the red splits

slide-30
SLIDE 30

A G D E J L H F C K I

slide-31
SLIDE 31

A G D E J L H F C K I One of the many resolutions which avoid the red splits

slide-32
SLIDE 32

A G D E J L H F C K I A G D E J L H F C K I

slide-33
SLIDE 33

Calculating a tree’s probability

A G D E J L H F C K I

. 9 0.8 0.6 0.5 0.4 0.8 0.3 0.9

A G D E J L H F C K I

focal tree tree to score

slide-34
SLIDE 34

Calculating a tree’s probability

A G D E J L H F C K I A G D E J L H F C K I

focal tree tree to score

slide-35
SLIDE 35

Calculating a tree’s probability

A G D E J L H F C K I

0.9 0.2 0.4 0.5 . 6 0.2 0.7 0.9

A G D E J L H F C K I

focal tree tree to score

slide-36
SLIDE 36

Calculating a tree’s probability

A G D E J L H F C K I

0.9 0.2 0.4 0.5 . 6 0.2 0.7 0.9

A G D E J L H F C K I

focal tree tree with shared splits P = 0.2 × 0.9 × 0.4 × 0.5 × 0.6 × 0.2 × 0.7 × 0.9

slide-37
SLIDE 37

Counting trees the max. distance from a tree

  • Bryant and Steel(2009) provide an

O(n5) algorithm.

  • Bryant contributed an O(n2) algorithm.
slide-38
SLIDE 38

Multitree steppingstone sampling

  • Has been validated on small data sets.
  • Minimum running time is unknown.
  • Appears to be a practical way to approximate

a model’s marginal likelihood.

  • The TCISP distribution could be useful in
  • ther context.
  • The method assumes unrooted, fully resolved

trees,

  • implemented in phycas

(http://www.phycas.org)

Holder, Lewis, Swofford, Bryant (in prep)

slide-39
SLIDE 39

Marginal likelihoods

  • Harmonic mean estimator is not reliable
  • AIC is preferable to the harmonic mean

estimator (Guy Baele et al MBE 2012)

  • Thermodynamic Integration (aka Path

Sampling) is an accurate method (Lartillot and Philippe, 2006; Rodrigue and Aris-Brousou)

  • Model-jumping approaches avoid the need to

calculate marginal likelihoods.

  • Steppingstone and/or Path Sampling are

available in MrBayes, Beast, ∗Beast, Migrate, and other Bayesian software for evolutionary analyses

slide-40
SLIDE 40

Thanks!

Thanks to the organizers, NSF, and to you (for listening to stats on a Saturday morning)

slide-41
SLIDE 41

This slide shows the math to demonstrate that we can use the harmonic mean to estimate the marginal likelihood, P(D). Consider, a model that has a discrete parameter v that can take two values (r and b). h in the harmonic mean

  • f the likelihoods for parameters values sampled from the

posterior distribution h = 1 P(v = r|D)

  • 1

P(D|v=r)

  • + P(v = b|D)
  • 1

P(D|v=b)

  • =

1

  • P(v=r)P(D|v=r)

P(D) 1 P(D|v=r)

  • +
  • P(v=b)P(D|v=b)

P(D) 1 P(D|v=b)

  • =

1

  • P(v=r)

P(D)

  • +
  • P(v=b)

P(D)

  • =

P(D) P(v = r) + P(v = b) = P(D)

slide-42
SLIDE 42

The harmonic mean as an importance sampling estimator

The marginal likelihood is the expected value of the likelihood over points drawn from the prior: P(D) = Ep(θ)[P(D|θ)] =

  • P(D|θ)P(θ)dθ

In importance sampling, we draw parameter values θ from a different density, g(θ), but multiple the points by a weight, w: Ep(θ)[P(D|θ)] =

1 n

n

i=1 P(D|θi)wi

b where b is a normalization constant.

slide-43
SLIDE 43

The appropriate weight is the ratio of the target density and our importance density: wi = P(θi) g(θi) And it turns out that the normalization constant is simply a sum of the importance weights: b = 1 n

n

  • i=1

P(θi) g(θi) Ep(θ)[P(D|θ)] =

1 n

n

i=1 P(D|θi)P(θi) g(θi) 1 n

n

i=1 P(θi) g(θi)

slide-44
SLIDE 44

Ep(θ)[P(D|θ)] =

1 n

n

i=1 P(D|θi)P(θi) g(θi) 1 n

n

i=1 P(θi) g(θi)

If the importance distribution, g(θ), is the prior then the importance weights are all 1: Ep(θ)[P(D|θ)] =

1 n

n

i=1 P(D|θi)P(θi) P(θi) 1 n

n

i=1 P(θi) P(θi)

= 1 n

n

  • i=1

P(D|θi) Recall that the points, θ, are draw by sampling from the importance distribution, so this sum of likelihoods is already weighted by the prior.

slide-45
SLIDE 45

Ep(θ)[P(D|θ)] =

1 n

n

i=1 P(D|θi)P(θi) g(θi) 1 n

n

i=1 P(θi) g(θi)

If the importance distribution, g(θ) is the posterior then: Ep(θ)[P(D|θ)] =

1 n

n

i=1 P(D|θi)P(θi) P(D|θi)P(θi) 1 n

n

i=1 P(θi) P(D|θi)P(θi)

= 1

1 n

n

i=1 P(θi) P(D|θi)P(θi)

= n n

i=1 1 P(D|θi)

This is the justification of the harmonic mean estimator of the marginal likelihood.

slide-46
SLIDE 46

Lartillot and Philippe’s thermodynamic integration

Like the steppingstone sampler, the thermodyanmic integration method (or path sampling) use power posterior densities with 0 ≤ β ≤ 1: pβ(θ|D) = P(D|θ)βP(θ) cβ with c0 = 0 and c1 = P(D). Lartillot and Philippe showed that ∂ ln cβ ∂β = Epβ

  • ∂ ln
  • P(D|θ)βP(θ)
  • ∂β
  • Note that by definition of a definite integral:

1 ∂ ln cβ ∂β dβ = ln c1 − ln c0 = ln P(D)

slide-47
SLIDE 47

Thus, ln P(D) = 1 Epβ

  • ∂ ln
  • P(D|θ)βP(θ)
  • ∂β

By differentiating we see that: ∂ ln

  • P(D|θ)βP(θ)
  • ∂β

= ∂[ln P(θ) + β ln P(D|θ)] ∂β = ln P(D|θ) The integration is not analytically tractable, but we can calculate Epβ [ln P(D|θ)] by conducting MCMC at the a power posterior distribution pβ and taking an average of the log-likelihoods. Then we can use standard numerical integration techniques to estimate the integral.

slide-48
SLIDE 48

This integration is not analytically tractable, but we can calculate Epβ [ln P(D|θ)] by conducting MCMC at the a power posterior distribution pβ and taking an average of the log-likelihoods. Then we can use standard numerical integration techniques to estimate the integral.