Mixture models of truncated data for estimating the number of - - PowerPoint PPT Presentation

mixture models of truncated data for estimating the
SMART_READER_LITE
LIVE PREVIEW

Mixture models of truncated data for estimating the number of - - PowerPoint PPT Presentation

Context Mixture models Estimation Application to Metagenomics Mixture models of truncated data for estimating the number of species. Li-Thiao-T e S ebastien , Jean-Jacques Daudin, St ephane Robin Equipe Statistique et G enome, UMR


slide-1
SLIDE 1

Context Mixture models Estimation Application to Metagenomics

Mixture models of truncated data for estimating the number of species.

Li-Thiao-T´ e S´ ebastien, Jean-Jacques Daudin, St´ ephane Robin

Equipe Statistique et G´ enome, UMR 518 AgroParisTech / INRA MIA

19th COMPSTAT symposium, 23rd August 2010

Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 1 / 12

slide-2
SLIDE 2

Context Mixture models Estimation Application to Metagenomics

Context

Situation individuals are sampled from a population then classified into species Goal 1 : estimate the species abundance distribution Goal 2 : estimate the number of species with no sampled individual Applications ecological surveys : number of species of butterflies [Fisher et al., 1943] metagenomics (our interest, large number of unobserved species, large datasets)

  • ther : number of words in a language, number of unreported drug addicts

Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 2 / 12

slide-3
SLIDE 3

Context Mixture models Estimation Application to Metagenomics

Example

Observations Species A B C D E . . . Number of individuals 10 430 10 289 3 . . . Species abundance distribution Number of individuals 1 2 3 4 5 . . . Number of species 513 149 65 34 24 . . .

  • ● ● ● ●
  • 5

10 15 20 25 1 5 50 500

Frequency/Count data

rare species −−−−−−−> abundant species nb of observations

  • ● ● ● ●
  • 5

10 15 20 25 1 5 50 500

Frequency/Count data

rare species −−−−−−−> abundant species nb of observations

  • nb of missing species : 2121

Species Abundance Distribution

Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 3 / 12

slide-4
SLIDE 4

Context Mixture models Estimation Application to Metagenomics

Sampling model

species abundance distribution : f (λ) =

  • q

αqfq(λ) Xi individuals are observed for species i conditionally on its abundance λi (Poisson distributed number) f (Xi|λi) = exp−λi λXi

i

Xi!

  • nly positive numbers of individuals are recorded in the data set :

f (X+|λi) = f (Xi|λi, Xi > 0) Truncated model (ϑ = {αq, πq}) : X+ ∼ f (x, ϑ) = Q

q αqfq(x, πq)

1 − Q

q αqfq(0, πq)

Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 4 / 12

slide-5
SLIDE 5

Context Mixture models Estimation Application to Metagenomics

Bayesian model

X+ ∼ f (x, πq) = Q

q αqfq(x, πq)

1 − Q

q αqfq(0, πq)

A priori : α ∼ Dirichlet( a) πq ∼ Beta(bq, cq) Z ∼ Multinom( a) X|Z ∼ Geom(πq) Approximate a posteriori distribution : α|X ∼ Dirichlet(˜ a) πq|X ∼ Beta(˜ bq, ˜ cq) The hyper parameters ˜ a, ˜ b and ˜ c provide an approximation of the a posteriori distribution and hence confidence intervals.

Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 5 / 12

slide-6
SLIDE 6

Context Mixture models Estimation Application to Metagenomics

Variational framework

Theorem The log-likelihood can be decomposed into : log P(X) = log

  • P(X, Z, ϑ) dZ dϑ

= F(X, Q) + KL(Q, P(.|X)) where F(X, Q) :=

  • log P(X, Z, ϑ)

Q(Z, ϑ) Q(Z, ϑ)dZdϑ. Consequently : log P(X) ≥ F(X, Q) if Q = argmax F(X, Q) then Q = argmin KL(Q, P(.|X)). argmax F(X, Q) = P(Z, ϑ|X)

Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 6 / 12

slide-7
SLIDE 7

Context Mixture models Estimation Application to Metagenomics

VB-EM algorithm

Application of [Beal and Ghahramani, 2003] leads to the following update formulae :      a(n+1)

q

= a0

q + i τ (n) iq

b(n+1)

q

= b0

q + i τ (n) iq (Xi − 1)

c(n+1)

q

= c0

q + i τ (n) iq

where τ (n)

iq

= PQZi (Zi = q). Consequences : Approximate posterior distribution approximate non asymptotic credibility intervals proposal distribution for importance sampling

0.35 0.40 0.45 0.50 0.55 0.60 10 20 30 40 50 Parameter density VB−EM VB−EM Imp Sampling MCMC 4800 4900 5000 5100 5200 5300 5400 0.000 0.002 0.004 0.006 total number of species Density VB−EM VB−EM IS MCMC Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 7 / 12

slide-8
SLIDE 8

Context Mixture models Estimation Application to Metagenomics

Bayesian Model Averaging

Let M denote the (random) number of components in the mixture model. Then the BMA model is fBMA =

  • m

P(M = m|X)fm where fm is the posterior density of the observations given a model with m components. The weights P(M = m|X) can be computed based on the Bayes formula : P(M = m|X) ∝ P(X|M)P(M) where P(M) is the a priori distribution on M. The evidence P(X|M) is hard to compute in general ; the VB-EM algorithm provides the approximation log P(X) ∼ F(X, Q) =

  • log P(X, Z, ϑ)

Q(Z, ϑ) Q(Z, ϑ)dZdϑ where the error term KL(Q, P(.|X)) has been neglected.

Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 8 / 12

slide-9
SLIDE 9

Context Mixture models Estimation Application to Metagenomics

Bayesian Model Averaging (example)

1 2 3 4 5 5.0e−08 1.5e−07 nb of components P(X|M) bma weights P(X|M) IS

Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 9 / 12

slide-10
SLIDE 10

Context Mixture models Estimation Application to Metagenomics

Metagenomics

High throughput DNA sequencing Complex environmental samples : soil, seawater, intestine microflora

Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 10 / 12

slide-11
SLIDE 11

Context Mixture models Estimation Application to Metagenomics

Real dataset example

Model fit to the data (human gut microbiota [Tap et al., 2009]) :

  • ● ●
  • 10

20 30 40 50 60 5e−04 5e−03 5e−02 5e−01 species abundance density

  • Data

1 2 3 4 5 BMA

Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 11 / 12

slide-12
SLIDE 12

Context Mixture models Estimation Application to Metagenomics

Real dataset example

Estimated number of species and approximate posterior distributions :

5000 10000 15000 20000 25000 30000 0.0000 0.0010 0.0020 nb of species density 1 2 3 4 5 BMA

Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 12 / 12

slide-13
SLIDE 13

Context Mixture models Estimation Application to Metagenomics

Beal, M. and Ghahramani, Z. (2003). The variational Bayesian EM algorithm for incomplete data : with application to scoring graphical model structures. Bayesian Statistics 7 (pp. 453–464). Fisher, R., Corbet, A., and Williams, C. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. Journal of Animal Ecology, 12(1) :42–58. Tap, J., Mondot, S., Levenez, F., Pelletier, E., Caron, C., Furet, J., Ugarte, E., Mu˜ noz-Tamayo, R., Paslier, D., Nalin, R., et al. (2009). Towards the human intestinal microbiota phylogenetic core. Environmental Microbiology, 11(10) :2574–2584.

Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 12 / 12