mixture models of truncated data for estimating the
play

Mixture models of truncated data for estimating the number of - PowerPoint PPT Presentation

Context Mixture models Estimation Application to Metagenomics Mixture models of truncated data for estimating the number of species. Li-Thiao-T e S ebastien , Jean-Jacques Daudin, St ephane Robin Equipe Statistique et G enome, UMR


  1. Context Mixture models Estimation Application to Metagenomics Mixture models of truncated data for estimating the number of species. Li-Thiao-T´ e S´ ebastien , Jean-Jacques Daudin, St´ ephane Robin Equipe Statistique et G´ enome, UMR 518 AgroParisTech / INRA MIA 19th COMPSTAT symposium, 23rd August 2010 Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 1 / 12

  2. Context Mixture models Estimation Application to Metagenomics Context Situation individuals are sampled from a population then classified into species Goal 1 : estimate the species abundance distribution Goal 2 : estimate the number of species with no sampled individual Applications ecological surveys : number of species of butterflies [Fisher et al., 1943] metagenomics (our interest, large number of unobserved species, large datasets) other : number of words in a language, number of unreported drug addicts Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 2 / 12

  3. Context Mixture models Estimation Application to Metagenomics Example Observations Species A B C D E . . . Number of individuals 10 430 10 289 3 . . . Species abundance distribution Number of individuals 1 2 3 4 5 . . . Number of species 513 149 65 34 24 . . . Frequency/Count data Frequency/Count data ● nb of missing species : 2121 nb of observations nb of observations 500 500 ● ● ● ● ● ● 50 50 Species Abundance Distribution ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 5 ● ● ● ● 1 ● ● ● ● ● ● 1 ● ● ● ● ● ● 0 5 10 15 20 25 0 5 10 15 20 25 rare species −−−−−−−> abundant species rare species −−−−−−−> abundant species Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 3 / 12

  4. Context Mixture models Estimation Application to Metagenomics Sampling model species abundance distribution : � f ( λ ) = α q f q ( λ ) q X i individuals are observed for species i conditionally on its abundance λ i (Poisson distributed number) f ( X i | λ i ) = exp − λ i λ X i i X i ! only positive numbers of individuals are recorded in the data set : f ( X + | λ i ) = f ( X i | λ i , X i > 0) Truncated model ( ϑ = { α q , π q } ) : � Q q α q f q ( x , π q ) X + ∼ f ( x , ϑ ) = 1 − � Q q α q f q (0 , π q ) Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 4 / 12

  5. Context Mixture models Estimation Application to Metagenomics Bayesian model � Q q α q f q ( x , π q ) X + ∼ f ( x , π q ) = 1 − � Q q α q f q (0 , π q ) A priori : α ∼ Dirichlet( � a ) π q ∼ Beta( b q , c q ) Z ∼ Multinom( � a ) X | Z ∼ Geom( π q ) Approximate a posteriori distribution : α | X ∼ Dirichlet(˜ a ) π q | X ∼ Beta(˜ b q , ˜ c q ) a , ˜ The hyper parameters ˜ b and ˜ c provide an approximation of the a posteriori distribution and hence confidence intervals. Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 5 / 12

  6. Context Mixture models Estimation Application to Metagenomics Variational framework Theorem The log-likelihood can be decomposed into : �� log P ( X ) = log P ( X , Z , ϑ ) d Z d ϑ = F ( X , Q ) + KL ( Q , P ( . | X )) log P ( X , Z , ϑ ) �� where F ( X , Q ) := Q ( Z , ϑ ) Q ( Z , ϑ ) d Z d ϑ . Consequently : log P ( X ) ≥ F ( X , Q ) if Q = argmax F ( X , Q ) then Q = argmin KL ( Q , P ( . | X )) . argmax F ( X , Q ) = P ( Z , ϑ | X ) Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 6 / 12

  7. Context Mixture models Estimation Application to Metagenomics VB-EM algorithm Application of [Beal and Ghahramani, 2003] leads to the following update formulae : a ( n +1) i τ ( n )  = a 0 q + � q iq   b ( n +1) i τ ( n ) = b 0 q + � iq ( X i − 1) q c ( n +1) i τ ( n )  = c 0 q + �  q iq where τ ( n ) = P Q Zi ( Z i = q ). iq Consequences : Approximate posterior distribution approximate non asymptotic credibility intervals proposal distribution for importance sampling 0.006 50 VB−EM VB−EM VB−EM Imp Sampling VB−EM IS Parameter density 40 0.004 MCMC MCMC Density 30 20 0.002 10 0.000 0 0.35 0.40 0.45 0.50 0.55 0.60 4800 4900 5000 5100 5200 5300 5400 total number of species Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 7 / 12

  8. Context Mixture models Estimation Application to Metagenomics Bayesian Model Averaging Let M denote the (random) number of components in the mixture model. Then the BMA model is � f BMA = P ( M = m | X ) f m m where f m is the posterior density of the observations given a model with m components. The weights P ( M = m | X ) can be computed based on the Bayes formula : P ( M = m | X ) ∝ P ( X | M ) P ( M ) where P ( M ) is the a priori distribution on M . The evidence P ( X | M ) is hard to compute in general ; the VB-EM algorithm provides the approximation �� log P ( X , Z , ϑ ) log P ( X ) ∼ F ( X , Q ) = Q ( Z , ϑ ) Q ( Z , ϑ ) d Z d ϑ where the error term KL ( Q , P ( . | X )) has been neglected. Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 8 / 12

  9. Context Mixture models Estimation Application to Metagenomics Bayesian Model Averaging (example) bma weights P(X|M) IS 1.5e−07 P(X|M) 5.0e−08 1 2 3 4 5 nb of components Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 9 / 12

  10. Context Mixture models Estimation Application to Metagenomics Metagenomics High throughput DNA sequencing Complex environmental samples : soil, seawater, intestine microflora Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 10 / 12

  11. Context Mixture models Estimation Application to Metagenomics Real dataset example Model fit to the data (human gut microbiota [Tap et al., 2009]) : 5e−01 ● Data ● 1 ● 2 5e−02 ● 3 density ● 4 ● ●● 5 5e−03 BMA ●●● ● ● ● ●● ● ●● ● ● ● 5e−04 ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ●●●● ●● ● ● ●● 0 10 20 30 40 50 60 species abundance Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 11 / 12

  12. Context Mixture models Estimation Application to Metagenomics Real dataset example Estimated number of species and approximate posterior distributions : 0.0020 1 2 3 4 density 0.0010 5 BMA 0.0000 0 5000 10000 15000 20000 25000 30000 nb of species Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 12 / 12

  13. Context Mixture models Estimation Application to Metagenomics Beal, M. and Ghahramani, Z. (2003). The variational Bayesian EM algorithm for incomplete data : with application to scoring graphical model structures. Bayesian Statistics 7 (pp. 453–464). Fisher, R., Corbet, A., and Williams, C. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. Journal of Animal Ecology , 12(1) :42–58. Tap, J., Mondot, S., Levenez, F., Pelletier, E., Caron, C., Furet, J., Ugarte, E., Mu˜ noz-Tamayo, R., Paslier, D., Nalin, R., et al. (2009). Towards the human intestinal microbiota phylogenetic core. Environmental Microbiology , 11(10) :2574–2584. Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 12 / 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend