Mixture models of truncated data for estimating the number of - PowerPoint PPT Presentation

Context Mixture models Estimation Application to Metagenomics Mixture models of truncated data for estimating the number of species. Li-Thiao-T´ e S´ ebastien , Jean-Jacques Daudin, St´ ephane Robin Equipe Statistique et G´ enome, UMR 518 AgroParisTech / INRA MIA 19th COMPSTAT symposium, 23rd August 2010 Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 1 / 12

Context Mixture models Estimation Application to Metagenomics Context Situation individuals are sampled from a population then classified into species Goal 1 : estimate the species abundance distribution Goal 2 : estimate the number of species with no sampled individual Applications ecological surveys : number of species of butterflies [Fisher et al., 1943] metagenomics (our interest, large number of unobserved species, large datasets) other : number of words in a language, number of unreported drug addicts Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 2 / 12

Context Mixture models Estimation Application to Metagenomics Example Observations Species A B C D E . . . Number of individuals 10 430 10 289 3 . . . Species abundance distribution Number of individuals 1 2 3 4 5 . . . Number of species 513 149 65 34 24 . . . Frequency/Count data Frequency/Count data ● nb of missing species : 2121 nb of observations nb of observations 500 500 ● ● ● ● ● ● 50 50 Species Abundance Distribution ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 5 ● ● ● ● 1 ● ● ● ● ● ● 1 ● ● ● ● ● ● 0 5 10 15 20 25 0 5 10 15 20 25 rare species −−−−−−−> abundant species rare species −−−−−−−> abundant species Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 3 / 12

Context Mixture models Estimation Application to Metagenomics Sampling model species abundance distribution : � f ( λ ) = α q f q ( λ ) q X i individuals are observed for species i conditionally on its abundance λ i (Poisson distributed number) f ( X i | λ i ) = exp − λ i λ X i i X i ! only positive numbers of individuals are recorded in the data set : f ( X + | λ i ) = f ( X i | λ i , X i > 0) Truncated model ( ϑ = { α q , π q } ) : � Q q α q f q ( x , π q ) X + ∼ f ( x , ϑ ) = 1 − � Q q α q f q (0 , π q ) Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 4 / 12

Context Mixture models Estimation Application to Metagenomics Bayesian model � Q q α q f q ( x , π q ) X + ∼ f ( x , π q ) = 1 − � Q q α q f q (0 , π q ) A priori : α ∼ Dirichlet( � a ) π q ∼ Beta( b q , c q ) Z ∼ Multinom( � a ) X | Z ∼ Geom( π q ) Approximate a posteriori distribution : α | X ∼ Dirichlet(˜ a ) π q | X ∼ Beta(˜ b q , ˜ c q ) a , ˜ The hyper parameters ˜ b and ˜ c provide an approximation of the a posteriori distribution and hence confidence intervals. Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 5 / 12

Context Mixture models Estimation Application to Metagenomics Variational framework Theorem The log-likelihood can be decomposed into : �� log P ( X ) = log P ( X , Z , ϑ ) d Z d ϑ = F ( X , Q ) + KL ( Q , P ( . | X )) log P ( X , Z , ϑ ) �� where F ( X , Q ) := Q ( Z , ϑ ) Q ( Z , ϑ ) d Z d ϑ . Consequently : log P ( X ) ≥ F ( X , Q ) if Q = argmax F ( X , Q ) then Q = argmin KL ( Q , P ( . | X )) . argmax F ( X , Q ) = P ( Z , ϑ | X ) Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 6 / 12

Context Mixture models Estimation Application to Metagenomics VB-EM algorithm Application of [Beal and Ghahramani, 2003] leads to the following update formulae : a ( n +1) i τ ( n )  = a 0 q + � q iq   b ( n +1) i τ ( n ) = b 0 q + � iq ( X i − 1) q c ( n +1) i τ ( n )  = c 0 q + �  q iq where τ ( n ) = P Q Zi ( Z i = q ). iq Consequences : Approximate posterior distribution approximate non asymptotic credibility intervals proposal distribution for importance sampling 0.006 50 VB−EM VB−EM VB−EM Imp Sampling VB−EM IS Parameter density 40 0.004 MCMC MCMC Density 30 20 0.002 10 0.000 0 0.35 0.40 0.45 0.50 0.55 0.60 4800 4900 5000 5100 5200 5300 5400 total number of species Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 7 / 12

Context Mixture models Estimation Application to Metagenomics Bayesian Model Averaging Let M denote the (random) number of components in the mixture model. Then the BMA model is � f BMA = P ( M = m | X ) f m m where f m is the posterior density of the observations given a model with m components. The weights P ( M = m | X ) can be computed based on the Bayes formula : P ( M = m | X ) ∝ P ( X | M ) P ( M ) where P ( M ) is the a priori distribution on M . The evidence P ( X | M ) is hard to compute in general ; the VB-EM algorithm provides the approximation �� log P ( X , Z , ϑ ) log P ( X ) ∼ F ( X , Q ) = Q ( Z , ϑ ) Q ( Z , ϑ ) d Z d ϑ where the error term KL ( Q , P ( . | X )) has been neglected. Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 8 / 12

Context Mixture models Estimation Application to Metagenomics Bayesian Model Averaging (example) bma weights P(X|M) IS 1.5e−07 P(X|M) 5.0e−08 1 2 3 4 5 nb of components Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 9 / 12

Context Mixture models Estimation Application to Metagenomics Metagenomics High throughput DNA sequencing Complex environmental samples : soil, seawater, intestine microflora Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 10 / 12

Context Mixture models Estimation Application to Metagenomics Real dataset example Model fit to the data (human gut microbiota [Tap et al., 2009]) : 5e−01 ● Data ● 1 ● 2 5e−02 ● 3 density ● 4 ● ●● 5 5e−03 BMA ●●● ● ● ● ●● ● ●● ● ● ● 5e−04 ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ●●●● ●● ● ● ●● 0 10 20 30 40 50 60 species abundance Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 11 / 12

Context Mixture models Estimation Application to Metagenomics Real dataset example Estimated number of species and approximate posterior distributions : 0.0020 1 2 3 4 density 0.0010 5 BMA 0.0000 0 5000 10000 15000 20000 25000 30000 nb of species Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 12 / 12

Context Mixture models Estimation Application to Metagenomics Beal, M. and Ghahramani, Z. (2003). The variational Bayesian EM algorithm for incomplete data : with application to scoring graphical model structures. Bayesian Statistics 7 (pp. 453–464). Fisher, R., Corbet, A., and Williams, C. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. Journal of Animal Ecology , 12(1) :42–58. Tap, J., Mondot, S., Levenez, F., Pelletier, E., Caron, C., Furet, J., Ugarte, E., Mu˜ noz-Tamayo, R., Paslier, D., Nalin, R., et al. (2009). Towards the human intestinal microbiota phylogenetic core. Environmental Microbiology , 11(10) :2574–2584. Li-Thiao-T´ e S. (AgroParisTech / INRA) Mixture models / truncated data CompStat 2010 12 / 12

Mixture models of truncated data for estimating the number of - PowerPoint PPT Presentation

Context Mixture models Estimation Application to Metagenomics Mixture models of truncated data for estimating the number of species. Li-Thiao-T e S ebastien , Jean-Jacques Daudin, St ephane Robin Equipe Statistique et G enome, UMR

Truncated Differentials Lars R. Knudsen June 2014 Lars R. Knudsen Truncated Differentials

Bernoulli Mixture Models Victor Medina Researcher at SBIF DataCamp Mixture Models in R The

Structure of mixture models Victor Medina Researcher at SBIF DataCamp Mixture Models in R

Estimating Variance under Estimating Mean . . . Interval and Fuzzy Estimating Variance . . .

On truncated discrete moment problems Tobias Kuna University of Reading, UK (Joint work with

Estimating Estimating Covariance . . . Statistical Characteristics Estimating . . . Proof of

Efficient algorithms for estimating multi-view mixture models Daniel Hsu Microsoft Research, New

AND MACHINE LEARNING CHAPTER 10: MIXTURE MODELS AND EM Mixture Models - Define a joint

Gaussian Mixture Models & EM CE-717: Machine Learning Sharif University of Technology M.

Planning III-A: Planning III-A: Estimating Software Size - Estimating Software Size -

Estimating Frequency Moments Estimating F 0 Algorithm Correctness Further Anil Maheshwari

Estimating Frequency Moments Moments Estimating F 0 Algorithm Correctness Anil Maheshwari

Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile

Deep Gaussian Mixture Models Cinzia Viroli (University of Bologna, Italy) joint with Geoff

Classification of High Dimensional Data By Two-way Mixture Models Jia Li Statistics Department

Estimating Gaussian Mixture Models from Data with Missing Features by Daniel McMichael CSSIP

Constraining the Global 21-cm Signal with EDGES and Applications for DARE Raul Monsalve for the

M.Sc. in Meteorology Synoptic Meteorology [MAPH P312] Prof Peter Lynch Second Semester,

Parabolic Solar Trough Section: Red A Use for Parabolic Solar Trough n Energy from sun is 1000

The tunnel leveling addendum Darryl McCullough University of Oklahoma Geometric Topology in 3

Winsorized Importance Sampling Paulo Orenstein February 8, 2019 Stanford University Paulo

Validation checks for CR track reconstruction in 3x1x1 V. Galymov SB Meeting 06.07.2016

Maximum Likelihood vs. Least Squares for Estimating Mixtures of Truncated Exponentials Helge

Bayesian Networks, Big Data and Greedy Search Efficient Implementation with Classic Statistics

Sambuz

Useful Links

Newsletter

Mail Us

Mixture models of truncated data for estimating the number of - PowerPoint PPT Presentation

Context Mixture models Estimation Application to Metagenomics Mixture models of truncated data for estimating the number of species. Li-Thiao-T e S ebastien , Jean-Jacques Daudin, St ephane Robin Equipe Statistique et G enome, UMR

Truncated Differentials Lars R. Knudsen June 2014 Lars R. Knudsen Truncated Differentials

Bernoulli Mixture Models Victor Medina Researcher at SBIF DataCamp Mixture Models in R The

Structure of mixture models Victor Medina Researcher at SBIF DataCamp Mixture Models in R

Estimating Variance under Estimating Mean . . . Interval and Fuzzy Estimating Variance . . .

On truncated discrete moment problems Tobias Kuna University of Reading, UK (Joint work with

Estimating Estimating Covariance . . . Statistical Characteristics Estimating . . . Proof of

Efficient algorithms for estimating multi-view mixture models Daniel Hsu Microsoft Research, New

AND MACHINE LEARNING CHAPTER 10: MIXTURE MODELS AND EM Mixture Models - Define a joint

Gaussian Mixture Models &amp; EM CE-717: Machine Learning Sharif University of Technology M.

Planning III-A: Planning III-A: Estimating Software Size - Estimating Software Size -

Estimating Frequency Moments Estimating F 0 Algorithm Correctness Further Anil Maheshwari

Estimating Frequency Moments Moments Estimating F 0 Algorithm Correctness Anil Maheshwari

Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile

Deep Gaussian Mixture Models Cinzia Viroli (University of Bologna, Italy) joint with Geoff

Classification of High Dimensional Data By Two-way Mixture Models Jia Li Statistics Department

Estimating Gaussian Mixture Models from Data with Missing Features by Daniel McMichael CSSIP

Constraining the Global 21-cm Signal with EDGES and Applications for DARE Raul Monsalve for the

M.Sc. in Meteorology Synoptic Meteorology [MAPH P312] Prof Peter Lynch Second Semester,

Parabolic Solar Trough Section: Red A Use for Parabolic Solar Trough n Energy from sun is 1000

The tunnel leveling addendum Darryl McCullough University of Oklahoma Geometric Topology in 3

Winsorized Importance Sampling Paulo Orenstein February 8, 2019 Stanford University Paulo

Validation checks for CR track reconstruction in 3x1x1 V. Galymov SB Meeting 06.07.2016

Maximum Likelihood vs. Least Squares for Estimating Mixtures of Truncated Exponentials Helge

Bayesian Networks, Big Data and Greedy Search Efficient Implementation with Classic Statistics

Sambuz

Useful Links

Newsletter

Mail Us

Gaussian Mixture Models & EM CE-717: Machine Learning Sharif University of Technology M.