Maximum Likelihood Density Estimation under Total Positivity Elina - PowerPoint PPT Presentation

Maximum Likelihood Density Estimation under Total Positivity Elina Robeva MIT joint work with Bernd Sturmfels, Ngoc Tran, and Caroline Uhler arXiv:1806.10120 ICERM Workshop on Nonlinear Algebra in Applications November 12, 2018 1 / 48

Density estimation Given i.i.d. samples X = { x 1 , . . . , x n } ⊂ R d from an unknown distribution on R d with density p , can we estimate p ? parametric: assume that p lies in some parametric family, and estimate parameters • finite-dimensional problem • too restrictive; the real-world distribution might not lie in the specified parametric family non-parametric: assume that p lies in a non-parametric family, e.g. impose shape-constraints on p (convex, log-concave, monotone, etc.) • infinite-dimensional problem • need constraints that are: • strong enough so that there is no spiky behavior • weak enough so that function class is large 2 / 48

Shape-constrained density estimation • monotonically decreasing densities: [G renander 1956, R ao 1969] • convex densities: [A nevski 1994, G roeneboom , J ongbloed, and W ellner 2001] • log-concave densities: [C ule , S amworth, and S tewart 2008] • generalized additive models with shape constraints: [C hen and S amworth 2016] • this talk: totally positive and log-concave densities 3 / 48

MTP 2 distributions • A distribution with density p on X ⊆ R d is multivariate totally positive of order 2 (or MTP 2 ) if p ( x ) p ( y ) ≤ p ( x ∧ y ) p ( x ∨ y ) for all x , y ∈ X , where x ∧ y and x ∨ y are the componentwise minimum and maximum. • MTP 2 is the same as log-supermodular : log( p ( x ))+log( p ( y )) ≤ log( p ( x ∧ y ))+log( p ( x ∨ y )) for all x , y ∈ X . 4 / 48

MTP 2 distributions • A distribution with density p on X ⊆ R d is multivariate totally positive of order 2 (or MTP 2 ) if p ( x ) p ( y ) ≤ p ( x ∧ y ) p ( x ∨ y ) for all x , y ∈ X , where x ∧ y and x ∨ y are the componentwise minimum and maximum. • MTP 2 is the same as log-supermodular : log( p ( x ))+log( p ( y )) ≤ log( p ( x ∧ y ))+log( p ( x ∨ y )) for all x , y ∈ X . • A random vector X taking values in R d is positively associated if for any non-decreasing functions φ, ψ : R d → R cov( φ ( X ) , ψ ( X )) ≥ 0 . • MTP 2 implies positive association (F ortuin K asteleyn G inibre inequality, 1971). 4 / 48

Properties of MTP 2 distributions Theorem (F allat, L auritzen, S adeghi, U hler, W ermuth and Z wiernik, 2015) If X = ( X 1 , . . . , X d ) is MTP 2 , then (i) any marginal distribution is MTP 2 , (ii) any conditional distribution is MTP 2 , (iii) X has the marginal independence structure X i ⊥ ⊥ X j ⇐ ⇒ cov ( X i , X j ) = 0 . Theorem (K arlin and R inott, 1980) If p ( x ) > 0 and p is MTP 2 for any pair of coordinates when the others are held constant, then p is MTP 2 . 5 / 48

Examples of MTP 2 distributions • A Gaussian random variable X ∼ N ( µ, Σ) is MTP 2 whenever Σ − 1 is an M-matrix, i.e. its off-diagonal entries are nonpositive. • The joint distribution of observed variables influenced by one hidden variable Z X 5 X 1 X 4 X 2 X 3 • Very common in real data: e.g. IQ test scores, phylogenetics data, financial econometrics data, and others • Many models imply MTP 2 : • Ferromagnetic Ising models • Order statistics of i.i.d. variables • Brownian motion tree models • Latent tree models (e.g. single factor analysis models) 6 / 48

Maximum Likelihood Estimation Given i.i.d. samples X = { x 1 , . . . , x n } ⊂ R d with weights w = ( w 1 , . . . , w n ) (where w 1 , . . . , w n ≥ 0, � w i = 1) from a distribution p on R d , can we estimate p ? The log-likelihood of observing X = { x 1 , . . . , x n } with weights w = ( w 1 , . . . , w n ) if they are drawn i.i.d. from p is (up to an additive constant) n � ℓ p ( X , w ) := w i log( p ( x i )) . i =1 7 / 48

Maximum Likelihood Estimation Given i.i.d. samples X = { x 1 , . . . , x n } ⊂ R d with weights w = ( w 1 , . . . , w n ) (where w 1 , . . . , w n ≥ 0, � w i = 1) from a distribution p on R d , can we estimate p ? The log-likelihood of observing X = { x 1 , . . . , x n } with weights w = ( w 1 , . . . , w n ) if they are drawn i.i.d. from p is (up to an additive constant) n � ℓ p ( X , w ) := w i log( p ( x i )) . i =1 We would like to n � maximize p w i log( p ( x i )) i =1 s.t. p is an MTP 2 density . 7 / 48

Maximum Likelihood Estimation under MTP 2 Suppose we observe two points: X = { x 1 , x 2 } ⊂ R 2 . We can find a sequence of MTP 2 densities p 1 , p 2 , p 3 , . . . such that ℓ p n ( X ) → ∞ as n → ∞ . x 1 x 1 x 1 x 2 x 2 x 2 p 1 p 2 p 3 Thus, the MLE doesn’t exist. 8 / 48

Maximum Likelihood Estimation under MTP 2 Suppose we observe two points: X = { x 1 , x 2 } ⊂ R 2 . We can find a sequence of MTP 2 densities p 1 , p 2 , p 3 , . . . such that ℓ p n ( X ) → ∞ as n → ∞ . x ∨ y x x 1 x 1 x 1 x ∧ y y x 2 x 2 x 2 p 1 p 2 p 3 Thus, the MLE doesn’t exist. 9 / 48

Maximum Likelihood Estimation under MTP 2 To ensure that the likelihood function is bounded, we impose the condition that p is log-concave. n � maximize p w i log( p ( x i )) i =1 s.t. p is an MTP 2 density , and p is log-concave . A function f : R d → R is log-concave if its logarithm is concave. 10 / 48

Maximum Likelihood Estimation under MTP 2 To ensure that the likelihood function is bounded, we impose the condition that p is log-concave. n � maximize p w i log( p ( x i )) i =1 s.t. p is an MTP 2 density , and p is log-concave . A function f : R d → R is log-concave if its logarithm is concave. • Log-concavity is a natural assumption because it ensures the density is continuous and includes many known families of parametric distributions. • Log-concave families: • Gaussian; Uniform( a , b ); Gamma( k , θ ) for k ≥ 1; Beta( a , b ) for a , b ≥ 1. • Maximum likelihood estimation under log-concavity is a well-studied problem (Cule et al. 2008, D¨ umbgen et al. 2009, Schuhmacher et al. 2010, . . . ). 10 / 48

Maximum Likelihood Estimation under Log-Concavity n � maximize p w i log( p ( x i )) i =1 s.t. p is a density and p is log-concave . Theorem ( Cule, Samworth and Stewart 2008 ) • With probability 1, a log-concave maximum likelihood estimator ˆ p exists and is unique. 11 / 48

Maximum Likelihood Estimation under Log-Concavity n � maximize p w i log( p ( x i )) i =1 s.t. p is a density and p is log-concave . Theorem ( Cule, Samworth and Stewart 2008 ) • With probability 1, a log-concave maximum likelihood estimator ˆ p exists and is unique. • Moreover, log (ˆ p ) is a ’tent-function’ supported on the convex hull of the data P ( X ) = conv ( x 1 , . . . , x n ) . 11 / 48

Maximum Likelihood Density Estimation under Total Positivity Elina - PowerPoint PPT Presentation

Maximum Likelihood Density Estimation under Total Positivity Elina Robeva MIT joint work with Bernd Sturmfels, Ngoc Tran, and Caroline Uhler arXiv:1806.10120 ICERM Workshop on Nonlinear Algebra in Applications November 12, 2018 1 / 48

Binary choice 3.3 Maximum likelihood estimation Michel Bierlaire Output of the estimation

Maximum likelihood parameter estimation Maximum likelihood parameter estimation For an HMM

Maximum Likelihood properties Maximum parsimony Maximum likelihood Experimental design

Binary choice 3.3 Maximum likelihood estimation Michel Bierlaire Maximum likelihood

15-388/688 - Practical Data Science: Maximum likelihood estimation, nave Bayes J. Zico Kolter

Chapter 8: Estimation In this chapter we will cover: 1. The likelihood and maximum likelihood

Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract formulation Weve had one

Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract formulation Weve had one

Density Estimation Parametric techniques Maximum Likelihood Maximum A Posteriori

Density Estimation Parametric techniques Maximum Likelihood Maximum A Posteriori

Maximum-likelihood and Bayesian parameter estimation Andrea Passerini passerini@disi.unitn.it

Phylogenetic trees IV Maximum Likelihood Gerhard Jger ESSLLI 2016 Gerhard Jger Maximum

Maximum likelihood models Tues. Feb. 27, 2018 1 Overview of today Informal notion of

Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood

Quasi-maximum likelihood estimation for multivariate CARMA processes Eckhard Schlemm Institute

Week 2: Maximum Likelihood Estimation Instructor: Sergey Levine 1 Recap: MLE for the binomial

Gaussian Discriminant Analysis material thanks to Andrew Ng @Stanford Course Map / module3

Maximum Likelihood Setting parameters Chris Williams, School of Informatics We choose a

Lecture 7: Maximum Likelihood Estimation (MLE) Maximum a Posteriori (MAP) Aykut Erdem

CS54701: Information Retrieval CS-54701 Information Retrieval Course Review Luo Si Department

Fast and Stable Maximum Likelihood Estimation for Incomplete Multinomial Models Chenyang Zhang,

Maximum Likelihood Theory Max Turgeon STAT 4690Applied Multivariate Analysis Suffjcient

About this class Point Estimators The next two lectures are really coming from Lets say we

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct Learning