 
              Maximum Likelihood Density Estimation under Total Positivity Elina Robeva MIT joint work with Bernd Sturmfels, Ngoc Tran, and Caroline Uhler arXiv:1806.10120 ICERM Workshop on Nonlinear Algebra in Applications November 12, 2018 1 / 48
Density estimation Given i.i.d. samples X = { x 1 , . . . , x n } ⊂ R d from an unknown distribution on R d with density p , can we estimate p ? parametric: assume that p lies in some parametric family, and estimate parameters • finite-dimensional problem • too restrictive; the real-world distribution might not lie in the specified parametric family non-parametric: assume that p lies in a non-parametric family, e.g. impose shape-constraints on p (convex, log-concave, monotone, etc.) • infinite-dimensional problem • need constraints that are: • strong enough so that there is no spiky behavior • weak enough so that function class is large 2 / 48
Density estimation Given i.i.d. samples X = { x 1 , . . . , x n } ⊂ R d from an unknown distribution on R d with density p , can we estimate p ? parametric: assume that p lies in some parametric family, and estimate parameters • finite-dimensional problem • too restrictive; the real-world distribution might not lie in the specified parametric family non-parametric: assume that p lies in a non-parametric family, e.g. impose shape-constraints on p (convex, log-concave, monotone, etc.) • infinite-dimensional problem • need constraints that are: • strong enough so that there is no spiky behavior • weak enough so that function class is large 2 / 48
Shape-constrained density estimation • monotonically decreasing densities: [G renander 1956, R ao 1969] • convex densities: [A nevski 1994, G roeneboom , J ongbloed, and W ellner 2001] • log-concave densities: [C ule , S amworth, and S tewart 2008] • generalized additive models with shape constraints: [C hen and S amworth 2016] • this talk: totally positive and log-concave densities 3 / 48
MTP 2 distributions • A distribution with density p on X ⊆ R d is multivariate totally positive of order 2 (or MTP 2 ) if p ( x ) p ( y ) ≤ p ( x ∧ y ) p ( x ∨ y ) for all x , y ∈ X , where x ∧ y and x ∨ y are the componentwise minimum and maximum. • MTP 2 is the same as log-supermodular : log( p ( x ))+log( p ( y )) ≤ log( p ( x ∧ y ))+log( p ( x ∨ y )) for all x , y ∈ X . 4 / 48
MTP 2 distributions • A distribution with density p on X ⊆ R d is multivariate totally positive of order 2 (or MTP 2 ) if p ( x ) p ( y ) ≤ p ( x ∧ y ) p ( x ∨ y ) for all x , y ∈ X , where x ∧ y and x ∨ y are the componentwise minimum and maximum. • MTP 2 is the same as log-supermodular : log( p ( x ))+log( p ( y )) ≤ log( p ( x ∧ y ))+log( p ( x ∨ y )) for all x , y ∈ X . • A random vector X taking values in R d is positively associated if for any non-decreasing functions φ, ψ : R d → R cov( φ ( X ) , ψ ( X )) ≥ 0 . • MTP 2 implies positive association (F ortuin K asteleyn G inibre inequality, 1971). 4 / 48
MTP 2 distributions • A distribution with density p on X ⊆ R d is multivariate totally positive of order 2 (or MTP 2 ) if p ( x ) p ( y ) ≤ p ( x ∧ y ) p ( x ∨ y ) for all x , y ∈ X , where x ∧ y and x ∨ y are the componentwise minimum and maximum. • MTP 2 is the same as log-supermodular : log( p ( x ))+log( p ( y )) ≤ log( p ( x ∧ y ))+log( p ( x ∨ y )) for all x , y ∈ X . • A random vector X taking values in R d is positively associated if for any non-decreasing functions φ, ψ : R d → R cov( φ ( X ) , ψ ( X )) ≥ 0 . • MTP 2 implies positive association (F ortuin K asteleyn G inibre inequality, 1971). 4 / 48
Properties of MTP 2 distributions Theorem (F allat, L auritzen, S adeghi, U hler, W ermuth and Z wiernik, 2015) If X = ( X 1 , . . . , X d ) is MTP 2 , then (i) any marginal distribution is MTP 2 , (ii) any conditional distribution is MTP 2 , (iii) X has the marginal independence structure X i ⊥ ⊥ X j ⇐ ⇒ cov ( X i , X j ) = 0 . Theorem (K arlin and R inott, 1980) If p ( x ) > 0 and p is MTP 2 for any pair of coordinates when the others are held constant, then p is MTP 2 . 5 / 48
Examples of MTP 2 distributions • A Gaussian random variable X ∼ N ( µ, Σ) is MTP 2 whenever Σ − 1 is an M-matrix, i.e. its off-diagonal entries are nonpositive. • The joint distribution of observed variables influenced by one hidden variable Z X 5 X 1 X 4 X 2 X 3 • Very common in real data: e.g. IQ test scores, phylogenetics data, financial econometrics data, and others • Many models imply MTP 2 : • Ferromagnetic Ising models • Order statistics of i.i.d. variables • Brownian motion tree models • Latent tree models (e.g. single factor analysis models) 6 / 48
Maximum Likelihood Estimation Given i.i.d. samples X = { x 1 , . . . , x n } ⊂ R d with weights w = ( w 1 , . . . , w n ) (where w 1 , . . . , w n ≥ 0, � w i = 1) from a distribution p on R d , can we estimate p ? The log-likelihood of observing X = { x 1 , . . . , x n } with weights w = ( w 1 , . . . , w n ) if they are drawn i.i.d. from p is (up to an additive constant) n � ℓ p ( X , w ) := w i log( p ( x i )) . i =1 7 / 48
Maximum Likelihood Estimation Given i.i.d. samples X = { x 1 , . . . , x n } ⊂ R d with weights w = ( w 1 , . . . , w n ) (where w 1 , . . . , w n ≥ 0, � w i = 1) from a distribution p on R d , can we estimate p ? The log-likelihood of observing X = { x 1 , . . . , x n } with weights w = ( w 1 , . . . , w n ) if they are drawn i.i.d. from p is (up to an additive constant) n � ℓ p ( X , w ) := w i log( p ( x i )) . i =1 We would like to n � maximize p w i log( p ( x i )) i =1 s.t. p is an MTP 2 density . 7 / 48
Maximum Likelihood Estimation Given i.i.d. samples X = { x 1 , . . . , x n } ⊂ R d with weights w = ( w 1 , . . . , w n ) (where w 1 , . . . , w n ≥ 0, � w i = 1) from a distribution p on R d , can we estimate p ? The log-likelihood of observing X = { x 1 , . . . , x n } with weights w = ( w 1 , . . . , w n ) if they are drawn i.i.d. from p is (up to an additive constant) n � ℓ p ( X , w ) := w i log( p ( x i )) . i =1 We would like to n � maximize p w i log( p ( x i )) i =1 s.t. p is an MTP 2 density . 7 / 48
Maximum Likelihood Estimation under MTP 2 Suppose we observe two points: X = { x 1 , x 2 } ⊂ R 2 . We can find a sequence of MTP 2 densities p 1 , p 2 , p 3 , . . . such that ℓ p n ( X ) → ∞ as n → ∞ . x 1 x 1 x 1 x 2 x 2 x 2 p 1 p 2 p 3 Thus, the MLE doesn’t exist. 8 / 48
Maximum Likelihood Estimation under MTP 2 Suppose we observe two points: X = { x 1 , x 2 } ⊂ R 2 . We can find a sequence of MTP 2 densities p 1 , p 2 , p 3 , . . . such that ℓ p n ( X ) → ∞ as n → ∞ . x ∨ y x x 1 x 1 x 1 x ∧ y y x 2 x 2 x 2 p 1 p 2 p 3 Thus, the MLE doesn’t exist. 9 / 48
Maximum Likelihood Estimation under MTP 2 To ensure that the likelihood function is bounded, we impose the condition that p is log-concave. n � maximize p w i log( p ( x i )) i =1 s.t. p is an MTP 2 density , and p is log-concave . A function f : R d → R is log-concave if its logarithm is concave. 10 / 48
Maximum Likelihood Estimation under MTP 2 To ensure that the likelihood function is bounded, we impose the condition that p is log-concave. n � maximize p w i log( p ( x i )) i =1 s.t. p is an MTP 2 density , and p is log-concave . A function f : R d → R is log-concave if its logarithm is concave. • Log-concavity is a natural assumption because it ensures the density is continuous and includes many known families of parametric distributions. • Log-concave families: • Gaussian; Uniform( a , b ); Gamma( k , θ ) for k ≥ 1; Beta( a , b ) for a , b ≥ 1. • Maximum likelihood estimation under log-concavity is a well-studied problem (Cule et al. 2008, D¨ umbgen et al. 2009, Schuhmacher et al. 2010, . . . ). 10 / 48
Maximum Likelihood Estimation under MTP 2 To ensure that the likelihood function is bounded, we impose the condition that p is log-concave. n � maximize p w i log( p ( x i )) i =1 s.t. p is an MTP 2 density , and p is log-concave . A function f : R d → R is log-concave if its logarithm is concave. • Log-concavity is a natural assumption because it ensures the density is continuous and includes many known families of parametric distributions. • Log-concave families: • Gaussian; Uniform( a , b ); Gamma( k , θ ) for k ≥ 1; Beta( a , b ) for a , b ≥ 1. • Maximum likelihood estimation under log-concavity is a well-studied problem (Cule et al. 2008, D¨ umbgen et al. 2009, Schuhmacher et al. 2010, . . . ). 10 / 48
Maximum Likelihood Estimation under Log-Concavity n � maximize p w i log( p ( x i )) i =1 s.t. p is a density and p is log-concave . Theorem ( Cule, Samworth and Stewart 2008 ) • With probability 1, a log-concave maximum likelihood estimator ˆ p exists and is unique. 11 / 48
Maximum Likelihood Estimation under Log-Concavity n � maximize p w i log( p ( x i )) i =1 s.t. p is a density and p is log-concave . Theorem ( Cule, Samworth and Stewart 2008 ) • With probability 1, a log-concave maximum likelihood estimator ˆ p exists and is unique. • Moreover, log (ˆ p ) is a ’tent-function’ supported on the convex hull of the data P ( X ) = conv ( x 1 , . . . , x n ) . 11 / 48
Recommend
More recommend