Maximum Likelihood Estimation for Learning Populations of Parameters - PowerPoint PPT Presentation

Maximum Likelihood Estimation for Learning Populations of Parameters Ramya Korlakai Vinayak Postdoctoral Researcher Paul G. Allen School of CSE joint work with Weihao Kong, Gregory Valiant, Sham Kakade Poster #189 ramya@cs.washington.edu 1

Motivation: Large yet Sparse Data Example: Flu data   Suppose for a large random subset of the population in California, we observe whether a person caught the flu or not for last 5 years Poster #189 2

Motivation: Large yet Sparse Data Probability of catching flu Example: Flu data   (unknown) Suppose for a large random subset of i p i the population in California, we observe whether a person caught the flu or not for last 5 years Poster #189 2

Motivation: Large yet Sparse Data Probability of catching flu Example: Flu data   (unknown) Suppose for a large random subset of i p i the population in California, we observe whether a person caught the bias of coin i flu or not for last 5 years Poster #189 2

Motivation: Large yet Sparse Data Probability of catching flu Example: Flu data   (unknown) Suppose for a large random subset of i p i the population in California, we observe whether a person caught the bias of coin i flu or not for last 5 years } { 0 0 1 0 1 x i = 2 p i = x i b t = 0 . 4 ± 0 . 45 Poster #189 2

Motivation: Large yet Sparse Data Probability of catching flu Example: Flu data   (unknown) Suppose for a large random subset of i p i the population in California, we observe whether a person caught the bias of coin i flu or not for last 5 years } { 0 0 1 0 1 Goal: Can we learn the distribution of x i = 2 the biases over the population? p i = x i b t = 0 . 4 ± 0 . 45 Poster #189 2

Motivation: Large yet Sparse Data Probability of catching flu Example: Flu data   (unknown) Suppose for a large random subset of i p i the population in California, we observe whether a person caught the bias of coin i flu or not for last 5 years } { 0 0 1 0 1 Goal: Can we learn the distribution of x i = 2 the biases over the population? p i = x i b t = 0 . 4 ± 0 . 45 • Application domains: Epidemiology, Social Sciences, Psychology, Medicine, Biology • Population size is large, often hundreds of thousands or millions • Number of observations per individual is limited ( sparse ) prohibiting accurate estimation of parameters of interest Poster #189 2

Motivation: Large yet Sparse Data Probability of catching flu Example: Flu data   (unknown) Suppose for a large random subset of i p i the population in California, we observe whether a person caught the bias of coin i flu or not for last 5 years } { 0 0 1 0 1 Goal: Can we learn the distribution of x i = 2 the biases over the population? p i = x i b t = 0 . 4 ± 0 . 45 • Application domains: Epidemiology, Social Sciences, Psychology, Medicine, Biology Useful for downstream analysis: • Population size is large, often hundreds of thousands or millions Why? Testing and estimating properties of the distribution • Number of observations per individual is limited ( sparse ) prohibiting accurate estimation of parameters of interest Poster #189 2

Model: Non-parametric Mixture of Binomials Lord 1965, 1969 True Distribution P ? • N independent coins   P ? Each coin has its own bias drawn from p i ∼ P ? (unknown) i = 1 , 2 , ..., N (unknown) 0 1 0.5 Poster #189 3

Model: Non-parametric Mixture of Binomials Lord 1965, 1969 True Distribution P ? • N independent coins   P ? Each coin has its own bias drawn from p i ∼ P ? (unknown) i = 1 , 2 , ..., N (unknown) 0 1 0.5 • We get to observe t tosses for every coin X i ∼ Bin( t, p i ) ∈ { 0 , 1 , ..., t } Observations: } { 0 0 1 0 1 t = 5 tosses x i = 2 Poster #189 3

Model: Non-parametric Mixture of Binomials Lord 1965, 1969 True Distribution P ? • N independent coins   P ? Each coin has its own bias drawn from p i ∼ P ? (unknown) i = 1 , 2 , ..., N (unknown) 0 1 0.5 • We get to observe t tosses for every coin X i ∼ Bin( t, p i ) ∈ { 0 , 1 , ..., t } Observations: } { 0 0 1 0 1 t = 5 tosses x i = 2 { X i } N ˆ • Given , return P P ? estimate of i =1 Poster #189 3

Model: Non-parametric Mixture of Binomials Lord 1965, 1969 True Distribution P ? • N independent coins   P ? Each coin has its own bias drawn from p i ∼ P ? (unknown) i = 1 , 2 , ..., N (unknown) 0 1 0.5 • We get to observe t tosses for every coin X i ∼ Bin( t, p i ) ∈ { 0 , 1 , ..., t } Observations: } { 0 0 1 0 1 t = 5 tosses x i = 2 { X i } N ˆ • Given , return P P ? estimate of i =1 ⇣ ⌘ • Wasserstein-1 distance P ? , ˆ W 1 P (Earth Mover’s Distance) Poster #189 3

Learning with Sparse Observations is Non-trivial ⇢ X 1 � t , ...X i t , ..., X N ˆ • Empirical plug-in estimator is bad P plug-in = histogram t ✓ 1 ◆ Number of When t ⌧ N incurs error of Number of Θ t = N = tosses per √ t coins coin Poster #189 4

Learning with Sparse Observations is Non-trivial ⇢ X 1 � t , ...X i t , ..., X N ˆ • Empirical plug-in estimator is bad P plug-in = histogram t ✓ 1 ◆ Number of When t ⌧ N incurs error of Number of Θ t = N = tosses per √ t coins coin • Many recent works on estimating symmetric properties of a discrete distribution with sparse observations Paninski 2003, Valiant and Valiant 2011, Jiao et. al. 2015, Orlitsky et. al. 2016, Acharya et. al. 2017 …. The setting in this work is different Poster #189 4

Learning with Sparse Observations is Non-trivial ⇢ X 1 � t , ...X i t , ..., X N ˆ • Empirical plug-in estimator is bad P plug-in = histogram t ✓ 1 ◆ Number of When t ⌧ N incurs error of Number of Θ t = N = tosses per √ t coins coin • Many recent works on estimating symmetric properties of a discrete distribution with sparse observations Paninski 2003, Valiant and Valiant 2011, Jiao et. al. 2015, Orlitsky et. al. 2016, Acharya et. al. 2017 …. The setting in this work is different • Tian et. al 2017 proposed a moment matching based estimator which achieves ✓ 1 ◆ optimal error of O when t < c log N t Weakness of moment matching estimator is that it fails to obtain optimal error when due to higher variance in larger moments t > c log N Poster #189 4

Learning with Sparse Observations is Non-trivial ⇢ X 1 � t , ...X i t , ..., X N ˆ • Empirical plug-in estimator is bad P plug-in = histogram t ✓ 1 ◆ Number of When t ⌧ N incurs error of Number of Θ t = N = tosses per √ t coins coin • Many recent works on estimating symmetric properties of a discrete distribution with sparse observations Paninski 2003, Valiant and Valiant 2011, Jiao et. al. 2015, Orlitsky et. al. 2016, Acharya et. al. 2017 …. What about Maximum Likelihood Estimator? The setting in this work is different • Tian et. al 2017 proposed a moment matching based estimator which achieves ✓ 1 ◆ optimal error of O when t < c log N t Weakness of moment matching estimator is that it fails to obtain optimal error when due to higher variance in larger moments t > c log N Poster #189 4

Maximum Likelihood Estimator Sufficient statistic: Fingerprint h s h s = # coins that show s heads s = 0 , 1 , ..., t N h = [ h 0 , h 1 , ..h s , .., h t ] fingerprint vector 2 0 1 3 4 5 s Poster #189 5

Maximum Likelihood Estimator Sufficient statistic: Fingerprint h s h s = # coins that show s heads s = 0 , 1 , ..., t N h = [ h 0 , h 1 , ..h s , .., h t ] fingerprint vector 2 0 1 3 4 5 s Observed h , Expected h under the distribution Q ˆ Q ∈ dist[0 , 1] KL P mle ∈ arg min Poster #189 5

Maximum Likelihood Estimator Sufficient statistic: Fingerprint h s h s = # coins that show s heads s = 0 , 1 , ..., t N h = [ h 0 , h 1 , ..h s , .., h t ] fingerprint vector 2 0 1 3 4 5 s Observed h , Expected h under the distribution Q ˆ Q ∈ dist[0 , 1] KL P mle ∈ arg min • NOT the empirical estimator • Convex optimization: Efficient (polynomial time) Poster #189 5

Maximum Likelihood Estimator Sufficient statistic: Fingerprint h s h s = # coins that show s heads s = 0 , 1 , ..., t N h = [ h 0 , h 1 , ..h s , .., h t ] fingerprint vector 2 0 1 3 4 5 s Observed h , Expected h under the distribution Q ˆ Q ∈ dist[0 , 1] KL P mle ∈ arg min • NOT the empirical estimator • Convex optimization: Efficient (polynomial time) • Proposed in late 1960’s by Frederic Lord in the context of psychological testing. Several works study the geometry and identifiability and uniqueness of the solution of the MLE Lord 1965,1969, Turnbull 1976, Laird 1978, Lindsay 1983, Wood 1999 Poster #189 5

Maximum Likelihood Estimator Sufficient statistic: Fingerprint h s h s = # coins that show s heads s = 0 , 1 , ..., t N h = [ h 0 , h 1 , ..h s , .., h t ] fingerprint vector 2 0 1 3 4 5 s Observed h , Expected h under the distribution Q ˆ Q ∈ dist[0 , 1] KL P mle ∈ arg min How well does the MLE recover the distribution? • NOT the empirical estimator • Convex optimization: Efficient (polynomial time) • Proposed in late 1960’s by Frederic Lord in the context of psychological testing. Several works study the geometry and identifiability and uniqueness of the solution of the MLE Lord 1965,1969, Turnbull 1976, Laird 1978, Lindsay 1983, Wood 1999 Poster #189 5

Maximum Likelihood Estimation for Learning Populations of Parameters - PowerPoint PPT Presentation

Maximum Likelihood Estimation for Learning Populations of Parameters Ramya Korlakai Vinayak Postdoctoral Researcher Paul G. Allen School of CSE joint work with Weihao Kong, Gregory Valiant, Sham Kakade Poster #189 ramya@cs.washington.edu 1

Binary choice 3.3 Maximum likelihood estimation Michel Bierlaire Output of the estimation

Maximum likelihood parameter estimation Maximum likelihood parameter estimation For an HMM

Maximum Likelihood properties Maximum parsimony Maximum likelihood Experimental design

15-388/688 - Practical Data Science: Maximum likelihood estimation, nave Bayes J. Zico Kolter

Binary choice 3.3 Maximum likelihood estimation Michel Bierlaire Maximum likelihood

Chapter 8: Estimation In this chapter we will cover: 1. The likelihood and maximum likelihood

Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract formulation Weve had one

Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract formulation Weve had one

Maximum-likelihood and Bayesian parameter estimation Andrea Passerini passerini@disi.unitn.it

Phylogenetic trees IV Maximum Likelihood Gerhard Jger ESSLLI 2016 Gerhard Jger Maximum

Maximum likelihood models Tues. Feb. 27, 2018 1 Overview of today Informal notion of

Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood

Maximum Maximum Likelihood Estimation Daphne Koller Biased Coin Example P is a Bernoulli

Quasi-maximum likelihood estimation for multivariate CARMA processes Eckhard Schlemm Institute

Week 2: Maximum Likelihood Estimation Instructor: Sergey Levine 1 Recap: MLE for the binomial

Phylogenetic trees IV Maximum Likelihood Gerhard Jger Words, Bones, Genes, Tools February 28,

Healthcare Personnel Safety Component Healthcare Personnel Vaccination Module Influenza

From Savagery to Greatness Stair-steps to Humanity Charcon 2019 Scott Crosby **** Buy The

Natural Selection 02-223 How to Analyze Your Own Genome 2.

Optimization and Simulation Optimization Michel Bierlaire Transport and Mobility Laboratory

4/7/16 TRIESTE ICTP & CALCVAROPTTRAN GMT LYON U . TWO FOR DROPLETS MODELS

Flu Fighter Awards Flu Fighter Awards 15 November 2011 Birmingham Childrens Hospital NHS

Refresher on Discrete Probability STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi

Dihadron production at Jefferson Lab. Sergio Anefalos Pereira (INFN - Frascati) Physics

Maximum Likelihood Estimation for Learning Populations of Parameters - PowerPoint PPT Presentation

Maximum Likelihood Estimation for Learning Populations of Parameters Ramya Korlakai Vinayak Postdoctoral Researcher Paul G. Allen School of CSE joint work with Weihao Kong, Gregory Valiant, Sham Kakade Poster #189 ramya@cs.washington.edu 1

Binary choice 3.3 Maximum likelihood estimation Michel Bierlaire Output of the estimation

Maximum likelihood parameter estimation Maximum likelihood parameter estimation For an HMM

Maximum Likelihood properties Maximum parsimony Maximum likelihood Experimental design

15-388/688 - Practical Data Science: Maximum likelihood estimation, nave Bayes J. Zico Kolter

Binary choice 3.3 Maximum likelihood estimation Michel Bierlaire Maximum likelihood

Chapter 8: Estimation In this chapter we will cover: 1. The likelihood and maximum likelihood

Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract formulation Weve had one

Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract formulation Weve had one

Maximum-likelihood and Bayesian parameter estimation Andrea Passerini passerini@disi.unitn.it

Phylogenetic trees IV Maximum Likelihood Gerhard Jger ESSLLI 2016 Gerhard Jger Maximum

Maximum likelihood models Tues. Feb. 27, 2018 1 Overview of today Informal notion of

Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood

Maximum Maximum Likelihood Estimation Daphne Koller Biased Coin Example P is a Bernoulli

Quasi-maximum likelihood estimation for multivariate CARMA processes Eckhard Schlemm Institute

Week 2: Maximum Likelihood Estimation Instructor: Sergey Levine 1 Recap: MLE for the binomial

Phylogenetic trees IV Maximum Likelihood Gerhard Jger Words, Bones, Genes, Tools February 28,

Healthcare Personnel Safety Component Healthcare Personnel Vaccination Module Influenza

From Savagery to Greatness Stair-steps to Humanity Charcon 2019 Scott Crosby **** Buy The

Natural Selection 02-223 How to Analyze Your Own Genome 2.

Optimization and Simulation Optimization Michel Bierlaire Transport and Mobility Laboratory

4/7/16 TRIESTE ICTP &amp; CALCVAROPTTRAN GMT LYON U . TWO FOR DROPLETS MODELS

Flu Fighter Awards Flu Fighter Awards 15 November 2011 Birmingham Childrens Hospital NHS

Refresher on Discrete Probability STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi

Dihadron production at Jefferson Lab. Sergio Anefalos Pereira (INFN - Frascati) Physics

4/7/16 TRIESTE ICTP & CALCVAROPTTRAN GMT LYON U . TWO FOR DROPLETS MODELS