18.650 Statistics for Applications Chapter 3: Maximum - PowerPoint PPT Presentation

18.650 Statistics for Applications Chapter 3: Maximum Likelihood Estimation 1/23

Total variation distance (1) ) ( Let E, (I P θ ) θ ∈ Θ be a statistical model associated with a sample θ ∗ ∈ Θ of i.i.d. r.v. X 1 , . . . , X n . Assume that there exists such : θ ∗ is true parameter. X 1 ∼ I that P θ ∗ the Statistician’s goal: given X 1 , . . . , X n , find an estimator ˆ ˆ θ = θ ( X 1 , . . . , X n ) such that I P ˆ is close to I P θ ∗ for the true θ θ ∗ . parameter is small for P ˆ ( A ) − I ⊂ E . This means: I P θ ∗ ( A ) all A θ Definition total variation distance between two The probability measures I P θ and I P θ ′ is defined by I . P θ ( A ) − I TV (I P θ , I P θ ′ ) = max P θ ′ ( A ) A ⊂ E 2/23

Total variation distance (2) Assume that E is discrete (i.e., finite or countable). This includes Bernoulli, Binomial, Poisson, . . . Therefore X has a PMF (probability mass function): I P θ ( X = x ) = p θ ( x ) for all ∈ E , x L p θ ( x ) ≥ 0 , p θ ( x ) = 1 . x ∈ E The total variation distance between I P θ and I P θ ′ is a simple function of the PMF’s p θ and p θ ′ : L 1 TV (I P θ , I P θ ′ ) = p θ ( x ) − p θ ′ ( x ) . 2 x ∈ E 3/23

Total variation distance (3) Assume that E is continuous. This includes Gaussian, Exponential, . . . J Assume that X has a density I P θ ( X ∈ A ) = f θ ( x ) dx for all A ⊂ E . A l f θ ( x ) ≥ 0 , f θ ( x ) dx = 1 . E The total variation distance between I P θ and I P θ ′ is a simple function of the densities f θ and f θ ′ : l 1 TV (I P θ , I P θ ′ ) = f θ ( x ) − f θ ′ ( x ) dx . 2 E 4/23

Total variation distance (4) Properties of Total variation: ◮ TV (I P θ , I P θ ′ ) = TV (I P θ ′ , I P θ ) (symmetric) ◮ TV (I P θ , I P θ ′ ) ≥ 0 ◮ If TV (I P θ , I P θ ′ ) = 0 then I P θ = I P θ ′ (definite) ◮ TV (I P θ , I P θ ′ ) ≤ TV (I P θ , I P θ ′′ ) + TV (I P θ ′′ , I P θ ′ ) (triangle inequality) distance between These imply that the total variation is a probability distributions. 5/23

Total variation distance (5) estimator T An estimation strategy: Build an TV (I P θ , I P θ ∗ ) for all find ˆ θ → T minimizes the ∈ Θ . Then that function TV (I P θ , I P θ ∗ ) . θ θ 6/23

Total variation distance (5) estimator T An estimation strategy: Build an TV (I P θ , I P θ ∗ ) for all find ˆ θ → T minimizes the ∈ Θ . Then that function TV (I P θ , I P θ ∗ ) . θ θ build T problem: Unclear how to TV (I P θ , I P θ ∗ ) ! 6/23

Kullback-Leibler (KL) divergence (1) many distances There are between probability measures to replace total variation. Let us choose one that is more convenient. Definition Kullback-Leibler (KL) divergence between two The probability measures I P θ and I P θ ′ is defined by     ( p θ ( x ) ) L    p θ ( x ) log if E is discrete   p θ ′ ( x ) x ∈ E KL (I P θ , I P θ ′ ) =   l  ( f θ ( x ) )     f θ ( x ) log dx if E is continuous  f θ ′ ( x ) E 7/23

Kullback-Leibler (KL) divergence (2) Properties of KL-divergence: ) = ◮ KL (I P θ , I P θ ′ KL (I P θ ′ , I P θ ) in general ◮ KL (I P θ , I P θ ′ ) ≥ 0 ◮ If KL (I P θ , I P θ ′ ) = 0 then I P θ = I P θ ′ (definite) ) i KL (I ◮ KL (I P θ , I P θ ′ P θ , I P θ ′′ ) + KL (I P θ ′′ , I P θ ′ ) in general Not a distance . This is is called a divergence . Asymmetry is the key to our ability to estimate it! 8/23

Kullback-Leibler (KL) divergence (3) [ ( X ) )] ( p θ ∗ KL (I P θ ∗ , I P θ ) = I E θ ∗ log p θ ( X ) [ ] [ ] = I E θ ∗ log p θ ∗ ( X ) − I E θ ∗ log p θ ( X ) So the function θ → KL (I P θ ∗ , I P θ ) is of the form: [ ] “constant” − I E θ ∗ log p θ ( X ) L n 1 Can be estimated: I E θ ∗ [ h ( X )] - h ( X i ) (by LLN) n i =1 L n 1 T − KL (I P θ ∗ , I P θ ) = “constant” log p θ ( X i ) n i =1 9/23

Kullback-Leibler (KL) divergence (4) L n 1 T − KL (I P θ ∗ , I P θ ) = “constant” log p θ ( X i ) n i =1 n L 1 T min KL (I P θ ∗ , I P θ ) ⇔ min − log p θ ( X i ) θ ∈ Θ n θ ∈ Θ i =1 n L 1 ⇔ max log p θ ( X i ) θ ∈ Θ n i =1 L n ⇔ max log p θ ( X i ) θ ∈ Θ i =1 n n ⇔ max p θ ( X i ) θ ∈ Θ i =1 maximum likelihood principle . This is the 10/23

Interlude: maximizing/minimizing functions (1) Note that min − h ( θ ) ⇔ max h ( θ ) θ ∈ Θ θ ∈ Θ In this class, we focus on maximization . Maximization of arbitrary functions can be difficult: Example: θ → � n ( θ − X i ) i =1 11/23

Interlude: maximizing/minimizing functions (2) Definition A function twice differentiable function h : Θ ⊂ I R → I R is said to concave if be its second derivative satisfies ′′ ( θ ) ≤ 0 , ∀ θ ∈ Θ h strictly concave if ′′ ( θ ) < It is said to be the inequality is strict: h 0 convex if Moreover, h is said to be (strictly) − h is (strictly) ′′ ( θ ) ≥ 0 ′′ ( θ ) > concave, i.e. h ( h 0 ). Examples: − θ 2 ◮ Θ = I R , h ( θ ) = , √ ◮ Θ = (0 , ∞ ) , h ( θ ) = θ , ◮ Θ = (0 , ∞ ) , h ( θ ) = log θ , ◮ Θ = [0 , π ] , h ( θ ) = sin( θ ) ◮ Θ − 3 = I R , h ( θ ) = 2 θ 12/23

Interlude: maximizing/minimizing functions (3) multivariate function: h R d → I More generally for a : Θ ⊂ I R , ≥ 2 , define the d   ∂h ∂θ 1 ( θ )  .  ◮ gradient vector: ∇ h ( θ ) = . R d   ∈ I . ∂h ∂θ d ( θ ) ◮ Hessian matrix:   ∂ 2 h ∂ 2 h ( θ ) · · · ( θ ) ∂θ 1 ∂θ 1 ∂θ 1 ∂θ d   .   ∇ 2 h ( θ ) = . R d × d ∈ I .   ∂ 2 h ∂ 2 h ( θ ) · · · ( θ ) ∂θ d ∂θ d ∂θ d ∂θ d x ⊤ ∇ 2 h ( θ ) x R d , θ ⇔ ≤ 0 ∀ x ∈ I ∈ Θ . h is concave x ⊤ ∇ 2 h ( θ ) x < R d , θ h is strictly concave ⇔ 0 ∀ x ∈ I ∈ Θ . Examples: R 2 2 − 2 θ 2 2 or − ( θ 1 − θ 2 ) 2 ◮ Θ − θ 1 = I , h ( θ ) = h ( θ ) = ◮ Θ = (0 , ∞ ) , h ( θ ) = log( θ 1 + θ 2 ) , 13/23

Interlude: maximizing/minimizing functions (4) Strictly concave functions are easy to maximize: if they have a maximum, then it is unique . It is the unique solution to ′ ( θ ) = 0 , h or, in the multivariate case R d ∇ h ( θ ) = 0 ∈ I . There are may algorithms to find it numerically: this is the theory closed form of “convex optimization”. In this class, often a formula for the maximum. 14/23

Likelihood, Discrete case (1) ) ( Let E, (I P θ ) θ ∈ Θ be a statistical model associated with a sample of i.i.d. r.v. X 1 , . . . , X n . Assume that E is discrete (i.e., finite or countable). Definition likelihood of The the model is the map L n (or just L ) defined as: E n × Θ L n : → I R ( x 1 , . . . , x n , θ ) → I P θ [ X 1 = x 1 , . . . , X n = x n ] . 15/23

Likelihood, Discrete case (2) iid Example 1 (Bernoulli trials): If X 1 , . . . , X n ∼ Ber ( p ) for some ∈ (0 , 1) : p ◮ E = { 0 , 1 } ; ◮ Θ = (0 , 1) ; ◮ ∀ ( x 1 , . . . , x n ) ∈ { 0 , 1 } n , ∀ p ∈ (0 , 1) , n n L ( x 1 , . . . , x n , p ) = I P p [ X i = x i ] i =1 n n x i (1 − p ) 1 − x i = p i =1 � n x i (1 − p ) n − � n x i = p i =1 i =1 . 16/23

Likelihood, Discrete case (3) Example 2 (Poisson model): iid If X 1 , . . . , X n ∼ Poiss ( λ ) for some 0 : λ > ◮ E = I N ; ◮ Θ = (0 , ∞ ) ; N n ◮ ∀ ( x 1 , . . . , x n ) ∈ I , ∀ λ > 0 , n n L ( x 1 , . . . , x n , p ) = I P λ [ X i = x i ] i =1 n n λ x − λ i = e x i ! i =1 n � i =1 x i λ − nλ = e . x 1 ! . . . x n ! 17/23

Likelihood, Continuous case (1) ) ( Let E, (I P θ ) θ ∈ Θ be a statistical model associated with a sample of i.i.d. r.v. X 1 , . . . , X n . Assume that all the I P θ have density f θ . Definition likelihood of The the model is the map L defined as: E n × Θ L : → I R � n ( x 1 , . . . , x n , θ ) → f θ ( x i ) . i =1 18/23

Likelihood, Continuous case (2) iid Example 1 (Gaussian model): If X 1 , . . . , X n ∼ N ( µ, σ 2 ) , for R , σ 2 > some µ ∈ I 0 : ◮ E = I R ; ◮ Θ = I R × (0 , ∞ ) R n , ∀ ( µ, σ 2 ) ∈ I ◮ ∀ ( x 1 , . . . , x n ) ∈ I R × (0 , ∞ ) , L n 1 1 L ( x 1 , . . . , x n , µ, σ 2 ) = ( x i − µ ) 2 √ exp − . 2 σ 2 2 π ) n ( σ i =1 19/23

Maximum likelihood estimator (1) Let X 1 , . . . , X n be an i.i.d. sample associated with a statistical ) ( model E, (I P θ ) θ ∈ Θ and let be the corresponding likelihood. L Definition likelihood estimator of The θ is defined as: ˆ MLE = argmax L ( X 1 , . . . , X n , θ ) , θ n θ ∈ Θ provided it exists. Remark (log-likelihood estimator): In practice, we use the fact that ˆ MLE = θ argmax log L ( X 1 , . . . , X n , θ ) . n θ ∈ Θ 20/23

Maximum likelihood estimator (2) Examples ¯ ˆ MLE = ◮ Bernoulli trials: p X n . n model: ˆ ¯ λ MLE ◮ Poisson = X n . n ( ) ˆ 2 ) ( ¯ n , S ˆ n ◮ Gaussian model: ˆ n , σ = . µ X n 21/23

18.650 Statistics for Applications Chapter 3: Maximum - PowerPoint PPT Presentation

18.650 Statistics for Applications Chapter 3: Maximum Likelihood Estimation 1/23 Total variation distance (1) ) ( Let E, (I P ) be a statistical model associated with a sample of i.i.d. r.v. X 1 , . . .

MAXIMUM CARDS MAXIMUM CARDS What is a Maximum Card ? The Maximum Card is the one which contains a

Graph Algorithms Maximum Flow Applications Algorithm Theory WS 2012/13 Fabian Kuhn Maximum Flow

What is the maximum efficiency that What is the maximum efficiency that What is the maximum

Maximum Likelihood properties Maximum parsimony Maximum likelihood Experimental design

CS 287 Advanced Robotics (Fall 2019) Lecture 13: Kalman Smoother, Maximum A Posteriori, Maximum

18.650 Statistics for Applications Chapter 5: Parametric hypothesis testing 1/37 Cherry

18.650 Statistics for Applications Chapter 4: The Method of Moments 1/14 Weierstrass

18.650 Statistics for Applications Chapter 1: Introduction 1/43 Goals Goals: To give you a

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Statistics I Chapter 3 Describing Data through Statistics Ling-Chieh Kung Department of

Statistics I Chapter 1 What is Statistics? Ling-Chieh Kung Department of Information

Maximum Flow Applications Max flow extensions and applications. Disjoint paths and network

Topics 11/13/2006 Chapter 11, start Chapter 12 11/20/2006 Chapter 12 11/27/2006 Chapter 13

Phylogenetic trees IV Maximum Likelihood Gerhard Jger ESSLLI 2016 Gerhard Jger Maximum

Outline n Maximum likelihood (ML) n Priors, and maximum a posteriori (MAP) n

Maximum Entropy Beyond Fact to Explain Selecting Probability Maximum Entropy . . . Explaining a

Advanced Software Engineering with C++ Templates Administrative Issues Thomas Gschwind <thg at

Long Term Protection Model in R Dr. Urszula Gasser, Senior Pricing Actuary 2 Disclaimer The

ACCEL Instruments GmbH Advanced Technology Equipment and Turn-Key System Supplier for Research,

LIBREOFFICE LOCKDOWN & ENCRYPTION Improvements to document security & permissions

Lecture 2 Measures of Information I-Hsiang Wang Department of Electrical Engineering National

A Kullback-Leibler Divergence for Bayesian Model Comparison with Applications to Diabetes Studies

Limit Distributions for Smooth Total Variation and 2 -Divergence in High Dimensions Ziv

General Transformations for GPU Execution of Tree Traversals Michael Goldfarb*, Youngjoon Jo**,

18.650 Statistics for Applications Chapter 3: Maximum - PowerPoint PPT Presentation

18.650 Statistics for Applications Chapter 3: Maximum Likelihood Estimation 1/23 Total variation distance (1) ) ( Let E, (I P ) be a statistical model associated with a sample of i.i.d. r.v. X 1 , . . .

MAXIMUM CARDS MAXIMUM CARDS What is a Maximum Card ? The Maximum Card is the one which contains a

Graph Algorithms Maximum Flow Applications Algorithm Theory WS 2012/13 Fabian Kuhn Maximum Flow

What is the maximum efficiency that What is the maximum efficiency that What is the maximum

Maximum Likelihood properties Maximum parsimony Maximum likelihood Experimental design

CS 287 Advanced Robotics (Fall 2019) Lecture 13: Kalman Smoother, Maximum A Posteriori, Maximum

18.650 Statistics for Applications Chapter 5: Parametric hypothesis testing 1/37 Cherry

18.650 Statistics for Applications Chapter 4: The Method of Moments 1/14 Weierstrass

18.650 Statistics for Applications Chapter 1: Introduction 1/43 Goals Goals: To give you a

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Statistics I Chapter 3 Describing Data through Statistics Ling-Chieh Kung Department of

Statistics I Chapter 1 What is Statistics? Ling-Chieh Kung Department of Information

Maximum Flow Applications Max flow extensions and applications. Disjoint paths and network

Topics 11/13/2006 Chapter 11, start Chapter 12 11/20/2006 Chapter 12 11/27/2006 Chapter 13

Phylogenetic trees IV Maximum Likelihood Gerhard Jger ESSLLI 2016 Gerhard Jger Maximum

Outline n Maximum likelihood (ML) n Priors, and maximum a posteriori (MAP) n

Maximum Entropy Beyond Fact to Explain Selecting Probability Maximum Entropy . . . Explaining a

Advanced Software Engineering with C++ Templates Administrative Issues Thomas Gschwind &lt;thg at

Long Term Protection Model in R Dr. Urszula Gasser, Senior Pricing Actuary 2 Disclaimer The

ACCEL Instruments GmbH Advanced Technology Equipment and Turn-Key System Supplier for Research,

LIBREOFFICE LOCKDOWN &amp; ENCRYPTION Improvements to document security &amp; permissions

Lecture 2 Measures of Information I-Hsiang Wang Department of Electrical Engineering National

A Kullback-Leibler Divergence for Bayesian Model Comparison with Applications to Diabetes Studies

Limit Distributions for Smooth Total Variation and 2 -Divergence in High Dimensions Ziv

General Transformations for GPU Execution of Tree Traversals Michael Goldfarb*, Youngjoon Jo**,

Advanced Software Engineering with C++ Templates Administrative Issues Thomas Gschwind <thg at

LIBREOFFICE LOCKDOWN & ENCRYPTION Improvements to document security & permissions