18 650 statistics for applications chapter 3 maximum
play

18.650 Statistics for Applications Chapter 3: Maximum - PowerPoint PPT Presentation

18.650 Statistics for Applications Chapter 3: Maximum Likelihood Estimation 1/23 Total variation distance (1) ) ( Let E, (I P ) be a statistical model associated with a sample of i.i.d. r.v. X 1 , . . .


  1. 18.650 Statistics for Applications Chapter 3: Maximum Likelihood Estimation 1/23

  2. Total variation distance (1) ) ( Let E, (I P θ ) θ ∈ Θ be a statistical model associated with a sample θ ∗ ∈ Θ of i.i.d. r.v. X 1 , . . . , X n . Assume that there exists such : θ ∗ is true parameter. X 1 ∼ I that P θ ∗ the Statistician’s goal: given X 1 , . . . , X n , find an estimator ˆ ˆ θ = θ ( X 1 , . . . , X n ) such that I P ˆ is close to I P θ ∗ for the true θ θ ∗ . parameter is small for P ˆ ( A ) − I ⊂ E . This means: I P θ ∗ ( A ) all A θ Definition total variation distance between two The probability measures I P θ and I P θ ′ is defined by I . P θ ( A ) − I TV (I P θ , I P θ ′ ) = max P θ ′ ( A ) A ⊂ E 2/23

  3. Total variation distance (2) Assume that E is discrete (i.e., finite or countable). This includes Bernoulli, Binomial, Poisson, . . . Therefore X has a PMF (probability mass function): I P θ ( X = x ) = p θ ( x ) for all ∈ E , x L p θ ( x ) ≥ 0 , p θ ( x ) = 1 . x ∈ E The total variation distance between I P θ and I P θ ′ is a simple function of the PMF’s p θ and p θ ′ : L 1 TV (I P θ , I P θ ′ ) = p θ ( x ) − p θ ′ ( x ) . 2 x ∈ E 3/23

  4. Total variation distance (3) Assume that E is continuous. This includes Gaussian, Exponential, . . . J Assume that X has a density I P θ ( X ∈ A ) = f θ ( x ) dx for all A ⊂ E . A l f θ ( x ) ≥ 0 , f θ ( x ) dx = 1 . E The total variation distance between I P θ and I P θ ′ is a simple function of the densities f θ and f θ ′ : l 1 TV (I P θ , I P θ ′ ) = f θ ( x ) − f θ ′ ( x ) dx . 2 E 4/23

  5. Total variation distance (4) Properties of Total variation: ◮ TV (I P θ , I P θ ′ ) = TV (I P θ ′ , I P θ ) (symmetric) ◮ TV (I P θ , I P θ ′ ) ≥ 0 ◮ If TV (I P θ , I P θ ′ ) = 0 then I P θ = I P θ ′ (definite) ◮ TV (I P θ , I P θ ′ ) ≤ TV (I P θ , I P θ ′′ ) + TV (I P θ ′′ , I P θ ′ ) (triangle inequality) distance between These imply that the total variation is a probability distributions. 5/23

  6. Total variation distance (5) estimator T An estimation strategy: Build an TV (I P θ , I P θ ∗ ) for all find ˆ θ → T minimizes the ∈ Θ . Then that function TV (I P θ , I P θ ∗ ) . θ θ 6/23

  7. Total variation distance (5) estimator T An estimation strategy: Build an TV (I P θ , I P θ ∗ ) for all find ˆ θ → T minimizes the ∈ Θ . Then that function TV (I P θ , I P θ ∗ ) . θ θ build T problem: Unclear how to TV (I P θ , I P θ ∗ ) ! 6/23

  8. Kullback-Leibler (KL) divergence (1) many distances There are between probability measures to replace total variation. Let us choose one that is more convenient. Definition Kullback-Leibler (KL) divergence between two The probability measures I P θ and I P θ ′ is defined by     ( p θ ( x ) ) L    p θ ( x ) log if E is discrete   p θ ′ ( x ) x ∈ E KL (I P θ , I P θ ′ ) =   l  ( f θ ( x ) )     f θ ( x ) log dx if E is continuous  f θ ′ ( x ) E 7/23

  9. Kullback-Leibler (KL) divergence (2) Properties of KL-divergence: ) = ◮ KL (I P θ , I P θ ′ KL (I P θ ′ , I P θ ) in general ◮ KL (I P θ , I P θ ′ ) ≥ 0 ◮ If KL (I P θ , I P θ ′ ) = 0 then I P θ = I P θ ′ (definite) ) i KL (I ◮ KL (I P θ , I P θ ′ P θ , I P θ ′′ ) + KL (I P θ ′′ , I P θ ′ ) in general Not a distance . This is is called a divergence . Asymmetry is the key to our ability to estimate it! 8/23

  10. Kullback-Leibler (KL) divergence (3) [ ( X ) )] ( p θ ∗ KL (I P θ ∗ , I P θ ) = I E θ ∗ log p θ ( X ) [ ] [ ] = I E θ ∗ log p θ ∗ ( X ) − I E θ ∗ log p θ ( X ) So the function θ → KL (I P θ ∗ , I P θ ) is of the form: [ ] “constant” − I E θ ∗ log p θ ( X ) L n 1 Can be estimated: I E θ ∗ [ h ( X )] - h ( X i ) (by LLN) n i =1 L n 1 T − KL (I P θ ∗ , I P θ ) = “constant” log p θ ( X i ) n i =1 9/23

  11. Kullback-Leibler (KL) divergence (4) L n 1 T − KL (I P θ ∗ , I P θ ) = “constant” log p θ ( X i ) n i =1 n L 1 T min KL (I P θ ∗ , I P θ ) ⇔ min − log p θ ( X i ) θ ∈ Θ n θ ∈ Θ i =1 n L 1 ⇔ max log p θ ( X i ) θ ∈ Θ n i =1 L n ⇔ max log p θ ( X i ) θ ∈ Θ i =1 n n ⇔ max p θ ( X i ) θ ∈ Θ i =1 maximum likelihood principle . This is the 10/23

  12. Interlude: maximizing/minimizing functions (1) Note that min − h ( θ ) ⇔ max h ( θ ) θ ∈ Θ θ ∈ Θ In this class, we focus on maximization . Maximization of arbitrary functions can be difficult: Example: θ → � n ( θ − X i ) i =1 11/23

  13. Interlude: maximizing/minimizing functions (2) Definition A function twice differentiable function h : Θ ⊂ I R → I R is said to concave if be its second derivative satisfies ′′ ( θ ) ≤ 0 , ∀ θ ∈ Θ h strictly concave if ′′ ( θ ) < It is said to be the inequality is strict: h 0 convex if Moreover, h is said to be (strictly) − h is (strictly) ′′ ( θ ) ≥ 0 ′′ ( θ ) > concave, i.e. h ( h 0 ). Examples: − θ 2 ◮ Θ = I R , h ( θ ) = , √ ◮ Θ = (0 , ∞ ) , h ( θ ) = θ , ◮ Θ = (0 , ∞ ) , h ( θ ) = log θ , ◮ Θ = [0 , π ] , h ( θ ) = sin( θ ) ◮ Θ − 3 = I R , h ( θ ) = 2 θ 12/23

  14. Interlude: maximizing/minimizing functions (3) multivariate function: h R d → I More generally for a : Θ ⊂ I R , ≥ 2 , define the d   ∂h ∂θ 1 ( θ )  .  ◮ gradient vector: ∇ h ( θ ) = . R d   ∈ I . ∂h ∂θ d ( θ ) ◮ Hessian matrix:   ∂ 2 h ∂ 2 h ( θ ) · · · ( θ ) ∂θ 1 ∂θ 1 ∂θ 1 ∂θ d   .   ∇ 2 h ( θ ) = . R d × d ∈ I .   ∂ 2 h ∂ 2 h ( θ ) · · · ( θ ) ∂θ d ∂θ d ∂θ d ∂θ d x ⊤ ∇ 2 h ( θ ) x R d , θ ⇔ ≤ 0 ∀ x ∈ I ∈ Θ . h is concave x ⊤ ∇ 2 h ( θ ) x < R d , θ h is strictly concave ⇔ 0 ∀ x ∈ I ∈ Θ . Examples: R 2 2 − 2 θ 2 2 or − ( θ 1 − θ 2 ) 2 ◮ Θ − θ 1 = I , h ( θ ) = h ( θ ) = ◮ Θ = (0 , ∞ ) , h ( θ ) = log( θ 1 + θ 2 ) , 13/23

  15. Interlude: maximizing/minimizing functions (4) Strictly concave functions are easy to maximize: if they have a maximum, then it is unique . It is the unique solution to ′ ( θ ) = 0 , h or, in the multivariate case R d ∇ h ( θ ) = 0 ∈ I . There are may algorithms to find it numerically: this is the theory closed form of “convex optimization”. In this class, often a formula for the maximum. 14/23

  16. Likelihood, Discrete case (1) ) ( Let E, (I P θ ) θ ∈ Θ be a statistical model associated with a sample of i.i.d. r.v. X 1 , . . . , X n . Assume that E is discrete (i.e., finite or countable). Definition likelihood of The the model is the map L n (or just L ) defined as: E n × Θ L n : → I R ( x 1 , . . . , x n , θ ) → I P θ [ X 1 = x 1 , . . . , X n = x n ] . 15/23

  17. Likelihood, Discrete case (2) iid Example 1 (Bernoulli trials): If X 1 , . . . , X n ∼ Ber ( p ) for some ∈ (0 , 1) : p ◮ E = { 0 , 1 } ; ◮ Θ = (0 , 1) ; ◮ ∀ ( x 1 , . . . , x n ) ∈ { 0 , 1 } n , ∀ p ∈ (0 , 1) , n n L ( x 1 , . . . , x n , p ) = I P p [ X i = x i ] i =1 n n x i (1 − p ) 1 − x i = p i =1 � n x i (1 − p ) n − � n x i = p i =1 i =1 . 16/23

  18. Likelihood, Discrete case (3) Example 2 (Poisson model): iid If X 1 , . . . , X n ∼ Poiss ( λ ) for some 0 : λ > ◮ E = I N ; ◮ Θ = (0 , ∞ ) ; N n ◮ ∀ ( x 1 , . . . , x n ) ∈ I , ∀ λ > 0 , n n L ( x 1 , . . . , x n , p ) = I P λ [ X i = x i ] i =1 n n λ x − λ i = e x i ! i =1 n � i =1 x i λ − nλ = e . x 1 ! . . . x n ! 17/23

  19. Likelihood, Continuous case (1) ) ( Let E, (I P θ ) θ ∈ Θ be a statistical model associated with a sample of i.i.d. r.v. X 1 , . . . , X n . Assume that all the I P θ have density f θ . Definition likelihood of The the model is the map L defined as: E n × Θ L : → I R � n ( x 1 , . . . , x n , θ ) → f θ ( x i ) . i =1 18/23

  20. Likelihood, Continuous case (2) iid Example 1 (Gaussian model): If X 1 , . . . , X n ∼ N ( µ, σ 2 ) , for R , σ 2 > some µ ∈ I 0 : ◮ E = I R ; ◮ Θ = I R × (0 , ∞ ) R n , ∀ ( µ, σ 2 ) ∈ I ◮ ∀ ( x 1 , . . . , x n ) ∈ I R × (0 , ∞ ) , L n 1 1 L ( x 1 , . . . , x n , µ, σ 2 ) = ( x i − µ ) 2 √ exp − . 2 σ 2 2 π ) n ( σ i =1 19/23

  21. Maximum likelihood estimator (1) Let X 1 , . . . , X n be an i.i.d. sample associated with a statistical ) ( model E, (I P θ ) θ ∈ Θ and let be the corresponding likelihood. L Definition likelihood estimator of The θ is defined as: ˆ MLE = argmax L ( X 1 , . . . , X n , θ ) , θ n θ ∈ Θ provided it exists. Remark (log-likelihood estimator): In practice, we use the fact that ˆ MLE = θ argmax log L ( X 1 , . . . , X n , θ ) . n θ ∈ Θ 20/23

  22. Maximum likelihood estimator (2) Examples ¯ ˆ MLE = ◮ Bernoulli trials: p X n . n model: ˆ ¯ λ MLE ◮ Poisson = X n . n ( ) ˆ 2 ) ( ¯ n , S ˆ n ◮ Gaussian model: ˆ n , σ = . µ X n 21/23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend