stat 339 a generative linear model and max likelihood
play

STAT 339 A Generative Linear Model and Max Likelihood Estimation - PowerPoint PPT Presentation

STAT 339 A Generative Linear Model and Max Likelihood Estimation 20-22 February 2017 Colin Reimer Dawson 1 / 23 Questions/Administrative Business? 2 / 23 Outline Linear Model Revisited Maximum Likelihood Estimation 3 / 23 Linear Model


  1. STAT 339 A Generative Linear Model and Max Likelihood Estimation 20-22 February 2017 Colin Reimer Dawson 1 / 23

  2. Questions/Administrative Business? 2 / 23

  3. Outline Linear Model Revisited Maximum Likelihood Estimation 3 / 23

  4. Linear Model Revisited Our original formulation of the model was deterministic : for a given x , the model yields the same t every time. 5 / 23

  5. Modeling the “Errors” Of course, the actual data is more complicated. 6 / 23

  6. Adding Error to the Model ▸ We can capture this added complexity with a “catchall” error term, ε . t = w 0 + w 1 x 1 + ⋅⋅⋅ + w k x k + ε (1) ▸ ε is different for every case, even if x is the same. ▸ It is a different beast from the variables x , w and t : it is a random variable . ▸ A stand in for all the factors that we are not modeling. 7 / 23

  7. A Generative Linear Model If each observation is associated with a random ε n term, then we have a generative model : t n = w 0 + w 1 x n 1 + ⋅⋅⋅ + w D x nD + ε n = x n w + ε n where ε i is a random error term. 8 / 23

  8. More specifically... The classic case is when t n = x i w + ε n where the ε i are independent and identically distributed as N( 0 ,σ 2 ) random variables. 9 / 23

  9. The Likelihood Function Previously, we chose ˆ w so as to minimize a loss function. With a generative model, an alternative is to find the parameters that make the data maximally likely . The Likelihood Function If the distribution of a r.v. X (or a random vector x ) depends on a parameter vector θ , then given an observation X = x (or x = x 0 ), the likelihood function is the probability (or density) of x (or x 0 ) for each possible value of θ : L ( θ ; x 0 ) = p ( x 0 ; θ ) 11 / 23

  10. Example: Poisson Distribution The Poisson distribution with parameter λ is a discrete distribution on { 0 , 1 , 2 ,..., } with PMF p ( y ; λ ) = e − λ λ y y ! The likelihood function for λ is also L ( λ ; y ) = e − λ λ y y ! but considered as a function of λ for fixed observation y . 12 / 23

  11. Poisson PMF and Likelihood 0.30 0.20 0.15 0.20 L ( λ ) p(y) 0.10 0.10 0.05 0.00 0.00 0 2 4 6 8 10 0 2 4 6 8 10 λ y Figure: Left: PMF for y from a Poisson ( λ ) distribution with λ = 1 . 5 . Right: Likelihood function for λ for a Poisson ( λ ) distribution with y = 3 . 13 / 23

  12. Maximizing the Likelihood A reasonable criterion for estimating a parameter is to try to maximize the likelihood; i.e., choose the param that makes the data as “probable” as possible. MLE ˆ θ = arg max L ( θ ; x ) = arg max p ( x ; θ ) θ θ 14 / 23

  13. Poisson MLE 0.30 0.20 0.15 0.20 p(y) L ( λ ) 0.10 0.10 0.05 0.00 0.00 0 2 4 6 8 10 0 2 4 6 8 10 λ y Figure: Left: PMF for y from a Poisson ( λ ) distribution with λ = 1 . 5 . Right: Likelihood function for λ for a Poisson ( λ ) distribution with y = 3 . What is the MLE for λ ? 15 / 23

  14. Analytically... L ( λ ; y ) = e − λ λ y y ! dL ( λ ) = 1 y ! ( ye − λ λ y − 1 − e − λ λ y ) dλ Set to zero and solve.... λ y − 1 = e − ˆ ye − ˆ λ ˆ λ ˆ λ y ˆ λ = y 16 / 23

  15. Log Likelihoods Many many common likelihoods are more manageable after taking a log. Also, if we have several independent observations, likelihoods multiply, but log likelihoods add. Can we just maximize the log likelihood instead? log L ( λ ; y ) = − λ + y log ( λ ) − log ( y ! ) d log L ( λ ) = − 1 + y dλ λ ˆ λ = y 17 / 23

  16. Deriving the Likelihood Function for w The classic case is when t n = x i w + ε n where the ε i are independent and identically distributed as N( 0 ,σ 2 ) random variables. 18 / 23

  17. Family of Conditional Densities for t n ε n ∼ N( 0 ,σ 2 ) ⇒ t n ∣ x n w ∼ N( x n w ,σ 2 ) − 1 / 2 exp {− 1 2 σ 2 ( t n − x n w ) 2 } i.e. p ( t n ∣ x n , w ,σ 2 ) = ( 2 πσ 2 ) 19 / 23

  18. Family of Joint Densities for t Since we assume the ε n are independent, then after fixing (i.e., conditioning on) x n w for each n , the t n are also (conditionally) independent: N p ( t ∣ X , w ,σ 2 ) = ∏ p ( t n ∣ x n , w ,σ 2 ) n = 1 N − 1 / 2 exp { − 1 2 σ 2 ( t n − x n w ) 2 } = ( 2 πσ 2 ) ∏ n = 1 N − N / 2 exp { − 1 ( t n − x n w ) 2 } = ( 2 πσ 2 ) ∑ 2 σ 2 n = 1 − N / 2 exp { − 1 = ( 2 πσ 2 ) 2 σ 2 ( t − Xw ) T ( t − Xw )} 20 / 23

  19. Finding the MLE for w − N / 2 exp { − 1 L ( w ,σ 2 ; X , t ) = ( 2 πσ 2 ) 2 σ 2 ( t − Xw ) T ( t − Xw )} Taking the log to make finding the gradient (much!) easier... log L ( w ,σ 2 ; X , t ) = − N 2 log ( 2 πσ 2 ) − 1 2 σ 2 ( t − Xw ) T ( t − Xw ) Taking the gradient w.r.t w ... ∂ log L ( w ,σ 2 ; X , t ) = 1 σ 2 X T ( t − Xw ) ∂ w and setting to zero... − 1 X T t w = X T t ⇒ w = ( X T X ) X T Xˆ ˆ 21 / 23

  20. Finding the MLE for σ 2 If we want our estimated model to be generative, we also need to estimate σ 2 . log L ( w ,σ 2 ; X , t ) = − N 2 log ( 2 πσ 2 ) − 1 2 σ 2 ( t − Xw ) T ( t − Xw ) Taking the derivative w.r.t σ 2 ... ∂ log L ( w ,σ 2 ; X , t ) = − N 1 2 σ 2 + 2 ( σ 2 ) 2 ( t − Xw ) T ( t − Xw ) ∂σ 2 and setting to zero... σ 2 = 1 N ( t − Xw ) T ( t − Xw ) ˆ N = 1 2 ( t n − ˆ t n ) ∑ N n = 1 22 / 23

  21. Summary: MLE Linear Regression Having defined a generative model according to ind. t n = x n w + ε n , ∼ N( 0 ,σ 2 ) ε n we get MLEs for w and σ 2 given by: − 1 X T t w = ( X T X ) ˆ σ 2 = 1 N ( t n − x n ˆ w ) 2 ˆ 23 / 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend