STAT 339 A Generative Linear Model and Max Likelihood Estimation - - PowerPoint PPT Presentation
STAT 339 A Generative Linear Model and Max Likelihood Estimation - - PowerPoint PPT Presentation
STAT 339 A Generative Linear Model and Max Likelihood Estimation 20-22 February 2017 Colin Reimer Dawson 1 / 23 Questions/Administrative Business? 2 / 23 Outline Linear Model Revisited Maximum Likelihood Estimation 3 / 23 Linear Model
Questions/Administrative Business?
2 / 23
Outline
Linear Model Revisited Maximum Likelihood Estimation 3 / 23
Linear Model Revisited
Our original formulation of the model was deterministic: for a given x, the model yields the same t every time. 5 / 23
Modeling the “Errors”
Of course, the actual data is more complicated. 6 / 23
Adding Error to the Model
▸ We can capture this added complexity with a “catchall”
error term, ε. t = w0 + w1x1 + ⋅⋅⋅ + wkxk + ε (1)
▸ ε is different for every case, even if x is the same. ▸ It is a different beast from the variables x, w and t: it is
a random variable.
▸ A stand in for all the factors that we are not modeling.
7 / 23
A Generative Linear Model
If each observation is associated with a random εn term, then we have a generative model: tn = w0 + w1xn1 + ⋅⋅⋅ + wDxnD + εn = xnw + εn where εi is a random error term. 8 / 23
More specifically...
The classic case is when tn = xiw + εn where the εi are independent and identically distributed as N(0,σ2) random variables. 9 / 23
The Likelihood Function
Previously, we chose ˆ w so as to minimize a loss function. With a generative model, an alternative is to find the parameters that make the data maximally likely.
The Likelihood Function
If the distribution of a r.v. X (or a random vector x) depends
- n a parameter vector θ, then given an observation X = x (or
x = x0), the likelihood function is the probability (or density) of x (or x0) for each possible value of θ: L(θ;x0) = p(x0;θ) 11 / 23
Example: Poisson Distribution
The Poisson distribution with parameter λ is a discrete distribution on {0,1,2,...,} with PMF p(y;λ) = e−λλy y! The likelihood function for λ is also L(λ;y) = e−λλy y! but considered as a function of λ for fixed observation y. 12 / 23
Poisson PMF and Likelihood
2 4 6 8 10 0.00 0.10 0.20 0.30 y p(y) 2 4 6 8 10 0.00 0.05 0.10 0.15 0.20 λ L(λ)
Figure: Left: PMF for y from a Poisson(λ) distribution with λ = 1.5. Right: Likelihood function for λ for a Poisson(λ) distribution with y = 3.
13 / 23
Maximizing the Likelihood
A reasonable criterion for estimating a parameter is to try to maximize the likelihood; i.e., choose the param that makes the data as “probable” as possible.
MLE ˆ θ = arg max
θ
L(θ;x) = arg max
θ
p(x;θ)
14 / 23
Poisson MLE
2 4 6 8 10 0.00 0.10 0.20 0.30 y p(y) 2 4 6 8 10 0.00 0.05 0.10 0.15 0.20 λ L(λ)
Figure: Left: PMF for y from a Poisson(λ) distribution with λ = 1.5. Right: Likelihood function for λ for a Poisson(λ) distribution with y = 3.
What is the MLE for λ? 15 / 23
Analytically...
L(λ;y) = e−λλy y! dL(λ) dλ = 1 y! (ye−λλy−1 − e−λλy) Set to zero and solve.... ye−ˆ
λˆ
λy−1 = e−ˆ
λˆ
λy ˆ λ = y 16 / 23
Log Likelihoods
Many many common likelihoods are more manageable after taking a log. Also, if we have several independent
- bservations, likelihoods multiply, but log likelihoods add. Can
we just maximize the log likelihood instead? log L(λ;y) = −λ + y log(λ) − log(y!) dlog L(λ) dλ = −1 + y λ ˆ λ = y 17 / 23
Deriving the Likelihood Function for w
The classic case is when tn = xiw + εn where the εi are independent and identically distributed as N(0,σ2) random variables. 18 / 23
Family of Conditional Densities for tn
εn ∼ N(0,σ2) ⇒ tn ∣ xnw ∼ N(xnw,σ2) i.e. p(tn ∣ xn,w,σ2) = (2πσ2)
−1/2 exp{− 1
2σ2 (tn − xnw)2} 19 / 23
Family of Joint Densities for t
Since we assume the εn are independent, then after fixing (i.e., conditioning on) xnw for each n, the tn are also (conditionally) independent: p(t ∣ X,w,σ2) =
N
∏
n=1
p(tn ∣ xn,w,σ2) =
N
∏
n=1
(2πσ2)
−1/2 exp{− 1
2σ2 (tn − xnw)2} = (2πσ2)
−N/2 exp{− 1
2σ2
N
∑
n=1
(tn − xnw)2} = (2πσ2)
−N/2 exp{− 1
2σ2(t − Xw)T(t − Xw)} 20 / 23
Finding the MLE for w
L(w,σ2;X,t) = (2πσ2)
−N/2 exp{− 1
2σ2(t − Xw)T(t − Xw)} Taking the log to make finding the gradient (much!) easier... log L(w,σ2;X,t) = −N 2 log(2πσ2) − 1 2σ2(t − Xw)T(t − Xw) Taking the gradient w.r.t w... ∂ log L(w,σ2;X,t) ∂w = 1 σ2XT(t − Xw) and setting to zero... XTXˆ w = XTt ⇒ ˆ w = (XTX)
−1 XTt
21 / 23
Finding the MLE for σ2
If we want our estimated model to be generative, we also need to estimate σ2. log L(w,σ2;X,t) = −N 2 log(2πσ2) − 1 2σ2(t − Xw)T(t − Xw) Taking the derivative w.r.t σ2... ∂ log L(w,σ2;X,t) ∂σ2 = − N 2σ2 + 1 2(σ2)2(t − Xw)T(t − Xw) and setting to zero... ˆ σ2 = 1 N (t − Xw)T(t − Xw) = 1 N
N
∑
n=1
(tn − ˆ tn)
2