STAT 339 A Generative Linear Model and Max Likelihood Estimation - - PowerPoint PPT Presentation

stat 339 a generative linear model and max likelihood
SMART_READER_LITE
LIVE PREVIEW

STAT 339 A Generative Linear Model and Max Likelihood Estimation - - PowerPoint PPT Presentation

STAT 339 A Generative Linear Model and Max Likelihood Estimation 20-22 February 2017 Colin Reimer Dawson 1 / 23 Questions/Administrative Business? 2 / 23 Outline Linear Model Revisited Maximum Likelihood Estimation 3 / 23 Linear Model


slide-1
SLIDE 1

STAT 339 A Generative Linear Model and Max Likelihood Estimation

20-22 February 2017 Colin Reimer Dawson 1 / 23

slide-2
SLIDE 2

Questions/Administrative Business?

2 / 23

slide-3
SLIDE 3

Outline

Linear Model Revisited Maximum Likelihood Estimation 3 / 23

slide-4
SLIDE 4

Linear Model Revisited

Our original formulation of the model was deterministic: for a given x, the model yields the same t every time. 5 / 23

slide-5
SLIDE 5

Modeling the “Errors”

Of course, the actual data is more complicated. 6 / 23

slide-6
SLIDE 6

Adding Error to the Model

▸ We can capture this added complexity with a “catchall”

error term, ε. t = w0 + w1x1 + ⋅⋅⋅ + wkxk + ε (1)

▸ ε is different for every case, even if x is the same. ▸ It is a different beast from the variables x, w and t: it is

a random variable.

▸ A stand in for all the factors that we are not modeling.

7 / 23

slide-7
SLIDE 7

A Generative Linear Model

If each observation is associated with a random εn term, then we have a generative model: tn = w0 + w1xn1 + ⋅⋅⋅ + wDxnD + εn = xnw + εn where εi is a random error term. 8 / 23

slide-8
SLIDE 8

More specifically...

The classic case is when tn = xiw + εn where the εi are independent and identically distributed as N(0,σ2) random variables. 9 / 23

slide-9
SLIDE 9

The Likelihood Function

Previously, we chose ˆ w so as to minimize a loss function. With a generative model, an alternative is to find the parameters that make the data maximally likely.

The Likelihood Function

If the distribution of a r.v. X (or a random vector x) depends

  • n a parameter vector θ, then given an observation X = x (or

x = x0), the likelihood function is the probability (or density) of x (or x0) for each possible value of θ: L(θ;x0) = p(x0;θ) 11 / 23

slide-10
SLIDE 10

Example: Poisson Distribution

The Poisson distribution with parameter λ is a discrete distribution on {0,1,2,...,} with PMF p(y;λ) = e−λλy y! The likelihood function for λ is also L(λ;y) = e−λλy y! but considered as a function of λ for fixed observation y. 12 / 23

slide-11
SLIDE 11

Poisson PMF and Likelihood

2 4 6 8 10 0.00 0.10 0.20 0.30 y p(y) 2 4 6 8 10 0.00 0.05 0.10 0.15 0.20 λ L(λ)

Figure: Left: PMF for y from a Poisson(λ) distribution with λ = 1.5. Right: Likelihood function for λ for a Poisson(λ) distribution with y = 3.

13 / 23

slide-12
SLIDE 12

Maximizing the Likelihood

A reasonable criterion for estimating a parameter is to try to maximize the likelihood; i.e., choose the param that makes the data as “probable” as possible.

MLE ˆ θ = arg max

θ

L(θ;x) = arg max

θ

p(x;θ)

14 / 23

slide-13
SLIDE 13

Poisson MLE

2 4 6 8 10 0.00 0.10 0.20 0.30 y p(y) 2 4 6 8 10 0.00 0.05 0.10 0.15 0.20 λ L(λ)

Figure: Left: PMF for y from a Poisson(λ) distribution with λ = 1.5. Right: Likelihood function for λ for a Poisson(λ) distribution with y = 3.

What is the MLE for λ? 15 / 23

slide-14
SLIDE 14

Analytically...

L(λ;y) = e−λλy y! dL(λ) dλ = 1 y! (ye−λλy−1 − e−λλy) Set to zero and solve.... ye−ˆ

λˆ

λy−1 = e−ˆ

λˆ

λy ˆ λ = y 16 / 23

slide-15
SLIDE 15

Log Likelihoods

Many many common likelihoods are more manageable after taking a log. Also, if we have several independent

  • bservations, likelihoods multiply, but log likelihoods add. Can

we just maximize the log likelihood instead? log L(λ;y) = −λ + y log(λ) − log(y!) dlog L(λ) dλ = −1 + y λ ˆ λ = y 17 / 23

slide-16
SLIDE 16

Deriving the Likelihood Function for w

The classic case is when tn = xiw + εn where the εi are independent and identically distributed as N(0,σ2) random variables. 18 / 23

slide-17
SLIDE 17

Family of Conditional Densities for tn

εn ∼ N(0,σ2) ⇒ tn ∣ xnw ∼ N(xnw,σ2) i.e. p(tn ∣ xn,w,σ2) = (2πσ2)

−1/2 exp{− 1

2σ2 (tn − xnw)2} 19 / 23

slide-18
SLIDE 18

Family of Joint Densities for t

Since we assume the εn are independent, then after fixing (i.e., conditioning on) xnw for each n, the tn are also (conditionally) independent: p(t ∣ X,w,σ2) =

N

n=1

p(tn ∣ xn,w,σ2) =

N

n=1

(2πσ2)

−1/2 exp{− 1

2σ2 (tn − xnw)2} = (2πσ2)

−N/2 exp{− 1

2σ2

N

n=1

(tn − xnw)2} = (2πσ2)

−N/2 exp{− 1

2σ2(t − Xw)T(t − Xw)} 20 / 23

slide-19
SLIDE 19

Finding the MLE for w

L(w,σ2;X,t) = (2πσ2)

−N/2 exp{− 1

2σ2(t − Xw)T(t − Xw)} Taking the log to make finding the gradient (much!) easier... log L(w,σ2;X,t) = −N 2 log(2πσ2) − 1 2σ2(t − Xw)T(t − Xw) Taking the gradient w.r.t w... ∂ log L(w,σ2;X,t) ∂w = 1 σ2XT(t − Xw) and setting to zero... XTXˆ w = XTt ⇒ ˆ w = (XTX)

−1 XTt

21 / 23

slide-20
SLIDE 20

Finding the MLE for σ2

If we want our estimated model to be generative, we also need to estimate σ2. log L(w,σ2;X,t) = −N 2 log(2πσ2) − 1 2σ2(t − Xw)T(t − Xw) Taking the derivative w.r.t σ2... ∂ log L(w,σ2;X,t) ∂σ2 = − N 2σ2 + 1 2(σ2)2(t − Xw)T(t − Xw) and setting to zero... ˆ σ2 = 1 N (t − Xw)T(t − Xw) = 1 N

N

n=1

(tn − ˆ tn)

2

22 / 23

slide-21
SLIDE 21

Summary: MLE Linear Regression

Having defined a generative model according to tn = xnw + εn, εn

ind.

∼ N(0,σ2) we get MLEs for w and σ2 given by: ˆ w = (XTX)

−1 XTt

ˆ σ2 = 1 N (tn − xn ˆ w)2

23 / 23