Regression Methods
0.
Regression Methods 1. Linear Regression and Logistic Regression: - - PowerPoint PPT Presentation
0. Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common property CMU, 2004 fall, Andrew Moore, HW2, pr. 4 2. Linear Regression and Logistic Regression: Definitions Given an input vector X , linear
0.
definitions, and a common property
1.
Given an input vector X, linear regression models a real-valued output Y as Y |X ∼ Normal(µ(X), σ2), where µ(X) = β⊤X = β0 + β1X1 + . . . + βpXp. Given an input vector X, logistic regression models a binary output Y by Y |X ∼ Bernoulli(θ(X)), where the Bernoulli parameter is related to β⊤X by the logit transformation logit (θ(X))
def.
= log θ(X) 1 − θ(X) = β⊤X.
2.
and its gradient with respect to the parameter vector β = (β0, β1, . . . , βp). Answer: For linear regression, we can write the log likelihood function as: LL(β) = log n
1 √ 2πσ exp
2σ2
n
log
√ 2πσ exp
2σ2
−n log( √ 2πσ) − 1 2σ2
n
(yi − β⊤xi)2 = −n log( √ 2πσ) − 1 2σ2
n
(yi − β⊤xi)⊤(yi − β⊤xi). Therefore, its gradient is: ∇βLL(β) =
n
(yi − β⊤xi)xi 3.
For logistic regression: log θ(X) 1 − θ(X) = β⊤X ⇔ eβ⊤X = θ(X) 1 − θ(X) ⇔ eβ⊤X = θ(X)(1 + eβ⊤X) Therefore, θ(X) = eβ⊤X 1 + eβ⊤X = 1 1 + e−β⊤X and 1 − θ(X) = 1 1 + eβ⊤X . Note that Y |X ∼ Bernoulli(θ(X)) means that P(Y = 1|X) = θ(X) and P(Y = 0|X) = 1 − θ(X), which can be equivalently written as P(Y = y|X) = θ(X)y(1 − θ(X))1−y for all y ∈ {0, 1}. 4.
So, in this case the log likelihood function is: LL(β) = log n
{θ(xi)yi(1 − θ(xi))1−yi}
n
{yi log θ(xi) + (1 − yi) log(1 − θ(xi))} =
n
{yi(β⊤xi + log(1 − θ(xi)) + (1 − yi) log(1 − θ(xi))} =
n
{yi(β⊤xi) − log(1 + eβ⊤xi)} And therefore, ∇βLL(β) =
n
eβ⊤xi 1 + eβ⊤xi xi
n
(yi − θ(xi))xi 5.
Actually, in the above solutions the full log likelihood function should look like the following first: log-likelihood = log
n
p(xi, yi) = log
n
(pY |X(yi|xi) pX(xi)) = log n
pY |X(yi|xi)
n
pX(xi)
log
n
pY |X(yi|xi) + log
n
pX(xi) = LL + LLx Because LLx does not depend on the parameter β, when doing MLE we could just consider maximizing LL. 6.
b. Show that for each of the two regression models above, at the MLE ˆ β has the following property:
n
yixi =
n
E[Y |X = xi, β = ˆ β]xi. Answer: For linear regression: ∇βLL(β) = 0 ⇒
n
yixi =
n
(ˆ β⊤xi)xi. Since Y |X ∼ Normal(µ(X), σ2), E[Y |X = xi, β = ˆ β] = µ(xi) = ˆ β⊤xi. So n
i=1 yixi = n i=1 E[Y |X = xi, β = ˆ
β]xi. For logistic regression: ∇βLL(β) = 0 ⇒
n
yixi =
n
θ(xi)xi. Since Y |X ∼ Bernoulli(θ(X)), E[Y |X = xi, β = ˆ β] = θ(xi) = e ˆ
β⊤xi
1 + e ˆ
β⊤xi .
So n
i=1 yixi = n i=1 E[Y |X = xi, β = ˆ
β]xi. 7.
CMU, 2012 fall, Tom Mitchell, Ziv Bar-Joseph, midterm, pr. 3
8.
Consider real-valued variables X and Y . The Y variable is generated, conditional on X, from the following process: ε ∼ N(0, σ2) Y = aX + ε, where every ε is an independent variable, called a noise term, which is drawn from a Gaussian distribution with mean 0, and standard deviation σ. This is a one-feature linear regression model, where a is the only weight parameter. The conditional probability of Y has the distribution p(Y |X, a) ∼ N(aX, σ2), so it can be written as p(Y |X, a) = 1 √ 2πσ exp
2σ2(Y − aX)2
MLE estimation
Which ones of the following equations correctly represent the maximum likelihood problem for estimating a? Say yes or no to each one. More than one of them should have the answer yes. i. arg maxa
1 √ 2πσ exp
2σ2 (Yi − aXi)2
arg maxa
1 √ 2πσ exp
2σ2 (Yi − aXi)2
arg maxa
2σ2 (Yi − aXi)2
arg maxa
2σ2 (Yi − aXi)2
arg maxa
vi. argmina
10.
Answer:
LD(a)
def.
= p(Y1, . . . , Yn|a) = p(Y1, . . . , Yn|X1, . . . , Xn, a)
i.i.d.
=
n
p(Yi|Xi, a) =
n
1 √ 2πσ exp
2σ2 (Yi − aXi)2
aMLE
def.
= arg max
a
LD(a) = arg max
a n
1 √ 2πσ exp
2σ2 (Yi − aXi)2
= arg max
a
√ 2πσ n
n
exp
2σ2 (Yi − aXi)2
a
1 ( √ 2πσ)n exp
n
1 2σ2 (Yi − aXi)2
arg max
a n
exp
2σ2 (Yi − aXi)2
= arg max
a
ln
n
exp
2σ2 (Yi − aXi)2
a n
− 1 2σ2 (Yi − aXi)2 = arg max
a
− 1 2σ2
n
(Yi − aXi)2 = arg min
a n
(Yi − aXi)2 (vi.) 11.
simplest form of the problem you found above. Answer: aMLE = arg min
a n
(Yi − aXi)2 = arg min
a
n
X2
i − 2a n
XiYi +
n
Y 2
i
−−2 n
i=1 XiYi
2 n
i=1 X2 i
= n
i=1 XiYi
n
i=1 X2 i
12.
MAP estimation
Let’s put a prior on a. Assume a ∼ N(0, λ2), so p(a|λ) = 1 √ 2πλ exp
2λ2 a2
p(a|Y1, . . . , Yn, X1, . . . , Xn, λ) = p(Y1, . . . , Yn|X1, . . . , Xn, a) p(a|λ)
We can ignore the denominator when doing MAP estimation.
argmax
a
[ln p(Y1, . . . , Yn|X1, . . . , Xn, a) + ln p(a|λ)] Your solution should be in terms of Xi’s, Yi’s, and λ. 13.
Answer:
p(Y1, . . . , Yn|X1, . . . , Xn, a) · p(a|λ) = n
1 √ 2πσ exp
2σ2 (Yi − aXi)2
1 √ 2πλ exp
2λ2
= n
1 √ 2π exp
2(Yi − aXi)2
1 √ 2πλ exp
2λ2
arg max
a
1 √ 2π − 1 2
n
(Yi − aXi)2 + ln 1 √ 2πλ − 1 2λ2 a2
arg max
a
2
n
(Yi − aXi)2 − 1 2λ2 a2
arg min
a
n
(Yi − aXi)2 + a2 λ2
a
n
X2
i + 1
λ2
n
XiYi +
n
Y 2
i
n
i=1 XiYi
n
i=1 X2 i + 1
λ2 14.
change? Do aMLE and aMAP become closer together, or further apart? p(a|λ) prior probability: wider, narrower,
p(Y1, . . . , Yn|X1, . . . , Xn, a) conditional likelihood: wider, narrower, or same? |aMLE − aMAP| increase or decrease? As λ → ∞ As λ → 0 More data: as n → ∞ (fixed λ) 15.
Answer:
p(a|λ) prior probability: wider, narrower,
p(Y1, . . . , Yn|X1, . . . , Xn, a) conditional likelihood: wider, narrower, or same? |aMLE − aMAP| increase or decrease? As λ → ∞ wider same decrease As λ → 0 narrower same increase More data: as n → ∞ (fixed λ) same narrower decrease 16.
[without “intercept” term]
17.
c x + c x + = y S Input Output x = (x , x )
1 2
ε noise
1 1 2 2
This figure shows a system S which takes two inputs x1, x2 and outputs a linear combi- nation of those two inputs, c1x1 + c2x2, where c1 and c2 are two unknown real numbers. The device you use to measure the output of S, i.e., c1x1 + c2x2, introduces an additive error ε, which is a random variable following some distribution. Thus, the output y that you observe is given by equation (1): y = c1x1 + c2x2 + ε (1) Assume that you have n > 2 instances xj1, xj2, yjj=1,...,n or equivalently xj, yjj=1,...,n, where xj
not.
= [xj1, xj2]. In other words, having n measurements in your hands is equivalent to having n equations of the following form: yj = c1xj1 + c2xj2 + εj, j = 1, . . . , n. The goal is to estimate c1 and c2 from those measurements using the maximum likeli- hood. 18.
a. Assume that the εi for i = 1, . . . , n are i.i.d. Gaussian random variables with zero mean and variance σ2. Compute the loglikelihood function and use it to prove that the maximum likelihood estimate c∗ = [c∗
1, c∗ 2] is the solution of a least squares approximation problem. Find the
solution of the least squares problem.
Answer:
εi = yi − (c1xi1 + c2xi2) ∼ N(0, σ2). Therefore yi ∼ N(c1xi1 + c2xi2, σ2). Since the noise are i.i.d., the likelihood function is given by L(c1, c2) =
n
1 √ 2πσ exp
2σ2
Taking the logarithm, we get the loglikelihood function: l(c1, c2) = −n 2 log(2πσ2) − 1 2σ2
n
(yi − c1xi1 − c2xi2)2. Let y ∈ Rn be the vector containing the measurements, X the n×2 matrix with Xij = xij and c = [c1, c2]⊤, then we are trying to minimize ||y − Xc||2
2 resulting in a solution c =
(X⊤X)−1X⊤y. 19.
variables with zero mean and variance Var(εi) = σ2
i .
Compute the loglikelihood function and find c∗ = [c∗
1, c∗ 2] which maximizes
it, i.e., the MLE. Answer: εi = yi − (c1xi1 + c2xi2) ∼ N (0, σ2
i ).
Similar as before, l(c1, c2) = −n 2 log(2π) −
n
(yi − c1xi1 − c2xi2)2 2σ2
i
. Now we are trying to minimize ||W(y−Xc)||2
2, where W is a diagonal matrix,
with wii = 1 σi , resulting the solution c = (X⊤W ⊤WX)−1X⊤W ⊤Wy.
20.
2bexp(−|x| b ). In
parameter µ = 0 and scale parameter b. Compute the loglikelihood function under this noise model and explain why this model leads to more robust solutions. Answer: l(c1, c2) = −n log(2b) −
n
||y − Xc||2
1.
It is prepared to see higher values of residuals because it has a larger tail [LC: than the Gaussian]. Thus it is more robust to noise and outliers. xxxxx Laplace p.d.f.
−10 −5 5 10 0.0 0.1 0.2 0.3 0.4 0.5
x f(x)
−10 −5 5 10 0.0 0.1 0.2 0.3 0.4 0.5
µ = 0, θ = 1 µ = 0, θ = 2 µ = 0, θ = 4 µ = −5, θ = 4
21.