Regression Methods
0.
Regression Methods 1. Linear Regression with only one parameter, - - PowerPoint PPT Presentation
0. Regression Methods 1. Linear Regression with only one parameter, and without offset; MLE and MAP estimation CMU, 2012 fall, Tom Mitchell, Ziv Bar-Joseph, midterm, pr. 3 2. Consider real-valued variables X and Y . The Y variable is
0.
CMU, 2012 fall, Tom Mitchell, Ziv Bar-Joseph, midterm, pr. 3
1.
Consider real-valued variables X and Y . The Y variable is generated, conditional on X, from the following process: ε ∼ N(0, σ2) Y = aX + ε, where every ε is an independent variable, called a noise term, which is drawn from a Gaussian distribution with mean 0, and standard deviation σ. This is a one-feature linear regression model, where a is the only weight parameter. The conditional probability of Y has the distribution p(Y |X, a) ∼ N(aX, σ2), so it can be written as p(Y |X, a) = 1 √ 2πσ exp
2σ2(Y − aX)2
MLE estimation
Which ones of the following equations correctly represent the maximum likelihood problem for estimating a? Say yes or no to each one. More than one of them should have the answer yes. i. arg maxa
1 √ 2πσ exp
2σ2 (Yi − aXi)2
arg maxa
1 √ 2πσ exp
2σ2 (Yi − aXi)2
arg maxa
2σ2 (Yi − aXi)2
arg maxa
2σ2 (Yi − aXi)2
arg maxa
vi. argmina
3.
Answer:
LD(a)
def.
= p(Y1, . . . , Yn|a) = p(Y1, . . . , Yn|X1, . . . , Xn, a)
i.i.d.
=
n
p(Yi|Xi, a) =
n
1 √ 2πσ exp
2σ2 (Yi − aXi)2
aMLE
def.
= arg max
a
LD(a) = arg max
a n
1 √ 2πσ exp
2σ2 (Yi − aXi)2
= arg max
a
√ 2πσ n
n
exp
2σ2 (Yi − aXi)2
a
1 ( √ 2πσ)n exp
n
1 2σ2 (Yi − aXi)2
arg max
a n
exp
2σ2 (Yi − aXi)2
= arg max
a
ln
n
exp
2σ2 (Yi − aXi)2
a n
− 1 2σ2 (Yi − aXi)2 = arg max
a
− 1 2σ2
n
(Yi − aXi)2 = arg min
a n
(Yi − aXi)2 (vi.) 4.
simplest form of the problem you found above. Answer: aMLE = arg min
a n
(Yi − aXi)2 = arg min
a
n
X2
i − 2a n
XiYi +
n
Y 2
i
−−2 n
i=1 XiYi
2 n
i=1 X2 i
= n
i=1 XiYi
n
i=1 X2 i
5.
MAP estimation
Let’s put a prior on a. Assume a ∼ N(0, λ2), so p(a|λ) = 1 √ 2πλ exp
2λ2 a2
p(a|Y1, . . . , Yn, X1, . . . , Xn, λ) = p(Y1, . . . , Yn|X1, . . . , Xn, a) p(a|λ)
We can ignore the denominator when doing MAP estimation.
argmax
a
[ln p(Y1, . . . , Yn|X1, . . . , Xn, a) + ln p(a|λ)] Your solution should be in terms of Xi’s, Yi’s, and λ. 6.
Answer:
p(Y1, . . . , Yn|X1, . . . , Xn, a) · p(a|λ) = n
1 √ 2πσ exp
2σ2 (Yi − aXi)2
1 √ 2πλ exp
2λ2
= n
1 √ 2π exp
2(Yi − aXi)2
1 √ 2πλ exp
2λ2
arg max
a
1 √ 2π − 1 2
n
(Yi − aXi)2 + ln 1 √ 2πλ − 1 2λ2 a2
arg max
a
2
n
(Yi − aXi)2 − 1 2λ2 a2
arg min
a
n
(Yi − aXi)2 + a2 λ2
a
n
X2
i + 1
λ2
n
XiYi +
n
Y 2
i
n
i=1 XiYi
n
i=1 X2 i + 1
λ2 7.
change? Do aMLE and aMAP become closer together, or further apart? p(a|λ) prior probability: wider, narrower,
p(Y1, . . . , Yn|X1, . . . , Xn, a) conditional likelihood: wider, narrower, or same? |aMLE − aMAP| increase or decrease? As λ → ∞ As λ → 0 More data: as n → ∞ (fixed λ) 8.
Answer:
p(a|λ) prior probability: wider, narrower,
p(Y1, . . . , Yn|X1, . . . , Xn, a) conditional likelihood: wider, narrower, or same? |aMLE − aMAP| increase or decrease? As λ → ∞ wider same decrease As λ → 0 narrower same increase More data: as n → ∞ (fixed λ) same narrower decrease 9.
10.
The objective of this problem is to gain knowledge on linear regression, Maximum Like- lihood Estimation (MLE), Maximum-a-Posteriori Estimation (MAP) and the variants
Consider a linear model with some Gaussian noise: Yi = Xi · w + b + εi where εi ∼ N(0, σ2), i = 1, . . . , n. (1) where Yi ∈ R is a scalar, Xi ∈ Rd is a d-dimensional vector, b ∈ R is a constant, w ∈ Rd is d-dimensional weight on Xi, and εi is a i.i.d. Gaussian noise with variance σ2. Given the data Xi, i = 1, . . . , n, our goal is to estimate w and b which specify the model. We will show that solving the linear model (1) with the MLE method is the same as solving the following Least Squares problem: arg min
β (Y − X′β)⊤(Y − X′β),
(2) where Y = (Y1, . . . , Yn)⊤, X′
i = (1, Xi)⊤, X′ = (X′ 1, . . . , X′ n)⊤ and β = (b, w)⊤.
11.
Xi is a fixed data point.
Answer:
Note that Yi|Xi; w, b ∼ N(Xi · w + b, σ2), thus we can write the p.d.f. of Yi|Xi, w, b in the following form: f(Yi = yi|Xi; w, b) = 1 √ 2πσ exp (yi − Xi · w − b)2 2σ2
12.
b. Assuming i.i.d. between each εi, i = 1, . . . , n, give an explicit expression for the log-likelihood, ℓ(Y |β) of the data. Note: The notation for Y and β was given at (2). Given that the εi’s are i.i.d., it follows that P(Y |β) =
i P(Yi|w, b). Remark that we are just omitting Xi for convenience, as the
problem explicitly tells that Xi are fixed points.
Answer:
Given y = (y1, . . . , yn)⊤, since Yi are independent as εi’s are i.i.d. and Xi’s are given, the likelihood of Y |β is as follows: f(Y = y|β) =
n
f(yi|w, b) =
n
1 √ 2πσ exp
2σ2
√ 2πσ n exp
n
i=1(yi − Xi · w − b)2
2σ2
Now, taking the ln, the log-likelihood of Y |β is as follows: ℓ(Y = y|β) = n ln( √ 2πσ) − 1 2σ2
n
(yi − Xi · w − b)2. (3) 13.
same as solving the Least Square problem of (2).
Answer:
To maximize the log-likelihood ℓ(Y = y|β), we want to focus on the second term since the first term of the log-likelihood (3) is a constant. In short, to maximize the second term of (3), we want to minimize n
i=1(yi − Xi · w − b)2. Writing it in the matrix-vector
from, we get: max
β
ℓ(Y = y|β) = min
β n
(yi − Xi · w − b)2 = min
β (y − X′β)⊤(y − X′β),
where again, Y = (Y1, . . . , Yn)⊤, X′
i = (1, Xi)⊤, X′ = (X′ 1, . . . , X′ n)⊤ and β = (b, w)⊤.
14.
Hint: You may find useful the following formulas:a (5a) ∂ ∂X a⊤X = ∂ ∂X X⊤a = a (5b) ∂ ∂X X⊤AX = (A + A⊤)X (6c) ∂(XY ) ∂z = X ∂Y ∂z + ∂X ∂z Y
Answer:
Setting the objective function J(β) as J(β) = (y − X′β)⊤(y − X′β), implies (see rules (6c), (5a) and (5b) in the document mentioned in the Hint): ∇βJ(β) = 2X′⊤(X′β − y). The log-likelihood maximizer ˆ β can be found by solving the following optimality condi- tion: ∇βJ(ˆ β) = 0 ⇔ X′⊤(X′ ˆ β − y) = 0 ⇔ X′⊤X′ ˆ β = X′⊤y ⇔ ˆ β = (X′⊤X′)−1X′⊤y. (4) Note that (X′⊤X′)−1 is possible because it was assumed that X has full rank on the column space.
aFrom Matrix Identities, Sam Roweis, 1999, http://www.cs.nyu.edu/∼roweis/notes/matrixid.pdf.
15.
i , . . . , Xk i )⊤, then we can model Y
in the kth-order model as it follows: Yi = φ(Xi) · β′ + εi with εi ∼ N(0, σ2), i = 1, . . . , n, (5) where all the definitions are those from equation (1), except β ∈ Rk+1, and for simplicity let d = 1 for Xi ∈ Rd. As we did in Part I, show that if we use MLE and assume that φ(X)
not.
= (φ(X1), φ(X2), . . . , φ(Xn))⊤ has full rank in the column space then the optimal [value for] β in (5) is (φ(X)⊤φ(X))−1φ(X)⊤Y . Hint: You are not expected to write the whole steps again. Focus on the change from the log-likelihood expressions of Part I, and derive the optimization problem. 16.
Answer:
As the relation betwen Yi and Xi have changed, the conditional p.d.f. and likelihood function changes as well. The expressions can be just modified by just replacing Xi to φ(Xi) as following: f(Yi = yi|Xi; w, b) = 1 √ 2πσ exp (yi − φ(Xi) · β′)2 2σ2
= n ln( √ 2πσ) − 1 2σ2
n
(yi − φ(Xi) · β′)2. As we did before, the maximum likelihood method can be expressed as max
β′ ℓ(Y = y|β′) = min β′ n
(yi − φ(Xi) · β′)2 = min
β′ (yi − φ(X)β′)⊤(yi − φ(X)β′),
and thus the maximum likelihood estimator is ˆ β = (φ(X)⊤φ(X))−1φ(X)⊤y, through similar steps as in Part I. 17.
Important Note: The results from this Part hold for both linear (Part I)
and nonlinear (Part II) regression. For the sake of simplicity, here we will
them, it is enough to replace X by φ(X), where φ was defined in Part II.
that X⊤X + λI becomes invertible, with I the identity matrix of size d, and λ > 0.b Show that ˆ β′ = (X⊤X + λI)−1X⊤Y is the solution of the optimization problem arg min
β′ Y − Xβ′2 2 + λβ′2 2.
(6)
aThis is the case when, notably, the number of features, d in Part I — or k + 1 in Part II — is larger than the
number of instances, n (i.e., d > n). In this case, rank(X⊤X) = rank(X), cf. Matrix identities, by Sam Roweis, the (2f) formula. So rank(X⊤X) is less or equal to min(n, d) = n, which is smaller than d. Therefore, the matrix X⊤X, which is d × d, is not full rank and thus cannot be inverted.
bNow, for a proper value of λ, the matrix X⊤X + λI is full rank and can be inverted. To see this, note that if two
columns in X⊤X were linearly dependent, then λI adds the same value (λ) but to two different components of these columns, thus they become linearly in dependent in X⊤X + λI.
18.
Answer:
The procedure is almost the same with [the one in] part d. First, set the
J′(ˆ β′) = (y − X′β′)⊤(y − X′β′) + λβ′2
2 (5b),(5c)
= ⇒ ∇β′J′(β′) = 2(X′⊤X′β′ − y) + 2λβ′. The log-likelihood maximizer ˆ β′ can be found by solving the following opti- mality condition: ∇β′J′(ˆ β′) = 0 ⇔ (X′⊤X′ ˆ β′ − y) + λˆ β′ = 0 ⇔ (X′⊤X′ + λI)ˆ β′ = X′⊤y ⇔ ˆ β′ = (X′⊤X′ + λI)−1X′⊤y. 19.
Write the posterior distribution of β′|Yi given the i-th sample, and β′|Y given the whole data, respectively. Assume independence between β′ and the noise εi ∼ N(0, σ2), for i = 1, . . . , n. Hint 1: Use Bayes’ rule: Pr(β′|Yi) = Pr(Yi|β′) Pr(β′) Pr(Yi) . Then follow similar steps with Part I. Hint 2: Make use of the fact that the sum of two independent Normal variable follows the Normal distribution, i.e., X ∼ N(a, b2), Y ∼ N(c, d2), X⊥Y ⇒ X + Y ∼ N(a + c, b2 + d2). 20.
Answer:
We know that Yi|Xi, β′ ∼ N(Xi · β′, σ2) and β′ ∼ N(0, η2I). Using Bayes’ rule, it follows that h(β′|Yi) ∝ f(β′|Yi) g(β′) ∝ exp
2σ2 (Y − Xi · β′)2
2η2 β′⊤β′
where g(·) is the density function for β and h(·) is the density function for β|Yi. The normalizer for the p.d.f. h is defined as Z = ∞
−∞ f(Yi|β′)g(β′)dβ′, and rewriting the p.d.f.
h(β′|Yi) = 1 Z exp
2σ2 (Y − Xi · β′)2 − 1 2η2 β′⊤β′
Likewise, the p.d.f. of β′|Y is h(β′|Y ) = 1 Zn exp
2σ2 (Y − X′β′)⊤(Y − X′β′) − 1 2η2 β′⊤β′
1 Zn exp
2σ2 Y − X′β′2
2 −
1 2η2 β′2
2
where Zn is the normalization factor, defined as Zn = ∞
−∞ f(Y |β′)g(β′)dβ′.
21.
β′, the MAP estimate of β′, is defined as the value of β′ that establishes the mode of the a posteriori probability P(β′|X, Y ). Show that solving for the MAP estimate leads to the problem (6) if λ can be expressed in terms σ and η, i.e., λ = g(σ, η). Find the explicit expression for g(σ, η).
Answer:
From the previous subproblem, it is clear that max
β′ h(β′|Yi) = min β′
1 2σ2 Y − Xβ′2
2 +
1 2η2 β′2
2.
(7) If η2 becomes σ2 λ , then the minimization problem becomes equivalent to (6), arg min
β′ Y − Xβ′2 2 + λβ′2 2.
In other words, rearranging the terms, λ = g(σ, η) = σ2 η2 for (7) to become (6). 22.
the regularization term can alleviate the potential problem. Note: Let’s skip the case of non-invertible matrix as we already covered the case in Part II.
Answer:
The regularizing term penalizes large components in β which leads to shrinking β to have smaller norm. As a consequence, the penalty terms encourage the model to avoid
influence β drastically. 23.
24.
Many modern regression problems can have hundreds or thou- sands of predictor variables. If these variables are correlated, stan- dard techniques will lead to overly complex models and poor gen- eralization error. Another issue is that problems may have more predictor variables than examples. When this happens, standard regression models will fail. One technique that addresses both these issues is called Ridge regression. The idea is to modify the loss function by adding a penalty term to the weights to encourage them to be small. This penalty term is known as a regularizer and controls model complexity by putting large weight on only the most important predictor variables in the model. The Ridge regression penalizes the squared length of the weight vector β. This is sometimes known as an L2 penalty because it is the square of the L2 norm (a.k.a the Euclidean norm).
25.
Let D = {yi, xi}1...N be a collection N training examples. Let yi be the response variable for example i. Let y ∈ RN be a column vector of all response variables. Each training example also has k predictor variables. Let xi ∈ R1×k be a row vector of the predictor variables for example i. Let X ∈ RN×k be a matrix of all predictor variables where row i is xi. Let β ∈ Rk×1 be a column vector that contains the weight parameters that we’re trying to learn. Note: At CMU, 2015 spring, Alex Smola, HW1, pr. 2 we have shown (by using the vector derivative method) that the analytic solution to arg min
β y − X⊤β2 + λβ2
is given by ˆ βRidge = (X⊤X + λI)−1X⊤y.
26.
While the analytic solution is easy to implement it can become intractible when there are very large numbers of predictor variables. This is because we need to compute x⊤x and its inverse. One way to deal with this problem is to use an optimization technique called steepest (a.k.a batch) descent. This is an iterative procedure that updates the weights at the t-th iteration as follows: βt = βt−1 − η∇βL(β; D), where η is a learning rate constant and ∇βL(β; D) is the transpose of the gradient of the loss function L(β; D). Compute the steepest descent update rule when L(β; D) is: L(β; D) = 1 2
N
(yi − x⊤
i β)2 + β⊤β.
Hint: You may find useful the following formulas (from Matrix Identities,
by Sam Roweis, 1999): (5a) ∂ ∂X a⊤X = ∂ ∂X X⊤a = a (5b) ∂ ∂X X⊤AX = (A + A⊤)X. 27.
Answer: L(β; D) = 1 2
N
(yi − x⊤
i β)2 + β⊤β = 1
2
N
(y2
i − 2yix⊤ i β + (x⊤ i β)2) + β⊤β (5a) (5b)
= 1 2
N
(y2
i − 2yix⊤ i β + β⊤x⊤ i xiβ) + β⊤β.
And now take the gradient: ∇β ∂L(β; D) ∂β = 1 2
N
{−2yix⊤
i +2x⊤ i xiβ}+λβ = N
{−yix⊤
i +x⊤ i (xiβ)}+λβ.
So the final update becomes: βt = βt−1 − η(
N
{−yix⊤
i + x⊤ i (xiβt−1)} + λβt−1).
28.
Another type of update is called stochastic (a.k.a incremental) gradient de-
each example. This is extremely useful for very large datasets and also when data examples stream in over time. This is sometimes called an online learn- ing technique. This update rule is given by: βt = βt−1 − η∇βLi(βt−1; xi, yi), where η is a learning rate constant and Li(β; xi, yi) is the transpose of the gradient of the loss function at a particular example i. Compute the stochastic descent update rule when Li(β; xi, yi) is: Li(β; xi, yi) = (yi − x⊤
i β)2 + β⊤β.
Answer:
In similar fashion to the batch method we arrive at: βt = βt−1 − η(−yix⊤
i + (x⊤ i xi)βt−1 + λβt−1).
29.
30.
From CMU, 2008 spring, Tom Mitchell, HW2, pr. 1 and Stanford, Andrew Ng, lecture notes #1 http://cs229.stanford.edu/notes/cs229-notes1.pdf, section 7
While MLE’s can sometimes be found analytically, for complicated likelihood functions it may need to be computed using numerical methods. One method is the Newton algorithm, which it- eratively finds a sequence of θ0, θ1, . . . that (under ideal conditions) converges to the MLE ˆ θ. The idea here is that we are trying to find the value of θ that maximizes the likelihood function. Newton’s method allows us to do this by efficiently finding a root for the likelihood function’s first derivative by following successively closer tan- gents of the first derivative function: Note that one may expand the derivative of the log-likelihood function around θj: 0 = l′(ˆ θ) ≈ l′(θj) + (ˆ θ − θj)l′′(θj) 31.
Solving for ˆ θ gives ˆ θ ≈ θj − l′(θj) l′′(θj). This leads to an iterative scheme where ˆ θj+1 = θj − l′(θj) l′′(θj). The generalization of Newton’s method to a multi-dimensional setting (also called the Newton-Raphson method) is given by θj+1 = θj − H−1 ∇θl(θ). Here, ∇θl(θ) is, as usual, the vector of partial derivatives of l(θ) with respect to the θi’s; and H is an n-by-n matrix (actually, n + 1-by-n + 1, assuming that we include the intercept term) called the Hessian, whose entries are given by Hij− = ∂2l(θ) ∂θi ∂θj . 32.
Newton’s method typically enjoys faster convergence than (batch) gradient descent and requires many fewer iterations to get very close to the minimum. One iteration of Newton’s can, however, be more expensive than
verting an n-by-n Hessian; but so long as n is not too large, it is usually much faster overall.
33.
In this problem we will prove that if we use Newton’s method to solve the least squares problem, then we only need one iteration to converge to ˆ β.
J(β) = 1 2
m
(β⊤x(i) − y(i))2. b. Show that the first iteration of Newton’s method gives us ˆ β = (X⊤X)−1X⊤y, the solution to our least squares problem.
34.
Solution
a. ∂J(β) ∂βj =
m
(β⊤x(i) − y(i))x(i)
j .
So, ∂2J(β) ∂βj ∂βk =
m
∂J(β) ∂βk (β⊤x(i) − y(i))x(i)
j
=
m
x(i)
j x(i) k
= (X⊤X)jk. Therefore, the Hessian of J(β) is X⊤X. (This can also be derived by simply applying the rules from the lecture notes on Linear Algebra.)
β(1) = β(0) − H−1 ∇βJ(β(0)) = β(0) − (X⊤X)−1 (X⊤Xβ(0) − X⊤y) = β(0) − β(0) + (X⊤X)−1X⊤y = (X⊤X)−1X⊤y. Therefore, no matter what β(0) we pick, Newton’s method always finds ˆ β after
35.
36.
In linear regression, we are given training data of the form D = (X, y) = {(xi, yi)}, i = 1, 2, . . . , n, where xi ∈ R1×d, i.e. xi = (xi,1, · · · , xi,d)⊤, yi ∈ R, X ∈ Rn×d, where row i of X is x⊤
i , and y = (y1, · · · , yn)⊤. Assuming a paramet-
ric model of the form: yi = xiβ + εi, where εi are noise terms from a given distribution, linear regression seeks to find the parameter vector β that pro- vides the best of fit of the above regression model. One criteria to measure fitness is to find β that minimizes a given loss function J(β). At problem CMU, 2015 spring, A. Smola, HW1, pr. 2 (part I), we have shown that if we take the loss function to be the square-error, i.e.: J1(β) =
(yi − xiβ)2 = (Xβ − y)⊤(Xβ − y), then ˆ β = (X⊤X)−1X⊤y. (8) Moreover, we have also shown that if we assume that ε1, . . . , εn are i.i.d. and sampled from the same zero mean Gaussian that is, εi ∼ N(0, σ2), then the least square estimate is also the MLE estimate for p(y|X; β). In this problem we will explore two extensions to this basic regression model. 37.
Assume that ε1, . . . , εn are independent but each εi ∼ N (0, σ2
i ).
Hint: You may find useful the following formulas (from Matrix Identities, by Sam Roweis, 1999):a (5a) ∂ ∂X a⊤X = ∂ ∂X X⊤a = a (5b) ∂ ∂X X⊤AX = (A + A⊤)X.
ahttp://www.cs.nyu.edu/∼roweis/notes/matrixid.pdf.
38.
Answer:
yi = x⊤
i β + εi and p(yi|xi; β) = N(x⊤ i β, σ2 i ). Thus the formula for the MLE of β
is: βMLE = argmax
β
ln
p(yi|xi; β) = argmax
β
ln p(yi|xi; β) = argmax
β
ln
√ 2πσi exp
i β)2
2σ2
i
argmax
β
1 √ 2πσi − (yi − x⊤
i β)2
2σ2
i
β
−(yi − x⊤
i β)2
2σ2
i
= arg min
β
(yi − x⊤
i β)2
2σ2
i
= arg min
β
1 σ2
i
(yi − x⊤
i β)2
(9) 39.
Now we write the expression (9) in matrix notation. If we let W be a diagonal matrix with diagonal entry wii = 1 σ2
i
, we get: βMLE = arg min
β (y − Xβ)⊤W(y − Xβ).
In order to get βMLE we just take derivatives of (y −Xβ)⊤W(y −Xβ) as follows: = ∂ ∂β
∂ ∂β
(10) For any scalar z, z⊤ = z, therefore ((β⊤X⊤)(Wy))⊤ = y⊤W ⊤Xβ = y⊤WXβ since W ⊤ = W as W is diagonal. Now, putting this back in (10) and taking the derivatives, we get: 0 = ∂ ∂β
(5a)(5b) = −2X⊤Wy + 2X⊤WXβ, which means that βMLE = (X⊤WX)−1(X⊤Wy). 40.
least square loss function J2(β) =
i ai(yi − x⊤ i β)2. Express each ai in terms of
the variance of each example.
Answer:
We have already shown in (9) that the MLE we just calculated is the minimizer
i ai(yi − x⊤ i β)2.
From the same relationship (9) we have ai = 1 σ2
i
. 41.
weighted version. (Hint: Consider the case when σ2
i is large and when it is
small).
Answer:
When the variance σ2
i is high, then the data point (xi, yi) might be an outlier,
as the noise term εi can be arbitrarily large. In this case, we don’t want βMLE to be biased to accommodate such outliers especially when using the squared error loss. The weighted least square formulation in this problem achieves that by weighting the contribution of each data point to the objective function by the inverse of the variance term. Therefore, points with large variance won’t contribute much to the loss func- tion and can be safely ignored or at least being given less importance when
42.
Assume that ε1, . . . , εn are independent and identically distributed according to a Laplace
2θ exp
θ
Answer:
The formula for the MLE of β is: βMLE = argmax
β
ln
p(yi|xi; β) = argmax
β
ln p(yi|xi; β) = argmax
β
ln 1 2θ exp
i β|
θ
argmax
β
2θ − |yi − x⊤
i β|
θ
β
−|yi − x⊤
i β|
θ
θ>0
= argmax
β
|yi − x⊤
i β|.
Thus J3 =
i |yi − x⊤ i β|.
43.
assumption? (Hint: Think about outliers.)
Answer:
If a point is an outlier, then the error in predicting this point given the correct β is much larger in the Gaussian assumption (as it is squared) than in this model (as it is not squared). Therefore, outliers will affect the estimation of β in the Gaussian model more than in the Laplace model. From a modeling point of view, since yi = x⊤
i β + εi, if y is an outlier, then the
model can explain that by making εi large to accommodate for the difference. This is possible in the Laplace model, since the Laplace distribution has heav- ier tails than the Gaussian distribution. To relate this to part I, to achieve the same effect, we assumed that every example has a different variance. How- ever, these variances have to be estimated (using an EM-like algorithm) since they affect the optimization problem, while in the Laplace model, we don’t have to do that. On the other hand, optimizing the L1 loss is harder than
44.
45.
Given the training feature vectors and output values {xt, yt}t=1,...,n and a test input vector xtest, and the scaling factors {αi}i=1,...,d, prove that the prediction of the test output value would be the same if we trained a linear regression on {˜ xt, yt}t=1,...,n, where ˜ xt,i = αi xt,i, and predicted on ˜ xtest, where ˜ xtest,i = αi xtest,i.
46.
Solution
Let us consider the original feature xt and the corresponding scaled version ˆ xt ⋆ αi. We can represent the relationship between the new and old matrices
its diagonal (A): ˆ X = XA. Plugging this into the optimal weight equation (where X, y refer to the train- ing set), we will get: ˆ w = ( ˆ X⊤ ˆ X)−1 ˆ X⊤y = ((XA)⊤XA)−1(XA)⊤y = (A⊤X⊤XA)−1A⊤X⊤y = A−1(X⊤X)−1 (A⊤)−1A⊤
X⊤y = A−1(X⊤X)−1X⊤y = A−1 w And so, the predicted output is: predicted ˆ ytest = ˆ Xtest ˆ w = Xtest AA−1w = Xtest w = predicted ytest 47.
48.
The linear regression model has the form y = xβ + ε, where x = (x1, . . . , xd)⊤, β = (β1, . . . , βd)⊤, y ∈ R, and ε is an additive noise with E(ε) = 0 and Var (ε) = σ2. We observe a set of training data (x1, y1), . . . , (xn, yn) from which to estimate the parameters β. Each xi = (xi1, . . . , xid)⊤ is a vector of inputs for the ith case. Denote by X the n × p matrix with each row an input vector, and similarly let Y by the n-vector of outputs in the training set. We have shown at CMU, 2015 spring, A. Smola, HW1, pr. 2 that minimizing the residual sum of squares Y − Xβ2 leads to the following estimate of β: ˆ β = (X⊤X)−1X⊤Y. The least squares prediction of y is given by ˆ y = X ˆ β = X(X⊤X)−1X⊤Y. Note that Y is a random variable, because given any input vector x, the value
β is a function of Y and therefore also a random quantity. 49.
a. Show that the least squares estimator is unbiased, that is, E(ˆ β) = β. Answer: E[ˆ β] = E[(X⊤X)−1X⊤Y ] = E[(X⊤X)−1X⊤(Xβ + ε)] = E[β + (X⊤X)−1X⊤ε] (11) = β + (X⊤X)−1X⊤ E[ε]
β.
50.
β equals to σ2(X⊤X)−1. Answer: Cov [ˆ β]
def.
= [Cov (ˆ βi, ˆ βj)]i,j∈{1,...,d}
def.
= [E[(ˆ βi − E[βi])(ˆ βj − E[ˆ βj])]]i,j∈{1,...,d} = [E[ˆ βi ˆ βj] − E[βi]E[ˆ βj]]i,j∈{1,...,d} = E[ˆ β ˆ β⊤] − E[ˆ β]E[ˆ β]⊤
51.
Answer (cont’d):
Cov [ˆ β]
(11)
= E[(β + (X⊤X)−1X⊤ε)(β + (X⊤X)−1X⊤ε)⊤] − ββ⊤ = E[(β + (X⊤X)−1X⊤ε)(β⊤ + ε⊤X((X⊤X)−1)⊤)] − ββ⊤ = E[(β + (X⊤X)−1X⊤ε)(β⊤ + ε⊤X((X⊤X)⊤
)−1)] − ββ⊤ = E[ββ⊤]
ββ⊤
+E[βε⊤X(X⊤X)−1 + (X⊤X)−1X⊤εβ⊤] + E[(X⊤X)−1X⊤ε ε⊤X(X⊤X)−1] − ββ⊤ = βE[ε]
E[(X⊤X)−1X⊤ε ε⊤X(X⊤X)−1] = E[(X⊤X)−1X⊤ε ε⊤X(X⊤X)−1] = (X⊤X)−1X⊤ E[ε ε⊤]
σ2In×n
X(X⊤X)−1 = σ2(X⊤X)−1(X⊤X)(X⊤X)−1 = σ2(X⊤X)−1. 52.
that an extension of the least squares estimator can be obtained by imposing a penalty on the size of regression coefficients Xβ − y1 + λβ⊤β, where the parameter λ controls the contribution of the regularization term. Minimizing this cost function gives the following estimator of the regression coefficients ˆ β = (X⊤X + λI)−1X⊤y, which is known as the Ridge estimator in statistics. The Ridge prediction
ˆ y = X ˆ β = X(X⊤X + λI)−1X⊤y. Show that the Ridge estimator is biased. 53.
Answer:
E[ˆ β] − β = E[(X⊤X + λI)−1X⊤y] − β = E[(X⊤X + λI)−1X⊤(Xβ + ε)] − β = (X⊤X + λI)−1X⊤Xβ + (X⊤X + λI)−1X⊤ E[ε]
= (X⊤X + λI)−1X⊤Xβ − β = (X⊤X + λI)−1(X⊤X + λI − λI)β − β = (X⊤X + λI)−1(X⊤X + λI)β − (X⊤X + λI)−1λIβ − β = β − λ(X⊤X + λI)−1β − β = −λ(X⊤X + λI)−1β = 0. 54.
[without “intercept” term]
55.
c x + c x + = y S Input Output x = (x , x )
1 2
ε noise
1 1 2 2
This figure shows a system S which takes two inputs x1, x2 and outputs a linear combi- nation of those two inputs, c1x1 + c2x2, where c1 and c2 are two unknown real numbers. The device you use to measure the output of S, i.e., c1x1 + c2x2, introduces an additive error ε, which is a random variable following some distribution. Thus, the output y that you observe is given by equation (12): y = c1x1 + c2x2 + ε (12) Assume that you have n > 2 instances xj1, xj2, yjj=1,...,n or equivalently xj, yjj=1,...,n, where xj
not.
= [xj1, xj2]. In other words, having n measurements in your hands is equivalent to having n equations of the following form: yj = c1xj1 + c2xj2 + εj, j = 1, . . . , n. The goal is to estimate c1 and c2 from those measurements using the maximum likeli- hood. 56.
a. Assume that the εi for i = 1, . . . , n are i.i.d. Gaussian random variables with zero mean and variance σ2. Compute the log-likelihood function and use it to prove that the maximum likelihood estimate c∗ = [c∗
1, c∗ 2] is the solution of a least squares approximation problem. Find the
solution of the least squares problem.
Answer:
εi = yi − (c1xi1 + c2xi2) ∼ N(0, σ2). Therefore yi ∼ N(c1xi1 + c2xi2, σ2). Since the noise are i.i.d., the likelihood function is given by L(c1, c2) =
n
1 √ 2πσ exp
2σ2
Taking the logarithm, we get the log-likelihood function: l(c1, c2) = −n 2 log(2πσ2) − 1 2σ2
n
(yi − c1xi1 − c2xi2)2. Let y ∈ Rn be the vector containing the measurements, X the n×2 matrix with Xij = xij and c = [c1, c2]⊤, then we are trying to minimize ||y − Xc||2
2 resulting in a solution c =
(X⊤X)−1X⊤y. 57.
Assume that the εi for i = 1, . . . , n are independent Gaussian random vari- ables with zero mean and variance Var(εi) = σ2
i .
Compute the log-likelihood function and find c∗ = [c∗
1, c∗ 2] which maximizes
it, i.e., the MLE. Answer: εi = yi − (c1xi1 + c2xi2) ∼ N (0, σ2
i ).
Similar as before, l(c1, c2) = −n 2 log(2π) −
n
log σi −
n
(yi − c1xi1 − c2xi2)2 2σ2
i
. Now we are trying to minimize ||W(y−Xc)||2
2, where W is a diagonal matrix,
with wii = 1 σi , resulting the solution c = (X⊤W ⊤WX)−1X⊤W ⊤Wy. (See the correspondence with the formula obtained at CMU, 2015 spring, Alex Smola, HW1, pr. 2.d or directly the formula obtained at . CMU, 2010 spring, E. Xing, T. Mitchell, A. Singh, HW2. pr. 3.1-2.a.)
58.
2θexp(−|x| θ ), with θ > 0. In other words, our noise is i.i.d. following a Laplace distribu- tion with location parameter µ = 0 and scale parameter θ. Compute the log-likelihood function under this noise model and explain why this model leads to more robust solutions. Answer: l(c1, c2) = −n log(2θ) − 1 θ
n
||y − Xc||1. It is prepared to see higher values of residuals because it has a larger tail [LC: than the Gaussian]. Thus it is more robust to noise and outliers. xxxxx Laplace p.d.f.
−10 −5 5 10 0.0 0.1 0.2 0.3 0.4 0.5
x f(x)
−10 −5 5 10 0.0 0.1 0.2 0.3 0.4 0.5
µ = 0, θ = 1 µ = 0, θ = 2 µ = 0, θ = 4 µ = −5, θ = 4
59.
60.
You are given n i.i.d. training datapoints (x(1), y(1)), (x(2), y(2)), . . . , (x(n), y(n)), where each vector x(i) has d features / attributes. Here we will assume that y(i) ∈ {0, 1} for i = 1, . . . , n. Logistic Regression is a classification algorithm that works by trying to learn a function that approximates P(Y |X). It makes the central assumption that P(Y |X) can be approximated as a sigmoid function (also called logistic func- tion) applied to a linear combination of input features. Mathematically, for a single training datapoint (x, y) Logistic Regression as- sumes: P(Y = 1|X = x) = σ(z) and, equivalently P(Y = 0|X = x) = 1 − σ(z), where σ(z)
def.
= 1 1 + e−z = ez 1 + ez , with z
not.
= w0 +
d
wixi = w · x and w
not.
= (w0, w1, . . . , wd) ∈ Rd+1, assuming x0 = 1.
0.5 1
2 4 6 sigma(x) x
Starting from the above formulas for the probability of Y |X, we can create an algorithm that selects values of w that maximize that probability for all the training data. 61.
the Logistic Regression assumption) is: L(w) =
n
(13)
Solution:
To start, here is a super slick way of writing the conditional probability of
P(Y = y|X = x) = σ(w · x)y [1 − σ(w · x)]1−y assuming y ∈ {0, 1}. Since each datapoint is independent, the conditional probability of all the data is:
n
P(Y = y(i)|X = x(i)) =
n
σ(w · x(i))y(i) [1 − σ(w · x(i))]1−y(i). (14) If you apply ln to this function, you get the reported conditional log-likelihood for Logistic Regression. 62.
Note
[from CMU, 2004 fall, T. Mitchell, Z. Bar-Joseph, HW2, pr. 4] Actually, the full log-likelihood function is: log-likelihood = ln
n
P(x(i), y(i)) = ln
n
(PY |X(y(i)|x(i)) PX(x(i))) = ln n
PY |X(y(i)|x(i))
n
PX(x(i))
ln
n
PY |X(y(i)|x(i)) + ln
n
PX(x(i)) = L(w) + Lx. Because Lx does not depend on the parameter w, when doing MLE we could just consider maximizing L(w). 63.
Starting from the expression (13) for the conditional log-likelihood function, we simply need to choose the values of w that maximize it. Unlike in other situations, here there is no closed form way to calculate w. Instead we will choose it using optimization, so we will employ an algorithm called gradient ascent. That algorithm claims that if you continuously take small steps in the direction of your gradient, you will eventually make it to a local maxima. In the case of Logistic Regression it can be proven that the result will always be a global maxima.a The small step that we continually take given the training dataset can be calculated as: wnew
j
= wold
j
+ η ∂ ∂wold
j
L(wold), were η is the magnitude of the step size (“learning rate”) that we take.
aSee Stanford, 2008 fall, Andrew Ag, HW1, pr. 1.a.
64.
with respect to each parameter wj is: ∂ ∂wj L(w) =
n
[y(i) − σ(w · x(i))
]x(i)
j
for j = 0, 1, . . ., d. (15)
Hint: You may use the following property for the derivative of σ with respect
to its inputs: ∂ ∂z σ(z) = σ(z)[1 − σ(z)] for ∀z ∈ R. 65.
Solution
The partial derivative of the conditional log-likelihood for only one datapoint (x, y) w.r.t. the wj component is computed as follows: ∂ ∂wj ln[σ(w · x)y [1 − σ(w · x)]1−y] = ∂ ∂wj y ln σ(w · x) + ∂ ∂wj (1 − y) ln[1 − σ(w · x)] =
σ(w · x) − 1 − y 1 − σ(w · x)
∂wj σ(w · x) =
σ(w · x) − 1 − y 1 − σ(w · x)
= y − σ(w · x) σ(w · x)[1 − σ(w · x)]σ(w · x)[1 − σ(w · x)]xj = [y − σ(w · x)]xj. (16) Because the derivative of sums is the sum of derivatives, the partial derivative
term for each training datapoint. More exactly, after applying the ln function to the (14) equality, and then calculating its partial derivative w.r.t. wj (for j ∈ {0, 1, . . ., d}) we will get the (15) result, due to the (16) relation proven above. 66.
[based on CMU, 2004 fall, T. Mitchell, Z. Bar-Joseph, HW2, pr. 4] Starting from (13), the conditional log-likelihood function can be further written as: L(w) =
n
{yi ln σ(w · xi) + (1 − yi) ln(1 − σ(w · xi))} =
n
{yi ln σ(w · xi) 1 − σ(w · xi) + ln(1 − σ(w · xi))}
(∗)
=
n
{yi(w · xi) − ln(1 + ew·xi)} (17) And therefore, the gradient vector for L(w) can be written as: ∇wL(w) =
n
ew·xi 1 + ew·xi xi
n
(yi − σ(w · xi))xi (*): We used the fact that the sigmoidal function σ : R → (0, 1) is bijective, and its inverse function is: σ−1 : (0, 1) → R, defined by σ−1(z) = ln z 1 − z . Some people call σ−1 the logit function. 67.
68.
Consider the log-likelihood function for logistic regression: ℓ(θ) =
m
y(i) ln h(x(i)) + (1 − y(i)) ln(1 − h(x(i))), where x = (x1, . . . , xk)⊤, and h(x)
def.
= σ(θ · x) = 1 1 + exp(−θ · x), Find the Hessian H of this function, and show that for any vector z = (z1, . . . , zk)⊤, it holds true that z⊤Hz ≤ 0. Hint: You might want to start by showing the fact that
i
(x⊤z)2 ≥ 0. Remark: This is one of the standard ways of showing that the matrix H is negative semi-definite, written “H ≤ 0”. This implies that ℓ is concave, and has no local maxima other than the global one. 69.
Note: we do things in a shorter way here; this solution does not use the Hint. Recall that we have σ′(z) = σ(z)(1 − σ(z)), and thus for h(x) = σ(θ · x) we have ∂h(x) ∂θk = h(x)(1 − h(x))xk. This latter fact is very useful to make the following derivations. Now ∂ℓ(θ) ∂θk = ∂ ∂θk m
y(i) ln h(x(i)) + (1 − y(i)) ln(1 − h(x(i)))
m
x(i)
k
h(x(i)) − (1 − y(i)) x(i)
k
1 − h(x(i))
=
m
x(i)
k
m
(y(i) − h(x(i)))x(i)
k
and Hkl = ∂2ℓ(θ) ∂θk ∂θl =
m
−∂h(x(i)) ∂θl x(i)
k
=
m
−h(x(i))(1 − h(x(i)))x(i)
l x(i) k .
70.
So we have for the Hessian matrix H (using that for X = xx⊤ if and only if Xij = xixj): H = −
m
h(x(i))(1 − h(x(i)))x(i)x(i)⊤. To prove that H is negative semi-definite, we show that z⊤Hz ≤ 0 for all z: z⊤Hz = −z⊤ m
h(x(i))(1 − h(x(i)))x(i)x(i)⊤
= −
m
h(x(i))(1 − h(x(i)))z⊤x(i)x(i)⊤z = −
m
h(x(i))(1 − h(x(i)))(z⊤x(i))2 ≤ 0, with the last inequality holding, since 0 ≤ h(x(i)) ≤ 1, which implies h(x(i))(1 − h(x(i))) ≥ 0, and (z⊤x(i))2 ≥ 0. 71.
72.
Consider a binary classification problem with variable X1 ∈ {0, 1} and label Y ∈ {0, 1}. We have a training set D1 made of n examples: D1 = {(x1
1, y1), . . . , (xn 1 , yn)}.
Suppose we generate another training set D2
n examples, D2 = {(x1
1, x1 2, y1), . . . , (xn 1, xn 2 , yn)}, where in each example x1 and y are the same as
in D1 and then x2 is a duplicate of x1. Now we learn a logistic regression from D1, which should contain two param- eters: w0 and w1; we also learn another logistic regression from D2, which should have three parameters: w0, w1 and w2. First, write down the training rule (maximum conditional likelihood estima- tion) we use to estimate the parameters (w0, w1) and (w0, w1, w2) from data. Then, given the training rule, what is the relationship between (w0, w1) and (w0, w1, w2) we estimated from D1 and D2? Use this fact to argue whether
variable X2. 73.
The training rule for (w0, w1) aims to maximize the following log-likehood function: ln
n
P(Y l|Xl
1, w0, w1) (17)
=
n
Y l(w0 + w1Xl
1) − ln(1 + exp(w0 + w1Xl 1)).
Similarly, the training rule for (w′
0, w′ 1, w′ 2) aims to maximize
ln
n
P(Y l|Xl
1, w′ 0, w′ 1, w′ 2) (17)
=
n
Y l(w′
0 + w′ 1Xl 1 + w′ 2Xl 2) − ln(1 + exp(w′ 0 + w′ 1Xl 1 + w′ 2Xl 2))
=
n
Y l(w′
0 + (w′ 1 + w′ 2)Xl 1) − ln(1 + exp(w′ 0 + (w′ 1 + w′ 2)Xl 1)),
which is basically the same as [the log-likelihood function for deriving] the training rule for (w0, w1), with the substitutions w0 = w′
0 and w1 = w′ 1 + w′ 2.
These substitutions express the relationship between the sets of parameters (w0, w1) and (w′
0, w′ 1, w′ 2) that we estimate from the training sets D1 and respec-
tively D2. Therefore, logistic regression will simply split the weight w1 into w′
1 + w′ 2 = w1
when facing the duplicated variable X2 = X1. 74.
CMU, 2012 fall, Tom Mitchell, Ziv Bar-Joseph, HW2, pr. 2
75.
We can easily extend the binary Logistic Regression model to handle multi- class classification. Let’s assume that we have K different classes, and the posterior probability for class k is given by: P(Y = k|X = x) = exp(wk · x) 1 + K−1
t=1 exp(wt · x)
for k = 1, . . . , K − 1 P(Y = K|X = x) = 1 1 + K−1
t=1 exp(wt · x)
, where x and wt for t = 1, . . . , K are d-dimensional vectors. Notice that we ignored the components wt0 in order to simplify the expression. Our goal is to estimate the weights wt using the gradient ascent optimization method. We will also define priors on the parameters to avoid overfitting and very large weights. 76.
ber of training examples, and d is the number of attributes / dimensions. Please explicitly write down the log-likelihood function, L(w1, . . . , wK) with L2 regularization on the weights. Show your steps. Hint: You can simplify the multi-class logistic regression expression above by introducing a fixed parameter vector wK = 0 (the d-dimensional vector made entirely of 0′s).
solution to maximize the log-conditional likelihood, L(w1, . . . , wK), with re- spect to wk. However, we can still find the solution with the gradient ascent method, by using partial derivatives. Derive the expression for the i-th com- ponent in the vector gradient L(w1, . . . , wK) with respect to wi, which is the partial derivative of L(w1, . . . , wK) with respect to wi.
using ν for the step size. Will the solution converge to a global maximum? 77.
Answer
a. Let 1{lk} be an indicator function, where 1{lk} = 1 if Yl = k, otherwise 1{lk} = 0. Then we can write the likelihood as: L(w1, . . . , wK) =
n
K
P(Y l = k|XL = x; w)1{lk} =
n
K
1{lk} . Taking ln: ℓ(w1, . . . , wK) =
n
K
1{lk}
exp(wr · xl)
Adding the L2 regularization term: ℓ(w1, . . . , wK) =
n
K
1{lk}
exp(wr · xl)
2
K
wk2. 78.
∂ ∂wi ℓ(w1, . . . , wK) =
n
=
n
wi ← wi + ν
n
This will converge to a global maximum since it is a concave function. 79.
CMU, 2004 fall, T. Mitchell, Z. Bar-Joseph, HW2, pr. 4
80.
Given an input vector X, linear regression models a real-valued output Y as Y |X ∼ Normal(µ(X), σ2), where µ(X) = β⊤X = β0 + β1X1 + . . . + βpXp. Given an input vector X, logistic regression models a binary output Y by Y |X ∼ Bernoulli(θ(X)), where the Bernoulli parameter is related to β⊤X by the logit transformation logit (θ(X))
def.
= log θ(X) 1 − θ(X) = β⊤X. 81.
a. For each of the two regression models defined above, write the log- likelihood function and its gradient with respect to the parameter vector β = (β0, β1, . . . , βp). Answer: For linear regression, we can write the log-likelihood function as: LL(β) = log n
1 √ 2πσ exp
2σ2
n
log
√ 2πσ exp
2σ2
−n log( √ 2πσ) − 1 2σ2
n
(yi − β⊤xi)2 = −n log( √ 2πσ) − 1 2σ2
n
(yi − β⊤xi)⊤(yi − β⊤xi). Therefore, its gradient is: ∇βLL(β) =
n
(yi − β⊤xi)xi 82.
For logistic regression: log θ(X) 1 − θ(X) = β⊤X ⇔ eβ⊤X = θ(X) 1 − θ(X) ⇔ eβ⊤X = θ(X)(1 + eβ⊤X) Therefore, θ(X) = eβ⊤X 1 + eβ⊤X = 1 1 + e−β⊤X and 1 − θ(X) = 1 1 + eβ⊤X . Note that Y |X ∼ Bernoulli(θ(X)) means that P(Y = 1|X) = θ(X) and P(Y = 0|X) = 1 − θ(X), which can be equivalently written as P(Y = y|X) = θ(X)y(1 − θ(X))1−y for all y ∈ {0, 1}. 83.
So, in this case the log-likelihood function is: LL(β) = log n
{θ(xi)yi(1 − θ(xi))1−yi}
n
{yi log θ(xi) + (1 − yi) log(1 − θ(xi))} =
n
{yi(β⊤xi + log(1 − θ(xi)) + (1 − yi) log(1 − θ(xi))} =
n
{yi(β⊤xi) − log(1 + eβ⊤xi)} And therefore, ∇βLL(β) =
n
eβ⊤xi 1 + eβ⊤xi xi
n
(yi − θ(xi))xi 84.
β has the following property:
n
yixi =
n
E[Y |X = xi, β = ˆ β]xi. Answer: For linear regression: ∇βLL(β) = 0 ⇒
n
yixi =
n
(ˆ β⊤xi)xi. Since Y |X ∼ Normal(µ(X), σ2), E[Y |X = xi, β = ˆ β] = µ(xi) = ˆ β⊤xi. So n
i=1 yixi = n i=1 E[Y |X = xi, β = ˆ
β]xi. For logistic regression: ∇βLL(β) = 0 ⇒
n
yixi =
n
θ(xi)xi. Since Y |X ∼ Bernoulli(θ(X)), E[Y |X = xi, β = ˆ β] = θ(xi) = e ˆ
β⊤xi
1 + e ˆ
β⊤xi .
So n
i=1 yixi = n i=1 E[Y |X = xi, β = ˆ
β]xi. 85.