Regression Methods 1. Linear Regression and Logistic Regression: - - PowerPoint PPT Presentation

regression methods
SMART_READER_LITE
LIVE PREVIEW

Regression Methods 1. Linear Regression and Logistic Regression: - - PowerPoint PPT Presentation

0. Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common property CMU, 2004 fall, Andrew Moore, HW2, pr. 4 2. Linear Regression and Logistic Regression: Definitions Given an input vector X , linear


slide-1
SLIDE 1

Regression Methods

0.

slide-2
SLIDE 2

Linear Regression and Logistic Regression:

definitions, and a common property

CMU, 2004 fall, Andrew Moore, HW2, pr. 4

1.

slide-3
SLIDE 3

Linear Regression and Logistic Regression: Definitions

Given an input vector X, linear regression models a real-valued output Y as Y |X ∼ Normal(µ(X), σ2), where µ(X) = β⊤X = β0 + β1X1 + . . . + βpXp. Given an input vector X, logistic regression models a binary output Y by Y |X ∼ Bernoulli(θ(X)), where the Bernoulli parameter is related to β⊤X by the logit transformation logit (θ(X))

def.

= log θ(X) 1 − θ(X) = β⊤X.

2.

slide-4
SLIDE 4
  • a. For each of the two regression models defined above, write the log likelihood function

and its gradient with respect to the parameter vector β = (β0, β1, . . . , βp). Answer: For linear regression, we can write the log likelihood function as: LL(β) = log n

  • i=1

1 √ 2πσ exp

  • −(yi − µ(xi)2

2σ2

  • =

n

  • i=1

log

  • 1

√ 2πσ exp

  • −(yi − β⊤xi)2

2σ2

  • =

−n log( √ 2πσ) − 1 2σ2

n

  • i=1

(yi − β⊤xi)2 = −n log( √ 2πσ) − 1 2σ2

n

  • i=1

(yi − β⊤xi)⊤(yi − β⊤xi). Therefore, its gradient is: ∇βLL(β) =

n

  • i=1

(yi − β⊤xi)xi 3.

slide-5
SLIDE 5

For logistic regression: log θ(X) 1 − θ(X) = β⊤X ⇔ eβ⊤X = θ(X) 1 − θ(X) ⇔ eβ⊤X = θ(X)(1 + eβ⊤X) Therefore, θ(X) = eβ⊤X 1 + eβ⊤X = 1 1 + e−β⊤X and 1 − θ(X) = 1 1 + eβ⊤X . Note that Y |X ∼ Bernoulli(θ(X)) means that P(Y = 1|X) = θ(X) and P(Y = 0|X) = 1 − θ(X), which can be equivalently written as P(Y = y|X) = θ(X)y(1 − θ(X))1−y for all y ∈ {0, 1}. 4.

slide-6
SLIDE 6

So, in this case the log likelihood function is: LL(β) = log n

  • i=1

{θ(xi)yi(1 − θ(xi))1−yi}

  • =

n

  • i=1

{yi log θ(xi) + (1 − yi) log(1 − θ(xi))} =

n

  • i=1

{yi(β⊤xi + log(1 − θ(xi)) + (1 − yi) log(1 − θ(xi))} =

n

  • i=1

{yi(β⊤xi) − log(1 + eβ⊤xi)} And therefore, ∇βLL(β) =

n

  • i=1
  • yixi −

eβ⊤xi 1 + eβ⊤xi xi

  • =

n

  • i=1

(yi − θ(xi))xi 5.

slide-7
SLIDE 7

Remark

Actually, in the above solutions the full log likelihood function should look like the following first: log-likelihood = log

n

  • i=1

p(xi, yi) = log

n

  • i=1

(pY |X(yi|xi) pX(xi)) = log n

  • i=1

pY |X(yi|xi)

  • ·

n

  • i=1

pX(xi)

  • =

log

n

  • i=1

pY |X(yi|xi) + log

n

  • i=1

pX(xi) = LL + LLx Because LLx does not depend on the parameter β, when doing MLE we could just consider maximizing LL. 6.

slide-8
SLIDE 8

b. Show that for each of the two regression models above, at the MLE ˆ β has the following property:

n

  • i=1

yixi =

n

  • i=1

E[Y |X = xi, β = ˆ β]xi. Answer: For linear regression: ∇βLL(β) = 0 ⇒

n

  • i=1

yixi =

n

  • i=1

(ˆ β⊤xi)xi. Since Y |X ∼ Normal(µ(X), σ2), E[Y |X = xi, β = ˆ β] = µ(xi) = ˆ β⊤xi. So n

i=1 yixi = n i=1 E[Y |X = xi, β = ˆ

β]xi. For logistic regression: ∇βLL(β) = 0 ⇒

n

  • i=1

yixi =

n

  • i=1

θ(xi)xi. Since Y |X ∼ Bernoulli(θ(X)), E[Y |X = xi, β = ˆ β] = θ(xi) = e ˆ

β⊤xi

1 + e ˆ

β⊤xi .

So n

i=1 yixi = n i=1 E[Y |X = xi, β = ˆ

β]xi. 7.

slide-9
SLIDE 9

Linear Regression with only one parameter; MLE and MAP estimation

CMU, 2012 fall, Tom Mitchell, Ziv Bar-Joseph, midterm, pr. 3

8.

slide-10
SLIDE 10

Consider real-valued variables X and Y . The Y variable is generated, conditional on X, from the following process: ε ∼ N(0, σ2) Y = aX + ε, where every ε is an independent variable, called a noise term, which is drawn from a Gaussian distribution with mean 0, and standard deviation σ. This is a one-feature linear regression model, where a is the only weight parameter. The conditional probability of Y has the distribution p(Y |X, a) ∼ N(aX, σ2), so it can be written as p(Y |X, a) = 1 √ 2πσ exp

  • − 1

2σ2(Y − aX)2

  • 9.
slide-11
SLIDE 11

MLE estimation

  • a. Assume we have a training dataset of n pairs (Xi, Yi) for i = 1, . . . , n, and σ is known.

Which ones of the following equations correctly represent the maximum likelihood problem for estimating a? Say yes or no to each one. More than one of them should have the answer yes. i. arg maxa

  • i

1 √ 2πσ exp

  • − 1

2σ2 (Yi − aXi)2

  • ii.

arg maxa

  • i

1 √ 2πσ exp

  • − 1

2σ2 (Yi − aXi)2

  • iii.

arg maxa

  • i exp
  • − 1

2σ2 (Yi − aXi)2

  • iv.

arg maxa

  • i exp
  • − 1

2σ2 (Yi − aXi)2

  • v.

arg maxa

  • i(Yi − aXi)2

vi. argmina

  • i(Yi − aXi)2

10.

slide-12
SLIDE 12

Answer:

LD(a)

def.

= p(Y1, . . . , Yn|a) = p(Y1, . . . , Yn|X1, . . . , Xn, a)

i.i.d.

=

n

  • i=1

p(Yi|Xi, a) =

n

  • i=1

1 √ 2πσ exp

  • − 1

2σ2 (Yi − aXi)2

  • Therefore

aMLE

def.

= arg max

a

LD(a) = arg max

a n

  • i=1

1 √ 2πσ exp

  • − 1

2σ2 (Yi − aXi)2

  • (ii.)

= arg max

a

  • 1

√ 2πσ n

n

  • i=1

exp

  • − 1

2σ2 (Yi − aXi)2

  • = arg max

a

1 ( √ 2πσ)n exp

n

  • i=1

1 2σ2 (Yi − aXi)2

  • =

arg max

a n

  • i=1

exp

  • − 1

2σ2 (Yi − aXi)2

  • (iv.)

= arg max

a

ln

n

  • i=1

exp

  • − 1

2σ2 (Yi − aXi)2

  • = arg max

a n

  • i=1

− 1 2σ2 (Yi − aXi)2 = arg max

a

− 1 2σ2

n

  • i=1

(Yi − aXi)2 = arg min

a n

  • i=1

(Yi − aXi)2 (vi.) 11.

slide-13
SLIDE 13
  • b. Derive the maximum likelihood estimate of the parameter a in terms
  • f the training example Xi’s and Yi’s. We recommend you start with the

simplest form of the problem you found above. Answer: aMLE = arg min

a n

  • i=1

(Yi − aXi)2 = arg min

a

  • a2

n

  • i=1

X2

i − 2a n

  • i=1

XiYi +

n

  • i=1

Y 2

i

  • =

−−2 n

i=1 XiYi

2 n

i=1 X2 i

= n

i=1 XiYi

n

i=1 X2 i

12.

slide-14
SLIDE 14

MAP estimation

Let’s put a prior on a. Assume a ∼ N(0, λ2), so p(a|λ) = 1 √ 2πλ exp

  • − 1

2λ2 a2

  • The posterior probability of a is

p(a|Y1, . . . , Yn, X1, . . . , Xn, λ) = p(Y1, . . . , Yn|X1, . . . , Xn, a) p(a|λ)

  • a′ p(Y1, . . . , Yn|X1, . . . , Xn, a′) p(a′|λ) da′

We can ignore the denominator when doing MAP estimation.

  • c. Assume σ = 1, and a fixed prior parameter λ. Solve for the MAP estimate of a,

argmax

a

[ln p(Y1, . . . , Yn|X1, . . . , Xn, a) + ln p(a|λ)] Your solution should be in terms of Xi’s, Yi’s, and λ. 13.

slide-15
SLIDE 15

Answer:

p(Y1, . . . , Yn|X1, . . . , Xn, a) · p(a|λ) = n

  • i=1

1 √ 2πσ exp

  • − 1

2σ2 (Yi − aXi)2

  • ·

1 √ 2πλ exp

  • − a2

2λ2

  • σ=1

= n

  • i=1

1 √ 2π exp

  • −1

2(Yi − aXi)2

  • ·

1 √ 2πλ exp

  • − a2

2λ2

  • Therefore the MAP optimization problem is

arg max

a

  • n ln

1 √ 2π − 1 2

n

  • i=1

(Yi − aXi)2 + ln 1 √ 2πλ − 1 2λ2 a2

  • =

arg max

a

  • −1

2

n

  • i=1

(Yi − aXi)2 − 1 2λ2 a2

  • =

arg min

a

n

  • i=1

(Yi − aXi)2 + a2 λ2

  • = arg min

a

  • a2

n

  • i=1

X2

i + 1

λ2

  • − 2a

n

  • i=1

XiYi +

n

  • i=1

Y 2

i

  • ⇒ aMAP =

n

i=1 XiYi

n

i=1 X2 i + 1

λ2 14.

slide-16
SLIDE 16
  • d. Under the following conditions, how do the prior and conditional likelihood curves

change? Do aMLE and aMAP become closer together, or further apart? p(a|λ) prior probability: wider, narrower,

  • r same?

p(Y1, . . . , Yn|X1, . . . , Xn, a) conditional likelihood: wider, narrower, or same? |aMLE − aMAP| increase or decrease? As λ → ∞ As λ → 0 More data: as n → ∞ (fixed λ) 15.

slide-17
SLIDE 17

Answer:

p(a|λ) prior probability: wider, narrower,

  • r same?

p(Y1, . . . , Yn|X1, . . . , Xn, a) conditional likelihood: wider, narrower, or same? |aMLE − aMAP| increase or decrease? As λ → ∞ wider same decrease As λ → 0 narrower same increase More data: as n → ∞ (fixed λ) same narrower decrease 16.

slide-18
SLIDE 18

Linear Regression in R2

[without “intercept” term]

with either Gaussian or Laplace noise CMU, 2009 fall, Carlos Guestrin, HW3, pr. 1.5.2 CMU, 2012 fall, Eric Xing, Aarti Singh, HW1, pr. 2

17.

slide-19
SLIDE 19

c x + c x + = y S Input Output x = (x , x )

1 2

ε noise

1 1 2 2

This figure shows a system S which takes two inputs x1, x2 and outputs a linear combi- nation of those two inputs, c1x1 + c2x2, where c1 and c2 are two unknown real numbers. The device you use to measure the output of S, i.e., c1x1 + c2x2, introduces an additive error ε, which is a random variable following some distribution. Thus, the output y that you observe is given by equation (1): y = c1x1 + c2x2 + ε (1) Assume that you have n > 2 instances xj1, xj2, yjj=1,...,n or equivalently xj, yjj=1,...,n, where xj

not.

= [xj1, xj2]. In other words, having n measurements in your hands is equivalent to having n equations of the following form: yj = c1xj1 + c2xj2 + εj, j = 1, . . . , n. The goal is to estimate c1 and c2 from those measurements using the maximum likeli- hood. 18.

slide-20
SLIDE 20

a. Assume that the εi for i = 1, . . . , n are i.i.d. Gaussian random variables with zero mean and variance σ2. Compute the loglikelihood function and use it to prove that the maximum likelihood estimate c∗ = [c∗

1, c∗ 2] is the solution of a least squares approximation problem. Find the

solution of the least squares problem.

Answer:

εi = yi − (c1xi1 + c2xi2) ∼ N(0, σ2). Therefore yi ∼ N(c1xi1 + c2xi2, σ2). Since the noise are i.i.d., the likelihood function is given by L(c1, c2) =

n

  • i=1

1 √ 2πσ exp

  • −(yi − c1xi1 − c2xi2)2

2σ2

  • .

Taking the logarithm, we get the loglikelihood function: l(c1, c2) = −n 2 log(2πσ2) − 1 2σ2

n

  • i=1

(yi − c1xi1 − c2xi2)2. Let y ∈ Rn be the vector containing the measurements, X the n×2 matrix with Xij = xij and c = [c1, c2]⊤, then we are trying to minimize ||y − Xc||2

2 resulting in a solution c =

(X⊤X)−1X⊤y. 19.

slide-21
SLIDE 21
  • b. Assume that the εi for i = 1, . . . , n are independent Gaussian random

variables with zero mean and variance Var(εi) = σ2

i .

Compute the loglikelihood function and find c∗ = [c∗

1, c∗ 2] which maximizes

it, i.e., the MLE. Answer: εi = yi − (c1xi1 + c2xi2) ∼ N (0, σ2

i ).

Similar as before, l(c1, c2) = −n 2 log(2π) −

n

  • i=1

(yi − c1xi1 − c2xi2)2 2σ2

i

. Now we are trying to minimize ||W(y−Xc)||2

2, where W is a diagonal matrix,

with wii = 1 σi , resulting the solution c = (X⊤W ⊤WX)−1X⊤W ⊤Wy.

20.

slide-22
SLIDE 22
  • c. Assume that εi for i = 1, . . . , n has density fεi(x) = f(x) = 1

2bexp(−|x| b ). In

  • ther words, our noise is i.i.d. following a Laplace distribution with location

parameter µ = 0 and scale parameter b. Compute the loglikelihood function under this noise model and explain why this model leads to more robust solutions. Answer: l(c1, c2) = −n log(2b) −

n

  • i=1

||y − Xc||2

1.

It is prepared to see higher values of residuals because it has a larger tail [LC: than the Gaussian]. Thus it is more robust to noise and outliers. xxxxx Laplace p.d.f.

−10 −5 5 10 0.0 0.1 0.2 0.3 0.4 0.5

x f(x)

−10 −5 5 10 0.0 0.1 0.2 0.3 0.4 0.5

µ = 0, θ = 1 µ = 0, θ = 2 µ = 0, θ = 4 µ = −5, θ = 4

21.