11. Regression and Least Squares Prof. Tesler Math 186 Winter 2019 - - PowerPoint PPT Presentation

11 regression and least squares
SMART_READER_LITE
LIVE PREVIEW

11. Regression and Least Squares Prof. Tesler Math 186 Winter 2019 - - PowerPoint PPT Presentation

11. Regression and Least Squares Prof. Tesler Math 186 Winter 2019 Prof. Tesler Ch. 11: Linear Regression Math 186 / Winter 2019 1 / 24 Regression Given n points ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , we want to determine a function y = f (


slide-1
SLIDE 1
  • 11. Regression and Least Squares
  • Prof. Tesler

Math 186 Winter 2019

  • Prof. Tesler
  • Ch. 11: Linear Regression

Math 186 / Winter 2019 1 / 24

slide-2
SLIDE 2

Regression

Given n points (x1, y1), (x2, y2), . . . , we want to determine a function y = f(x) that is close to them.

  • −20

−10 10 20 30 10 20 30 40 50

Scatter plot of data (x,y)

x y

  • Prof. Tesler
  • Ch. 11: Linear Regression

Math 186 / Winter 2019 2 / 24

slide-3
SLIDE 3

Regression

Based on knowledge of the underlying problem or on plotting the data, you have an idea of the general form of the function, such as: Line y = β0 + β1x Polynomial y = β0 + β1x + β2 x2 + β3 x3

10 12 14 16 18 20 60 80 100

! ! ! ! ! ! ! ! !

5 10 15 20 25 −1000 500

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

Exponential Decay y = Ae−Bx Logistic Curve y = A/(1 + B/C x)

5 10 15 20 25 30 1 2 3 4 5

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

5 10 15 20 2 4 6 8 10

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

Goal: Compute the parameters (β0, β1, . . . or A, B, C, . . .) that give a “best fit” to the data.

  • Prof. Tesler
  • Ch. 11: Linear Regression

Math 186 / Winter 2019 3 / 24

slide-4
SLIDE 4

Regression

The methods we consider require the parameters to occur linearly. It is fine if (x, y) do not occur linearly. E.g., plugging (x, y) = (2, 3) into y = β0 + β1x + β2 x2 + β3 x3 gives 3 = β0 + 2β1 + 4β2 + 8β3. For exponential decay, y = Ae−Bx, parameter B does not occur

  • linearly. Transform the equation to:

ln y = ln(A) − Bx = A′ − Bx When we plug in (x, y) values, the parameters A′, B occur linearly. Transform the logistic curve y = A/(1 + B/Cx) to: ln A y − 1

  • = ln(B) − x ln(C) = B′ + C′ x

where A is determined from A = lim

x→∞ y(x). Now B′, C′ occur linearly.

  • Prof. Tesler
  • Ch. 11: Linear Regression

Math 186 / Winter 2019 4 / 24

slide-5
SLIDE 5

Least squares fit to a line

! ! ! ! ! ! ! ! ! !

−20 −10 10 20 30 10 20 30 40 50 x y

Given n points (x1, y1), (x2, y2), . . . , we will fit them to a line ˆ y = β0 + β1x: Independent variable: x. We assume the x’s are known exactly or have negligible measurement errors. Dependent variable: y. We assume the y’s depend on the x’s but fluctuate due to a random process. We do not have y = f(x), but instead, y = f(x) + error.

  • Prof. Tesler
  • Ch. 11: Linear Regression

Math 186 / Winter 2019 5 / 24

slide-6
SLIDE 6

Least squares fit to a line

! ! ! ! ! ! ! ! ! !

−20 −10 10 20 30 10 20 30 40 50 x y

Given n points (x1, y1), (x2, y2), . . . , we will fit them to a line ˆ y = β0 + β1x: Predicted y value (on the line): ˆ yi = β0 + β1xi Actual data (•): yi = β0 + β1xi + ǫi Residual (actual y minus prediction): ǫi = yi − ˆ yi = yi − (β0 + β1xi)

  • Prof. Tesler
  • Ch. 11: Linear Regression

Math 186 / Winter 2019 6 / 24

slide-7
SLIDE 7

Least squares fit to a line

! ! ! ! ! ! ! ! ! !

−20 −10 10 20 30 10 20 30 40 50 x y

We will use the least squares method: pick parameters β0, β1 that minimize the sum of squares of the residuals. L =

n

  • i=1

(yi − (β0 + β1xi))2

  • Prof. Tesler
  • Ch. 11: Linear Regression

Math 186 / Winter 2019 7 / 24

slide-8
SLIDE 8

Least squares fit to a line

L =

n

  • i=1

(yi − (β0 + β1xi))2 To find β0, β1 that minimize this, solve ∇L =

  • ∂L

∂β0 , ∂L ∂β1

  • = (0, 0):

∂L ∂β0 = −2

n

  • i=1

(yi − (β0 + β1xi)) = 0 ⇒ nβ0 +

  • n
  • i=1

xi

  • β1 =

n

  • i=1

yi ∂L ∂β1 = −2

n

  • i=1

(yi − (β0 + β1xi))xi = 0 ⇒

  • n
  • i=1

xi

  • β0 +
  • n
  • i=1

xi

2

  • β1 =

n

  • i=1

xiyi

which has solution (all sums are i = 1 to n) β1 = n (

i xi yi) − ( i xi) ( i yi)

n (

i xi2) − ( i xi)2

=

  • i(xi − ¯

x)(yi − ¯ y)

  • i(xi − ¯

x)2 β0 = ¯ y − β1¯ x Not shown: use 2nd derivatives to confirm it’s a minimum rather than a maximum or saddle point.

  • Prof. Tesler
  • Ch. 11: Linear Regression

Math 186 / Winter 2019 8 / 24

slide-9
SLIDE 9

Best fitting line

  • −20

−10 10 20 30 10 20 30 40 50

y = β0 + β1x + ε

x y slope = 0.6180 y = 24.9494 + 0.6180x

  • −20

−10 10 20 30 10 20 30 40 50

x = α0 + α1y + ε

x y slope = 0.8695 x = −28.2067 + 1.1501y

The best fit for y = β0 + β1x + error

  • r x = α0 + α1y + error give different lines!

y = β0 + β1x + error assumes the x’s are known exactly with no errors, while the y’s have errors. x = α0 + α1y + error is the other way around.

  • Prof. Tesler
  • Ch. 11: Linear Regression

Math 186 / Winter 2019 9 / 24

slide-10
SLIDE 10

Total Least Squares / Principal Components Analysis

! ! ! ! ! ! ! ! ! !

−20 −10 10 20 30 10 20 30 40 50

y = !0 + !1x + "

x y slope = 0.6180 y = 24.9494 + 0.6180x

! ! ! ! ! ! ! ! ! !

−20 −10 10 20 30 10 20 30 40 50

x = #0 + #1y + "

x y slope = 0.8695 x = −28.2067 + 1.1501y

! ! ! ! ! ! ! ! ! !

−20 −10 10 20 30 10 20 30 40 50

First principal component

  • f centered data

x y slope = 0.6934274

! ! ! ! ! ! ! ! ! !

−20 −10 10 20 30 10 20 30 40 50

All three

x y x = 1.685727 y = 25.99114

In many experiments, both x and y have measurement errors. Use Total Least Squares or Principal Components Analysis, in which the residuals are measured perpendicular to the line. Details require advanced linear algebra, beyond Math 18.

  • Prof. Tesler
  • Ch. 11: Linear Regression

Math 186 / Winter 2019 10 / 24

slide-11
SLIDE 11

Confidence intervals

5 10 15 20 25 50 100 150 y = !0 + !1x + " x y

! ! ! ! ! ! ! ! !

r2 = 0.7683551

!

true line sample data best fit line 95% prediction interval

The best fit line — is different than the true line —. We found point estimates of β0 and β1. Assuming errors are independent of x and normally distributed gives

Confidence intervals for β0, β1. A prediction interval to extrapolate y = f(x) at other x’s. Warning: it may diverge from the true line when we go out too far. Not shown: one can also do hypothesis tests on the values of β0 and β1, and on whether two samples give the same line.

  • Prof. Tesler
  • Ch. 11: Linear Regression

Math 186 / Winter 2019 11 / 24

slide-12
SLIDE 12

Confidence intervals

The method of least squares gave point estimates of β0 and β1: ˆ β1 = n

i xiyi − ( i xi) ( i yi)

n (

i xi2) − ( i xi)2

=

  • i(xi − ¯

x)(yi − ¯ y)

  • i(xi − ¯

x)2 ˆ β0 = ¯ y − ˆ β1¯ x The sample variance of the residuals is s2 = 1 n − 2

n

  • i=1

(yi − ( ˆ β0 + ˆ β1xi))2 (with df = n − 2). 100(1 − α)% confidence intervals: β0 :

  • ˆ

β0 − tα/2,n−2

s√

i xi2

n

i(xi−¯

x), ˆ

β0 + tα/2,n−2

s√

i xi2

n

i(xi−¯

x)

  • β1 :
  • ˆ

β1 − tα/2,n−2

s

i(xi−¯

x), ˆ

β1 + tα/2,n−2

s

i(xi−¯

x)

  • y at new x :

(ˆ y − w, ˆ y + w) with ˆ y = β0 + β1x and w = tα/2,n−2 · s ·

  • 1 + 1

n + (x−¯ x)2

  • i(xi−¯

x)2

  • Prof. Tesler
  • Ch. 11: Linear Regression

Math 186 / Winter 2019 12 / 24

slide-13
SLIDE 13

Covariance

Let X and Y be random variables, possibly dependent. Let µX = E(X), µY = E(Y) Var(X + Y) = E((X + Y − µX − µY)2) = E

  • X − µX
  • +
  • Y − µY

2 = E

  • X − µX

2 + E

  • Y − µY

2 + 2E

  • (X − µX)(Y − µY)
  • = Var(X) + Var(Y) + 2 Cov(X, Y)

where the covariance of X and Y is defined as Cov(X, Y) = E

  • (X − µX)(Y − µY)
  • = E(XY) − E(X)E(Y)

Independent variables have E(XY) = E(X)E(Y), so Cov(X, Y) = 0. But Cov(X, Y) = 0 does not guarantee X and Y are independent.

  • Prof. Tesler
  • Ch. 11: Linear Regression

Math 186 / Winter 2019 13 / 24

slide-14
SLIDE 14

Covariance and independence

Independent variables have E(XY) = E(X)E(Y), so Cov(X, Y) = 0. But Cov(X, Y) = 0 does not guarantee X and Y are independent. Consider the standard normal distribution, Z. Z and Z2 are dependent. Cov(Z, Z2) = E(Z3) − E(Z)E(Z2). The standard normal distribution has mean 0: E(Z) = 0. E(Z3) = 0 since Z3 is an odd function and the pdf of Z is symmetric around Z = 0. So Cov(Z, Z2) = 0.

  • Prof. Tesler
  • Ch. 11: Linear Regression

Math 186 / Winter 2019 14 / 24

slide-15
SLIDE 15

Covariance properties

We have Var(X + Y) = Var(X) + Var(Y) + 2 Cov(X, Y) where the covariance of X and Y is defined as Cov(X, Y) = E

  • (X − µX)(Y − µY)
  • = E(XY) − E(X)E(Y)

Additional properties of covariance

Cov(X, X) = Var(X) Cov(X, Y) = Cov(Y, X) Cov(aX + b, cY + d) = ac Cov(X, Y)

  • Prof. Tesler
  • Ch. 11: Linear Regression

Math 186 / Winter 2019 15 / 24

slide-16
SLIDE 16

Sign of covariance

Cov(X, Y) = E((X − µX)(Y − µY)) When Cov(X, Y) is positive: There is a tendency to have X > µX when Y > µY and vice-versa, and X < µX when Y < µY and vice-versa. When Cov(X, Y) is negative: There is a tendency to have X > µX when Y < µY and vice-versa, and X < µX when Y > µY and vice-versa. When Cov(X, Y) = 0: a) X and Y might be independent, but it’s not guaranteed. b) Var(X + Y) = Var(X) + Var(Y)

  • Prof. Tesler
  • Ch. 11: Linear Regression

Math 186 / Winter 2019 16 / 24

slide-17
SLIDE 17

Sample variance and covariance

Variance of a random variable: σ2 = Var(X) = E((X − µX)2) = E(X2) − (E(X))2 Sample variance from data x1, . . . , xn to estimate σ2: s2 = var(x) = 1 n − 1

n

  • i=1

(xi − ¯ x)2 = 1 n − 1 n

  • i=1

xi2

n n − 1 ¯ x 2 Covariance between random variables X, Y: σXY = Cov(X, Y) = E((X − µX)(Y − µY)) = E(XY) − E(X)E(Y) Sample covariance from data (x1, y1), . . . , (xn, yn) to estimate σXY: sXY = cov(x, y) = 1 n − 1

n

  • i=1

(xi − ¯ x)(yi − ¯ y) = 1 n − 1 n

  • i=1

xi yi

n n − 1 ¯ x ¯ y

  • Prof. Tesler
  • Ch. 11: Linear Regression

Math 186 / Winter 2019 17 / 24

slide-18
SLIDE 18

Correlation coefficient

Let X and Y be two random variables. Their correlation coefficient is ρ(X, Y) = Cov(X, Y)

  • Var(X) Var(Y)

This is a normalized version of covariance, and is between ±1. For a line Y = aX + b with a, b constants (a 0), ρ(X, Y) = a Var(X)

  • Var(X)
  • Var(aX)

= aσ2 σ · |a|σ = a |a| = ±1 (sign of a) ρ(X, Y) = ±1 iff Y = aX + b with a, b constants (a 0). Closer to ±1: more linear. Closer to 0: less linear. If X and Y are independent then ρ(X, Y)=0. The converse is not valid: dependent variables can have ρ(X, Y)=0.

  • Prof. Tesler
  • Ch. 11: Linear Regression

Math 186 / Winter 2019 18 / 24

slide-19
SLIDE 19

Correlation coefficient

ρ(X,Y) is estimated from data by the sample correlation coefficient (a.k.a. Pearson product-moment correlation coefficient): r(x, y) = cov(x, y)

  • var(x) var(y)

=

  • i(xi − ¯

x)(yi − ¯ y)

  • i(xi − ¯

x)2

i(yi − ¯

y)2 = n

i xiyi − ( i xi)( i yi)

  • n

i xi2 − ( i xi)2

n

i yi2 − ( i yi)2

People often report r2 (between 0 and 1) instead of r.

  • Prof. Tesler
  • Ch. 11: Linear Regression

Math 186 / Winter 2019 19 / 24

slide-20
SLIDE 20

Sample correlation coefficient r

1 0.8 0.4

  • 0.4
  • 0.8
  • 1

1 1 1

  • 1
  • 1
  • 1

http://en.wikipedia.org/wiki/File:Correlation_examples2.svg http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

Middle row: Perfect linear relation Y = aX + b gives r = 1 for lines with positive slope (a > 0) r = −1 for lines with negative slope (a < 0) r undefined for horizontal line (Y = b) Other rows: coming up!

  • Prof. Tesler
  • Ch. 11: Linear Regression

Math 186 / Winter 2019 20 / 24

slide-21
SLIDE 21

Interpretation of r2

Let ˆ yi = ˆ β1xi + ˆ β0 be the predicted y-value for xi based on the least squares line. Write the deviation of yi from ¯ y as yi − ¯ y = (yi − ˆ yi) + (ˆ yi − ¯ y)

Total Unexplained Explained deviation by line by line

It can be shown that the sum of squared deviations for all y’s is

  • i(yi − ¯

y)2 =

i(yi − ˆ

yi)2 +

i(ˆ

yi − ¯ y)2 + 2

i(yi − ˆ

yi)(ˆ yi − ¯ y)

Total Unexplained Explained = 0 by a miracle! variation variation variation (Tedious algebra not shown)

and that r2 =

  • i(ˆ

yi − ¯ y)2

  • i(yi − ¯

y)2 = Explained variation Total variation r = 1: 100% of the variation is explained by the line and 0% is due to other factors, and the slope is positive. r = −.8: 64% of the variation is explained by the line and 36% is due to other factors, and the slope is negative.

  • Prof. Tesler
  • Ch. 11: Linear Regression

Math 186 / Winter 2019 21 / 24

slide-22
SLIDE 22

Sample correlation coefficient r

1 0.8 0.4

  • 0.4
  • 0.8
  • 1

1 1 1

  • 1
  • 1
  • 1

http://en.wikipedia.org/wiki/File:Correlation_examples2.svg http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

Top row: Linear relations with varying r. Bottom: r = 0, yet X and Y are dependent in all of these (except possibly the last); it’s just that the relationship is not a line.

  • Prof. Tesler
  • Ch. 11: Linear Regression

Math 186 / Winter 2019 22 / 24

slide-23
SLIDE 23

Correlation does not imply causation

High correlation between X and Y doesn’t mean X causes Y or vice-versa. It could be a coincidence. Or they could both be caused by a third variable. Website tylervigen.com plots many data sets (various quantities by year) against each other to find spurious correlations:

http://www.tylervigen.com/view_correlation?id=1703 http://tylervigen.com/view_correlation?id=1759

  • Prof. Tesler
  • Ch. 11: Linear Regression

Math 186 / Winter 2019 23 / 24

slide-24
SLIDE 24

More about interpretation of correlation

Low r2 does NOT guarantee independence; it just means that a line y = β0 + β1x is not a good fit to the data. r is an estimate of ρ. The estimate improves with higher n. With additional assumptions on the underlying joint distribution of X, Y, we can use r to test H0: ρ = 0 vs. H1: ρ 0 (or other values). Best fits and correlation generalize to other models, including Polynomial regression y = β0 + β1 x + β2 x2 + · · · + βp x p Multiple linear regression y = β0 + β1 t + β2 u + · · · + βp w t, u, . . . , w: multiple independent variables y: dependent variable Weighted versions When the variance is different at each value of the independent variables

  • Prof. Tesler
  • Ch. 11: Linear Regression

Math 186 / Winter 2019 24 / 24