Political Science 209 - Fall 2018 Linear Regression Florian - - PowerPoint PPT Presentation

political science 209 fall 2018
SMART_READER_LITE
LIVE PREVIEW

Political Science 209 - Fall 2018 Linear Regression Florian - - PowerPoint PPT Presentation

Political Science 209 - Fall 2018 Linear Regression Florian Hollenbach 12th October 2018 Recall Correlation & Scatterplot Income and Child Mortality 200 Correlation = 0.77 150 Child Mortality


slide-1
SLIDE 1

Political Science 209 - Fall 2018

Linear Regression

Florian Hollenbach 12th October 2018

slide-2
SLIDE 2

Recall Correlation & Scatterplot

  • 7

8 9 10 11 50 100 150 200

Income and Child Mortality

logged GDP in PPP Child Mortality Correlation = − 0.77

What is the correlation?

Florian Hollenbach 1

slide-3
SLIDE 3

Recall the definition of correlation

Correlation (x,y) = 1

N

N

i=1 z-score of xi× z-score of yi

Correlation (x,y) = 1

N

N

i=1 xi−¯ x sdx × yi−¯ y sdy Florian Hollenbach 2

slide-4
SLIDE 4

Correlations & Scatterplots/Data points

  • 1. positive correlation upward slope
  • 2. negative correlation downward slope
  • 3. high correlation tighter, close to a line
  • 4. correlation cannot capture nonlinear relationship

Florian Hollenbach 3

slide-5
SLIDE 5

Correlations & Scatterplots/Data points

  • −3

−2 −1 1 2 3 −3 −2 −1 1 2 3

(a) correlation = 0.22

  • −3

−2 −1 1 2 3 −3 −2 −1 1 2 3

(b) correlation = 0.88

  • −3

−2 −1 1 2 3 −3 −2 −1 1 2 3

(c) correlation = −0.7

  • −3

−2 −1 1 2 3 −3 −2 −1 1 2 3

(d) correlation = 0.02

Florian Hollenbach 4

slide-6
SLIDE 6

Moving from Correlation to Linear Regression

Preview:

  • linear regression allows us to create predictions
  • linear regression specifies direction of relationship
  • linear regression allows us to examine more than two variables

at the same time (statistical control)

Florian Hollenbach 5

slide-7
SLIDE 7

Linear Regression

  • regression has one dependent (y) and for now one independent

(x) variable

  • regression is a statistical method to estimate the linear

relationship between variables

Florian Hollenbach 6

slide-8
SLIDE 8

Linear Regression

  • goal of regression is to approximate the (linear) relationship

between X and Y as best as possible

Florian Hollenbach 7

slide-9
SLIDE 9

Linear Regression

  • goal of regression is to approximate the (linear) relationship

between X and Y as best as possible

  • regression is the mathematical model to draw best fitting line

through cloud of points

Florian Hollenbach 7

slide-10
SLIDE 10

Linear Regression

  • 7

8 9 10 11 50 100 150 200

Income and Child Mortality

logged GDP in PPP Child Mortality

Linear regression is the mathematical model to draw best fitting line through cloud of points

Florian Hollenbach 8

slide-11
SLIDE 11

Linear Regression

  • 7

8 9 10 11 50 100 150 200 Income and Child Mortality logged GDP in PPP Child Mortality

  • regression line is an estimate of the (for now bivariate)

relationship between x and y

  • for each x we have a prediction of y: what would we expect y

to be given the value of x?

Florian Hollenbach 9

slide-12
SLIDE 12

What is the equation of a line?

Equation of a line?

Florian Hollenbach 10

slide-13
SLIDE 13

What is the equation of a line?

Equation of a line? y = mx + b → b? m?

Florian Hollenbach 10

slide-14
SLIDE 14

What is the equation of a line?

Equation of a line? y = mx + b b → y-intercept m → slope

Florian Hollenbach 11

slide-15
SLIDE 15

What is the equation of a line?

Equation of a line? y = mx + b b → y-intercept m → slope regression equation: Y = α + βX + ǫ → α? β? ǫ?

Florian Hollenbach 11

slide-16
SLIDE 16

What is the equation of a line?

Equation of a line? y = mx + b b → y-intercept m → slope regression equation: Y = alpha + βX + ǫ α → y-intercept β → slope ǫ → error

Florian Hollenbach 12

slide-17
SLIDE 17

Regression equation

  • 7

8 9 10 11 50 100 150 200

Income and Child Mortality

logged GDP in PPP Child Mortality y−intercept = 282.46 Slope = −26.61

Florian Hollenbach 13

slide-18
SLIDE 18

Regression equation

  • 7

8 9 10 11 50 100 150 200

Income and Child Mortality

logged GDP in PPP Child Mortality y−intercept = 282.46 Slope = −26.61

Y = 282.46 + −26.61X + ǫ

Florian Hollenbach 14

slide-19
SLIDE 19

Regression equation

Model: Y = α

  • intercept

+ β

  • slope

X + ǫ

  • error term
  • Y : dependent/outcome/response variable
  • X: independent/explanatory variable, predictor
  • (α, β): coefficients (parameters of the model)
  • ǫ: unobserved error/disturbance term (mean zero)

Florian Hollenbach 15

slide-20
SLIDE 20

Regression: Interpretation of the Parameters:

Y = α

  • intercept

+ β

  • slope

X + ǫ

  • error term
  • α + βX: average of Y at the given the value of X
  • α: the value of Y when X is zero
  • β: increase in Y associated with one unit increase in X

Florian Hollenbach 16

slide-21
SLIDE 21

Regression equation

  • but, we don’t know the equation that generates the data
  • our regression line is an estimate, based on the collected data

Florian Hollenbach 17

slide-22
SLIDE 22

Regression equation

  • but, we don’t know the equation that generates the data
  • our regression line is an estimate, based on the collected data
  • estimates are denoted with little hats: ˆ

β, ˆ α

α, ˆ β): estimated coefficients

Florian Hollenbach 17

slide-23
SLIDE 23

Regression equation

  • but, we don’t know the equation that generates the data
  • our regression line is an estimate, based on the collected data
  • estimates are denoted with little hats: ˆ

β, ˆ α

α, ˆ β): estimated coefficients

  • we can use (ˆ

α, ˆ β, X) to create predicted values of y

Y = ˆ α + ˆ βx: predicted/fitted value

Florian Hollenbach 17

slide-24
SLIDE 24

Regression equation

How far off is our line? How do we know?

Florian Hollenbach 18

slide-25
SLIDE 25

Regression equation

How far off is our line? How do we know?

Florian Hollenbach 19

slide-26
SLIDE 26

Regression equation

How far off is our line? How do we know? ˆ ǫ = true Y − Y : residuals/error ˆ ǫ’s are an estimate of how good/bad our line approximates the relationship

Florian Hollenbach 19

slide-27
SLIDE 27

Regression

  • 6

7 8 9 10 11 12 50 100 150 200

Income and Child Mortality

logged GDP in PPP Child Mortality

  • utcome

y residual ε ^ y ^ predicted value

Florian Hollenbach 20

slide-28
SLIDE 28

Regression

  • (α, β) are estimated from the data
  • How do we find α, β?

Florian Hollenbach 21

slide-29
SLIDE 29

Regression: How do we find α, β?

We minimize the sum of the squared residuals

Florian Hollenbach 22

slide-30
SLIDE 30

Regression: How do we find α, β?

We minimize the sum of the squared residuals (SSR) SSR =

n

  • i=1

ˆ ǫ2

i Florian Hollenbach 23

slide-31
SLIDE 31

Regression: How do we find α, β?

We minimize the sum of the squared residuals (SSR) SSR =

n

  • i=1

ˆ ǫ2

i

=

n

  • i=1

(Yi − Yi)2

Florian Hollenbach 24

slide-32
SLIDE 32

Regression: How do we find α, β?

We minimize the sum of the squared residuals (SSR) SSR =

n

  • i=1

ˆ ǫ2

i

=

n

  • i=1

(Yi − Yi)2 =

n

  • i=1

(Yi − ˆ α − ˆ βXi)2

Florian Hollenbach 25

slide-33
SLIDE 33

Regression: How do we find α, β?

We minimize the sum of the squared residuals (SSR) SSR =

n

  • i=1

ˆ ǫ2

i

=

n

  • i=1

(Yi − Yi)2 =

n

  • i=1

(Yi − ˆ α − ˆ βXi)2 This also minimizes the root mean squared error: RMSE =

  • 1

nSSR Florian Hollenbach 25

slide-34
SLIDE 34

Regression by Hand

ˆ α = ¯ Y − ˆ β ¯ X ˆ β = n

i=1(Yi − Y )(Xi − X)

n

i=1(Xi − X)2

OR:

Florian Hollenbach 26

slide-35
SLIDE 35

Regression by Hand

ˆ α = ¯ Y − ˆ β ¯ X ˆ β = n

i=1(Yi − Y )(Xi − X)

n

i=1(Xi − X)2

OR: ˆ β = correlation of X and Y × standard deviation of Y standard deviation of X

Florian Hollenbach 26

slide-36
SLIDE 36

Regression by Hand

Regression line always goes through the point of averages ( ˆ X, ˆ Y )

  • Y

= (Y − ˆ βX) + ˆ βX = Y

Florian Hollenbach 27

slide-37
SLIDE 37

Regression always goes through point of averages

  • 6

7 8 9 10 11 12 50 100 150 200

Income and Child Mortality

logged GDP in PPP Child Mortality

  • utcome

y residual ε ^ y ^ predicted value x mean of x mean of y y

  • Florian Hollenbach

28

slide-38
SLIDE 38

Regression NOT by Hand

Enough math! Fitting/estimating a regression in R: lm(dependent ~ independent, data = data_object)

Florian Hollenbach 29

slide-39
SLIDE 39

Regression NOT by Hand

Fitting/estimating a regression in R: data <- read.csv("bivariate_data.csv") data <- subset(data, Year ==2010) result <- lm(Child.Mortality ~ log(GDP) , data = data) summary(result)

Florian Hollenbach 30

slide-40
SLIDE 40

Regression NOT by Hand

result <- lm(Child.Mortality ~ log(GDP) , data = data) coef(result) ### coefficients (Intercept) log(GDP) 282.45870

  • 26.61347

R-output: (Intercept): α log(GDP): β

Florian Hollenbach 31

slide-41
SLIDE 41

Model Fit

How well does our regression line fit the data? How well does the model predict the outcome?

Florian Hollenbach 32

slide-42
SLIDE 42

Model Fit

How well does our regression line fit the data? How well does the model predict the outcome? R2 or coefficient of determination: R2 = 1 − SSR Total sum of squares (TSS) = 1 − n

i=1 ˆ

ǫ2

i

n

i=1(Yi − Y )2 Florian Hollenbach 32

slide-43
SLIDE 43

Model Fit

R2 = 1 − SSR Total sum of squares (TSS) = 1 − n

i=1 ˆ

ǫ2

i

n

i=1(Yi − Y )2

R2 is also defined as the explained variance in Y How much of the deviation of Y from the average is explained by X?

Florian Hollenbach 33

slide-44
SLIDE 44

Model Fit

result <- lm(Child.Mortality ~ log(GDP) , data = data) summary(result) Call: lm(formula = Child.Mortality ~ log(GDP), data = data) Residuals: Min 1Q Median 3Q Max

  • 49.455 -15.418
  • 4.161

10.847 132.136 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 282.459 16.569 17.05 <2e-16 *** log(GDP)

  • 26.613

1.809

  • 14.71

<2e-16 ***

  • codes:

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 27.57 on 150 degrees of freedom Multiple R-squared: 0.5906,Adjusted R-squared: 0.5878 F-statistic: 216.4 on 1 and 150 DF, p-value: < 2.2e-16

Florian Hollenbach 34