Unit 6: Simple Linear Regression Lecture : Introduction to SLR - - PowerPoint PPT Presentation

unit 6 simple linear regression lecture introduction to
SMART_READER_LITE
LIVE PREVIEW

Unit 6: Simple Linear Regression Lecture : Introduction to SLR - - PowerPoint PPT Presentation

Unit 6: Simple Linear Regression Lecture : Introduction to SLR Statistics 101 Thomas Leininger June 17, 2013 Recap: Chi-square test of independence Recap: Chi-square test of independence 1 Ball throwing Expected counts in two-way tables


slide-1
SLIDE 1

Unit 6: Simple Linear Regression Lecture : Introduction to SLR

Statistics 101

Thomas Leininger

June 17, 2013

slide-2
SLIDE 2

Recap: Chi-square test of independence

1

Recap: Chi-square test of independence Ball throwing Expected counts in two-way tables

2

Modeling numerical variables

3

Correlation

4

Fitting a line by least squares regression Residuals Best line The least squares line Prediction & extrapolation Conditions for the least squares line

R2

Categorical explanatory variables

Statistics 101 U6 - L1: Introduction to SLR Thomas Leininger

slide-3
SLIDE 3

Recap: Chi-square test of independence Ball throwing

1

Recap: Chi-square test of independence Ball throwing Expected counts in two-way tables

2

Modeling numerical variables

3

Correlation

4

Fitting a line by least squares regression Residuals Best line The least squares line Prediction & extrapolation Conditions for the least squares line

R2

Categorical explanatory variables

Statistics 101 U6 - L1: Introduction to SLR Thomas Leininger

slide-4
SLIDE 4

Recap: Chi-square test of independence Ball throwing

Does ball-throwing ability vary by major?

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 2 / 35

slide-5
SLIDE 5

Recap: Chi-square test of independence Ball throwing

Does ball-throwing ability vary by major?

Going back to our carnival game, should I be worried if a bus-load of public policy majors show up at my booth?

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 2 / 35

slide-6
SLIDE 6

Recap: Chi-square test of independence Ball throwing

Does ball-throwing ability vary by major?

Going back to our carnival game, should I be worried if a bus-load of public policy majors show up at my booth? The hypotheses are:

H0: Ball-throwing ability and major are independent. Ball-throwing

skills do not vary by major.

HA: Ball-throwing ability and major are dependent. Ball-throwing

skills vary by major.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 2 / 35

slide-7
SLIDE 7

Recap: Chi-square test of independence Ball throwing

Does ball-throwing ability vary by major?

Going back to our carnival game, should I be worried if a bus-load of public policy majors show up at my booth? The hypotheses are:

H0: Ball-throwing ability and major are independent. Ball-throwing

skills do not vary by major.

HA: Ball-throwing ability and major are dependent. Ball-throwing

skills vary by major. Major Public Policy Undeclared Other Total Hit target 40 10 10 60 Missed target 20 30 30 80 Total 60 40 40 140

Note: I multiplied the numbers by 10 to meet our expected cell counts conditions.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 2 / 35

slide-8
SLIDE 8

Recap: Chi-square test of independence Ball throwing

Chi-square test of independence

The test statistic is calculated as

χ2

df = k

  • i=1

(O − E)2 E

where

df = (R − 1) × (C − 1),

where k is the number of cells, R is the number of rows, and C is the number of columns.

Note: We calculate df differently for one-way and two-way tables.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 3 / 35

slide-9
SLIDE 9

Recap: Chi-square test of independence Ball throwing

Chi-square test of independence

The test statistic is calculated as

χ2

df = k

  • i=1

(O − E)2 E

where

df = (R − 1) × (C − 1),

where k is the number of cells, R is the number of rows, and C is the number of columns.

Note: We calculate df differently for one-way and two-way tables.

Expected counts in two-way tables Expected Count = (row total) × (column total) table total

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 3 / 35

slide-10
SLIDE 10

Recap: Chi-square test of independence Expected counts in two-way tables

1

Recap: Chi-square test of independence Ball throwing Expected counts in two-way tables

2

Modeling numerical variables

3

Correlation

4

Fitting a line by least squares regression Residuals Best line The least squares line Prediction & extrapolation Conditions for the least squares line

R2

Categorical explanatory variables

Statistics 101 U6 - L1: Introduction to SLR Thomas Leininger

slide-11
SLIDE 11

Recap: Chi-square test of independence Expected counts in two-way tables

Expected counts in two-way tables

Major Public Policy Undeclared Other Total Hit target 40 10 10 60 Missed target 20 30 30 80 Total 60 40 40 140

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 4 / 35

slide-12
SLIDE 12

Recap: Chi-square test of independence Expected counts in two-way tables

Expected counts in two-way tables

Major Public Policy Undeclared Other Total Hit target 40 10 10 60 Missed target 20 30 30 80 Total 60 40 40 140

df = (R − 1) × (C − 1) = χ2

df = k

  • i=1

(O − E)2 E =

p-value :

Upper tail 0.3 0.2 0.1 0.05 0.02 0.01 0.005 0.001 df 1 1.07 1.64 2.71 3.84 5.41 6.63 7.88 10.83 2 2.41 3.22 4.61 5.99 7.82 9.21 10.60 13.82 3 3.66 4.64 6.25 7.81 9.84 11.34 12.84 16.27 4 4.88 5.99 7.78 9.49 11.67 13.28 14.86 18.47 5 6.06 7.29 9.24 11.07 13.39 15.09 16.75 20.52

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 4 / 35

slide-13
SLIDE 13

Recap: Chi-square test of independence Expected counts in two-way tables

Expected counts in two-way tables

Major Public Policy Undeclared Other Total Hit target 40 10 10 60 Missed target 20 30 30 80 Total 60 40 40 140

df = (R − 1) × (C − 1) = (2 − 1) × (3 − 1) = 2 χ2

df = k

  • i=1

(O − E)2 E =

p-value :

Upper tail 0.3 0.2 0.1 0.05 0.02 0.01 0.005 0.001 df 1 1.07 1.64 2.71 3.84 5.41 6.63 7.88 10.83 2 2.41 3.22 4.61 5.99 7.82 9.21 10.60 13.82 3 3.66 4.64 6.25 7.81 9.84 11.34 12.84 16.27 4 4.88 5.99 7.78 9.49 11.67 13.28 14.86 18.47 5 6.06 7.29 9.24 11.07 13.39 15.09 16.75 20.52

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 4 / 35

slide-14
SLIDE 14

Recap: Chi-square test of independence Expected counts in two-way tables

Expected counts in two-way tables

Major Public Policy Undeclared Other Total Hit target 40 10 10 60 Missed target 20 30 30 80 Total 60 40 40 140

df = (R − 1) × (C − 1) = (2 − 1) × (3 − 1) = 2 χ2

df = k

  • i=1

(O − E)2 E = (40−25.7)2

25.7

+ · · · + (30−22.857)2

22.857

= 24.306

p-value :

Upper tail 0.3 0.2 0.1 0.05 0.02 0.01 0.005 0.001 df 1 1.07 1.64 2.71 3.84 5.41 6.63 7.88 10.83 2 2.41 3.22 4.61 5.99 7.82 9.21 10.60 13.82 3 3.66 4.64 6.25 7.81 9.84 11.34 12.84 16.27 4 4.88 5.99 7.78 9.49 11.67 13.28 14.86 18.47 5 6.06 7.29 9.24 11.07 13.39 15.09 16.75 20.52

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 4 / 35

slide-15
SLIDE 15

Recap: Chi-square test of independence Expected counts in two-way tables

Expected counts in two-way tables

Major Public Policy Undeclared Other Total Hit target 40 10 10 60 Missed target 20 30 30 80 Total 60 40 40 140

df = (R − 1) × (C − 1) = (2 − 1) × (3 − 1) = 2 χ2

df = k

  • i=1

(O − E)2 E = (40−25.7)2

25.7

+ · · · + (30−22.857)2

22.857

= 24.306

p-value : smaller than 0.001

Upper tail 0.3 0.2 0.1 0.05 0.02 0.01 0.005 0.001 df 1 1.07 1.64 2.71 3.84 5.41 6.63 7.88 10.83 2 2.41 3.22 4.61 5.99 7.82 9.21 10.60 13.82 3 3.66 4.64 6.25 7.81 9.84 11.34 12.84 16.27 4 4.88 5.99 7.78 9.49 11.67 13.28 14.86 18.47 5 6.06 7.29 9.24 11.07 13.39 15.09 16.75 20.52

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 4 / 35

slide-16
SLIDE 16

Modeling numerical variables

1

Recap: Chi-square test of independence Ball throwing Expected counts in two-way tables

2

Modeling numerical variables

3

Correlation

4

Fitting a line by least squares regression Residuals Best line The least squares line Prediction & extrapolation Conditions for the least squares line

R2

Categorical explanatory variables

Statistics 101 U6 - L1: Introduction to SLR Thomas Leininger

slide-17
SLIDE 17

Modeling numerical variables

Modeling numerical variables

So far we have worked with

1 numerical variable (Z, T)

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 5 / 35

slide-18
SLIDE 18

Modeling numerical variables

Modeling numerical variables

So far we have worked with

1 numerical variable (Z, T) 1 categorical variable (χ2)

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 5 / 35

slide-19
SLIDE 19

Modeling numerical variables

Modeling numerical variables

So far we have worked with

1 numerical variable (Z, T) 1 categorical variable (χ2) 1 numerical and 1 categorical variable (2-sample Z/T, ANOVA)

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 5 / 35

slide-20
SLIDE 20

Modeling numerical variables

Modeling numerical variables

So far we have worked with

1 numerical variable (Z, T) 1 categorical variable (χ2) 1 numerical and 1 categorical variable (2-sample Z/T, ANOVA) 2 categorical variables (χ2 test for independence)

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 5 / 35

slide-21
SLIDE 21

Modeling numerical variables

Modeling numerical variables

So far we have worked with

1 numerical variable (Z, T) 1 categorical variable (χ2) 1 numerical and 1 categorical variable (2-sample Z/T, ANOVA) 2 categorical variables (χ2 test for independence)

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 5 / 35

slide-22
SLIDE 22

Modeling numerical variables

Modeling numerical variables

So far we have worked with

1 numerical variable (Z, T) 1 categorical variable (χ2) 1 numerical and 1 categorical variable (2-sample Z/T, ANOVA) 2 categorical variables (χ2 test for independence)

Next up: relationships between two numerical variables, as well as modeling numerical response variables using a numerical or categorical explanatory variable.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 5 / 35

slide-23
SLIDE 23

Modeling numerical variables

Modeling numerical variables

So far we have worked with

1 numerical variable (Z, T) 1 categorical variable (χ2) 1 numerical and 1 categorical variable (2-sample Z/T, ANOVA) 2 categorical variables (χ2 test for independence)

Next up: relationships between two numerical variables, as well as modeling numerical response variables using a numerical or categorical explanatory variable. Wed–Friday: to model numerical variables using many explanatory variables at once.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 5 / 35

slide-24
SLIDE 24

Modeling numerical variables

Poverty vs. HS graduate rate

The scatterplot below shows the relationship between HS graduate rate in all 50 US states and DC and the % of residents who live below the poverty line (income below $23,050 for a family of 4 in 2012).

  • 80

85 90 6 8 10 12 14 16 18 % HS grad % in poverty

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 6 / 35

slide-25
SLIDE 25

Modeling numerical variables

Poverty vs. HS graduate rate

The scatterplot below shows the relationship between HS graduate rate in all 50 US states and DC and the % of residents who live below the poverty line (income below $23,050 for a family of 4 in 2012).

  • 80

85 90 6 8 10 12 14 16 18 % HS grad % in poverty

Response?

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 6 / 35

slide-26
SLIDE 26

Modeling numerical variables

Poverty vs. HS graduate rate

The scatterplot below shows the relationship between HS graduate rate in all 50 US states and DC and the % of residents who live below the poverty line (income below $23,050 for a family of 4 in 2012).

  • 80

85 90 6 8 10 12 14 16 18 % HS grad % in poverty

Response? % in poverty

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 6 / 35

slide-27
SLIDE 27

Modeling numerical variables

Poverty vs. HS graduate rate

The scatterplot below shows the relationship between HS graduate rate in all 50 US states and DC and the % of residents who live below the poverty line (income below $23,050 for a family of 4 in 2012).

  • 80

85 90 6 8 10 12 14 16 18 % HS grad % in poverty

Response? % in poverty Explanatory?

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 6 / 35

slide-28
SLIDE 28

Modeling numerical variables

Poverty vs. HS graduate rate

The scatterplot below shows the relationship between HS graduate rate in all 50 US states and DC and the % of residents who live below the poverty line (income below $23,050 for a family of 4 in 2012).

  • 80

85 90 6 8 10 12 14 16 18 % HS grad % in poverty

Response? % in poverty Explanatory? % HS grad

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 6 / 35

slide-29
SLIDE 29

Modeling numerical variables

Poverty vs. HS graduate rate

The scatterplot below shows the relationship between HS graduate rate in all 50 US states and DC and the % of residents who live below the poverty line (income below $23,050 for a family of 4 in 2012).

  • 80

85 90 6 8 10 12 14 16 18 % HS grad % in poverty

Response? % in poverty Explanatory? % HS grad Relationship?

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 6 / 35

slide-30
SLIDE 30

Modeling numerical variables

Poverty vs. HS graduate rate

The scatterplot below shows the relationship between HS graduate rate in all 50 US states and DC and the % of residents who live below the poverty line (income below $23,050 for a family of 4 in 2012).

  • 80

85 90 6 8 10 12 14 16 18 % HS grad % in poverty

Response? % in poverty Explanatory? % HS grad Relationship? linear, negative, moderately strong

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 6 / 35

slide-31
SLIDE 31

Correlation

1

Recap: Chi-square test of independence Ball throwing Expected counts in two-way tables

2

Modeling numerical variables

3

Correlation

4

Fitting a line by least squares regression Residuals Best line The least squares line Prediction & extrapolation Conditions for the least squares line

R2

Categorical explanatory variables

Statistics 101 U6 - L1: Introduction to SLR Thomas Leininger

slide-32
SLIDE 32

Correlation

Quantifying the relationship

Correlation describes the strength of the linear association between two variables.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 7 / 35

slide-33
SLIDE 33

Correlation

Quantifying the relationship

Correlation describes the strength of the linear association between two variables. It takes values between -1 (perfect negative) and +1 (perfect positive).

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 7 / 35

slide-34
SLIDE 34

Correlation

Quantifying the relationship

Correlation describes the strength of the linear association between two variables. It takes values between -1 (perfect negative) and +1 (perfect positive). A value of 0 indicates no linear association.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 7 / 35

slide-35
SLIDE 35

Correlation

Guessing the correlation

Question Which of the following is the best guess for the correlation between % in poverty and % HS grad? (a) 0.6 (b) -0.75 (c) -0.1 (d) 0.02 (e) -1.5

  • 80

85 90 6 8 10 12 14 16 18 % HS grad % in poverty

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 8 / 35

slide-36
SLIDE 36

Correlation

Guessing the correlation

Question Which of the following is the best guess for the correlation between % in poverty and % HS grad? (a) 0.6 (b) -0.75 (c) -0.1 (d) 0.02 (e) -1.5

  • 80

85 90 6 8 10 12 14 16 18 % HS grad % in poverty

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 8 / 35

slide-37
SLIDE 37

Correlation

Guessing the correlation

Question Which of the following is the best guess for the correlation between % in poverty and % HS female householder? (a) 0.1 (b) -0.6 (c) -0.4 (d) 0.9 (e) 0.5

  • 8

10 12 14 16 18 6 8 10 12 14 16 18 % female householder, no husband present % in poverty

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 9 / 35

slide-38
SLIDE 38

Correlation

Guessing the correlation

Question Which of the following is the best guess for the correlation between % in poverty and % HS female householder? (a) 0.1 (b) -0.6 (c) -0.4 (d) 0.9 (e) 0.5

  • 8

10 12 14 16 18 6 8 10 12 14 16 18 % female householder, no husband present % in poverty

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 9 / 35

slide-39
SLIDE 39

Correlation

Assessing the correlation

Question Which of the following is has the strongest correlation, i.e. correlation coefficient closest to +1 or -1?

  • (a)
  • (b)
  • (c)
  • (d)

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 10 / 35

slide-40
SLIDE 40

Correlation

Assessing the correlation

Question Which of the following is has the strongest correlation, i.e. correlation coefficient closest to +1 or -1?

  • (a)
  • (b)
  • (c)
  • (d)

(b) → correlation means linear association

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 10 / 35

slide-41
SLIDE 41

Fitting a line by least squares regression

1

Recap: Chi-square test of independence Ball throwing Expected counts in two-way tables

2

Modeling numerical variables

3

Correlation

4

Fitting a line by least squares regression Residuals Best line The least squares line Prediction & extrapolation Conditions for the least squares line

R2

Categorical explanatory variables

Statistics 101 U6 - L1: Introduction to SLR Thomas Leininger

slide-42
SLIDE 42

Fitting a line by least squares regression Residuals

1

Recap: Chi-square test of independence Ball throwing Expected counts in two-way tables

2

Modeling numerical variables

3

Correlation

4

Fitting a line by least squares regression Residuals Best line The least squares line Prediction & extrapolation Conditions for the least squares line

R2

Categorical explanatory variables

Statistics 101 U6 - L1: Introduction to SLR Thomas Leininger

slide-43
SLIDE 43

Fitting a line by least squares regression Residuals

Residuals

Residuals are the leftovers from the model fit: Data = Fit + Residual

  • 80

85 90 6 8 10 12 14 16 18 % HS grad % in poverty

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 11 / 35

slide-44
SLIDE 44

Fitting a line by least squares regression Residuals

Residuals (cont.)

Residual Residual is the difference between the observed and predicted y.

ei = yi − ˆ yi

  • 80

85 90 6 8 10 12 14 16 18 % HS grad % in poverty y 5.44 y ^ y −4.16 y ^

DC RI Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 12 / 35

slide-45
SLIDE 45

Fitting a line by least squares regression Residuals

Residuals (cont.)

Residual Residual is the difference between the observed and predicted y.

ei = yi − ˆ yi

  • 80

85 90 6 8 10 12 14 16 18 % HS grad % in poverty y 5.44 y ^ y −4.16 y ^

DC RI

% living in poverty in DC is 5.44% more than predicted.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 12 / 35

slide-46
SLIDE 46

Fitting a line by least squares regression Residuals

Residuals (cont.)

Residual Residual is the difference between the observed and predicted y.

ei = yi − ˆ yi

  • 80

85 90 6 8 10 12 14 16 18 % HS grad % in poverty y 5.44 y ^ y −4.16 y ^

DC RI

% living in poverty in DC is 5.44% more than predicted. % living in poverty in RI is 4.16% less than predicted.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 12 / 35

slide-47
SLIDE 47

Fitting a line by least squares regression Best line

1

Recap: Chi-square test of independence Ball throwing Expected counts in two-way tables

2

Modeling numerical variables

3

Correlation

4

Fitting a line by least squares regression Residuals Best line The least squares line Prediction & extrapolation Conditions for the least squares line

R2

Categorical explanatory variables

Statistics 101 U6 - L1: Introduction to SLR Thomas Leininger

slide-48
SLIDE 48

Fitting a line by least squares regression Best line

A measure for the best line

We want a line that has small residuals:

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 13 / 35

slide-49
SLIDE 49

Fitting a line by least squares regression Best line

A measure for the best line

We want a line that has small residuals:

1

Option 1: Minimize the sum of magnitudes (absolute values) of residuals |e1| + |e2| + · · · + |en|

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 13 / 35

slide-50
SLIDE 50

Fitting a line by least squares regression Best line

A measure for the best line

We want a line that has small residuals:

1

Option 1: Minimize the sum of magnitudes (absolute values) of residuals |e1| + |e2| + · · · + |en|

2

Option 2: Minimize the sum of squared residuals – least squares e2

1 + e2 2 + · · · + e2 n

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 13 / 35

slide-51
SLIDE 51

Fitting a line by least squares regression Best line

A measure for the best line

We want a line that has small residuals:

1

Option 1: Minimize the sum of magnitudes (absolute values) of residuals |e1| + |e2| + · · · + |en|

2

Option 2: Minimize the sum of squared residuals – least squares e2

1 + e2 2 + · · · + e2 n

Why least squares?

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 13 / 35

slide-52
SLIDE 52

Fitting a line by least squares regression Best line

A measure for the best line

We want a line that has small residuals:

1

Option 1: Minimize the sum of magnitudes (absolute values) of residuals |e1| + |e2| + · · · + |en|

2

Option 2: Minimize the sum of squared residuals – least squares e2

1 + e2 2 + · · · + e2 n

Why least squares?

1

Most commonly used

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 13 / 35

slide-53
SLIDE 53

Fitting a line by least squares regression Best line

A measure for the best line

We want a line that has small residuals:

1

Option 1: Minimize the sum of magnitudes (absolute values) of residuals |e1| + |e2| + · · · + |en|

2

Option 2: Minimize the sum of squared residuals – least squares e2

1 + e2 2 + · · · + e2 n

Why least squares?

1

Most commonly used

2

Easier to compute by hand and using software

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 13 / 35

slide-54
SLIDE 54

Fitting a line by least squares regression Best line

A measure for the best line

We want a line that has small residuals:

1

Option 1: Minimize the sum of magnitudes (absolute values) of residuals |e1| + |e2| + · · · + |en|

2

Option 2: Minimize the sum of squared residuals – least squares e2

1 + e2 2 + · · · + e2 n

Why least squares?

1

Most commonly used

2

Easier to compute by hand and using software

3

In many applications, a residual twice as large as another is more than twice as bad

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 13 / 35

slide-55
SLIDE 55

Fitting a line by least squares regression Best line

The least squares line

ˆ y = β0 + β1x

✟ ✟ ✟ ✟ ✟ ✙ predicted y

intercept ❅ ❅ ❅ ❘ slope ❍❍❍❍ ❍ ❥ explanatory variable Notation: Intercept:

Parameter: β0 Point estimate: b0

Slope:

Parameter: β1 Point estimate: b1

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 14 / 35

slide-56
SLIDE 56

Fitting a line by least squares regression The least squares line

1

Recap: Chi-square test of independence Ball throwing Expected counts in two-way tables

2

Modeling numerical variables

3

Correlation

4

Fitting a line by least squares regression Residuals Best line The least squares line Prediction & extrapolation Conditions for the least squares line

R2

Categorical explanatory variables

Statistics 101 U6 - L1: Introduction to SLR Thomas Leininger

slide-57
SLIDE 57

Fitting a line by least squares regression The least squares line

Given...

  • 80

85 90 6 8 10 12 14 16 18 % HS grad % in poverty

% HS grad % in poverty

(x) (y)

mean

¯ x = 86.01 ¯ y = 11.35

sd

sx = 3.73 sy = 3.1

correlation

R = −0.75

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 15 / 35

slide-58
SLIDE 58

Fitting a line by least squares regression The least squares line

Slope

Slope The slope of the regression can be calculated as

b1 = sy sx R

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 16 / 35

slide-59
SLIDE 59

Fitting a line by least squares regression The least squares line

Slope

Slope The slope of the regression can be calculated as

b1 = sy sx R

In context...

b1 = 3.1 3.73 × −0.75 = −0.62

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 16 / 35

slide-60
SLIDE 60

Fitting a line by least squares regression The least squares line

Slope

Slope The slope of the regression can be calculated as

b1 = sy sx R

In context...

b1 = 3.1 3.73 × −0.75 = −0.62

Interpretation For each % point increase in HS graduate rate, we would expect the % living in poverty to decrease on average by 0.62% points.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 16 / 35

slide-61
SLIDE 61

Fitting a line by least squares regression The least squares line

Intercept

Intercept The intercept is where the regression line intersects the y-axis. The calculation of the intercept uses the fact the a regression line always passes through (¯

x, ¯ y). b0 = ¯ y − b1¯ x

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 17 / 35

slide-62
SLIDE 62

Fitting a line by least squares regression The least squares line

Intercept

Intercept The intercept is where the regression line intersects the y-axis. The calculation of the intercept uses the fact the a regression line always passes through (¯

x, ¯ y). b0 = ¯ y − b1¯ x

  • 20

40 60 80 100 10 20 30 40 50 60 70 % HS grad % in poverty intercept Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 17 / 35

slide-63
SLIDE 63

Fitting a line by least squares regression The least squares line

Intercept

Intercept The intercept is where the regression line intersects the y-axis. The calculation of the intercept uses the fact the a regression line always passes through (¯

x, ¯ y). b0 = ¯ y − b1¯ x

  • 20

40 60 80 100 10 20 30 40 50 60 70 % HS grad % in poverty intercept

b0 = 11.35 − (−0.62) × 86.01 = 64.68

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 17 / 35

slide-64
SLIDE 64

Fitting a line by least squares regression The least squares line

Interpret b0

Question How do we interpret the intercept? (b0 = 64.68)

  • 20

40 60 80 100 10 20 30 40 50 60 70 % HS grad % in poverty intercept

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 18 / 35

slide-65
SLIDE 65

Fitting a line by least squares regression The least squares line

Interpret b0

Question How do we interpret the intercept? (b0 = 64.68)

  • 20

40 60 80 100 10 20 30 40 50 60 70 % HS grad % in poverty intercept

States with no HS graduates are expected on average to have 64.68% of residents living below the poverty line.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 18 / 35

slide-66
SLIDE 66

Fitting a line by least squares regression The least squares line

Recap: Interpretation of slope and intercept

Intercept: When x = 0, y is expected to equal the value of the intercept. Slope: For each unit increase in x, y is expected to increase/decrease on average by value of the slope.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 19 / 35

slide-67
SLIDE 67

Fitting a line by least squares regression The least squares line

Regression line

  • % in poverty = 64.68 − 0.62 % HS grad
  • 80

85 90 6 8 10 12 14 16 18 % HS grad % in poverty

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 20 / 35

slide-68
SLIDE 68

Fitting a line by least squares regression Prediction & extrapolation

1

Recap: Chi-square test of independence Ball throwing Expected counts in two-way tables

2

Modeling numerical variables

3

Correlation

4

Fitting a line by least squares regression Residuals Best line The least squares line Prediction & extrapolation Conditions for the least squares line

R2

Categorical explanatory variables

Statistics 101 U6 - L1: Introduction to SLR Thomas Leininger

slide-69
SLIDE 69

Fitting a line by least squares regression Prediction & extrapolation

Prediction

Using the linear model to predict the value of the response variable for a given value of the explanatory variable is called prediction, simply by plugging in the value of x in the linear model equation. There will be some uncertainty associated with the predicted value - we’ll talk about this next time.

  • 80

85 90 6 8 10 12 14 16 18 % HS grad % in poverty

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 21 / 35

slide-70
SLIDE 70

Fitting a line by least squares regression Prediction & extrapolation

Extrapolation

Applying a model estimate to values outside of the realm of the

  • riginal data is called extrapolation.

Sometimes the intercept might be an extrapolation.

  • 20

40 60 80 100 10 20 30 40 50 60 70 % HS grad % in poverty intercept

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 22 / 35

slide-71
SLIDE 71

Fitting a line by least squares regression Prediction & extrapolation

Examples of extrapolation

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 23 / 35

slide-72
SLIDE 72

Fitting a line by least squares regression Prediction & extrapolation

Examples of extrapolation

1

http://www.colbertnation.com/the-colbert-report-videos/269929

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 24 / 35

slide-73
SLIDE 73

Fitting a line by least squares regression Prediction & extrapolation

Examples of extrapolation

1

http://www.colbertnation.com/the-colbert-report-videos/269929

2

Sprinting:

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 24 / 35

slide-74
SLIDE 74

Fitting a line by least squares regression Prediction & extrapolation

Examples of extrapolation

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 25 / 35

slide-75
SLIDE 75

Fitting a line by least squares regression Conditions for the least squares line

1

Recap: Chi-square test of independence Ball throwing Expected counts in two-way tables

2

Modeling numerical variables

3

Correlation

4

Fitting a line by least squares regression Residuals Best line The least squares line Prediction & extrapolation Conditions for the least squares line

R2

Categorical explanatory variables

Statistics 101 U6 - L1: Introduction to SLR Thomas Leininger

slide-76
SLIDE 76

Fitting a line by least squares regression Conditions for the least squares line

Conditions for the least squares line

1

Linearity

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 26 / 35

slide-77
SLIDE 77

Fitting a line by least squares regression Conditions for the least squares line

Conditions for the least squares line

1

Linearity

2

Nearly normal residuals

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 26 / 35

slide-78
SLIDE 78

Fitting a line by least squares regression Conditions for the least squares line

Conditions for the least squares line

1

Linearity

2

Nearly normal residuals

3

Constant variability

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 26 / 35

slide-79
SLIDE 79

Fitting a line by least squares regression Conditions for the least squares line

Conditions: (1) Linearity

The relationship between the explanatory and the response variable should be linear.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 27 / 35

slide-80
SLIDE 80

Fitting a line by least squares regression Conditions for the least squares line

Conditions: (1) Linearity

The relationship between the explanatory and the response variable should be linear. Methods for fitting a model to non-linear relationships exist, but are beyond the scope of this class.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 27 / 35

slide-81
SLIDE 81

Fitting a line by least squares regression Conditions for the least squares line

Conditions: (1) Linearity

The relationship between the explanatory and the response variable should be linear. Methods for fitting a model to non-linear relationships exist, but are beyond the scope of this class. Check using a scatterplot of the data, or a residuals plot.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 27 / 35

slide-82
SLIDE 82

Fitting a line by least squares regression Conditions for the least squares line

Anatomy of a residuals plot

% HS grad % in poverty 80 85 90 5 10 15 −5 5

∗ RI:

% HS grad = 81 % in poverty = 10.3

  • % in poverty = 64.68 − 0.62 ∗ 81 = 14.46

e = % in poverty −

  • % in poverty

= 10.3 − 14.46 = −4.16

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 28 / 35

slide-83
SLIDE 83

Fitting a line by least squares regression Conditions for the least squares line

Anatomy of a residuals plot

% HS grad % in poverty 80 85 90 5 10 15 −5 5

∗ RI:

% HS grad = 81 % in poverty = 10.3

  • % in poverty = 64.68 − 0.62 ∗ 81 = 14.46

e = % in poverty −

  • % in poverty

= 10.3 − 14.46 = −4.16

DC:

% HS grad = 86 % in poverty = 16.8

  • % in poverty = 64.68 − 0.62 ∗ 86 = 11.36

e = % in poverty −

  • % in poverty

= 16.8 − 11.36 = 5.44

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 28 / 35

slide-84
SLIDE 84

Fitting a line by least squares regression Conditions for the least squares line

Conditions: (2) Nearly normal residuals

The residuals should be nearly normal.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 29 / 35

slide-85
SLIDE 85

Fitting a line by least squares regression Conditions for the least squares line

Conditions: (2) Nearly normal residuals

The residuals should be nearly normal. This condition may not be satisfied when there are unusual

  • bservations that don’t follow the trend of the rest of the data.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 29 / 35

slide-86
SLIDE 86

Fitting a line by least squares regression Conditions for the least squares line

Conditions: (2) Nearly normal residuals

The residuals should be nearly normal. This condition may not be satisfied when there are unusual

  • bservations that don’t follow the trend of the rest of the data.

Check using a histogram or normal probability plot of residuals.

residuals frequency −4 −2 2 4 6 2 4 6 8 10 12

  • −2

−1 1 2 −4 −2 2 4

Normal Q−Q Plot

Theoretical Quantiles Sample Quantiles Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 29 / 35

slide-87
SLIDE 87

Fitting a line by least squares regression Conditions for the least squares line

Conditions: (3) Constant variability

  • 80

85 90 6 8 10 12 14 16 18 % HS grad % in poverty

  • 80

90 −4 4

The variability of points around the least squares line should be roughly constant.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 30 / 35

slide-88
SLIDE 88

Fitting a line by least squares regression Conditions for the least squares line

Conditions: (3) Constant variability

  • 80

85 90 6 8 10 12 14 16 18 % HS grad % in poverty

  • 80

90 −4 4

The variability of points around the least squares line should be roughly constant. This implies that the variability

  • f residuals around the 0 line

should be roughly constant as well.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 30 / 35

slide-89
SLIDE 89

Fitting a line by least squares regression Conditions for the least squares line

Conditions: (3) Constant variability

  • 80

85 90 6 8 10 12 14 16 18 % HS grad % in poverty

  • 80

90 −4 4

The variability of points around the least squares line should be roughly constant. This implies that the variability

  • f residuals around the 0 line

should be roughly constant as well. Also called homoscedasticity.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 30 / 35

slide-90
SLIDE 90

Fitting a line by least squares regression Conditions for the least squares line

Conditions: (3) Constant variability

  • 80

85 90 6 8 10 12 14 16 18 % HS grad % in poverty

  • 80

90 −4 4

The variability of points around the least squares line should be roughly constant. This implies that the variability

  • f residuals around the 0 line

should be roughly constant as well. Also called homoscedasticity. Check using a residuals plot.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 30 / 35

slide-91
SLIDE 91

Fitting a line by least squares regression Conditions for the least squares line

Checking conditions

Question What condition is this linear model

  • bviously violating?

(a) Constant variability (b) Linear relationship (c) Non-normal residuals (d) No extreme outliers

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 31 / 35

slide-92
SLIDE 92

Fitting a line by least squares regression Conditions for the least squares line

Checking conditions

Question What condition is this linear model

  • bviously violating?

(a) Constant variability (b) Linear relationship (c) Non-normal residuals (d) No extreme outliers

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 31 / 35

slide-93
SLIDE 93

Fitting a line by least squares regression Conditions for the least squares line

Checking conditions

Question What condition is this linear model

  • bviously violating?

(a) Constant variability (b) Linear relationship (c) Non-normal residuals (d) No extreme outliers

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 32 / 35

slide-94
SLIDE 94

Fitting a line by least squares regression Conditions for the least squares line

Checking conditions

Question What condition is this linear model

  • bviously violating?

(a) Constant variability (b) Linear relationship (c) Non-normal residuals (d) No extreme outliers

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 32 / 35

slide-95
SLIDE 95

Fitting a line by least squares regression R2

1

Recap: Chi-square test of independence Ball throwing Expected counts in two-way tables

2

Modeling numerical variables

3

Correlation

4

Fitting a line by least squares regression Residuals Best line The least squares line Prediction & extrapolation Conditions for the least squares line

R2

Categorical explanatory variables

Statistics 101 U6 - L1: Introduction to SLR Thomas Leininger

slide-96
SLIDE 96

Fitting a line by least squares regression R2

R2

The strength of the fit of a linear model is most commonly evaluated using R2.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 33 / 35

slide-97
SLIDE 97

Fitting a line by least squares regression R2

R2

The strength of the fit of a linear model is most commonly evaluated using R2.

R2 is calculated as the square of the correlation coefficient.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 33 / 35

slide-98
SLIDE 98

Fitting a line by least squares regression R2

R2

The strength of the fit of a linear model is most commonly evaluated using R2.

R2 is calculated as the square of the correlation coefficient.

It tells us what percent of variability in the response variable is explained by the model.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 33 / 35

slide-99
SLIDE 99

Fitting a line by least squares regression R2

R2

The strength of the fit of a linear model is most commonly evaluated using R2.

R2 is calculated as the square of the correlation coefficient.

It tells us what percent of variability in the response variable is explained by the model. The remainder of the variability is explained by variables not included in the model.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 33 / 35

slide-100
SLIDE 100

Fitting a line by least squares regression R2

R2

The strength of the fit of a linear model is most commonly evaluated using R2.

R2 is calculated as the square of the correlation coefficient.

It tells us what percent of variability in the response variable is explained by the model. The remainder of the variability is explained by variables not included in the model. For the model we’ve been working with, R2 = (−0.62)2 = 0.38.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 33 / 35

slide-101
SLIDE 101

Fitting a line by least squares regression R2

Interpretation of R2

Question

Which of the below is the correct interpretation of R = −0.62, R2 = 0.38?

(a) 38% of the variability in the % of HG graduates among the 51 states is explained by the model. (b) 38% of the variability in the % of residents living in poverty among the 51 states is explained by the model. (c) 38% of the time % HS graduates predict % living in poverty correctly. (d) 62% of the variability in the % of residents living in poverty among the 51 states is explained by the model.

  • 80

85 90 6 8 10 12 14 16 18 % HS grad % in poverty

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 34 / 35

slide-102
SLIDE 102

Fitting a line by least squares regression R2

Interpretation of R2

Question

Which of the below is the correct interpretation of R = −0.62, R2 = 0.38?

(a) 38% of the variability in the % of HG graduates among the 51 states is explained by the model. (b) 38% of the variability in the % of residents living in poverty among the 51 states is explained by the model. (c) 38% of the time % HS graduates predict % living in poverty correctly. (d) 62% of the variability in the % of residents living in poverty among the 51 states is explained by the model.

  • 80

85 90 6 8 10 12 14 16 18 % HS grad % in poverty

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 34 / 35

slide-103
SLIDE 103

Fitting a line by least squares regression Categorical explanatory variables

1

Recap: Chi-square test of independence Ball throwing Expected counts in two-way tables

2

Modeling numerical variables

3

Correlation

4

Fitting a line by least squares regression Residuals Best line The least squares line Prediction & extrapolation Conditions for the least squares line

R2

Categorical explanatory variables

Statistics 101 U6 - L1: Introduction to SLR Thomas Leininger

slide-104
SLIDE 104

Fitting a line by least squares regression Categorical explanatory variables

Poverty vs. region (east, west)

  • poverty = 11.17 + 0.38 × west

Explanatory variable: region, reference level: east Intercept: The estimated average poverty percentage in eastern states is 11.17%/

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 35 / 35

slide-105
SLIDE 105

Fitting a line by least squares regression Categorical explanatory variables

Poverty vs. region (east, west)

  • poverty = 11.17 + 0.38 × west

Explanatory variable: region, reference level: east Intercept: The estimated average poverty percentage in eastern states is 11.17%/

This is the value we get if we plug in 0 for the explanatory variable

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 35 / 35

slide-106
SLIDE 106

Fitting a line by least squares regression Categorical explanatory variables

Poverty vs. region (east, west)

  • poverty = 11.17 + 0.38 × west

Explanatory variable: region, reference level: east Intercept: The estimated average poverty percentage in eastern states is 11.17%/

This is the value we get if we plug in 0 for the explanatory variable

Slope: The estimated average poverty percentage in western states is 0.38% higher than eastern states.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 35 / 35

slide-107
SLIDE 107

Fitting a line by least squares regression Categorical explanatory variables

Poverty vs. region (east, west)

  • poverty = 11.17 + 0.38 × west

Explanatory variable: region, reference level: east Intercept: The estimated average poverty percentage in eastern states is 11.17%/

This is the value we get if we plug in 0 for the explanatory variable

Slope: The estimated average poverty percentage in western states is 0.38% higher than eastern states.

Then, the estimated average poverty percentage in western states is 11.17 + 0.38 = 11.55%.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 35 / 35

slide-108
SLIDE 108

Fitting a line by least squares regression Categorical explanatory variables

Poverty vs. region (east, west)

  • poverty = 11.17 + 0.38 × west

Explanatory variable: region, reference level: east Intercept: The estimated average poverty percentage in eastern states is 11.17%/

This is the value we get if we plug in 0 for the explanatory variable

Slope: The estimated average poverty percentage in western states is 0.38% higher than eastern states.

Then, the estimated average poverty percentage in western states is 11.17 + 0.38 = 11.55%. This is the value we get if we plug in 1 for the explanatory variable

This is called using a dummy variable.

Statistics 101 (Thomas Leininger) U6 - L1: Introduction to SLR June 17, 2013 35 / 35