Business Statistics CONTENTS Ordinary least squares (recap for - - PowerPoint PPT Presentation

business statistics
SMART_READER_LITE
LIVE PREVIEW

Business Statistics CONTENTS Ordinary least squares (recap for - - PowerPoint PPT Presentation

SIMPLE REGRESSION ANALYSIS Business Statistics CONTENTS Ordinary least squares (recap for some) Statistical formulation of the regression model Assessing the regression model Testing the regression coefficients The ANOVA table Old exam


slide-1
SLIDE 1

SIMPLE REGRESSION ANALYSIS

Business Statistics

slide-2
SLIDE 2

Ordinary least squares (recap for some) Statistical formulation of the regression model Assessing the regression model Testing the regression coefficients The ANOVA table Old exam question Further study CONTENTS

slide-3
SLIDE 3

Idea of “curve fitting” in a scatterplot ▪ linear fit: 𝑧 = 𝑏 + 𝑐𝑦 (𝑦=floor area of house, 𝑧=price of house) ORDINARY LEAST SQUARES

slide-4
SLIDE 4

You find the “best” line ▪ by minimizing the “misfit” (𝑓𝑗) between observed value (𝑧𝑗) and modelled/estimated value (ෝ 𝑧𝑗 = 𝑏𝑦𝑗 + 𝑐)

▪ 𝑓𝑗 = 𝑧𝑗 − ෝ 𝑧𝑗

▪ in fact by minimizing the sum of squares of misfit

▪ σ𝑗=1

𝑜

𝑓𝑗 2

▪ OLS regression ORDINARY LEAST SQUARES

The hat (^) is our symbol for the estimate

slide-5
SLIDE 5

Rephrasing the model 𝑧 = 𝑏 + 𝑐𝑦 as a statistical model Assumptions and notation ▪ we assume a linear relation of the form of the population regression model 𝑍

𝑗 = 𝛾0 + 𝛾1𝑌𝑗 + 𝜁𝑗

▪ or 𝑍 = 𝛾0 + 𝛾1𝑌 + 𝜁 ▪ with

▪ 𝛾0 is the intercept or constant ▪ 𝛾1 the slope or slope coefficient ▪ random variable 𝜁𝑗 is the error or residual, the “unexplained part”

STATISTICAL FORMULATION OF THE REGRESSION MODEL

We prefer to use 𝛾0 instead of 𝑏 for the constant, and 𝛾1 instead

  • f 𝑐 for the slope
slide-6
SLIDE 6

Estimation of the model coefficients ▪ we assume that 𝜁𝑗~𝑂 0, 𝜏2 ▪ based on sample of 𝑜 paired data points 𝑦𝑗, 𝑧𝑗 , 𝑗 = 1, … , 𝑜 ▪ use OLS to estimate the best line through the estimated regression model ෠ 𝑍 = 𝑐0 + 𝑐1𝑌 or ෝ 𝑧𝑗 = 𝑐0 + 𝑐1𝑦𝑗 ▪ the estimated coefficients (𝑐0 for 𝛾0 and 𝑐1 for 𝛾1) and the estimated error (𝑓𝑗 for 𝜁𝑗) corresespond to 𝑧𝑗 = 𝑐0 + 𝑐1𝑦𝑗 + 𝑓𝑗 STATISTICAL FORMULATION OF THE REGRESSION MODEL

slide-7
SLIDE 7

STATISTICAL FORMULATION OF THE REGRESSION MODEL

𝑦𝑗, 𝑧𝑗 ෠ 𝑍 = 𝑐0 + 𝑐1𝑌 𝑦𝑗, ෝ 𝑧𝑗 𝑓𝑗{ 𝑐0 𝑐1

slide-8
SLIDE 8

So ▪ 𝑐0 is the estimated value of 𝛾0

▪ the intercept or constant of the regression line

▪ 𝑐1 is the estimated value of 𝛾1

▪ the slope or slope coefficient of the regression line

▪ 𝑓𝑗 is the estimated residual or error for observation 𝑗

▪ the “misfit”

STATISTICAL FORMULATION OF THE REGRESSION MODEL

slide-9
SLIDE 9

Look back at the house prices ▪ where we have a line found 𝑧 = −264700 + 6152𝑦

  • a. Give the theoretical model
  • b. Give the estimated model

EXERCISE 1

slide-10
SLIDE 10

OLS will always give an estimate for 𝛾0 and 𝛾1 ▪ the line of “best fit” But is “best” also “good enough” to make good predictions? ▪ can we do a statistical test on the quality of the model? We have minimized the sum of squares (𝑇𝑇) of the error 𝑇𝑇𝐹 = ෍

𝑗=1 𝑜

𝑓𝑗 2 = ෍

𝑗=1 𝑜

𝑧𝑗 − ෝ 𝑧𝑗 2 We would like to compare this with: ▪ the “total” sum of squares 𝑇𝑇𝑈 ▪ the “explained” sum of squares 𝑇𝑇𝑆 ASSESSING THE REGRESSION MODEL

“R” stands for “regression”

slide-11
SLIDE 11

Total sum of squares: 𝑇𝑇𝑈 = ෍

𝑗=1 𝑜

𝑧𝑗 − ത 𝑧 2 ASSESSING THE REGRESSION MODEL

So 𝑇𝑇𝑈 is the total variation around the mean ത 𝑧

slide-12
SLIDE 12

Regression sum of squares: 𝑇𝑇𝑆 = ෍

𝑗=1 𝑜

ෝ 𝑧𝑗 − ത 𝑧 2 ▪ So,

▪ the data has a total variability 𝑇𝑇𝑈 ▪ the regression model explains a variability 𝑇𝑇𝑆 ▪ and the residual variability is 𝑇𝑇𝐹 ▪ and 𝑇𝑇𝑈 = 𝑇𝑇𝑆 + 𝑇𝑇𝐹

Coefficient of determination (“𝑆-square”): 𝑆2 = 𝑇𝑇𝑆 𝑇𝑇𝑈 = 1 − 𝑇𝑇𝐹 𝑇𝑇𝑈 ASSESSING THE REGRESSION MODEL

So 𝑇𝑇𝑆 is the variation around the mean ത 𝑧 that is explained by the model

slide-13
SLIDE 13

𝑆2 is a measure of the usefulness of the model ▪ Properties

▪ 0 ≤ 𝑆2 ≤ 1 ▪ 𝑆2 = 0 means the model doesn’t explain anything ▪ 𝑆2 = 1 means the model explains everything ▪ in between, the model explains 𝑆2 × 100% of the variance of 𝑍

ASSESSING THE REGRESSION MODEL

𝑆2 = 1 − 𝑇𝑇𝐹 𝑇𝑇𝑈

slide-14
SLIDE 14

If 𝑆2 > 0, the regression model explains “something” ▪ but in a random sample, 𝑆2 may be non-zero due to chance ▪ when is 𝑆2 is “significantly” different from 0? Finding a test statistic ▪ look at the variances associated with 𝑇𝑇𝑆 and 𝑇𝑇𝐹 ▪ so define the mean sums of squares (𝑁𝑇) (variances!) ▪ 𝑁𝑇𝑈 =

𝑇𝑇𝑈 𝑜−1 ; 𝑁𝑇𝑆 = 𝑇𝑇𝑆 1 ; 𝑁𝑇𝐹 = 𝑇𝑇𝐹 𝑜−2

▪ use

𝑁𝑇𝑆 𝑁𝑇𝐹 = 𝑇𝑇𝑆/1 𝑇𝑇𝐹/ 𝑜−2 as a ratio of two variances

ASSESSING THE REGRESSION MODEL

slide-15
SLIDE 15

Statistical test: ▪ 𝐼0: the independent variable (𝑌) does not explain the variation in the dependent variable (𝑍)

▪ i.e., 𝐼0: 𝛾1 = 0 versus 𝐼1: 𝛾1 ≠ 0

▪ Sample statistic: 𝐺 =

𝑁𝑇𝑆 𝑁𝑇𝐹 ; reject for large values

▪ Under 𝐼0: 𝐺~𝐺

1,𝑜−2 ; assumptions: see model

▪ Compare 𝐺calc =

𝑁𝑇𝑆 𝑁𝑇𝐹 with 𝐺crit = 𝐺 1,𝑜−2;𝛽

▪ or compute 𝑞-value as the probability of obtaining 𝐺

calc or more

extreme if 𝐼0 is true

ASSESSING THE REGRESSION MODEL

slide-16
SLIDE 16

Using SPSS, three types of output Model summary ▪ 𝑆2 Variance decomposition (ANOVA?) ▪ 𝑇𝑇𝑆, 𝑇𝑇𝐹, 𝑇𝑇𝑈 ▪ 𝑁𝑇𝑆, 𝑁𝑇𝐹 ▪ 𝐺𝑑𝑏𝑚𝑑 ▪ 𝑞-value Regression coefficients ▪ 𝑐0 and 𝑐1 ASSESSING THE REGRESSION MODEL

slide-17
SLIDE 17

The model is 𝑍 = 𝛾0 + 𝛾1𝑌 + 𝜁 ▪ OLS extracts estimates from the data: 𝑐0 and 𝑐1 ▪ But how accurate are these estimates? We can also find the distribution of 𝐶0 and 𝐶1 ▪ So, we can find confidence intervals and perform hypothesis tests 𝐶0 and 𝐶1 are 𝑢-distributed: ▪

𝐶0−𝛾0 𝑇𝐶0 ~𝑢𝑜−2

𝐶1−𝛾1 𝑇𝐶1 ~𝑢𝑜−2

ASSESSING THE REGRESSION MODEL

slide-18
SLIDE 18

Mind the notation, like before: ▪ mean

▪ population value 𝜈𝑌 ▪ sample estimate ҧ 𝑦 ▪ sampling distribution of random variable ത 𝑌

▪ regression slope

▪ population value 𝛾1 ▪ sample estimate 𝑐1 ▪ sampling distribution of random variable 𝐶1

ASSESSING THE REGRESSION MODEL

When you’re careless with this, it all gets mixed up in one big abracadabra trickery!

slide-19
SLIDE 19
  • a. Is the model significant?
  • b. Has the model practical relevance?

EXERCISE 2

slide-20
SLIDE 20

▪ Testing 𝛾0 is usually not interesting

▪ but testing 𝛾1 is! ▪ in particular, the hypothesis 𝛾1 = 0 is often interesting ▪ i.e., the hypothesis that there is no relation between 𝑌 and 𝑍 ▪ or: that knowledge of 𝑌 doesn’t tell you anything about 𝑍

▪ This test requires the standard deviation of 𝐶1

▪ it is calculated from the data; see computer output ▪ here 𝑡𝐶1 = 347.578

TESTING THE REGRESSION COEFFICIENTS

slide-21
SLIDE 21

So: 𝑢calc =

𝑐1−𝛾1 𝑡𝐶1

=

6151.670−0 347.578

= 17.699 ▪ which has to be compared to 𝑢crit = ±𝑢0.025;69 ▪ reject 𝐼0: 𝛾1 = 0, because 𝑢calc > 𝑢crit ▪ or with 𝑞-value: 𝑞 = 0.000 ≪ 0.05 ▪ and conclude that the slope differs significantly from zero ▪ post-hoc conclusion: it is larger than zero TESTING THE REGRESSION COEFFICIENTS

slide-22
SLIDE 22

Testing the regression model ▪ on the basis of

𝑁𝑇𝑆 𝑁𝑇𝐹 ~𝐺 1,𝑜−2

Testing the regression coefficient 𝑐1 ▪ on the basis of

𝐶1−0 𝑇𝐶1 ~𝑢𝑜−2

The two approaches are equivalent ▪ they have the same null hypothesis: 𝐼0: 𝛾1 = 0 ▪ they lead to the same conclusion (rejection or no rejection) ▪ they lead to the same 𝑞-value

▪ when we do multiple regression with several explanatory variables this is not the case! See later.

TESTING THE REGRESSION COEFFICIENTS

slide-23
SLIDE 23

We can also perform other tests than 𝐼0: 𝛾1 = 0 ▪ Case 1: Different test values for 𝛾1

▪ for example 𝐼0: 𝛾1 = 2 ▪ 𝑢calc =

𝑐1−2 𝑡𝐶1

▪ not in SPSS, but easily calculated using s𝐶1

▪ Case 2: One sided tests

▪ for example 𝐼0: 𝛾1 ≥ 0 ▪ 𝑢calc as before, but now tested with different 𝑢crit ▪ not in SPSS, but also easily calculated using 2-sided 𝑞-value

▪ Case 3: combination of case 1 and case 2

▪ for example 𝐼0: 𝛾1 ≥ 2

▪ Try all! (see tutorials) TESTING THE REGRESSION COEFFICIENTS

slide-24
SLIDE 24

Example of case 3: ▪ is there evidence that the price per square meter larger than 5500€?

▪ 𝐼0: 𝛾1 ≤ 5500; 𝐼1: 𝛾1 > 5500; 𝛽 = 0.05

▪ 𝑢calc =

6151.670−5500 347.578

= 1.875 > 𝑢crit ≈ 1.7

▪ reject 𝐼0

▪ conclude that price per m2 is significantly larger than 5500€ TESTING THE REGRESSION COEFFICIENTS

  • ne-sided critical

value, with 𝛽, not 𝛽/2

slide-25
SLIDE 25

One may also test 𝛾0 in exactly the same way ▪ however, this is hardly ever useful Overall significance of 𝐺-test only depends on

𝐶1 𝑇𝐶1, not on 𝐶0 𝑇𝐶0

▪ that is because the slope explains variation ▪ while the intercept is only a vertical shift TESTING THE REGRESSION COEFFICIENTS

slide-26
SLIDE 26

One of the regression results is the ANOVA table ANOVA = analysis of variance ▪ Excel ▪ SPSS THE ANOVA TABLE

slide-27
SLIDE 27

What was ANOVA? ▪ ANOVA: 𝑍 numerical; 𝑌 categorical ▪ regression: 𝑍 numerical; 𝑌 numerical So ANOVA is really different from regression ▪ why then an ANOVA table in regression? Because ANOVA and regression both decompose the total variance (𝑇𝑇𝑈) into ▪ an explained part

▪ 𝑇𝑇𝐵 in ANOVA (factor “A”); 𝑇𝑇𝑆 in regression (“Regression”)

▪ an unexplained part

▪ 𝑇𝑇𝑋 in ANOVA (“Within”), 𝑇𝑇𝐹 in regression (“Error”)

THE ANOVA TABLE

slide-28
SLIDE 28

The ANOVA table for regression ▪ 𝑁𝑇 ∗=

𝑇𝑇∗ 𝑒𝑔∗, with ∗= 𝑆, 𝐹

▪ 𝐺calc =

𝑁𝑇𝑆 𝑁𝑇𝐹 and associated 𝑞-value

THE ANOVA TABLE

slide-29
SLIDE 29

21 May 2015, Q2c OLD EXAM QUESTION

slide-30
SLIDE 30

Doane & Seward 5/E 12.1-12.6 Tutorial exercises week 4 regression analysis FURTHER STUDY