SLIDE 1
Business Statistics CONTENTS Ordinary least squares (recap for - - PowerPoint PPT Presentation
Business Statistics CONTENTS Ordinary least squares (recap for - - PowerPoint PPT Presentation
SIMPLE REGRESSION ANALYSIS Business Statistics CONTENTS Ordinary least squares (recap for some) Statistical formulation of the regression model Assessing the regression model Testing the regression coefficients The ANOVA table Old exam
SLIDE 2
SLIDE 3
Idea of “curve fitting” in a scatterplot ▪ linear fit: 𝑧 = 𝑏 + 𝑐𝑦 (𝑦=floor area of house, 𝑧=price of house) ORDINARY LEAST SQUARES
SLIDE 4
You find the “best” line ▪ by minimizing the “misfit” (𝑓𝑗) between observed value (𝑧𝑗) and modelled/estimated value (ෝ 𝑧𝑗 = 𝑏𝑦𝑗 + 𝑐)
▪ 𝑓𝑗 = 𝑧𝑗 − ෝ 𝑧𝑗
▪ in fact by minimizing the sum of squares of misfit
▪ σ𝑗=1
𝑜
𝑓𝑗 2
▪ OLS regression ORDINARY LEAST SQUARES
The hat (^) is our symbol for the estimate
SLIDE 5
Rephrasing the model 𝑧 = 𝑏 + 𝑐𝑦 as a statistical model Assumptions and notation ▪ we assume a linear relation of the form of the population regression model 𝑍
𝑗 = 𝛾0 + 𝛾1𝑌𝑗 + 𝜁𝑗
▪ or 𝑍 = 𝛾0 + 𝛾1𝑌 + 𝜁 ▪ with
▪ 𝛾0 is the intercept or constant ▪ 𝛾1 the slope or slope coefficient ▪ random variable 𝜁𝑗 is the error or residual, the “unexplained part”
STATISTICAL FORMULATION OF THE REGRESSION MODEL
We prefer to use 𝛾0 instead of 𝑏 for the constant, and 𝛾1 instead
- f 𝑐 for the slope
SLIDE 6
Estimation of the model coefficients ▪ we assume that 𝜁𝑗~𝑂 0, 𝜏2 ▪ based on sample of 𝑜 paired data points 𝑦𝑗, 𝑧𝑗 , 𝑗 = 1, … , 𝑜 ▪ use OLS to estimate the best line through the estimated regression model 𝑍 = 𝑐0 + 𝑐1𝑌 or ෝ 𝑧𝑗 = 𝑐0 + 𝑐1𝑦𝑗 ▪ the estimated coefficients (𝑐0 for 𝛾0 and 𝑐1 for 𝛾1) and the estimated error (𝑓𝑗 for 𝜁𝑗) corresespond to 𝑧𝑗 = 𝑐0 + 𝑐1𝑦𝑗 + 𝑓𝑗 STATISTICAL FORMULATION OF THE REGRESSION MODEL
SLIDE 7
STATISTICAL FORMULATION OF THE REGRESSION MODEL
𝑦𝑗, 𝑧𝑗 𝑍 = 𝑐0 + 𝑐1𝑌 𝑦𝑗, ෝ 𝑧𝑗 𝑓𝑗{ 𝑐0 𝑐1
SLIDE 8
So ▪ 𝑐0 is the estimated value of 𝛾0
▪ the intercept or constant of the regression line
▪ 𝑐1 is the estimated value of 𝛾1
▪ the slope or slope coefficient of the regression line
▪ 𝑓𝑗 is the estimated residual or error for observation 𝑗
▪ the “misfit”
STATISTICAL FORMULATION OF THE REGRESSION MODEL
SLIDE 9
Look back at the house prices ▪ where we have a line found 𝑧 = −264700 + 6152𝑦
- a. Give the theoretical model
- b. Give the estimated model
EXERCISE 1
SLIDE 10
OLS will always give an estimate for 𝛾0 and 𝛾1 ▪ the line of “best fit” But is “best” also “good enough” to make good predictions? ▪ can we do a statistical test on the quality of the model? We have minimized the sum of squares (𝑇𝑇) of the error 𝑇𝑇𝐹 =
𝑗=1 𝑜
𝑓𝑗 2 =
𝑗=1 𝑜
𝑧𝑗 − ෝ 𝑧𝑗 2 We would like to compare this with: ▪ the “total” sum of squares 𝑇𝑇𝑈 ▪ the “explained” sum of squares 𝑇𝑇𝑆 ASSESSING THE REGRESSION MODEL
“R” stands for “regression”
SLIDE 11
Total sum of squares: 𝑇𝑇𝑈 =
𝑗=1 𝑜
𝑧𝑗 − ത 𝑧 2 ASSESSING THE REGRESSION MODEL
So 𝑇𝑇𝑈 is the total variation around the mean ത 𝑧
SLIDE 12
Regression sum of squares: 𝑇𝑇𝑆 =
𝑗=1 𝑜
ෝ 𝑧𝑗 − ത 𝑧 2 ▪ So,
▪ the data has a total variability 𝑇𝑇𝑈 ▪ the regression model explains a variability 𝑇𝑇𝑆 ▪ and the residual variability is 𝑇𝑇𝐹 ▪ and 𝑇𝑇𝑈 = 𝑇𝑇𝑆 + 𝑇𝑇𝐹
Coefficient of determination (“𝑆-square”): 𝑆2 = 𝑇𝑇𝑆 𝑇𝑇𝑈 = 1 − 𝑇𝑇𝐹 𝑇𝑇𝑈 ASSESSING THE REGRESSION MODEL
So 𝑇𝑇𝑆 is the variation around the mean ത 𝑧 that is explained by the model
SLIDE 13
𝑆2 is a measure of the usefulness of the model ▪ Properties
▪ 0 ≤ 𝑆2 ≤ 1 ▪ 𝑆2 = 0 means the model doesn’t explain anything ▪ 𝑆2 = 1 means the model explains everything ▪ in between, the model explains 𝑆2 × 100% of the variance of 𝑍
ASSESSING THE REGRESSION MODEL
𝑆2 = 1 − 𝑇𝑇𝐹 𝑇𝑇𝑈
SLIDE 14
If 𝑆2 > 0, the regression model explains “something” ▪ but in a random sample, 𝑆2 may be non-zero due to chance ▪ when is 𝑆2 is “significantly” different from 0? Finding a test statistic ▪ look at the variances associated with 𝑇𝑇𝑆 and 𝑇𝑇𝐹 ▪ so define the mean sums of squares (𝑁𝑇) (variances!) ▪ 𝑁𝑇𝑈 =
𝑇𝑇𝑈 𝑜−1 ; 𝑁𝑇𝑆 = 𝑇𝑇𝑆 1 ; 𝑁𝑇𝐹 = 𝑇𝑇𝐹 𝑜−2
▪ use
𝑁𝑇𝑆 𝑁𝑇𝐹 = 𝑇𝑇𝑆/1 𝑇𝑇𝐹/ 𝑜−2 as a ratio of two variances
ASSESSING THE REGRESSION MODEL
SLIDE 15
Statistical test: ▪ 𝐼0: the independent variable (𝑌) does not explain the variation in the dependent variable (𝑍)
▪ i.e., 𝐼0: 𝛾1 = 0 versus 𝐼1: 𝛾1 ≠ 0
▪ Sample statistic: 𝐺 =
𝑁𝑇𝑆 𝑁𝑇𝐹 ; reject for large values
▪ Under 𝐼0: 𝐺~𝐺
1,𝑜−2 ; assumptions: see model
▪ Compare 𝐺calc =
𝑁𝑇𝑆 𝑁𝑇𝐹 with 𝐺crit = 𝐺 1,𝑜−2;𝛽
▪ or compute 𝑞-value as the probability of obtaining 𝐺
calc or more
extreme if 𝐼0 is true
ASSESSING THE REGRESSION MODEL
SLIDE 16
Using SPSS, three types of output Model summary ▪ 𝑆2 Variance decomposition (ANOVA?) ▪ 𝑇𝑇𝑆, 𝑇𝑇𝐹, 𝑇𝑇𝑈 ▪ 𝑁𝑇𝑆, 𝑁𝑇𝐹 ▪ 𝐺𝑑𝑏𝑚𝑑 ▪ 𝑞-value Regression coefficients ▪ 𝑐0 and 𝑐1 ASSESSING THE REGRESSION MODEL
SLIDE 17
The model is 𝑍 = 𝛾0 + 𝛾1𝑌 + 𝜁 ▪ OLS extracts estimates from the data: 𝑐0 and 𝑐1 ▪ But how accurate are these estimates? We can also find the distribution of 𝐶0 and 𝐶1 ▪ So, we can find confidence intervals and perform hypothesis tests 𝐶0 and 𝐶1 are 𝑢-distributed: ▪
𝐶0−𝛾0 𝑇𝐶0 ~𝑢𝑜−2
▪
𝐶1−𝛾1 𝑇𝐶1 ~𝑢𝑜−2
ASSESSING THE REGRESSION MODEL
SLIDE 18
Mind the notation, like before: ▪ mean
▪ population value 𝜈𝑌 ▪ sample estimate ҧ 𝑦 ▪ sampling distribution of random variable ത 𝑌
▪ regression slope
▪ population value 𝛾1 ▪ sample estimate 𝑐1 ▪ sampling distribution of random variable 𝐶1
ASSESSING THE REGRESSION MODEL
When you’re careless with this, it all gets mixed up in one big abracadabra trickery!
SLIDE 19
- a. Is the model significant?
- b. Has the model practical relevance?
EXERCISE 2
SLIDE 20
▪ Testing 𝛾0 is usually not interesting
▪ but testing 𝛾1 is! ▪ in particular, the hypothesis 𝛾1 = 0 is often interesting ▪ i.e., the hypothesis that there is no relation between 𝑌 and 𝑍 ▪ or: that knowledge of 𝑌 doesn’t tell you anything about 𝑍
▪ This test requires the standard deviation of 𝐶1
▪ it is calculated from the data; see computer output ▪ here 𝑡𝐶1 = 347.578
TESTING THE REGRESSION COEFFICIENTS
SLIDE 21
So: 𝑢calc =
𝑐1−𝛾1 𝑡𝐶1
=
6151.670−0 347.578
= 17.699 ▪ which has to be compared to 𝑢crit = ±𝑢0.025;69 ▪ reject 𝐼0: 𝛾1 = 0, because 𝑢calc > 𝑢crit ▪ or with 𝑞-value: 𝑞 = 0.000 ≪ 0.05 ▪ and conclude that the slope differs significantly from zero ▪ post-hoc conclusion: it is larger than zero TESTING THE REGRESSION COEFFICIENTS
SLIDE 22
Testing the regression model ▪ on the basis of
𝑁𝑇𝑆 𝑁𝑇𝐹 ~𝐺 1,𝑜−2
Testing the regression coefficient 𝑐1 ▪ on the basis of
𝐶1−0 𝑇𝐶1 ~𝑢𝑜−2
The two approaches are equivalent ▪ they have the same null hypothesis: 𝐼0: 𝛾1 = 0 ▪ they lead to the same conclusion (rejection or no rejection) ▪ they lead to the same 𝑞-value
▪ when we do multiple regression with several explanatory variables this is not the case! See later.
TESTING THE REGRESSION COEFFICIENTS
SLIDE 23
We can also perform other tests than 𝐼0: 𝛾1 = 0 ▪ Case 1: Different test values for 𝛾1
▪ for example 𝐼0: 𝛾1 = 2 ▪ 𝑢calc =
𝑐1−2 𝑡𝐶1
▪ not in SPSS, but easily calculated using s𝐶1
▪ Case 2: One sided tests
▪ for example 𝐼0: 𝛾1 ≥ 0 ▪ 𝑢calc as before, but now tested with different 𝑢crit ▪ not in SPSS, but also easily calculated using 2-sided 𝑞-value
▪ Case 3: combination of case 1 and case 2
▪ for example 𝐼0: 𝛾1 ≥ 2
▪ Try all! (see tutorials) TESTING THE REGRESSION COEFFICIENTS
SLIDE 24
Example of case 3: ▪ is there evidence that the price per square meter larger than 5500€?
▪ 𝐼0: 𝛾1 ≤ 5500; 𝐼1: 𝛾1 > 5500; 𝛽 = 0.05
▪ 𝑢calc =
6151.670−5500 347.578
= 1.875 > 𝑢crit ≈ 1.7
▪ reject 𝐼0
▪ conclude that price per m2 is significantly larger than 5500€ TESTING THE REGRESSION COEFFICIENTS
- ne-sided critical
value, with 𝛽, not 𝛽/2
SLIDE 25
One may also test 𝛾0 in exactly the same way ▪ however, this is hardly ever useful Overall significance of 𝐺-test only depends on
𝐶1 𝑇𝐶1, not on 𝐶0 𝑇𝐶0
▪ that is because the slope explains variation ▪ while the intercept is only a vertical shift TESTING THE REGRESSION COEFFICIENTS
SLIDE 26
One of the regression results is the ANOVA table ANOVA = analysis of variance ▪ Excel ▪ SPSS THE ANOVA TABLE
SLIDE 27
What was ANOVA? ▪ ANOVA: 𝑍 numerical; 𝑌 categorical ▪ regression: 𝑍 numerical; 𝑌 numerical So ANOVA is really different from regression ▪ why then an ANOVA table in regression? Because ANOVA and regression both decompose the total variance (𝑇𝑇𝑈) into ▪ an explained part
▪ 𝑇𝑇𝐵 in ANOVA (factor “A”); 𝑇𝑇𝑆 in regression (“Regression”)
▪ an unexplained part
▪ 𝑇𝑇𝑋 in ANOVA (“Within”), 𝑇𝑇𝐹 in regression (“Error”)
THE ANOVA TABLE
SLIDE 28
The ANOVA table for regression ▪ 𝑁𝑇 ∗=
𝑇𝑇∗ 𝑒𝑔∗, with ∗= 𝑆, 𝐹
▪ 𝐺calc =
𝑁𝑇𝑆 𝑁𝑇𝐹 and associated 𝑞-value
THE ANOVA TABLE
SLIDE 29
21 May 2015, Q2c OLD EXAM QUESTION
SLIDE 30