Outline The Simple Linear Regression Model (12.1) Fitting the - - PDF document

outline
SMART_READER_LITE
LIVE PREVIEW

Outline The Simple Linear Regression Model (12.1) Fitting the - - PDF document

2/22/2007 219323 Probability and Statistics for Software Statistics for Software and Knowledge Engineers Lecture 13: Simple Linear Regression Simple Linear Regression and Correlation Monchai Sopitkamon, Ph.D. Outline The Simple Linear


slide-1
SLIDE 1

2/22/2007 1

219323 Probability and Statistics for Software Statistics for Software and Knowledge Engineers

Lecture 13: Simple Linear Regression Simple Linear Regression and Correlation

Monchai Sopitkamon, Ph.D.

Outline

The Simple Linear Regression Model

(12.1)

Fitting the Regression Line (12.2) The Analysis of Variance Table (12.6) Residual Analysis (12.7) Correlation Analysis (12.9)

slide-2
SLIDE 2

2/22/2007 2

The Simple Linear Regression Model I (12.1)

Purpose of regression analysis: predict the

value of a dependent or response variable value of a dependent or response variable from the values of at least one explanatory

  • r independent variable (also called

predictors or factors).

Purpose of correlation analysis: measure

the strength of the correlation between two variables variables.

The Simple Linear Regression Model II (12.1)

yi = β0 + β1xi Yi ∼ N(β0 + β1xi, σ2) Yi N(β0 β1xi, σ )

Intercept parameter Slope parameter

Sim ple linear regression m odel

slide-3
SLIDE 3

2/22/2007 3

The Simple Linear Regression Model III (12.1)

I nterpretation of the error variance σ2 error variance σ

The Simple Linear Regression Model IV (12.1)

β1 > 0 positive relationship β1 = 0 No relationship

30 35

β1 < 0 negative relationship SLR model is not appropriate for nonlinear relationship

5 10 15 20 25 2 4 6 8 10 12 14 16

for nonlinear relationship

slide-4
SLIDE 4

2/22/2007 4

The Simple Linear Regression Model V (12.1)

Ex.67 pg.536: Car Plant Electricity Usage

2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 3 3.5 4 4.5 5 5.5 6 6.5 Electricity usage Productiom

Excel sheet

Outline

The Simple Linear Regression Model

(12.1)

Fitting the Regression Line (12.2) The Analysis of Variance Table (12.6) Residual Analysis (12.7) Correlation Analysis (12.9)

slide-5
SLIDE 5

2/22/2007 5

Fitting the Regression Line I (12.2) : Selecting the “best” line

(errors)

error The least squares fit

estimated y

Fitting the Regression Line II (12.2)

i i

x y

1

ˆ β β + = : ˆi y :

i

x

predicted value of y for observation i. value of observation i.

1

and β β

are chosen to minimize:

2

[ ]

2 1 1 1 2 1 2

) ( ) ˆ (

∑ ∑ ∑

= = =

+ − = − = =

n i i i n i i i n i i

x y y y e SSE β β

Subject to:

1

=

= n i i

e

slide-6
SLIDE 6

2/22/2007 6

Fitting the Regression Line III (12.2)

Method of Least Squares

( )

x n x y x n y x

n i i n i i i 1 2 2 1 1

β − − =

∑ ∑

= =

x y

1

β β − =

Variance of errors:

2 ˆ 2 − = n SSE σ

n-2 since two regression parameters need to be computed first

Fitting the Regression Line IV (12.2)

Ex.67 pg.545: Car Plant Electricity Usage x 4 885 =

( )

x b y x n x y x n y x

n i i n i i i 1 1 2 2 1 1

− = − − =

∑ ∑

= =

β β

y x y x

i i i i

846 . 2 885 . 4 12 253 . 169 169.253 291.231 x 2.846 4.885

12 1 12 1 2 i

× × − = = = = =

∑ ∑

= =

β x y 499 . 409 . 409 . 885 . 4 4998 . 846 . 2 4988 . 885 . 4 12 231 . 291

2 1

+ = ∴ = × − = = × − = β β

Excel sheet

slide-7
SLIDE 7

2/22/2007 7

Fitting the Regression Line V (12.2)

Ex.67 pg.545: Car Plant Electricity Usage

y = 0.498x + 0.409 R² = 0.802

2.6 2.8 3 3.2 3.4 3.6 3.8 Electricity usage 2 2.2 2.4 3 3.5 4 4.5 5 5.5 6 6.5 Productiom

Outline

The Simple Linear Regression Model

(12.1)

Fitting the Regression Line (12.2) The Analysis of Variance Table (12.6) Residual Analysis (12.7) Correlation Analysis (12.9)

slide-8
SLIDE 8

2/22/2007 8

Outline

The Simple Linear Regression Model

(12.1)

Fitting the Regression Line (12.2) The Analysis of Variance Table (12.6) Residual Analysis (12.7) Correlation Analysis (12.9)

Outline

The Simple Linear Regression Model

(12.1)

Fitting the Regression Line (12.2) The Analysis of Variance Table (12.6) Residual Analysis (12.7) Correlation Analysis (12.9)

slide-9
SLIDE 9

2/22/2007 9

The Analysis of Variance Table: Sum of Squares Decomposition I (12.6.1)

Apply the similar ANOVA approach as the

  • ne-factor layout as in Chapter 11
  • ne factor layout as in Chapter 11

Consider the variability in the dependent

variable y

Hypothesis test:

H0 : β1 = 0

The Analysis of Variance Table: Sum of Squares Decomposition II (12.6.1)

2

) ( y y SST

n i −

=∑

1

) ( y y

i i

=

( )

SSE SST y y SSR

n i i

− = − =∑

=1 2

ˆ

( )

=

− =

n i i i

y y SSE

1 2

ˆ

slide-10
SLIDE 10

2/22/2007 10

The Analysis of Variance Table: Sum of Squares Decomposition III (12.6.1)

The sum of squares for a sim ple linear regression

The Analysis of Variance Table: Sum of Squares Decomposition IV (12.6.1)

The analysis of variance table for a sim ple linear regression analysis Hypothesis test:

H0 : β1 = 0

The two-sided p-value is

p-value = P(X > F) where X is RV that has an F1,n-2 distribution

slide-11
SLIDE 11

2/22/2007 11

The Analysis of Variance Table: Sum of Squares Decomposition V (12.6.1)

Coefficient of determination (R2): fraction

  • f variation explained by the regression
  • f variation explained by the regression

(0 ≤ R2 ≤ 1) SST SSE SST SSE SST SST SSR R − = − = = 1

2 The closer R2 is to one, the better is the regression model.

The Analysis of Variance Table: Sum of Squares Decomposition VI (12.6.1)

The coefficient of determination R2 is larger in scenario I I than in scenario I

slide-12
SLIDE 12

2/22/2007 12

The Analysis of Variance Table: Sum of Squares Decomposition VII (12.6.1)

Ex.67 pg.572: Car Plant Electricity Usage

2124 1 53 . 40 0299 . 2124 . 1 = = = SSR MSE MSR F

Excel sheet

802 . 5115 . 1 2124 . 1

2

= = = SST SSR R

The higher the value of R2 the better the regression.

Outline

The Simple Linear Regression Model

(12.1)

Fitting the Regression Line (12.2) The Analysis of Variance Table (12.6) Residual Analysis (12.7) Correlation Analysis (12.9)

slide-13
SLIDE 13

2/22/2007 13

Residual Analysis Methods I (12.7.1)

Residuals: differences between the

  • bserved values of the dependent variable
  • bserved values of the dependent variable

and the corresponding predicted (fitted) values 1 ≤ i ≤ n

Residual analysis can be used to

i i i

y y e ˆ − =

– Identify outliers – Check if the fitted model is good – Check if the variance of error is constant – Check if the error terms are normally distributed

Excel sheet

Residual Analysis Methods II (12.7.1)

Plot the residuals ei against the values of the

explanatory variable xi explanatory variable xi

Random scatter plot indicates no problem

with the obtained regression model

If (standardized residual) is > 3, data

point i is an outlier

If there are outliers, they should be removed

σ ˆ /

i

e

and the regression line should be fitted again

Excel sheet

slide-14
SLIDE 14

2/22/2007 14

Residual Analysis Methods III (12.7.1)

Residual plot indicating points that m ay be outliers

Residual Analysis Methods IV (12.7.1)

If residual plots show positive and negative

residuals grouped together, a linear model is residuals grouped together, a linear model is not suitable

A grouping of positive d i id l and negative residuals indicates that the linear m odel is inappropriate

slide-15
SLIDE 15

2/22/2007 15

Residual Analysis Methods V (12.7.1)

If the residual plot shows a “funnel

shape”, the variance of error (σ2) is not shape , the variance of error (σ ) is not constant, conflicting w/ the assumption

A funnel shape in the residual plot indicates residual plot indicates a non-constant error variance

Residual Analysis Methods VI (12.7.1)

Normal probability plot (normal scores plot)

  • f residuals can be used to check if the error
  • f residuals can be used to check if the error

terms εi are normally distributed

A norm al scores plot of a sim ulated sam ple from a cores sam ple from a norm al distribution, w hich show s the points lying approxim ately

  • n a straight line

Normal sc

slide-16
SLIDE 16

2/22/2007 16

Residual Analysis Methods VII (12.7.1)

Exhibits non-normal

distribution of

res

residuals

Linear modeling

approach may not be used

Norm al scores plots

Normal scor

Norm al scores plots

  • f sim ulated sam ples

from non-norm al distributions, w hich show nonlinear patterns

Normal scores

Outline

The Simple Linear Regression Model

(12.1)

Fitting the Regression Line (12.2) The Analysis of Variance Table (12.6) Residual Analysis (12.7) Correlation Analysis (12.9)

slide-17
SLIDE 17

2/22/2007 17

The Sample Correlation Coefficient I (12.9.1)

From the correlation eq. in Section 2.5.4,

) , ( Cov ) , ( Corr Y X Y X = = ρ

which measures the strength of linear association between two jointly distributed RVs X and Y

The sample correlation coefficient r for a set of

paired data observations (xi, yi) is

) ( Var ) ( Var ) , ( Corr Y X Y X ρ

− −

n i i

y y x x ) )( (

∑ ∑ ∑ ∑ ∑ ∑

= = = = = =

− − − = − − =

n i i n i i n i i i n i i n i i i i i

y n y x n x y x n y x y y x x y y r

1 2 2 1 2 2 1 1 2 1 2 1

) ( ) ( ) )( (

(-1 ≤ r ≤ 1)

The Sample Correlation Coefficient II (12.9.1)

r = 0 no linear association r < 0 negative linear association r > 0 positive linear association

slide-18
SLIDE 18

2/22/2007 18

The Sample Correlation Coefficient III (12.9.1)

r2 = R2

( l l ti ft )2 ( ft f d t i ti )

r is unchanged if x and y are swapped, which is in

contrast to regression analysis, which requires that

  • ne variable be dependent and the other

explanatory

r is also not affected by any linear combination of

the variables, e.g., (sample correlation coeft.)2 = (coeft. of determination)

d cy y b ax x

i i i i

+ = ′ + = ′ and

the variables, e.g.,

d cy y b ax x

i i i i

a d

The Sample Correlation Coefficient IV (12.9.1)

Hypothesis test:

H : ρ = 0 (no correlation between 2 RVs) H0 : ρ = 0 (no correlation between 2 RVs) can be performed by computing t-statistic with a t-distribution w/ n – 2 degrees of freedom

2

1 2 r n r t − − =

freedom

slide-19
SLIDE 19

2/22/2007 19

The Sample Correlation Coefficient V (12.9.1)

Ex.69 pg.588: Cranial Circumferences 255 10745 . 3

XY

S r

(r2 = R2 = 0.2552 = 0.065) Null Hypothesis H0 : ρ = 0 (no correlation) computer t-statistic

255 . 457 . 99 489 . 1 = = =

YY XX XY

S S r

2 2

255 . 1 18 255 . 1 2 − × = − − = r n r t

t-statistic = 1.12 p-value = P(X > t) = P(X > 1.12) = 0.277 Since p-value > α, we accept H0 finger length and cranial circumference are not correlated.

Excel sheet