Introduction to Regression Myra O Regan Myra.ORegan@tcd.ie Room 142 - - PowerPoint PPT Presentation

introduction to regression
SMART_READER_LITE
LIVE PREVIEW

Introduction to Regression Myra O Regan Myra.ORegan@tcd.ie Room 142 - - PowerPoint PPT Presentation

Introduction to Regression Myra O Regan Myra.ORegan@tcd.ie Room 142 Lloyd Institute 1 Description of module Practical module on regression Focussing on the application of multiple regression Software Lots of computer output


slide-1
SLIDE 1

Introduction to Regression

Myra O’ Regan Myra.ORegan@tcd.ie Room 142 Lloyd Institute

1

slide-2
SLIDE 2

Description of module

  • Practical module on regression
  • Focussing on the application of multiple

regression

  • Software
  • Lots of computer output – will use R

sometimes

  • 2 labs
  • Some Mathematics but no linear Algebra

2

slide-3
SLIDE 3

Topics to be covered

  • Revision of Simple linear regression
  • Introduction to Multiple regression
  • Use of logs and other transformations
  • Regression Diagnostics
  • Use of Indicator Variables
  • Polynomial regression
  • Building a regression model
  • Dealing with multicollinearity
  • Introduction to Logistic regression
  • Other fun techniques

3

slide-4
SLIDE 4

Notes and Books

  • I use BlackBoard
  • Sheather, S. J. A Modern Approach to

regression with R,, New York:, Springer 2009

  • Neter, J., Wasserman, W. & Kutner, M.H.

Applied Linear Models , 2nd edition Boston, Irwin:1989

  • Kutner. M. H., Nachtsheim, C.J., Neter, J. & Li,
  • W. Applied Linear Statistical Models, 5th,

Boston: McGraw-Hill, 2005

4

slide-5
SLIDE 5

Purpose of regression

  • To build a model for prediction purposes

– Price of diamond from number of carats – Price of a house – Time to process invoices – Measuring the volume of wood in trees

  • To look at relationships

– Factors relating to cot death

5

slide-6
SLIDE 6

Netflix competition

  • Variables were
  • user, movie, date of grade, grade
  • Grade was measured from 1 to 5
  • 100,480,507 ratings
  • 480,189 users
  • 17,770 movies
  • Movie, title and year of release

6

slide-7
SLIDE 7

7

slide-8
SLIDE 8

8

slide-9
SLIDE 9

308 diamnonds, price, colour, clarity and size

9

slide-10
SLIDE 10

10

slide-11
SLIDE 11

11

slide-12
SLIDE 12

Initial examination of data

  • Know the story behind the data
  • Understand the background
  • Understand meanings of variables
  • Look at each variable separately
  • Check the quality of data
  • Summary statistics and graphs
  • How much missing data?

12

slide-13
SLIDE 13

Revision of simple linear regression

  • Manager of a purchasing department of a

large company would like to predict average amount of time it takes to process a given number of invoices. Data was collected over a sample of 30 days on the number of invoices and time taken in hours

  • Three variables Time, Number of Invoices and

Day

13

slide-14
SLIDE 14

Invoices Time N 30 30 N* Mean 130 2.11 SE Mean 13.7 0.165 StDev 74.8 0.905 Minimum 23 0.8 Q1 60 1.425 Median 127.5 2 Q3 190.8 2.8 Maximum 289 4.1

14

slide-15
SLIDE 15

15

slide-16
SLIDE 16

Model to fit

  • 𝑈𝑗𝑛𝑓𝑗 = α + β ∗ 𝐽𝑜𝑤𝑝𝑗𝑑𝑓𝑡𝑗 + 𝜁𝑗
  • Linear model
  • Need estimates of α and β
  • Need SE for estimates
  • We use Minitab to calculate estimates of α

and β

16

slide-17
SLIDE 17

17

slide-18
SLIDE 18

What is going on here? What are the lines? More importantly what are the differences

18

slide-19
SLIDE 19

Prediction vs Confidence intervals

  • Confidence interval
  • For a given value of x0 this is an interval for the

average value of the dependent variable

  • Point Estimate ± t *s 𝐸𝑗𝑡𝑢𝑏𝑜𝑑𝑓 𝑤𝑏𝑚𝑣𝑓
  • t has n-(k+1) df where k = no. of predictors
  • s= 0.330 – what does this measure
  • Distance value =1

𝑜 + (𝑦0−𝑦 )2 (𝑦𝑗−𝑦 )2

19

slide-20
SLIDE 20

Prediction vs Confidence intervals

  • Prediction interval
  • For a given value of x0 this an interval for the

particular value of the dependent variable

  • Point Estimate ± t *s 1 + 𝐸𝑗𝑡𝑢𝑏𝑜𝑑𝑓 𝑤𝑏𝑚𝑣𝑓
  • t has n-(k+1) df where k = no. of predictors
  • s= 0.330 – what doe this measure
  • Distance value =1

𝑜 + (𝑦0−𝑦 )2 (𝑦𝑗−𝑦 )2

20

slide-21
SLIDE 21

Approximate intervals for reasonably large samples

  • Confidence intervals=2*s*

1 𝑜

  • Prediction intervals = 2*s * 1 + 1

𝑜

21

slide-22
SLIDE 22

Example

  • Let number of invoices = 50
  • Where do these numbers come from roughly?

22

slide-23
SLIDE 23

ANOVA table…

  • Total sums of squares(SS) =(𝑍

𝑗 − 𝑍

)2

  • Regression SS=(𝑍

𝑗 − 𝑍

)2

  • Error SS =(𝑍

𝑗 − 𝑍 𝑗)2

  • What is R2?

23

slide-24
SLIDE 24

What happens if we do the following?

  • Let Invoices=X
  • Subtract k from each case
  • What will change?
  • 𝑈𝑗𝑛𝑓 = α + β ∗ 𝑌 + 𝜁 − 𝑝𝑠𝑗𝑕𝑗𝑜𝑏𝑚 𝑛𝑝𝑒𝑓𝑚
  • Time=α + β*(X-k)+ε= (α- βk)+ βX+ ε
  • Slope does not change but intercept does
  • Intercept = expected value of Time when X=k
  • Normally we use k=mean of the variable

24

slide-25
SLIDE 25

The regression equation is Time = 2.11 + 0.0113 Centered invoices

25

slide-26
SLIDE 26

26

slide-27
SLIDE 27

Trees data

  • Sample of 31 black cherry trees in the

Allegheny national Forest in Pennsylvania

  • Volume in cubic feet
  • Height in feet
  • Diameter in inches 54 inches above ground

27

slide-28
SLIDE 28

Variable Diameter Height Volume N 31 31 31 N* Mean 13.248 76 30.17 SE mean 0.564 1.14 2.95 StDev 3.138 6.37 16.44 Minimum 8.3 63 10.2 Q1 11 72 19.1 Median 12.9 76 24.2 Q3 16 80 38.3 Maximum 20.6 87 77

28

slide-29
SLIDE 29

29

slide-30
SLIDE 30

30

slide-31
SLIDE 31

31

slide-32
SLIDE 32

32

slide-33
SLIDE 33

33

slide-34
SLIDE 34

34

slide-35
SLIDE 35

What does the F-test mean?

  • Testing a hypothesis
  • Null hypothesis H0: 𝛾1 = 𝛾2 = 0
  • Alternative Hypothesis H1: Not all β’s =0
  • F=254.97, df=(2,28) p<0.001
  • Enough evidence against the null hypothesis

35

slide-36
SLIDE 36

Interpretation of coefficients

  • Volume = β0+ β1*Height+ β2*Diameter + ε
  • E(Volume) or Predicted(Volume) or sometimes written

as 𝑍

  • = -58.0 +0.339*Height+4.71 *Diameter
  • Constant (-58.0) is the mean response when Height=0

and Diameter=0

  • β1 change in mean response per unit increase in Height

when Diameter is held constant (at any value)

  • Similarly β2 change in mean response per unit increase

in Diameter when Height is held constant (at any value)

36

slide-37
SLIDE 37

And a little more

  • Example let Diameter =12
  • E(Volume) =-58.0 +0.339 Height + 4.71 *12
  • = -1.48+0.339 Height
  • Intercept changes but β1 stays the same.
  • Effect on mean response of height does not

depend on Diameter

  • We say effects are additive or not to interact
  • Partial regression coefficients

37

slide-38
SLIDE 38

Changing coefficients

  • Height by itself

1.54 (.38)

  • Diameter by itself 5.07 (0.25)

Multiple regression

  • Height | Diameter 0.34 (0.13)
  • Diameter | Height 4.71 (0.26)

38

slide-39
SLIDE 39

Sums of squares

  • Same calculation as before
  • Sequential sums of squares Diameter & Height
  • Diameter

7581.8

  • Height

102.4

  • Sequential sums of squares Height & Diameter
  • Height

2901.1

  • Diameter

4783.0

39

slide-40
SLIDE 40

Derived variables

  • Create a new x from the given x-variables
  • Could be a transformation or a combination
  • Use background knowledge to create new

variable

  • Tree crudely modeled by cylinder
  • 𝑑𝑧𝑚𝑗𝑜𝑒𝑓𝑠 𝑤𝑝𝑚 = 𝜌𝑠2𝑦 ℎ𝑢 = 𝜌

4 (𝐸𝑗𝑏𝑛)2x ht

  • ∝ ℎ𝑢 ∗(𝐸𝑗𝑏𝑛)2

40

slide-41
SLIDE 41

Plot first

41

slide-42
SLIDE 42

42

slide-43
SLIDE 43

43

slide-44
SLIDE 44

44

slide-45
SLIDE 45

45

slide-46
SLIDE 46

Transform using logs

  • y=logba; by=a;
  • 23=8; log28=3;
  • b is called the base
  • Typical bases are e and 10
  • We are going to use base 10
  • e is a mathematical number =2.71
  • logs to the base e are called natural logs often

written as ln

46

slide-47
SLIDE 47

Basic rules for logs using base 10

  • Log(10) =1
  • Log(10)a=a
  • Log(1)=0
  • Log(0) is not defined
  • Log(xr)=rlog(x)
  • 10log(a)=a
  • Richter scale for measuring earthquake

strength is on a log 10 scale

47

slide-48
SLIDE 48

And some more

  • Log(ab) = log(a)+log(b)
  • log

𝑏 𝑐 = log 𝑏 − log 𝑐

  • 10ab=(10a)b; 10(a+b)=10a10b;10a-b=10𝑏

10𝑐

48

slide-49
SLIDE 49

What are we going to do with all this?

  • Linear Model
  • We can take logs of X; of Y; or of both;
  • What we are interested in examining is the

interpretation of the coefficients and interpret them in the original scale

  • We will see later when it is appropriate
  • Let us start with the model
  • Y=α + β*log(x) + ε

49

slide-50
SLIDE 50

50

slide-51
SLIDE 51

51

slide-52
SLIDE 52

Interpretation of coefficients

  • A 1 unit increase in log(X) is associated with β

increase in Y units

  • log(X)+1 = log(X) +log(10)= log(10X)
  • Converting to a percentage
  • Multiplying X by 10 equivalent to (10-1)*100%

change = 900% increase in x

  • 𝛾

expected change in Y when X is multiplied by 10

  • 𝛾

expected change in Y when X increases by 900%

52

slide-53
SLIDE 53

And more

  • For other percentage changes p
  • p% increase in X = 𝛾

∗ log (100+𝑞

100 ) increase in Y

  • A 10% increase in X associated with

𝛾 ∗ log (100+10

100 ) increase in Y

  • 𝛾

*log(1.1) increase in Y

  • 𝛾

*0.041 increase in Y

53

slide-54
SLIDE 54

What does this mean?

  • Volume = - 461 + 262 logheight
  • An increase in 1 in logheight will increase

Volume by 262

  • Multiplying height by 10 will increase Volume

by 262

  • A 10% increase in height will increase Volume

by 𝛾 ∗ log (100+𝑞

100 ) =262*log(1.1)=10.84

54

slide-55
SLIDE 55

Next situation

  • Log(Y)=α+β*X+ε
  • A 1 unit increase in X is associated with β

increase in log Y units

  • Log Y + β =10(log 𝑧 +𝛾) = 𝑍 ∗ 10𝛾
  • Each 1-unit increase in X multiplies the

expected value of Y by 10β

  • The effect of a c-unit increase in X is to

multiply the expected value of Y by 10cβ

55

slide-56
SLIDE 56

More

  • Calculate ch= 𝑍 ∗ 10𝛾
  • Calculate (ch-1)*100
  • Ch=1.20 implies a 20% increase
  • Ch=.7 implies a 30% decrease

56

slide-57
SLIDE 57

57

slide-58
SLIDE 58

And now ..

  • logVolume = - 0.346 + 0.0233 Height
  • A 1 unit increase in height increase logVolume

by 0.0233

  • Each unit increase of height increases Volume

by a multiple of 100.0233 =1.055 or 5.5% increase

58

slide-59
SLIDE 59

Last situation

  • Log Y = α +β*log(X) +ε
  • A 1 unit increase in log(X) is associated with

β*log(Y) units

  • p% increase in X = 𝛾

∗ log (100+𝑞

100 ) increase in

log Y units

  • a= 𝛾

∗ log (100+𝑞

100 )

  • Log Y+a =Y*10a

59

slide-60
SLIDE 60

60

slide-61
SLIDE 61

Again some interpretation

  • logVolume = - 6.06 + 3.98 logheight
  • A 1 unit increase in logheight will increase

logVolume by 3.98

  • Multiplying height by 10 multiplies Volume by

103.98

  • A 10% increase in height multiplies Volume by

10(3.98*log(1.1)) = 1.46

  • Can interpret this a 46% increase in Volume
  • 10% increase in height associated with a 46%

increase in Volume.

61

slide-62
SLIDE 62

Interpretation

  • logVolume = - 6.06 + 3.98 logheight
  • Can write as
  • 10logVolume=10(-6.06+3.98logheight)
  • Volume =10-6.06*103.98logheight
  • This is sometimes called a multiplicative model
  • Using the above for prediction
  • Height = 85 – remember to use log10(85)=1.929
  • Using Minitab we get (1.2412, 1.9973) as PI

62

slide-63
SLIDE 63

In the original units

  • (1.2412, 1.9973) = (101.2412 ,101.9973)=
  • (17.42, 99.38)
  • 85 is not in the centre
  • Will return to when to use logs

63

slide-64
SLIDE 64

64

Interpret coefficients in original scale Calculate predicted Sun circulation for weekday circulation of 300,000 – both predicted and CI. You can just use the approximate solution

slide-65
SLIDE 65

Interpretations

65

10% increase in weekly circulation associated with a 10(1.05*log(1.1)) = 1.105 increase in Sunday circulation equivalent

% Increase in weekly Increase in Sunday % increase in Sunday 10 1.105 10.5 20 1.211 21.1 30 1.317 31.7 40 1.424 42 50 1.531 53

slide-66
SLIDE 66

Approximate Confidence Intervals

  • Calculate CI’s for weekly circulation of 300,000
  • Predicted Value= -0.134+1.05*log(300000)
  • =5.62 on log scale
  • N=89;s=0.056
  • 95%CI = 5.617±2*0.056*
  • =(5.605,5.629) = 402,835 to 425,473

66

slide-67
SLIDE 67

Great chapter on derived variables

  • Linoff, G. S & Berry, M. J. A. Data Mining

Techniques 3rd Edition, Wiley: Indianapolis, 2011

67

slide-68
SLIDE 68

Some summary thoughts

  • Get to know the story of your data
  • Use simple plots and summary statistics
  • Does it look ok?
  • Think about derived variables
  • Start simply
  • Don’t forget your common sense

68