CS 147: Computer Systems Performance Analysis
Multiple and Categorical Regression
1 / 36
CS 147: Computer Systems Performance Analysis
Multiple and Categorical Regression
CS 147: Computer Systems Performance Analysis Multiple and - - PowerPoint PPT Presentation
CS147 2015-06-15 CS 147: Computer Systems Performance Analysis Multiple and Categorical Regression CS 147: Computer Systems Performance Analysis Multiple and Categorical Regression 1 / 36 Overview CS147 Overview 2015-06-15 Multiple
1 / 36
CS 147: Computer Systems Performance Analysis
Multiple and Categorical Regression
2 / 36
Overview
Multiple Linear Regression Basic Formulas Example Quality of the Example Categorical Models
Multiple Linear Regression
3 / 36
Multiple Linear Regression
◮ Develops models with more than one predictor variable ◮ But each predictor variable has linear relationship to response
variable
◮ Conceptually, plotting a regression line in n-dimensional
space, instead of 2-dimensional
Multiple Linear Regression Basic Formulas
4 / 36
Basic Multiple Linear Regression Formula
Response y is a function of k predictor variables x1, x2, . . . , xk y = b0 + b1x1 + b2x2 + · · · + bkxk + e
Multiple Linear Regression Basic Formulas
5 / 36
A Multiple Linear Regression Model
Given sample of n observations {(x11, x21, . . . , xk1, y1), . . . , (x1n, x2n, . . . , xkn, yn)} model consists of n equations (note possible + vs. − typo in book): y1 = b0 + b1x11 + b2x21 + · · · + bkxk1 + e1 y2 = b0 + b1x12 + b2x22 + · · · + bkxk2 + e2 . . . yn = b0 + b1x1n + b2x2n + · · · + bkxkn + en
Multiple Linear Regression Basic Formulas
6 / 36
Looks Like It’s Matrix Arithmetic Time
y = Xb + e y1 y2 . . . yn = 1 x11 x21 . . . xk1 1 x12 x22 . . . xk2 . . . . . . . . . . . . . . . 1 x1n x2n . . . xkn b0 b1 . . . bk + e0 e1 . . . en Note that:
◮ y and e have n elements ◮ b has k + 1 ◮ x is k by n
Multiple Linear Regression Basic Formulas
◮ This isn’t a class on statistics
7 / 36
Analysis of Multiple Linear Regression
◮ Listed in box 15.1 of Jain ◮ Not terribly important (for our purposes) how they were
derived
◮ This isn’t a class on statistics ◮ But you need to know how to use them ◮ Mostly matrix analogs to simple linear regression resultsMultiple Linear Regression Example
◮ Year made ◮ Running time
8 / 36
Example of Multiple Linear Regression
◮ IMDB keeps numerical popularity ratings of movies ◮ Postulate popularity of Academy Award-winning films is
based on two factors:
◮ Year made ◮ Running time ◮ Produce a regressionrating = b0 + b1(year) + b2(length)
Multiple Linear Regression Example
9 / 36
Some Sample Data
Title Year Length Rating Silence of the Lambs 1991 118 8.1 Terms of Endearment 1983 132 6.8 Rocky 1976 119 7.0 Oliver! 1968 153 7.4 Marty 1955 91 7.7 Gentleman’s Agreement 1947 118 7.5 Mutiny on the Bounty 1935 132 7.6 It Happened One Night 1934 105 8.0
Multiple Linear Regression Example
10 / 36
Now for Some Tedious Matrix Arithmetic
◮ We need to calculate X, XT, XTX, (XTX)−1, and XTy ◮ Because b = (XTX)−1(XTy) ◮ We will see that b = (18.5430, −0.0051, −0.0086) ◮ Meaning the regression predicts:
rating = 18.5430 − 0.0051(year) − 0.0086(length)
Multiple Linear Regression Example
11 / 36
X Matrix for Example
X = 1 1991 118 1 1983 132 1 1976 119 1 1968 153 1 1955 91 1 1947 118 1 1935 132 1 1934 105
Multiple Linear Regression Example
12 / 36
Transpose to Get XT
XT = 1 1 1 1 1 1 1 1 1991 1983 1976 1968 1955 1947 1935 1934 118 132 119 153 91 118 132 105
Multiple Linear Regression Example
13 / 36
Multiply To Get XTX
XTX = 8 15689 968 15689 30771385 1899083 968 1899083 119572
Multiple Linear Regression Example
14 / 36
Invert to Get C = (XTX)−1
C = (XTX)−1 = 1207.7585
0.1328
0.0003
0.1328
0.0004
Multiple Linear Regression Example
15 / 36
Multiply to Get XTy
XTy = 60.1 117840.7 7247.5
Multiple Linear Regression Example
16 / 36
Multiply (XTX)−1(XTy) to Get b
b = 18.5430
Multiple Linear Regression Quality of the Example
17 / 36
How Good Is This Regression Model?
◮ How accurately does model predict film rating based on age
and running time?
◮ Best way to determine this analytically is to calculate errors:
SSE = yTy − bTXTy
SSE =
i
Multiple Linear Regression Quality of the Example
18 / 36
Calculating the Errors
Estimated Year Length Rating Rating ei e2
i
1991 118 8.1 7.4 −0.71 0.51 1983 132 6.8 7.3 0.51 0.26 1976 119 7.0 7.5 0.45 0.21 1968 153 7.4 7.2 −0.20 0.04 1955 91 7.7 7.8 0.10 0.01 1947 118 7.5 7.6 0.11 0.01 1935 132 7.6 7.6 −0.05 0.00 1934 105 8.0 7.8 −0.21 0.04
Multiple Linear Regression Quality of the Example
19 / 36
Calculating the Errors, Continued
◮ SSE = 1.08 ◮ SSY = y2 i = 452.91 ◮ SS0 = ny2 = 451.5 ◮ SST = SSY − SS0 = 452.9 − 451.5 = 1.4 ◮ SSR = SST − SSE = 0.33 ◮ R2 = SSR
SST = 0.33 1.41 = 0.23
◮ In other words, this regression stinks
Multiple Linear Regression Quality of the Example
20 / 36
Why Does It Stink?
◮ Let’s look at properties of the regression parameters
se =
n − 3 =
5 = 0.46
◮ Now calculate standard deviations of the regression
parameters (These are estimations only, since we’re working with a sample)
◮ Estimated stdev of
b0 is se √c00 = 0.46 √ 1207.76 = 16.16 b1 is se √c11 = 0.46 √ 0.0003 = 0.0084 b2 is se √c22 = 0.46 √ 0.0004 = 0.0097
Multiple Linear Regression Quality of the Example
21 / 36
Calculating Confidence Intervals of STDEVs
◮ We will use 90% level ◮ Confidence intervals for
b0 is 18.54 ∓ 2.015(16.16) = (−14.02, 51.10) b1 is 0.005 ∓ 2.015(0.0084) = (−0.022, 0.012) b2 is 0.009 ∓ 2.015(0.0097) = (−0.028, 0.011)
◮ None is significant at this level
Multiple Linear Regression Quality of the Example
◮ Not yet; predictors may be correlated
◮ E.g., to determine if the SSR is significantly higher than the
◮ Equivalent to testing that y does not depend on any of the
22 / 36
Analysis of Variance
◮ So, can we really say that none of the predictor variables are
significant?
◮ Not yet; predictors may be correlated ◮ F-tests can be used for this purpose ◮ E.g., to determine if the SSR is significantly higher than theSSE
◮ Equivalent to testing that y does not depend on any of thepredictor variables ◮ Alternatively, that no bi is significantly nonzero
Multiple Linear Regression Quality of the Example
◮ SSR has k degrees of freedom ◮ SST matches y − y, not y − ˆ
23 / 36
Running an F-Test
◮ Need to calculate SSR and SSE ◮ From those, calculate mean squares of regression (MSR) and
errors (MSE)
◮ MSR/MSE has an F distribution ◮ If MSR/MSE > Ftable, predictors explain significant fraction of
response variation
◮ Note typos in book’s table 15.3 ◮ SSR has k degrees of freedom ◮ SST matches y − y, not y − ˆ y
Multiple Linear Regression Quality of the Example
24 / 36
F-Test for Our Example
◮ SSR = .33 ◮ SSE = 1.08 ◮ MSR = SSR/k = .33/2 = .16 ◮ MSE = SSE/(n − k − 1) = 1.08/(8 − 2 − 1) = .22 ◮ F-computed = MSR/MSE = .76 ◮ F[90; 2, 5] = 3.78 ◮ So it fails the F-test at 90% (miserably)
Multiple Linear Regression Quality of the Example
◮ Meaning they are related ◮ And thus second variable does not improve regression ◮ In fact, it can make it worse
25 / 36
Multicollinearity
◮ If two predictor variables are linearly dependent, they are
collinear
◮ Meaning they are related ◮ And thus second variable does not improve regression ◮ In fact, it can make it worse ◮ Typical symptom is inconsistent results from varioussignificance tests
Multiple Linear Regression Quality of the Example
26 / 36
Finding Multicollinearity
◮ Must test correlation between predictor variables ◮ If it’s high, eliminate one and repeat regression without it ◮ If significance of regression improves, it’s probably due to
collinearity between the variables
Multiple Linear Regression Quality of the Example
◮ Not especially correlated
◮ See example on p. 253 of book 27 / 36
Is Multicollinearity a Problem in Our Example?
◮ Probably not, since significance tests are consistent ◮ But let’s check, anyway ◮ Calculate correlation of age and length ◮ After tedious calculation, 0.25 ◮ Not especially correlated ◮ Important point—adding a predictor variable does not
always improve a regression
◮ See example on p. 253 of bookMultiple Linear Regression Quality of the Example
◮ Rating vs. year ◮ Rating vs. length
28 / 36
Why Didn’t Regression Work Well Here?
◮ Check scatter plots ◮ Rating vs. year ◮ Rating vs. length ◮ Regardless of how good or bad regressions look, always
check the scatter plots
Multiple Linear Regression Quality of the Example
29 / 36
Rating vs. Length
80 100 120 140 160 Length 2 4 6 8 Rating
Multiple Linear Regression Quality of the Example
30 / 36
Rating vs. Year
1940 1960 1980 Year 2 4 6 8 Rating
Categorical Models
31 / 36
Regression With Categorical Predictors
◮ Regression methods discussed so far assume numerical
variables
◮ What if some of your variables are categorical in nature? ◮ If all are categorical, use techniques discussed later in the
course
◮ Levels: number of values a category can take
Categorical Models
◮ xi = 0 for first value ◮ xi = 1 for second value
◮ To avoid implying order in categories 32 / 36
Handling Categorical Predictors
◮ If only two levels, define bi as follows ◮ xi = 0 for first value ◮ xi = 1 for second value ◮ (This definition is missing from book in section 15.2) ◮ Can use +1 and -1 as values, instead ◮ Need k − 1 predictor variables for k levels ◮ To avoid implying order in categories
Categorical Models
33 / 36
Categorical Variables Example
Which is a better predictor of a high rating in the movie database?
◮ Winning an Oscar? ◮ Winning the Golden Palm at Cannes? ◮ Winning the New York Critics Circle?
Categorical Models
34 / 36
Choosing Variables
◮ Categories are not mutually exclusive ◮ x1 = 1 if Oscar, 0 otherwise ◮ x2 = 1 if Golden Palm, 0 otherwise ◮ x3 = 1 if Critics Circle Award, 0 otherwise ◮ y = b0 + b1x1 + b2x2 + b3x3
Categorical Models
35 / 36
A Few Data Points
Title Rating Oscar Palm NYC Gentleman’s Agreement 7.5 X X Mutiny on the Bounty 7.6 X Marty 7.4 X X X If 7.8 X La Dolce Vita 8.1 X Kagemusha 8.2 X The Defiant Ones 7.5 X Reds 6.6 X High Noon 8.1 X
Categorical Models
36 / 36
And Regression Says. . .
◮ ˆ
y = 7.8 − 0.1x1 + 0.2x2 − 0.4x3
◮ How good is that?
Categorical Models
◮ Better than age and length ◮ But still no great shakes 36 / 36
And Regression Says. . .
◮ ˆ
y = 7.8 − 0.1x1 + 0.2x2 − 0.4x3
◮ How good is that? ◮ R2 is 34% of variation ◮ Better than age and length ◮ But still no great shakes
Categorical Models
◮ Better than age and length ◮ But still no great shakes
36 / 36
And Regression Says. . .
◮ ˆ
y = 7.8 − 0.1x1 + 0.2x2 − 0.4x3
◮ How good is that? ◮ R2 is 34% of variation ◮ Better than age and length ◮ But still no great shakes ◮ Are regression parameters significant at 90% level?