lecture 5 multiple linear regression
play

Lecture 5: Multiple Linear Regression CS109A Introduction to Data - PowerPoint PPT Presentation

Lecture 5: Multiple Linear Regression CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader Lecture Outline Simple Regression: Predictor variables Standard Errors Evaluating Significance of Predictors Hypothesis


  1. Lecture 5: Multiple Linear Regression CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader

  2. Lecture Outline Simple Regression: • Predictor variables Standard Errors • Evaluating Significance of Predictors • Hypothesis Testing " ? • How well do we know 𝑔 • How well do we know 𝑧 $ ? Multiple Linear Regression: • Categorical Predictors • Collinearity • Hypothesis Testing • Interaction Terms Polynomial Regression CS109A, P ROTOPAPAS , R ADER 1

  3. Standard Errors " & , 𝑇𝐹 𝛾 " ' . The variances of 𝛾 & and 𝛾 ' are also called their standard errors , 𝑇𝐹 𝛾 If our data is drawn from a larger set of observations then we can empirically " & , 𝑇𝐹 𝛾 " ' of 𝛾 & and 𝛾 ' through estimate the standard errors , 𝑇𝐹 𝛾 bootstrapping. If we know the variance 𝜏 . of the noise 𝜗 , we can compute 𝑇𝐹 𝛾 " & , 𝑇𝐹 𝛾 " ' analytically, using the formulae below: s ⇣ ⌘ x 2 1 b = σ n + SE β 0 P i ( x i − x ) 2 ⇣ ⌘ σ b = qP SE β 1 i ( x i − x ) 2 CS109A, P ROTOPAPAS , R ADER 2

  4. � � Standard Errors s ⇣ ⌘ x 2 1 More data: 𝑜 ↑ and ∑ (𝑦 5 − 𝑦̅) . b ↑⟹ 𝑇𝐹 ↓ = σ n + SE β 0 P 5 i ( x i − x ) 2 Largest coverage : 𝑤𝑏𝑠(𝑦) or ∑ (𝑦 5 − 𝑦̅) . ↑ ⟹ 𝑇𝐹 ↓ ⇣ ⌘ 5 σ b qP = SE β 1 Better data: 𝜏 . ↓ ⇒ 𝑇𝐹 ↓ i ( x i − x ) 2 In practice, we do not know the theoretical value of 𝜏 since we do not know the exact distribution of the noise 𝜗 . Remember: 𝑧 5 = 𝑔 𝑦 5 + 𝜗 5 ⟹ 𝜗 5 = 𝑧 5 − 𝑔(𝑦 5 ) CS109A, P ROTOPAPAS , R ADER 3

  5. Standard Errors In practice, we do not know the theoretical value of 𝜏 since we do not know the exact distribution of the noise 𝜗 . However, if we make the following assumptions, • the errors 𝜗 5 = 𝑧 5 − 𝑧 $ 5 and 𝜗 B = 𝑧 B − 𝑧 $ B are uncorrelated, for 𝑗 ≠ 𝑘 , • each 𝜗 5 is normally distributed with mean 0 and variance 𝜏 . , then, we can empirically estimate 𝜏 . , from the data and our regression line: sP r y i ) 2 n · MSE i ( y i − b = σ ≈ n − 2 n − 2 s X ( ˆ f ( x ) − y i ) 2 σ ≈ n − 2 CS109A, P ROTOPAPAS , R ADER 4

  6. � � Standard Errors s ⇣ ⌘ x 2 1 More data: 𝑜 ↑ and ∑ (𝑦 5 − 𝑦̅) . b ↑⟹ 𝑇𝐹 ↓ = σ n + SE β 0 P 5 i ( x i − x ) 2 ⇣ ⌘ Largest coverage : 𝑤𝑏𝑠(𝑦) or ∑ (𝑦 5 − 𝑦̅) . ↑ ⟹ 𝑇𝐹 ↓ σ b 5 = qP SE β 1 Better data: 𝜏 . ↓ ⇒ 𝑇𝐹 ↓ i ( x i − x ) 2 s X ( ˆ f ( x ) − y i ) 2 " − 𝑧 5 ) ↓ ⟹ 𝜏 ↓ ⟹ 𝑇𝐹 ↓ Better model: (𝑔 σ ≈ n − 2 F , 𝛾 ' F under these scenarios? Question: What happens to the 𝛾 & CS109A, P ROTOPAPAS , R ADER 5

  7. Standard Errors The following results are for the coefficients for TV advertising: " 𝟐 Method 𝑇𝐹 𝛾 Analytic Formula 0.0061 Bootstrap 0.0061 The coefficients for TV advertising but restricting the coverage of x are: " 𝟐 Method 𝑇𝐹 𝛾 Analytic Formula 0.0068 Bootstrap 0.0068 This makes no sense? The coefficients for TV advertising but with added extra noise: " 𝟐 Method 𝑇𝐹 𝛾 Analytic Formula 0.0028 Bootstrap 0.0023 CS109A, P ROTOPAPAS , R ADER 6

  8. Importance of predictors We have discussed finding the importance of predictors, by determining the cumulative distribution from ∞ to 0. . CS109A, P ROTOPAPAS , R ADER 7

  9. Hypothesis Testing Hypothesis testing is a formal process through which we evaluate the validity of a statistical hypothesis by considering evidence for or against the hypothesis gathered by random sampling of the data . CS109A, P ROTOPAPAS , R ADER 8

  10. Random sampling of the data TV TV sales sales TV TV TV TV sales sales sales sales 68.4 216.4 230.1 215.4 88.3 50.0 22.1 22.1 22.1 22.1 22.1 22.1 Shuffle the values of the predictor variable 184.9 102.7 202.5 10.4 10.4 10.4 89.7 276.7 44.5 10.4 10.4 10.4 248.8 204.1 11.7 9.3 9.3 9.3 23.8 68.4 17.2 9.3 9.3 9.3 151.5 13.2 219.8 191.1 39.5 75.3 18.5 18.5 18.5 18.5 18.5 18.5 23.8 68.4 13.1 12.9 12.9 12.9 142.9 180.8 26.8 12.9 12.9 12.9 248.8 296.4 59.6 7.2 7.2 7.2 8.7 170.2 220.3 7.2 7.2 7.2 76.4 70.6 0.7 255.4 57.5 26.8 11.8 11.8 11.8 11.8 11.8 11.8 197.6 265.2 164.5 13.2 13.2 13.2 120.2 87.2 139.5 13.2 13.2 13.2 195.4 209.6 292.9 4.8 4.8 4.8 8.6 120.5 237.4 4.8 4.8 4.8 76.4 75.5 147.3 10.6 10.6 10.6 199.8 293.6 16.9 10.6 10.6 10.6 139.2 238.2 66.1 78.2 80.2 13.1 8.6 8.6 8.6 8.6 8.6 8.6 182.6 109.8 222.4 17.4 17.4 17.4 43.0 214.7 218.5 17.4 17.4 17.4 43.0 112.9 171.3 9.2 9.2 9.2 139.2 23.8 147.3 9.2 9.2 9.2 276.9 184.9 199.1 25.6 97.5 73.4 9.7 9.7 9.7 9.7 9.7 9.7 193.2 147.3 262.7 19.0 19.0 19.0 239.3 204.1 216.4 19.0 19.0 19.0 131.7 28.6 89.7 22.4 22.4 22.4 238.2 191.1 195.4 22.4 22.4 22.4 225.8 116.0 67.8 25.1 135.2 213.4 12.5 12.5 12.5 12.5 12.5 12.5 240.1 193.2 166.8 24.4 24.4 24.4 25.6 109.8 281.4 24.4 24.4 24.4 CS109A, P ROTOPAPAS , R ADER 9

  11. CS109A, P ROTOPAPAS , R ADER 10

  12. Importance of predictors Translate this to Kevin’s language. Let’s look at the distance of the " ' ) = 𝜏 I estimated value of the coefficient in units of SE( 𝛾 J K . . 𝜈 I J K − 0 ˆ 𝜏 I β 1 − 0 J K t = SE (ˆ β 1 ) CS109A, P ROTOPAPAS , R ADER 11

  13. Importance of predictors And also evaluate how often a particular value of t can occur by accident (using the shuffled data)? We expect that t will have a t-distribution with n-2 degrees of freedom. To compute the probability of observing any value " ' = 0 is easy. We call equal to |𝑢| or larger, assuming 𝛾 this probability the p-value . a small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance CS109A, P ROTOPAPAS , R ADER 12

  14. Hypothesis Testing Hypothesis testing is a formal process through which we evaluate the validity of a statistical hypothesis by considering evidence for or against the hypothesis gathered by random sampling of the data. 1. State the hypotheses, typically a null hypothesis , 𝐼 & and an alternative hypothesis , 𝐼 ' , that is the negation of the former. 2. Choose a type of analysis, i.e. how to use sample data to evaluate the null hypothesis. Typically this involves choosing a single test statistic . 3. Compute the test statistic. 4. Use the value of the test statistic to either reject or not reject the null hypothesis. CS109A, P ROTOPAPAS , R ADER 13

  15. Hypothesis testing 1. State Hypothesis: Null hypothesis: 𝐼 & : There is no relation between X and Y The alternative: 𝐼 Q : There is some relation between X and Y 2: Choose test statistics To test the null hypothesis, we need to determine whether, our " ' , is sufficiently far from zero that we can be confident estimate for 𝛾 " ' is non-zero. We use the following test statistic: that 𝛾 ˆ β 1 − 0 t = SE (ˆ β 1 ) CS109A, P ROTOPAPAS , R ADER 14

  16. Hypothesis testing 3. Compute the statistics : ", 𝑇𝐹(𝛾) we calculate the t-statistic. Using the estimated 𝛾 4. Reject or not reject the hypothesis: If there is really no relationship between X and Y , then we expect that will have a t-distribution with n-2 degrees of freedom. To compute the probability of observing any value equal to |𝑢| or larger, " ' = 0 is easy. We call this probability the p-value. assuming 𝛾 a small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance CS109A, P ROTOPAPAS , R ADER 15

  17. Hypothesis testing P-values for all three predictors done independently " & " 𝟐 Method 𝑇𝐹 𝛾 𝑇𝐹 𝛾 Analytic Formula 0.353 0.0023 Bootstrap 0.328 0.0028 " & " 𝟐 Method 𝑇𝐹 𝛾 𝑇𝐹 𝛾 Analytic Formula 0.353 0.0023 Bootstrap 0.328 0.0028 " & " 𝟐 Method 𝑇𝐹 𝛾 𝑇𝐹 𝛾 Analytic Formula 0.353 0.0023 Bootstrap 0.328 0.0028 CS109A, P ROTOPAPAS , R ADER 16

  18. Things to Consider Comparison of Two Models How do we choose from two different models? Model Fitness How does the model perform predicting? Evaluating Significance of Predictors Does the outcome depend on the predictors? S How well do we know 𝒈 " The confidence intervals of our 𝑔 CS109A, P ROTOPAPAS , R ADER 17

  19. " ? How well do we know 𝑔 Our confidence in 𝑔 is directly connected with the confidence in 𝛾 s. So for each 𝛾 we can determine the model. CS109A, P ROTOPAPAS , R ADER 18

  20. " ? How well do we know 𝑔 Here we show two difference set of models given the fitted coefficients for a given subsample CS109A, P ROTOPAPAS , R ADER 19

  21. " ? How well do we know 𝑔 There is one such regression line for every imaginable sub-sample. CS109A, P ROTOPAPAS , R ADER 20

  22. " ? How well do we know 𝑔 Below we show all regression lines for a thousand of such sub-samples. " , and determine the mean For a given 𝑦 , we examine the distribution of 𝑔 and standard deviation. CS109A, P ROTOPAPAS , R ADER 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend