Unit 6: Introduction to linear regression 1. Introduction to - PowerPoint PPT Presentation

Announcements ▶ MT 2 grades have been posted today! Unit 6: Introduction to linear regression 1. Introduction to regression The CDC monitors the physical activity level of Americans. A recent survey on a random sample of 23,129 Americans yielded a 95% confidence interval of 61.1% to 62.9% for the proportion of Americans who walk for at least 10 minutes per day. Which is the most accurate statement? STA 104 - Summer 2017 A. 95% of random samples of 23,129 Americans will yield confidence intervals between 61.1% and 62.9%. Duke University, Department of Statistical Science B. This interval does not support the claim that less than 50% of Americans walk at least 10 minutes per day. C. We are 95% confident that each American walks for at least 10 minutes per day on 61.1% to 62.9% of the days. D. Between 61.1% and 62.9% of random samples of 23,129 Americans are expected to yield confidence intervals that contain the true proportion of Americans who walk for at least 10 minutes per day. E. 95% of the time the true proportion of Americans who walk for at least 10 Prof. van den Boom Slides posted at minutes per day is between 61.1% to 62.9%. http://www2.stat.duke.edu/courses/Summer17/sta104.001-1/ For post-hoc tests of the results of an ANOVA we use a corrected alpha or significance level. If we want an overall type 1 error rate of 5%, what 1 should the alpha be for the individual pairwise tests if the number of groups equals 6? Choose the closest option. Modeling numerical variables Guessing the correlation A. 0.16667 B. 0.00833 C. 0.00333 D. 0.05 Clicker question E. 0.3 Which of the following is the best guess for the correlation between ▶ So far we have worked with single numerical and categorical annual murders per million and percentage living in poverty? variables, and explored relationships between numerical and categorical, and two categorical variables. 40 ● ▶ In this unit we will learn to quantify the relationship between two (a) -1.52 ● 35 ● numerical variables, as well as modeling numerical response annual murders per million 30 (b) -0.63 variables using a numerical or categorical explanatory variable. ● ● 25 ● ● ● (c) -0.12 ▶ In the next unit we’ll learn to model numerical variables using ● ● 20 ● many explanatory variables at once. (d) 0.02 ● 15 ● ● ● ● ● 10 ● (e) 0.84 ● ● 5 14 16 18 20 22 24 26 % in poverty 2 3

Guessing the correlation Assessing the correlation Clicker question Clicker question Which of the following is has the strongest correlation, i.e. Which of the following is the best guess for the correlation between correlation coefficient closest to +1 or -1? annual murders per million and population size? ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● (a) -0.97 ● ● ● ● ● 35 ● ● ● ● ● annual murders per million ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 ● ● ● ● ● ● ● ● (b) -0.61 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● (c) -0.06 (a) (b) ● ● 20 ● (d) 0.55 ● 15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● (e) 0.97 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2e+06 4e+06 6e+06 8e+06 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● population ● ● ● (c) (d) 4 5 Play the game! Spurious correlations Remember: correlation does not always imply causation! http://www.tylervigen.com/ To sharpen your correlation guessing abilities: http://guessthecorrelation.com/ 6 7

(2) Least squares line minimizes squared residuals (3) Interpreting the last squares line ▶ Residuals are the leftovers from the model fit, and calculated as the difference between the observed and predicted y : ▶ Slope: For each unit increase in x , y is expected to be e i = y i − ˆ y i ▶ The least squares line minimizes squared residuals: higher/lower on average by the slope. – Population data: ˆ y = β 0 + β 1 x b 1 = s y – Sample data: ˆ y = b 0 + b 1 x R s x 40 ● ▶ Intercept: When x = 0 , y is expected to equal the intercept. ● 35 ● annual murders per million 30 b 0 = ¯ y − b 1 ¯ x ● ● 25 ● ● ● ● ● 20 ● ● – The calculation of the intercept uses the fact the a regression line 15 ● ● ● ● always passes through (¯ x , ¯ y ) . ● 10 ● ● ● 5 14 16 18 20 22 24 26 % in poverty 8 9 Why does the regression line always pass through (¯ y ) ? x , ¯ ▶ If there is no relationship between x and y ( b 1 = 0 ), the best guess for ˆ y for any value of x is ¯ y . ▶ Even when there is a relationship between x and y ( b 1 ̸ = 0 ), the best guess for ˆ y when x = ¯ x is still ¯ y . Application exercise: 6.1 Linear model See course website for details 10 1.5 4 8 0.5 6 2 (x, y) (x, y) ● ● ● ● y2 ● y3 y ● ● ● 4 (x, y) −0.5 ● 0 2 ● 0 −2 −1.5 −2 −1.0 0.0 0.5 1.0 1.5 2.0 −1.0 0.0 0.5 1.0 1.5 2.0 −1.0 0.0 0.5 1.0 1.5 2.0 x x x 10 11

murder <- read.csv("https://stat.duke.edu/~mc301/data/murder.csv") 21.28663 # fit model m_mur_pov <- lm(annual_murders_per_mil ~ perc_pov, data = murder) # create new data newdata <- data.frame(perc_pov = 20) # predict predict(m_mur_pov, newdata) 1 # load data Clicker question Clicker question Suppose you want to predict annual murder count (per million) for a What is the interpretation of the slope? series of districts that were not included in the dataset. For which of the following districts would you be most comfortable with your (a) Each additional percentage in those living in poverty increases prediction? number of annual murders per million by 2.56. (b) For each percentage increase in those living in poverty, the A district where % in number of annual murders per million is expected to be higher 40 ● poverty = by 2.56 on average. ● ● 35 annual murders per million 30 (a) 5% (c) For each percentage increase in those living in poverty, the ● ● ● 25 ● number of annual murders per million is expected to be lower by ● (b) 15% ● ● 20 29.91 on average. (c) 20% ● 15 ● ● ● (d) For each percentage increase annual murders per million, the ● ● (d) 26% ● 10 ● percentage of those living in poverty is expected to be higher by ● (e) 40% ● 5 2.56 on average. 14 16 18 20 22 24 26 % in poverty 12 13 A note about the intercept Calculating predicted values By hand: � murder = − 29 . 91 + 2 . 56 poverty The predicted number of murders per million per year for a county Sometimes the intercept might be an extrapolation: useful for with 20% poverty rate is: adjusting the height of the line, but meaningless in the context of the data. � murder = − 29 . 91 + 2 . 56 × 20 = 21 . 29 In R: annual murders per million 80 40 0 ● −40 0 10 20 30 40 50 60 % in poverty 14 15

Summary of main ideas 1. Correlation coefficient describes the strength and direction of the linear association between two numerical variables 2. Least squares line minimizes squared residuals 3. Interpreting the least squares line 4. Predict, but don’t extrapolate 16

Unit 6: Introduction to linear regression 1. Introduction to - PowerPoint PPT Presentation

Announcements MT 2 grades have been posted today! Unit 6: Introduction to linear regression 1. Introduction to regression The CDC monitors the physical activity level of Americans. A recent survey on a random sample of 23,129 Americans

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Technical conditions for linear regression Jo Hardin Professor, Pomona College DataCamp

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

NSF Center for GRid-connected Advanced Power Electronic Systems (GRAPES) GR-17-14 Decentralized

estimation for self-adjoint eigenvalue problems Liu Xuefeng Research Institue for Science and

An improved Branch-Cut-and-Price Algorithm for Heterogeneous Vehicle Routing Problems Artur Pessoa

Hairy Graphs and the Homology of Out ( F n ) Jim Conant Univ. of Tennessee joint w/ Martin

and Physical (In In) Activity The role of the FCP in Physical Activity Promotion A Public

Stanfield Elementary School Not Your Ordinary Elementary School Where Our Why, Is You!

CPCB AQI definitions India 4 >501 Severe-Plus All are in emergency levels of hazardous

Ubiquitous and Mobile Computing CS 528: My Smartphone Knows I am Hungry Hoang Ngo Computer Science

Sambuz

Useful Links

Newsletter

Mail Us

Unit 6: Introduction to linear regression 1. Introduction to - PowerPoint PPT Presentation

Announcements MT 2 grades have been posted today! Unit 6: Introduction to linear regression 1. Introduction to regression The CDC monitors the physical activity level of Americans. A recent survey on a random sample of 23,129 Americans

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Technical conditions for linear regression Jo Hardin Professor, Pomona College DataCamp

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

NSF Center for GRid-connected Advanced Power Electronic Systems (GRAPES) GR-17-14 Decentralized

estimation for self-adjoint eigenvalue problems Liu Xuefeng Research Institue for Science and

An improved Branch-Cut-and-Price Algorithm for Heterogeneous Vehicle Routing Problems Artur Pessoa

Hairy Graphs and the Homology of Out ( F n ) Jim Conant Univ. of Tennessee joint w/ Martin

and Physical (In In) Activity The role of the FCP in Physical Activity Promotion A Public

Stanfield Elementary School Not Your Ordinary Elementary School Where Our Why, Is You!

CPCB AQI definitions India 4 &gt;501 Severe-Plus All are in emergency levels of hazardous

Ubiquitous and Mobile Computing CS 528: My Smartphone Knows I am Hungry Hoang Ngo Computer Science

Sambuz

Useful Links

Newsletter

Mail Us

CPCB AQI definitions India 4 >501 Severe-Plus All are in emergency levels of hazardous