Unit 6: Introduction to linear regression MT 2 scores posted in - PowerPoint PPT Presentation

Announcements Unit 6: Introduction to linear regression ▶ MT 2 scores posted in Sakai! 1. Introduction to regression ▶ Start working on your Final projects. Due date- Sunday Dec 3, 11:55 PM. Sta 101 - Fall 2017 – Lab:10:05 AM, 11:45 AM and 1:25 PM will present on Dec 4 during their Lab session. No lab on Monday for 8:30 AM and 3:05 PM. – Lab: 8:30 AM and 3:05 PM will present on Dec 5 during our lecture at Duke University, Department of Statistical Science Social Science 139. Your labs TAs will be here! No lecture on Tuesday for 10:05 AM, 11:45 AM and 1:25 PM. ▶ PS 6 due date Nov 17 at 11:55 PM. ▶ PA 6 due date Nov 19 at 11:55 PM. Dr. Mukherjee Slides posted at http://www2.stat.duke.edu/courses/Fall17/sta101.002/ 1 Modeling numerical variables Guessing the correlation Clicker question Which of the following is the best guess for the correlation between ▶ So far we have worked with single numerical and categorical annual murders per million and percentage living in poverty? variables, and explored relationships between numerical and categorical, and two categorical variables. 40 ● ▶ In this unit we will learn to quantify the relationship between two (a) -1.52 ● 35 ● numerical variables, as well as modeling numerical response annual murders per million 30 (b) -0.63 variables using a numerical or categorical explanatory variable. ● ● 25 ● ● ● (c) -0.12 ▶ In the next unit we’ll learn to model numerical variables using ● ● 20 ● many explanatory variables at once. (d) 0.02 ● 15 ● ● ● ● ● 10 ● (e) 0.84 ● ● 5 14 16 18 20 22 24 26 % in poverty 2 3

Guessing the correlation Assessing the correlation Clicker question Clicker question Which of the following is has the strongest correlation, i.e. Which of the following is the best guess for the correlation between correlation coefficient closest to +1 or -1? annual murders per million and population size? ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● (a) -0.97 ● ● ● ● ● 35 ● ● ● ● ● annual murders per million ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 ● ● ● ● ● ● ● ● (b) -0.61 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● (c) -0.06 (a) (b) ● ● 20 ● (d) 0.55 ● 15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● (e) 0.97 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2e+06 4e+06 6e+06 8e+06 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● population ● ● ● (c) (d) 4 5 Spurious correlations (2) Least squares line minimizes squared residuals ▶ Residuals are the leftovers from the model fit, and calculated as the difference between the observed and predicted y : e i = y i − ˆ y i ▶ The least squares line minimizes squared residuals: – Population data: ˆ y = β 0 + β 1 x – Sample data: ˆ y = b 0 + b 1 x Remember: correlation does not always imply causation! 40 ● http://www.tylervigen.com/ ● 35 ● annual murders per million 30 ● ● 25 ● ● ● ● ● 20 ● ● 15 ● ● ● ● ● 10 ● ● ● 5 14 16 18 20 22 24 26 % in poverty 6 7

(3) Interpreting the last squares line Why does the regression line always pass through (¯ y ) ? x , ¯ ▶ If there is no relationship between x and y ( b 1 = 0 ), the best ▶ Slope: For each unit increase in x , y is expected to be guess for ˆ y for any value of x is ¯ y . higher/lower on average by the slope. ▶ Even when there is a relationship between x and y ( b 1 ̸ = 0 ), the best guess for ˆ y when x = ¯ x is still ¯ y . b 1 = s y R s x ▶ Intercept: When x = 0 , y is expected to equal the intercept. 10 1.5 4 8 0.5 b 0 = ¯ y − b 1 ¯ x 6 2 (x, y) (x, y) ● ● ● ● y2 ● y3 y ● ● ● 4 (x, y) −0.5 ● 0 2 ● 0 – The calculation of the intercept uses the fact the a regression line −2 −1.5 −2 always passes through (¯ x , ¯ y ) . −1.0 0.0 0.5 1.0 1.5 2.0 −1.0 0.0 0.5 1.0 1.5 2.0 −1.0 0.0 0.5 1.0 1.5 2.0 x x x 8 9 Clicker question What is the interpretation of the slope? (a) Each additional percentage in those living in poverty increases number of annual murders per million by 2.56. (b) For each percentage increase in those living in poverty, the Application exercise: 6.1 Linear model number of annual murders per million is expected to be higher See course website for details by 2.56 on average. (c) For each percentage increase in those living in poverty, the number of annual murders per million is expected to be lower by 29.91 on average. (d) For each percentage increase annual murders per million, the percentage of those living in poverty is expected to be higher by 2.56 on average. 10 11

murder <- read.csv("https://stat.duke.edu/~mc301/data/murder.csv") # load data # fit model m_mur_pov <- lm(annual_murders_per_mil ~ perc_pov, data = murder) # create new data newdata <- data.frame(perc_pov = 20) # predict predict(m_mur_pov, newdata) 1 21.28663 A note about the intercept Clicker question Suppose you want to predict annual murder count (per million) for a series of districts that were not included in the dataset. For which of Sometimes the intercept might be an extrapolation: useful for the following districts would you be most comfortable with your adjusting the height of the line, but meaningless in the context of the prediction? data. A district where % in annual murders per million 40 ● 80 poverty = ● ● 35 annual murders per million 40 30 (a) 5% ● ● ● 25 ● 0 ● (b) 15% ● ● ● 20 −40 (c) 20% ● 15 ● ● ● 0 10 20 30 40 50 60 ● ● (d) 26% ● 10 % in poverty ● ● (e) 40% ● 5 14 16 18 20 22 24 26 % in poverty 12 13 Calculating predicted values Summary of main ideas By hand: � murder = − 29 . 91 + 2 . 56 poverty The predicted number of murders per million per year for a county with 20% poverty rate is: � murder = − 29 . 91 + 2 . 56 × 20 = 21 . 29 1. Correlation coefficient describes the strength and direction of In R: the linear association between two numerical variables 2. Least squares line minimizes squared residuals 3. Interpreting the least squares line 4. Predict, but don’t extrapolate 14 15

Unit 6: Introduction to linear regression MT 2 scores posted in - PowerPoint PPT Presentation

Announcements Unit 6: Introduction to linear regression MT 2 scores posted in Sakai! 1. Introduction to regression Start working on your Final projects. Due date- Sunday Dec 3, 11:55 PM. Sta 101 - Fall 2017 Lab:10:05 AM, 11:45 AM

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Technical conditions for linear regression Jo Hardin Professor, Pomona College DataCamp

Unit 7: Multiple linear regression 1. Introduction to multiple linear regression Sta 101 - Fall

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Show the Right Numbers ggplot IMPLEMENTS A GRAMMAR OF GRAPHICS The grammar is a set of rules

political representation in the complex EU system Thematic Area 1. Modes of democratic

Financial Stability Report November 2020 The implemented support measures have ensured that the

Convex Programs COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Convex

Correlation Between GDP and Satellites Launched by Steven Bradley Question Is there a

Business Statistics CONTENTS The correlation coefficient The rank correlation coefficient

Joint Distributions, Independence Covariance and Correlation 18.05 Spring 2014 X \ Y 1 2 3 4 5

OSN Mood Tracking : Exploring the Use of Online Social Network Activity as an Indicator of Mood

Sambuz

Useful Links

Newsletter

Mail Us