Simple Linear Regression Government statisticians in England - PDF document

Simple Linear Regression Government statisticians in England conducted a study of the relationship between smoking and lung cancer. The data concern 25 occupational groups and are condensed from data on thousands of individual men. The explanatory variable is the number of cigarettes smoked per day by men in each occupation relative to the number smoked by all men of the same age. This smoking ratio is 100 if men in an occupation are exactly average in their smoking, it is below 100 if they smoke less than average, and above 100 if they smoke more than average. The response variable is the standardized mortality ratio for deaths from lung cancer. It is also measured relative to the entire population of men of the same ages as those studied, and is greater or less than 100 when there are more or fewer deaths from lung cancer than would be expected based on the experience of all English men. 1. Plot the data in the file smoke.txt . The first variable is the smoking index smoke and the second is the mortality index mort . An appropriate graph would be a scatter plot to explore the data. Which variable should go on the x -axis and which on the y -axis? Describe any patterns that you observe. Does a linear relationship between smoke and mort seem plausible? 2. Many of you have probably studied simple linear regression, which is a method used to fit a straight line model to data in order to describe the relationship between a response variable Y and an explanatory variable X . In simple linear regression, we assume that we can model the relationship between the i th observed values of Y and X as follows: Y i = β 0 + β 1 X i + ǫ i , where β 0 is the intercept of the line, β 1 is the slope, and i = 1 , 2 , . . . , n . The term ǫ i in the model is an “error” term that expresses the random deviation of the observed Y i from the value of the true regression line β 0 + β 1 X i . 1

The above statement is not a complete description of the model. The complete description also includes some important statistical assumptions: • ǫ i and ǫ j are independent if i � = j • The error terms ǫ i are normally distributed with mean 0 and variance σ 2 . • although the model explicitly allows for measurement error in the Y variable, measure- ments made on X i are known precisely. One of the commonly used methods of fitting a straight line to data is called linear regression, or least squares. Draw a picture that illustrates the principle behind this method. 2

3. Using the method of least squares, the formulas for estimating the intercept β 0 and the slope β 1 are as follows: � n i =1 ( X i − X )( Y i − Y ) � β 1 = � n i =1 ( X i − X ) 2 and β 0 = Y − � � β 1 X, where X and Y are the means of the X and Y observations, respectively. Since these estimates are functions of data, they will change from data set to data set—that is, � β 0 and � β 1 are random P n variables. Furthermore, the standard errors for � β 0 and � β 1 can be computed as follows: � � 2 1 X SE[ � β 0 ] = MSE n + � n i =1 ( X i − X ) 2 and MSE SE[ � β 1 ] = � n i =1 ( X i − X ) 2 , i =1 e 2 where the mean square error MSE = and e i is the i th residual. i n − 2 These formulas look nasty and as you can imagine, they are even worse when extended to the case where our model has more than one explanatory variable. As a result, we often frame the estimation problem in terms of linear algebra. Write down the matrix expressions for estimating the intercept and slope, as well as their covariance matrix. 3

4. Now let’s apply these formulas to the data set smoke.txt . You can write your own code or use the code in smoke.m . Fit the model with the intercept term. What do you conclude about the need to include an intercept term in the model? 5. Fit the model without an intercept term. What is your estimate of the Lung cancer mortality rate when the smoking ratio is 100 (exactly average in their smoking) in an occupation? 6. No statistical analysis is complete without a complete check of the model assumptions that were given previously. Use the plots provided in the MATLAB code smoke.m (or any other methods you can think of) to test the model assumptions. 4

7. Now repeat your analysis on the second data set hubble.txt . These data were collected by Hubble. And details are on “ http://lib.stat.cmu.edu/DASL/Stories/Hubble’sConstant.html ”. Remember to follow the important steps in any statistical analysis,namely: • plot the data • propose a model • fit the model (this includes standard error estimates and/or confidence intervals for parameter estimates) and • check the fit of the model, as well as the other model assumptions. Does this data support the Big Bang Theory? Based on this second data set, estimate the age of the universe. How does your estimate compare with the currently held belief that the universe is between 10 and 15 billion years old? 5

Simple Linear Regression Government statisticians in England - PDF document

Simple Linear Regression Government statisticians in England conducted a study of the relationship between smoking and lung cancer. The data concern 25 occupational groups and are condensed from data on thousands of individual men. The

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear regression Linear regression is a simple approach to supervised learning. It assumes

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

Simple linear regression STAT 401A - Statistical Methods for Research Workers Jarad Niemi Iowa

Outline The Simple Linear Regression Model (12.1) Fitting the Regression Line (12.2)

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

Regression: Simple and Linear Introduction to Machine Learning Regression Principle REGRESSION

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

CS 4495 Computer Vision RAN dom SA mple C onsensus Aaron Bobick School of Interactive Computing

Simultaneous Embedding of a Planar Graph and Its Dual on the Grid Cesim Erten 1 and Stephen G.

Equations of Lines Among the standard forms for equations of lines are The Two-Intercept Form:

Fast Straightening Algorithm for Bracket Polynomials Based on Tableau Manipulations Changpeng

Quality Ratios of Measures for Graph Drawing Styles Michael Hoffmann - Zurich Marc van

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Logistic Regression

Problem Definition Problem Definition CG Lecture 5 CG Lecture 5 Point Location Point Location

1 Bilinear Patch Bicubic Bezier Patch Editing Bicubic Bezier Patches Curve Basis Functions