SLIDE 1 STAT 113 Simple Linear Regression
Colin Reimer Dawson
Oberlin College
SLIDE 2
Outline
Prediction What’s a Good Prediction? Linear Prediction Equation Prediction Error Regression to the Mean
SLIDE 3 Prediction
◮ Correlations give us a description of the relationship
between two numeric variables.
◮ However, when two variables are related, we can go further
and use knowledge of one to make predictions about the
◮ Examples:
◮ Use SAT scores to predict college GPA ◮ Use economic indicators to predict stock prices ◮ Use credit score to predict probability of default on a loan ◮ Use biomarkers to predict disease progression ◮ What else?
SLIDE 4 What’s a Good Prediction?
6 8 10 12 14 4 5 6 7 8 9 10 11 X Y
◮ Suppose I have this
data.
◮ What would be a good
prediction if I get a new X value of 12?
◮ What about an X value
SLIDE 5 Modeling relationships with a function
◮ We can capture all of our predictions by writing the y
variable as a function of the x variable
◮ Examples:
◮ f(x) = x2 ◮ f(x) = 1.6x + 20 ◮ f(x) = 5 cos(2πx)
SLIDE 6 What’s a Good Prediction?
6 8 10 12 14 4 5 6 7 8 9 10 11 X Y
How about this function?
SLIDE 7 What’s a Good Prediction?
6 8 10 12 14 2 4 6 8 10 12 X Y
Or this?
SLIDE 8 What’s a Good Prediction?
6 8 10 12 14 2 4 6 8 10 12 X Y
◮ What about this? ◮ There’s a tradeoff
between how well we can fit the data and how simple our model (i.e., prediction function) is.
SLIDE 9 What’s a Good Prediction?
6 8 10 12 14 2 4 6 8 10 12 X Y
◮ Pretty much the
simplest model we can have is a straight line.
◮ Two things determine
what line we have:
◮ The intercept ◮ The slope
SLIDE 10
Intercept Slope Form
◮ The intercept and slope are the parameters of our
regression model.
◮ The general equation for a line is:
f(x) = a + bx
◮ In statistics notation, we write ˆ
y (“y hat”) to represent a predicted (or fitted) value.
◮ Given a value xi, we predict using:
ˆ y = ˆ a + ˆ bxi
SLIDE 11
Hat Notation
Figure: Source: brownsharpie.com
SLIDE 12
Systematic vs. Random
◮ We can split up each y value into two parts: a systematic
(predictable) part and a “random” part.
◮ That is, we can write, for the y coordinate of the ith data
point: yi = ˆ yi + Errori
SLIDE 13 What’s a Good Prediction?
6 8 10 12 14 2 4 6 8 10 X Y
Every line will have a differ- ent set of errors associated with it.
SLIDE 14 What’s a Good Prediction?
6 8 10 12 14 2 4 6 8 10 X Y
Every line will have a differ- ent set of errors associated with it.
SLIDE 15 What’s a Good Prediction?
6 8 10 12 14 2 4 6 8 10 X Y
◮ Every line will have a
different set of errors associated with it.
◮ Which is best? ◮ Intuitively, we want to
minimize the overall “distance” between the line and the points.
SLIDE 16
The Prediction Equation
Prediction Function
ˆ yi = ˆ a + ˆ b1xi Pick ˆ a and ˆ b that minimize the total distance. This is a calculus problem that the computer solves for us.
SLIDE 17 Regression Example
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- 62
66 70 74 62 66 70 74 Mid−parent Height (in.) Child's Adult Height (in.)
◮ The “father of
regression”, Francis Galton, looked at parents’ and children’s heights.
◮ Here’s his data, with
the associated regression line.
SLIDE 18
Example: Batting Average in Successive Seasons
SLIDE 19
Figure: Source: https://www.washingtonpost.com/opinions/why-our-childrens-future- no-longer-looks-so-bright/2011/10/14/gIQAofzlpL_story.html
SLIDE 20
“This fall, Lafley will step down for the second time, and no one will be mentioning Steve Jobs’s legendary return to Apple. Lafley hasn’t been bad – he slimmed the company down, selling off parts and getting out of less profitable businesses – but there’s been no dramatic turnaround. ... In other words, he’s been just O.K. How could someone who, according to Fortune, was known as “an all-time C.E.O. hero” end up being just O.K.? Well, if commentators had looked at the track record of returning C.E.O.s – boomerang C.E.O.s, as they’re sometimes called – that’s precisely what they’d have predicted. A 2014 study found that profitability at companies run by boomerang C.E.O.s fell slightly, and an earlier study detected no significant difference in long-term performance between firms that reappointed a former C.E.O. and ones that hired someone new.”
Figure: Source: http://www.newyorker.com/magazine/2015/09/21/the-comeback- conundrum
SLIDE 21
Regression to the Mean
◮ Many variables have a systematic and random part (e.g.,
“Skill” and “Luck”)
◮ If you had a really high score the first time, there’s a good
chance you had high values for both.
◮ If you try again, you would expect your skill to carry over,
but your luck will be average, on average; so your score would go down
◮ Conversely, low scores are likely partly the result of bad
luck, so they should go up as luck reverts to the mean.