Bivariate Relationships 17.871 2012 1 T Testing associ ti - - PowerPoint PPT Presentation

bivariate relationships
SMART_READER_LITE
LIVE PREVIEW

Bivariate Relationships 17.871 2012 1 T Testing associ ti - - PowerPoint PPT Presentation

Bivariate Relationships 17.871 2012 1 T Testing associ ti iati t ions (not causation!) Continuous data Scatter plot (always use first!) (Pearson) correlation coefficient (rare should be rarer!) (Pearson) correlation


slide-1
SLIDE 1

Bivariate Relationships

17.871 2012

1
slide-2
SLIDE 2

t T ti Testing associ iati ions (not causation!)

 Continuous data

 Scatter plot (always use first!)  (Pearson) correlation coefficient (rare should be rarer!)  (Pearson) correlation coefficient (rare, should be rarer!)  (Spearman) rank-order correlation coefficient (rare)  Regression coefficient (common)

 Discrete data

 Cross tabulations  χ

2

  Gamma, Beta, etc.

2
slide-3
SLIDE 3

t t Conti inuous DV, conti inuous EV DV EV

 Dependent Variable: DV  Explanatory (or independent) Variable: EV

Explanatory (or independent) Variable: EV E l Wh t i th l ti hi b t

 Example: What is the relationship between

Black percent in state legislatures and black percent i t in st tat te popul lati tions

3
slide-4
SLIDE 4

Regression interpretation Regression interpretation

Three key things to learn (today)

  • 1. Where does regression come from
  • 2. To interpret the regression coefficient

To interpret the regression coefficient

  • 3. To interpret the confidence interval

We will l ill learn h how t to cal lcul lat te confid fidence intervals in a couple of weeks

4
slide-5
SLIDE 5

beo Fitted values beo bpop

Linear Relationship between African Linear Relationship between African American Population & Black Legislators

10

Black % in state

5

legislatures legislatures Black % in state population Black % in state population

10 20 30

5
slide-6
SLIDE 6

The linear relationship between two The linear relationship between two variables

Y  X Y     X 

i 1 i i

Regression quantifies how one variable can be described in terms of another

6
slide-7
SLIDE 7

beo Fitted values beo bpop

Linear Relationship between African Linear Relationship between African American Population & Black Legislators

10

Black % in state

5

legislatures legislatures

^

1 31  1.31

10 20 30

Black % in state population Black % in state population

^ 1  0 359

 0.359

Y     X 

i 1 i i

7
slide-8
SLIDE 8

eo b bpop bpop

How did we get that line?

  • 1. Pick a value of Yi

Yi

10

Black % in

5

state legis. Black % in state population

Y     X 

i 1 i i

b 10 20 30

8
slide-9
SLIDE 9

eo b bpop bpop

How did we get that line? How did we get that line?

  • 2. Decompose Yi into two parts

10

Black % in

5

state legis. Black % in state population

Y     X 

i 1 i i

b 10 20 30

9
slide-10
SLIDE 10

eo b bpop bpop

How did we get that line? How did we get that line?

  • 3. Label the points

Yi

10

Black % in

5

state legis. Black % in state population

Y  (   X )  

i 1 i i

b 10 20 30

10
slide-11
SLIDE 11

eo b bpop bpop

How did we get that line? How did we get that line?

  • 3. Label the points

Yi

10

^ Yi

Black % in

5

state legis. Black % in state population

Y  (   X )  

i 1 i i

b 10 20 30

11
slide-12
SLIDE 12

eo b bpop bpop

How did we get that line? How did we get that line?

  • 3. Label the points

Yi

10

^ Yi

Black % in

5

state legis. Black % in state population

Y  (   X )  

i 1 i i

b 10 20 30

12
slide-13
SLIDE 13

eo b bpop bpop

How did we get that line? How did we get that line?

  • 3. Label the points

Yi

10

^ Yi-Yi ^ Yi

Black % in

5

state legis. Black % in state population

Y  (   X )  

i 1 i i

b 10 20 30

13
slide-14
SLIDE 14

eo b bpop bpop

How did we get that line? How did we get that line?

  • 3. Label the points

Yi

10

^ Yi-Yi ^ Yi

Black % in

5

state legis. Black % in state population

Y  (   X )  

i 1 i i

b

εi “residual” residual

10 20 30

14
slide-15
SLIDE 15

What is εi? (sometimes ui)

 Wrong functional form  Measurement error

Measurement error

 Stochastic component in Y

U d i Y

 Unmeasured infl

fluences on Y

Y     X 

i 1 i i

15
slide-16
SLIDE 16

t Th M h d f L t S The Method of Least Squares

n

Pick  and and  to minimize  i

2

Pick

1 to minimize 1

 i

i1 2

n

(Yi Y ˆ ) or

i i i1

10 5

n 2

(Y     X )

bpop 10 20 30

i 1 i i1

Y     X 

i 1 i i

beo Fitted values

Yi ^ Yi ^ εi Yi-Yi ^

beo 16
slide-17
SLIDE 17

n

(Y     X )2 (Yi

1 i i1

S l f Solve for

 0 1

n

(Y Yi )(X  X i )

^

i1

1 

n

  • r

(X  X )2

i i1

var(X ) ) , cov( Y X

Remember this for the problem set!

17
slide-18
SLIDE 18

Regressi ion commands i in STATA d STATA

 reg depvar expvars

E.g., reg y x E.g., reg beo bpop

 Making predictions from regression lines

predict newvar predict newvar, resid

 newvar will now equal εi

18
slide-19
SLIDE 19
  • Black elected officials example

Black elected officials example

. reg beo bpop Source | Source | SS df MS Number of obs 41 SS df MS Number of obs = 41

  • ------------+------------------------------

F( 1, 39) = 202.56 Model | 351.26542 1 351.26542 Prob > F = 0.0000 Residual | 67.6326195 39 1.73416973 R-squared = 0.8385

  • ------------+------------------------------

Adj R-squared = 0.8344 l | 18 898039 10 2 1 Root MSE = 1 3169 Total | 4 18.898039 10.472451 S 1.3169 beo | Coef.

  • Std. Err.

t P>|t| [95% Conf. Interval]

  • ------------+----------------------------------------------------------------

bpop | .3586751 .0251876 14.23 0.000 .3075284 .4094219 _cons | -1.314892 .3277508

  • 4.01

0.000 -1.977831

  • .6519535
  • Always include interpretation in

your presentations and papers

Interpretation: a one percentage point increase in

your presentations and papers

g black population leads to a .36 percentage point increase in black composition in the legislature

19
slide-20
SLIDE 20

beo Fitted values beo bpop

The Linear Relationship between African The Linear Relationship between African American Population & Black Legislators

10

Black % in state legislatures

5

(Y)

0  1.31

 1 31

10 20 30

Black % in state population (X)

1  0 359  0.359

Y Y     X    X    

i i i i 1 i i 1

20
slide-21
SLIDE 21

More regression examples More regression examples

21
slide-22
SLIDE 22

t t t T d L d Temperature and Latit i ude

80

LosAngelesCA Ph i AZ MiamiFL

60

Portland SanFranciscoCA PhoenixAZ NorfolkVA MobileAL MemphisTN DallasTX HoustonTX

40 JanTemp

NewYorkNY BostonMA BaltimoreMD SyracuseNY WashingtonDC ClevelandOH KansasCityMO PittsburghPA Minneapolis

20 J

Minneapolis Dulu

25 30 35 40 45 latitude latitude

scatter JanTemp latitude, mlabel(city)

22
slide-23
SLIDE 23
  • . reg jantemp latitude

Source | SS df MS Number of obs = 20

  • ------------+------------------------------

F( 1, 18) = 49.34 Model | 3250.72219 1 3250.72219 Prob > F = 0.0000 Residual | 1185.82781 18 65.8793228 R-squared = 0.7327

  • ------------+------------------------------

Adj R-squared = 0.7179 Total | 4436.55 19 233.502632 Root MSE = 8.1166 jantemp | jantemp | Coef.

  • Std. Err.

t P>|t| [95% Conf. Interval] Coef.

  • Std. Err.

t P>|t| [95% Conf. Interval]

  • ------------+----------------------------------------------------------------

latitude | -2.341428 .3333232

  • 7.02

0.000

  • 3.041714
  • 1.641142

_cons | 125.5072 12.77915 9.82 0.000 98.65921 152.3552

Interpretation: a one point increase in latitude is associated with a 2.3 decrease in average temperature (in Fahrenheit).

Y     X 

i 1 i i

23
slide-24
SLIDE 24

a as

tt

How to add a regression line:

St Stata command: lfit d lfit

80

MiamiFL HoustonTX MobileAL DallasTX PhoenixAZ LosAngelesCA SanFranciscoCA

60

MemphisTN NorfolkVA BaltimoreMD KansasCityMO WashingtonDC PittsburghPA ClevelandOH NewYorkNY BostonMA SyracuseNY Mi li M Portland

20 40

MinneapolisM Dulu

25 30 35 40 45 latitude Fitted values JanTemp

scatter JanTemp latitude, mlabel(city) || lfit JanTemp latitude

  • r oft

ften b better

scatter JanTemp latitude, mlabel(city) m(i) || lfit JanTemp latitude

24
slide-25
SLIDE 25
  • Presenting regression results

Presenting regression results

Brief aside

 First, show scatter plot

Label data points (if possible) Label data points (if possible) Include best-fit line

 Second show regression table  Second, show regression table

Assess statistical significance with confidence

interval or p-value interval or p value

Assess robustness to control variables

(internal validity y: nonrandom selection)

25
slide-26
SLIDE 26

t t t t B h Bush vote and S d South hern B Bapti ists

ID NE OK UT WY

.7

AL AK AZ AR GA IN KS KY LA MS MO MT NC ND SC SD TN TX VA WV

.6 Pct 2004

CA CO CT DE FL HI IL IA ME MD MI MN MO NV NH NJ NM OH OR PA VA WA WI

.5 Bush P

MD MA NY RI VT

.4 .2 .4 .6 S th B ti t % Southern Baptist % Bush Fitted values

26
slide-27
SLIDE 27

|

  • . reg bush sbc_mpct [aw=votes]

(sum of wgt is 1.2207e+08) Source | SS df MS Number of obs = 50

  • ------------+------------------------------

F( 1, 48) = 40.18 Model | .118925068 1 .118925068 Prob > F = 0.0000 Residual | .142084951 48 .002960103 R-squared = 0.4556

  • ------------+------------------------------

Adj R-squared = 0.4443 Total | .261010018 49 .005326735 Root MSE = .05441 bush | Coef.

  • Std. Err.

t P>|t| [95% Conf. Interval]

  • ------------+----------------------------------------------------------------

sbc_mpct | .261779 .0413001 6.34 0.000 .1787395 .3448185 _cons | .4563507 .0112155 40.69 0.000 .4338004 .4789011

Coefficient interpretation: Coefficient interpretation:

  • A one percentage point increase in Baptist percentage is associated

with a .26 percentage point increase in Bush vote share at the state le el level.

27
slide-28
SLIDE 28

J UT AL ID NE ND OK UT WY

.7

AL AK AZ AR GA IN KS KY LA MS MT NC ND SC SD TN TX WV

.6 ct 2004

AR CO DE FL IA MI MN MO NV NH NJ NM OH OR PA VA WA WI

.5 Bush Pc

CA CT DE HI IL ME MD MA NY RI VT WA

.4

MA

.2 .4 .6 Southern Baptist % B h Fitt d l Bush Fitted values

28
slide-29
SLIDE 29

|

  • Interpreting

g confidence interval

. reg bush sbc_mpct [aw=votes] (sum of wgt is 1.2207e+08) Source | SS df MS Number of obs = 50

  • ------------+------------------------------

F( 1, 48) = 40.18 Model | .118925068 1 .118925068 Prob > F = 0.0000 Residual | .142084951 48 .002960103 R-squared = 0.4556

  • ------------+------------------------------

Adj R-squared = 0.4443 Total | .261010018 49 .005326735 Root MSE = .05441 bush | Coef.

  • Std. Err.

t P>|t| [95% Conf. Interval]

  • ------------+----------------------------------------------------------------

sbc_mpct | .261779 .0413001 6.34 0.000 .1787395 .3448185 _cons | .4563507 .0112155 40.69 0.000 .4338004 .4789011

Coefficient interpretation: Coefficient interpretation:

  • A 1 percentage point increase in Baptist percentage is associated with a

.26 percentage point increase in Bush vote share at the state level. Confidence interval interpretation

  • The 95% confidence interval lies between .18 and .34.
29
slide-30
SLIDE 30

1994 1998 2002 1950 1954 1962 1970 1978 1982 1986 1990

  • 20

use seats

1942 1946 1950 1958 1966 1974 1994

  • 40

Change in Hou

1938 1946

  • 80
  • 60

C 30 40 50 60 70 Gallup approval rating (Nov.) loss Fitted values

30
slide-31
SLIDE 31
  • . reg loss gallup

Source | SS df MS Number of obs = 17

  • ------------+------------------------------

F( 1, 15) = 5.70 Model | 2493.96962 1 2493.96962 Prob > F = 0.0306 Residual | 6564.50097 15 437.633398 R-squared = 0.2753

  • ------------+------------------------------

Adj R-squared = 0.2270 Total | 9058.47059 16 566.154412 Root MSE = 20.92 Seats | Coef.

  • Std. Err.

t P>|t| [95% Conf. Interval]

  • ------------+----------------------------------------------------------------

gallup | 1.283411 .53762 2.39 0.031 .1375011 2.429321 cons | -96.59926 29.25347

  • 3.30

0.005

  • 158.9516
  • 34.24697

Coefficient interpretation: Coefficient interpretation:

  • A 1 percentage point increase in presidential approval is associated with

an avg. of 1.28 more seats won by the president’s party in the midterm. Confidence interval interpretation

  • The 95% confidence interval lies between .14 and 2.43.
31
slide-32
SLIDE 32

Additional regression in bivariate Additional regression in bivariate relationship topics

 Residuals  Comparing

g coefficients

 Functional form  Goodness of fit (R2 and SER)  Goodness of fit (R and SER)  Correlation  Discrete DV discrete EV  Discrete DV, discrete EV  Using the appropriate graph/table

32
slide-33
SLIDE 33

Residuals Residuals

33
slide-34
SLIDE 34

R id l Residuals

34

e = Y B B X ei = Yi – B0 – B1Xi

slide-35
SLIDE 35

One important numerical property One important numerical property

  • f residuals

 The sum of the residuals is zero

80

MiamiFL H t TX M bil AL PhoenixAZ LosAngelesCA

60 8

HoustonTX MobileAL DallasTX MemphisTN NorfolkVA SanFranciscoCA BaltimoreMD KansasCityMO WashingtonDC PittsburghPA ClevelandOH NewYorkNY BostonMA SyracuseNY Portland

40

SyracuseNY MinneapolisM Dulu

20 25 30 35 40 45 25 30 35 40 45 latitude Fitted values JanTemp

35
slide-36
SLIDE 36
  • Generating predictions and residuals

. reg jantemp latitude Source | SS df MS Number of obs = 20

  • ------------+------------------------------

F( 1, 18) = 49.34 Model | 3250.72219 1 3250.72219 Prob > F = 0.0000 Residual | 1185.82781 18 65.8793228 R-squared = 0.7327

  • ------------+------------------------------

Adj R-squared = 0.7179 Total | 4436.55 19 233.502632 Root MSE = 8.1166 jantemp |

  • Coef. Std. Err.

t P>|t| [95% Conf. Interval]

  • ------------+----------------------------------------------------------------

latitude | -2.341428 .3333232

  • 7.02

0.000

  • 3.041714
  • 1.641142

_cons | 125.5072 12.77915 9.82 0.000 98.65921 152.3552 . predict py (option xb assumed; fitted values) (option xb assumed; fitted values) . predict ry, resid

36
slide-37
SLIDE 37

| i i |

  • t

gsort -ry . list city jantemp py ry list city jantemp py ry +-------------------------------------------------+ | city jantemp py ry | |-------------------------------------------------| 1. | PortlandOR 40 17.8015 22.1985 |

  • 2. | SanFranciscoCA

49 36.53293 12.46707 |

  • 3. |

LosAngelesCA 58 45.89864 12.10136 |

  • 4. |

PhoenixAZ 54 48.24007 5.759929 |

  • 5. |

NewYorkNY 32 29.50864 2.491357 | |-------------------------------------------------| 6. | MiamiFL 67 64.63007 2.36993 |

  • 7. |

BostonMA 29 27.16722 1.832785 |

  • 8. |

NorfolkVA 39 38.87436 .125643 |

  • 9. |

BaltimoreMD 32 34.1915

  • 2.1915 |
  • 10. |

SyracuseNY 22 24.82579

  • 2.825786 |

|-------------------------------------------------| |

  • 11. |

MobileAL 50 52.92293

  • 2.922928 |
  • 12. |

WashingtonDC 31 34.1915

  • 3.1915 |
  • 13. |

MemphisTN 40 43.55721

  • 3.557214 |
  • 14. |

ClevelandOH 25 29.50864

  • 4.508643 |
  • 15. |

DallasTX 43 48.24007

  • 5.240071 |

|-------------------------------------------------|

  • 16. |

HoustonTX 50 55.26435

  • 5.264356 |
  • 17. |

KansasCityMO 28 34.1915

  • 6.1915 |
  • 18. |

PittsburghPA 25 31.85007

  • 6.850072 |
  • 19. | MinneapolisMN

12 20.14293

  • 8.142929 |

D l 15.46007 8 460073 | 20

  • 20. |

| DuluthMN hMN 7 7 15 46007

  • 8.460073 |

+-------------------------------------------------+

37
slide-38
SLIDE 38

Use residuals to diagnose potential Use residuals to diagnose potential problems

1962 1986 1990 1998 2002 1950 1954 1970 1978 1982

  • 20

n House seats

1942 1946 1958 1966 1974 1994

  • 60
  • 4

Change in

1938

  • 80

30 40 50 60 70 Gallup approval rating (Nov ) Gallup approval rating (Nov.) loss Fitted values

38
slide-39
SLIDE 39

i |

  • . reg loss gallup

Source | SS df MS Number of obs = 17

  • ------------+------------------------------

F( 1, 15) = 5.70 Model | 2493.96962 1 2493.96962 Prob > F = 0.0306 Residual | 6564.50097 15 437.633398 R-squared = 0.2753

  • ------------+------------------------------

Adj R-squared = 0.2270 Total | 9058.47059 16 566.154412 Root MSE = 20.92 Seats | Coef.

  • Std. Err.

t P>|t| [95% Conf. Interval]

  • ------------+----------------------------------------------------------------

53762 2.39 0 031 . 1375011 2.429321 gallup | gallup | 1 283411 1.283411 . 53762 2 39 0.031 1375011 2 429321 _cons | -96.59926 29.25347

  • 3.30 0.005
  • 158.9516
  • 34.24697

. reg loss gallup if year>1946 Source | SS df MS Number of obs = 14

  • ------------+------------------------------

F( 1, 12) = 17.53 Model | 3332.58872 1 3332.58872 Prob > F = 0.0013 Residual | 2280.83985 12 190.069988 R-squared = 0.5937

  • ------------+------------------------------

Adj R-squared = 0.5598 Total | 5613.42857 13 431.802198 Root MSE = 13.787 seats | seats | Coef Std

  • Std. Err

Err. t P>|t| [95% Conf Interval] Coef. t P>|t| [95% Conf. Interval]

  • ------------+----------------------------------------------------------------

gallup | 1.96812 .4700211 4.19 0.001 .9440315 2.992208 _cons | -127.4281 25.54753

  • 4.99 0.000
  • 183.0914
  • 71.76486
39
slide-40
SLIDE 40

scatter loss gallup, mlabel(year) || lfit loss gallup || lfit loss gallup if year >1946

40

1962 1970 1978 1986 1990 1998 2002

eats

1942 1950 1954 1978 1982

  • 40
  • 20

ge in House se

1938 1942 1946 1958 1966 1974 1994

  • 60

Chang

  • 80

30 40 50 60 70 Gallup approval rating (Nov.) loss Fitted values Fitted values

slide-41
SLIDE 41

t Compari ing regressi ion coeffi ffici ients

 As a general rule:

Code all your variables to vary between 0 and 1 Code all your variables to vary between 0 and 1

 That is, minimum = 0, maximum = 1

Regression coefficients then rep

present the effect

  • f shifting from the minimum to the maximum.

This allows you to more easily

y comp pare the relative importance of coefficients.

41
slide-42
SLIDE 42
  • ca

How to recode variables to 0 1 scale How to recode variables to 0-1 scale

 Party ID examp

ple: p pid7

 Usually varies from

 1 (st

(strong g R epublican) epub )

 to 8 (strong Democrat)  sometimes 0 needs to be recoded to missing (“.”).

 Stata code?

replace pid7 = (p

(pid7-1)/7 p )/

42
slide-43
SLIDE 43
  • Regression interpretation with 0 1 scale

Regression interpretation with 0-1 scale

 Continue with pid7 examp

ple

 regress natlecon pid7

(both recoded to 0-1 scales)*

pid7 coefficient: b = -.46 (CCES data from

2006)

Interpretation?

 Shifting from being a strong Republican to a strong

Democrat corresponds with a .46 drop in evaluations

  • f the national economy (on the one-point national
  • f the national economy (on the one point national

economy scale)

*natlecon originally coded so that 1 = excellent, 4 = poor, 5 = not sure

43
slide-44
SLIDE 44

Functional Form Functional Form

44
slide-45
SLIDE 45

t Ab About th the Functi tional Form F l F

 Linear in the variables vs. linear in the

parameters

Y = a + bX + e (linear in both) Y = a + bX + cX

cX2 + e (linear in parms.) a bX e (linear in parms.)

Y = a + Xb + e (linear in variables, not parms.)

 Regression must be linear in parameters

Regression must be linear in parameters

45
slide-46
SLIDE 46

The Linear and Curvilinear Relationship between African American Population & Black Legislators

15 10 es

Y = 0.11 + 0.0088X + 0.013X2

5 leg/Fitted valu 10 20 30 pop leg Fitted values Fitted values

scatter beo pop || qfit beo pop

46
slide-47
SLIDE 47

Log transformations Log transformations (see Tufte, ch. 3)

Y = a + bX + e b = dY/dX, or b = the unit change in Y given a unit change in X Typical case Y = a + b lnX + e b = dY/(dX/X), or b = the unit change in Y given a % change in X % change in X Log explanatory variable ln Y = a + bX + e b = (dY/Y)/dX, or b = the % change in Y given a unit change in X Log dependent variable ln Y = a + b ln X + e b = (dY/Y)/(dX/X), or b = the % change in Y given a % change in X (elasticity) Economic production

47
slide-48
SLIDE 48

Goodness of regression fit Goodness of regression fit

48
slide-49
SLIDE 49

How “good” is the fitted line? How good is the fitted line?

 Goodness-of-fit is often not relevant to research  Goodness of fit receives too much emphasis  Goodness-of-fit receives too much emphasis  Focus on

 Substantive interpretation of coefficients (

(most important) )

 Statistical significance of coefficients (less important)

 Confidence interval  Standard error of a coefficient  t-statistic: coeff./s.e.

 Nevertheless, you should know about

 Standard Error of the Regression (SER)

Standard Error of the Regression (SER)

 Standard Error of the Estimate (SEE)  Also called Regrettably called Root Mean Squared Error (Root

MSE) in Stata

 R-squared (R2)

 Often not informative, use sparingly

49
slide-50
SLIDE 50

beo

Standard Error of the Regression the idea

beo Fitt d l Fitted values 10 5 bpop bpop 10 20 30

50
slide-51
SLIDE 51

beo

Standard Error of the Regression the idea

beo Fitt d l Fitted values 10 5 bpop bpop 10 20 30

51
slide-52
SLIDE 52

beo

Standard Error of the Regression picture

beo Fitt d l Fitted values

εi Yi-Yi ^ Yi

10

^ Yi

5 10 20 30 bpop bpop

52
slide-53
SLIDE 53

Standard Error of the Regression (SER)

 or Standard Error of the Estimate  or Root Mean Squared Error (Root MSE)

  • r Root Mean Squared Error (Root MSE)

2

n

(Yi Y ˆ

i ) i1

  • d. f .

d.f. equals n minus the number of estimate coefficients (Bs). In bivariate regression case, d.f. = n-2.

53
slide-54
SLIDE 54
  • SER i t

ti SER interpret tation called “Root MSE” in Stata

 On average in sample predictions will be off the  On average, in-sample predictions will be off the

mark by about one standard error of the regression

. reg beo bpop Source | SS df MS Number of obs = 41 F( 1 1, 39) = 202 56

  • ------------+------------------------------

F( 39) 202.56 Model | 351.26542 1 351.26542 Prob > F = 0.0000 Residual | 67.6326195 39 1.73416973 R-squared = 0.8385

  • ------------+------------------------------

Adj R-squared = 0.8344 Total | 418.898039 40 10.472451 Root MSE = 1.3169 beo | Coef.

  • Std. Err.

t P>|t| [95% Conf. Interval]

  • ------------+----------------------------------------------------------------

0251876 14.23 0 000 .3075284 .4094219 bpop | bpop | .3586751 3586751 .0251876 14 23 0.000 3075284 4094219 _cons | -1.314892 .3277508

  • 4.01

0.000

  • 1.977831
  • .6519535
54
slide-55
SLIDE 55
  • 10.8

884722 884722

  • .

R2: A less useful measure of fit : A less useful measure of fit

beo Fitt d l Fitted values

(Yi

  • Yi)

^ 10 (Yi Yi) (Yi-Y) (Yi-Y) ^ Y _

beo

Y

1.2 30.8 bpop

55
slide-56
SLIDE 56

_

30 8 10.8
  • .884722

R2: A less useful measure of fit R : A less useful measure of fit

beo Fitted values

(Yi

  • Yi)

^

10

beo

(Yi-Y) (Yi Yi) (Yi-Y) ^ _ _

1.2 30.8

Y

bpop

n ( i  Y )2  "

  

"

Y total sum of squares

i1

=

n (Y

i  Y )  "regression sum of squares

 

"

2

i1

+

n

 )2

(Y

i  Y i

 "residual sum of squares  

"

i1

56
slide-57
SLIDE 57

ed

 

10.8
  • .884722

 R-squa

squared

beo Fitted values

(Yi-Yi) ^

10

beo

(Yi-Y) (Yi-Y) ^ _ _ _ Y

n

(Y ˆ

i  Y)2

1.2 30.8 bpop bpop

2 i1

r  n

  • r

(Yi  Y)2

i1

  • pct. variance "explained"

Also called “coefficient of determination”

57
slide-58
SLIDE 58

m

  • Interpreting SER (Root MSE) and R2

. reg bush sbc_mpct Source | SS df MS Number of obs = 50

  • ------------+------------------------------

F( 1, 48) = 11.83 Model | .069183833 1 .069183833 Prob > F = 0.0012 Residual | .280630922 48 .005846478 R-squared = 0.1978

  • ------------+------------------------------

Adj R-squared = 0.1811 Total | .349814756 49 .007139077 Root MSE = .07646 bush | Coef.

  • Std. Err.

t P>|t| [95% Conf. Interval]

  • ------------+----------------------------------------------------------------

sbc pct | .196814 196814 0572138 3.44 0 001 .0817779 3118501 sbc mpct | .0572138 3 44 0.001 0817779 .3118501 _cons | .4931758 .0155007 31.82 0.000 .4620095 .524342

Interpreting SER (Root MSE): Interpreting SER (Root MSE):

  • On average, in-sample predictions about Bush’s vote share will be off

the mark by about 7.6% Interpreting R2

  • Regression model explains about 19.8% of the variation in Bush vote.
58
slide-59
SLIDE 59

J UT AL ID NE ND OK UT WY

.7

AL AK AZ AR GA IN KS KY LA MS MT NC ND SC SD TN TX WV

.6 ct 2004

AR CO DE FL IA MI MN MO NV NH NJ NM OH OR PA VA WA WI

.5 Bush Pc

CA CT DE HI IL ME MD MA NY RI VT WA

.4

MA

.2 .4 .6 Southern Baptist % B h Fitt d l Bush Fitted values

59
slide-60
SLIDE 60

Correlation Correlation

60
slide-61
SLIDE 61 61

t C l i Correlation

Cov (x, y) Cov y Corr x y 

 r

( , )

   x y

Corr(BushPct00,BushPct04) =0.96 =

00 04

0 014858 . .96 0 01499  0 01605 . .

  • Measures how closely data points
  • .4
  • .6
  • .6
  • .4
  • .2
new2000 .2 .4

fall along the line fall along the line

  • Varies between -1 and 1

(compare with Tufte p. 102)

new2004
  • .2
.2 .4
slide-62
SLIDE 62

t t t Warni ing: D Don’ ’t correl late of ften! !

 Correlation only measures linear relationship  Correlation is sensitive to variance  Correlation usually doesn’t measure a

theoretically interesting quantity

 Same criticisms apply to R2, which is the

squared correlation between predictions and d t i t data points.

 Instead, focus on regression coefficients

(slopes) (slopes)

62
slide-63
SLIDE 63

t t Di DV di EV Discrete DV, discrete EV

 Crosstabs

 χ2  Gamma, Beta, etc.

63
slide-64
SLIDE 64

E l Example

 What is the relationship between

abortion sentiments and vote choice?

 The abortion scale:

  • 1. BY LAW, ABORTION SHOULD NEVER BE PERMITTED.
  • 2. THE LAW SHOULD PERMIT ABORTION ONLY IN CASE OF RAPE, INCEST, OR WHEN

THE WOMAN'S LIFE IS IN DANGER.

  • 3. THE LAW SHOULD PERMIT ABORTION FOR REASONS OTHER THAN RAPE, INCEST,

OR DANGER TO THE WOMAN'S LIFE BUT ONLY AFTER THE NEED FOR THE OR DANGER TO THE WOMAN'S LIFE, BUT ONLY AFTER THE NEED FOR THE ABORTION HAS BEEN CLEARLY ESTABLISHED.

  • 4. BY LAW, A WOMAN SHOULD ALWAYS BE ABLE TO OBTAIN AN ABORTION AS A

MATTER OF PERSONAL CHOICE.

64
slide-65
SLIDE 65

| column e | | 770 |

t

. tab

  • use ote abo
  • co

Ab Aborti tion and d vote ch hoi ice i in 2006 2006

. tab housevote abortopinion, , col top +-------------------+ | Key | |-------------------| | frequency | percentag g p +-------------------+ us house candidate | stmt most agrees w/ view on abortion law voting for | Never Rarely Sometimes Always other (pl | Total

  • ---------------------+-------------------------------------------------------+----------

Democrat 446 1,749 1,903 8,759 13,627 | 13.60 20.21 36.90 57.93 34.30 | 39.55

  • ---------------------+-------------------------------------------------------+----------

Republican | 1,900 4,381 1,639 2,006 758 | 10,684 | 57.93 50.62 31.78 13.27 33.76 | 31.01

  • ---------------------+-------------------------------------------------------+----------
  • ther (please specify |

157 384 228 671 190 | 1,630 | 4.79 4.44 4.42 4.44 8.46 | 4.73

  • ---------------------+-------------------------------------------------------+----------

i won't vote in this | 65 201 117 299 52 | 734 | 1.98 2.32 2.27 1.98 2.32 | 2.13

  • ---------------------+-----------
  • -------------------------------------------+----------

haven't decided | 712 1,939 1,270 3,386 475 | 7,782 22 41 24.63 21.16 | 22 58 | | 21 7 21.71 22.41 24 63 22 39 22.39 21 16 | 22.58

  • ---------------------+-------------------------------------------------------+----------

Total | 3,280 8,654 5,157 15,121 2,245 | 34,457 | 100.00 100.00 100.00 100.00 100.00 | 100.00

65
slide-66
SLIDE 66

t U th i h/t bl Use the appropriate graph/table

 Continuous DV, continuous EV

 E.g., vote share by income growth  Use scatter plot

 Continuous DV, discrete and unordered EV

 E.g., vote share by religion or by union membership  Box plot, dot plot

 Discrete DV, discrete EV

 No graph: Use crosstabs (tabulate)

66
slide-67
SLIDE 67

Two quick notes about Two quick notes about comparing coefficients

 Recode/rescale independent variables to

be in 0-1 interval

new_x = [x-min(x)+1]/(max(x)-min(x)+1) Interpretation: a move from the minimum to Interpretation: a move from the minimum to

the maximum in the independent variable yields an average change of b in the d.v.

67
slide-68
SLIDE 68

=

  • . reg beo bpop

Source | SS df MS Number of obs = 41

  • ------------+------------------------------

F( 1, 39) = 202.56 Model | 351.26542 1 351.26542 Prob > F = 0.0000 Residual | 67.6326195 39 1.73416973 R-squared = 0.8385

  • ------------+------------------------------

Adj R-squared = 0.8344 Total | 418.898039 40 10.472451 Root MSE = 1.3169 beo beo | Coef Std

  • Std. Err

Err. t P>|t| [95% Conf Interval] Coef. t P>|t| [95% Conf. Interval]

  • ------------+----------------------------------------------------------------

bpop | .3584751 .0251876 14.23 0.000 .3075284 .4094219 _cons |

  • 1.314892 .3277508
  • 4.01 0.000
  • 1.977831
  • .6519535

Variable | Obs Mean

  • Std. Dev.

Min Max

  • ------------+--------------------------------------------------------

bpop | 41 10.13171 8.266633 1.2 30.8 . gen bpop01=(bpop-1.2)/(30.8-1.2) . reg beo bpop01 Source | SS df MS Number of obs = 41

  • ------------+------------------------------

F( 1, 39) = 202.56 Model | 351 265419 351.265419 1 1 351 265419 Prob > F = 351.265419 Prob > F 0 0000 Model | 0.0000 Residual | 67.63262 39 1.73416974 R-squared = 0.8385

  • ------------+------------------------------

Adj R-squared = 0.8344 Total | 418.898039 40 10.472451 Root MSE = 1.3169 beo | Coef.

  • Std. Err.

t P>|t| [95% Conf. Interval]

  • ------------+----------------------------------------------------------------

bpop01 | 10.61086 .7455536 14.23 0.000 9.10284 12.11889 _cons | -.8847219 .3048075

  • 2.90 0.006
  • 1.501253
  • .2681905
68
slide-69
SLIDE 69

 Convert all variables, except dummy

variables, to “unit deviates”:\

 new x = [x

mean(x)]/sd(x)

 new_x = [x-mean(x)]/sd(x)  new_y = [y-mean(y)]/sd(y)

etc.

 Interpretation: a one standard deviation

Interpretation: a one standard deviation change in x yields, on average, a b standard deviation change in y.

 (For a dummy variable a change from category 0  (For a dummy variable, a change from category 0

to category 1 yields, on average, a b standard deviation change in y.

69
slide-70
SLIDE 70
  • . reg beo bpop

Source | SS df MS Number of obs = 41

  • ------------+------------------------------

F( 1, 39) = 202.56 Model | 351.26542 1 351.26542 Prob > F = 0.0000 Residual | 67.6326195 39 1.73416973 R-squared = 0.8385

  • ------------+------------------------------

Adj R-squared = 0.8344 Total | 418.898039 40 10.472451 Root MSE = 1.3169 beo beo | Coef Std

  • Std. Err

Err. t P>|t| [95% Conf Interval] Coef. t P>|t| [95% Conf. Interval]

  • ------------+----------------------------------------------------------------

bpop | .3584751 .0251876 14.23 0.000 .3075284 .4094219 _cons | -1.314892 .3277508

  • 4.01 0.000
  • 1.977831
  • .6519535

. summ beo bpop p p Variable | Obs Mean

  • Std. Dev.

Min Max

  • ------------+--------------------------------------------------------

beo | 41 2.317073 3.236117 10.8 bpop | 41 10.13171 8.266633 1.2 30.8 . gen st_beo=(beo-2.317073)/3.236117 (9 missing values generated) . gen st_bpop=(bpop-10.13171)/8.266633 (9 missing values generated) (9 missing values generated)

70
slide-71
SLIDE 71
  • . reg st_beo st_bpop

Source | SS df MS Number of obs = 41

  • ------------+------------------------------

F( 1, 39) = 202.56 Model | 33.5418469 1 33.5418469 Prob > F = 0.0000 Residual | 6.45814509 39 .165593464 R-squared = 0.8385

  • ------------+------------------------------

Adj R-squared = 0.8344 Total | 39.9999919 40 .999999799 Root MSE = .40693 st_beo | Coef.

  • Std. Err.

t P>|t| [95% Conf. Interval]

  • ------------+----------------------------------------------------------------

st_bpop | .9157217 .0643416 14.23 0.000 .7855786 1.045865 _cons | 3.54e-07 .0635521 0.00 1.000

  • .1285458

.1285465 . reg beo bpop,beta Source | SS df MS Number of obs = 41

  • ------------+------------------------------

F( 1, 39) = 202.56 Model | 351.26542 1 351.26542 Prob > F = 0.0000 Residual | 67.6326195 39 1.73416973 R-squared = 0.8385

  • ------------+------------------------------

Adj R-squared = 0.8344 Total | 418.898039 40 10.472451 Root MSE = 1.3169 beo | Coef.

  • Std. Err.

t P>|t| Beta

  • ------------+----------------------------------------------------------------

bpop | .3584751 .0251876 14.23 0.000 .9157218 _cons | -1.314892 .3277508

  • 4.01

0.000 .

71
slide-72
SLIDE 72

MIT OpenCourseWare http://ocw.mit.edu

17.871 Political Science Laboratory

Spring 2012 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.