Sampling and Inference Sampling and Inference The Quality of Data and - - PowerPoint PPT Presentation

sampling and inference sampling and inference
SMART_READER_LITE
LIVE PREVIEW

Sampling and Inference Sampling and Inference The Quality of Data and - - PowerPoint PPT Presentation

Sampling and Inference Sampling and Inference The Quality of Data and Measures 2012 1 Why do we sample? Cost/ benefit benefit Benefit Benefit (precision) Cost (h (hassle factor) l f t ) N 2 Effects of samples Obvious:


slide-1
SLIDE 1

Sampling and Inference Sampling and Inference

The Quality of Data and Measures 2012

1

slide-2
SLIDE 2

Why do we sample?

Cost/ benefit Benefit benefit Benefit (precision) Cost (h l f t ) N (hassle factor)

2

slide-3
SLIDE 3

Effects of samples

  • Obvious: influences marginals
  • Less obvious

Less obvious

– Allows effective use of time and effort Effect on multivariate techniques Effect on multivariate techniques

  • Sampling of independent variable:

greater precision in regression estimates

  • Sampling on dependent variable: bias

3

slide-4
SLIDE 4

Sampling on Independent Sampling on Independent Variable

y y x x

4

slide-5
SLIDE 5

Sampling on Dependent Variable

y y x x

5

slide-6
SLIDE 6

Sampling Sampling

Consequences for Statistical Inference

6

slide-7
SLIDE 7

Statistical Inference: Learning About the Unknown From the Known

  • Reasoning forward:

distributions of sample means, when the pop pulation mean, , s.d., , and n are known.

  • Reasoning backward: learning about the

Reasoning backward: learning about the population mean when only the sample, s d and n are known s.d., and n are known

7

slide-8
SLIDE 8

Reasoning Forward Reasoning Forward

8

slide-9
SLIDE 9

Exponential Distribution Exponential Distribution Example

.271441 500000 1.0e+06 Fraction

Mean = 250,000 Median=125,000 s.d. = 283,474 Min = 0 Max = 1,000,000

inc

9

slide-10
SLIDE 10

Consider 10 random samples of Consider 10 random samples, of n = 100 apiece

Sample mean 1 253,396.9 2 198.789.6 3 271,074.2 4 238 928 7 238,928.7 5 280,657.3 6 241,369.8 7 249,036.7 8 226,422.7 9 210,593.4 10 212,137.3

Fraction .271441 250000 500000 1.0e+06 inc inc

10

slide-11
SLIDE 11

Consider 10 000 samples of n = Consider 10,000 samples of n 100

N = 10,000

.275972

Mean = 249,993 s.d. = 28,559 Skewness = 0.060 Kurtosis = 2.92

(mean) inc Fraction 250000 500000 1.0e+06

11

slide-12
SLIDE 12

Consider 1 000 samples of Consider 1,000 samples of various sizes

10 100 1000

Fraction .731 (mean) inc 250000 500000 1.0e+06 Fraction .731 (mean) inc 250000 500000 1.0e+06 Fraction .731 (mean) inc 250000 500000 1.0e+06

Mean =250,105 s.d.= 90,891 Skew= 0.38 Kurt= 3.13 Mean = 250,498 s.d.= 28,297 Skew= 0.02 Kurt= 2.90 Mean = 249,938 s.d.= 9,376

12

Skew= -0.50 Kurt= 6.80

slide-13
SLIDE 13

Difference of means example

.280203 Fraction

State 1 Mean = 250,000

250000 500000 1.0e+06 inc .251984

State 2

Fraction

State 2 Mean = 300,000

13

inc2 250000 500000 1.0e+06
slide-14
SLIDE 14

Take 1 000 samples of 10 of Take 1,000 samples of 10, of each state, and compare them

First 10 samples Sample State 1 State 2 1 311 410 311,410 < 365 224 365,224 2 184,571 < 243,062 3 468,574 > 438,336 4 253,374 < 557,909 5 220,934 > 189,674 6 270 400 270,400 < 284 309 284,309 7 127,115 < 210,970 8 253,885 < 333,208 9 152,678 < 314,882 10 222,725 > 152,312

14

slide-15
SLIDE 15

1,000 samples of 10

300,000

1.1e+06 (m mean) inc2

250,000

(mean) inc (mean) inc 1.1e+06

State 2 > State 1: 673 times

15

slide-16
SLIDE 16

1,000 samples of 100

300,000

1.1e+06 (m mean) inc2

250,000

(mean) inc (mean) inc 1.1e+06

State 2 > State 1: 909 times

16

slide-17
SLIDE 17

1,000 samples of 1,000

300,000

1.1e+06 (m mean) inc2

250,000

(mean) inc (mean) inc 1.1e+06

State 2 > State 1: 1,000 times

17

slide-18
SLIDE 18

Another way of looking at it: Another way of looking at it: The distribution of Inc2 – Inc1

n = 10 n = 100 n = 1,000

.565 .565 .565 Fraction .565 Fraction .565 Fraction .565

Mean = 51 845 Mean = 49 704 Mean = 49 816

diff
  • 400000
600000 diff
  • 400000
50000 600000 diff
  • 400000
50000 600000

Mean = 51,845 s.d. = 124,815 Mean = 49,704 s.d. = 38,774 Mean = 49,816 s.d. = 13,932

18

slide-19
SLIDE 19

Play with some simulations

  • http://onlinestatbook.com/stat_sim/sampling

_ dist/index.html

19

slide-20
SLIDE 20

Reasoning Backward Reasoning Backward

When you know n, X, and s, but want to say something about 

20

slide-21
SLIDE 21

Central Limit Theorem

As the sample size n increases, the distribution of the mean X of a random sample taken from practically any population approaches a normal p p pp distribution, with mean : and standard deviation 

n

21

slide-22
SLIDE 22

Calculating Standard Errors

In general:

s

  • std. err. 

n

22

slide-23
SLIDE 23

Most important standard errors

Mean

n s

Proportion

n p p ) (1

  • Diff. of 2 means

2 2 2 1 2 1

n s n s 

  • Diff. of 2

proportions

2 2 2 1 1 1

) (1 ) (1 n p p n p p   

Diff of 2 means (paired data)

n sd

Regression (slope) coeff.

x

s n s e r 1 1 . . .  

23

slide-24
SLIDE 24

Using Standard Errors we can Using Standard Errors, we can construct “confidence intervals”

  • Confidence interval (ci): an interval

between two numbers, where there is a certain specified level of confidence that a population p parameter lies p p

  • ci = sample parameter +

ci = sample parameter + multiple * sample standard error

24

slide-25
SLIDE 25

Constructing Confidence Intervals

  • Let’s say we draw a sample of tuitions from 15

private universities. Can we estimate what the average of all private university tuitions is?

  • N = 15
  • Average = 29,735
  • S.d. = 2,196

s 2,196

  • S.e. =

2 196   567 n 15 15 n

25

slide-26
SLIDE 26

N = 15; avg. = 29,735; s.d. = 2,196; s.e. = s/√n = 567

The Picture

y .398942 398942

29,735+567=30,302 29,735-567=29,168 29,735-2*567= 29,735+2*567= 28,601 30,869 29,735

 4  3  2 

68%

 2 3 4

Mean

95% 99%

.000134

26

slide-27
SLIDE 27

Confidence Intervals for Tuition Confidence Intervals for Tuition Example

  • 68% confidence interval = 29,735+567 =

[ , 29,168 to 30, ,302] ]

  • 95% confidence interval = 29,735+2*567 =

[28 601 to 30 869] [28,601 to 30,869]

  • 99% confidence interval = 29,735+3*567 =

[28 034 to 31 436] [28,034 to 31,436]

27

slide-28
SLIDE 28

What if someone (ahead of time) had id “I thi k th t iti f said, “I think the average tuition of major research universities is $25k”?

  • Note that $25,000 is well out of the 99%

confidence interval, [28, ,034 to 31,436] ] , [ ,

  • Q: How far away is the $25k estimate from

the sample mean? the sample mean?

– A: Do it in z-scores: (29,735-25,000)/567 = 8 35 8.35

28

slide-29
SLIDE 29

Constructing confidence intervals of Constructing confidence intervals of proportions

  • Let us say we drew a sample of 1,500 adults and asked

them if they approved of the way Barack Obama was handling his job as president (March 23 25 2012 Gallup handling his job as president. (March 23-25, 2012 Gallup Poll) Can we estimate the % of all American adults who approve?

  • N = 1500
  • p = .43
  • s.e. = p(1  p)

.43(1  .43)   0.013 1500 n

http://www.gallup.com/poll/113980/gallup-daily-obama-job-approval.aspx

29

slide-30
SLIDE 30

N = 1,500; p. = .43; s.e. = √p(1-p)/n = .013

The Picture

.398942 398942

.43+.013=.44 .43-.013=.42 .43-2*.013=.41 .43+2*.013=.45

y

.43

 4  3  2 

68%

 2 3 4

Mean

95% 99%

.000134

30

slide-31
SLIDE 31

Confidence Intervals for Obama Confidence Intervals for Obama approval example

  • 68% confidence interval = .43+.013 =

[ 42 to 44] [.42 to .44]

  • 95% confidence interval = .43+2*.013 =

[ 40 46] [.40 to .46]

  • 99% confidence interval = .43+3*.013 =

[ .39 to .47]

31

slide-32
SLIDE 32

What if someone (ahead of time) had said said, “I think Americans are equally I think Americans are equally divided in how they think about Obama.”

  • Note that 50% is well out of the 99%

Note that 50% is well out of the 99% confidence interval, [39% to 47%]

  • Q: How far away is the 50% estimate from
  • Q: How far away is the 50% estimate from

the sample proportion?

A: Do it in z scores: ( 43 5)/ 013 = 5 3 – A: Do it in z-scores: (.43-.5)/.013 = -5.3

32

slide-33
SLIDE 33

= =

Constructing confidence intervals of Constructing confidence intervals of differences of means

  • Let’s say we draw a sample of tuitions from 15

private and public universities. Can we estimate what the difference in average tuitions is between the two types of universities?

  • N = 15 in both cases
  • Average = 29 735 (private); 5 498 (public); diff = 24 238

29,735 (private); 5,498 (public); diff 24,238

  • Average
  • s.d. = 2,196 (private); 1,894 (public)

2 2

  • s.e. =

s s s 4 822 416 3 587 236 3,587,236 4,822,416

1  2 

  749 n n 15 15

1 2

33

slide-34
SLIDE 34

N = 15 twice; diff = 24,238; s.e. = 749

y

The Picture

.398942 398942

24,238+749=24,987 24,238-749= 23,489 24,238-2*749= 24,238+2*749= 22,740 25,736 24,238

 4  3  2 

68%

 2 3 4

Mean

95% 99%

.000134

34

slide-35
SLIDE 35

Confidence Intervals for difference Confidence Intervals for difference

  • f tuition means example
  • 68% confidence interval = 24,238+749 =

[23 489 to 24 987] [23,489 to 24,987]

  • 95% confidence interval = 24,238+2*749 =

[22 740 to 25 736] [22,740 to 25,736]

  • 99% confidence interval =24,238+3*749 =
  • [21,991 to 26,485]

35

slide-36
SLIDE 36

What if someone (ahead of time) had said, “Private universities are no more expensive than public universities universities”

  • Note that $0 is well out of the 99%

Note that $0 is well out of the 99% confidence interval, [$21,991 to $26,485]

  • Q: How far away is the $0 estimate from the
  • Q: How far away is the $0 estimate from the

sample proportion?

A: Do it in z scores: (24 238 0)/749 = 32 4 – A: Do it in z-scores: (24,238-0)/749 = 32.4

36

slide-37
SLIDE 37

Constructing confidence intervals of Constructing confidence intervals of difference of proportions

  • Let us say we drew a sample of 1,500 adults and asked

them if they approved of the way Barack Obama was handling his job as president (March 23 25 2012 Gallup handling his job as president. (March 23-25, 2012 Gallup Poll). We focus on the 1000 who are either independents

  • r Democrats. Can we estimate whether independents

and Democrats view Obama differently?

  • N = 600 ind; 400 Dem.
  • p = .43 (ind );

43 (ind.); .82 (Dem ); diff 82 (Dem.); diff = .39 39

  • s.e. =

p (1 p ) p (1 p ) .43(1 .43) .82(1  .82)

1 1 2 2

    .03 n

1

n n 600 400 400 600

1 2

37

slide-38
SLIDE 38
  • diff. p. = .39; s.e. = .03

The Picture

.398942 398942

.39+.03=.42 .39-.03=.36 .39-2*.03=.33 .39+2*.03=.45

y

.19

 4  3  2 

68%

 2 3 4

Mean

95% 99%

.000134

38

slide-39
SLIDE 39

Confidence Intervals for Obama Confidence Intervals for Obama Ind/Dem approval example

  • 68% confidence interval = .39+.03 =

[ 36 to 42] [.36 to .42]

  • 95% confidence interval = .39+2*.03 =

[ 33 45] [.33 to .45]

  • 99% confidence interval = .39+3*.03 =

[ .30 to .48]

39

slide-40
SLIDE 40

What if someone (ahead of time) had said said, “I think Democrats and I think Democrats and Independents are equally unsupportive of Obama”?

  • Note that 0% is well out of the 99%

Note that 0% is well out of the 99% confidence interval, [30% to 48%]

  • Q: How far away is the 0% estimate from
  • Q: How far away is the 0% estimate from

the sample proportion?

A: Do it in z scores: ( 39 0)/ 03 = 13 – A: Do it in z-scores: (.39-0)/.03 = 13

40

slide-41
SLIDE 41

Constructing confidence intervals of Constructing confidence intervals of regression coefficients

  • Let’s look at the relationship between the mid-

term seat loss by the President’s party at midterm and the President’s Gallup poll rating

Slope = 1.97 N 14 N = 14 s.e.r. = 13.8 = 8.14 sx s.e.slope =

s.e.r. 1 13.8 1     0.47

Gallup approval rating (Nov.)

13 8.14 n 1 sx 13 8 14

loss Fitted values Fitted values

41

1998 2002 1942 1950 1954 1962 1970 1978 1982 1986 1990
  • 40
  • 20

ge in House seats

1938 1942 1946 1958 1966 1974 1994
  • 80
  • 60

Chang 30 40 50 60 70 Gallup approval rating (Nov )

slide-42
SLIDE 42

N = 14; slope=1.97; s.e. = 0.45

The Picture

.398942 398942

1.97+0.47=2.44 1.97-0.47=1.50 1.97-2*0.47=1.03 1.97+2*0.47=2.91

y

1.97

.000134

 4  3  2 

68%

 2 3 4

Mean

95%

42

99%

slide-43
SLIDE 43

Confidence Intervals for regression Confidence Intervals for regression example

  • 68% confidence interval = 1.97+ 0.47=

[1 50 to 2 44] [1.50 to 2.44]

  • 95% confidence interval = 1.97+ 2*0.47 =

[1 03 to 2 91] [1.03 to 2.91]

  • 99% confidence interval = 1.97+3*0.47 =

[0 62 3 32] [0.62 to 3.32]

43

slide-44
SLIDE 44

What if someone (ahead of time) had id “Th i l ti hi said, “There is no relationship between the president’s popularity and how his party’s House members do at midterm”? ? do at midterm

  • Note that 0 is well out of the 99%

fid i t l [0 62 t 3 32] confidence interval, [0.62 to 3.32]

  • Q: How far away is the 0 estimate from the

sample proportion?

– A: Do it in z-scores: (1.97-0)/0.47 = 4.19

44

slide-45
SLIDE 45
  • The Stata output

. reg loss gallup if year>1948 Source | SS df MS Number of obs = 14

  • ------------+------------------------------

F( 1, 12) = 17.53 Model | 3332.58872 1 3332.58872 Prob > F = 0.0013 Residual | 2280.83985 12 190.069988 R-squared = 0.5937

  • ------------+------------------------------

Adj R-squared = 0.5598 Adj R squared = 0 5598 Total | 5613.42857 13 431.802198 Root MSE = 13.787 loss | Coef.

  • Std. Err.

t P>|t| [95% Conf. Interval]

  • ------------+----------------------------------------------------------------

gallup | 1.96812 .4700211 4.19 0.001 .9440315 2.992208 _cons | -127.4281 25.54753

  • 4.99 0.000
  • 183.0914
  • 71.76486

45

slide-46
SLIDE 46

MIT OpenCourseWare http://ocw.mit.edu

17.871 Political Science Laboratory

Spring 2012 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.