1 2019 STAT 373/ Week 9 STAT 814_STAT714 Population values - - PDF document

1
SMART_READER_LITE
LIVE PREVIEW

1 2019 STAT 373/ Week 9 STAT 814_STAT714 Population values - - PDF document

STAT 373/ Week 9 STAT 814_STAT714 LGAs Week 9: Example 1996 Australian Bureau of Statistics Albury, Armidale, Ashfield, Auburn, Ballina, census data Balranald, Bankstown, Barraba, Bathurst, We will use the data from Australian Baulkham


slide-1
SLIDE 1

STAT 373/ STAT 814_STAT714 Week 9

1

2019

1

Week 9: Example

1996 Australian Bureau of Statistics census data

  • We will use the data from Australian

census 1996 as an example.

  • Units: Local Government Areas (LGAs) of

NSW (182) at the time

  • Data are in file LGA.MTW, available on

the unit iLearn.

2

LGAs

Albury, Armidale, Ashfield, Auburn, Ballina, Balranald, Bankstown, Barraba, Bathurst, Baulkham Hills, Bega Valley, Bellingen, Berrigan, Bingara, Blacktown, …….., Wollongong, Woollahra, Wyong, Yallaroi, Yarrowlumla, Yass,Young

3

Variables in LGA.mtw

Variable Mean Median Total M 18335 6997 Total F 18803 6893 Total P 37138 13890 GE15 M 14286 5233 GE15 F 14944 5193 GE15 P 29231 10426 Aborig M 269.9 146.5 Aborig F 277.5 142.0 Aborig P 547.3 292.0 Variable Mean Median AusBorn 13302 5991 AusBorn 13716 6010 AusBorn 27018 12001 OSBorn M 4246 512 OSBorn F 4256 432 OSBorn P 8502 935 AusCit M 16171 6274 AusCit F 16599 6330 AusCit P 32769 12605 Unempl M 944 345 Unempl F 609.7 209.5 Unempl P 1554 586

4

  • We will be using these data to illustrate

sampling and estimation techniques.

  • Sampling frame: list of 182 LGAs with IDs

from 1 to 182 (N=182 LGAs)

  • We will estimate quantities such as total
  • verseas-born population of NSW on the

basis of a random sample, and compare our answer with the actual total population.

5

Overseas-born population

  • Say we wish to estimate the mean LGA OS-

born population, , and the total NSW OS- born population , on the basis of a simple random sample of size n = 30 LGAs. (Note: You can find instructions on obtaining a SRS in Minitab on slides 46 and 47 of this lecture)

  • Histogram of number of OS-born in all 182

LGAs, ie, the population (see next page).

6

Very skewed population... normal approximation for sample mean (for n = 30) may be unlikely.

50000 100000 50 100

OSBorn P Frequency

slide-2
SLIDE 2

STAT 373/ STAT 814_STAT714 Week 9

2

2019

7

Population values

Descriptive Statistics

Variable N Mean Median TrMean StDev SE Mean OSBorn P 182 8502 935 5918 16237 1204 Variable Minimum Maximum Q1 Q3 OSBorn P 18 97203 275 8921

  • Mean:  = 8,502
  • Total:  = N = 182×8,502=1,547,364

8

Sample (n=30) drawn using Minitab:

(click Calc, Random Data, Sample from Columns and then follow it through)

LGA OS Born LGA OS Born Tumbarumba Albury M uswellbrook Yarrowlumla M udgee Botany Hay M aitland Great Lakes W arringah Bland W agga W agga Crookwell Yass Holbrook 278 3998 930 1449 1382 16002 183 3624 2763 31893 266 3787 184 879 137 Dungog Bourke Pittwater Nambucca Junee South Sydney Narrabri Urana Rockdale Culcairn M osman Lake M acquarie Kogarah Eurobodalla Shoalhaven 377 162 11177 1477 300 27729 611 48 33491 227 7129 16914 14914 3996 9502

9

Sample Statistics

Descriptive Statistics

Variable N Mean Median TrMean StDev SE Mean OSBorn s 30 6527 1463 5009 9713 1773 Variable Minimum Maximum Q1 Q3 OSBorn s 48 33491 275 9921

10

Estimation of the population mean

.975 29 .975 29

Based on the sample of 30 LGAs, we have 6527 9713 30 0.165 182 2.0452 95% CI for population mean OS-born: (1 )/ 6527 2.0452 9713 (1 0.165) /30 6,527 3,314 (3,213, 9,841) y s f t y t s f n                 

n f s y SE estimated / ) 1 ( ) (   

11

Estimation of the population total

.975 29 .975 29

We have 182 6527 1,187,914 9713 0.165 2.0452 95% CI for total OS-born: (1 )/ 1,187,914 2.0452 182 9713 (1 0.165)/30 1,187,914 603,175 (584,739,1,791,089)

T T

y Ny s f t y t N s f n                      large error bound; sample size may be too small.

12

NOTE:

  • We find that the true population values of

= 8,502 and = 1,547,364 do in fact lie in the 95% confidence intervals.

  • However, because of the severe skewness of

the population values, it would have been more appropriate to stratify the population

  • n some criterion. [We will return to this

issue later.]

slide-3
SLIDE 3

STAT 373/ STAT 814_STAT714 Week 9

3

2019

13

Sample size required

Say we wish to estimate the total OS-born in NSW within 200,000 ( = error bound) persons of the true value, with a probability of 0.95. Take the previous sample as a pilot study. We estimate  as s=9713. Given D = 200,000, (From Lecture 8) . 114 Take 296 . 113 9713 96 . 1 200000 182 1 1 182 have Then we

1 2

                 

n n

14

Descriptive Statistics

Variable N Mean Median TrMean StDev SE Mean C26 114 9097 935 6044 17837 1671 Variable Minimum Maximum Q1 Q3 C26 72 97203 275 8464

Now let’s take a SRS of size n=114 and see what error bound we get:

15

.975 .975

We have 182 9097 1,655,654 17,837 114 0.626 182 1.96 (as we have a large sample here) 95% CI for total OS-born: (1 )/ 1,655,654 1.96 182 17,837 (1 0.626) /114 1,655,654 364,263

T T

y Ny s f z y z N s f n                     

16

Note:

  • Why has the error bound turned out to be

364,263 (compared to 603,175 when n = 30), still much greater than 200,000 as planned?

  • Recall we used s to estimate the population

standard deviation  in the calculation of sample size, n.

17

  • If we had used the population standard

deviation,  = 16,237, in the calculation of n, we would have obtained

1 2

1 200000 182 1 149.5 182 1.96 16237 , we need 150. n ie n

                   

18

We had: Estimate of  from pilot sample : s = 9,713 Actual value :  = 16,237 >> s Note: The pilot sample underestimated , which led us to underestimate the sample size required.

slide-4
SLIDE 4

STAT 373/ STAT 814_STAT714 Week 9

4

2019

19

Estimating a population proportion p

Say we are interested in the presence/absence

  • f some characteristic, eg,

– person has HIV/AIDS – person watching a particular TV program – person supports the use of nuclear power in Australia

20

  • We may want to estimate the

– proportion/percentage (p) – number (a)

in the population that possess the characteristic. A SRS of size n allows us to estimate

  • p = population proportion
  • a = Np = population total

21

Let r = number in sample with the characteristic of interest. Then we estimate p by: r n N p N a a n r p     ˆ ˆ by and ˆ

22

Let  1 if ith member of popn. has the characteristic ui =   0 if ith member doesn’t have characteristic Then and

) u iable binary var the

  • f

mean population , ( N u ... u u p

i N 2 1

     

) , ( ...

2 1

total population u u u a

N

     

23

i.e.

  • p = population mean (of the binary variable

with values of 0 or 1)

  • a = population total

Good news

  • We know properties of the estimators of

means (and totals), so we know properties

  • f estimators of p and a shown on Slide 21.

24

Extra simplification

Here ui

2=ui since ui = 0 or 1; Thus we have

population variance worked out as follows:

 

p q where pq N N p p N N p u u u N N u N u N mean population the u NB N u N N u u

N i i N i i N i i

                     

  

  

1 , 1 ) 1 ( 1 , Recall ] [ 1 ] [ 1 1 , : ] [ 1 1 1

2 2 1 2 1 2 1 2 2

   

slide-5
SLIDE 5

STAT 373/ STAT 814_STAT714 Week 9

5

2019

25

Sample estimator of 2

Suppose we have a simple random sample y1, y2, … ,yn. (Note that the yi’s will be 0’s or 1’s. ) Then, sample variance is shown on next slide.

26

Sample estimator of 2

 

p q where q p n n p p n n p y y n y n n y n y n y n y n n y y s

n i i n i i n i i

ˆ 1 ˆ , ˆ ˆ 1 ) ˆ 1 ( ˆ 1 ˆ , Recall ] [ 1 1 ] [ 1 1 ] [ 1 1 1

2 2 1 2 1 2 1 2 2

                   

  

  

     

2 2 2 2 2

ˆ (1 ) ˆ ( ) 1 (1 ) ( ) (1 ) 1 when N is relatively large. ˆ ˆ ( ) (1 ) ( ) 1 E y E p p N f Var p pq N n f Var y N n pq pq f n N n n n E s E p p E s n                           

27

Apply the results from last week

28

Central Limit Theorem

        n pq f p N p ) 1 ( , ~ ˆ 

   

                     N N- N n pq f n f pq N N p Var large for 1 1 since 1 1 1 ) ˆ (

29

  • Notice the similarity of the variance formula

with that for the variance of the Binomial distribution, viz. p(1-p)/n.

  • Here again the effect of sampling without

replacement is seen in the factor (1-f).

  • As N, f0 and the properties of

estimators based on sampling without replacement approach those based on sampling with replacement.

30

Warning

If p is “too close” to 0 or 1 then the normal approximation may be unreliable. When np or n(1-p) less than ~30 and p<0.25 or p>0.75, normal approximation may not be reliable.

slide-6
SLIDE 6

STAT 373/ STAT 814_STAT714 Week 9

6

2019

31

Confidence interval for p

As has an approximate normal approximation, ie, we could construct CIs in the usual way: p ˆ

        n pq f p N p ) 1 ( , ~ ˆ 

n p p f z p ) 1 ( ) 1 ( ˆ

2 /

    

 32

Problem

The CI involves population proportion, p, which is exactly the quantity that we don’t know! 1 ) ˆ 1 ( ˆ ) 1 ( ) ˆ 1 ( ˆ 1 1 ) 1 ( ) ˆ ( by ˆ estimate we ) ˆ 1 ( ˆ 1 by Estimating . ) 1 ( ) ˆ ( Recall

ˆ

2 2 2

                  n p p f p p n n n f p r Va ) p Var( p p n n s n f p Var  

) p 1 ( p 1 N N where

2

   

33

This gives us the (approximate) 100(1-)% CI:

 

1 ) ˆ 1 ( ˆ ) 1 ( ˆ

2 /

      n p p f z p

SE( ) p

34

Sample size determination

We specify a margin of error (ie, error bound), B, such that and we seek the minimum value of n to achieve this inequality.

 

    B | p p ˆ | Pr

35

Sample size determination

Using the normal approximation result,

2 /

1 ) 1 ( ) 1 ( 1 ) 1 ( ) 1 ( 1 ) 1 ( ) 1 ( ˆ Pr

 z N N n p p f B N N n p p f B N N n p p f p p                                

36

   

1 2 2 / 2 2 2 /

) 1 ( 1 1

  • n

manipulati some After : ) 1 ( get we , 1 1 taking and

                       

 

z B p p N N n the fpc ignoring Not B z p p n N N the fpc Ignoring

slide-7
SLIDE 7

STAT 373/ STAT 814_STAT714 Week 9

7

2019

37

Both of the above formula for n involve the population proportion p, in the form p(1-p). Let’s look at p(1-p) over the range 0<p<1.

p p ( 1 - p ) 0 .1 0 .0 9 0 .2 0 .1 6 0 .3 0 .2 1 0 .4 0 .2 4 0 .5 0 .2 5 0 .6 0 .2 4 0 .7 0 .2 1 0 .8 0 .1 6 0 .9 0 .0 9 0.05 0.1 0.15 0.2 0.25 0.3 0.25 0.5 0.75 1 p p(1-p)

Note: p(1-p) attains a maximum of 0.25 at p=0.5.

What about p?

38

When determining sample size, we can therefore use

  • an estimate of p (sometimes a range of likely values
  • f p will be available, take the value of p leading to

the largest n)

  • or, taking p=0.5 (when no estimate of p available
  • r when a range of likely values of p includes 0.5)

Note: Using p=0.5 in the formula for n will result in a conservative (i.e. possibly larger than necessary) sample size.

39

LGA example

10 20 30 40 50 10 20 30 40 50 60

Percent OS Born Frequency

Percentage of overseas-born people in LGAs:

40

If we use percentage of overseas-born people in a LGA as our response of interest, we could answer the following research question: “Does an LGA have more than 20% of its residents born overseas?” (Actually there are 42 of the 182 LGAs have more than 20% overseas-born people. The population proportion is then .)

2308 . 182 42   p

41

Sample size determination

Example: We need to determine the sample size required to estimate the true proportion LGAs with “%OS born  20%” within 5% (ie, 0.05) of the true value (margin of error), with 0.95 probability (ie, =0.05).

42

   

) (! 2 . 384 05 . 96 . 1 5 . 5 . ) 1 ( : 0.5 p using and , 1 1 taking ,

2 2 2 2 2 /

         B z p p n N N the fpc Ignoring

slide-8
SLIDE 8

STAT 373/ STAT 814_STAT714 Week 9

8

2019

43

124. take therefore We 7 . 123 96 . 1 05 . 5 . 5 . 181 1 182 ) 1 ( 1 1 :

1 2 1 2 2 /

                                    

 

n z B p p N N n the fpc ignoring Not

Here we’d better NOT ignore the fpc, given N is not that large, ie, using n=124.

44

Descriptive statistics of the 124 LGAs sampled from 182 LGAs in Minitab:

Variable N Mean Median TrMean StDev SE Mean

GE20 sam

124 0.1935 0.0000 0.1607 0.3967 0.0356 Variable Minimum Maximum Q1 Q3

GE20 sam

0.0000 1.0000 0.0000 0.0000

Note: GE20 is coded as 1 for an LGA with more than 20% OB people and 0 otherwise.

45

: interval confidence 95% with 0.1935 ˆ have We  p  

) 233 . , 154 . ( 039 . 1935 . 123 ) 1935 . 1 ( 1935 . 182 124 1 96 . 1 1935 . 1 ) ˆ 1 ( ˆ ) 1 ( ˆ

2 /

                     n p p f z p

Includes the true population p=0.2308.

46

Comments on the CI

  • Note that the margin of error has come out to

be 0.039, which is less than the 0.05 as specified.

  • This has happened because we conservatively

set p=0.5 in the formula for determination of n.

  • If we set p=0.25, we get n111.8.
  • Taking n=112 would have resulted in a sample

with a margin of error greater than that for n=124.

47

Getting a SRS using Minitab

Draw a random sample of size 30 from LGA.MTW (available on the unit iLearn) in Minitab:

  • Start Minitab;
  • Open LGA.MTW in Minitab;
  • Click Calc button on the top menu, then

select Random Data and then select Sample from Columns; (a dialog window will appear)

48

  • Fill in the dialog screen below appropriately;
  • Click OK button to finish.

Specify sample size; Select the columns required; Need to specify the same number of columns in both boxes.

slide-9
SLIDE 9

STAT 373/ STAT 814_STAT714 Week 9

9

2019

49

  • To check, you may examine the sample and obtain

some descriptive statistics of it.

– Eg, estimate the mean and the total number of people who were unemployed, based on this sample (C55) of size 30. Descriptive Statistics: Unempl P_1:

Variable Count Mean StDev Minimum Median Maximum Unempl P_1 30 1833 2267 89 786 9066

50

  • Use the sample information on the previous

slide, to estimate mean and total number of unemployed persons in NSW, and their corresponding 95% confidence intervals. (You may refer to examples on slides 10 and 11.) Your solution: … …

51

  • Comment on the estimated results on the

previous slide against the actuals in the population below. . .

Descriptive Statistics: Unempl P (population)

Variable Count Mean StDev Sum Minimum Median Maximum Unempl P 182 1554 2597 282778 18 586 21283

52

Stratified random sampling

Lohr chapter 4

Example

Consider the following (artificial) population:

{6, 3, 4, 4, 5, 3, 6, 2, 3, 2, 2, 6, 5, 3, 5, 2, 4, 6, 4, 5}

105 . 2 4 20 have We

2 

    , , N

53

Take a SRS of size n=5:

Worst cases:

  • 2, 2, 2, 2, 3 𝑍

= 2.2

  • 5, 6, 6, 6, 6 𝑍

= 5.8 3158 . 5 105 . 2 20 5 1 ) 1 ( ) (

2

           n f y Var 

54

Rearrange the population:

Rearrange the population into 5 homogeneous strata, and take a SRS of size 1 from each: Resulting sample: (2, 3, 4, 5, 6) with 𝑍 = 4 and var(𝑍 ) = 0               

1 n SRS 1 n SRS 1 n SRS 1 n SRS 1 n SRS

5 4 3 2 1

6 6 6 6 5 5 5 5 4 4 4 4 3 3 3 3 2 2 2 2

    

Note: We will always get this sample.

slide-10
SLIDE 10

STAT 373/ STAT 814_STAT714 Week 9

10

2019

55

Note:

  • This sample is known as a stratified random

sample.

  • The variance was reduced from the SRS

case of 0.3158 to the absolute minimum possible, i.e. zero, in the stratified r.s. case (note this is a rather extreme case).

  • The point of stratifying a population is to

reduce the variance of the estimators.

56

Notation

  • Number of strata: L
  • Stratum sizes: N1, N2, …, NL
  • Total population size: N= N1+ N2+ …+ NL
  • Members of the ith stratum: ui1, ui2, …, uiNi
  • Strata population means: 1, 2, …, L
  • Strata population variances: 1

2, 2 2, …, L 2 57

Population mean

 

.) ' the

  • f

average weighted a is ( where stratum i for total population total population

i 1 1 1 th

s N N W W N N N N

i i L i i i L i i i L i

         

  

  

58

Population variance

       

 

 

2 2 1 1 2 2 2 2

1 1 Now 2

i

N L ij i j ij ij i i ij i i ij i i

u N u u u u            

 

              



59

It can be shown that Thus if we know 1,…,L and 1

2,…,L 2 we can

determine  and 2.

   

    

                  

L 1 i 2 i i L 1 i 2 i i L 1 i 2 i i L 1 i N 1 j 2 i ij 2

) ( N 1 N 1 ) 1 N ( 1 N 1 ) ( N 1 N 1 ) u ( 1 N 1

i

60

More notation: sample quantities

  • Stratum sample sizes: n1, n2, …, nL
  • Total sample size: n= n1+ n2+ …+ nL
  • Members of the ith stratum SRS :
  • Strata sample means:
  • Strata sample variances: s1

2, s2 2, …, sL 2

i

in i i

y y y , ... , ,

2 1 L

y y y , ... , ,

2 1

slide-11
SLIDE 11

STAT 373/ STAT 814_STAT714 Week 9

11

2019

61

 

fraction) sampling (i 1 1 1 where

th 1 2 2 1 i i i n j i ij i i n j ij i i

N n f y y n s y n y

i i

    

 

  62

Estimating 

It is tempting to use to estimate . But it can be shown this is generally a biased estimator of .

n y n n y

i i i

  ns

  • bservatio

sample all

  • f

sum

63

 

 

             

L i i i ST i L i i i

y N N y y N N

1 i 1

take we , for unbiased is Since Recall    as an estimator of .

64

. for unbiased is so ) ( ) E( have We

1 1

  

ST L i i i L i i i ST

y N N y E N N y

 

 

              

65

Variance of y ST

Remember the yi’s are independent. Hence Where Wi=Ni/N.

     

  

  

         

L i i i i i L i i i L i i i ST

n f W y Var W y N Var y Var

N

1 2 2 1 2 1

1 

66

 

   

. 1 r a ˆ V variance, estimated the have we 1 1 where , by expression ) ( in estimate We

2 2 1 2 2 2 2 i

 

    

 i i i i ST n j i ij i i i ST

n s f W y Then y y n s s y Var

i

slide-12
SLIDE 12

STAT 373/ STAT 814_STAT714 Week 9

12

2019

67

                .

y Var n f 1 W n s E f 1 W n s f 1 W E y r a ˆ V E as estimator, unbiased an is slide previous

  • n the

suggested y Var estimated the shown that be can It

ST i 2 i i 2 i i 2 i i 2 i i 2 i i 2 i ST ST

             

  

68

SGTA Exercises for Week 9

(Try completing each task below before your SGTA class in Week 10.) Lohr 2.12, Exercises: 19 16 (golfsrs.mtw is available on iLearn.) 18 Lohr 3.9, Exercises: 1 (for class discussion) Note: The questions above are available on iLearn.