Chapter 7: Sampling In this chapter we will cover: 1. Samples and - - PDF document

chapter 7 sampling in this chapter we will cover 1
SMART_READER_LITE
LIVE PREVIEW

Chapter 7: Sampling In this chapter we will cover: 1. Samples and - - PDF document

Chapter 7: Sampling In this chapter we will cover: 1. Samples and Populations ( 7.1, 7.2 Rice) 2. Simple random sampling ( 7.3 Rice) 3. Confidence intervals for means, proportions and variances ( 7.3 Rice) Samples and Populations


slide-1
SLIDE 1

Chapter 7: Sampling In this chapter we will cover:

  • 1. Samples and Populations (§7.1, 7.2 Rice)
  • 2. Simple random sampling (§7.3 Rice)
  • 3. Confidence intervals for means, proportions and variances (§7.3 Rice)

Samples and Populations

  • Sample surveys are used to obtain information about a large population by examining only a small fraction of that

population

  • These are used extensively in social science studies, by governments, and audits
  • The sampling used here is probabilistic in nature- each member of the population as a specified probability of being

included in the sample Samples and Populations Survey sampling is used because:

  • 1. The selection of units at random is a guard against investigatr bias
  • 2. A small sample costs far less and is much faster than a comlete enumeration (or census)
  • 3. The results from a small sample may be more accurate than from an enumeration: higher data quality
  • 4. random sampling techiques provide for the calculation of an estimate of error due to sampling
  • 5. In designing a survey it is frequently possible to determine the sample size needed to obtain a prrescribed error level

Population parameters

  • The numerical charactistics of a population are called its parameters
  • In general will assume a population is of size N
  • Each member of the population has associated a numberical value corresponding to the object of interest
  • These numerical values are denoted by x1, x2, · · · , xN
  • These can be continuous or discrete

1

slide-2
SLIDE 2

Example A

  • The population is N = 393 short stay hospitals
  • The data is xi which is the number of patients discharged from the ith hospital in Januray 1968
  • The population mean is

µ = 1 N

N

  • i=1

xi, this is 814.6

  • The population total is

τ =

N

  • i=1

xi = Nµ, this is 320 138

  • population variance is

σ2 = 1 N

N

  • i=1

(xi − µ)2 Example A

Number of discharges

Number of discharges Frequency 500 1000 1500 2000 2500 3000 10 20 30 40 50 60 70

2

slide-3
SLIDE 3

Simple random sampling

  • The most elementary form of sampling is simple random sampling (s.r.s.)
  • Here each sample of size n has the same probability of being selected
  • The sampling is done without replacement so there are

N

n

  • possible samples

Sample mean

  • If the sample size is n then denote the sample by X1, X2, · · · , Xn
  • Each is a random variable
  • The sample mean is then

¯ X = 1 n

n

  • i=1

Xi

  • This is also a random variable and will have a (sampling) distribution
  • We will use ¯

X which is calculated from the sample to estimate µ, which can only be calculated from the population

  • In practice we will know the sample but not know the population

Example A

  • We would like to know the sampling distrubution of barX for each n
  • If n = 16 there are 1033 different samples, so we cann’t enumerate exactly the sampling distrubition
  • We can simulate it though, i.e. draw the sample many (500-1000) times and examine the distribution.
  • In practice use the fact that the sampling distribution is approximately Normal

3

slide-4
SLIDE 4

Example A

Sampling dist: n=8

sample mean Frequency 500 1000 1500 20 40 60 80 100

Sampling dist: n=16

sample mean Frequency 500 1000 1500 20 60 100 140

Sampling dist: n=32

sample mean Frequency 500 1000 1500 20 40 60 80

Sampling dist: n=64

sample mean Frequency 500 1000 1500 50 100 150

Example A

  • All the sampling distributions are centered near the true value, (the red line)
  • As the sample size increases the histogram becomes less spread out i.e. variance decreases
  • For the larger values of n the histograms look well approximated by Normal distributions

4

slide-5
SLIDE 5

Simple random sampling The following results are proved in Rice (pp. 191-194)

  • For simple random sampling

E( ¯ X) = µ We say ¯ X is an unbiased estimate of µ

  • For simple random sampling

E(T) = τ We say T is an unbiased estimate of τ

  • For simple random sampling

V ar( ¯ X) = σ2 n

  • 1 − n − 1

N − 1

  • The term n−1

N−1 is called the finite population correction. IfN is much bigger than n this will be small

Mean square error

  • An unbiased estimate of a parameter is correct ‘on average’
  • One of measuring how good an estimate ˆ

θ is of the parameter θ is by using the mean squared error mse = E

  • ˆ

θ − θ 2

  • We can rewrite the mse as

mse = variance + bias2 Standard error

  • Since ¯

X is unbiased its mse is just its variance

  • As long as n << N this is well approximated by

V ar( ¯ X) = σ2 n

  • 1 − n − 1

N − 1

  • ≈ σ2

n

  • The term

σ ¯

X =

σ √n

  • 1 − n − 1

N − 1

σ √n is called the standard error for ¯

  • X. It measures how close the estimate is to the true value on average
  • As n gets bigger the standard error gets smaller

5

slide-6
SLIDE 6

Estimating a proportion

  • Suppose the population was split into two groups, one group with some property and another group without
  • Let the proportion with the property be p
  • An estimate for p is ˆ

p which is the proportion in the sample with the property

  • This estimate is also unbiased
  • Its standard error is

σˆ

p =

  • p(1 − p)

n

  • 1 − n − 1

N − 1 ≈

  • p(1 − p)

n Estimating a population variance

  • By taking a random sample the population variance σ2 can be estimated by the variance of the sample

ˆ σ2 = 1 n

n

  • i=1

(Xi − ¯ X)2

  • This is in fact a biased estimate since

E(ˆ σ2) = σ2 n − 1 n N N − 1

  • An unbiased estimate of V ar( ¯

X) is s2

¯ X = s2

n

  • 1 − n

N

  • where

s2 = 1 n − 1

n

  • i=1

(Xi − ¯ X)2 Example A

  • A simple random sample of 50 of the 393 hospitals was taken. From this sample ¯

X = 938.5

  • The sample variance is s2 = 614.532
  • The estimated standard error of ¯

X is s2

¯ X = s2

n

  • 1 − n

N

  • = 81.19

Recommended Questions From Rice §7.7 please look at Questions 1, 3, 5, 6, 7 6

slide-7
SLIDE 7

The Normal approximation to sampling distributions

  • We have calculate the mean and standard deviation of ¯

X, can we find the sampling distribution?

  • In general the exact sampling distribution will depend on the population distribution which is unknown
  • The central limit theorem however tells us that we can get a good approximation if n the sample size is large enough

The Normal approximation to sampling distributions

  • The central limit theorem states that if Xi are independent with the same distribution then

P ¯ Xn − µ σ/√n ≤ z

  • → Φ(z),

as n → ∞ where µ, σ are the mean and standard deviation of each Xi and Φ is the cdf for the standard normal

  • For simple random sampling the random variables are not strictly independent, nevertheless for n/N sufficiently

small a form of the CLT still applies Example A

  • For the 393 hospitals the standard error for ¯

X when n = 64 is σ ¯

X =

σ √n

  • 1 − n − 1

N − 1 = 67.5

  • Applying the CLT means we can ask what is the probability that the estimate ¯

X is more than 100 from the true value i.e. want P(| ¯ X − µ| > 100) = 2P( ¯ X − µ > 100) P( ¯ X − µ > 100) = 1 − P( ¯ X − µ > 100) = 1 − P ¯ X − µ σ ¯

X

> 100 σ ¯

X

1 − Φ 100 67.5

  • =

0.069 7

slide-8
SLIDE 8

Example A: simulation In the simulation the proportion of samples further than 100 from the true value is 15.6% incomparision to the 14% predicted by theory

Sampling dist: n=64

sample mean Frequency 600 700 800 900 1000 20 40 60 80 100 120 140 + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 100 200 300 400 500 600 700 800 900 1000 index Sample

The normal approximation also seems reasonable Confidence intervals

  • The previous example is a good way to understand a confidence interval
  • A confidence interval for a population parameter θ is a random interval (i.e. an interval that depends on the sample)
  • It contains the true value some fixed proportion of the times a sample is drawn
  • A 95% confidence interval contains θ for 95% of the samples
  • Confidence interval with coverage 1 − α contains the true value 100(1 − α)% times you use it.

8

slide-9
SLIDE 9

Confidence intervals

20 40 60 80 100 400 600 800 1000 1200

100 95% Confidence intervals

Index Mean

Confidence intervals: Algorithm

  • If you want to compute a 95% confidence interval from data X1, X2, · · · , Xn using the normal approximation
  • Calculate ¯

X and s2 the sample mean and variance of the data

  • Calculate σ ¯

X the s.e. of the estimate, this is s/√n

  • In Table 2, Appendix B find the zp such that P(|z| > zp) = 0.05. This will be zp = 1.96
  • The confidence interval is

¯ X − 1.96σ ¯

X, ¯

X + 1.96σ ¯

X

  • Example

Suppose that from a sample of size 100 we have ¯ X = 1.2 and s2 = 0.09.

  • 1. What is the 95% confidence interval for µ?
  • 2. What is the 99% confidence interval for µ?

9

slide-10
SLIDE 10

Confidence intervals

  • For 0 ≤ α ≤ 1 let z(α) be such that

Φ(z(α)) = 1 − α

  • Shaded region covers the interval [z(α), ∞) and has area α

−4 −2 2 4 0.0 0.1 0.2 0.3 0.4 z density alpha

  • So if Z has a standard normal distribution then

P(−z(α/2) ≤ Z ≤ z(α/2)) = 1 − α Confidence intervals

  • The normalised mean

¯ X − µ σ ¯

X

has a standard normal distribution

  • So we have that

P

  • −z(α/2) ≤

¯ X − µ σ ¯

X

≤ z(α/2)

  • = 1 − α
  • That is

P ¯ X − z(α/2)σ ¯

X ≤ µ ≤ ¯

X + z(α/2)σ ¯

X

  • = 1 − α
  • That is the interval

¯ X − z(α/2)σ ¯

X, ¯

X + z(α/2)σ ¯

X

  • is an 1 − α confidence interval

10

slide-11
SLIDE 11

Confidence intervals

  • The form of the confidence interval

¯ X − z(α/2)σ ¯

X, ¯

X + z(α/2)σ ¯

X

  • depends on the sample
  • It is therefore a random interval, i.e. it will change if a different sample had been selected
  • The previous calculation tell us the probability that the random interval contains the true (but unknown) population

mean

  • We want this probability to be high in order to be reasonably sure that we have caught the true value

Example A

  • We draw a sample of size 64 from the population
  • The sample has a mean of ¯

X = 92.3 and its standard deviation is 647.6

  • The estimated standard error for ¯

X is the σ ¯

X = 647.6/

√ 64 = 80.95

  • We want a 95% confidence interval so we use z(95/2) = 1.96
  • So the 95% confidence interval is

( ¯ X − 1.96σ ¯

X, ¯

X + 1.96σ ¯

X)

= (792.3 − 1.96 × 80.95, 792.3 + 1.96 × 80.95) = (633.6, 951.0) Recommended Questions From Rice §7.7 please look at Questions 11, 12, 13, 15, 17, 18, 19(a), 25, 26 and 28. Finite Samples

  • The previous results are based on the normal approximation, so assume that n the sample size is large enough
  • For smaller n while ¯

X is still well approximated by a normal distribution, the estimate of σ ¯

X by s/√n is not too

accurate.

  • In order to correct for the variability in estimating σ a different sampling distribution is used.
  • This is called the t-distribution

11

slide-12
SLIDE 12

The t-distribution

  • Let Xi, · · · , Xn are a random sample from a Normal distribution with mean µ and variance σ2
  • Defined

¯ X = n

i=1 Xi

n , S2 = n

i=1

  • Xi − ¯

X 2 n − 1 to be the sample mean and variance [note the n − 1 term]

  • From Rice §6.3 we have

E ¯ X

  • = µ, E
  • S2

= σ2 The t-distribution

  • The sampling distribution of

¯ X−µ S/√n is called the t-distribution with n − 1 degrees of freedom

−4 −2 2 4 0.0 0.1 0.2 0.3 0.4

t−distributions

t density df=100 df=20 df=5 df=2

Confidence intervals for µ

  • We can base our confidence intervalss for µ on the t-distribution
  • They will be more accurate when n is not too big
  • We use the formula
  • ¯

X − t(α/2) S √n, ¯ X + t(α/2) S √n

  • where t(α/2) is defined from table 4 Appendix B of Rice

12

slide-13
SLIDE 13

Confidence intervals for µ Suppose that from a sample of size 100 we have ¯ X = 1.2 and s2 = 0.09.

  • 1. What is the 95% confidence interval for µ based on the t-distribution?
  • 2. What is the 99% confidence interval for µ based on the t-distribution?

Recommended Questions From Rice §6.5 please look at Question 4 13