ACMS 20340 Statistics for Life Sciences Chapter 13: Sampling - - PowerPoint PPT Presentation

acms 20340 statistics for life sciences
SMART_READER_LITE
LIVE PREVIEW

ACMS 20340 Statistics for Life Sciences Chapter 13: Sampling - - PowerPoint PPT Presentation

ACMS 20340 Statistics for Life Sciences Chapter 13: Sampling Distributions Sampling We use information from a sample to infer something about a population. When using random samples and randomized experiments, we cannot rule out the


slide-1
SLIDE 1

ACMS 20340 Statistics for Life Sciences

Chapter 13: Sampling Distributions

slide-2
SLIDE 2

Sampling

We use information from a sample to infer something about a population. When using random samples and randomized experiments, we cannot rule out the possibility of incorrect inferences. So we ask: How often would this method give a correct answer if we used it a large number of times?

slide-3
SLIDE 3

Some Terminology

A parameter is a number which describes some aspect of a population. In practice, we don’t know the value of a parameter because we cannot directly examine/measure the entire population. A statistic is a number that can be computed from the sample data, without making use of any unknown parameters. In practice we often use statistics to estimate an unknown parameter.

slide-4
SLIDE 4

Mnemonic Device

Statistics come from Samples. Parameters come from Populations.

slide-5
SLIDE 5

An Illustration

According to the 2008 Health and Nutrition Examination Survey, the mean weight of the sample of American adult males was ¯ x = 191.5 pounds. 191.5 is a statistic. The population: all American adult males over the age of 20. The parameter: the mean weight of all the members of the population.

slide-6
SLIDE 6

On Means

We will always use µ to represent the mean of a population. This is a fixed parameter that is unknown when we use a sample for inference. We will always write ¯ x for the mean of the sample. This is the average of the observations in the sample.

slide-7
SLIDE 7

The Key Question

If the sample mean ¯ x is rarely exactly equal to the population mean µ and can vary from sample to sample, how can we consider it a reasonable estimate of µ?

slide-8
SLIDE 8

The Answer. . .

If we take larger and larger samples, the statistic ¯ x is guaranteed to get closer and closer to the parameter µ. This fact is known as the Law of Large Numbers.

slide-9
SLIDE 9

The Law of Large Numbers 1

Recall: In the long run, the proportion of occurrences of a given

  • utcome gets closer and closer to the probability of that outcome.

E.g. the proportion of heads when tossing a fair coin gets closer to 1/2 in the long run. Similarly, in the long run, the average outcome gets close to the population mean.

slide-10
SLIDE 10

The Law of Large Numbers 2

Using the basic laws of probability, we can prove the law of large numbers. The “Law of Large Numbers” applet is useful for illustrating the law.

slide-11
SLIDE 11

A Word of Caution

Only in the very long run does the sample mean get really close to the population mean, and so in this respect, the Law of Large Numbers is not very practical. However, the success of certain businesses, such as casinos and insurance companies, depends on the Law of Large numbers.

slide-12
SLIDE 12

Sampling Distributions 1

The Law of Large Numbers = ⇒ If we measure enough subjects the statistic ¯ x will eventually get close to the parameter µ. What if we can only take samples of a smaller size, say 10?

slide-13
SLIDE 13

Sampling Distributions 2

What would happen if we took many samples of 10 subjects from this population? To answer this question:

◮ Take a large number of samples of size 10 from the

population

◮ Calculate the sample mean ¯

x for each sample

◮ Make a histogram of the values of ¯

x

◮ Examine the distribution in the histogram (shape, center,

spread, outliers, etc.)

slide-14
SLIDE 14

By Way of Example. . . 1

◮ High levels of dimethyl sulfide (DMS) in wine causes the wine

to smell bad.

◮ Winemakers are thus interested in determining the odor

threshold, the lowest concentration of DMS that the human nose can detect.

◮ The threshold varies from person to person, so we’d like to

find the mean threshold µ in the population of all adults.

◮ An SRS of size 10 yields the values

28 40 28 33 20 31 29 27 17 21 and thus we have a sample mean ¯ x = 27.4.

slide-15
SLIDE 15

By Way of Example. . . 2

◮ It turns out that the DMS odor threshold of adults follows a

roughly Normal distribution with µ = 25 mg/L and standard deviation σ = 7 mg/L.

◮ By following the procedure outlined before (taking 1,000

SRS’s), we produce a histogram that displays the distribution

  • f the values of ¯

x from the 1,000 SRS’s.

◮ This histogram displays the sampling distribution of the

statistic ¯ x.

slide-16
SLIDE 16

By Way of Example. . . 3

slide-17
SLIDE 17

The Official Definition

The sampling distribution of a statistic is the distribution of values taken by the statistic over all possible samples of some fixed size from the population. Thus, the histogram on the previous slide actually displays an approximation to the sampling distribution of the statistic ¯ x. Important point: The sample mean is a random variable!

◮ Since “good” samples are chosen randomly, statistics such as

the sample mean ¯ x are random variables.

◮ Thus we can describe the behavior of a sample statistic by

means of a probability model.

slide-18
SLIDE 18

An Important Difference

◮ The law of large numbers describes what would happen if we

took random samples of increasing size n.

◮ A sampling distribution describes what would happen if we

took all random samples of a fixed size n.

slide-19
SLIDE 19

Examining the Sampling Distribution

◮ Shape: It appears to be

Normal.

◮ Center: The mean of the

1000 ¯ x’s is 24.95, very close to the population mean µ = 25.

◮ Spread: The s.d. of the

1000 ¯ x’s is 2.217, much smaller than the population s.d. σ = 7.

slide-20
SLIDE 20

A General Fact

When we choose many SRSs from a population, the sampling distribution of the sample means is centered at the mean of the

  • riginal population.

But the sampling distribution is also less spread out than the distribution of individual observations.

slide-21
SLIDE 21

More Precisely

Suppose that ¯ x is the mean of an SRS of size n drawn from a large population with mean µ and standard deviation σ. Then the sampling distribution of ¯ x has mean µ¯

x and standard

deviation σ¯

x = σ/√n.

Note that µ¯

x = µ. This notation is simply to tell the difference

between the two distributions. Because the mean of the sampling distribution of the statistic ¯ x, µ¯

x is equal to µ, we say that the statistic ¯

x is an unbiased estimator of the parameter µ.

slide-22
SLIDE 22

Unbiased Estimators

◮ An unbiased estimator is “correct on the average” over many

samples.

◮ Just how close the estimator will be to the parameter in most

samples is determined by the spread of the sampling distribution.

◮ If the individual observations have s.d. σ, then sample means

¯ x from samples of size n have s.d. σ/√n.

◮ Thus, averages are less variable than individual observations.

slide-23
SLIDE 23

For a Normal Population

If individual observations have the distribution N(µ, σ), then the sample mean ¯ x of an SRS of size n has the distribution N(µ, σ/√n).

slide-24
SLIDE 24

Seeing is Believing

slide-25
SLIDE 25

Non-Normal Distributions?

We know what the values of the mean and standard deviation of ¯ x will be, regardless of the population distribution. But what can be known about the shape of the sampling distribution? Population Distribution → Sampling Distribution is Normal. is Normal. Population Distribution → Sampling Distribution is not Normal. is ?????.

slide-26
SLIDE 26

Central Limit Theorem

Remarkably, as the sample size of a non-Normal population increases, the sampling distribution of ¯ x changes shape. In fact, the sampling distribution starts to look more like a Normal distribution regardless of what the population distribution looks like. This idea is the Central Limit Theorem.

slide-27
SLIDE 27

The Official Definition

Draw an SRS of size n from any population with mean µ and standard deviation σ. When n is large, the sampling distribution of the sample mean ¯ x is approximately Normal: ¯ x is a random variable with distribuition (roughly) N(µ, σ/√n)

slide-28
SLIDE 28

So Why Do We Care?

The Central Limit Theorem allows us to use Normal probability calculations to answer questions about sample means, even if the population distribution is not Normal.

slide-29
SLIDE 29

Central Limit in Action

(a) Strongly skewed population distribution. (b) Sampling distribution of ¯ x with n = 2. (c) Sampling distribution of ¯ x with n = 10. (d) Sampling distribution of ¯ x with n = 25.

slide-30
SLIDE 30

Warning!

The CLT applies to sampling distributions, not the distribution of a sample.

◮ Now I’m confused.

Larger sample size = more Normal distribution of a sample.

◮ Skewed population will likely have skewed random samples.

The CLT only describes the distribution of averages for repeated samples.

slide-31
SLIDE 31

Sample Sizes 1

How large does the sample need to be for the sampling distribution

  • f ¯

x to be close to Normal? The answer depends on the population distribution. Farther from Normal ⇒ More observations per sample needed

slide-32
SLIDE 32

Sample Sizes 2

General rule of thumb for sample size n:

◮ Skewed populations ⇒ Sample of size 25 is generally enough

to obtain a Normal sampling distribution.

◮ Extremely skewed populations ⇒ Sample of size 40 is

generally enough to obtain a Normal sampling distribution.

slide-33
SLIDE 33

Sample Sizes 3

Angle of big toe deformations in 28 patients. Population likely close to Normal, so sampling distribution should be Normal.

slide-34
SLIDE 34

Sample Sizes 4

Servings of fruit per day for 74 adolescent girls. Population likely skewed, but sampling distribution should be Normal due to large sample size.

slide-35
SLIDE 35

CLT and Sampling Distributions

There are a few helpful facts that come out of the Central Limit Theorem. These are always true, regardless of population distribution.

◮ Means of random samples are less variable than individual

  • bservations.

◮ Means of random samples are more Normal than individual

  • bservations.
slide-36
SLIDE 36

Sampling Distributions for Probabilities

We have seen that sampling distributions are useful for analyzing the means of quantitative variables. But what if we have a categorical variable instead? Fortunately, we can use the sampling distribution of ˆ p.

slide-37
SLIDE 37

Probability and Categorical Variables

Categorical variables can take any of a finite number of possible

  • utcomes.

We choose one such outcome and call it a “success”. All other outcomes are then “non-successes” or “failures.” Note: This is an arbitrary choice, not a moral judgment.

slide-38
SLIDE 38

Terminology

An experiment finds that 6 of 20 birds exposed to an avian flu strain develop flu symptoms. We say the random variable X = the number of birds with flu symptoms. Recall: X is a count of the “successes” of this categorical variable in a fixed number of observations.

slide-39
SLIDE 39

Terminology

If the number of observations is labeled as n, then the sample proportion is ˆ p = count of successes in sample size of sample = X n Similar to the sample average ¯ x, we can find the sampling distribution for ˆ p.

slide-40
SLIDE 40

Recall: Binomial Distribution

As we saw last week, a binomial distribution consists of n

  • bservations and constant probability of success p for each
  • bservation.

Here we will rely heavily on the fact that the binomial distribution (which is discrete) can be approximated by a Normal distribution.

slide-41
SLIDE 41

Recall: Normal Approximation to Binomial Distribution

Suppose a count X has a binomial distribution with n observations and success probability p. When n is large, the distribution of X is approximately Normal with distribution N(np,

  • np(1 − p))

As a rule of thumb, n should be large enough for the count of successes and failures to be at least 10 each.

slide-42
SLIDE 42

Sampling Distribution of a Sample Proportion

A count of successes has limited use when comparing different studies (as the sample sizes may differ drastically). Instead if we consider the sample proportion ˆ p as our preferred sample statistic, this is much more informative. How good is the statistic ˆ p as an estimate of the parameter p? Again we ask: “What happens with many samples?”

slide-43
SLIDE 43

The Official Definition

Choose an SRS of size n from a large population that has proportion p of successes. Let ˆ p be the sample proportion of successes, ˆ p = count of successes in the sample n Then:

◮ The mean of the sampling distribution is p. ◮ The standard deviation of the sampling distribution is

  • p(1 − p)/n.

◮ As the sample size increases, the sampling distribution of ˆ

p becomes approximately Normal.

slide-44
SLIDE 44

Summary in Picture Form

slide-45
SLIDE 45

Warning!

Do not use the Normal approximation for the sampling distribution

  • f ˆ

p when the sample size is small. Also, the population should be much larger than the sample. We’ll say, at least 20 times larger, as a rule of thumb. This approximation is least accurate when p is close to 0 or 1. (Our sample would contain only successes or failures unless n is very large.)

slide-46
SLIDE 46

Example: Who Gets the Flu?

Suppose that we know that 2.5% of all American adults were sick with the flu on a given day of January 2010. The Gallup-Healthways survey interviewed a random sample of 29,483 people and asked them this question. What is the probability that at least 2.3% of such a sample would answer “yes” in the survey?

slide-47
SLIDE 47

Example: Who Gets the Flu?

The population proportion is about p = 0.025 and n = 29, 483. So the sample proportion ˆ p has mean 0.025 and standard deviation

  • p(1 − p)

n =

  • (0.025)(0.975)

29, 483 = 0.00091

slide-48
SLIDE 48

Example: Who Gets the Flu?

We want the probability that ˆ p is 0.023 or greater. First we standardize ˆ p and call the corresponding statistic z. z = ˆ p − 0.025 0.00091 Now finish the calculation. P(ˆ p ≥ 0.023) = P ˆ p − 0.025 0.00091 ≥ 0.023 − 0.025 0.00091

  • = P(z ≥ −2.20)

= 1 − 0.0139 = 0.9861

slide-49
SLIDE 49

Example: Who Gets the Flu?

There is a more than 98% chance that any sample the Gallup-Healthways survey conducts will contain at least 2.3% who say “yes”.