SLIDE 1
ACMS 20340 Statistics for Life Sciences Chapter 13: Sampling - - PowerPoint PPT Presentation
ACMS 20340 Statistics for Life Sciences Chapter 13: Sampling - - PowerPoint PPT Presentation
ACMS 20340 Statistics for Life Sciences Chapter 13: Sampling Distributions Sampling We use information from a sample to infer something about a population. When using random samples and randomized experiments, we cannot rule out the
SLIDE 2
SLIDE 3
Some Terminology
A parameter is a number which describes some aspect of a population. In practice, we don’t know the value of a parameter because we cannot directly examine/measure the entire population. A statistic is a number that can be computed from the sample data, without making use of any unknown parameters. In practice we often use statistics to estimate an unknown parameter.
SLIDE 4
Mnemonic Device
Statistics come from Samples. Parameters come from Populations.
SLIDE 5
An Illustration
According to the 2008 Health and Nutrition Examination Survey, the mean weight of the sample of American adult males was ¯ x = 191.5 pounds. 191.5 is a statistic. The population: all American adult males over the age of 20. The parameter: the mean weight of all the members of the population.
SLIDE 6
On Means
We will always use µ to represent the mean of a population. This is a fixed parameter that is unknown when we use a sample for inference. We will always write ¯ x for the mean of the sample. This is the average of the observations in the sample.
SLIDE 7
The Key Question
If the sample mean ¯ x is rarely exactly equal to the population mean µ and can vary from sample to sample, how can we consider it a reasonable estimate of µ?
SLIDE 8
The Answer. . .
If we take larger and larger samples, the statistic ¯ x is guaranteed to get closer and closer to the parameter µ. This fact is known as the Law of Large Numbers.
SLIDE 9
The Law of Large Numbers 1
Recall: In the long run, the proportion of occurrences of a given
- utcome gets closer and closer to the probability of that outcome.
E.g. the proportion of heads when tossing a fair coin gets closer to 1/2 in the long run. Similarly, in the long run, the average outcome gets close to the population mean.
SLIDE 10
The Law of Large Numbers 2
Using the basic laws of probability, we can prove the law of large numbers. The “Law of Large Numbers” applet is useful for illustrating the law.
SLIDE 11
A Word of Caution
Only in the very long run does the sample mean get really close to the population mean, and so in this respect, the Law of Large Numbers is not very practical. However, the success of certain businesses, such as casinos and insurance companies, depends on the Law of Large numbers.
SLIDE 12
Sampling Distributions 1
The Law of Large Numbers = ⇒ If we measure enough subjects the statistic ¯ x will eventually get close to the parameter µ. What if we can only take samples of a smaller size, say 10?
SLIDE 13
Sampling Distributions 2
What would happen if we took many samples of 10 subjects from this population? To answer this question:
◮ Take a large number of samples of size 10 from the
population
◮ Calculate the sample mean ¯
x for each sample
◮ Make a histogram of the values of ¯
x
◮ Examine the distribution in the histogram (shape, center,
spread, outliers, etc.)
SLIDE 14
By Way of Example. . . 1
◮ High levels of dimethyl sulfide (DMS) in wine causes the wine
to smell bad.
◮ Winemakers are thus interested in determining the odor
threshold, the lowest concentration of DMS that the human nose can detect.
◮ The threshold varies from person to person, so we’d like to
find the mean threshold µ in the population of all adults.
◮ An SRS of size 10 yields the values
28 40 28 33 20 31 29 27 17 21 and thus we have a sample mean ¯ x = 27.4.
SLIDE 15
By Way of Example. . . 2
◮ It turns out that the DMS odor threshold of adults follows a
roughly Normal distribution with µ = 25 mg/L and standard deviation σ = 7 mg/L.
◮ By following the procedure outlined before (taking 1,000
SRS’s), we produce a histogram that displays the distribution
- f the values of ¯
x from the 1,000 SRS’s.
◮ This histogram displays the sampling distribution of the
statistic ¯ x.
SLIDE 16
By Way of Example. . . 3
SLIDE 17
The Official Definition
The sampling distribution of a statistic is the distribution of values taken by the statistic over all possible samples of some fixed size from the population. Thus, the histogram on the previous slide actually displays an approximation to the sampling distribution of the statistic ¯ x. Important point: The sample mean is a random variable!
◮ Since “good” samples are chosen randomly, statistics such as
the sample mean ¯ x are random variables.
◮ Thus we can describe the behavior of a sample statistic by
means of a probability model.
SLIDE 18
An Important Difference
◮ The law of large numbers describes what would happen if we
took random samples of increasing size n.
◮ A sampling distribution describes what would happen if we
took all random samples of a fixed size n.
SLIDE 19
Examining the Sampling Distribution
◮ Shape: It appears to be
Normal.
◮ Center: The mean of the
1000 ¯ x’s is 24.95, very close to the population mean µ = 25.
◮ Spread: The s.d. of the
1000 ¯ x’s is 2.217, much smaller than the population s.d. σ = 7.
SLIDE 20
A General Fact
When we choose many SRSs from a population, the sampling distribution of the sample means is centered at the mean of the
- riginal population.
But the sampling distribution is also less spread out than the distribution of individual observations.
SLIDE 21
More Precisely
Suppose that ¯ x is the mean of an SRS of size n drawn from a large population with mean µ and standard deviation σ. Then the sampling distribution of ¯ x has mean µ¯
x and standard
deviation σ¯
x = σ/√n.
Note that µ¯
x = µ. This notation is simply to tell the difference
between the two distributions. Because the mean of the sampling distribution of the statistic ¯ x, µ¯
x is equal to µ, we say that the statistic ¯
x is an unbiased estimator of the parameter µ.
SLIDE 22
Unbiased Estimators
◮ An unbiased estimator is “correct on the average” over many
samples.
◮ Just how close the estimator will be to the parameter in most
samples is determined by the spread of the sampling distribution.
◮ If the individual observations have s.d. σ, then sample means
¯ x from samples of size n have s.d. σ/√n.
◮ Thus, averages are less variable than individual observations.
SLIDE 23
For a Normal Population
If individual observations have the distribution N(µ, σ), then the sample mean ¯ x of an SRS of size n has the distribution N(µ, σ/√n).
SLIDE 24
Seeing is Believing
SLIDE 25
Non-Normal Distributions?
We know what the values of the mean and standard deviation of ¯ x will be, regardless of the population distribution. But what can be known about the shape of the sampling distribution? Population Distribution → Sampling Distribution is Normal. is Normal. Population Distribution → Sampling Distribution is not Normal. is ?????.
SLIDE 26
Central Limit Theorem
Remarkably, as the sample size of a non-Normal population increases, the sampling distribution of ¯ x changes shape. In fact, the sampling distribution starts to look more like a Normal distribution regardless of what the population distribution looks like. This idea is the Central Limit Theorem.
SLIDE 27
The Official Definition
Draw an SRS of size n from any population with mean µ and standard deviation σ. When n is large, the sampling distribution of the sample mean ¯ x is approximately Normal: ¯ x is a random variable with distribuition (roughly) N(µ, σ/√n)
SLIDE 28
So Why Do We Care?
The Central Limit Theorem allows us to use Normal probability calculations to answer questions about sample means, even if the population distribution is not Normal.
SLIDE 29
Central Limit in Action
(a) Strongly skewed population distribution. (b) Sampling distribution of ¯ x with n = 2. (c) Sampling distribution of ¯ x with n = 10. (d) Sampling distribution of ¯ x with n = 25.
SLIDE 30
Warning!
The CLT applies to sampling distributions, not the distribution of a sample.
◮ Now I’m confused.
Larger sample size = more Normal distribution of a sample.
◮ Skewed population will likely have skewed random samples.
The CLT only describes the distribution of averages for repeated samples.
SLIDE 31
Sample Sizes 1
How large does the sample need to be for the sampling distribution
- f ¯
x to be close to Normal? The answer depends on the population distribution. Farther from Normal ⇒ More observations per sample needed
SLIDE 32
Sample Sizes 2
General rule of thumb for sample size n:
◮ Skewed populations ⇒ Sample of size 25 is generally enough
to obtain a Normal sampling distribution.
◮ Extremely skewed populations ⇒ Sample of size 40 is
generally enough to obtain a Normal sampling distribution.
SLIDE 33
Sample Sizes 3
Angle of big toe deformations in 28 patients. Population likely close to Normal, so sampling distribution should be Normal.
SLIDE 34
Sample Sizes 4
Servings of fruit per day for 74 adolescent girls. Population likely skewed, but sampling distribution should be Normal due to large sample size.
SLIDE 35
CLT and Sampling Distributions
There are a few helpful facts that come out of the Central Limit Theorem. These are always true, regardless of population distribution.
◮ Means of random samples are less variable than individual
- bservations.
◮ Means of random samples are more Normal than individual
- bservations.
SLIDE 36
Sampling Distributions for Probabilities
We have seen that sampling distributions are useful for analyzing the means of quantitative variables. But what if we have a categorical variable instead? Fortunately, we can use the sampling distribution of ˆ p.
SLIDE 37
Probability and Categorical Variables
Categorical variables can take any of a finite number of possible
- utcomes.
We choose one such outcome and call it a “success”. All other outcomes are then “non-successes” or “failures.” Note: This is an arbitrary choice, not a moral judgment.
SLIDE 38
Terminology
An experiment finds that 6 of 20 birds exposed to an avian flu strain develop flu symptoms. We say the random variable X = the number of birds with flu symptoms. Recall: X is a count of the “successes” of this categorical variable in a fixed number of observations.
SLIDE 39
Terminology
If the number of observations is labeled as n, then the sample proportion is ˆ p = count of successes in sample size of sample = X n Similar to the sample average ¯ x, we can find the sampling distribution for ˆ p.
SLIDE 40
Recall: Binomial Distribution
As we saw last week, a binomial distribution consists of n
- bservations and constant probability of success p for each
- bservation.
Here we will rely heavily on the fact that the binomial distribution (which is discrete) can be approximated by a Normal distribution.
SLIDE 41
Recall: Normal Approximation to Binomial Distribution
Suppose a count X has a binomial distribution with n observations and success probability p. When n is large, the distribution of X is approximately Normal with distribution N(np,
- np(1 − p))
As a rule of thumb, n should be large enough for the count of successes and failures to be at least 10 each.
SLIDE 42
Sampling Distribution of a Sample Proportion
A count of successes has limited use when comparing different studies (as the sample sizes may differ drastically). Instead if we consider the sample proportion ˆ p as our preferred sample statistic, this is much more informative. How good is the statistic ˆ p as an estimate of the parameter p? Again we ask: “What happens with many samples?”
SLIDE 43
The Official Definition
Choose an SRS of size n from a large population that has proportion p of successes. Let ˆ p be the sample proportion of successes, ˆ p = count of successes in the sample n Then:
◮ The mean of the sampling distribution is p. ◮ The standard deviation of the sampling distribution is
- p(1 − p)/n.
◮ As the sample size increases, the sampling distribution of ˆ
p becomes approximately Normal.
SLIDE 44
Summary in Picture Form
SLIDE 45
Warning!
Do not use the Normal approximation for the sampling distribution
- f ˆ
p when the sample size is small. Also, the population should be much larger than the sample. We’ll say, at least 20 times larger, as a rule of thumb. This approximation is least accurate when p is close to 0 or 1. (Our sample would contain only successes or failures unless n is very large.)
SLIDE 46
Example: Who Gets the Flu?
Suppose that we know that 2.5% of all American adults were sick with the flu on a given day of January 2010. The Gallup-Healthways survey interviewed a random sample of 29,483 people and asked them this question. What is the probability that at least 2.3% of such a sample would answer “yes” in the survey?
SLIDE 47
Example: Who Gets the Flu?
The population proportion is about p = 0.025 and n = 29, 483. So the sample proportion ˆ p has mean 0.025 and standard deviation
- p(1 − p)
n =
- (0.025)(0.975)
29, 483 = 0.00091
SLIDE 48
Example: Who Gets the Flu?
We want the probability that ˆ p is 0.023 or greater. First we standardize ˆ p and call the corresponding statistic z. z = ˆ p − 0.025 0.00091 Now finish the calculation. P(ˆ p ≥ 0.023) = P ˆ p − 0.025 0.00091 ≥ 0.023 − 0.025 0.00091
- = P(z ≥ −2.20)
= 1 − 0.0139 = 0.9861
SLIDE 49