ACMS 20340 Statistics for Life Sciences Chapter 13: Sampling - PowerPoint PPT Presentation

ACMS 20340 Statistics for Life Sciences Chapter 13: Sampling Distributions

Sampling We use information from a sample to infer something about a population. When using random samples and randomized experiments, we cannot rule out the possibility of incorrect inferences. So we ask: How often would this method give a correct answer if we used it a large number of times?

Some Terminology A parameter is a number which describes some aspect of a population. In practice, we don’t know the value of a parameter because we cannot directly examine/measure the entire population. A statistic is a number that can be computed from the sample data, without making use of any unknown parameters. In practice we often use statistics to estimate an unknown parameter.

Mnemonic Device S tatistics come from S amples. P arameters come from P opulations.

An Illustration According to the 2008 Health and Nutrition Examination Survey, the mean weight of the sample of American adult males was ¯ x = 191 . 5 pounds. 191.5 is a statistic. The population: all American adult males over the age of 20. The parameter: the mean weight of all the members of the population.

On Means We will always use µ to represent the mean of a population. This is a fixed parameter that is unknown when we use a sample for inference. We will always write ¯ x for the mean of the sample. This is the average of the observations in the sample.

The Key Question If the sample mean ¯ x is rarely exactly equal to the population mean µ and can vary from sample to sample, how can we consider it a reasonable estimate of µ ?

The Answer. . . If we take larger and larger samples, the statistic ¯ x is guaranteed to get closer and closer to the parameter µ . This fact is known as the Law of Large Numbers .

The Law of Large Numbers 1 Recall: In the long run, the proportion of occurrences of a given outcome gets closer and closer to the probability of that outcome. E.g. the proportion of heads when tossing a fair coin gets closer to 1 / 2 in the long run. Similarly, in the long run, the average outcome gets close to the population mean.

The Law of Large Numbers 2 Using the basic laws of probability, we can prove the law of large numbers. The “Law of Large Numbers” applet is useful for illustrating the law.

A Word of Caution Only in the very long run does the sample mean get really close to the population mean, and so in this respect, the Law of Large Numbers is not very practical. However, the success of certain businesses, such as casinos and insurance companies, depends on the Law of Large numbers.

Sampling Distributions 1 The Law of Large Numbers = ⇒ If we measure enough subjects the statistic ¯ x will eventually get close to the parameter µ . What if we can only take samples of a smaller size, say 10?

Sampling Distributions 2 What would happen if we took many samples of 10 subjects from this population? To answer this question: ◮ Take a large number of samples of size 10 from the population ◮ Calculate the sample mean ¯ x for each sample ◮ Make a histogram of the values of ¯ x ◮ Examine the distribution in the histogram (shape, center, spread, outliers, etc.)

By Way of Example. . . 1 ◮ High levels of dimethyl sulfide (DMS) in wine causes the wine to smell bad. ◮ Winemakers are thus interested in determining the odor threshold, the lowest concentration of DMS that the human nose can detect. ◮ The threshold varies from person to person, so we’d like to find the mean threshold µ in the population of all adults. ◮ An SRS of size 10 yields the values 28 40 28 33 20 31 29 27 17 21 and thus we have a sample mean ¯ x = 27 . 4.

By Way of Example. . . 2 ◮ It turns out that the DMS odor threshold of adults follows a roughly Normal distribution with µ = 25 mg/L and standard deviation σ = 7 mg/L. ◮ By following the procedure outlined before (taking 1,000 SRS’s), we produce a histogram that displays the distribution of the values of ¯ x from the 1,000 SRS’s. ◮ This histogram displays the sampling distribution of the statistic ¯ x .

By Way of Example. . . 3

The Official Definition The sampling distribution of a statistic is the distribution of values taken by the statistic over all possible samples of some fixed size from the population. Thus, the histogram on the previous slide actually displays an approximation to the sampling distribution of the statistic ¯ x . Important point: The sample mean is a random variable! ◮ Since “good” samples are chosen randomly, statistics such as the sample mean ¯ x are random variables. ◮ Thus we can describe the behavior of a sample statistic by means of a probability model.

An Important Difference ◮ The law of large numbers describes what would happen if we took random samples of increasing size n . ◮ A sampling distribution describes what would happen if we took all random samples of a fixed size n .

Examining the Sampling Distribution ◮ Shape : It appears to be Normal. ◮ Center : The mean of the 1000 ¯ x ’s is 24.95, very close to the population mean µ = 25. ◮ Spread : The s.d. of the 1000 ¯ x ’s is 2.217, much smaller than the population s.d. σ = 7.

A General Fact When we choose many SRSs from a population, the sampling distribution of the sample means is centered at the mean of the original population. But the sampling distribution is also less spread out than the distribution of individual observations.

More Precisely Suppose that ¯ x is the mean of an SRS of size n drawn from a large population with mean µ and standard deviation σ . Then the sampling distribution of ¯ x has mean µ ¯ x and standard x = σ/ √ n . deviation σ ¯ Note that µ ¯ x = µ . This notation is simply to tell the difference between the two distributions. Because the mean of the sampling distribution of the statistic ¯ x , µ ¯ x is equal to µ , we say that the statistic ¯ x is an unbiased estimator of the parameter µ .

Unbiased Estimators ◮ An unbiased estimator is “correct on the average” over many samples. ◮ Just how close the estimator will be to the parameter in most samples is determined by the spread of the sampling distribution. ◮ If the individual observations have s.d. σ , then sample means x from samples of size n have s.d. σ/ √ n . ¯ ◮ Thus, averages are less variable than individual observations.

For a Normal Population If individual observations have the distribution N ( µ, σ ) , then the sample mean ¯ x of an SRS of size n has the distribution N ( µ, σ/ √ n ) .

Seeing is Believing

Non-Normal Distributions? We know what the values of the mean and standard deviation of ¯ x will be, regardless of the population distribution. But what can be known about the shape of the sampling distribution? Population Distribution Sampling Distribution → is Normal. is Normal. Population Distribution Sampling Distribution → is not Normal. is ?????.

Central Limit Theorem Remarkably, as the sample size of a non-Normal population increases, the sampling distribution of ¯ x changes shape. In fact, the sampling distribution starts to look more like a Normal distribution regardless of what the population distribution looks like. This idea is the Central Limit Theorem .

The Official Definition Draw an SRS of size n from any population with mean µ and standard deviation σ . When n is large, the sampling distribution of the sample mean ¯ x is approximately Normal: x is a random variable with distribuition (roughly) N ( µ, σ/ √ n ) ¯

So Why Do We Care? The Central Limit Theorem allows us to use Normal probability calculations to answer questions about sample means, even if the population distribution is not Normal.

Central Limit in Action (a) Strongly skewed population distribution. (b) Sampling distribution of ¯ x with n = 2. (c) Sampling distribution of ¯ x with n = 10. (d) Sampling distribution of ¯ x with n = 25.

Warning! The CLT applies to sampling distributions, not the distribution of a sample. ◮ Now I’m confused. Larger sample size � = more Normal distribution of a sample. ◮ Skewed population will likely have skewed random samples. The CLT only describes the distribution of averages for repeated samples.

Sample Sizes 1 How large does the sample need to be for the sampling distribution of ¯ x to be close to Normal? The answer depends on the population distribution. Farther from Normal ⇒ More observations per sample needed

Sample Sizes 2 General rule of thumb for sample size n : ◮ Skewed populations ⇒ Sample of size 25 is generally enough to obtain a Normal sampling distribution. ◮ Extremely skewed populations ⇒ Sample of size 40 is generally enough to obtain a Normal sampling distribution.

Sample Sizes 3 Angle of big toe deformations in 28 patients. Population likely close to Normal, so sampling distribution should be Normal.

Sample Sizes 4 Servings of fruit per day for 74 adolescent girls. Population likely skewed, but sampling distribution should be Normal due to large sample size.

CLT and Sampling Distributions There are a few helpful facts that come out of the Central Limit Theorem. These are always true, regardless of population distribution. ◮ Means of random samples are less variable than individual observations. ◮ Means of random samples are more Normal than individual observations.

ACMS 20340 Statistics for Life Sciences Chapter 13: Sampling - PowerPoint PPT Presentation

ACMS 20340 Statistics for Life Sciences Chapter 13: Sampling Distributions Sampling We use information from a sample to infer something about a population. When using random samples and randomized experiments, we cannot rule out the

ACMS 20340 Statistics for Life Sciences Chapter 9: Introducing Probability Why Consider

ACMS 20340 Statistics for Life Sciences Chapter 3: Scatterplots and Correlation Exploratory

ACMS 20340 Statistics for Life Sciences Chapter 7: Samples and Observational Studies Obtaining

ACMS 20340 Statistics for Life Sciences Chapter 8: Designing Experiments Fishers Experiments

ACMS 20340 Statistics for Life Sciences Chapter 18: Comparing Two Means Daily Activity and

ACMS 20340 Statistics for Life Sciences Chapter 15: Inference in Practice Inference in Practice

ACMS 20340 Statistics for Life Sciences Chapter 14: Introduction to Inference Sampling

ACMS 20340 Statistics for Life Sciences Chapter 4: Regression A Quick Recap of Chapter 3

ACMS 20340 Statistics for Life Sciences Chapter 11: The Normal Distributions Introducing the

ACMS 20340 Statistics for Life Sciences Chapter 20: Comparing Two Proportions Two sample tests

ACMS 20340 Statistics for Life Sciences Chapter 22: The Chi-Square Test for Two-Way Tables

ACMS 20340 Statistics for Life Sciences Chapter 17: Inference About a Population Mean

ACMS 20340 Statistics for Life Sciences Chapter 19: Inference about a Population Proportion

ACMS 20340 Statistics for Life Sciences Chapter 24: One-way Analysis of Variance: Comparing

ACMS 20340 Statistics for Life Sciences Chapter 12: Discrete Probability Distributions What

ACMS 20340 Statistics for Life Sciences Chapter 21: The Chi-Square Test for Goodness of Fit

Teaching Students How to Apply Teaching Students How to Apply GIS Technology to Improve GIS

Purifan Clean Air Office Wellness Program Purifans Clean Air Wellness Program Offers a

Introduction to Information Retrieval http://informationretrieval.org IIR 13: Text Classification

Constitutions and Government Responses to Financial Crises Ragnhildur Helgadttir There are

A strategy for Inexpensive Automated Containment of Infected or Vulnerable Systems Steven Sim

Enhancing Collaborative Research and Development with High Definition Video Conference David

Airports of Thailand Plc. Presentation for Analyst Briefing 9M_FY09 (October 2008 June

Home Deliveries: Not just for pizza? Robyn Lamar, MD, MPH Assistant Professor OB GYN, UCSF