Sampling Marc H. Mehlman marcmehlman@yahoo.com University of New - - PowerPoint PPT Presentation

▶

Nov 02, 2022 91 likes •310 views

Sampling Marc H. Mehlman marcmehlman@yahoo.com University of New Haven Marc Mehlman Marc Mehlman (University of New Haven) Sampling 1 / 20 Table of Contents Sampling Distributions 1 Central Limit Theorem 2 Binomial Distribution 3 Marc

SLIDE 1

Marc Mehlman

Sampling

Marc H. Mehlman

marcmehlman@yahoo.com

University of New Haven

Marc Mehlman (University of New Haven) Sampling 1 / 20

SLIDE 2

Marc Mehlman

Sampling Distributions

Central Limit Theorem

Binomial Distribution

Marc Mehlman (University of New Haven) Sampling 2 / 20

SLIDE 3

Marc Mehlman

Sampling Distributions

Marc Mehlman (University of New Haven) Sampling 3 / 20

SLIDE 4

Marc Mehlman

Sampling Distributions

As we begin to use sample data to draw conclusions about a wider population, we must be clear about whether a number describes a sample or a population.

Parameters and Statistics

A parameter is a number that describes some characteristic of the

population. In statistical practice, the value of a parameter is not

known because we cannot examine the entire population. A statistic is a number that describes some characteristic of a

sample. The value of a statistic can be computed directly from the

sample data. We often use a statistic to estimate an unknown parameter. A parameter is a number that describes some characteristic of the

population. In statistical practice, the value of a parameter is not

known because we cannot examine the entire population. A statistic is a number that describes some characteristic of a

sample. The value of a statistic can be computed directly from the

sample data. We often use a statistic to estimate an unknown parameter.

Remember s and p: statistics come from samples and parameters come from populations. x We write µ (the Greek letter mu) for the population mean and σ for the population standard deviation. We write (x-bar) for the sample mean and s for the sample standard deviation.

Marc Mehlman (University of New Haven) Sampling 4 / 20

SLIDE 5

Marc Mehlman

Sampling Distributions

Statistical Estimation

The process of statistical inference involves using information from a sample to draw conclusions about a wider population. Different random samples yield different statistics. We need to be able to describe the sampling distribution of possible statistic values in order to perform statistical inference. We can think of a statistic as a random variable because it takes numerical values that describe the outcomes of the random sampling process. Population Population Sample Sample Collect data from a representative Sample... Make an Inference about the Population.

Marc Mehlman (University of New Haven) Sampling 5 / 20

SLIDE 6

Marc Mehlman

Sampling Distributions

Sampling Variability

Different random samples yield different statistics. This basic fact is called sampling variability: the value of a statistic varies in repeated random sampling. To make sense of sampling variability, we ask, “What would happen if we took many samples?” Population Population

Sample Sample Sample Sample Sample Sample Sample Sample Sample Sample Sample Sample Sample Sample Sample Sample Marc Mehlman (University of New Haven) Sampling 6 / 20

SLIDE 7

Marc Mehlman

Sampling Distributions

The law of large numbers assures us that if we measure enough subjects, the statistic x-bar will eventually get very close to the unknown parameter µ. If we took every one of the possible samples of a certain size, calculated the sample mean for each, and graphed all of those values, we’d have a sampling distribution. The population distribution of a variable is the distribution of values of the variable among all individuals in the population. The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population. The population distribution of a variable is the distribution of values of the variable among all individuals in the population. The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population.

Marc Mehlman (University of New Haven) Sampling 7 / 20

SLIDE 8

Marc Mehlman

Sampling Distributions

Mean of a sampling distribution of a sample mean There is no tendency for a sample mean to fall systematically above or below µ, even if the distribution of the raw data is skewed. Thus, the mean of the sampling distribution is an unbiased estimate of the population mean µ. Standard deviation of a sampling distribution of a sample mean The standard deviation of the sampling distribution measures how much the sample statistic varies from sample to sample. It is smaller than the standard deviation of the population by a factor of √n.  Averages are less variable than individual observations.

Mean and Standard Deviation of a Sample Mean

Marc Mehlman (University of New Haven) Sampling 8 / 20

SLIDE 9

Marc Mehlman

Sampling Distributions

The Sampling Distribution of a Sample Mean

When we choose many SRSs from a population, the sampling distribution

f the sample mean is centered at the population mean µ and is less

spread out than the population distribution. Here are the facts. Note: These facts about the mean and standard deviation of x are true no matter what shape the population distribution has.

The Sampling Distribution of Sample Means The Sampling Distribution of Sample Means

The st andard deviation of the sampling distribution of x is σ x = σ n The mean of the sampling distribution of x is µx

= µ

Suppose that x is the mean of an SRS of size n drawn from a large population with mean µ and standard deviation σ. Then : If individual observations have the N(µ,σ) distribution, then the sample mean

f an SRS of size n has the N(µ, σ/√n) distribution regardless of the sample

size n. If individual observations have the N(µ,σ) distribution, then the sample mean

f an SRS of size n has the N(µ, σ/√n) distribution regardless of the sample

size n.

Marc Mehlman (University of New Haven) Sampling 9 / 20

SLIDE 10

Marc Mehlman

Central Limit Theorem

Marc Mehlman (University of New Haven) Sampling 10 / 20

SLIDE 11

Marc Mehlman

Central Limit Theorem

“I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the “law of frequency of error” [the normal distribution]. The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self effacement amidst the wildest confusion. The huger the mob, and the greater the anarchy, the more perfect is its sway. It is the supreme law of Unreason.” – Francis Galton In the previous slide, the sampling distribution of ¯ X is depicted as:

with mean µ, ie unbiased.

with standard deviation σ/√n.

with normal distribution. The first two depictions are always true, regardless of sample size or population distribution. The Central Limit Theorem (below) says the third depiction is approximately true, regardless of population distribution, for large sample sizes, n. As Francis Galton said, the averaged effects of random acts from a large mob form a familiar pattern. Theorem (Central Limit Theorem, CLT) Consider a random sample of size n from a population with mean µ and standard deviation σ. For large n, the sampling distribution of ¯ X is approximately N

µ, σ/√n
.

Marc Mehlman (University of New Haven) Sampling 11 / 20

SLIDE 12

Marc Mehlman

Central Limit Theorem

Example

Based on service records from the past year, the time (in hours) that a technician requires to complete preventative maintenance on an air conditioner follows the distribution that is strongly right-skewed, and whose most likely outcomes are close to 0. The mean time is µ = 1 hour and the standard deviation is σ = 1.

Your company will service an SRS of 70 air conditioners. You have budgeted 1.1 hours per unit. Will this be enough? 1 = = μ μx The central limit theorem states that the sampling distribution of the mean time spent working on the 70 units is: σx = σ n = 1 70 = 0.12 The sampling distribution of the mean time spent working is approximately N(1, 0.12) because n = 70 ≥ 30. z= 1.1−1 0.12 = 0.83 P(x >1.1) = P(Z> 0.83) =1− 0.7967 = 0.2033 If you budget 1.1 hours per unit, there is a 20% chance the technicians will not complete the work within the budgeted time. Marc Mehlman (University of New Haven) Sampling 12 / 20

SLIDE 13

Marc Mehlman

Central Limit Theorem

Any linear combination of independent Normal random variables is also Normally distributed. More generally, the central limit theorem notes that the distribution of a sum or average of many small random quantities is close to Normal. Finally, the central limit theorem also applies to discrete random variables.

A Few More Facts

Marc Mehlman (University of New Haven) Sampling 13 / 20

SLIDE 14

Marc Mehlman

Binomial Distribution

Marc Mehlman (University of New Haven) Sampling 14 / 20

SLIDE 15

Marc Mehlman

Binomial Distribution

Definition (Bernoulli Distribution, X ∼ BIN(1, p)) Model: X = # heads after tossing a coin once, that has a probability of heads on each toss equal to p. Definition (Binomial Distribution, X ∼ BIN(n, p)) Model: X = # heads after tossing a coin n times, that has a probability of heads on each toss equal to p. Theorem If X ∼ BIN(n, p) and j is a nonnegative integer between 0 and n inclusive P(X = j) = n j

pj(1 − p)n−j.

Furthermore µX = np, σ2

X = np(1 − p)

and σX =

np(1 − p).

Marc Mehlman (University of New Haven) Sampling 15 / 20

SLIDE 16

Marc Mehlman

Binomial Distribution

Let Y1, Y2, · · · , Yn be a random sample from BIN(1, p). Then

1 X def

= n

j=1 Yj ∼ BIN(n, p).

ˆ p def = ¯ Y = # of heads # of tosses is an unbiased estimator of p.

3 For large n, the distribution of ˆ

p = ¯ Y is approximately N

p,
p(1−p)

by the Central Limit Theorem.

Since X = n ¯ Y one has Theorem (Normal Approximation for Binomial Distribution) For large n, one has X ∼ BIN(n, p) is approximately distributed as N

np,
np(1 − p)
.

For how large of n is the above approximate good? Convention When np ≥ 10 and n(1 − p) ≥ 10.

Marc Mehlman (University of New Haven) Sampling 16 / 20

SLIDE 17

Marc Mehlman

Binomial Distribution

When dealing with discrete random variables as the binomial distribution, a “continuity correction” can greatly improve accuracy. For instance consider the example: Example (Exact) Joe always runs red lights. The probability of an accident for each red light run is 0.3. Of the last 100 red lights run, what is the probability that there were 25 or fewer accidents? Solution: Letting X ∼ BIN(100, 0.3) be the number of accidents. The exact answer is

P(X = j) =

j

(0.3)j(0.7)100−j = 0.1631,

(obtained with Mathematica). Or using R, > pbinom(25,100,0.3) [1] 0.1631301 The exact answer can’t easily be obtained without a computer.

Marc Mehlman (University of New Haven) Sampling 17 / 20

SLIDE 18

Marc Mehlman

Binomial Distribution

Example (Normal approximation without continuity correction)

Joe always runs red lights. The probability of an accident for each red light run is 0.3. Of the last 100 red lights run, what is the probability, approximately, that there were 25 or fewer accidents? Solution: Let X ∼ BIN(100, 0.3). Since 100(0.3) ≥ 10 and 100(1 − 0.3) ≥ 10, X has approximately the same distribution as Y ∼ N

30,
100(0.3)(1 − 0.3)
= N(30, 4.582576).

Thus P[X ≤ 25] ≈ P [Y ≤ 25] = P Y − 30 4.582576 ≤ 25 − 30 4.582576

P [Z ≤ −1.091089] = 0.1379 using the Table. Instead of using a table, one can get more accuracy using R for the normal approximation without continuity correction: > pnorm(25,30,sqrt(100*0.3*(1-0.3))) [1] 0.1376168 The approximation is unsatisfactory.

Marc Mehlman (University of New Haven) Sampling 18 / 20

SLIDE 19

Marc Mehlman

Binomial Distribution

Continuity Correction Let X ∼ BIN(n, p) and let j, k be integers such that 0 ≤ j ≤ k ≤ n. Then it is common practice to use the following approximation when np ≥ 10 and n(1 − p) ≥ 10: P [j ≤ X ≤ k] ≈ P [j − 0.5 ≤ Y ≤ k + 0.5] where Y ∼ N

np,
np(1 − p)
.

Marc Mehlman (University of New Haven) Sampling 19 / 20

SLIDE 20

Marc Mehlman

Binomial Distribution

Example (Normal approximation with continuity correction)

Joe always runs red lights. The probability of an accident for each red light run is 0.3. Of the last 100 red lights run, what is the probability, approximately, that there were 25 or fewer accidents? Since 100(0.3) ≥ 10 and 100(0.7) ≥ 10 the above convention says, letting Y ∼ N

30,
100(0.3)(1 − 0.3)
= N(30, 4.582576)

P(X ≤ 25) ≈ P(Y ≤ 25.5) = P Y − 30 4.582576 ≤ 25.5 − 30 4.582576

P(Z ≤ −0.9819805) ≈ 0.1635 using the Table. Instead of using a table, one can get more accuracy using R for the normal approximation with continuity correction: > pnorm(25.5,30,sqrt(100*0.3*(1-0.3))) [1] 0.1630547 This approximation is much, much better than the normal approximation without continuity correction.

Marc Mehlman (University of New Haven) Sampling 20 / 20