Business Statistics CONTENTS Sampling The central limit theorem - - PowerPoint PPT Presentation

▶

Sep 19, 2022 195 likes •436 views

SAMPLING, THE CLT, AND THE STANDARD ERROR Business Statistics CONTENTS Sampling The central limit theorem Point and interval estimates for Confidence intervals for Old exam question Further study SAMPLING Suppose youre a

SLIDE 1

SAMPLING, THE CLT, AND THE STANDARD ERROR

Business Statistics

SLIDE 2

Sampling The central limit theorem Point and interval estimates for 𝜈 Confidence intervals for 𝜈 Old exam question Further study CONTENTS

SLIDE 3

Suppose you’re a scissors manufacturer in the UK

▪ What proportion of your production should be left-handed?

▪ Three strategies

▪ look at Wikipedia (“Studies suggest that 70–90% of the world population is right-handed.[4][5]”) ▪ ask all persons in the UK (~63 million) ▪ ask a sample of persons (100?) in the UK

SAMPLING

SLIDE 4

Sampling is the process of collecting data about a sample (a subset of the population), with the aim of representing the entire population ▪ Arguments pro sampling

▪ too costly to probe entire population ▪ too time-consuming ▪ too dangerous ▪ too destructive ▪ etc.

▪ Arguments against sampling

▪ limited accuracy  confidence intervals (later in this course) ▪ not representative  design of experiments (not in this course)

SAMPLING

SLIDE 5

A sample should be representative

▪ e.g., don’t ask people at Schiphol if they’re afraid of flying

A sample should be large enough

▪ cf. the “ 𝑜” law later on

Choice in sampling

▪ with replacement or without replacement ▪ this has consequences for the probability model

SAMPLING

SLIDE 6

Population Sample unknown known we would like to know irrelevant parameter statistic mostly Greek letters (𝜌, 𝜏) mostly Roman letters (𝑞, 𝑡) some deviating notations (𝑂) some deviating notations ( ҧ 𝑦, 𝑜)

SAMPLING

SLIDE 7

▪ Let 𝑌1, 𝑌2, … , 𝑌𝑜 be a random sample from a population 𝑌 with mean 𝜈𝑌 and variance 𝜏𝑌

2

▪ e.g., body heights of 𝑜 persons ▪ waiting times of 𝑜 customers ▪ failure rates of 𝑜 cars, ...

▪ Then, for 𝑜 sufficiently large, the mean ത 𝑌 =

𝑌1+𝑌2+⋯+𝑌𝑜 𝑜

1. is normally distributed
2. with mean 𝜈 ത

𝑌 = 𝜈𝑌

3. and variance 𝜏ത

𝑌 2 = 𝜏𝑌

𝑜

THE CENTRAL LIMIT THEOREM

Capital 𝑌, because it is a random variable! Capital ത 𝑌, because this is also a random variable!

SLIDE 8

So for large 𝑜: ത 𝑌~𝑂 𝜈 ത

𝑌 = 𝜈𝑌, 𝜏ത 𝑌 2 = 𝜏𝑌 2

𝑜

▪ or for short

ത 𝑌~𝑂 𝜈𝑌, 𝜏𝑌

2

𝑜 ▪ This holds regardless of the distribution of 𝑌!

▪ so that’s why the normal distribution is called “normal” ▪ this fact is called the central limit theorem (CLT) ▪ it is one of the most important results of statistics ▪ it holds for “sufficiently large” 𝑜

THE CENTRAL LIMIT THEOREM

SLIDE 9

The CLT for a fair die Distribution of ത 𝑌 for

▪ 𝑜 = 1 ▪ 𝑜 = 2 ▪ 𝑜 = 5 ▪ 𝑜 = 20

THE CENTRAL LIMIT THEOREM

SLIDE 10

The CLT for a loaded (unfair) die Distribution of ത 𝑌 for

▪ 𝑜 = 1 ▪ 𝑜 = 2 ▪ 𝑜 = 5 ▪ 𝑜 = 20

THE CENTRAL LIMIT THEOREM

SLIDE 11

We roll with a die 100 times. The outcomes are 𝑌 = 𝑌1, 𝑌2, … , 𝑌100 . How is ത 𝑌 distributed? EXERCISE 1

SLIDE 12

A “proof” of the theorem (for normal populations) ▪ Recall the additive property of the normal distribution:

▪ if 𝑌1~𝑂 𝜈𝑌, 𝜏𝑌

2 and 𝑌2~𝑂 𝜈𝑌, 𝜏𝑌 2 , then 𝑌1 +

𝑌2~𝑂 2𝜈𝑌, 2𝜏𝑌

2 (provided 𝑌1 and 𝑌2 are independent)

▪ Also recal that if 𝑌~𝑂 𝜈𝑌, 𝜏𝑌

2 then 𝑏𝑌~𝑂 𝑏𝜈𝑌, 𝑏2𝜏𝑌 2

▪ So, if 𝑌1 + 𝑌2~𝑂 2𝜈𝑌, 2𝜏𝑌

2 then 𝑌1+𝑌2 2

~𝑂 𝜈𝑌,

𝜏𝑌

2

▪ and more general:

𝑌1+⋯+𝑌𝑜 𝑜

~𝑂 𝜈𝑌,

𝜏𝑌

𝑜

▪ or equivalently: ത 𝑌~𝑂 𝜈𝑌,

𝜏𝑌

𝑜

▪ This proof works for normal populations and all 𝑜, but the CLT is valid for all populations and “large” 𝑜 THE CENTRAL LIMIT THEOREM

You don’t need to reproduce such proofs, but it may help

SLIDE 13

Some consequences of the CLT ▪ ത 𝑌 is an estimator of 𝜈𝑌

▪ and ҧ 𝑦 is the best estimate of 𝜈𝑌

▪ ത 𝑌 will be a better estimator for large 𝑜

▪ because 𝜏 ത

𝑌 decreases with 𝑜

▪ we can use the distribution of ത 𝑌 to construct a confidence interval for 𝜈 THE CENTRAL LIMIT THEOREM

SLIDE 14

The CLT holds for 𝑜 “sufficiently” large ▪ More specifically:

▪ if 𝑌 is normally distributed, the CLT holds for all sample sizes 𝑜 ▪ if the distribution of 𝑌 is fairly symmetric without extreme outliers, for sample sizes 𝑜 ≥ 15 the CLT gives a pretty good approximation of the distribution of ത 𝑌 ▪ for any distribution of ത 𝑌 and a sample size 𝑜 ≥ 30, the CLT gives a pretty good approximation of the distribution of ത 𝑌

THE CENTRAL LIMIT THEOREM

SLIDE 15

The effect of asymmetry vs. sample size THE CENTRAL LIMIT THEOREM

SLIDE 16

A statistic is a function of the (randomly sampled) data ▪ important example: the statistic ത 𝑌

▪ defined by ത 𝑌 =

1 𝑜 σ𝑗=1 𝑜

𝑌𝑗

▪ in a concrete case, ҧ 𝑦 =

1 𝑜 σ𝑗=1 𝑜

𝑦𝑗 is the best possible estimate of the parameter 𝜈 ▪ so the sample mean ҧ 𝑦 is the best possible estimate of the population mean 𝜈 ▪ because it is just one value, it is a point estimate POINT AND INTERVAL ESTIMATES FOR 𝜈

SLIDE 17

Due to sampling variation, ҧ 𝑦 will be different in each sample ▪ and there will be a distribution of ҧ 𝑦-values, the distribution ത 𝑌 ▪ the true value of 𝜈 may be different from the value of ҧ 𝑦 obtained ▪ however, keep in mind that the value of ҧ 𝑦 obtained cannot be “too” wrong ▪ we know that ത 𝑌~𝑂 𝜈 ത

𝑌, 𝜏ത 𝑌 2 , so it follows that a specific

value ҧ 𝑦 must be within 𝜈 ത

𝑌 − 1.96𝜏 ത 𝑌, 𝜈 ത 𝑌 + 1.96𝜏 ത 𝑌

with 95% probability POINT AND INTERVAL ESTIMATES FOR 𝜈

SLIDE 18

Conversely, the population value 𝜈 ത

𝑌 must be within

ҧ 𝑦 − 1.96𝜏 ത

𝑌,

ҧ 𝑦 + 1.96𝜏 ത

𝑌 with 95% probability

▪ and because 𝜈 ത

𝑌 = 𝜈𝑌, the population value 𝜈𝑌 must be

within ҧ 𝑦 − 1.96𝜏 ത

𝑌,

ҧ 𝑦 + 1.96𝜏 ത

𝑌 with 95% probability

▪ this is an interval estimate for 𝜈𝑌 ▪ we say that ҧ 𝑦 − 1.96𝜏 ത

𝑌,

ҧ 𝑦 + 1.96𝜏 ത

𝑌 is a 95%

confidence interval for 𝜈𝑌 POINT AND INTERVAL ESTIMATES FOR 𝜈

SLIDE 19

So: ▪ we estimate 𝜈𝑌 by ҧ 𝑦 ▪ and we know with 95% probability that ҧ 𝑦 − 1.96𝜏 ത

𝑌 ≤

𝜈𝑌 ≤ ҧ 𝑦 + 1.96𝜏 ത

𝑌

▪ the quantity 𝜏 ത

𝑌 = 𝜏𝑌 𝑜 is the standard error of the

distribution of the mean ത 𝑌 ▪ it is so important that we give it a special name: the standard error of the mean ▪ sometimes (unfortunately!) abbreviated as the standard error POINT AND INTERVAL ESTIMATES FOR 𝜈

SLIDE 20

We sample (𝑜 = 25) from a normal population 𝑌 with unknown 𝜈𝑌 and known 𝜏𝑌

2 = 4. We find ҧ

𝑦 = 3.

a. Give a point estimate for 𝜈𝑌.
b. Find the standard error of the mean, 𝑡 ത

𝑌.

b. Give a 95%-confidence interval for 𝜈𝑌.

EXERCISE 2

SLIDE 21

▪ Carefully distinguish:

▪ 𝜈𝑌 (a value, often unknown) ▪ ҧ 𝑦 (a value from observations) ▪ ത 𝑌 (a distribution, not a value) ▪ and its two parameters 𝜈 ത

𝑌 and 𝜏ത 𝑌 2 (both are values, often

unknown)

▪ Later on, we will follow a similar logic, e.g.

▪ 𝜏𝑌

2

▪ 𝑡𝑌

2

▪ 𝑇𝑌

2

▪ and its two parameters

CONCEPTS AND SYMBOLS

and the CLT claims that 𝜈 ത

𝑌 = 𝜈𝑌

𝜏ത

𝑌 2 = 𝜏𝑌 2

𝑜

SLIDE 22

23 March 2015, Q1h OLD EXAM QUESTION

SLIDE 23

Doane & Seward 5/E 8.1-8.3 Tutorial exercises week 2 sampling distribution central limit theorem standard error FURTHER STUDY