SLIDE 1
Business Statistics CONTENTS Sampling The central limit theorem - - PowerPoint PPT Presentation
Business Statistics CONTENTS Sampling The central limit theorem - - PowerPoint PPT Presentation
SAMPLING, THE CLT, AND THE STANDARD ERROR Business Statistics CONTENTS Sampling The central limit theorem Point and interval estimates for Confidence intervals for Old exam question Further study SAMPLING Suppose youre a
SLIDE 2
SLIDE 3
Suppose youโre a scissors manufacturer in the UK
โช What proportion of your production should be left-handed?
โช Three strategies
โช look at Wikipedia (โStudies suggest that 70โ90% of the world population is right-handed.[4][5]โ) โช ask all persons in the UK (~63 million) โช ask a sample of persons (100?) in the UK
SAMPLING
SLIDE 4
Sampling is the process of collecting data about a sample (a subset of the population), with the aim of representing the entire population โช Arguments pro sampling
โช too costly to probe entire population โช too time-consuming โช too dangerous โช too destructive โช etc.
โช Arguments against sampling
โช limited accuracy ๏ฎ confidence intervals (later in this course) โช not representative ๏ฎ design of experiments (not in this course)
SAMPLING
SLIDE 5
A sample should be representative
โช e.g., donโt ask people at Schiphol if theyโre afraid of flying
A sample should be large enough
โช cf. the โ ๐โ law later on
Choice in sampling
โช with replacement or without replacement โช this has consequences for the probability model
SAMPLING
SLIDE 6
Population Sample unknown known we would like to know irrelevant parameter statistic mostly Greek letters (๐, ๐) mostly Roman letters (๐, ๐ก) some deviating notations (๐) some deviating notations ( าง ๐ฆ, ๐)
SAMPLING
SLIDE 7
โช Let ๐1, ๐2, โฆ , ๐๐ be a random sample from a population ๐ with mean ๐๐ and variance ๐๐
2
โช e.g., body heights of ๐ persons โช waiting times of ๐ customers โช failure rates of ๐ cars, ...
โช Then, for ๐ sufficiently large, the mean เดค ๐ =
๐1+๐2+โฏ+๐๐ ๐
- 1. is normally distributed
- 2. with mean ๐ เดค
๐ = ๐๐
- 3. and variance ๐เดค
๐ 2 = ๐๐
2
๐
THE CENTRAL LIMIT THEOREM
Capital ๐, because it is a random variable! Capital เดค ๐, because this is also a random variable!
SLIDE 8
So for large ๐: เดค ๐~๐ ๐ เดค
๐ = ๐๐, ๐เดค ๐ 2 = ๐๐ 2
๐
โช or for short
เดค ๐~๐ ๐๐, ๐๐
2
๐ โช This holds regardless of the distribution of ๐!
โช so thatโs why the normal distribution is called โnormalโ โช this fact is called the central limit theorem (CLT) โช it is one of the most important results of statistics โช it holds for โsufficiently largeโ ๐
THE CENTRAL LIMIT THEOREM
SLIDE 9
The CLT for a fair die Distribution of เดค ๐ for
โช ๐ = 1 โช ๐ = 2 โช ๐ = 5 โช ๐ = 20
THE CENTRAL LIMIT THEOREM
SLIDE 10
The CLT for a loaded (unfair) die Distribution of เดค ๐ for
โช ๐ = 1 โช ๐ = 2 โช ๐ = 5 โช ๐ = 20
THE CENTRAL LIMIT THEOREM
SLIDE 11
We roll with a die 100 times. The outcomes are ๐ = ๐1, ๐2, โฆ , ๐100 . How is เดค ๐ distributed? EXERCISE 1
SLIDE 12
A โproofโ of the theorem (for normal populations) โช Recall the additive property of the normal distribution:
โช if ๐1~๐ ๐๐, ๐๐
2 and ๐2~๐ ๐๐, ๐๐ 2 , then ๐1 +
๐2~๐ 2๐๐, 2๐๐
2 (provided ๐1 and ๐2 are independent)
โช Also recal that if ๐~๐ ๐๐, ๐๐
2 then ๐๐~๐ ๐๐๐, ๐2๐๐ 2
โช So, if ๐1 + ๐2~๐ 2๐๐, 2๐๐
2 then ๐1+๐2 2
~๐ ๐๐,
๐๐
2
2
โช and more general:
๐1+โฏ+๐๐ ๐
~๐ ๐๐,
๐๐
2
๐
โช or equivalently: เดค ๐~๐ ๐๐,
๐๐
2
๐
โช This proof works for normal populations and all ๐, but the CLT is valid for all populations and โlargeโ ๐ THE CENTRAL LIMIT THEOREM
You donโt need to reproduce such proofs, but it may help
SLIDE 13
Some consequences of the CLT โช เดค ๐ is an estimator of ๐๐
โช and าง ๐ฆ is the best estimate of ๐๐
โช เดค ๐ will be a better estimator for large ๐
โช because ๐ เดค
๐ decreases with ๐
โช we can use the distribution of เดค ๐ to construct a confidence interval for ๐ THE CENTRAL LIMIT THEOREM
SLIDE 14
The CLT holds for ๐ โsufficientlyโ large โช More specifically:
โช if ๐ is normally distributed, the CLT holds for all sample sizes ๐ โช if the distribution of ๐ is fairly symmetric without extreme outliers, for sample sizes ๐ โฅ 15 the CLT gives a pretty good approximation of the distribution of เดค ๐ โช for any distribution of เดค ๐ and a sample size ๐ โฅ 30, the CLT gives a pretty good approximation of the distribution of เดค ๐
THE CENTRAL LIMIT THEOREM
SLIDE 15
The effect of asymmetry vs. sample size THE CENTRAL LIMIT THEOREM
SLIDE 16
A statistic is a function of the (randomly sampled) data โช important example: the statistic เดค ๐
โช defined by เดค ๐ =
1 ๐ ฯ๐=1 ๐
๐๐
โช in a concrete case, าง ๐ฆ =
1 ๐ ฯ๐=1 ๐
๐ฆ๐ is the best possible estimate of the parameter ๐ โช so the sample mean าง ๐ฆ is the best possible estimate of the population mean ๐ โช because it is just one value, it is a point estimate POINT AND INTERVAL ESTIMATES FOR ๐
SLIDE 17
Due to sampling variation, าง ๐ฆ will be different in each sample โช and there will be a distribution of าง ๐ฆ-values, the distribution เดค ๐ โช the true value of ๐ may be different from the value of าง ๐ฆ obtained โช however, keep in mind that the value of าง ๐ฆ obtained cannot be โtooโ wrong โช we know that เดค ๐~๐ ๐ เดค
๐, ๐เดค ๐ 2 , so it follows that a specific
value าง ๐ฆ must be within ๐ เดค
๐ โ 1.96๐ เดค ๐, ๐ เดค ๐ + 1.96๐ เดค ๐
with 95% probability POINT AND INTERVAL ESTIMATES FOR ๐
SLIDE 18
Conversely, the population value ๐ เดค
๐ must be within
าง ๐ฆ โ 1.96๐ เดค
๐,
าง ๐ฆ + 1.96๐ เดค
๐ with 95% probability
โช and because ๐ เดค
๐ = ๐๐, the population value ๐๐ must be
within าง ๐ฆ โ 1.96๐ เดค
๐,
าง ๐ฆ + 1.96๐ เดค
๐ with 95% probability
โช this is an interval estimate for ๐๐ โช we say that าง ๐ฆ โ 1.96๐ เดค
๐,
าง ๐ฆ + 1.96๐ เดค
๐ is a 95%
confidence interval for ๐๐ POINT AND INTERVAL ESTIMATES FOR ๐
SLIDE 19
So: โช we estimate ๐๐ by าง ๐ฆ โช and we know with 95% probability that าง ๐ฆ โ 1.96๐ เดค
๐ โค
๐๐ โค าง ๐ฆ + 1.96๐ เดค
๐
โช the quantity ๐ เดค
๐ = ๐๐ ๐ is the standard error of the
distribution of the mean เดค ๐ โช it is so important that we give it a special name: the standard error of the mean โช sometimes (unfortunately!) abbreviated as the standard error POINT AND INTERVAL ESTIMATES FOR ๐
SLIDE 20
We sample (๐ = 25) from a normal population ๐ with unknown ๐๐ and known ๐๐
2 = 4. We find าง
๐ฆ = 3.
- a. Give a point estimate for ๐๐.
- b. Find the standard error of the mean, ๐ก เดค
๐.
- b. Give a 95%-confidence interval for ๐๐.
EXERCISE 2
SLIDE 21
โช Carefully distinguish:
โช ๐๐ (a value, often unknown) โช าง ๐ฆ (a value from observations) โช เดค ๐ (a distribution, not a value) โช and its two parameters ๐ เดค
๐ and ๐เดค ๐ 2 (both are values, often
unknown)
โช Later on, we will follow a similar logic, e.g.
โช ๐๐
2
โช ๐ก๐
2
โช ๐๐
2
โช and its two parameters
CONCEPTS AND SYMBOLS
and the CLT claims that ๐ เดค
๐ = ๐๐
๐เดค
๐ 2 = ๐๐ 2
๐
SLIDE 22
23 March 2015, Q1h OLD EXAM QUESTION
SLIDE 23