Business Statistics CONTENTS Estimating parameters The sampling - - PowerPoint PPT Presentation
Business Statistics CONTENTS Estimating parameters The sampling - - PowerPoint PPT Presentation
: ESTIMATES, CONFIDENCE INTERVALS, AND TESTS Business Statistics CONTENTS Estimating parameters The sampling distribution Confidence intervals for Hypothesis tests for The -distribution Comparison of and Old exam
Estimating parameters The sampling distribution Confidence intervals for ๐ Hypothesis tests for ๐ The ๐ข-distribution Comparison of ๐จ and ๐ข Old exam question Further study CONTENTS
Central task in inferential statistics โช Estimation
โช estimating a parameter (population value) from a sample
โช Example
โช what proportion of cars in Amsterdam is electric? โช population value: ๐ โช sample of size ๐ = 200 cars yields 26 electric cars โช so, ๐ =
26 200 = 0.13
โช this suggests ๐ โ 0.13
ESTIMATING PARAMETERS
Terminology โช Parameter
โช a characteristic descriptive of the population โช e.g., ๐, ๐, ๐ (or ๐2)
โช Estimator
โช a statistic derived from a sample to infer the value of a population parameter โช e.g., เดค ๐, ๐, ๐ (or ๐2)
โช Estimate
โช the value of the estimator in a particular sample โช e.g., าง ๐ฆ, ๐, ๐ก (or ๐ก2)
ESTIMATING PARAMETERS
ESTIMATING PARAMETERS
Estimator Estimate Population parameter Mean เดค ๐ =
1 ๐ ฯ๐=1 ๐
๐๐ าง ๐ฆ =
1 ๐ ฯ๐=1 ๐
๐ฆ๐ ๐ Standard deviation ๐ =
1 ๐โ1 ฯ๐=1 ๐
๐๐ โ เดค ๐ 2 ๐ก =
1 ๐โ1 ฯ๐=1 ๐
๐ฆ๐ โ าง ๐ฆ 2 ๐ Proportion ๐ =
๐ ๐
๐ =
๐ฆ ๐
๐
ESTIMATING PARAMETERS
โช Another example (Amsterdam, 2015):
โช what is the mean price of a glass of beer? โช population value: ๐ โช sample of size ๐ = 64 glasses of beer yields าง ๐ฆ = 2.06โฌ โช this suggests that ๐ = 2.06โฌ
โช But suppose we had taken a different sample
โช again with sample size ๐ = 64 โช but now perhaps yielding าง ๐ฆ = 2.13โฌ โช then we would estimate ๐ = 2.13โฌ
โช Obviously there is sampling variation
โช so a distribution of าง ๐ฆ-values (the sampling distribution of เดค ๐)
โช Solution: point estimates and confidence intervals ESTIMATING PARAMETERS
โช Example โช Consider a discrete uniform population consisting of the integers {0, 1, 2, 3} โช The population parameters are:
โช ๐ = 1.5 โช ๐ = 1.118
THE SAMPLING DISTRIBUTION
โช Sample ๐ = 2 values and calculate าง ๐ฆ โช Do this for all possible sample of size ๐ = 2 โช You will get a distribution of าง ๐ฆ-values: the distribution เดค ๐ THE SAMPLING DISTRIBUTION
โช We will study the variance of the estimate of a population parameter from a sample statistic โช We will do so by studying how the sample statistic varies when you draw a different sample โช Example:
โช GMAT score of MBA students โช ๐ = 2637 โช ๐ = 520.78 โช ๐ = 86.60
THE SAMPLING DISTRIBUTION
โช Consider eight random samples, each of size ๐ = 5
โช the sample means ( าง ๐ฆ1 = 504.0, าง ๐ฆ2 = 576.0, โฆ , าง ๐ฆ8 = 582) tend to be close to the population mean (๐ = 520.78) โช sometimes a bit lower, sometimes a bit higher
THE SAMPLING DISTRIBUTION
โช The dot plots show that the sample means ( าง ๐ฆ1, โฆ , าง ๐ฆ8) have much less variation than the individual data points (๐ฆ1, โฆ , ๐ฆ2637) THE SAMPLING DISTRIBUTION
โช An estimator is a random variable since samples vary
โช so we write it as a capital letter, e.g., ๐, เดค ๐, ๐, etc.
โช The sampling distribution of an estimator is the probability distribution of all possible values the statistic may assume when a random sample of (a fixed) size ๐ is taken
โช so we write ๐~๐ ๐, ๐ , etc.
THE SAMPLING DISTRIBUTION
โช The sampling distribution of เดค ๐
โช for a population with ๐ = ๐๐ and ๐2 = ๐๐
2
โช If the CLT holds เดค ๐~๐ ๐๐, ๐๐
2
๐ โช So, the statistic เดค ๐
โช is normally distributed โช has mean ๐๐ โช and has standard deviation
๐๐ ๐
โช Fortunately, the CLT holds pretty often THE SAMPLING DISTRIBUTION
3 things: shape, mean, dispersion
โช The standard deviation of the distribution of sample means เดค ๐
โช is given by ๐ เดค
๐ = ๐๐ ๐
โช has a special name: standard error of the mean โช is often abbreviated as the standard error (SE) โช decreases with increasing sample size โช but only according to the โlaw of diminishing returnsโ (1/ ๐) โช is often calculated by software (SPSS, etc.) โช is the basis for confidence intervals and hypothesis tests (see later)
THE SAMPLING DISTRIBUTION
Thatโs a bit confusing, because we will meet more standard errors later on
What is the meaning of the standard error? EXERCISE 1
โช A sample mean าง ๐ฆ is a point estimate of the population mean ๐
โช it is the best possible estimate of ๐ โช but it will probably not be completely right
โช A confidence interval (CI) for the mean is a range of possible values for ๐: ๐lower โค ๐ โค ๐upper
โช such that the interval ๐ท๐ฝ๐ = ๐lower, ๐upper contains the true value (๐) with a certain probability (e.g., 95%)
CONFIDENCE INTERVALS FOR ๐
To simplify notation, we will drop the โ๐โ from ๐๐ now, and write just ๐
โช From the CLT it follows that under certain conditions:
โช the distribution of เดค ๐ is normal โช the best estimate of เดค ๐ of ๐ is าง ๐ฆ โช the standard deviation of เดค ๐ is
๐ ๐
โช This implies that:
โช with probability 2.5%, เดค ๐ < ๐ โ 1.96
๐ ๐ โ ๐ > เดค
๐ + 1.96
๐ ๐
โช with probability 2.5%, เดค ๐ > ๐ + 1.96
๐ ๐ โ ๐ < เดค
๐ โ 1.96
๐ ๐
โช so with probability 95%, เดค ๐ โ 1.96
๐ ๐ โค ๐ โค เดค
๐ + 1.96
๐ ๐
โช So, if we find a sample mean าง ๐ฆ, we can construct the following 95% confidence interval for ๐: CI๐,0.95 = าง ๐ฆ โ 1.96 ๐ ๐ , าง ๐ฆ + 1.96 ๐ ๐
CONFIDENCE INTERVALS FOR ๐
Three notations for a confidence interval for ๐ โช าง ๐ฆ โ 1.96
๐ ๐ , าง
๐ฆ + 1.96
๐ ๐
โช าง ๐ฆ โ 1.96
๐ ๐ โค ๐ โค าง
๐ฆ + 1.96
๐ ๐
โช าง ๐ฆ ยฑ 1.96
๐ ๐
CONFIDENCE INTERVALS FOR ๐
Example โช Population
โช ๐ = 520.78 (unknown) โช ๐ = 86.60 (known) โช normally distributed (assumed)
โช Sample
โช ๐ = 5 (chosen) โช าง ๐ฆ = 504.0 (estimated)
โช Calculation
โช standard error of mean:
86.60 5 = 38.73
โช 1.96 ร 38.73 = 75.91 โช ๐ท๐ฝ๐,0.95 = 428.09, 579.91
CONFIDENCE INTERVALS FOR ๐
Write the confidence interval 428.09, 579.91 in two alternative ways. EXERCISE 2
โช The factor 1.96 is of course related to the 95% probability โช Other confidence levels: โช General form of a 1 โ ๐ฝ ร 100% confidence interval of the mean: CI๐,1โ๐ฝ = าง ๐ฆ โ ๐จ๐ฝ/2 ๐ ๐ , าง ๐ฆ + ๐จ๐ฝ/2 ๐ ๐ CONFIDENCE INTERVALS FOR ๐
Where ๐จ๐ฝ/2 is such that ๐ ๐ โค ๐จ๐ฝ/2 = ๐ฝ if ๐ is drawn from a ๐-distribution
CONFIDENCE INTERVALS FOR ๐
โช Trade-off
โช narrow CI ๏ low confidence level โช wide CI ๏ high confidence level
โช Choice of confidence level depends on application
โช more precision required for a refinery than for a dairy farm
CONFIDENCE INTERVALS FOR ๐
โช A confidence interval either does or does not contain ๐ โช The confidence level quantifies the risk โช Out of 100 confidence intervals, approximately 95% will contain ๐, while approximately 5% might not contain ๐ CONFIDENCE INTERVALS FOR ๐
โช We can use the standard error to perform a hypothesis test
โช recall that ๐ท๐ฝ๐,0.95 = 428.09, 579.91
โช Suppose we hypothesize ๐ = 550 โช The value 550 is inside the 95% confidence interval for ๐
โช therefore the sample statistic+confidence interval will not suggest that the hypothesis (๐ = 550) is wrong โช and we will not reject the hypothesis โช notice that we didnโt say that ๐ = 550; we only said that we canโt reject it (at a 5% significance level)
HYPOTHESIS TESTS FOR ๐
โช Another example: suppose we hypothesize that ๐ = 600 โช The value 600 is outside the confidence interval for ๐
โช finding a confidence interval not containing ๐ happens only in 5% of the cases โช so we conclude that ๐ โ 600 (at a 5% significance level) โช therefore the sample statistic+confidence interval will suggest that the hypothesis (๐ = 600) is wrong โช and we will reject the hypothesis
HYPOTHESIS TESTS FOR ๐
Much more on hypothesis tests later on!
โช A closer look at CI๐,0.95 = าง ๐ฆ โ 1.96
๐ ๐ , าง
๐ฆ + 1.96
๐ ๐
โช Given a sample mean าง ๐ฆ, you can find a 95% confidence interval for the population mean ๐ โช Sounds great when you donโt know ๐ ... โช ... but it assumes you do know ๐! โช There are many situations in which you donโt know ๐ and you also donโt know ๐ โช So what to do? THE ๐ข-DISTRIBUTION
โช A simple strategy โช If the population standard deviation ๐ is unknown, we can estimate it with the sample standard deviation ๐ก โช Then we use ยฑ1.96
๐ก ๐ instead of ยฑ1.96 ๐ ๐
โช But we pay a price for that โช The reason is that ๐ก is itself an estimate of ๐, and therefore uncertain โช The price we pay is that the factor โ1.96โ must be somewhat larger THE ๐ข-DISTRIBUTION
โช Recall that the CLT yields that
เดค ๐โ๐ ๐/ ๐ ~๐ 0,1
โช where ๐ is the standard normal distribution
โช Likewise, it can be shown that
เดค ๐โ๐ ๐ก/ ๐ ~๐ข
โช where ๐ข is the ๐ข-distribution (or Studentโs ๐ข-distribution) โช which has an even more complicated formula than the normal distribution โช ๐ ๐จ =
1 2๐ ๐โ1
2๐จ2 vs. ๐ ๐ข; ๐ =
ฮ 1
2 ๐+1
๐๐ฮ 1
2๐
1 +
๐ข2 ๐ โ1
2 ๐+1
THE ๐ข-DISTRIBUTION
Arrrgh: forget quickly!
โช The confidence interval for ๐ with unknown ๐ is CI๐,1โ๐ฝ = าง ๐ฆ โ ๐ข๐ฝ/2 ๐ก ๐ , าง ๐ฆ + ๐ข๐ฝ/2 ๐ก ๐ โช What is the ๐ข-distribution?
โช quite similar to the ๐-distribution (๐ = 0, continuous, symmetric, bell- shaped, infinite range, ...) โช a little bit โfatterโ tails โช it has 1 parameter, usually denoted with df or ๐, and called degrees of freedom
THE ๐ข-DISTRIBUTION
Where ๐ข๐ฝ/2 is such that ๐ ๐ โฅ ๐ข๐ฝ/2 = ๐ฝ if ๐ is drawn from a ๐ข-distribution
โช Graph of pdf of ๐ข-distribution THE ๐ข-DISTRIBUTION
๐ฆ
๐ (standard normal) distribution
๐ ๐ฆ
๐ข-distribution with df = 13 ๐ข-distribution with df = 5 ๐ข-distribution with df = 1000
โช Different notations
โช ๐ข13 โช ๐ข df = 13 โช etc.
โช And likewise
โช ๐ข13;๐ฝ/2 โช ๐ข13 ๐ฝ/2 โช etc.
โช So altogether for the confidence interval CI๐,1โ๐ฝ = าง ๐ฆ โ ๐ข๐โ1;๐ฝ/2 ๐ก ๐ , าง ๐ฆ + ๐ข๐โ1;๐ฝ/2 ๐ก ๐ THE ๐ข-DISTRIBUTION
Compare to าง ๐ฆ โ ๐จ๐ฝ/2 ๐ ๐ , าง ๐ฆ + ๐จ๐ฝ/2 ๐ ๐
THE ๐ข-DISTRIBUTION
โช How to choose the parameter df?
โช it is a parameter based on the sample size that is used to determine the value of the ๐ข-statistic โช it tells how many observations are used to estimate ๐, less the number
- f intermediate estimates used in the calculation
โช the df for the ๐ข-distribution in the case of a confidence interval for ๐ when ๐ is unknown, is df = ๐ โ 1 โช but in other cases, it may be different
โช Properties of ๐ข
โช as ๐ increases, the ๐ข-distribution approaches the shape of the normal distribution โช for a given confidence level ๐ฝ, ๐ข is always larger than ๐จ, so a confidence interval based on ๐ข is always wider than if ๐จ were used
THE ๐ข-DISTRIBUTION
โช Reading the table of critical ๐ข-values
โช e.g., ๐ข0.025 9 โช ๐ข = 2.262
THE ๐ข-DISTRIBUTION
๐๐ = 9 ๐ฝ/2 = 0.025
Look carefully at tables for ๐จ and ๐ข: โช ๐จ usually runs from left to right
โช ๐ ๐ โค ๐จ = ืฌ
โโ ๐จ ๐ ๐ฆ ๐๐ฆ
โช ๐ข usually runs from right to left
โช ๐ ๐ โฅ ๐ข = ืฌ
๐ข โ ๐ ๐ฆ ๐๐ฆ
THE ๐ข-DISTRIBUTION
โช Background of ๐ข
โช developed by William Gosset in 1908 โช while working at Guiness Brewery, Dublin โช published under the pen name โStudentโ
THE ๐ข-DISTRIBUTION
Example for confidence interval โช Population
โช ๐ = 520.78 (unknown) โช ๐ = 86.60 (unknown) โช normally distributed (assumed)
โช Sample
โช ๐ = 5 (chosen) โช าง ๐ฆ = 504.0 (estimated) โช ๐ก = 73.01 (estimated)
โช Calculation
โช standard error of mean:
๐ก ๐ = 32.65
โช 2.776 ร 32.65 = 90.65 โช ๐ท๐ฝ๐,0.95 = 413.35, 594.64
THE ๐ข-DISTRIBUTION
now we have a situation in which ๐ is not known to us
df=4
โช Repeat the hypothesis test for this case
โช now CI๐,0.95 = 413.35, 594.65
โช So we will reject the hypothesis ๐ = 600
โช while we will not reject the hypothesis ๐ = 550
โช Exactly the same reasoning as with the ๐จ-test, but with (slightly) different numbers THE ๐ข-DISTRIBUTION
โช When to use which?
โช for a confidence interval for ๐ if ๐2 is known: use ๐จ โช for a confidence interval for ๐ if ๐2 is unknown: use ๐ข, and estimate ๐2 by ๐ก2
โช How to find?
โช from a table with ๐จ-values: given ๐ฝ, look up ๐จ โช from a table with ๐ข-values: given ๐ฝ and df, look up ๐ข
โช What is the difference?
โช confidence intervals with ๐ข are a bit wider than with ๐จ โช the difference is small for ๐ โฅ 30 and negligible for ๐ โฅ 100
COMPARISON OF ๐จ AND ๐ข
โช Example: 50 confidence intervals with ๐จ and ๐ข COMPARISON OF ๐จ AND ๐ข
10 20 30 40 50 60
Based on ๏ณ2 Based on s
2 (i )
50 Samples, sample size n=10 S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
Sample Number
i
๏ฎ 2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4
S i m u l a t e d f r o m : N (2,9) d i s t r i b u t i o n
2 4 2 4