Beyond descriptive statistics 2 When we have a data set, we usually - - PowerPoint PPT Presentation
Beyond descriptive statistics 2 When we have a data set, we usually - - PowerPoint PPT Presentation
Beyond descriptive statistics 2 When we have a data set, we usually want to do more with the data than just describe them Keep in mind that data are information of a sample selected or generated from a population, and our goal is to make
Beyond descriptive statistics
When we have a data set, we usually want to do more with
the data than just describe them
Keep in mind that data are information of a sample selected
- r generated from a population, and our goal is to make
inferences about the population
2
Statistical inference
Statistical inference can be further subdivided into the two main areas of estimation and hypothesis
- Estimation is concerned with estimating the
values of specific population parameters (Today’s lecture).
- Hypothesis testing is concerned with testing
whether the value of a population parameter is equal to some specific value (next lecture).
3
Point estimation and interval estimation
- Sometimes we are interested in obtaining specific values
as estimates of our parameters (along with estimation precise). These values are referred to as point estimates
- Sometimes we want to specify a range within which the
parameter values are likely to fall. If the range is narrow, then we may feel our point estimate is good. These are called interval estimates
4
- Purpose of
inference:
Make decisions about population characteristics when it is impractical to observe the whole population and we only have a sample of data drawn from the population Population?
5
From Sample to Population!
Towards statistical inference
- Parameter: a number describing the population
- Statistic: a number describing a sample
- Statistical inference: Statistic → Parameter
6
Estimation of population mean
- We have a sample randomly sampled from
a population
- The population mean µ and variance σ2 are unknown
- Question: how to use the observed sample
to estimate µ and σ2 ?
7
1 2
( , ,..., )
n
x x x
1 2
( , ,..., )
n
x x x
Point estimator of population mean and variance
- A natural estimator for estimating population
mean µ is the sample mean
- A natural estimator for estimating population standard
deviation σ is the sample standard deviation
8
∑
=
− − =
n i i
x x n s
1 2
) ( 1 1 n x x
n i i / 1
∑
=
=
Point estimator of population mean
- A natural estimator for estimating population
mean µ is the sample mean
- Question: How good is this estimate?
9
n x x
n i i / 1
∑
=
=
Point estimator of population mean
- A natural estimator for estimating population
mean µ is the sample mean
- Question: How good is this estimate?
- We would like this is close to µ, however, we don’t
know the value of µ.
- We need to study the distribution of
10
n x x
n i i / 1
∑
=
=
x X
Sampling distribution of sample mean
11
- To understand what properties of make it a
desirable estimator for µ, we need to forget about our particular sample for the moment and consider all possible samples of size n that could have been selected from the population
- The values of in different samples will be different.
These values will be denoted by
- The sampling distribution of is the distribution
- f values over all possible samples of size n that
could have been selected from the study population X
, , ,
3 2 1
x x x
X
X
x
Population Mean
12
Research question: center of a population
Sample is representative of the population Population Mean Random sample 1 Random sample 2 . . . Random sample K
13
Research question: center of a population
The selection of random sample set is A RANDOM EXPERIMENT. is RANDOM VARIABLE are observed values for
Population Mean Random sample 1 Random sample 2 . . Random sample K
14
Research question: center of a population
1
x
2
x
K
x X X
1,..., K
x x X
An example of sampling distribution
15
Sample mean is an unbiased estimator of population mean
- We can show that the average of these samples mean (
- ver all possible samples) is equal to the population
mean µ
- Unbiasedness: Let X1, X2, …, Xn be a random sample
drawn from some population with mean µ. Then
16
, , ,
3 2 1
x x x
µ = ) (X E
is minimum variance unbiased estimator of µ
- The unbiasedness of sample mean is not
sufficient reason to use it as an estimator of µ
- There are many other unbiased estimators, like
sample median and the average of min and max
- We can show that (but not here): among all kinds
- f unbiased estimators, the sample mean has the
smallest variance
- Now what is the variance of sample mean ?
17
X X
Standard error(SE) of sample mean
- The variance of sample mean measures the
estimation precise.
- is the population variance
18
2
var( ) / X n σ =
( ) / SE X n σ =
2
σ
X
Use to estimate
19
- In practice, the population variance is rarely
- known. And the sample variance is a
reasonable estimator for
- Therefore, the standard error of mean can
be estimated by (recall that ) NOTE: The larger sample size is the smaller standard error is the more accurate estimation is
n / σ n s /
n s /
n / σ
∑
=
− − =
n i i
x x n s
1 2
) ( 1 1
2
σ
2
σ
2
s
X
An example of standard error
- A sample of size 10 birthweights:
97, 125, 62, 120, 132, 135, 118, 137, 126, 118 (sample mean x-bar=117.00 and sample standard deviation s=22.44)
- In order to estimate the population mean µ, a
point estimate is the sample mean , with standard error given by
20
00 . 117 = x 09 . 7 10 / 44 . 22 / = = = n s SE
Summary of sampling distribution of
21
X
- Let X1, …, Xn be a random sample from a population
with µ and σ2 . Then the mean and variance of is µ and σ2/n, respectively
- Furthermore, if X1, …, Xn be a random sample from
a normal population with µ and σ2 . Then by the properties of linear combination, is also normally distributed, that is
- Now the question is, if the population is NOT
normal, what is the distribution of ?
X
X
) / , ( ~
2 n
N X σ µ
X
The Central Limit Theorem
22
- Let X1 , X2 , …, Xn denote n independent random
variables sampled from some population with mean µ and variance σ2
- When n is large, the sampling distribution of the
sample mean is approximately normally distributed even if the underlying population is not normal
- By standardization:
2
( , ) X N n µ σ ≈
~ (0,1) / X Z N n µ σ − ≈
Illustration of Central limit Theorem (CLT)
23
Interval estimation
24
- Let X1 , X2 , …, Xn denote n independent random
variables sampled from some population with mean µ and variance σ2
- Our goal is to estimate µ. We know that is a good point
estimate
- Now we want to have a confidence interval
such that a 95% confidence interval(CI) for µ should satisfy the following:
a X a X a X ± = + − ) , (
% 95 ) Pr( = + < < − a X a X µ
N1=N2=15"
Interval estimation
25
- BY CLT we have . Use the Z-transformation,
we have
- Thus for a standard normal variable, what’s the width of the
interval within which 95% of the data are covered? Pr(-1<Z<1)=0.6827, Pr(-1.96<Z<1.96)=0.95 Pr(-2.58<Z<2.58)=0.99 ) / , ( ~
2 n
N X σ µ ~ (0,1) / X Z N n µ σ − =
Interval estimation
26
- Thus we have
However, we don’t know σ, we replace σ with s—sample standard deviation: Term is called the SAMPLE STANDARD ERROR, denoted as SE( ). Thus 95% confidence interval for µ is: 0.95 ( 1.96 1.96) = ( 1.96 1.96) / P Z X P n µ σ = − ≤ ≤ − − ≤ ≤ 0.95= ( 1.96 1.96) / X P s n µ − − ≤ ≤
/ s n
0.95 ( 1.96 ( ) 1.96 ( )) P X SE X X SE X µ ≈ − ≤ ≤ +
Ex.
27 • Assuming that the n=92 students were a simple random
sample of all NYU students. The sample mean weight is
- lbs. And sample standard deviation s=23.7.
What’s the 95% confidence interval for NYU student mean weight? 145.2 x =
Ex.
28 • Assuming that the n=92 students were a simple random
sample of all NYU students. The sample mean weight is
- lbs. And sample standard deviation s=23.7.
What’s the 95% confidence interval for NYU student mean weight?
- Answer: First of all the standard error is
- 95% CI is
23.7 ( ) =2.47 92 SE X = 145.2 x = 1.96* ( ) 145.2 (1.96)(2.47) 145.2 4.8 lbs =[140.4 lbs, 150 lbs] x SE X ± = ± = ±
Interval estimation for an arbitrary level of confidence 1-α
29
Zα/2 is Z critical value for level α/2 In particular EXCEL: NORM.INV(0.025, 0,1)=-1.96 NORM.INV(probability, mean, sd)= critical value
/2
( ) / 2 P Z zα α ≤ =
Area 1-α Area α/2 Area α/2
0.025
( ) 0.025 P Z z ≤ =
/2
zα
1 /2
z
α −
Z critical value table
1-α 0.80 0.90 0.95 0.99 α 0.20 0.10 0.05 0.01 α/2 0.10 0.05 0.025 0.005 Z1-α/2 1.28 1.64 1.96 2.58
30
E.G. For 0.99 level of confidence interval, go out 2.58 standard errors from sample mean.
Confidence interval (when n is large)
- Confidence Interval for the mean of a normal
distribution (large sample case)
- A 100%×(1-α) CI for the mean µ of a normal distribution
with unknown variance is given by A shorthand notation for the CI is
31
1 /2 1 /2
( ( ), ( )) ( ) / x z SE X x z SE X SE X s n
α α − −
− + =
1 /2
( ) x z SE X
α −
±
Motivation for t-distribution
- has an approximately normal
distribution only when it is computed using a large sample.
- For small n=5, 10, 25… <=30, it is not true.
When sample size is small, we need to use the student’s t distribution
32
/ X s n µ −
T-distribution
- If X1, …, Xn ~ N(µ,σ2) and are independent, then a
new random variable where is called t-distribution with n-1 degrees of freedom
33
1
~ /
n
X t t s n µ
−
− =
1 − n
t
∑
=
− − =
n i i
x x n s
1 2
) ( 1 1
Normal density and t densities
34
Comparison of normal and t distributions
If n is the sample size, the corresponding t distribution
has n-1 degree freedom of the sample.
Comparison of the .975th percentile of the t distribution
and the normal distribution
The bigger degrees of freedom, the closer to the
standard normal distribution
T distribution is more spread out than normal distribution,
td,.975 is more farther from 0 than Z.975
35
d td,.975 Z.975 d td,.975 Z.975 4 2.776 1.960 60 2.000 1.960 9 2.262 1.960 ∞ 1.960 1.960 29 2.045 1.960
100%×(1-α) area
1-α tn-1,α/2 = -t1-α/2 tn-1,1-α/2
36
α/2 α/2
Define the critical values t1-α/2 and -t1-α/2 as follows
1 1,1 /2 1 1,1 /2
/ 2 / 2
n n n n
P t t P t t
α α
α α
− − − − − −
> = < − = and
EXCEL: t.INV(0.025, 60)=-2.003 t.INV(probability, degree freedom)= critical value
Confidence interval
- Confidence Interval for the mean of a normal
distribution
- A 100%×(1-α) CI for the mean µ of a normal distribution
with unknown variance is given by A shorthand notation for the CI is
37
1,1 /2 1,1 /2
( ( ), ( )) ( ) /
n n
x t SE X x t SE X SE X s n
α α − − − −
− + =
1,1 /2
( )
n
x t SE X
α − −
±
Factors affecting the length of a CI
38
The length of a 100%*(1-α) confidence interval for µ equals 2tn-1,1- α s/ and is determined by n, s and α n: as the sample size increase, the length of the CI decreases. s: As the standard deviation (s) which reflects the variability of the distribution of individual
- bservations, increases, the length of the CI increases.
α: As the confidence desired increases (α decreases), the length of the CI increases.
n
Estimating the population proportion
Setting:
- X follows binomial distribution Bin(n, p)
- Intuition:
- Parameter p is unknown population proportion of
success
- We consider the following sample proportion of
success
39
n X p / ˆ =
An example of estimating p
- Estimating
the prevalence
- f
malignant melanoma in 45- to 54-year-old women in the
- US. Suppose a random sample of 5000 women
is selected from this age group, of whom 28 are found to have the disease
- Suppose
the prevalence
- f
the disease (population proportion of the disease) in this age group is p. How can p be estimated?
40
Cancer example continued
- Let X be the number of women with the disease among
the n women
- X can be looked at a binomial random variable with
parameters n and p
- we have E(X)=np and Var(X)=npq, where q=1-p
41
Properties of sample proportion
- Sample proportion of the disease among n women:
- Its mean:
Thus, is an unbiased estimator for p
- Its variance and standard deviation:
Replace p by the standard error of the sample proportion is 42
n X p / ˆ =
p n X E p E = = / ) ( ) ˆ (
2
ˆ ( ) ( ) / (1 ) / ˆ ( ) (1 ) / Var p Var X n p p n SD p p p n = = − = − ˆ ˆ ˆ ( ) (1 ) / SE p p p n = −
ˆ p
ˆ p
An example of point estimate of p
- (Cancer example continued): Estimate the prevalence of
malignant melanoma
- Solution:
43
ˆ 28 / 5000 0.0056 ˆ ( ) 0.0056(0.9944) / 5000 0.0011 p SE p = = = =
Why sample proportion is best estimator for p
- Actually, sample proportion is also sample mean
- Actually, population proportion p is also population
mean
- Recall that the sample mean is minimum variance
unbiased estimate of the population mean
- Therefore, the sample proportion is also the
minimum variance unbiased estimate of the population proportion
44
p ˆ
1
ˆ / /
n i i
p X n X n
=
= =
∑
p X E
i
= = µ ) (
Interval estimation for p
- EX. Suppose we are interested in estimating the
prevalence rate of breast cancer among 50- to 54-year-
- ld women whose mothers have had breast cancer.
Suppose that in a random sample of 10,000 such women, 400 are found to have had breast cancer at some point in their lives. The best point estimate of the prevalence rate p is given by the sample proportion
- How can an interval estimate of p be obtained?
45
ˆ 400 /10000 0.04 ˆ var( ) .04(.96) /10,000 0.002 p p = = = =
Central limit theorem plays an important role
Since As discussed earlier, sample proportion is a special case of sample mean. When sample size n is large, use central limit theorem, we have
46
n pq p Var p p E / ) ˆ ( , ) ˆ ( = =
) / , ( ˆ n pq p N p ≈
CI for p – Normal theory method
- An approximate 100%×(1-α) CI for the binomial parameter
p based on the normal approximation is given by
47
1 /2 1 /2 1 /2
ˆ ˆ ˆ ˆ ( ( ), ( )) ˆ ˆ ( ) ˆ ˆ ˆ ( ) / p z SE p p z SE p p z SE p SE p pq n
α α α − − −
− + = ± =
An example of CI for p
Breast Cancer example continued Since we have Therefore, 95% CI is given by
48
1 /2
ˆ 0.04, =0.05, z 1.95, 10,000 p n
α
α
−
= = =
(0.04 1.96 .04(.96) /10,000,0.04 1.96 .04(.96) /10,000 (0.04 0.004,0.04 .004) (.036,.044) − + = − + =
Summary– three recipes for CI
- Proportion (large sample)
- Large sample population mean
- Small sample population mean
- Each of those SE is proportional to the magic number
49
1 /2
ˆ ˆ ( ) p z SE p
α −
±
1 /2
( ) x z SE X
α −
±
1,1 /2
( )
n
x t SE X
α − −