Point Estimation Edwin Leuven Introduction Last time we reviewed - - PowerPoint PPT Presentation
Point Estimation Edwin Leuven Introduction Last time we reviewed - - PowerPoint PPT Presentation
Point Estimation Edwin Leuven Introduction Last time we reviewed statistical inference We saw that while in probability we ask: given a data generating process, what are the properties of the outcomes? in statistics the question is the
Introduction
Last time we reviewed statistical inference We saw that while in probability we ask:
◮ given a data generating process, what are the properties of the
- utcomes?
in statistics the question is the reverse:
◮ given the outcomes, what can we say about the process that
generated the data? Statistical inference consists in
- 1. Estimation (point, interval)
- 2. Inference (quantifying sampling error, hypothesis testing)
2/43
Introduction
Today we take a closer look at point estimation We will go over three desirable properties of estimator:
- 1. Unbiasedness
- 2. Consistency
- 3. Efficiency
And how to quantify the trade-off between location and variance using the
◮ Mean Squared Error (MSE)
3/43
Random sampling
Statistical inference starts with an assumption about how our data came about (the “data generating process”) We introduced the notion of sampling where we consider
- bservations in our data X1, . . . , Xn as draws from a population or,
more generally, an unknown probability distribution f (X)
Simple Random Sample
We call a sample X1, . . . , Xn random if Xi are independent and identically distributed (i.i.d) random variables Random samples arise if we draw each unit in the population with equal probability in our sample.
4/43
Random sampling
We will assume throughout that our samples are random! The aim is to use our data X1, . . . , Xn to learn something about the unknown probability distribution f (X) where the data came from We typically focus on E[X], the mean of X, to explain things but we can ask many different questions:
◮ What is the variance of X ◮ What is the 10th percentile of X ◮ What fraction of X lies below 100,000 ◮ etc.
Very often we are interested in comparing measurements across populations
◮ What is the difference in earnings between men and women
5/43
Bias
Consider
- 1. the estimand E[X], and
- 2. an estimator ˆ
X What properties do we want our estimator ˆ X to have? One desirable property is that ˆ X is on average correct E[ˆ X] = E[X] We call such estimators unbiased
Bias
Bias = E[ˆ X] − E[X]
6/43
Bias
The estimand – in our example the population mean E[X] – is a number For a given sample ˆ X is also a number, we call this the estimate Bias is not the difference between the estimate and the estimand
◮ this the estimation error
Bias is the average estimation error across (infintely) many random samples!
7/43
Estimating the Mean of X
The sample average is an unbiased estimator of the mean
E[¯ Xn] = 1
n
n
i=1 E[Xi] = E[X]
but we can think of different unbiased estimators, f.e. X1 is also an unbiased estimate of E[X] If X has a symmetric distribution then both
◮ median(X), and ◮ (min(X) + max(X))/2
are unbiased
8/43
Estimation the Variance of X
The estimator of the variance
- Var(X) =
1 n−1
n
i=1(Xi − ¯
Xn)2 Why divide by n − 1 and not n? E[1 n
n
- i=1
(Xi − ¯ Xn)2] = 1 n
n
- i=1
E[(Xi − ¯ Xn)2] = 1 n
n
- i=1
E[X 2
i − 2Xi ¯
Xn + ¯ X 2
n ]
= E[X 2
i ] − 2E[Xi ¯
Xn] + E[¯ X 2
n ]
= n − 1 n (E[X 2
i ] − E[Xi]2) = n − 1
n Var(Xi) where the last line follows since E[¯ X 2
n ] = E[Xi ¯
Xn] = 1 nE[X 2
i ] + n − 1
n E[Xi]2
9/43
Variance Estimation
We can verify this through numerical simulation: n = 20; nrep = 10^5 varhat1 = rep(0, nrep); varhat2 = rep(0, nrep) for(i in 1:nrep) { x = rnorm(n, 5, sqrt(3)) sx = sum((x - mean(x))^2) varhat1[i] = sx / (n - 1); varhat2[i] = sx / n } mean(varhat1) ## [1] 3.0000818 mean(varhat2) ## [1] 2.8500777
10/43
How to choose between two unbiased estimators?
Since both are centered around the truth:
◮ pick the one that tends to be closest!
One measure of “close” is Var(ˆ X), the sampling variance of ˆ X x1 = rep(0, nrep); x2 = rep(0, nrep) for(i in 1:nrep) { x = rnorm(100, 0, 1) x1[i] = mean(x); x2[i] = (min(x) + max(x)) / 2 } var(x1) ## [1] 0.0099863036 var(x2) ## [1] 0.092541761
11/43
How to choose between two unbiased estimators?
Since both are centered around the truth:
◮ pick the one that tends to be closest!
One measure of “close” is Var(ˆ X), the sampling variance of ˆ X y1 = rep(0, nrep); y2 = rep(0, nrep) for(i in 1:nrep) { x = runif(100, 0, 1) y1[i] = mean(x); y2[i] = (min(x) + max(x)) / 2 } var(y1) ## [1] 0.00083522986 var(y2) ## [1] 0.00004879091
12/43
How to choose between two unbiased estimators?
Normal(0,1) distribution
x1 Density −0.4 −0.2 0.0 0.2 0.4 1 2 3 4 −0.4 −0.2 0.0 0.2 0.4 1 2 3 4
13/43
How to choose between two unbiased estimators?
Uniform[0,1] distribution
y1 Density 0.40 0.45 0.50 0.55 0.60 0.65 20 40 60 0.40 0.45 0.50 0.55 0.60 0.65 20 40 60
14/43
How to choose between two unbiased estimators?
The sampling distribution of our estimator depends on the underlying distribution of Xi in the population!
◮ Xi ∼ Normal the sample average outperforms the midrange ◮ Xi ∼ Uniform the midrange outperforms the sample average
However, the sample average is attractive default because it is often
- 1. has a sampling distribution that is well understood
- 2. more efficient (smaller sampling variance) than alternative
estimators We will say more about this in the context of the WLLN and the CLT
15/43
The Standard Error
Above we compared the average and the midrange estimators using the sampling variance Var(ˆ X) = E[(ˆ X − E[ˆ X])2] = E[ˆ X 2] − E[ˆ X]2 It is however common to use the square root of the sampling variance of our estimators This is called the standard error Standard Error of ˆ X =
- Var(ˆ
X)
16/43
The Standard Error of the Sample Proportion
Consider a Bernouilli random variable X where X =
- 1
with probability p with probability 1 − p The sample proportion is ¯ Xn = 1
n
- i Xi with variance
Var(ˆ X) = 1 n2
- i
Var(Xi) = nVar(X) n2 = p(1 − p) n but this depends on p which is unknown! We have an unbiased estimator of p, namely ˆ X and we can therefore estimate the variance as follows
- Var(¯
Xn) = ¯ X(1 − ¯ X)/n
17/43
The Standard Error of the Sample Mean
When the distribution of X is unknown but i.i.d. we can also more generally derive the variance of the sample mean as follows Var(¯ Xn) = 1 n2
- i
Var(Xi) = Var(X) n this again depends on an unknown parameter, Var(X), but that we also have an estimator of
- Var(X) =
1 n − 1
n
- i=1
(Xi − ¯ Xn)2 so that
- Var(X n) =
Var(X)/n and we get the standard error by taking the square root
18/43
Calculating Standard Errors
phat = mean(rbinom(100,1,.54)) sqrt(phat * (1-phat) / 100) # estimate ## [1] 0.049638695 sqrt(.54*(1-.54)/100) # theoretical s.e. ## [1] 0.049839743 sqrt(var(rnorm(100,1,2)) / 100) # estimate ## [1] 0.16962716 sqrt(2^2/100) # theoretical s.e. ## [1] 0.2
19/43
Bias vs Variance
Suppose we have
- 1. an unbiased estimator with a large sampling variance
- 2. a biased estimator with a small sampling variance
Should we choosing our “best” estimator on
◮ bias, or ◮ variance
?
20/43
Bias vs Variance
b b b bb b b b b b b b b b bb b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b bb b b b b b b b b b b b b b b b b b b b bb b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b
High Bias Low Bias Low Variance High Variance
21/43
Bias vs Variance
E[X]=0
xhat Density −0.4 −0.2 0.0 0.2 0.4 2 4 6 8 10 −0.4 −0.2 0.0 0.2 0.4 2 4 6 8 10
22/43
Bias vs Variance
E[X]=0
xhat Density −0.4 −0.2 0.0 0.2 0.4 2 4 6 8 10 −0.4 −0.2 0.0 0.2 0.4 2 4 6 8 10
23/43
Mean Squared Error
We may need to choose between two estimators one of which is unbiased Consider the biased estimator, is the sampling variance (or the standard error) still a good measure? Var(ˆ X) = E[(ˆ X − E[ˆ X])2] = E[(ˆ X − (E[ˆ X] − E[X]) − E[X])2] = E[(ˆ X − E[X] − Bias)2] Suppose Var(ˆ Xbiased) < Var(ˆ Xunbiased) what would you conclude?
24/43
Mean Squared Error
We are interest in the spread relative to the truth!! This is called the Mean Squared Error (MSE)
Mean Squared Error
MSE = E[(ˆ X − E[X])2] We can show that MSE = E[(ˆ X − E[X])2] = E[(ˆ X − E[ˆ X] + E[ˆ X] − E[X])2] = E[(ˆ X − E[ˆ X])2]
- Var(ˆ
X)
+ (E[ˆ X] − E[X])2]
- Bias2
There is a potential trade-off between Bias and Variance
25/43
Mean Squared Error
Consider again the following two estimators of the variance: 1.
- Var(X) =
1 n−1
n
i=1(Xi − ¯
Xn)2 2.
- Var(X) = 1
n
n
i=1(Xi − ¯
Xn)2 We saw that 1. is unbiased while 2. is not How about the MSE? Consider the example on p.10 bias2 var mse vhat1 0.00000001 0.94744282 0.94743335 vhat2 0.02247669 0.85506714 0.87753529 here X ∼ N(5, 3) and n = 20, try for X ∼ χ2(1) and vary n
26/43
Consistency
We mentioned unbiasedness as an attractive property of an estimator But unbiasedness is a finite sample property
◮ silent on “how close” the estimate is to the truth ◮ a nonlinear function of an unbiased estimator is typically not
unbiased We will now consider consistenty which is a large sample property
◮ consistent estimators converge to the truth as sample sizes
grow large
◮ a nonlinear function of a consistent estimator is typically
consistent
27/43
Consistency
Consistency
Let ˆ θn be an estimator of θ based on a sample of size n. We call ˆ θn consistent if it gets closer and closer to θ as data accumulates, and write: ˆ θn → θ The precise definition is: lim
n→∞ Pr(|ˆ
θn − θ| > ǫ) = 0 ∀ǫ > 0
Weak law of large numbers
If Xi are i.i.d. random variables with E[|Xi|] < ∞, then 1 n
- i
Xi → E[Xi]
28/43
Consistency
Consider sampling from a population of voters where Xi =
- 1
if person i support the right if person i support the left and Pr(Xi = 1) = 0.54 Denote our data by x1, . . . , xn We estimate p by ˆ p = (x1 + . . . + xn)/n
29/43
Consistency
10 100 1000 10000 100000 0.0 0.4 0.8 n phat(n) 10 100 1000 10000 100000 0.0 0.4 0.8
30/43
Consistency
10 100 1000 10000 100000 0.0 0.4 0.8 n phat(n) 10 100 1000 10000 100000 0.0 0.4 0.8
31/43
Consistency
10 100 1000 10000 100000 0.0 0.4 0.8 n phat(n) 10 100 1000 10000 100000 0.0 0.4 0.8
32/43
Biased and Consistent
Consider U ∼ Uniform[0, θ] then ˆ θ = max(u1, . . . , un) is a biased estimator since E[ˆ θ] = n n − 1θ but is consistent
33/43
Biased and Consistent
10 100 1000 10000 100000 2 4 6 8 10 n phat(n) 10 100 1000 10000 100000 2 4 6 8 10
34/43
Biased vs Consistent
Estimates of the mean
◮ unbiased and consistent
◮ ¯
X
◮ unbiased and inconsistent
◮ X1
◮ biased and consistent
◮ ¯
X + 1/n
◮ biased and inconsistent
◮ can you think of one? 35/43
Consistent Estimators
Finding unbiased estimators is not so easy because even if E[ˆ θ] = θ E[g(ˆ θ)] = g(θ) For example, if we know that E[ˆ θ] = σ2 then E[
- ˆ
θ] = σ Finding consistent estimators is much easier because of the WLLN and because functions and combinations of consistent estimators are
- ften again consistent
36/43
Consistent Estimators
Continuous Mapping Theorem (CMT)
If g(·) is a continuous function and ˆ θ a consistent estimator of θ, then g(ˆ θ) → g(θ) This means that if ˆ θ → σ2 then
- ˆ
θ → σ
37/43
Consistent Estimators
Suppose you want a consistent estimator of the variance of X: Var(X) = E[X 2] − E[X]2 By the WLLN you know that 1 n
- i
Xi → E[X], and 1 n
- i
X 2
i → E[X 2]
and by the CMT (1 n
- i
Xi)2 → E[X]2 and therefore that 1 n
- i
X 2
i − (1
n
- i
Xi)2 → Var(X) This is an application of the Method of Moments
38/43
Summary
With point estimation the objective is to estimate (compute a “best guess” of) a population parameter θ using our data Parameters are things like:
◮ means, percentiles, minima, maxima, differences in means
between groups, etc. etc. Estimates differ across samples, and estimators are therefore random variables Estimators have a distribution
39/43
Summary
To characterize an estimator we focussed on two key properties of its sampling distribution:
- 1. location (unbiasedness, consistency)
- 2. spread (variance, MSE)
Unbiasedness, E[ˆ θ] = θ means that the expectation of our estimator equals the population parameter it intends to estimate The expectation here is across infintely many random samples, and unbiasedness means that we are correct on average Unbiasedness is a finite sample property because it is true for samples of any size
40/43
Summary
While being on target on average (location) is important, we never have this average estimate but a single one We would therefore prefer to be close to the target in a given sample This is more likely to happen if the spread of our estimator is small A natural measure of spead is the variance: Var(ˆ θ) = E[(ˆ θ − E[ˆ θ])2] But for a biased estimator it measure the spread around the wrong location since then E[ˆ θ] = θ + Bias
41/43
Summary
This is why we turned to the Mean Squared Error (MSE) MSE(ˆ θ) = E[(ˆ θ − θ)2] which measures the spread of the estimator ˆ θ around the true parameter value θ We saw that MSE = Variance + Bias2 and that a trade-off between bias and variance can make us prefer a biased estimator over an unbiased one
42/43
Summary
We often use consistent estimators because unbiased estimators are difficult to find or may not exist Consistent estimators can be biased in small samples, but converge to the population parameter as more data become avaiable: ˆ θ → θ The Weak Law of Large Numbers says that with random sampling sample averages are consistent estimators of corresponding population averages We can often combine consistent estimators to construct new consistent estimators Consistency is a large sample property
43/43