Point Estimation Edwin Leuven Introduction Last time we reviewed - - PowerPoint PPT Presentation

point estimation
SMART_READER_LITE
LIVE PREVIEW

Point Estimation Edwin Leuven Introduction Last time we reviewed - - PowerPoint PPT Presentation

Point Estimation Edwin Leuven Introduction Last time we reviewed statistical inference We saw that while in probability we ask: given a data generating process, what are the properties of the outcomes? in statistics the question is the


slide-1
SLIDE 1

Point Estimation

Edwin Leuven

slide-2
SLIDE 2

Introduction

Last time we reviewed statistical inference We saw that while in probability we ask:

◮ given a data generating process, what are the properties of the

  • utcomes?

in statistics the question is the reverse:

◮ given the outcomes, what can we say about the process that

generated the data? Statistical inference consists in

  • 1. Estimation (point, interval)
  • 2. Inference (quantifying sampling error, hypothesis testing)

2/43

slide-3
SLIDE 3

Introduction

Today we take a closer look at point estimation We will go over three desirable properties of estimator:

  • 1. Unbiasedness
  • 2. Consistency
  • 3. Efficiency

And how to quantify the trade-off between location and variance using the

◮ Mean Squared Error (MSE)

3/43

slide-4
SLIDE 4

Random sampling

Statistical inference starts with an assumption about how our data came about (the “data generating process”) We introduced the notion of sampling where we consider

  • bservations in our data X1, . . . , Xn as draws from a population or,

more generally, an unknown probability distribution f (X)

Simple Random Sample

We call a sample X1, . . . , Xn random if Xi are independent and identically distributed (i.i.d) random variables Random samples arise if we draw each unit in the population with equal probability in our sample.

4/43

slide-5
SLIDE 5

Random sampling

We will assume throughout that our samples are random! The aim is to use our data X1, . . . , Xn to learn something about the unknown probability distribution f (X) where the data came from We typically focus on E[X], the mean of X, to explain things but we can ask many different questions:

◮ What is the variance of X ◮ What is the 10th percentile of X ◮ What fraction of X lies below 100,000 ◮ etc.

Very often we are interested in comparing measurements across populations

◮ What is the difference in earnings between men and women

5/43

slide-6
SLIDE 6

Bias

Consider

  • 1. the estimand E[X], and
  • 2. an estimator ˆ

X What properties do we want our estimator ˆ X to have? One desirable property is that ˆ X is on average correct E[ˆ X] = E[X] We call such estimators unbiased

Bias

Bias = E[ˆ X] − E[X]

6/43

slide-7
SLIDE 7

Bias

The estimand – in our example the population mean E[X] – is a number For a given sample ˆ X is also a number, we call this the estimate Bias is not the difference between the estimate and the estimand

◮ this the estimation error

Bias is the average estimation error across (infintely) many random samples!

7/43

slide-8
SLIDE 8

Estimating the Mean of X

The sample average is an unbiased estimator of the mean

E[¯ Xn] = 1

n

n

i=1 E[Xi] = E[X]

but we can think of different unbiased estimators, f.e. X1 is also an unbiased estimate of E[X] If X has a symmetric distribution then both

◮ median(X), and ◮ (min(X) + max(X))/2

are unbiased

8/43

slide-9
SLIDE 9

Estimation the Variance of X

The estimator of the variance

  • Var(X) =

1 n−1

n

i=1(Xi − ¯

Xn)2 Why divide by n − 1 and not n? E[1 n

n

  • i=1

(Xi − ¯ Xn)2] = 1 n

n

  • i=1

E[(Xi − ¯ Xn)2] = 1 n

n

  • i=1

E[X 2

i − 2Xi ¯

Xn + ¯ X 2

n ]

= E[X 2

i ] − 2E[Xi ¯

Xn] + E[¯ X 2

n ]

= n − 1 n (E[X 2

i ] − E[Xi]2) = n − 1

n Var(Xi) where the last line follows since E[¯ X 2

n ] = E[Xi ¯

Xn] = 1 nE[X 2

i ] + n − 1

n E[Xi]2

9/43

slide-10
SLIDE 10

Variance Estimation

We can verify this through numerical simulation: n = 20; nrep = 10^5 varhat1 = rep(0, nrep); varhat2 = rep(0, nrep) for(i in 1:nrep) { x = rnorm(n, 5, sqrt(3)) sx = sum((x - mean(x))^2) varhat1[i] = sx / (n - 1); varhat2[i] = sx / n } mean(varhat1) ## [1] 3.0000818 mean(varhat2) ## [1] 2.8500777

10/43

slide-11
SLIDE 11

How to choose between two unbiased estimators?

Since both are centered around the truth:

◮ pick the one that tends to be closest!

One measure of “close” is Var(ˆ X), the sampling variance of ˆ X x1 = rep(0, nrep); x2 = rep(0, nrep) for(i in 1:nrep) { x = rnorm(100, 0, 1) x1[i] = mean(x); x2[i] = (min(x) + max(x)) / 2 } var(x1) ## [1] 0.0099863036 var(x2) ## [1] 0.092541761

11/43

slide-12
SLIDE 12

How to choose between two unbiased estimators?

Since both are centered around the truth:

◮ pick the one that tends to be closest!

One measure of “close” is Var(ˆ X), the sampling variance of ˆ X y1 = rep(0, nrep); y2 = rep(0, nrep) for(i in 1:nrep) { x = runif(100, 0, 1) y1[i] = mean(x); y2[i] = (min(x) + max(x)) / 2 } var(y1) ## [1] 0.00083522986 var(y2) ## [1] 0.00004879091

12/43

slide-13
SLIDE 13

How to choose between two unbiased estimators?

Normal(0,1) distribution

x1 Density −0.4 −0.2 0.0 0.2 0.4 1 2 3 4 −0.4 −0.2 0.0 0.2 0.4 1 2 3 4

13/43

slide-14
SLIDE 14

How to choose between two unbiased estimators?

Uniform[0,1] distribution

y1 Density 0.40 0.45 0.50 0.55 0.60 0.65 20 40 60 0.40 0.45 0.50 0.55 0.60 0.65 20 40 60

14/43

slide-15
SLIDE 15

How to choose between two unbiased estimators?

The sampling distribution of our estimator depends on the underlying distribution of Xi in the population!

◮ Xi ∼ Normal the sample average outperforms the midrange ◮ Xi ∼ Uniform the midrange outperforms the sample average

However, the sample average is attractive default because it is often

  • 1. has a sampling distribution that is well understood
  • 2. more efficient (smaller sampling variance) than alternative

estimators We will say more about this in the context of the WLLN and the CLT

15/43

slide-16
SLIDE 16

The Standard Error

Above we compared the average and the midrange estimators using the sampling variance Var(ˆ X) = E[(ˆ X − E[ˆ X])2] = E[ˆ X 2] − E[ˆ X]2 It is however common to use the square root of the sampling variance of our estimators This is called the standard error Standard Error of ˆ X =

  • Var(ˆ

X)

16/43

slide-17
SLIDE 17

The Standard Error of the Sample Proportion

Consider a Bernouilli random variable X where X =

  • 1

with probability p with probability 1 − p The sample proportion is ¯ Xn = 1

n

  • i Xi with variance

Var(ˆ X) = 1 n2

  • i

Var(Xi) = nVar(X) n2 = p(1 − p) n but this depends on p which is unknown! We have an unbiased estimator of p, namely ˆ X and we can therefore estimate the variance as follows

  • Var(¯

Xn) = ¯ X(1 − ¯ X)/n

17/43

slide-18
SLIDE 18

The Standard Error of the Sample Mean

When the distribution of X is unknown but i.i.d. we can also more generally derive the variance of the sample mean as follows Var(¯ Xn) = 1 n2

  • i

Var(Xi) = Var(X) n this again depends on an unknown parameter, Var(X), but that we also have an estimator of

  • Var(X) =

1 n − 1

n

  • i=1

(Xi − ¯ Xn)2 so that

  • Var(X n) =

Var(X)/n and we get the standard error by taking the square root

18/43

slide-19
SLIDE 19

Calculating Standard Errors

phat = mean(rbinom(100,1,.54)) sqrt(phat * (1-phat) / 100) # estimate ## [1] 0.049638695 sqrt(.54*(1-.54)/100) # theoretical s.e. ## [1] 0.049839743 sqrt(var(rnorm(100,1,2)) / 100) # estimate ## [1] 0.16962716 sqrt(2^2/100) # theoretical s.e. ## [1] 0.2

19/43

slide-20
SLIDE 20

Bias vs Variance

Suppose we have

  • 1. an unbiased estimator with a large sampling variance
  • 2. a biased estimator with a small sampling variance

Should we choosing our “best” estimator on

◮ bias, or ◮ variance

?

20/43

slide-21
SLIDE 21

Bias vs Variance

b b b bb b b b b b b b b b bb b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b bb b b b b b b b b b b b b b b b b b b b bb b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

High Bias Low Bias Low Variance High Variance

21/43

slide-22
SLIDE 22

Bias vs Variance

E[X]=0

xhat Density −0.4 −0.2 0.0 0.2 0.4 2 4 6 8 10 −0.4 −0.2 0.0 0.2 0.4 2 4 6 8 10

22/43

slide-23
SLIDE 23

Bias vs Variance

E[X]=0

xhat Density −0.4 −0.2 0.0 0.2 0.4 2 4 6 8 10 −0.4 −0.2 0.0 0.2 0.4 2 4 6 8 10

23/43

slide-24
SLIDE 24

Mean Squared Error

We may need to choose between two estimators one of which is unbiased Consider the biased estimator, is the sampling variance (or the standard error) still a good measure? Var(ˆ X) = E[(ˆ X − E[ˆ X])2] = E[(ˆ X − (E[ˆ X] − E[X]) − E[X])2] = E[(ˆ X − E[X] − Bias)2] Suppose Var(ˆ Xbiased) < Var(ˆ Xunbiased) what would you conclude?

24/43

slide-25
SLIDE 25

Mean Squared Error

We are interest in the spread relative to the truth!! This is called the Mean Squared Error (MSE)

Mean Squared Error

MSE = E[(ˆ X − E[X])2] We can show that MSE = E[(ˆ X − E[X])2] = E[(ˆ X − E[ˆ X] + E[ˆ X] − E[X])2] = E[(ˆ X − E[ˆ X])2]

  • Var(ˆ

X)

+ (E[ˆ X] − E[X])2]

  • Bias2

There is a potential trade-off between Bias and Variance

25/43

slide-26
SLIDE 26

Mean Squared Error

Consider again the following two estimators of the variance: 1.

  • Var(X) =

1 n−1

n

i=1(Xi − ¯

Xn)2 2.

  • Var(X) = 1

n

n

i=1(Xi − ¯

Xn)2 We saw that 1. is unbiased while 2. is not How about the MSE? Consider the example on p.10 bias2 var mse vhat1 0.00000001 0.94744282 0.94743335 vhat2 0.02247669 0.85506714 0.87753529 here X ∼ N(5, 3) and n = 20, try for X ∼ χ2(1) and vary n

26/43

slide-27
SLIDE 27

Consistency

We mentioned unbiasedness as an attractive property of an estimator But unbiasedness is a finite sample property

◮ silent on “how close” the estimate is to the truth ◮ a nonlinear function of an unbiased estimator is typically not

unbiased We will now consider consistenty which is a large sample property

◮ consistent estimators converge to the truth as sample sizes

grow large

◮ a nonlinear function of a consistent estimator is typically

consistent

27/43

slide-28
SLIDE 28

Consistency

Consistency

Let ˆ θn be an estimator of θ based on a sample of size n. We call ˆ θn consistent if it gets closer and closer to θ as data accumulates, and write: ˆ θn → θ The precise definition is: lim

n→∞ Pr(|ˆ

θn − θ| > ǫ) = 0 ∀ǫ > 0

Weak law of large numbers

If Xi are i.i.d. random variables with E[|Xi|] < ∞, then 1 n

  • i

Xi → E[Xi]

28/43

slide-29
SLIDE 29

Consistency

Consider sampling from a population of voters where Xi =

  • 1

if person i support the right if person i support the left and Pr(Xi = 1) = 0.54 Denote our data by x1, . . . , xn We estimate p by ˆ p = (x1 + . . . + xn)/n

29/43

slide-30
SLIDE 30

Consistency

10 100 1000 10000 100000 0.0 0.4 0.8 n phat(n) 10 100 1000 10000 100000 0.0 0.4 0.8

30/43

slide-31
SLIDE 31

Consistency

10 100 1000 10000 100000 0.0 0.4 0.8 n phat(n) 10 100 1000 10000 100000 0.0 0.4 0.8

31/43

slide-32
SLIDE 32

Consistency

10 100 1000 10000 100000 0.0 0.4 0.8 n phat(n) 10 100 1000 10000 100000 0.0 0.4 0.8

32/43

slide-33
SLIDE 33

Biased and Consistent

Consider U ∼ Uniform[0, θ] then ˆ θ = max(u1, . . . , un) is a biased estimator since E[ˆ θ] = n n − 1θ but is consistent

33/43

slide-34
SLIDE 34

Biased and Consistent

10 100 1000 10000 100000 2 4 6 8 10 n phat(n) 10 100 1000 10000 100000 2 4 6 8 10

34/43

slide-35
SLIDE 35

Biased vs Consistent

Estimates of the mean

◮ unbiased and consistent

◮ ¯

X

◮ unbiased and inconsistent

◮ X1

◮ biased and consistent

◮ ¯

X + 1/n

◮ biased and inconsistent

◮ can you think of one? 35/43

slide-36
SLIDE 36

Consistent Estimators

Finding unbiased estimators is not so easy because even if E[ˆ θ] = θ E[g(ˆ θ)] = g(θ) For example, if we know that E[ˆ θ] = σ2 then E[

  • ˆ

θ] = σ Finding consistent estimators is much easier because of the WLLN and because functions and combinations of consistent estimators are

  • ften again consistent

36/43

slide-37
SLIDE 37

Consistent Estimators

Continuous Mapping Theorem (CMT)

If g(·) is a continuous function and ˆ θ a consistent estimator of θ, then g(ˆ θ) → g(θ) This means that if ˆ θ → σ2 then

  • ˆ

θ → σ

37/43

slide-38
SLIDE 38

Consistent Estimators

Suppose you want a consistent estimator of the variance of X: Var(X) = E[X 2] − E[X]2 By the WLLN you know that 1 n

  • i

Xi → E[X], and 1 n

  • i

X 2

i → E[X 2]

and by the CMT (1 n

  • i

Xi)2 → E[X]2 and therefore that 1 n

  • i

X 2

i − (1

n

  • i

Xi)2 → Var(X) This is an application of the Method of Moments

38/43

slide-39
SLIDE 39

Summary

With point estimation the objective is to estimate (compute a “best guess” of) a population parameter θ using our data Parameters are things like:

◮ means, percentiles, minima, maxima, differences in means

between groups, etc. etc. Estimates differ across samples, and estimators are therefore random variables Estimators have a distribution

39/43

slide-40
SLIDE 40

Summary

To characterize an estimator we focussed on two key properties of its sampling distribution:

  • 1. location (unbiasedness, consistency)
  • 2. spread (variance, MSE)

Unbiasedness, E[ˆ θ] = θ means that the expectation of our estimator equals the population parameter it intends to estimate The expectation here is across infintely many random samples, and unbiasedness means that we are correct on average Unbiasedness is a finite sample property because it is true for samples of any size

40/43

slide-41
SLIDE 41

Summary

While being on target on average (location) is important, we never have this average estimate but a single one We would therefore prefer to be close to the target in a given sample This is more likely to happen if the spread of our estimator is small A natural measure of spead is the variance: Var(ˆ θ) = E[(ˆ θ − E[ˆ θ])2] But for a biased estimator it measure the spread around the wrong location since then E[ˆ θ] = θ + Bias

41/43

slide-42
SLIDE 42

Summary

This is why we turned to the Mean Squared Error (MSE) MSE(ˆ θ) = E[(ˆ θ − θ)2] which measures the spread of the estimator ˆ θ around the true parameter value θ We saw that MSE = Variance + Bias2 and that a trade-off between bias and variance can make us prefer a biased estimator over an unbiased one

42/43

slide-43
SLIDE 43

Summary

We often use consistent estimators because unbiased estimators are difficult to find or may not exist Consistent estimators can be biased in small samples, but converge to the population parameter as more data become avaiable: ˆ θ → θ The Weak Law of Large Numbers says that with random sampling sample averages are consistent estimators of corresponding population averages We can often combine consistent estimators to construct new consistent estimators Consistency is a large sample property

43/43