Estimation Theory Overview Introduction Up until now we have - - PowerPoint PPT Presentation

estimation theory overview introduction up until now we
SMART_READER_LITE
LIVE PREVIEW

Estimation Theory Overview Introduction Up until now we have - - PowerPoint PPT Presentation

Estimation Theory Overview Introduction Up until now we have defined and discussed properties of random Properties variables and processes Bias, Variance, and Mean Square Error In each case we started with some known property (e.g.


slide-1
SLIDE 1

Terminology

  • Suppose we have N independent, identically-distributed (i.i.d.)
  • bservations {xi}|N

i=1

  • Ideally we would like to know the pdf of the data

f(x; θ) where θ ∈ Rp×1

  • In probability theory, we think about the “likeliness” of {xi}|N

i=1

given the pdf and θ

  • In inference, we are given {xi}|N

i=1 and are interested in the

“likeliness” of θ

  • Called the sampling distribution
  • We will use θ to denote the parameter (or vector of parameters)

we wish to estimate

  • This could be, for example, the process mean µx
  • J. McNames

Portland State University ECE 4/557 Estimation Theory

  • Ver. 1.26

3

Estimation Theory Overview

  • Properties
  • Bias, Variance, and Mean Square Error
  • Cram´

er-Rao lower bound

  • Maximum likelihood
  • Consistency
  • Confidence intervals
  • Properties of the mean estimator
  • J. McNames

Portland State University ECE 4/557 Estimation Theory

  • Ver. 1.26

1

Estimators as Random Variables

  • Our estimator is a function of the measurements ˆ

θ

  • {xi}|N

i=1

  • It is therefore a random variable
  • It will be different for every different set of observations
  • It is called an estimate or, if θ is a scalar, a point estimate
  • Of course we want ˆ

θ to be as close to the true θ as possible

  • J. McNames

Portland State University ECE 4/557 Estimation Theory

  • Ver. 1.26

4

Introduction

  • Up until now we have defined and discussed properties of random

variables and processes

  • In each case we started with some known property (e.g.

autocorrelation) and derived other related properties (e.g. PSD)

  • In practical problems we rarely know these properties a priori
  • In stead, we must estimate what we wish to know from finite sets
  • f measurements
  • J. McNames

Portland State University ECE 4/557 Estimation Theory

  • Ver. 1.26

2

slide-2
SLIDE 2

Bias Bias of an estimator ˆ θ of a parameter θ is defined as B(ˆ θ) E[ˆ θ] − θ

  • Unbiased: an estimator is said to be unbiased if B(ˆ

θ) = 0

  • This implies the pdf of the estimator is centered at the true value θ
  • The sample mean is unbiased
  • The estimator of variance on the earlier slide is biased
  • Unbiased estimators are generally good, but they are not always

best (more later)

  • J. McNames

Portland State University ECE 4/557 Estimation Theory

  • Ver. 1.26

7

Natural Estimators ˆ µx = ˆ θ

  • {xi}|N

i=1

  • = 1

N

N

  • n=1

xi

  • This is the obvious or “natural” estimator of the process mean
  • Sometimes called the average or sample mean
  • It will also turn out to be the “best” estimator
  • I will define “best” shortly

ˆ σ2

x = ˆ

θ

  • {xi}|N

i=1

  • = 1

N

N

  • n=1

(xi − ˆ µx)2

  • This is the obvious or “natural” estimator of the process variance
  • Not the “best”
  • J. McNames

Portland State University ECE 4/557 Estimation Theory

  • Ver. 1.26

5

Variance Variance of an estimator ˆ θ of a parameter θ is defined as var(ˆ θ) = σ2

ˆ θ E

  • ˆ

θ − E

  • ˆ

θ

  • 2
  • A measure of the spread of ˆ

θ about its mean

  • Would like the variance to be as small as possible
  • J. McNames

Portland State University ECE 4/557 Estimation Theory

  • Ver. 1.26

8

Good Estimators

θ(ˆ

θ) ˆ θ θ

  • Without loss of generality, let us consider a scalar parameter θ for

the time being

  • What is a “good” estimator

– Distribution of ˆ θ should be centered at the true value – Want the distribution to be as narrow as possible

  • Lower-order moments enable coarse measurements of “good”
  • J. McNames

Portland State University ECE 4/557 Estimation Theory

  • Ver. 1.26

6

slide-3
SLIDE 3

Bias, Variance, and Modeling y(x) = g(x) + ε ˆ y(x) = ˆ g(x)

  • In the modeling context, we are usually interested in estimating a

function

  • For a given input x, this function is a scalar
  • We can define θ = g(x)
  • Thus, all of the ideas that apply to estimating parameters also

apply to estimating functional relationships

  • J. McNames

Portland State University ECE 4/557 Estimation Theory

  • Ver. 1.26

11

Bias-Variance Tradeoff

θ(ˆ

θ) fˆ

θ(ˆ

θ) ˆ θ ˆ θ θ θ

  • In many cases minimizing variance conflicts with minimizing bias
  • Note that ˆ

θ 0 has zero variance, but is generally biased

  • In these cases we must trade variance for bias (or vice versa)
  • J. McNames

Portland State University ECE 4/557 Estimation Theory

  • Ver. 1.26

9

Notation and Prediction Error y = g(x) + ε g = g(x) ˆ g = ˆ g(x) ˆ ge = E[ˆ g(x)]

  • Expectation is taken over the distribution of data sets used to

construct ˆ g(x) and the distribution of the process noise f(ε)

  • Everything is a function of x
  • Recall that ε is i.i.d. with zero mean
  • We are treating x as a fixed, non-random variable
  • The dependence on x is not shown to simplify notation

The prediction error for a new, given input is defined as PE(x) = E[(y − ˆ g)2] = E[((g − ˆ g) + ε)2] = E[(g − ˆ g)2] + 2 E[(g − ˆ g)ε] + E[ε2] = MSE(x) + σ2

ε

  • J. McNames

Portland State University ECE 4/557 Estimation Theory

  • Ver. 1.26

12

The Bias-Variance Tradeoff

θ(ˆ

θ) fˆ

θ(ˆ

θ) ˆ θ ˆ θ θ θ

  • Understanding of the bias-variance tradeoff is crucial to this course
  • Unbiased models are not always best
  • The methods we will use to estimate the model coefficients are

biased

  • But they may be more accurate, because they have less variance
  • This idea applies to nonlinear models as well
  • J. McNames

Portland State University ECE 4/557 Estimation Theory

  • Ver. 1.26

10

slide-4
SLIDE 4

Bias-Variance Tradeoff Comments MSE(x) = (g − E[ˆ g])2 + E

g − E[ˆ g])2 = Bias2 + Variance

  • Large variance: the model is sensitive to small changes in the

data set

  • Large bias: if the model was compared to the true function on a

large number of data sets, the expected value of the model ˆ g(x) would not be close to the true function g(x)

  • If the model is sensitive to small changes in the data, a biased

model may have smaller error (MSE) than an unbiased model

  • If the data is strongly collinear, biased models can result in more

accurate models!

  • J. McNames

Portland State University ECE 4/557 Estimation Theory

  • Ver. 1.26

15

The Bias-Variance Tradeoff Derivation y = g(x) + ε g = g(x) ˆ g = ˆ g(x) ˆ ge = E[ˆ g(x)]

  • Only ˆ

g is a random function

  • Nothing else is dependent on the data set

MSE(x) = E[(g − ˆ g)2] = E[{(g − ˆ ge) − (ˆ g − ˆ ge)}2] = E ⎡ ⎣(g − ˆ ge)2 − 2(g − ˆ ge)(ˆ g − ˆ ge)

+ (ˆ g − ˆ ge)2

⎤ ⎦

  • J. McNames

Portland State University ECE 4/557 Estimation Theory

  • Ver. 1.26

13

Bias-Variance Tradeoff Comments Continued MSE(x) = (g − E[ˆ g])2 + E

g − E[ˆ g])2 = Bias2 + Variance

  • Large variance, small bias

– If the model is too flexible, it can overfit the data – The model will change dramatically from one data set to another – In this case it has high variance, but potentially low variance

  • Small variance, large bias

– If the model is not very flexible, it may not capture the true relationship between the inputs and the output – It will not vary as much from one data set to another – In this case the model has low variance, but potentially high bias

  • J. McNames

Portland State University ECE 4/557 Estimation Theory

  • Ver. 1.26

16

Bias-Variance Tradeoff Derivation Continued 1 ① = E[(g − ˆ ge)2 − 2(g − ˆ ge)(ˆ g − ˆ ge)] = E[g2 − 2gˆ ge + ˆ g2

e − 2g(ˆ

g − ˆ ge)] + 2ˆ g2

e − 2ˆ

g2

e

= E[g2 − 2gˆ ge + ˆ g2

e − 2gˆ

g + 2gˆ ge] = E[g2 − 2gˆ g + ˆ g2

e]

= g2 − 2g E[ˆ g] + ˆ g2

e

= g2 − 2gˆ ge + ˆ g2

e

= (g − ˆ ge)2 Thus MSE(x) = ① + ② = (g − ˆ ge)2 + E[(ˆ g − ˆ ge)2] = (g − E[ˆ g])2 + E

g − E[ˆ g])2

  • J. McNames

Portland State University ECE 4/557 Estimation Theory

  • Ver. 1.26

14

slide-5
SLIDE 5

Cram´ er-Rao Lower Bound Comments var(ˆ θ) ≥ 1 E

  • ∂ ln fx;θ(x;θ)

∂θ

2 = − 1 E

  • ∂2 ln fx;θ(x;θ)

∂θ2

  • Efficient Estimator: an unbiased estimate that achieves the

CRLB with equality

  • If it exists, then the unique solution is given by

∂ ln fx;θ(x; θ) ∂θ = 0 where the pdf is evaluated at the observed outcome x(ζ)

  • Maximum Likelihood (ML) Estimate: an estimator that

satisfies the equation above

  • This can be generalized to vectors of parameters
  • Limited use — fx;θ(x; θ) is rarely known in practice
  • J. McNames

Portland State University ECE 4/557 Estimation Theory

  • Ver. 1.26

19

Mean Square Error Mean Square Error of an estimator ˆ θ of a parameter θ is defined as MSE(θ) E[|ˆ θ − θ|2] = σ2

ˆ θ + |B(ˆ

θ)|2

  • We will use often use MSE as a global measure of estimator

performance

  • Note that two different estimators may have the same MSE but

different bias and variance

  • This criterion is convenient for building estimators
  • Creating a problem we can solve
  • Note the rationale is due to convenience:

– Picking MSE results in a simple bias/variance decomposition – Other error measures generally do not have such a decomposition

  • J. McNames

Portland State University ECE 4/557 Estimation Theory

  • Ver. 1.26

17

Consistency

  • Consistent Estimator an estimator such that

lim

N→∞ MSE(ˆ

θ) = 0

  • Implies the following as the sample size grows (N → ∞)

– The estimator becomes unbiased – The variance approaches zero – The distribution fˆ

θ(x) becomes an impulse centered at θ

  • J. McNames

Portland State University ECE 4/557 Estimation Theory

  • Ver. 1.26

20

Cram´ er-Rao Lower Bound var(ˆ θ) ≥ − 1 E

  • ∂ ln fx;θ(x;θ)

∂θ

2 = − 1 E[ ∂2 ln fx;θ(x;θ)

∂θ2

]

  • Minimum Variance Unbiased (MVU): Estimators that are both

unbiased and have the smallest variance of all possible estimators

  • Note that these do not necessarily achieve the minimum MSE
  • Cram´

er-Rao Lower Bound (CRLB) shown above is a lower bound on unbiased estimators

  • Log Likelihood Function of θ is ln fx;θ(x; θ)
  • Note that the pdf fx;θ(x; θ) describes the distribution of the data

(stochastic process), not the parameter

  • θ is not a random variable, it is a parameter that defines the

distribution

  • J. McNames

Portland State University ECE 4/557 Estimation Theory

  • Ver. 1.26

18

slide-6
SLIDE 6

Sample Mean Confidence Intervals fˆ

µx(ˆ

µx) = 1 √ 2π(σx/ √ N) exp

  • −1

2 ˆ µx − µx σx/ √ N 2 Pr

  • µx − k σx

√ N < ˆ µx < µx + k σx √ N

  • =

Pr

  • ˆ

µx − k σx √ N < µx < ˆ µx + k σx √ N

  • =

1 − α

  • In general, we don’t know the pdf
  • If we can assume the process is Gaussian and IID, we know the

pdf (sampling distribution) of the estimator

  • If N is large and the distribution doesn’t have heavy tails, the

distribution of ˆ µx is Gaussian by the Central Limit Theorem (CLT)

  • J. McNames

Portland State University ECE 4/557 Estimation Theory

  • Ver. 1.26

23

Confidence Intervals

  • Confidence Interval: interval, a ≤ θ ≤ b, that has a specified

probability of covering the unknown true parameter value Pr {a < θ ≤ b} = 1 − α

  • The interval is estimated from the data, therefore it is also a pair
  • f random variables
  • Confidence Level: coverage probability of a confidence interval,

1 − α

  • The confidence interval is not uniquely defined by the confidence

level

  • J. McNames

Portland State University ECE 4/557 Estimation Theory

  • Ver. 1.26

21

Sample Mean Confidence Intervals Comments Pr

  • ˆ

µx − k σx √ N < µx < ˆ µx + k σx √ N

  • = 1 − α
  • In many cases the confidence intervals are accurate, even if they

are only approximate

  • We can choose k such that 1 − α equals any probability we like
  • In general, the user picks α
  • This controls how often the confidence interval does not cover µx
  • 95% and 99% are common choices
  • J. McNames

Portland State University ECE 4/557 Estimation Theory

  • Ver. 1.26

24

Properties of the Sample Mean ˆ µx 1 N

N−1

  • k=0

x(n) E[ˆ µx] = µx var(ˆ µx) = σ2

x

N

N

  • ℓ=−N
  • The estimator is unbiased
  • Can also be shown that

– Has minimum variance – If Gaussian, is the maximum likelihood estimator – If Gaussian, attains the Cram´ er-Rao Lower Bound

  • J. McNames

Portland State University ECE 4/557 Estimation Theory

  • Ver. 1.26

22

slide-7
SLIDE 7

Sample Mean Variance when Gaussian and IID Pr

  • ˆ

µx − k σx √ N < µx < ˆ µx + k σx √ N

  • = 1 − α
  • If σx is unknown (usually), must estimate from the data

ˆ σ2

x =

1 N − 1

N−1

  • n=0

[x(n) − ˆ µx]2

  • The corresponding z score, has a different distribution
  • If x(n) is IID and Gaussian

ˆ µx − µx ˆ σx/ √ N has a Students’ t distribution with v = N − 1 degrees of freedom

  • Approaches a Gaussian distribution as v becomes large (> 20)
  • J. McNames

Portland State University ECE 4/557 Estimation Theory

  • Ver. 1.26

25

Sample Mean Variance when Gaussian E[ˆ µx] = µx var(ˆ µx) = 1 N

N

  • ℓ=−N
  • 1 − |ℓ|

N

  • γx(ℓ)
  • If x(n) is Gaussian but not IID, the sample mean is normal with

mean µ

  • The approximate confidence interval is given by a Guassian PDF

Pr

  • ˆ

µx − k

  • var(ˆ

µx) < µx < ˆ µx − k

  • var(ˆ

µx)

  • = 1 − α
  • J. McNames

Portland State University ECE 4/557 Estimation Theory

  • Ver. 1.26

26