[PPT] - Data Analysis and Uncertainty Part 2: Estimation Instructor: Sargur PowerPoint Presentation

SLIDE 1

1

Data Analysis and Uncertainty Part 2: Estimation

Instructor: Sargur N. Srihari

University at Buffalo The State University of New York

srihari@cedar.buffalo.edu

Srihari

SLIDE 2

Topics in Estimation

1. Estimation
2. Desirable Properties of Estimators
3. Maximum Likelihood Estimation

Examples: Binomial, Normal

4. Bayesian Estimation

Examples: Binomial, Normal

1. Jeffreyʼs Prior

Srihari 2

SLIDE 3

Estimation

In inference we want to make statements

about entire population from which sample is drawn

Two most important methods for estimating

parameters of a model:

1. Maximum Likelihood Estimation
2. Bayesian Estimation

Srihari 3

SLIDE 4

Desirable Properties of Estimators

Let be an estimate of Parameter θ
Two measures of estimator quality
1. Expected Value of Estimate (Bias)
Difference between expected and true value

– Measures Systematic departure from true value

2. Variance of Estimate

– Data driven component of error in estimation procedure – E.g., Always saying has a variance of zero but high bias

Mean Squared Error can be partitioned

as sum of bias2 and variance

Srihari

Bias( ˆ θ ) = E[θ

∧

]−θ

ˆ θ

Var(θ

∧

) = E[θ

∧

− E[θ

∧

]]2

ˆ θ

E[(θ

∧

−θ)2] θ

∧

=1

Expectation over all possible data sets of size n

SLIDE 5

Bias-Variance in Point Estimate

Scenario 1
Everyone believes it is

180 (variance=0)

Answer is always 180
The error is always -20
Ave squared error is 400
Average bias error is 20
400=400+0
Scenario 2
Normally distributed beliefs with

mean 180 and std dev 10 (variance 100)

Poll two: One says 190, other 170
Bias Errors are -10 and -30
Average bias error is -20
Squared errors: 100 and 900
Ave squared error: 500
500 = 400 + 100

True height of Chinese emperor: 200cm, about 6ʼ6” Poll a random American: ask “How tall is the emperor?” We want to determine how wrong they are, on average Squared error = Square of bias error + Variance As variance increases, error increases

Scenario 3
Normally distributed

beliefs with mean 180 and std dev 20 (variance=400)

Poll two: One says 200

and other 160

Errors: 0 and -40

– Ave error is -20

Sq. errors: 0 and 1600

– Ave squared error: 800

800 = 400 + 400

200 180 200 180 200 180

Bias No variance Bias Some variance Bias More variance

Each scenario has expected value of 180 (or bias error = 20), but increasing variance in estimate

SLIDE 6

Mean Squared Error as a Criterion for

Natural decomposition as sum of squared

bias and its variance

Mean squared error (over data sets) is a

useful criterion since it incorporates both bias and variance

Srihari 6

E[(θ

∧

−θ)2] = E[(θ

∧

− E[θ

∧

]+ E[θ

∧

]−θ)2] = (E[θ

∧

]−θ)2 + E[(θ

∧

− E[θ

∧

])2] = (Bias(θ

∧

))2 + Var(θ

∧

)

ˆ θ

SLIDE 7

Maximum Likelihood Estimation

Most widely used method for parameter

estimation

Likelihood Function is probability that data D

would have arisen for a given value of θ

A scalar function of θ
Value of θ for which the data has the highest

probability is the MLE

Srihari

L(θ | D) = L(θ | x(1),..., x(n)) = p(x(1),..., x(n) |θ) = p(x(i) |θ)

i=1 n

∏

SLIDE 8

Example of MLE for Binomial

Customers either purchase/ not purchase milk
We want estimate of proportion purchasing
Binomial with unknown parameter θ
Binomial is a generalization of Bernoulli
Bernoulli: probability of a binary variable x=1 or 0

Denoting p(x=1)=θ probability mass function

Bern(x|θ)=θ x (1-θ) 1-x

Mean= θ, variance = θ(1-θ)
Binomial: probability of r successes in n trials
Mean= nθ, variance = nθ(1-θ)

Srihari

Bin(r | n,θ) = n r       θ r(1−θ)n−r

SLIDE 9

Likelihood Function: Binomial/Bernoulli

Samples x(1),.. x(1000) where r purchase milk
Assuming conditional independence, likelihood

function is

Binomial pdf includes every possible way of getting

r successes so it has nCr additive terms

Log-likelihood Function
Differentiating and setting equal to zero

Srihari 9

L(θ | x(1),..,x(1000)) = θ x(i)(1−θ)n−x(i)

i

∏

= θ r(1−θ)1000−r l(θ) = logL(θ) = rlogθ + (1000 − r)log(1−θ) ˆ θ

ML = r /1000

SLIDE 10

Binomial: Likelihood Functions

Srihari 10

Likelihood function for three data sets Binomial distribution

r milk purchases out of n customers θ is the probability that milk is purchased by random customer r=70 n=100 r=700 n=1000 r=7 n=10

Uncertainty becomes smaller as n increases

SLIDE 11

Likelihood under Normal Distribution

Unit variance, Unknown mean
Likelihood function
Log-likelihood function
To find mle set derivative d/dθ to zero

Srihari 11

l(θ | x(1),...,x(n)) = − n 2 log2π − 1 2 (x(i) −θ)2

i=1 n

∑

L(θ | x(1),...,x(n)) = (2π)−1/ 2

i=1 n

∏

exp − 1 2 x(i) −θ

( )

2

      = 2π

( )

−n / 2 exp − 1

2 x(i) −θ

( )

2 i=1 n

∑

     

ˆ θ

ML =

x(i)/n

i

∑

SLIDE 12

Normal: Histogram, Likelihood Log-Likelihood

Srihari 12

Estimate unknown mean θ Histogram of 20 data points drawn from zero mean, unit variance Likelihood function Log-Likelihood function

SLIDE 13

Normal: More data points

Srihari 13

Histogram of 200 data points drawn from zero mean, unit variance Likelihood function Log-Likelihood function

SLIDE 14

Sufficient Statistics

Useful general concept in statistical estimation
Quantity s(D) is a sufficient statistic for θ if the

likelihood l(θ) only depends on the data through s(D)

Examples
For binomial parameter θ number of successes r

is sufficient

For the mean of normal distribution sum of the
bservations Σ x(i) is sufficient for likelihood

function of mean (which is a function of θ)

SLIDE 15

Interval Estimate

Point estimate does not convey uncertainty

associated with it

Interval estimates provide a confidence interval
Example:
100 observations from N(unknown µ,known σ2)
we want 95% confidence interval for estimate of m
Since distribution of sample mean is N(µ,σ2/100)
95% of normal distribution lies within 1.96 std dev of

mean

15

p(µ-1.96σ/10)<x_<µ+1.96σ/10)=0.95 Rewritten as P(x_-1.96σ/10) <x_<1.96σ/10)=0.95 l(x)=x_-1.96σ/10 and u(x)=x_+1.96σ/10 is a 95% interval

SLIDE 16

Bayesian Approach

Frequentist Approach
Parameters are fixed but unknown
Data is a random sample
Intrinsic variablity lies in data D={x(1),..x(n)}
Bayesian Statistics
Data are known
Parameters θ are random variables
θ has a distribution of values
p(θ) reflects degree of belief on where true

parameters θ may be

Srihari 16

SLIDE 17

Bayesian Estimation

Distribution of probabilities for θ is the prior p(θ)
Analysis of data leads to modified distribution,

called the posterior p(θ/D)

Modification done by Bayes rule
Leads to a distribution rather than single value
Single value obtainable: mean or mode (latter known as

maximum aposteriori (MAP) method)

MAP and MLE of θ may well coincide

– Since prior is flat preferring no single value – MLE can be viewed as a special case of MAP procedure, which in turn is a restricted form of Bayesian estimation p(θ | D) = p(D |θ)p(θ) p(D) = p(D |θ)p(θ) p(D |ψ)p(ψ)ψ

ψ

∫

SLIDE 18

Summary of Bayesian Approach

For a given data set D and a particular model

(model = distributions for prior and likelihood)

In words: Posterior distribution given D

(distribution conditioned on having observed the data) is

Proportionl to product: prior p(θ) & likelihood p(D|θ)
If we have a weak belief about parameter

before collecting data, choose a wide prior (normal with large variance)

Larger the data, more dominant is the likelihood

Srihari

p(θ | D) ∝ p(D |θ)p(θ)

SLIDE 19

Bayesian Binomial

Single Binary variable X: wish to estimate θ =p(X=1)
Prior for parameter in [0,1] is the Beta distribtn
Likelihood (same as for MLE):
Combining likelihood and prior
We get another Beta distribution
With parameters r+α and n-r+β and mean=
If α=β=0 we get the standard MLE of r/n

p(θ) ∝θα−1(1−θ)β −1 p(θ | D) ∝ p(D |θ)p(θ) = θ r(1−θ)n−rθα−1(1−θ)β −1 = θ r+α−1(1−θ)n−r+β −1 L(θ | D) = θ r(1−θ)n−r

Where α > 0, β > 0 are two parameters

f this model

Beta(θ |α,β) = Γ(α + β) Γ(α)Γ(β)θα−1(1−θ)β −1

E[θ] = α α + β mode[θ] = α -1 α +β - 2 var[θ] = αβ (α +β)2(α +β +1)

E[θ] = r + α n + α + β

SLIDE 20

Advantages of Bayesian Approach

Retain full knowledge of all problem uncertainty
Eg, calculating full posterior distribution on θ
E.g., Prediction of new point x(n+1) not in training

set D is done by averaging over all possible θ

Can average over all possible models
Considerably more computation than max likelihood
Natural updating of distribution sequentially

Srihari 20

p(x(n +1) | D) = p(x(n +1),θ | D)dθ

∫

= p(x(n +1) |

∫

θ)p(θ | D)dθ p(θ | D

1,D2) ∝ p(D2 |θ)p(D 1 |θ)p(θ)

Since x(n+1) is conditionally independent

f training data D

SLIDE 21

Predictive Distribution

In equation to modify prior to posterior
Denominator is called predictive distribution
f D
Represents predictions about value of D
Includes uncertainty about θ via p(θ) and

uncertainty about D when θ is known

Useful for model checking
If observed data have only small probability then

it is unlikely to be correct

Srihari

p(θ | D) = p(D |θ)p(θ) p(D) = p(D |θ)p(θ) p(D |ψ)p(ψ)ψ

ψ

∫

SLIDE 22

Bayesian: Normal Distribution

Suppose x comes from a Normal Distribution

with unknown mean θ and known variance α x ~ N(θ,α)

Prior distribution for θ is θ ~ N(θ0,α0)
After some algebra
Normal prior altered to yield normal posterior

Srihari

p(θ | x) ∝ p(x |θ)p(θ) = 1 2πα exp - 1 2α x −θ

( )

2

      1 2πα exp - 1 2α0 x −θ0

( )

2

      p(θ | x) = 1 2πα1 exp - 1 2 θ −θ1

( )

2 /α1

     

where α1=(α0

1+α-1)-1 is the sum of precisions and

θ1=α1(θ0/α0+x/α) weighted sum of mean and datum

SLIDE 23

Improper Priors

If we have no idea of the normal mean then

we could give a uniform distribution for prior: but would not be a density function

But could adopt an improper prior which is

uniform over regions where parameter may

ccur

Srihari 23

SLIDE 24

Jeffreyʼs Prior

Results for one prior may differ from another
Use a reference prior
Fisher Information
Negative of expectation second derivative of log-

likelihood

Measures curvature or flatness of likelihood function
Jeffreyʼs Prior
Consistent no matter how parameter is transformed

Srihari 24

I(θ | x) = −E[∂2 logL(θ | x) ∂θ 2 ] p(θ) ∝ I(θ | x)

SLIDE 25

Conjugate Priors

Beta to Beta
Normal to Normal
Advantage of using conjugate families

is that updating process replaced by simple updating of parameters

Srihari 25

SLIDE 26

Credibility Interval

Can obtain point estimate or interval

estimate from posterior distribution

When there is a single parameter the interval

containing a given probability (say 90%) is a credibility interval

Interpretation is more straightforward than

frequentist confidence intervals

Srihari 26

SLIDE 27

Stochastic Estimation

Bayesian methods involve complicated

joint distributions

Drawing random samples from estimated

distributions enable properties of the distributions of parameters to be estimated

Called Markov Chain Monte Carlo (MCMC)

Srihari 27