Data Analysis and Uncertainty Part 2: Estimation Instructor: Sargur - - PowerPoint PPT Presentation

data analysis and uncertainty part 2 estimation
SMART_READER_LITE
LIVE PREVIEW

Data Analysis and Uncertainty Part 2: Estimation Instructor: Sargur - - PowerPoint PPT Presentation

Data Analysis and Uncertainty Part 2: Estimation Instructor: Sargur N. Srihari University at Buffalo The State University of New York srihari@cedar.buffalo.edu 1 Srihari Topics in Estimation 1. Estimation 2. Desirable Properties of Estimators


slide-1
SLIDE 1

1

Data Analysis and Uncertainty Part 2: Estimation

Instructor: Sargur N. Srihari

University at Buffalo The State University of New York

srihari@cedar.buffalo.edu

Srihari

slide-2
SLIDE 2

Topics in Estimation

  • 1. Estimation
  • 2. Desirable Properties of Estimators
  • 3. Maximum Likelihood Estimation

Examples: Binomial, Normal

  • 4. Bayesian Estimation

Examples: Binomial, Normal

  • 1. Jeffreyʼs Prior

Srihari 2

slide-3
SLIDE 3

Estimation

  • In inference we want to make statements

about entire population from which sample is drawn

  • Two most important methods for estimating

parameters of a model:

  • 1. Maximum Likelihood Estimation
  • 2. Bayesian Estimation

Srihari 3

slide-4
SLIDE 4

Desirable Properties of Estimators

  • Let be an estimate of Parameter θ
  • Two measures of estimator quality
  • 1. Expected Value of Estimate (Bias)
  • Difference between expected and true value

– Measures Systematic departure from true value

  • 2. Variance of Estimate

– Data driven component of error in estimation procedure – E.g., Always saying has a variance of zero but high bias

  • Mean Squared Error can be partitioned

as sum of bias2 and variance

Srihari

Bias( ˆ θ ) = E[θ

]−θ

ˆ θ

Var(θ

) = E[θ

− E[θ

]]2

ˆ θ

E[(θ

−θ)2] θ

=1

Expectation over all possible data sets of size n

slide-5
SLIDE 5

Bias-Variance in Point Estimate

  • Scenario 1
  • Everyone believes it is

180 (variance=0)

  • Answer is always 180
  • The error is always -20
  • Ave squared error is 400
  • Average bias error is 20
  • 400=400+0
  • Scenario 2
  • Normally distributed beliefs with

mean 180 and std dev 10 (variance 100)

  • Poll two: One says 190, other 170
  • Bias Errors are -10 and -30
  • Average bias error is -20
  • Squared errors: 100 and 900
  • Ave squared error: 500
  • 500 = 400 + 100

True height of Chinese emperor: 200cm, about 6ʼ6” Poll a random American: ask “How tall is the emperor?” We want to determine how wrong they are, on average Squared error = Square of bias error + Variance As variance increases, error increases

  • Scenario 3
  • Normally distributed

beliefs with mean 180 and std dev 20 (variance=400)

  • Poll two: One says 200

and other 160

  • Errors: 0 and -40

– Ave error is -20

  • Sq. errors: 0 and 1600

– Ave squared error: 800

  • 800 = 400 + 400

200 180 200 180 200 180

Bias No variance Bias Some variance Bias More variance

Each scenario has expected value of 180 (or bias error = 20), but increasing variance in estimate

slide-6
SLIDE 6

Mean Squared Error as a Criterion for

  • Natural decomposition as sum of squared

bias and its variance

  • Mean squared error (over data sets) is a

useful criterion since it incorporates both bias and variance

Srihari 6

E[(θ

−θ)2] = E[(θ

− E[θ

]+ E[θ

]−θ)2] = (E[θ

]−θ)2 + E[(θ

− E[θ

])2] = (Bias(θ

))2 + Var(θ

)

ˆ θ

slide-7
SLIDE 7

Maximum Likelihood Estimation

  • Most widely used method for parameter

estimation

  • Likelihood Function is probability that data D

would have arisen for a given value of θ

  • A scalar function of θ
  • Value of θ for which the data has the highest

probability is the MLE

Srihari

L(θ | D) = L(θ | x(1),..., x(n)) = p(x(1),..., x(n) |θ) = p(x(i) |θ)

i=1 n

slide-8
SLIDE 8

Example of MLE for Binomial

  • Customers either purchase/ not purchase milk
  • We want estimate of proportion purchasing
  • Binomial with unknown parameter θ
  • Binomial is a generalization of Bernoulli
  • Bernoulli: probability of a binary variable x=1 or 0

Denoting p(x=1)=θ probability mass function

Bern(x|θ)=θ x (1-θ) 1-x

  • Mean= θ, variance = θ(1-θ)
  • Binomial: probability of r successes in n trials
  • Mean= nθ, variance = nθ(1-θ)

Srihari

Bin(r | n,θ) = n r       θ r(1−θ)n−r

slide-9
SLIDE 9

Likelihood Function: Binomial/Bernoulli

  • Samples x(1),.. x(1000) where r purchase milk
  • Assuming conditional independence, likelihood

function is

  • Binomial pdf includes every possible way of getting

r successes so it has nCr additive terms

  • Log-likelihood Function
  • Differentiating and setting equal to zero

Srihari 9

L(θ | x(1),..,x(1000)) = θ x(i)(1−θ)n−x(i)

i

= θ r(1−θ)1000−r l(θ) = logL(θ) = rlogθ + (1000 − r)log(1−θ) ˆ θ

ML = r /1000

slide-10
SLIDE 10

Binomial: Likelihood Functions

Srihari 10

Likelihood function for three data sets Binomial distribution

r milk purchases out of n customers θ is the probability that milk is purchased by random customer r=70 n=100 r=700 n=1000 r=7 n=10

Uncertainty becomes smaller as n increases

slide-11
SLIDE 11

Likelihood under Normal Distribution

  • Unit variance, Unknown mean
  • Likelihood function
  • Log-likelihood function
  • To find mle set derivative d/dθ to zero

Srihari 11

l(θ | x(1),...,x(n)) = − n 2 log2π − 1 2 (x(i) −θ)2

i=1 n

L(θ | x(1),...,x(n)) = (2π)−1/ 2

i=1 n

exp − 1 2 x(i) −θ

( )

2

      = 2π

( )

−n / 2 exp − 1

2 x(i) −θ

( )

2 i=1 n

     

ˆ θ

ML =

x(i)/n

i

slide-12
SLIDE 12

Normal: Histogram, Likelihood Log-Likelihood

Srihari 12

Estimate unknown mean θ Histogram of 20 data points drawn from zero mean, unit variance Likelihood function Log-Likelihood function

slide-13
SLIDE 13

Normal: More data points

Srihari 13

Histogram of 200 data points drawn from zero mean, unit variance Likelihood function Log-Likelihood function

slide-14
SLIDE 14

Sufficient Statistics

  • Useful general concept in statistical estimation
  • Quantity s(D) is a sufficient statistic for θ if the

likelihood l(θ) only depends on the data through s(D)

  • Examples
  • For binomial parameter θ number of successes r

is sufficient

  • For the mean of normal distribution sum of the
  • bservations Σ x(i) is sufficient for likelihood

function of mean (which is a function of θ)

slide-15
SLIDE 15

Interval Estimate

  • Point estimate does not convey uncertainty

associated with it

  • Interval estimates provide a confidence interval
  • Example:
  • 100 observations from N(unknown µ,known σ2)
  • we want 95% confidence interval for estimate of m
  • Since distribution of sample mean is N(µ,σ2/100)
  • 95% of normal distribution lies within 1.96 std dev of

mean

15

p(µ-1.96σ/10)<x_<µ+1.96σ/10)=0.95 Rewritten as P(x_-1.96σ/10) <x_<1.96σ/10)=0.95 l(x)=x_-1.96σ/10 and u(x)=x_+1.96σ/10 is a 95% interval

slide-16
SLIDE 16

Bayesian Approach

  • Frequentist Approach
  • Parameters are fixed but unknown
  • Data is a random sample
  • Intrinsic variablity lies in data D={x(1),..x(n)}
  • Bayesian Statistics
  • Data are known
  • Parameters θ are random variables
  • θ has a distribution of values
  • p(θ) reflects degree of belief on where true

parameters θ may be

Srihari 16

slide-17
SLIDE 17

Bayesian Estimation

  • Distribution of probabilities for θ is the prior p(θ)
  • Analysis of data leads to modified distribution,

called the posterior p(θ/D)

  • Modification done by Bayes rule
  • Leads to a distribution rather than single value
  • Single value obtainable: mean or mode (latter known as

maximum aposteriori (MAP) method)

  • MAP and MLE of θ may well coincide

– Since prior is flat preferring no single value – MLE can be viewed as a special case of MAP procedure, which in turn is a restricted form of Bayesian estimation p(θ | D) = p(D |θ)p(θ) p(D) = p(D |θ)p(θ) p(D |ψ)p(ψ)ψ

ψ

slide-18
SLIDE 18

Summary of Bayesian Approach

  • For a given data set D and a particular model

(model = distributions for prior and likelihood)

  • In words: Posterior distribution given D

(distribution conditioned on having observed the data) is

  • Proportionl to product: prior p(θ) & likelihood p(D|θ)
  • If we have a weak belief about parameter

before collecting data, choose a wide prior (normal with large variance)

  • Larger the data, more dominant is the likelihood

Srihari

p(θ | D) ∝ p(D |θ)p(θ)

slide-19
SLIDE 19

Bayesian Binomial

  • Single Binary variable X: wish to estimate θ =p(X=1)
  • Prior for parameter in [0,1] is the Beta distribtn
  • Likelihood (same as for MLE):
  • Combining likelihood and prior
  • We get another Beta distribution
  • With parameters r+α and n-r+β and mean=
  • If α=β=0 we get the standard MLE of r/n

p(θ) ∝θα−1(1−θ)β −1 p(θ | D) ∝ p(D |θ)p(θ) = θ r(1−θ)n−rθα−1(1−θ)β −1 = θ r+α−1(1−θ)n−r+β −1 L(θ | D) = θ r(1−θ)n−r

Where α > 0, β > 0 are two parameters

  • f this model

Beta(θ |α,β) = Γ(α + β) Γ(α)Γ(β)θα−1(1−θ)β −1

E[θ] = α α + β mode[θ] = α -1 α +β - 2 var[θ] = αβ (α +β)2(α +β +1)

E[θ] = r + α n + α + β

slide-20
SLIDE 20

Advantages of Bayesian Approach

  • Retain full knowledge of all problem uncertainty
  • Eg, calculating full posterior distribution on θ
  • E.g., Prediction of new point x(n+1) not in training

set D is done by averaging over all possible θ

  • Can average over all possible models
  • Considerably more computation than max likelihood
  • Natural updating of distribution sequentially

Srihari 20

p(x(n +1) | D) = p(x(n +1),θ | D)dθ

= p(x(n +1) |

θ)p(θ | D)dθ p(θ | D

1,D2) ∝ p(D2 |θ)p(D 1 |θ)p(θ)

Since x(n+1) is conditionally independent

  • f training data D
slide-21
SLIDE 21

Predictive Distribution

  • In equation to modify prior to posterior
  • Denominator is called predictive distribution
  • f D
  • Represents predictions about value of D
  • Includes uncertainty about θ via p(θ) and

uncertainty about D when θ is known

  • Useful for model checking
  • If observed data have only small probability then

it is unlikely to be correct

Srihari

p(θ | D) = p(D |θ)p(θ) p(D) = p(D |θ)p(θ) p(D |ψ)p(ψ)ψ

ψ

slide-22
SLIDE 22

Bayesian: Normal Distribution

  • Suppose x comes from a Normal Distribution

with unknown mean θ and known variance α x ~ N(θ,α)

  • Prior distribution for θ is θ ~ N(θ0,α0)
  • After some algebra
  • Normal prior altered to yield normal posterior

Srihari

p(θ | x) ∝ p(x |θ)p(θ) = 1 2πα exp - 1 2α x −θ

( )

2

      1 2πα exp - 1 2α0 x −θ0

( )

2

      p(θ | x) = 1 2πα1 exp - 1 2 θ −θ1

( )

2 /α1

     

where α1=(α0

  • 1+α-1)-1 is the sum of precisions and

θ1=α1(θ0/α0+x/α) weighted sum of mean and datum

slide-23
SLIDE 23

Improper Priors

  • If we have no idea of the normal mean then

we could give a uniform distribution for prior: but would not be a density function

  • But could adopt an improper prior which is

uniform over regions where parameter may

  • ccur

Srihari 23

slide-24
SLIDE 24

Jeffreyʼs Prior

  • Results for one prior may differ from another
  • Use a reference prior
  • Fisher Information
  • Negative of expectation second derivative of log-

likelihood

  • Measures curvature or flatness of likelihood function
  • Jeffreyʼs Prior
  • Consistent no matter how parameter is transformed

Srihari 24

I(θ | x) = −E[∂2 logL(θ | x) ∂θ 2 ] p(θ) ∝ I(θ | x)

slide-25
SLIDE 25

Conjugate Priors

  • Beta to Beta
  • Normal to Normal
  • Advantage of using conjugate families

is that updating process replaced by simple updating of parameters

Srihari 25

slide-26
SLIDE 26

Credibility Interval

  • Can obtain point estimate or interval

estimate from posterior distribution

  • When there is a single parameter the interval

containing a given probability (say 90%) is a credibility interval

  • Interpretation is more straightforward than

frequentist confidence intervals

Srihari 26

slide-27
SLIDE 27

Stochastic Estimation

  • Bayesian methods involve complicated

joint distributions

  • Drawing random samples from estimated

distributions enable properties of the distributions of parameters to be estimated

  • Called Markov Chain Monte Carlo (MCMC)

Srihari 27