Review of Estimation Theory Berlin 2003 References: 1. X. Huang - - PowerPoint PPT Presentation

review of estimation theory
SMART_READER_LITE
LIVE PREVIEW

Review of Estimation Theory Berlin 2003 References: 1. X. Huang - - PowerPoint PPT Presentation

Review of Estimation Theory Berlin 2003 References: 1. X. Huang et. al., Spoken Language Processing, Chapter 3 Introduction Estimation theory is the most important theory and method in statistical inference Statistical inference


slide-1
SLIDE 1

Review of Estimation Theory

Berlin 2003

References:

  • 1. X. Huang et. al., Spoken Language Processing, Chapter 3
slide-2
SLIDE 2

2

Introduction

  • Estimation theory is the most important theory and

method in statistical inference

  • Statistical inference

– Data generated in accordance with some unknown probability distribution must be analyzed – Some type of inference about the unknown distribution must be made like the characteristics (parameters) of the distribution generating the experimental data, the mean and variance etc.

( )

Φ x g

{ }

n

X X X ,..., ,

2 1

= X

{ }

n

x x x ,..., ,

2 1

= x

The vector of random variables The vector of sample values

( )

Φ X g

estimator estimate

( )

x θ

( )

X θ Φ :the parameters

  • f the distribution
slide-3
SLIDE 3

3

Introduction

  • Three common estimators (estimation methods)

– Minimum mean square estimator

  • Estimate the random variable itself
  • Function approximation, curve fitting, …

– Maximum likelihood estimator

  • Estimate the parameters of the distribution of the random

variables

– Bayes’ estimator

  • Estimate the parameters of the distribution of the random

variables

slide-4
SLIDE 4

4

Minimum Mean Square Error Estimation and Least Square Error Estimation

  • There are two random variables and . When
  • bserving the value of , we want to find a

transform ( the parameter vectors of function ) to predict the value of – Minimum Mean Square Error Estimation – Least Square Error Estimation

  • Base on the law of large numbers, when the joint

probability is uniform or the number of samples approaches to infinity, MMSE and LSE are equivalent

X

( )

Φ , ˆ X g Y =

Φ

g

( ) ( )

[ ] [ ]

2

, min arg Φ Φ

Φ

X g Y E

MMSE

− =

( ) [ ]

=

− =

n i i i LSE

x g y

1 2

, min arg Φ Φ

Φ

If the joint distribution Is known

( )

Y X f

Y X

,

,

When samples of pairs are observed

( )

i i y

x ,

( ) Y X f Y

Y X

, ˆ

,

= X Y Y

slide-5
SLIDE 5

5

Minimum Mean Square Error Estimation and Least Square Error Estimation

  • Constant functions

– MMSE

  • - LSE
  • Linear functions

– MMSE

( )

c X g =

( )

b aX X g + =

( )

[ ]

( )

[ ]

) ( ) (

2 2

= + − ∇ = + − ∇ b aX Y E b aX Y E

b a

[ ]

[ ] [ ] [ ] [ ]

2

= − + = − + Y E b X aE XY E X bE X aE

( ) ( ) [ ] [ ]

X E Y E b X Var Y X a

X Y XY

σ σ ρ − = = , cov

( )

[ ]

[ ]

Y E c c Y E

MMSE c

= ∴ = − ∇

2

( )

∑ ∑

= =

= ∴ = − ∇

n i i LSE n i i c

y n c c y

1 1 2

1

sample mean mean

slide-6
SLIDE 6

6

Minimum Mean Square Error Estimation and Least Square Error Estimation

  • Linear functions

– LSE

  • Suppose that x are d-dimensional vectors and y

are scalars

( )

( ) ∑

=

− = − =                           = =             =

n i i t n d n n d d n

y e a a a x x x x x x y y y

1 2 1 1 1 2 1 2 1 1 1 2 1

ˆ 1 1 1 ˆ x A Y Y A XA Y M L M M M L L M

( )

( )

( )

( )

..... 2 2

1 1

Y X X X A Y X XA X Y XA X x x A A

t t t t t i n i i i t

y e

− =

= ⇒ = ⇒ = − = − = ∇

Y ˆ Y

c0 c1 cd c0 c1 cd

slide-7
SLIDE 7

7

Maximum Likelihood Estimation (MLE/ML)

  • ML is the most widely used parametric estimation

method

  • A set of random samples is to be drawn

independently according to a distribution with the pdf

– Given a sequence of random samples the likelihood of it is defined as , a joint pdf of – Maximum likelihood estimator of is denoted as – Since the logarithm function is monotonically increasing function, the parameter set that maximizes the log-likelihood should also maximize the likelihood. The log-likelihood can be expressed as:

{ }

n 2 1

X ,..., X , X = X

( )

Φ x p ( )

n 2 1

x ,..., x , x = x ( )

Φ x

n

p

( ) ( )

iid are ... , , ,

n n 1 k 2 1 k n

X X X x p p

=

= Q Φ Φ x

( ) ( )

=

= =

n 1 k k n ML

x p p Φ Φ x Φ

Φ Φ

max arg max arg Φ

ML

Φ

( )

( ) ( )

log

=

= =

n 1 k k n

x p p log l Φ Φ x Φ

( )

n 2 1

x ,..., x , x

slide-8
SLIDE 8

8

Maximum Likelihood Estimation (MLE/ML)

  • If is differentiable function of , can be

attained by taking the partial derivative with respect to and setting it to zero

– Let be a M-component parameter vector

  • Example: is a univariate Gaussian pdf with the

parameter set

Φ

( )

Φ x

n

p

ML

Φ Φ Φ

( )

t M 2 1

Φ ,..., Φ , Φ = Φ

( )

( )

( ) ( )

Φ l . . Φ l x p log l

M 1 n 1 k k

=                   ∂ ∂ ∂ ∂ = ∇ = ∇

=

Φ Φ Φ Φ

Φ Φ

( )

2

,σ µ

( )

Φ x p

( )

( )

      − − =

2 2

2 x exp 2 1 x p σ µ σ π Φ

( ) ( )

( )

( )

( )

∑ ∑ ∑

= = =

− − − =               − − = =

n 1 k 2 k 2 2 n 1 k 2 2 k n 1 k k n

x 2 1 2 log 2 n 2 x exp 2 1 log x p log p log µ σ πσ σ µ σ π Φ Φ x

slide-9
SLIDE 9

9

Maximum Likelihood Estimation (MLE/ML)

  • Example: univariate Gaussian pdf (cont.)

– Take the partial derivatives of the above expression and set them to zero – The maximum likelihood estimates for and are

  • The maximum likelihood estimation for mean and variance is just

the sample mean and variance

( )

( )

( )

( )

log 1 log

1 4 2 2 2 1 2

= − + − = ∂ ∂ = − = ∂ ∂

∑ ∑

= = n k k n n k k n

x n p x p σ µ σ σ µ σ µ Φ x Φ x

µ

2

σ

( ) ( ) ( )

[ ]

2 ML k 2 ML k ML 2 n 1 k k ML

x E x n 1 x E x µ µ σ µ − = − = = =

=

unkown but fixed is itself Φ

slide-10
SLIDE 10

10

Maximum Likelihood Estimation (MLE/ML)

  • Example: multivariate Gaussian pdf (cont.)

– The maximum likelihood estimates for and are

  • The maximum likelihood estimation for mean vector and

variance matrix is just the sample mean vector and variance matrix

  • In fact, itself is also a Gaussian distribution

( )

( ) ( ) ( )

     − − − =

µ x Σ µ x Σ Φ x

1 2 1

2 1 exp 2 1

2

t

d

π p

µ

Σ

=

=

n k k MLE

n

1

1 ˆ x µ

( )( ) ( )( )

[ ]

t MLE k MLE k t MLE k n k MLE k MLE

E n µ x µ x µ x µ x Σ ˆ ˆ ˆ ˆ 1 ˆ

1

− − = − − =

=

MLE

Φ

slide-11
SLIDE 11

11

Bayesian Estimation

  • Bayesian estimation has a different philosophy than

maximum likelihood (ML) estimation

– ML assumes the parameter set is fixed but unknown (non- informative, uniform prior) – Bayesian estimation assumes the parameter set itself is a random variable with a prior distribution – Given a sequence of random samples , which are i.i.d. with a joint pdf , the posterior distribution of can be the following according to the Bayes’ rule Φ Φ

( )

Φ p

( )

n 2 1

x ,..., x , x = x ( )

Φ x p

Φ

( ) ( ) ( )

( )

( ) ( )

Φ Φ x x Φ Φ x x Φ p p p p p p ∝ =

slide-12
SLIDE 12

12

Bayesian Estimation

  • : the posterior probability, the distribution of

after we observed the values of random variables

  • : a conjugate prior of the random variables (or vector)

is defined as the prior distribution for the parameters of the density function (e.g. ) of the random variables (or vectors) – Before we observed the values of random variables

  • The joint pdf/likelihood function
  • The prior is also a Gaussian distribution

( )

( )

              Φ − − ∝               Φ − − =

∑ ∑

= = n i i n i i n

x x π p

n

1 2 1 2

2 1 exp 2 1 exp 2 1

2

σ σ σ Φ x

Φ

( )

x Φ p

( )

Φ p

Φ

( ) ( )

              − Φ − ∝               − Φ − =

2 2

2 1 exp 2 1 exp 2 1

2 1

ν µ ν µ ν π p Φ

slide-13
SLIDE 13

13

Maximum a Posterior Probability (MAP)

  • The MAP chooses a estimate that maximizes the

posterior probability is the most common Bayesian estimator

– For example, the conjugate prior for the mean of a Gaussian pdf is also a Gaussian pdf

  • Supposed in previous example, is drawn from a

Gaussian which mean is unknown and variance is known, while the conjugate prior (is a Gaussian) with mean and variance

  • The MAP estimated is:

( ) ( ) ( )

Φ Φ x x Φ Φ

Φ Φ

p p p

MAP

max arg max arg = =

( )

x Φ p

( )

( )

[ ]

Φ Φ x Φ

Φ

p log p log

MAP

+ = max arg

( )

( )

p log p log = ∂ + ∂ ∂ Φ Φ Φ Φ x

MAP

Φ { }

n 2 1

X ,..., X , X = X Φ

2

ν µ Φ

mean sample the is samples, training

  • f

no. is

n 2 2 n 2 2 MAP

x n , n x n Φ ν σ ν µ σ + + =

2

σ

slide-14
SLIDE 14

14

Bayes’ Decision Theory

  • A decision-making based on both the posterior

knowledge obtained from specific observation data and prior knowledge of the categories

– Prior class probabilities – Class-conditioned probabilities (likelihoods)

( )

i P

i

class , ∀ ω

( )

i x P

i

class , ∀ ω

( ) ( ) (

) ( )

( ) (

)

( ) (

)

=

= = =

1

max arg max arg max arg

j j j i i i i i i i i

P x P P x P x P P x P x P k ω ω ω ω ω ω ω

( ) (

)

i i i

P x P k ω ω max arg = ∴

slide-15
SLIDE 15

15

Bayes’ Decision Theory

  • Bayes’ decision rule designed to minimize the overall risk

involved in making decision

– The expected loss (conditional risk) when making decision – The overall risk (Bayes’ risk)

( )

( ) (

)

( )

   ≠ = = = ∑ j , i j , i x l x P x l x R

j i j j j i i

1 , where , , ω δ ω ω δ δ

i

δ

( )

( ) ( )

( )

x x dx x p x x R R sample a for decision selected the : , δ δ

∞ ∞ −

=

loss function

The class x might belong to a decision

slide-16
SLIDE 16

16

Bayes’ Decision Theory

  • Minimize the overall risk (classification error) by

computing the conditional risks and select the decision for which the conditional risk is minimum, i.e., is maximum

– Called the minimum-error-rate decision rule which minimizes the classification error rate

i

δ

( )

x R

i

δ

( )

x P

i

ω

( )

( ) (

) ( ) ( )

( ) ( )

x P x P x P x P x P x l x R

i i j j i j j j j j i i

ω ω ω ω ω ω δ δ

  • 1

, = − = = =

∑ ∑ ∑

( )

( ) ( ) (

)

j i i i i

P x P x P x ω ω ω δ max arg max arg = =

the decision should be made

slide-17
SLIDE 17

17

Bayes’ Decision Theory

  • Two-class pattern classification

Likelihood ratio or log-likelihood ratio:

( )

( ) ( )

( ) ( )

1 2 2 1

2 1

ω ω ω ω

ω ω

P P x P x P x l < > =

Bayes’ Classifier

( )

( ) ( )

( ) ( )

1 2 2 1

log log log log log

2 1

ω ω ω ω

ω ω

P P x P x P x l − < > − =

( ) (

)

( ) (

)

2 2 1 1

2 1

ω ω ω ω

ω ω

P x P P x P < >

( )

( ) ( ) (

) ( )

( ) ( ) (

)

2 2 2 2 1 1 1 1

, ω ω ω ω ω ω P x P x P x d P x P x P x d ≅ = ≅ =

( ) ( ) ( )

( ) (

)

( ) (

)

( ) (

)

( ) (

)dx

P x P dx P x P P R x P P R x P R x P R x P error p

R R 1 1 2 2 1 1 2 2 2 1 1 2 2 1

2 1

, , ω ω ω ω ω ω ω ω ω ω

∫ ∫

+ = ∈ + ∈ = ∈ + ∈ =

Classification error:

X falls in R2, but the true class is ω1

( )

x P

2

ω

( )

x P

2

ω