review of estimation theory
play

Review of Estimation Theory Berlin 2003 References: 1. X. Huang - PowerPoint PPT Presentation

Review of Estimation Theory Berlin 2003 References: 1. X. Huang et. al., Spoken Language Processing, Chapter 3 Introduction Estimation theory is the most important theory and method in statistical inference Statistical inference


  1. Review of Estimation Theory Berlin 2003 References: 1. X. Huang et. al., Spoken Language Processing, Chapter 3

  2. Introduction • Estimation theory is the most important theory and method in statistical inference • Statistical inference – Data generated in accordance with some unknown probability distribution must be analyzed – Some type of inference about the unknown distribution must be made like the characteristics (parameters) of the distribution generating the experimental data, the mean and variance etc. ( ) ( ) θ X { } The vector of g X Φ = X X , X ,..., X 1 2 n random variables estimator { } ( ) The vector of ( ) = x x , x ,..., x θ x g x Φ 1 2 n sample values estimate Φ :the parameters of the distribution 2

  3. Introduction • Three common estimators (estimation methods) – Minimum mean square estimator • Estimate the random variable itself • Function approximation, curve fitting, … – Maximum likelihood estimator • Estimate the parameters of the distribution of the random variables – Bayes’ estimator • Estimate the parameters of the distribution of the random variables 3

  4. Minimum Mean Square Error Estimation and Least Square Error Estimation • There are two random variables and . When X Y observing the value of , we want to find a X ( ) ˆ transform ( the parameter vectors of Y = Φ g X , Φ function ) to predict the value of g Y – Minimum Mean Square Error Estimation [ [ ] ] If the joint distribution ( ( ) ) 2 = − Φ arg min E Y g X , Φ ( ) Is known MMSE f X , Y Φ X , Y – Least Square Error Estimation ( ) x , i y When samples of i n [ ] ( ) ∑ pairs are observed 2 = − Φ arg min y g x , Φ LSE i i Φ = i 1 • Base on the law of large numbers, when the joint ( ) ˆ probability is uniform or the number of samples = Y f X , Y X , Y approaches to infinity, MMSE and LSE are equivalent 4

  5. Minimum Mean Square Error Estimation and Least Square Error Estimation ( ) • Constant functions = g X c – MMSE -- LSE ( ) n [ ] ∇ − 2 = y c 0 ∑ ( ) 2 ∇ − = c i E Y c 0 = i 1 c [ ] 1 n ∴ = c E Y ∴ = c y ∑ MMSE LSE i n = sample mean mean i 1 ( ) = + • Linear functions g X aX b – MMSE [ ] [ ] ( ) [ ] [ ] 2 ∇ − + = E Y ( aX b ) 0 + − = 2 aE X bE X E XY 0 a [ ] [ ] [ ] ( ) ∇ − + 2 = + − = E Y ( aX b ) 0 aE X b E Y 0 b ( ) cov X , Y = a ( ) Var X σ [ ] [ ] = − ρ Y b E Y E X XY σ X 5

  6. Minimum Mean Square Error Estimation and Least Square Error Estimation • Linear functions – LSE • Suppose that x are d-dimensional vectors and y are scalars c 0 c d c 1    y  1 d  a  1 x x L 1 0 1 1       1 d y a 1 x x L       ˆ = 2 = = 1 2 2 Y XA       M M M M M       y 1 d a  1 x x    L     n n n n ( ) n ( ) ∑ ˆ 2 = − = − t e A Y Y A x y i 1 = i 1 ( ) n ( ) ( ) ∑ ∇ = t − = t − = e A 2 A x y x 2 X XA Y 0 i i i = i 1 Y ⇒ = t t X XA X Y ( ) c 0 c 1 − ˆ 1 ⇒ = t t Y A X X X Y c d ..... 6

  7. Maximum Likelihood Estimation (MLE/ML) • ML is the most widely used parametric estimation method { } • A set of random samples is to be drawn = X X , X ,..., X 1 2 n independently according to a distribution ( ) with the pdf p x Φ ( ) = x x , x ,..., x – Given a sequence of random samples the 1 2 n ( ) ( ) x , x ,..., x likelihood of it is defined as , a joint pdf of p x Φ 1 2 n n ( ) ( ) n = p x Φ p x Φ , X , X , ... X are iid Q ∏ n k 1 2 n = k 1 – Maximum likelihood estimator of is denoted as Φ ( ) ( ) n = = Φ arg max p x Φ arg max p x Φ ∏ ML n k = k 1 Φ Φ – Since the logarithm function is monotonically increasing function , Φ the parameter set that maximizes the log-likelihood should ML also maximize the likelihood. The log-likelihood can be ( ) ( ) ( ) expressed as: n = = l Φ log p x Φ log p x Φ ∑ 7 n k = k 1

  8. Maximum Likelihood Estimation (MLE/ML) ( ) • If is differentiable function of , can be p x Φ Φ Φ n ML attained by taking the partial derivative with respect to and setting it to zero Φ ( ) = – Let be a M-component parameter vector t Φ Φ Φ , Φ ,..., Φ 1 2 M ( ) ∂   l Φ   ∂ Φ   1 . ( ) ( )   n ∇ = ∇ = = l Φ log p x Φ 0 ∑   Φ Φ k . = k 1   ( ) ∂ l Φ   ∂   Φ   M ( ) • Example: is a univariate Gaussian pdf with the p x Φ ( ) µ , σ parameter set 2 ( )   − µ 2 ( ) 1 x = − p x Φ exp   σ π σ 2 2 2   ( )     − µ 2 ( ) ( ) 1 x n 1 ( ) ( )   n n n = = − = − πσ − − µ 2 log p x Φ log p x Φ log exp log 2 2 x k   ∑ ∑ ∑   σ σ n k π σ k 2 2 2 2 2 2 = =   = k 1 k 1   k 1 8

  9. Maximum Likelihood Estimation (MLE/ML) • Example: univariate Gaussian pdf (cont.) – Take the partial derivatives of the above expression and set them to zero Φ itself is fixed but unkown ∂ ( ) 1 ( ) n = − µ = log p x Φ x 0 ∑ n k ∂ µ σ 2 = k 1 ( ) 2 ∂ − µ ( ) n x n = − + = k log p x Φ 0 ∑ n ∂ σ 2 σ 2 σ 4 = k 1 µ – The maximum likelihood estimates for and are σ 2 ( ) n µ = = x E x ∑ ML k = k 1 [ ] 1 ( ) ( ) σ = − µ = − µ 2 2 2 x E x ML k ML k ML n • The maximum likelihood estimation for mean and variance is just the sample mean and variance 9

  10. Maximum Likelihood Estimation (MLE/ML) • Example: multivariate Gaussian pdf (cont.) ( )   1 1 ( ) ( )  t − = − − 1 − p x Φ exp x µ Σ x µ  d 2   ( ) 1 2 2 π Σ 2 µ – The maximum likelihood estimates for and are Σ 1 n = ˆ µ x ∑ MLE k n = k 1 1 ( )( ) n ˆ t = − − ˆ ˆ Σ x µ x µ ∑ MLE k MLE k MLE n = k 1 [ ] ( )( ) t = − − ˆ ˆ E x µ x µ k MLE k MLE • The maximum likelihood estimation for mean vector and variance matrix is just the sample mean vector and variance matrix Φ • In fact, itself is also a Gaussian distribution MLE 10

  11. Bayesian Estimation • Bayesian estimation has a different philosophy than maximum likelihood (ML) estimation – ML assumes the parameter set is fixed but unknown (non- Φ informative, uniform prior) – Bayesian estimation assumes the parameter set itself is a Φ ( ) p Φ random variable with a prior distribution ( ) = x x , x ,..., x – Given a sequence of random samples , which are 1 2 n ( ) i.i.d. with a joint pdf , the posterior distribution of can p x Φ Φ be the following according to the Bayes’ rule ( ) ( ) p x Φ p Φ ( ) ( ) ( ) = ∝ p Φ x p x Φ p Φ ( ) p x 11

  12. Bayesian Estimation ( ) : the posterior probability, the distribution of Φ • p Φ x after we observed the values of random variables ( ) p Φ • : a conjugate prior of the random variables (or vector) is defined as the prior distribution for the parameters of the density function (e.g. ) of the random variables (or vectors) Φ – Before we observed the values of random variables • The joint pdf/likelihood function     2 2 − Φ − Φ     ( ) 1 1 n x 1 n x ∑ ∑ = − ∝ −     p x Φ exp  i  exp  i  σ σ n ( ) 2   2       2 σ     n 2 π = = i 1 i 1 • The prior is also a Gaussian distribution     2 2 Φ − µ Φ − µ 1 1   1   ( ) = − ∝ − p Φ exp     exp     ν ν 1 ( ) 2   2       2 ν     2 π 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend