parametric models part i maximum likelihood and bayesian
play

Parametric Models Part I: Maximum Likelihood and Bayesian Density - PowerPoint PPT Presentation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2019 CS 551, Spring 2019 2019, Selim Aksoy (Bilkent


  1. Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2019 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 1 / 33

  2. Introduction ◮ Bayesian Decision Theory shows us how to design an optimal classifier if we know the prior probabilities P ( w i ) and the class-conditional densities p ( x | w i ) . ◮ Unfortunately, we rarely have complete knowledge of the probabilistic structure. ◮ However, we can often find design samples or training data that include particular representatives of the patterns we want to classify. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 2 / 33

  3. Introduction ◮ To simplify the problem, we can assume some parametric form for the conditional densities and estimate these parameters using training data. ◮ Then, we can use the resulting estimates as if they were the true values and perform classification using the Bayesian decision rule. ◮ We will consider only the supervised learning case where the true class label for each sample is known. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 3 / 33

  4. Introduction ◮ We will study two estimation procedures: ◮ Maximum likelihood estimation ◮ Views the parameters as quantities whose values are fixed but unknown. ◮ Estimates these values by maximizing the probability of obtaining the samples observed. ◮ Bayesian estimation ◮ Views the parameters as random variables having some known prior distribution. ◮ Observing new samples converts the prior to a posterior density. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 4 / 33

  5. Maximum Likelihood Estimation ◮ Suppose we have a set D = { x 1 , . . . , x n } of independent and identically distributed ( i.i.d. ) samples drawn from the density p ( x | θ ) . ◮ We would like to use training samples in D to estimate the unknown parameter vector θ . ◮ Define L ( θ |D ) as the likelihood function of θ with respect to D as n � L ( θ |D ) = p ( D| θ ) = p ( x 1 , . . . , x n | θ ) = p ( x i | θ ) . i =1 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 5 / 33

  6. Maximum Likelihood Estimation ◮ The maximum likelihood estimate (MLE) of θ is, by definition, the value ˆ θ that maximizes L ( θ |D ) and can be computed as ˆ θ = arg max L ( θ |D ) . θ ◮ It is often easier to work with the logarithm of the likelihood function ( log-likelihood function ) that gives n � ˆ θ = arg max log L ( θ |D ) = arg max log p ( x i | θ ) . θ θ i =1 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 6 / 33

  7. Maximum Likelihood Estimation ◮ If the number of parameters is p , i.e., θ = ( θ 1 , . . . , θ p ) T , define the gradient operator   ∂ ∂ θ 1 . .   ∇ θ ≡  . .    ∂ ∂ θ p ◮ Then, the MLE of θ should satisfy the necessary conditions n � ∇ θ log L ( θ |D ) = ∇ θ log p ( x i | θ ) = 0 . i =1 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 7 / 33

  8. Maximum Likelihood Estimation ◮ Properties of MLEs: ◮ The MLE is the parameter point for which the observed sample is the most likely. ◮ The procedure with partial derivatives may result in several local extrema. We should check each solution individually to identify the global optimum. ◮ Boundary conditions must also be checked separately for extrema. ◮ Invariance property: if ˆ θ is the MLE of θ , then for any function f ( θ ) , the MLE of f ( θ ) is f ( ˆ θ ) . CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 8 / 33

  9. The Gaussian Case ◮ Suppose that p ( x | θ ) = N ( µ , Σ ) . ◮ When Σ is known but µ is unknown: n µ = 1 � ˆ x i n i =1 ◮ When both µ and Σ are unknown: n n µ = 1 Σ = 1 � ˆ � µ ) T and ( x i − ˆ µ )( x i − ˆ ˆ x i n n i =1 i =1 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 9 / 33

  10. The Bernoulli Case ◮ Suppose that P ( x | θ ) = Bernoulli ( θ ) = θ x (1 − θ ) 1 − x where x = 0 , 1 and 0 ≤ θ ≤ 1 . ◮ The MLE of θ can be computed as n θ = 1 ˆ � x i . n i =1 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 10 / 33

  11. Bias of Estimators ◮ Bias of an estimator ˆ θ is the difference between the expected value of ˆ θ and θ . ◮ The MLE of µ is an unbiased estimator for µ because E [ ˆ µ ] = µ . ◮ The MLE of Σ is not an unbiased estimator for Σ because E [ ˆ Σ ] = n − 1 n Σ � = Σ . ◮ The sample covariance n 1 S 2 = � µ ) T ( x i − ˆ µ )( x i − ˆ n − 1 i =1 is an unbiased estimator for Σ . CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 11 / 33

  12. Goodness-of-fit ◮ To measure how well a fitted distribution resembles the sample data ( goodness-of-fit ), we can use the Kolmogorov-Smirnov test statistic. ◮ It is defined as the maximum value of the absolute difference between the cumulative distribution function estimated from the sample and the one calculated from the fitted distribution. ◮ After estimating the parameters for different distributions, we can compute the Kolmogorov-Smirnov statistic for each distribution and choose the one with the smallest value as the best fit to our sample. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 12 / 33

  13. Maximum Likelihood Estimation Examples Random sample from N(10,2 2 ) Random sample from 0.5 N(10,0.4 2 ) + 0.5 N(11,0.5 2 ) 0.2 0.7 Histogram Histogram Gaussian fit Gaussian fit 0.18 0.6 0.16 0.5 0.14 0.12 0.4 pdf 0.1 pdf 0.3 0.08 0.06 0.2 0.04 0.1 0.02 0 0 6 8 10 12 14 16 9 9.5 10 10.5 11 11.5 12 x x (a) True pdf is N (10 , 4) . Estimated pdf is (b) True pdf is 0 . 5 N (10 , 0 . 16) + 0 . 5 N (11 , 0 . 25) . N (9 . 98 , 4 . 05) . Estimated pdf is N (10 . 50 , 0 . 47) . Random sample from Gamma(4,4) Cumulative distribution functions 0.07 1 Histogram Gaussian fit 0.9 Gamma fit 0.06 0.8 0.05 0.7 0.6 0.04 pdf cdf 0.5 0.03 0.4 0.3 0.02 0.2 0.01 True cdf 0.1 Gaussian fit cdf Gamma fit cdf 0 0 10 20 30 40 50 60 0 10 20 30 40 50 60 70 x x (c) True pdf is Gamma (4 , 4) . Estimated pdfs (d) Cumulative distribution functions for the ex- are N (16 . 1 , 67 . 4) and Gamma (3 . 8 , 4 . 2) . ample in (c). Figure 1: Histograms of samples and estimated densities for different distributions. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 13 / 33

  14. Bayesian Estimation ◮ Suppose the set D = { x 1 , . . . , x n } contains the samples drawn independently from the density p ( x | θ ) whose form is assumed to be known but θ is not known exactly. ◮ Assume that θ is a quantity whose variation can be described by the prior probability distribution p ( θ ) . CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 14 / 33

  15. Bayesian Estimation ◮ Given D , the prior distribution can be updated to form the posterior distribution using the Bayes rule p ( θ |D ) = p ( D| θ ) p ( θ ) p ( D ) where � p ( D ) = p ( D| θ ) p ( θ ) d θ and n � p ( D| θ ) = p ( x i | θ ) . i =1 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 15 / 33

  16. Bayesian Estimation ◮ The posterior distribution p ( θ |D ) can be used to find estimates for θ (e.g., the expected value of p ( θ |D ) can be used as an estimate for θ ). ◮ Then, the conditional density p ( x |D ) can be computed as � p ( x |D ) = p ( x | θ ) p ( θ |D ) d θ and can be used in the Bayesian classifier. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 16 / 33

  17. MLEs vs. Bayes Estimates ◮ Maximum likelihood estimation finds an estimate of θ based on the samples in D but a different sample set would give rise to a different estimate. ◮ Bayes estimate takes into account the sampling variability. ◮ We assume that we do not know the true value of θ , and instead of taking a single estimate, we take a weighted sum of the densities p ( x | θ ) weighted by the distribution p ( θ |D ) . CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 17 / 33

  18. The Gaussian Case ◮ Consider the univariate case p ( x | µ ) = N ( µ, σ 2 ) where µ is the only unknown parameter with a prior distribution p ( µ ) = N ( µ 0 , σ 2 ( σ 2 , µ 0 and σ 2 0 ) 0 are all known). ◮ This corresponds to drawing a value for µ from the population with density p ( µ ) , treating it as the true value in the density p ( x | µ ) , and drawing samples for x from this density. CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 18 / 33

  19. The Gaussian Case ◮ Given D = { x 1 , . . . , x n } , we obtain n � p ( µ |D ) ∝ p ( x i | µ ) p ( µ ) i =1 � 1 �� n � n �� − 1 σ 2 + 1 � � x i + µ 0 µ 2 − 2 � ∝ exp µ σ 2 σ 2 σ 2 2 0 0 i =1 = N ( µ n , σ 2 n ) where n nσ 2 σ 2 � � � � � � µ n = 1 � 0 µ n = µ n + ˆ µ 0 ˆ x i nσ 2 nσ 2 0 + σ 2 0 + σ 2 n i =1 σ 2 0 σ 2 σ 2 n = 0 + σ 2 . nσ 2 CS 551, Spring 2019 � 2019, Selim Aksoy (Bilkent University) c 19 / 33

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend