Maximum-likelihood and Bayesian parameter estimation Andrea - PowerPoint PPT Presentation

Maximum-likelihood and Bayesian parameter estimation Andrea Passerini passerini@disi.unitn.it Machine Learning Maximum-likelihood and Bayesian parameter estimation

Parameter estimation Setting Data are sampled from a probability distribution p ( x , y ) The form of the probability distribution p is known but its parameters are unknown There is a training set D = { ( x 1 , y 1 ) , . . . , ( x m , y m ) } of examples sampled i.i.d. according to p ( x , y ) Task Estimate the unknown parameters of p from training data D . Note: i.i.d. sampling independent : each example is sampled independently from the others identically distributed: all examples are sampled from the same distribution Maximum-likelihood and Bayesian parameter estimation

Parameter estimation Multiclass classification setting The training set can be divided into D 1 , . . . , D c subsets, one for each class ( D i = { x 1 , . . . , x n } contains i.i.d examples for target class y i ) For any new example x (not in training set), we compute the posterior probability of the class given the example and the full training set D : P ( y i | x , D ) = p ( x | y i , D ) p ( y i |D ) p ( x |D ) Note same as Bayesian decision theory (compute posterior probability of class given example) except that parameters of distributions are unknown a training set D is provided instead Maximum-likelihood and Bayesian parameter estimation

Parameter estimation Multiclass classification setting: simplifications P ( y i | x , D ) = p ( x | y i , D i ) p ( y i |D ) p ( x |D ) we assume x is independent of D j ( j � = i ) given y i and D i without additional knowledge, p ( y i |D ) can be computed as the fraction of examples with that class in the dataset the normalizing factor p ( x |D ) can be computed marginalizing p ( x | y i , D i ) p ( y i |D ) over possible classes Note We must estimate class-dependent parameters θ i for: p ( x | y i , D i ) Maximum-likelihood and Bayesian parameter estimation

Maximum Likelihood vs Bayesian estimation Maxiumum likelihood/Maximum a-posteriori estimation Assumes parameters θ i have fixed but unknown values Values are computed as those maximizing the probability of the observed examples D i (the training set for the class) Obtained values are used to compute probability for new examples: p ( x | y i , D i ) ≈ p ( x | θ i ) Maximum-likelihood and Bayesian parameter estimation

Maximum Likelihood vs Bayesian estimation Bayesian estimation Assumes parameters θ i are random variables with some known prior distribution Observing examples turns prior distribution over parameters into a posterior distribution Predictions for new examples are obtained integrating over all possible values for the parameters: � p ( x | y i , D i ) = p ( x , θ i | y i , D i ) d θ i θ i Maximum-likelihood and Bayesian parameter estimation

Maxiumum likelihood/Maximum a-posteriori estimation Maximum a-posteriori estimation θ ∗ i = argmax θ i p ( θ i |D i , y i ) = argmax θ i p ( D i , y i | θ i ) p ( θ i ) Assumes a prior distribution for the parameters p ( θ i ) is available Maximum likelihood estimation (most common) θ ∗ i = argmax θ i p ( D i , y i | θ i ) maximizes the likelihood of the parameters with respect to the training samples no assumption about prior distributions for parameters Note Each class y i is treated independently: replace y i , D i → D for simplicity Maximum-likelihood and Bayesian parameter estimation

Maximum-likelihood (ML) estimation Setting (again) A training data D = { x 1 , . . . , x n } of i.i.d. examples for the target class y is available We assume the parameter vector θ has a fixed but unknown value We estimate such value maximizing its likelihood with respect to the training data: n � θ ∗ = argmax θ p ( D| θ ) = argmax θ p ( x j | θ ) j = 1 The joint probability over D decomposes into a product as examples are i.i.d (thus independent of each other given the distribution) Maximum-likelihood and Bayesian parameter estimation

Maximum-likelihood estimation Maximizing log-likelihood It is usually simpler to maximize the logarithm of the likelihood (monotonic): n � θ ∗ = argmax θ ln p ( D| θ ) = argmax θ ln p ( x j | θ ) j = 1 Necessary conditions for the maximum can be obtained zeroing the gradient wrt to θ : n � ∇ θ ln p ( x j | θ ) = 0 j = 1 Points zeroing the gradient can be local or global maxima depending on the form of the distribution Maximum-likelihood and Bayesian parameter estimation

Maximum-likelihood estimation Univariate Gaussian case: unknown µ and σ 2 the log-likelihood is: n − 1 2 σ 2 ( x j − µ ) 2 − 1 � 2 ln 2 πσ 2 L = j = 1 The gradient wrt µ is: n n ∂ L − 1 1 � � ∂µ = 2 2 σ 2 ( x j − µ )( − 1 ) = σ 2 ( x j − µ ) j = 1 j = 1 Maximum-likelihood and Bayesian parameter estimation

Maximum-likelihood estimation Univariate Gaussian case: unknown µ and σ 2 Setting the gradient to zero gives mean: n n 1 � � σ 2 ( x j − µ ) = 0 = ( x j − µ ) j = 1 j = 1 n n � � x j = µ j = 1 j = 1 n � x j = n µ j = 1 n µ = 1 � x j n j = 1 Maximum-likelihood and Bayesian parameter estimation

Maximum-likelihood estimation Univariate Gaussian case: unknown µ and σ 2 the log-likelihood is: n − 1 2 σ 2 ( x j − µ ) 2 − 1 � 2 ln 2 πσ 2 L = j = 1 The gradient wrt σ 2 is: n ∂ L − ( x j − µ ) 2 ∂ 2 σ 2 − 1 1 1 � ∂σ 2 = 2 πσ 2 2 π ∂σ 2 2 j = 1 n − ( x j − µ ) 2 1 2 ( − 1 ) 1 1 � = σ 4 − 2 σ 2 j = 1 Maximum-likelihood and Bayesian parameter estimation

Maximum-likelihood estimation Univariate Gaussian case: unknown µ and σ 2 Setting the gradient to zero gives variance: n n ( x j − µ ) 2 1 � � 2 σ 2 = 2 σ 4 j = 1 j = 1 n n � � σ 2 = ( x j − µ ) 2 j = 1 j = 1 n σ 2 = 1 � ( x j − µ ) 2 n j = 1 Maximum-likelihood and Bayesian parameter estimation

Maximum-likelihood estimation Multivariate Gaussian case: unknown µ and Σ the log-likelihood is: n − 1 2 ( x j − µ ) t Σ − 1 ( x j − µ ) − 1 � 2 ln ( 2 π ) d | Σ | j = 1 The maximum-likelihood estimates are: n µ = 1 � x j n j = 1 and: n Σ = 1 � ( x j − µ )( x j − µ ) t n j = 1 Maximum-likelihood and Bayesian parameter estimation

Maximum-likelihood estimation general Gaussian case: Maximum likelihood estimates for Gaussian parameters are simply their empirical estimates over the samples: Gaussian mean is the sample mean Gaussian covariance matrix is the mean of the sample covariances Maximum-likelihood and Bayesian parameter estimation

Bayesian estimation setting (again) Assumes parameters θ i are random variables with some known prior distribution Predictions for new examples are obtained integrating over all possible values for the parameters: � p ( x | y i , D i ) = p ( x , θ i | y i , D i ) d θ i θ i probability of x given each class y i is independent of the other classes y j , for simplicity we can again write: � p ( x | y i , D i ) → p ( x |D ) = p ( x , θ |D ) d θ θ where D is a dataset for a certain class y and θ the parameters of the distribution Maximum-likelihood and Bayesian parameter estimation

Bayesian estimation setting � � p ( x |D ) = p ( x , θ |D ) d θ = p ( x | θ ) p ( θ |D ) d θ θ p ( x | θ ) can be easily computed (we have both form and parameters of distribution, e.g. Gaussian) need to estimate the parameter posterior density given the training set: p ( θ |D ) = p ( D| θ ) p ( θ ) p ( D ) Maximum-likelihood and Bayesian parameter estimation

Bayesian estimation denominator p ( θ |D ) = p ( D| θ ) p ( θ ) p ( D ) p ( D ) is a constant independent of θ (i.e. it will no influence final Bayesian decision) if final probability (not only decision) is needed we can compute: � p ( D ) = p ( D| θ ) p ( θ ) d θ θ Maximum-likelihood and Bayesian parameter estimation

Bayesian estimation Univariate normal case: unknown µ , known σ 2 Examples are drawn from: p ( x | µ ) ∼ N ( µ, σ 2 ) The Gaussian mean prior distribution is itself normal: p ( µ ) ∼ N ( µ 0 , σ 2 0 ) The Gaussian mean posterior given the dataset is computed as: n p ( µ |D ) = p ( D| µ ) p ( µ ) � = α p ( x j | µ ) p ( µ ) p ( D ) j = 1 where α = 1 / p ( D ) is independent of µ Maximum-likelihood and Bayesian parameter estimation

Univariate normal case: unknown µ , known σ 2 a posteriori parameter density p ( x j | µ ) p ( µ ) � �� 2 � � � 2 � n � x j − µ � µ − µ 0 1 − 1 1 − 1 � √ √ p ( µ |D ) = α exp exp σ σ 0 2 2 2 πσ 2 πσ 0 j = 1     n � µ − x j � 2 � µ − µ 0 � 2  − 1 � = α ′ exp +    σ σ 0 2 j = 1       � n n �  − 1 σ 2 + 1  1 x j + µ 0 � = α ′′ exp µ 2 − 2  µ    2 σ 2 σ 2 σ 2 0 0 j = 1 Normal distribution � � 2 � � µ − µ n 1 − 1 p ( µ |D ) = √ exp 2 σ n 2 πσ n Maximum-likelihood and Bayesian parameter estimation

Maximum-likelihood and Bayesian parameter estimation Andrea - PowerPoint PPT Presentation

Maximum-likelihood and Bayesian parameter estimation Andrea Passerini passerini@disi.unitn.it Machine Learning Maximum-likelihood and Bayesian parameter estimation Parameter estimation Setting Data are sampled from a probability distribution

Maximum likelihood parameter estimation Maximum likelihood parameter estimation For an HMM

I 4 - Bayesian parameter estimation in a normal model STAT 587 (Engineering) Iowa State

Binary choice 3.3 Maximum likelihood estimation Michel Bierlaire Output of the estimation

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Maximum Likelihood properties Maximum parsimony Maximum likelihood Experimental design

Outline Introduction Knowledge Structures Parameter Estimation Maximum Likelihood Estimation

15-388/688 - Practical Data Science: Maximum likelihood estimation, nave Bayes J. Zico Kolter

Binary choice 3.3 Maximum likelihood estimation Michel Bierlaire Maximum likelihood

Chapter 8: Estimation In this chapter we will cover: 1. The likelihood and maximum likelihood

Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract formulation Weve had one

Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract formulation Weve had one

Maximum Maximum Likelihood Estimation Daphne Koller Biased Coin Example P is a Bernoulli

Phylogenetic trees IV Maximum Likelihood Gerhard Jger ESSLLI 2016 Gerhard Jger Maximum

Maximum Likelihood Estimation MLE tool for parameter estimation good approach for cases

Increasing Access and Building Capacity to Ensure Innovative Solutions to Emerging Animal

Techgirlz Workshop Scratch and Raspberry Pi Ruth Willenborg coderdojortp@gmail.com in

Imperfect Competition 14.12 Game Theory Muhamet Yildiz 1 Road Map 1. Coumot (quantity)

OS RAA PI Letter Outreach S ession Office for S ponsored Research and Award Administration

OWASP IoT Top 10 A gentle introduction and an exploration of root causes Hi! Nick Johnston

Abstraction is our Business What I have A single (or a finite number) of CPUs Memory Management

Corrected Velocities 3. The incompressible velocity at the new time, u n+1, is the predicted

Power allocation through revenue-maximising pricing on a CDMA reverse link shared by