3 M IXTURE DENSITY ESTIMATION In this chapter we consider mixture - PDF document

3 M IXTURE DENSITY ESTIMATION In this chapter we consider mixture densities, the main building block for the dimen- sion reduction techniques described in the following chapters. In the first section we introduce mixture densities and the expectation-maximization (EM) algorithm to estimate their parameters from data. The EM algorithm finds, from an initial parameter estimate, a sequence of parameter estimates that yield increasingly higher data log- likelihood. The algorithm is guaranteed to converge to a local maximum of the data log-likelihood as function of the parameters. However, this local maximum may yield significantly lower log-likelihood than the globally optimal parameter estimate. The first contribution we present is a technique that is empirically found to avoid many of the poor local maxima found when using random initial parameter estimates. Our technique finds an initial parameter estimate by starting with a one-component mixture and adding components to the mixture one-by-one. In Section 3.2 we apply this technique to mixtures of Gaussian densities and in Section 3.3 to k-means clustering. Each iteration of the EM algorithm requires a number of computations that scales lin- early with the product of the number of data points and the number of mixture components, this limits its applicability in large scale applications with many data points and mixture components. In Section 3.4, we present a technique to speed-up the estimation of mixture models from large quantities of data where the amount of computation can be traded against accuracy of the algorithm. However, for any preferred accuracy the algorithm is in each step guaranteed to increase a lower bound on the data log-likelihood. 3.1 The EM algorithm and Gaussian mixture densities In this section we describe the expectation-maximization (EM) algorithm for estimating the parameters of mixture densities. Parameter estimation algorithms are sometimes also referred to as ‘learning algorithms’ since the machinery that implements the algorithm, in a sense, ‘learns’ about the data by estimating the parameters. Mixture models,

42 M IXTURE DENSITY ESTIMATION a weighted sum of finitely many elements of some parametric class of component densities, form an expressive class of models for density estimation. Due to the development of automated procedures to estimate mixture models from data, applications in a wide range of fields have emerged in recent decades. Examples are density estimation, clustering, and estimating class-conditional densities in supervised learning settings. Using the EM algorithm it is relatively straightforward to apply density estimation techniques in cases where some data is missing. The missing data could be the class labels of some objects in partially supervised classification problems or the value of some features that describe the objects for which we try to find a density estimate. 3.1.1 Mixture densities A mixture density (McLachlan and Peel, 2000) is defined as a weighted sum of, say k , component densities. The component densities are restricted to a particular parametric class of densities that is assumed to be appropriate for the data at hand or attractive for computational reasons. Let us denote by p ( x ; θ s ) the s -th component density, where θ s are the component parameters. We use π s to denote the weighing factor of the s -th component in the mixture. The weights must satisfy two constraints: (i) non-negativity: π s ≥ 0 and (ii) partition of unity: � k s =1 π s = 1 . The weights π s are also known as ‘mixing proportions’ or ‘mixing weights’ and can be thought of as the probability p ( s ) that a data sample will be drawn from mixture component s . A k component mixture density is then defined as: k � p ( x ) ≡ π s p ( x ; θ s ) . (3.1) s =1 For a mixture we collectively denote all parameters with θ = { θ 1 , . . . , θ k , π 1 , . . . , π k } . Throughout this thesis we assume that all data are identically and independently dis- tributed (i.i.d.), and hence that the likelihood of a set of data vectors is just the product of the individual likelihoods. One can think of a mixture density as modelling a process where first a ‘source’ s is selected according to the multinomial distribution { π 1 , . . . , π k } and then a sample is drawn from the corresponding component density p ( x ; θ s ) . Thus, the probability of selecting source s and datum x is π s p ( x ; θ s ) . The marginal probability of selecting datum x is then given by (3.1). We can think of the source that generated a data vector x as ‘missing information’: we only observe x and do not know the generating source. The expectation-maximization algorithm, presented in the next section, can be understood in terms of iteratively estimating this missing information. An important derived quantity is the ’posterior probability’ on a mixture component given a data vector. One can think of this distribution as a distribution on which mixture component generated a particular data vector, i.e. “Which component density was this

43 3.1 T HE EM ALGORITHM AND G AUSSIAN MIXTURE DENSITIES data vector drawn from?” or “to which cluster does this data vector belong?”. The posterior distribution on the mixture components is defined using Bayes rule: p ( s | x ) ≡ π s p ( x ; θ s ) π s p ( x ; θ s ) = s ′ π s ′ p ( x ; θ s ′ ) . (3.2) p ( x ) � The expectation-maximization algorithm to estimate the parameters of a mixture model from data makes essential use of these posterior probabilities. Mixture modelling is also known as semi-parametric density estimation and it can be placed in between two extremes: parametric and non-parametric density estimation. Parametric density estimation assumes the data is drawn from a density in a parametric class, say the class of Gaussian densities. The estimation problem then reduces to find- ing the parameters of the Gaussian that fits the data best. The assumption underlying parametric density estimation is often unrealistic but allows for very efficient parameter estimation. At the other extreme, non-parametric methods do not assume a particular form of the density from which the data is drawn. Non-parametric estimates typically take a form of a mixture density with a mixture component for every data point in the data set. The components, often referred to as ‘kernels’. A well known non-parametric density estimator is the Parzen estimator (Parzen, 1962) which uses Gaussian components with mean equal to the corresponding data point and small isotropic covariance. Non-parametric estimates can implement a large class of densities. The price we have to pay is that for the evaluation of the estimator at a new point we have to evaluate all the kernels, which is computationally demanding if the estimate is based on a large data set. Mixture modelling strikes a balance between these extremes: a large class of densities can be implemented and we can evaluate the density efficiently, since only relatively few density functions have to be evaluated. 3.1.2 Parameter estimation with the EM algorithm The first step when using a mixture model is to determine its architecture: a proper class of component densities and the number of component densities in the mixture. We will discuss these issues in Section 3.1.3. After these design choices have been made, we estimate the free parameters in the mixture model such that the model ‘fits’ our data as good as possible. The expectation-maximization algorithm is the most popular method to estimate the parameters of mixture models to a given data set. We define “fits the data as good as possible” as “assigns maximum likelihood to the data”. Hence, fitting the model to given data becomes searching for the maximum- likelihood parameters for our data in the set of probabilistic models defined by the cho- sen architecture. Since the logarithm is a monotone increasing function, the maximum likelihood criterion is equivalent to a maximum log-likelihood criterion and these crite- ria are often interchanged. Due to the i.i.d. assumption, the log-likelihood of a data set

3 M IXTURE DENSITY ESTIMATION In this chapter we consider mixture - PDF document

3 M IXTURE DENSITY ESTIMATION In this chapter we consider mixture densities, the main building block for the dimen- sion reduction techniques described in the following chapters. In the first section we introduce mixture densities and the

Effects of neutron star dynamic tides on gravitational waveforms within the Effective One-Body

Representation theory and (co)homology for subfactors, -lattices and C -tensor categories

Phenomenology of the spin-3 mesons Shahriyar Jafarzade Jan Kochanowski University of Kielce,

t st qt stt

Parton Distribution Functions and Neural Networks Alberto Guffanti Albert-Ludwigs-Universitt

Networking & Routing for HPC MAKWEBA, Damas Instructor/HPC Section (Head) India-Tanzania

Multiple class queueing networks Mean Value Analysis - Open queueing networks - Closed queueing

3.6 GRAPH THEORY APPROACH Graph theory is basically a branch of topology. Geometric structure of a

Towards a Performance Model Management Repository for Component-based Enterprise Applications

Living beyond your means Extracting Response Time Densities and Quantiles from Stochastic Models

Wave Phenomena Physics 15c Lecture 12 Dispersion (H&L Sections 2.6) What We Did Last Time

SMS based FAQ Retrieval: A Theme Matching Scheme Deba Prasad Mandal & Saptaditya Maiti

) S a n g - H o K i m ( A s i a P a c i fj c C e n t e r f o

His tp ries of dark ma tu er his tp rical perspec tj ve Book, chap tf r 17 R 9 7 3 . . 9 5

Parent Seminar Elevate Education Who are we? Research: Why do the top students get the top

10-701 Fall 2017 Recitation 3 Agenda Q1 - Decision Tree to KNN A1 Q2.1 - KNN to Decision

s r t q q q q

Mining Your P s and Q s Detection of Widespread Weak Keys in Embedded Devices Nadia Heninger UC

Quick-Start for DCCP draft-fairhurst-tsvwg-dccp-qs-00 Gorry Fairhurst Arjuna Sathiaseelan

PUBLIC KEY INFRASTRUCTURE Nina Bindel PQCrypto 2017 Udyani Herath Utrecht, The Nederlands

The Overall Procedure I Given P = ( Q , , , , q 0 , { q accept } ) Construct a CFG G var(

The eect of RF on p olarization in a m uon storage ring Ra jendran Ra ja F

Quantum computing at scale Yuri Alexeev Computational Science Division and Argonne Leadership

CSE 105 THEORY OF COMPUTATION Fall 2016 http://cseweb.ucsd.edu/classes/fa16/cse105-abc/ T

Sambuz

Useful Links

Newsletter

Mail Us

3 M IXTURE DENSITY ESTIMATION In this chapter we consider mixture - PDF document

3 M IXTURE DENSITY ESTIMATION In this chapter we consider mixture densities, the main building block for the dimen- sion reduction techniques described in the following chapters. In the first section we introduce mixture densities and the

Effects of neutron star dynamic tides on gravitational waveforms within the Effective One-Body

Representation theory and (co)homology for subfactors, -lattices and C -tensor categories

Phenomenology of the spin-3 mesons Shahriyar Jafarzade Jan Kochanowski University of Kielce,

t st qt stt

Parton Distribution Functions and Neural Networks Alberto Guffanti Albert-Ludwigs-Universitt

Networking &amp; Routing for HPC MAKWEBA, Damas Instructor/HPC Section (Head) India-Tanzania

Multiple class queueing networks Mean Value Analysis - Open queueing networks - Closed queueing

3.6 GRAPH THEORY APPROACH Graph theory is basically a branch of topology. Geometric structure of a

Towards a Performance Model Management Repository for Component-based Enterprise Applications

Living beyond your means Extracting Response Time Densities and Quantiles from Stochastic Models

Wave Phenomena Physics 15c Lecture 12 Dispersion (H&amp;L Sections 2.6) What We Did Last Time

SMS based FAQ Retrieval: A Theme Matching Scheme Deba Prasad Mandal &amp; Saptaditya Maiti

) S a n g - H o K i m ( A s i a P a c i fj c C e n t e r f o

His tp ries of dark ma tu er his tp rical perspec tj ve Book, chap tf r 17 R 9 7 3 . . 9 5

Parent Seminar Elevate Education Who are we? Research: Why do the top students get the top

10-701 Fall 2017 Recitation 3 Agenda Q1 - Decision Tree to KNN A1 Q2.1 - KNN to Decision

s r t q q q q

Mining Your P s and Q s Detection of Widespread Weak Keys in Embedded Devices Nadia Heninger UC

Quick-Start for DCCP draft-fairhurst-tsvwg-dccp-qs-00 Gorry Fairhurst Arjuna Sathiaseelan

PUBLIC KEY INFRASTRUCTURE Nina Bindel PQCrypto 2017 Udyani Herath Utrecht, The Nederlands

The Overall Procedure I Given P = ( Q , , , , q 0 , { q accept } ) Construct a CFG G var(

The eect of RF on p olarization in a m uon storage ring Ra jendran Ra ja F

Quantum computing at scale Yuri Alexeev Computational Science Division and Argonne Leadership

CSE 105 THEORY OF COMPUTATION Fall 2016 http://cseweb.ucsd.edu/classes/fa16/cse105-abc/ T

Sambuz

Useful Links

Newsletter

Mail Us

Networking & Routing for HPC MAKWEBA, Damas Instructor/HPC Section (Head) India-Tanzania

Wave Phenomena Physics 15c Lecture 12 Dispersion (H&L Sections 2.6) What We Did Last Time

SMS based FAQ Retrieval: A Theme Matching Scheme Deba Prasad Mandal & Saptaditya Maiti