APPLIED MACHINE LEARNING
APPLIED MACHINE LEARNING
Probability Density Functions Gaussian Mixture Models
1
APPLIED MACHINE LEARNING Probability Density Functions Gaussian - - PowerPoint PPT Presentation
APPLIED MACHINE LEARNING APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED MACHINE LEARNING Discrete Probabilities Consider two variables x and y taking discrete values over the intervals [1.....N ] and
APPLIED MACHINE LEARNING
1
APPLIED MACHINE LEARNING
1
x
x N i y
Consider two variables x and y taking discrete values over the intervals
[1.....N ] and [1.....N ] respectively.
x y
2
APPLIED MACHINE LEARNING
The joint probability is written p(x,y). The joint probability that variable x takes value i and variable y takes value j is: P(x | y) is the conditional probability of observing a value for x given a value for y.
3
Bayes' theorem:
When x and y are statistically independent:
Matlab Exercise I
APPLIED MACHINE LEARNING
The marginal probability that variable x takes value xi is given by:
1
y
N x i xy j
Drop the x, y for simplicity of notation
4
APPLIED MACHINE LEARNING
The joint distribution is far richer than the marginals. The marginals of N variables taking K values corresponds to N(K-1) probabilities. The joint distribution corresponds to ~NK probabilities. Pros of computing the joint distribution: Provides statistical dependencies across all variables and the marginal distributions Cons: Computational costs grow exponentially with number of dimensions (statistical power: 10 samples to estimate each parameter of a model) Compute solely the conditional if you care only about dependencies across variables (this will be relevant for lecture on non-linear regression methods)
5
APPLIED MACHINE LEARNING
p(x) a continuous function is the probability density function or probability distribution function (PDF) (sometimes also called probability distribution or simply density) of variable x.
6
APPLIED MACHINE LEARNING
The pdf is not bounded by 1. It can grow unbounded, depending on the value taken by x.
7
APPLIED MACHINE LEARNING
The cumulative distribution function (or simply distribution function) of X is:
*
* * *
x x x
p(x) dx ~ probability of x to fall within an infinitesimal interval [x, x + dx]
8
APPLIED MACHINE LEARNING
b x x x b a
Probability that x takes a value in the subinterval [a,b] is given by:
Uniform distribution on x
* x
*
x
9
APPLIED MACHINE LEARNING
x X X
The expectation of the random variable x with probability P(x) (in the discrete case) and pdf p(x) (in the continuous case), also called the expected value or mean, is the mean of the observed value of x weighted by p(x). If X is the set of observations of x, then:
10
APPLIED MACHINE LEARNING
2 2 2 2
2
distribution around its mean:
11
APPLIED MACHINE LEARNING
2 2
2 2
x
The uni-dimensional Gaussian or Normal distribution is a distribution with pdf given by:
The Gaussian function is entirely determined by its mean and variance. For this reason, it is referred to as a parametric distribution.
Illustrations from Wikipedia
12
APPLIED MACHINE LEARNING
~68% of the data are comprised between +/ 1 sigma ~96% of the data are comprised between +/ 2 sigma-s ~99% of the data are comprised between +/ 3 sigma-s
Illustrations from Wikipedia
This is no longer true for arbitrary pdf-s!
13
APPLIED MACHINE LEARNING
For other pdf than the Gaussian distribution, the variance represents a notion of dispersion around the expected value.
3 Gaussians distributions Resulting distribution when superposing the 3 Gaussian distributions.
1 2 3 4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 f=1/3(f1+f2+f3) x Expectation: 1sigma=0.68
14
Matlab Demo I
APPLIED MACHINE LEARNING
2 2
2
x
The uni-dimensional Gaussian or Normal distribution is a distribution with pdf given by: The multi-dimensional Gaussian or Normal distribution has a pdf given by:
1
1 2 1 2 2
T
x x N
if x is N-dimensional, then μ is a dimensional mean vector is a covariance matrix N N N
15
APPLIED MACHINE LEARNING
1 2
, p x x
if x is N-dimensional, then μ is a dimensional mean vector is a covariance matrix N N N
Isolines: p x cst
1
x
2
x
1
x
2
x
16
1
1 2 1 2 2
T
x x N
APPLIED MACHINE LEARNING
17
if x is N-dimensional, then μ is a dimensional mean vector is a covariance matrix N N N
1...
Construct covariance matrix from (centered) set of datapoints : 1
i M i T
X x XX M
1
1 2 1 2 2
T
x x N
APPLIED MACHINE LEARNING
1
x
2
x
1 .......
is square and symmetric. It can be decomposed using the eigenvalue decomposition. , : matrix of eigenvectors, : diagonal matrix composed of eigenvalues
T N
V V V
1st eigenvector 2nd eigenvector
1 1 2 2
For the 1-std ellipse, the axes' lengths are equal to: and , with . Each isoline corresponds to a scaling of the 1std ellipse.
T
V V
1...
Construct covariance matrix from (centered) set of datapoints : 1
i M i T
X x XX M
18
APPLIED MACHINE LEARNING
1
x
2
x
1st eigenvector 2nd eigenvector
PCA Identifies a suitable representation of a multivariate data set by decorrelating the dataset.
2
1 1 2 1 1
~ ; ,
T
p e X N X
1 2 2 2 2
~ ; ,
T
p e X N X
1
e
1 2
When projected onto e and e , the set of datapoints appears to follow two uncorrelated Normal distributions.
19
APPLIED MACHINE LEARNING
1 1 2 2
Consider two random variables x1 and x2 with joint distribution p(x1, x2), then the marginal probability of x1 given x1 is:
1 2 2 1 2 2 1 2 1 1 1
The conditional probability is given by:
20
APPLIED MACHINE LEARNING
The conditional and marginal pdf of a multi-dimensional Gauss function are all Gauss functions!
Illustrations from Wikipedia
21
1 2 1 2
joint density of , , x x p x x
2 1
conditional density of given 0. x x
1
x
1 1
marginal density of x
2
marginal density of x
1
2
1 2
,
Matlab Exercise II
APPLIED MACHINE LEARNING
1 2 2 2 1 2 1 2 1 2 2 2
1 2 1 2 1 2 1 2 1 2 1 2 1 2
The expectation of the joint distribution is equal to the product
22
APPLIED MACHINE LEARNING
1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
Statistical independence ensures uncorrelatedness. The converse is not true Independent Uncorrelated
23
APPLIED MACHINE LEARNING
Are x1 and x2 uncorrelated? Are x1 and x2 statistically independent?
24
x2=-1 x2=0 x2=1 Total x1=-1 3/12 3/12 1/2 x1=1 1/12 4/12 1/12 1/2 Total 1/3 1/3 1/3
APPLIED MACHINE LEARNING
x1 and x2 are uncorrelated as but not statistically independent.
25
x2=-1 x2=0 x2=1 Total x1=-1 3/12 3/12 1/2 x1=1 1/12 4/12 1/12 1/2 Total 1/3 1/3 1/3
1 2 1 2
1 2 1 2
APPLIED MACHINE LEARNING
Part I - Exercises
26
APPLIED MACHINE LEARNING
27
Data are noisy no model will fit perfectly the data (unless you fit the noise = overfitting) need a mean to determine how much the model fits the underlying distribution. Which of the two models fit best the data?
APPLIED MACHINE LEARNING
29
Consider that the pdf of the dataset X is parametrized with parameters , . One writes: The likelihood function (short – likelihood) of the model parameters is given by:
Measures probability of observing X if the distribution of X is parametrized with ,
To determine the best fit, search for parameters that maximize the likelihood.
1
M i i
APPLIED MACHINE LEARNING
Values taken by the likelihood for two different fits using 1-D Gauss functions with different means.
30
1 2 3 4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x p(x)
Likelihood=0.39253 Real distribution Fit of 1D Gauss Model 1 2 3 4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x p(x)
Likelihood=0.15959 Real distribution Fit of 1D Gauss Model
APPLIED MACHINE LEARNING
32
EM used when no closed form solution exists for the maximum likelihood estimate. Example: Fit Mixture of Gaussian Functions
1 1 1
K k k K K k k k k
K Gaussian Functions Linear weighted combination
No closed form solution to:
1 1
max | max | max | max ; ,
M K i k k k k k i
L X p X L X p x
E-M is an iterative procedure to estimate the best set of parameters Converge to a local optimum Sensitive to initialization!
APPLIED MACHINE LEARNING
33
EM is an iterative procedure:
Ensured to converge to a local optimum only! (see more details next slides)
APPLIED MACHINE LEARNING
The algorithm of K-means is a simple version of Expectation-Maximization applied to a model composed of isotropic Gauss functions
APPLIED MACHINE LEARNING
Assignment Step (E-step): ~ Compute expectation of the equivalent Gaussian model with unity variance and centered on the centroid
x1 x2
2
( )
i k i k i k
x
Computing the distance to the k-th centroid is equivalent to computing the probability that the data point has been generated by the k-th model.
2
( )
The likelihood of the k-th model is: ;
i k k
x i
L X e
APPLIED MACHINE LEARNING x1 x2
Assignment Step (E-step): ~ Compute expectation of the equivalent Gaussian model with unity variance and centered on the centroid
~ Bayes’ rule:
1
2
If ; ; , then belongs to cluster 2.
i i i
p x p x x
APPLIED MACHINE LEARNING
Update Step (M-step): ~ Maximize expectation of the equivalent Gaussian model with unity variance and centered on the centroid
x1 x2
The new centroid is closer to the datapoints after the update step, the likelihood
the k-th model increases.
2
( )
;
i k k
x i
L X e
i k
k i i k i i
r x r
( )
i k
d x
APPLIED MACHINE LEARNING
Soft K-means can be seen as fitting the data distribution with a mixture of isotropic (spherical) Gaussian pdf-s. E-M updates the parameters of each Gaussian to optimize the likelihood that the Gaussians represent the distribution of the datapoints.
x1 x3
1 1
; , 0.6 L X
2 2
; , 0.4 L X
x1 x3
1 1
; , 0.8 L X
1 1 2 2 1 1 2 2
Likelihood of overall distribution (with uniform prior for each Gaussian): ; , , , ; , ; , L X L X L X
2 2
; , 0.9 L X
Poor fit Better fit
APPLIED MACHINE LEARNING x1 x3
' '
' '
: responsibility of cluster for point ; , , 0,1 ; , ; , [0,1]: Gauss pdf evaluated at Normalized over clusters: 1
i i k k i k k i k k i
k i k k i k k k k i k
r k x p x r p x p x x r
Assignment Step (E-step): Relative importance of each of the K clusters (measure
number
datapoints in each cluster) In GMM, we will see that this is a measure of the likelihood that the Gaussian k (or cluster k) generated the whole dataset. The responsibility factor gives a measure
the likelihood that cluster k generated the dataset.
APPLIED MACHINE LEARNING
' '
' '
: responsibility of cluster for point ; , , 0,1 ; , ; , [0,1]: Gauss pdf evaluated at Normalized over clusters: 1
i i k k i k k i k k i
k i k k i k k k k i k
r k x p x r p x p x x r
Update Step (M-Step):
k
k i i i k i i
r x r
2 2
k
k i i k i k i i
r x M r
k i i k k i k i
r r
This fits only a mixture of spherical Gaussians!
Relative importance of each of the K clusters (measure
number
datapoints in each cluster) In GMM, we will see that this is a measure of the likelihood that the Gaussian k (or cluster k) generated the whole dataset.
APPLIED MACHINE LEARNING
'
' ' '
: responsibility of cluster for point ; , ; , ; , [0,1]: Gauss pdf evaluated at Normalized over clusters: 1
i i k j i k j i k k i
k i k k k i k k k k i k
r k x p x r p x p x x r
Update Step (M-Step):
k
k i i i k i i
r x r
k i i k k i k i
r r
2 2
j
k i k i j k i j k i i
r x M r
One covariance element per dimension, but still aligned with the axes of the original frame of reference 1,... : dimension of dataset j N
APPLIED MACHINE LEARNING
Mixture of diagonal Gaussians can only fit Gaussians whose axes are aligned with the data axes, i.e. the covariance matrices of the Gaussians are diagonal
x1 x2
APPLIED MACHINE LEARNING
Gaussian Mixture Models (GMM) can learn mixtures of Gaussians with arbitrary (full) covariance matrices. Gaussian Mixture Model can exploit local correlations and adapt the covariance matrix of each Gaussian so that it aligns with the local direction of correlation. Each Gaussian performs a local linear PCA.
x1 x2
APPLIED MACHINE LEARNING
the number of Gaussians required to fit the data.
Full covariance matrices require N*(N+1)/2 parameters against N for diagonal matrices.
How to derive an algorithm for fitting data with complex mixtures
APPLIED MACHINE LEARNING
x1 x3
2
0.8
1 1
Likelihood of the mixture of Gaussians: ; , ; ,
K K k k k k k k k
L X p X
1
1
with ; , ~ (unormalized likelihood of Gaussian ) , : mean and covariance matrix of Gaussian
T i k i k k
M x x k k i k k
p X e k k
1
1
K k k
Mixing Coefficients are normalized
1
0.2
1
; , 1 ; ,
i k k M k i k l i l
p x M p x
APPLIED MACHINE LEARNING
The parameters of a GMM are the means, covariance matrices and priors: Estimation of all the parameters can be done through Expectation- Maximization (E-M). E-M tries to find the optimum of the likelihood of the model given the data, i.e.:
1 1 1
K K K
APPLIED MACHINE LEARNING
One usually can safely assume that the datapoints are i.i.d. (identically and independently distributed).
1 1
M K i k k k k i
1 1 1 1
M K M K i k k i k k k k k i k i
Computing the log of the likelihood yields the same optimum:
No close-form solution unlike the case for one Gaussian. See derivation of E-M for GMM in the annexes posted on the website
APPLIED MACHINE LEARNING
Estimation Step (E-step): Initialization:
1
1
K
k
1 1 ( ) 1 1 1
M K k t k t t i k t k i
At each step t, one can compute the likelihood of the current model, given the current estimate of the parameters and use this value to determine a stopping criterion (we stop E-M once the increase in likelihood (or decrease in the log-likelihood) reaches a plateau. 𝛪 𝑢 .
APPLIED MACHINE LEARNING
Update Step (M-step):
Recompute the means, covariances matrices and prior probabilities so as to maximize the log likelihood of the current estimate : log | and using current estimate of the probabilities : |
t t
L X p k
1
t j t
j j t k j j
1 1 1
| , | ,
T t k t k t j j j k t j t i i
p k x x x p k x
1
t j k
t j
APPLIED MACHINE LEARNING
Clustering with Mixtures of Gaussians using spherical Gaussians (left) and non spherical Gaussians (i.e. with full covariance matrix) (right). Notice how the clusters become elongated along the direction of the clusters (the grey circles represent the first and second variances of the distributions).
APPLIED MACHINE LEARNING
51
APPLIED MACHINE LEARNING
52
APPLIED MACHINE LEARNING
GMM using 4 Gaussians with random initialization
53
APPLIED MACHINE LEARNING
Expectation Maximization is very sensitive to initial conditions:
GMM using 4 Gaussians with new random initialization
54
APPLIED MACHINE LEARNING
Very sensitive to choice of number of Gaussians. Number of Gaussians can be optimized iteratively using AIC or BIC (see later slides):
Here, GMM using 8 Gaussians
55
APPLIED MACHINE LEARNING
This class revisited some basic notions of statistics, with standard definitions of pdf, cdf, marginal and conditional distributions. It emphasized the notion of statistical independence and how one can recognize it numerically and geometrically from looking at the distributions. It exemplified these concepts with multi-dimensional Gaussian functions. Finally, it introduced the notion of maximum likelihood fit first with one Gaussian and then with a mixture of Gaussians.