Statistical Machine Learning Lecture 06: Probability Density - - PowerPoint PPT Presentation

statistical machine learning
SMART_READER_LITE
LIVE PREVIEW

Statistical Machine Learning Lecture 06: Probability Density - - PowerPoint PPT Presentation

Statistical Machine Learning Lecture 06: Probability Density Estimation Kristian Kersting TU Darmstadt Summer Term 2020 K. Kersting based on Slides from J. Peters Statistical Machine Learning Summer Term 2020 1 / 77 Todays Objectives


slide-1
SLIDE 1

Statistical Machine Learning

Lecture 06: Probability Density Estimation

Kristian Kersting TU Darmstadt

Summer Term 2020

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

1 / 77

slide-2
SLIDE 2

Today’s Objectives

Make you understand how to do find p (x) Covered Topics

Density Estimation Maximum Likelihood Estimation Non-Parametric Methods Mixture Models Expectation Maximization

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

2 / 77

slide-3
SLIDE 3

Outline

  • 1. Probability Density
  • 2. Parametric models

Maximum Likelihood Method

  • 3. Non-Parametric Models

Histograms Kernel Density Estimation K-nearest Neighbors

  • 4. Mixture models
  • 5. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

3 / 77

slide-4
SLIDE 4
  • 1. Probability Density

Outline

  • 1. Probability Density
  • 2. Parametric models

Maximum Likelihood Method

  • 3. Non-Parametric Models

Histograms Kernel Density Estimation K-nearest Neighbors

  • 4. Mixture models
  • 5. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

4 / 77

slide-5
SLIDE 5
  • 1. Probability Density

Training Data

0.25 0.5 0.75 1 0.5 1 1.5 2

How do we get the probability distributions from this so that we can classify with them?

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

5 / 77

slide-6
SLIDE 6
  • 1. Probability Density

Probability Density Estimation

So far we have seen:

Bayes optimal classification, based on probability distributions p(x | Ck)p(Ck)

The prior p(Ck) is easy to deal with. We can “just count” the number of occurrences of each class in the training data We need to estimate (learn) the class-conditional probability density p(x | Ck)

Supervised training: we know the input data points and their true labels (classes) Estimate the density separately for each class Ck “Abbreviation”: p(x) = p(x | Ck)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

6 / 77

slide-7
SLIDE 7
  • 1. Probability Density

Probability Density Estimation

Training data x1, x2, x3, . . . Estimation p(x) Methods

Parametric model Non-parametric model Mixture models

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

7 / 77

slide-8
SLIDE 8
  • 2. Parametric models

Outline

  • 1. Probability Density
  • 2. Parametric models

Maximum Likelihood Method

  • 3. Non-Parametric Models

Histograms Kernel Density Estimation K-nearest Neighbors

  • 4. Mixture models
  • 5. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

8 / 77

slide-9
SLIDE 9
  • 2. Parametric models
  • 2. Parametric models

Simple case: Gaussian Distribution p (x|µ, σ) = 1 √ 2πσ2 exp

  • −(x − µ)2

2σ2

  • Is governed by two parameters: mean and variance. That is, if we

know these parameters we can fully describe p(x)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

9 / 77

slide-10
SLIDE 10
  • 2. Parametric models
  • 2. Parametric models

Notation for parametric density models x ∼ p(x | θ) For the Gaussian distribution θ = (µ, σ) x ∼ p

  • x
  • µ, σ
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

10 / 77

slide-11
SLIDE 11
  • 2. Parametric models : Maximum Likelihood Method
  • 2. Parametric models

Learning means to estimate the parameters θ given the training data X = {x1, x2, . . .} Likelihood of θ is defined as the probability that the data X was generated from the probability density function with parameters θ L (θ) = p (X | θ)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

11 / 77

slide-12
SLIDE 12
  • 2. Parametric models : Maximum Likelihood Method

Maximum Likelihood Method

Consider a set of points X = {x1, . . . , xN} Computing the likelihood

Of a single datum? p (xn|θ) Of all data?

Assumption: the data is i.i.d. (independent and identically distributed)

The random variables x1 and x2 are independent if P (x1 ≤ α, x2 ≤ β) = P (x1 ≤ α) P (x2 ≤ β) ∀α, β ∈ R The random variables x1 and x2 are identically distributed if P (x1 ≤ α) = P (x2 ≤ α) ∀α ∈ R

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

12 / 77

slide-13
SLIDE 13
  • 2. Parametric models : Maximum Likelihood Method

Maximum Likelihood Method

Likelihood L (θ) = p (X | θ) = p

  • x1, . . . , xN
  • θ
  • (using the i.i.d. assumption)

= p (x1 | θ) · . . . · p (xn | θ) =

N

  • n=1

p (xn | θ)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

13 / 77

slide-14
SLIDE 14
  • 2. Parametric models : Maximum Likelihood Method

Maximum log-Likelihood Method

Maximize the (log-)likelihood w.r.t. θ log L (θ) = log p (X | θ) = log

N

  • n=1

p (xn | θ) =

N

  • n=1

log p (xn | θ)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

14 / 77

slide-15
SLIDE 15
  • 2. Parametric models : Maximum Likelihood Method

Maximum Likelihood Method - Gaussian

Maximum likelihood estimation of a Gaussian ˆ µ, ˆ σ = arg max

µ,σ log L (θ) = log p (X | θ) = N

  • n=1

log p

  • xn
  • µ, σ
  • Take the partial derivatives and set them to zero

∂L ∂µ = 0, ∂L ∂σ = 0 This leads to a closed form solution ˆ µ = 1 N

N

  • n=1

xn ˆ σ2 = 1 N

N

  • n=1

(xn − ˆ µ)2

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

15 / 77

slide-16
SLIDE 16
  • 2. Parametric models : Maximum Likelihood Method

Maximum Likelihood Method - Gaussian

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

16 / 77

slide-17
SLIDE 17
  • 2. Parametric models : Maximum Likelihood Method

Likelihood

L (θ) = p (X | θ) =

N

  • n=1

p (xn | θ)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

17 / 77

slide-18
SLIDE 18
  • 2. Parametric models : Maximum Likelihood Method

Degenerate case

If N = 1, X = {x1}, the resulting Gaussian looks like

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

18 / 77

slide-19
SLIDE 19
  • 2. Parametric models : Maximum Likelihood Method

Degenerate case

What can we do to still get a useful estimate? We can put a prior on the mean!

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

19 / 77

slide-20
SLIDE 20
  • 2. Parametric models : Maximum Likelihood Method

Bayesian Estimation

Bayesian estimation / learning of parametric distributions, assumes that the parameters are not fixed, but are random variables too This allows us to use prior knowledge about the parameters How do we achieve that?

What do we want? A density model for x, p(x) What do we have? Data X

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

20 / 77

slide-21
SLIDE 21
  • 2. Parametric models : Maximum Likelihood Method

Bayesian Estimation

Formalize this as a conditional probability p

  • x
  • X
  • p
  • x
  • X
  • =
  • p
  • x, θ
  • X

p

  • x, θ
  • X
  • = p
  • x
  • θ, X
  • p
  • θ
  • X
  • p(x) can be fully determined with the parameters θ, i.e., θ is a

sufficient statistic Hence, we have p

  • x
  • θ, X
  • = p
  • x
  • θ
  • p
  • x
  • X
  • =
  • p
  • x
  • θ
  • p
  • θ
  • X
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

21 / 77

slide-22
SLIDE 22
  • 2. Parametric models : Maximum Likelihood Method

Bayesian Estimation

p

  • x
  • X
  • =
  • p
  • x
  • θ
  • p
  • θ
  • X

p

  • θ
  • X
  • =

p

  • X
  • θ
  • p (θ)

p (X) = L (θ) p (θ) p (X) p (X) =

  • p
  • X
  • θ
  • p (θ) dθ =
  • L (θ) p (θ) dθ

p

  • x
  • X
  • =

1 p (X)

  • p
  • x
  • θ
  • L (θ) p (θ) dθ
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

22 / 77

slide-23
SLIDE 23
  • 2. Parametric models : Maximum Likelihood Method

Bayesian Estimation

p

  • x
  • X
  • =
  • p
  • x
  • θ
  • p
  • θ
  • X

The probability p

  • θ
  • X
  • makes it explicit how the parameter

estimation depends on the training data If p

  • θ
  • X
  • is small in most places, but large for a single ˆ

θ then we can approximate p

  • x
  • X
  • ≈ p
  • x
  • ˆ

θ

  • Sometimes referred to as the Bayes point

The more uncertain we are about estimating ˆ θ, the more we average

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

23 / 77

slide-24
SLIDE 24
  • 2. Parametric models : Maximum Likelihood Method

Bayesian Estimation

Problem: In general, it is intractable to integrate out the parameters θ (or only possible to do so numerically) Example with closed form solution

Gaussian data distribution, the variance is known and fixed We estimate the distribution of the mean

p

  • µ
  • X
  • =

p

  • X
  • µ
  • p (µ)

p (X)

With prior

p (µ) = N

  • µ0, σ2
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

24 / 77

slide-25
SLIDE 25
  • 2. Parametric models : Maximum Likelihood Method

Bayesian Estimation

Sample mean ¯ x = 1 N

N

  • n=1

xn Bayesian estimation p

  • µ
  • X
  • ∼ N
  • µN, σ2

N

  • µN = Nσ2

x + σ2µ0 Nσ2

0 + σ2

, 1 σ2

N

= N σ2 + 1 σ2 Check what happens when N grows to infinity...

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

25 / 77

slide-26
SLIDE 26
  • 2. Parametric models : Maximum Likelihood Method

Conjugate Priors

Conjugate Priors are prior distributions for the parameters that do not “change” the type of the parametric model For example, as we saw that a Gaussian prior on the mean is conjugate to the Gaussian model. This works here because...

The product of two Gaussians is a Gaussian The marginal of a Gaussian is a Gaussian

In general, it is not as easy!

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

26 / 77

slide-27
SLIDE 27
  • 3. Non-Parametric Models

Outline

  • 1. Probability Density
  • 2. Parametric models

Maximum Likelihood Method

  • 3. Non-Parametric Models

Histograms Kernel Density Estimation K-nearest Neighbors

  • 4. Mixture models
  • 5. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

27 / 77

slide-28
SLIDE 28
  • 3. Non-Parametric Models
  • 3. Non-Parametric Models

Why use Non-parametric representations? Often we do not know what functional form the class-conditional density takes (or we do not know what class of function we need) Probability density is estimated directly from the data (i.e. without an explicit parametric model)

Histograms Kernel density estimation (Parzen windows) K-nearest neighbors

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

28 / 77

slide-29
SLIDE 29
  • 3. Non-Parametric Models : Histograms

Histograms

Discretize the feature space into bins Not smooth enough About right Too smooth

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

29 / 77

slide-30
SLIDE 30
  • 3. Non-Parametric Models : Histograms

Histograms

Properties

They are very general, because in the infinite data limit any probability density can be approximated arbitrarily well At the same time it is a Brute-force method

Problems

High-dimensional feature spaces

Exponential increase in the number of bins Hence requires exponentially much data Commonly known as the Curse of dimensionality

How to choose the size of the bins?

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

30 / 77

slide-31
SLIDE 31
  • 3. Non-Parametric Models : Histograms

Curse of Dimensionality

For histograms We will see that it is a general issue that we have to keep in mind

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

31 / 77

slide-32
SLIDE 32
  • 3. Non-Parametric Models : Histograms

More formally

Data point x is sampled from probability density p (x) Probability that x falls in region R P (x ∈ R) =

  • R

p (x) dx If R is sufficiently small, with volume V, then p (x) is almost constant P (x ∈ R) =

  • R

p (x) dx ≈ p (x) V If R is sufficiently large P (x ∈ R) = K N = ⇒ p (x) ≈ K NV where N is the number of total points and K is the number of points falling in the region R

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

32 / 77

slide-33
SLIDE 33
  • 3. Non-Parametric Models : Histograms

More formally

p (x) ≈ K NV Kernel density estimation - Fix V and determine K

Example: determine the number of data points K in a fixed hypercube

K-nearest neighbor - Fix K and determine V

Example: increase the size of a sphere until K data points fall into the sphere

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

33 / 77

slide-34
SLIDE 34
  • 3. Non-Parametric Models : Kernel Density Estimation

Parzen Window

Hypercubes in d dimensions with edge length h H (u) =

  • 1
  • uj
  • ≤ h

2, j = 1, . . . , d

  • therwise

V =

  • H (u) du = hd

K (x) =

N

  • n=1

H

  • x − x(n)

p (x) ≈ K (x) NV = 1 Nhd

N

  • n=1

H

  • x − x(n)
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

34 / 77

slide-35
SLIDE 35
  • 3. Non-Parametric Models : Kernel Density Estimation

Gaussian Kernel

H (u) = 1 √ 2πh2 d exp

  • −u2

2h2

  • V =
  • H (u) du = 1

K (x) =

N

  • n=1

H

  • x − x(n)

p (x) ≈ K (x) NV = 1 N √ 2πh2 d

N

  • n=1

exp

  • x − x(n)

2 2h2

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

35 / 77

slide-36
SLIDE 36
  • 3. Non-Parametric Models : Kernel Density Estimation

General formulation - arbitrary kernel

k (u) ≥ 0,

  • k (u) du = 1

V = hd K (x) =

N

  • n=1

k

  • x − x(n)
  • h
  • p (x) ≈ K (x)

NV = 1 Nhd

N

  • n=1

k

  • x − x(n)
  • h
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

36 / 77

slide-37
SLIDE 37
  • 3. Non-Parametric Models : Kernel Density Estimation

Common Kernels

Gaussian Kernel k (u) = 1 √ 2π exp

  • −1

2u2

  • Problem: kernel has infinite support

Requires a lot of computation

Parzen window k (u) =

  • 1

|u| ≤ 1/2

  • therwise

Not very smooth results

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

37 / 77

slide-38
SLIDE 38
  • 3. Non-Parametric Models : Kernel Density Estimation

Common Kernels

Epanechnikov kernel k (u) = max

  • 0, 3

4 (1 − u)2

  • Smoother, but finite support

Problem with kernel methods: We have to select the kernel bandwidth h appropriately

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

38 / 77

slide-39
SLIDE 39
  • 3. Non-Parametric Models : Kernel Density Estimation

Gaussian KDE Example

Not smooth enough About right Too smooth

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

39 / 77

slide-40
SLIDE 40
  • 3. Non-Parametric Models : K-nearest Neighbors

Again to our definition

p (x) ≈ K NV Kernel density estimation - Fix V and determine K

Example: determine the number of data points K in a fixed hypercube

K-nearest neighbor - Fix K and determine V

Example: increase the size of a sphere until K data points fall into the sphere

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

40 / 77

slide-41
SLIDE 41
  • 3. Non-Parametric Models : K-nearest Neighbors

K-Nearest Neighbors (kNN)

Not smooth enough About right Too smooth Note: Blue rescaled for visualization

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

41 / 77

slide-42
SLIDE 42
  • 3. Non-Parametric Models : K-nearest Neighbors

K-Nearest Neighbors (kNN)

Bayesian classification P

  • Cj
  • x
  • =

P

  • x
  • Cj
  • P
  • Cj
  • P (x)

k-Nearest Neighbors classification

Assume we have a dataset of N points, where Nj is the number of data points in class Cj and

j Nj = N. To classify a point x we

draw a sphere centered in x that contains K points (from any classes). Assume the sphere has volume V and contains Kj points

  • f class Cj

P (x) ≈ K NV , P

  • x
  • Cj
  • ≈ Kj

NjV , P

  • Cj
  • ≈ Nj

N P

  • Cj
  • x
  • ≈ Kj

NjV Nj N NV K = Kj K

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

42 / 77

slide-43
SLIDE 43
  • 3. Non-Parametric Models : K-nearest Neighbors

Bias-Variance Problem

Nonparametric probability density estimation

Histograms: Size of the bins?

too large: too smooth too small: not smooth enough

Kernel density estimation: Kernel bandwidth?

h too large: too smooth h too small: not smooth enough

K-nearest neighbor: Number of neighbors?

K too large: too smooth K too small: not smooth enough

A general problem of many density estimation approaches, including parametric and mixture models

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

43 / 77

slide-44
SLIDE 44
  • 4. Mixture models

Outline

  • 1. Probability Density
  • 2. Parametric models

Maximum Likelihood Method

  • 3. Non-Parametric Models

Histograms Kernel Density Estimation K-nearest Neighbors

  • 4. Mixture models
  • 5. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

44 / 77

slide-45
SLIDE 45
  • 4. Mixture models
  • 4. Mixture models

Parametric models Gaussian, Neural Networks, ... Good analytic properties Simple Small memory requirements Fast Nonparametric models Kernel Density Estimation, k-Nearest Neighbors, ... General Large memory requirements Slow Mixture models are a mix of parametric and nonparametric models

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

45 / 77

slide-46
SLIDE 46
  • 4. Mixture models

Mixture of Gaussians (MoG)

Sum of individual Gaussian distributions

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

46 / 77

slide-47
SLIDE 47
  • 4. Mixture models

Mixture of Gaussians

Sum of individual Gaussian distributions In the limit (i.e. with many mixture components) this can approximate every (smooth) density p (x) =

M

  • j=1

p

  • x
  • j
  • p (j)
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

47 / 77

slide-48
SLIDE 48
  • 4. Mixture models

Mixture of Gaussians

p (x) =

M

  • j=1

p

  • x
  • j
  • p (j)

p

  • x
  • j
  • =

N

  • x
  • µj, σj
  • =

1

  • 2πσ2

j

exp

  • x − µj

2 2σ2

j

  • p (j) = πj

with 0 ≤ πj ≤ 1,

M

  • j=1

πj = 1 Remarks

The mixture density integrates to 1:

  • p (x) dx = 1

The mixture parameters are: θ = {µ1, σ1, π1, . . . , µM, σM, πM}

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

48 / 77

slide-49
SLIDE 49
  • 4. Mixture models

Mixture of Gaussians - MLE

Maximum (log-)Likelihood Estimation

Dataset with N i.i.d. points {x1, . . . , xN} L = log L (θ) =

N

  • n=1

log p

  • xn
  • θ
  • ∂L

∂µj = 0 µj = N

n=1 p

  • j
  • xn
  • xn

N

n=1 p

  • j
  • xn
  • What is the problem with this approach?

Circular dependency - No analytical solution!

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

49 / 77

slide-50
SLIDE 50
  • 4. Mixture models

Mixture of Gaussians - MLE Gradient Ascent

Maximum (log-)Likelihood Estimation

Dataset with N i.d.d. points {x1, . . . , xN} L = log L (θ) =

N

  • n=1

log p

  • xn
  • θ
  • ∂L

∂µj = 0 Gradient ascent

Complex gradient (nonlinear, circular dependencies) Optimization of one Gaussian component depends on all other components

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

50 / 77

slide-51
SLIDE 51
  • 4. Mixture models

Mixture of Gaussians - Different strategy

Unobserved := hidden or latent variables (j|x)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

51 / 77

slide-52
SLIDE 52
  • 4. Mixture models

Mixture of Gaussians - Different strategy

Suppose we knew the observed and unobserved dataset (also called the complete dataset) Then we can compute the maximum likelihood solution of components 1 and 2

µ1 = N

n=1 p

  • 1
  • xn
  • xn

N

n=1 p

  • 1
  • xn
  • µ2 =

N

n=1 p

  • 2
  • xn
  • xn

N

n=1 p

  • 2
  • xn
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

52 / 77

slide-53
SLIDE 53
  • 4. Mixture models

Mixture of Gaussians - Different strategy

Suppose we knew the distributions We can infer the unobserved data using Bayes Decision Rule. Namely we decide 1 if p

  • j = 1
  • x
  • > p
  • j = 2
  • x
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

53 / 77

slide-54
SLIDE 54
  • 4. Mixture models

Mixture of Gaussians - Chicken and Egg problem

We have big problem at hand... we neither know the distribution nor the unobserved data! To break this loop, we need some estimation of the unobserved data j Temporary solution: Clustering (to be replaced soon)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

54 / 77

slide-55
SLIDE 55
  • 4. Mixture models

Estimation using Clustering

Clustering with hard assignments Somehow assign mixture labels to each data point Estimate the mixture component only from its data

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

55 / 77

slide-56
SLIDE 56
  • 4. Mixture models

Mixture of Gaussians

Suppose we had a guess about the distribution, but did not know the unobserved data Compute the probability for each mixture component:

p

  • j = 1
  • x
  • =

p

  • x
  • 1
  • p (1)

p (x) = p

  • x
  • 1
  • π1

M

j=1 p

  • x
  • j
  • πj

p

  • j = 2
  • x
  • =

p

  • x
  • 2
  • p (2)

p (x) = p

  • x
  • 2
  • π2

M

j=1 p

  • x
  • j
  • πj
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

56 / 77

slide-57
SLIDE 57
  • 4. Mixture models

Expectation Maximization - Clustering

Clustering with soft assignments Expectation-step of the EM-algorithm (shortly) We can determine the means by maximum likelihood estimation

µj = N

n=1 p

  • j
  • xn
  • xn

N

n=1 p

  • j
  • xn
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

57 / 77

slide-58
SLIDE 58
  • 4. Mixture models

Expectation Maximization Algorithm

Algorithm Initialize with (random) means: µ1, µ2, . . . , µM While stop-condition is not met

E-step: Compute the posterior distribution for each mixture component and for all data points

p

  • j
  • xn
  • M-step: Compute the new means as the weighted means of all

data points

µj = N

n=1 p

  • j
  • xn
  • xn

N

n=1 p

  • j
  • xn
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

58 / 77

slide-59
SLIDE 59
  • 4. Mixture models

Expectation Maximization

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

59 / 77

slide-60
SLIDE 60
  • 4. Mixture models

Expectation Maximization (EM) Algorithm

Expectation-Maximization (EM) Algorithm

Method for performing maximum likelihood estimation, even when the data is incomplete (i.e. we only have access to observed variables) Idea: if we have unknown values in our estimation problem (so-called hidden variables) we can use EM

Assume:

Observed (incomplete) data: X = {x1, . . . , xN} Unobserved (hidden) data: Y = {y1, . . . , yN}

In case of Gaussian mixtures:

Association of every data point to one of the mixture components

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

60 / 77

slide-61
SLIDE 61
  • 4. Mixture models

Properties of EM

Incomplete (observed) data: X = {x1, . . . , xN} Hidden (unobserved) data: Y = {y1, . . . , yN} Complete data: Z = (X, Y) Joint density p (Z) = p (X, Y) = p

  • Y
  • X
  • p (X)

With parameters

p

  • Z
  • θ
  • = p
  • X, Y
  • θ
  • = p
  • Y
  • X, θ
  • p
  • X
  • θ
  • In the case of Gaussian mixtures

p

  • X
  • θ
  • likelihood of the mixture model

p

  • Y
  • X, θ
  • predictions of the mixture component
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

61 / 77

slide-62
SLIDE 62
  • 4. Mixture models

Properties of EM

Incomplete likelihood L

  • θ
  • X
  • = p
  • X
  • θ
  • =

N

  • n=1

p

  • xn
  • θ
  • Complete likelihood

L

  • θ
  • Z
  • = p
  • Z
  • θ
  • = p
  • X, Y
  • θ
  • = p
  • Y
  • X, θ
  • p
  • X
  • θ
  • =

N

  • n=1

p

  • yn
  • xn, θ
  • p
  • xn
  • θ
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

62 / 77

slide-63
SLIDE 63
  • 4. Mixture models

EM Algorithm

We don’t know Y, but if we have the current guess θi−1 of the parameters θ, we can it use that to predict Y Formally we compute the expected value of the (complete) log-likelihood given the data X and the current estimation of θ EY

  • log p
  • X, Y
  • θ
  • X, θi−1

=: Q

  • θ, θi−1

X - fixed; Y - random variable; θ - variable; θi−1 - current estimation of the parameters (fixed)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

63 / 77

slide-64
SLIDE 64
  • 4. Mixture models

Properties of the EM Algorithm

Maximize the expected complete log-likelihood Q

  • θ, θi−1

= EY

  • log p
  • X, Y
  • θ
  • X, θi−1

=

  • p
  • y
  • X, θi−1

log p

  • X, y
  • θ
  • dy
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

64 / 77

slide-65
SLIDE 65
  • 4. Mixture models

Properties of the EM Algorithm

Q

  • θ, θi−1

=

  • p
  • y
  • X, θi−1

log p

  • X, y
  • θ
  • dy

E-step (expectation): compute p

  • y
  • X, θi−1

to be able to compute the expectation Q

  • θ, θi−1

M-step (maximization): maximize the expected value of the complete log-likelihood θi = arg max

θ

Q

  • θ, θi−1
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

65 / 77

slide-66
SLIDE 66
  • 4. Mixture models

Formal Properties of the EM Algorithm

Main result from Dempster et al, Maximum Likelihood from Incomplete Data via the EM Algorithm, 1977

The expected complete log-likelihood of the i-th iteration is at least as good as that of the (i-1)-th iteration:

Q

  • θi, θi−1

≥ Q

  • θi−1, θi−1

If this expectation is maximized w.r.t. θi, then it holds that:

L

  • θi
  • X
  • ≥ L
  • θi−1
  • X
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

66 / 77 Dempster (1929-) Laird (1943-) Rubin (1942-)

slide-67
SLIDE 67
  • 4. Mixture models

Formal Properties of the EM Algorithm

Consequence of the previous statements

The incomplete log-likelihood increases in every iteration (or at least stays the same) The incomplete log-likelihood is maximized (locally)

In practice

The quality of the results depends on the initialization If we initialize poorly, we may get stuck in poor local optima EM relies on good initialization of the parameters

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

67 / 77

slide-68
SLIDE 68
  • 4. Mixture models

Special case - Gaussian Mixtures

For mixtures of Gaussians there is a closed form solution Look at the fully general case: also estimate the variances of the mixture components and the prior distribution over the mixture components θi = arg max

θ

Q

  • θ, θi−1
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

68 / 77

slide-69
SLIDE 69
  • 4. Mixture models

EM for Gaussian Mixtures

Algorithm Initialize parameters: µ1, σ1, π1 . . . While stop-condition is not met

E-step: Compute the posterior distribution, also called responsibility, for each mixture component and for all data points

αnj = p

  • j
  • xn
  • =

πjN

  • xn
  • µj, σj
  • M

i=1 πiN

  • xn
  • µi, σi
  • M-step: Compute the new parameters using weighted estimates

µnew

j

= 1 Nj

N

  • n=1

αnjxn with Nj =

N

  • n=1

αnj

  • σnew

j

2 = 1 Nj

N

  • n=1

αnj

  • xn − µnew

j

2 , πnew

j

= Nj N

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

69 / 77

slide-70
SLIDE 70
  • 4. Mixture models

Expectation Maximization

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

70 / 77

slide-71
SLIDE 71
  • 4. Mixture models

How many components?

How many mixture components do we need?

More components will typically lead to a better likelihood But are more components necessarily better? Not always, because of overfitting!

(Simple) automatic selection

Find K that maximizes the Akaike information criterion log p

  • X
  • θML
  • − K

where K is the number of parameters

Or find K that maximizes the Bayesian information criterion log p

  • X
  • θML
  • − 1

2K log N

where N is the number of data points

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

71 / 77

slide-72
SLIDE 72
  • 4. Mixture models

Before we move on... It is important to understand

Mixture models are much more general than mixtures of Gaussians

One can have mixtures of any parametric distribution, and even mixtures of different parametric distributions Gaussian mixtures are only one of many possibilities, though by far the most common one

Expectation maximization is not just for fitting mixtures of Gaussians

One can fit other mixture models with EM EM is still more general, in that it applies to many other hidden variable models

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

72 / 77

slide-73
SLIDE 73
  • 5. Wrap-Up

Outline

  • 1. Probability Density
  • 2. Parametric models

Maximum Likelihood Method

  • 3. Non-Parametric Models

Histograms Kernel Density Estimation K-nearest Neighbors

  • 4. Mixture models
  • 5. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

73 / 77

slide-74
SLIDE 74
  • 5. Wrap-Up
  • 5. Wrap-Up

You know now: The difference between parametric and non-parametric models More about the likelihood function and how to derive the maximum likelihood estimators for the Gaussian distribution What Bayesian estimation is Different non-parametric models (histogram, kernel density estimation and k-nearest neighbors) What mixture models are What the Expectation-Maximization idea and algorithm are

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

74 / 77

slide-75
SLIDE 75
  • 5. Wrap-Up

Self-Test Questions

Where do we get the probability of data from? What are parametric methods and how to obtain their parameters? How many parameters have non-parametric methods? What are mixture models? Should gradient methods be used for training mixture models? How does the EM algorithm work? What is the biggest problem of mixture models?

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

75 / 77

slide-76
SLIDE 76
  • 5. Wrap-Up

Homework

Reading Assignment for next lecture

Clustering: Murphy ch. 25 Bias & Variance: Bishop ch. 3.2, Murphy ch. 6.4

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

76 / 77

slide-77
SLIDE 77
  • 5. Wrap-Up

References

EM Standard Reference

A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum-Likelihood from incomplete data via EM algorithm, In Journal Royal Statistical Society, Series B. Vol. 39, 1977

EM Tutorial

Jeff A. Bilmes, A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, TR-97-021, ICSI, U.C. Berkeley, CA, USA

Modern interpretation

Neal, R.M. and Hinton, G.E., A view of the EM algorithm that justifies incremental, sparse, and other variants, In Learning in Graphical Models, M.I. Jordan (editor)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

77 / 77