Supervised and unsupervised learning. Petr Pok Czech Technical - - PowerPoint PPT Presentation

supervised and unsupervised learning
SMART_READER_LITE
LIVE PREVIEW

Supervised and unsupervised learning. Petr Pok Czech Technical - - PowerPoint PPT Presentation

Supervised and unsupervised learning. Petr Pok Czech Technical University in Prague Faculty of Electrical Engineering Dept. of Cybernetics This lecture is based on the book Ten Lectures on Statistical and Structural Pattern Recognition by


slide-1
SLIDE 1
  • P. Pošík c

2013 Artificial Intelligence – 1 / 48

Supervised and unsupervised learning.

Petr Pošík Czech Technical University in Prague Faculty of Electrical Engineering

  • Dept. of Cybernetics

This lecture is based on the book Ten Lectures on Statistical and Structural Pattern Recognition by Michail I. Schlesinger and Václav Hlavᡠc (Kluwer, 2002). (V ˇ ceské verzi kniha vyšla ve vydavatelství ˇ CVUT v roce 1999 pod názvem Deset pˇ rednášek z teorie statistického a strukturálního rozpoznávání).

slide-2
SLIDE 2

Learning

Learning Decision strategy design Learning as parameter estimation Learning as optimal strategy selection Several surrogate criteria Learning revisited Unsupervised Learning Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 2 / 48

slide-3
SLIDE 3

Decision strategy design

  • P. Pošík c

2013 Artificial Intelligence – 3 / 48

Using an observation x ∈ X of an object of interest with a hidden state k ∈ K, we should design a decision strategy q : X → D which would be optimal with respect to certain criterion.

slide-4
SLIDE 4

Decision strategy design

  • P. Pošík c

2013 Artificial Intelligence – 3 / 48

Using an observation x ∈ X of an object of interest with a hidden state k ∈ K, we should design a decision strategy q : X → D which would be optimal with respect to certain criterion. Bayesian decision theory requires complete statistical information pXK(x, k) of the object of interest to be known, and a suitable penalty function W : K × D → R must be provided. Non-Bayesian decision theory studies tasks for which some of the above information is not available. In practical applications, typically, none of the probabilities are known! The designer is only provided with the training (multi)set T = {(x1, k1), (x2, k2), . . . , (xl, kl)} of examples.

It is simpler to provide good examples than to gain complete or partial statistical model, build general theories, or create explicit descriptions of concepts (hidden states).

The aim is to find definitions of concepts (classes, hidden states) which are

complete (all positive examples are satisfied), and

consistent (no negative examples are satisfied).

The training (multi)set is finite, the found concept description is only a hypothesis.

slide-5
SLIDE 5

Decision strategy design

  • P. Pošík c

2013 Artificial Intelligence – 3 / 48

Using an observation x ∈ X of an object of interest with a hidden state k ∈ K, we should design a decision strategy q : X → D which would be optimal with respect to certain criterion. Bayesian decision theory requires complete statistical information pXK(x, k) of the object of interest to be known, and a suitable penalty function W : K × D → R must be provided. Non-Bayesian decision theory studies tasks for which some of the above information is not available. In practical applications, typically, none of the probabilities are known! The designer is only provided with the training (multi)set T = {(x1, k1), (x2, k2), . . . , (xl, kl)} of examples.

It is simpler to provide good examples than to gain complete or partial statistical model, build general theories, or create explicit descriptions of concepts (hidden states).

The aim is to find definitions of concepts (classes, hidden states) which are

complete (all positive examples are satisfied), and

consistent (no negative examples are satisfied).

The training (multi)set is finite, the found concept description is only a hypothesis. When do we need to use learning?

When knowledge about the recognized object is insufficient to solve the PR task.

Most often, we have insufficient knowledge about pX|K(x|k).

slide-6
SLIDE 6

Decision strategy design

  • P. Pošík c

2013 Artificial Intelligence – 3 / 48

Using an observation x ∈ X of an object of interest with a hidden state k ∈ K, we should design a decision strategy q : X → D which would be optimal with respect to certain criterion. Bayesian decision theory requires complete statistical information pXK(x, k) of the object of interest to be known, and a suitable penalty function W : K × D → R must be provided. Non-Bayesian decision theory studies tasks for which some of the above information is not available. In practical applications, typically, none of the probabilities are known! The designer is only provided with the training (multi)set T = {(x1, k1), (x2, k2), . . . , (xl, kl)} of examples.

It is simpler to provide good examples than to gain complete or partial statistical model, build general theories, or create explicit descriptions of concepts (hidden states).

The aim is to find definitions of concepts (classes, hidden states) which are

complete (all positive examples are satisfied), and

consistent (no negative examples are satisfied).

The training (multi)set is finite, the found concept description is only a hypothesis. When do we need to use learning?

When knowledge about the recognized object is insufficient to solve the PR task.

Most often, we have insufficient knowledge about pX|K(x|k). How do we proceed?

slide-7
SLIDE 7

Learning as parameter estimation

Learning Decision strategy design Learning as parameter estimation Learning as optimal strategy selection Several surrogate criteria Learning revisited Unsupervised Learning Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 4 / 48

1. Assume pXK(x, k) has a particular form (e.g. Gaussian, mixture of Gaussians, piece-wise constant) with a small number of parameters Θk. 2. Estimate the values of parameters Θk using the training set T. 3. Solve the classifier design problem as if the estimated ˆ pXK(x, k) was the true (and unknown) pXK(x, k).

slide-8
SLIDE 8

Learning as parameter estimation

Learning Decision strategy design Learning as parameter estimation Learning as optimal strategy selection Several surrogate criteria Learning revisited Unsupervised Learning Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 4 / 48

1. Assume pXK(x, k) has a particular form (e.g. Gaussian, mixture of Gaussians, piece-wise constant) with a small number of parameters Θk. 2. Estimate the values of parameters Θk using the training set T. 3. Solve the classifier design problem as if the estimated ˆ pXK(x, k) was the true (and unknown) pXK(x, k). Pros and cons:

If the true pXK(x, k) does not have the assumed form, the resulting strategy q′(x) can be arbitrarilly bad, even if the training set size L approaches infinity.

Implementation is often straightforward, especially if the parameters Θk are assumed to be independent for each class (naive bayes classifier).

slide-9
SLIDE 9

Learning as optimal strategy selection

Learning Decision strategy design Learning as parameter estimation Learning as optimal strategy selection Several surrogate criteria Learning revisited Unsupervised Learning Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 5 / 48

Choose a class Q of strategies qΘ : X → D. The class Q is usually given as a parametrized set of strategies of the same kind, i.e. qΘ(x, Θ1, . . . , Θ|K|).

The problem can be formulated as a non-Bayesian task with non-random interventions:

The unknown parameters Θk are the non-random interventions.

The probabilities pX|K,Θ(x|k, Θk) must be known.

The solution may be e.g. such a strategy that minimizes the maximal probability of incorrect decision over all Θk, i.e. strategy that minimizes the probability of incorrect decision in case of the worst possible parameter settings.

But even this minimal probability may not be low enough—this happens especially in cases when the class Q of strategies is too broad.

It is necessary to narrow the set of possible strategies using additional information—the training (multi)set T.

Learning then amounts to selecting a particular strategy q∗

Θ from the a priori

known set Q using the information provided as training set T.

Natural criterion for the selection of one particular strategy is the risk R(qΘ), but it cannot be computed because pXK(x, k) is unknown.

The strategy q∗

Θ ∈ Q is chosen by minimizing some other surrogate criterion

  • n the training set which approximates R(qΘ).

The choice of the surrogate criterion determines the learning paradigm.

slide-10
SLIDE 10

Several surrogate criteria

  • P. Pošík c

2013 Artificial Intelligence – 6 / 48

All the following surrogate criteria can be computed using the training data T. Learning as parameter estimation

according to the maximum likelihood.

according to a non-random training set. Learning as optimal strategy selection

by minimization of the empirical risk.

by minimization of the structural risk.

slide-11
SLIDE 11

Several surrogate criteria

  • P. Pošík c

2013 Artificial Intelligence – 6 / 48

All the following surrogate criteria can be computed using the training data T. Learning as parameter estimation

according to the maximum likelihood.

The likelihood of an instance of the parameters Θ = (Θk : k ∈ K) is the probability of T given Θ: L(Θ) = p(T|Θ) =

(xi,ki)∈T

pK(ki)pX|K(xi|ki, Θki )

Learning then means to find Θ∗ that maximizes the probability of T: Θ∗ = (Θ∗

k : k ∈ K) = arg max Θ

L(T, Θ) which can be decomposed to Θ∗

k = arg max Θk ∑ x∈X

α(x, k) log pX|K(x|k, Θk), where α(x, k) is the frequency of the pair (x, k) in T (i.e. T is multiset).

The recognition is then performed according to qΘ(x, Θ∗).

according to a non-random training set. Learning as optimal strategy selection

by minimization of the empirical risk.

by minimization of the structural risk.

slide-12
SLIDE 12

Several surrogate criteria

  • P. Pošík c

2013 Artificial Intelligence – 6 / 48

All the following surrogate criteria can be computed using the training data T. Learning as parameter estimation

according to the maximum likelihood.

according to a non-random training set.

When random examples are not easy to obtain, e.g. in recognition of images.

T is carefully crafted by the designer:

✓ it should cover the whole recognized domain ✓ the examples should be typical (“quite probable”) prototypes ✘

Let T(k),k ∈ K, be a subset of the training set T with examples for state k. Then Θ∗

k = arg max Θk

min

x∈T(k) pX|K(x|k, Θk)

Note that the Θ∗ does not depend on the frequencies of (x, k) in T (i.e. T is a set). Learning as optimal strategy selection

by minimization of the empirical risk.

by minimization of the structural risk.

slide-13
SLIDE 13

Several surrogate criteria

  • P. Pošík c

2013 Artificial Intelligence – 6 / 48

All the following surrogate criteria can be computed using the training data T. Learning as parameter estimation

according to the maximum likelihood.

according to a non-random training set. Learning as optimal strategy selection

by minimization of the empirical risk.

The set Q of parametrized strategies q(x, Θ), penalty function W(k, d).

The quality of each strategy q ∈ Q (i.e. the quality of each parameter set Θ) could be described by the risk R(Θ) = R(q) = ∑

k∈K ∑ x∈X

pXK(x, k)W(k,q(x, Θ)), but pXK is unknown.

We thus use the empirical risk Remp (training set error): Remp(Θ) = Remp(q) = 1

|T|

(xi,ki)∈T

W(ki, q(xi, Θ)).

Strategy qΘ(x, Θ∗) is used where Θ∗ = arg minΘ Remp(Θ).

Examples: Perceptron, neural networks (backprop.), classification trees, ...

by minimization of the structural risk.

slide-14
SLIDE 14

Several surrogate criteria

  • P. Pošík c

2013 Artificial Intelligence – 6 / 48

All the following surrogate criteria can be computed using the training data T. Learning as parameter estimation

according to the maximum likelihood.

according to a non-random training set. Learning as optimal strategy selection

by minimization of the empirical risk.

by minimization of the structural risk.

Based on Vapnik-Chervonenkis theory

Examples: Optimal separating hyperplane, support vector machine (SVM)

slide-15
SLIDE 15

Learning revisited

Learning Decision strategy design Learning as parameter estimation Learning as optimal strategy selection Several surrogate criteria Learning revisited Unsupervised Learning Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 7 / 48

Do we need learning? When?

If we are about to solve one particular task which is sufficiently known to us, we should try to develop a recognition method without learning.

If we are about to solve a task belonging to a well defined class (we only do not know which particular task from the class we shall solve), develop a recognition method with learning.

slide-16
SLIDE 16

Learning revisited

Learning Decision strategy design Learning as parameter estimation Learning as optimal strategy selection Several surrogate criteria Learning revisited Unsupervised Learning Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 7 / 48

Do we need learning? When?

If we are about to solve one particular task which is sufficiently known to us, we should try to develop a recognition method without learning.

If we are about to solve a task belonging to a well defined class (we only do not know which particular task from the class we shall solve), develop a recognition method with learning. The designer

should understand all the varieties of the task class, i.e.

should find a solution to the whole class of problems.

slide-17
SLIDE 17

Learning revisited

Learning Decision strategy design Learning as parameter estimation Learning as optimal strategy selection Several surrogate criteria Learning revisited Unsupervised Learning Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 7 / 48

Do we need learning? When?

If we are about to solve one particular task which is sufficiently known to us, we should try to develop a recognition method without learning.

If we are about to solve a task belonging to a well defined class (we only do not know which particular task from the class we shall solve), develop a recognition method with learning. The designer

should understand all the varieties of the task class, i.e.

should find a solution to the whole class of problems. The solution

is a parametrized strategy and

its parameters are learned from the training (multi)set.

slide-18
SLIDE 18

Learning revisited

Learning Decision strategy design Learning as parameter estimation Learning as optimal strategy selection Several surrogate criteria Learning revisited Unsupervised Learning Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 7 / 48

Do we need learning? When?

If we are about to solve one particular task which is sufficiently known to us, we should try to develop a recognition method without learning.

If we are about to solve a task belonging to a well defined class (we only do not know which particular task from the class we shall solve), develop a recognition method with learning. The designer

should understand all the varieties of the task class, i.e.

should find a solution to the whole class of problems. The solution

is a parametrized strategy and

its parameters are learned from the training (multi)set. The supervised learning is a topic for several upcoming lectures:

Decision trees and decision rules.

Linear classifiers.

Adaboost.

slide-19
SLIDE 19

Unsupervised Learning

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 8 / 48

slide-20
SLIDE 20

Do we need the teacher?

  • P. Pošík c

2013 Artificial Intelligence – 9 / 48

No learning:

Use q for recognition.

x q(x) d

slide-21
SLIDE 21

Do we need the teacher?

  • P. Pošík c

2013 Artificial Intelligence – 9 / 48

No learning:

Use q for recognition.

x q(x) d

Supervised learning:

First, use T to learn Θ.

Then, use qΘ for recognition.

x xT kT Learning q(x, Θ) Θ d

slide-22
SLIDE 22

Do we need the teacher?

  • P. Pošík c

2013 Artificial Intelligence – 9 / 48

No learning:

Use q for recognition.

x q(x) d

Supervised learning:

First, use T to learn Θ.

Then, use qΘ for recognition.

x xT kT Learning q(x, Θ) Θ d

Unsupervised learning???

First, predict the sequence ˜ K0, i.e. k0

i = R(xi, Θ0), xi ∈ ˜

X.

Iteratively update Θt = L( ˜ X, ˜ Kt−1)).

x Θ k R(x, Θ) L(x, k)

Does not work as expected, if models like perceptron are used.

Works for quadratic clustering (k-means, ISODATA, EM).

slide-23
SLIDE 23

K-means (ISODATA)

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 10 / 48

Assume:

An object can be in one of the |K| states with equal probabilities.

All pX|K(x|k) are isotropic Gaussians: pX|K(x|k) = N (x|µk, σI).

slide-24
SLIDE 24

K-means (ISODATA)

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 10 / 48

Assume:

An object can be in one of the |K| states with equal probabilities.

All pX|K(x|k) are isotropic Gaussians: pX|K(x|k) = N (x|µk, σI). Recognition:

The task is to decide the state k for each x, assuming all µk are known.

The Bayesian strategy (minimizes the probability of error): q∗(x) = arg min

k∈K (x − µk)2

If µk, k ∈ K, are not known, it is a parametrized strategy qΘ(x), where Θ = (µk)K

k=1.

slide-25
SLIDE 25

K-means (ISODATA)

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 10 / 48

Assume:

An object can be in one of the |K| states with equal probabilities.

All pX|K(x|k) are isotropic Gaussians: pX|K(x|k) = N (x|µk, σI). Recognition:

The task is to decide the state k for each x, assuming all µk are known.

The Bayesian strategy (minimizes the probability of error): q∗(x) = arg min

k∈K (x − µk)2

If µk, k ∈ K, are not known, it is a parametrized strategy qΘ(x), where Θ = (µk)K

k=1.

Learning:

Find the maximum-likelihood estimates of µk based on known (x1, k1), . . . , (xl, kl): µ∗

k =

1

|Ik| ∑

i∈Ik

xi, where Ik is a set of indices of training examples belonging to state k.

slide-26
SLIDE 26

K-means algorithm

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 11 / 48

Algorithm K-means [Mac67]

K is the apriori given number of clusters.

Algorithm: 1. Choose K centroids µk (in almost any way, but every cluster should have at least one example.) 2. For all x, assign x to its closest µk. 3. Compute the new position of centroids µk based on all examples xi, i ∈ Ik, in cluster k. 4. If the positions of centroids changed, repeat from 2.

slide-27
SLIDE 27

K-means algorithm

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 11 / 48

Algorithm K-means [Mac67]

K is the apriori given number of clusters.

Algorithm: 1. Choose K centroids µk (in almost any way, but every cluster should have at least one example.) 2. For all x, assign x to its closest µk. 3. Compute the new position of centroids µk based on all examples xi, i ∈ Ik, in cluster k. 4. If the positions of centroids changed, repeat from 2. Algorithm features:

Algorithm minimizes the function (intracluster variance): J =

k

j=1 nj

i=1

  • xi,j − cj
  • 2

(1)

Algorithm is fast, but each time it can converge to a different local optimum of J.

[Mac67]

  • J. B. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of 5-th Berkeley

Symposium on Mathematical Statistics and Probability, volume 1, pages 281–297, Berkeley, 1967. University of California Press.

slide-28
SLIDE 28

Illustration

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 12 / 48

2 4 6 8 10 1 2 3 4 5 6 7 8 9 10 K−means clustering: iteration 1

slide-29
SLIDE 29

Illustration

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 13 / 48

2 4 6 8 10 1 2 3 4 5 6 7 8 9 10 K−means clustering: iteration 2

slide-30
SLIDE 30

Illustration

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 14 / 48

2 4 6 8 10 1 2 3 4 5 6 7 8 9 10 K−means clustering: iteration 3

slide-31
SLIDE 31

Illustration

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 15 / 48

2 4 6 8 10 1 2 3 4 5 6 7 8 9 10 K−means clustering: iteration 4

slide-32
SLIDE 32

Illustration

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 16 / 48

2 4 6 8 10 1 2 3 4 5 6 7 8 9 10 K−means clustering: iteration 5

slide-33
SLIDE 33

Illustration

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 17 / 48

2 4 6 8 10 1 2 3 4 5 6 7 8 9 10 K−means clustering: iteration 6

slide-34
SLIDE 34

General mixture distributions

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 18 / 48

Assume the data are samples from a distribution factorized as pXK(x, k) = pK(k)pX|K(x|k), i.e. pX(x) = ∑

k∈K

pK(k)pX|K(x|k) and that the distribution is known.

slide-35
SLIDE 35

General mixture distributions

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 18 / 48

Assume the data are samples from a distribution factorized as pXK(x, k) = pK(k)pX|K(x|k), i.e. pX(x) = ∑

k∈K

pK(k)pX|K(x|k) and that the distribution is known. Recognition:

Let’s define the result of recognition not as a single decision for some state k, but rather as

a set of posterior probabilities (sometimes called responsibilities) for all k given xi γk(xi) = pK|X(k|xi) = pX|K(xi|k)pK(k) ∑k∈K pX|K(xi|k)pK(k) that an object was in state k when observation xi was made.

The γk(x) functions can be viewed as discriminant functions.

slide-36
SLIDE 36

General mixture distributions (cont.)

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 19 / 48

Learning:

Given the training multiset T = (xi, ki)n

i=1 (or the respective γk(xi) instead of ki),

assume γk(x) is known, pK(k) are not known, and pX|K(x|k) are known except the parameter values Θk, i.e. we shall write pX|K(x|k, Θk).

Let the object model m be a “set” of all unknown parameters m = (pK(k), Θk)k∈K.

slide-37
SLIDE 37

General mixture distributions (cont.)

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 19 / 48

Learning:

Given the training multiset T = (xi, ki)n

i=1 (or the respective γk(xi) instead of ki),

assume γk(x) is known, pK(k) are not known, and pX|K(x|k) are known except the parameter values Θk, i.e. we shall write pX|K(x|k, Θk).

Let the object model m be a “set” of all unknown parameters m = (pK(k), Θk)k∈K.

The log-likelihood of the model m: log L(m) = log

n

i=1

pXK(xi, ki) =

n

i=1

log pK(ki) +

n

i=1

log pX|K(xi|ki, Θki )

The log-likelihood using γ: log L(m) =

n

i=1 ∑ k∈K

γk(xi) log pK(k) +

n

i=1 ∑ k∈K

γk(xi) log pX|K(xi|k, Θk)

We search for the optimal model using maximal likelihood: m∗ = (p∗

K(k), Θ∗ k) = arg max m

log L(m)

i.e. we compute p∗

K(k) = 1

n

n

i=1

γk(xi) and solve k independent tasks Θ∗

k = arg max Θk n

i=1

γk(xi) log pX|K(xi|k, Θk).

slide-38
SLIDE 38

Expectation Maximization Algorithm

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 20 / 48

Unsupervised learning algorithm [DLR77] for general mixture distributions: 1. Initialize the model parameters m = ((pK(k), Θk)∀k). 2. Perform the recognition task, i.e. assuming m is known, compute γk(xi) = ˆ pK|X(k|xi) = pK(k)pX|K(xi|k, Θk) ∑j∈K pK(j)pX|K(xi|j, Θj) . 3. Perform the learning task, i.e. assuming γk(xi) are known, update the ML estimates of the model parameters pK(k) and Θk for all k: pK(k) = 1 n

n

i=1

γk(xi) Θk = arg max

Θk n

i=1

γk(xi) log pX|K(xi|k, Θk) 4. Iterate 2 and 3 until the model stabilizes.

slide-39
SLIDE 39

Expectation Maximization Algorithm

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 20 / 48

Unsupervised learning algorithm [DLR77] for general mixture distributions: 1. Initialize the model parameters m = ((pK(k), Θk)∀k). 2. Perform the recognition task, i.e. assuming m is known, compute γk(xi) = ˆ pK|X(k|xi) = pK(k)pX|K(xi|k, Θk) ∑j∈K pK(j)pX|K(xi|j, Θj) . 3. Perform the learning task, i.e. assuming γk(xi) are known, update the ML estimates of the model parameters pK(k) and Θk for all k: pK(k) = 1 n

n

i=1

γk(xi) Θk = arg max

Θk n

i=1

γk(xi) log pX|K(xi|k, Θk) 4. Iterate 2 and 3 until the model stabilizes. Features:

The algorithm does not specify how to update Θk in step 3, it depends on the chosen form of pX|K.

The model created in iteration t is always at least as good as the model from iteration t − 1, i.e. L(m) = p(T|m) increases.

[DLR77] Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1):1–38, 1977.

slide-40
SLIDE 40

Special Case: Gaussian Mixture Model

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 21 / 48

Each kth component is a Gaussian distribution:

N (x|µk, Σk) =

1

(2π)

D 2 |Σk| 1 2

exp{− 1 2(x − µk)TΣ−1

k (x − µk)}

Gaussian Mixture Model (GMM): p(x) =

K

k=1

pK(k)pX|K(x|k, Θk) =

K

k=1

αkN (x|µk, Σk) assuming

K

k=1

αk = 1 and 0 ≤ αk ≤ 1

−40 −30 −20 −10 10 20 30 40 −20 −10 10 20 30 40 50 5 x 10

−3

slide-41
SLIDE 41

EM for GMM

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 22 / 48

1. Initialize the model parameters m = ((pK(k), µk, Σk)∀k). 2. Perform the recognition task as in the general case, i.e. assuming m is known, compute γk(xi) = ˆ pK|X(k|xi) = pK(k)pX|K(xi|k, Θk) ∑j∈K pK(j)pX|K(xi|j, Θj) = αkN (xi|µk, Σk) ∑j∈K αjN (xi|µj, Σj) . 3. Perform the learning task, i.e. assuming γk(xi) are known, update the ML estimates of the model parameters αk, µk and Σk for all k: αk = pK(k) = 1 n

n

i=1

γk(xi) µk = ∑n

i=1 γk(xi)xi

∑n

i=1 γk(xi)

Σk = ∑n

i=1 γk(xi)(xi − µk)(xi − µk)T

∑n

i=1 γk(xi)

4. Iterate 2 and 3 until the model stabilizes.

slide-42
SLIDE 42

Example: Source data

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 23 / 48

Source data generated from 3 Gaussians.

−40 −30 −20 −10 10 20 30 40 −20 −10 10 20 30 40 50

slide-43
SLIDE 43

Example: Input to EM algorithm

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 24 / 48

The data were given to the EM algorithm as an unlabeled dataset.

−50 −40 −30 −20 −10 10 20 30 −20 −10 10 20 30 40 50

slide-44
SLIDE 44

Example: EM Iterations

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 25 / 48

−40 −30 −20 −10 10 20 30 40 −20 −10 10 20 30 40 50

slide-45
SLIDE 45

Example: EM Iterations

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 26 / 48

−40 −30 −20 −10 10 20 30 40 −20 −10 10 20 30 40 50

slide-46
SLIDE 46

Example: EM Iterations

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 27 / 48

−40 −30 −20 −10 10 20 30 40 −20 −10 10 20 30 40 50

slide-47
SLIDE 47

Example: EM Iterations

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 28 / 48

−40 −30 −20 −10 10 20 30 40 −20 −10 10 20 30 40 50

slide-48
SLIDE 48

Example: EM Iterations

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 29 / 48

−40 −30 −20 −10 10 20 30 40 −20 −10 10 20 30 40 50

slide-49
SLIDE 49

Example: EM Iterations

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 30 / 48

−40 −30 −20 −10 10 20 30 40 −20 −10 10 20 30 40 50

slide-50
SLIDE 50

Example: EM Iterations

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 31 / 48

−40 −30 −20 −10 10 20 30 40 −20 −10 10 20 30 40 50

slide-51
SLIDE 51

Example: EM Iterations

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 32 / 48

−40 −30 −20 −10 10 20 30 40 −20 −10 10 20 30 40 50

slide-52
SLIDE 52

Example: EM Iterations

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 33 / 48

−40 −30 −20 −10 10 20 30 40 −20 −10 10 20 30 40 50

slide-53
SLIDE 53

Example: EM Iterations

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 34 / 48

−40 −30 −20 −10 10 20 30 40 −20 −10 10 20 30 40 50

slide-54
SLIDE 54

Example: EM Iterations

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 35 / 48

−40 −30 −20 −10 10 20 30 40 −20 −10 10 20 30 40 50

slide-55
SLIDE 55

Example: EM Iterations

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 36 / 48

−40 −30 −20 −10 10 20 30 40 −20 −10 10 20 30 40 50

slide-56
SLIDE 56

Example: Ground Truth and EM Estimate

  • P. Pošík c

2013 Artificial Intelligence – 37 / 48

−40 −30 −20 −10 10 20 30 40 −20 −10 10 20 30 40 50 −40 −30 −20 −10 10 20 30 40 −20 −10 10 20 30 40 50

The ground truth (left) and the EM estimate (right) are very close because

we have enough data,

we know the right number of components, and

we were lucky that EM converged to the right local optimum of the likelihood function.

slide-57
SLIDE 57

What is unsupervised learning?

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 38 / 48

A strict view:

Only those algorithms conforming to the scheme Θt = L( ˜ X, R( ˜ X, Θt−1)), i.e.

  • nly k-means, ISODATA, EM algorithm, ....
slide-58
SLIDE 58

What is unsupervised learning?

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 38 / 48

A strict view:

Only those algorithms conforming to the scheme Θt = L( ˜ X, R( ˜ X, Θt−1)), i.e.

  • nly k-means, ISODATA, EM algorithm, ....

A broader view:

any algorithm that analyses a dataset and extracts potentially usable information just on the basis of x, i.e. without knowing k.

Clustering creates dissimilar groups of similar objects.

Vector quantization searches for several typical prototypes, reduces amount of data.

Outlier detection searches for unusual (non-probable) examples.

Dimensionality reduction chooses/creates a low number of (artificial) variables that describe the data sufficiently well.

Feature extraction derives new features describing the data.

slide-59
SLIDE 59

What is unsupervised learning?

Learning Unsupervised Learning Do we need the teacher? K-means (ISODATA) K-means algorithm General mixture distributions EM Algorithm GMM EM for GMM What is unsupervised learning? Clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 38 / 48

A strict view:

Only those algorithms conforming to the scheme Θt = L( ˜ X, R( ˜ X, Θt−1)), i.e.

  • nly k-means, ISODATA, EM algorithm, ....

A broader view:

any algorithm that analyses a dataset and extracts potentially usable information just on the basis of x, i.e. without knowing k.

Clustering creates dissimilar groups of similar objects.

Vector quantization searches for several typical prototypes, reduces amount of data.

Outlier detection searches for unusual (non-probable) examples.

Dimensionality reduction chooses/creates a low number of (artificial) variables that describe the data sufficiently well.

Feature extraction derives new features describing the data. Intuition:

based on the “structure” hidden in the data, we want to derive new features, but

we often do not know what these features should mean.

slide-60
SLIDE 60

Clustering

Learning Unsupervised Learning Clustering Clustering Similarity Clustering algorithms Hierarchical clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 39 / 48

slide-61
SLIDE 61

Clustering

Learning Unsupervised Learning Clustering Clustering Similarity Clustering algorithms Hierarchical clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 40 / 48

The goal is to assign certain set of objects into groups (called clusters) so that the

  • bjects in the same cluster are more similar to each other than objects from different
  • clusters. (The result of clustering does not need to be a rule how to assign points to

clusters). A cluster is a set of objects which 1. are similar to each other and 2. dissimilar to objects from other clusters. Issues:

What do we mean by “similar” and “dissimilar”?

What is the right number of clusters? Cluster analysis studies algorithms for

cluster formation, i.e. how to divide a set of objects into clusters,

segmentation, i.e. how to describe borders of individual clusters or their prototypes (centroids),

cluster labeling, i.e. how to assign meaningful labels to individual clusters.

slide-62
SLIDE 62

Similarity

Learning Unsupervised Learning Clustering Clustering Similarity Clustering algorithms Hierarchical clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 41 / 48

Similarity (or rather dissimilarity) is usually expressed by a distance:

Minkowski metric: d(x1, x2) = (

D

i=1

|x1,i − x2,i|q)

1 q

q = 1: L1, Manhattan, city-block, postman metric

q = 2: L2, Euclidean distance

q = ∞: L∞

Cosine distance: e.g. to asses the similarity of 2 documents (word frequency): d(x1, x2) = x1Tx2

|x1||x2|

Mahalanobis distance: covariance matrix driven metric (see the multivariate Gaussian distribution): d(x1, x2) =

  • (x1 − x2)TΣ−1(x1 − x2)
slide-63
SLIDE 63

Clustering algorithms

Learning Unsupervised Learning Clustering Clustering Similarity Clustering algorithms Hierarchical clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 42 / 48

Hierarchichal algorithms build the clusters incrementally using already found clusters, changing their number.

Agglomerative (bottom-up)

Divisive (top-down) Partitioning algorithms refine the definition of apriori given number of clusters iteratively.

K-means

ISODATA

fuzzy C-means

EM algoritmus pro smˇ es Gaussián˚ u Graph theory methods: minimum spanning tree Spectral clustering: based on the distance matrix, they reduce the space dimensionality, clustering is performed in the lower-dimensional space. Kohonen maps

slide-64
SLIDE 64

Hierarchical clustering

Learning Unsupervised Learning Clustering Clustering Similarity Clustering algorithms Hierarchical clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 43 / 48

Agglomerative clustering [Joh67]

n objects, n · n distance matrix

Algorithm: 1. Create n clusters, each with 1 object. 2. Find the closest pair of clusters and join it. 3. Update the distance matrix for the newly created cluster. 4. Iterate 2 and 3, until there is only 1 cluster left.

We need to specify what a distance between 2 clusters means!!!

slide-65
SLIDE 65

Hierarchical clustering

Learning Unsupervised Learning Clustering Clustering Similarity Clustering algorithms Hierarchical clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 43 / 48

Agglomerative clustering [Joh67]

n objects, n · n distance matrix

Algorithm: 1. Create n clusters, each with 1 object. 2. Find the closest pair of clusters and join it. 3. Update the distance matrix for the newly created cluster. 4. Iterate 2 and 3, until there is only 1 cluster left.

We need to specify what a distance between 2 clusters means!!! Divisive clustering 1. Assign all points to 1 big cluster. 2. Divide the cluster to several (2?) smaller clusters, e.g. using k-means. 3. Apply item 2 to each cluster that is still candidate for splitting.

[Joh67] Stephen Johnson. Hierarchical clustering schemes. Psychometrika, 32(3):241–254, 1967.

slide-66
SLIDE 66

Example

Learning Unsupervised Learning Clustering Clustering Similarity Clustering algorithms Hierarchical clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 44 / 48

  • Str. diagram pro 22 případů

Úplné spojení

  • Euklid. vzdálenosti

1 2 3 4 5 6 7 8 Vzdálen. spojení Porsche Corvette Eagle Isuzu Ford Buick Toyota Mazda Volvo Saab BMW Mercedes Audi Nissan Mitsub. Pontiac Honda VW Dodge Chrysler Olds Acura Řez stromem pro 5 shluků Řez stromem pro 4 shluky

slide-67
SLIDE 67

Example

Learning Unsupervised Learning Clustering Clustering Similarity Clustering algorithms Hierarchical clustering Summary

  • P. Pošík c

2013 Artificial Intelligence – 45 / 48

Výsledek shlukování metodou hierarchického spojování

Acura Audi BMW Buick Corvette Chrysler Dodge Eagle Ford Honda Isuzu Mazda Mercedes Mitsub. Nissan Olds Pontiac Porsche Saab Toyota VW Volvo

  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 Faktor 1

  • 2
  • 1

1 2 3 4 5 Faktor 2

slide-68
SLIDE 68

Summary

Learning Unsupervised Learning Clustering Summary Summary Reference

  • P. Pošík c

2013 Artificial Intelligence – 46 / 48

slide-69
SLIDE 69

Summary

Learning Unsupervised Learning Clustering Summary Summary Reference

  • P. Pošík c

2013 Artificial Intelligence – 47 / 48

Learning: Needed when we do not have sufficient statistical info for recognition. Approaches:

Assume pXK has a certain form and use T to estimate its parameters.

Assume the right strategy is in a particular set and use T to choose it.

There are several learning paradigms depending on the choice of criterion used instead of Bayesian risk.

slide-70
SLIDE 70

Summary

Learning Unsupervised Learning Clustering Summary Summary Reference

  • P. Pošík c

2013 Artificial Intelligence – 47 / 48

Learning: Needed when we do not have sufficient statistical info for recognition. Approaches:

Assume pXK has a certain form and use T to estimate its parameters.

Assume the right strategy is in a particular set and use T to choose it.

There are several learning paradigms depending on the choice of criterion used instead of Bayesian risk. Unsupervised learning: What if we use the prediction from the recognition as the information from the teacher and re-estimate the recognizer iteratively?

Failure for algorithm like Rosenblatt’s perceptron.

Works for the problem of quadratic clustering.

Expectation-Maximization algorithm!!!

slide-71
SLIDE 71

Summary

Learning Unsupervised Learning Clustering Summary Summary Reference

  • P. Pošík c

2013 Artificial Intelligence – 47 / 48

Learning: Needed when we do not have sufficient statistical info for recognition. Approaches:

Assume pXK has a certain form and use T to estimate its parameters.

Assume the right strategy is in a particular set and use T to choose it.

There are several learning paradigms depending on the choice of criterion used instead of Bayesian risk. Unsupervised learning: What if we use the prediction from the recognition as the information from the teacher and re-estimate the recognizer iteratively?

Failure for algorithm like Rosenblatt’s perceptron.

Works for the problem of quadratic clustering.

Expectation-Maximization algorithm!!! Unsupervised learning – a broader view: any analytical procedure that does not use the labels of the training data.

Vector quantization, outlier/novelty detection, dimensionality reduction, feature extraction, ...

Clustering

hierarchical vs. partitioning

slide-72
SLIDE 72

Reference

Learning Unsupervised Learning Clustering Summary Summary Reference

  • P. Pošík c

2013 Artificial Intelligence – 48 / 48

[DLR77] Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1):1–38, 1977. [Joh67] Stephen Johnson. Hierarchical clustering schemes. Psychometrika, 32(3):241–254, 1967. [Mac67]

  • J. B. MacQueen. Some methods for classification and analysis of multivariate
  • bservations. In Proceedings of 5-th Berkeley Symposium on Mathematical

Statistics and Probability, volume 1, pages 281–297, Berkeley, 1967. University

  • f California Press.