[PPT] - SUBSPACE CLUSTERING Sylvain Calinon Robot Learning & PowerPoint Presentation

SLIDE 1

EE613 Machine Learning for Engineers

SUBSPACE CLUSTERING

Sylvain Calinon Robot Learning & Interaction Group Idiap Research Institute

Oct. 25, 2017

1

SLIDE 2

SUBSPACE CLUSTERING (Wed, Oct. 25) HIDDEN MARKOV MODELS (Wed, Nov. 1) LINEAR REGRESSION (Thu, Nov. 9) GAUSSIAN MIXTURE REGRESSION (Wed, Dec. 13) GAUSSIAN PROCESS REGRESSION (Wed, Dec. 20)

Time series analysis and synthesis, Multivariate data processing

2

SLIDE 3

Outline

3

High-dimensional data clustering (HDDC)

Matlab code: demo_HDDC01.m

Mixture of factor analyzers (MFA)

Matlab code: demo_MFA01.m

Mixture of probabilistic principal component

analyzers (MPPCA) Matlab code: demo_MPPCA01.m

GMM with semi-tied covariance matrices

Matlab code: demo_semitiedGMM01.m

SLIDE 4

Introduction

4

Subspace clustering aims at clustering data while reducing the dimension of each cluster (cluster-dependent subspace) Considering the two problems separately (clustering, then subspace projection) can be inefficient and can produce poor local

ptima, especially when datapoints of

high dimensions are considered.

K clusters N datapoints D dimensions (original space) d dimensions (latent space)

SLIDE 5

Example of application: Whole body motion

 About 90% of variance in walking motion can be explained by 2 principal components  Each type of periodic motion can be characterized by a different subspace

5

Walking Walking Running

 Requires clustering of the complete motion into different locomotion phases  Requires extraction of coordination patterns for each cluster

Image: Dominici et al. (2010), J NEUROPHYSIOL

SLIDE 6

Curse of dimensionality in GMM encoding

6

K clusters N datapoints D dimensions (original space) d dimensions (latent space)

Image: datasciencecentral.com

SLIDE 7

Curse of dimensionality

7

Some characteristics of high-dimensional spaces can ease the classification of data. Indeed, having different groups living in different subspaces may be a useful property for discriminating the groups. Subspace clustering exploits the phenomenon that high-dimensional spaces are mostly empty to ease the discrimination between groups of points.

 Curse of dimensionality or…

blessing of dimensionality?

SLIDE 8

Curse of dimensionality

8

Bouveyron and Brunet (2014, COMPUT STAT DATA AN) reviewed various ways

f handling the problem of high-dimensional data in clustering problems:

1. Since D is too large w.r.t. N, a global dimensionality reduction should be applied as a pre-processing step to reduce D. 2. Since D is too large w.r.t. N, the solution space contains many poor local optima. The solution space should be smoothed by introducing ridge or lasso regularization in the estimation of the covariance (avoiding numerical problem and singular solutions when inverting the covariances). A simple form of regularization can be achieved after the maximization step of each EM loop. 3. Since D is too large w.r.t. N, the model is probably over-parametrized, and a more parsimonious model should be used (thus estimating a fewer number of parameters).

N datapoints D dimensions (original space) d dimensions (latent space)

SLIDE 9

Equidensity contour of

ne standard deviation

Gaussian Mixture Model (GMM)

K Gaussians N datapoints of dimension D

9

SLIDE 10

Covariance structures in GMM

10

SLIDE 11

Multivariate normal distribution - Stochastic sampling

11

SLIDE 12

Expectation-maximization (EM)

12

SLIDE 13

Expectation-maximization (EM)

13

E-step M-step Converge? Stop Initial guess

SLIDE 14

EM for GMM

14

SLIDE 15

EM for GMM

15

SLIDE 16

EM for GMM

16

SLIDE 17

EM for GMM

17

SLIDE 18

K Gaussians N datapoints

EM for GMM: Resulting procedure

18

These results can be intuitively interpreted in terms of normalized counts. EM provides a systematic approach to derive such procedure.

 Weighted averages taking into

account the responsibility of each datapoint in each cluster.

SLIDE 19

EM for GMM

19

SLIDE 20

EM for GMM: Local optima issue

20

SLIDE 21

Local optima in EM

21

Parameter space Log-likelihood Unknown solution space EM will improve the likelihood at each iteration, but it can get trapped into poor local optima in the solution space

 Parameters initialization is important!

SLIDE 22

Parameters estimation in GMM… in 1893

54 pages! Proposed solution: Moment-based approach requiring to solve a polynomial of degree 9… … which does not mean that moment- based approaches are old-fashioned! They are actually today popular again with new developments related to spectral decomposition.

SLIDE 23

High-dimensional data clustering (HDDC)

Matlab code: demo_HDDC01.m

23

[C. Bouveyron and C. Brunet. Model-based clustering of high-dimensional data: A review. Computational Statistics and Data Analysis, 71:52–78, March 2014]

SLIDE 24

Curse of dimensionality

24

Bouveyron and Brunet (2014, COMPUT STAT DATA AN) reviewed various ways

f viewing the problem and coping with high-dimensional data in

clustering problems: 1. Since D is too large wrt N, a global dimensionality reduction should be applied as a pre-processing step to reduce D. 2. Since D is too large wrt N, the solution space contains many poor local optima; the solution space should be smoothed by introducing ridge or lasso regularization in the estimation of the covariance (avoiding numerical problem and singular solutions when inverting the covariances). A simple form of regularization can be achieved after the maximization step of each EM loop. 3. Since D is too large wrt N, the model is probably over-parametrized, and a more parsimonious model should be used (thus estimating a fewer number of parameters).

SLIDE 25

25

Parameter space Log-likelihood Unknown solution space

The introduction of a regularization term can change the shape of the solution space

Regularization of the GMM parameters

SLIDE 26

Regularization of the GMM parameters

Tikhonov regularization with diagonal isotropic covariance: Regularization with minimal admissible eigenvalue:

SLIDE 27

High-dimensional data clustering (HDDC)

27

SLIDE 28

Mixture of factor analyzers (MFA)

Matlab code: demo_MFA01.m

28

[P. D. McNicholas and T. B. Murphy. Parsimonious Gaussian mixture models. Statistics and Computing, 18(3):285–296, September 2008]

SLIDE 29

Mixture of factor analyzers (MFA)

29

SLIDE 30

Mixture of factor analyzers (MFA)

30

SLIDE 31

Mixture of factor analyzers (MFA): graphical model

31

SLIDE 32

Mixture of factor analyzers (MFA)

32

SLIDE 33

Mixture of factor analyzers (MFA)

33

SLIDE 34

34

Estimation of parameters in MFA

SLIDE 35

35

Alternating Expectation Conditional Maximization (AECM)

SLIDE 36

36

AECM for MFA (UUU model in McNicholas and Murphy, 2008)

covariance as in GMM

SLIDE 37

AECM for MFA (UUU model in McNicholas and Murphy, 2008)

37

Same as standard GMM

covariance as in GMM

SLIDE 38

Mixture of probabilistic PCA (MPPCA)

Matlab code: demo_MPPCA01.m

38

[M. E. Tipping and C. M. Bishop. Mixtures of probabilistic principal component analyzers. Neural Computation, 11(2):443–482, 1999]

SLIDE 39

Mixture of probabilistic PCA (MPPCA)

39

covariance as in GMM

SLIDE 40

A taxonomy of parsimonious GMMs

40

[C. Bouveyron and C. Brunet. Model-based clustering of high-dimensional data: A review. Computational Statistics and Data Analysis, 71:52–78, March 2014] D in the slides of this lecture

SLIDE 41

GMM with semi-tied covariance matrices

Matlab code: demo_semitiedGMM01.m

41

[M. J. F. Gales. Semi-tied covariance matrices for hidden Markov models. IEEE Trans. on Speech and Audio Processing, 7(3):272–281, 1999]

SLIDE 42

Sharing of parameters in mixture models

42

SLIDE 43

43

GMM with semi-tied covariance matrices

H

SLIDE 44

44

GMM with semi-tied covariance matrices

SLIDE 45

GMM with semi-tied covariance matrices

45

SLIDE 46

GMM with semi-tied covariance matrices

46

covariance as in GMM

SLIDE 47

H

Summary of relevant covariance structures

47

SLIDE 48

Main references

48

Parsimonious GMM

C. Bouveyron and C. Brunet. Model-based clustering of high-dimensional data: A
review. Computational Statistics and Data Analysis, 71:52–78, March 2014
P. D. McNicholas and T. B. Murphy. Parsimonious Gaussian mixture models. Statistics

and Computing, 18(3):285–296, September 2008 MFA

G. J. McLachlan, D. Peel, and R. W. Bean. Modelling high-dimensional data by mixtures
f factor analyzers. Computational Statistics and Data Analysis, 41(3-4):379–388, 2003
G. E. Hinton, P. Dayan, and M. Revow. Modeling the manifolds of images of

handwritten digits. IEEE Trans. on Neural Networks, 8(1):65–74, 1997 MPPCA

M. E. Tipping and C. M. Bishop. Mixtures of probabilistic principal component
analyzers. Neural Computation, 11(2):443–482, 1999

GMM with semi-tied covariances

M. J. F. Gales. Semi-tied covariance matrices for hidden Markov models. IEEE Trans. on

Speech and Audio Processing, 7(3):272–281, 1999

SLIDE 49

General textbooks

49

SLIDE 50

Advanced related research topics

50

Coordinated MFA by using common factor loadings

J. Baek, G. J. McLachlan, and L. K. Flack. Mixtures of factor analyzers with common

factor loadings: Applications to the clustering and visualization of high-dimensional

data. IEEE Trans. Pattern Anal. Mach. Intell., 32(7):1298–1309, 2010
J. Verbeek. Learning nonlinear image manifolds by global alignment of local linear
models. IEEE Trans. on Pattern Analysis & Machine Intelligence, 28(8):1236–1250,

August 2006

Estimation of K and dk in MFA with Bayesian nonparametrics

Y. Wang and J. Zhu. DP-space: Bayesian nonparametric subspace clustering with small-

variance asymptotics. In Proc. Intl Conf. on Machine Learning (ICML), pages 1–9, Lille, France, 2015

Online parameters estimation in MPPCA

A. Bellas, C. Bouveyron, M. Cottrell, and J. Lacaille. Model-based clustering of high-

dimensional data streams with online mixture of probabilistic PCA. Advances in Data Analysis and Classification, 7(3):281–300, 2013 (not covered in the course)

SLIDE 51

51

Sparse subspace clustering with L1 regularization

H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of

Computational and Graphical Statistics, 15(2):265–286, 2006

Y. Guan and J. G. Dy. Sparse probabilistic principal component analysis. In Intl Conf. on

Artificial Intelligence and Statistics, pages 185–192, 2009

Deep MFA

Y. Tang, R. Salakhutdinov, and G. Hinton. Deep mixtures of factor analysers. In Proc. Intl
Conf. on Machine Learning (ICML), Edinburgh, Scotland, 2012

Mixture of tensor analyzers (MTA)

Y. Tang, R. Salakhutdinov, and G. Hinton. Tensor analyzers. In Proc. Intl Conf. on Machine

Learning (ICML), Atlanta, USA, 2013

Advanced related research topics

(not covered in the course)