bi clustering and co clustering going further in cluster
play

Bi-clustering and co-clustering Going further in cluster analysis - PDF document

HAL Id: hal-01810380 manant des tablissements denseignement et de Summer School on Clustering, Data Analysis and Visualization of Complex Data, May 2018, Catania, Going further in cluster analysis and classifjcation: Bi-clustering and


  1. HAL Id: hal-01810380 émanant des établissements d’enseignement et de Summer School on Clustering, Data Analysis and Visualization of Complex Data, May 2018, Catania, Going further in cluster analysis and classifjcation: Bi-clustering and co-clustering. C Biernacki. To cite this version: C Biernacki Bi-clustering and co-clustering Going further in cluster analysis and classifjcation: publics ou privés. recherche français ou étrangers, des laboratoires scientifjques de niveau recherche, publiés ou non, https://hal.inria.fr/hal-01810380 destinée au dépôt et à la difgusion de documents L’archive ouverte pluridisciplinaire HAL , est abroad, or from public or private research centers. teaching and research institutions in France or The documents may come from lished or not. entifjc research documents, whether they are pub- archive for the deposit and dissemination of sci- HAL is a multi-disciplinary open access Submitted on 7 Jun 2018 Italy. ฀hal-01810380฀

  2. HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further Going further in cluster analysis and classification: Bi-clustering and co-clustering C. Biernacki Summer School on Clustering, Data Analysis and Visualization of Complex Data May 21-25 2018, University of Catania, Italy 1/66

  3. HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further Outline 1 HD clustering 2 Modeling 3 Estimating 4 Selecting 5 BlockCluster in MASSICCC 6 To go further 2/66

  4. HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further Motivation High dimensional (HD) data sets are now frequent: Marketing: d ∼ 10 2 microarray gene expression: d ∼ 10 2 –10 4 SNP data: d ∼ 10 6 Curves: depends on discretization but can be very high Text mining . . . Clustering has to be applied for HD datasets for the same reasons as the lower dimensional datasets: Data summary Data exploratory Preprocessing for more flexibility of a forthcoming prediction step But clustering is even more important since visualization in the HD setting can be hazardous. . . 3/66

  5. HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further Today’: exponential growing of dimension 1 1 S. Alelyani, J. Tang and H. Liu (2013). Feature Selection for Clustering: A Review. Data Clustering: Algorithms and Applications , 29 4/66

  6. HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further HD data: definition (1/2) An attempt in the non-parametric case Dataset x = ( x 1 , . . . , x n ), x j described by d variables, where n = o � e d � Justifications: To approximate within error ǫ a (Lipschitz) function of d variables, about (1 /ǫ ) d evaluations on a grid are required [Bellman, 61] Approximate a Gaussian distribution with fixed Gaussian kernels and with approximate error of about 10% [Silverman, 86] log 10 n ( d ) ≈ 0 . 6( d − 0 . 25) For instance, n (10) ≈ 7 . 10 5 5/66

  7. HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further HD data: definition (2/2) An attempt in the parametric case Dataset x = ( x 1 , . . . , x n ), x j described by d variables and a model m with ν parameters, where n = o ( g ( ν )), with g a given function Justification: We consider the heteroscedastic Gaussian mixture with of true parameter θ ∗ with K ∗ components. We note ˆ θ the Gaussian MLE with K ∗ components. We have g linear from the following result [Michel, 08] : it exists constants κ , A and C such that � � � � � ν ��� κ ν + 1 E x [Hellinger 2 (p θ ∗ , p ˆ K )] ≤ C 2 A ln d + 1 − ln 1 ∧ n A ln d . θ ˆ n n But ν can be high since ν ∼ d 2 / 2, combined with potentially large constants. 6/66

  8. HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further HD density estimation: curse A two-component d -variate Gaussian mixture: π 1 = π 2 = 1 2 , X 1 | z 11 = 1 ∼ N d ( 0 , I ) , X 1 | z 12 = 1 ∼ N d ( 1 , I ) √ Components are more and more separated when d grows: � µ 2 − µ 1 � I = d . . . 4 16.5 16 3 15.5 2 15 Kullback−Leibler 1 14.5 x2 0 14 −1 13.5 −2 13 −3 12.5 −4 12 −3 −2 −1 0 1 2 3 4 1 2 3 4 5 6 7 8 9 10 x1 d . . . but density estimation quality decreases with d 7/66

  9. HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further HD clustering: blessing (1/2) A two-component d -variate Gaussian mixture: π 1 = π 2 = 1 X 1 | z 11 = 1 ∼ N d ( 0 , I ) , X 1 | z 12 = 1 ∼ N d ( 1 , I ) 2 , Each variable provides equal and own separation information √ Theoretical error decreases when d grows: err theo = Φ( − d / 2). . . 4 0.4 Empirical Theoretical 3 0.35 2 0.3 1 0.25 x2 err 0 0.2 −1 0.15 −2 0.1 −3 −4 0.05 −3 −2 −1 0 1 2 3 4 1 2 3 4 5 6 7 8 9 10 x1 d . . . and empirical error rate decreases also with d ! 8/66

  10. HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further HD clustering: blessing (2/2) FDA d=2 d=20 2.5 4 2 3 1.5 2 1 2nd axis FDA 2nd axis FDA 0.5 1 0 0 −0.5 −1 −1 −1.5 −2 −2 −2.5 −3 −4 −3 −2 −1 0 1 2 3 4 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 1st axis FDA 1st axis FDA d=200 d=400 5 3 4 2 3 2 1 2nd axis FDA 2nd axis FDA 1 0 0 −1 −1 −2 −2 −3 −4 −3 −1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5 1st axis FDA 1st axis FDA 9/66

  11. HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further HD clustering: curse (1/2) Many variables provide no separation information Same parameter setting except: X 1 | z 12 = 1 ∼ N d ((1 0 . . . 0) ′ , I ) Groups are not separated more when d grows: � µ 2 − µ 1 � I = 1. . . 4 0.44 3 0.42 2 0.4 1 0.38 0 x2 err Empirical −1 0.36 Theoretical −2 0.34 −3 0.32 −4 −5 0.3 −4 −3 −2 −1 0 1 2 3 4 1 2 3 4 5 6 7 8 9 10 x1 d . . . thus theoretical error is constant (= Φ( − 1 2 )) and empirical error increases with d 10/66

  12. HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further HD clustering: curse (2/2) Many variables provide redundant separation information Same parameter setting except: X j 1 = X 1 1 + N 1 (0 , 1) ( j = 2 , . . . , d ) Groups are not separated more when d grows: � µ 2 − µ 1 � Σ = 1. . . 6 0.42 0.4 4 2 0.38 Empirical err x2 0 0.36 Theoretical −2 0.34 −4 0.32 −6 0.3 −3 −2 −1 0 1 2 3 4 1 2 3 4 5 6 7 8 9 10 x1 d . . . thus err theo is constant (= Φ( − 1 2 )) and empirical error increases (less) with d 11/66

  13. HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further The trade-off bias/variance The fundamental statistical principle Always minimize an error err between truth ( z ) and estimate (ˆ z ) Gap between true ( z ) and model-based ( Z p) partitions: z ∗ = arg min ˜ z ∈Z p ∆( z , ˜ z ) z of z ∗ in Z p: any relevant method (bias, consistency, efficiency. . . ) Estimation ˆ Fundamental decomposition of the observed error err( z , ˆ z ): � � � � err( z , z ∗ ) − err( z , z ) z ) − err( z , z ∗ ) err( z , ˆ z ) = + err( z , ˆ � � � � = bias + variance � � � � = error of approximation + error of estimation 12/66

  14. HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further Bias/variance in HD: reduce variance, accept bias A two-component d -variate Gaussian mixture with intra-dependency: π 1 = π 2 = 1 2 , X 1 | z 11 = 1 ∼ N d ( 0 , Σ ) , X 1 | z 12 = 1 ∼ N d ( 1 , Σ ) Each variable provides equal and own separation information Theoretical error decreases when d grows: err theo = Φ( −� µ 2 − µ 1 � Σ − 1 / 2) Empirical error rate with the (true) intra-correlated model worse with d Empirical error rate with the (false) intra-independent model better with d ! 0.38 4 0.36 3 0.34 2 Empirical corr. Empirical indep. 0.32 1 Theoretical err x2 0.3 0 0.28 −1 0.26 −2 0.24 −3 1 2 3 4 5 6 7 8 9 10 −4 −3 −2 −1 0 1 2 3 4 5 d x1 13/66

  15. HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further Some alternatives for reducing variance Dimension reduction in non-canonical space (PCA-like typically) Dimension reduction in the canonical space (variable selection) Model parsimony in the initial HD space (constraints on model parameters) But which kind of parsimony? Remember that clustering is a way for dealing with large n Why not reusing this idea for large d ? Co-clustering It performs parsimony of row clustering through variable clustering 14/66

  16. HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further From clustering to co-clustering [Govaert, 2011] 15/66

  17. HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further Bi-clustering A generalization of co-clustering Look for submatrices of x which are homogeneous We do not consider bi-clustering here 16/66

  18. HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further Outline 1 HD clustering 2 Modeling 3 Estimating 4 Selecting 5 BlockCluster in MASSICCC 6 To go further 17/66

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend