Co-clustering for large datasets Mohamed Nadif LIPADE, Universit - PowerPoint PPT Presentation

Co-clustering for large datasets Mohamed Nadif LIPADE, Université Paris Descartes, France Travaux menés avec G. Govaert et L. Lazhar Nadif (LIPADE) AAFD’14, April 29-30, 2014 Co-clustering 1 / 35

Introduction Outline Introduction 1 Co-clustering methods Binary data Continuous data Latent block model and CML approach 2 Bernoulli Latent block models Gaussian latent block models Asymmetric Gaussian model Factorization 3 Nonnegative Matrix Factorization Nonnegative Matrix Tri-Factorization Conclusion 4 Nadif (LIPADE) AAFD’14, April 29-30, 2014 Co-clustering 2 / 35

Introduction Co-clustering methods Simultaneous clustering on both dimensions The co-clustering methods have attracted much attention in recent years The block clustering had an influence in applied mathematics from the sixties (Jennings, 1968) First works in J.A. Hartigan, Direct Clustering of a Data Matrix (1972) Works of Govaert (1983) Referred in the literature as bi-clustering, co-clustering, double clustering, direct clustering, coupled clustering Different approaches (see for instance chapter 1, Govaert and Nadif 2013), These approaches can differ in the pattern they seek and the types of data they apply to Organization of the data matrix into homogeneous blocks or extraction of co-clusters no-overlapping co-clustering overlapping co-clustering Aim To cluster the sets of rows and columns simultaneously in order to obtain homogeneous blocks Nadif (LIPADE) AAFD’14, April 29-30, 2014 Co-clustering 3 / 35

Introduction Co-clustering methods Example of co-clustering data3 Reordred data: co−clustering result 100 100 200 200 300 300 400 400 500 500 600 600 700 700 800 800 900 900 1000 1000 100 200 300 400 500 100 200 300 400 500 Why co-clustering ? (1) : Utilizing duality of clustering (2) : Reducing running time (3) : Discovering hidden latent patterns and generating compact representation (4) : Reducing dimensionality implicitly (5) : High dimension Nadif (LIPADE) AAFD’14, April 29-30, 2014 Co-clustering 4 / 35

Introduction Co-clustering methods Applications and approaches Fields Text mining: clustering of documents and words simultaneously Bioinformatics: clustering of genes and tissus simultaneously Collaborative Filtering Social Network Analysis Approaches Spectral Factorization Latent block models etc. Softwares Package {biclust} in R , Bicat, etc. R {blockcluster} Nadif (LIPADE) AAFD’14, April 29-30, 2014 Co-clustering 5 / 35

Introduction Co-clustering methods Notations Let be x = ( x ij ) of size n × d , i ∈ I set of n rows, j ∈ J set of d columns Partition z of I in g clusters z = ( z 1 , . . . , z n ) − → ( z ik ) zi zi 1 zi 2 zi 3 3 0 0 1 ⇒ z ik = 1 if i ∈ k th cluster z i cluster indicator of i = 2 0 1 0 3 0 0 1 and z ik = 0 otherwise 2 0 1 0 1 1 0 0 z . k cardinality of k th cluster, k ∈ { 1 , . . . , g } Partition w of J in m clusters w = ( w 1 , . . . , w d ) − → ( w j ℓ ) ⇒ w j ℓ = 1 if j ∈ ℓ th cluster and w j ℓ = 0 otherwise w j cluster indicator of j = w .ℓ cardinality of ℓ th cluster, ℓ ∈ { 1 , . . . , m } From z and w Block formed by the couple k th and ℓ th clusters is defined by the x ij ’s with z ik w j ℓ = 1 Nadif (LIPADE) AAFD’14, April 29-30, 2014 Co-clustering 6 / 35

Introduction Co-clustering methods General principle Binary data Contingency table Continuous data Mode �� Sum mean �� T1 �� T1 �� T1 �� T0 T0 T0 Criteria Data a k ℓ Criterion � Binary Mode i , j , k ,ℓ z ik w j ℓ | x ij − a k ℓ | I ( z , w ) = � p k . p .ℓ or χ 2 ( z , w ) p k ℓ Contingency Sum k ,ℓ p k ℓ log � i , j , k ,ℓ z ik w j ℓ ( x ij − a k ℓ ) 2 = || x − zaw T || 2 Continuous Mean Nadif (LIPADE) AAFD’14, April 29-30, 2014 Co-clustering 7 / 35

Introduction Binary data Notations and example 1 2 1 2 3 4 5 6 7 8 9 10 1 3 5 8 10 2 4 6 7 9 a 1 0 1 0 1 0 0 1 0 1 a 1 1 1 1 1 0 0 0 0 0 b 0 1 0 1 0 1 1 0 1 0 A d 1 1 0 1 0 0 0 0 0 0 c 1 0 0 0 0 0 0 1 1 0 h 1 1 1 1 1 0 0 1 0 1 d 1 0 1 0 0 0 0 1 0 0 b 0 0 0 0 0 1 1 1 1 1 e 0 1 0 1 0 1 1 0 1 0 B e 0 0 0 0 0 1 1 1 1 1 f 0 1 0 0 0 1 1 0 1 0 f 0 0 0 0 0 1 0 1 1 1 g 0 1 0 0 0 0 0 1 0 1 j 0 0 0 0 0 1 1 0 1 0 h 1 0 1 0 1 1 0 1 1 1 c 1 0 0 1 0 0 0 0 0 1 i 1 0 0 1 0 0 0 0 0 1 C g 0 0 0 1 1 1 0 0 0 0 j 0 1 0 1 0 0 1 0 0 0 i 1 0 0 0 1 0 1 0 0 0 Binary data x Reorganized data matrix x 1 2 A 1 0 B 0 1 C 0 0 Summary matrix a Matrix Size Definition kj = � x z = ( x z x z kj ) ( g × d ) i z ik x ij i ℓ = � x w = ( x w x w i ℓ ) ( n × m ) j w j ℓ x ij k ℓ = � x zw = ( x zw x zw k ℓ ) ( g × m ) i , j z ik w j ℓ x ij Reduced matrices, sizes and definitions of x z , x w and x zw Nadif (LIPADE) AAFD’14, April 29-30, 2014 Co-clustering 8 / 35

Introduction Binary data Intermediate data matrices x z , x w and x zw   1 2 5 0 1 3 5 8 10 2 4 6 7 9 3 0   a 1 1 1 1 1 0 0 0 0 0  5 2  A d 1 1 0 1 0 0 0 0 0 0   0 5 h 1 1 1 1 1 0 0 1 0 1   x w = 0 5 b 0 0 0 0 0 1 1 1 1 1   0 4 B e 0 0 0 0 0 1 1 1 1 1     f 0 0 0 0 0 1 0 1 1 1 0 3   j 0 0 0 0 0 1 1 0 1 0 2 1 c 1 0 0 1 0 0 0 0 0 1 2 1 C g 0 0 0 1 1 1 0 0 0 0 2 1 i 1 0 0 0 1 0 1 0 0 0 � � 3 3 2 3 2 0 0 1 0 1 x z = 0 0 0 0 0 4 3 3 4 3 2 0 0 2 2 1 1 0 0 1 � � 13 2 x zw = 0 17 6 3 Minimization of the following criterion � C ( z , w , a ) = z ik w j ℓ | x ij − a k ℓ | , i , j , k ,ℓ where a k ℓ ∈ { 0 , 1 } Nadif (LIPADE) AAFD’14, April 29-30, 2014 Co-clustering 9 / 35

Introduction Binary data Algorithm Minimization of C ( z , w , a ) by alternated minimization of C ( z , a | w ) and C ( w , a | z ) Crobin (here ⌊ x ⌉ is the nearest integer function) input: x , g , m x zw initialization: z , w , a k ℓ = ⌊ z . k w .ℓ ⌉ k ℓ repeat i ℓ = � x w j w j ℓ x ij repeat � ℓ w j ℓ | x w step 1. z i = argmin k i ℓ − w .ℓ a k ℓ | k z ik x w � step 2. a k ℓ = ⌊ z . k w .ℓ ⌉ i ℓ until convergence kj = � x z i z ik x ij repeat � k z ik | x z step 3. w j = argmin ℓ kj − z . k a k ℓ | j w j ℓ x z � kj step 4. a k ℓ = ⌊ z . k w .ℓ ⌉ until convergence until convergence return z , w , a Nadif (LIPADE) AAFD’14, April 29-30, 2014 Co-clustering 10 / 35

Co-clustering for large datasets Mohamed Nadif LIPADE, Universit - PowerPoint PPT Presentation

Co-clustering for large datasets Mohamed Nadif LIPADE, Universit Paris Descartes, France Travaux mens avec G. Govaert et L. Lazhar Nadif (LIPADE) AAFD14, April 29-30, 2014 Co-clustering 1 / 35 Introduction Outline Introduction 1

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Learning with Large Datasets L eon Bottou NEC Laboratories America Why Large-scale Datasets?

Keppel Land Limited Keppel Land Limited 1Q2004 Results 1Q2004 Results 26 April 2004 26 April

How Did Our Galaxy Form? Stars & Galaxies NNOUNCEMENTS HOMEWORK # 6 due Tue. Nov.3 REVIEW

Start-to-End Simulation of Beam Dynamics in SASE FELs M. Borland, Y.-C. Chae, P. Emma,

t tt s

MS Cluster on KVM Vadim Rozenfeld vrozenfe@redhat.com 25 Aug, 2016 Cluster: Servers Combined to

Chapter VIII: Clustering Information Retrieval & Data Mining Universitt des Saarlandes,

CSC 411 Lecture 15: K-Means Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

Marthas Vineyard Regional High School Athletic Field Improvements Phase One Marthas

Co-clustering for large datasets Mohamed Nadif LIPADE, Universit - PowerPoint PPT Presentation

Co-clustering for large datasets Mohamed Nadif LIPADE, Universit Paris Descartes, France Travaux mens avec G. Govaert et L. Lazhar Nadif (LIPADE) AAFD14, April 29-30, 2014 Co-clustering 1 / 35 Introduction Outline Introduction 1

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Learning with Large Datasets L eon Bottou NEC Laboratories America Why Large-scale Datasets?

Keppel Land Limited Keppel Land Limited 1Q2004 Results 1Q2004 Results 26 April 2004 26 April

How Did Our Galaxy Form? Stars &amp; Galaxies NNOUNCEMENTS HOMEWORK # 6 due Tue. Nov.3 REVIEW

Start-to-End Simulation of Beam Dynamics in SASE FELs M. Borland, Y.-C. Chae, P. Emma,

t tt s

MS Cluster on KVM Vadim Rozenfeld vrozenfe@redhat.com 25 Aug, 2016 Cluster: Servers Combined to

Chapter VIII: Clustering Information Retrieval &amp; Data Mining Universitt des Saarlandes,

CSC 411 Lecture 15: K-Means Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

Marthas Vineyard Regional High School Athletic Field Improvements Phase One Marthas

How Did Our Galaxy Form? Stars & Galaxies NNOUNCEMENTS HOMEWORK # 6 due Tue. Nov.3 REVIEW

Chapter VIII: Clustering Information Retrieval & Data Mining Universitt des Saarlandes,