Statistical network clustering: some recent advances and - PowerPoint PPT Presentation

Statistical network clustering: some recent advances and applications to digital humanities Charles Bouveyron Laboratoire MAP5, UMR CNRS 8145 Université Paris Descartes charles.bouveyron@parisdescartes.fr – @cbouveyron 1

Disclaimer “Essentially, all models are wrong but some are useful” George E.P. Box 2

Outline Introduction The stochastic block model (SBM) The random subgraph model (RSM) Analysis of an ecclesiastical network Extension to dynamic networks Conclusion 3

Introduction The analysis of networks: � is a recent but increasingly important field in statistical learning, � with applications in domains ranging from biology to history: � biology: analysis of gene regulation processes, � social sciences: analysis of political blogs, � history: visualization of medieval social networks. Two main problems are currently well addressed: � visualization of the networks, � clustering of the network nodes. 4

Introduction The analysis of networks: � is a recent but increasingly important field in statistical learning, � with applications in domains ranging from biology to history: � biology: analysis of gene regulation processes, � social sciences: analysis of political blogs, � history: visualization of medieval social networks. Two main problems are currently well addressed: � visualization of the networks, � clustering of the network nodes. Network comparison: � is a still emerging problem is statistical learning, � which is mainly addressed using graph structure comparison, � but limited to binary networks. 4

Introduction Figure: Clustering of network nodes: communities (left) vs. structures with hubs (right). 5

Introduction Key works in probabilistic models: � stochastic block model (SBM) by Nowicki and Snijders (2001), � latent space model by Hoff, Handcock and Raftery (2002), � latent cluster model by Handcock, Raftery and Tantrum (2007), � mixed membership SBM (MMSBM) by Airoldi et al. (2008), � mixture of experts for LCM by Gormley and Murphy (2010), � MMSBM for dynamic networks by Xing et al. (2010), � overlapping SBM (OSBM) by Latouche et al. (2011). A good overview is given in: � M. Salter-Townshend, A. White, I. Gollini and T. B. Murphy, “Review of Statistical Network Analysis: Models, Algorithms, and Software”, Statistical Analysis and Data Mining, Vol. 5(4), pp. 243–264, 2012. 6

Introduction: a historical problem Our colleagues from the LAMOP team were interested in answering the following question: Was the Church organized in the same way within the different kingdoms in Merovingian Gaul? 7

Introduction: a historical problem Our colleagues from the LAMOP team were interested in answering the following question: Was the Church organized in the same way within the different kingdoms in Merovingian Gaul? To this end, they have build a relational database: � from written acts of ecclesiastical councils that took place in Gaul during the 6th century (480-614), � those acts report who attended (bishops, kings, dukes, priests, monks, ...) and what questions (regarding Church, faith, ...) were discussed, � they also allowed to characterize the type of relationship between the individuals, � it took 18 months to build the database. 7

Introduction: a historical problem The database contains: � 1331 individuals (mostly clergymen) who participated to ecclesiastical councils in Gaul between 480 and 614, � 4 types of relationships between individuals have been identified (positive, negative, variable or neutral), � each individual belongs to one of the 5 regions of Gaul: � 3 kingdoms: Austrasia, Burgundy and Neustria, � 2 provinces: Aquitaine and Provence. � additional information is also available: social positions , family relationships, birth and death dates, hold offices, councils dates, ... 8

Introduction: a historical problem Neustria Provence Unknown Aquitaine Austrasia Burgundy Figure: Adjacency matrix of the ecclesiastical network (sorted by regions). 9

The stochastic block model (SBM) The SBM (Nowicki and Snijders, 2001) model assumes that the network (represented by its adjacency matrix X ) is generated as follows: � each node i is associated with an (unobserved) group among K according to:‌ Z i ∼ M ( α ) , where α ∈ [0 , 1] K and � K k =1 α k = 1 , 11

The stochastic block model (SBM) The SBM (Nowicki and Snijders, 2001) model assumes that the network (represented by its adjacency matrix X ) is generated as follows: � each node i is associated with an (unobserved) group among K according to:‌ Z i ∼ M ( α ) , where α ∈ [0 , 1] K and � K k =1 α k = 1 , � then, each edge X ij is drawn according to: X ij | Z ik Z jl = 1 ∼ B ( π kl ) , where π kl ∈ [0 , 1] . 11

The stochastic block model (SBM) The SBM (Nowicki and Snijders, 2001) model assumes that the network (represented by its adjacency matrix X ) is generated as follows: � each node i is associated with an (unobserved) group among K according to:‌ Z i ∼ M ( α ) , where α ∈ [0 , 1] K and � K k =1 α k = 1 , � then, each edge X ij is drawn according to: X ij | Z ik Z jl = 1 ∼ B ( π kl ) , where π kl ∈ [0 , 1] . � this model is therefore a mixture model: K K � � X ij ∼ α k α ℓ B ( π kl ) . k =1 ℓ =1 11

The stochastic block model (SBM) 6 7 π •• π •• 4 5 3 π •• π •• 9 1 2 π •• π •• 8 Table: A SBM network. 12

The stochastic block model (SBM) Inference of the SBM model (maximum likelihood): � log-likelihood: �� log p ( X | α, Π) = log p ( X, Z | α, Π) , Z → K N terms! ֒ 13

The stochastic block model (SBM) Inference of the SBM model (maximum likelihood): � log-likelihood: �� log p ( X | α, Π) = log p ( X, Z | α, Π) , Z → K N terms! ֒ � Expectation Maximization (EM) algorithm requires the knowledge of p ( Z | X, α, Π) , � Problem: p ( Z | X, α, Π) is not tractable (no conditional independence)! 13

The stochastic block model (SBM) Inference of the SBM model (maximum likelihood): � log-likelihood: �� log p ( X | α, Π) = log p ( X, Z | α, Π) , Z → K N terms! ֒ � Expectation Maximization (EM) algorithm requires the knowledge of p ( Z | X, α, Π) , � Problem: p ( Z | X, α, Π) is not tractable (no conditional independence)! Solutions: � Variational EM (Daudin et al., 2008) + ICL (Biernacki et al., 2003), � Variational Bayes EM + ILvb criterion (Latouche et al., 2012). 13

The random subgraph model (RSM) Before the maths, an example of an RSM network: We observe: � the partition of the network into S = 2 subgraphs (node form), � the presence A ij of directed edges between the N nodes, � the type X ij ∈ { 1 , ..., C } of the edges ( C = 3 , edge color). Figure: Example of an RSM network. 15

The random subgraph model (RSM) Before the maths, an example of an RSM network: We observe: � the partition of the network into S = 2 subgraphs (node form), � the presence A ij of directed edges between the N nodes, � the type X ij ∈ { 1 , ..., C } of the edges ( C = 3 , edge color). We search: � a partition of the node into K = 3 groups (node color), Figure: Example of an RSM network. � which overlap with the partition into subgraphs. 15

The random subgraph model (RSM) The network (represented by its adjacency matrix X ) is assumed to be generated as follows: � the presence of an edge between nodes i and j is such that: A ij ∼ B ( γ s i s j ) where s i ∈ { 1 , ..., S } indicates the (observed) subgraph of node i , 16

The random subgraph model (RSM) The network (represented by its adjacency matrix X ) is assumed to be generated as follows: � the presence of an edge between nodes i and j is such that: A ij ∼ B ( γ s i s j ) where s i ∈ { 1 , ..., S } indicates the (observed) subgraph of node i , � each node i is as well associated with an (unobserved) group among K according to: Z i ∼ M ( α s i ) where α s ∈ [0 , 1] K and � K k =1 α sk = 1 , 16

The random subgraph model (RSM) The network (represented by its adjacency matrix X ) is assumed to be generated as follows: � the presence of an edge between nodes i and j is such that: A ij ∼ B ( γ s i s j ) where s i ∈ { 1 , ..., S } indicates the (observed) subgraph of node i , � each node i is as well associated with an (unobserved) group among K according to: Z i ∼ M ( α s i ) where α s ∈ [0 , 1] K and � K k =1 α sk = 1 , � each edge X ij can be finally of C different (observed) types and such that: X ij | A ij Z ik Z jl = 1 ∼ M (Π kl ) where Π kl ∈ [0 , 1] C and � C c =1 Π klc = 1 . 16

The random subgraph model (RSM) 4 5 γ ## , π •• 8 9 γ 22 , π •• 3 γ 22 , π •• γ ## , π •• 6 7 1 2 γ #2 , π •• Table: A RSM network. 17

The random subgraph model (RSM) P Z j Z i α α A ij Z j Z i γ X ij Π Π X ij X ij (a) SBM (b) RSM Figure: SBM model vs. RSM model. 18

Statistical network clustering: some recent advances and - PowerPoint PPT Presentation

Statistical network clustering: some recent advances and applications to digital humanities Charles Bouveyron Laboratoire MAP5, UMR CNRS 8145 Universit Paris Descartes charles.bouveyron@parisdescartes.fr @cbouveyron 1 Disclaimer

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Some Clustering Methods on Some Clustering Methods on Some Clustering Methods on Dissimilarity

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Recent Advances in Photonic Recent Advances in Photonic effect employing IP- based distributed

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Notice of Funding Availability (NOFA) for the Fiscal Years 2013 and 2014 Continuum of Care

Video scene analysis for a configurable hardware accelerator dedicated to Smart-camera Imen

Why and how you should build and run your own Internet Access Provider, or at least try TL;DR: Y

Union Congregational Church Adult Sunday School September 29, 2013 Franois Fnelon

Botanical Name Common Name CA Native TREES Acer palmatum Japanese Maple SHRUBS

Teachers Pay Award 2020 Launch of the LGA survey to inform NEOST consultation response Selena

Cameron Barrie Client Principal at Bilue cameron@bilue.com.au @whalec Mobile at Warp Speed If

Webinar October Payments: Termination With Guest Speaker Karen Gough - 39 Essex Chambers Bill