Statistical network clustering: some recent advances and applications to digital humanities
Charles Bouveyron
Laboratoire MAP5, UMR CNRS 8145 Université Paris Descartes charles.bouveyron@parisdescartes.fr – @cbouveyron
1
Statistical network clustering: some recent advances and - - PowerPoint PPT Presentation
Statistical network clustering: some recent advances and applications to digital humanities Charles Bouveyron Laboratoire MAP5, UMR CNRS 8145 Universit Paris Descartes charles.bouveyron@parisdescartes.fr @cbouveyron 1 Disclaimer
Laboratoire MAP5, UMR CNRS 8145 Université Paris Descartes charles.bouveyron@parisdescartes.fr – @cbouveyron
1
2
3
is a recent but increasingly important field in statistical learning, with applications in domains ranging from biology to history: biology: analysis of gene regulation processes, social sciences: analysis of political blogs, history: visualization of medieval social networks.
visualization of the networks, clustering of the network nodes. 4
is a recent but increasingly important field in statistical learning, with applications in domains ranging from biology to history: biology: analysis of gene regulation processes, social sciences: analysis of political blogs, history: visualization of medieval social networks.
visualization of the networks, clustering of the network nodes.
is a still emerging problem is statistical learning, which is mainly addressed using graph structure comparison, but limited to binary networks. 4
5
stochastic block model (SBM) by Nowicki and Snijders (2001), latent space model by Hoff, Handcock and Raftery (2002), latent cluster model by Handcock, Raftery and Tantrum (2007), mixed membership SBM (MMSBM) by Airoldi et al. (2008), mixture of experts for LCM by Gormley and Murphy (2010), MMSBM for dynamic networks by Xing et al. (2010),
6
7
from written acts of ecclesiastical councils that took place in Gaul during
those acts report who attended (bishops, kings, dukes, priests, monks, ...)
they also allowed to characterize the type of relationship between the
it took 18 months to build the database. 7
1331 individuals (mostly clergymen) who
4 types of relationships between
each individual belongs to one of the 5
3 kingdoms: Austrasia, Burgundy and
2 provinces: Aquitaine and Provence. additional information is also available: social positions, family
8
Neustria Provence Unknown Aquitaine Austrasia Burgundy
9
10
each node i is associated with an (unobserved) group among K
k=1 αk = 1,
11
each node i is associated with an (unobserved) group among K
k=1 αk = 1,
then, each edge Xij is drawn according to:
11
each node i is associated with an (unobserved) group among K
k=1 αk = 1,
then, each edge Xij is drawn according to:
this model is therefore a mixture model:
K
K
11
12
log-likelihood:
13
log-likelihood:
Expectation Maximization (EM) algorithm requires the knowledge of
Problem: p(Z|X, α, Π) is not tractable (no conditional independence)! 13
log-likelihood:
Expectation Maximization (EM) algorithm requires the knowledge of
Problem: p(Z|X, α, Π) is not tractable (no conditional independence)!
Variational EM (Daudin et al., 2008) + ICL (Biernacki et al., 2003), Variational Bayes EM + ILvb criterion (Latouche et al., 2012). 13
14
the partition of the network into
the presence Aij of directed edges
the type Xij ∈ {1, ..., C} of the
15
the partition of the network into
the presence Aij of directed edges
the type Xij ∈ {1, ..., C} of the
a partition of the node into K = 3
which overlap with the partition
15
the presence of an edge between nodes i and j is such that:
16
the presence of an edge between nodes i and j is such that:
each node i is as well associated with an (unobserved) group among K
k=1 αsk = 1,
16
the presence of an edge between nodes i and j is such that:
each node i is as well associated with an (unobserved) group among K
k=1 αsk = 1,
each edge Xij can be finally of C different (observed) types and such
c=1 Πklc = 1.
16
17
18
the RSM model separates the roles of the known partition and the latent
this was motivated by historical assumptions on the creation of
indeed, the possibilities of connection were preponderant over the type of
19
the RSM model separates the roles of the known partition and the latent
this was motivated by historical assumptions on the creation of
indeed, the possibilities of connection were preponderant over the type of
an alternative approach would consist in allowing Xij to directly depend
however, this would dramatically increase the number of model
if S = 6, K = 6 and C = 4, then the alternative approach has 6 516
19
the previous model is fully defined by its joint distribution:
which we complete with conjuguate prior distributions for model
the prior distribution for α is:
the prior distribution for γ is:
the prior distribution for Π is:
20
21
we aim at estimating the posterior distribution p(Z, α, γ, Π|X, A), which
as expected, this distribution is not tractable and approximate inference
the use of MCMC methods is obviously an option but MCMC methods
22
we aim at estimating the posterior distribution p(Z, α, γ, Π|X, A), which
as expected, this distribution is not tractable and approximate inference
the use of MCMC methods is obviously an option but MCMC methods
because they allow to deal with large networks (N > 1000), recent theoretical results (Celisse et al., 2012; Mariadassou and Matias,
22
we use the decomposition of the marginal log-likelihood:
L(q(Z, θ)) =
Z
KL(q(Z, θ)||p(Z, θ|X)) = −
Z
we also assume that q factorizes over Z and θ:
23
we use the decomposition of the marginal log-likelihood:
L(q(Z, θ)) =
Z
KL(q(Z, θ)||p(Z, θ|X)) = −
Z
we also assume that q factorizes over Z and θ:
VB-E step: qθ(θ) is fixed and L is maximized over the qi
j (Zj) = Ei=j,θ[log p(X, Z, θ)] + c
VB-M step: all qi(Zi) are now fixed and L is maximized over qθ
θ(θ) = EZ[log p(X, Z, θ)] + c
23
the VBEM is known to be sensitive to its initialization, we propose a strategy based on several k-means algorithms with a
N
N
24
the VBEM is known to be sensitive to its initialization, we propose a strategy based on several k-means algorithms with a
N
N
we thus can use L(q) as a model selection criterion for choosing K, if computed right after the M step,
L(q) =
S
log( B(ars, brs) B(a0
rs, b0 rs) ) + S
log( C(χs) C(χ0
s) ) + K
log( C(Ξkl) C(Ξ0
kl) ) − N
K
τik log(τik).
24
25
1331 individuals (mostly clergymen) who
4 types of relationships between
each individual belongs to one of the 5
Z allows to characterize the found clusters through social positions of the
parameter Π describes the relations between the found clusters, parameter γ describes the connections between the subgraphs, parameter α describes the cluster repartition in the subgraphs. 26
Bishop Priest Abbot Earl Duke Monk Deacon King Queen Archdeacon
Cluster 1 50 100 150 200 250
Bishop Priest Abbot Earl Duke Monk Deacon King Queen Archdeacon
Cluster 2 2 4 6 8
Bishop Priest Abbot Earl Duke Monk Deacon King Queen Archdeacon
Cluster 3 50 100 150
Bishop Priest Abbot Earl Duke Monk Deacon King Queen Archdeacon
Cluster 4 1 2 3 4 5 6
Bishop Priest Abbot Earl Duke Monk Deacon King Queen Archdeacon
Cluster 5 5 10 15 20
Bishop Priest Abbot Earl Duke Monk Deacon King Queen Archdeacon
Cluster 6 10 20 30 40
27
clusters 1 and 3 correspond to local, provincial of diocesan councils,
clusters 2 and 6 correspond to councils dedicated to political questions,
clusters 4 and 5 correspond to aristocratic assemblies, where queens and
28
positive
cluster 1 cluster 2 cluster 3 cluster 4 cluster 5 cluster 6
negative
cluster 1 cluster 2 cluster 3 cluster 4 cluster 5 cluster 6
29
variable
cluster 1 cluster 2 cluster 3 cluster 4 cluster 5 cluster 6
neutral
cluster 1 cluster 2 cluster 3 cluster 4 cluster 5 cluster 6
30
positive relations between clusters 3, 5 and 6 mainly corresponds to
negative and variable relations betweens clusters 4, 5 and 6 report the
neutral relations between clusters 1, 3 and 6 were expected because they
31
Neustria Provence Unknown Aquitaine Austrasia Burgundy Neustria Provence Unknown Aquitaine Austrasia Burgundy
1 2 3 4 5 6 1 2 3 4 5 6 −3.5 −3.0 −2.5 −2.0 −1.5
32
Neustria Provence Unknown Aquitaine Austrasia Burgundy total Proportions 0.0 0.1 0.2 0.3 0.4 0.5 cluster 1 cluster 2 cluster 3 cluster 4 cluster 5 cluster 6
33
−1 1 2 3 −1.5 −1.0 −0.5 0.0 0.5 Comp.1 Comp.2 Neustria Provence Unknown Aquitaine Austrasia Burgundy
34
35
36
Europe−Atlantic Asia−Pacific Middle East & Indian Ocean Med & Black Sea
36
Network in 1890 Network in 2008
37
dynamic MMSBM by Xing et al., dynamic SBM by Yang et al., another dynamic SBM by Xu et al., dynamic LPCM by Sarkar et al., and a few others... 38
dynamic MMSBM by Xing et al., dynamic SBM by Yang et al., another dynamic SBM by Xu et al., dynamic LPCM by Sarkar et al., and a few others...
38
each node i is associated with an (unobserved) group among K
i
si )
s
k=1 α(t) sk = 1,
each edge X(t)
ij can have C + 1 different (observed) types (0 denotes the
ij |Z(t) ik Z(t) jl = 1 ∼ M(Πkl)
c=0 Πklc = 1.
39
we introduce the latent variable γ(t)
s
sk = exp(γ(t) sk )
s )
sK = 0 and C(γ(t) s ) = K ℓ=1 exp(γ(t) sℓ ),
γ(t)
s\K is further assumed to be distributed according to a normal
s
40
νt depends on νt−1 such that:
ω(t) ∼ N(0, Φ), ν1 = µ0 + u, u ∼ N(0, v0). 41
νt depends on νt−1 such that:
ω(t) ∼ N(0, Φ), ν1 = µ0 + u, u ∼ N(0, v0).
41
ij
i
j
42
data from Lloyd’s List (Voyage Record) covering the period 1890-2008 at
huge work to extract from paper versions and complement the lacks
the data contains 176 095 vessels between 4472 ports but we had to
4 types of relations between ports are considered: liquid bulk, passengers,
43
44
4 5 6 7 8 −288000 −286000 −284000 −282000 −280000 −278000 −276000 −274000
Choice of K
K BIC
45
0.2% 1.3% 0.4% 0.6% 1.2% 9.3% 2.8% 1.3% 11.8% 2% 6% 6% 51.9% 23.6% 0.4% 2% 3.3% 0.3% 9.4% 37% 11.6% 0.6% 6% 0.3% 3.3% 1.2% 25.4% 9% 1.2% 6% 9.4% 1.2% 23.7% 68.9% 30.5% 9.3% 51.9% 37% 25.4% 68.9% 86.6% 83.6% 2.8% 23.6% 11.6% 9% 30.5% 83.6% 50.6% Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7
Connexion probabilities between clusters
kl.
46
0.0 0.1 0.2 0.3 0.4 0.5
Subgraph 1 (Asia − Pacific)
Time Group proportions 1890 1930 1940 1951 1965 1975 1985 1995 2008 0.0 0.1 0.2 0.3 0.4
Subgraph 2 (Europe − Atlantic)
Time Group proportions 1890 1930 1940 1951 1965 1975 1985 1995 2008 0.0 0.1 0.2 0.3 0.4 0.5
Subgraph 3 (Medit. − Black Sea)
Time Group proportions 1890 1930 1940 1951 1965 1975 1985 1995 2008 0.0 0.1 0.2 0.3 0.4
Subgraph 4 (Middle East − India)
Time Group proportions 1890 1930 1940 1951 1965 1975 1985 1995 2008
G1 G2 G3 G4 G5 G6 G7
47
48
the RSM model takes into account an existing partition into subgraphs, this modeling allows afterward a comparison of the subgraphs, the dRSM model allows to deal with evolving networks. 49
the RSM model takes into account an existing partition into subgraphs, this modeling allows afterward a comparison of the subgraphs, the dRSM model allows to deal with evolving networks.
49