The ground truth about metadata and community detection in 8 8 7 - - PowerPoint PPT Presentation
The ground truth about metadata and community detection in 8 8 7 - - PowerPoint PPT Presentation
The ground truth about metadata and community detection in 8 8 7 7 8 8 networks 5 5 0 0 . . 8 8 0 0 6 6 1 1 : : Leto Peel v v i i X X Universit catholique de Louvain r r a a Community detectjon: Split nodes
a r X i v : 1 6 8 . 5 8 7 8
Community detectjon: Split nodes into groups based
- n their patuern of links
a r X i v : 1 6 8 . 5 8 7 8
Data generatjng process: Generate nodes and assign to communitjes
a r X i v : 1 6 8 . 5 8 7 8
Data generatjng process: Generate nodes and assign to communitjes, T Generate links in G dependent
- n community membership
g(T)
a r X i v : 1 6 8 . 5 8 7 8
Community detectjon: Infer T Observe G Assess performance on how well we recover T f(G)
a r X i v : 1 6 8 . 5 8 7 8
Ground truth in real networks?
?
a r X i v : 1 6 8 . 5 8 7 8
Networks can have metadata that describe the nodes
food webs internet social networks protein interactjons feeding mode, species body mass, etc. data capacity, physical locatjon, etc. age, sex, ethnicity, race, etc. molecular weight, associatjon with cancer, etc.
a r X i v : 1 6 8 . 5 8 7 8
Recovering metadata implies sensible methods
stochastjc block model stochastjc block model with degree correctjon
Karrer, Newman. Stochastjc blockmodels and community structure in networks. Phys. Rev. E 83, 016107 (2011). Adamic, Glance. The politjcal blogosphere and the 2004 US electjon: divided they blog. 36–43 (2005).
a r X i v : 1 6 8 . 5 8 7 8
Yang & Leskovec. Overlapping community detectjon at scale: a nonnegatjve matrix factorizatjon approach (2013).
Metadata ofuen treated as ground truth
a r X i v : 1 6 8 . 5 8 7 8
Yang & Leskovec. Overlapping community detectjon at scale: a nonnegatjve matrix factorizatjon approach (2013).
Metadata ofuen treated as ground truth
Do you think thats ground truth you're detectjng?
a r X i v : 1 6 8 . 5 8 7 8
Communities, C = f ( G ) Ground truth, T
d ( T , f ( G ) )
Ground truth, T
d ( T , f ( G ) )
a r X i v : 1 6 8 . 5 8 7 8
Communities, C = f ( G ) Ground truth, T
d ( T , f ( G ) )
Communities, C = f ( G ) Metadata, M Ground truth, T
d ( M, f ( G ) ) d ( T , f ( G ) ) d ( M, T )
a r X i v : 1 6 8 . 5 8 7 8
When communitjes ≠ metadata...
(i) the metadata do not relate to the network structure,
a r X i v : 1 6 8 . 5 8 7 8
When communitjes ≠ metadata...
(ii) the detected communitjes and the metadata capture difgerent aspects of the network’s structure,
a r X i v : 1 6 8 . 5 8 7 8
When communitjes ≠ metadata...
(iii) the network contains no structure (e.g., an E-R random graph)
a r X i v : 1 6 8 . 5 8 7 8
When communitjes ≠ metadata...
(iv) the community detectjon algorithm does not perform well. Typically we assume this is the only possible cause
a r X i v : 1 6 8 . 5 8 7 8
President Instructor Split into factjons
The Karate Club network
a r X i v : 1 6 8 . 5 8 7 8
President Instructor Split into factjons
The Karate Club network
a r X i v : 1 6 8 . 5 8 7 8
‘This can be explained by notjng that he was only three weeks away from a test for black belt (master status) when the split in the club
- ccurred. Had he joined the offjcers’[President's]
club he would have had to give up his rank and begin again in a new style of karate with a white (beginner’s) belt, since the offjcers had decided to change the style of karate practjced in their new club’
- Zachary 1977
a r X i v : 1 6 8 . 5 8 7 8
You only see what you look for...
Peixoto, T. P. Hierarchical Block Structures and High-Resolutjon Model Selectjon in Large Networks. Phys. Rev. X 4, 011047 (2014).
US politjcs is more than two opposing views
Adamic, Glance. The politjcal blogosphere and the 2004 US electjon: divided they blog. 36–43 (2005).
a r X i v : 1 6 8 . 5 8 7 8
Difgerent generatjve processes = difgerent community structures
a r X i v : 1 6 8 . 5 8 7 8
Many good partjtjons...
Evans, T. S. Clique graphs and overlapping communitjes. J. Stat. Mech. 2010, P12037–22 (2010).
a r X i v : 1 6 8 . 5 8 7 8
Metadata are not ground truth for community detectjon
a r X i v : 1 6 8 . 5 8 7 8
No interpretability of negatjve results. (i) M unrelated to network structure (ii) C and M capture difgerent aspects of network structure (iii) the network has no structure (iv) the algorithm does not perform well Metadata are not ground truth for community detectjon
a r X i v : 1 6 8 . 5 8 7 8
No interpretability of negatjve results. (i) M unrelated to network structure (ii) C and M capture difgerent aspects of network structure (iii) the network has no structure (iv) the algorithm does not perform well Multjple sets of metadata exist. Which set is ground truth? Metadata are not ground truth for community detectjon
a r X i v : 1 6 8 . 5 8 7 8
No interpretability of negatjve results. (i) M unrelated to network structure (ii) C and M capture difgerent aspects of network structure (iii) the network has no structure (iv) the algorithm does not perform well Multjple sets of metadata exist. Which set is ground truth? We see what we look for. Confjrmatjon bias. Publicatjon bias. Metadata are not ground truth for community detectjon
a r X i v : 1 6 8 . 5 8 7 8
No interpretability of negatjve results. (i) M unrelated to network structure (ii) C and M capture difgerent aspects of network structure (iii) the network has no structure (iv) the algorithm does not perform well Multjple sets of metadata exist. Which set is ground truth? We see what we look for. Confjrmatjon bias. Publicatjon bias. “Community” is model dependent. Do we expect all networks across all domains to have the same relatjonship with communitjes? Metadata are not ground truth for community detectjon
a r X i v : 1 6 8 . 5 8 7 8
Communitjes, T Network, G Community detectjon is an inverse problem f(G) g(T) community detectjon data generatjon
a r X i v : 1 6 8 . 5 8 7 8
For any graph there exist a (Bell) number of possible “ground truth” partjtjons, and an infjnite number of capable generatjve models. However, in real networks both T and g are unknown The community detectjon problem is ill-posed (no unique solutjon) {generatjve models, g} x {partjtjons, T} {graph G} many to one
s e e h e r e f
- r
p r
- f
a r X i v : 1 6 8 . 5 8 7 8
Wolpert, D. H. The lack of a priori distjnctjons between learning algorithms. Neural Computatjon 8, 1341–1390 (1996).
A No Free Lunch Theorem for community detectjon?
NFL theorem (supervised learning) states that there cannot exist a classifjer that is a priori betuer than any other, averaged
- ver all possible problems.
a r X i v : 1 6 8 . 5 8 7 8
s e e h e r e f
- r
p r
- f
A No Free Lunch Theorem for community detectjon
NFL Theorem for communtjy detectjon (paraphrased):
For the community detectjon problem, with accuracy measured by adjusted mutual informatjon, the uniform average of the accuracy of any method f over all possible community detectjon problems is a constant which is independent of f . On average, no community detectjon algorithm performs betuer than any other
a r X i v : 1 6 8 . 5 8 7 8
a r X i v : 1 6 8 . 5 8 7 8
So, what about metadata?
Metadata = types of nodes Communitjes = how nodes interact Metadata + Communitjes = how difgerent types of nodes interact with each other
we require new methods to understand the relatjonship between metadata and structure
a r X i v : 1 6 8 . 5 8 7 8
Are the metadata related to the network structure? Blockmodel Entropy Signifjcance Test Do metadata and detected communitjes capture difgerent aspects network structure? neoSBM
a r X i v : 1 6 8 . 5 8 7 8
Are the metadata related to the network structure? Blockmodel Entropy Signifjcance Test (i) the metadata do not relate to the network structure, (ii) communitjes and metadata capture difgerent aspects network structure, Do metadata and detected communitjes capture difgerent aspects network structure? neoSBM
a r X i v : 1 6 8 . 5 8 7 8
The Stochastjc Blockmodel
Edges are conditjonally independent given community membership pij = p(eij|zi,zj,ω) = ωzi,zj increasing density
i n t r a
- c
- m
m u n i t y d e n s i t y inter-community density inter-community density
a r X i v : 1 6 8 . 5 8 7 8
Blockmodel Entropy Signifjcance Test
How well do the metadata explain the network? metadata is randomly assigned → model gives no explanatjon, high H metadata correlates with structure → model gives good explanatjon, low H
- 1. Divide the network G into groups
according to metadata labels M.
- 2. Fit the parameters of an SBM and
compute the entropy H(G,M)
- 3. Compare this entropy to a
distributjon of entropies of networks partjtjoned using permutatjons of the metadata labels.
a r X i v : 1 6 8 . 5 8 7 8
Multjple networks; multjple metadata atuributes
Multjple sets of metadata provide a signifjcant explainatjon for multjple networks.
a r X i v : 1 6 8 . 5 8 7 8
Are the metadata related to the network structure? Blockmodel Entropy Signifjcance Test (i) the metadata do not relate to the network structure, (ii) communitjes and metadata capture difgerent aspects network structure, Do metadata and detected communitjes capture difgerent aspects network structure? neoSBM
a r X i v : 1 6 8 . 5 8 7 8
Choose between the red (SBM) partjtjon and the blue (metadata) partjtjon Do metadata and detected communitjes capture difgerent aspects of the network?
a r X i v : 1 6 8 . 5 8 7 8
a r X i v : 1 6 8 . 5 8 7 8 a r X i v : 1 6 8 . 5 8 7 8
a r X i v : 1 6 8 . 5 8 7 8
Network with multjple 4- group optjma
core-periphery (''metadata'', M) assortatjve (SBM comms., C)
a r X i v : 1 6 8 . 5 8 7 8
µ
Metadata partjtjon SBM partjtjon
1 2 3 4
As θ increases the cost of freeing a node decreases
a r X i v : 1 6 8 . 5 8 7 8
neoSBM log likelihood SBM log likelihood
As θ increases the cost of freeing a node decreases
a r X i v : 1 6 8 . 5 8 7 8
There is no ground truth
a r X i v : 1 6 8 . 5 8 7 8
''I don’t know the future. I didn’t come here to tell you how this is going to end. I came here to tell you how it’s going to begin… Where we go from there is a choice I leave to you.'' – Neo, The Matrix