The ground truth about metadata and community detection in 8 8 7 - - PowerPoint PPT Presentation

the ground truth about metadata and community detection in
SMART_READER_LITE
LIVE PREVIEW

The ground truth about metadata and community detection in 8 8 7 - - PowerPoint PPT Presentation

The ground truth about metadata and community detection in 8 8 7 7 8 8 networks 5 5 0 0 . . 8 8 0 0 6 6 1 1 : : Leto Peel v v i i X X Universit catholique de Louvain r r a a Community detectjon: Split nodes


slide-1
SLIDE 1

a r X i v : 1 6 8 . 5 8 7 8

The ground truth about metadata and community detection in networks

Leto Peel

Université catholique de Louvain a r X i v : 1 6 8 . 5 8 7 8

slide-2
SLIDE 2

a r X i v : 1 6 8 . 5 8 7 8

Community detectjon: Split nodes into groups based

  • n their patuern of links
slide-3
SLIDE 3

a r X i v : 1 6 8 . 5 8 7 8

Data generatjng process: Generate nodes and assign to communitjes

slide-4
SLIDE 4

a r X i v : 1 6 8 . 5 8 7 8

Data generatjng process: Generate nodes and assign to communitjes, T Generate links in G dependent

  • n community membership

g(T)

slide-5
SLIDE 5

a r X i v : 1 6 8 . 5 8 7 8

Community detectjon: Infer T Observe G Assess performance on how well we recover T f(G)

slide-6
SLIDE 6

a r X i v : 1 6 8 . 5 8 7 8

Ground truth in real networks?

?

slide-7
SLIDE 7

a r X i v : 1 6 8 . 5 8 7 8

Networks can have metadata that describe the nodes

food webs internet social networks protein interactjons feeding mode, species body mass, etc. data capacity, physical locatjon, etc. age, sex, ethnicity, race, etc. molecular weight, associatjon with cancer, etc.

slide-8
SLIDE 8

a r X i v : 1 6 8 . 5 8 7 8

Recovering metadata implies sensible methods

stochastjc block model stochastjc block model with degree correctjon

Karrer, Newman. Stochastjc blockmodels and community structure in networks. Phys. Rev. E 83, 016107 (2011). Adamic, Glance. The politjcal blogosphere and the 2004 US electjon: divided they blog. 36–43 (2005).

slide-9
SLIDE 9

a r X i v : 1 6 8 . 5 8 7 8

Yang & Leskovec. Overlapping community detectjon at scale: a nonnegatjve matrix factorizatjon approach (2013).

Metadata ofuen treated as ground truth

slide-10
SLIDE 10

a r X i v : 1 6 8 . 5 8 7 8

Yang & Leskovec. Overlapping community detectjon at scale: a nonnegatjve matrix factorizatjon approach (2013).

Metadata ofuen treated as ground truth

Do you think thats ground truth you're detectjng?

slide-11
SLIDE 11

a r X i v : 1 6 8 . 5 8 7 8

Communities, C = f ( G ) Ground truth, T

d ( T , f ( G ) )

Ground truth, T

d ( T , f ( G ) )

slide-12
SLIDE 12

a r X i v : 1 6 8 . 5 8 7 8

Communities, C = f ( G ) Ground truth, T

d ( T , f ( G ) )

Communities, C = f ( G ) Metadata, M Ground truth, T

d ( M, f ( G ) ) d ( T , f ( G ) ) d ( M, T )

slide-13
SLIDE 13

a r X i v : 1 6 8 . 5 8 7 8

When communitjes ≠ metadata...

(i) the metadata do not relate to the network structure,

slide-14
SLIDE 14

a r X i v : 1 6 8 . 5 8 7 8

When communitjes ≠ metadata...

(ii) the detected communitjes and the metadata capture difgerent aspects of the network’s structure,

slide-15
SLIDE 15

a r X i v : 1 6 8 . 5 8 7 8

When communitjes ≠ metadata...

(iii) the network contains no structure (e.g., an E-R random graph)

slide-16
SLIDE 16

a r X i v : 1 6 8 . 5 8 7 8

When communitjes ≠ metadata...

(iv) the community detectjon algorithm does not perform well. Typically we assume this is the only possible cause

slide-17
SLIDE 17

a r X i v : 1 6 8 . 5 8 7 8

President Instructor Split into factjons

The Karate Club network

slide-18
SLIDE 18

a r X i v : 1 6 8 . 5 8 7 8

President Instructor Split into factjons

The Karate Club network

slide-19
SLIDE 19

a r X i v : 1 6 8 . 5 8 7 8

‘This can be explained by notjng that he was only three weeks away from a test for black belt (master status) when the split in the club

  • ccurred. Had he joined the offjcers’[President's]

club he would have had to give up his rank and begin again in a new style of karate with a white (beginner’s) belt, since the offjcers had decided to change the style of karate practjced in their new club’

  • Zachary 1977
slide-20
SLIDE 20

a r X i v : 1 6 8 . 5 8 7 8

You only see what you look for...

Peixoto, T. P. Hierarchical Block Structures and High-Resolutjon Model Selectjon in Large Networks. Phys. Rev. X 4, 011047 (2014).

US politjcs is more than two opposing views

Adamic, Glance. The politjcal blogosphere and the 2004 US electjon: divided they blog. 36–43 (2005).

slide-21
SLIDE 21

a r X i v : 1 6 8 . 5 8 7 8

Difgerent generatjve processes = difgerent community structures

slide-22
SLIDE 22

a r X i v : 1 6 8 . 5 8 7 8

Many good partjtjons...

Evans, T. S. Clique graphs and overlapping communitjes. J. Stat. Mech. 2010, P12037–22 (2010).

slide-23
SLIDE 23

a r X i v : 1 6 8 . 5 8 7 8

Metadata are not ground truth for community detectjon

slide-24
SLIDE 24

a r X i v : 1 6 8 . 5 8 7 8

No interpretability of negatjve results. (i) M unrelated to network structure (ii) C and M capture difgerent aspects of network structure (iii) the network has no structure (iv) the algorithm does not perform well Metadata are not ground truth for community detectjon

slide-25
SLIDE 25

a r X i v : 1 6 8 . 5 8 7 8

No interpretability of negatjve results. (i) M unrelated to network structure (ii) C and M capture difgerent aspects of network structure (iii) the network has no structure (iv) the algorithm does not perform well Multjple sets of metadata exist. Which set is ground truth? Metadata are not ground truth for community detectjon

slide-26
SLIDE 26

a r X i v : 1 6 8 . 5 8 7 8

No interpretability of negatjve results. (i) M unrelated to network structure (ii) C and M capture difgerent aspects of network structure (iii) the network has no structure (iv) the algorithm does not perform well Multjple sets of metadata exist. Which set is ground truth? We see what we look for. Confjrmatjon bias. Publicatjon bias. Metadata are not ground truth for community detectjon

slide-27
SLIDE 27

a r X i v : 1 6 8 . 5 8 7 8

No interpretability of negatjve results. (i) M unrelated to network structure (ii) C and M capture difgerent aspects of network structure (iii) the network has no structure (iv) the algorithm does not perform well Multjple sets of metadata exist. Which set is ground truth? We see what we look for. Confjrmatjon bias. Publicatjon bias. “Community” is model dependent. Do we expect all networks across all domains to have the same relatjonship with communitjes? Metadata are not ground truth for community detectjon

slide-28
SLIDE 28

a r X i v : 1 6 8 . 5 8 7 8

Communitjes, T Network, G Community detectjon is an inverse problem f(G) g(T) community detectjon data generatjon

slide-29
SLIDE 29

a r X i v : 1 6 8 . 5 8 7 8

For any graph there exist a (Bell) number of possible “ground truth” partjtjons, and an infjnite number of capable generatjve models. However, in real networks both T and g are unknown The community detectjon problem is ill-posed (no unique solutjon) {generatjve models, g} x {partjtjons, T} {graph G} many to one

s e e h e r e f

  • r

p r

  • f
slide-30
SLIDE 30

a r X i v : 1 6 8 . 5 8 7 8

Wolpert, D. H. The lack of a priori distjnctjons between learning algorithms. Neural Computatjon 8, 1341–1390 (1996).

A No Free Lunch Theorem for community detectjon?

NFL theorem (supervised learning) states that there cannot exist a classifjer that is a priori betuer than any other, averaged

  • ver all possible problems.
slide-31
SLIDE 31

a r X i v : 1 6 8 . 5 8 7 8

s e e h e r e f

  • r

p r

  • f

A No Free Lunch Theorem for community detectjon

NFL Theorem for communtjy detectjon (paraphrased):

For the community detectjon problem, with accuracy measured by adjusted mutual informatjon, the uniform average of the accuracy of any method f over all possible community detectjon problems is a constant which is independent of f . On average, no community detectjon algorithm performs betuer than any other

slide-32
SLIDE 32

a r X i v : 1 6 8 . 5 8 7 8

slide-33
SLIDE 33

a r X i v : 1 6 8 . 5 8 7 8

So, what about metadata?

Metadata = types of nodes Communitjes = how nodes interact Metadata + Communitjes = how difgerent types of nodes interact with each other

we require new methods to understand the relatjonship between metadata and structure

slide-34
SLIDE 34

a r X i v : 1 6 8 . 5 8 7 8

Are the metadata related to the network structure? Blockmodel Entropy Signifjcance Test Do metadata and detected communitjes capture difgerent aspects network structure? neoSBM

slide-35
SLIDE 35

a r X i v : 1 6 8 . 5 8 7 8

Are the metadata related to the network structure? Blockmodel Entropy Signifjcance Test (i) the metadata do not relate to the network structure, (ii) communitjes and metadata capture difgerent aspects network structure, Do metadata and detected communitjes capture difgerent aspects network structure? neoSBM

slide-36
SLIDE 36

a r X i v : 1 6 8 . 5 8 7 8

The Stochastjc Blockmodel

Edges are conditjonally independent given community membership pij = p(eij|zi,zj,ω) = ωzi,zj increasing density

i n t r a

  • c
  • m

m u n i t y d e n s i t y inter-community density inter-community density

slide-37
SLIDE 37

a r X i v : 1 6 8 . 5 8 7 8

Blockmodel Entropy Signifjcance Test

How well do the metadata explain the network? metadata is randomly assigned → model gives no explanatjon, high H metadata correlates with structure → model gives good explanatjon, low H

  • 1. Divide the network G into groups

according to metadata labels M.

  • 2. Fit the parameters of an SBM and

compute the entropy H(G,M)

  • 3. Compare this entropy to a

distributjon of entropies of networks partjtjoned using permutatjons of the metadata labels.

slide-38
SLIDE 38

a r X i v : 1 6 8 . 5 8 7 8

Multjple networks; multjple metadata atuributes

Multjple sets of metadata provide a signifjcant explainatjon for multjple networks.

slide-39
SLIDE 39

a r X i v : 1 6 8 . 5 8 7 8

Are the metadata related to the network structure? Blockmodel Entropy Signifjcance Test (i) the metadata do not relate to the network structure, (ii) communitjes and metadata capture difgerent aspects network structure, Do metadata and detected communitjes capture difgerent aspects network structure? neoSBM

slide-40
SLIDE 40

a r X i v : 1 6 8 . 5 8 7 8

Choose between the red (SBM) partjtjon and the blue (metadata) partjtjon Do metadata and detected communitjes capture difgerent aspects of the network?

a r X i v : 1 6 8 . 5 8 7 8

slide-41
SLIDE 41

a r X i v : 1 6 8 . 5 8 7 8 a r X i v : 1 6 8 . 5 8 7 8

slide-42
SLIDE 42

a r X i v : 1 6 8 . 5 8 7 8

Network with multjple 4- group optjma

core-periphery (''metadata'', M) assortatjve (SBM comms., C)

slide-43
SLIDE 43

a r X i v : 1 6 8 . 5 8 7 8

µ

Metadata partjtjon SBM partjtjon

1 2 3 4

As θ increases the cost of freeing a node decreases

slide-44
SLIDE 44

a r X i v : 1 6 8 . 5 8 7 8

neoSBM log likelihood SBM log likelihood

As θ increases the cost of freeing a node decreases

slide-45
SLIDE 45

a r X i v : 1 6 8 . 5 8 7 8

There is no ground truth

slide-46
SLIDE 46

a r X i v : 1 6 8 . 5 8 7 8

''I don’t know the future. I didn’t come here to tell you how this is going to end. I came here to tell you how it’s going to begin… Where we go from there is a choice I leave to you.'' – Neo, The Matrix

The future of community detectjon

slide-47
SLIDE 47

a r X i v : 1 6 8 . 5 8 7 8

In colloboratjon with...

Dan Larremore Aaron Clauset

slide-48
SLIDE 48

a r X i v : 1 6 8 . 5 8 7 8

Questjons?