What is the core distribution of a graph telling us? Sonja Petrovi - - PowerPoint PPT Presentation

what is the core distribution of a graph telling us
SMART_READER_LITE
LIVE PREVIEW

What is the core distribution of a graph telling us? Sonja Petrovi - - PowerPoint PPT Presentation

What is the core distribution of a graph telling us? Sonja Petrovi c Illinois Institute of Technology, Chicago Joint work with: Vishesh Karwa (Carnegie Mellon / Harvard), Michael Pelsmajer (IIT), Despina Stasi (Univ. of Cyprus / IIT) Dane


slide-1
SLIDE 1

What is the core distribution of a graph telling us?

Sonja Petrovi´ c

Illinois Institute of Technology, Chicago Joint work with: Vishesh Karwa (Carnegie Mellon / Harvard), Michael Pelsmajer (IIT), Despina Stasi (Univ. of Cyprus / IIT) Dane Wilburne (IIT) arXiv:1410.7357 - v2 soon.

  • AS2015 -

Genova, June 2015

(Karwa,Petrovic,Pelsmajer,Stasi,Wilburne) ERGM for network shell structure as2015 0 / 12

slide-2
SLIDE 2

Motivation: Degree-centric models not enough for statistical network analysis

Setting: statistical models for random graphs

How to capture node importance? In some applications, it matters not just to how many other nodes a particular node in the network is connected, but also to which other nodes it is connected. → Is degree-centric analysis suitable? ← Examples: information dispersal, the spread of disease or viruses, or robustness to node failure... Social network setting: record ’node celebrity status’.

(Karwa,Petrovic,Pelsmajer,Stasi,Wilburne) ERGM for network shell structure as2015 1 / 12

slide-3
SLIDE 3

Motivation: Degree-centric models not enough for statistical network analysis

Setting: statistical models for random graphs

How to capture node importance? In some applications, it matters not just to how many other nodes a particular node in the network is connected, but also to which other nodes it is connected. → Is degree-centric analysis suitable? ← Examples: information dispersal, the spread of disease or viruses, or robustness to node failure... Social network setting: record ’node celebrity status’.

(Karwa,Petrovic,Pelsmajer,Stasi,Wilburne) ERGM for network shell structure as2015 1 / 12

slide-4
SLIDE 4

Discrete math tool: Cores decomposition of a graph

Classifying vertices: coreness (a.k.a. shell index)

[Seidman83]: A k-core decomposition of a graph captures precisely this: Any vertex may live in many cores, but only one shell. Vast literature on: Fast computation of shell indices; Interesting applications and heuristic studies. Not surfaced in stats literature so far: A rigorous statistical model for networks relying on core structure. → Core structure is summarized by shell distribution. ←

(Karwa,Petrovic,Pelsmajer,Stasi,Wilburne) ERGM for network shell structure as2015 2 / 12

slide-5
SLIDE 5

The ERGM: Shell distribution as sufficient statistic

The shell distribution model.

G = g: a random instance of a graph on n nodes ni(g): number of vertices in shell i; pi: the “shel parameter” P(G = g; p) = ϕ(p)

n−1

  • i=0

pi ni(g) Exponential family form P(G = g; p) = exp{

n−2

  • i=0

ni(g)θi − ψ(θ)}. Shell ↔ degree distribution: Erd¨

  • s-R´

enyi not a formal submodel Log-linear structure only on ‘atomic level’.

(Karwa,Petrovic,Pelsmajer,Stasi,Wilburne) ERGM for network shell structure as2015 3 / 12

slide-6
SLIDE 6

Motivating example: Authorship dataset

Sampling from the model - Authorship dataset

The largest connected component of the network science co-authorship network (379 nodes)

500 1000 1500 900 1000110012001300

Edges Count

200 400 250 500 750

Triangles Count

2500 5000 7500 10000 −0.1 0.0 0.1 0.2

Centrality Count

1000 2000 100 200 300

Largest Core Size Count

  • 100

200 300 1 2 3 4 5 6 7

Shell Index Number of Nodes

  • 25

50 75 0123456789 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

Degree Number of Nodes

(Karwa,Petrovic,Pelsmajer,Stasi,Wilburne) ERGM for network shell structure as2015 4 / 12

slide-7
SLIDE 7

Motivating example: Authorship dataset

Typical graphs from the model - Authorship dataset

... what to do with this??

(Karwa,Petrovic,Pelsmajer,Stasi,Wilburne) ERGM for network shell structure as2015 5 / 12

slide-8
SLIDE 8

Motivating example: Authorship dataset

Exploring structure of graphs within a fiber

The author network component core distribution can be realized with graphs that have from about 250 to 500 triangles. Simulations: examples from n = 18 to n = 57 nodes, algorithm never visited the same graph twice, min and max number of triangles differ by a factor of 2 or 3. A typical histogram of number of triangles:

shell distr. of the big component of author dataset 52,536 steps Number of triangles Frequency 250 300 350 400 450 500 2000 4000 6000 8000 10000 14000

So what do we have? Model that provides necessary formalism for using k-cores in statistical considerations Algorithm for constructing all graphs with given shell structure MCMC algorithm for simulating from the model

(Karwa,Petrovic,Pelsmajer,Stasi,Wilburne) ERGM for network shell structure as2015 6 / 12

slide-9
SLIDE 9

Sampling from the model ... and sampling from fibers

3 problems

(... or: the usual ERGM suspects)

Model fitting questions lead to three important subproblems; * Solving these is crucial for MLE estimates and goodness of fit tests *

1) Existence of MLE - captured by the model polytope: Theorem The polytope of all shell distribution vectors is a dilate of a simplex. All realizable lattice points lie on the boundary of this polytope. The MLE never exists for a sample of size 1. 2) Sampling from the fibers (via the Metropolis algorithm): Algorithm Randomly construct a graph with a given shell distribution. Constructs all graphs with positive probability. Experiments: fast graph discovery.

(Karwa,Petrovic,Pelsmajer,Stasi,Wilburne) ERGM for network shell structure as2015 7 / 12

slide-10
SLIDE 10

Sampling from the model ... and sampling from fibers

3 problems (continued)

(... or: the usual ERGM suspects)

3) Sampling from the model: direct sampling intractable! Sampson data set: 18 monks in a New England Monastery: nS(g) = (0, 2, 3, 15, 0, 0, ...) MCMC scheme: “tie-no-tie” proposal [Caimo et al]

  • good mixing

Probability of accepting: π = min

  • 1,

i pni(g ′)−ni(g) i

  • .

(Karwa,Petrovic,Pelsmajer,Stasi,Wilburne) ERGM for network shell structure as2015 8 / 12

slide-11
SLIDE 11

Sampling from the model ... and sampling from fibers

3 problems (continued)

(... or: the usual ERGM suspects)

3) Sampling from the model: direct sampling intractable! Sampson data set: 18 monks in a New England Monastery: nS(g) = (0, 2, 3, 15, 0, 0, ...) MCMC scheme: “tie-no-tie” proposal [Caimo et al]

  • good mixing

Probability of accepting: π = min

  • 1,

i pni(g ′)−ni(g) i

  • .

2500 5000 7500 10000 30 35 40 45

Edges Count

500 1000 1500 10 20 30

Triangles Count

2500 5000 7500 10000 0.00 0.25 0.50

Centrality Count

5000 10000 15000 10 15 20

Largest Core Size Count

  • 1

2 3 4 5 6 7 8 9 10 11 12 13 1 2 3

Shell Index Number of Nodes

  • 1

2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10111213

Degree Number of Nodes

(Karwa,Petrovic,Pelsmajer,Stasi,Wilburne) ERGM for network shell structure as2015 8 / 12

slide-12
SLIDE 12

Dealing with model degeneracy A nested family of models / polytopes

Model degeneracy! - example

(Karwa,Petrovic,Pelsmajer,Stasi,Wilburne) ERGM for network shell structure as2015 9 / 12

slide-13
SLIDE 13

Dealing with model degeneracy A nested family of models / polytopes

Extending the model family

Introduce a parameter for the degeneracy of a graph: P(G = g; p, m) = ϕ(p)

n−1

  • i=0

pi ni(g), if g ∈ Gn,m := {G : dgen(G) ≤ m}. It means that all graphs under this model will have degeneracy at least m. Treat m as a parameter (that needs to be estimated)

  • analogous to choosing the number of components in a mixture model
  • vs. assuming that it is known.

We treat m as fixed - select the observed value of m. Estimation - open; but at least the new model is not degenerate.

(Karwa,Petrovic,Pelsmajer,Stasi,Wilburne) ERGM for network shell structure as2015 10 / 12

slide-14
SLIDE 14

Dealing with model degeneracy A nested family of models / polytopes

Simulations - Sampson network

Two submodels: support graphs with degeneracy ≤ 3, or = 3 (observed).

1000 2000 3000 20 30 40

Edges Count

200 400 600 10 20 30

Triangles Count

500 1000 1500 2000 0.1 0.2 0.3 0.4 0.5

Centrality Count

500 1000 1500 8 12 16

Largest Core Size Count

1000 2000 3000 4000 20 30 40

Edges Count

250 500 750 10 20 30

Triangles Count

500 1000 1500 0.1 0.2 0.3 0.4 0.5

Centrality Count

500 1000 1500 10 15

Largest Core Size Count

Note heavier tails in one model Parameter used = good estimate of MLE (moment equations) (expected shell distrib. under the MLE very close to observed)

(Karwa,Petrovic,Pelsmajer,Stasi,Wilburne) ERGM for network shell structure as2015 11 / 12

slide-15
SLIDE 15

Dealing with model degeneracy A nested family of models / polytopes

Simulations - Sampson network - typical graphs

(Karwa,Petrovic,Pelsmajer,Stasi,Wilburne) ERGM for network shell structure as2015 12 / 12