Subgraph Frequencies: The Empirical and Extremal Geography of Large - - PowerPoint PPT Presentation

subgraph frequencies
SMART_READER_LITE
LIVE PREVIEW

Subgraph Frequencies: The Empirical and Extremal Geography of Large - - PowerPoint PPT Presentation

Subgraph Frequencies: The Empirical and Extremal Geography of Large Graph Collections Johan Ugander, Lars Backstrom, Jon Kleinberg World Wide Web Conference May 16 , 2013 Graph collections Neighborhoods: graph induced by friends of a single


slide-1
SLIDE 1

Johan Ugander, Lars Backstrom, Jon Kleinberg World Wide Web Conference May 16, 2013

Subgraph Frequencies:

The Empirical and Extremal Geography

  • f Large Graph Collections
slide-2
SLIDE 2

▪ Neighborhoods: graph induced by friends of a single ego, excluding ego

Graph collections

All Friends

slide-3
SLIDE 3

▪ Neighborhoods: graph induced by friends of a single ego, excluding ego ▪ Groups: graph induced by members of a Facebook ‘group‘ ▪ Events: graph induced by ‘Yes’ respondents to a Facebook ‘event’

Graph collections

All Friends

slide-4
SLIDE 4

▪ Neighborhoods: graph induced by friends of a single ego, excluding ego ▪ Groups: graph induced by members of a Facebook ‘group‘ ▪ Events: graph induced by ‘Yes’ respondents to a Facebook ‘event’

Graph collections

All Friends

Seeking a ‘coordinate system’ on these graphs

slide-5
SLIDE 5

Subgraphs

All Friends

slide-6
SLIDE 6

Subgraphs

All Friends

slide-7
SLIDE 7

Subgraphs

All Friends

slide-8
SLIDE 8

Subgraphs

All Friends

Compute frequencies

slide-9
SLIDE 9

Subgraph Frequencies

▪ Definition: The subgraph frequency s(F,G) of a k-node subgraph F in a graph G

is the fraction of k-tuples of nodes in G that induce a copy of F.

Motifs/Frequent subgraphs: Inokuchi et al. 2000, Milo et al. 2002, Yan-Han 2002, Kuramochi-Karypis 2004 Triad census: Davis-Leinhardt 1971, Wasserman-Faust 1994

slide-10
SLIDE 10

Subgraph Frequencies

▪ Definition: The subgraph frequency s(F,G) of a k-node subgraph F in a graph G

is the fraction of k-tuples of nodes in G that induce a copy of F.

▪ Subgraph frequency vectors:

s(·, G) = (x1, x2, x3, x4)

Motifs/Frequent subgraphs: Inokuchi et al. 2000, Milo et al. 2002, Yan-Han 2002, Kuramochi-Karypis 2004 Triad census: Davis-Leinhardt 1971, Wasserman-Faust 1994

= (0.18, 0.37, 0.14, 0.31)

slide-11
SLIDE 11

Subgraph Frequencies

▪ Definition: The subgraph frequency s(F,G) of a k-node subgraph F in a graph G

is the fraction of k-tuples of nodes in G that induce a copy of F.

▪ Subgraph frequency vectors:

Motifs/Frequent subgraphs: Inokuchi et al. 2000, Milo et al. 2002, Yan-Han 2002, Kuramochi-Karypis 2004 Triad census: Davis-Leinhardt 1971, Wasserman-Faust 1994

= (0.18, 0.37, 0.14, 0.31)

s(·, G) = (y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11) s(·, G) = (x1, x2, x3, x4)

slide-12
SLIDE 12

Empirical/Extremal Questions

▪ Consider the subgraph frequencies as a ‘coordinate system’ ▪ Empirical Geography: ▪ What subgraph frequencies do social graphs exhibit? ▪ Is there a good model? ▪ Extremal Geography: ▪ How much of this space is even feasible, combinatorially? ▪ Do empirical graphs fill the feasible space?

slide-13
SLIDE 13

Empirical/Extremal Questions

▪ What’s a property of graphs and what’s a property of people? ▪ Consider the subgraph frequencies as a ‘coordinate system’ ▪ Empirical Geography: ▪ What subgraph frequencies do social graphs exhibit? ▪ Is there a good model? ▪ Extremal Geography: ▪ How much of this space is even feasible, combinatorially? ▪ Do empirical graphs fill the feasible space?

slide-14
SLIDE 14

What do we expect?

slide-15
SLIDE 15

triadic closure triadic closure t r i a d i c c l

  • s

u r e

What do we expect?

slide-16
SLIDE 16

What do we expect?

We expect few wedges, many triangles for social networks.

slide-17
SLIDE 17

The triad space

slide-18
SLIDE 18

The triad space

You are here 50 node graphs Orange

  • Neighborhoods

Green

  • Groups

Lavender - Events

slide-19
SLIDE 19

You are here Gn,p

The triad space

50 node graphs Orange

  • Neighborhoods

Green

  • Groups

Lavender - Events

slide-20
SLIDE 20

Subgraph frequency of

50 node graphs Orange

  • Neighborhoods

Green

  • Groups

Lavender - Events

slide-21
SLIDE 21

Subgraph frequency of

50 node graphs Orange

  • Neighborhoods

Green

  • Groups

Lavender - Events

slide-22
SLIDE 22

Subgraph frequency of

Gn,p 50 node graphs Orange

  • Neighborhoods

Green

  • Groups

Lavender - Events

slide-23
SLIDE 23

Subgraph frequency of

50 node graphs Orange

  • Neighborhoods

Green

  • Groups

Lavender - Events Extremal Graph Theory

slide-24
SLIDE 24

Subgraph frequency of

Frequency of the ‘forbidden triad’ is bounded at ≤ 3/4. Sharp for Kn/2,n/2 (bipartite graph) when n is even.

50 node graphs Orange

  • Neighborhoods

Green

  • Groups

Lavender - Events

slide-25
SLIDE 25

Subgraph frequencies

slide-26
SLIDE 26

‘Crowd-sourced’ inner bounds

Consider all social graphs and the complements of all graphs, anti-social graphs (which are also graphs!)

slide-27
SLIDE 27

What graphs are missing?

slide-28
SLIDE 28

▪ Square unlikely to form:

Triadic Closure and Squares

slide-29
SLIDE 29

▪ Square unlikely to form:

Triadic Closure and Squares

slide-30
SLIDE 30

▪ Square unlikely to form: ▪ Square has very short ‘half-life’:

Triadic Closure and Squares

slide-31
SLIDE 31

Continuous Time Markov Chain Model

slide-32
SLIDE 32

triadic closure triadic closure t r i a d i c c l

  • s

u r e

Continuous Time Markov Chain Model

slide-33
SLIDE 33

Edge Formation Random Walk (EFRW)

▪ Continuous-time Markov chain ▪ Transitions between unlabeled, undirected graphs based in edge formation. ▪ Independent Poisson processes for all node pairs: ▪ Arbitrary formation: rate ɣ > 0 ▪ Arbitrary deletion: rate δ > 0 ▪ Triadic closure formation for each wedge: rate λ ≥ 0

slide-34
SLIDE 34

Edge Formation Random Walk (EFRW)

▪ Continuous-time Markov chain ▪ Transitions between unlabeled, undirected graphs based in edge formation. ▪ Independent Poisson processes for all node pairs: ▪ Arbitrary formation: rate ɣ > 0 ▪ Arbitrary deletion: rate δ > 0 ▪ Triadic closure formation for each wedge: rate λ ≥ 0 ▪ For 4-node graphs, succinct Markov chain state transition diagram:

6γ δ 4γ γ 2δ 2δ 3δ 3δ 2δ δ 4γ γ γ+λ 2γ 3(γ+λ) 6δ 3γ γ 4δ 2(γ+λ) 4δ 2(γ+λ) 2(γ+2λ) γ+λ δ δ 2δ δ

slide-35
SLIDE 35

Edge Formation Random Walk (EFRW)

▪ Continuous-time Markov chain ▪ Transitions between unlabeled, undirected graphs based in edge formation. ▪ Independent Poisson processes for all node pairs: ▪ Arbitrary formation: rate ɣ > 0 ▪ Arbitrary deletion: rate δ > 0 ▪ Triadic closure formation for each wedge: rate λ ≥ 0 ▪ For 4-node graphs, succinct Markov chain state transition diagram:

6γ δ 4γ γ 2δ 2δ 3δ 3δ 2δ δ 4γ γ γ+λ 2γ 3(γ+λ) 6δ 3γ γ 4δ 2(γ+λ) 4δ 2(γ+λ) 2(γ+2λ) γ+λ δ δ 2δ δ

slide-36
SLIDE 36

Fitting λ to subgraph data

▪ How well can we fit λ? ▪ Subgraph frequencies are modeled very well by triadic closure.

frequency 0.001 0.010 0.100 1.000 Neighborhoods, n=50

  • Neighborhoods data, mean

Fit model, λ ν = 19.37 1.000

(log-scale y-axis)

slide-37
SLIDE 37

Extremal graph theory

▪ Subgraph frequencies s(F,G) closely related to homomorphism density t(F,G). ▪ Frequency of cliques, lower bounds: Moon-Moser 1962, Razborov 2008 ▪ Frequency of cliques, upper bounds: Kruskal-Katona Theorem ▪ Frequency of trees:

Sidorenko Conjecture (‘Theorem for trees’)

▪ Also linear relationships across sizes. ▪ => Linear Program!

[Borgs et al. 2006, Lovasz 2009]

slide-38
SLIDE 38

▪ A proposition for all subgraphs:

  • Proposition. For every k, there exist constants ✏ and n0 such that the following
  • holds. If F is a k-node subgraph that is not a clique and not empty, and G is

any graph on n ≥ n0 nodes, then s(F, G) < 1 − ✏.

Extremal graph theory

slide-39
SLIDE 39

▪ How do different audience graphs differ?

Audience graph classification

20 50 100 200 500 1000 0.05 0.10 0.20 0.50 1.00 size Average edge density Neighborhoods Neighborhoods + ego Groups Events 400 75

slide-40
SLIDE 40

▪ How do different audience graphs differ? ▪ Classification challenges

A) 75-node neigh. vs. 75-node events B) 400-node neigh. vs. 400-node groups

Audience graph classification

20 50 100 200 500 1000 0.05 0.10 0.20 0.50 1.00 size Average edge density Neighborhoods Neighborhoods + ego Groups Events 400 75

slide-41
SLIDE 41

▪ How do different audience graphs differ? ▪ Classification challenges

A) 75-node neigh. vs. 75-node events B) 400-node neigh. vs. 400-node groups

▪ Features: Quad frequencies :

76% / 76% accuracy Global features: 69% / 76% accuracy Quad frequencies + Global features: 81% / 82% accuracy

Audience graph classification

20 50 100 200 500 1000 0.05 0.10 0.20 0.50 1.00 size Average edge density Neighborhoods Neighborhoods + ego Groups Events 400 75

slide-42
SLIDE 42

▪ Subgraph frequencies usefully characterize social graphs, have extremal limits! ▪ Edge Formation Random Walk model of dense social graphs: ▪ Homomorphism density bounds yield subgraph density bounds:

Conclusions

6γ δ 4γ γ 2δ 2δ 3δ 3δ 2δ δ 4γ γ γ+λ 2γ 3 ( γ + λ ) 6δ 3γ γ 4δ 2(γ+λ) 4δ 2(γ+λ) 2(γ+2λ) γ+λ δ δ 2δ δ