Graph Summarisation Jilles les Vreeken eeken 10 10 July 2015 - - PowerPoint PPT Presentation

graph summarisation
SMART_READER_LITE
LIVE PREVIEW

Graph Summarisation Jilles les Vreeken eeken 10 10 July 2015 - - PowerPoint PPT Presentation

Graph Summarisation Jilles les Vreeken eeken 10 10 July 2015 2015 Service Announcement #0 The Case of The Lost Pen -- or The Case of the Found Pen Service Announcement #1 Next week, a guest lecture Mining Data that Changes by dr.


slide-1
SLIDE 1

Graph Summarisation

Jilles les Vreeken eeken

10 10 July 2015 2015

slide-2
SLIDE 2

The Case of The Lost Pen

  • - or –

The Case of the Found Pen

Service Announcement #0

slide-3
SLIDE 3

Next week, a guest lecture

Mining Data that Changes

by dr. Pauli Miettinen (MPI-INF)

Service Announcement #1

slide-4
SLIDE 4

Exam. Oral. 3rd and 4th of August. Timeslots to be decided. Mail me if you want to participate, let me know if you have a preferred time/day.

Service Announcement #2

slide-5
SLIDE 5

Service Announcement #3

Introduction Patterns Correlation and Causation Graphs Wrap-up + <ask-me-anything> (Subjective) Interestingness

slide-6
SLIDE 6

Service Announcement #2

Introduction Patterns Correlation and Causation Graphs Wrap-up + <ask-us-anything> (Subjective) Interestingness

<ask-me-anything>? Yes! Prepare questions on anything* you’ve always wanted to ask me. Mail them to me in advance,

  • r have me answer on the spot
* preferably related to TADA, data mining, machine learning, science, the world, etc.
slide-7
SLIDE 7

Question of the day

How can we summarise the main structure of a graph in easily understandable terms?

slide-8
SLIDE 8

Graphs

Graphs are everywhere ↔ Everything* can be represented as a graph

* almost
slide-9
SLIDE 9

Graphs, formally

We consider graphs 𝐻 = 𝑊, 𝐹 with 𝑊 the set of 𝑜 nodes, and 𝐹 a set of 𝑛 edges between nodes In general, nodes can have labels, and edges can have labels, weights and can be directed.

slide-10
SLIDE 10

Real world graphs

road networks social networks biological networks cellular networks relational databases

slide-11
SLIDE 11

Real world graphs

the internet

slide-12
SLIDE 12

Graphs, formally

Today we consider unlab labeled led unweig ight hted ed undir irect ected ed graphs. The adjacen jacency cy matrix 𝐵 then is an 𝑜 × 𝑜 matrix 𝐵 ∈ 0,1 𝑜×𝑜 where a cell 𝑏𝑗,𝑘 = 1 iff 𝑗, 𝑘 ∈ 𝐹 and 0 otherwise. We call the number of edges 𝑒𝑗

  • f a node 𝑗 its degree

ee

slide-13
SLIDE 13

Why summarisation?

Visualization Guiding attention

slide-14
SLIDE 14

Why summarisation?

Visualization Guiding attention

slide-15
SLIDE 15

Staring at an Adjacency Matrix

slide-16
SLIDE 16

Nodes: wiki editors Edges: co-edited I don’t see anything! 

Staring at a Hairball

slide-17
SLIDE 17

Stars: admins, bots, heavy users Bipartite cores: edit wars

Nodes: wiki editors Edges: co-edited

Kiev vs. Kyiv vandals

Example: Wikipedia Controversy

slide-18
SLIDE 18

Summary Statistics

For ‘normal’ data, we can get insight by taking an average. What kind of summary statistics do we have for graphs? Ave vera rage ge degree. ee.

Not very insightful.
slide-19
SLIDE 19

Summary Statistics

For ‘normal’ data, we can get insight by taking an average. What kind of summary statistics do we have for graphs? Degree e plots ts

slide-20
SLIDE 20
slide-21
SLIDE 21

Power

laws

slide-22
SLIDE 22

Summary Statistics

For ‘normal’ data, we can get insight by taking an average. What kind of summary statistics do we have for graphs? Clust ster r coeffi ficien cient t (global) How clustered are the nodes in the graph? 𝐷 = 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑑𝑚𝑝𝑡𝑓𝑒 𝑢𝑠𝑗𝑏𝑜𝑕𝑚𝑓𝑡 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑑𝑝𝑜𝑜𝑓𝑑𝑢𝑓𝑒 𝑢𝑠𝑗𝑞𝑚𝑓𝑢𝑡 𝑝𝑔 𝑤𝑓𝑠𝑢𝑗𝑑𝑓𝑡 Counting triangles requires matrix multiplication, which takes 𝑃(𝑜𝜕) where 𝜕 < 2.376, but takes 𝑃 𝑜2 space. (but fast estimators exist)

slide-23
SLIDE 23

Summary Statistics

For ‘normal’ data, we can get insight by taking an average. What kind of summary statistics do we have for graphs? Clust ster r coeffi ficien cient t (local cal) How close is the neighborhood of node 𝑗 to being a clique? 𝐷𝑗 = 2 𝑘, 𝑙 ∈ 𝐹 𝑘, 𝑙 ∈ 𝑂𝑗 𝑒𝑗(𝑒𝑗 − 1) which is 𝑃(𝑒𝑗

2) at 𝑃(𝑜2) space
slide-24
SLIDE 24

Summary Statistics

For ‘normal’ data, we can get insight by taking an average. What kind of summary statistics do we have for graphs? Diamet ameter er The longest shortest path between two nodes. Requires calculating all shortest paths. Calculating shortest path takes 𝑃(𝑜2). So, no.

slide-25
SLIDE 25

Scalability

Many real world graphs are big, with 𝑜 in the order of millions. 𝑃(𝑜2) is ver very scary for a graph miner. Current-day graph mining algorithms need to be linear ar in the number of edges,

  • r else your paper will almost surely be reject

cted. What are the implications?

slide-26
SLIDE 26

Summarising a Graph

Given: a graph

slide-27
SLIDE 27

Summarising a Graph

Given: a graph Find: a succinct summary with possibly

  • verlapping subgraphs
slide-28
SLIDE 28

Summarising a Graph

Given: a graph Find: a succinct summary with possibly

  • verlapping subgraphs
slide-29
SLIDE 29

Summarising a Graph

Given: a graph Find:

important graph structures. a succinct summary with possibly

  • verlapping subgraphs
slide-30
SLIDE 30

Community Detection

Adjacency Matrix Assumed graph

slide-31
SLIDE 31

Community Detection

Adjacency Matrix Real graph

slide-32
SLIDE 32

Summarising a Graph

Fully ly Automa utomatic tic Cross ss Associat sociations

  • ns

is a nice MDL based algorithm to summarise a matrix.

1)

REASSIGN: Given a grid, assign rows and columns s.t. entropy within the grid is minimal.

(Chakrabarti et al. 2004)
slide-33
SLIDE 33

Summarising a Graph

Fully ly Automa utomatic tic Cross ss Associat sociations

  • ns

is a nice MDL based algorithm to summarise a matrix.

1)

REASSIGN: Given a grid, assign rows and columns s.t. entropy within the grid is minimal.

2)

CROSSASSOC: Find cluster with highest entropy, split it, run REASSIGN. Stop when no split reduces the MDL score.

(Chakrabarti et al. 2004)
slide-34
SLIDE 34

Summarising a Graph

Fully ly Automa utomatic tic Cross ss Associat sociations

  • ns

is a nice MDL based algorithm to summarise a matrix.

1)

REASSIGN: Given a grid, assign rows and columns s.t. entropy within the grid is minimal.

2)

CROSSASSOC: Find cluster with highest entropy, split it, run REASSIGN. Stop when no split reduces the MDL score.

(Chakrabarti et al. 2004)
slide-35
SLIDE 35

Beyond Cave-men Communities

Traditional community detection algorithms assume that you interact

  • nly with people in your ‘cave’.

You are assumed not t to interact with others, except if you are one

  • f few ‘messengers’ between ‘caves’.

That is not very realistic.

(Kang & Faloutsos, ICDM 2011)
slide-36
SLIDE 36

Slash’n’Burn

Slash’n’Burn finds the node 𝑗 with highest 𝑒𝑗 and removes its edges 𝑂𝑗 and recurses. SLASHBURN:

  • 1. Slash

ash top-𝑙 hubs, burn rn edges

  • 2. Repeat on the remaining GCC

Before

(Kang & Faloutsos, ICDM 2011)
slide-37
SLIDE 37

Slash’n’Burn

Slash’n’Burn finds the node 𝑗 with highest 𝑒𝑗 and removes its edges 𝑂𝑗 and recurses. SLASHBURN:

  • 1. Slash

ash top-𝑙 hubs, burn rn edges

  • 2. Repeat on the remaining GCC
(Kang & Faloutsos, ICDM 2011)
slide-38
SLIDE 38

Slash’n’Burn

Slash’n’Burn finds the node 𝑗 with highest 𝑒𝑗 and removes its edges 𝑂𝑗 and recurses. SLASHBURN:

  • 1. Slash

ash top-𝑙 hubs, burn rn edges

  • 2. Repeat on the remaining GCC

After

(Kang & Faloutsos, ICDM 2011)
slide-39
SLIDE 39

Beyond Cave-men Communities

Slash’n’Burn applied on the AS-Oregon graphs shows that real graphs indeed have structure beyond cave-men communities! – but also include those! A nice side-result is that the Slash’n’Burned ordered matrix has lots of ‘empty space’ and can hence be stored efficiently.

(Kang & Faloutsos, ICDM 2011)
slide-40
SLIDE 40 Carnegie Mellon University Korea Advanced Institute of Science and Technology

VoG: Summarizing and Understanding Large Graphs

Danai Koutra Jilles Vreeken U Kang Christos Faloutsos

SDM, 25 April 2014, Philadelphia, USA

slide-41
SLIDE 41

Main Idea

1)

Use a graph vocabulary:

2)

Best graph summary  optimal compression (MDL)

slide-42
SLIDE 42

Main Idea

1)

Use a graph vocabulary:

2)

Shortest lossless description

 optimal compression (MDL)

slide-43
SLIDE 43

Given a set of models ℳ, the best model 𝑁 ∈ ℳ is arg min 𝑀 𝑁 + 𝑀(𝐸 ∣ 𝑁)

# bits for 𝑁 # bits for the data using 𝑁 ℳ

𝑁

Minimum Description Length

slide-44
SLIDE 44

a1 x + a0

𝑀 𝑁 + 𝑀(𝐸|𝑁)

a10 x10 + a9 x9 + … + a0 errors { }

MDL example

slide-45
SLIDE 45

Given: - a graph 𝐻 with adjacency matrix 𝐵

  • vocabulary Ω

Find: model 𝑁 s.t. 𝑀(𝐻, 𝑁) = min 𝑀(𝑁) + 𝑀(𝐹)

Minimum Graph Description

Model 𝑁 Adjacency 𝐵 Error 𝐹

slide-46
SLIDE 46

VoG: Overview

argmin

≈ ≈?

slide-47
SLIDE 47

VoG: Overview

slide-48
SLIDE 48

VoG: Overview

some criterion

slide-49
SLIDE 49

VoG: Overview

slide-50
SLIDE 50

VoG: Overview

slide-51
SLIDE 51

Summary

VoG: Overview

slide-52
SLIDE 52

We need candidate structures…

… How can we get them?

slide-53
SLIDE 53

Step 1: Graph Decomposition

We ca can n us use: Any ny decomposition method We We did d us use/a /adapt dapt: SLASHBURN

slide-54
SLIDE 54

Slash ash top-k hubs, burn edges

Before

SnB Graph Decomposition

slide-55
SLIDE 55

Slash ash top-k hubs, burn edges

SnB Graph Decomposition

slide-56
SLIDE 56

Slash ash top-k hubs, burn edges

candidate structures

After

SnB Graph Decomposition

slide-57
SLIDE 57

Slash ash top-k hubs, burn edges

candidate structures

After

SnB Graph Decomposition

Notice that the structures can overlap!

slide-58
SLIDE 58

Slash ash top-k hubs, burn edges

candidate structures

After

SnB Graph Decomposition

slide-59
SLIDE 59

Slash ash top-k hubs, burn edges Repeat on the remaining GCC

GCC

SnB Graph Decomposition

slide-60
SLIDE 60

Now

  • w, how
  • w ca

can we we ‘label’ them?

We got candidate structures.

slide-61
SLIDE 61

≈?

argmin

1 2

Step 2: Graph Labeling

slide-62
SLIDE 62

hub? “best” node split? 45 80 n “best” node ordering? 1

1 n

missing edges?

Graph Representations

slide-63
SLIDE 63

hub

Hub: top-degree node Spokes: the rest

DETAILS

Graph Representations

slide-64
SLIDE 64

hub

Hub: top-degree node Spokes: the rest

𝑀𝑂 𝑡𝑢 − 1 + log 𝑜 + log 𝑜 − 1 𝑡𝑢 − 1 + 𝑀(𝐹

+ )

+ 𝑀(𝐹

− )

# of spokes

hub ID

spokes IDs extra missing

Errors Star structure

𝑜=7

DETAILS

Graph Representations

slide-65
SLIDE 65

hub

Hub: top-degree node Spokes: the rest

𝑀𝑂 𝑡𝑢 − 1 + log 𝑜 + log 𝑜 − 1 𝑡𝑢 − 1 + 𝑀(𝐹

+ )

+ 𝑀(𝐹

− )

# of spokes

hub ID

spokes IDs extra missing

Errors

𝑜=7

DETAILS

Graph Representations

slide-66
SLIDE 66

hub

Hub: top-degree node Spokes: the rest

𝑀𝑂 𝑡𝑢 − 1 + log 𝑜 + log 𝑜 − 1 𝑡𝑢 − 1 + 𝑀(𝐹

+ )

+ 𝑀(𝐹

− )

hub ID

spokes IDs extra missing

Errors

6 𝑜=7

DETAILS

Graph Representations

slide-67
SLIDE 67

hub

Hub: top-degree node Spokes: the rest

𝑀𝑂 𝑡𝑢 − 1 + log 𝑜 + log 𝑜 − 1 𝑡𝑢 − 1 + 𝑀(𝐹

+ )

+ 𝑀(𝐹

− )

spokes IDs extra missing

Errors

6 𝑜=7

DETAILS

Graph Representations

slide-68
SLIDE 68

hub

Hub: top-degree node Spokes: the rest

𝑀𝑂 𝑡𝑢 − 1 + log 𝑜 + log 𝑜 − 1 𝑡𝑢 − 1 + 𝑀(𝐹

+ )

+ 𝑀(𝐹

− )

extra missing

Errors

6 𝑜=7

DETAILS

Graph Representations

slide-69
SLIDE 69

hub

Hub: top-degree node Spokes: the rest

𝑀𝑂 𝑡𝑢 − 1 + log 𝑜 + log 𝑜 − 1 𝑡𝑢 − 1 + 𝑀(𝐹

+ )

+ 𝑀(𝐹

− )

extra missing

6 𝑜=7

DETAILS

Graph Representations

slide-70
SLIDE 70

Max bipartite graph: NP-hard Heuristic: Belief Propagation with heterophily for node classification (blue/red)

DETAILS

Graph Representations

slide-71
SLIDE 71

Max bipartite graph: NP-hard Heuristic: Belief Propagation with heterophily for node classification (blue/red) + logn + log( ) + L(E+ ) + L(E− )

# of blue nodes

n−1 |st|− 1

their IDs extra missing

Errors

# of red nodes

Bipartite graph structure DETAILS

Graph Representations

slide-72
SLIDE 72

Max bipartite graph: NP-hard Heuristic: Belief Propagation with heterophily for node classification (blue/red) + logn + log( ) + L(E+ ) + L(E− )

# of blue nodes

n−1 |st|− 1

their IDs extra missing

Errors

# of red nodes

DETAILS

Graph Representations

slide-73
SLIDE 73

Max bipartite graph: NP-hard Heuristic: Belief Propagation with heterophily for node classification (blue/red) + logn + log( ) + L(E+ ) + L(E− )

# of blue nodes

n−1 |st|− 1

their IDs extra missing # of red nodes

DETAILS

Graph Representations

slide-74
SLIDE 74

1 45 80 n

1 n

Longest path: NP-hard Heuristic: BFS + local search

Graph Representations

slide-75
SLIDE 75

1 45 80 n

1 n

Longest path: NP-hard Heuristic: BFS + local search +

extra missin g Errors Chain structure

Graph Representations

slide-76
SLIDE 76

≈?

Step 2: Graph Labeling

slide-77
SLIDE 77

≈?

argmin

Step 2: Graph Labeling

slide-78
SLIDE 78

≈?

argmin

Step 2: Graph Labeling

slide-79
SLIDE 79

Step 3: Summary Assembly

slide-80
SLIDE 80

Step 3: Summary Assembly

slide-81
SLIDE 81

Step 3: Summary Assembly

Summary

slide-82
SLIDE 82

Concepts

= # bits as structure - # bits as noise

compression gain

Savings

DETAILS

slide-83
SLIDE 83

Step 3: Summary Assembly

slide-84
SLIDE 84

Step 3: Summary Assembly

slide-85
SLIDE 85

Step 3: Summary Assembly

slide-86
SLIDE 86

Summary

Step 3: Summary Assembly

slide-87
SLIDE 87

Concepts

Summary Encoding cost 𝑀 𝑁 = 𝑀𝑂( 𝑁 + 1) + log 𝑁 + 1 Ω + 1 + ∑ − log 𝑄 𝑦 𝑡 𝑁 + 𝑀 𝑡

# of structures # of structures per type for each structure its encoding length its connectivity its type 3 # of structures # of structures per type for each structure its encoding length : 1 : 1 : 1

slide-88
SLIDE 88

Step 3: Summary Assembly

𝑀(𝐸, 𝑁) structures

DETAILS

slide-89
SLIDE 89

75% 98% 93% 75% 2% 77% 46% 60% 0% 20% 40% 60% 80% 100% Plain Top-10 Top-100 G&F Bits needed Unexplained edges

4292729 bits as noise

Real graphs have structure!

(we can save bits by encoding with structures!)

Quantitative Analysis

slide-90
SLIDE 90

1 10 100 Plain Top-10 Top-100 G&F Star Near-Bipartite Full clique Full Bipartite Chain

Main structure types:

Quantitative Analysis

slide-91
SLIDE 91

Quantitative Analysis

1 10 100 1000 10000 Plain Top-10 Top-100 G&F Star Near-Bipartite Full clique Full Bipartite

Main structure types: Stars, near- and full-bipartite cores.

slide-92
SLIDE 92

Top-3 Stars

klay kenneth.lay @enron.com

Top-1 NBC

Ski excursion

jeff.skilling@enron.com

Qualitative Analysis: Enron

slide-93
SLIDE 93

VOG is near-linear on the number of edges of the input graph.

Runtime

slide-94
SLIDE 94

“jellyfish”

(Tauro, 2001)

Future Work

Those of you interested in a MSc or RIL project… Our current vocabulary is But many other structures make sense, for example

slide-95
SLIDE 95

Future Work

Those of you who might be interested in a MSc or RIL project… It would be great if we could mine summaries direct ctly y from m data ata … without pre-mining all candidate structures Real graphs show powerlaw-ish degree distributions, … would be great if VoG could take that into account

slide-96
SLIDE 96

Conclusions

Graphs need Summaries

graphs are powerful – but difficult to interpret far too few (efficient) summary methods available

Cross-Associations

powerful technique to find bi-clusters heuristic, improvements exist

Slash’n’Burn

reorders nodes of a graph finds sub-graphs ‘beyond’ cave-men communities

VoG

summarises graphs with a graph-theoretic vocabulary first of its kind – but a big stack of heuristics fast, good results.
slide-97
SLIDE 97

Thank you!

Graphs need Summaries

graphs are powerful – but difficult to interpret far too few (efficient) summary methods available

Cross-Associations

powerful technique to find bi-clusters heuristic, improvements exist

Slash’n’Burn

reorders nodes of a graph finds sub-graphs ‘beyond’ cave-men communities

VoG

summarises graphs with a graph-theoretic vocabulary first of its kind – but a big stack of heuristics fast, good results.