Graph Summarisation
Jilles les Vreeken eeken
10 10 July 2015 2015
Graph Summarisation Jilles les Vreeken eeken 10 10 July 2015 - - PowerPoint PPT Presentation
Graph Summarisation Jilles les Vreeken eeken 10 10 July 2015 2015 Service Announcement #0 The Case of The Lost Pen -- or The Case of the Found Pen Service Announcement #1 Next week, a guest lecture Mining Data that Changes by dr.
Graph Summarisation
Jilles les Vreeken eeken
10 10 July 2015 2015
The Case of The Lost Pen
The Case of the Found Pen
Service Announcement #0
Next week, a guest lecture
Mining Data that Changes
by dr. Pauli Miettinen (MPI-INF)
Service Announcement #1
Exam. Oral. 3rd and 4th of August. Timeslots to be decided. Mail me if you want to participate, let me know if you have a preferred time/day.
Service Announcement #2
Service Announcement #3
Introduction Patterns Correlation and Causation Graphs Wrap-up + <ask-me-anything> (Subjective) Interestingness
Service Announcement #2
Introduction Patterns Correlation and Causation Graphs Wrap-up + <ask-us-anything> (Subjective) Interestingness
<ask-me-anything>? Yes! Prepare questions on anything* you’ve always wanted to ask me. Mail them to me in advance,
Question of the day
How can we summarise the main structure of a graph in easily understandable terms?
Graphs
Graphs are everywhere ↔ Everything* can be represented as a graph
* almostGraphs, formally
We consider graphs 𝐻 = 𝑊, 𝐹 with 𝑊 the set of 𝑜 nodes, and 𝐹 a set of 𝑛 edges between nodes In general, nodes can have labels, and edges can have labels, weights and can be directed.
Real world graphs
road networks social networks biological networks cellular networks relational databases
Real world graphs
the internet
Graphs, formally
Today we consider unlab labeled led unweig ight hted ed undir irect ected ed graphs. The adjacen jacency cy matrix 𝐵 then is an 𝑜 × 𝑜 matrix 𝐵 ∈ 0,1 𝑜×𝑜 where a cell 𝑏𝑗,𝑘 = 1 iff 𝑗, 𝑘 ∈ 𝐹 and 0 otherwise. We call the number of edges 𝑒𝑗
ee
Why summarisation?
Visualization Guiding attention
Why summarisation?
Visualization Guiding attention
Staring at an Adjacency Matrix
Nodes: wiki editors Edges: co-edited I don’t see anything!
Staring at a Hairball
Stars: admins, bots, heavy users Bipartite cores: edit wars
Nodes: wiki editors Edges: co-edited
Kiev vs. Kyiv vandals
Example: Wikipedia Controversy
Summary Statistics
For ‘normal’ data, we can get insight by taking an average. What kind of summary statistics do we have for graphs? Ave vera rage ge degree. ee.
Not very insightful.Summary Statistics
For ‘normal’ data, we can get insight by taking an average. What kind of summary statistics do we have for graphs? Degree e plots ts
laws
Summary Statistics
For ‘normal’ data, we can get insight by taking an average. What kind of summary statistics do we have for graphs? Clust ster r coeffi ficien cient t (global) How clustered are the nodes in the graph? 𝐷 = 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑑𝑚𝑝𝑡𝑓𝑒 𝑢𝑠𝑗𝑏𝑜𝑚𝑓𝑡 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑑𝑝𝑜𝑜𝑓𝑑𝑢𝑓𝑒 𝑢𝑠𝑗𝑞𝑚𝑓𝑢𝑡 𝑝𝑔 𝑤𝑓𝑠𝑢𝑗𝑑𝑓𝑡 Counting triangles requires matrix multiplication, which takes 𝑃(𝑜𝜕) where 𝜕 < 2.376, but takes 𝑃 𝑜2 space. (but fast estimators exist)
Summary Statistics
For ‘normal’ data, we can get insight by taking an average. What kind of summary statistics do we have for graphs? Clust ster r coeffi ficien cient t (local cal) How close is the neighborhood of node 𝑗 to being a clique? 𝐷𝑗 = 2 𝑘, 𝑙 ∈ 𝐹 𝑘, 𝑙 ∈ 𝑂𝑗 𝑒𝑗(𝑒𝑗 − 1) which is 𝑃(𝑒𝑗
2) at 𝑃(𝑜2) spaceSummary Statistics
For ‘normal’ data, we can get insight by taking an average. What kind of summary statistics do we have for graphs? Diamet ameter er The longest shortest path between two nodes. Requires calculating all shortest paths. Calculating shortest path takes 𝑃(𝑜2). So, no.
Scalability
Many real world graphs are big, with 𝑜 in the order of millions. 𝑃(𝑜2) is ver very scary for a graph miner. Current-day graph mining algorithms need to be linear ar in the number of edges,
cted. What are the implications?
Summarising a Graph
Given: a graph
Summarising a Graph
Given: a graph Find: a succinct summary with possibly
Summarising a Graph
Given: a graph Find: a succinct summary with possibly
Summarising a Graph
Given: a graph Find:
≈
important graph structures. a succinct summary with possibly
Community Detection
Adjacency Matrix Assumed graph
Community Detection
Adjacency Matrix Real graph
Summarising a Graph
Fully ly Automa utomatic tic Cross ss Associat sociations
is a nice MDL based algorithm to summarise a matrix.
1)REASSIGN: Given a grid, assign rows and columns s.t. entropy within the grid is minimal.
(Chakrabarti et al. 2004)Summarising a Graph
Fully ly Automa utomatic tic Cross ss Associat sociations
is a nice MDL based algorithm to summarise a matrix.
1)REASSIGN: Given a grid, assign rows and columns s.t. entropy within the grid is minimal.
2)CROSSASSOC: Find cluster with highest entropy, split it, run REASSIGN. Stop when no split reduces the MDL score.
(Chakrabarti et al. 2004)Summarising a Graph
Fully ly Automa utomatic tic Cross ss Associat sociations
is a nice MDL based algorithm to summarise a matrix.
1)REASSIGN: Given a grid, assign rows and columns s.t. entropy within the grid is minimal.
2)CROSSASSOC: Find cluster with highest entropy, split it, run REASSIGN. Stop when no split reduces the MDL score.
(Chakrabarti et al. 2004)Beyond Cave-men Communities
Traditional community detection algorithms assume that you interact
You are assumed not t to interact with others, except if you are one
That is not very realistic.
(Kang & Faloutsos, ICDM 2011)Slash’n’Burn
Slash’n’Burn finds the node 𝑗 with highest 𝑒𝑗 and removes its edges 𝑂𝑗 and recurses. SLASHBURN:
ash top-𝑙 hubs, burn rn edges
Before
(Kang & Faloutsos, ICDM 2011)Slash’n’Burn
Slash’n’Burn finds the node 𝑗 with highest 𝑒𝑗 and removes its edges 𝑂𝑗 and recurses. SLASHBURN:
ash top-𝑙 hubs, burn rn edges
Slash’n’Burn
Slash’n’Burn finds the node 𝑗 with highest 𝑒𝑗 and removes its edges 𝑂𝑗 and recurses. SLASHBURN:
ash top-𝑙 hubs, burn rn edges
After
(Kang & Faloutsos, ICDM 2011)Beyond Cave-men Communities
Slash’n’Burn applied on the AS-Oregon graphs shows that real graphs indeed have structure beyond cave-men communities! – but also include those! A nice side-result is that the Slash’n’Burned ordered matrix has lots of ‘empty space’ and can hence be stored efficiently.
(Kang & Faloutsos, ICDM 2011)VoG: Summarizing and Understanding Large Graphs
Danai Koutra Jilles Vreeken U Kang Christos Faloutsos
SDM, 25 April 2014, Philadelphia, USA
Main Idea
1)Use a graph vocabulary:
2)Best graph summary optimal compression (MDL)
Main Idea
1)Use a graph vocabulary:
2)Shortest lossless description
optimal compression (MDL)
Given a set of models ℳ, the best model 𝑁 ∈ ℳ is arg min 𝑀 𝑁 + 𝑀(𝐸 ∣ 𝑁)
# bits for 𝑁 # bits for the data using 𝑁 ℳ
𝑁
Minimum Description Length
a1 x + a0
𝑀 𝑁 + 𝑀(𝐸|𝑁)
a10 x10 + a9 x9 + … + a0 errors { }
MDL example
Given: - a graph 𝐻 with adjacency matrix 𝐵
Find: model 𝑁 s.t. 𝑀(𝐻, 𝑁) = min 𝑀(𝑁) + 𝑀(𝐹)
Minimum Graph Description
Model 𝑁 Adjacency 𝐵 Error 𝐹
VoG: Overview
argmin
≈ ≈?
VoG: Overview
VoG: Overview
some criterion
VoG: Overview
VoG: Overview
Summary
VoG: Overview
We need candidate structures…
… How can we get them?
Step 1: Graph Decomposition
We ca can n us use: Any ny decomposition method We We did d us use/a /adapt dapt: SLASHBURN
Slash ash top-k hubs, burn edges
Before
SnB Graph Decomposition
Slash ash top-k hubs, burn edges
SnB Graph Decomposition
Slash ash top-k hubs, burn edges
candidate structures
After
SnB Graph Decomposition
Slash ash top-k hubs, burn edges
candidate structures
After
SnB Graph Decomposition
Notice that the structures can overlap!
Slash ash top-k hubs, burn edges
candidate structures
After
SnB Graph Decomposition
Slash ash top-k hubs, burn edges Repeat on the remaining GCC
GCC
SnB Graph Decomposition
Now
can we we ‘label’ them?
We got candidate structures.
≈?
argmin
≈
1 2
Step 2: Graph Labeling
hub? “best” node split? 45 80 n “best” node ordering? 1
1 n
missing edges?
Graph Representations
hub
Hub: top-degree node Spokes: the rest
DETAILS
Graph Representations
hub
Hub: top-degree node Spokes: the rest
𝑀𝑂 𝑡𝑢 − 1 + log 𝑜 + log 𝑜 − 1 𝑡𝑢 − 1 + 𝑀(𝐹
+ )+ 𝑀(𝐹
− )# of spokes
hub ID
spokes IDs extra missing
Errors Star structure
𝑜=7
DETAILS
Graph Representations
hub
Hub: top-degree node Spokes: the rest
𝑀𝑂 𝑡𝑢 − 1 + log 𝑜 + log 𝑜 − 1 𝑡𝑢 − 1 + 𝑀(𝐹
+ )+ 𝑀(𝐹
− )# of spokes
hub ID
spokes IDs extra missing
Errors
𝑜=7
DETAILS
Graph Representations
hub
Hub: top-degree node Spokes: the rest
𝑀𝑂 𝑡𝑢 − 1 + log 𝑜 + log 𝑜 − 1 𝑡𝑢 − 1 + 𝑀(𝐹
+ )+ 𝑀(𝐹
− )hub ID
spokes IDs extra missing
Errors
6 𝑜=7
DETAILS
Graph Representations
hub
Hub: top-degree node Spokes: the rest
𝑀𝑂 𝑡𝑢 − 1 + log 𝑜 + log 𝑜 − 1 𝑡𝑢 − 1 + 𝑀(𝐹
+ )+ 𝑀(𝐹
− )spokes IDs extra missing
Errors
6 𝑜=7
DETAILS
Graph Representations
hub
Hub: top-degree node Spokes: the rest
𝑀𝑂 𝑡𝑢 − 1 + log 𝑜 + log 𝑜 − 1 𝑡𝑢 − 1 + 𝑀(𝐹
+ )+ 𝑀(𝐹
− )extra missing
Errors
6 𝑜=7
DETAILS
Graph Representations
hub
Hub: top-degree node Spokes: the rest
𝑀𝑂 𝑡𝑢 − 1 + log 𝑜 + log 𝑜 − 1 𝑡𝑢 − 1 + 𝑀(𝐹
+ )+ 𝑀(𝐹
− )extra missing
6 𝑜=7
DETAILS
Graph Representations
Max bipartite graph: NP-hard Heuristic: Belief Propagation with heterophily for node classification (blue/red)
DETAILS
Graph Representations
Max bipartite graph: NP-hard Heuristic: Belief Propagation with heterophily for node classification (blue/red) + logn + log( ) + L(E+ ) + L(E− )
# of blue nodes
n−1 |st|− 1
their IDs extra missing
Errors
# of red nodes
Bipartite graph structure DETAILS
Graph Representations
Max bipartite graph: NP-hard Heuristic: Belief Propagation with heterophily for node classification (blue/red) + logn + log( ) + L(E+ ) + L(E− )
# of blue nodes
n−1 |st|− 1
their IDs extra missing
Errors
# of red nodes
DETAILS
Graph Representations
Max bipartite graph: NP-hard Heuristic: Belief Propagation with heterophily for node classification (blue/red) + logn + log( ) + L(E+ ) + L(E− )
# of blue nodes
n−1 |st|− 1
their IDs extra missing # of red nodes
DETAILS
Graph Representations
1 45 80 n
1 n
Longest path: NP-hard Heuristic: BFS + local search
Graph Representations
1 45 80 n
1 n
Longest path: NP-hard Heuristic: BFS + local search +
extra missin g Errors Chain structure
Graph Representations
≈?
Step 2: Graph Labeling
≈?
argmin
≈
Step 2: Graph Labeling
≈?
argmin
≈
Step 2: Graph Labeling
Step 3: Summary Assembly
Step 3: Summary Assembly
Step 3: Summary Assembly
Summary
Concepts
= # bits as structure - # bits as noise
compression gain
Savings
DETAILS
Step 3: Summary Assembly
Step 3: Summary Assembly
Step 3: Summary Assembly
Summary
Step 3: Summary Assembly
Concepts
Summary Encoding cost 𝑀 𝑁 = 𝑀𝑂( 𝑁 + 1) + log 𝑁 + 1 Ω + 1 + ∑ − log 𝑄 𝑦 𝑡 𝑁 + 𝑀 𝑡
# of structures # of structures per type for each structure its encoding length its connectivity its type 3 # of structures # of structures per type for each structure its encoding length : 1 : 1 : 1
Step 3: Summary Assembly
𝑀(𝐸, 𝑁) structures
…
DETAILS
75% 98% 93% 75% 2% 77% 46% 60% 0% 20% 40% 60% 80% 100% Plain Top-10 Top-100 G&F Bits needed Unexplained edges
4292729 bits as noise
Real graphs have structure!
(we can save bits by encoding with structures!)
Quantitative Analysis
1 10 100 Plain Top-10 Top-100 G&F Star Near-Bipartite Full clique Full Bipartite Chain
Main structure types:
Quantitative Analysis
Quantitative Analysis
1 10 100 1000 10000 Plain Top-10 Top-100 G&F Star Near-Bipartite Full clique Full Bipartite
Main structure types: Stars, near- and full-bipartite cores.
Top-3 Stars
klay kenneth.lay @enron.com
Top-1 NBC
Ski excursion
jeff.skilling@enron.com
Qualitative Analysis: Enron
VOG is near-linear on the number of edges of the input graph.
Runtime
“jellyfish”
(Tauro, 2001)Future Work
Those of you interested in a MSc or RIL project… Our current vocabulary is But many other structures make sense, for example
Future Work
Those of you who might be interested in a MSc or RIL project… It would be great if we could mine summaries direct ctly y from m data ata … without pre-mining all candidate structures Real graphs show powerlaw-ish degree distributions, … would be great if VoG could take that into account
Conclusions
Graphs need Summaries
graphs are powerful – but difficult to interpret far too few (efficient) summary methods availableCross-Associations
powerful technique to find bi-clusters heuristic, improvements existSlash’n’Burn
reorders nodes of a graph finds sub-graphs ‘beyond’ cave-men communitiesVoG
summarises graphs with a graph-theoretic vocabulary first of its kind – but a big stack of heuristics fast, good results.Thank you!
Graphs need Summaries
graphs are powerful – but difficult to interpret far too few (efficient) summary methods availableCross-Associations
powerful technique to find bi-clusters heuristic, improvements existSlash’n’Burn
reorders nodes of a graph finds sub-graphs ‘beyond’ cave-men communitiesVoG
summarises graphs with a graph-theoretic vocabulary first of its kind – but a big stack of heuristics fast, good results.