Graph Summarisation Jilles les Vreeken eeken 10 10 July 2015 - PowerPoint PPT Presentation

Graph Summarisation Jilles les Vreeken eeken 10 10 July 2015 2015

Service Announcement #0 The Case of The Lost Pen -- or – The Case of the Found Pen

Service Announcement #1 Next week, a guest lecture Mining Data that Changes by dr. Pauli Miettinen (MPI-INF)

Service Announcement #2 Exam. Oral. 3 rd and 4 th of August. Timeslots to be decided. Mail me if you want to participate, let me know if you have a preferred time/day.

Service Announcement #3 Introduction Patterns Correlation and Causation (Subjective) Interestingness Graphs Wrap-up + < ask-me-anything>

Service Announcement #2 <ask-me-anything>? Introduction Yes! Prepare questions on anything* Patterns you’ve always wanted to ask me. Correlation and Causation Mail them to me in advance, (Subjective) Interestingness or have me answer on the spot Graphs * preferably related to Wrap-up + < ask-us-anything> TADA, data mining, machine learning, science, the world, etc.

Question of the day How can we summarise the main structure of a graph in easily understandable terms?

Graphs Graphs are everywhere ↔ Everything* can be represented as a graph * almost

Graphs, formally We consider graphs 𝐻 = 𝑊, 𝐹 with 𝑊 the set of 𝑜 nodes, and 𝐹 a set of 𝑛 edges between nodes In general, nodes can have labels, and edges can have labels, weights and can be directed.

Real world graphs social networks road networks relational databases cellular networks biological networks

Real world graphs the internet

Graphs, formally Today we consider unlab labeled led unweig ight hted ed undir irect ected ed graphs. The adjacen jacency cy matrix 𝐵 then is an 𝑜 × 𝑜 matrix 𝐵 ∈ 0,1 𝑜×𝑜 where a cell 𝑏 𝑗,𝑘 = 1 iff 𝑗, 𝑘 ∈ 𝐹 and 0 otherwise. We call the number of edges 𝑒 𝑗 of a node 𝑗 its degree ee

Why summarisation? Visualization Guiding attention

Staring at an Adjacency Matrix

Staring at a Hairball I don’t see anything!  Nodes : wiki editors Edges : co-edited

Example: Wikipedia Controversy Stars : Bipartite cores : edit wars admins, bots, heavy users Kiev vs. Kyiv vandals Nodes : wiki editors Edges : co-edited

Summary Statistics For ‘normal’ data, we can get insight by taking an average. What kind of summary statistics do we have for graphs? Ave vera rage ge degree. ee. Not very insightful.

Summary Statistics For ‘normal’ data, we can get insight by taking an average. What kind of summary statistics do we have for graphs? Degree e plots ts

Power laws

Summary Statistics For ‘normal’ data, we can get insight by taking an average. What kind of summary statistics do we have for graphs? Clust ster r coeffi ficien cient t (global) How clustered are the nodes in the graph? 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑑𝑚𝑝𝑡𝑓𝑒 𝑢𝑠𝑗𝑏𝑜𝑕𝑚𝑓𝑡 𝐷 = 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑑𝑝𝑜𝑜𝑓𝑑𝑢𝑓𝑒 𝑢𝑠𝑗𝑞𝑚𝑓𝑢𝑡 𝑝𝑔 𝑤𝑓𝑠𝑢𝑗𝑑𝑓𝑡 Counting triangles requires matrix multiplication, which takes 𝑃(𝑜 𝜕 ) where 𝜕 < 2.376 , but takes 𝑃 𝑜 2 space. (but fast estimators exist)

Summary Statistics For ‘normal’ data, we can get insight by taking an average. What kind of summary statistics do we have for graphs? Clust ster r coeffi ficien cient t (local cal) How close is the neighborhood of node 𝑗 to being a clique? 𝐷 𝑗 = 2 𝑘, 𝑙 ∈ 𝐹 𝑘, 𝑙 ∈ 𝑂 𝑗 𝑒 𝑗 (𝑒 𝑗 − 1) 2 ) at 𝑃(𝑜 2 ) space which is 𝑃(𝑒 𝑗

Summary Statistics For ‘normal’ data, we can get insight by taking an average. What kind of summary statistics do we have for graphs? Diamet ameter er The longest shortest path between two nodes. Requires calculating all shortest paths. Calculating shortest path takes 𝑃(𝑜 2 ) . So, no.

Scalability Many real world graphs are big, with 𝑜 in the order of millions. 𝑃(𝑜 2 ) is ver very scary for a graph miner. Current-day graph mining algorithms need to be linear ar in the number of edges, or else your paper will almost surely be reject cted. What are the implications?

Summarising a Graph Given : a graph

Summarising a Graph Given : a graph Find : a succinct summary with possibly overlapping subgraphs

Summarising a Graph Given : a graph Find : a succinct summary with possibly overlapping subgraphs ≈ important graph structures .

Community Detection Assumed graph Adjacency Matrix

Community Detection Real graph Adjacency Matrix

Summarising a Graph Fully ly Automa utomatic tic Cross ss Associat sociations ons is a nice MDL based algorithm to summarise a matrix. R E A SSIGN : Given a grid, assign rows and columns 1) s.t. entropy within the grid is minimal. (Chakrabarti et al. 2004)

Summarising a Graph Fully ly Automa utomatic tic Cross ss Associat sociations ons is a nice MDL based algorithm to summarise a matrix. R E A SSIGN : Given a grid, assign rows and columns 1) s.t. entropy within the grid is minimal. C ROSS A SSOC : Find cluster with highest entropy, split it, run R E A SSIGN . 2) Stop when no split reduces the MDL score. (Chakrabarti et al. 2004)

Beyond Cave-men Communities Traditional community detection algorithms assume that you interact only with people in your ‘cave’. You are assumed not t to interact with others, except if you are one of few ‘messengers’ between ‘caves’. That is not very realistic. (Kang & Faloutsos, ICDM 2011)

Slash’n’Burn Slash’n’Burn finds the node 𝑗 with highest 𝑒 𝑗 and removes its edges 𝑂 𝑗 and recurses. S LASH B URN : 1. Slash ash top- 𝑙 hubs, burn rn edges 2. Repeat on the remaining GCC Before (Kang & Faloutsos, ICDM 2011)

Slash’n’Burn Slash’n’Burn finds the node 𝑗 with highest 𝑒 𝑗 and removes its edges 𝑂 𝑗 and recurses. S LASH B URN : 1. Slash ash top- 𝑙 hubs, burn rn edges 2. Repeat on the remaining GCC (Kang & Faloutsos, ICDM 2011)

Slash’n’Burn Slash’n’Burn finds the node 𝑗 with highest 𝑒 𝑗 and removes its edges 𝑂 𝑗 and recurses. S LASH B URN : 1. Slash ash top- 𝑙 hubs, burn rn edges 2. Repeat on the remaining GCC After (Kang & Faloutsos, ICDM 2011)

Beyond Cave-men Communities Slash’n’Burn applied on the AS-Oregon graphs shows that real graphs indeed have structure beyond cave-men communities! – but also include those! A nice side-result is that the Slash’n’Burned ordered matrix has lots of ‘empty space’ and can hence be stored efficiently. (Kang & Faloutsos, ICDM 2011)

Korea Advanced Carnegie Mellon Institute of Science University and Technology VoG: Summarizing and Understanding Large Graphs Danai Koutra U Kang Jilles Vreeken Christos Faloutsos SDM, 25 April 2014, Philadelphia, USA

Main Idea Use a graph vocabulary: 1) Best graph summary 2)  optimal compression (MDL)

Main Idea Use a graph vocabulary: 1) Shortest lossless description 2)  optimal compression (MDL)

Minimum Description Length Given a set of models ℳ , 𝑁 the best model 𝑁 ∈ ℳ is ℳ arg min 𝑀 𝑁 + 𝑀(𝐸 ∣ 𝑁) # bits # bits for the for 𝑁 data using 𝑁

MDL example 𝑀 𝑁 + 𝑀(𝐸|𝑁) errors a 1 x + a 0 a 10 x 10 + a 9 x 9 + … + a 0 { }

Minimum Graph Description Given : - a graph 𝐻 with adjacency matrix 𝐵 - vocabulary Ω Find : model 𝑁 s.t. 𝑀(𝐻, 𝑁) = min 𝑀(𝑁) + 𝑀(𝐹) Adjacency 𝐵 Model 𝑁 Error 𝐹

VoG: Overview ≈? argmin ≈

VoG: Overview

VoG: Overview some criterion

VoG: Overview

VoG: Overview Summary

We need candidate structures … … How can we get them?

Step 1: Graph Decomposition We ca can n us use: Any ny decomposition method We We did d us use/a /adapt dapt: S LASH B URN

SnB Graph Decomposition Slash ash top-k hubs, burn edges Before

SnB Graph Decomposition Slash ash top-k hubs, burn edges

SnB Graph Decomposition Slash ash top-k hubs, burn edges candidate structures After

SnB Graph Decomposition Slash ash top-k hubs, burn edges candidate structures Notice that the structures can overlap ! After

SnB Graph Decomposition Slash ash top-k hubs, burn edges candidate structures After

SnB Graph Decomposition Slash ash top-k hubs, burn edges Repeat on the remaining GCC GCC

We got candidate structures. Now ow, how ow ca can we we ‘label’ them?

Graph Summarisation Jilles les Vreeken eeken 10 10 July 2015 - PowerPoint PPT Presentation

Graph Summarisation Jilles les Vreeken eeken 10 10 July 2015 2015 Service Announcement #0 The Case of The Lost Pen -- or The Case of the Found Pen Service Announcement #1 Next week, a guest lecture Mining Data that Changes by dr.

A Corpus for Evidence Based Medicine Summarisation Diego Moll a Centre for Language

Text Summarisation for Evidence Based Medicine Diego Moll a Centre for Language Technology,

Automated Summarisation for Evidence Based Medicine Diego Moll a Centre for Language

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Extractive Evidence Based Medicine Summarisation Based on Sentence-Specific Statistics Abeed

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao,

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

Split clique graph complexity L. Alcn and M. Gutierrez La Plata, Argentina L. Faria and C. M.

XL1C: Graph Times-Series Using Ratio Display 3/9/2017 V0D XL1C: V0D XL1C: V0D Graph by Time

XL1A: Graph Nominal Frequency Data Using Excel2013 3/10/2017 V0E XL1A: V0E XL1A: V0E Graph

Graph Sparsifiers Smaller graph that (approximately) preserves the values of some set of

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

Network/Graph Network/Graph Informally a graph is a set of nodes Theory Theory joined by a

Game Theory -- Lecture 5 Patrick Loiseau EURECOM Fall 2016 1 Lecture 3-4 recap Defined

Lecture notes on Regression: Markov Chain Monte Carlo (MCMC) Dr. Veselina Kalinova, Max Planck

BUILDING A CAREER DEVELOPMENT PROGRAM WHO ARE WE? 1. PROGRAM DESIGN 2. VOLUNTEERS 3. PARTNERS

Handout I Co C on ns st ta an nc ce e D D. . B Ba al ld dw w i in n, , P Ph

VIA E-MAIL AND FIRST-CLASS MAIL January 12, 2010 Secretary U.S. Nuclear Regulatory Commission

Many Pixels Make Light Work A Brief Early History of Imaging Devices, and a Brief Self-Indulgent

A Two-Stage Markov Chain Monte Carlo Method for Seismic Inversion Susan E. Minkoff, Georgia K.

H1 2020 results presentation introduction Joh ohan an Lundgr dgren en CE CEO easyJet -

Graph Summarisation Jilles les Vreeken eeken 10 10 July 2015 - PowerPoint PPT Presentation

Graph Summarisation Jilles les Vreeken eeken 10 10 July 2015 2015 Service Announcement #0 The Case of The Lost Pen -- or The Case of the Found Pen Service Announcement #1 Next week, a guest lecture Mining Data that Changes by dr.

A Corpus for Evidence Based Medicine Summarisation Diego Moll a Centre for Language

Text Summarisation for Evidence Based Medicine Diego Moll a Centre for Language Technology,

Automated Summarisation for Evidence Based Medicine Diego Moll a Centre for Language

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Extractive Evidence Based Medicine Summarisation Based on Sentence-Specific Statistics Abeed

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Graph Indexing: Tree + Delta Delta &gt;= Graph &gt;= Graph Graph Indexing: Tree + Peixian Zhao,

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

Split clique graph complexity L. Alcn and M. Gutierrez La Plata, Argentina L. Faria and C. M.

XL1C: Graph Times-Series Using Ratio Display 3/9/2017 V0D XL1C: V0D XL1C: V0D Graph by Time

XL1A: Graph Nominal Frequency Data Using Excel2013 3/10/2017 V0E XL1A: V0E XL1A: V0E Graph

Graph Sparsifiers Smaller graph that (approximately) preserves the values of some set of

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

Network/Graph Network/Graph Informally a graph is a set of nodes Theory Theory joined by a

Game Theory -- Lecture 5 Patrick Loiseau EURECOM Fall 2016 1 Lecture 3-4 recap Defined

Lecture notes on Regression: Markov Chain Monte Carlo (MCMC) Dr. Veselina Kalinova, Max Planck

BUILDING A CAREER DEVELOPMENT PROGRAM WHO ARE WE? 1. PROGRAM DESIGN 2. VOLUNTEERS 3. PARTNERS

Handout I Co C on ns st ta an nc ce e D D. . B Ba al ld dw w i in n, , P Ph

VIA E-MAIL AND FIRST-CLASS MAIL January 12, 2010 Secretary U.S. Nuclear Regulatory Commission

Many Pixels Make Light Work A Brief Early History of Imaging Devices, and a Brief Self-Indulgent

A Two-Stage Markov Chain Monte Carlo Method for Seismic Inversion Susan E. Minkoff, Georgia K.

H1 2020 results presentation introduction Joh ohan an Lundgr dgren en CE CEO easyJet -

Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao,