IRDM ‘15/16
Jilles Vreeken
Chapter 8: Gra Graph ph Mining Mining
1 Dec 2015 Revision 1, December 4th typo’s fixed: edge order
Chapter 8: Gra Graph ph Mining Mining Jilles Vreeken Revision 1, - - PowerPoint PPT Presentation
Chapter 8: Gra Graph ph Mining Mining Jilles Vreeken Revision 1, December 4 th typos fixed: edge order IRDM 15/16 1 Dec 2015 IRDM Chapter 8, overview The basics 1. Properties of Graphs 2. Frequent Subgraphs 3. Community
IRDM ‘15/16
1 Dec 2015 Revision 1, December 4th typo’s fixed: edge order
IRDM ‘15/16
1.
The basics
2.
Properties of Graphs
3.
Frequent Subgraphs
4.
Community Detection
5.
Graph Clustering
You’ll find this covered in: Aggarwal, Ch. 17, 19 Zaki & Meira, Ch. 4, 11, 16
VIII-1: 2
IRDM ‘15/16
1.
The basics
2.
Properties of Graphs
3.
Frequent Subgraphs
4.
Community Detection
5.
Graph Clustering
You’ll find this covered in: Aggarwal, Ch. 17, 19 Zaki & Meira, Ch. 4, 11, 16
VIII-1: 3
IRDM ‘15/16
Aggarwal Ch. 17.1
VIII-1: 4
IRDM ‘15/16
VIII-1: 5
Human Disease Network [Barabasi 2007] Gene Regulatory Network [Decourty 2008] Facebook Network [2010] The Internet [2005]
IRDM ‘15/16
VIII-1: 6
Skewed Degrees Robustness
IRDM ‘15/16
(Bearman et. al. Am. Jnl. of Sociology, 2004. Image: Mark Newman) VIII-1: 7
Blue: Male Pink: Female Interesting
IRDM ‘15/16
VIII-1: 8
IRDM ‘15/16
How many of you think that your friends have mor more friends than you? A recent Facebook study
examined all of FB’s users: 721 million people with 69 billion
friendships
about 10% of the world’s population
found that 93 percent of the time a user’s friend count was
le less tha han n the average frie iend nd count nt of his or her friends,
users had an average of 190 friends,
while their friends averaged 635 friends of their own
VIII-1: 9
IRDM ‘15/16
You are a loner? Your friends are extraverts? There are more extraverts than introverts in the world?
VIII-1: 10
IRDM ‘15/16
Average number
friends?
= 1 + 3 + 2 + 2 4 = 2
(Strogatz, NYT 2012) VIII-1: 11
Average number
friends o
f fri friends?
= (3 + 1 + 2 + 2 + 3 + 2 + 3 + 2)/8 = ( 1 × 1 + 3 × 3 + 2 × 2 + (2 × 2))/8 = 𝟑. 𝟑𝟑
IRDM ‘15/16
Proof? 𝐹 𝑌 = ∑𝑦𝑗/𝑂 𝑊𝑊𝑊 𝑌 = 𝐹 𝑌 − 𝐹 𝑌
2
= 𝐹 𝑌2 − 𝐹 𝑌 2 𝐹 𝑌2 𝐹 𝑌 = 𝐹 𝑌 + 𝑊𝑊𝑊 𝑌 𝐹 𝑌 Essentially, it’s true if there is an any spread in the number of friends (i.e. whenever there’s a non-zero variance).
VIII-1: 12
IRDM ‘15/16
Many real-world data sets are in the forms of graphs
social networks hyperlinks protein–protein interaction XML parse trees …
Many of these graphs are enormous
humans cannot understand them → a task for data mining!
VIII-1: 13
IRDM ‘15/16
A graph ph 𝐻 is a pair (𝑊, 𝐹 ⊆ 𝑊2)
elements in 𝑊 are ve
verti tices or nod nodes of the graph
pairs 𝑤, 𝑣 ∈ 𝐹 are edges
edges or arcs cs of the graph
for undir
irect cted gra raph phs pairs are unor unordered, for dir irect cted gra raph phs pairs are ordered
The graphs can be la labelle lled
vertices can have labeling 𝑀(𝑤) edges can have labeling 𝑀(𝑤, 𝑣)
A tree is a rooted, connec nected ed, and acyc yclic graph Graphs can be represented using adjacency cency matr atrices
|𝑊| × |𝑊| matrix 𝐵 with 𝐵 𝑗𝑗 = 1 if 𝑤𝑗, 𝑤𝑗 ∈ 𝐹
VIII-1: 14
IRDM ‘15/16
The dis istance 𝑒(𝑤𝑗, 𝑤𝑗) between two vertices is the (weighted) length
The eccent ntric icity of a vertex 𝑤𝑗, 𝑓(𝑤𝑗), is its maximum distance to any other vertex, max
𝑗 {𝑒(𝑤𝑗, 𝑤𝑗)}
The radius ius of a connected graph, 𝑊(𝐻), is the minimum eccentricity of any vertex, min
𝑗 {𝑓(𝑤𝑗)}
The diamet eter er of a connected graph, 𝑒(𝐻), is the maximum eccentricity of any vertex, max
𝑗 {𝑓(𝑤𝑗)} = max 𝑗,𝑗 {𝑒(𝑤𝑗, 𝑤𝑗)}
the effect
ctive d dia iameter of a graph is smallest number that is larger than the eccentricity of a large fraction of the vertices in the graph
VIII-1: 15
IRDM ‘15/16
The cluster ering g coeffi ficien ent of vertex 𝑤𝑗, 𝐷(𝑤𝑗), tells how clique-like the neighbourhood of 𝑤𝑗 is
let 𝑜𝑗 be the number of neighbours of 𝑤𝑗 and 𝑛𝑗 the number of edges between the neighbours of 𝑤𝑗 excluding 𝑤𝑗 itself 𝐷 𝑤𝑗 = 𝑛𝑗 𝑜𝑗 2 = 2𝑛𝑗 𝑜𝑗 𝑜𝑗 − 1
well-defined only for 𝑤𝑗 with at least two neighbours
for others, let 𝐷(𝑤𝑗) = 0
The clu lustering ng coefficie ient nt of the graph h is the average clustering coefficient of the vertices: 𝐷(𝐻) = 𝑜−1 𝐷(𝑤𝑗)
𝑗
VIII-1: 16
IRDM ‘15/16
There are many interesting data one can mine from graphs and sets of graphs
cliques of friends from social networks hubs and authorities from link graphs who is the centre of the Hollywood subgraphs that appear frequently in (a set of) graph(s) areas with higher inter-connectivity than intra-connectivity …
Graph mining is perhaps the most popular topic in contemporary data mining research
though not necessary called as such…
VIII-1: 17
IRDM ‘15/16
Aggarwal Ch. 17.1, 19.2; Zaki & Meira Ch 4
VIII-1: 18
IRDM ‘15/16
Six degrees of Kevin Bacon
”Every actor is related to Kevin Bacon by no more than 6 hops” Kevin Bacon has acted with many, that have acted with many
this makes Kevin Bacon a centre of the co-acting graph
Kevin, however, is not the centre:
the average distance to him is 2.998 but to Harvey Keitel it is only 2.848
(http://oracleofbacon.org) VIII-1: 19
IRDM ‘15/16
Centr tral ality ty is a function 𝑑 ∶ 𝑊 → ℝ inducing a total order in 𝑊
the higher the centrality of a vertex, the more important it is
In degr gree c central ality ty 𝑑(𝑤𝑗) = 𝑒(𝑤𝑗), the degree of the vertex In ecce ccentric icit ity ce centralit lity the least eccentric vertex is the most central one, 𝑑 𝑤𝑗 =
1 𝑓 𝑤𝑗
the least eccentric vertex is central
al
the most eccentric vertex is peripheral
al
VIII-1: 20
IRDM ‘15/16
In closeness c eness cent ntra rality y the vertex with least distance to all other vertices is the centre 𝑑 𝑤𝑗 = 𝑒 𝑤𝑗, 𝑤𝑗
𝑗 −1
In eccentricity centrality we aim to minimize the maxi aximum d distan tance In closeness centrality we aim to minimize the avera verage d e distance nce
this is the distance used to measure the centre of Hollywood
VIII-1: 21
IRDM ‘15/16
Betweenness centrality measures the number of sh shortest p path aths th that at tr trav avel th through 𝑤𝑗
measures the “monitoring” role of the vertex “all roads lead to Rome”
Let 𝜃𝑗𝑘 be the number of shortest paths between 𝑤𝑗 and 𝑤𝑘 and let 𝜃𝑗𝑘(𝑤𝑗) be the number of those that include 𝑤𝑗
let 𝛿𝑗𝑘 𝑤𝑗 = 𝜃𝑗𝑘(𝑤𝑗)/𝜃𝑗𝑘 betweenness centrality is defined as
𝑑 𝑤𝑗 = 𝛿𝑗𝑘
𝑘≠𝑗 𝑘>𝑗 𝑗≠𝑗
VIII-1: 22
IRDM ‘15/16
In presti stige ge, the vertex is more central if it has many incoming edges from other vertices of high prestige
𝐵 is the adjacency matrix of the directed graph 𝐻 𝑞 is 𝑜-dimensional vector giving the prestige of the vertices 𝑞 = 𝐵𝑈𝑞 starting from an initial prestige vector 𝑞0, we get
𝑞𝑘 = 𝐵𝑈𝑞𝑘−1 = 𝐵𝑈 𝐵𝑈𝑞𝑘−2 = 𝐵𝑈 2𝑞𝑘−2 = 𝐵𝑈 3𝑞𝑘−3 = ⋯ = 𝐵𝑈 𝑘𝑞0 Vector 𝑞 converges to the dominant eigenvector of 𝐵𝑈
under some assumptions
(PageRank is based on (normalized) prestige) VIII-1: 23
IRDM ‘15/16
Several real-world graphs exhibit certain characteristics
studying what these are and explaining why they appear is an
important area of network research
As data miners, we need to understand the cons
uenc nces of these characteristics
finding a result that can be explained merely by one
We also want to mod model graphs with these characteristics
VIII-1: 24
IRDM ‘15/16
A graph 𝐻 is said to exhibit a sm smal all-worl rld p proper perty if its average path length scales logarithmically, 𝜈𝑀 ∝ log 𝑜
six degrees of Kevin Bacon is based on this property similarly so for Erdős numbers
how far a mathematician is from Hungarian combinatorist Paul Erdős radius of a large, connected mathematical co-authorship network
(268K authors) is 12 and diameter 23
VIII-1: 25
IRDM ‘15/16
The degree d ee distri ribu bution of a graph is the distribution of its vertex degrees
how many vertices with degree 1, how many with degree 2, etc. 𝑔(𝑙) is the number of edges with degree 𝑙
A graph 𝐻 is said to exhibit a sca cale le-fr free p e proper perty if 𝑔 𝑙 ∝ 𝑙−𝛿
follows a so-called power-law distribution majority of vertices have low degree, few with very high degree scale-free: 𝑔 𝑑𝑙 = 𝛽 𝑑𝑙 −𝛿 = 𝛽𝑑−𝛿 𝑙−𝛿 ∝ 𝑙−𝛿
VIII-1: 26
IRDM ‘15/16
In-degree 𝑡 = 2.09
(Broder et al.,’00) VIII-1: 27
Out-degree 𝑡 = 2.72
IRDM ‘15/16
A graph exhibits the clusteri ring e effec fect if the distribution of average clustering coefficient (per degree) follows a power law
if 𝐷(𝑙) is the average clustering coefficient of all vertices of
degree 𝑙, then 𝐷 𝑙 ∝ 𝑙−𝛿
The vertices with small degrees are part of highly clustered areas (high clustering coefficient) while “hub vertices” have smaller clustering coefficients
VIII-1: 28
IRDM ‘15/16
Aggarwal Ch. 17.2, 17.4; Zaki & Meira Ch 11
VIII-1: 29
IRDM ‘15/16
Graph (𝑊′, 𝐹′) is a subgr grap aph of graph (𝑊, 𝐹) iff
𝑊′ ⊆ 𝑊 𝐹′ ⊆ 𝐹
Note that subgraphs don’t have to be connected
today we consider only connected
ed subgr graphs
To check whether a graph is a subgraph of other is trivial
but, in most real-world applications there are no direct subgraphs two graphs might be similar even if their vertex sets are disjoint
VIII-1: 30
IRDM ‘15/16
Graphs 𝐻 𝑊, 𝐹 and 𝐻′ = (𝑊′, 𝐹′) are isomorp rphi hic if there exists a bijective function 𝜚: 𝑊 → 𝑊𝑊 such that
𝑣, 𝑤 ∈ 𝐹 if and only if 𝜚 𝑣 , 𝜚 𝑤 ∈ 𝐹𝑊
𝑀 𝑤 = 𝑀(𝜚 𝑤 ) for all 𝑤 ∈ 𝑊 𝑀 𝑣, 𝑤 = 𝑀 𝜚 𝑣 , 𝜚 𝑤
for all 𝑣, 𝑤 ∈ 𝐹
Graph 𝐻𝐻 is subgra raph ph is isom
ic to 𝐻 if there exists a subgraph of 𝐻 which is isomorphic to 𝐻𝐻 No polynomial-time algorithm is known for determining if 𝐻 and 𝐻𝐻 are isomorphic Determining if 𝐻𝐻 is subgraph isomorphic to 𝐻 is NP NP-hard
VIII-1: 31
IRDM ‘15/16
Isomorphism defines an equivalence class
id: 𝑊 → 𝑊, 𝑗𝑒(𝑤) = 𝑤 shows 𝐻 is isomorphic to itself if 𝐻 is isomorphic to 𝐻𝐻 via 𝜚, then 𝐻𝐻 is isomorphic to 𝐻 via 𝜚−1 if 𝐻 is isomorphic to 𝐼 via 𝜚, and 𝐼 to 𝐽 via 𝜓, then 𝐻 is
isomorphic to 𝐽 via 𝜚 ∘ 𝜓
A ca canoniza izatio ion of a graph 𝐻, 𝑑𝑊𝑜𝑑𝑜(𝐻) produces another graph 𝐷 such that if 𝐼 is a graph that is isomorphic to 𝐻, 𝑑𝑊𝑜𝑑𝑜(𝐻) = 𝑑𝑊𝑜𝑑𝑜(𝐼)
two graphs are isomorphic if and only if
their canonical versions are the same
VIII-1: 32
IRDM ‘15/16
VIII-1: 33
IRDM ‘15/16
VIII-1: 34
IRDM ‘15/16
VIII-1: 35
IRDM ‘15/16
Given a set 𝑬 of 𝑜 graphs and a minimum support 𝜏, find all connected graphs that are subgraph isomorphic to at least 𝜏 graphs in 𝑬
enormously complex problem
For graphs that have 𝑛 vertices there are
2𝑃 𝑛2 subgraphs (not all are connected)
If we have 𝑡 labels for vertices and edges we have
𝑃
2𝑡 𝑃 𝑛2 labelings of the different graphs
Counting support means solving mul multipl ple NP-hard problems
VIII-1: 36
IRDM ‘15/16
VIII-1: 37
IRDM ‘15/16
Like for itemsets, the sub-graph definition of support is monotone
we can employ level-wise search!
We can modify
APRIORI to get to AGM (Inokuchi, Washio, Motoda, 2000) ECLAT to get FFSM (Huan, Wang, Prins, 2003) FP-GROWTH to get GSPAN (Pei et al., 2001)
VIII-1: 38
IRDM ‘15/16
Alg lgorit ithm GRAPHAPRIORI(graph db 𝑬, minsup 𝜏) begin in 𝑙 ← 1; ℱ𝑘 ← {all frequent singleton graphs} wh while ile ℱ𝑘 is not empty do do Generate 𝒟𝑘+1 by joining pairs of graphs in ℱ𝑘 that have in common a subgraph of size (𝑙 − 1) Prune subgraphs from 𝒟𝑘+1 that violate downward closure Determine ℱ𝑘+1 by support counting on 𝒟𝑘+1, 𝑬 and retaining subgraphs from 𝒟𝑘+1 with support at least 𝜏 𝑙 ← 𝑙 + 1 end nd return ⋃ ℱ𝑗
𝑘 𝑗=1
end nd
(Inokuchi et al. 2000; Kuramochi & Karypis 2001) VII-2: 39
IRDM ‘15/16
Alg lgorit ithm GRAPHAPRIORI(graph db 𝑬, minsup 𝜏) begin in 𝑙 ← 1; ℱ𝑘 ← {all frequent singleton graphs} wh while ile ℱ𝑘 is not empty do do Generate 𝒟𝑘+1 by joining pairs of graphs in ℱ𝑘 that have in common a subgraph of size (𝑙 − 1) Prune subgraphs from 𝒟𝑘+1 that violate downward closure Determine ℱ𝑘+1 by support counting on 𝒟𝑘+1, 𝑬 and retaining subgraphs from 𝒟𝑘+1 with support at least 𝜏 𝑙 ← 𝑙 + 1 end nd return ⋃ ℱ𝑗
𝑘 𝑗=1
end nd
(Inokuchi et al. 2000; Kuramochi & Karypis 2001) VII-2: 40
a c b a a c b a a c b a c a c b a
+
∈ ℱ𝑘 ∈ ℱ𝑘 ∈ 𝒟𝑘+1 ∈ 𝒟𝑘+1 Cand ndidate g e gener eration us using ed edge-ba based join
a c b a a c b a a c b a c a b a
+
∈ ℱ𝑘 ∈ ℱ𝑘 ∈ 𝒟𝑘+1 ∈ 𝒟𝑘+1 Cand ndidate g e gener eration us using no node-based jo join in
c c
IRDM ‘15/16
We can improve the running time of frequent subgraph mining by either
speeding up the computation of support
lots of efforts in faster isomorphism checking but only little progress
creating fewer candidates that we need to check
level-wise algorithms generate huge numbers of candidates,
all of which must be checked for isomorphism with others
The gSPAN algorithm is the frequent subgraph mining equivalent of FP-growth; it uses a depth-first approach
(Zaki & Meira chapter 11 ; Yan & Han 2002) VIII-1: 41
IRDM ‘15/16
A depth-first spanning (DFS) tree of a graph 𝐻
is a connected tree contains all the vertices of 𝐻 is built in depth-first order
selection between the siblings is e.g. based on the vertex index
Edges of the DFS tree are forward edges edges Edges not in the DFS tree are backward edges edges A right htmost p path h in the DFS tree is the path that travels from the root to the right htmo most st vertex by always taking the right htmo most child (last added)
VIII-1: 42
IRDM ‘15/16
VIII-1: 43
𝑤1 𝑤4 𝑤6 𝑤2 𝑤3 𝑤5 𝑤7 𝑤8
IRDM ‘15/16
VIII-1: 44
𝑤1 𝑤4 𝑤6 𝑤2 𝑤3 𝑤5 𝑤7 𝑤8 The right most path
IRDM ‘15/16
Given graph 𝐻, we extend it only from the vertices in the rightmost path
we can add a backward edge from the rightmost vertex
to some other vertex in the rightmost path
we can add a fo
forward edge edge from any vertex in the rightmost path
this increases the number of vertices by 1
The order of generating the candidates is
first backward extensions
first to root, then to root’s child, …
then forward extensions
first from the leaf, then from leaf’s father, …
VIII-1: 45
IRDM ‘15/16
VIII-1: 46
𝑤1 𝑤4 𝑤6 𝑤2 𝑤3 𝑤5 𝑤7 𝑤8
IRDM ‘15/16
A DFS code is a sequence of tuples of type ⟨𝑤𝑗, 𝑤
𝑗, 𝑀(𝑤𝑗), 𝑀(𝑤 𝑗), 𝑀(𝑤𝑗, 𝑤 𝑗)⟩
tuples are given in DFS order
backwards edges are listed before forward edges vertices are numbered in DFS order
A DFS code is canonical if it is the smallest of the codes in the ordering
⟨𝑤𝑗, 𝑤𝑗, 𝑀(𝑤𝑗), 𝑀(𝑤𝑗, 𝑀(𝑤𝑗, 𝑤𝑗)⟩ < ⟨𝑤𝑦, 𝑤𝑧, 𝑀(𝑤𝑦), 𝑀(𝑤𝑧), 𝑀(𝑤𝑦, 𝑤𝑧)⟩ if
𝑤𝑗, 𝑤𝑗 <𝑓 ⟨𝑤𝑦, 𝑤𝑧⟩; or
⟨𝑤𝑗, 𝑤𝑗⟩ = ⟨𝑤𝑦, 𝑤𝑧⟩ and
𝑀 𝑤𝑗 , 𝑀 𝑤𝑗 , 𝑀 𝑤𝑗, 𝑤𝑗 <𝑚 ⟨𝑀(𝑤𝑦), 𝑀(𝑤𝑧), 𝑀(𝑤𝑦, 𝑤𝑧)⟩
the ordering of the label tuples is lexicographical
VIII-1: 47
IRDM ‘15/16
Let 𝑓𝑗𝑗 = 𝑤𝑗, 𝑤𝑗 and 𝑓𝑦𝑧 = 𝑤𝑦, 𝑤𝑧 𝑓𝑗𝑗 <𝑓 𝑓𝑦𝑧 if
if 𝑓𝑗𝑗 and 𝑓𝑦𝑧 are both forward edges, then
𝑘 < 𝑧; or 𝑘 = 𝑧 and 𝑗 > 𝑦
if 𝑓𝑗𝑗 and 𝑓𝑦𝑧 are both backward edges, then
𝑗 < 𝑦; or 𝑗 = 𝑦 and 𝑘 < 𝑧
if 𝑓𝑗𝑗 is forward and 𝑓𝑦𝑧 is backward, then 𝑘 ≤ 𝑦 if 𝑓𝑗𝑗 is backward and 𝑓𝑦𝑧 is forward, then 𝑗 < 𝑧
(typo fixed, edge order now in sync with Zaki & Meira) VIII-1: 48
IRDM ‘15/16
VIII-1: 49
𝑤1 𝑤2 𝑤3 𝑤4 𝐻3
q
r r r
𝑤1 𝑤2 𝑤3 𝑤4 𝐻1 r r
q
r
𝑤1 𝑤2 𝑤3 𝑤4 𝐻2
q
r r r
𝑢11 = 𝑤1, 𝑤2, 𝑊, 𝑊, 𝑟 𝑢12 = 𝑤2, 𝑤3, 𝑊, 𝑊, 𝑊 𝑢13 = 𝑤3, 𝑤1, 𝑊, 𝑊, 𝑊 𝑢14 = 𝑤2, 𝑤4, 𝑊, 𝑐, 𝑊 𝑢21 = 𝑤1, 𝑤2, 𝑊, 𝑊, 𝑟 𝑢22 = 𝑤2, 𝑤3, 𝑊, 𝑐, 𝑊 𝑢23 = 𝑤2, 𝑤4, 𝑊, 𝑊, 𝑊 𝑢24 = 𝑤4, 𝑤1, 𝑊, 𝑊, 𝑊 𝑢31 = 𝑤1, 𝑤2, 𝑊, 𝑊, 𝑟 𝑢32 = 𝑤2, 𝑤3, 𝑊, 𝑊, 𝑊 𝑢33 = 𝑤3, 𝑤1, 𝑊, 𝑊, 𝑊 𝑢34 = 𝑤1, 𝑤4, 𝑊, 𝑐, 𝑊
IRDM ‘15/16
VIII-1: 50
𝑤1 𝑤2 𝑤3 𝑤4 𝐻3
q
r r r
𝑤1 𝑤2 𝑤3 𝑤4 𝐻1 r r
q
r
𝑤1 𝑤2 𝑤3 𝑤4 𝐻2
q
r r r
𝑢11 = 𝑤1, 𝑤2, 𝑊, 𝑊, 𝑟 𝑢12 = 𝑤2, 𝑤3, 𝑊, 𝑊, 𝑊 𝑢13 = 𝑤3, 𝑤1, 𝑊, 𝑊, 𝑊 𝑢14 = 𝑤2, 𝑤4, 𝑊, 𝑐, 𝑊 𝑢21 = 𝑤1, 𝑤2, 𝑊, 𝑊, 𝑟 𝑢22 = 𝑤2, 𝑤3, 𝑊, 𝑐, 𝑊 𝑢23 = 𝑤2, 𝑤4, 𝑊, 𝑊, 𝑊 𝑢24 = 𝑤4, 𝑤1, 𝑊, 𝑊, 𝑊 𝑢31 = 𝑤1, 𝑤2, 𝑊, 𝑊, 𝑟 𝑢32 = 𝑤2, 𝑤3, 𝑊, 𝑊, 𝑊 𝑢33 = 𝑤3, 𝑤1, 𝑊, 𝑊, 𝑊 𝑢34 = 𝑤1, 𝑤4, 𝑊, 𝑐, 𝑊 First rows are identical
IRDM ‘15/16
VIII-1: 51
𝑤1 𝑤2 𝑤3 𝑤4 𝐻3
q
r r r
𝑤1 𝑤2 𝑤3 𝑤4 𝐻1 r r
q
r
𝑤1 𝑤2 𝑤3 𝑤4 𝐻2
q
r r r
𝑢11 = 𝑤1, 𝑤2, 𝑊, 𝑊, 𝑟 𝑢12 = 𝑤2, 𝑤3, 𝑊, 𝑊, 𝑊 𝑢13 = 𝑤3, 𝑤1, 𝑊, 𝑊, 𝑊 𝑢14 = 𝑤2, 𝑤4, 𝑊, 𝑐, 𝑊 𝑢21 = 𝑤1, 𝑤2, 𝑊, 𝑊, 𝑟 𝑢22 = 𝑤2, 𝑤3, 𝑊, 𝑐, 𝑊 𝑢23 = 𝑤2, 𝑤4, 𝑊, 𝑊, 𝑊 𝑢24 = 𝑤4, 𝑤1, 𝑊, 𝑊, 𝑊 𝑢31 = 𝑤1, 𝑤2, 𝑊, 𝑊, 𝑟 𝑢32 = 𝑤2, 𝑤3, 𝑊, 𝑊, 𝑊 𝑢33 = 𝑤3, 𝑤1, 𝑊, 𝑊, 𝑊 𝑢34 = 𝑤1, 𝑤4, 𝑊, 𝑐, 𝑊 In second row, 𝐻2 is bigger in label order
IRDM ‘15/16
VIII-1: 52
𝑤1 𝑤2 𝑤3 𝑤4 𝐻3
q
r r r
𝑤1 𝑤2 𝑤3 𝑤4 𝐻1 r r
q
r
𝑤1 𝑤2 𝑤3 𝑤4 𝐻2
q
r r r
𝑢11 = 𝑤1, 𝑤2, 𝑊, 𝑊, 𝑟 𝑢12 = 𝑤2, 𝑤3, 𝑊, 𝑊, 𝑊 𝑢13 = 𝑤3, 𝑤1, 𝑊, 𝑊, 𝑊 𝑢14 = 𝑤2, 𝑤4, 𝑊, 𝑐, 𝑊 𝑢21 = 𝑤1, 𝑤2, 𝑊, 𝑊, 𝑟 𝑢22 = 𝑤2, 𝑤3, 𝑊, 𝑐, 𝑊 𝑢23 = 𝑤2, 𝑤4, 𝑊, 𝑊, 𝑊 𝑢24 = 𝑤4, 𝑤1, 𝑊, 𝑊, 𝑊 𝑢31 = 𝑤1, 𝑤2, 𝑊, 𝑊, 𝑟 𝑢32 = 𝑤2, 𝑤3, 𝑊, 𝑊, 𝑊 𝑢33 = 𝑤3, 𝑤1, 𝑊, 𝑊, 𝑊 𝑢34 = 𝑤1, 𝑤4, 𝑊, 𝑐, 𝑊 Last two rows are forward edges, and 4 = 4 but 2 > 1 → 𝐻1 is smallest
IRDM ‘15/16
The general idea
use the DFS codes to create candidates
extend only canonical and frequent candidates
There can be very, very many extensions
we need to see them all, and all of their isomorphisms,
to count their supports
(Yan VIII-1: 53
IRDM ‘15/16
The candidates are build in a DF DFS S cod
tree
a DFS code 𝑊 is an anc
ancestor of DFS code 𝑐 if 𝑊 is a proper prefix of 𝑐
the siblings in the tree follow the DFS code order
A graph can be frequent if and only if all of the graphs representing its ancestors in the DFS tree are frequent The DFS tree contains all the canonical codes for all subgraphs of the graphs in the data
but not all vertices in the code tree correspond to canonical codes
We (implicitly) traverse this tree
VIII-1: 54
IRDM ‘15/16
GSPAN(graph 𝐻, minsup 𝜏)
fo for each h frequent 1-edge graph do do call ll subgrm to grow all nodes in the tree rooted in this edge-graph remove this edge from the graph
SUBGRM(frequent subgraph 𝑌, minsup 𝜏)
if if the code is not canonical then en return add 𝑌 to the set of frequent graphs create all super-graphs 𝑍 ⊃ 𝑌, extending 𝑌 with one more edge compute frequencies of all 𝑍’s call l SUBGRM for canonical representation of all frequent 𝑍’s
VIII-1: 55
IRDM ‘15/16
gSPAN merges extension generation and support computation For each graph in the data base
gSPAN computes all isomorphisms of the current candidate
can mean solving NP-complete problems…
for all isomorphisms, computes all backward and forward extensions
these extensions are stored together with the graph they appear in
The support of each extension is the number of times we’ve stored it
VIII-1: 56
IRDM ‘15/16
Given a DFS code of an extension, we need to check if the code is ca canonica ical This can be done by re-creating the code
at every step, choose the smallest of the right-most path extension
in the graph h corresponding ing to the extens nsio ion
If at any step we get a code that is smaller than the suffix
if after 𝑙 steps we arrive at the extensions code, it is canonical
VIII-1: 57
IRDM ‘15/16
Much of the complexity of subgraph mining lies in (checking for) isomorphisms For some types of graphs isomorphism is easy
different types of trees
ordered and unordered rooted and unrooted
graphs where every node has a distinct label
VIII-1: 58
IRDM ‘15/16
Graphs are everywhere
many interesting problems real graphs often exhibit power-law-like behaviour
Graphs generalise many data settings
makes it possible to create general algorithms
Many problems in graphs are very difficult
subgraph isomorphism
Frequent subgraph mining
involves multiple NP-hard problems
VIII-1: 59
IRDM ‘15/16
Graphs are everywhere
many interesting problems real graphs often exhibit power-law-like behaviour
Graphs generalise many data settings
makes it possible to create general algorithms
Many problems in graphs are very difficult
subgraph isomorphism
Frequent subgraph mining
involves multiple NP-hard problems
VIII-1: 60