Chapter 8: Gra Graph ph Mining Mining Jilles Vreeken Revision 1, - - PowerPoint PPT Presentation

chapter 8 gra graph ph mining mining
SMART_READER_LITE
LIVE PREVIEW

Chapter 8: Gra Graph ph Mining Mining Jilles Vreeken Revision 1, - - PowerPoint PPT Presentation

Chapter 8: Gra Graph ph Mining Mining Jilles Vreeken Revision 1, December 4 th typos fixed: edge order IRDM 15/16 1 Dec 2015 IRDM Chapter 8, overview The basics 1. Properties of Graphs 2. Frequent Subgraphs 3. Community


slide-1
SLIDE 1

IRDM ‘15/16

Jilles Vreeken

Chapter 8: Gra Graph ph Mining Mining

1 Dec 2015 Revision 1, December 4th typo’s fixed: edge order

slide-2
SLIDE 2

IRDM ‘15/16

IRDM Chapter 8, overview

1.

The basics

2.

Properties of Graphs

3.

Frequent Subgraphs

4.

Community Detection

5.

Graph Clustering

You’ll find this covered in: Aggarwal, Ch. 17, 19 Zaki & Meira, Ch. 4, 11, 16

VIII-1: 2

slide-3
SLIDE 3

IRDM ‘15/16

IRDM Chapter 8, today

1.

The basics

2.

Properties of Graphs

3.

Frequent Subgraphs

4.

Community Detection

5.

Graph Clustering

You’ll find this covered in: Aggarwal, Ch. 17, 19 Zaki & Meira, Ch. 4, 11, 16

VIII-1: 3

slide-4
SLIDE 4

IRDM ‘15/16

Chapter 7.1:

The B he Basi asics

Aggarwal Ch. 17.1

VIII-1: 4

slide-5
SLIDE 5

IRDM ‘15/16

Networks are everywhere!

VIII-1: 5

Human Disease Network [Barabasi 2007] Gene Regulatory Network [Decourty 2008] Facebook Network [2010] The Internet [2005]

slide-6
SLIDE 6

IRDM ‘15/16

The Internet

VIII-1: 6

Skewed Degrees Robustness

slide-7
SLIDE 7

IRDM ‘15/16

High school dating network

(Bearman et. al. Am. Jnl. of Sociology, 2004. Image: Mark Newman) VIII-1: 7

Blue: Male Pink: Female Interesting

  • bservations?
slide-8
SLIDE 8

IRDM ‘15/16

Karate club network

VIII-1: 8

slide-9
SLIDE 9

IRDM ‘15/16

Friends

How many of you think that your friends have mor more friends than you? A recent Facebook study

 examined all of FB’s users: 721 million people with 69 billion

friendships

 about 10% of the world’s population

 found that 93 percent of the time a user’s friend count was

le less tha han n the average frie iend nd count nt of his or her friends,

 users had an average of 190 friends,

while their friends averaged 635 friends of their own

VIII-1: 9

slide-10
SLIDE 10

IRDM ‘15/16

Reasons?

You are a loner? Your friends are extraverts? There are more extraverts than introverts in the world?

VIII-1: 10

slide-11
SLIDE 11

IRDM ‘15/16

Example

Average number

  • f fri

friends?

= 1 + 3 + 2 + 2 4 = 2

(Strogatz, NYT 2012) VIII-1: 11

Average number

  • f fri

friends o

  • f

f fri friends?

= (3 + 1 + 2 + 2 + 3 + 2 + 3 + 2)/8 = ( 1 × 1 + 3 × 3 + 2 × 2 + (2 × 2))/8 = 𝟑. 𝟑𝟑

slide-12
SLIDE 12

IRDM ‘15/16

Always true (almost)!

Proof? 𝐹 𝑌 = ∑𝑦𝑗/𝑂 𝑊𝑊𝑊 𝑌 = 𝐹 𝑌 − 𝐹 𝑌

2

= 𝐹 𝑌2 − 𝐹 𝑌 2 𝐹 𝑌2 𝐹 𝑌 = 𝐹 𝑌 + 𝑊𝑊𝑊 𝑌 𝐹 𝑌 Essentially, it’s true if there is an any spread in the number of friends (i.e. whenever there’s a non-zero variance).

VIII-1: 12

slide-13
SLIDE 13

IRDM ‘15/16

Why graphs?

Many real-world data sets are in the forms of graphs

 social networks  hyperlinks  protein–protein interaction  XML parse trees  …

Many of these graphs are enormous

 humans cannot understand them → a task for data mining!

VIII-1: 13

slide-14
SLIDE 14

IRDM ‘15/16

What is a graph?

A graph ph 𝐻 is a pair (𝑊, 𝐹 ⊆ 𝑊2)

 elements in 𝑊 are ve

verti tices or nod nodes of the graph

 pairs 𝑤, 𝑣 ∈ 𝐹 are edges

edges or arcs cs of the graph

 for undir

irect cted gra raph phs pairs are unor unordered, for dir irect cted gra raph phs pairs are ordered

The graphs can be la labelle lled

 vertices can have labeling 𝑀(𝑤)  edges can have labeling 𝑀(𝑤, 𝑣)

A tree is a rooted, connec nected ed, and acyc yclic graph Graphs can be represented using adjacency cency matr atrices

 |𝑊| × |𝑊| matrix 𝐵 with 𝐵 𝑗𝑗 = 1 if 𝑤𝑗, 𝑤𝑗 ∈ 𝐹

VIII-1: 14

slide-15
SLIDE 15

IRDM ‘15/16

Eccentricity, radius, and diameter

The dis istance 𝑒(𝑤𝑗, 𝑤𝑗) between two vertices is the (weighted) length

  • f the shortest path between them

The eccent ntric icity of a vertex 𝑤𝑗, 𝑓(𝑤𝑗), is its maximum distance to any other vertex, max

𝑗 {𝑒(𝑤𝑗, 𝑤𝑗)}

The radius ius of a connected graph, 𝑊(𝐻), is the minimum eccentricity of any vertex, min

𝑗 {𝑓(𝑤𝑗)}

The diamet eter er of a connected graph, 𝑒(𝐻), is the maximum eccentricity of any vertex, max

𝑗 {𝑓(𝑤𝑗)} = max 𝑗,𝑗 {𝑒(𝑤𝑗, 𝑤𝑗)}

 the effect

ctive d dia iameter of a graph is smallest number that is larger than the eccentricity of a large fraction of the vertices in the graph

VIII-1: 15

slide-16
SLIDE 16

IRDM ‘15/16

Clustering Coefficient

The cluster ering g coeffi ficien ent of vertex 𝑤𝑗, 𝐷(𝑤𝑗), tells how clique-like the neighbourhood of 𝑤𝑗 is

let 𝑜𝑗 be the number of neighbours of 𝑤𝑗 and 𝑛𝑗 the number of edges between the neighbours of 𝑤𝑗 excluding 𝑤𝑗 itself 𝐷 𝑤𝑗 = 𝑛𝑗 𝑜𝑗 2 = 2𝑛𝑗 𝑜𝑗 𝑜𝑗 − 1

well-defined only for 𝑤𝑗 with at least two neighbours

for others, let 𝐷(𝑤𝑗) = 0

The clu lustering ng coefficie ient nt of the graph h is the average clustering coefficient of the vertices: 𝐷(𝐻) = 𝑜−1 𝐷(𝑤𝑗)

𝑗

VIII-1: 16

slide-17
SLIDE 17

IRDM ‘15/16

What do to with a graph?

There are many interesting data one can mine from graphs and sets of graphs

 cliques of friends from social networks  hubs and authorities from link graphs  who is the centre of the Hollywood  subgraphs that appear frequently in (a set of) graph(s)  areas with higher inter-connectivity than intra-connectivity  …

Graph mining is perhaps the most popular topic in contemporary data mining research

 though not necessary called as such…

VIII-1: 17

slide-18
SLIDE 18

IRDM ‘15/16

Chapter 7.2:

Properties o s of f Gra Graphs hs

Aggarwal Ch. 17.1, 19.2; Zaki & Meira Ch 4

VIII-1: 18

slide-19
SLIDE 19

IRDM ‘15/16

Centrality

Six degrees of Kevin Bacon

 ”Every actor is related to Kevin Bacon by no more than 6 hops”  Kevin Bacon has acted with many, that have acted with many

  • thers, that have acted with many others…

 this makes Kevin Bacon a centre of the co-acting graph

Kevin, however, is not the centre:

 the average distance to him is 2.998  but to Harvey Keitel it is only 2.848

(http://oracleofbacon.org) VIII-1: 19

slide-20
SLIDE 20

IRDM ‘15/16

Degree and eccentricity Centrality

Centr tral ality ty is a function 𝑑 ∶ 𝑊 → ℝ inducing a total order in 𝑊

 the higher the centrality of a vertex, the more important it is

In degr gree c central ality ty 𝑑(𝑤𝑗) = 𝑒(𝑤𝑗), the degree of the vertex In ecce ccentric icit ity ce centralit lity the least eccentric vertex is the most central one, 𝑑 𝑤𝑗 =

1 𝑓 𝑤𝑗

 the least eccentric vertex is central

al

 the most eccentric vertex is peripheral

al

VIII-1: 20

slide-21
SLIDE 21

IRDM ‘15/16

Closeness centrality

In closeness c eness cent ntra rality y the vertex with least distance to all other vertices is the centre 𝑑 𝑤𝑗 = 𝑒 𝑤𝑗, 𝑤𝑗

𝑗 −1

In eccentricity centrality we aim to minimize the maxi aximum d distan tance In closeness centrality we aim to minimize the avera verage d e distance nce

 this is the distance used to measure the centre of Hollywood

VIII-1: 21

slide-22
SLIDE 22

IRDM ‘15/16

Betweenness centrality

Betweenness centrality measures the number of sh shortest p path aths th that at tr trav avel th through 𝑤𝑗

 measures the “monitoring” role of the vertex  “all roads lead to Rome”

Let 𝜃𝑗𝑘 be the number of shortest paths between 𝑤𝑗 and 𝑤𝑘 and let 𝜃𝑗𝑘(𝑤𝑗) be the number of those that include 𝑤𝑗

 let 𝛿𝑗𝑘 𝑤𝑗 = 𝜃𝑗𝑘(𝑤𝑗)/𝜃𝑗𝑘  betweenness centrality is defined as

𝑑 𝑤𝑗 = 𝛿𝑗𝑘

𝑘≠𝑗 𝑘>𝑗 𝑗≠𝑗

VIII-1: 22

slide-23
SLIDE 23

IRDM ‘15/16

Prestige

In presti stige ge, the vertex is more central if it has many incoming edges from other vertices of high prestige

 𝐵 is the adjacency matrix of the directed graph 𝐻  𝑞 is 𝑜-dimensional vector giving the prestige of the vertices  𝑞 = 𝐵𝑈𝑞  starting from an initial prestige vector 𝑞0, we get

𝑞𝑘 = 𝐵𝑈𝑞𝑘−1 = 𝐵𝑈 𝐵𝑈𝑞𝑘−2 = 𝐵𝑈 2𝑞𝑘−2 = 𝐵𝑈 3𝑞𝑘−3 = ⋯ = 𝐵𝑈 𝑘𝑞0 Vector 𝑞 converges to the dominant eigenvector of 𝐵𝑈

 under some assumptions

(PageRank is based on (normalized) prestige) VIII-1: 23

slide-24
SLIDE 24

IRDM ‘15/16

Graph properties

Several real-world graphs exhibit certain characteristics

 studying what these are and explaining why they appear is an

important area of network research

As data miners, we need to understand the cons

  • nseque

uenc nces of these characteristics

 finding a result that can be explained merely by one

  • f these characteristics is not interesting

We also want to mod model graphs with these characteristics

VIII-1: 24

slide-25
SLIDE 25

IRDM ‘15/16

It’s a small world after all

A graph 𝐻 is said to exhibit a sm smal all-worl rld p proper perty if its average path length scales logarithmically, 𝜈𝑀 ∝ log 𝑜

 six degrees of Kevin Bacon is based on this property  similarly so for Erdős numbers

 how far a mathematician is from Hungarian combinatorist Paul Erdős  radius of a large, connected mathematical co-authorship network

(268K authors) is 12 and diameter 23

VIII-1: 25

slide-26
SLIDE 26

IRDM ‘15/16

Scale-free property

The degree d ee distri ribu bution of a graph is the distribution of its vertex degrees

 how many vertices with degree 1, how many with degree 2, etc.  𝑔(𝑙) is the number of edges with degree 𝑙

A graph 𝐻 is said to exhibit a sca cale le-fr free p e proper perty if 𝑔 𝑙 ∝ 𝑙−𝛿

 follows a so-called power-law distribution  majority of vertices have low degree, few with very high degree  scale-free: 𝑔 𝑑𝑙 = 𝛽 𝑑𝑙 −𝛿 = 𝛽𝑑−𝛿 𝑙−𝛿 ∝ 𝑙−𝛿

VIII-1: 26

slide-27
SLIDE 27

IRDM ‘15/16

Example: WWW links

In-degree 𝑡 = 2.09

(Broder et al.,’00) VIII-1: 27

Out-degree 𝑡 = 2.72

slide-28
SLIDE 28

IRDM ‘15/16

Clustering effect

A graph exhibits the clusteri ring e effec fect if the distribution of average clustering coefficient (per degree) follows a power law

 if 𝐷(𝑙) is the average clustering coefficient of all vertices of

degree 𝑙, then 𝐷 𝑙 ∝ 𝑙−𝛿

The vertices with small degrees are part of highly clustered areas (high clustering coefficient) while “hub vertices” have smaller clustering coefficients

VIII-1: 28

slide-29
SLIDE 29

IRDM ‘15/16

Chapter 7.3:

Freque uent nt Sub Subgraph M h Mini ining ng

Aggarwal Ch. 17.2, 17.4; Zaki & Meira Ch 11

VIII-1: 29

slide-30
SLIDE 30

IRDM ‘15/16

Subgraphs

Graph (𝑊′, 𝐹′) is a subgr grap aph of graph (𝑊, 𝐹) iff

 𝑊′ ⊆ 𝑊  𝐹′ ⊆ 𝐹

Note that subgraphs don’t have to be connected

 today we consider only connected

ed subgr graphs

To check whether a graph is a subgraph of other is trivial

 but, in most real-world applications there are no direct subgraphs  two graphs might be similar even if their vertex sets are disjoint

VIII-1: 30

slide-31
SLIDE 31

IRDM ‘15/16

Graph isomorphism

Graphs 𝐻 𝑊, 𝐹 and 𝐻′ = (𝑊′, 𝐹′) are isomorp rphi hic if there exists a bijective function 𝜚: 𝑊 → 𝑊𝑊 such that

𝑣, 𝑤 ∈ 𝐹 if and only if 𝜚 𝑣 , 𝜚 𝑤 ∈ 𝐹𝑊

 𝑀 𝑤 = 𝑀(𝜚 𝑤 ) for all 𝑤 ∈ 𝑊  𝑀 𝑣, 𝑤 = 𝑀 𝜚 𝑣 , 𝜚 𝑤

for all 𝑣, 𝑤 ∈ 𝐹

Graph 𝐻𝐻 is subgra raph ph is isom

  • mor
  • rphic

ic to 𝐻 if there exists a subgraph of 𝐻 which is isomorphic to 𝐻𝐻 No polynomial-time algorithm is known for determining if 𝐻 and 𝐻𝐻 are isomorphic Determining if 𝐻𝐻 is subgraph isomorphic to 𝐻 is NP NP-hard

VIII-1: 31

slide-32
SLIDE 32

IRDM ‘15/16

Equivalence and canonical graphs

Isomorphism defines an equivalence class

 id: 𝑊 → 𝑊, 𝑗𝑒(𝑤) = 𝑤 shows 𝐻 is isomorphic to itself  if 𝐻 is isomorphic to 𝐻𝐻 via 𝜚, then 𝐻𝐻 is isomorphic to 𝐻 via 𝜚−1  if 𝐻 is isomorphic to 𝐼 via 𝜚, and 𝐼 to 𝐽 via 𝜓, then 𝐻 is

isomorphic to 𝐽 via 𝜚 ∘ 𝜓

A ca canoniza izatio ion of a graph 𝐻, 𝑑𝑊𝑜𝑑𝑜(𝐻) produces another graph 𝐷 such that if 𝐼 is a graph that is isomorphic to 𝐻, 𝑑𝑊𝑜𝑑𝑜(𝐻) = 𝑑𝑊𝑜𝑑𝑜(𝐼)

 two graphs are isomorphic if and only if

their canonical versions are the same

VIII-1: 32

slide-33
SLIDE 33

IRDM ‘15/16

Example of isomorphic graphs

VIII-1: 33

a b a c b a

slide-34
SLIDE 34

IRDM ‘15/16

Example of isomorphic graphs

VIII-1: 34

a b a c b a

slide-35
SLIDE 35

IRDM ‘15/16

Example of isomorphic graphs

VIII-1: 35

a b a c b a a b a c b a

slide-36
SLIDE 36

IRDM ‘15/16

Frequent subgraph mining

Given a set 𝑬 of 𝑜 graphs and a minimum support 𝜏, find all connected graphs that are subgraph isomorphic to at least 𝜏 graphs in 𝑬

 enormously complex problem

For graphs that have 𝑛 vertices there are

 2𝑃 𝑛2 subgraphs (not all are connected)

If we have 𝑡 labels for vertices and edges we have

 𝑃

2𝑡 𝑃 𝑛2 labelings of the different graphs

Counting support means solving mul multipl ple NP-hard problems

VIII-1: 36

slide-37
SLIDE 37

IRDM ‘15/16

An example

VIII-1: 37

a b a c b a a c b a a b a

slide-38
SLIDE 38

IRDM ‘15/16

Mining frequent subgraph patterns

Like for itemsets, the sub-graph definition of support is monotone

 we can employ level-wise search!

We can modify

 APRIORI to get to AGM (Inokuchi, Washio, Motoda, 2000)  ECLAT to get FFSM (Huan, Wang, Prins, 2003)  FP-GROWTH to get GSPAN (Pei et al., 2001)

VIII-1: 38

slide-39
SLIDE 39

IRDM ‘15/16

GraphApriori

Alg lgorit ithm GRAPHAPRIORI(graph db 𝑬, minsup 𝜏) begin in 𝑙 ← 1; ℱ𝑘 ← {all frequent singleton graphs} wh while ile ℱ𝑘 is not empty do do Generate 𝒟𝑘+1 by joining pairs of graphs in ℱ𝑘 that have in common a subgraph of size (𝑙 − 1) Prune subgraphs from 𝒟𝑘+1 that violate downward closure Determine ℱ𝑘+1 by support counting on 𝒟𝑘+1, 𝑬 and retaining subgraphs from 𝒟𝑘+1 with support at least 𝜏 𝑙 ← 𝑙 + 1 end nd return ⋃ ℱ𝑗

𝑘 𝑗=1

end nd

(Inokuchi et al. 2000; Kuramochi & Karypis 2001) VII-2: 39

slide-40
SLIDE 40

IRDM ‘15/16

GraphApriori

Alg lgorit ithm GRAPHAPRIORI(graph db 𝑬, minsup 𝜏) begin in 𝑙 ← 1; ℱ𝑘 ← {all frequent singleton graphs} wh while ile ℱ𝑘 is not empty do do Generate 𝒟𝑘+1 by joining pairs of graphs in ℱ𝑘 that have in common a subgraph of size (𝑙 − 1) Prune subgraphs from 𝒟𝑘+1 that violate downward closure Determine ℱ𝑘+1 by support counting on 𝒟𝑘+1, 𝑬 and retaining subgraphs from 𝒟𝑘+1 with support at least 𝜏 𝑙 ← 𝑙 + 1 end nd return ⋃ ℱ𝑗

𝑘 𝑗=1

end nd

(Inokuchi et al. 2000; Kuramochi & Karypis 2001) VII-2: 40

a c b a a c b a a c b a c a c b a

+

∈ ℱ𝑘 ∈ ℱ𝑘 ∈ 𝒟𝑘+1 ∈ 𝒟𝑘+1 Cand ndidate g e gener eration us using ed edge-ba based join

a c b a a c b a a c b a c a b a

+

∈ ℱ𝑘 ∈ ℱ𝑘 ∈ 𝒟𝑘+1 ∈ 𝒟𝑘+1 Cand ndidate g e gener eration us using no node-based jo join in

c c

slide-41
SLIDE 41

IRDM ‘15/16

Canonical codes

We can improve the running time of frequent subgraph mining by either

 speeding up the computation of support

 lots of efforts in faster isomorphism checking but only little progress

 creating fewer candidates that we need to check

 level-wise algorithms generate huge numbers of candidates,

all of which must be checked for isomorphism with others

The gSPAN algorithm is the frequent subgraph mining equivalent of FP-growth; it uses a depth-first approach

(Zaki & Meira chapter 11 ; Yan & Han 2002) VIII-1: 41

slide-42
SLIDE 42

IRDM ‘15/16

Depth-First Spanning tree

A depth-first spanning (DFS) tree of a graph 𝐻

 is a connected tree  contains all the vertices of 𝐻  is built in depth-first order

 selection between the siblings is e.g. based on the vertex index

Edges of the DFS tree are forward edges edges Edges not in the DFS tree are backward edges edges A right htmost p path h in the DFS tree is the path that travels from the root to the right htmo most st vertex by always taking the right htmo most child (last added)

VIII-1: 42

slide-43
SLIDE 43

IRDM ‘15/16

An example – DFS traversal

VIII-1: 43

a a d c c b a b

𝑤1 𝑤4 𝑤6 𝑤2 𝑤3 𝑤5 𝑤7 𝑤8

slide-44
SLIDE 44

IRDM ‘15/16

An example – the DFS tree

VIII-1: 44

a a d c c b a b

𝑤1 𝑤4 𝑤6 𝑤2 𝑤3 𝑤5 𝑤7 𝑤8 The right most path

slide-45
SLIDE 45

IRDM ‘15/16

Candidates from the DFS tree

Given graph 𝐻, we extend it only from the vertices in the rightmost path

 we can add a backward edge from the rightmost vertex

to some other vertex in the rightmost path

 we can add a fo

forward edge edge from any vertex in the rightmost path

 this increases the number of vertices by 1

The order of generating the candidates is

 first backward extensions

 first to root, then to root’s child, …

 then forward extensions

 first from the leaf, then from leaf’s father, …

VIII-1: 45

slide-46
SLIDE 46

IRDM ‘15/16

An example – the DFS tree

VIII-1: 46

a a d c c b a b

𝑤1 𝑤4 𝑤6 𝑤2 𝑤3 𝑤5 𝑤7 𝑤8

slide-47
SLIDE 47

IRDM ‘15/16

DFS codes and their orders

A DFS code is a sequence of tuples of type ⟨𝑤𝑗, 𝑤

𝑗, 𝑀(𝑤𝑗), 𝑀(𝑤 𝑗), 𝑀(𝑤𝑗, 𝑤 𝑗)⟩

 tuples are given in DFS order

 backwards edges are listed before forward edges  vertices are numbered in DFS order

A DFS code is canonical if it is the smallest of the codes in the ordering

 ⟨𝑤𝑗, 𝑤𝑗, 𝑀(𝑤𝑗), 𝑀(𝑤𝑗, 𝑀(𝑤𝑗, 𝑤𝑗)⟩ < ⟨𝑤𝑦, 𝑤𝑧, 𝑀(𝑤𝑦), 𝑀(𝑤𝑧), 𝑀(𝑤𝑦, 𝑤𝑧)⟩ if

 𝑤𝑗, 𝑤𝑗 <𝑓 ⟨𝑤𝑦, 𝑤𝑧⟩; or

 ⟨𝑤𝑗, 𝑤𝑗⟩ = ⟨𝑤𝑦, 𝑤𝑧⟩ and

𝑀 𝑤𝑗 , 𝑀 𝑤𝑗 , 𝑀 𝑤𝑗, 𝑤𝑗 <𝑚 ⟨𝑀(𝑤𝑦), 𝑀(𝑤𝑧), 𝑀(𝑤𝑦, 𝑤𝑧)⟩

 the ordering of the label tuples is lexicographical

VIII-1: 47

slide-48
SLIDE 48

IRDM ‘15/16

Ordering the edges

Let 𝑓𝑗𝑗 = 𝑤𝑗, 𝑤𝑗 and 𝑓𝑦𝑧 = 𝑤𝑦, 𝑤𝑧 𝑓𝑗𝑗 <𝑓 𝑓𝑦𝑧 if

 if 𝑓𝑗𝑗 and 𝑓𝑦𝑧 are both forward edges, then

 𝑘 < 𝑧; or  𝑘 = 𝑧 and 𝑗 > 𝑦

 if 𝑓𝑗𝑗 and 𝑓𝑦𝑧 are both backward edges, then

 𝑗 < 𝑦; or  𝑗 = 𝑦 and 𝑘 < 𝑧

 if 𝑓𝑗𝑗 is forward and 𝑓𝑦𝑧 is backward, then 𝑘 ≤ 𝑦  if 𝑓𝑗𝑗 is backward and 𝑓𝑦𝑧 is forward, then 𝑗 < 𝑧

(typo fixed, edge order now in sync with Zaki & Meira) VIII-1: 48

slide-49
SLIDE 49

IRDM ‘15/16

Example

VIII-1: 49

a a a b

𝑤1 𝑤2 𝑤3 𝑤4 𝐻3

q

r r r

a a a b

𝑤1 𝑤2 𝑤3 𝑤4 𝐻1 r r

q

r

a a a b

𝑤1 𝑤2 𝑤3 𝑤4 𝐻2

q

r r r

𝑢11 = 𝑤1, 𝑤2, 𝑊, 𝑊, 𝑟 𝑢12 = 𝑤2, 𝑤3, 𝑊, 𝑊, 𝑊 𝑢13 = 𝑤3, 𝑤1, 𝑊, 𝑊, 𝑊 𝑢14 = 𝑤2, 𝑤4, 𝑊, 𝑐, 𝑊 𝑢21 = 𝑤1, 𝑤2, 𝑊, 𝑊, 𝑟 𝑢22 = 𝑤2, 𝑤3, 𝑊, 𝑐, 𝑊 𝑢23 = 𝑤2, 𝑤4, 𝑊, 𝑊, 𝑊 𝑢24 = 𝑤4, 𝑤1, 𝑊, 𝑊, 𝑊 𝑢31 = 𝑤1, 𝑤2, 𝑊, 𝑊, 𝑟 𝑢32 = 𝑤2, 𝑤3, 𝑊, 𝑊, 𝑊 𝑢33 = 𝑤3, 𝑤1, 𝑊, 𝑊, 𝑊 𝑢34 = 𝑤1, 𝑤4, 𝑊, 𝑐, 𝑊

slide-50
SLIDE 50

IRDM ‘15/16

Example

VIII-1: 50

a a a b

𝑤1 𝑤2 𝑤3 𝑤4 𝐻3

q

r r r

a a a b

𝑤1 𝑤2 𝑤3 𝑤4 𝐻1 r r

q

r

a a a b

𝑤1 𝑤2 𝑤3 𝑤4 𝐻2

q

r r r

𝑢11 = 𝑤1, 𝑤2, 𝑊, 𝑊, 𝑟 𝑢12 = 𝑤2, 𝑤3, 𝑊, 𝑊, 𝑊 𝑢13 = 𝑤3, 𝑤1, 𝑊, 𝑊, 𝑊 𝑢14 = 𝑤2, 𝑤4, 𝑊, 𝑐, 𝑊 𝑢21 = 𝑤1, 𝑤2, 𝑊, 𝑊, 𝑟 𝑢22 = 𝑤2, 𝑤3, 𝑊, 𝑐, 𝑊 𝑢23 = 𝑤2, 𝑤4, 𝑊, 𝑊, 𝑊 𝑢24 = 𝑤4, 𝑤1, 𝑊, 𝑊, 𝑊 𝑢31 = 𝑤1, 𝑤2, 𝑊, 𝑊, 𝑟 𝑢32 = 𝑤2, 𝑤3, 𝑊, 𝑊, 𝑊 𝑢33 = 𝑤3, 𝑤1, 𝑊, 𝑊, 𝑊 𝑢34 = 𝑤1, 𝑤4, 𝑊, 𝑐, 𝑊 First rows are identical

slide-51
SLIDE 51

IRDM ‘15/16

Example

VIII-1: 51

a a a b

𝑤1 𝑤2 𝑤3 𝑤4 𝐻3

q

r r r

a a a b

𝑤1 𝑤2 𝑤3 𝑤4 𝐻1 r r

q

r

a a a b

𝑤1 𝑤2 𝑤3 𝑤4 𝐻2

q

r r r

𝑢11 = 𝑤1, 𝑤2, 𝑊, 𝑊, 𝑟 𝑢12 = 𝑤2, 𝑤3, 𝑊, 𝑊, 𝑊 𝑢13 = 𝑤3, 𝑤1, 𝑊, 𝑊, 𝑊 𝑢14 = 𝑤2, 𝑤4, 𝑊, 𝑐, 𝑊 𝑢21 = 𝑤1, 𝑤2, 𝑊, 𝑊, 𝑟 𝑢22 = 𝑤2, 𝑤3, 𝑊, 𝑐, 𝑊 𝑢23 = 𝑤2, 𝑤4, 𝑊, 𝑊, 𝑊 𝑢24 = 𝑤4, 𝑤1, 𝑊, 𝑊, 𝑊 𝑢31 = 𝑤1, 𝑤2, 𝑊, 𝑊, 𝑟 𝑢32 = 𝑤2, 𝑤3, 𝑊, 𝑊, 𝑊 𝑢33 = 𝑤3, 𝑤1, 𝑊, 𝑊, 𝑊 𝑢34 = 𝑤1, 𝑤4, 𝑊, 𝑐, 𝑊 In second row, 𝐻2 is bigger in label order

slide-52
SLIDE 52

IRDM ‘15/16

Example

VIII-1: 52

a a a b

𝑤1 𝑤2 𝑤3 𝑤4 𝐻3

q

r r r

a a a b

𝑤1 𝑤2 𝑤3 𝑤4 𝐻1 r r

q

r

a a a b

𝑤1 𝑤2 𝑤3 𝑤4 𝐻2

q

r r r

𝑢11 = 𝑤1, 𝑤2, 𝑊, 𝑊, 𝑟 𝑢12 = 𝑤2, 𝑤3, 𝑊, 𝑊, 𝑊 𝑢13 = 𝑤3, 𝑤1, 𝑊, 𝑊, 𝑊 𝑢14 = 𝑤2, 𝑤4, 𝑊, 𝑐, 𝑊 𝑢21 = 𝑤1, 𝑤2, 𝑊, 𝑊, 𝑟 𝑢22 = 𝑤2, 𝑤3, 𝑊, 𝑐, 𝑊 𝑢23 = 𝑤2, 𝑤4, 𝑊, 𝑊, 𝑊 𝑢24 = 𝑤4, 𝑤1, 𝑊, 𝑊, 𝑊 𝑢31 = 𝑤1, 𝑤2, 𝑊, 𝑊, 𝑟 𝑢32 = 𝑤2, 𝑤3, 𝑊, 𝑊, 𝑊 𝑢33 = 𝑤3, 𝑤1, 𝑊, 𝑊, 𝑊 𝑢34 = 𝑤1, 𝑤4, 𝑊, 𝑐, 𝑊 Last two rows are forward edges, and 4 = 4 but 2 > 1 → 𝐻1 is smallest

slide-53
SLIDE 53

IRDM ‘15/16

graph-based substructure pattern mining

The general idea

 use the DFS codes to create candidates

 extend only canonical and frequent candidates

There can be very, very many extensions

 we need to see them all, and all of their isomorphisms,

to count their supports

(Yan VIII-1: 53

slide-54
SLIDE 54

IRDM ‘15/16

Constructing candidates

The candidates are build in a DF DFS S cod

  • de t

tree

 a DFS code 𝑊 is an anc

ancestor of DFS code 𝑐 if 𝑊 is a proper prefix of 𝑐

 the siblings in the tree follow the DFS code order

A graph can be frequent if and only if all of the graphs representing its ancestors in the DFS tree are frequent The DFS tree contains all the canonical codes for all subgraphs of the graphs in the data

 but not all vertices in the code tree correspond to canonical codes

We (implicitly) traverse this tree

VIII-1: 54

slide-55
SLIDE 55

IRDM ‘15/16

The gSPAN algorithm (sketch)

GSPAN(graph 𝐻, minsup 𝜏)

fo for each h frequent 1-edge graph do do call ll subgrm to grow all nodes in the tree rooted in this edge-graph remove this edge from the graph

SUBGRM(frequent subgraph 𝑌, minsup 𝜏)

if if the code is not canonical then en return add 𝑌 to the set of frequent graphs create all super-graphs 𝑍 ⊃ 𝑌, extending 𝑌 with one more edge compute frequencies of all 𝑍’s call l SUBGRM for canonical representation of all frequent 𝑍’s

VIII-1: 55

slide-56
SLIDE 56

IRDM ‘15/16

Computing frequencies

gSPAN merges extension generation and support computation For each graph in the data base

 gSPAN computes all isomorphisms of the current candidate

 can mean solving NP-complete problems…

 for all isomorphisms, computes all backward and forward extensions

 these extensions are stored together with the graph they appear in

The support of each extension is the number of times we’ve stored it

VIII-1: 56

slide-57
SLIDE 57

IRDM ‘15/16

Checking canonicity

Given a DFS code of an extension, we need to check if the code is ca canonica ical This can be done by re-creating the code

 at every step, choose the smallest of the right-most path extension

  • f the current code in

in the graph h corresponding ing to the extens nsio ion

If at any step we get a code that is smaller than the suffix

  • f the extension’s code, we do not have a canonical code

 if after 𝑙 steps we arrive at the extensions code, it is canonical

VIII-1: 57

slide-58
SLIDE 58

IRDM ‘15/16

Easier problems

Much of the complexity of subgraph mining lies in (checking for) isomorphisms For some types of graphs isomorphism is easy

 different types of trees

 ordered and unordered  rooted and unrooted

 graphs where every node has a distinct label

VIII-1: 58

slide-59
SLIDE 59

IRDM ‘15/16

Conclusions

Graphs are everywhere

 many interesting problems  real graphs often exhibit power-law-like behaviour

Graphs generalise many data settings

 makes it possible to create general algorithms

Many problems in graphs are very difficult

 subgraph isomorphism

Frequent subgraph mining

 involves multiple NP-hard problems

VIII-1: 59

slide-60
SLIDE 60

IRDM ‘15/16

Graphs are everywhere

 many interesting problems  real graphs often exhibit power-law-like behaviour

Graphs generalise many data settings

 makes it possible to create general algorithms

Many problems in graphs are very difficult

 subgraph isomorphism

Frequent subgraph mining

 involves multiple NP-hard problems

VIII-1: 60