Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining - - PowerPoint PPT Presentation

data mining in bioinformatics day 5 frequent subgraph
SMART_READER_LITE
LIVE PREVIEW

Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining - - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining Chlo-Agathe Azencott & Karsten Borgwardt February 10 to February 21, 2014 Machine Learning & Computational Biology Research Group Max Planck Institutes Tbingen and


slide-1
SLIDE 1

Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining

Chloé-Agathe Azencott & Karsten Borgwardt February 10 to February 21, 2014 Machine Learning & Computational Biology Research Group Max Planck Institutes Tübingen and Eberhard Karls Universität Tübingen

slide-2
SLIDE 2

Graphs are everywhere

Karsten Borgwardt: Data Mining in Bioinformatics, Page 2

Coexpression network Social network Program flow Protein structure Chemical compound

slide-3
SLIDE 3

Mining graph data

Karsten Borgwardt: Data Mining in Bioinformatics, Page 3

Graph comparison E.g. Compare PPIN between species Graph classification / regression Predict properties of objects represented as graphs E.g. Predict toxicity of molecular compound, functional- ity of protein Graph nodes classification / regression Predict properties of objects connected on a graph E.g. Predict functionality of protein, classify pixels in re- mote sensing images

slide-4
SLIDE 4

Mining graph data

Karsten Borgwardt: Data Mining in Bioinformatics, Page 4

Graph compression Representing graphs compactly E.g. Store and mine web data Graph clustering Finding dense subnetworks of graphs E.g. Find groups in social networks Link prediction Predicting relationships between nodes of the graph E.g. Predict who should be added to your social net- work, predict interactions between proteins

slide-5
SLIDE 5

Graph pattern mining

Karsten Borgwardt: Data Mining in Bioinformatics, Page 5

Graph pattern mining Find frequent / informative graph patterns Summarize patterns Approximate patterns Applications Finding biological conserved subnetworks Finding functional modules Program control flow analysis Intrusion detection Building blocks for graph classification, clustering, compression, comparison

slide-6
SLIDE 6

Karsten Borgwardt: Data Mining in Bioinformatics, Page 6

Frequent Pattern Mining

slide-7
SLIDE 7

Frequent pattern mining

Karsten Borgwardt: Data Mining in Bioinformatics, Page 7

Frequent item set mining Market basket analysis Find items that are frequently purchased together Given

a set B = {i1, i2, . . . , in} of items a list T = {t1, t2, . . . , tm} of transactions tj ⊆ B a minimum number of occurences smin ∈ N

Find the set of frequent item sets, i.e.

F(smin) = {I ⊆ B : |{k : I ⊆ tk} ≥ smin}

slide-8
SLIDE 8

A Priori

Karsten Borgwardt: Data Mining in Bioinformatics, Page 8

[Agrawal et al., 1994]

Brute force approach Enumerate all 2n subsets of B Count how often each of them is included in each of

t1, . . . , tm

Generally infeasible The a-priori property No superset of an infrequent item set can be frequent All subsets of a frequent item set are frequent

slide-9
SLIDE 9

A Priori

Karsten Borgwardt: Data Mining in Bioinformatics, Page 9

The a-priori algorithm List all singletons, discard the infrequent ones Form pairs of frequent elements, discard infrequent

  • nes

... Augment the sets of size k − 1 to form all sets of size k

  • f frequent elements, discard infrequent ones

Alternate between candidate generation and pruning.

slide-10
SLIDE 10

A Priori

Karsten Borgwardt: Data Mining in Bioinformatics, Page 10

Generating unique candidates There are k! ways of generating a single set of k items Ensure we do it only once

⇒ Idea: assign a unique parent set to each set

Canonical form The set of possible parents of an item set I is the set of its maximal proper subsets: {J ⊂ I | ∄K : J ⊂ K ⊂

I}

Put an ordering on B: i1 < i2 < · · · < in Define the canonical parent of I as

pc(I) = I \ {max

a∈I a}

slide-11
SLIDE 11

A Priori

Karsten Borgwardt: Data Mining in Bioinformatics, Page 11

Canonical code words code word for I ⊆ B: any word w on the alphabet B canonical code word of I wc(I): smallest of these words, in lexicographic order E.g. {a, c, b, e} → abce The canonical parent of I pc(I) is described by the longest proper prefix of wc(I). Prefix property: The longest proper prefix of a canoni- cal code word is a canonical code word itself. Equivalently, any prefix of a canonical code word is a canonical code word itself.

slide-12
SLIDE 12

A Priori

Karsten Borgwardt: Data Mining in Bioinformatics, Page 12

Candidate set generation From frequent item sets of size k−1, construct item sets

  • f size k by appending (frequent) items to their canonical

code words Only do so for items greater than the last letter of the canonical code word

abe → abef, abeg, ✘✘✘✘ abec

slide-13
SLIDE 13

A Priori

Karsten Borgwardt: Data Mining in Bioinformatics, Page 13

Prefix tree

a b c d ab ac ad bc bd cd abc abd acd bcd abcd

Full prefix tree for B = {a, b, c, d}

slide-14
SLIDE 14

A Priori

Karsten Borgwardt: Data Mining in Bioinformatics, Page 14

Pruning the prefix tree Only generate unique item sets A-priori property ⇒ Prune branches at infrequent items Size-based pruning

a b c d ab ac ad bc bd cd abc abd acd bcd

4 11 9 7 3 4 7 3 2 1 2

abcd

T = {{a, b}, {a, b, c}, {b, c}, {b}, {b, d}, {d}, {a, c}, {b, c}, {d}, {a, c}, {b, c}, {b, c, d}, {d}, {b}, {b, c, d}, {b, c, d}}

slide-15
SLIDE 15

Frequent pattern mining

Karsten Borgwardt: Data Mining in Bioinformatics, Page 15

Exploring the search tree Breadth-First Search: find all frequent sets of size k before moving on to size k + 1

→ A-priori

Depth-First Search: find all frequent sets containing el- ement a before moving on to those that contain b but do not contain a Advantage: divide-and-conquer strategy, requires less memory

→ Eclat, FP-growth...

slide-16
SLIDE 16

Karsten Borgwardt: Data Mining in Bioinformatics, Page 16

Frequent Subgraph Mining

slide-17
SLIDE 17

Graphs

Karsten Borgwardt: Data Mining in Bioinformatics, Page 17

A graph is an ordered pair G = (V, E)

V is a set of vertices (or nodes) E ⊆ V × V is a set of edges (or links)

Edges can be

  • rdered → G is directed
  • r not → G is undirected

A labeled graph is an ordered triplet G = (V, E, l)

V is a set of vertices (or nodes) E ⊆ V × V is a set of edges (or links) l : V ∪ E → A∗ assigns labels to vertices and edges

slide-18
SLIDE 18

Frequent subgraph mining

Karsten Borgwardt: Data Mining in Bioinformatics, Page 18

Frequent subgraphs Given a set D = {G1, G2, . . . , GN} of graphs a minimum frequency θmin ∈ [0, 1] Find the set of frequent subgraphs, i.e.

F(θmin) = {H| |{i : H subgraph of Gi}| ≥ Nθmin}

The frequency of subgraph H is called the support of

H supp(H) = |{i : H subgraph of Gi}| θmin is called the minimimum support

Often focus on connected subgraphs

slide-19
SLIDE 19

Frequent subgraph mining

Karsten Borgwardt: Data Mining in Bioinformatics, Page 19

Example: Call graphs Frequent subgraphs:

slide-20
SLIDE 20

Frequent subgraph mining

Karsten Borgwardt: Data Mining in Bioinformatics, Page 20

Example: Chemical compounds Caffeine Theobromine Sildenafil Adenine Frequent subgraphs: Imidazole Purine

slide-21
SLIDE 21

Frequent subgraph mining

Karsten Borgwardt: Data Mining in Bioinformatics, Page 21

Subgraph isomorphism Let G = (VG, EG, lG) and H = (VH, EH, lH) be two la- beled graphs. A subgraph isomorphism from H to G (or an occur- rence of H in G) is an injective function f : VH → VG such that:

∀v ∈ VH : lH(v) = lG(f(v)) ∀(u, v) ∈ EH : (f(u), f(v)) ∈ EG

and lH(u, v) = lG(f(u), f(v)) There may be several (many) ways to map H to G

slide-22
SLIDE 22

Frequent subgraph mining

Karsten Borgwardt: Data Mining in Bioinformatics, Page 22

Graph isomorphism

G and H are isomorphic if there exists a subgraph iso-

morphism from G to H and from H to G

f(1) = A f(2) = C f(3) = D f(4) = B f(5) = F f(6) = E

slide-23
SLIDE 23

Frequent subgraph mining

Karsten Borgwardt: Data Mining in Bioinformatics, Page 23

Subgraph isomorphism Testing whether there is a subgraph isomorphism be- tween two graphs is generally NP-complete Special cases: linear complexity for planar graphs (e.g. paths, trees, grids) Therefore: Testing whether a subgraph occurs in the database is NP-complete Testing whether a subgraph is isomorphic to an al- ready identified subgraph requires exponential run- time in general as well

slide-24
SLIDE 24

Frequent subgraph mining

Karsten Borgwardt: Data Mining in Bioinformatics, Page 24

The a-priori property No supergraph of an infrequent graph can be frequent All subgraphs of a frequent graph are frequent

AGM [Inokuchi et al., 2000], FSG [Kuramochi and Karypis, 2001]

Growing from k to k + 1 isn’t trivial Eliminating non-frequent subgraphs of size k + 1 in- volves costly subgraph isomorphisms Canonical representations of graphs More difficult than with item sets. spanning trees adjacency matrices

slide-25
SLIDE 25

gSpan

Karsten Borgwardt: Data Mining in Bioinformatics, Page 25

[Yan and Han, 2002]

Spanning tree A graph G is called a tree if for any pair of vertices of G, there exists one and only one path connecting them in

G

A spanning tree of G is a subgraph S of G that that is a tree whose vertices are the vertices of G, ie. VS = VG

G Two spanning trees of G

slide-26
SLIDE 26

gSpan

Karsten Borgwardt: Data Mining in Bioinformatics, Page 26

DFS trees

Explore G in DFS order

  • ne graph can have several DFS trees

Order vertices in discovery order <V v0 is called the root vn is called the right-most vertex right-most path: straight path v0 → vn forward edges: edges in the DFS tree (i, j) : vi <V vj backward edges: edges not in the DFS tree (i, j) : vj <V vi

slide-27
SLIDE 27

gSpan

Karsten Borgwardt: Data Mining in Bioinformatics, Page 27

Ordering edges

vi1 = vi2 and vj1 <V vj2 ⇒ (vi1, vj1) <E (vi2, vj2) vi1 <V vj1 and vj1 = vi2 ⇒ (vi1, vj1) <E (vi2, vj2) (vi1, vj1) <E (vi2, vj2) and (vi2, vj2) <E (vi3, vj3) ⇒ (vi1, vj1) <E (vi3, vj3) (v0, v1) <E (v0, v2) (v0, v2) <E (v2, v3) (v0, v1) <E (v2, v3)

slide-28
SLIDE 28

gSpan

Karsten Borgwardt: Data Mining in Bioinformatics, Page 28

DFS lexicographic order

code(G, T) = (ek)i=k,...,m s. t. ek <E ek+1 is the DFS code of the DFS tree T If <L is a linear order on the labels, the lexicographic combination

  • f <E and <L is a linear order ≺T over E × L × L × L

Let α = (a1, a2, . . . , amα) and β = (b1, b2, . . . , bmβ) be 2 DFS codes. α β iff ∃t, 0 ≤ t ≤ min(mα, mβ) s. t. ak = bk ∀k < t and at ≺T bt

  • r ak = bk∀k ≤ mα and mα ≤ mβ

Minimum DFS code The minimum DFS code is a canonical label of G

min{code(G, T) : T spanning tree of G}

slide-29
SLIDE 29

gSpan

Karsten Borgwardt: Data Mining in Bioinformatics, Page 29

Valid minimum DFS codes

(e1, . . . , em, e is a child of (e1, . . . , em) (e1, . . . , em, e) is a minimum DFS code if (e1, . . . , em) is a minimum DFS code and em ≺T e i.e. e must grow from a vertex on the rightmost path of the tree coded by (e1, . . . , em). Backward edges can only grow from the rightmost vertex.

slide-30
SLIDE 30

gSpan

Karsten Borgwardt: Data Mining in Bioinformatics, Page 30

Extending subgraphs

If the extension edge is not a rightmost path extension, then the resulting code word is certainly not canonical. If the extension edge is a rightmost path extension, then the re- sulting code word may or may not be canonical.

DFS code tree

Analogous to prefix tree Each node is a DFS code As above, (e1, . . . , em, e child of (e1, . . . , em) DFS traversal of DFS code tree ⇒ DFS lexicographic order

slide-31
SLIDE 31

gSpan

Karsten Borgwardt: Data Mining in Bioinformatics, Page 31

gSpan idea

From the set of vertices and edge labels, build the DFS tree of fre- quent subgraphs If vertices are labeled by {A, B, C, . . . } and edges by {a, b, c, . . . }: The 1st iteration looks for all frequent subgraphs containing AaA The 2nd iteration looks for all frequent subgraphs containing AaB . . . At each iteration, subgraph_mining is called to grow subgraphs Growing stops when (a) frequency drops below θmin or (b) a non- minimal code is created

slide-32
SLIDE 32

gSpan

Karsten Borgwardt: Data Mining in Bioinformatics, Page 32

subgraph_mining

subgraph_mining(D = {G1, G2, . . . , GN}, S, s): if s not minimal return S ← S ∪ {s} for G ∈ D for each instance of s in G for each child c of this instance of s supp(c) ++ for each child c if supp(c) > minsupp s ← c subgraph_mining(D, S, s)

slide-33
SLIDE 33

gSpan

Karsten Borgwardt: Data Mining in Bioinformatics, Page 33

Runtime comparison of FSG and gSpan N: number of labels I: average size of potentially frequent subgraphs T: average number of edges per frequent subgraph 200 potentially frequent subgraphs 104 graphs, θmin = 0.01

slide-34
SLIDE 34

Enumerating subgraphs

Karsten Borgwardt: Data Mining in Bioinformatics, Page 34

Canonical form Adjacency matrix AGM, FSG, FFSM [Huan et al., 2003] Spanning tree gSpan Graph exploration BFS (“level-wise” search) MoSS/MoFa [Borgelt and Berthold, 2002],

AGM

DFS gSpan “Easy” subgraphs (paths, trees) first

GASTON [Nijssen and Kok, 2005]

Avoiding redundancy Canonical form pruning Repository of processed subgraphs MoSS/MoFa, GASTON

slide-35
SLIDE 35

Enumerating subgraphs

Karsten Borgwardt: Data Mining in Bioinformatics, Page 35

Runtime per pattern (ms) vs. minimum support (%)

[Wörlein et al., 2005]

slide-36
SLIDE 36

Enumerating subgraphs

Karsten Borgwardt: Data Mining in Bioinformatics, Page 36

Memory usage (GB) vs. minimum support (%)

[Wörlein et al., 2005]

slide-37
SLIDE 37

Pattern summarization

Karsten Borgwardt: Data Mining in Bioinformatics, Page 37

Large number of frequent patterns Remember: all subgraphs of a frequent subgraph are frequent AIDS antiviral screen dataset, ∼ 400 compounds, sup- port 5%

⇒ > 106 frequent subgraphs

Problems: Interpreting frequent patterns Reducing the number of the frequent patterns Setting the minimum support

slide-38
SLIDE 38

Pattern summarization

Karsten Borgwardt: Data Mining in Bioinformatics, Page 38

Representative Patterns Top k patterns [Xin et al., 2006] Cluster centroids [Chen et al., 2008] Cluster based on pattern similarity Cluster based on data similarity

slide-39
SLIDE 39

Closed and maximal subgraphs

Karsten Borgwardt: Data Mining in Bioinformatics, Page 39

Closed graph A frequent graph G is closed if there exists no super- graph of G that carries the same support as G If some of G’s subgraphs have the same support, it is unnecessary to output these subgraphs (nonclosed graphs) Lossless compression: still ensures that the mining re- sult is complete Maximal frequent graph A frequent graph G is maximal if there exists no super- graph of G that is frequent

slide-40
SLIDE 40

Closed and maximal subgraphs

Karsten Borgwardt: Data Mining in Bioinformatics, Page 40

(A) (B) (C)

(D)

is a subgraph of A, B, C, but so is D and E have the same support (3). D is not closed.

(E)

No supergraph of E is a subgraph of all 3 graphs therefore E is closed.

(F)

is a subgraph of A and B. F is closed as none

  • f its supergraphs has support 2.

If θmin = 70%, E is maximal: it is frequent and none of it supergraphs is frequent.

slide-41
SLIDE 41

CloseGraph

Karsten Borgwardt: Data Mining in Bioinformatics, Page 41

[Yan and Han, 2003]

Extension of gSpan to avoid growing subgraphs guaranteed to have only non- closed descendants

Early termination

If wherever graph H1 occurs in the data, graph H2 = H1 ⋄ e occurs as well, then for any graph H, if H1 is a subgraph of H and H2 is not, then H is not closed. (1) and (2) systematically co-occur in D. Therefore (3) cannot be closed – indeed (4) is a supergraph of (3) with identical support. We need to grow from (2) and not from (1).

slide-42
SLIDE 42

CloseGraph

Karsten Borgwardt: Data Mining in Bioinformatics, Page 42

Failure of early termination

x − a − y and y − b − x co-occur in (1) and (2) If we only extend from x − a − y − b − x, then we miss pattern (3), which also co-occurs in (1) and (2) Need to distinguish between H ⋄e e (creates a new vertex) and H ⋄b e (does not create a new vertex)

slide-43
SLIDE 43

References and further reading

Karsten Borgwardt: Data Mining in Bioinformatics, Page 43

[Agrawal et al., 1994] Agrawal, R., Srikant, R. et al. (1994). Fast algorithms for mining association rules. In VLDB vol. 1215, pp. 487–499,. 8 [Borgelt and Berthold, 2002] Borgelt, C. and Berthold, M. R. (2002). Mining molecular fragments: Finding relevant substructures of

  • molecules. In ICDM pp. 51–58,. 34

[Chen et al., 2008] Chen, C., Lin, C. X., Yan, X. and Han, J. (2008). On effective presentation of graph patterns: a structural represen- tative approach. In CIKM pp. 299–308,. 38 [Huan et al., 2003] Huan, J., Wang, W. and Prins, J. (2003). Efficient mining of frequent subgraphs in the presence of isomorphism. In ICDM pp. 549–552,. 34 [Inokuchi et al., 2000] Inokuchi, A., Washio, T. and Motoda, H. (2000). An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data. In Principles of Data Mining and Knowledge Discovery vol. 1910, of LNCS pp. 13–23. Springer. 24 [Kuramochi and Karypis, 2001] Kuramochi, M. and Karypis, G. (2001). Frequent subgraph discovery. In ICDM pp. 313–320,. 24 [Nijssen and Kok, 2005] Nijssen, S. and Kok, J. N. (2005). Frequent graph mining and its application to molecular databases. Electronic Notes in Theoretical Computer Science 127. 34 [Wörlein et al., 2005] Wörlein, M., Meinl, T., Fischer, I. and Philippsen, M. (2005). A quantitative comparison of the subgraph miners MoFa, gSpan, FFSM, and Gaston. In PKDD pp. 392–403, Springer. 35, 36 [Xin et al., 2006] Xin, D., Cheng, H., Yan, X. and Han, J. (2006). Extracting redundancy-aware top-k patterns. In SIGKDD pp. 444–453,. 38 [Yan and Han, 2002] Yan, X. and Han, J. (2002). gSpan: Graph-based substructure pattern mining. In ICDM pp. 721–724,. 25 [Yan and Han, 2003] Yan, X. and Han, J. (2003). CloseGraph: mining closed frequent graph patterns. In SIGKDD pp. 286–295,. 41

slide-44
SLIDE 44

The end

Karsten Borgwardt: Data Mining in Bioinformatics, Page 44

Next topic: Classification in Bioinformatics