2.5 Association Rule Mining based on: Chlo e-Agathe Azencott and - - PowerPoint PPT Presentation

2 5 association rule mining
SMART_READER_LITE
LIVE PREVIEW

2.5 Association Rule Mining based on: Chlo e-Agathe Azencott and - - PowerPoint PPT Presentation

2.5 Association Rule Mining based on: Chlo e-Agathe Azencott and Karsten Borgwardt. Course Data Mining in Bioinformatics. Chapter Graph Mining. 2012 Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester


slide-1
SLIDE 1

2.5 Association Rule Mining

based on: Chlo´ e-Agathe Azencott and Karsten Borgwardt. Course ’Data Mining in Bioinformatics’. Chapter Graph Mining. 2012

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 179 / 230

slide-2
SLIDE 2

Keyword co-occurrence

Goals

To understand the link between keyword co-occurrence, and association rule mining and frequent itemset mining To understand how the computation of frequent itemsets can be sped up

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 180 / 230

slide-3
SLIDE 3

Keyword co-occurrence

Problem

Find sets of keyword that often co-occur Common problem in biomedical literature: find associations between genes, proteins or

  • ther entities using co-occurrence search

Keyword co-occurrence search is an instance of a more general problem in data mining, called association rule mining.

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 181 / 230

slide-4
SLIDE 4

Association Rules

Definitions

Let I = {I1, I2, . . . , Im} be a set of items (keywords) Let D be the database of transactions T (collection of documents) A transaction T ∈ D is a set of items: T ⊆ I (a document is a set of keywords) Let A be a set of items: A ⊆ T. An association rule is an implication of the form A ⊆ T ⇒ B ⊆ T, (1) where A, B ⊆ I and A ∩ B = ∅

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 182 / 230

slide-5
SLIDE 5

Association Rules

Support and Confidence

The rule A ⇒ B holds in the transaction set D with support s, where s is the percentage

  • f transactions in D that contain A ∪ B:

support(A ⇒ B) = |{T ∈ D|A ⊆ T ∧ B ⊆ T}| |{T ∈ D}| (2) The rule A ⇒ B has confidence c in the transaction set D, where c is the percentage of transactions in D containing A that also contain B: confidence(A ⇒ B) = |{T ∈ D|A ⊆ T ∧ B ⊆ T}| |{T ∈ D|A ⊆ T}| (3)

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 183 / 230

slide-6
SLIDE 6

Association Rules

Strong rules

Rules that satisfy both a minimum support threshold (minsup) and a minimum confidence threshold (minconf) are called strong association rules — and these are the

  • nes we are after!

Finding strong rules

  • 1. Search for all frequent itemsets (set of items that occur in at least minsup % of all

transactions)

  • 2. Generate strong association rules from the frequent itemsets

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 184 / 230

slide-7
SLIDE 7

Association Rules: Frequent pattern mining

Frequent item set mining

Market basket analysis Find items that are frequently purchased together Given

a set B = {i1, i2, . . . , in} of items a list T = {t1, t2, . . . , tm} of transactions tj ⊆ B a minimum number of occurences smin ∈ N

Find the set of frequent item sets, i.e. F(smin) = {I ⊆ B : |{k : I ⊆ tk} ≥ smin}

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 185 / 230

slide-8
SLIDE 8

Association Rules: Apriori [Agrawal et al., 1994]

Brute force approach

Enumerate all 2n subsets of B Count how often each of them is included in each of t1, . . . , tm Generally infeasible

The Apriori property

If an itemset A is frequent, then any subset B of A (B ⊆ A) is frequent as well. If B is infrequent, then any superset A of B (A ⊇ B) is infrequent as well.

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 186 / 230

slide-9
SLIDE 9

Association Rules

Apriori Pseudocode

1 Determine frequent items = k-itemsets with k = 1 2 Join all pairs of frequent k-itemsets that differ in at most 1 item = candidates Ck+1 for

being frequent k + 1 itemsets

3 Check the frequency of these candidates Ck+1: the frequent ones form the frequent

k + 1-itemsets (trick: discard any candidate immediately that contains an infrequent k-itemset)

4 Repeat from Step 2 until no more candidate is frequent.

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 187 / 230

slide-10
SLIDE 10

Association Rules: A Priori

Generating unique candidates

There are k! ways of generating a single set of k items Ensure we do it only once ⇒ Idea: assign a unique parent set to each set

Canonical form

The set of possible parents of an item set I is the set of its maximal proper subsets: {J ⊂ I | ∄K : J ⊂ K ⊂ I} Put an ordering on B: i1 < i2 < · · · < in Define the canonical parent of I as pc(I) = I \ {max

a∈I a}

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 188 / 230

slide-11
SLIDE 11

Association Rules: A Priori

Canonical code words

code word for I ⊆ B: any word w on the alphabet B canonical code word of I wc(I): smallest of these words, in lexicographic order E.g. {a, c, b, e} → abce The canonical parent of I pc(I) is described by the longest proper prefix of wc(I). Prefix property: The longest proper prefix of a canonical code word is a canonical code word itself. Equivalently, any prefix of a canonical code word is a canonical code word itself.

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 189 / 230

slide-12
SLIDE 12

Association Rules: A Priori

Candidate set generation

From frequent item sets of size k − 1, construct item sets of size k by appending (frequent) items to their canonical code words Only do so for items greater than the last letter of the canonical code word abe → abef , abeg, ✘✘

abec

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 190 / 230

slide-13
SLIDE 13

Association Rules: A Priori

Prefix tree

a b c d ab ac ad bc bd cd abc abd acd bcd abcd

Full prefix tree for B = {a, b, c, d}

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 191 / 230

slide-14
SLIDE 14

Association Rules: A Priori

Pruning the prefix tree Only generate unique item sets A-priori property ⇒ Prune branches at infrequent items Size-based pruning

a b c d ab ac ad bc bd cd abc abd acd bcd

4 11 9 7 3 4 7 3 2 1 2

abcd

T = {{a, b}, {a, b, c}, {b, c}, {b}, {b, d}, {d}, {a, c}, {b, c}, {d}, {a, c}, {b, c}, {b, c, d}, {d}, {b}, {b, c, d}, {b, c, d}}

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 192 / 230

slide-15
SLIDE 15

Association Rules: Frequent Pattern Mining

Exploring the search tree

Breadth-First Search: find all frequent sets of size k before moving on to size k + 1 → A-priori Depth-First Search: find all frequent sets containing element a before moving on to those that contain b but do not contain a Advantage: divide-and-conquer strategy, requires less memory → Eclat, FP-growth...

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 193 / 230

slide-16
SLIDE 16

Association Rules

Summary

Keyword co-occurrence is a way to mine relationships between concepts from text databases. It is an instance of association rule mining, which tries to find associations between the

  • ccurrences of sets of words.

The classic algorithm for finding association rules is the Apriori algorithm, which enumerates all frequent itemsets in a branch-and-bound fashion.

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 194 / 230

slide-17
SLIDE 17
  • 3. Graph Mining

based on: Chlo´ e-Agathe Azencott and Karsten Borgwardt. Course ’Data Mining in Bioinformatics’. Chapter Graph Mining. 2012

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 195 / 230

slide-18
SLIDE 18

Graphs are everywhere

Coexpression network Social network Program flow Protein structure Chemical compound

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 196 / 230

slide-19
SLIDE 19

Mining graph data

Graph comparison

Example: Compare PPIN between species

Graph classification / regression

Predict properties of objects represented as graphs Example: Predict toxicity of molecular compound, functionality of protein

Graph nodes classification / regression

Predict properties of objects connected on a graph Example: Predict functionality of protein, classify pixels in remote sensing images

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 197 / 230

slide-20
SLIDE 20

Mining graph data

Graph compression

Representing graphs compactly Example: Store and mine web data

Graph clustering

Finding dense subnetworks of graphs Example: Find groups in social networks

Link prediction

Predicting relationships between nodes of the graph Example: Predict who should be added to your social network, predict interactions

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 198 / 230

slide-21
SLIDE 21

Graph pattern mining

Graph pattern mining

Find frequent / informative graph patterns Summarize patterns Approximate patterns

Applications

Finding biological conserved subnetworks Finding functional modules Program control flow analysis Intrusion detection Building blocks for graph classification, clustering, compression, comparison

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 199 / 230

slide-22
SLIDE 22

3.1 Frequent Subgraph Mining

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 200 / 230

slide-23
SLIDE 23

Graphs

Definitions

A graph is an ordered pair G = (V , E)

V is a set of vertices (or nodes) E ⊆ V × V is a set of edges (or links)

Edges can be

  • rdered → G is directed
  • r not → G is undirected

A labeled graph is an ordered triplet G = (V , E, l)

V is a set of vertices (or nodes) E ⊆ V × V is a set of edges (or links) l : V ∪ E → A∗ assigns labels to vertices and edges

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 201 / 230

slide-24
SLIDE 24

Frequent subgraph mining

Frequent subgraphs

Given:

a set D = {G1, G2, . . . , GN} of graphs a minimum frequency θmin ∈ [0, 1]

Find the set of frequent subgraphs, i.e. F(θmin) = {H| |{i : H subgraph of Gi}| ≥ Nθmin} The frequency of subgraph H is called the support of H supp(H) = |{i : H subgraph of Gi}| θmin is called the minimimum support Often focus on connected subgraphs

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 202 / 230

slide-25
SLIDE 25

Frequent subgraph mining

Example: Call graphs Frequent subgraphs:

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 203 / 230

slide-26
SLIDE 26

Frequent subgraph mining

Example: Chemical compounds Caffeine Theobromine Sildenafil Adenine Frequent subgraphs: Imidazole Purine

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 204 / 230

slide-27
SLIDE 27

Frequent subgraph mining

Subgraph isomorphism

Let G = (VG, EG, lG) and H = (VH, EH, lH) be two labeled graphs. A subgraph isomorphism from H to G (or an occurrence of H in G) is an injective function f : VH → VG such that:

∀v ∈ VH : lH(v) = lG(f (v)) ∀(u, v) ∈ EH : (f (u), f (v)) ∈ EG and lH(u, v) = lG(f (u), f (v))

There may be several (many) ways to map H to G

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 205 / 230

slide-28
SLIDE 28

Frequent subgraph mining

Graph isomorphism G and H are isomorphic if there exists a subgraph isomorphism from G to H and from H to G f (1) = A f (2) = C f (3) = D f (4) = B f (5) = F f (6) = E

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 206 / 230

slide-29
SLIDE 29

Frequent subgraph mining

Subgraph isomorphism

Testing whether there is a subgraph isomorphism between two graphs is generally NP-complete Special cases: linear complexity for planar graphs (e.g. paths, trees, grids) Therefore:

Testing whether a subgraph occurs in the database is NP-complete Testing whether a subgraph is isomorphic to an already identified subgraph is NP-complete as well

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 207 / 230

slide-30
SLIDE 30

Frequent subgraph mining

The a-priori property

No supergraph of an infrequent graph can be frequent All subgraphs of a frequent graph are frequent AGM [Inokuchi et al., 2000], FSG [Kuramochi and Karypis, 2001]

Growing from k to k + 1 isn’t trivial Eliminating non-frequent subgraphs of size k + 1 involves costly subgraph isomorphisms

Canonical representations of graphs

More difficult than with item sets.

spanning trees adjacency matrices

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 208 / 230

slide-31
SLIDE 31

gSpan

[Yan and Han, 2002]

Spanning tree

A graph G is called a tree if for any pair of vertices of G, there exists one and only one path connecting them in G A spanning tree of G is a subgraph S of G that

that is a tree whose vertices are the vertices of G, ie. VS = VG

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 209 / 230

slide-32
SLIDE 32

gSpan

DFS trees

Explore G in DFS order

  • ne graph can have several DFS trees

Order vertices in discovery order <V v0 is called the root vn is called the right-most vertex right-most path: straight path v0 → vn forward edges: edges in the DFS tree (i, j) : vi <V vj backward edges: edges not in the DFS tree (i, j) : vj <V vi

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 210 / 230

slide-33
SLIDE 33

gSpan

Ordering edges

vi1 = vi2 and vj1 <V vj2 ⇒ (vi1, vj1) <E (vi2, vj2) vi1 <V vj1 and vj1 = vi2 ⇒ (vi1, vj1) <E (vi2, vj2) (vi1, vj1) <E (vi2, vj2) and (vi2, vj2) <E (vi3, vj3) ⇒ (vi1, vj1) <E (vi3, vj3) (v0, v1) <E (v0, v2) (v0, v2) <E (v2, v3) (v0, v1) <E (v2, v3)

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 211 / 230

slide-34
SLIDE 34

gSpan

DFS lexicographic order

code(G, T) = (ek)i=k,...,m s. t. ek <E ek+1 is the DFS code of the DFS tree T If <L is a linear order on the labels, the lexicographic combination of <E and <L is a linear order ≺T over E × L × L × L Let α = (a1, a2, . . . , amα) and β = (b1, b2, . . . , bmβ) be 2 DFS codes. α β iff

∃t, 0 ≤ t ≤ min(mα, mβ) s. t. ak = bk ∀k < t and at ≺T bt

  • r ak = bk∀k ≤ mα and mα ≤ mβ

Minimum DFS code

The minimum DFS code is a canonical label of G min{code(G, T) : T spanning tree of G}

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 212 / 230

slide-35
SLIDE 35

gSpan

Valid minimum DFS codes (e1, . . . , em, e is a child of (e1, . . . , em) (e1, . . . , em, e) is a minimum DFS code if (e1, . . . , em) is a minimum DFS code and em ≺T e i.e. e must grow from a vertex on the rightmost path of the tree coded by (e1, . . . , em). Backward edges can only grow from the rightmost vertex.

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 213 / 230

slide-36
SLIDE 36

gSpan

Extending subgraphs

If the extension edge is not a rightmost path extension, then the resulting code word is certainly not canonical. If the extension edge is a rightmost path extension, then the resulting code word may or may not be canonical.

DFS code tree

Analogous to prefix tree Each node is a DFS code As above, (e1, . . . , em, e) child of (e1, . . . , em) DFS traversal of DFS code tree ⇒ DFS lexicographic order

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 214 / 230

slide-37
SLIDE 37

gSpan

gSpan idea

From the set of vertices and edge labels, build the DFS tree of frequent subgraphs If vertices are labeled by {A, B, C, . . . } and edges by {a, b, c, . . . }:

The 1st iteration looks for all frequent subgraphs containing AaA The 2nd iteration looks for all frequent subgraphs containing AaB . . . At each iteration, subgraph mining is called to grow subgraphs Growing stops when (a) frequency drops below θmin or (b) a non-minimal code is created

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 215 / 230

slide-38
SLIDE 38

gSpan

subgraph mining subgraph mining(D = {G1, G2, . . . , GN}, S, s): if s not minimal

return

S ← S ∪ {s} for G ∈ D

for each instance of s in G

for each child c of this instance of s supp(c) ++

for each child c

if supp(c) > minsupp

s ← c subgraph mining(D, S, s)

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 216 / 230

slide-39
SLIDE 39

gSpan

Runtime comparison of FSG and gSpan N: number of labels

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 217 / 230

slide-40
SLIDE 40

Enumerating subgraphs

Canonical form

Adjacency matrix AGM, FSG, FFSM [Huan et al., 2003] Spanning tree gSpan

Graph exploration

BFS (“level-wise” search) MoSS/MoFa [Borgelt and Berthold, 2002], AGM DFS gSpan “Easy” subgraphs (paths, trees) first GASTON [Nijssen and Kok, 2005]

Avoiding redundancy

Canonical form pruning, Repository of processed subgraphs MoSS/MoFa, GASTON

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 218 / 230

slide-41
SLIDE 41

Enumerating subgraphs

Runtime per pattern (ms) vs. minimum support (%)

Source: [W¨

  • rlein et al., 2005]

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 219 / 230

slide-42
SLIDE 42

Enumerating subgraphs

Memory usage (GB) vs. minimum support (%)

Source: [W¨

  • rlein et al., 2005]

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 220 / 230

slide-43
SLIDE 43

Pattern summarization

Large number of frequent patterns

Remember: all subgraphs of a frequent subgraph are frequent AIDS antiviral screen dataset, ∼ 400 compounds, support 5% ⇒ > 106 frequent subgraphs Problems:

Interpreting frequent patterns Reducing the number of the frequent patterns Setting the minimum support

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 221 / 230

slide-44
SLIDE 44

Pattern summarization

Representative Patterns

Top k patterns [Xin et al., 2006] Cluster centroids [Chen et al., 2008]

Cluster based on pattern similarity Cluster based on data similarity

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 222 / 230

slide-45
SLIDE 45

Closed and maximal subgraphs

Closed graph

A frequent graph G is closed if there exists no supergraph of G that carries the same support as G If some of G’s subgraphs have the same support, it is unnecessary to output these subgraphs (nonclosed graphs) Lossless compression: still ensures that the mining result is complete

Maximal frequent graph

A frequent graph G is maximal if there exists no supergraph of G that is frequent

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 223 / 230

slide-46
SLIDE 46

Closed and maximal subgraphs

(A) (B) (C)

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 224 / 230

slide-47
SLIDE 47

Closed and maximal subgraphs

(D)

is a subgraph of A, B, C.

(E)

No supergraph of E is a subgraph of all 3 graphs, therefore E is closed. D and E have the same support (3). Hence D is not closed.

(F)

is a subgraph of A and B. F is closed as none of its supergraphs has support 2.

If θmin = 70%, E is maximal: it is frequent and none of it supergraphs is frequent.

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 225 / 230

slide-48
SLIDE 48

CloseGraph

[Yan and Han, 2003]

Extension of gSpan

Extension of gSpan to avoid growing subgraphs guaranteed to have only non-closed descendants

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 226 / 230

slide-49
SLIDE 49

CloseGraph

[Yan and Han, 2003]

Early termination

If wherever graph H1 occurs in the data, graph H2 = H1 ⋄ e occurs as well, then for any graph H, if H1 is a subgraph of H and H2 is not, then H is not closed. (1) and (2) systematically co-occur in D. Therefore (3) cannot be closed –indeed (4) is a supergraph of (3) with identical support. We need to grow from (2) and not from (1).

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 227 / 230

slide-50
SLIDE 50

CloseGraph

Failure of early termination

x − a − y and y − b − x co-occur in (1) and (2) If we only extend from x − a − y − b − x, then we miss pattern (3), which also co-occurs in (1) and (2) Need to distinguish between H ⋄e e (creates a new vertex) and H ⋄b e (does not create a new vertex)

Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 228 / 230

slide-51
SLIDE 51

References and further reading I

Agrawal, R., Srikant, R. et al. (1994). Fast algorithms for mining association rules. In VLDB vol. 1215, pp. 487–499,. Borgelt, C. and Berthold, M. R. (2002). Mining molecular fragments: Finding relevant substructures of molecules. In ICDM pp. 51–58,. Chen, C., Lin, C. X., Yan, X. and Han, J. (2008). On effective presentation of graph patterns: a structural representative approach. In CIKM pp. 299–308,. Huan, J., Wang, W. and Prins, J. (2003). Efficient mining of frequent subgraphs in the presence of isomorphism. In ICDM pp. 549–552,. Inokuchi, A., Washio, T. and Motoda, H. (2000). An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data. In Principles of Data Mining and Knowledge Discovery vol. 1910, of LNCS pp. 13–23. Springer. Kuramochi, M. and Karypis, G. (2001). Frequent subgraph discovery. In ICDM pp. 313–320,. Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 229 / 230

slide-52
SLIDE 52

References and further reading II

Nijssen, S. and Kok, J. N. (2005). Frequent graph mining and its application to molecular databases. Electronic Notes in Theoretical Computer Science 127. W¨

  • rlein, M., Meinl, T., Fischer, I. and Philippsen, M. (2005).

A quantitative comparison of the subgraph miners MoFa, gSpan, FFSM, and Gaston. In PKDD pp. 392–403, Springer. Xin, D., Cheng, H., Yan, X. and Han, J. (2006). Extracting redundancy-aware top-k patterns. In SIGKDD pp. 444–453,. Yan, X. and Han, J. (2002). gSpan: Graph-based substructure pattern mining. In ICDM pp. 721–724,. Yan, X. and Han, J. (2003). CloseGraph: mining closed frequent graph patterns. In SIGKDD pp. 286–295,. Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 230 / 230