Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining Chloé-Agathe Azencott & Karsten Borgwardt February 10 to February 21, 2014 Machine Learning & Computational Biology Research Group Max Planck Institutes Tübingen and Eberhard Karls Universität Tübingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

Graphs are everywhere Coexpression network Social network Program flow Protein structure Chemical compound Karsten Borgwardt: Data Mining in Bioinformatics, Page 2

Mining graph data Graph comparison E.g. Compare PPIN between species Graph classification / regression Predict properties of objects represented as graphs E.g. Predict toxicity of molecular compound, functionality of protein Graph nodes classification / regression Predict properties of objects connected on a graph E.g. Predict functionality of protein, classify pixels in re- mote sensing images Karsten Borgwardt: Data Mining in Bioinformatics, Page 3

Mining graph data Graph compression Representing graphs compactly E.g. Store and mine web data Graph clustering Finding dense subnetworks of graphs E.g. Find groups in social networks Link prediction Predicting relationships between nodes of the graph E.g. Predict who should be added to your social network, predict interactions between proteins Karsten Borgwardt: Data Mining in Bioinformatics, Page 4

Graph pattern mining Graph pattern mining Find frequent / informative graph patterns Summarize patterns Approximate patterns Applications Finding biological conserved subnetworks Finding functional modules Program control flow analysis Intrusion detection Building blocks for graph classification, clustering, compression, comparison Karsten Borgwardt: Data Mining in Bioinformatics, Page 5

Frequent Pattern Mining Karsten Borgwardt: Data Mining in Bioinformatics, Page 6

Frequent pattern mining Frequent item set mining Market basket analysis Find items that are frequently purchased together Given a set B = { i 1 , i 2 , . . . , i n } of items a list T = { t 1 , t 2 , . . . , t m } of transactions t j ⊆ B a minimum number of occurences s min ∈ N Find the set of frequent item sets , i.e. F ( s min ) = { I ⊆ B : |{ k : I ⊆ t k } ≥ s min } Karsten Borgwardt: Data Mining in Bioinformatics, Page 7

A Priori [Agrawal et al. , 1994] Brute force approach Enumerate all 2 n subsets of B Count how often each of them is included in each of t 1 , . . . , t m Generally infeasible The a-priori property No superset of an infrequent item set can be frequent All subsets of a frequent item set are frequent Karsten Borgwardt: Data Mining in Bioinformatics, Page 8

A Priori The a-priori algorithm List all singletons, discard the infrequent ones Form pairs of frequent elements, discard infrequent ones ... Augment the sets of size k − 1 to form all sets of size k of frequent elements, discard infrequent ones Alternate between candidate generation and pruning . Karsten Borgwardt: Data Mining in Bioinformatics, Page 9

A Priori Generating unique candidates There are k ! ways of generating a single set of k items Ensure we do it only once ⇒ Idea: assign a unique parent set to each set Canonical form The set of possible parents of an item set I is the set of its maximal proper subsets : { J ⊂ I | ∄ K : J ⊂ K ⊂ I } Put an ordering on B : i 1 < i 2 < · · · < i n Define the canonical parent of I as p c ( I ) = I \ { max a ∈ I a } Karsten Borgwardt: Data Mining in Bioinformatics, Page 10

A Priori Canonical code words code word for I ⊆ B : any word w on the alphabet B canonical code word of I w c ( I ) : smallest of these words, in lexicographic order E.g. { a, c, b, e } → abce The canonical parent of I p c ( I ) is described by the longest proper prefix of w c ( I ) . Prefix property : The longest proper prefix of a canonical code word is a canonical code word itself. Equivalently, any prefix of a canonical code word is a canonical code word itself. Karsten Borgwardt: Data Mining in Bioinformatics, Page 11

A Priori Candidate set generation From frequent item sets of size k − 1 , construct item sets of size k by appending (frequent) items to their canonical code words Only do so for items greater than the last letter of the canonical code word abe → abef, abeg, ✘✘✘✘ abec Karsten Borgwardt: Data Mining in Bioinformatics, Page 12

A Priori Prefix tree a b c d ab ac ad bc bd cd abc abd acd bcd abcd Full prefix tree for B = { a, b, c, d } Karsten Borgwardt: Data Mining in Bioinformatics, Page 13

A Priori Pruning the prefix tree Only generate unique item sets A-priori property ⇒ Prune branches at infrequent items Size-based pruning T = {{ a, b } , { a, b, c } , { b, c } , a b c d 9 7 4 11 { b } , { b, d } , { d } , { a, c } , { b, c } , { d } , { a, c } , { b, c } , { b, c, d } , ab ac ad bc bd cd 2 7 4 3 2 0 { d } , { b } , { b, c, d } , { b, c, d }} abc abd acd bcd 3 1 0 0 abcd 0 Karsten Borgwardt: Data Mining in Bioinformatics, Page 14

Frequent pattern mining Exploring the search tree Breadth-First Search : find all frequent sets of size k before moving on to size k + 1 → A-priori Depth-First Search : find all frequent sets containing el- ement a before moving on to those that contain b but do not contain a Advantage : divide-and-conquer strategy, requires less memory → Eclat , FP-growth ... Karsten Borgwardt: Data Mining in Bioinformatics, Page 15

Frequent Subgraph Mining Karsten Borgwardt: Data Mining in Bioinformatics, Page 16

Graphs A graph is an ordered pair G = ( V, E ) V is a set of vertices (or nodes ) E ⊆ V × V is a set of edges (or links ) Edges can be ordered → G is directed or not → G is undirected A labeled graph is an ordered triplet G = ( V, E, l ) V is a set of vertices (or nodes ) E ⊆ V × V is a set of edges (or links ) l : V ∪ E → A ∗ assigns labels to vertices and edges Karsten Borgwardt: Data Mining in Bioinformatics, Page 17

Frequent subgraph mining Frequent subgraphs Given a set D = { G 1 , G 2 , . . . , G N } of graphs a minimum frequency θ min ∈ [0 , 1] Find the set of frequent subgraphs , i.e. F ( θ min ) = { H | |{ i : H subgraph of G i }| ≥ Nθ min } The frequency of subgraph H is called the support of H supp ( H ) = |{ i : H subgraph of G i }| θ min is called the minimimum support Often focus on connected subgraphs Karsten Borgwardt: Data Mining in Bioinformatics, Page 18

Frequent subgraph mining Example: Call graphs Frequent subgraphs: Karsten Borgwardt: Data Mining in Bioinformatics, Page 19

Frequent subgraph mining Example: Chemical compounds Caffeine Theobromine Sildenafil Adenine Frequent subgraphs: Imidazole Purine Karsten Borgwardt: Data Mining in Bioinformatics, Page 20

Frequent subgraph mining Subgraph isomorphism Let G = ( V G , E G , l G ) and H = ( V H , E H , l H ) be two labeled graphs. A subgraph isomorphism from H to G (or an occur- rence of H in G ) is an injective function f : V H → V G such that: ∀ v ∈ V H : l H ( v ) = l G ( f ( v )) ∀ ( u, v ) ∈ E H : ( f ( u ) , f ( v )) ∈ E G and l H ( u, v ) = l G ( f ( u ) , f ( v )) There may be several (many) ways to map H to G Karsten Borgwardt: Data Mining in Bioinformatics, Page 21

Frequent subgraph mining Graph isomorphism G and H are isomorphic if there exists a subgraph isomorphism from G to H and from H to G f (1) = A f (2) = C f (3) = D f (4) = B f (5) = F f (6) = E Karsten Borgwardt: Data Mining in Bioinformatics, Page 22

Frequent subgraph mining Subgraph isomorphism Testing whether there is a subgraph isomorphism between two graphs is generally NP-complete Special cases: linear complexity for planar graphs (e.g. paths, trees, grids) Therefore: Testing whether a subgraph occurs in the database is NP-complete Testing whether a subgraph is isomorphic to an al- ready identified subgraph requires exponential run- time in general as well Karsten Borgwardt: Data Mining in Bioinformatics, Page 23

Frequent subgraph mining The a-priori property No supergraph of an infrequent graph can be frequent All subgraphs of a frequent graph are frequent AGM [Inokuchi et al. , 2000] , FSG [Kuramochi and Karypis, 2001] Growing from k to k + 1 isn’t trivial Eliminating non-frequent subgraphs of size k + 1 in- volves costly subgraph isomorphisms Canonical representations of graphs More difficult than with item sets. spanning trees adjacency matrices Karsten Borgwardt: Data Mining in Bioinformatics, Page 24

gSpan [Yan and Han, 2002] Spanning tree A graph G is called a tree if for any pair of vertices of G , there exists one and only one path connecting them in G A spanning tree of G is a subgraph S of G that that is a tree whose vertices are the vertices of G , ie. V S = V G Two spanning trees of G G Karsten Borgwardt: Data Mining in Bioinformatics, Page 25

gSpan DFS trees Explore G in DFS order one graph can have several DFS trees Order vertices in discovery order < V v 0 is called the root v n is called the right-most vertex right-most path : straight path v 0 → v n forward edges : edges in the DFS tree ( i, j ) : v i < V v j backward edges : edges not in the DFS tree ( i, j ) : v j < V v i Karsten Borgwardt: Data Mining in Bioinformatics, Page 26

Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining Chlo-Agathe Azencott & Karsten Borgwardt February 10 to February 21, 2014 Machine Learning & Computational Biology Research Group Max Planck Institutes Tbingen and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Frequent Subgraph Mining Frequent Subgraph Mining (FSM) Outline FSM Preliminaries FSM

Topic II.1: Frequent Subgraph Mining Discrete Topics in Data Mining Universitt des Saarlandes,

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Mining Large Single Networks under Subgraph Mining Large Single Networks under Subgraph

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

TITLE PAGE: Is protein sequence evolution constant over time? Carolin Kosiol & Nick

CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 2 Part 2

Parsimony 123456789... Taxon1 CGACC A GGT... Taxon2 CGACC A GGT... Taxon3 CGGTC C GGT... Taxon4

L ear ning and Vision R esear ch Gr oup Shuic he ng YAN Natio nal U nive rsity o f Singapo

A Docker-Based Replicability Study of a Neural Information Retrieval Model Nicola Ferro, Stefano

Human-Computer Interaction CS5340 Round 4 Homework I3: Update Now due next week Still

Why decoding? Understanding the neural code. Neural Decoding Given spikes, what was the

Computational models of of biological biological Computational models systems systems

Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining Chlo-Agathe Azencott & Karsten Borgwardt February 10 to February 21, 2014 Machine Learning & Computational Biology Research Group Max Planck Institutes Tbingen and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Frequent Subgraph Mining Frequent Subgraph Mining (FSM) Outline FSM Preliminaries FSM

Topic II.1: Frequent Subgraph Mining Discrete Topics in Data Mining Universitt des Saarlandes,

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Mining Large Single Networks under Subgraph Mining Large Single Networks under Subgraph

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

TITLE PAGE: Is protein sequence evolution constant over time? Carolin Kosiol &amp; Nick

CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 2 Part 2

Parsimony 123456789... Taxon1 CGACC A GGT... Taxon2 CGACC A GGT... Taxon3 CGGTC C GGT... Taxon4

L ear ning and Vision R esear ch Gr oup Shuic he ng YAN Natio nal U nive rsity o f Singapo

A Docker-Based Replicability Study of a Neural Information Retrieval Model Nicola Ferro, Stefano

Human-Computer Interaction CS5340 Round 4 Homework I3: Update Now due next week Still

Why decoding? Understanding the neural code. Neural Decoding Given spikes, what was the

Computational models of of biological biological Computational models systems systems

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

TITLE PAGE: Is protein sequence evolution constant over time? Carolin Kosiol & Nick