Fast Frequent Free Tree Mining in Graph Databases Peixiang Zhao - PowerPoint PPT Presentation

The Chinese University of Hong Kong Fast Frequent Free Tree Mining in Graph Databases Peixiang Zhao Jeffrey Xu Yu The Chinese University of Hong Kong December 18 th , 2006 ICDM Workshop MCD06

Synopsis • Introduction • Existing Approaches • Our Algorithm: F3TM • Performance Studies • Conclusions ICDM Workshop MCD06 Dec. 18 th , 2006 2

Introduction • Graph, a general data structure to represent relations among entities, has been widely used in a broad range of areas • Computational biology • Chemistry • Pattern recognition • Computer networks • etc. • Mining frequent sub-graphs in a graph database • If a large graph contains another small graph : the sub-graph isomorphism problem ( NP-complete ) • If two graphs are isomorphic : the graph isomorphism problem (either P or NP-complete ) ICDM Workshop MCD06 Dec. 18 th , 2006 3

Introduction • Free Tree ( ftree ) • Connected , acyclic and undirected graph • Widely used in bioinformatics, computer vision, networks, etc. • Specialization of general graph avoiding undesirable theoretical properties and algorithmic complexity incurred by graph – determining whether a tree t 1 is contained in another tree t 2 can be solved in O( m 3/2 n /log m ) time – determining whether t 1 is isomorphic to t 2 can be solved in O( n ) – determining whether a tree is isomorphic to some sub-trees of a graph, a costly tree-in-graph testing which is still NP-Complete ICDM Workshop MCD06 Dec. 18 th , 2006 4

Introduction • Frequent free tree mining • Given a graph database D = { g 1 , g 2 , …, g N }. The problem of frequent free tree mining is to find the set of all frequent free trees where a ftree, t , is frequent if the ratio of graphs in D , that has t as its sub-tree, is greater than or equal to a user-given threshold Φ • Two key concepts – Candidate generation – Frequency counting • Our focus • The less number of candidates generated, the less number of times to apply costly tree-in-graph testing • the cost of candidate generation itself can be high ICDM Workshop MCD06 Dec. 18 th , 2006 5

Existing Approaches • FT-Algorithm • Apriori-based algorithm • Builds a conceptual enumeration lattice to enumerate frequent ftrees in the database • Follows a pattern-join approach to generate candidate frequent ftrees • FG-Algorithm • A vertical mining algorithm • Builds an enumeration tree and traverses it in a depth-first fashion • Takes a pattern-growth approach to generate candidate frequent ftrees ICDM Workshop MCD06 Dec. 18 th , 2006 6

Our Algorithm: F3TM • F3TM (F ast F requent F ree T ree M ining ) • A vertical mining algorithm – Requires a relatively small memory to maintain the frequent ftrees being found • Uses the pattern-growth approach for candidate generation • Two pruning algorithms are proposed to facilitate candidate generation and they contribute a dramatic speedup to the final performance of our ftree mining algorithm – Automorphism-based pruning – Canonical mapping-based pruning ICDM Workshop MCD06 Dec. 18 th , 2006 7

Canonical Form of Free Tree • A unique representation of a ftree • two ftrees, t 1 and t 2 , share the same canonical form if and only if t 1 is isomorphic to t 2 • Only free trees in their canonical form need to be considered in frequent ftree mining process • A two-step algorithm • normalizing a ftree to be a rooted ordered tree • assigning a string, as its code, to represent the normalized rooted ordered tree • Both steps of the algorithm are O( n ), for a n -ftree ICDM Workshop MCD06 Dec. 18 th , 2006 8

Candidate Generation • Theorem: the completeness of frequent ftrees is ensured if we grow vertices from the predefined positions of a ftree, called extension frontier • Extension frontier represents all legal positions of an n -ftree t’ on which a new vertex can be appended to achieve the new ( n+1 )-ftree t , while no ftrees are omitted during this frontier- extending process a b c d e f g ICDM Workshop MCD06 Dec. 18 th , 2006 9

Automorphism-Based Pruning • Given a candidate ftree t in T (the candidates set), in order to reduce the cost of frequency counting, we firstly check if there is a candidate ftree t' in T such as t = t' • There is no need to count redundancies • When T becomes large, the cost of checking t = t' for every t' in T can possibly become the dominating cost a a a 0 b b b b b b 1 2 c d c d c d c d c d c d 3 4 5 6 ICDM Workshop MCD06 Dec. 18 th , 2006 10

Automorphism-Based Pruning • Automorphism-based pruning • efficiently prunes redundant candidates in T while avoids checking if a ftree has existed in T already, repetitively • All vertices of a free tree can be partitioned into different equivalence classes base on automorphism • We only need to grow vertices from one representative of an equivalence class, if vertices of the equivalence class are in the extension frontier of the ftree a a 0 b b b b 0 0 c d c d c d c d 0 1 0 1 ICDM Workshop MCD06 Dec. 18 th , 2006 11

Canonical Mapping-based Pruning • How to select potential labels to be grown on the frequent ftrees during candidate generation? • Existing algorithms maintain mappings from a ftree t to all its k occurrences in g i • Based on these mappings, it is possible to know which labels, that appear in graph g i , can be selected and assigned to generate a candidate ( n+1 )-ftree • there are a lot of redundant mappings between a ftree t and occurrences in g i ICDM Workshop MCD06 Dec. 18 th , 2006 12

Canonical Mapping-based Pruning g 1 g 2 a b a 1 4 1 a 2 b a 2 3 b b 3 4 mapping list (1;1,2,4) t (1;1,4,2) a 1 (1;3,2,4) (1;3,4,2) b b 2 3 (2;2,3,4) (2;2,4,3) ICDM Workshop MCD06 Dec. 18 th , 2006 13

Canonical Mapping-based Pruning • Canonical mapping • efficiently avoid multiple mappings from a ftree to the same occurrence of the tree in a graph g i of D • After orienting frequent ftree t to its canonical mapping t’ of g i in D , We can select potential labels from graph g i for candidate generation • Given a n-ftree t , and assume that the number of equivalence classes of t is c , and the number of vertices in each equivalence class C i is n i (1 ≤ i ≤ c ) – The number of mappings between t and an occurrence t' in graph g i c ∏ ( )! n is up to i = i 1 c ∏ – With canonical mapping, we only need to consider one out of ( )! n i = i 1 mappings for candidate generation ICDM Workshop MCD06 Dec. 18 th , 2006 14

Performance Studies • The Real Dataset • The AIDS antiviral screen dataset from Developmental Theroapeutics Program in NCI/NIH • 42390 compounds retrieved from DTP's Drug Information System • 63 kinds of atoms in this dataset, most of which are C, H, O, S, etc. • Three kinds of bonds are popular in these compounds: single-bond, double-bond and aromatic-bond • On average, compounds in the dataset has 43 vertices and 45 edges. • The graph of maximum size has 221 vertices and 234 edges ICDM Workshop MCD06 Dec. 18 th , 2006 15

Real Data Set • Performance comparisons (with different minimum threshold: 10%, 20%, 50%) 3500 20000 12000 F3TM F3TM F3TM FG Total running time (sec) 3000 FG FG Total running time (sec) Total running time (sec) 10000 FT FT FT 15000 2500 8000 2000 10000 6000 1500 4000 1000 5000 2000 500 0 0 0 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 Size of datasets Size of datasets Size of datasets ICDM Workshop MCD06 Dec. 18 th , 2006 16

Conclusion • Free tree has computational advantages over general graph, which makes it a suitable candidate for computational biology, pattern recognition, computer networks, XML databases, etc. • F3TM discovers all frequent free trees in a graph database with the focus on reducing the cost of candidate generation • F3TM outperforms the up-to-date existing free tree mining algorithms by an order of magnitude • F3TM is scalable to mine frequent free trees in a large graph dataset with a low minimum support threshold ICDM Workshop MCD06 Dec. 18 th , 2006 17

The Chinese University of Hong Kong Thank you

Fast Frequent Free Tree Mining in Graph Databases Peixiang Zhao - PowerPoint PPT Presentation

The Chinese University of Hong Kong Fast Frequent Free Tree Mining in Graph Databases Peixiang Zhao Jeffrey Xu Yu The Chinese University of Hong Kong December 18 th , 2006 ICDM Workshop MCD06 Synopsis Introduction Existing Approaches

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

From Path Tree To Frequent Patterns: A Framework for Mining Frequent Patterns Yabo Xu, Jeffrey

1 Introduction Co-Occurrences Frequent Item Tree Association rule mining FP Growth Ying

Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao,

Frequent Subgraph Mining Frequent Subgraph Mining (FSM) Outline FSM Preliminaries FSM

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques:

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

Classifying Highly Supersymmetric Solutions Jan Gutowski Kings College London with U. Gran,

Semi-analytical computation of Normal Forms, Centre Manifolds and First Integrals of Hamiltonian

Lecture Outline Systeem- en Regeltechniek II Previous lecture: State-space models,

New Hiera rchies of Rep resentations RM'97, Septemb er 1997 1 ' $ NEW HIERARCHIES OF

Progress and Problems Restricted misere play Definitions Invertibility in Restricted Misere

The continuous categorical: a novel simplex-valued exponential family Elliott Gordon-Rodr

INC 541 Modern Control Theory Using State Space Methods

Topic #38 Transfer function to state-space Reference textbook : Control Systems, Dhanesh N.

Fast Frequent Free Tree Mining in Graph Databases Peixiang Zhao - PowerPoint PPT Presentation

The Chinese University of Hong Kong Fast Frequent Free Tree Mining in Graph Databases Peixiang Zhao Jeffrey Xu Yu The Chinese University of Hong Kong December 18 th , 2006 ICDM Workshop MCD06 Synopsis Introduction Existing Approaches

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

From Path Tree To Frequent Patterns: A Framework for Mining Frequent Patterns Yabo Xu, Jeffrey

1 Introduction Co-Occurrences Frequent Item Tree Association rule mining FP Growth Ying

Graph Indexing: Tree + Delta Delta &gt;= Graph &gt;= Graph Graph Indexing: Tree + Peixian Zhao,

Frequent Subgraph Mining Frequent Subgraph Mining (FSM) Outline FSM Preliminaries FSM

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques:

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

Classifying Highly Supersymmetric Solutions Jan Gutowski Kings College London with U. Gran,

Semi-analytical computation of Normal Forms, Centre Manifolds and First Integrals of Hamiltonian

Lecture Outline Systeem- en Regeltechniek II Previous lecture: State-space models,

New Hiera rchies of Rep resentations RM'97, Septemb er 1997 1 ' $ NEW HIERARCHIES OF

Progress and Problems Restricted misere play Definitions Invertibility in Restricted Misere

The continuous categorical: a novel simplex-valued exponential family Elliott Gordon-Rodr

INC 541 Modern Control Theory Using State Space Methods

Topic #38 Transfer function to state-space Reference textbook : Control Systems, Dhanesh N.

Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao,