gSpan: Graph-Based Substructure Pattern Mining Xifeng Yan Jiawei - PDF document

✩ ✩ ✂ ✻ gSpan: Graph-Based Substructure Pattern Mining Xifeng Yan Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign � xyan, hanj ✁ @uiuc.edu Abstract chemical compound dataset in 10 minutes with 6.5% minimum support. For the same dataset, our novel algorithm We investigate new approaches for frequent graph-based can complete the same task in 10 seconds. pattern mining in graph datasets and propose a novel algo- AGM and FSG both take advantage of the Apriori level- rithm called gSpan (graph-based Substructure pattern min- wise approach [1]. In the context of frequent subgraph mining), which discovers frequent substructures without can- ing, the Apriori-like algorithms meet two challenges: (1) didate generation. gSpan builds a new lexicographic or- candidate generation: the generation of size ✧✼✻✾✽❀✿✥✫ sub- der among graphs, and maps each graph to a unique mini- graph candidates from size frequent subgraphs is more mum DFS code as its canonical label. Based on this lexico- complicated and costly than that of itemsets; and (2) prun- graphic order, gSpan adopts the depth-first search strategy ing false positives: subgraph isomorphism test is an NP- to mine frequent connected subgraphs efficiently. Our per- complete problem, thus pruning false positives is costly. formance study shows that gSpan substantially outperforms Contribution. In this paper, we develop gSpan , which previous algorithms, sometimes by an order of magnitude. targets to reduce or avoid the significant costs mentioned above. If the entire graph dataset can fit in main memory, gSpan can be applied directly; otherwise, one can first per- form graph-based data projection as in [6], and then apply 1. Introduction gSpan . To the best of our knowledge, gSpan is the first algorithm that explores depth-first search (DFS) in frequent sub- Frequent substructure pattern mining has been an emerg- graph mining. Two techniques, DFS lexicographic order ing data mining problem with many scientific and com- and minimum DFS code , are introduced here, which form mercial applications. As a general data structure, la- a novel canonical labeling system to support DFS search. beled graph can be used to model much complicated sub- gSpan discovers all the frequent subgraphs without candi- structure patterns among data. Given a graph dataset, ✂☎✄✝✆✟✞✡✠☞☛✌✞✎✍✏☛✒✑✓✑✔✑✓☛✌✞✖✕✘✗ , date generation and false positives pruning. It combines the ✙✛✚☞✜☞✜✣✢✥✤✥✦★✧✪✩✬✫ denotes the number growing and checking of frequent subgraphs into one pro- of graphs (in ) in which is a subgraph. The problem cedure, thus accelerates the mining process. of frequent subgraph mining is to find any subgraph s.t. ✙✒✚✭✜✮✜✯✢✥✤✥✦★✧✪✩✬✫✱✰✳✲✵✴✷✶✹✸✺✚☞✜ (a minimum support threshold). To 2. DFS Lexicographic Order reduce the complexity of the problem (meanwhile consid- ering the connectivity property of hidden structures in most situations), only frequent connected subgraphs are studied This section introduces several techniques developed in in this paper. gSpan , including mapping each graph to a DFS code (a The kernel of frequent subgraph mining is subgraph iso- sequence), building a novel lexicographic ordering among morphism test. Lots of well-known pair-wise isomorphism these codes , and constructing a search tree based on this testing algorithms were developed. However, the frequent lexicographic order . subgraph mining problem was not explored well. Recently, DFS Subscripting. When performing a depth-first search Inokuchi et al. [4] proposed an Apriori-based algorithm, [3] in a graph, we construct a DFS tree. One graph can have called AGM, to discover all frequent (both connected and several different DFS trees. For example, graphs in Fig. disconnected) substructures. Kuramochi and Karypis [5] 1(b)-(d) are isomorphic to that in Fig. 1(a). The thickened further developed the idea using adjacent representation of edges in Fig. 1(b)-(d) represent three different DFS trees for graph and an edge-growing strategy. Their algorithm, called the graph in Fig. 1(a). The depth-first discovery of the ver- FSG, is able to find all frequent connected subgraphs from a tices forms a linear order. We use subscripts to label this

gSpan: Graph-Based Substructure Pattern Mining Xifeng Yan Jiawei - PDF document

gSpan: Graph-Based Substructure Pattern Mining Xifeng Yan Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign xyan, hanj @uiuc.edu Abstract chemical compound dataset in 10 minutes

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Effect of substructure on tidal streams Denis Erkal University of Surrey Halo Substructure and

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

Graph Essentials Graph Basics Social Media Mining Social Media Mining Measures and Metrics

Week 5 Video 4 Relationship Mining Sequential Pattern Mining Association Rule Mining Try to

Chapter X: Graph Mining Information Retrieval & Data Mining Universitt des Saarlandes,

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Data Mining Associative pattern mining Hamid Beigy Sharif University of Technology Fall 1396

Data Mining: Concepts and Techniques Chapter 9 Graph mining and Social Network Analysis

Windowed All- k NN Search over Multidimensional Array Data from Medical Imaging GTC 2016 San

chameleon-db Presented by Alu Joint work with

Output Spaces Darryl Buller, Aaron Kaufer Information Assurance Directorate National Security

1 Introduction 1.1 Problem Definition Let G = ( V, E ) be undirected graph with n vertices, and

Google matrix of the world trade network Leonardo Ermann and Dima Shepelyansky (CNRS, Toulouse)

Modeling and Mapping Metros Rail Stations Minhua Wang GIS Enterprise Architect

Extraction of information in large graphs Automatic search for synonyms Pierre Senellart, under

Seungwon Song 2017.05.23 CS686 Paper Presentation #2 Suzi Kims Presentation Cell

gSpan: Graph-Based Substructure Pattern Mining Xifeng Yan Jiawei - PDF document

gSpan: Graph-Based Substructure Pattern Mining Xifeng Yan Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign xyan, hanj @uiuc.edu Abstract chemical compound dataset in 10 minutes

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Effect of substructure on tidal streams Denis Erkal University of Surrey Halo Substructure and

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

Graph Essentials Graph Basics Social Media Mining Social Media Mining Measures and Metrics

Week 5 Video 4 Relationship Mining Sequential Pattern Mining Association Rule Mining Try to

Chapter X: Graph Mining Information Retrieval &amp; Data Mining Universitt des Saarlandes,

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Data Mining Associative pattern mining Hamid Beigy Sharif University of Technology Fall 1396

Data Mining: Concepts and Techniques Chapter 9 Graph mining and Social Network Analysis

Windowed All- k NN Search over Multidimensional Array Data from Medical Imaging GTC 2016 San

chameleon-db Presented by Alu Joint work with

Output Spaces Darryl Buller, Aaron Kaufer Information Assurance Directorate National Security

1 Introduction 1.1 Problem Definition Let G = ( V, E ) be undirected graph with n vertices, and

Google matrix of the world trade network Leonardo Ermann and Dima Shepelyansky (CNRS, Toulouse)

Modeling and Mapping Metros Rail Stations Minhua Wang GIS Enterprise Architect

Extraction of information in large graphs Automatic search for synonyms Pierre Senellart, under

Seungwon Song 2017.05.23 CS686 Paper Presentation #2 Suzi Kims Presentation Cell

Chapter X: Graph Mining Information Retrieval & Data Mining Universitt des Saarlandes,