E FFICIENT algorithms for finding frequent patternsboth between - PDF document

1038 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 9, SEPTEMBER 2004 An Efficient Algorithm for Discovering Frequent Subgraphs Michihiro Kuramochi and George Karypis Abstract —Over the years, frequent itemset discovery algorithms have been used to find interesting patterns in various application areas. However, as data mining techniques are being increasingly applied to nontraditional domains, existing frequent pattern discovery approaches cannot be used. This is because the transaction framework that is assumed by these algorithms cannot be used to effectively model the data sets in these domains. An alternate way of modeling the objects in these data sets is to represent them using graphs. Within that model, one way of formulating the frequent pattern discovery problem is that of discovering subgraphs that occur frequently over the entire set of graphs. In this paper, we present a computationally efficient algorithm, called FSG, for finding all frequent subgraphs in large graph data sets. We experimentally evaluate the performance of FSG using a variety of real and synthetic data sets. Our results show that despite the underlying complexity associated with frequent subgraph discovery, FSG is effective in finding all frequently occurring subgraphs in data sets containing more than 200,000 graph transactions and scales linearly with respect to the size of the data set. Index Terms —Data mining, scientific data sets, frequent pattern discovery, chemical compound data sets. � 1 I NTRODUCTION E FFICIENT algorithms for finding frequent patterns—both between them. We can assign to each vertex a label sequential and nonsequential—in very large data sets corresponding to the atom involved (and potentially its have been one of the key success stories of data mining charge), and assign to each edge a label corresponding to research [2], [41], [1], [49], [20], [36]. Nevertheless, as data the type of the bond (and potentially information about mining techniques have been increasingly applied to their relative 3D orientation). Once these graphs have been nontraditional domains, there is a need to develop efficient created, recurrent substructures across different com- and general-purpose frequent pattern discovery algorithms pounds become frequently occurring subgraphs. In fact, that are capable of capturing the spatial, topological, within the context of chemical compound classification, geometric, and/or relational nature of the data sets that such techniques have been used to mine chemical com- characterize these domains. pounds and identify the substructures that best discrimi- In recent years, labeled topological graphs have emerged nate between the different classes [27], [42], [5], [11], and as a promising abstraction to capture the characteristics of were shown to produce superior classifiers than more these data sets. In this approach, each object to be analyzed traditional methods [21]. is represented via a separate graph whose vertices Developing algorithms that discover all frequently correspond to the entities in the object and the edges occurring subgraphs in a large graph data set is particularly correspond to the relations between them. Within that challenging and computationally intensive, as graph and model, one way of formulating the frequent pattern subgraph isomorphisms play a key role throughout the discovery problem is that of discovering subgraphs that computations. In this paper, we present a new algorithm, called FSG , for finding all connected subgraphs that appear occur frequently over the entire set of graphs. The power of graphs to model complex data sets has frequently in a large graph data set. Our algorithm finds been recognized by various researchers [26], [23], [30], [46], frequent subgraphs using the level-by-level expansion [3], [37], [43], [6], [10], [14], [19], as it allows us to represent strategy adopted by Apriori [2]. The key features of FSG arbitrary relations among entities and solve problems that are the following: we could not previously solve. For instance, consider the it uses a sparse graph representation that minimizes 1. problem of mining chemical compounds to find recurrent both storage and computation; substructures. We can achieve that by using a graph-based it increases the size of frequent subgraphs by adding 2. pattern discovery algorithm by creating a graph for each one edge at a time, allowing it to generate the one of the compounds whose vertices correspond to candidates efficiently; different atoms, and whose edges correspond to bonds 3. it incorporates various optimizations for candidate generation and frequency counting which enables it . The authors are with the Department of Computer Science, University of to scale to large graph data sets; and Minnesota, 4-192 EE/CS Building, 200 Union St. SE, Minneapolis, MN 4. it uses sophisticated algorithms for canonical label- 55455. E-mail: {kuram, karypis}@cs.umn.edu. ing to uniquely identify the various generated Manuscript received 28 June 2002; revised 28 Apr. 2003; accepted 2 July subgraphs without having to resort to computation- 2003. ally expensive graph and subgraph-isomorphism For information on obtaining reprints of this article, please send e-mail to: computations. tkde@computer.org, and reference IEEECS Log Number 116863. 1041-4347/04/$20.00 � 2004 IEEE Published by the IEEE Computer Society

E FFICIENT algorithms for finding frequent patternsboth between - PDF document

1038 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 9, SEPTEMBER 2004 An Efficient Algorithm for Discovering Frequent Subgraphs Michihiro Kuramochi and George Karypis Abstract Over the years, frequent itemset discovery

Can Systems be Certified Distributively? Scalable Analysis Methods for Sparse Large-scale Systems

A three-level BDDC algorithm for mortar discretization Hyea Hyun Kim National Institute for

Measuring Energy and Power with PAPI Vince Weaver vweaver1@eecs.utk.edu 11 May 2012 Power and

Synthesis of certified programs in fixed-point arithmetic, and its application to linear algebra

Does God play dice with the cell? Does God play dice with the cell? Does God play dice with the

Recognizing Action At A Distance Alexei A. Efros, et Al. Presented by: Sunny Chow 1 Background

St Structuring ructuring a New a New or or Res Restructur tructuring ing an an Exi

2016 ARE A 10W U10 U14 L E AGUE PL AYOF F T OURNAME NT COACHE S ME E T I NG T

Presentation for Shaping Clay May 30, 2014 www.reinhold.net Amy Parker Executive Director The Paul

ShareUp Soton Jack Davies Moov2 - Junior Web Developer Imposter Syndrome Are you lucky,

Laurne Descamps, Circular Economy Transition / Impact Hub Zrich What Is Circular Economy A

Circular Food Systems Presentation by the Taskforce Circular Food Systems to GRA Council

CREEP OF FULLY OR PARTIALLY FRP-CONFINED SQUARE OR CIRCULAR CONCRETE COLUMNS Y. S. Ma, Y. F.

Duo-Binary Circular Turbo Decoder Based on Border Metric Encoding for WiMAX for WiMAX Ji-Hoon

re - CIRCULAR ECONOMY FASHION Rahmi Fajar Harini M.si WHO ARE WE? BEDO MEMBERS AND NON MEMBERS

circular and regenerative cities Phebe Dudek I.3 - Sustainable Management of Natural Ressources

Data Must Speak in in the Philippines: Enhancing the dis istribution of the Special Hardship

Revision of the Recommendations on Statistics of International Migration, rev. 1 (1998) Haoyi

Non-Profits Receiving Grants Preparing for Audits and Protecting Grant Eligibility Given Current

Journey and key takeaways 1 Procurement - Making it Circular Journey and key takeaways of A

Disclaimer This document does not constitute or form part of and should not be construed as a

VENESS THROUGH THE CIRCULAR ECONOMY TODAYS PRACTICES ARE VERY WASTEFUL AT ANY GIVEN MOMENT ,

NON-LINEAR THRMO-MECHANICAL RESPONSE OF FOAM CORE CIRCULAR SANDWICH PLATES O. T. Thomsen 1 , Y.

PROYECTO NUCIF Webinar on Circular Economy Econmia circular y valorisacion energetica de