Discovering Frequent Topological Structures from Graph Datasets R. - PDF document

Discovering Frequent Topological Structures from Graph Datasets R. Jin C. Wang D. Polshakov S. Parthasarathy G. Agrawal Department of Computer Science and Engineering Ohio State University, Columbus OH 43210 � jinr,wachao,polshako,srini,agrawal ✁ @cse.ohio-state.edu ABSTRACT a dataset can be as follows: three ✂ -helices that are not direct neigh- bors of each other, but form a triangle in the three-dimensional The problem of finding frequent patterns from graph-based datasets space. Specifically, in the graphs for different proteins, each pair of is an important one that finds applications in drug discovery, pro- above ✂ -helices is connected through independent paths formed by tein structure analysis, XML querying, and social network analysis other secondary structures, possibly including ✂ -helices, ✄ -sheets, among others. In this paper we propose a framework to mine fre- or loops. The triangle information can be useful for understand- quent large-scale structures, formally defined as frequent topolog- ing the functionalities of these proteins. For instance, two DNA- ical structures , from graph datasets. Key elements of our frame- binding regulatory proteins (1ALI and 1E31), though seemingly work include, fast algorithms for discovering frequent topological different from the local-structure perspective, share such a ✂ -helices patterns based on the well known notion of a topological minor, al- triangle, and perform similar functionalities [5]. In fact, both be- gorithms for specifying and pushing constraints deep into the min- long to the class of zinc finger proteins. However, because this kind ing process for discovering constrained topological patterns, and of structure is hidden under the pair-wise relationship, it is very un- mechanisms for specifying approximate matches when discovering likely to be identified using the existing frequent subgraph mining frequent topological patterns in noisy datasets. We demonstrate the approaches. In particular, even if some subgraphs which embed the viability and scalability of the proposed algorithms on real and syn- three ✂ -helices may appear to be frequent, the triangle structure can thetic datasets and also discuss the use of the framework to discover easily be missed. meaningful topological structures from protein structure data. The main contribution of this paper is a framework to mine frequent large-scale structures from graphs. Our work is inspired by 1. INTRODUCTION a well-established mathematical concept, topological minor [4]. A Recently, there has been a lot of interest in mining frequent pat- topological minor of a graph is an abstraction that focuses on its terns from structured datasets , such as chemical compounds, pro- structural information. Intuitively, such an abstraction is achieved teins, web-logs, and XML datasets. Such patterns can effectively by replacing or contracting independent paths in a subgraph with summarize the data, provide key insights and often serve as a pre- individual edges. processing step for further analysis. Since, such datasets can of- An important notion in our framework is that of a relabeling ten be modeled as graphs, a majority of research in this area has function . Since often real datasets can be best represented as la- focused on developing efficient algorithms for mining frequently beled graphs when we replace independent paths in a subgraph with occurring (connected) subgraphs [9, 10, 18, 13]. edges, the information labels on such paths are lost. However, in However, in many real world applications, such as biology, so- many applications, summarized information about the contracted cial networks, and telecommunication, large-scale structures , which paths can be useful to categorize these topological structures. For provide high-level topological information of graphs, may be equally example, we may prefer to distinguish the ✂ -helix triangles of dif- or more important than discovering the basic components. For in- ferent sizes, and the length of each independent path connecting stance, the discovery of non-local or tertiary structural information these ✂ -helices can help to provide such measurement. Our frame- is an important problem in protein structure analysis. Similarly, in work supports this notion through user-defined relabeling functions the analysis of social or communication networks, the direct con- to recover some degree of information loss from the contracted nection between a pair of nodes is often not the focus, instead, paths. Such a function maps an entire labeled path to a single edge the patterns where several nodes are connected through a set of label. In other words, an edge label carries the desired informa- independent paths are of greater interest. Such frequent large-scale tion about its corresponding contracted path. For instance, in the structures can be very hard to discover using current frequent sub- above example, the relabeling function can use the length of each graph mining approaches. This is not only because the subgraphs contracted path as their corresponding edge labels. An additional sharing these kind of structures can be infrequent (i.e. the tradi- benefit of the relabeling function is that it can be used to support tional anti-monotone property leveraged by most such algorithms the mining of constrained topological structures. does not hold), but also because the individual subgraphs are not To summarize, the main contributions of this paper are as fol- adequately abstracted or represented. lows: As an example of a large-scale structure we are focusing on, con- 1. We introduce a novel framework for discovering frequent sider mining a protein dataset where each protein is represented as topological structures from graph datasets based on a vertical a graph. The vertexes of each graph are protein secondary struc- mining approach. tures, and an edge is associated with two protein secondary structures if their distance in the three-dimensional space is within a 2. We study the basic properties of relabeling functions, and certain range. A frequent large-scale topological structure in such demonstrate their use for summarization and discovery of

Discovering Frequent Topological Structures from Graph Datasets R. - PDF document

Discovering Frequent Topological Structures from Graph Datasets R. Jin C. Wang D. Polshakov S. Parthasarathy G. Agrawal Department of Computer Science and Engineering Ohio State University, Columbus OH 43210

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

CSE 326: Data Structures Graph representations Graphs Topological Sort Topological

Topological Sort Shivam Patel Viktor Zenkov Questions 1. Who first described topological sort?

Topological Structures in the Analysis of Images and Data Chao Chen City University of New York

EE 355 Unit 18 DFS and Topological Sort Mark Redekopp 2 Topological Sort Given a graph of

Topological invariants in disordered topological insulators Subtitle: Spectral localizer of

Frequent Item Sets Chau Tran & Chun-Che Wang Outline 1. Definitions Frequent Itemsets

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Discovering Gods Word (Part-2) Discovering Gods Word (Part-2) Hermeneutics = The science

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

W4231: Analysis of Algorithms Topological Sort 10/26/1999 Given a directed graph G = ( V, E ) , a

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Exotic topological states of ultra-cold atomic matter Lecture 1: Topolgical and non- topological

Lecture 19: Topological Mapping CS 344R/393R: Robotics Benjamin Kuipers Exploration Defines

G -bases in free objects of Topological Algebra (Local) -bases in topological and uniform

On the Development and Optimization of Hybrid Parallel Codes for Integral Equation Formulations 7

Tools for Success: Developing an Academic Advising Syllabus NACADA Region 1 P3: March 8, 2017:

Student Involvement at EOU: Results of the Student

SOLACTIVE EUROPEAN DEEP VALUE SELECT 50 INDEX Marketing C ommunication For professional clients

Beta Presentation GameChang3rs Learning Management System The Capstone Experience Team Michael

Hybrid Clustering of multi-view data via MLSVD Xinhai Liu, Lieven De Lathauwer, Wolfgang Gl

Office of Data and Accountability W HY D ATA A CCURACY ? Accurate student data is important

HW Mountz School Analysis of 2017-2018 Academic Progress Spring Lake Board of Education Meeting

Discovering Frequent Topological Structures from Graph Datasets R. - PDF document

Discovering Frequent Topological Structures from Graph Datasets R. Jin C. Wang D. Polshakov S. Parthasarathy G. Agrawal Department of Computer Science and Engineering Ohio State University, Columbus OH 43210

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

CSE 326: Data Structures Graph representations Graphs Topological Sort Topological

Topological Sort Shivam Patel Viktor Zenkov Questions 1. Who first described topological sort?

Topological Structures in the Analysis of Images and Data Chao Chen City University of New York

EE 355 Unit 18 DFS and Topological Sort Mark Redekopp 2 Topological Sort Given a graph of

Topological invariants in disordered topological insulators Subtitle: Spectral localizer of

Frequent Item Sets Chau Tran &amp; Chun-Che Wang Outline 1. Definitions Frequent Itemsets

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Discovering Gods Word (Part-2) Discovering Gods Word (Part-2) Hermeneutics = The science

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

W4231: Analysis of Algorithms Topological Sort 10/26/1999 Given a directed graph G = ( V, E ) , a

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Exotic topological states of ultra-cold atomic matter Lecture 1: Topolgical and non- topological

Lecture 19: Topological Mapping CS 344R/393R: Robotics Benjamin Kuipers Exploration Defines

G -bases in free objects of Topological Algebra (Local) -bases in topological and uniform

On the Development and Optimization of Hybrid Parallel Codes for Integral Equation Formulations 7

Tools for Success: Developing an Academic Advising Syllabus NACADA Region 1 P3: March 8, 2017:

Student Involvement at EOU: Results of the Student

SOLACTIVE EUROPEAN DEEP VALUE SELECT 50 INDEX Marketing C ommunication For professional clients

Beta Presentation GameChang3rs Learning Management System The Capstone Experience Team Michael

Hybrid Clustering of multi-view data via MLSVD Xinhai Liu, Lieven De Lathauwer, Wolfgang Gl

Office of Data and Accountability W HY D ATA A CCURACY ? Accurate student data is important

HW Mountz School Analysis of 2017-2018 Academic Progress Spring Lake Board of Education Meeting

Frequent Item Sets Chau Tran & Chun-Che Wang Outline 1. Definitions Frequent Itemsets