gApprox: Mining Frequent Approximate Patterns from a Massive Network - PDF document

gApprox: Mining Frequent Approximate Patterns from a Massive Network Chen Chen † Xifeng Yan ‡ Feida Zhu † Jiawei Han † † University of Illinois at Urbana-Champaign ‡ IBM T. J. Watson Research Center { cchen37, feidazhu, hanj } @cs.uiuc.edu xifengyan@us.ibm.com Abstract P a and P b , which are quite similar in the sense that, af- ter proper correspondence, discernable resemblance exists Recently, there arise a large number of graphs with mas- between individual proteins, e.g., with regard to their amino sive sizes and complex structures in many new applications, acids, secondary structures, etc., and the interactions within P a and P b are nearly identical to each other 1 . such as biological networks, social networks, and the Web, demanding powerful data mining methods. Due to inherent noise or data diversity, it is crucial to address the issue of pqn-57 abu-1 ubc-18 ubc-1 approximation, if one wants to mine patterns that are po- tentially interesting with tolerable variations. M02G9.1 F46F11.7 unc-97 F30H5.3 Y65B4A.7 In this paper, we investigate the problem of mining frequent approximate patterns from a massive network and pqn-54 abu-11 pqn-5 propose a method called gApprox. gApprox not only finds approximate network patterns, which is the key for many lys-1 abu-8 pqn-71 lys-2 M195.2 F35A5.4 knowledge discovery applications on structural data, but (a) (b) also enriches the library of graph mining methodologies by Figure 1. Two subnets extracted from the worm PPI net- introducing several novel techniques such as: (1) a com- work, where proteins at the corresponding positions of (a) plete and redundancy-free strategy to explore the new pat- and (b) are biologically quite similar, and 2 PPI deletions tern space faced by gApprox; and (2) transform “frequent plus 3 PPI insertions transform (a) into (b). in an approximate sense” into an anti-monotonic constraint There are in general two major complications to mine so that it can be pushed deep into the mining process. Sys- such massive and highly complex networks: tematic empirical studies on both real and synthetic data First, compared to algorithms targeting a set of graphs, sets show that frequent approximate patterns mined from mining frequent patterns in a single network needs to par- the worm protein-protein interaction network are biologi- tition the network into regions, where each region contains cally interesting and gApprox is both effective and efficient. one occurrence of the pattern. This partition changes from one pattern to another; whereas for any given partition, re- 1 Introduction gions may overlap with each other as well. All these prob- In the past, there have been a set of interesting algorithms lems are not solved by existing technologies for mining a [4, 10, 6] that mine frequent patterns in a set of graphs . set of graphs. Recently, there arise a large number of graphs with mas- Second, due to various inherent noise or data diversity, sive sizes and complex structures in many new applications, it is crucial to account for approximations so that all poten- such as biological networks, social networks, and the Web, tially interesting patterns can be captured. Cast to the PPI demanding powerful data mining methods. Because of their network we described in Example 1 (see Fig.1), as long as characteristics, we are now interested in patterns that fre- their similarity is above some threshold, it is ideal to detect quently appear at many different places of a single network . P b as a place where P a approximately appears. Example 1 Let us consider a P rotein- P rotein I nteraction In retrospect, compared to the rich literature on mining frequent patterns in a set of graphs, single network based ( PPI ) network in Biology. A PPI network is a huge graph algorithms have been examined to a minor extent. [5, 7, 1] whose vertices are individual proteins, where an edge exists between two vertices if and only if there is a significant 1 In Biology, this might represent a mechanism to backup a set of pro- protein-protein interaction. Due to some underlying bio- teins whose mutual interactions support a vital function of the network, so logical process, occasionally we may observe two subnets that in case of any unexpected events, the “copy” can switch in.

gApprox: Mining Frequent Approximate Patterns from a Massive Network - PDF document

gApprox: Mining Frequent Approximate Patterns from a Massive Network Chen Chen Xifeng Yan Feida Zhu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center { cchen37, feidazhu, hanj }

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

1 Closed Patterns and Max-Patterns Closed Patterns and Max-Patterns A long pattern contains a

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

From Path Tree To Frequent Patterns: A Framework for Mining Frequent Patterns Yabo Xu, Jeffrey

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Frequent Subgraph Mining Frequent Subgraph Mining (FSM) Outline FSM Preliminaries FSM

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques:

CS570 Data Mining Frequent Pattern Mining and Association Analysis 2 Cengiz Gunay Slide

CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay

Statistics and Data Analysis Logistic Regression & Frequent Pattern Mining Ling-Chieh Kung

The shortcomings of the frequent pattern mining CLOSET:An Efficient Algorithm There may exist

SEISMIC PROTECTION OF TIMBER PLATFORM FRAME BUILDING STRUCTURES WITH HYSTERETIC ENERGY

Biosecurity in Agriculture Maintaining Sustainability About Me Richard Boulding Geoscience and

UAS for pipeline inspection and exploration Dr Joseph Barnard

Government of South Georgia and the South Sandwich Islands

!"#$%"$#&'()+'+,-(.-(/01(&23("45( 6#+"52(7&"&()&28

Highlights of the Alberta Economy Alberta Finance and Enterprise Enterprise Division Alberta:

Assessing the Biological Threat: A Delphi Study Crystal Boddie, MPH December 16, 2015 BWC MSP,

8/2/2016 1 8/2/2016 Coliphages: What You Need To Know And How Will Laboratories, The Regulatory

gApprox: Mining Frequent Approximate Patterns from a Massive Network - PDF document

gApprox: Mining Frequent Approximate Patterns from a Massive Network Chen Chen Xifeng Yan Feida Zhu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center { cchen37, feidazhu, hanj }

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

1 Closed Patterns and Max-Patterns Closed Patterns and Max-Patterns A long pattern contains a

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

From Path Tree To Frequent Patterns: A Framework for Mining Frequent Patterns Yabo Xu, Jeffrey

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Frequent Subgraph Mining Frequent Subgraph Mining (FSM) Outline FSM Preliminaries FSM

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques:

CS570 Data Mining Frequent Pattern Mining and Association Analysis 2 Cengiz Gunay Slide

CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay

Statistics and Data Analysis Logistic Regression &amp; Frequent Pattern Mining Ling-Chieh Kung

The shortcomings of the frequent pattern mining CLOSET:An Efficient Algorithm There may exist

SEISMIC PROTECTION OF TIMBER PLATFORM FRAME BUILDING STRUCTURES WITH HYSTERETIC ENERGY

Biosecurity in Agriculture Maintaining Sustainability About Me Richard Boulding Geoscience and

UAS for pipeline inspection and exploration Dr Joseph Barnard

Government of South Georgia and the South Sandwich Islands

!&quot;#$%&quot;$#&amp;'()*+'+,-(.-(/01(&amp;23(&quot;45( 6#+&quot;5*2(7&amp;&quot;&amp;()&amp;28

Highlights of the Alberta Economy Alberta Finance and Enterprise Enterprise Division Alberta:

Assessing the Biological Threat: A Delphi Study Crystal Boddie, MPH December 16, 2015 BWC MSP,

8/2/2016 1 8/2/2016 Coliphages: What You Need To Know And How Will Laboratories, The Regulatory

Statistics and Data Analysis Logistic Regression & Frequent Pattern Mining Ling-Chieh Kung

!"#$%"$#&'()+'+,-(.-(/01(&23("45( 6#+"52(7&"&()&28