Biological Network Analysis: Graph Mining in Bioinformatics Karsten - PowerPoint PPT Presentation

Biological Network Analysis: Graph Mining in Bioinformatics Karsten Borgwardt Interdepartmental Bioinformatics Group MPIs Tübingen with permission from Xifeng Yan and Xianghong Jasmine Zhou Karsten Borgwardt: Graph Mining in Bioinformatics, Page 1

Mining coherent dense subgraphs across massive biological networks for functional discovery H. Hu 1 , X. Yan 2 , Y. Huang 1 , J. Han 2 , and X. J. Zhou 1 1 University of Southern California 2 University of Illinois at Urbana-Champaign

Biological Networks • Protein-protein interaction network • Metabolic network • Transcriptional regulatory network • Co-expression network • Genetic Interaction network • …

Data Mining Across Multiple Networks f f f j j a j a h a c h c h c e e e b b k b k k d d g i i g d i g f f f j j j a a h a c h h c c e e e b b k b k k d i g d d g i i g

Identify frequent co-expression clusters across multiple microarray data sets f f j j a a h h c c c 1 c 2 … c m e e b b g 1 .1 .2… .2 k k d g d i i g g 2 .4 .3… .4 … f f j j e a a c 1 c 2 … c m c c h h e g 1 .8 .6… .2 b b k k d g 2 .2 .3… .4 d i g i g … . . . . . . . . . f c 1 c 2 … c m f j j a a h c h c e e g 1 .9 .4… .1 b b k k g 2 .7 .3… .5 d d i g g i … f f j j c 1 c 2 … c m a a h h c c e e g 1 .2 .5… .8 b b k k g 2 .7 .1… .3 d d g i … g i

Frequent Subgraph Mining Problem is hard! Problem formulation: Given n graphs, identify subgraphs which occur in at least m graphs (m ≤ n ) Efficient modeling of Biological Networks : each gene occurs once and only once in a graph. That means, the edge labels are unique.

The common pattern growth approach Find a frequent subgraph of k edges, and expand it to k+1 edge to check occurrence frequency – Koyuturk M., Grama A. & Szpankowski W. An efficient algorithm for detecting frequent subgraphs in biological networks . ISMB 2004 – Yan, Zhou, and Han. Mining Closed Relational Graphs with Connectivity Constraints . ICDE 2005

Problem of the Pattern-growth approach The time and memory requirements increase exponentially with increasing size of patterns and increasing number of networks. The number of frequent dense subgraphs is explosive when there are very large frequent dense subgraphs, e.g., subgraphs with hundreds of edges.

Problem of the Pattern-growth approach Pattern Expansion f f f f k  k +1 j j f j a h j a a h h c a h j c c a c h e c e e e e b b b k b k k k b d i k d i d i g d g g i g d i g f f f j f j j a j f a a c a j h c c h a h c h e c h e e e e b b b k b k k k b d k i d g d g i g i d g i d i g f j f f j j f a j a a h f c j a h h c c h c a h c e e e e e b b k b b k k k b g i d k g g d i d i g d i g d i f f f f j f j j a j a h a a h h c j h c c a c h e c e e e e b b b k b k k k b d k d d i g d g i g i i d g i g

Our solution We develop a novel algorithm, called CODENSE , to mine frequent co herent dense subgraphs. The target subgraphs have three characteristics: (1) All edges occur in >= k graphs (frequency) (2) All edges should exhibit correlated occurrences in the given graph set. (coherency) (3) The subgraph is dense, where density d is higher than a threshold γ and d=2m/(n(n-1)) (density) m: #edges, n: #nodes

CODENSE: Mine coherent dense subgraph f f f a a a h h c c c h e e e b b b f d d d i i i g g g a h c G 1 G 2 G 3 e b f f f a d a a i g h h h c c c summary graph Ĝ e e e b b b d d d i g i i g g G 4 G 5 G 6

CODENSE: Mine coherent dense subgraph f f a Step 2 h h c c e e b MODES d i i g g summary graph Ĝ Sub( Ĝ ) Observation : If a frequent subgraph is dense, it must be a dense subgraph in the summary graph. However, the reverse conclusion is not true.

CODENSE: Mine coherent dense subgraph E G1 G2 G3 G4 G5 G6 f c-e 0 0 1 1 0 1 h c Step 3 c-f 0 1 0 1 1 1 e c-h 0 0 0 1 1 1 c-i 0 0 1 1 1 0 g i Sub( Ĝ ) e-f 0 0 0 1 1 1 … … … … … … … edge occurrence profiles

CODENSE: Mine coherent dense subgraph g-h f-i E G1 G2 G3 G4 G5 G6 e-i h-i c-e 0 0 1 1 1 1 Step 4 c-f 0 1 0 1 1 1 g-i e-g c-h 0 0 0 1 1 1 e-h c-i 0 0 1 1 1 0 c-h e-f 0 0 0 1 1 1 f-h … … … … … … … c-f e-f c-i edge occurrence profiles c-e second-order graph S

CODENSE: Mine coherent dense subgraph g-h g-h f-i e-i h-i e-i h-i Step 4 g-i e-g g-i e-g e-h e-h c-h f-h c-h f-h c-f e-f c-f e-f c-i c-e c-e second-order graph S Sub(S) Observation : if a subgraph is coherent (its edges show high correlation in their occurrences across a graph set), then its 2nd-order graph must be dense.

CODENSE: Mine coherent dense subgraph g-h h e-i h-i e Step 5 g i g-i e-g e-h f c h c-h f-h e c-f e-f c-e Sub(G) Sub(S)

Our solution We develop a novel algorithm, called CODENSE , to mine frequent co herent dense subgraphs. The target subgraphs have three characteristics: (1) All edges occur in >= k graphs (frequency) (2) All edges should exhibit correlated occurrences in the given graph set. (coherency) (3) The subgraph is dense, where density d is higher than a threshold γ and d=2m/(n(n-1)) (density) m: #edges, n: #nodes

CODENSE: Mine coherent dense subgraph a f f f a a h h c c c h b e e e b b f f d d d g i g i g i a Step 1 Step 2 h h c c G 1 G 2 G 3 e e a b f f f MODES Add/Cut d a a i i g g h h h c c c summary graph Ĝ b Sub( Ĝ ) e e e b b d d d g i g i g i G 4 G 5 G 6 Step 3 g-h g-h f-i h E G1 G2 G3 G4 G5 G6 e-i e-i h-i h-i e c-e 0 0 1 1 1 1 Step 6 Step 5 Step 4 c-f 0 1 0 1 1 1 g i g-i g-i e-g e-g c-h 0 0 0 1 1 1 e-h e-h c-i 0 0 1 1 1 0 f MODES c-h c-h Restore e-f 0 0 0 1 1 1 c h f-h f-h G and c-f c-f … … … … … … … e-f e e-f MODES c-i c-e c-e edge occurrence profiles second-order graph S Sub(G) Sub(S)

CODENSE The design of CODENSE can solve the scalability issue. Instead of mining each biological network individually, CODENSE compresses the networks into two meta-graphs and performs clustering in these two graphs only. Thus, CODENSE can handle any large number of networks.

MODES: Mine overlapped dense subgraph G Sub(G) G’ j j j a a f h f h b h b Step 1 Step 2 V condense HCS’ c i c i e i e g g g d d HCS’ Sub(G’) a f h f Step 4 h h b Step 3 V HCS’ restore c e i e i i d

Comparison with other Methods • By transforming all necessary information of the n graphs into two graphs, CODENSE achieves significant time and memory efficiency. • CODENSE can mine both exact and approximate patterns. (Approximate frequent subgraph mining is an important but never touched problem) • CODENSE can be extended to pattern mining on weighted graphs

Applying CoDense to 39 yeast microarray data sets f f j j a h a h c c c 1 c 2 … c m e e b b g 1 .1 .2… .2 k k d i d g i g g 2 .4 .3… .4 … f f j j e a a c 1 c 2 … c m c c h h e g 1 .8 .6… .2 b b k k d g 2 .2 .3… .4 d i i g g … f c 1 c 2 … c m f j j a a h c h c e e g 1 .9 .4… .1 b b k k g 2 .7 .3… .5 i d d g g i … f f j j c 1 c 2 … c m a a h h c c e e g 1 .2 .5… .8 b b k k g 2 .7 .1… .3 d d g i i … g

YDR115W MRP49 PHB1 MRPL51 PET100 ATP12 MRPL37 ATP17 MRPL38 ACN9 MRPL32 MRPL39 FMC1 MRPS18

MRP49 ATP17 MRPL51 PHB1 ATP12 PET100 PET100 YDR115W MRPL38 ACN9 MRPL32 MRPL39 MRPS18 FMC1 Yellow: YDR115W, FMC1, ATP12,MRPL37,MRPS18 GO:0019538(protein metabolism; pvalue = 0.001122)

YDR115W MRP49 PHB1 MRPL51 PET100 ATP12 MRPL37 ATP17 MRPL38 ACN9 MRPL32 MRPL39 FMC1 MRPS18 Red:PHB1,ATP17,MRPL51,MRPL39, MRPL49, MRPL51,PET100 GO:0006091(generation of precursor metabolites and energy; pvalue=0. 001339)

Functional annotation Annotation

Functional Annotation (Validation) Method : leave-one-out approach - masking a known gene to be unknown, and assign its function based on the other genes in the subgraph pattern. Functional categories : 166 functional categories at GO level at least 6 Results: 448 predictions with accuracy of 50%

Functional Annotation (Prediction) We made functional predictions for 169 genes, covering a wide range of functional categories, e.g. amino acid biosynthesis, ATP biosynthesis, ribosome biogenesis, vitamin biosynthesis, etc. A significant number of our predictions can be supported by literature.

NOP16 YGR172C RRP15 LCP5 POP6 We predicted RRP15 to participate in "ribosome biogenesis". Based on a recent publication (De Marchis et al, RNA 2005), this gene is involved in pre-rRNA processing.

Biological Network Analysis: Graph Mining in Bioinformatics Karsten - PowerPoint PPT Presentation

Biological Network Analysis: Graph Mining in Bioinformatics Karsten Borgwardt Interdepartmental Bioinformatics Group MPIs Tbingen with permission from Xifeng Yan and Xianghong Jasmine Zhou Karsten Borgwardt: Graph Mining in Bioinformatics,

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Data Mining: Concepts and Techniques Chapter 9 Graph mining and Social Network Analysis

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 10: Graph Mining in Bioinformatics Karsten Borgwardt February

Text Mining and Information Extraction Applications for Bioinformatics and Systems Biology Plant

Graph Essentials Graph Basics Social Media Mining Social Media Mining Measures and Metrics

A PRIMER ON GRAPH KERNELS Karsten Borgwardt Interdepartmental Bioinformatics Group

Network/Graph Network/Graph Informally a graph is a set of nodes Theory Theory joined by a

Data Mining in Bioinformatics Day 8: Graph Mining for Chemoinformatics and Drug Discovery

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

Agile Dreamteam Malte Beck November 28 2019 does delivers effects intention input

Space Utilization & Metrics Team Number: 10 *Steelcase, The Post-Covid Workplace, Edition 1

Traian Kaiser, GoTo Prague Conference, November 2011 2 Traian n Kaiser er Director Agile

Building Mass Advocacy Campaigns Marvin Ammori OCTOBER 20, 2014 WASHINGTON, D.C. Jan 18

Social and Technological Networks Rik Sarkar University of Edinburgh, 2018. Course specifics

TemporalDistanceMetricsfor SocialNetworksAnalysis JohnTang 1 ,

Large Scale Complex Network Analysis using Large Scale Complex Network Analysis using the Hybrid

NETWORK ANALYSIS: PEOPLE AND OPEN SOURCE COMMUNITIES Dawn M. Foster PhD Student

Biological Network Analysis: Graph Mining in Bioinformatics Karsten - PowerPoint PPT Presentation

Biological Network Analysis: Graph Mining in Bioinformatics Karsten Borgwardt Interdepartmental Bioinformatics Group MPIs Tbingen with permission from Xifeng Yan and Xianghong Jasmine Zhou Karsten Borgwardt: Graph Mining in Bioinformatics,

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Data Mining: Concepts and Techniques Chapter 9 Graph mining and Social Network Analysis

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 10: Graph Mining in Bioinformatics Karsten Borgwardt February

Text Mining and Information Extraction Applications for Bioinformatics and Systems Biology Plant

Graph Essentials Graph Basics Social Media Mining Social Media Mining Measures and Metrics

A PRIMER ON GRAPH KERNELS Karsten Borgwardt Interdepartmental Bioinformatics Group

Network/Graph Network/Graph Informally a graph is a set of nodes Theory Theory joined by a

Data Mining in Bioinformatics Day 8: Graph Mining for Chemoinformatics and Drug Discovery

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

Agile Dreamteam Malte Beck November 28 2019 does delivers effects intention input

Space Utilization &amp; Metrics Team Number: 10 *Steelcase, The Post-Covid Workplace, Edition 1

Traian Kaiser, GoTo Prague Conference, November 2011 2 Traian n Kaiser er Director Agile

Building Mass Advocacy Campaigns Marvin Ammori OCTOBER 20, 2014 WASHINGTON, D.C. Jan 18

Social and Technological Networks Rik Sarkar University of Edinburgh, 2018. Course specifics

TemporalDistanceMetricsfor SocialNetworksAnalysis JohnTang 1 ,

Large Scale Complex Network Analysis using Large Scale Complex Network Analysis using the Hybrid

NETWORK ANALYSIS: PEOPLE AND OPEN SOURCE COMMUNITIES Dawn M. Foster PhD Student

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Space Utilization & Metrics Team Number: 10 *Steelcase, The Post-Covid Workplace, Edition 1