Karsten Borgwardt: Graph Mining in Bioinformatics, Page 1
Biological Network Analysis: Graph Mining in Bioinformatics
Karsten Borgwardt Interdepartmental Bioinformatics Group MPIs Tübingen
with permission from Xifeng Yan and Xianghong Jasmine Zhou
Biological Network Analysis: Graph Mining in Bioinformatics Karsten - - PowerPoint PPT Presentation
Biological Network Analysis: Graph Mining in Bioinformatics Karsten Borgwardt Interdepartmental Bioinformatics Group MPIs Tbingen with permission from Xifeng Yan and Xianghong Jasmine Zhou Karsten Borgwardt: Graph Mining in Bioinformatics,
Karsten Borgwardt: Graph Mining in Bioinformatics, Page 1
Karsten Borgwardt Interdepartmental Bioinformatics Group MPIs Tübingen
with permission from Xifeng Yan and Xianghong Jasmine Zhou
1University of Southern California 2University of Illinois at Urbana-Champaign
a b c d e f g h i j k a b c d e f g h i j k a b c d e f g h i j k a b c d e f g h i j k a b d e f g h i j k c a b c d e f g h i j k
a b c d e f g h i j k a b c d e f g h i j k a b c d e f g h i j k a b c d e f g h i j k a b d e f g h i j k c a b c d e f g h i j k
c1 c2… cm
g1 .1 .2… .2 g2 .4 .3… .4 …
c1 c2… cm
g1 .8 .6… .2 g2 .2 .3… .4 …
c1 c2… cm
g1 .9 .4… .1 g2 .7 .3… .5 …
c1 c2… cm
g1 .2 .5… .8 g2 .7 .1… .3 …
a b c d e f g h i j k a b c d e f g h i j k a b c d e f g h i j k a b d e f g h i j k c
a b c d e f g h i j k a b c d e f g h i j k a b c d e f g h i j k a b d e f g h i j k c
a b c d e f g h i j k a b c d e f g h i j k a b c d e f g h i j k a b d e f g h i j k c a b c d e f g h i j k a b c d e f g h i j k a b c d e f g h i j k a b d e f g h i j k c
Pattern Expansion k k+1
a b c d e f g h i j k a b c d e f g h i j k a b c d e f g h i j k a b d e f g h i j k c a b c d e f g h i j k a b c d e f g h i j k a b c d e f g h i j k a b d e f g h i j k c a b c d e f g h i j k a b c d e f g h i j k a b c d e f g h i j k a b d e f g h i j k c
f a b d e g h i c
G1
a b d e g h i c f
summary graph Ĝ
f a b c d e f g h i a b c d e f g h i a b c d e f g h i a b c d e f g h i a b c d e g h i
G3 G2 G6 G5 G4
a b d e g h i c f
summary graph Ĝ
e g h i c f
Sub(Ĝ)
Step 2
MODES
e g h i c f
Sub(Ĝ)
Step 3
… … … … … … … 1 1 1 e-f 1 1 1 c-i 1 1 1 c-h 1 1 1 1 c-f 1 1 1 c-e
G6 G5 G4 G3 G2 G1
E
edge occurrence profiles
… … … … … … … 1 1 1 e-f 1 1 1 c-i 1 1 1 c-h 1 1 1 1 c-f 1 1 1 1 c-e
G6 G5 G4 G3 G2 G1
E
edge occurrence profiles
Step 4
c-f c-h c-e e-h e-f f-h c-i e-i e-g g-i h-i
second-order graph S
g-h f-i
c-f c-h c-e e-h e-f f-h c-i e-i e-g g-i h-i
second-order graph S
g-h f-i
Step 4
c-f c-h c-e e-h e-f f-h e-i e-g g-i h-i
Sub(S)
g-h
c-f c-h c-e e-h e-f f-h e-i e-g g-i h-i
Sub(S)
g-h
Step 5
c e f h e g h i
Sub(G)
… … … … … … … 1 1 1 e-f 1 1 1 c-i 1 1 1 c-h 1 1 1 1 c-f 1 1 1 1 c-e
G6 G5 G4 G3 G2 G1
E
edge occurrence profiles
c e f h e g h i
Step 4 Step 5
Sub(G)
a b d e g h i c f a b c d e f g h i a b c d e f g h i a b c d e f g h i a b d e f g h i c a b c d e f g h i a b c d e f g h i
G1 G3 G2 G6 G5 G4
c-f c-h c-e e-h e-f f-h c-i e-i e-g g-i h-i
second-order graph S
g-h f-i
Step 1 Step 3
summary graph Ĝ
e g h i c f
Sub(Ĝ)
Step 2
c-f c-h c-e e-h e-f f-h e-i e-g g-i h-i
Sub(S)
g-h
Step 6
MODES Add/Cut MODES Restore G and MODES
V g j h i g f e a b c d h i j g f e a b c d h i j V h i f e a b c d h i f e h i Step 1 Step 2 Step 3 Step 4
HCS’ condense HCS’ restore HCS’
c1 c2… cm
g1 .1 .2… .2 g2 .4 .3… .4 …
c1 c2… cm
g1 .8 .6… .2 g2 .2 .3… .4 …
c1 c2… cm
g1 .9 .4… .1 g2 .7 .3… .5 …
c1 c2… cm
g1 .2 .5… .8 g2 .7 .1… .3 …
a b c d e f g h i j k a b c d e f g h i j k a b c d e f g h i j k a b d e f g h i j k c a b c d e f g h i j k a b c d e f g h i j k a b c d e f g h i j k a b d e f g h i j k c
ATP17 ATP12 MRPL38 MRPL37 MRPL39 FMC1 MRPS18 MRPL32 ACN9 MRPL51 MRP49 YDR115W PHB1 PET100
ATP17 ATP12 MRPL38 MRPL39 FMC1 MRPS18 MRPL32 ACN9 MRPL51 MRP49 YDR115W PHB1 PET100
Yellow: YDR115W, FMC1, ATP12,MRPL37,MRPS18 GO:0019538(protein metabolism; pvalue = 0.001122)
PET100
Red:PHB1,ATP17,MRPL51,MRPL39, MRPL49, MRPL51,PET100 GO:0006091(generation of precursor metabolites and energy; pvalue=0. 001339)
ATP17 ATP12 MRPL38 MRPL37 MRPL39 FMC1 MRPS18 MRPL32 ACN9 MRPL51 MRP49 YDR115W PHB1 PET100
Annotation
POP6 YGR172C LCP5 NOP16 RRP15
We predicted RRP15 to participate in "ribosome biogenesis". Based on a recent publication (De Marchis et al, RNA 2005), this gene is involved in pre-rRNA processing.
We predicted QRI5 to be involved in "protein biosynthesis"; QRI5 has been shown to participate in a common regulatory process together with MSS51 (Simon et al., 1992) and the GO annotation of MSS51 is "positive regulation of translation and protein biosynthesis".
MRPL27 MRPS18 MRPL32 MRP49 QR15
NeMo |
Network Module Mining
2
NCBI Gene Expression Omnibus EBI Array Express
137231 experiments 55228 experiments
The public microarray data increases by 3 folds per year
NeMo |
Network Module Mining
3
genes conditions
MCM3 MCM7 NASP FEN1 SNRPG CDC2 CCNB1 UNG
Microarray Coexpression Network Module
NeMo |
Network Module Mining
4
~9000 genes 105 x ~(9000 x 9000) = 8 billion edges
transform graph mining
Patterns discovered in multiple graphs are more reliable and significant dense vertexset Mining poor quality data!
Transcriptional Annotation
NeMo |
Network Module Mining
5
NeMo |
Network Module Mining
6
Bottom-up approach (small → large) frequent maximum dense (KDD’05) Top-down approach (large → small) consensus clustering (Filkov and Skiena 04) summary graph (Lee etc. 04)
Coherent clustering (Hu et al. ISMB’05) Partition and neighbor association (this work)
NeMo |
Network Module Mining
7
M networks ONE graph
clustering
Scale Down
NeMo |
Network Module Mining
8
Dense subgraphs are accidentally formed by noise edges They are false frequent dense vertexsets Noise edges will also interfere with true modules
dense subgraphs in summary graph Frequent dense vertexsets
NeMo |
Network Module Mining
9
noise edge ratio in summary graph noise edge ratio in individual graph
NeMo |
Network Module Mining
10
number of false patterns
NeMo |
Network Module Mining
11
How to choose a subset of networks? randomly select?
Unsupervised partition Supervised partition Reduce the noise edge ratio (b) in summary graph Use a subset of graphs if m ↓, then b ↓ Reduce the number of false patterns
NeMo |
Network Module Mining
12
clustering (1) (2) identify (3) group mining together seed
NeMo |
Network Module Mining
13
Change the structure of summary graph, if p ↓, then N ↓ Summary graph measures the association of vertices. In
More stringent definition: the number of small frequent
NeMo |
Network Module Mining
14
u v
: # of frequent dense vertexlets with k-1 nodes including u and v : # of frequent dense vertexlets with k nodes including u is larger, u and v are more likely from the same module normalization
NeMo |
Network Module Mining
15
NeMo |
Network Module Mining
16
105 human microarray data sets NeMo 4727 recurrent coexpression clusters
(density > 0.7 and support > 10)
Validation based on ChIp-chip data (9521 target genes for 20 TFs) Validation based on human-mouse Conserved Transfac prediction (7720 target genes for 407 TFs)
15.4% homogenous clusters (vs. 0.2% by randomization test) 12.5% homogenous clusters (vs. 3.3% by randomization test)
NeMo |
Network Module Mining
17
Percentage of potential transcription modules validated by ChIP-Chip data increases with cluster density and recurrence
NeMo |
Network Module Mining
18
individual < multiple partition works NeMo is better!
individual summary partition NeMo = partition + neighbor-association percentage 20% 40%
NeMo |
Network Module Mining
19
Microarray data integration is important Overcome the noise issue Microarray data integration is hard Have the scalability issue NeMo: a graph-based approach Partitioning Neighbor Association Summary Graph