Peiying (Colleen) Ruan, PhD, Deep Learning Solution Architect 3/26/2018
INTERACTION NETWORKS USING DEEP LEARNING Peiying (Colleen) Ruan, - - PowerPoint PPT Presentation
INTERACTION NETWORKS USING DEEP LEARNING Peiying (Colleen) Ruan, - - PowerPoint PPT Presentation
PREDICTION OF HETERODIMERIC PROTEIN COMPLEXES FROM PROTEIN-PROTEIN INTERACTION NETWORKS USING DEEP LEARNING Peiying (Colleen) Ruan, PhD, Deep Learning Solution Architect 3/26/2018 Background Method OUTLINE Computational Experiments and
2
OUTLINE
Background Method Computational Experiments and Results Conclusions
3
BACKGROUND
4
Transcription Translation Forming complexes Performing functions Our works Protein-protein interactions DNA mRNA mRNA Protein Disease Keeping healthy Human Cell Biological System
BACKGROUND
?
5
BACKGROUND
What is heterodimer and why predict it? Occupy 40% !!! Heterodimers
6
BACKGROUND
D2 D3 D2
P2 P1 Interaction
Pi: protein Di: domain
Domain Composition
P1
Structure D1 D2
P1
7
BACKGROUND
D2 D3 D2
P2 P1 Interaction
Pi: protein Di: domain
Domain Composition
P1
Structure D1 D2
P1
P1 P2
w12
Weighted PPI Network
8
METHOD
9
OVERVIEW OF THE PROBLEM
Input:weighted PPI network Pi Heterodimer? Pj Input data
10
MULTIPLE INFORMATION + MULTIPLE DL MODELS
▪ Input data involving
biological information
Protein-protein interaction (PPI)
Domain
Phylogenetic profile
▪ Deep neural network
models including
Convolutional neural network (CNN)
Recurrent neural network (RNN)
CNN + RNN
11
PROTEIN-PROTEIN INTERACTION (PPI)
Pi Pj wij wik Pk wjk
Dn Dr Dm
Table 1. Feature space mapping from two interacting proteins Pi, Pj and neighbors.
Figure 1. Example of a subgraph with an interacting protein pair and their neighboring proteins.
… ……
The weights of interactions between the focused proteins. The maximum weights of interactions between either of focused proteins and a neighboring protein. The minimum weights of interactions between either of focused proteins and a neighboring protein. The maximum smaller weights of interactions with neighboring proteins. The maximum differences of weights among the neighboring weights.
12
DOMAIN
The whole domain pair sets for all complexes in the dataset {(D1, D1), (D1, D2),…, (D3, D3),…, (D9, D10),…, (Dn, Dn)}5295 [Cj]=[ ,…, 2 ,…, 1 ,…, 0 ]
P2 P1
Ci
Sample D3D9 D8 D3 D10 D3
Domain pair of protein complex Cj: (D3, D3), (D3, D3), (D3, D10), (D8, D3) , (D8, D3) , (D8, D10) , (D9, D3) , (D9, D3) , (D9, D10)
#domain pair is 5295
13
PHYLOGENETIC PROFILE
The whole organism for all complexes in the dataset { SC, BS, EC, …}2717 [Cj = Q(P1, P2)]=[ 1 , … ] SC BS EC P1 1 1 P2 1 1 P3 1 P4 1 1
P1 P4 S.Cerevisiae (SC)
#organism is 2717
P1 P2 E.Coli (EC) P3 P2 P1 B.Subtilis (BS)
Q(a, b)=min(a, b)
14
COMPUTATIONAL EXPERIMENTS
15
▪ Databases
CYC2008: A manually curated comprehensive catalogue of yeast protein complexes, including 172(42%) heterodimers. WI-PHI: A PPI database with weights containing 49607 interacting protein pairs except self-interactions.
▪ Positives and Negatives
P1 P2 P4 P3 C2 C1
Positives: (P1,P2) Negatives: (P1,P3), (P2,P4), (P3,P4) and (P1,P4) #Sample: 5497
16
INPUT DATA
e.x.Domain property The whole domain pair set for all complexes in the dataset {(D1, D1), (D1, D2),…, (D3, D3),…, (D9, D10),…, (Dn, Dn)} Input data: [C1]=[ ,…, 2 ,…, 1 ,…, 0 ] [C2]=[ 1 ,…, ,…, 0 ,…, 1 ] … [C5497]=[ 0 ,…, 2 ,…, 1 ,…, 0 ] Label: 1 …
]
17
INPUT DATA
e.x. Domain + Phylogenetic profile The whole (domain pair set + organism) for all complexes in the dataset {(D1, D1), (D1, D2),…, (Dn, Dn), SC, BS, EC, …}5295+2717 Input data: [C1]=[ ,…, , 0, 0, 1, …] [C2]=[ 1 ,…, 1 , 1, 0, 0, …] … [C5497]=[ 0 ,…, , 0, 1, 1, …] Label: 1 …
]
18
MODELS
Input data Convolution Neural Network Output Input data Recurrent Neural Network Output Input data Convolution Neural Network Output Recurrent Neural Network
- D. Quang et al., DanQ: a hybrid convolutional and recurrent deep neural network for
quantifying the function of DNA sequences, Nucleic Acids Research, 2016
19
RESULTS
20
PERFORMANCE MEASURES
tp: true positive, tn: true negative, fp: false positive, fn: false negative
𝐵𝑑𝑑𝑣𝑠𝑏𝑑𝑧 = 𝑢𝑞 + 𝑢𝑜 𝑢𝑞 + 𝑢𝑜 + 𝑔𝑞 + 𝑔𝑜 𝑄𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 = 𝑢𝑞 𝑢𝑞 + 𝑔𝑞 𝑆𝑓𝑑𝑏𝑚𝑚 = 𝑢𝑞 𝑢𝑞 + 𝑔𝑜 𝐺1 = 2 · 𝑞𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 · 𝑠𝑓𝑑𝑏𝑚𝑚 𝑞𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 + 𝑠𝑓𝑑𝑏𝑚𝑚
21
COMPARISON OF MODEL + INFORMATION
Models Training accuracy Training loss Test accuracy Evaluation score (F1) CNN (domain) 0.80 1.311 0.79 0.68 CNN (domain+PPI) 0.84 1.124 0.81 0.69 CNN (domain+PPI+Phylogenetic profile) 0.83 0.912 0.81 0.69 RNN (domain+PPI+Phylogenetic profile) 0.71 2.334 0.72 0.66 CNN+RNN (domain+PPI+Phylogenetic profile) 0.86 0.865 0.85 0.72 Baseline method* SVM(PPI+domain) 0.65
- 0.73
0.63
*P . Ruan et al. Prediction of Heterodimeric Protein Complexes from Weighted Protein-Protein Interaction Networks Using Novel Features and Kernel Functions, PLoS One, 2013
22
8 min 12 sec 100 200 300 400 500 600 Time(sec)/Epoch
CPU VS GPU
CPU DGX Station
DGX Station is 40 times faster!!
23