Logic Programming for Big Data in Computational Biology Nicos - - PowerPoint PPT Presentation
Logic Programming for Big Data in Computational Biology Nicos - - PowerPoint PPT Presentation
Logic Programming for Big Data in Computational Biology Nicos Angelopoulos Wellcome Sanger Institute Hinxton, Cambridge nicos.angelopoulos@sanger.ac.uk 18.9.18 overview knowledge for Bayesian machine learning over model structure
- verview
◮ knowledge for Bayesian machine learning over model structure ◮ applied knowledge representation for biological data analytics
Bayesian inference of model structure (Bims)
A Bayesian machine learning system that can model prior knowledge by means of a probabilistic logic programming. Nonmeclature ◮ DLPs = Distributional logic programs ◮ Bims = Bayesian inference of model structure Timeline ◮ Theory (York, 2000-5) ◮ Applications (Edinburgh, 2006-8, IAH 2009, NKI 2013) ◮ Bims library and theory paper 2015-2017
Bims Overview
◮ syntax of DLPs ◮ a succinct classification tree prior program ◮ Bayesian learning of model structure ◮ learning classification and regression trees ◮ Bayesian learning of Bayesian networks ◮ the bims library
DLPs- description
We extend LP’s clausal syntax with probabilistic guards that associate a resolution step using a particular clause with a probability whose value is computed on-the-fly. The intuition is that this value can be used as the probability with which the clause is selected for resolution. Thus in addition to the logical relation, a clause defines over the
- bjects that appear as arguments in its head, it also defines a
probability distribution over aspects of this relation.
DLPs example
member(H, [H| T]). member(El, [ H|T]) :− (C1) member(El, T). L :: length(List, L) ∼ El :: umember(El, List) (G1) 1 L :: L :: umember(El, [El|Tail]). (C2) 1 − 1 L :: L :: umember(El, [H|Tail]) :− (C3) umember(El, Tail).
DLPs probabilistic goals
1 L :: L :: umember(El, [El|Tail]). (C4) 1 − 1 L :: L :: umember(El, [H|Tail]) :− (C5) K is L − 1, K :: umember(El, Tail).
DLPs query
? − umember(X, [a, b, c]). X = a (1/3 of the times = 1/3); X = b (1/3 of the times = 2/3 ∗ 1/2); X = c (1/3 of the times = 2/3 ∗ 1/2 ∗ 1).
simple tree prior
?- cart( ζ, ξ, A, M ).
M=nd(x2,1,nd(x1,0,lf,lf),lf) (C0) cart(ζ, ξ, M, Cart) : − ψ0 is ζ, ψ0: split(0, ζ, ξ, M, Cart). (C1) ψD: split(D, ζ, ξ, MB, nd(F, Val, L, R)) : − ψD+1 is ζ ∗ (1 + D)−ξ, D1 is D + 1, r select(F, Val, MB, LB, RB), ψD+1: split(D1, ζ, ξ, LB, L), ψD+1: split(D1, ζ, ξ, RB, R). (C2) 1 − ψD: split(D, ζ, ξ, MB, lf ).
Bims theory
Bayes’ Theorem p(M|D) = p(D|M)p(M)
- M p(D|M)p(M)
Metropolis-Hastings α(Mi, M∗) = min q(M∗, Mi)P(D|M∗)P(M∗) q(Mi, M∗)P(D|Mi)P(Mi) , 1
DLP defined model space
From Mi identify Gi then sample forward to M⋆. q(Mi, M⋆) is the probability of proposing M⋆ when Mi is the current model.
Pyruvate kinase interactors
- bjective
improve chances of discovering binding molecules based on examples from screened chemical libraries. pyruvate kinase affinity data 582 Active and 582 Inactive. Dragon software produces 1500 property descriptors for each molecule, about 1100 were used. ten-fold cross-validation Compared to Feed Forward Neural Networks and Support Vector Machines by splitting the data into ten train/test segments.
best likelihood model
ten-fold validation
Sensitivity = T + T + + F − Specificity = T − T − + F +
molecules of Eduliss according to BCarts
Bims: Bayesian inference of model structure
Released in 2016 as an easily installable SWI-Prolog library (IJAR paper in 2017) Includes ◮ priors and likelihoods for: CARTs and Bayesian networks ◮ hooks for user defined models Probabilistic logic programming ◮ thesis: probabilistic finite domains ◮ PLP workshop and IJAR associated issues (5th edition)
knowledge-based computation biology
◮ graphical models (focal adhesion dynamics, NKI, 2011-3) ◮ proteomics functional analysis (TKSilac,KSR1,ATG9A, Imperial, 2014-5) ◮ mutational profiling (14MG, Sanger, 2016-8)
Graphical models of FAD
Graphical models (aka Bayesian networks) can provide a network view of dependencies among variables, capturing much richer information than pairwise correlations. In this project, microscopy based variables characterising focal adhesion in time are connected for a number of conditions in the HGF pathway.
tkSilac: tyrosine kinase screen
◮ MCF7 cell line ◮ 33 SILAC runs ◮ 65/66 expressed tyrosine kinases ◮ 4739 quantified in some experiment ◮ 1000 quantified in 60 or more TK KO
LMTK3 ERBB4 MATK ZAP70 ROR2 TYRO3 FLT1 FGFR2 FRK EPHB4 EPHB1 INSR IGF1R MERTK PTK6 PDGFRB SRC HCK MET EPHA1 EGFR NTRK3 NTRK1 LMTK2 MST1R RYK EPHA2 ABL1 AXL EPHA4 EPHB3 ERBB2 CSF1R ABL2 EPHA6 BTK FGFR1 EPHB2 ERBB3 EPHA7 CSK DDR1 FLT3 EPHA3 FYN EPHB6 FES LYN STYK1 TNK2 PTK2 JAK1 LCK TYK2 YES1 KDR JAK2 ROR1 TNK1 SYK TEC PTK7 RET NTRK2 PTK2B 0.5 2 Value Color Key
Figure 2
- Fig. 2. Heatmap of quantified proteins after TK silencing. The overall pattern of regulation is shown in the heat-
map of quantified values. After normalized to siControl, values of fold changes are all above 0, with value 1 show- ing that the expression levels of the specific protein are not altered after silencing TKs. For each knockdown (rows)
A B L 1 ABL2 AXL B T K CSF1R CSK DDR1 EGFR E P H A 1 EPHA2 EPHA3 EPHA4 EPHA6 EPHA7 EPHB1 EPHB2 E P H B 3 E P H B 4 EPHB6 ERBB2 E R B B 3 E R B B 4 F E S FGFR1 FGFR2 FLT1 FLT3 F R K F Y N HCK IGF1R I N S R J A K 1 J A K 2 KDR LCK LMTK2 LMTK3 L Y N MATK MERTK M E T MST1R NTRK1 NTRK2 N T R K 3 P D G F R P T K 2 P T K 2 B P T K 6 PTK7 RET R O R 1 ROR2 RYK SRC STYK1 SYK TEC TNK1 TNK2 TYK2 TYRO3 Y E S 1 ZAP70
A Figure 4 B D
25 50 75 100
Clusters Counts 1 2 3 4 5 6 7 8 9 10 Up Down
ABL1 AXL EPHA2 EPHA4 LMTK2 MST1R NTRK1 RYK
CD44 TMEM164 EPB41L1 CA12 PGR NRCAM FREM2 SDC1 CELSR2 LXN MYOF GFRA1 GLA GUSB FARP1 GREB1 FBP1 CYB5R1 GSTM3 NCAM2 GGH
CA2
MUC1 SLC38A1 FLNB ERMP1 PBXIP1 PODXL ADAM10 BLOC1S5 CDH1 AKR1C2 BASP1 SLC12A2 MVP AGR2 FTH1 LCP1 LASP1 BCAS1 PYGB STAT1 PREX1 ME1 SLC7A1 MYBBP1A HEATR6 CAD KRT18 MC1R GALNT7 GEMIN4 PHGDH TXNRD1 USP32 MON2 LTN1 RNF213 VDAC1 TUBA1C HUWE1 ASMTL GYG1 MROH1 PRKDC ASS1 NUP188 ASNS MUC5B LDHB XPOT HEATR1
- 2
- 1
1
C
Clusters TKs 8 15 3 6 2 6 7 3 12 3 No. ABL1, AXL, EPHA2, EPHA4, LMTK2, MST1R, NTRK1, RYK ABL2, BTK, CSF1R, CSK, DDR1, EPHA3, EPHA6, EPHA7, EPHB2, EPHB6, ERBB3, FES, FGFR1, FLT3, FYN EGFR, EPHA1, NTRK3 EPHB1, EPHB4, FGFR2, FLT1, FRK, INSR EPHB3, ERBB2 ERBB4, LMTK3, MATK, ROR2, TYRO3, ZAP70 HCK, IGF1R, MERTK, MET, PDGFRB, PTK6, SRC JAK1, LCK, PTK2 JAK2, KDR, NTRK2, PTK2B, PTK7, RET, ROR1, SYK, TEC, TNK1, TYK2, YES1 LYN, STYK1, TNK2 1 2 3 4 5 6 7 8 9 10
Cell adhesion Apoptosis Development Growth Immune system Transport Metabolic process Reproduction Cell cycle Cell communication
%
10 9 8 7 6 5 4 3 2 1
Antioxidant activity Binding Catalytic activity Chemoattractant activity Enzyme regulator activity Molecular transducer activity Protein binding TF activity Receptor activity Structural molecule activity Transporter activity
% 80
10 9 8 7 6 5 4 3 2 1
60 40 20 10 20 30 40 50
Figure 5 A B
- Fig. 5. Characterization of a functional portrait for each cluster. A, A functional profile of top GO biologic
processes that the up- and downregulated proteins belong to is presented. x-axis shows the percentage of hits in each cluster that belong to a GO biologic process term. The color coding and the number for each cluster are indi- cated as above. B, A functional profile of top GO molecular functions that the up- and downregulated proteins belong to is presented. x-axis shows the percentage of hits in each cluster that belong to a GO molecular function term.
- ADAM10
AGR2 AKR1C2 ASNS ASS1 BASP1 BLOC1S5 CA2 CAD CD44 CDH1 CELSR2 FARP1 FLNB FREM2 GFRA1 GSTM3 HUWE1 KRT18 LCP1 MC1R MUC1 MUC5B
MYBBP1A NCAM2 NRCAM PBXIP1
PGR
PHGDH PODXL PREX1 PRKDC SDC1 SLC12A2 STAT1 TXNRD1 VDAC1
cluster 1 development
- ● ●
- ●
ACSL1 ASS1 CD44 FLNB KPNA2 KRT18 NUP210 PRKCD PXDN RPS27A ACTN1 ACTN4 EPB41L1 FLNB KRT18 LMNA MAP4 NUMA1 PREX1 PRKCD SPTBN2
cluster 2 response to cytokine cytoskeleton organization
- ADAR
ANXA3 APOB
ASS1 CD44 CRIP2 DDX58 DDX60
EIF2AK2
FTH1
HERC6
IFI35 IFIH1 IFIT1 IFIT2 IFIT3 IRS1
ISG15 ITGA2 MIR7703 MX1 NFKB2 OAS1 OAS2 OAS3 PARP9 PML PRKDC PSME1 RPL22 SAMHD1 SLC3A2 STAT1 STAT2 TAP1 TAP2 TAPBP TRIM25
cluster 3 immune system process
- AKR1C2
CA2 CAD GBA
MUC5B PCK2 RBM14 SDC1 STAT1 ACTN4 ANXA6 LASP1 SLC38A2 SLC3A2 SLC6A14 SLC7A1 SLC7A2 SLC7A5
cluster 4 ion transmembrane transport response to hormone
- LRPPRC
MVP
NUP210 PNPT1 ACSL1 AGRN BLVRA
CD44
EPHX1
FBP1 G6PD
GSTM3 HSP90B1 KIAA1324 LRPPRC MYH14 PNPT1 PXDN SDC1
RNA transport cluster 5 catabolic process
- CAV1
CD63
CELSR1 CTNNB1
DST
GPR56 HMGB1 HMGB2 ITGA5 L1CAM LGALS3 NRP1 SCARB1 SLC3A2 SLC7A5
cluster 6 cell motility
- ACSL1
AHSA1 AKR1C2 APRT
ASS1 CA12 CAD
CERS2 CYB5R3 DDX20 ENO1 ERO1L
GBA GLA GPD2 HK1
INPP4B MON2 MYH14
PFKL
PHGDH PYGB SLC44A2 SLC5A6 SLC7A2 SLC7A5 SLC9A1 SORD TUBA1A TUBA1C TUBB4A TUBB8
cluster 7 metabolic process
- DNMT1
DUT
FANCD2
FEN1 H1FX
KPNA2 MCM2 MCM3 MCM4 MCM5 MCM6 MCM7 NCAPD2 PCNA POLD1
RAN SMC2 SMC4
SSRP1 SUPT16H
cluster 8 DNA replication
- ACTN4
AGRN ARMCX3 BAG6 CELSR2 FLNB JUP LCP1 MGEA5 MLPH MVP MYH9 PDCD6IP RPS27A
SLC9A3R1 SYNE2
TFRC XPO1 XPOT
cluster 9 protein localiztion
- ●
- ACTN4
ASNS BAG6 CD44 CDH1 MUC1 NQO1 PRDX2 PRDX3 PRKDC
SLC25A24 SLC9A3R1 SORT1
VDAC1 VDAC2
cluster 10 cell death
Figure 6
volcano plot (BT474HR H/M)
- −4
−2 2 4 6 7 8 9 10 Log2 Ratio Total Protein (R10K8/R0604) Log10 Intensity (R10K8/R0604)
CLEC1B CNOT8 SCAP SYF2 ATG9A STOML2 LDHB TRIP6 DHX37 STK3 KATNAL2 VCP RP1 YBX2 BLMH ANO1 SFN DCTN5 PEX3 MCM6 MCM3 LGALS1 PEG10 GBP2 VIM SLC6A14 SH3BGRL GSTM3 PGM5 KYNU CRIP1 S100A6 SSNA1 CA5B IGFBP3 RGN BSG SEMA4C EADS2
- < 0.001
< 0.01 < 0.05 >0.05
Autophagy
- ATG16L1
ATG3 ATG4B ATG7 ATG9A BAG1 BAG3 BNIP1 CALCOCO2 CANX CASP3 CASP8 CD46 CDKN1B CHMP2B DNAJB1 ERBB2 FOXO3 GAA GABARAPL2 HDAC6 HSPB8 ITGA6 ITGB4 MAP1LC3B MAPK3 MTOR NBR1 NFKB1 PEX3 PRKCD RAB7A RB1 RELA SERPINA1 SQSTM1 ST13 TBK1 TP53 TSC1 TSC2 ULK1 WDFY3 WIPI2
Myelodysplastic syndrom, NGS somatic mutations profiling
- AUC (1y)
0.5 0.6 0.7 0.8 0.9 1.0 AUC (1y) for Clinical vs Lasso_min model 0.0037 Auc (1y) 0.679351 0.78364 Harrel's C 0.634512 0.712633 R square 0.179789 0.415647
5 year: Clinical vs Lasso (Optimal)
Clinical
1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0
Lasso_min
1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0
myeloma structural variations
t_14_16 del13q14 HDR t_4_14 delCYLD t_11_14 del17p13 NRAS gain1q21 TP53 delFAM46C delTRAF3 DIS3 KRAS # events (shown med=225,max=643) Co-occur (shown odds=4) Mut.excl (shown odds=0.25) Fisher test odds