Logic Programming for Big Data in Computational Biology Nicos - - PowerPoint PPT Presentation

logic programming for big data in computational biology
SMART_READER_LITE
LIVE PREVIEW

Logic Programming for Big Data in Computational Biology Nicos - - PowerPoint PPT Presentation

Logic Programming for Big Data in Computational Biology Nicos Angelopoulos Wellcome Sanger Institute Hinxton, Cambridge nicos.angelopoulos@sanger.ac.uk 18.9.18 overview knowledge for Bayesian machine learning over model structure


slide-1
SLIDE 1

Logic Programming for Big Data in Computational Biology

Nicos Angelopoulos

Wellcome Sanger Institute Hinxton, Cambridge nicos.angelopoulos@sanger.ac.uk

18.9.18

slide-2
SLIDE 2
  • verview

◮ knowledge for Bayesian machine learning over model structure ◮ applied knowledge representation for biological data analytics

slide-3
SLIDE 3

Bayesian inference of model structure (Bims)

A Bayesian machine learning system that can model prior knowledge by means of a probabilistic logic programming. Nonmeclature ◮ DLPs = Distributional logic programs ◮ Bims = Bayesian inference of model structure Timeline ◮ Theory (York, 2000-5) ◮ Applications (Edinburgh, 2006-8, IAH 2009, NKI 2013) ◮ Bims library and theory paper 2015-2017

slide-4
SLIDE 4

Bims Overview

◮ syntax of DLPs ◮ a succinct classification tree prior program ◮ Bayesian learning of model structure ◮ learning classification and regression trees ◮ Bayesian learning of Bayesian networks ◮ the bims library

slide-5
SLIDE 5

DLPs- description

We extend LP’s clausal syntax with probabilistic guards that associate a resolution step using a particular clause with a probability whose value is computed on-the-fly. The intuition is that this value can be used as the probability with which the clause is selected for resolution. Thus in addition to the logical relation, a clause defines over the

  • bjects that appear as arguments in its head, it also defines a

probability distribution over aspects of this relation.

slide-6
SLIDE 6

DLPs example

member(H, [H| T]). member(El, [ H|T]) :− (C1) member(El, T). L :: length(List, L) ∼ El :: umember(El, List) (G1) 1 L :: L :: umember(El, [El|Tail]). (C2) 1 − 1 L :: L :: umember(El, [H|Tail]) :− (C3) umember(El, Tail).

slide-7
SLIDE 7

DLPs probabilistic goals

1 L :: L :: umember(El, [El|Tail]). (C4) 1 − 1 L :: L :: umember(El, [H|Tail]) :− (C5) K is L − 1, K :: umember(El, Tail).

slide-8
SLIDE 8

DLPs query

? − umember(X, [a, b, c]). X = a (1/3 of the times = 1/3); X = b (1/3 of the times = 2/3 ∗ 1/2); X = c (1/3 of the times = 2/3 ∗ 1/2 ∗ 1).

slide-9
SLIDE 9

simple tree prior

?- cart( ζ, ξ, A, M ).

M=nd(x2,1,nd(x1,0,lf,lf),lf) (C0) cart(ζ, ξ, M, Cart) : − ψ0 is ζ, ψ0: split(0, ζ, ξ, M, Cart). (C1) ψD: split(D, ζ, ξ, MB, nd(F, Val, L, R)) : − ψD+1 is ζ ∗ (1 + D)−ξ, D1 is D + 1, r select(F, Val, MB, LB, RB), ψD+1: split(D1, ζ, ξ, LB, L), ψD+1: split(D1, ζ, ξ, RB, R). (C2) 1 − ψD: split(D, ζ, ξ, MB, lf ).

slide-10
SLIDE 10

Bims theory

Bayes’ Theorem p(M|D) = p(D|M)p(M)

  • M p(D|M)p(M)

Metropolis-Hastings α(Mi, M∗) = min q(M∗, Mi)P(D|M∗)P(M∗) q(Mi, M∗)P(D|Mi)P(Mi) , 1

slide-11
SLIDE 11

DLP defined model space

From Mi identify Gi then sample forward to M⋆. q(Mi, M⋆) is the probability of proposing M⋆ when Mi is the current model.

slide-12
SLIDE 12

Pyruvate kinase interactors

  • bjective

improve chances of discovering binding molecules based on examples from screened chemical libraries. pyruvate kinase affinity data 582 Active and 582 Inactive. Dragon software produces 1500 property descriptors for each molecule, about 1100 were used. ten-fold cross-validation Compared to Feed Forward Neural Networks and Support Vector Machines by splitting the data into ten train/test segments.

slide-13
SLIDE 13

best likelihood model

slide-14
SLIDE 14

ten-fold validation

Sensitivity = T + T + + F − Specificity = T − T − + F +

slide-15
SLIDE 15

molecules of Eduliss according to BCarts

slide-16
SLIDE 16

Bims: Bayesian inference of model structure

Released in 2016 as an easily installable SWI-Prolog library (IJAR paper in 2017) Includes ◮ priors and likelihoods for: CARTs and Bayesian networks ◮ hooks for user defined models Probabilistic logic programming ◮ thesis: probabilistic finite domains ◮ PLP workshop and IJAR associated issues (5th edition)

slide-17
SLIDE 17

knowledge-based computation biology

◮ graphical models (focal adhesion dynamics, NKI, 2011-3) ◮ proteomics functional analysis (TKSilac,KSR1,ATG9A, Imperial, 2014-5) ◮ mutational profiling (14MG, Sanger, 2016-8)

slide-18
SLIDE 18

Graphical models of FAD

Graphical models (aka Bayesian networks) can provide a network view of dependencies among variables, capturing much richer information than pairwise correlations. In this project, microscopy based variables characterising focal adhesion in time are connected for a number of conditions in the HGF pathway.

slide-19
SLIDE 19
slide-20
SLIDE 20

tkSilac: tyrosine kinase screen

◮ MCF7 cell line ◮ 33 SILAC runs ◮ 65/66 expressed tyrosine kinases ◮ 4739 quantified in some experiment ◮ 1000 quantified in 60 or more TK KO

slide-21
SLIDE 21

LMTK3 ERBB4 MATK ZAP70 ROR2 TYRO3 FLT1 FGFR2 FRK EPHB4 EPHB1 INSR IGF1R MERTK PTK6 PDGFRB SRC HCK MET EPHA1 EGFR NTRK3 NTRK1 LMTK2 MST1R RYK EPHA2 ABL1 AXL EPHA4 EPHB3 ERBB2 CSF1R ABL2 EPHA6 BTK FGFR1 EPHB2 ERBB3 EPHA7 CSK DDR1 FLT3 EPHA3 FYN EPHB6 FES LYN STYK1 TNK2 PTK2 JAK1 LCK TYK2 YES1 KDR JAK2 ROR1 TNK1 SYK TEC PTK7 RET NTRK2 PTK2B 0.5 2 Value Color Key

Figure 2

  • Fig. 2. Heatmap of quantified proteins after TK silencing. The overall pattern of regulation is shown in the heat-

map of quantified values. After normalized to siControl, values of fold changes are all above 0, with value 1 show- ing that the expression levels of the specific protein are not altered after silencing TKs. For each knockdown (rows)

slide-22
SLIDE 22

A B L 1 ABL2 AXL B T K CSF1R CSK DDR1 EGFR E P H A 1 EPHA2 EPHA3 EPHA4 EPHA6 EPHA7 EPHB1 EPHB2 E P H B 3 E P H B 4 EPHB6 ERBB2 E R B B 3 E R B B 4 F E S FGFR1 FGFR2 FLT1 FLT3 F R K F Y N HCK IGF1R I N S R J A K 1 J A K 2 KDR LCK LMTK2 LMTK3 L Y N MATK MERTK M E T MST1R NTRK1 NTRK2 N T R K 3 P D G F R P T K 2 P T K 2 B P T K 6 PTK7 RET R O R 1 ROR2 RYK SRC STYK1 SYK TEC TNK1 TNK2 TYK2 TYRO3 Y E S 1 ZAP70

A Figure 4 B D

25 50 75 100

Clusters Counts 1 2 3 4 5 6 7 8 9 10 Up Down

ABL1 AXL EPHA2 EPHA4 LMTK2 MST1R NTRK1 RYK

CD44 TMEM164 EPB41L1 CA12 PGR NRCAM FREM2 SDC1 CELSR2 LXN MYOF GFRA1 GLA GUSB FARP1 GREB1 FBP1 CYB5R1 GSTM3 NCAM2 GGH

CA2

MUC1 SLC38A1 FLNB ERMP1 PBXIP1 PODXL ADAM10 BLOC1S5 CDH1 AKR1C2 BASP1 SLC12A2 MVP AGR2 FTH1 LCP1 LASP1 BCAS1 PYGB STAT1 PREX1 ME1 SLC7A1 MYBBP1A HEATR6 CAD KRT18 MC1R GALNT7 GEMIN4 PHGDH TXNRD1 USP32 MON2 LTN1 RNF213 VDAC1 TUBA1C HUWE1 ASMTL GYG1 MROH1 PRKDC ASS1 NUP188 ASNS MUC5B LDHB XPOT HEATR1

  • 2
  • 1

1

C

Clusters TKs 8 15 3 6 2 6 7 3 12 3 No. ABL1, AXL, EPHA2, EPHA4, LMTK2, MST1R, NTRK1, RYK ABL2, BTK, CSF1R, CSK, DDR1, EPHA3, EPHA6, EPHA7, EPHB2, EPHB6, ERBB3, FES, FGFR1, FLT3, FYN EGFR, EPHA1, NTRK3 EPHB1, EPHB4, FGFR2, FLT1, FRK, INSR EPHB3, ERBB2 ERBB4, LMTK3, MATK, ROR2, TYRO3, ZAP70 HCK, IGF1R, MERTK, MET, PDGFRB, PTK6, SRC JAK1, LCK, PTK2 JAK2, KDR, NTRK2, PTK2B, PTK7, RET, ROR1, SYK, TEC, TNK1, TYK2, YES1 LYN, STYK1, TNK2 1 2 3 4 5 6 7 8 9 10

slide-23
SLIDE 23

Cell adhesion Apoptosis Development Growth Immune system Transport Metabolic process Reproduction Cell cycle Cell communication

%

10 9 8 7 6 5 4 3 2 1

Antioxidant activity Binding Catalytic activity Chemoattractant activity Enzyme regulator activity Molecular transducer activity Protein binding TF activity Receptor activity Structural molecule activity Transporter activity

% 80

10 9 8 7 6 5 4 3 2 1

60 40 20 10 20 30 40 50

Figure 5 A B

  • Fig. 5. Characterization of a functional portrait for each cluster. A, A functional profile of top GO biologic

processes that the up- and downregulated proteins belong to is presented. x-axis shows the percentage of hits in each cluster that belong to a GO biologic process term. The color coding and the number for each cluster are indi- cated as above. B, A functional profile of top GO molecular functions that the up- and downregulated proteins belong to is presented. x-axis shows the percentage of hits in each cluster that belong to a GO molecular function term.

slide-24
SLIDE 24
  • ADAM10

AGR2 AKR1C2 ASNS ASS1 BASP1 BLOC1S5 CA2 CAD CD44 CDH1 CELSR2 FARP1 FLNB FREM2 GFRA1 GSTM3 HUWE1 KRT18 LCP1 MC1R MUC1 MUC5B

MYBBP1A NCAM2 NRCAM PBXIP1

PGR

PHGDH PODXL PREX1 PRKDC SDC1 SLC12A2 STAT1 TXNRD1 VDAC1

cluster 1 development

  • ● ●

ACSL1 ASS1 CD44 FLNB KPNA2 KRT18 NUP210 PRKCD PXDN RPS27A ACTN1 ACTN4 EPB41L1 FLNB KRT18 LMNA MAP4 NUMA1 PREX1 PRKCD SPTBN2

cluster 2 response to cytokine cytoskeleton organization

  • ADAR

ANXA3 APOB

ASS1 CD44 CRIP2 DDX58 DDX60

EIF2AK2

FTH1

HERC6

IFI35 IFIH1 IFIT1 IFIT2 IFIT3 IRS1

ISG15 ITGA2 MIR7703 MX1 NFKB2 OAS1 OAS2 OAS3 PARP9 PML PRKDC PSME1 RPL22 SAMHD1 SLC3A2 STAT1 STAT2 TAP1 TAP2 TAPBP TRIM25

cluster 3 immune system process

  • AKR1C2

CA2 CAD GBA

MUC5B PCK2 RBM14 SDC1 STAT1 ACTN4 ANXA6 LASP1 SLC38A2 SLC3A2 SLC6A14 SLC7A1 SLC7A2 SLC7A5

cluster 4 ion transmembrane transport response to hormone

  • LRPPRC

MVP

NUP210 PNPT1 ACSL1 AGRN BLVRA

CD44

EPHX1

FBP1 G6PD

GSTM3 HSP90B1 KIAA1324 LRPPRC MYH14 PNPT1 PXDN SDC1

RNA transport cluster 5 catabolic process

  • CAV1

CD63

CELSR1 CTNNB1

DST

GPR56 HMGB1 HMGB2 ITGA5 L1CAM LGALS3 NRP1 SCARB1 SLC3A2 SLC7A5

cluster 6 cell motility

  • ACSL1

AHSA1 AKR1C2 APRT

ASS1 CA12 CAD

CERS2 CYB5R3 DDX20 ENO1 ERO1L

GBA GLA GPD2 HK1

INPP4B MON2 MYH14

PFKL

PHGDH PYGB SLC44A2 SLC5A6 SLC7A2 SLC7A5 SLC9A1 SORD TUBA1A TUBA1C TUBB4A TUBB8

cluster 7 metabolic process

  • DNMT1

DUT

FANCD2

FEN1 H1FX

KPNA2 MCM2 MCM3 MCM4 MCM5 MCM6 MCM7 NCAPD2 PCNA POLD1

RAN SMC2 SMC4

SSRP1 SUPT16H

cluster 8 DNA replication

  • ACTN4

AGRN ARMCX3 BAG6 CELSR2 FLNB JUP LCP1 MGEA5 MLPH MVP MYH9 PDCD6IP RPS27A

SLC9A3R1 SYNE2

TFRC XPO1 XPOT

cluster 9 protein localiztion

  • ACTN4

ASNS BAG6 CD44 CDH1 MUC1 NQO1 PRDX2 PRDX3 PRKDC

SLC25A24 SLC9A3R1 SORT1

VDAC1 VDAC2

cluster 10 cell death

Figure 6

slide-25
SLIDE 25

volcano plot (BT474HR H/M)

  • −4

−2 2 4 6 7 8 9 10 Log2 Ratio Total Protein (R10K8/R0604) Log10 Intensity (R10K8/R0604)

CLEC1B CNOT8 SCAP SYF2 ATG9A STOML2 LDHB TRIP6 DHX37 STK3 KATNAL2 VCP RP1 YBX2 BLMH ANO1 SFN DCTN5 PEX3 MCM6 MCM3 LGALS1 PEG10 GBP2 VIM SLC6A14 SH3BGRL GSTM3 PGM5 KYNU CRIP1 S100A6 SSNA1 CA5B IGFBP3 RGN BSG SEMA4C EADS2

  • < 0.001

< 0.01 < 0.05 >0.05

slide-26
SLIDE 26

Autophagy

  • ATG16L1

ATG3 ATG4B ATG7 ATG9A BAG1 BAG3 BNIP1 CALCOCO2 CANX CASP3 CASP8 CD46 CDKN1B CHMP2B DNAJB1 ERBB2 FOXO3 GAA GABARAPL2 HDAC6 HSPB8 ITGA6 ITGB4 MAP1LC3B MAPK3 MTOR NBR1 NFKB1 PEX3 PRKCD RAB7A RB1 RELA SERPINA1 SQSTM1 ST13 TBK1 TP53 TSC1 TSC2 ULK1 WDFY3 WIPI2

slide-27
SLIDE 27

Myelodysplastic syndrom, NGS somatic mutations profiling

  • AUC (1y)

0.5 0.6 0.7 0.8 0.9 1.0 AUC (1y) for Clinical vs Lasso_min model 0.0037 Auc (1y) 0.679351 0.78364 Harrel's C 0.634512 0.712633 R square 0.179789 0.415647

slide-28
SLIDE 28

5 year: Clinical vs Lasso (Optimal)

Clinical

1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0

Lasso_min

1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0

slide-29
SLIDE 29

myeloma structural variations

t_14_16 del13q14 HDR t_4_14 delCYLD t_11_14 del17p13 NRAS gain1q21 TP53 delFAM46C delTRAF3 DIS3 KRAS # events (shown med=225,max=643) Co-occur (shown odds=4) Mut.excl (shown odds=0.25) Fisher test odds

slide-30
SLIDE 30

logic programming for (biological) data analytics

Positives ◮ interpreted ◮ memory management ◮ clean and high level ◮ probabilistic ML & reasoning (Prism,Bims,Pepl) ◮ intuitive database integration (db facts,bio db) ◮ multi-threaded and web-capable ◮ talking to other systems (R:Real,ODBC,proSQLite) ◮ (largely) OS independence Negatives ◮ graphics ◮ SWI-Prolog, at core a one-person project ◮ code sharing in toddler stage (but showing promise) ◮ in-browser interaction with other technologies

slide-31
SLIDE 31

KR bottom line

(probabilistic) logic programming and Bayesian networks are powerful tools for explainable, accountable, open and shareable AI & ML

slide-32
SLIDE 32

KR bottom line

(probabilistic) logic programming and Bayesian networks are powerful tools for explainable, accountable, open and shareable AI & ML symbolic AI education, can be a central player in contributing tangibly to the current AI resurgence, while managing expectations of modern AI see media coverage of Facebook/Cambridge-Analytica & Uber/Tesla driveless accidents

slide-33
SLIDE 33

KR bottom line

(probabilistic) logic programming and Bayesian networks are powerful tools for explainable, accountable, open and shareable AI & ML symbolic AI education, can be a central player in contributing tangibly to the current AI resurgence, while managing expectations of modern AI see media coverage of Facebook/Cambridge-Analytica & Uber/Tesla driveless accidents biology presents a unique application area, where unprecedented volumes of data are generated knowledge is a crucial concept, currently being shaped transferable to other big data areas