Bayesian matrix factorization for drug-target activity prediction Yves Moreau University of Leuven – ESAT-STADIUS SymBioSys Center for Computational Biology

Number of new drugs per billion US$ 100 1950 10 1 2010 0.1 Scannell et al . 2012

The curse of attrition … 83% 64% 60% Phase success 32% rates Phase 1 to Phase 2 to Phase 3 to NDA/BLA to phase 2 phase 3 NDA/BLA approval Hay et al. 2014

… mainly due to safety and efficacy issues Other Efficacy Safety Causes of failure between Phase 2 and submission in 2011 and 2012 Arrowsmith & Miller 2013

Chemoinformatics ? ● Goal: estimate interaction between compounds and protein targets ● Activity measured by high- Compound throughput screening (ex: Viagra) Enzyme ● Activity depends on (ex: ACE2) match between shape of compound and shape of protein ● 3D modeling is challenging

Drug–target activities • IC50 – amount of compound needed for half inhibition • pIC50 = -log10(IC50) • EC50 – amount of compound needed for half effect

High-throughput screening • Hit discovery in early drug discovery P 1 P x P m • Identify compounds active against a protein drug target of interest Comp 1 7 2 • Activity measured by Comp 2 8 Millions of compounds high-throughput screening • Activity = “scarce” data Comp u 3 IC50 1-2% fill rate Comp N 9 Thousands of targets

Molecular fingerprints Ø High-dimensional fingerprints of 2D compound structures Ø Sparse vectors Key-based fingerprints Circular fingerprints MNA & MPD & ECFP FP2 & MACCS A bit string represents the each fingerprint presence or absence of represents a central particular substructures atom and its neighbors 9

Quantitative Structure–Activity Relationship (QSAR) Ø Finds optimal model α based on predictive features Ø IC50( x ) = α 1 x 1 + α 2 x 2 + … + α F x F P 1 P x P m Ø Minimize error loss 00100010 Comp 1 7 2 Ø PLS, ridge regression Ø Good performance if 2 Comp 2 8 01000001 enough training examples Ø Does not share 00101101 Comp u 3 IC50 information across tasks! 00101101 Comp N 9

Multitask learning • From fingerprints and P 1 P x P m available activities, 00100010 Comp 1 7 2 predict missing activities Comp 2 8 01000001 • Approaches Supervised learning per 1. 3M compounds target (QSAR) 00101101 Comp u 3 Matrix factorization 2. IC50 - Netflix style 1-2% fill rate MF + supervised 3. - Macau 00101101 Comp N 9 Features 1500 targets (6K-4M)

The Netflix Challenge • Goal: predict user movie ratings • 440K users, 18K movies 18K movies • 100 million ratings 1 ? 2 ? ? ? • 1% fill rate ? ? ? ? ? 1 • è Predict 99% missing ? ? ? 5 ? ? 440K users ? ? ? ? ? 4 • How can this work? ? 5 ? ? ? ? ? ? ? ? 3 ? ? ? 3 ? ? ? 4 ? ? ? ? ?

Factor analysis Ø Low-rank approximation of full matrix V U Y Factors * ≈ Loadings

Factor analysis U i . V Y i . Factors * ≈ Loadings

Factor analysis Ø Individual response (= row) modeled as individual mixture (= loading) of a small number of latent responses (= factor) * + Factors ≈ * + * Loadings

Alternating Least Squares U i . V Y i . Factors ? * ? ? ≈ Loadings

Alternating Least Squares Ø If V were known, U could be found by linear regression V U Y ? ? ? Factors * ? ? ? ? ? ? ≈ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Loadings

Alternating Least Squares Ø If U were known, V could also be found by linear regression V U Y ? ? ? ? ? ? Factors * ? ? ? ? ? ? ? ? ? ? ? ? ≈ Loadings

Scarce matrix factorization Ø Only observed values are used in regressions V * U * Y Factors * ≈ U , V W ! ( Y − U . V ) min 2 Loadings

Scarce matrix factorization Ø Once factors are obtained, other entries can be predicted ˆ V * U * Y Factors * ≈ Loadings

Uncertainty Ø Given scarce data, is a single solution ( U * , V * ) meaningful? ˆ V * U * Y Factors * ≈ Loadings

Bayesian modeling Ø Given uncertainty from scarce data, Bayesian inference is desirable ( ) = min U , V W ! ( Y − U . V ) U * , V * • Instead of , 2 we want to consider the Bayesian posterior distribution p ( U , V | Y ) • Posterior predictive distribution p ( ˆ Y | Y ) is more informative than any optimal estimator

Ordinary least squares Ø ALS involves successive regressions solved by OLS U i . V Y i . Factors * ? ? ? ≈ Loadings

Ordinary least squares Ø Model Ø Solution Ø Setup = transposed of previous notation Ø If Gaussian noise, then OLS is max. likelihood estimate

Block Gibbs sampler Ø The Gibbs sampler is a Markov Chain Monte Carlo method Ø MCMC for model inference generates samples from complex posterior distributions of model parameters by iteratively sampling from simpler distributions Ø The following scheme is a block Gibbs sampler U ( i + 1) ~ p ( U | V ( i ) , Y ) V ( i + 1) ~ p ( V | U ( i + 1) , Y ) Ø Under mild conditions of ergodicity, after burn-in , the samples will be dependently drawn from joint distribution For i sufficiently large, ( U ( i ) , V ( i ) ) ~ p ( U , V | Y ) Ø Similar to alternating least squares, but global optimization

Markov Chain Monte Carlo Ø We do not get the posterior distribution analytically, only samples from it Ø Samples are sufficient to characterize posterior distribution Ø e.g., average solutions to get posterior mean estimate Ø e.g., marginal variance of individual predictions to characterize uncertainty

Bayesian linear regression Ø The distribution of β in function of the data X and y can be modeled as a multivariate Gaussian distribution over β Ø Model Ø Assume a Gaussian prior for β and an inverse gamma prior for ρ =

Bayesian linear regression Ø Then the posterior distribution of β is also a Gaussian distribution by application of Bayes’ rule = = Ø If Λ 0 = 0 and µ 0 = 0 , then solution for µ n is identical to OLS! Ø Average solution µ n is similar to ridge regression solution Ø Precision matrix Λ n characterizes variance of solution

GAMBLR trick Ø Executing the Gibbs sampler requires sampling repeatedly from posterior Gaussian distributions (which change every time U and V change) Ø Sampling from multivariate Gaussian distribution ε ~ N (0, I ). If A such that Σ =AA', then z = µ + A ε ~ N ( µ , Σ ) Ø For Bayesian linear regression ! $ ! $ y X # & X = # & , y = with Λ 0 = L 0 L 0 ' L 0 L 0 µ 0 # & # & " % " % − 1 Xy and Λ n = XX ' ( ) µ n = XX ' − 1 X ( y + σ . ε ) ~ N ( µ n , σ 2 Λ n ( ) − 1 ) It can be shown that z = XX ' Ø This has the same form as OLS!

GAMBLR trick Ø This means that we can sample from the posterior Gaussian distribution by solving a linear regression on the original data plus injected noise! Ø Running the Gibbs sampler then only amounts to solving a sequence of linear regressions with variable noise injection! Ø Linear regression is one of the best studied problems in numerical analysis Ø Fast algorithms Ø Scalable code Ø One multivariate regression per row or column of Y at each iteration step, hence easy parallelization

Matrix factorization • One of the best approaches for Netflix challenge • Prediction of ratings for viewer-movie pairs • Does not use features, only matrix values • Two popular versions • Probabilistic Matrix Factorization (PMF) = Maximum Likelihood • Bayesian PMF = Bayesian inference

Netflix comparison (PMF vs. BPMF) Ø Data: 100M ratings from 480K users, 18K movies Ø BPMF has advantage for users with few ratings

Motivation for Bayesian PMF • PMF gives point estimates • Problematic for compounds that have only few samples • We are interested in uncertainty of estimates Example IC50 data set from CHEMBL with 15K compounds

Bayesian PMF

Gibbs sampling • Iteratively samples each parameter • Obtains posterior samples of the model • e.g. , sample 200 models after burn-in • Using the samples one can also measure uncertainty • Related to Alternating Least Squares • Blocked Gibbs sampler with large blocks, good sampling behavior

ChEMBL: PMF vs. Bayesian PMF • ChEMBL public data set of assay activities • Classified IC50 Test classification error • 15,118 compounds • 344 proteins • 59,451 values • Discretization at 200nM • 20% test • BPMF outperforms PMF • Does not use features, only matrix values

ChEMBL: BPMF vs. ridge regression 15K compounds 344 protein 200 nM threshold 20% for test set Vary number of dimensions Matrix factorization not as good as QSAR, but does capture information.

BPMF (relation view) Model 2 entities, 1 relation Comp. Protein Latent variables (green) are learned from the IC50 data. IC50

Macau Can we get Fingerprints the best of both worlds? β comp Model Comp. Protein 2 entities, 1 relation Latent Latent U V + features for compounds Latent variables are learned together with β comp IC50

Results on ChEMBL 15K compounds 344 protein 200 nM threshold 20% for test set Compound features improve performance Multitask modeling improves performance

Download Presentation

Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend

More recommend