bayesian matrix factorization for
play

Bayesian matrix factorization for drug-target activity prediction - PowerPoint PPT Presentation

Bayesian matrix factorization for drug-target activity prediction Yves Moreau University of Leuven ESAT-STADIUS SymBioSys Center for Computational Biology Number of new drugs per billion US$ 100 1950 10 1 2010 0.1 Scannell et al . 2012


  1. Bayesian matrix factorization for drug-target activity prediction Yves Moreau University of Leuven – ESAT-STADIUS SymBioSys Center for Computational Biology

  2. Number of new drugs per billion US$ 100 1950 10 1 2010 0.1 Scannell et al . 2012

  3. The curse of attrition … 83% 64% 60% Phase success 32% rates Phase 1 to Phase 2 to Phase 3 to NDA/BLA to phase 2 phase 3 NDA/BLA approval Hay et al. 2014

  4. … mainly due to safety and efficacy issues Other Efficacy Safety Causes of failure between Phase 2 and submission in 2011 and 2012 Arrowsmith & Miller 2013

  5. Chemoinformatics ? ● Goal: estimate interaction between compounds and protein targets ● Activity measured by high- Compound throughput screening (ex: Viagra) Enzyme ● Activity depends on (ex: ACE2) match between shape of compound and shape of protein ● 3D modeling is challenging

  6. Drug–target activities • IC50 – amount of compound needed for half inhibition • pIC50 = -log10(IC50) • EC50 – amount of compound needed for half effect

  7. High-throughput screening • Hit discovery in early drug discovery P 1 P x P m • Identify compounds active against a protein drug target of interest Comp 1 7 2 • Activity measured by Comp 2 8 Millions of compounds high-throughput screening • Activity = “scarce” data Comp u 3 IC50 1-2% fill rate Comp N 9 Thousands of targets

  8. Molecular fingerprints Ø High-dimensional fingerprints of 2D compound structures Ø Sparse vectors Key-based fingerprints Circular fingerprints MNA & MPD & ECFP FP2 & MACCS A bit string represents the each fingerprint presence or absence of represents a central particular substructures atom and its neighbors 9

  9. Quantitative Structure–Activity Relationship (QSAR) Ø Finds optimal model α based on predictive features Ø IC50( x ) = α 1 x 1 + α 2 x 2 + … + α F x F P 1 P x P m Ø Minimize error loss 00100010 Comp 1 7 2 Ø PLS, ridge regression Ø Good performance if 2 Comp 2 8 01000001 enough training examples Ø Does not share 00101101 Comp u 3 IC50 information across tasks! 00101101 Comp N 9

  10. Multitask learning • From fingerprints and P 1 P x P m available activities, 00100010 Comp 1 7 2 predict missing activities Comp 2 8 01000001 • Approaches Supervised learning per 1. 3M compounds target (QSAR) 00101101 Comp u 3 Matrix factorization 2. IC50 - Netflix style 1-2% fill rate MF + supervised 3. - Macau 00101101 Comp N 9 Features 1500 targets (6K-4M)

  11. The Netflix Challenge • Goal: predict user movie ratings • 440K users, 18K movies 18K movies • 100 million ratings 1 ? 2 ? ? ? • 1% fill rate ? ? ? ? ? 1 • è Predict 99% missing ? ? ? 5 ? ? 440K users ? ? ? ? ? 4 • How can this work? ? 5 ? ? ? ? ? ? ? ? 3 ? ? ? 3 ? ? ? 4 ? ? ? ? ?

  12. Factor analysis Ø Low-rank approximation of full matrix V U Y Factors * ≈ Loadings

  13. Factor analysis U i . V Y i . Factors * ≈ Loadings

  14. Factor analysis Ø Individual response (= row) modeled as individual mixture (= loading) of a small number of latent responses (= factor) * + Factors ≈ * + * Loadings

  15. Alternating Least Squares U i . V Y i . Factors ? * ? ? ≈ Loadings

  16. Alternating Least Squares Ø If V were known, U could be found by linear regression V U Y ? ? ? Factors * ? ? ? ? ? ? ≈ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Loadings

  17. Alternating Least Squares Ø If U were known, V could also be found by linear regression V U Y ? ? ? ? ? ? Factors * ? ? ? ? ? ? ? ? ? ? ? ? ≈ Loadings

  18. Scarce matrix factorization Ø Only observed values are used in regressions V * U * Y Factors * ≈ U , V W ! ( Y − U . V ) min 2 Loadings

  19. Scarce matrix factorization Ø Once factors are obtained, other entries can be predicted ˆ V * U * Y Factors * ≈ Loadings

  20. Uncertainty Ø Given scarce data, is a single solution ( U * , V * ) meaningful? ˆ V * U * Y Factors * ≈ Loadings

  21. Bayesian modeling Ø Given uncertainty from scarce data, Bayesian inference is desirable ( ) = min U , V W ! ( Y − U . V ) U * , V * • Instead of , 2 we want to consider the Bayesian posterior distribution p ( U , V | Y ) • Posterior predictive distribution p ( ˆ Y | Y ) is more informative than any optimal estimator

  22. Ordinary least squares Ø ALS involves successive regressions solved by OLS U i . V Y i . Factors * ? ? ? ≈ Loadings

  23. Ordinary least squares Ø Model Ø Solution Ø Setup = transposed of previous notation Ø If Gaussian noise, then OLS is max. likelihood estimate

  24. Block Gibbs sampler Ø The Gibbs sampler is a Markov Chain Monte Carlo method Ø MCMC for model inference generates samples from complex posterior distributions of model parameters by iteratively sampling from simpler distributions Ø The following scheme is a block Gibbs sampler U ( i + 1) ~ p ( U | V ( i ) , Y ) V ( i + 1) ~ p ( V | U ( i + 1) , Y ) Ø Under mild conditions of ergodicity, after burn-in , the samples will be dependently drawn from joint distribution For i sufficiently large, ( U ( i ) , V ( i ) ) ~ p ( U , V | Y ) Ø Similar to alternating least squares, but global optimization

  25. Markov Chain Monte Carlo Ø We do not get the posterior distribution analytically, only samples from it Ø Samples are sufficient to characterize posterior distribution Ø e.g., average solutions to get posterior mean estimate Ø e.g., marginal variance of individual predictions to characterize uncertainty

  26. Bayesian linear regression Ø The distribution of β in function of the data X and y can be modeled as a multivariate Gaussian distribution over β Ø Model Ø Assume a Gaussian prior for β and an inverse gamma prior for ρ =

  27. Bayesian linear regression Ø Then the posterior distribution of β is also a Gaussian distribution by application of Bayes’ rule = = Ø If Λ 0 = 0 and µ 0 = 0 , then solution for µ n is identical to OLS! Ø Average solution µ n is similar to ridge regression solution Ø Precision matrix Λ n characterizes variance of solution

  28. GAMBLR trick Ø Executing the Gibbs sampler requires sampling repeatedly from posterior Gaussian distributions (which change every time U and V change) Ø Sampling from multivariate Gaussian distribution ε ~ N (0, I ). If A such that Σ =AA', then z = µ + A ε ~ N ( µ , Σ ) Ø For Bayesian linear regression ! $ ! $ y X # & X = # & , y = with Λ 0 = L 0 L 0 ' L 0 L 0 µ 0 # & # & " % " % − 1 Xy and Λ n = XX ' ( ) µ n = XX ' − 1 X ( y + σ . ε ) ~ N ( µ n , σ 2 Λ n ( ) − 1 ) It can be shown that z = XX ' Ø This has the same form as OLS!

  29. GAMBLR trick Ø This means that we can sample from the posterior Gaussian distribution by solving a linear regression on the original data plus injected noise! Ø Running the Gibbs sampler then only amounts to solving a sequence of linear regressions with variable noise injection! Ø Linear regression is one of the best studied problems in numerical analysis Ø Fast algorithms Ø Scalable code Ø One multivariate regression per row or column of Y at each iteration step, hence easy parallelization

  30. Matrix factorization • One of the best approaches for Netflix challenge • Prediction of ratings for viewer-movie pairs • Does not use features, only matrix values • Two popular versions • Probabilistic Matrix Factorization (PMF) = Maximum Likelihood • Bayesian PMF = Bayesian inference

  31. Netflix comparison (PMF vs. BPMF) Ø Data: 100M ratings from 480K users, 18K movies Ø BPMF has advantage for users with few ratings

  32. Motivation for Bayesian PMF • PMF gives point estimates • Problematic for compounds that have only few samples • We are interested in uncertainty of estimates Example IC50 data set from CHEMBL with 15K compounds

  33. Bayesian PMF

  34. Gibbs sampling • Iteratively samples each parameter • Obtains posterior samples of the model • e.g. , sample 200 models after burn-in • Using the samples one can also measure uncertainty • Related to Alternating Least Squares • Blocked Gibbs sampler with large blocks, good sampling behavior

  35. ChEMBL: PMF vs. Bayesian PMF • ChEMBL public data set of assay activities • Classified IC50 Test classification error • 15,118 compounds • 344 proteins • 59,451 values • Discretization at 200nM • 20% test • BPMF outperforms PMF • Does not use features, only matrix values

  36. ChEMBL: BPMF vs. ridge regression 15K compounds 344 protein 200 nM threshold 20% for test set Vary number of dimensions Matrix factorization not as good as QSAR, but does capture information.

  37. BPMF (relation view) Model 2 entities, 1 relation Comp. Protein Latent variables (green) are learned from the IC50 data. IC50

  38. Macau Can we get Fingerprints the best of both worlds? β comp Model Comp. Protein 2 entities, 1 relation Latent Latent U V + features for compounds Latent variables are learned together with β comp IC50

  39. Results on ChEMBL 15K compounds 344 protein 200 nM threshold 20% for test set Compound features improve performance Multitask modeling improves performance

Recommend


More recommend