Improved Applicability Domain Determination of QSAR Models Using Local Mapping:
Natalia Aniceto, Alex Freitas, Andreas Bender, Taravat Ghafourian
PhD student ● University of Kent
Determination of QSAR Models Using Local Mapping: RDN Natalia - - PowerPoint PPT Presentation
Improved Applicability Domain Determination of QSAR Models Using Local Mapping: RDN Natalia Aniceto , Alex Freitas, Andreas Bender, Taravat Ghafourian PhD student University of Kent BACKGROUND Applicability Domain Establishing
Natalia Aniceto, Alex Freitas, Andreas Bender, Taravat Ghafourian
PhD student ● University of Kent
cumulative result of data noise and sparseness Establishing boundaries for prediction reliability is arguably as important as demonstrating good predictive performance. AD methods proposed so far typically address the data as a whole, often focusing on a single aspect of data (i.e. similarity to training set, descriptor span, data density, etc).
sparse dense
Sahigara et al.1 and Sheridan2
In order to successfully characterize a model’s AD: Combining different measures that address data locally has shown improved AD characterization.
AD = f (local density & local reliability) Hypothesis: “Applicability Domain”
STD + Similarity + Predictions
Ideal scenario
AD output measure Accuracy
Chemical space coverage
Sahigara et al.1
Rationale: Each training instance provides coverage
chemical space according to densely populated is its local vicinity. A radius of coverage is placed on each instance from the average Euclidean Distance (ED) to its neighbours within the average ED to the k-th NN.
↑k : ↑span of coverage
scan through the chemical space
Di
Density Neighbourhood (dk-NN)
bias precision &
Deviation within an ensemble of models (STD)
(Tetko et al)
Agreement with observed response Di Di*
↑ agreement ↑ 1 – STD
↑W Di is less penalized Higher reliability = larger coverage
Tetko et al 2008 J. Chem. Inf. Model. 2008, 48, 1733–1746
Training set External set
Scan through the chemical space By keeping track
entering the AD, and updating in- AD-Accuracy, we build a map of
reliability across
the model’s space
Test the RDN algorithm on ❷ benchmark datasets
Compare RDN vs ❸ other AD methods
Ames mutagenicity test CYP450 inhibition test
(1A2) Class = { + , - }
Build a decision tree model P-gp dataset Class = {S, NS}
Ensemble model
(10-fold bagging)
Build RDN
Test RDN
AD output measure Accuracy
Chemical space coverage
❷ Characterizing Mispredictions
Kernel Density Estimation (KDE)
Global Density? Descriptor span?
Decision tree descriptor span
AD = f f (local l den ensit ity & local l relia eliabil ilit ity) Test Hypothesis
Training set
N = 659
Validation set
N = 194
Test set
N = 195
0.68 0.685 0.69 0.695 0.7 0.705 0.71 0.715 0.72 0.725 0.73
60 70 80 90 100
Accuracy in the AD
% data in AD
0.68 0.685 0.69 0.695 0.7 0.705 0.71 0.715 0.72 0.725 0.73
92 94 96 98 100
Accuracy in the AD
% data in the AD
Original dk-NN RDN
Shrink distances to 1/3 in the beginning ½ Di Di
0.65 0.7 0.75 0.8 0.85 0.9
5 25 45 65 85
Accuracy inside AD
% data in AD IV set TE set
0.5 0.6 0.7 0.8 0.9 1.0
0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19 0.21 0.23 0.25 0.27 0.29 0.31 0.33 0.35 0.37 0.39 0.41 0.43 0.45 0.47 0.49 0.51 0.53 >0.55
Accuracy
STD
TR set TE set IV set
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 Agreement (to observed)
STD
STD
IN OUT Descriptor span ?
N = 62 (32%) Acc = 72.6% Acc = 67.7% N = 133 (68%)
Density in feature space ?
0.45 0.55 0.65 0.75 0.85 0.95 5 15 25 35 45 55 65 75 85 95
Accuracy inside the AD % covered data
TE set IV set
Kernel Density Estimation (KDE)
PC1
P-gp dataset Class = {S, NS} Training set
N = 689
Validation set
N = 194
Test set
N = 195
Build RDN Test RDN Ames dataset CYP450 dataset Predictions, STD & Agreement from OChem Build RDN Test RDN
N = 1089 + 1090 N = 1870 + 1870
AD = f f (local l den ensit ity & local l relia eliabil ilit ity) Test Hypothesis
Test #1 Test #2
4358 3743
training
0.75 0.8 0.85 0.9 0.95 1
20 40 60 80 100
Accuracy in AD % data in AD
Ames model CYP450 model
0.8 0.85 0.9 0.95 1
20 40 60 80 100
Accuracy in AD % data in AD 0.65 0.7 0.75 0.8 0.85 0.9 5 25 45 65 85 Accuracy in AD
% data in AD IV set TE set
Pgp model
Test #1 Test #2 Test #1 Test #2
AD = f f (local l den ensit ity & local l relia eliabil ilit ity) Test Hypothesis
Compare Reliability-Density Neighbourhood (RDN) vs ❸ other AD methods Benchmark datasets
vs AD techniques
0.75 0.8 0.85 0.9 0.95 1
20 40 60 80 100
RDN
0.75 0.8 0.85 0.9 0.95 1
2 10 18 26 34 42 50 58 66 74 82 90 98
STD
0.75 0.77 0.79 0.81 0.83 0.85 5 10 15 20 25
# nearest neighbours
dk-NN
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 5 15 25 35 45 55 65 75 85 95
KDE Ames model CYP450 model
0.8 0.85 0.9 0.95 1 20 40 60 80 100
RDN
0.8 0.85 0.9 0.95 1 20 40 60 80 100
STD
0.7 0.75 0.8 0.85 0.9 0.95 1 5 10 15 20 25
# nearest neighbours
dk-NN
0.79 0.81 0.83 0.85 0.87 0.89 0.91 5 15 25 35 45 55 65 75 85 95
KDE Accuracy vs. % coverage (implicitly, distance-to-model)
Test #1 Test #2
Local density corrected for local reliability (Precision + Bias) is able to successfuly sort new instances according to their predictive performance through the definition of map that identifies regions according to their probability to contain mispredictions The RDN method performs robustly in new unseen data (two external datasets show similar profiles of accuracy accross chemical space) For the optimal establishment of the AD of a QSAR model: RDN + STD case-by-case selection of the best candidate.
Natalia Aniceto, Alex Freitas, Andreas Bender, Taravat Ghafourian
PhD student ● University of Kent