Determination of QSAR Models Using Local Mapping: RDN Natalia - - PowerPoint PPT Presentation

▶

Apr 17, 2023 107 likes •297 views

Improved Applicability Domain Determination of QSAR Models Using Local Mapping: RDN Natalia Aniceto , Alex Freitas, Andreas Bender, Taravat Ghafourian PhD student University of Kent BACKGROUND Applicability Domain Establishing

SLIDE 1

Improved Applicability Domain Determination of QSAR Models Using Local Mapping:

Natalia Aniceto, Alex Freitas, Andreas Bender, Taravat Ghafourian

PhD student ● University of Kent

RDN

SLIDE 2

BACKGROUND

cumulative result of data noise and sparseness Establishing boundaries for prediction reliability is arguably as important as demonstrating good predictive performance. AD methods proposed so far typically address the data as a whole, often focusing on a single aspect of data (i.e. similarity to training set, descriptor span, data density, etc).

sparse dense

Sahigara et al.1 and Sheridan2

In order to successfully characterize a model’s AD: Combining different measures that address data locally has shown improved AD characterization.

1. Sahigara et al. Journal of Cheminformatics 2013, 5:27
2. Sheridan. J. Chem. Inf. Model. 2012, 52, 814−823

AD = f (local density & local reliability) Hypothesis: “Applicability Domain”

STD + Similarity + Predictions

SLIDE 3

Ideal scenario

AD output measure Accuracy

Chemical space coverage

❶ Local Density

Sahigara et al.1

Rationale: Each training instance provides coverage

chemical space according to densely populated is its local vicinity. A radius of coverage is placed on each instance from the average Euclidean Distance (ED) to its neighbours within the average ED to the k-th NN.

↑k : ↑span of coverage

scan through the chemical space

However high local density does not imply high reliability

Density Neighbourhood (dk-NN)

AD = f (loc local den ensit ity & loc local rel elia iabilit ity) ❶ ❷

METHODS

AD = f (loc local den ensit ity & loc local rel elia iabilit ity)

SLIDE 4

AD = f (loc local den ensit ity & loc local rel elia iabilit ity) ❶ ❷ ❷ Local Reliability

bias precision &

Deviation within an ensemble of models (STD)

(Tetko et al)

Agreement with observed response Di Di*

Di * = Di x Wi

↑ agreement ↑ 1 – STD

↑W Di is less penalized Higher reliability = larger coverage

Tetko et al 2008 J. Chem. Inf. Model. 2008, 48, 1733–1746

METHODS

SLIDE 5

METHODS

Training set External set

k = 1 k = 3 k = 5 k = 7

Scan through the chemical space By keeping track

f new instances

entering the AD, and updating in- AD-Accuracy, we build a map of

reliability across

the model’s space

ut in

SLIDE 6

Test the RDN algorithm on ❷ benchmark datasets

Part #2

Compare RDN vs ❸ other AD methods

Part #3 Part #1

Explore the capabilities of the Reliability-Density Neighbourhood (RDN) algorithm
Characterize Mispredictions

( ❶ “working” dataset )

METHODS

 Ames mutagenicity test  CYP450 inhibition test

(1A2) Class = { + , - }

SLIDE 7

Build a decision tree model P-gp dataset Class = {S, NS}

Ensemble model

(10-fold bagging)

STD
Agreement

Build RDN

Predictions

Test RDN

Part #1

AD output measure Accuracy

Chemical space coverage

?

❶ Explore the capabilities of RDN

❷ Characterizing Mispredictions

Kernel Density Estimation (KDE)

Global Density? Descriptor span?

Decision tree descriptor span

AD = f f (local l den ensit ity & local l relia eliabil ilit ity) Test Hypothesis

METHODS

Training set

N = 659

Validation set

N = 194

Test set

N = 195

SLIDE 8

0.68 0.685 0.69 0.695 0.7 0.705 0.71 0.715 0.72 0.725 0.73

60 70 80 90 100

Accuracy in the AD

% data in AD

0.68 0.685 0.69 0.695 0.7 0.705 0.71 0.715 0.72 0.725 0.73

92 94 96 98 100

Accuracy in the AD

% data in the AD

Original dk-NN RDN

Shrink distances to 1/3 in the beginning  ½ Di  Di

0.65 0.7 0.75 0.8 0.85 0.9

5 25 45 65 85

Accuracy inside AD

% data in AD IV set TE set

P-gp

Part #1

RESULTS

❶ Explore the capabilities of RDN

SLIDE 9

0.5 0.6 0.7 0.8 0.9 1.0

0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19 0.21 0.23 0.25 0.27 0.29 0.31 0.33 0.35 0.37 0.39 0.41 0.43 0.45 0.47 0.49 0.51 0.53 >0.55

Accuracy

STD

TR set TE set IV set

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 Agreement (to observed)

STD

Role of bias-precision correction

Part #1

RESULTS

P-gp

SLIDE 10

IN OUT Descriptor span ?

❷ Diagnosing mispredictions

N = 62 (32%) Acc = 72.6% Acc = 67.7% N = 133 (68%)

Part #1

RESULTS

P-gp

SLIDE 11

Density in feature space ?

0.45 0.55 0.65 0.75 0.85 0.95 5 15 25 35 45 55 65 75 85 95

Accuracy inside the AD % covered data

TE set IV set

Kernel Density Estimation (KDE)

PC1

❷ Diagnosing mispredictions

Part #1

RESULTS

P-gp

SLIDE 12

P-gp dataset Class = {S, NS} Training set

N = 689

Validation set

N = 194

Test set

N = 195

Build RDN Test RDN Ames dataset CYP450 dataset Predictions, STD & Agreement from OChem Build RDN Test RDN

Test the RDN algorithm on ❷ benchmark datasets

N = 1089 + 1090 N = 1870 + 1870

AD = f f (local l den ensit ity & local l relia eliabil ilit ity) Test Hypothesis

Part #2

METHODS

Test #1 Test #2

4358 3743

training

SLIDE 13

0.75 0.8 0.85 0.9 0.95 1

20 40 60 80 100

Accuracy in AD % data in AD

Ames model CYP450 model

0.8 0.85 0.9 0.95 1

20 40 60 80 100

Accuracy in AD % data in AD 0.65 0.7 0.75 0.8 0.85 0.9 5 25 45 65 85 Accuracy in AD

% data in AD IV set TE set

Pgp model

Test the RDN algorithm on ❷ benchmark datasets

Part #2

RESULTS

Test #1 Test #2 Test #1 Test #2

SLIDE 14

AD = f f (local l den ensit ity & local l relia eliabil ilit ity) Test Hypothesis

Ames dataset CYP450 dataset RDN

Compare Reliability-Density Neighbourhood (RDN) vs ❸ other AD methods Benchmark datasets

STD KDE dk-NN

vs AD techniques

Part #3

METHODS

SLIDE 15

0.75 0.8 0.85 0.9 0.95 1

20 40 60 80 100

RDN

0.75 0.8 0.85 0.9 0.95 1

2 10 18 26 34 42 50 58 66 74 82 90 98

STD

0.75 0.77 0.79 0.81 0.83 0.85 5 10 15 20 25

# nearest neighbours

dk-NN

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 5 15 25 35 45 55 65 75 85 95

KDE Ames model CYP450 model

0.8 0.85 0.9 0.95 1 20 40 60 80 100

RDN

0.8 0.85 0.9 0.95 1 20 40 60 80 100

STD

0.7 0.75 0.8 0.85 0.9 0.95 1 5 10 15 20 25

# nearest neighbours

dk-NN

0.79 0.81 0.83 0.85 0.87 0.89 0.91 5 15 25 35 45 55 65 75 85 95

KDE Accuracy vs. % coverage (implicitly, distance-to-model)

Part #3

RESULTS

Test #1 Test #2

SLIDE 16

CONCLUSIONS

 Local density corrected for local reliability (Precision + Bias) is able to successfuly sort new instances according to their predictive performance through the definition of map that identifies regions according to their probability to contain mispredictions  The RDN method performs robustly in new unseen data (two external datasets show similar profiles of accuracy accross chemical space)  For the optimal establishment of the AD of a QSAR model: RDN + STD  case-by-case selection of the best candidate.

SLIDE 17

Improved Applicability Domain Determination of QSAR Models Using Local Mapping: RDN

Natalia Aniceto, Alex Freitas, Andreas Bender, Taravat Ghafourian

PhD student ● University of Kent

Improved Applicability Domain Determination of QSAR Models Using Local Mapping:

RDN

BACKGROUND

❶ Local Density

However high local density does not imply high reliability

AD = f (loc local den ensit ity & loc local rel elia iabilit ity) ❶ ❷

METHODS

AD = f (loc local den ensit ity & loc local rel elia iabilit ity)

AD = f (loc local den ensit ity & loc local rel elia iabilit ity) ❶ ❷ ❷ Local Reliability

Di * = Di x Wi

METHODS

METHODS

k = 1 k = 3 k = 5 k = 7

Part #2

Part #3 Part #1

( ❶ “working” dataset )

METHODS

Part #1

?

❶ Explore the capabilities of RDN

METHODS

P-gp

Part #1

RESULTS

❶ Explore the capabilities of RDN

Role of bias-precision correction

Part #1

RESULTS

P-gp

❷ Diagnosing mispredictions

Part #1

RESULTS

P-gp

❷ Diagnosing mispredictions

Part #1

RESULTS

P-gp

Test the RDN algorithm on ❷ benchmark datasets

Part #2

METHODS

Test the RDN algorithm on ❷ benchmark datasets

Part #2

RESULTS

Ames dataset CYP450 dataset RDN

STD KDE dk-NN

Part #3

METHODS

Part #3

RESULTS

CONCLUSIONS

Improved Applicability Domain Determination of QSAR Models Using Local Mapping: RDN

THANK YOU FOR YOUR ATTENTION!