Determination of QSAR Models Using Local Mapping: RDN Natalia - - PowerPoint PPT Presentation

determination of qsar models
SMART_READER_LITE
LIVE PREVIEW

Determination of QSAR Models Using Local Mapping: RDN Natalia - - PowerPoint PPT Presentation

Improved Applicability Domain Determination of QSAR Models Using Local Mapping: RDN Natalia Aniceto , Alex Freitas, Andreas Bender, Taravat Ghafourian PhD student University of Kent BACKGROUND Applicability Domain Establishing


slide-1
SLIDE 1

Improved Applicability Domain Determination of QSAR Models Using Local Mapping:

Natalia Aniceto, Alex Freitas, Andreas Bender, Taravat Ghafourian

PhD student ● University of Kent

RDN

slide-2
SLIDE 2

BACKGROUND

cumulative result of data noise and sparseness Establishing boundaries for prediction reliability is arguably as important as demonstrating good predictive performance. AD methods proposed so far typically address the data as a whole, often focusing on a single aspect of data (i.e. similarity to training set, descriptor span, data density, etc).

sparse dense

Sahigara et al.1 and Sheridan2

In order to successfully characterize a model’s AD: Combining different measures that address data locally has shown improved AD characterization.

  • 1. Sahigara et al. Journal of Cheminformatics 2013, 5:27
  • 2. Sheridan. J. Chem. Inf. Model. 2012, 52, 814−823

AD = f (local density & local reliability) Hypothesis: “Applicability Domain”

STD + Similarity + Predictions

slide-3
SLIDE 3

Ideal scenario

AD output measure Accuracy

Chemical space coverage

❶ Local Density

Sahigara et al.1

Rationale: Each training instance provides coverage

  • f

chemical space according to densely populated is its local vicinity. A radius of coverage is placed on each instance from the average Euclidean Distance (ED) to its neighbours within the average ED to the k-th NN.

↑k : ↑span of coverage

scan through the chemical space

However high local density does not imply high reliability

Di

Density Neighbourhood (dk-NN)

AD = f (loc local den ensit ity & loc local rel elia iabilit ity) ❶ ❷

METHODS

AD = f (loc local den ensit ity & loc local rel elia iabilit ity)

slide-4
SLIDE 4

AD = f (loc local den ensit ity & loc local rel elia iabilit ity) ❶ ❷ ❷ Local Reliability

bias precision &

Deviation within an ensemble of models (STD)

(Tetko et al)

Agreement with observed response Di Di*

Di * = Di x Wi

↑ agreement ↑ 1 – STD

↑W Di is less penalized Higher reliability = larger coverage

Tetko et al 2008 J. Chem. Inf. Model. 2008, 48, 1733–1746

METHODS

slide-5
SLIDE 5

METHODS

Training set External set

k = 1 k = 3 k = 5 k = 7

Scan through the chemical space By keeping track

  • f new instances

entering the AD, and updating in- AD-Accuracy, we build a map of

reliability across

the model’s space

  • ut in
slide-6
SLIDE 6

Test the RDN algorithm on ❷ benchmark datasets

Part #2

Compare RDN vs ❸ other AD methods

Part #3 Part #1

  • Explore the capabilities of the Reliability-Density Neighbourhood (RDN) algorithm
  • Characterize Mispredictions

( ❶ “working” dataset )

METHODS

 Ames mutagenicity test  CYP450 inhibition test

(1A2) Class = { + , - }

slide-7
SLIDE 7

Build a decision tree model P-gp dataset Class = {S, NS}

Ensemble model

(10-fold bagging)

  • STD
  • Agreement

Build RDN

  • Predictions

Test RDN

Part #1

AD output measure Accuracy

Chemical space coverage

?

❶ Explore the capabilities of RDN

❷ Characterizing Mispredictions

Kernel Density Estimation (KDE)

Global Density? Descriptor span?

Decision tree descriptor span

AD = f f (local l den ensit ity & local l relia eliabil ilit ity) Test Hypothesis

METHODS

Training set

N = 659

Validation set

N = 194

Test set

N = 195

slide-8
SLIDE 8

0.68 0.685 0.69 0.695 0.7 0.705 0.71 0.715 0.72 0.725 0.73

60 70 80 90 100

Accuracy in the AD

% data in AD

0.68 0.685 0.69 0.695 0.7 0.705 0.71 0.715 0.72 0.725 0.73

92 94 96 98 100

Accuracy in the AD

% data in the AD

Original dk-NN RDN

Shrink distances to 1/3 in the beginning  ½ Di  Di

0.65 0.7 0.75 0.8 0.85 0.9

5 25 45 65 85

Accuracy inside AD

% data in AD IV set TE set

P-gp

Part #1

RESULTS

❶ Explore the capabilities of RDN

slide-9
SLIDE 9

0.5 0.6 0.7 0.8 0.9 1.0

0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19 0.21 0.23 0.25 0.27 0.29 0.31 0.33 0.35 0.37 0.39 0.41 0.43 0.45 0.47 0.49 0.51 0.53 >0.55

Accuracy

STD

TR set TE set IV set

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 Agreement (to observed)

STD

STD

Role of bias-precision correction

Part #1

RESULTS

P-gp

slide-10
SLIDE 10

IN OUT Descriptor span ?

❷ Diagnosing mispredictions

N = 62 (32%) Acc = 72.6% Acc = 67.7% N = 133 (68%)

Part #1

RESULTS

P-gp

slide-11
SLIDE 11

Density in feature space ?

0.45 0.55 0.65 0.75 0.85 0.95 5 15 25 35 45 55 65 75 85 95

Accuracy inside the AD % covered data

TE set IV set

Kernel Density Estimation (KDE)

PC1

❷ Diagnosing mispredictions

Part #1

RESULTS

P-gp

slide-12
SLIDE 12

P-gp dataset Class = {S, NS} Training set

N = 689

Validation set

N = 194

Test set

N = 195

Build RDN Test RDN Ames dataset CYP450 dataset Predictions, STD & Agreement from OChem Build RDN Test RDN

Test the RDN algorithm on ❷ benchmark datasets

N = 1089 + 1090 N = 1870 + 1870

AD = f f (local l den ensit ity & local l relia eliabil ilit ity) Test Hypothesis

Part #2

METHODS

Test #1 Test #2

4358 3743

training

slide-13
SLIDE 13

0.75 0.8 0.85 0.9 0.95 1

20 40 60 80 100

Accuracy in AD % data in AD

Ames model CYP450 model

0.8 0.85 0.9 0.95 1

20 40 60 80 100

Accuracy in AD % data in AD 0.65 0.7 0.75 0.8 0.85 0.9 5 25 45 65 85 Accuracy in AD

% data in AD IV set TE set

Pgp model

Test the RDN algorithm on ❷ benchmark datasets

Part #2

RESULTS

Test #1 Test #2 Test #1 Test #2

slide-14
SLIDE 14

AD = f f (local l den ensit ity & local l relia eliabil ilit ity) Test Hypothesis

Ames dataset CYP450 dataset RDN

Compare Reliability-Density Neighbourhood (RDN) vs ❸ other AD methods Benchmark datasets

STD KDE dk-NN

vs AD techniques

Part #3

METHODS

slide-15
SLIDE 15

0.75 0.8 0.85 0.9 0.95 1

20 40 60 80 100

RDN

0.75 0.8 0.85 0.9 0.95 1

2 10 18 26 34 42 50 58 66 74 82 90 98

STD

0.75 0.77 0.79 0.81 0.83 0.85 5 10 15 20 25

# nearest neighbours

dk-NN

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 5 15 25 35 45 55 65 75 85 95

KDE Ames model CYP450 model

0.8 0.85 0.9 0.95 1 20 40 60 80 100

RDN

0.8 0.85 0.9 0.95 1 20 40 60 80 100

STD

0.7 0.75 0.8 0.85 0.9 0.95 1 5 10 15 20 25

# nearest neighbours

dk-NN

0.79 0.81 0.83 0.85 0.87 0.89 0.91 5 15 25 35 45 55 65 75 85 95

KDE Accuracy vs. % coverage (implicitly, distance-to-model)

Part #3

RESULTS

Test #1 Test #2

slide-16
SLIDE 16

CONCLUSIONS

 Local density corrected for local reliability (Precision + Bias) is able to successfuly sort new instances according to their predictive performance through the definition of map that identifies regions according to their probability to contain mispredictions  The RDN method performs robustly in new unseen data (two external datasets show similar profiles of accuracy accross chemical space)  For the optimal establishment of the AD of a QSAR model: RDN + STD  case-by-case selection of the best candidate.

slide-17
SLIDE 17

Improved Applicability Domain Determination of QSAR Models Using Local Mapping: RDN

Natalia Aniceto, Alex Freitas, Andreas Bender, Taravat Ghafourian

PhD student ● University of Kent

THANK YOU FOR YOUR ATTENTION!