RECOGNITION OF RECOGNITION OF PROTEIN FUNCTION PROTEIN FUNCTION - - PowerPoint PPT Presentation

recognition of recognition of protein function protein
SMART_READER_LITE
LIVE PREVIEW

RECOGNITION OF RECOGNITION OF PROTEIN FUNCTION PROTEIN FUNCTION - - PowerPoint PPT Presentation

RECOGNITION OF RECOGNITION OF PROTEIN FUNCTION PROTEIN FUNCTION USING THE LOCAL SIMILARITY USING THE LOCAL SIMILARITY Kirill E. Alexandrov Dmitry A. Filimonov Boris N. Sobolev Vladimir V. Poroikov Institute of Biomedical Chemistry of


slide-1
SLIDE 1

RECOGNITION OF RECOGNITION OF PROTEIN FUNCTION PROTEIN FUNCTION USING THE LOCAL SIMILARITY USING THE LOCAL SIMILARITY

Kirill E. Alexandrov Dmitry A. Filimonov Boris N. Sobolev Vladimir V. Poroikov

Institute of Biomedical Chemistry

  • f Russian Academy of Medical Sciences,

Moscow, Russia

slide-2
SLIDE 2

Agenda Agenda 1. History of Problem 2. Sequence Local Similarity 3. Algorithm of Similarity Calculation 4. Local Similarity Approach Paradigm 5. Algorithm of Protein Function Recognition 6. Prediction Accuracy Estimation 7. Results of Local Similarity Approach Evaluation 8. Acknowledgements

slide-3
SLIDE 3

Property = Function ( Structure ) Property = Function ( Structure ) Continuity hypothesis: the difference of structures is less, the difference of properties is less The central dogma of SAR/QSAR/QSPR: The central dogma of SAR/QSAR/QSPR:

ypred = x0 + ixiFi(S)

Fi(S) = LogP, ..., (LogP)2, ... – traditional QSAR Fi(S) = Sim(S,Si) – similarity based QSAR MLR – multiple linear regression PLS – projections to latent structures ANN – artificial neural network SVM – support vector machine

slide-4
SLIDE 4

The local similarity principle The local similarity principle

QSAR with CoMFA

Tripos' patented Comparative Molecular Field Analysis (CoMFA) has been used as the method of choice in hundreds of published QSAR studies.

slide-5
SLIDE 5

Neighborhoods of atoms descriptors Neighborhoods of atoms descriptors MOLECULAR BIOLOGY QUANTUM CHEMISTRY QUANTUM FIELD THEORY: M = V + VgM = V + VgV + VgVgV + VgVgVg + … Mi = Vi + VigM = Vi + Vig(M1 + M2 + … + Mm) All descriptors are based on the concept of atoms’ of molecule description subject to the neighborhood of them: MNA

  • multilevel neighborhoods of atoms

RMNA - reaction multilevel neighborhoods of atoms QNA

  • quantitative neighborhoods of atoms

FNA

  • fuzzy neighborhoods of atoms

.., .. (2006) , L, (2), 66-75.

slide-6
SLIDE 6

MNA/0: C MNA/1: C(CN-H) MNA/2: C(C(CC-H)N(CC)-H(C))

C C H C O O N C H C C H H H

Multilevel neighborhoods of atoms descriptors Multilevel neighborhoods of atoms descriptors – – MNA MNA

O O N C C H C O O N C H C C H H H C C H C O O N C H C C H H H .., .. (2006) , L, (2), 66-75.

slide-7
SLIDE 7

MNA/2 C(C(CC-H)C(CC-C)-H(C)) C(C(CC-H)C(CN-H)-H(C)) C(C(CC-H)C(CN-H)-C(C-O-O)) C(C(CC-H)N(CC)-H(C)) C(C(CC-C)N(CC)-H(C)) N(C(CN-H)C(CN-H))

  • H(C(CC-H))
  • H(C(CN-H))
  • H(-O(-H-C))
  • C(C(CC-C)-O(-H-C)-O(-C))
  • O(-H(-O)-C(C-O-O))
  • O(-C(C-O-O))

C C H C O O N C H C C H H H

Multilevel neighborhoods of atoms descriptors Multilevel neighborhoods of atoms descriptors – – MNA MNA

.., .. (2006) , L, (2), 66-75.

slide-8
SLIDE 8

Prediction of activity spectra for organic compounds Prediction of activity spectra for organic compounds According to the Bayes formula the probability P(A|S) of that compound S has activity A is equal to: P(A|S) = P(S|A)•P(A)/P(S) Let the descriptors of organic compound D1, ..., Dm are mutually independent, then: P(S|A) = P(D1, ..., Dm|A) = iP(Di|A) P(A) and P(A|Di) are caculated as sums over all organic compounds of the training set:

.., .. (2006) , L, (2), 66-75.

slide-9
SLIDE 9
slide-10
SLIDE 10

Qi = ai

k[g(C)]ikbk

ai and bk are parameters of atoms i and k g(C) is function of the connectivity matrix C Pi = Bi

  • k(Exp(-C))ikBk
  • Qi = Bi
  • k(Exp(-C))ikBk
  • Ak

A = (IP + EA), B = IP – EA, IP is the first ionization potential, EA is the electron affinity.

Feynman R. Ph. Phys. Rev., 1939, 56, 340-343. Robert G. Parr et al. J. Chem. Phys., 1978, 68(8), 3801-3807. Gasteiger J, Marsili M. Tetrahedron, 1980, 36, 3219-3228. Rappe A K and W A Goddard III. J. Ph. Ch., 1991, 95, 3358-3363.

Quatitative neighborhoods of atoms descriptors Quatitative neighborhoods of atoms descriptors – – QNA QNA

slide-11
SLIDE 11

Quatitative neighborhoods of atoms descriptors Quatitative neighborhoods of atoms descriptors – – QNA QNA ChemNavigator DataBase in QNA Space 976,545,026 QNA descriptors of 24,621,668 molecules

Initial QNA Space Normalized QNA Space

slide-12
SLIDE 12

Quatitative neighborhoods of atoms descriptors Quatitative neighborhoods of atoms descriptors – – QNA QNA

Nicotinic Acid Aspirin Sulfathiazole

slide-13
SLIDE 13

GUSAR GUSAR – – QNA based prediction QNA based prediction

  • f quantitative properties of organic compounds
  • f quantitative properties of organic compounds
slide-14
SLIDE 14

GUSAR GUSAR – – QNA based prediction QNA based prediction

  • f quantitative properties of organic compounds
  • f quantitative properties of organic compounds

Vibrio fischeri Chlorella vulgaris Tetrahymena pyriformis CDK2 inhibitors DHFR inhibitors ACE inhibitors

slide-15
SLIDE 15

GUSAR GUSAR – – QNA based prediction QNA based prediction

  • f quantitative properties of organic compounds
  • f quantitative properties of organic compounds
  • 0.10
  • 0.05

0.00 0.05 0.10 0.15 0.20 2D Cerius2 3D Cerius2 CoMSIA EVA CoMFA HQSAR GFA MLR PLS

delta R2 test delta Q2 delta R2

slide-16
SLIDE 16
  • OK. But, how local
  • OK. But, how local

similarity can be used similarity can be used for recognition for recognition

  • f protein function?
  • f protein function?…

slide-17
SLIDE 17

Pairwise Pairwise sequence alignment sequence alignment

1996, Autumn Homology-derived annotation based on the pairwise sequence alignment was a general way to predict the protein function for a long time.

slide-18
SLIDE 18

AANRDPSQFPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVA 2 ANRDPSQFPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVAL 1 NRDPSQFPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALR 1 RDPSQFPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRA 0 DPSQFPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRAL 1 PSQFPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALF 2 SQFPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFG 1 QFPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGR 1 FPDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRF 2 PDPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFP 0 DPHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFPA 1 PHRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFPAL 0 HRFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFPALS 9 RFDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFPALSL 0 FDVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFPALSLG 3 DVTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFPALSLGI 1 VTRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFPALSLGID 1 TRDTRGHLSFGQGIHFCMGRPLAKLEGEVALRALFGRFPALSLGIDA 2 GTAINKPLSEKMMLFGMGKRRCIGEVLAKWEIFLFLAILLQQLEFSV 9 Ri = 9

Sequence Sequence Local Similarity. Local Similarity. Frame 20, shift from Frame 20, shift from -8

  • 8 to

to +8 +8

Query sequence The best match

slide-19
SLIDE 19

Sequence Local Similarity. Sequence Local Similarity. Algorithm Algorithm of

  • f Similarity Calculation

Similarity Calculation

, i is position number in the query sequence A a and b are aminoacid residuals in sequence A and sequence B m is current shift between sequence A and sequence B F is frame size Ri is primary similarity value Si is the local similarity value for position i in the query sequence A with sequence B

About 1000 sequences per second.

slide-20
SLIDE 20

13 13.11. .11.1996 1996

Sequence Local Similarity. Sequence Local Similarity.

slide-21
SLIDE 21

“ “If there exists correspondence between similarity of If there exists correspondence between similarity of substrates and protein sequences in substrates and protein sequences in cytochrome cytochrome P450 P450 superfamily superfamily? ?” ”

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 25 50 75 100 125 150

Number of clusters Proportion of homologs

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 25 50 75 100 125 150

Proportion of homologs Number of clusters

— — real data … average random data *** confidence interval The results of substrate-based clustering correspond to homology-based classification for families CYP 1, 2, 3, 4, 5, 6, 7, 11 For other families of P450 (CYP 8, 17, 19, 21, 24, 26, 27) substrate-based clustering brings to the contradictions with the traditional classification CYP4 CYP7 Borodina Yu.V., Lisitsa A.V., Poroikov V.V., Filimonov D.A., Sobolev B.N., Archakov A.A. Nova Acta Leopoldina., 2003, 87(329), 47-55.

slide-22
SLIDE 22

“ “Quantifying the Relationships among Drug Classes Quantifying the Relationships among Drug Classes” ”

A subset of the MDDR database containing 65 367 compounds

  • rganized in 249 sets

that associate with a specific biological target “By multiple criteria, bioinformatics and chemoinformatics networks differed substantially, and only occasionally did a high sequence similarity correspond to a high ligand-set similarity.” Hert, J., Keiser, M. J., Irwin, J. J., Oprea, T. I., Shoichet, B. K. “Quantifying the Relationships among Drug Classes”

  • J. Chem. Inf. Model., 2008, 48(4), 755-765.
slide-23
SLIDE 23

Ab initio principles Learning by example Unique law

  • f nature

Partial estimate Fundamental theory Machine Learning Molecular Modelling Homology

slide-24
SLIDE 24

It is based on a data set of sequences with known properties. This data set must be subdivided into “positive” and “negtive” examples – group A and its complement ¬A

Protein Protein function recognition based on learning by example function recognition based on learning by example

A ¬A B C

slide-25
SLIDE 25

Is there universal similarity reasonable? Is there universal similarity reasonable?

slide-26
SLIDE 26

Sequence Local Similarity. Sequence Local Similarity. It is It is descriptor itself! descriptor itself!

Descriptor is defined as the similarity value Sik for position i of sequence under study and experimentally annotated sequence k.

slide-27
SLIDE 27

i = 1,…,n is position number in the sequence under study; k = 1,…,N is the experimentally annotated sequence number; wk(A), wk(¬A) are weights in class and its complement ¬A

  • f the experimentally annotated sequence k;

Sik is similarity for position i of the sequence under study and the experimentally annotated sequence k. Belonging of the sequence under study to a class A is calculated using statistical function B(A):

Sequence Local Similarity. Sequence Local Similarity. Algorithm of Algorithm of Classification Classification

slide-28
SLIDE 28

General Classification General Classification Problem Problem

Observed value Calculated value

Negative Positive Threshold

TP TN FP FN

slide-29
SLIDE 29

Sensitivity = TP/(TP+FN) Specificity = TN/(TN+FP) Accuracy (Concordance) = (TP+TN)/N Predictive value positive = TP/(TP+FP) Predictive value negative = TN/(TN+FN) False Negative Rate = FN/(TP+FN) = Error1 False Positive Rate = FP/(TN+FP) = Error2 Positive Likelihood = SENS/(1-SPEC) Negative Likelihood = (1-SENS)/SPEC . . . http://www.intmed.mcw.edu/clincalc/bayes.html

General Classification Problem. General Classification Problem. Criteria of classification accuracy Criteria of classification accuracy

slide-30
SLIDE 30

General Classification Problem General Classification Problem. . Independent Independent Accuracy of Prediction (IAP) Accuracy of Prediction (IAP) IAP is calculated using Leave-One-Out Cross-Validation procedure.

Bi is the estimation for sequence i from the class A Bj is the estimation for sequence j from its complement ¬A (x) = 1 if x > 0, (x) = if x = 0, (x) = 0 if x < 0 NA is the number of sequences in the class A N¬A is the number of sequences in its complement ¬A

Poroikov V. et al. (2000) J. Chem. Inf. Comput. Sci., 40, 1349-1355.

  • P. A. Flach, N. Lachiche, Machine Learning, 2004, 57, 233–269.
slide-31
SLIDE 31

How How sequence sequence local similarity can be used? local similarity can be used?

slide-32
SLIDE 32

Training sets used for Training sets used for the the method evaluation method evaluation

  • Serine proteases

EC 3.4.21 – 28 groups of 4th EC level, 623 sequences

  • Gold standard,

especially composed to test statistical learning methods 5 enzyme superfamilies and 56 families, 832 sequences.

  • P450 superfamily (CPD database)

242 proteins classified by substrate specificity (579 compounds). 163 proteins classified by inhibitor specificity (272 compounds).

slide-33
SLIDE 33

Serine proteases Serine proteases The average accuracy reached the maximum (close to one) at the maximal shift of 50 and frame of 50. 24 of 28 classes were recognized at this parameter values with 100% accuracy.

slide-34
SLIDE 34

Gold Standard Gold Standard The average IAP exceeded 0.99. 4 superfamilies were recognized with 100% accuracy. 45 families were recognized with IAP = 1 and 11 families were recognized with IAP > 0.96. The superfamilies seem to be clearly recognized by alignment-based methods; however the families of the same superfamily are worse recognized by the analysis of aligned sequences with phylogenomics methods.

Superfamilies Families

slide-35
SLIDE 35

CYP450 classification. Frame 20, Band 100. CYP450 classification. Frame 20, Band 100.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 5 10 15 20 25 30 35 40 Group size IAP 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 5 10 15 20 25 30 35 40 Group size IAP

Substrate specificity Inducer specificity

slide-36
SLIDE 36

Prediction of activity spectra for organic compounds Prediction of activity spectra for organic compounds

.., .. (2006) , L, (2), 66-75.

slide-37
SLIDE 37

GUSAR GUSAR – – QNA based prediction QNA based prediction

  • f quantitative properties of organic compounds
  • f quantitative properties of organic compounds
slide-38
SLIDE 38
slide-39
SLIDE 39

Our approach revealed the high efficiency of function prediction with different sequence description types. The high accuracy of prediction was obtained for different levels of protein functional classifications. The projection method is useful both for functional specificity prediction and for sequences mapping, i.e. to reveal the local determinants of the functional specificity. The approach “RECOGNITION OF PROTEIN FUNCTION USING THE LOCAL SIMILARITY” will be published in Journal of Bioinformatics and Computational Biology, 2008

Conclusions Conclusions

slide-40
SLIDE 40

Acknowledgements Acknowledgements

This work was supported by Russian Federation of Basic Research (grant N 04-04-49390-). We are grateful to A.V. Lisitsa for providing the data on cytochrome P450 substrates and inducers specificity.

Ñî àâòî û