annotation error in public databases
play

Annotation Error in Public Databases ALEXANDRA SCHNOES UNIVERSITY - PowerPoint PPT Presentation

Annotation Error in Public Databases ALEXANDRA SCHNOES UNIVERSITY OF CALIFORNIA, SAN FRANCISCO OCTOBER 25, 2010 1 New genomes (and metagenomes) sequenced every day... 2 3 3 3 3 3 3 3 3 3 Computational Function Prediction Needed


  1. Annotation Error in Public Databases ALEXANDRA SCHNOES UNIVERSITY OF CALIFORNIA, SAN FRANCISCO OCTOBER 25, 2010 1

  2. New genomes (and metagenomes) sequenced every day... 2

  3. 3

  4. 3

  5. 3

  6. 3

  7. 3

  8. 3

  9. 3

  10. 3

  11. 3

  12. Computational Function Prediction Needed Characterized Sequences Total Sequences 4

  13. What about the error that results from large scale function prediction? 5

  14. Our focus: commonly used protein sequence databases How prevalent is misannotation in common sequence databases? What can we learn about these annotation errors and annotation in general? 6

  15. What is ‘function’? Many Possible Definitions Phenotype Enzymatic Reaction 7

  16. What is ‘function’? Many Possible Definitions Phenotype Enzymatic Reaction 7

  17. •Concrete definition of function •Substrate •Product •Chemical conversion • Function can be mapped to specific residues Why use enzymes? 8

  18. Functionally Diverse Enzyme Superfamilies 9

  19. Functionally Diverse Enzyme Superfamilies Conserved mechanistic step Multi functional Low % sequence ID 9

  20. Functionally Diverse Enzyme Superfamilies Conserved mechanistic step Multi functional Low % sequence ID Mono functional Family Specific Residues % ID within families > % ID between families 9

  21. What is needed for the misannotation analysis? Gold Standard Sequence Set Requirements • Organized hierarchy & data • Superfamily definitions • Family definitions • Sequences • Sequence alignments • Statistical models • Functions are experimentally characterized • Understand functional mechanism • Structure • Active site • Functionally important residues • Large set 10

  22. What is needed for the misannotation analysis? Gold Standard Sequence Set Requirements • Organized hierarchy & data • Superfamily definitions • Family definitions — 6 Superfamilies — 5 Structural folds • Sequences — 37 Families • Sequence alignments —5/6 E.C. categories • Statistical models Genome Biol. 2006;7(1):R8. • Functions are experimentally characterized • Understand functional mechanism • Structure • Active site • Functionally important residues • Large set 10

  23. Sequence Evidence Models Codes (HMMs) Functionally Important Residues Hand-Curated Hierarchically Sequence Alignments Organized Gold Standard Sequence Set sfld.rbvi.ucsf.edu 11

  24. Data Source: Commonly Used Sequence Databases NCBI Automated Large TrEMBL Automated Large Swiss-Prot Curated KEGG Small 12 Automated

  25. Analysis Question Given : A protein sequence annotated to a specific enzyme function Is that annotation correct? 13

  26. General Process 14

  27. 15

  28. 15

  29. 15

  30. 15

  31. Non-Family Family LC Members Members TC NC 15

  32. 16

  33. 16

  34. Variable percent misannotation Manually curated Swiss- Prot is most accurate 17

  35. Misannotation Problem is Getting Worse Sequences Deposited by Year and the Fraction Predicted to be Misannotated (NR DB) 1200 Fraction Predicted Misannotated 0.5 1000 Number of Sequences 0.4 800 0.3 600 0.2 400 0.1 200 0 1993 1995 1997 1999 2001 2003 2005 Year Incorrect Annotations Correct Annotations 18

  36. What are the characteristics of these misannotations? 19

  37. Sensitivity to threshold change 20

  38. Non-Family Family LC Members Members TC NC TC — Trusted Cutoff NC — Noise Cutoff LC — Lenient Cutoff Sensitivity to threshold change 20

  39. Non-Family Family LC Members Members TC NC Non-Family Family LC Members Members TC NC Sensitivity to threshold change 20

  40. 21

  41. NSA NSA — No Superfamily Association 21

  42. NSA SFA NSA — No Superfamily Association SFA — Superfamily Association Only 21

  43. NSA SFA NSA — No Superfamily Association MFR SFA — Superfamily Association Only MFR — Missing Functionally Important Residues 21

  44. NSA SFA NSA — No Superfamily Association MFR SFA — Superfamily Association Only MFR — Missing Functionally Important Residues BTC BTC — Below Trusted Cutoff 21

  45. Types of Misannotation MFR (6%) SFA (31%) NSA (9%) NSA BTC (54%) SFA Misannotations due to overprediction Misannotations not due to overprediction NSA — No Superfamily Association MFR SFA — Superfamily Association Only MFR — Missing Functionally Important Residues BTC BTC — Below Trusted Cutoff Biggest Problem Predicting function without sufficient evidence 21

  46. Dipeptide Epimerase Dipeptide >gi|13786715|pdb|1HZY|A Chain A, High Resolution Structure Of The Zinc-Containing Epimerase Phosphotriesterase From Pseudomonas Diminuta 1 GDRINTVRGPITISEAGFTLTHEHICGSSAGFLRAWPEFFGSRKALAEKAVRGLRRARAAGVRTIVDVST FDIGRDVSLLAEVSRAADVHIVAATGLWFDPPLSMRLRSVEELTQFFLREIQYGIEDTGIRAGIIKVATT GKATPFQELVLKAAARASLATGVPVTTHTAASQRDGEQQAAIFESEGLSPSRVCIGHSDDTDDLSYLTAL 2 = AARGYLIGLDHIPHSAIGLEDNASASALLGIRSWQTRALLIKALIDQGYMKQILVSNDWLFGFSSYVTNI 1 1 MDVMDRVNPDGMAFIPLRVIPFLREKGVPQETLAGITVTNPARFLSPTLRAS >gi|1176259|sp|P45548|PHP_ECOLI Phosphotriesterase homology protein 2 MSFDPTGYTLAHEHLHIDLSGFKNNVDCRLDQYAFICQEMNDLMTRGVRNVIEMTNRYMGRNAQFMLDVM RETGINVVACTGYYQDAFFPEHVATRSVQELAQEMVDEIEQGIDGTELKAGIIAEIGTSEGKITPLEEKV Dipeptide FIAAALAHNQTGRPISTHTSFSTMGLEQLALLQAHGVDLSRVTVGHCDLKDNLDNILKMIDLGAYVQFDT IGKNSYYPDEKRIAMLHALRDRGLLNRVMLSMDITRRSHLKANGGYGYDYLLTTFIPQLRQSGFSQADVD VMLRENPSQFFQ Epimerase Unknown Function 2 Dipeptide � & 1 1 Epimerase INCORRECT! Error Propagation 22

  47. Dipeptide Epimerase Dipeptide >gi|13786715|pdb|1HZY|A Chain A, High Resolution Structure Of The Zinc-Containing Epimerase Phosphotriesterase From Pseudomonas Diminuta 1 GDRINTVRGPITISEAGFTLTHEHICGSSAGFLRAWPEFFGSRKALAEKAVRGLRRARAAGVRTIVDVST FDIGRDVSLLAEVSRAADVHIVAATGLWFDPPLSMRLRSVEELTQFFLREIQYGIEDTGIRAGIIKVATT GKATPFQELVLKAAARASLATGVPVTTHTAASQRDGEQQAAIFESEGLSPSRVCIGHSDDTDDLSYLTAL 2 = AARGYLIGLDHIPHSAIGLEDNASASALLGIRSWQTRALLIKALIDQGYMKQILVSNDWLFGFSSYVTNI 1 1 MDVMDRVNPDGMAFIPLRVIPFLREKGVPQETLAGITVTNPARFLSPTLRAS >gi|1176259|sp|P45548|PHP_ECOLI Phosphotriesterase homology protein 2 MSFDPTGYTLAHEHLHIDLSGFKNNVDCRLDQYAFICQEMNDLMTRGVRNVIEMTNRYMGRNAQFMLDVM RETGINVVACTGYYQDAFFPEHVATRSVQELAQEMVDEIEQGIDGTELKAGIIAEIGTSEGKITPLEEKV Dipeptide FIAAALAHNQTGRPISTHTSFSTMGLEQLALLQAHGVDLSRVTVGHCDLKDNLDNILKMIDLGAYVQFDT IGKNSYYPDEKRIAMLHALRDRGLLNRVMLSMDITRRSHLKANGGYGYDYLLTTFIPQLRQSGFSQADVD VMLRENPSQFFQ Epimerase Unknown Function 2 Dipeptide � & 1 1 Epimerase INCORRECT! Error Propagation 22

  48. BLAST sequence similarity network • E-value 1 × 10 − 30 or lower • Distance between nodes reflects level of sequence similarity Sequence similarity Correct annotation Incorrect annotation 23

  49. BLAST sequence similarity network • E-value 1 × 10 − 30 or lower • Distance between nodes reflects level of sequence similarity Sequence similarity Correct annotation Incorrect annotation 23

  50. Misannotations • Cluster with each other • Indication of error propagation BLAST sequence similarity network • E-value 1 × 10 − 30 or lower • Distance between nodes reflects level of sequence similarity Sequence similarity Correct annotation Incorrect annotation 23

  51. In Conclusion... • Misannotation is a serious problem • Automated databases • Across multiple folds, functions and superfamilies • Hard to predict misannotation a priori • Manual curation delivers the highest quality • Misannotation problem is getting worse • Overprediction is a common problem • Error propagation appears to be a common source of misannotation 24

  52. Acknowledgements Patricia Babbitt & lab Shoshana Brown Igor Dodevski University of Zürich Tanja Kortemme & Lab Colin Smith Jim Wells Lab Emily Crawford $$ Howard Hughes Pre-Doctoral Fellowship PLoS Comput Biol. 2009 Dec;5(12):e1000605. NIH & NSF 25 Wiki Commons & Science Magazine for some images

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend