Annotation Error in Public Databases
ALEXANDRA SCHNOES UNIVERSITY OF CALIFORNIA, SAN FRANCISCO OCTOBER 25, 2010
1
Annotation Error in Public Databases ALEXANDRA SCHNOES UNIVERSITY - - PowerPoint PPT Presentation
Annotation Error in Public Databases ALEXANDRA SCHNOES UNIVERSITY OF CALIFORNIA, SAN FRANCISCO OCTOBER 25, 2010 1 New genomes (and metagenomes) sequenced every day... 2 3 3 3 3 3 3 3 3 3 Computational Function Prediction Needed
ALEXANDRA SCHNOES UNIVERSITY OF CALIFORNIA, SAN FRANCISCO OCTOBER 25, 2010
1
2
3
3
3
3
3
3
3
3
3
Characterized Sequences Total Sequences
4
5
6
Phenotype Enzymatic Reaction Many Possible Definitions
7
Phenotype Many Possible Definitions Enzymatic Reaction
7
8
9
9
Conserved mechanistic step Low % sequence ID Multifunctional
9
Conserved mechanistic step Low % sequence ID Monofunctional Family Specific Residues % ID within families > % ID between families Multifunctional
Gold Standard Sequence Set
Requirements
characterized
10
Gold Standard Sequence Set
Requirements
characterized
10
— 6 Superfamilies — 5 Structural folds — 37 Families —5/6 E.C. categories
Genome Biol. 2006;7(1):R8.
Functionally Important Residues Evidence Codes Hand-Curated Sequence Alignments Sequence Models (HMMs) Hierarchically Organized
11
Automated Large
Automated
Automated Large
Curated Small
12
13
14
15
15
15
15
15
Family Members Non-Family Members TC LC NC
16
16
Variable percent misannotation Manually curated Swiss- Prot is most accurate
17
Number of Sequences
Correct Annotations Incorrect Annotations
Sequences Deposited by Year and the Fraction Predicted to be Misannotated (NR DB)
Year Fraction Predicted Misannotated
1993 1995 1997 1999 2001 2003 2005 200 400 600 800 1000 1200
0.1 0.2 0.3 0.4 0.5
18
19
20
20
Family Members Non-Family Members
TC LC NC
TC — Trusted Cutoff NC — Noise Cutoff LC — Lenient Cutoff
20
Family Members Non-Family Members
TC LC NC
Family Members Non-Family Members
TC LC NC
21
21
NSA — No Superfamily Association
21
NSA — No Superfamily Association SFA — Superfamily Association Only
21
NSA — No Superfamily Association SFA — Superfamily Association Only MFR — Missing Functionally Important Residues
21
NSA — No Superfamily Association SFA — Superfamily Association Only MFR — Missing Functionally Important Residues BTC — Below Trusted Cutoff
Types of Misannotation
NSA (9%)
MFR (6%) SFA (31%) BTC (54%)
Misannotations due to overprediction Misannotations not due to overprediction
21
Predicting function without sufficient evidence
NSA — No Superfamily Association SFA — Superfamily Association Only MFR — Missing Functionally Important Residues BTC — Below Trusted Cutoff
>gi|13786715|pdb|1HZY|A Chain A, High Resolution Structure Of The Zinc-Containing Phosphotriesterase From Pseudomonas Diminuta GDRINTVRGPITISEAGFTLTHEHICGSSAGFLRAWPEFFGSRKALAEKAVRGLRRARAAGVRTIVDVST FDIGRDVSLLAEVSRAADVHIVAATGLWFDPPLSMRLRSVEELTQFFLREIQYGIEDTGIRAGIIKVATT GKATPFQELVLKAAARASLATGVPVTTHTAASQRDGEQQAAIFESEGLSPSRVCIGHSDDTDDLSYLTAL AARGYLIGLDHIPHSAIGLEDNASASALLGIRSWQTRALLIKALIDQGYMKQILVSNDWLFGFSSYVTNI MDVMDRVNPDGMAFIPLRVIPFLREKGVPQETLAGITVTNPARFLSPTLRAS >gi|1176259|sp|P45548|PHP_ECOLI Phosphotriesterase homology protein MSFDPTGYTLAHEHLHIDLSGFKNNVDCRLDQYAFICQEMNDLMTRGVRNVIEMTNRYMGRNAQFMLDVM RETGINVVACTGYYQDAFFPEHVATRSVQELAQEMVDEIEQGIDGTELKAGIIAEIGTSEGKITPLEEKV FIAAALAHNQTGRPISTHTSFSTMGLEQLALLQAHGVDLSRVTVGHCDLKDNLDNILKMIDLGAYVQFDT IGKNSYYPDEKRIAMLHALRDRGLLNRVMLSMDITRRSHLKANGGYGYDYLLTTFIPQLRQSGFSQADVD VMLRENPSQFFQ
Dipeptide Epimerase Unknown Function Dipeptide Epimerase Dipeptide Epimerase Dipeptide Epimerase
INCORRECT!
22
>gi|13786715|pdb|1HZY|A Chain A, High Resolution Structure Of The Zinc-Containing Phosphotriesterase From Pseudomonas Diminuta GDRINTVRGPITISEAGFTLTHEHICGSSAGFLRAWPEFFGSRKALAEKAVRGLRRARAAGVRTIVDVST FDIGRDVSLLAEVSRAADVHIVAATGLWFDPPLSMRLRSVEELTQFFLREIQYGIEDTGIRAGIIKVATT GKATPFQELVLKAAARASLATGVPVTTHTAASQRDGEQQAAIFESEGLSPSRVCIGHSDDTDDLSYLTAL AARGYLIGLDHIPHSAIGLEDNASASALLGIRSWQTRALLIKALIDQGYMKQILVSNDWLFGFSSYVTNI MDVMDRVNPDGMAFIPLRVIPFLREKGVPQETLAGITVTNPARFLSPTLRAS >gi|1176259|sp|P45548|PHP_ECOLI Phosphotriesterase homology protein MSFDPTGYTLAHEHLHIDLSGFKNNVDCRLDQYAFICQEMNDLMTRGVRNVIEMTNRYMGRNAQFMLDVM RETGINVVACTGYYQDAFFPEHVATRSVQELAQEMVDEIEQGIDGTELKAGIIAEIGTSEGKITPLEEKV FIAAALAHNQTGRPISTHTSFSTMGLEQLALLQAHGVDLSRVTVGHCDLKDNLDNILKMIDLGAYVQFDT IGKNSYYPDEKRIAMLHALRDRGLLNRVMLSMDITRRSHLKANGGYGYDYLLTTFIPQLRQSGFSQADVD VMLRENPSQFFQ
Dipeptide Epimerase Unknown Function Dipeptide Epimerase Dipeptide Epimerase Dipeptide Epimerase
INCORRECT!
22
23
Sequence similarity Correct annotation Incorrect annotation BLAST sequence similarity network
reflects level of sequence similarity
23
Sequence similarity Correct annotation Incorrect annotation BLAST sequence similarity network
reflects level of sequence similarity
propagation
23
Sequence similarity Correct annotation Incorrect annotation BLAST sequence similarity network
reflects level of sequence similarity
24
Tanja Kortemme & Lab Colin Smith $$ Howard Hughes Pre-Doctoral Fellowship NIH & NSF
Wiki Commons & Science Magazine for some images
Patricia Babbitt & lab
Shoshana Brown
Igor Dodevski University of Zürich PLoS Comput Biol. 2009 Dec;5(12):e1000605.
25
Jim Wells Lab Emily Crawford