[PPT] - Annotation Error in Public Databases ALEXANDRA SCHNOES UNIVERSITY PowerPoint Presentation

SLIDE 1

Annotation Error in Public Databases

ALEXANDRA SCHNOES UNIVERSITY OF CALIFORNIA, SAN FRANCISCO OCTOBER 25, 2010

1

SLIDE 2

New genomes (and metagenomes) sequenced every day...

2

SLIDE 3

3

SLIDE 4

3

SLIDE 5

3

SLIDE 6

3

SLIDE 7

3

SLIDE 8

3

SLIDE 9

3

SLIDE 10

3

SLIDE 11

3

SLIDE 12

Characterized Sequences Total Sequences

Computational Function Prediction Needed

4

SLIDE 13

What about the error that results from large scale function prediction?

5

SLIDE 14

Our focus: commonly used protein sequence databases

How prevalent is misannotation in common sequence databases? What can we learn about these annotation errors and annotation in general?

6

SLIDE 15

What is ‘function’?

Phenotype Enzymatic Reaction Many Possible Definitions

7

SLIDE 16

What is ‘function’?

Phenotype Many Possible Definitions Enzymatic Reaction

7

SLIDE 17

Why use enzymes?

Concrete definition of function
Substrate
Product
Chemical conversion
Function can be mapped

to specific residues

8

SLIDE 18

9

Functionally Diverse Enzyme Superfamilies

SLIDE 19

9

Functionally Diverse Enzyme Superfamilies

Conserved mechanistic step Low % sequence ID Multifunctional

SLIDE 20

9

Functionally Diverse Enzyme Superfamilies

Conserved mechanistic step Low % sequence ID Monofunctional Family Specific Residues % ID within families > % ID between families Multifunctional

SLIDE 21

What is needed for the misannotation analysis?

Gold Standard Sequence Set

Requirements

Organized hierarchy & data
Superfamily definitions
Family definitions
Sequences
Sequence alignments
Statistical models
Functions are experimentally

characterized

Understand functional mechanism
Structure
Active site
Functionally important residues
Large set

10

SLIDE 22

What is needed for the misannotation analysis?

Gold Standard Sequence Set

Requirements

Organized hierarchy & data
Superfamily definitions
Family definitions
Sequences
Sequence alignments
Statistical models
Functions are experimentally

characterized

Understand functional mechanism
Structure
Active site
Functionally important residues
Large set

10

— 6 Superfamilies — 5 Structural folds — 37 Families —5/6 E.C. categories

Genome Biol. 2006;7(1):R8.

SLIDE 23

Gold Standard Sequence Set sfld.rbvi.ucsf.edu

Functionally Important Residues Evidence Codes Hand-Curated Sequence Alignments Sequence Models (HMMs) Hierarchically Organized

11

SLIDE 24

Data Source: Commonly Used Sequence Databases

NCBI

Automated Large

KEGG

Automated

TrEMBL

Automated Large

Swiss-Prot

Curated Small

12

SLIDE 25

Given: A protein sequence annotated to a specific enzyme function

Is that annotation correct?

Analysis Question

13

SLIDE 26

General Process

14

SLIDE 27

15

SLIDE 28

15

SLIDE 29

15

SLIDE 30

15

SLIDE 31

15

Family Members Non-Family Members TC LC NC

SLIDE 32

16

SLIDE 33

16

SLIDE 34

Variable percent misannotation Manually curated Swiss- Prot is most accurate

17

SLIDE 35

Number of Sequences

Correct Annotations Incorrect Annotations

Sequences Deposited by Year and the Fraction Predicted to be Misannotated (NR DB)

Year Fraction Predicted Misannotated

1993 1995 1997 1999 2001 2003 2005 200 400 600 800 1000 1200

0.1 0.2 0.3 0.4 0.5

Misannotation Problem is Getting Worse

18

SLIDE 36

What are the characteristics of these misannotations?

19

SLIDE 37

20

Sensitivity to threshold change

SLIDE 38

20

Sensitivity to threshold change

Family Members Non-Family Members

TC LC NC

TC — Trusted Cutoff NC — Noise Cutoff LC — Lenient Cutoff

SLIDE 39

20

Sensitivity to threshold change

Family Members Non-Family Members

TC LC NC

Family Members Non-Family Members

TC LC NC

SLIDE 40

21

SLIDE 41

NSA

21

NSA — No Superfamily Association

SLIDE 42

SFA NSA

21

NSA — No Superfamily Association SFA — Superfamily Association Only

SLIDE 43

SFA NSA MFR

21

NSA — No Superfamily Association SFA — Superfamily Association Only MFR — Missing Functionally Important Residues

SLIDE 44

SFA NSA MFR BTC

21

NSA — No Superfamily Association SFA — Superfamily Association Only MFR — Missing Functionally Important Residues BTC — Below Trusted Cutoff

SLIDE 45

Types of Misannotation

NSA (9%)

MFR (6%) SFA (31%) BTC (54%)

Misannotations due to overprediction Misannotations not due to overprediction

SFA NSA MFR BTC

21

Predicting function without sufficient evidence

Biggest Problem

NSA — No Superfamily Association SFA — Superfamily Association Only MFR — Missing Functionally Important Residues BTC — Below Trusted Cutoff

SLIDE 46

>gi|13786715|pdb|1HZY|A Chain A, High Resolution Structure Of The Zinc-Containing Phosphotriesterase From Pseudomonas Diminuta GDRINTVRGPITISEAGFTLTHEHICGSSAGFLRAWPEFFGSRKALAEKAVRGLRRARAAGVRTIVDVST FDIGRDVSLLAEVSRAADVHIVAATGLWFDPPLSMRLRSVEELTQFFLREIQYGIEDTGIRAGIIKVATT GKATPFQELVLKAAARASLATGVPVTTHTAASQRDGEQQAAIFESEGLSPSRVCIGHSDDTDDLSYLTAL AARGYLIGLDHIPHSAIGLEDNASASALLGIRSWQTRALLIKALIDQGYMKQILVSNDWLFGFSSYVTNI MDVMDRVNPDGMAFIPLRVIPFLREKGVPQETLAGITVTNPARFLSPTLRAS >gi|1176259|sp|P45548|PHP_ECOLI Phosphotriesterase homology protein MSFDPTGYTLAHEHLHIDLSGFKNNVDCRLDQYAFICQEMNDLMTRGVRNVIEMTNRYMGRNAQFMLDVM RETGINVVACTGYYQDAFFPEHVATRSVQELAQEMVDEIEQGIDGTELKAGIIAEIGTSEGKITPLEEKV FIAAALAHNQTGRPISTHTSFSTMGLEQLALLQAHGVDLSRVTVGHCDLKDNLDNILKMIDLGAYVQFDT IGKNSYYPDEKRIAMLHALRDRGLLNRVMLSMDITRRSHLKANGGYGYDYLLTTFIPQLRQSGFSQADVD VMLRENPSQFFQ

1 1 2 1 2 =

1

Dipeptide Epimerase Unknown Function Dipeptide Epimerase Dipeptide Epimerase Dipeptide Epimerase

1 & 2

INCORRECT!

Error Propagation

22

SLIDE 47

>gi|13786715|pdb|1HZY|A Chain A, High Resolution Structure Of The Zinc-Containing Phosphotriesterase From Pseudomonas Diminuta GDRINTVRGPITISEAGFTLTHEHICGSSAGFLRAWPEFFGSRKALAEKAVRGLRRARAAGVRTIVDVST FDIGRDVSLLAEVSRAADVHIVAATGLWFDPPLSMRLRSVEELTQFFLREIQYGIEDTGIRAGIIKVATT GKATPFQELVLKAAARASLATGVPVTTHTAASQRDGEQQAAIFESEGLSPSRVCIGHSDDTDDLSYLTAL AARGYLIGLDHIPHSAIGLEDNASASALLGIRSWQTRALLIKALIDQGYMKQILVSNDWLFGFSSYVTNI MDVMDRVNPDGMAFIPLRVIPFLREKGVPQETLAGITVTNPARFLSPTLRAS >gi|1176259|sp|P45548|PHP_ECOLI Phosphotriesterase homology protein MSFDPTGYTLAHEHLHIDLSGFKNNVDCRLDQYAFICQEMNDLMTRGVRNVIEMTNRYMGRNAQFMLDVM RETGINVVACTGYYQDAFFPEHVATRSVQELAQEMVDEIEQGIDGTELKAGIIAEIGTSEGKITPLEEKV FIAAALAHNQTGRPISTHTSFSTMGLEQLALLQAHGVDLSRVTVGHCDLKDNLDNILKMIDLGAYVQFDT IGKNSYYPDEKRIAMLHALRDRGLLNRVMLSMDITRRSHLKANGGYGYDYLLTTFIPQLRQSGFSQADVD VMLRENPSQFFQ

1 1 2 1 2 =

1

Dipeptide Epimerase Unknown Function Dipeptide Epimerase Dipeptide Epimerase Dipeptide Epimerase

1 & 2

INCORRECT!

Error Propagation

22

SLIDE 48

23

Sequence similarity Correct annotation Incorrect annotation BLAST sequence similarity network

E-value 1×10−30 or lower
Distance between nodes

reflects level of sequence similarity

SLIDE 49

23

Sequence similarity Correct annotation Incorrect annotation BLAST sequence similarity network

E-value 1×10−30 or lower
Distance between nodes

reflects level of sequence similarity

SLIDE 50

Misannotations

Cluster with each
ther
Indication of error

propagation

23

Sequence similarity Correct annotation Incorrect annotation BLAST sequence similarity network

E-value 1×10−30 or lower
Distance between nodes

reflects level of sequence similarity

SLIDE 51

In Conclusion...

Misannotation is a serious problem
Automated databases
Across multiple folds, functions and superfamilies
Hard to predict misannotation a priori
Manual curation delivers the highest quality
Misannotation problem is getting worse
Overprediction is a common problem
Error propagation appears to be a common source of

misannotation

24

SLIDE 52

Acknowledgements

Tanja Kortemme & Lab Colin Smith $$ Howard Hughes Pre-Doctoral Fellowship NIH & NSF

Wiki Commons & Science Magazine for some images

Patricia Babbitt & lab

Shoshana Brown

Igor Dodevski University of Zürich PLoS Comput Biol. 2009 Dec;5(12):e1000605.

25

Jim Wells Lab Emily Crawford