Annotation Error in Public Databases ALEXANDRA SCHNOES UNIVERSITY - - PowerPoint PPT Presentation

annotation error in public databases
SMART_READER_LITE
LIVE PREVIEW

Annotation Error in Public Databases ALEXANDRA SCHNOES UNIVERSITY - - PowerPoint PPT Presentation

Annotation Error in Public Databases ALEXANDRA SCHNOES UNIVERSITY OF CALIFORNIA, SAN FRANCISCO OCTOBER 25, 2010 1 New genomes (and metagenomes) sequenced every day... 2 3 3 3 3 3 3 3 3 3 Computational Function Prediction Needed


slide-1
SLIDE 1

Annotation Error in Public Databases

ALEXANDRA SCHNOES UNIVERSITY OF CALIFORNIA, SAN FRANCISCO OCTOBER 25, 2010

1

slide-2
SLIDE 2

New genomes (and metagenomes) sequenced every day...

2

slide-3
SLIDE 3

3

slide-4
SLIDE 4

3

slide-5
SLIDE 5

3

slide-6
SLIDE 6

3

slide-7
SLIDE 7

3

slide-8
SLIDE 8

3

slide-9
SLIDE 9

3

slide-10
SLIDE 10

3

slide-11
SLIDE 11

3

slide-12
SLIDE 12

Characterized Sequences Total Sequences

Computational Function Prediction Needed

4

slide-13
SLIDE 13

What about the error that results from large scale function prediction?

5

slide-14
SLIDE 14

Our focus: commonly used protein sequence databases

How prevalent is misannotation in common sequence databases? What can we learn about these annotation errors and annotation in general?

6

slide-15
SLIDE 15

What is ‘function’?

Phenotype Enzymatic Reaction Many Possible Definitions

7

slide-16
SLIDE 16

What is ‘function’?

Phenotype Many Possible Definitions Enzymatic Reaction

7

slide-17
SLIDE 17

Why use enzymes?

  • Concrete definition of function
  • Substrate
  • Product
  • Chemical conversion
  • Function can be mapped

to specific residues

8

slide-18
SLIDE 18

9

Functionally Diverse Enzyme Superfamilies

slide-19
SLIDE 19

9

Functionally Diverse Enzyme Superfamilies

Conserved mechanistic step Low % sequence ID Multifunctional

slide-20
SLIDE 20

9

Functionally Diverse Enzyme Superfamilies

Conserved mechanistic step Low % sequence ID Monofunctional Family Specific Residues % ID within families > % ID between families Multifunctional

slide-21
SLIDE 21

What is needed for the misannotation analysis?

Gold Standard Sequence Set

Requirements

  • Organized hierarchy & data
  • Superfamily definitions
  • Family definitions
  • Sequences
  • Sequence alignments
  • Statistical models
  • Functions are experimentally

characterized

  • Understand functional mechanism
  • Structure
  • Active site
  • Functionally important residues
  • Large set

10

slide-22
SLIDE 22

What is needed for the misannotation analysis?

Gold Standard Sequence Set

Requirements

  • Organized hierarchy & data
  • Superfamily definitions
  • Family definitions
  • Sequences
  • Sequence alignments
  • Statistical models
  • Functions are experimentally

characterized

  • Understand functional mechanism
  • Structure
  • Active site
  • Functionally important residues
  • Large set

10

— 6 Superfamilies — 5 Structural folds — 37 Families —5/6 E.C. categories

Genome Biol. 2006;7(1):R8.

slide-23
SLIDE 23

Gold Standard Sequence Set sfld.rbvi.ucsf.edu

Functionally Important Residues Evidence Codes Hand-Curated Sequence Alignments Sequence Models (HMMs) Hierarchically Organized

11

slide-24
SLIDE 24

Data Source: Commonly Used Sequence Databases

NCBI

Automated Large

KEGG

Automated

TrEMBL

Automated Large

Swiss-Prot

Curated Small

12

slide-25
SLIDE 25

Given: A protein sequence annotated to a specific enzyme function

Is that annotation correct?

Analysis Question

13

slide-26
SLIDE 26

General Process

14

slide-27
SLIDE 27

15

slide-28
SLIDE 28

15

slide-29
SLIDE 29

15

slide-30
SLIDE 30

15

slide-31
SLIDE 31

15

Family Members Non-Family Members TC LC NC

slide-32
SLIDE 32

16

slide-33
SLIDE 33

16

slide-34
SLIDE 34

Variable percent misannotation Manually curated Swiss- Prot is most accurate

17

slide-35
SLIDE 35

Number of Sequences

Correct Annotations Incorrect Annotations

Sequences Deposited by Year and the Fraction Predicted to be Misannotated (NR DB)

Year Fraction Predicted Misannotated

1993 1995 1997 1999 2001 2003 2005 200 400 600 800 1000 1200

0.1 0.2 0.3 0.4 0.5

Misannotation Problem is Getting Worse

18

slide-36
SLIDE 36

What are the characteristics of these misannotations?

19

slide-37
SLIDE 37

20

Sensitivity to threshold change

slide-38
SLIDE 38

20

Sensitivity to threshold change

Family Members Non-Family Members

TC LC NC

TC — Trusted Cutoff NC — Noise Cutoff LC — Lenient Cutoff

slide-39
SLIDE 39

20

Sensitivity to threshold change

Family Members Non-Family Members

TC LC NC

Family Members Non-Family Members

TC LC NC

slide-40
SLIDE 40

21

slide-41
SLIDE 41

NSA

21

NSA — No Superfamily Association

slide-42
SLIDE 42

SFA NSA

21

NSA — No Superfamily Association SFA — Superfamily Association Only

slide-43
SLIDE 43

SFA NSA MFR

21

NSA — No Superfamily Association SFA — Superfamily Association Only MFR — Missing Functionally Important Residues

slide-44
SLIDE 44

SFA NSA MFR BTC

21

NSA — No Superfamily Association SFA — Superfamily Association Only MFR — Missing Functionally Important Residues BTC — Below Trusted Cutoff

slide-45
SLIDE 45

Types of Misannotation

NSA (9%)

MFR (6%) SFA (31%) BTC (54%)

Misannotations due to overprediction Misannotations not due to overprediction

SFA NSA MFR BTC

21

Predicting function without sufficient evidence

Biggest Problem

NSA — No Superfamily Association SFA — Superfamily Association Only MFR — Missing Functionally Important Residues BTC — Below Trusted Cutoff

slide-46
SLIDE 46

>gi|13786715|pdb|1HZY|A Chain A, High Resolution Structure Of The Zinc-Containing Phosphotriesterase From Pseudomonas Diminuta GDRINTVRGPITISEAGFTLTHEHICGSSAGFLRAWPEFFGSRKALAEKAVRGLRRARAAGVRTIVDVST FDIGRDVSLLAEVSRAADVHIVAATGLWFDPPLSMRLRSVEELTQFFLREIQYGIEDTGIRAGIIKVATT GKATPFQELVLKAAARASLATGVPVTTHTAASQRDGEQQAAIFESEGLSPSRVCIGHSDDTDDLSYLTAL AARGYLIGLDHIPHSAIGLEDNASASALLGIRSWQTRALLIKALIDQGYMKQILVSNDWLFGFSSYVTNI MDVMDRVNPDGMAFIPLRVIPFLREKGVPQETLAGITVTNPARFLSPTLRAS >gi|1176259|sp|P45548|PHP_ECOLI Phosphotriesterase homology protein MSFDPTGYTLAHEHLHIDLSGFKNNVDCRLDQYAFICQEMNDLMTRGVRNVIEMTNRYMGRNAQFMLDVM RETGINVVACTGYYQDAFFPEHVATRSVQELAQEMVDEIEQGIDGTELKAGIIAEIGTSEGKITPLEEKV FIAAALAHNQTGRPISTHTSFSTMGLEQLALLQAHGVDLSRVTVGHCDLKDNLDNILKMIDLGAYVQFDT IGKNSYYPDEKRIAMLHALRDRGLLNRVMLSMDITRRSHLKANGGYGYDYLLTTFIPQLRQSGFSQADVD VMLRENPSQFFQ

1 1 2 1 2 =

  • 1

Dipeptide Epimerase Unknown Function Dipeptide Epimerase Dipeptide Epimerase Dipeptide Epimerase

1 & 2

INCORRECT!

Error Propagation

22

slide-47
SLIDE 47

>gi|13786715|pdb|1HZY|A Chain A, High Resolution Structure Of The Zinc-Containing Phosphotriesterase From Pseudomonas Diminuta GDRINTVRGPITISEAGFTLTHEHICGSSAGFLRAWPEFFGSRKALAEKAVRGLRRARAAGVRTIVDVST FDIGRDVSLLAEVSRAADVHIVAATGLWFDPPLSMRLRSVEELTQFFLREIQYGIEDTGIRAGIIKVATT GKATPFQELVLKAAARASLATGVPVTTHTAASQRDGEQQAAIFESEGLSPSRVCIGHSDDTDDLSYLTAL AARGYLIGLDHIPHSAIGLEDNASASALLGIRSWQTRALLIKALIDQGYMKQILVSNDWLFGFSSYVTNI MDVMDRVNPDGMAFIPLRVIPFLREKGVPQETLAGITVTNPARFLSPTLRAS >gi|1176259|sp|P45548|PHP_ECOLI Phosphotriesterase homology protein MSFDPTGYTLAHEHLHIDLSGFKNNVDCRLDQYAFICQEMNDLMTRGVRNVIEMTNRYMGRNAQFMLDVM RETGINVVACTGYYQDAFFPEHVATRSVQELAQEMVDEIEQGIDGTELKAGIIAEIGTSEGKITPLEEKV FIAAALAHNQTGRPISTHTSFSTMGLEQLALLQAHGVDLSRVTVGHCDLKDNLDNILKMIDLGAYVQFDT IGKNSYYPDEKRIAMLHALRDRGLLNRVMLSMDITRRSHLKANGGYGYDYLLTTFIPQLRQSGFSQADVD VMLRENPSQFFQ

1 1 2 1 2 =

  • 1

Dipeptide Epimerase Unknown Function Dipeptide Epimerase Dipeptide Epimerase Dipeptide Epimerase

1 & 2

INCORRECT!

Error Propagation

22

slide-48
SLIDE 48

23

Sequence similarity Correct annotation Incorrect annotation BLAST sequence similarity network

  • E-value 1×10−30 or lower
  • Distance between nodes

reflects level of sequence similarity

slide-49
SLIDE 49

23

Sequence similarity Correct annotation Incorrect annotation BLAST sequence similarity network

  • E-value 1×10−30 or lower
  • Distance between nodes

reflects level of sequence similarity

slide-50
SLIDE 50

Misannotations

  • Cluster with each
  • ther
  • Indication of error

propagation

23

Sequence similarity Correct annotation Incorrect annotation BLAST sequence similarity network

  • E-value 1×10−30 or lower
  • Distance between nodes

reflects level of sequence similarity

slide-51
SLIDE 51

In Conclusion...

  • Misannotation is a serious problem
  • Automated databases
  • Across multiple folds, functions and superfamilies
  • Hard to predict misannotation a priori
  • Manual curation delivers the highest quality
  • Misannotation problem is getting worse
  • Overprediction is a common problem
  • Error propagation appears to be a common source of

misannotation

24

slide-52
SLIDE 52

Acknowledgements

Tanja Kortemme & Lab Colin Smith $$ Howard Hughes Pre-Doctoral Fellowship NIH & NSF

Wiki Commons & Science Magazine for some images

Patricia Babbitt & lab

Shoshana Brown

Igor Dodevski University of Zürich PLoS Comput Biol. 2009 Dec;5(12):e1000605.

25

Jim Wells Lab Emily Crawford