[PPT] - Genome Annotation The steps in genome sequencing Generate genome PowerPoint Presentation

SLIDE 1

Genome Annotation

SLIDE 2

The steps in genome sequencing

Generate genome sequence

– Assembly – ORF calling – tRNA identifjcation – rRNA identifjcation – Functional annotation

SLIDE 3

Annotating Genomes

Identifying which protein performs which

function

SLIDE 4

www.sigmaaldrich.com

SLIDE 5

Why annotate a genome?

Catalog what's there
Identify what's missing – but should be there!

– Things you don't know

In vitro growth

– Mycoplasma pneumoniae

Comparative genomics
Hypothesis generation

SLIDE 6

The goals of annotation

Exchange information with others
Compare annotations between organisms

SLIDE 7

How to annotate a genome?

Sequence
Assemble
Identify open reading frames

– Putative proteins

SLIDE 8

Putative protein

Open Reading Frame (ORF)

– A stretch of amino acids with no stop codon

Coding Sequence (CDS)

– An ORF that could encode a protein

Protein encoding gene (PEG)

– An ORF that could encode a protein

Hypothetical protein = putative protein

– Something that has not been experimentally shown

Polypeptide

– Short stretch of ~50 amino acids. Often a domain

SLIDE 9

PEGS

E. coli

– 4,391 genes – 4,288 genes that make proteins (pegs)

SLIDE 10

ORF Calling

SLIDE 11

Genome Annotation

SLIDE 12

The steps in genome sequencing

Generate genome sequence

– Assembly – ORF calling – tRNA identifjcation – rRNA identifjcation – Functional annotation

SLIDE 13

Traditional genome annotation

SLIDE 14

Traditional genome annotation BLAST Similarities

SLIDE 15

Traditional genome annotation BLAST Similarities

SLIDE 16

Traditional genome annotation BLAST Similarities

SLIDE 17

Traditional genome annotation BLAST Similarities

SLIDE 18

Traditional genome annotation BLAST Similarities

SLIDE 19

Traditional genome annotation BLAST Similarities

SLIDE 20

Traditional genome annotation BLAST Similarities

SLIDE 21

Traditional genome annotation BLAST Similarities

SLIDE 22

Traditional genome annotation BLAST Similarities

SLIDE 23

Traditional genome annotation BLAST Similarities

SLIDE 24

Traditional genome annotation BLAST Similarities

SLIDE 25

Traditional genome annotation BLAST Similarities

SLIDE 26

Traditional genome annotation BLAST Similarities

SLIDE 27

Protein Families

SLIDE 28

Protein Families

SLIDE 29

Protein Families

SLIDE 30

Protein Families

SLIDE 31

Gene Ontology

Ontology

– A “hierarchy” of functions – Does not need to be linear

Directed Acyclic Graph
Controlled Vocabulary

– Decides which words or phrases to use

SLIDE 32

GO

Gene ontology

– A eukaryotic focus

Drosophila
Mus
Saccharomyces
Homo

SLIDE 33

GO

Cellular component

– The parts of a cell

Molecular function

– e.g. ligand binding

Biological processes

– What things do

SLIDE 34

GO Terms

[GO ID, function]
e.g:

– GO:0004743 – Ontology: molecular function – Name: pyruvate kinase activity

SLIDE 35

GO Terms

[GO ID, function]
e.g:

– GO:0004743 – Ontology: molecular function – Name: pyruvate kinase activity

Mainly assigned by BLAST/HMMER/... etc

SLIDE 36

Directed Acyclic Graph

Molecular function Catalytic activity Transferase activity Transferase activity, transferring phosphorous Kinase activity phosphotransferase activity, alcohol group as acceptor Pyruvate kinase activity

SLIDE 37

Problems

Annotation by committee
Eukaryotic focus

– Some efgorts to counter that

Owen White
Arriane Toussaint
Not very deep
Strict controlled vocabulary

SLIDE 38

Alternatives

SLIDE 39

lacZ lacI lacY lacA

Jacob & Monod, 1961 Basic biology

SLIDE 40

lacZ lacI lacY lacA

Basic biology

SLIDE 41

< 80 % < 80 % < 80%

Difgerent types of clustering

SLIDE 42

< 80 % < 80 % < 80%

Difgerent types of clustering

SLIDE 43

Purine metabolism

SLIDE 44

< 80 % < 80 % < 80%

Difgerent types of clustering

SLIDE 45

Heme / chlorophyll metabolism is conserved

They are both porphyrins

SLIDE 46

A q u i f c a e B a c t e r

i

d e t e s C h l a m y d i a e C h l

r
f

e x i C y a n

b

a c t e r i a D e i n

c
c

c u s

T

h e r m u s S p i r

c

h a e t e s T h e r m

t
g

a e

1 0.8 0.6 0.4 0.2 Clusters of genes w/ maximum 80% identity Genes in subsystems in clusters T

tal number of genomes in group

Fraction of genes in clusters Number of genomes 40 80 120

Occurrence of clustering in difgerent genomes

SLIDE 47

Subsystem is a generalization of “pathway”

– collection of functional roles jointly involved in a

biological process or complex

Functional Role is the abstract biological function
f a gene product

– atomic, or user-defjned, examples:

6-phosphofructokinase (EC 2.7.1.11)
LSU ribosomal protein L31p
Streptococcal virulence factors
Should not contain “putative”, “thermostable”, etc
Populated subsystem is complete spreadsheet
f functions and roles

The Subsystems Approach to Annotation

SLIDE 48

1 HutH Histidine ammonia-lyase (EC 4.3.1.3) 2 HutU Urocanate hydratase (EC 4.2.1.49) 3 HutI Imidazolonepropionase (EC 3.5.2.7) 4 GluF Glutamate formiminotransferase (EC 2.1.2.5) 5 HutG Formiminoglutamase (EC 3.5.3.8) 6 NfoD N-formylglutamate deformylase (EC 3.5.1.68) 7 ForI Formiminoglutamic iminohydrolase (EC 3.5.3.13)

Subsystem: Histidine Degradation

Conversion of histidine to glutamate
Functional roles defjned in table
Inclusion in subsystem is only by functional role
Controlled vocabulary …

Histidine Degradation

SLIDE 49

Column headers taken from table of functional roles
Rows are selected genomes or organisms
Cells are populated with specifjc, annotated genes
Functional variants defjned by the annotated roles
Variant code -1 indicates subsystem is not functional
Clustering shown by color

Organism Variant HutH HutU HutI GluF HutG NfoD ForI Bacteroides thetaiotaomicron 1

Q8A4B3 Q8A4A9 Q8A4B1 Q8A4B0

Desulfotela psychrophila 1

gi51246205 gi51246204 gi51246203 gi51246202

Halobacterium sp. 2

Q9HQD5 Q9HQD8 Q9HQD6 Q9HQD7

Deinococcus radiodurans 2

Q9RZ06 Q9RZ02 Q9RZ05 Q9RZ04

Bacillus subtilis 2

P10944 P25503 P42084 P42068

Caulobacter crescentus 3

P58082 Q9A9MI P58079 Q9A9M0 Q9A9L9

Pseudomonas putida 3

Q88CZ7 Q88CZ6 Q88CZ9 Q88D00 Q88CZ3

Xanthomonas campestris 3

Q8PAA7 P58988 Q8PAA6 Q8PAA8 Q8PAA5

Listeria monocytogenes

1

Subsystem Spreadsheet

SLIDE 50

1 HutH Histidine ammonia-lyase (EC 4.3.1.3) 2 HutU Urocanate hydratase (EC 4.2.1.49) 3 HutI Imidazolonepropionase (EC 3.5.2.7) 4 GluF Glutamate formiminotransferase (EC 2.1.2.5) 5 HutG Formiminoglutamase (EC 3.5.3.8) 6 NfoD N-formylglutamate deformylase (EC 3.5.1.68) 7 ForI Formiminoglutamic iminohydrolase (EC 3.5.3.13)

Subsystem: Histidine Degradation

Organism Variant HutH HutU HutI GluF HutG NfoD ForI Bacteroides thetaiotaomicron 1

Q8A4B3 Q8A4A9 Q8A4B1 Q8A4B0

Desulfotela psychrophila 1

gi51246205 gi51246204 gi51246203 gi51246202

Halobacterium sp. 2

Q9HQD5 Q9HQD8 Q9HQD6 Q9HQD7

Deinococcus radiodurans 2

Q9RZ06 Q9RZ02 Q9RZ05 Q9RZ04

Bacillus subtilis 2

P10944 P25503 P42084 P42068

Caulobacter crescentus 3

P58082 Q9A9MI P58079 Q9A9M0 Q9A9L9

Pseudomonas putida 3

Q88CZ7 Q88CZ6 Q88CZ9 Q88D00 Q88CZ3

Xanthomonas campestris 3

Q8PAA7 P58988 Q8PAA6 Q8PAA8 Q8PAA5

Listeria monocytogenes

1

Subsystem Spreadsheet

“The Populated Subsystem”

SLIDE 51

Microbial sialic acid metabolism has now been frmly established as a virulence determinant in a range

f infectious diseases

Nan-operon within Sialic Acid Metabolism

SLIDE 52

The nan-operon

SLIDE 53

No in cluster

Abbr. Functional role in subsystem 1 NanK N-acetylmannosamine kinase (EC 2.7.1.60)

ABH- 0028250

putative NAGC-like transcriptional regulator

ABS- 0084973

possible kinase

ADD- 0003671

putative NAGC-like transcriptional regulator

ACZ- 0002834

putative sugar kinase 2 NanE N-acetylmannosamine-6- phosphate 2-epimerase (EC 5.1.3.9)

ABH- 0028251

putative enzyme

ABS- 0083505

conserved hypothetical protein

ADD- 0003672

putative enzyme

ACZ- 0002836

conserved hypothetical protein 3 NanA N-acetylneuraminate lyase (EC 4.1.3.3)

ABH- 0028253

N-acetylneuraminate lyase

ABS- 0084976

N-acetylneuraminate lyase

ADD- 0003674

N-acetylneuraminate lyase

ACZ- 0002837

probable N- acetylneuraminate lyase 4 YhcH Putative sugar isomerase involved in processing of exogenous sialic acid*

ABH- 0028249

rf, hypothetical

protein

ABS- 0084972

conserved hypothetical protein

ADD- 0003670

conserved hypothetical protein

ACZ- 0002833

conserved hypothetical protein 5 NanT Sialic acid transporter (permease) NanT

ABH- 0028252

sialic acid transporter

ABS- 0084975

putative sialic acid transporter

ADD- 0003673

sialic acid transporter

ACZ- 0002831

MFS family sialic acid transporter 10 NanR Transcriptional regulator NanR**

ABH- 0028254

putative FADA-type transcriptional regulator

ABS- 0084977

putative GntR-family transcriptional regulator

ADD- 0003675

putative FADA-type transcriptional regulator NOT PRESENT (likely repalced by a clustered member of RpiR family)

* proposed by: 9. Teplyakov, A., Obmolova, G., Toedt, J., Galperin, M. Y ., Gilliland, G. L. (2005). Crystal Structure of the Bacterial YhcH Protein Indicates a Role in Sialic Acid Catabolism. J. Bacteriol. 187: 5520-5527

** K. A. Kalivoda, S. M. Steenbergen, E. R. Vimr , and J. Plumbridge Regulation of Sialic Acid Catabolism by the DNA Binding Protein NanR in Escherichia coli. J. Bacteriol., August 15, 2003; 185(16): 4806 - 4815

Escherichia coli O157:H7 EDL933 Salmonella enterica subsp. enterica serovar Typhi Ty2 Shigella sonnei 53G Yersinia pseudotuberculosis IP

Color coding for annotations:

green, consistent
yellow; general class;
gray, inconsistent or not

informative

Annotations in conserved cluster (nan-operon)

SLIDE 54

Methionine Biosynthesis

You need to get to here From here

SLIDE 55

Sulfhydrylation

Organism

Variant Code HSDH HK HSST HSAT AHSH/ SHSH CTGS CTBL MetH MetE BhmT MTHFR

Nostoc sp. PCC 7120

4427 657 619 1093

Synechocystis sp. PCC 6803

2356 1112 2469 1144

Thermosynechococcus elongatus BP-1

277 1764 1027 1090 1770

Trichodesmium erythraeum IMS101

415, 4266 6167 106, 1229 2279 4433

Gloeobacter violaceus PCC 7421

4295 1127 2500 477 789

Anabaena variabilis ATCC 29413

33 2331 5519 3872 3873 4254, 6365 6434

Nostoc punctiforme

33 2895 6648 5301 5302 4055 1885

Prochlorococcus marinus MED4

66 1204 1764 1714 1715 2 1 1421 295

Prochlorococcus marinus str. MIT 9313

66 1141 426 875 874 225 226 728 2005

Prochlorococcus marinus subsp. marinus str. CCMP1375

66 1148 1064 799 798 404 405 957 176

Prochlorococcus marinus subsp. pastoris str. CCMP1986

66 1047 592 640 639 405 406 874 153

Synechococcus sp. WH 8102

66 706 1476 845 846 669 670 1233 2258

Synechococcus elongatus PCC 7942

1397 769 2172 1030 2173 702 639 Homocerine activation Transsulfuration Methylation

SLIDE 56

Sulfhydrylation

Organism

Variant Code HSDH HK HSST HSAT AHSH/ SHSH CTGS CTBL MetH MetE BhmT MTHFR

Nostoc sp. PCC 7120

4427 657 619 1093

Synechocystis sp. PCC 6803

2356 1112 2469 1144

Thermosynechococcus elongatus BP-1

277 1764 1027 1090 1770

Trichodesmium erythraeum IMS101

415, 4266 6167 106, 1229 2279 4433

Gloeobacter violaceus PCC 7421

4295 1127 2500 477 789

Anabaena variabilis ATCC 29413

33 2331 5519 3872 3873 4254, 6365 6434

Nostoc punctiforme

33 2895 6648 5301 5302 4055 1885

Prochlorococcus marinus MED4

66 1204 1764 1714 1715 2 1 1421 295

Prochlorococcus marinus str. MIT 9313

66 1141 426 875 874 225 226 728 2005

Prochlorococcus marinus subsp. marinus str. CCMP1375

66 1148 1064 799 798 404 405 957 176

Prochlorococcus marinus subsp. pastoris str. CCMP1986

66 1047 592 640 639 405 406 874 153

Synechococcus sp. WH 8102

66 706 1476 845 846 669 670 1233 2258

Synechococcus elongatus PCC 7942

1397 769 2172 1030 2173 702 639 Homocerine activation Transsulfuration Methylation

SLIDE 57

Sulfhydrylation

Organism

Variant Code HSDH HK HSST HSAT AHSH/ SHSH CTGS CTBL MetH MetE BhmT MTHFR

Nostoc sp. PCC 7120

4427 657 619 1093

Synechocystis sp. PCC 6803

2356 1112 2469 1144

Thermosynechococcus elongatus BP-1

277 1764 1027 1090 1770

Trichodesmium erythraeum IMS101

415, 4266 6167 106, 1229 2279 4433

Gloeobacter violaceus PCC 7421

4295 1127 2500 477 789

Anabaena variabilis ATCC 29413

33 2331 5519 3872 3873 4254, 6365 6434

Nostoc punctiforme

33 2895 6648 5301 5302 4055 1885

Prochlorococcus marinus MED4

66 1204 1764 1714 1715 2 1 1421 295

Prochlorococcus marinus str. MIT 9313

66 1141 426 875 874 225 226 728 2005

Prochlorococcus marinus subsp. marinus str. CCMP1375

66 1148 1064 799 798 404 405 957 176

Prochlorococcus marinus subsp. pastoris str. CCMP1986

66 1047 592 640 639 405 406 874 153

Synechococcus sp. WH 8102

66 706 1476 845 846 669 670 1233 2258

Synechococcus elongatus PCC 7942

1397 769 2172 1030 2173 702 639 Homocerine activation Transsulfuration Methylation

?

SLIDE 58

Sulfhydrylation

Organism

Variant Code HSDH HK HSST HSAT AHSH/ SHSH CTGS CTBL MetH MetE BhmT MTHFR

Nostoc sp. PCC 7120

4427 657 619 1093

Synechocystis sp. PCC 6803

2356 1112 2469 1144

Thermosynechococcus elongatus BP-1

277 1764 1027 1090 1770

Trichodesmium erythraeum IMS101

415, 4266 6167 106, 1229 2279 4433

Gloeobacter violaceus PCC 7421

4295 1127 2500 477 789

Anabaena variabilis ATCC 29413

33 2331 5519 3872 3873 4254, 6365 6434

Nostoc punctiforme

33 2895 6648 5301 5302 4055 1885

Prochlorococcus marinus MED4

66 1204 1764 1714 1715 2 1 1421 295

Prochlorococcus marinus str. MIT 9313

66 1141 426 875 874 225 226 728 2005

Prochlorococcus marinus subsp. marinus str. CCMP1375

66 1148 1064 799 798 404 405 957 176

Prochlorococcus marinus subsp. pastoris str. CCMP1986

66 1047 592 640 639 405 406 874 153

Synechococcus sp. WH 8102

66 706 1476 845 846 669 670 1233 2258

Synechococcus elongatus PCC 7942

1397 769 2172 1030 2173 702 639 Homocerine activation Transsulfuration Methylation

? ? Missing genes

SLIDE 59

Hypothesis generation that leads to the wet lab...

SLIDE 60

Wet lab
Chromosomal context
Metabolic context
Phylogenetic context
Microarray data
Proteomics data
…

Subsystems developed based on

SLIDE 61

How can we compare annotations

There are several groups doing annotations
f microbial genomes
How do we compare them?

SLIDE 62

Caveat emptor!

SLIDE 63

Number of subsystems defjned
Number of functional roles defjned
Number of genes connected to functional

roles

Natural Metrics

SLIDE 64

Annotations for some genomes

SLIDE 65

Number of solid connections of gene to functional role where “solid” is

1. supported by experimental data
2. connected to functional role and in chromosomal

cluster with genes implementing functional roles from the same subsystem

3. only gene in genome connected to a functional role

in an active variant of a subsystem

Reactions, GO terms, Articles, Other databases cross references (number and diversity)

Applied Metrics

SLIDE 66

Applied Metrics

SLIDE 67

Talmudic question*

If I find the identical protein sequence in two different organisms, is it doing the same function in both organisms? Per: Elio Schaecter, Small Things Considered. A talmudic question is unanswerable

SLIDE 68

FIG function:

Phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase (EC 5.3.1.16)

Other functions in RefSeq:

phosphoribosylformimino-5-aminoimidazole carboxamide phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase phosphoribosylformimino-5-aminoimidazole carboxamide ribotide... 1-(5-phosphoribosyl)-5-[(5- phosphoribosylamino)methylideneamino] imidazole-4-carboxamide isomerase N-(5-phospho-L-ribosyl-formimino)-5-amino-1-(5- phosphoribosyl)-4-imidazolecarboxamide isomerase N-(5'-phospho-L-ribosyl-formimino)-5-amino-1-(5'-phosphoribosyl)-4-imidazolecarboxamide isomerase N-(5'-phospho-L-ribosyl-formimino)-5-amino-1- (5'- phosphoribosyl)-4-imidazolecarboxamide isomerase N-(5'-phospho-L-ribosyl-formimino)-5-amino-1- (5'-phosphoribosyl)-4-imidazolecarboxamide isomerase N-(5'-phospho-L-ribosyl-formimino)-5-amino-1- (5'-phosphoribosyl)-4- imidazolecarboxamide isomerase Phosphoribosyl isomerase A [1-[5-phosphoribosyl]-5-[[5-phosphoribosylamino]methylideneamino] imidazole-4-carboxamide isomerase]

hisA

SLIDE 69

Defjne a set of protein families such that each family contains

genes playing the same function

Attach functional roles to protein families
Measure the consistency of the annotations made to genes

within each family

1. "consistency" is the odds that two proteins from

the same family have the same function

2. Evaluate both families and functions.

Measuring Consistency

SLIDE 70

Consistency among databases (2008)

SLIDE 71

Number of RefSeq proteins in families

SLIDE 72

If everything was called “hypothetical protein” the database

would be 100% consistent

Need to measure accuracy (specifjcity) as well as consistency
Sample 100 proteins at random from “curated” set (i.e. that

are believed to be correct)

Manually inspect annotations to score correctness

How to measure accuracy

SLIDE 73

Problems

Subsytems are biased!
Subsystems are inaccurate!
Merging annotations between difgerent

groups is political/psychological not technical!

SLIDE 74

Problems

E. coli

– 4,391 genes – 4,288 genes that make proteins (pegs) – 676 genes that make enzymes

15% of genes encode enzymes!

SLIDE 75

The SEED Family

www.nmpdr.org www.theseed.org

SLIDE 76

Three level “hierarchy”

Amino Acids and Derivatives

– Alanine, serine, and glycine

Serine Biosynthesis
Amino Acids and Derivatives

– Lysine, threonine, methionine, and cysteine

Methionine Biosynthesis

Make your own subsystems!

About 2,500 Subsystems

SLIDE 77

Classifjcation # SS Classifjcation # SS Classifjcation # SS Experimental Subsystems 498 Regulation and Cell signaling 51 Motility and Chemotaxis 11 Clustering-based subsystems 352 Virulence 49 Plant cell walls and

uter surfaces

10 Carbohydrates 160 Stress Response 43 Phages 10 Cofactors, Vitamins, Prosthetic Groups, Pigments 123 DNA Metabolism 41 Cell Division and Cell Cycle 10 Amino Acids and Derivatives 96 Aromatic Compounds 38 Photosynthesis 9 Protein Metabolism 95 Phages 36 Metabolite damage 8 Virulence, Disease, Defense 70 Secondary Metabolism 34 Phosphorus Metabolism 7 Miscellaneous 70 Iron acquisition and metabolism 31 Potassium metabolism 4 RNA Metabolism 65 Nucleosides and Nucleotides 24 Transcriptional regulation 2 Membrane Transport 65 Sulfur Metabolism 20 Plasmids 2 Respiration 62 Dormancy and Sporulation 17 Central metabolism 2 Cell Wall and Capsule 62 Plant-prokaryote 12 Autotrophy 2 Fatty Acids, Lipids, and 60 Nitrogen Metabolism 12 Arabinose Transport 1

SLIDE 78

http://rast.nmpdr.org
Rapid Annotation using Subsystem T

echnology

Started in 2008
Designed for annotating bacterial and archaeal

genomes

As of Monday, May 11th 248,822 annotation

jobs

19,918 registered users

SLIDE 79

Find the phylogenetic neighborhood of your

genome

Look for proteins that related organisms have

– Core proteins – Subset of all subsystems

Use those calls as a training set for

critica/glimmer

– Intrinsic training set!

The annotation process (complete genomes)

SLIDE 80

This one’s for Gary

SLIDE 81

Subsystem, GO, and KEGG connections

– KEGG EC numbers – KEGG reaction numbers – SEED reaction numbers (Chris Henry)

Metabolic fmux models

– Automatically generate FBA matrices (Aaron Best/Matt DeJongh; Hope College)

Automatic metabolic reconstruction

SLIDE 82

SLIDE 83

The Populated Subsystem

SLIDE 84

Automatically compare metabolic reconstructions

SLIDE 85

Rapidly correct missing annotations
Add more members to subsystems
Improves future genome annotations!

(especially with new subsystems)

Find and suggest candidate functions

SLIDE 86

10 genomes submitted on Thursday at 6 pm
First annotation complete before 8 am Friday
Remaining annotations completed Friday

before noon

(there were others in the pipeline too!)
Presentation ASM 2009 Tuesday, 8pm

The Live ASM Test

Philadelphia, 2009

SLIDE 87

Genome Percent of Proteins in Subsystems

Haloferax denitrifcans 20% Haloferax mediterranei 19% Haloferax sulfurifontis 19% Haloferax volcanii DS2 19% Haloarcula sp 33800 19% Haloarcula sp 33799 18%

Subsystems coverage of sequenced Archaea

SLIDE 88

Phage talk Work by Sajia Akhter

Haloferax sulfurifontis prophage

Prophages

SLIDE 89

Metagenomics RAST had 300 public metagenomes Compared using tblastx Comparing complete genomes to metagenomes

SLIDE 90

Human Poop

SLIDE 91

Thanks Nick Celms, Beltran Rodriguez-Mueller, Mya Breitbart, & Forest Rohwer

High Salinity Salterns

San Diego, July 2004

SLIDE 92

Low salinity salterns High salinity salterns July 2004 Nov 2005

SLIDE 93

RAST usage grows...

SLIDE 94

RAST coverage....

SLIDE 95

RASTtk

RAST2.0
Customizable choice of pipelines to run
Same behind the scenes infrastructure

SLIDE 96

RASTtk

SLIDE 97

Vibrio genomes

SLIDE 98