Genome Annotation The steps in genome sequencing Generate genome - - PowerPoint PPT Presentation
Genome Annotation The steps in genome sequencing Generate genome - - PowerPoint PPT Presentation
Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF calling tRNA identifjcation rRNA identifjcation Functional annotation Annotating Genomes Identifying which protein performs which
The steps in genome sequencing
- Generate genome sequence
– Assembly – ORF calling – tRNA identifjcation – rRNA identifjcation – Functional annotation
Annotating Genomes
- Identifying which protein performs which
function
www.sigmaaldrich.com
Why annotate a genome?
- Catalog what's there
- Identify what's missing – but should be there!
– Things you don't know
- In vitro growth
– Mycoplasma pneumoniae
- Comparative genomics
- Hypothesis generation
The goals of annotation
- Exchange information with others
- Compare annotations between organisms
How to annotate a genome?
- Sequence
- Assemble
- Identify open reading frames
– Putative proteins
Putative protein
- Open Reading Frame (ORF)
– A stretch of amino acids with no stop codon
- Coding Sequence (CDS)
– An ORF that could encode a protein
- Protein encoding gene (PEG)
– An ORF that could encode a protein
- Hypothetical protein = putative protein
– Something that has not been experimentally shown
- Polypeptide
– Short stretch of ~50 amino acids. Often a domain
PEGS
- E. coli
– 4,391 genes – 4,288 genes that make proteins (pegs)
ORF Calling
Genome Annotation
The steps in genome sequencing
- Generate genome sequence
– Assembly – ORF calling – tRNA identifjcation – rRNA identifjcation – Functional annotation
Traditional genome annotation
Traditional genome annotation BLAST Similarities
Traditional genome annotation BLAST Similarities
Traditional genome annotation BLAST Similarities
Traditional genome annotation BLAST Similarities
Traditional genome annotation BLAST Similarities
Traditional genome annotation BLAST Similarities
Traditional genome annotation BLAST Similarities
Traditional genome annotation BLAST Similarities
Traditional genome annotation BLAST Similarities
Traditional genome annotation BLAST Similarities
Traditional genome annotation BLAST Similarities
Traditional genome annotation BLAST Similarities
Traditional genome annotation BLAST Similarities
Protein Families
Protein Families
Protein Families
Protein Families
Gene Ontology
- Ontology
– A “hierarchy” of functions – Does not need to be linear
- Directed Acyclic Graph
- Controlled Vocabulary
– Decides which words or phrases to use
GO
- Gene ontology
– A eukaryotic focus
- Drosophila
- Mus
- Saccharomyces
- Homo
GO
- Cellular component
– The parts of a cell
- Molecular function
– e.g. ligand binding
- Biological processes
– What things do
GO Terms
- [GO ID, function]
- e.g:
– GO:0004743 – Ontology: molecular function – Name: pyruvate kinase activity
GO Terms
- [GO ID, function]
- e.g:
– GO:0004743 – Ontology: molecular function – Name: pyruvate kinase activity
- Mainly assigned by BLAST/HMMER/... etc
Directed Acyclic Graph
Molecular function Catalytic activity Transferase activity Transferase activity, transferring phosphorous Kinase activity phosphotransferase activity, alcohol group as acceptor Pyruvate kinase activity
Problems
- Annotation by committee
- Eukaryotic focus
– Some efgorts to counter that
- Owen White
- Arriane Toussaint
- Not very deep
- Strict controlled vocabulary
Alternatives
lacZ lacI lacY lacA
Jacob & Monod, 1961 Basic biology
lacZ lacI lacY lacA
Basic biology
< 80 % < 80 % < 80%
Difgerent types of clustering
< 80 % < 80 % < 80%
Difgerent types of clustering
Purine metabolism
< 80 % < 80 % < 80%
Difgerent types of clustering
Heme / chlorophyll metabolism is conserved
They are both porphyrins
A q u i f c a e B a c t e r
- i
d e t e s C h l a m y d i a e C h l
- r
- f
e x i C y a n
- b
a c t e r i a D e i n
- c
- c
c u s
- T
h e r m u s S p i r
- c
h a e t e s T h e r m
- t
- g
a e
1 0.8 0.6 0.4 0.2 Clusters of genes w/ maximum 80% identity Genes in subsystems in clusters T
- tal number of genomes in group
Fraction of genes in clusters Number of genomes 40 80 120
Occurrence of clustering in difgerent genomes
- Subsystem is a generalization of “pathway”
– collection of functional roles jointly involved in a
biological process or complex
- Functional Role is the abstract biological function
- f a gene product
– atomic, or user-defjned, examples:
- 6-phosphofructokinase (EC 2.7.1.11)
- LSU ribosomal protein L31p
- Streptococcal virulence factors
- Should not contain “putative”, “thermostable”, etc
- Populated subsystem is complete spreadsheet
- f functions and roles
The Subsystems Approach to Annotation
1 HutH Histidine ammonia-lyase (EC 4.3.1.3) 2 HutU Urocanate hydratase (EC 4.2.1.49) 3 HutI Imidazolonepropionase (EC 3.5.2.7) 4 GluF Glutamate formiminotransferase (EC 2.1.2.5) 5 HutG Formiminoglutamase (EC 3.5.3.8) 6 NfoD N-formylglutamate deformylase (EC 3.5.1.68) 7 ForI Formiminoglutamic iminohydrolase (EC 3.5.3.13)
Subsystem: Histidine Degradation
- Conversion of histidine to glutamate
- Functional roles defjned in table
- Inclusion in subsystem is only by functional role
- Controlled vocabulary …
Histidine Degradation
- Column headers taken from table of functional roles
- Rows are selected genomes or organisms
- Cells are populated with specifjc, annotated genes
- Functional variants defjned by the annotated roles
- Variant code -1 indicates subsystem is not functional
- Clustering shown by color
Organism Variant HutH HutU HutI GluF HutG NfoD ForI Bacteroides thetaiotaomicron 1
Q8A4B3 Q8A4A9 Q8A4B1 Q8A4B0
Desulfotela psychrophila 1
gi51246205 gi51246204 gi51246203 gi51246202
Halobacterium sp. 2
Q9HQD5 Q9HQD8 Q9HQD6 Q9HQD7
Deinococcus radiodurans 2
Q9RZ06 Q9RZ02 Q9RZ05 Q9RZ04
Bacillus subtilis 2
P10944 P25503 P42084 P42068
Caulobacter crescentus 3
P58082 Q9A9MI P58079 Q9A9M0 Q9A9L9
Pseudomonas putida 3
Q88CZ7 Q88CZ6 Q88CZ9 Q88D00 Q88CZ3
Xanthomonas campestris 3
Q8PAA7 P58988 Q8PAA6 Q8PAA8 Q8PAA5
Listeria monocytogenes
- 1
Subsystem Spreadsheet
Subsystem Spreadsheet
1 HutH Histidine ammonia-lyase (EC 4.3.1.3) 2 HutU Urocanate hydratase (EC 4.2.1.49) 3 HutI Imidazolonepropionase (EC 3.5.2.7) 4 GluF Glutamate formiminotransferase (EC 2.1.2.5) 5 HutG Formiminoglutamase (EC 3.5.3.8) 6 NfoD N-formylglutamate deformylase (EC 3.5.1.68) 7 ForI Formiminoglutamic iminohydrolase (EC 3.5.3.13)
Subsystem: Histidine Degradation
Organism Variant HutH HutU HutI GluF HutG NfoD ForI Bacteroides thetaiotaomicron 1
Q8A4B3 Q8A4A9 Q8A4B1 Q8A4B0
Desulfotela psychrophila 1
gi51246205 gi51246204 gi51246203 gi51246202
Halobacterium sp. 2
Q9HQD5 Q9HQD8 Q9HQD6 Q9HQD7
Deinococcus radiodurans 2
Q9RZ06 Q9RZ02 Q9RZ05 Q9RZ04
Bacillus subtilis 2
P10944 P25503 P42084 P42068
Caulobacter crescentus 3
P58082 Q9A9MI P58079 Q9A9M0 Q9A9L9
Pseudomonas putida 3
Q88CZ7 Q88CZ6 Q88CZ9 Q88D00 Q88CZ3
Xanthomonas campestris 3
Q8PAA7 P58988 Q8PAA6 Q8PAA8 Q8PAA5
Listeria monocytogenes
- 1
Subsystem Spreadsheet
“The Populated Subsystem”
Microbial sialic acid metabolism has now been frmly established as a virulence determinant in a range
- f infectious diseases
Nan-operon within Sialic Acid Metabolism
The nan-operon
No in cluster
Abbr. Functional role in subsystem 1 NanK N-acetylmannosamine kinase (EC 2.7.1.60)
ABH- 0028250
putative NAGC-like transcriptional regulator
ABS- 0084973
possible kinase
ADD- 0003671
putative NAGC-like transcriptional regulator
ACZ- 0002834
putative sugar kinase 2 NanE N-acetylmannosamine-6- phosphate 2-epimerase (EC 5.1.3.9)
ABH- 0028251
putative enzyme
ABS- 0083505
conserved hypothetical protein
ADD- 0003672
putative enzyme
ACZ- 0002836
conserved hypothetical protein 3 NanA N-acetylneuraminate lyase (EC 4.1.3.3)
ABH- 0028253
N-acetyl- neuraminate lyase
ABS- 0084976
N-acetyl- neuraminate lyase
ADD- 0003674
N-acetyl- neuraminate lyase
ACZ- 0002837
probable N- acetylneuraminate lyase 4 YhcH Putative sugar isomerase involved in processing of exogenous sialic acid*
ABH- 0028249
- rf, hypothetical
protein
ABS- 0084972
conserved hypothetical protein
ADD- 0003670
conserved hypothetical protein
ACZ- 0002833
conserved hypothetical protein 5 NanT Sialic acid transporter (permease) NanT
ABH- 0028252
sialic acid transporter
ABS- 0084975
putative sialic acid transporter
ADD- 0003673
sialic acid transporter
ACZ- 0002831
MFS family sialic acid transporter 10 NanR Transcriptional regulator NanR**
ABH- 0028254
putative FADA-type transcriptional regulator
ABS- 0084977
putative GntR-family transcriptional regulator
ADD- 0003675
putative FADA-type transcriptional regulator NOT PRESENT (likely repalced by a clustered member of RpiR family)
* proposed by: 9. Teplyakov, A., Obmolova, G., Toedt, J., Galperin, M. Y ., Gilliland, G. L. (2005). Crystal Structure of the Bacterial YhcH Protein Indicates a Role in Sialic Acid Catabolism. J. Bacteriol. 187: 5520-5527
** K. A. Kalivoda, S. M. Steenbergen, E. R. Vimr , and J. Plumbridge Regulation of Sialic Acid Catabolism by the DNA Binding Protein NanR in Escherichia coli. J. Bacteriol., August 15, 2003; 185(16): 4806 - 4815Escherichia coli O157:H7 EDL933 Salmonella enterica subsp. enterica serovar Typhi Ty2 Shigella sonnei 53G Yersinia pseudotuberculosis IP
Color coding for annotations:
- green, consistent
- yellow; general class;
- gray, inconsistent or not
informative
Annotations in conserved cluster (nan-operon)
Methionine Biosynthesis
You need to get to here From here
Sulfhydrylation
Organism
Variant Code HSDH HK HSST HSAT AHSH/ SHSH CTGS CTBL MetH MetE BhmT MTHFR
Nostoc sp. PCC 7120
4427 657 619 1093
Synechocystis sp. PCC 6803
2356 1112 2469 1144
Thermosynechococcus elongatus BP-1
277 1764 1027 1090 1770
Trichodesmium erythraeum IMS101
415, 4266 6167 106, 1229 2279 4433
Gloeobacter violaceus PCC 7421
4295 1127 2500 477 789
Anabaena variabilis ATCC 29413
33 2331 5519 3872 3873 4254, 6365 6434
Nostoc punctiforme
33 2895 6648 5301 5302 4055 1885
Prochlorococcus marinus MED4
66 1204 1764 1714 1715 2 1 1421 295
Prochlorococcus marinus str. MIT 9313
66 1141 426 875 874 225 226 728 2005
Prochlorococcus marinus subsp. marinus str. CCMP1375
66 1148 1064 799 798 404 405 957 176
Prochlorococcus marinus subsp. pastoris str. CCMP1986
66 1047 592 640 639 405 406 874 153
Synechococcus sp. WH 8102
66 706 1476 845 846 669 670 1233 2258
Synechococcus elongatus PCC 7942
1397 769 2172 1030 2173 702 639 Homocerine activation Transsulfuration Methylation
Sulfhydrylation
Organism
Variant Code HSDH HK HSST HSAT AHSH/ SHSH CTGS CTBL MetH MetE BhmT MTHFR
Nostoc sp. PCC 7120
4427 657 619 1093
Synechocystis sp. PCC 6803
2356 1112 2469 1144
Thermosynechococcus elongatus BP-1
277 1764 1027 1090 1770
Trichodesmium erythraeum IMS101
415, 4266 6167 106, 1229 2279 4433
Gloeobacter violaceus PCC 7421
4295 1127 2500 477 789
Anabaena variabilis ATCC 29413
33 2331 5519 3872 3873 4254, 6365 6434
Nostoc punctiforme
33 2895 6648 5301 5302 4055 1885
Prochlorococcus marinus MED4
66 1204 1764 1714 1715 2 1 1421 295
Prochlorococcus marinus str. MIT 9313
66 1141 426 875 874 225 226 728 2005
Prochlorococcus marinus subsp. marinus str. CCMP1375
66 1148 1064 799 798 404 405 957 176
Prochlorococcus marinus subsp. pastoris str. CCMP1986
66 1047 592 640 639 405 406 874 153
Synechococcus sp. WH 8102
66 706 1476 845 846 669 670 1233 2258
Synechococcus elongatus PCC 7942
1397 769 2172 1030 2173 702 639 Homocerine activation Transsulfuration Methylation
Sulfhydrylation
Organism
Variant Code HSDH HK HSST HSAT AHSH/ SHSH CTGS CTBL MetH MetE BhmT MTHFR
Nostoc sp. PCC 7120
4427 657 619 1093
Synechocystis sp. PCC 6803
2356 1112 2469 1144
Thermosynechococcus elongatus BP-1
277 1764 1027 1090 1770
Trichodesmium erythraeum IMS101
415, 4266 6167 106, 1229 2279 4433
Gloeobacter violaceus PCC 7421
4295 1127 2500 477 789
Anabaena variabilis ATCC 29413
33 2331 5519 3872 3873 4254, 6365 6434
Nostoc punctiforme
33 2895 6648 5301 5302 4055 1885
Prochlorococcus marinus MED4
66 1204 1764 1714 1715 2 1 1421 295
Prochlorococcus marinus str. MIT 9313
66 1141 426 875 874 225 226 728 2005
Prochlorococcus marinus subsp. marinus str. CCMP1375
66 1148 1064 799 798 404 405 957 176
Prochlorococcus marinus subsp. pastoris str. CCMP1986
66 1047 592 640 639 405 406 874 153
Synechococcus sp. WH 8102
66 706 1476 845 846 669 670 1233 2258
Synechococcus elongatus PCC 7942
1397 769 2172 1030 2173 702 639 Homocerine activation Transsulfuration Methylation
?
Sulfhydrylation
Organism
Variant Code HSDH HK HSST HSAT AHSH/ SHSH CTGS CTBL MetH MetE BhmT MTHFR
Nostoc sp. PCC 7120
4427 657 619 1093
Synechocystis sp. PCC 6803
2356 1112 2469 1144
Thermosynechococcus elongatus BP-1
277 1764 1027 1090 1770
Trichodesmium erythraeum IMS101
415, 4266 6167 106, 1229 2279 4433
Gloeobacter violaceus PCC 7421
4295 1127 2500 477 789
Anabaena variabilis ATCC 29413
33 2331 5519 3872 3873 4254, 6365 6434
Nostoc punctiforme
33 2895 6648 5301 5302 4055 1885
Prochlorococcus marinus MED4
66 1204 1764 1714 1715 2 1 1421 295
Prochlorococcus marinus str. MIT 9313
66 1141 426 875 874 225 226 728 2005
Prochlorococcus marinus subsp. marinus str. CCMP1375
66 1148 1064 799 798 404 405 957 176
Prochlorococcus marinus subsp. pastoris str. CCMP1986
66 1047 592 640 639 405 406 874 153
Synechococcus sp. WH 8102
66 706 1476 845 846 669 670 1233 2258
Synechococcus elongatus PCC 7942
1397 769 2172 1030 2173 702 639 Homocerine activation Transsulfuration Methylation
? ? Missing genes
Hypothesis generation that leads to the wet lab...
- Wet lab
- Chromosomal context
- Metabolic context
- Phylogenetic context
- Microarray data
- Proteomics data
- …
Subsystems developed based on
How can we compare annotations
- There are several groups doing annotations
- f microbial genomes
- How do we compare them?
Caveat emptor!
- Number of subsystems defjned
- Number of functional roles defjned
- Number of genes connected to functional
roles
Natural Metrics
Annotations for some genomes
Number of solid connections of gene to functional role where “solid” is
- 1. supported by experimental data
- 2. connected to functional role and in chromosomal
cluster with genes implementing functional roles from the same subsystem
- 3. only gene in genome connected to a functional role
in an active variant of a subsystem
Reactions, GO terms, Articles, Other databases cross references (number and diversity)
Applied Metrics
Applied Metrics
Talmudic question*
If I find the identical protein sequence in two different organisms, is it doing the same function in both organisms? Per: Elio Schaecter, Small Things Considered. A talmudic question is unanswerable
FIG function:
Phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase (EC 5.3.1.16)
Other functions in RefSeq:
phosphoribosylformimino-5-aminoimidazole carboxamide phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase phosphoribosylformimino-5-aminoimidazole carboxamide ribotide... 1-(5-phosphoribosyl)-5-[(5- phosphoribosylamino)methylideneamino] imidazole-4-carboxamide isomerase N-(5-phospho-L-ribosyl-formimino)-5-amino-1-(5- phosphoribosyl)-4-imidazolecarboxamide isomerase N-(5'-phospho-L-ribosyl-formimino)-5-amino-1-(5'-phosphoribosyl)-4-imidazolecarboxamide isomerase N-(5'-phospho-L-ribosyl-formimino)-5-amino-1- (5'- phosphoribosyl)-4-imidazolecarboxamide isomerase N-(5'-phospho-L-ribosyl-formimino)-5-amino-1- (5'-phosphoribosyl)-4-imidazolecarboxamide isomerase N-(5'-phospho-L-ribosyl-formimino)-5-amino-1- (5'-phosphoribosyl)-4- imidazolecarboxamide isomerase Phosphoribosyl isomerase A [1-[5-phosphoribosyl]-5-[[5-phosphoribosylamino]methylideneamino] imidazole-4-carboxamide isomerase]
hisA
- Defjne a set of protein families such that each family contains
genes playing the same function
- Attach functional roles to protein families
- Measure the consistency of the annotations made to genes
within each family
- 1. "consistency" is the odds that two proteins from
the same family have the same function
- 2. Evaluate both families and functions.
Measuring Consistency
Consistency among databases (2008)
Number of RefSeq proteins in families
- If everything was called “hypothetical protein” the database
would be 100% consistent
- Need to measure accuracy (specifjcity) as well as consistency
- Sample 100 proteins at random from “curated” set (i.e. that
are believed to be correct)
- Manually inspect annotations to score correctness
How to measure accuracy
Problems
- Subsytems are biased!
- Subsystems are inaccurate!
- Merging annotations between difgerent
groups is political/psychological not technical!
Problems
- E. coli
– 4,391 genes – 4,288 genes that make proteins (pegs) – 676 genes that make enzymes
15% of genes encode enzymes!
The SEED Family
www.nmpdr.org www.theseed.org
Three level “hierarchy”
- Amino Acids and Derivatives
– Alanine, serine, and glycine
- Serine Biosynthesis
- Amino Acids and Derivatives
– Lysine, threonine, methionine, and cysteine
- Methionine Biosynthesis
Make your own subsystems!
About 2,500 Subsystems
Classifjcation # SS Classifjcation # SS Classifjcation # SS Experimental Subsystems 498 Regulation and Cell signaling 51 Motility and Chemotaxis 11 Clustering-based subsystems 352 Virulence 49 Plant cell walls and
- uter surfaces
10 Carbohydrates 160 Stress Response 43 Phages 10 Cofactors, Vitamins, Prosthetic Groups, Pigments 123 DNA Metabolism 41 Cell Division and Cell Cycle 10 Amino Acids and Derivatives 96 Aromatic Compounds 38 Photosynthesis 9 Protein Metabolism 95 Phages 36 Metabolite damage 8 Virulence, Disease, Defense 70 Secondary Metabolism 34 Phosphorus Metabolism 7 Miscellaneous 70 Iron acquisition and metabolism 31 Potassium metabolism 4 RNA Metabolism 65 Nucleosides and Nucleotides 24 Transcriptional regulation 2 Membrane Transport 65 Sulfur Metabolism 20 Plasmids 2 Respiration 62 Dormancy and Sporulation 17 Central metabolism 2 Cell Wall and Capsule 62 Plant-prokaryote 12 Autotrophy 2 Fatty Acids, Lipids, and 60 Nitrogen Metabolism 12 Arabinose Transport 1
- http://rast.nmpdr.org
- Rapid Annotation using Subsystem T
echnology
- Started in 2008
- Designed for annotating bacterial and archaeal
genomes
- As of Monday, May 11th 248,822 annotation
jobs
- 19,918 registered users
- Find the phylogenetic neighborhood of your
genome
- Look for proteins that related organisms have
– Core proteins – Subset of all subsystems
- Use those calls as a training set for
critica/glimmer
– Intrinsic training set!
The annotation process (complete genomes)
This one’s for Gary
- Subsystem, GO, and KEGG connections
– KEGG EC numbers – KEGG reaction numbers – SEED reaction numbers (Chris Henry)
- Metabolic fmux models
– Automatically generate FBA matrices (Aaron Best/Matt DeJongh; Hope College)
Automatic metabolic reconstruction
The Populated Subsystem
Automatically compare metabolic reconstructions
- Rapidly correct missing annotations
- Add more members to subsystems
- Improves future genome annotations!
(especially with new subsystems)
Find and suggest candidate functions
- 10 genomes submitted on Thursday at 6 pm
- First annotation complete before 8 am Friday
- Remaining annotations completed Friday
before noon
- (there were others in the pipeline too!)
- Presentation ASM 2009 Tuesday, 8pm
The Live ASM Test
Philadelphia, 2009
Genome Percent of Proteins in Subsystems
Haloferax denitrifcans 20% Haloferax mediterranei 19% Haloferax sulfurifontis 19% Haloferax volcanii DS2 19% Haloarcula sp 33800 19% Haloarcula sp 33799 18%
Subsystems coverage of sequenced Archaea
Phage talk Work by Sajia Akhter
Haloferax sulfurifontis prophage
Prophages
Metagenomics RAST had 300 public metagenomes Compared using tblastx Comparing complete genomes to metagenomes
Human Poop
Thanks Nick Celms, Beltran Rodriguez-Mueller, Mya Breitbart, & Forest Rohwer
High Salinity Salterns
San Diego, July 2004
Low salinity salterns High salinity salterns July 2004 Nov 2005
RAST usage grows...
RAST coverage....
RASTtk
- RAST2.0
- Customizable choice of pipelines to run
- Same behind the scenes infrastructure