IGSAnnota*onEngineandManatee MichelleGwinnGiglio - PowerPoint PPT Presentation

IGS Annota*on Engine and Manatee  Michelle Gwinn Giglio  Pathway Tools Workshop  October 2010 

IGS Annota*on Engine  • A free service to anyone with a prokaryo*c  sequence they wish to annotate that provides:  – Automated output of the IGS prokaryo*c  annota*on pipeline  – The Manatee cura*on tool  • Can be used with complete or draI genomes 

The need for services like the AE  Further Analysis  Sequence Genera*on  Manual Annota*on  Further Analysis  Sequence Genera*on  Automa*c Annota*on  Sequence Genera*on  Automa*c Annota*on  Manual Annota*on  Further Analysis 

More is on the way!!!  Third Genera*on of  Sequencing Technology  Poised to provide insane  amounts of sequence data.  

Annota*on  Engine   web page  hMp://ae.igs.umaryland.edu  

IGS Annota*on Engine Growth  Current stats (from the two years of the project at IGS)  • – SubmiMers: 90  • From all over the United States and 17 other countries  – Users:  >> 90  – Genomes/sequences: >225 

Data  Gene  Predicted  Predic*on  protein  DNA  Flow  transla*on  Sequence  with  coding  Glimmer3  genes  Automated start  site and gene  overlap  RNA finding:  correc*on  tRNA‐scanSE  RNAmmer   similarity searches  Searches:  Pairwise BER searches against UniRef100  HMM searches against Pfam and TIGRfam  MySQL database  Mo*f searches with LipP, THMHH, PROSITE  using the Chado  NCBI COGs  schema  Prium profiles  Predicted RNA  Genes  Automa*c Annota*on using the evidence  hierarchy of Pfunc  Flat files of  annota*on  Manatee  informa*on 

Sequence‐based searches  • Pairwise protein alignments  • HMM searches  • Mo*f searches  – PROSITE  – TMHMM  – SignalP  – LipoP  • COGs  • Priam profiles 

Blast‐Extend‐Repraze (BER)  • a pairwise alignment tool • initial BLAST with liberal cutoff for each protein in the genome • modified Smith-Waterman alignment generated between search protein and each BLAST result • result is a file containing one pairwise alignment for each match protein from the BLAST • view alignments in our Manatee annotation tool • we do the 2-step process because BLAST is fast and Smith-Waterman is slow, so it saves cpu time to only do the Smith-Waterman alignments on things that have any hope of matching

HMMs • Our Hidden Markov Model database consists of TIGRFAMs and Pfam • statistical model of the patterns of amino acids in a multiple alignment of proteins (called the “seed) which share sequence and functional similarity • Each TIGRFAM HMM is assigned to a category which describes the type of relationship the proteins in the model have to each other – equivalog – superfamily – subfamily – domain • one can search proteins against HMMs, they receive a score indicating how well they match the model • by comparing this score to the cutoff scores assigned to each model, one can determine whether or not the search protein is a member of the group defined by the HMM – “trusted cutoff’ - proteins scoring above this score are considered a member of the group defined by the HMM – “noise cutoff” - proteins scoring below this score are considered NOT to be a member of the group defined by the HMM – for proteins scoring between trusted and noise, the HMM evidence is not sufficient to determine whether the protein is a member of the functional group or not

Annotation is attached to HMMs • TIGR00433 – category: equivalog – name: biotin synthase – EC: 2.8.1.6 – gene symbol: bioB – GO terms: GO:0004076 biotin synthase activity; GO:0009102 biotin biosynthesis • PF04055 – category: domain – name: radical SAM domain protein – EC: not applicable – gene symbol: not applicable – GO terms: GO:0003824 catalytic activity; GO:0008152 metabolism

Evaluating HMM scores …above trusted: the protein is a member of family the HMM models 0 100 …below noise: the protein is not a member of family the HMM models 0 100 …in-between noise and trusted: the protein MAY be a member of the family the HMM models 0 100

Gene  Predicted  Predic*on  protein  DNA  transla*on  Sequence  with  coding  Glimmer  genes  Automated start  site and gene  overlap  RNA finding:  correc*on  tRNAScan,  RNAMMER,  homology searches  Searches:  Pairwise BER searches against UniRef100  HMM searches against Pfam and TIGRfam  MySQL database  Mo*f searches with LipP, THMHH, PROSITE  using the Chado  NCBI COGs  schema  Prium profiles  Predicted RNA  Genes  Automa*c Annota*on using the evidence  hierarchy of Pfunc  Flat files of  annota*on  Manatee  informa*on 

The Pitfalls of Transitive Annotation ~ ~ ~ Protein A Protein B Protein C Protein D But, is Protein A similar to Protein D? If not, a transitive annotation error has occurred. To prevent, or at least minimize, such errors we require that a match protein be “trusted” if specific functional annotations are made from it.

prokaryo*c protein func*onal predic*on (pFunc) 

Protein names are adjusted to reflect  func*onal confidence/specificity  • High confidence in specific func*on  – “adenylosuccinate lyase” with EC/gene symbol  • General knowledge of func*on or subfamily  – “carbohydrate kinase”, FGGY family  • Family/Domain membership  – “cbbY family protein”  • Hypothe*cals  – “hypothe*cal protein  – “conserved hypothe*cal protein” 

Op*ons for Data Access  • Op*on 1  – We place a MySQl version of your  database and files onto an Ip  site. You download it and  Manatee for local installa*on  • Op*on 2  – Your database resides at IGS. We  provide you a password‐ protected account to Manatee  installed at IGS.  – By far the most popular op*on.  • Op*on 3  – File downloads   • gff3  • gbk  • Simple tab‐delimited with  func*onal informa*on  • Mul*fasta protein/nucleo*de 

manatee.sourceforge.net 

Pathway Tools  • All AE genomes now get Pathway Tools  analysis  • A PGDB is created for each genome  • The PGDB is Available to the users via  protected web site  • We are just beginning to form links between  Manatee and the PGDBs 

Future direc*ons  • We are working on grant renewal now  – Just entered our 4 th  and last year of the current grant  • We plan several more enhancements  – more search op*ons in Manatee  – More customizable download/viewing op*ons  – Incorpora*on of new datatypes such as RNAseq  • Integra*on with other tools  – Artemis  – Apollo  – IGS resources  • Sybil  • Mummer‐remap 

Future direc*ons of Annota*on Engine  and Pathway Tools  • Communica*on between Manatee/PGDBs  – Lists of/links to pathways on Manatee GCPs  – Links to pathways from Manatee GCPs  • Use PT analysis to inform automa*c  annota*on process in an itera*ve fashion  • Changes in Manatee propagate to PGDB and  back again, automa*c refresh of pathway  predic*ons. 

hMp://gscid.igs.umaryland.edu   IGS Genomics Workshop  ‐  4 *mes per year  hMp://ae/cgi/workshop_info.cgi  Topics  ‐sequencing  ‐gene finding (prok and euk)  ‐func*onal annota*on  ‐Gene Ontology  ‐Manatee demo and hands‐on  ‐compara*ve genomics, Sybil demo  ‐Artemis demo  ‐expression analysis  ‐metagenomics  ‐Human Microbiome Project  ‐databases  ‐pipeline management  Please check out   the IGS careers page at:  hMp://www.igs.umaryland.edu 

Acknowledgements  • Kevin Galens, Joshua Orvis  • Todd Creasy  • Sean Daugherty, Heather Creasy  • Jennifer Wortman, Anup Mahurkar  • Tanja Davidsen, Owen White  • Especially:      for funding this project 

IGSAnnota*onEngineandManatee MichelleGwinnGiglio - PowerPoint PPT Presentation

IGSAnnotaonEngineandManatee MichelleGwinnGiglio PathwayToolsWorkshop October2010 IGSAnnotaonEngine Afreeservicetoanyonewithaprokaryo*c

Health Care Funding December 4, 2012 Manatee County Utilities De Department Manatee County

Planning for the Future(s) Sarasota Manatee MPO ITS Workshop September 23, 2019 Image courtesy

SAP IGS SAP IGS THE 'VULNERABLE' FORGOTTEN COMPONENT THE 'VULNERABLE' FORGOTTEN COMPONENT Yvan

Recent IGS Analysis Centres Coordinator Activities Guorong Hu & Michael Moore Geodesy Section,

Manatee County Health Advisory Board Presentation February 23, 2016 Manatee County Health

The Florida Department of Health in Manatee County Presented to the Manatee County Healthcare

Func%onal annota%on Uppsala 9th-11th may 2017 Lucile Soler Based on Jacques

Society www.iranigs.com The International Geosynthetics Society The IGS is a learned society

IGS-MGEX: Preparing for a Multi-GNSS World O. Montenbruck, P. Steigenberger DLR, German Space

Leveraging Public Resources for Greater Impact: Livable Manatee Affordable Housing Incentive

MANATEE COUNTY HOUSING FINANCE AUTHORITY TURNKEY DPA OPTIONS May 2017 National Housing Group 1

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Seman&c Annota&on of Mobility Data using Social Media Fei Wu,

W3C Workshop on Annota/ons Educa/on Use Cases Proposed

Whats New in Engine Research Whats New in Engine Research Mark Musculus Engine Combustion

1 Mapping Relational Data Model Patterns To The App Engine Datastore Max Ross November 19,

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

TEXT ENCODING FOR PROTEIN STRUCTURE REPRESENTATION Jun Tan Donald Adjeroh Biological background

Data Mining in Bioinformatics Day 5: Classification in Bioinformatics Karsten Borgwardt February

Supervised Convolutional GSN for Protein Secondary Structure Prediction Jian Zhou Olga

10: Biological Applications for HMMs Machine Learning and Real-world Data (MLRD) Ann Copestake

A Fatgraph Model of Protein Structure Carsten Wiuf BiRC Bioinformatics Research Center

Protein Shakes: Graph 1 1 12 8 4 Protein Shakes: Graph 2 2 12 8 4 P1 SepOct 2012

Enhanced Sampling and Free Energy Applications in Biomolecular Modeling Emad Tajkhorshid NIH

Sambuz

Useful Links

Newsletter

Mail Us

IGSAnnota*onEngineandManatee MichelleGwinnGiglio - PowerPoint PPT Presentation

IGSAnnota*onEngineandManatee MichelleGwinnGiglio PathwayToolsWorkshop October2010 IGSAnnota*onEngine Afreeservicetoanyonewithaprokaryo*c

Health Care Funding December 4, 2012 Manatee County Utilities De Department Manatee County

Planning for the Future(s) Sarasota Manatee MPO ITS Workshop September 23, 2019 Image courtesy

SAP IGS SAP IGS THE 'VULNERABLE' FORGOTTEN COMPONENT THE 'VULNERABLE' FORGOTTEN COMPONENT Yvan

Recent IGS Analysis Centres Coordinator Activities Guorong Hu &amp; Michael Moore Geodesy Section,

Manatee County Health Advisory Board Presentation February 23, 2016 Manatee County Health

The Florida Department of Health in Manatee County Presented to the Manatee County Healthcare

Func%onal annota%on Uppsala 9th-11th may 2017 Lucile Soler Based on Jacques

Society www.iranigs.com The International Geosynthetics Society The IGS is a learned society

IGS-MGEX: Preparing for a Multi-GNSS World O. Montenbruck, P. Steigenberger DLR, German Space

Leveraging Public Resources for Greater Impact: Livable Manatee Affordable Housing Incentive

MANATEE COUNTY HOUSING FINANCE AUTHORITY TURNKEY DPA OPTIONS May 2017 National Housing Group 1

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Seman&amp;c Annota&amp;on of Mobility Data using Social Media Fei Wu,

W3C Workshop on Annota/ons Educa/on Use Cases Proposed

Whats New in Engine Research Whats New in Engine Research Mark Musculus Engine Combustion

1 Mapping Relational Data Model Patterns To The App Engine Datastore Max Ross November 19,

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

TEXT ENCODING FOR PROTEIN STRUCTURE REPRESENTATION Jun Tan Donald Adjeroh Biological background

Data Mining in Bioinformatics Day 5: Classification in Bioinformatics Karsten Borgwardt February

Supervised Convolutional GSN for Protein Secondary Structure Prediction Jian Zhou Olga

10: Biological Applications for HMMs Machine Learning and Real-world Data (MLRD) Ann Copestake

A Fatgraph Model of Protein Structure Carsten Wiuf BiRC Bioinformatics Research Center

Protein Shakes: Graph 1 1 12 8 4 Protein Shakes: Graph 2 2 12 8 4 P1 SepOct 2012

Enhanced Sampling and Free Energy Applications in Biomolecular Modeling Emad Tajkhorshid NIH

Sambuz

Useful Links

Newsletter

Mail Us

IGSAnnotaonEngineandManatee MichelleGwinnGiglio PathwayToolsWorkshop October2010 IGSAnnotaonEngine Afreeservicetoanyonewithaprokaryo*c

Recent IGS Analysis Centres Coordinator Activities Guorong Hu & Michael Moore Geodesy Section,

Seman&c Annota&on of Mobility Data using Social Media Fei Wu,