IGSAnnota*onEngineandManatee MichelleGwinnGiglio - - PowerPoint PPT Presentation

igs annota on engine and manatee
SMART_READER_LITE
LIVE PREVIEW

IGSAnnota*onEngineandManatee MichelleGwinnGiglio - - PowerPoint PPT Presentation

IGSAnnota*onEngineandManatee MichelleGwinnGiglio PathwayToolsWorkshop October2010 IGSAnnota*onEngine Afreeservicetoanyonewithaprokaryo*c


slide-1
SLIDE 1

IGS
Annota*on
Engine
and
Manatee


Michelle
Gwinn
Giglio
 Pathway
Tools
Workshop
 October
2010


slide-2
SLIDE 2

IGS
Annota*on
Engine


  • A
free
service
to
anyone
with
a
prokaryo*c


sequence
they
wish
to
annotate
that
provides:


– Automated
output
of
the
IGS
prokaryo*c
 annota*on
pipeline
 – The
Manatee
cura*on
tool


  • Can
be
used
with
complete
or
draI
genomes

slide-3
SLIDE 3

The
need
for
services
like
the
AE


Sequence
Genera*on
 Sequence
Genera*on
 Sequence
Genera*on
 Manual
Annota*on
 Automa*c
Annota*on
 Automa*c
Annota*on
 Manual
Annota*on


Further
Analysis
 Further
Analysis
 Further
Analysis


slide-4
SLIDE 4

More
is
on
the
way!!!


Third
Genera*on
of
 Sequencing
Technology
 Poised
to
provide
insane
 amounts
of
sequence
data.



slide-5
SLIDE 5

Annota*on
 Engine

 web
page


hMp://ae.igs.umaryland.edu



slide-6
SLIDE 6

IGS
Annota*on
Engine
Growth


  • Current
stats
(from
the
two
years
of
the
project
at
IGS)


– SubmiMers:
90


  • From
all
over
the
United
States
and
17
other
countries


– Users:

>>
90
 – Genomes/sequences:
>225


slide-7
SLIDE 7

DNA
 Sequence
 Automa*c
Annota*on
using
the
evidence
 hierarchy
of
Pfunc
 Searches:
 Pairwise
BER
searches
against
UniRef100
 HMM
searches
against
Pfam
and
TIGRfam
 Mo*f
searches
with
LipP,
THMHH,
PROSITE
 NCBI
COGs
 Prium
profiles
 Automated
start
 site
and
gene


  • verlap


correc*on
 transla*on
 RNA
finding:
 tRNA‐scanSE
 RNAmmer

 similarity
searches
 Predicted
RNA
 Genes
 Gene
 Predic*on
 with
 Glimmer3
 Predicted
 protein
 coding
 genes
 MySQL
database
 using
the
Chado
 schema


Manatee


Flat
files
of
 annota*on
 informa*on


Data
 Flow


slide-8
SLIDE 8

Sequence‐based
searches


  • Pairwise
protein
alignments

  • HMM
searches

  • Mo*f
searches


– PROSITE
 – TMHMM
 – SignalP
 – LipoP


  • COGs

  • Priam
profiles

slide-9
SLIDE 9

Blast‐Extend‐Repraze
(BER)


  • a pairwise alignment tool
  • initial BLAST with liberal cutoff for

each protein in the genome

  • modified Smith-Waterman alignment

generated between search protein and each BLAST result

  • result is a file containing one pairwise

alignment for each match protein from the BLAST

  • view alignments in our Manatee

annotation tool

  • we do the 2-step process because

BLAST is fast and Smith-Waterman is slow, so it saves cpu time to only do the Smith-Waterman alignments on things that have any hope of matching

slide-10
SLIDE 10
slide-11
SLIDE 11
slide-12
SLIDE 12
slide-13
SLIDE 13
slide-14
SLIDE 14

HMMs

  • Our Hidden Markov Model database consists of TIGRFAMs and Pfam
  • statistical model of the patterns of amino acids in a multiple alignment of proteins (called

the “seed) which share sequence and functional similarity

  • Each TIGRFAM HMM is assigned to a category which describes the type of relationship

the proteins in the model have to each other – equivalog – superfamily – subfamily – domain

  • ne can search proteins against HMMs, they receive a score indicating how well they

match the model

  • by comparing this score to the cutoff scores assigned to each model, one can determine

whether or not the search protein is a member of the group defined by the HMM – “trusted cutoff’ - proteins scoring above this score are considered a member of the group defined by the HMM – “noise cutoff” - proteins scoring below this score are considered NOT to be a member of the group defined by the HMM – for proteins scoring between trusted and noise, the HMM evidence is not sufficient to determine whether the protein is a member of the functional group or not

slide-15
SLIDE 15

Annotation is attached to HMMs

  • TIGR00433

– category: equivalog – name: biotin synthase – EC: 2.8.1.6 – gene symbol: bioB – GO terms: GO:0004076 biotin synthase activity; GO:0009102 biotin biosynthesis

  • PF04055

– category: domain – name: radical SAM domain protein – EC: not applicable – gene symbol: not applicable – GO terms: GO:0003824 catalytic activity; GO:0008152 metabolism

slide-16
SLIDE 16

Evaluating HMM scores

100 100 100 …above trusted: the protein is a member of family the HMM models …below noise: the protein is not a member of family the HMM models …in-between noise and trusted: the protein MAY be a member of the family the HMM models

slide-17
SLIDE 17

DNA
 Sequence
 Automa*c
Annota*on
using
the
evidence
 hierarchy
of
Pfunc
 Searches:
 Pairwise
BER
searches
against
UniRef100
 HMM
searches
against
Pfam
and
TIGRfam
 Mo*f
searches
with
LipP,
THMHH,
PROSITE
 NCBI
COGs
 Prium
profiles
 Automated
start
 site
and
gene


  • verlap


correc*on
 transla*on
 RNA
finding:
 tRNAScan,
 RNAMMER,
 homology
searches
 Predicted
RNA
 Genes
 Gene
 Predic*on
 with
 Glimmer
 Predicted
 protein
 coding
 genes
 MySQL
database
 using
the
Chado
 schema


Manatee


Flat
files
of
 annota*on
 informa*on


slide-18
SLIDE 18

The Pitfalls of Transitive Annotation

Protein A Protein B Protein C Protein D

~ ~ ~

But, is Protein A similar to Protein D?

If not, a transitive annotation error has occurred. To prevent, or at least minimize, such errors we require that a match protein be “trusted” if specific functional annotations are made from it.

slide-19
SLIDE 19

prokaryo*c
protein
func*onal
predic*on
(pFunc)


slide-20
SLIDE 20

Protein
names
are
adjusted
to
reflect
 func*onal
confidence/specificity


  • High
confidence
in
specific
func*on


– “adenylosuccinate
lyase”
with
EC/gene
symbol


  • General
knowledge
of
func*on
or
subfamily


– “carbohydrate
kinase”,
FGGY
family


  • Family/Domain
membership


– “cbbY
family
protein”


  • Hypothe*cals


– “hypothe*cal
protein
 – “conserved
hypothe*cal
protein”


slide-21
SLIDE 21

Op*ons
for
Data
Access


  • Op*on
1


– We
place
a
MySQl
version
of
your
 database
and
files
onto
an
Ip
 site.
You
download
it
and
 Manatee
for
local
installa*on


  • Op*on
2


– Your
database
resides
at
IGS.
We
 provide
you
a
password‐ protected
account
to
Manatee
 installed
at
IGS.
 – By
far
the
most
popular
op*on.


  • Op*on
3


– File
downloads



  • gff3

  • gbk

  • Simple
tab‐delimited
with


func*onal
informa*on


  • Mul*fasta
protein/nucleo*de

slide-22
SLIDE 22

manatee.sourceforge.net


slide-23
SLIDE 23
slide-24
SLIDE 24
slide-25
SLIDE 25
slide-26
SLIDE 26
slide-27
SLIDE 27
slide-28
SLIDE 28
slide-29
SLIDE 29
slide-30
SLIDE 30

Pathway
Tools


  • All
AE
genomes
now
get
Pathway
Tools


analysis


  • A
PGDB
is
created
for
each
genome

  • The
PGDB
is
Available
to
the
users
via


protected
web
site


  • We
are
just
beginning
to
form
links
between


Manatee
and
the
PGDBs


slide-31
SLIDE 31
slide-32
SLIDE 32

Future
direc*ons


  • We
are
working
on
grant
renewal
now


– Just
entered
our
4th
and
last
year
of
the
current
grant


  • We
plan
several
more
enhancements


– more
search
op*ons
in
Manatee
 – More
customizable
download/viewing
op*ons
 – Incorpora*on
of
new
datatypes
such
as
RNAseq


  • Integra*on
with
other
tools


– Artemis
 – Apollo
 – IGS
resources


  • Sybil

  • Mummer‐remap

slide-33
SLIDE 33

Future
direc*ons
of
Annota*on
Engine
 and
Pathway
Tools


  • Communica*on
between
Manatee/PGDBs


– Lists
of/links
to
pathways
on
Manatee
GCPs
 – Links
to
pathways
from
Manatee
GCPs


  • Use
PT
analysis
to
inform
automa*c


annota*on
process
in
an
itera*ve
fashion


  • Changes
in
Manatee
propagate
to
PGDB
and


back
again,
automa*c
refresh
of
pathway
 predic*ons.


slide-34
SLIDE 34

IGS
Genomics
Workshop
‐
4
*mes
per
year


hMp://ae/cgi/workshop_info.cgi
 Topics
 ‐sequencing
 ‐gene
finding
(prok
and
euk)
 ‐func*onal
annota*on
 ‐Gene
Ontology
 ‐Manatee
demo
and
hands‐on
 ‐compara*ve
genomics,
Sybil
demo
 ‐Artemis
demo
 ‐expression
analysis
 ‐metagenomics
 ‐Human
Microbiome
Project
 ‐databases
 ‐pipeline
management


hMp://gscid.igs.umaryland.edu

 Please
check
out

 the
IGS
careers
page
at:
 hMp://www.igs.umaryland.edu


slide-35
SLIDE 35

Acknowledgements


  • Kevin
Galens,
Joshua
Orvis

  • Todd
Creasy

  • Sean
Daugherty,
Heather
Creasy

  • Jennifer
Wortman,
Anup
Mahurkar

  • Tanja
Davidsen,
Owen
White

  • Especially:




 
for
funding
this
project