[PPT] - Developing and Using Special Developing and Using Special PowerPoint Presentation

SLIDE 1

Martin Gollery Associate Director of Bioinformatics University of Nevada, Reno Mgollery@unr.edu

Developing and Using Special Purpose Hidden Markov Model Databases Developing and Using Special Developing and Using Special Purpose Hidden Markov Model Purpose Hidden Markov Model Databases Databases

SLIDE 2

Today Today’ ’s Tutorial s Tutorial

Instructor: Martin Gollery
Associate Director of Bioinformatics,

University of Nevada, Reno

Consultant to several organizations
Formerly with TimeLogic
Developed several HMM databases

SLIDE 3

Hidden Markov Models Hidden Markov Models

What HMM’s are
Which HMM programs are commonly used
What HMM databases are available
Why you would use one DB over another
Integrated Resources- InterPro and more
How you can build your own HMM DB
Problems with building your own
Live demonstration

SLIDE 4

Hidden Markov Models Hidden Markov Models-

What are they, anyway?

What are they, anyway?

Statistical description of a protein family's

consensus sequence

Conserved regions receive highest scores
Can be seen as a Finite State Machine

SLIDE 5

Representation of Family Representation of Family Members Members

yciH

KDGII

ZyciH

KDGVI

VCA0570 KDGDI
HI1225 KNGII
sll0546 KEDCV

0.2 0.8 5 0.2 0.4 0.2 0.2 4 0.8 0.2 3 0.2 0.2 0.6 2 1.0 1 V N K I G E D C

SLIDE 6

Representation of gaps in Family Representation of gaps in Family Members Members

yciH

KDGII

ZyciH

KDGVI

VCA0570 KDGDI
HI1225 KNGII
sll0546 KED-V

0.2 0.2 V 0.8 5 0.2 0.4 0.2 4 0.8 0.2 3 0.2 0.2 0.6 2 1.0 1

N

K I G E D C

SLIDE 7

For Maximum sensitivity For Maximum sensitivity-

0.2

0.2 V 0.8 5 0.2 0.4 0.2 4 0.8 0.2 3 0.2 0.2 0.6 2 1.0 1

N

K I G E D C

No residue at any position should have a zero probability, even if it was not seen in the training data.

SLIDE 8

Start with an MSA Start with an MSA… …

CLUSTAL W (1.7) multiple sequence alignment
yciH

KDGVIEIQGDKRDLLKSLLEAKGMKVKLAGG

ZyciH

KDGVIEIQGDKRDLLKSLLEAKGMKVKLAGG

VCA0570 KDGDIEIQGDVRDQLKTLLESKGHKVKLAGG
HI1225 KNGIIEIQGEKRDLLKQLLEQKGFKVKLSGG
sll0546 KEDCVEIQGDQREKILAYLLKQGYKAKISGG
PA4840 KDGVVEIQGEHVELLIDELLKRGFKAKKSGG
AF0914 KNGVIELQGNHVNRVKELLIKKGFNPERIKT
*:. :*:**: : : * :* : :

SLIDE 9

Hidden Markov Models Hidden Markov Models

HMMER2.0
NAME example2
DESC

Small example for demonstration purposes

LENG 31
ALPH Amino
COM hmmbuild example2 example2.aln
NSEQ 7
DATE Wed Jan 08 13:33:06 2003
HMM A C D E F G H I

K …

1 -3217 -3413 -3082 -2664 -4291 -3257 -2104 -4231 3883…
2 -1938 -3859 2747 1592 -4024 -1857 -1206 -3953 -1455…
3 -2160 -3144 1834 -953 -4284 3247 -2013 -4362 -2365…
4 -1255 2750 436 -2789 -1273 -2972 -2049 1510 -2543…
5 -2035 -1558 -4660 -4320 -2085 -4409 -4229 3081 -4224…
6 -3264 -3765 -1447 3822 -4535 -2948 -2636 -4814 -2810…
7 -2423 -1951 -4843 -4395 -1156 -4544 -3680 3291 -4151…
8 -3220 -3396 -2530 -2667 -3851 -3171 -2735 -4442 -2277…
9 -3196 -3194 -3915 -4259 -4867 3789 -4005 -5414 -4591…
10 -1923 -3837 2743 2134 -4005 -1854 -1196 -3929 -1434…
11 -999 -2164 -952 -353 -2483 -1909 3321 -2139 1730…
12 -1629 -1909 -2827 -2102 -2279 -2588 -1442 -1012 -488…

SLIDE 10

Emission Probabilities Emission Probabilities

What is the likelihood that sequence X was

emitted by HMM Y?

Likelihood is calculated by adding the

probability of each residue at each position, and each of the transition probabilities

SLIDE 11

Plan7 from Outer Space Plan7 from Outer Space

(Well, from St. Louis, anyway!) (Well, from St. Louis, anyway!)

SLIDE 12

HMM HMM’ ’s s vs BLAST vs BLAST

Position specific scoring vs. general matrix
Example:

– dDGVIvIddDKRDLLKSLiEAKkMKVKLAGG – KDGVIEIQGDKRDLLKSLLEAKGMKVKLAGG has 80% BLAST similarity, but misses highly conserved regions

Scoring emphasizes important locations
Clearer score cutoffs
However, it is MUCH slower!

SLIDE 13

HMM programs HMM programs

HMMer -Sean Eddy, Wash U
SAM - Haussler, UCSC
Wise tools - Birney, EBI
SledgeHMMer - Subramaniam, SDSC
Meta-MEME - Noble & Bailey
PSI-BLAST - NCBI
SPSpfam - Southwest Parallel Software
Ldhmmer - Logical Depth
DeCypherHMM - TimeLogic

SLIDE 14

What exactly do you want? What exactly do you want?

Are you searching thousands of sequences with
ne or a few models?
Use hmmsearch
Searching a few sequences with thousands of

models?

Use hmmpfam
Thousands of sequences vs. Thousands of models?
Use an accelerator, if you do it very often

SLIDE 15

HMM databases HMM databases

PFAM
TIGRFAM
Superfamily
SMART
Panther
PRED-GPCR

SLIDE 16

HMM databases at the CFB HMM databases at the CFB

COGfam
KinFam
HydroHMMer
NVfam-pro
NVfam-arc
NVfam-fun
NVfam-pln

SLIDE 17

PFAM PFAM

From Sanger, WashU, KI, INRA
Version 17 has 7868 families
Most widely used HMM database
Good annotation team

SLIDE 18

PFAM PFAM

PFAM-A is hand curated
From high quality multiple Alignments
PFAM-B is built automatically from ProDom
Generated using the Domainer algorithm
ProDom is built from SP/TREMBL

SLIDE 19

PFAM PFAM

Pfam-ls = global alignments
Pfam-fs = local alignments, so that matches

may include only part of the model

Both the –ls and –fs versions are local

W.R.T. the sequence

SLIDE 20

PFAM PFAM

Note ‘type’ annotation
Labeled TP
Family
Domain
Repeat
Motif

SLIDE 21

TIGRFAMs TIGRFAMs

Available at (www.tigr.org/TIGRFAMs/)
Organized by functional role
Equivalogs: a set of homologous proteins

that are conserved with respect to function since their last common ancestor

Equivalog domains: domains of conserved

function

SLIDE 22

TIGRFAMs TIGRFAMs

2453 models in release 4.1
Complementary to PFAM, so run both
Part of the Comprehensive Microbial

Resource (CMR)

SLIDE 23

TIGRFAMs TIGRFAMs

TIGRfam and PFAM alignments for Pyruvate carboxylase. The thin line represents the sequence. The bars represent hit regions.

SLIDE 24

SuperFamily SuperFamily

By Julian Gough, formerly MRC, now Riken GSC
www.supfam.org
Provides structural (and hence implied functional)

assignments to protein sequences at the superfamily level

Built from SCOP (Structural Classification of

Proteins) database, which is built from PDB

Available in HMMer, SAM, and PSI-BLAST

formats

SLIDE 25

SuperFamily SuperFamily

1447 SCOP Superfamilies
Each represented by a group of HMMs
Over 8500 models total
Table provides comparison to GO, Interpro,

PFAM

SLIDE 26

SMART SMART

Simple Modular Architecture Research Tool
Version 3.4 contains 654 HMMs
Emphasis on mobile eukaryotic domains
smart.embl-heidelberg.de
Annotated with respect to phyletic

distributions, functional class, tertiary structures and functionally important residues

SLIDE 27

SMART SMART

Use for signaling domains or extracellular

domains

Normal and Genomic mode

SLIDE 28

PRED PRED-

GPCR

GPCR

Papasaikas et al, U of Athens
265 HMMs in 67 GPCR families
Based on TiPs Pharmacological classification.
Filters with CAST
signatures regularly updated
Entire system redone each year

SLIDE 29

PRED PRED-

GPCR webserver

GPCR webserver

SLIDE 30

Panther Panther

Protein ANalysis THrough Evolutionary Relationships
Family and subfamily: families are evolutionarily related

proteins; subfamilies are related proteins with the same function

Molecular function: the function of the protein by itself or

with directly interacting proteins at a biochemical level, e.g. a protein kinase

Biological process: the function of the protein in the

context of a larger network of proteins that interact to accomplish a process at the level of the cell or organism, e.g. mitosis.

Pathway: similar to biological process, but a pathway also

explicitly specifies the relationships between the interacting molecules.

SLIDE 31

Panther Panther

(Thomas et al., Genome Research 2003; Mi

et al. NAR 2005)

6683 protein families
31,705 functionally distinct protein

subfamilies.

SLIDE 32

Panther Panther

Due to the size, searches could be slow
First, BLAST against consensus seqs
Then, search against models represented by

those hits

With an accelerator, you don’t have to do

that…

SLIDE 33

Panther Panther

So- how does it perform?
I took 3451 Arabidopsis proteins with no hit

to PFAM, Superfamily, SMART or TIGRfam

Ran it against Panther
Found 160 significant hits!

SLIDE 34

COG COG-

HMMs

HMMs

Clusters of Orthologous Groups of proteins
www.ncbi.nlm.nih.gov/cog/
Each COG is from at least 3 lineages
Ancient conserved domain
4873 alignments available
Alignments from NCBI, HMMs from me at

mgollery@unr.edu

SLIDE 35

CDD CDD

Conserved Domain Database (NCBI)
Psi-BLAST profiles are similar to HMMs
10991 PSSMs - SMART + COG +KOG+

Pfam+CD

Runs with RPS-BLAST
Much faster searches

SLIDE 36

KinFam KinFam

Kinfam- models represent 53 different classes of

PKs

Assigns Kinase Class and Group
Based on Hanks’ classification scheme
Database is small, so searches are fast

SLIDE 37

KinFam KinFam

Categorizes Kinase data
Available for download from

bioinformatics.unr.edu

RANK SCORE QF TARGET|ACCESSION E_VALUE DESCRIPTION 1 852.93 1 KinFam||ptkgrp15 9.3e-256 Fibroblast GF recept 2 479.14 1 KinFam||ptkgrp14 3.1e-143 Platelet derived GF 3 423.33 1 KinFam||ptkother 1.9e-126 Other membrane-span

SLIDE 38

HydroHmmer HydroHmmer

Hydrohmmer finds LEAs, other hydrophilin

classes

Small target size makes for very fast

searches

SLIDE 39

NVFAMs NVFAMs

HMM’s reflect the training data
Specific training sets provide better results
So… use Archaeal data to study Archaeons,

Fungal data to study Fungi, etc.

Designed for use with PFAM, not stand

alone

Recent redesign, name change

SLIDE 40

NVFAMs NVFAMs

NVFAM-pro used to study E. faecalis
Demonstrated higher scores, better aligns
However, PFAM had more total hits
P.falciparum used as negative control
PFAM showed better scores, aligns as predicted
Automated design by Garrett Taylor- scripts are

available!

Contact me for input, collaboration, or help to

build your own

SLIDE 41

Which database to use? Which database to use?

One Comparison Test One Comparison Test-

(Your results may vary

(Your results may vary… …) )

Compare 563 I. pini sequences to COGhmm, PFAM,

PFAMfrag, SMART, TIGRfam, TIGRfamfrag, Superfamily

COGs- 9
PFAM- 22
PFAMfrag- 57
SMART- 4
Superfamily- 30
TIGRfam- 6
TIGRfamfrag- 12

SLIDE 42

Integrated Resources Integrated Resources

InterProscan
MAGPIE
PANAL
Make your own!

SLIDE 43

InterPro InterPro

Database built from PFAM, Prints, Prosite,

SuperFamily, ProDom, SMART, TIGRFAMs, PANTHER, PIRsf, Gene3D & SP/TrEMBL

Version 10.0
Nearly 12,000 entries
http://www.ebi.ac.uk/interpro/
InterProScan can be installed locally

SLIDE 44

InterProScan InterProScan

Splits up big jobs & reassembles them
Works with SGE, PBS, LSF
A free analysis pipeline!
Provides GO mappings
Written in PERL, so it’s easy to modify
Average 4 min. per NT sequence per CPU

SLIDE 45

InterPro InterPro

InterPro release 10.0 contains 11972 entries, representing 3079 domains, 8597 families, 228 repeats, 27 active sites, 21 binding sites and 20 post-translational modification sites. Overall, there are 7521179 InterPro hits from 1466570 UniProt protein sequences. A complete list is available from the ftp site.

DATABASE VERSION ENTRIES

SWISS-PROT 46.5 180652 PRINTS 37.0 1850 TrEMBL 29.5 1689375 Pfam 17.0 7868 PROSITE patterns 18.45 1800 PROSITE preprofiles N/A 120 ProDom 2004.1 1522 InterPro 10.0 11972 SMART 4.0 663 TIGRFAMs 4.1 2454 PIRSF 2.52 962 PANTHER 5.0 438 SUPERFAMILY 1.65 1160 Gene3D 3.0 117 GO Classification N/A 18705

SLIDE 46

SLIDE 47

SLIDE 48

SLIDE 49

SLIDE 50

Modifying InterProScan Modifying InterProScan

Two ways to Add your own HMM database

to InterProScan:

Modify PERL scripts
Concatenate your models onto PFAM
Similarly, if you are looking for a specific

target, delete all the rest to speed up searches

SLIDE 51

PANAL PANAL

Simultaneously searches several targets
Produces a nice graphical overview
Databases-

– PFAM – SMART – TIGRFAM – Prosite – PRINTS – BLOCKS

SLIDE 52

PANAL PANAL

SLIDE 53

MAGPIE MAGPIE

BLOCKS
NCBI public non-redundant DNA and protein
NCBI EST databases
NCBI Conserved Domain Database (CDD)
Protein Identification Resource SuperFamilies
PFAM
ProDom
SCOP SuperFamilies
SMART
TIGRFam
ProSite

SLIDE 54

MAGPIE MAGPIE

Gives a putative description of the gene
Database search result ranking based on user

defined tool precedence and score thresholds.

A single graphical summary of the various search

results

Links to the database source entries

SLIDE 55

MAGPIE MAGPIE

Gene taxonomic distribution information
Reporting of similar sequences in the dataset

based on hits to similar database entries

Annotated metabolic pathway diagrams
Gene Ontology (GO) term assignments

SLIDE 56

MAGPIE MAGPIE

Terry Gaasterland et al. Genome Res. 2000; 10: 502-510

SLIDE 57

Building Your Own HMM Building Your Own HMM Database Database

Why do it?
Greater Specificity
Represent your training set
Faster searches
Focus on the particular aspects that you

want

SLIDE 58

PFAM

Y

ur

Data Y

ur

Data

Public DB

HMMsearch B LAS T

Or

Cluster S equences

Build Multiple S equence Alignments

HMMbuild HMMcalibrate

Discard S ingletons

Annotate

Check Alignments Add Desc ription L ine

SLIDE 59

First, search against a target First, search against a target… …

SLIDE 60

Select the hits for the model Select the hits for the model

SLIDE 61

Build the Multiple Sequence Build the Multiple Sequence Alignment Alignment

SLIDE 62

Run Run HMMbuild HMMbuild to make the to make the model model

SLIDE 63

Iterate Search to Add more distant Members Iterate Search to Add more distant Members

SLIDE 64

Design Decisions: Design Decisions:

Local or global models?
Which sequence weighting scheme?
What type of Prior?

SLIDE 65

Calibration Calibration

Hmmcalibrate
Improves scoring
Compares to random data
Can be done on each model, or on the entire

collection

SLIDE 66

Calibration Calibration

Very time consuming on CPU, not on

researcher

No acceleration available
Not necessary with SAM

SLIDE 67

Meme and Meta Meme and Meta-

Meme

Meme

Meme discovers motifs in a group of related

DNA or protein sequences

Motifs contain no gaps- split in two instead

SLIDE 68

Meta Meta-

meme

meme

Meta-meme takes meme motifs & related

seqs as input

Combines motifs into HMMs
Regions between motifs are modeled

imprecisely

Reduction in parameter space
Accurate models with fewer training seqs

SLIDE 69

Meta Meta-

meme

meme

mhmm: Build a motif-based HMM from

Meme motifs.

mhmms: Search a sequence database using

a motif-based HMM

mhmmscan: Like mhmms, but allows long

seqs and multiple matches.

SLIDE 70

Using RPS Using RPS-

BLAST

BLAST

Start with PSI-BLAST using –C
Prepare files with makemat and copymat
Compile target
Annotate
Search with RPS-BLAST

SLIDE 71

IMPALA IMPALA

Also uses profiles database
Alignments generated by Smith-Waterman

instead of word hit initiated

10-100x Slower, might be better than RPS-

BLAST

SLIDE 72

SPEED SPEED

PVM version of HMMer is available, MPI is on

the way (?)

Other Solutions- use PSSM’s?
SPSpfam can speed searches 3-60X
SledgeHMMer claims 10X Speedup
Accelerators
Target Triage

SLIDE 73

SPSpfam SPSpfam

From Southwest Parallel Software
Optimized HMMer code
Up to 60X faster
Works well on cluster
Uses binary Pfam, so you can’t drop it into

InterProScan

This may change soon

SLIDE 74

HMM Accelerators HMM Accelerators

Can provide speedup of 100’s-1000’s X
TimeLogic is the only commercial one left
HokieGene from Virginia Tech
StarBridge - No HMMs yet
Others coming soon
An open-source project is in the works-

BioFPGA

SLIDE 75

HMMs on the Web HMMs on the Web

SAM

http://www.cse.ucsc.edu/research/compbio/

HMMer http://hmmer.wustl.edu/
Several other HMMer servers…
SledgeHMMer.sdsc.edu is only unlimited

webserver- most restrict you to one sequence at a time.

SLIDE 76

Resources Resources

Online Applications:
HMMer http://hmmer.wustl.edu/
SAM-T02

http://www.soe.ucsc.edu/research/compbio/ HMM-apps/HMM-applications.html

Pfam http://pfam.wustl.edu/
SledgeHMMer sledgehmmer.sdsc.edu
Meta-MEME http://metameme.sdsc.edu/
PANAL http://web.ahc.umn.edu/panal/

SLIDE 77

Resources Resources

Commercial vendors of HMM systems
SPSpfam (www.spsoft.com)
Ldhmmer (www.logicaldepth.com)
DeCypherHMM (www.timelogic.com)

SLIDE 78

References References

S.Altshul, et al. Basic Local Alignment Search Tool. JMB, 215:403{410, 1990.
C. Barrett, et al. Scoring hidden Markov models. CABIOS, 13(2):191{199, 1997.
S. R. Eddy. Profile hidden markov models. Bioinformatics, 14(9):755{63, 1998.
W. N. Grundy,et al. Meta-MEME: Motif-based hidden Markov models of protein families.

CABIOS, 13(4):397{406, 1997.

M. Gribskov, et al. Profile analysis: Detection of distantly related proteins. PNAS,

84:4355{4358, July 1987.

S. Henikoff and Jorja G. Henikoff. Amino acid substitution matrices from protein blocks.

PNAS, 89:10915{10919, November 1992.

[HH94] Steven Henikoff and Jorja G. Henikoff. Position-based sequence weights. JMB,
243(4):574{578, November 1994.

SLIDE 79

Jerey D. et al. Kestrel: A programmable array for sequence analysis. In Application-Specific
Array Processors, pages 25{34, Los Alamitos, CA, July 1996. IEEE Computer Society.
R. Hughey and A. Krogh. Hidden Markov models for sequence analysis: Extension and

analysis of the basic method. CABIOS, 12(2):95{107, 1996.

T. Hubbard, et al. SCOP: a structural classification of proteins database. NAR, 25(1):236{9,

January 1997.

L. Holm and C. Sander. Dali/fssp classification of three-dimensional
protein folds. NAR, 25:231{234, 1 Jan 1997.
K. Karplus, et al. Predicting protein structure using only sequence
information. Proteins: Structure, Function, and Genetics
K. Karplus, et al. Hidden markov models for detecting remote protein homologies.

Bioinformatics, 14(10):846{856, 1998.

SLIDE 80

A. Krogh, et al, Hidden Markov models in computational biology: Applications to protein modeling.

JMB, 235:1501{1531, February 1994.

Kevin Karplus, et al. Predicting protein structure using hidden Markov models. Proteins: Str, Func, and

Genetics, Suppl. 1:134{139, 1997.

C. A. Orengo, et al. Cath- a hierarchic classification of protein domain structures.
Structure, 5(8):1093{108, August 1997.
J. Park, et al. Sequence comparisons using multiple sequences detect twice
as many remote homologues as pairwise methods. JMB, 284(4):1201{1210
E.L.L Sonnhammer, et al. Pfam: A comprehensive database of protein families. Proteins, 28:405{420,

1997.

K. Sjolander, et al. Dirichlet mixtures: A method for improving detection of weak
but signicant protein sequence homology. CABIOS, 12(4):327{345, August 1996.
Reinhard Schneider and Chris Sander. The HSSP database of protein
structure-sequence alignments. NAR, 24(1):201{205, 1 Jan 1996.
Chukkapalli G., Guda, C. and Subramaniam S. SledgeHMMER: A web server for batch searching

Pfam database, Nucleic Acids Res. , 32:W542-544

Schaffer, A.A., Wolf, Y.I., Ponting, C.P. Koonin, E.V., Aravind, L., Altschul, S. F., IMPALA:

Matching a Protein Sequence Against a Collection of PSI-BLAST-Constructed Position- Specific Score Matrices, Bioninformatics,

P. K. Papasaikas, P. G. Bagos, Z. I. Litou, V. J. Promponas and S. J. Hamodrakas

PRED-GPCR: GPCR recognition and family classification serveNucleic Acids Research 2004 32(Web Server issue):W380-W382; doi:10.1093/nar/gkh431

Silverstein, K.A.T., A. Kilian, J.L. Freeman, and E.F. Retzel. "PANAL: an integrated resource for

Protein sequence ANALysis," Bioinformatics, 16:1157-1158, 2000

SLIDE 81

Thanks! Thanks!

Garrett Taylor, Brian Beck, Taliah Mittler,

Barrett Abel, John Cushman, Lee Weber

Contact me at- mgollery@unr.edu
Bioinformatics.unr.edu