An introduction to Patterns, Profiles, HMMs and PSI-BLAST
Marco Pagni and Lorenzo Cerutti Swiss Institute of Bioinformatics Course, 2003
An introduction to Patterns, Profiles, HMMs and PSI-BLAST Marco - - PowerPoint PPT Presentation
An introduction to Patterns, Profiles, HMMs and PSI-BLAST Marco Pagni and Lorenzo Cerutti Swiss Institute of Bioinformatics Course, 2003 Patterns, Profiles, HMMs, PSI-BLAST Course 2003 Outline Introduction Multiple alignments and
Marco Pagni and Lorenzo Cerutti Swiss Institute of Bioinformatics Course, 2003
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
Color code: Keywords, Databases, Software
1
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
2
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
regions in protein or DNA sequences.
These particular regions are usually associated with:
sequences, search similar sequences in the databases or annotate new sequences.
3
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
STA3_MOUSE ZA70_MOUSE ZA70_HUMAN PIG2_RAT MATK_HUMAN SEM5_CAEEL P85B_BOVIN VAV_MOUSE YES_XIPHE TXK_HUMAN PIG2_HUMAN YKF1_CAEEL SPK1_DUGTI STA6_HUMAN STA4_MOUSE SPT6_YEAST . A E G Q N E A K N T E W Q K . E E E E E D E G D Q S D E Y E Q R A A A A A V A T A A V A V K A E E E E V E N E E E E F E T E E R E R D Q V E G R H K Q K S R D A H K M Q L K I L L L L S L L Y I L L L L L L L L L L L L L L L L K Y M Q K R T L R Q D M L L R S L S R P K D N L Q E N K N K S | 10 . A G . . P . . P . Y . I . . . . . . . . . . . . . C . . . . . . . . . . . . . . . M . . . . . . . . . . . . . . . E . . . . . . . . . . . . . . . T . . . . . T G A I . T . . G E G . G . D . K M Q P P V T R N S G . L E K K P A T R E R P S E K K . Q P M E P D D D D D D D R E D N K D P R G G G G G G G G G G G G G G G G | 20 T L K A L H T T T A T D T T T E F F F F F F F Y F F F Y Y F F F L L L L L L L L L I L V I L L V L L L I V V V V I V V V I L L I R R R R R R R R R R R R R R R R F Q P K E Q D Q E D E L P F F Q S C R R S C A R S S S S S S S S E L K E A E S V E R E D R D E S S R E G R S S K T . T P . S S R S . . . . . K . . . . . . . . . | 30 K S . T H S I D T H F K K E H G E L Q D P P Q T K L P P E I L D G G G . G G G A G G N G N G G D G G T S D E E E A S D E S G G H . . . . . . . . . . . P . . . . . . . . . . . . . . . R . . . . . . . . . . . . . . . S . . . . V Y Y Y Y F Y F Y Y Y Y Y I I L T V A A V S T A S T T I A T T V F L L I L I L I L I L L L I F I | 40 T S S T C S T S S S S S S A T T W L L F V V L I L V F V V H W W V V I R S R R K R F W M R V V K E H Y A F F K Y D M R F D I D L K D G R G Q G N W G S N F R Q D D V K G R D G V D A G N D G S K I . . . . . . . E R . K E Q . D S . . . . . . . T R . L K D . . G . . . . . . . K S . D K G . . K . . . . . . . . T . E K . . . | 50 T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Q . . . . . . . . . . . . . . . I . . . . . . . G E . N I S . . Q . . . . . . . D A . S C P . . S R T K D S N E N A R S I Q E L V F V V V V N V C I V V V I N F E H Y K I Q K K K K Q K K E G Q P H H H H H L H H H H H H N E H Y F Y C Y F . I Y Y C F F I V I | 60 T P L R R K I K K Q R V Q Q R D K I I I V V K I I I I I I P F I Q E S N L L V M R K R N K F H Q Q R Q R H R F T K K S S T S S E L Q D D R D H S L N T V L A V L N L K G D Q R E D D M E Q K E E STA3_MOUSE ZA70_MOUSE ZA70_HUMAN PIG2_RAT MATK_HUMAN SEM5_CAEEL P85B_BOVIN VAV_MOUSE YES_XIPHE TXK_HUMAN PIG2_HUMAN YKF1_CAEEL SPK1_DUGTI STA6_HUMAN STA4_MOUSE SPT6_YEAST N N A R G N D G N S E N D D P K M G G . . G G . G G G K E L . E S . . . . . . . . . G . K . . N F . . . . . . . . . T . . . . P | 70 A . . . . . . . . . . . . . . L E . . . . . . . . . . . . . . . I . . . . . . . . . . . . . . A I . . . . . . . . . . . . . . L M . . . . . . . . . L . . . . G G T K H H . . . G Q K . G . . K Y Y Y F L K H L Y W Y Y I S . V K A C V T Y Y Y Y Y Y F S I Y L I I I L I Y G R I V L V Y R N I M A P G D L F I T A T N S S K V | 80 D G E T E W S T T E D N V L G D . G G S A A E E R R N N N G R N A K T A V V P K T H L M I D L Q T A K Y F K L K Q A R S R R S K . . . . . . T A . . . . N . . . . . . . . . . . . . . . . . . . N H F F F F F F F F F F F I A Y I C D E C N C R M Q R N P R L N L G T S N S S G S S R T N D A D V P L L L L V L L I M I I L F L | 90 S A W V M N V L Q P Y Q L A A D P E Q E D E D E M E A Q T Q D Q L L L L M L L L L L L M L L I I V C V V V V I V V I I L I K L I Y Q E S E A T E K W Q S Q N R V L F Y Y H Y H F H Y H H F L D E Y Y L Y Y H Y Y Y H Y Y Y Y Y Y
4
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
5
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
6
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
from a multiple sequence alignment.
7
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
Search databases
K K Y F E D R A P S S L E P K G C P L E C R T T F M
G G G G G H H H H H E E E E E G K G F L V K Y E R G G G G G K Y G P T V F R K T V E S G F | K D R C M L R G G P G G G A A A P G L L G S Y Y E A A S I C
G H E G V G K V V K L G A G A F Y G R S R G G Y I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Consensus: GHE**G*****G***
8
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
DNA.
9
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
10
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
In computer science, patterns are known as regular expressions.
brackets ’{ }’ ({AG} means any amino acid except Ala and Gly),
2 and 4 times, X(2) means any amino acid twice),
11
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
<A-x-[ST](2)-x(0,1)-{V} means:
12
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
Search databases
K K Y F E D R A P S S R T T F M L E P K G C P L E C
G G G G G H H H H H E E E E E G K G F L V K Y E R G G G G G K Y G P T V F R K T V E S G F | K D R C M L R G G P G G G A A A P G L L G S Y Y E A A S I C
G H E G V G K V V K L G A G A F Y G R S R G G Y I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 G−H−E−X(2)−G−X(5)−[GA]−X(3) Pattern:
13
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
[DNEG]-x-[LIVFA]-[LIVMY]-[LVAST]-H-N-[STC]
[RK]-x(2)-[DE]-x(3)-Y or [RK]-x(3)-[DE]-x(2)-Y
G-A-K-R-H
M-C-N-S-S-C-[MV]-G-G-M-N-R-R
[LIVMA]-G-[EQ]-H-G-[DN]-[ST]
P-[LIVM]-C-T-[LIVM]-[KRH]-x-[FT]-P
14
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
tures.
15
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
sequences by specialized programs.
with some knowledge of the biochemical literature.
16
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
17
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 5 0 0 2 0 5 1 0 1 0 2 3 1 1 0 0 0 5 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 2 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 1 0 1 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1
G G G G G H H H H H E E E E E G K G F L V K Y E R G G G G G K Y G P T V F R K T V E S G F | K D R C M L R G G P G G G A A A P G L L G S Y Y E A A S I C
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 A C D E F G H I K L M N P Q R S T V W Y
5 = 0, fG,1 = 5 5 = 1, ...
5 = 0, fH,2 = 5 5 = 1, ...
5 = 0.4, fC,15 = 1 5 = 0.2, ... 18
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
number of sequences that is present in a MSA.
corresponding residue at this position (this was the case with patterns).
small non-observed frequencies are referred to as pseudo-counts.
A,1 = 0+1 5+20 = 0.04, f ′ G,1 = 5+1 5+20 = 0.24, ...
A,2 = 0+1 5+20 = 0.04, f ′ H,2 = 5+1 5+20 = 0.24, ...
A,15 = 2+1 5+20 = 0.12, f ′ C,15 = 1+1 5+20 = 0.08, ...
counts, and which are based on substitution matrix or Dirichlet mixtures.
19
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
with the frequency at which any residue can be expected in a random
sequence.
frequency in a random sequence. This is a quite simplistic null model.
More precisely, the logarithm of this ratio is taken and refereed to as the log-
likelihood ratio: Scoreij = log(
f ′
ij
qi )
where Scoreij is the score for residue i at position j, f ′
ij is the relative
frequency for a residue i at position j (corrected with pseudo-counts) and qi is the expected relative frequency of residue i in a random sequence.
20
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 A
1.3 0.7
1.3 C
0.7
0.7 D
E
2.3
0.7
0.7
0.7
F
0.7
0.7
G 2.3
1.3
2.3 0.7
0.7
1.3 1.7 0.7 0.7
H
2.3
I
0.7 K
0.7 0.7
0.7 0.7
0.7
L
0.7
0.7
1.3
M
0.7
N
P
0.7
0.7
Q
R
0.7
0.7
0.7 0.7
S
0.7
0.7
T
0.7 0.7
V
0.7
0.7 0.7
W
Y
0.7
0.7
0.7
21
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
Position +1 Position +1
Score = 0.3 T S G H E L V G G V A F P A R C A S Score = 0.6 Score = 16.1 T S G H E L V G G V A F P A R C A S T S G H E L V G G V A F P A R C A S 22
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
thus influencing observed residue frequencies.
attempt to compensate this sequence sampling bias.
SW_PDA6_MESAU SW_PDI1_ARATH SW_PDI_CHICK SW_PDA6_ARATH SW_PDA2_HUMAN SW_THIO_ECOLI SW_THIM_CHLRE SW_THIO_CHLTR SW_THI1_SYNY3 SW_THI3_CORNE SW_THI2_CAEEL SW_THIO_MYCGE SW_THIO_BORBU SW_THIO_EMENI SW_THIO_NEUCR SW_TRX3_YEAST SW_THIO_OPHHA SW_THH4_ARATH SW_THI3_DICDI SW_THIO_CLOLI SW_THF2_ARATH W V V A L I V V V V V V A V V L I I V V V M L F L L L L L L L I I I V V V V V V L V V L V V V V V I V I V V I V A I V I V V L E E E E E D D D D D D D D D D D D D D D D F F F F F F F F F L F F F C F F F F F Y M Y Y Y Y Y W W F Y W H W Y F Y Y S T S F Y A A A A A A A A A A A A A A A A A A A S T P P P P P E P E T E E A N T D T T S E D Q W W W W W W W W W W W W W W W W W W W G W C C C C C C C C C C C C C C C C C C C C C G G G G G G G G G G G G G G G G G P G V G H H H H H P P P P P P P P P P P P P P P P C C C C C C C C C C C C C C C C C C C C C K Q K K Q K R K Q K Q K K K K K K R R K K N K Q K A M I M M M A L M A A M M M A A V L L L L L I I L M M L T L I I M I I I L I E A A A A A A T A A G S S A A Q K A A M A P P P P P P P P P P P P P P P P P P P P P E I I E E I V V I H R E I T M H F I V A K
Low weights High weights
SW_THIO_ECOLI SW_THIM_CHLRE SW_THIO_CHLTR SW_THI1_SYNY3 SW_THI3_CORNE SW_THI2_CAEEL SW_THIO_MYCGE SW_THIO_BORBU SW_THIO_EMENI SW_THIO_NEUCR SW_TRX3_YEAST SW_THIO_OPHHA SW_THH4_ARATH SW_THI3_DICDI SW_THIO_CLOLI SW_THF2_ARATH SW_PDA6_MESAU SW_PDI1_ARATH SW_PDI_CHICK SW_PDA6_ARATH SW_PDA2_HUMANW V V A L I V V V V V V A V V L I I V V V M L F L L L L L L L I I I V V V V V V L V V L V V V V V I V I V V I V A I V I V V L E E E E E D D D D D D D D D D D D D D D D F F F F F F F F F L F F F C F F F F F Y M Y Y Y Y Y W W F Y W H W Y F Y Y S T S F Y A A A A A A A A A A A A A A A A A A A S T P P P P P E P E T E E A N T D T T S E D Q W W W W W W W W W W W W W W W W W W W G W C C C C C C C C C C C C C C C C C C C C C G G G G G G G G G G G G G G G G G P G V G H H H H H P P P P P P P P P P P P P P P P C C C C C C C C C C C C C C C C C C C C C K Q K K Q K R K Q K Q K K K K K K R R K K N K Q K A M I M M M A L M A A M M M A A V L L L L L I I L M M L T L I I M I I I L I E A A A A A A T A A G S S A A Q K A A M A P P P P P P P P P P P P P P P P P P P P P E I I E E I V V I H R E I T M H F I V A K
23
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
the observed score that are expected to occur by chance.
false positives expected above a given score threshold increases proportionately with the size of the database.
24
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
25
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
aligned sequences by specialized programs. The program MEME is such a tool which is based on the expectation-maximization algorithm
http://meme.sdsc.edu/meme/website/.
that purpose is available (Prints).
26
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
27
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
However this raises the difficult problems of defining and computing an optimal alignment with gaps.
define and compute the optimal alignments between a pair of sequences e.g. by the Smith-Waterman algorithm, and generalize it by the introduction of:
28
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
states, just as in the PSSM.
that receives a position-dependent penalty.
states. These insertion states are given a position-dependent penalty that might also depend upon the inserted residues.
position-dependent penalty. This is primarily to model the cost of opening and closing a gap.
the alignment, which can forced to be ’local’ or ’global’ at either ends of the profile and
29
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
MATCH INSERTION DELETION −d14 −d8 −d13 −d7 −d12 −d10 −d9 −d4 −d3 −d5 −d2 −d6 −d11 −d15 −d1 I13 I1 I12 I9 I8 I7 I6 I5 I2 I11 I10 I4 I3 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 I14 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 2.3 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 2.3 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 0.7 1.3 −0.2 −0.2 0.7 0.7 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 0.7 −0.2 −0.2 −0.2 −0.2 0.7 −0.2 −0.2 −0.2 −0.2 −0.2 0.7 −0.2 −0.2 D −0.2 0.7 −0.2 −0.2 −0.2 −0.2 −0.2 2.3 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 0.7 −0.2 −0.2 0.7 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 0.7 −0.2 0.7 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 0.7 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 0.7 −0.2 0.7 0.7 −0.2 −0.2 −0.2 −0.2 0.7 0.7 0.7 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 0.7 −0.2 0.7 −0.2 −0.2 −0.2 0.7 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 0.7 −0.2 0.7 −0.2 −0.2 −0.2 0.7 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 1.3 −0.2 −0.2 −0.2 0.7 −0.2 −0.2 0.7 −0.2 0.7 −0.2 −0.2 −0.2 −0.2 −0.2 1.3 0.7 −0.2 −0.2 −0.2 −0.2 1.7 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 0.7 −0.2 −0.2 −0.2 1.3 −0.2 −0.2 0.7 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 0.7 −0.2 0.7 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 0.7 −0.2 −0.2 −0.2 0.7 1.3 0.7 −0.2 −0.2 −0.2 −0.2 −0.2 0.7 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 C 0.7 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 −0.2 2.3 E F G H I K L M N P Q R S T V W Y −0.2 A
30
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
n
n-1
31
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
ID THIOREDOXIN_2; MATRIX. AC PS50223; DT ? (CREATED); MAY-1999 (DATA UPDATE); ? (INFO UPDATE). DE Thioredoxin-domain (does not find all). MA /GENERAL_SPEC: ALPHABET=’ABCDEFGHIKLMNPQRSTVWYZ’; LENGTH=103; MA /DISJOINT: DEFINITION=PROTECT; N1=6; N2=98; MA /NORMALIZATION: MODE=1; FUNCTION=LINEAR; R1=1.9370; R2=0.01816483; TEXT=’-LogE’; MA /CUT_OFF: LEVEL=0; SCORE=361; N_SCORE=8.5; MODE=1; TEXT=’!’; MA /DEFAULT: D=-20; I=-20; B1=-100; E1=-100; MM=1; MI=-105; MD=-105; IM=-105; DM=-105; M0=-6; MA /I: B1=0; BI=-105; BD=-105; ... many lines deleted ... MA /M: SY=’K’; M=-8,0,-25,1,8,-24,-14,-9,-22,19,-20,-11,0,-9,5,13,-3,-4,-16,-24,-13,6; D=-3; MA /I: I=-3; DM=-16; MA /M: SY=’P’; M=-6,-13,-26,-12,-9,-12,-19,-14,-5,-11,-5,-4,-12,8,-11,-13,-9,-6,-6,-25,-11,-12; MA /M: SY=’V’; M=-4,-22,-19,-24,-20,-2,-25,-21,11,-15,2,3,-20,-23,-17,-14,-9,-1,19,-11,-4,-19; MA /M: SY=’A’; M=28,-7,-15,-13,-6,-20,-2,-15,-15,-6,-14,-11,-5,-12,-6,-11,9,1,-6,-21,-17,-6; MA /M: SY=’P’; M=-6,-3,-27,2,2,-22,-14,-11,-20,-6,-24,-17,-5,25,-4,-11,3,1,-19,-29,-17,-3; MA /M: SY=’W’; M=-16,-27,-41,-28,-21,2,-13,-20,-20,-16,-19,-17,-26,-25,-15,-15,-26,-20,-26,93,19,-15; MA /M: SY=’C’; M=-9,-17,106,-26,-27,-20,-27,-28,-29,-28,-20,-20,-17,-37,-28,-28,-8,-9,-10,-48,-29,-27; MA /M: SY=’G’; M=-4,-12,-31,-9,-9,-27,24,-18,-27,-13,-25,-17,-7,14,-13,-17,-3,-13,-24,-24,-26,-13; MA /M: SY=’H’; M=-12,-10,-30,-8,-4,-14,-18,18,-17,-10,-18,-8,-7,16,-5,-11,-8,-10,-20,-22,-1,-8; MA /M: SY=’C’; M=-9,-19,111,-28,-28,-20,-29,-29,-28,-29,-20,-19,-18,-38,-28,-29,-8,-8,-9,-49,-29,-28; MA /M: SY=’R’; M=-12,-4,-27,-4,3,-22,-20,-2,-21,22,-19,-6,-2,-13,9,23,-9,-8,-16,-20,-6,4; ... many lines deleted ... // 32
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
THIO_ECOLI SFDTDVLKADGAILVDFWAEWCGPCKMIAPILDEIADEYQ------GKLTVAKLNIDQNP :. :. : .:..:.: ::: :: .:: ::.: : .:.:.::.. : PDI_ASPNG SYKDLVIDNDKDVLLEFYAPWCGHCKALAPKYDELAALYADHPDLAAKVTIAKIDATAND THIO_ECOLI GTAPKYGIRGIPTLLLFKNG : : :.::: :. : PDI_ASPNG VPDP---ITGFPTLRLYPAG 33
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
consensus 1 XVXVLSDENFDEXVXDSDKPVLVDFYAPWCGHCRALAPVFEELAEEYK----DBVKFVKV
: : : : : :: : : ::::: : : : : : : PDI_ASPNG 360 PVTVVVAHSYKDLVIDNDKDVLLEFYAPWCGHCKALAPKYDELAALYAdhpdLAAKVTIA
consensus 57 DVDENXELAEEYGVRGFPTIMFF--KBGEXVERYSGARBKEDLXEFIEK
: : :: : : : : : : PDI_ASPNG 420 KID-ATANDVPDPITGFPTLRLYpaGAKDSPIEYSGSRTVEDLANFVKE
34
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
by Philipp Bucher (http://www.isrec.isb-sib.ch/ftp-server/pftools/).
35
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
36
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
37
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
theory, which is part of the theory of probabilities.
P (A|G) = 0.18, P (C|G) = 0.38, P (G|G) = 0.32, P (T |G) = 0.12 P (A|C) = 0.15, P (C|C) = 0.35, P (G|C) = 0.34, P (T |C) = 0.15
C A G T Start
38
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
C A G T Start
The probability of sequence x = GCCT is:
P(GCCT) = P(T|C)P(C|C)P(C|G)P(G)
39
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
a finite number of states connected by
transitions.
a symbol but a distribution of symbols. Each state can emit a symbol with a probability given by the distribution.
"Hidden" "Visible"
Start End
= 1xA, 1xT, 2xC, 2xG = 1xA, 1xT, 1xC, 1xG
0.5 0.5 0.1 0.7 0.2 0.4 0.5 0.1
40
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
Start State 1 State 2 End START 1 1 1 1 2 2 1 1 1 2 END G C A G C T G G C T
"Hidden" "Visible"
0.5 0.5 0.7 0.2 0.5 0.1 0.1 0.4
G 0.25 T 0.25 A 0.17 T 0.17 C 0.33 G 0.33 C 0.25 A 0.25
41
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
in state q. E(x|q)
PSSMs.
T (r|q)
Welch training algorithm.
42
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
I2 I1 D3 D2 M3 M2 D1 I3 M1
END BEGIN
I0
E 0.44 D 0.41 C 0.01 A 0.01 ... ... C 0.01 E 0.03 A 0.01 W A E - C D 0.03
C 0.92 D 0.01 E 0.01 ...
A 0.74
HMM model Training set
43
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
I3 M1 I0 M3 M2 I2
A R A E S P D C I A R A E S P D C I I3 I2 D2 M3 I1 M2 D1 D3
BEGIN END
M1 I0
... E 0.44 E 0.01 D 0.41 D 0.01 A 0.01 C 0.92 ... C 0.01 A 0.01 ... E 0.03 D 0.03 C 0.01 A 0.74
44
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
the Baum-Welch expectation maximization.
Durbin, Eddy, Mitchison, Krog. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998.
45
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
(http://hmmer.wustl.edu/).
Anders Krogh (http://www.cse.ucsc.edu/research/compbio/sam.html).
46
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
S N B J I1 I2 I3 E C T M3 M2 M1 D2 M4 D1 D4 D3
47
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
48
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
architecture).
the Viterbi path on a HMM.
is very difficult with HMMs.
49
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
hmmalign hmmsearch pfsearch hmmbuild pfcalibrate hmmcalibrate psa2msa pfw, pfmake Multiple Alignment Training set = Protein Database HMM/Profile Search output trusted sequences A collection of
50
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
domains.
51
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
52
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
independent affine gap cost model. This is less sophistication than the generalized profiles, but it is just this principle that is behind PSI-BLAST.
1 A standard BLAST search is performed against a database using a substitution matrix (e.g. BLOSUM62). 2 A PSSM (checkpoint) is constructed automatically from a multiple alignment of the highest scoring hits of the initial BLAST search. High conserved positions receive high scores and weakly conserved positions receive low scores. 3 The PSSM replaces the initial matrix (e.g. BLOSUM62) to perform a second BLAST search. 4 Steps 3 and 4 can be repeated and the new found sequences included to build a new PSSM. 5 We say that the PSI-BLAST has converged if no new sequences are included in the last cycle.
53
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
pfsearch hmmsearch hmmbuild pfcalibrate hmmcalibrate psa2msa pfw, pfmake hmmalign = Protein Database Training set HMM/Profile Multiple Alignment Search output A single PSI−blast trusted sequence
54
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
The default is 10 as in the standard BLAST;
is 0.001).
55
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
56
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
not to the original sequence!
57
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
C N
N C N C C N N N C C C C C N N N C N C N C N C N
58
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
59
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
and false positives (FP) in SWISS-PROT.
A Archaea B Bacteriophages E Eukaryota P Procaryota V Viruses
to return significant results (SKIP FLAG = TRUE).
60
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
ID UCH_2_1; PATTERN. AC PS00972; DT JUN-1994 (CREATED); DEC-2001 (DATA UPDATE); DEC-2001 (INFO UPDATE). DE Ubiquitin carboxyl-terminal hydrolases family 2 signature 1. PA G-[LIVMFY]-x(1,3)-[AGC]-[NASM]-x-C-[FYW]-[LIVMFC]-[NST]-[SACV]-x-[LIVMS]- PA Q. NR /RELEASE=40.7,103373; NR /TOTAL=58(58); /POSITIVE=58(58); /UNKNOWN=0(0); /FALSE_POS=0(0); NR /FALSE_NEG=5; /PARTIAL=1; CC /TAXO-RANGE=??E??; /MAX-REPEAT=1; CC /SITE=7,active_site(?); DR P55824, FAF_DROME , T; Q93008, FAFX_HUMAN, T; P70398, FAFX_MOUSE, T; DR O00507, FAFY_HUMAN, T; P54578, TGT_HUMAN , T; P40826, TGT_RABIT , T; (...) DR Q99MX1, UBPQ_MOUSE, T; Q61068, UBPW_MOUSE, T; P34547, UBPX_CAEEL, T; DR Q09931, UBPY_CAEEL, T; DR Q01988, UBPB_CANFA, P; DR P53874, UBPA_YEAST, N; Q9UMW8, UBPI_HUMAN, N; Q9WTV6, UBPI_MOUSE, N; DR Q9UPU5, UBPO_HUMAN, N; Q17361, UBPT_CAEEL, N; DO PDOC00750; //
61
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
{PDOC00750} {PS00972; UCH_2_1} {PS00973; UCH_2_2} {PS50235; UCH_2_3} {BEGIN} ********************************************************************** * Ubiquitin carboxyl-terminal hydrolases family 2 signatures/profile * ********************************************************************** Ubiquitin carboxyl-terminal hydrolases (EC 3.1.2.15) (UCH) (deubiquitinating enzymes) [1,2] are thiol proteases that recognize and hydrolyze the peptide bond at the C-terminal glycine of ubiquitin. These enzymes are involved in the processing of poly-ubiquitin precursors as well as that
ubiquinated
two distinct families
proteins (800 to 2000 residues) and is currently represented by:
UBP11, UBP12, UBP13, UBP14, UBP15 and UBP16.
These proteins only share two regions of similarity. The first region contains a conserved cysteine which is probably implicated in the catalytic mechanism. The second region contains two conserved histidines residues,
(...) 62
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
(http://www.expasy.org/tools/scanprosite/):
63
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
motifs.
(http://www.blocks.fhcrc.org/).
64
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
(http://bioinf.man.ac.uk/dbbrowser/PRINTS):
65
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
66
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
67
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
profiles/matrices in Prosite release 17.34).
68
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
69
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
70
Patterns, Profiles, HMMs, PSI-BLAST Course 2003 71
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
family).
72
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
database and a stand-alone package (iprscan) are available for UNIX platforms to locally run a complete Interpro analysis: ftp://ftp.ebi.ac.uk/pub/databases/interpro.
73
Patterns, Profiles, HMMs, PSI-BLAST Course 2003 74
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
75
Patterns, Profiles, HMMs, PSI-BLAST Course 2003
76