An introduction to multiple alignments original version by Cdric - - PDF document

an introduction to multiple alignments
SMART_READER_LITE
LIVE PREVIEW

An introduction to multiple alignments original version by Cdric - - PDF document

An introduction to multiple alignments original version by Cdric Notredame, updated by Laurent Falquet Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique CN+LF-2006.01 Overview Multiple alignments How-to, Goal,


slide-1
SLIDE 1

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

An introduction to multiple alignments

  • riginal version by Cédric Notredame, updated by Laurent Falquet

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Overview

Multiple alignments

How-to, Goal, problems, use

Patterns

PROSITE database, syntax, use

PSI-BLAST

BLAST, matrices, use

[ Profiles/HMMs ] …

slide-2
SLIDE 2

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Overview

What are multiple alignments? How can I use my alignments? How does the computer align the sequences?

The progressive alignment algorithm

What are the difficulties? Pre-requisite?

How can we compare sequences? How can we align sequences?

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Sometimes two sequences are not enough

The man with TWO watches NEVER knows the exact time

slide-3
SLIDE 3

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

What is a multiple sequence alignment?

What can it do for me? How can I produce one of these? How can I use it?

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- mouse AKDDRIRYDNEMKSWEEQMAE * : .* . :

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

What is a multiple sequence alignment?

Structural/biochemical criteria

Residues playing a similar role end up in the same column.

Evolution criteria

Residues having the same ancestor end up in the same column. chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- mouse AKDDRIRYDNEMKSWEEQMAE * : .* . :

slide-4
SLIDE 4

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

How can I use a multiple alignment?

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP unknown -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- unknown AKDDRIRYDNEMKSWEEQMAE * : .* . :

Extrapolation

SwissProt Unkown Sequence

Homology? Less Than 30 % id BUT Conserved where it MATTERS

slide-5
SLIDE 5

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

How can I use a multiple alignment?

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- mouse AKDDRIRYDNEMKSWEEQMAE * : .* . :

Extrapolation Prosite Patterns

P-K-R-[PA]-x(1)-[ST]…

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

How can I use a multiple alignment?

Extrapolation Prosite Patterns

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-IQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- mouse AKDDRIRYDNEMKSWEEQMAE * : .* . :

L? K>R

Prosite Profiles

  • More Sensitive
  • More Specific

A F D E F G H Q I V L W

slide-6
SLIDE 6

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

PROSITE profile (see also HMMs)

A Substitution Cost For Every Amino Acid, At Every Position

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

How can I use a multiple alignment?

Phylogeny

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- mouse AKDDRIRYDNEMKSWEEQMAE * : .* . : chite wheat trybr mouse

  • Evolution
  • Paralogy/Orthology
slide-7
SLIDE 7

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

How can I use a multiple alignment?

Phylogeny

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- mouse AKDDRIRYDNEMKSWEEQMAE * : .* . :

  • Struc. Prediction

Column Constraint

  • Evolution Constraint
  • Structure Constraint

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

How can I use a multiple alignment?

Phylogeny

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- mouse AKDDRIRYDNEMKSWEEQMAE * : .* . :

  • Struc. Prediction

PsiPred or PhD For secondary Structure Prediction: 75% Accurate. Threading: is improving but is not yet as good.

slide-8
SLIDE 8

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

How can I use a multiple alignment?

Phylogeny

  • Struc. Prediction

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- mouse AKDDRIRYDNEMKSWEEQMAE * : .* . :

Caution!

Automatic Multiple Sequence Alignment methods are not always perfect…

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

slide-9
SLIDE 9

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

The problem

why is it difficult to compute a multiple sequence

alignment?

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: *

Computation What is the good alignment? Biology What is a good alignment?

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

The problem

why is it difficult to compute a multiple sequence

alignment?

CIRCULAR PROBLEM.... Good Sequences Good Alignment

slide-10
SLIDE 10

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

The problem

Same as pairwise alignment problem We do NOT know how sequences evolve. We do NOT understand the relation between

structures and sequences.

We would NOT recognize the “correct” alignment if

we had it IN FRONT of our eyes…

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

The Charlie Chaplin paradox

slide-11
SLIDE 11

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

What do I need to know to make a good multiple alignment?

How do sequences evolve? How does the computer align the sequences? How can I choose my sequences? What is the best program? How can I use my alignment?

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

An alignment is a story

ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN ADKPRRPLS-YMLWLN ADKPKRPKPRLSAYMLWLN

Mutations + Selection

ADKPRRP---LS-YMLWLN ADKPKRPKPRLSAYMLWLN

Insertion Deletion Mutation

slide-12
SLIDE 12

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Homology

Same sequences -> same origin? -> same function? -

> same 3D fold?

Length %Sequence Identity 30% 100 Same 3D Fold Twilight Zone

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Residues and mutations

All residues are equal, but some more than others…

P G S C L I T V A W Y F Q H K R E D N

Aliphatic Aromatic Hydrophobic Polar Small

M

Accurate matrices are data driven rather than knowledge driven

G C

slide-13
SLIDE 13

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Substitution matrices

Different Flavors:

  • Pam: 250, 350
  • Blosum: 45, 62

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

What is the best substitution matrix?

Mutation rates depend on families Choosing the right matrix may be tricky

Gonnet250 > BLOSUM62 > PAM250 Depends on the family, the program used and its tuning Family S N Histone3 6.4 Insulin 4.0 0.1 Interleukin I 4.6 1.4 Globin 5.1 0.6

  • Apolipoprot. AI

4.5 1.6 Interferon G 8.6 2.8

Rates in Substitutions/site/Billion Years as measured on Mouse Vs Human (0.08 Billion years)

slide-14
SLIDE 14

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Insertions and deletions?

Indel Cost L Cost L Cost L Affine Gap Penalty Cost=GOP+GEP*L

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

How to align many sequences?

Exact algorithms are computing time consuming

Needlemann & Wunsch Smith & Waterman

2 Globins =>1 sec

slide-15
SLIDE 15

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

3 Globins =>2 mn

How to align many sequences?

Exact algorithms are computing time consuming

Needlemann & Wunsch Smith & Waterman

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

4 Globins =>5 hours

How to align many sequences?

Exact algorithms are computing time consuming

Needlemann & Wunsch Smith & Waterman

  • > heuristic wished
slide-16
SLIDE 16

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

5 Globins =>3 weeks

How to align many sequences?

Exact algorithms are computing time consuming

Needlemann & Wunsch Smith & Waterman

  • > heuristic really wished!

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

6 Globins =>9 years

How to align many sequences?

Exact algorithms are computing time consuming

Needlemann & Wunsch Smith & Waterman

  • > heuristic required!
slide-17
SLIDE 17

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

How to align many sequences?

Exact algorithms are computing time consuming

Needlemann & Wunsch Smith & Waterman

  • > heuristic definitely required!

7 Globins =>1000 years

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Existing methods

1-Carillo and Lipman:

  • MSA, DCA.
  • Few Small Closely Related

Sequence.

2-Segment Based:

  • DIALIGN, MACAW.
  • May Align Too Few Residues
  • Do Well When They Can Run.

3-Iterative:

  • HMMs, HMMER, SAM.
  • Slow, Sometimes Inacurate
  • Good Profile Generators

4-Progressive:

  • ClustalW, Pileup, Multalign…
  • Fast and Sensitive

5-Mixtures:

  • T-Coffee, MAFFT, MUSCLE,

ProbCons, Psi-Praline,

  • Very sensitive
slide-18
SLIDE 18

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Progressive alignment

Feng and Dolittle, 1980; Taylor 1981

Dynamic Programming Using A Substitution Matrix

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Progressive alignment

Feng and Dolittle, 1980; Taylor 1981

  • Depends on the ORDER of the sequences (Tree).
  • Depends on the CHOICE of the sequences.
  • Depends on the PARAMETERS:
  • Substitution Matrix.
  • Penalties (Gop, Gep).
  • Sequence Weight.
  • Tree making Algorithm.
slide-19
SLIDE 19

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Progressive alignment

Works well when phylogeny is dense No outlayer sequence Example: river crossing

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Selecting sequences from a BLAST output

slide-20
SLIDE 20

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

A common mistake

Sequences too closely related

Identical sequences brings no information Multiple sequence alignments thrive on diversity

PRVA_MACFU SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE PRVA_HUMAN SMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIEE PRVA_GERSP SMTDLLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKTPDDVKKVFHILDKDKSGFIEE PRVA_MOUSE SMTDVLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKNPDEVKKVFHILDKDKSGFIEE PRVA_RAT SMTDLLSAEDIKKAIGAFTAADSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE PRVA_RABIT AMTELLNAEDIKKAIGAFAAAESFDHKKFFQMVGLKKKSTEDVKKVFHILDKDKSGFIEE :**::*.*******:***:* :****************..::******:*********** PRVA_MACFU DELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES PRVA_HUMAN DELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES PRVA_GERSP DELGFILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSES PRVA_MOUSE DELGSILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAES PRVA_RAT DELGSILKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES PRVA_RABIT EELGFILKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES :*** ******.******.**** *:************.:******:**

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

slide-21
SLIDE 21

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Respect information!

  • This alignment is not

informative about the relation between TPCC MOUSE and the rest of the sequences.

  • A better spread of the

sequences is needed

PRVA_MACFU ------------------------------------------SMTDLLN----AEDIKKA PRVA_HUMAN ------------------------------------------SMTDLLN----AEDIKKA PRVA_GERSP ------------------------------------------SMTDLLS----AEDIKKA PRVA_MOUSE ------------------------------------------SMTDVLS----AEDIKKA PRVA_RAT ------------------------------------------SMTDLLS----AEDIKKA PRVA_RABIT ------------------------------------------AMTELLN----AEDIKKA TPCC_MOUSE MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM : :*. .*:::: PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI PRVA_HUMAN VGAFSATDS--FDHKKFFQMVG------LKKKSADDVKKVFHMLDKDKSGFIEEDELGFI PRVA_GERSP IGAFAAADS--FDHKKFFQMVG------LKKKTPDDVKKVFHILDKDKSGFIEEDELGFI PRVA_MOUSE IGAFAAADS--FDHKKFFQMVG------LKKKNPDEVKKVFHILDKDKSGFIEEDELGSI PRVA_RAT IGAFTAADS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGSI PRVA_RABIT IGAFAAAES--FDHKKFFQMVG------LKKKSTEDVKKVFHILDKDKSGFIEEEELGFI TPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM :. . * .*..:*: *: * *. :::..:*:::**: .*:*: :** : PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES- PRVA_HUMAN LKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES- PRVA_GERSP LKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSES- PRVA_MOUSE LKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAES- PRVA_RAT LKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES- PRVA_RABIT LKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES- TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE *: . .. :: .: : *: ***:.**:*. :** ::

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Selecting diverse sequences

  • A REASONABLE model now exists.
  • Going further:remote homologues.

PRVB_CYPCA -AFAGVLNDADIAAALEACKAADSFNHKAFFAKVGLTSKSADDVKKAFAIIDQDKSGFIE PRVB_BOACO -AFAGILSDADIAAGLQSCQAADSFSCKTFFAKSGLHSKSKDQLTKVFGVIDRDKSGYIE PRV1_SALSA MACAHLCKEADIKTALEACKAADTFSFKTFFHTIGFASKSADDVKKAFKVIDQDASGFIE PRVB_LATCH -AVAKLLAAADVTAALEGCKADDSFNHKVFFQKTGLAKKSNEELEAIFKILDQDKSGFIE PRVB_RANES -SITDIVSEKDIDAALESVKAAGSFNYKIFFQKVGLAGKSAADAKKVFEILDRDKSGFIE PRVA_MACFU -SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIE PRVA_ESOLU --AKDLLKADDIKKALDAVKAEGSFNHKKFFALVGLKAMSANDVKKVFKAIDADASGFIE : *: .: . .* .:*. * ** *: * : * :* * **:** PRVB_CYPCA EDELKLFLQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA- PRVB_BOACO EDELKKFLQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG PRV1_SALSA VEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ- PRVB_LATCH DEELELFLQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA- PRVB_RANES QDELGLFLQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA- PRVA_MACFU EDELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES PRVA_ESOLU EEELKFVLKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA :** .*:.* .* *: ** :: .* **** **::** **

slide-22
SLIDE 22

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Aligning remote homologues

PRVA_MACFU ------------------------------------------SMTDLLNA----EDIKKA PRVA_ESOLU -------------------------------------------AKDLLKA----DDIKKA PRVB_CYPCA ------------------------------------------AFAGVLND----ADIAAA PRVB_BOACO ------------------------------------------AFAGILSD----ADIAAG PRV1_SALSA -----------------------------------------MACAHLCKE----ADIKTA PRVB_LATCH ------------------------------------------AVAKLLAA----ADVTAA PRVB_RANES ------------------------------------------SITDIVSE----KDIDAA TPCS_RABIT -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAI TPCS_PIG -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAI TPCC_MOUSE MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM : :: PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI PRVA_ESOLU LDAVKAEGS--FNHKKFFALVG------LKAMSANDVKKVFKAIDADASGFIEEEELKFV PRVB_CYPCA LEACKAADS--FNHKAFFAKVG------LTSKSADDVKKAFAIIDQDKSGFIEEDELKLF PRVB_BOACO LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKF PRV1_SALSA LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLF PRVB_LATCH LEGCKADDS--FNHKVFFQKTG------LAKKSNEELEAIFKILDQDKSGFIEDEELELF PRVB_RANES LESVKAAGS--FNYKIFFQKVG------LAGKSAADAKKVFEILDRDKSGFIEQDELGLF TPCS_RABIT IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEI TPCS_PIG IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEI TPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM : . .: .. . *: * : * :* : .*:*: :** . PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES- PRVA_ESOLU LKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA- PRVB_CYPCA LQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA-- PRVB_BOACO LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG- PRV1_SALSA LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ-- PRVB_LATCH LQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA-- PRVB_RANES LQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA-- TPCS_RABIT FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ TPCS_PIG FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQ TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE :: .. :: : :: .* :.** *. :** ::

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Going further…

PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI PRVB_BOACO LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKF PRV1_SALSA LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLF TPCS_RABIT IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEI TPCS_PIG IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEI TPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM TPC_PATYE SDEMDEEATGRLNCDAWIQLFER---KLKEDLDERELKEAFRVLDKEKKGVIKVDVLRWI . : .. . :: . : * :* : .* *. : * . PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES-- PRVB_BOACO LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG-- PRV1_SALSA LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ--- TPCS_RABIT FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ- TPCS_PIG FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQ- TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE- TPC_PATYE LS---SLGDELTEEEIENMIAETDTDGSGTVDYEEFKCLMMSSDA : . :: : :: * :..* :. :** ::

slide-23
SLIDE 23

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

What makes a good alignment…

The more divergent the sequences, the better The fewer indels, the better Nice ungapped blocks separated with indels Different classes of residues within a block:

Completely conserved (*) Size and hydropathy conserved (:) Size or hydropathy conserved (.)

The ultimate evaluation is a matter of personal

judgment and knowledge

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Avoiding pitfalls

slide-24
SLIDE 24

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Naming your sequences the right way

Never use white spaces in your sequence names Never use special symbols. Stick to plain letters,

numbers and the underscore sign (_) to replace

  • spaces. Avoid ALL other signs, especially the most

tempting ones like @, #, |, *, >, <…

Never use names longer than 15 characters Never give the same name to 2 different sequences

in your set. Some programs accept it, others like ClustalW don’t.

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Do not use too many sequences!

slide-25
SLIDE 25

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Beware of Repeats

  • There is a problem when two sequences do not contain the same number of

repeats

  • It is then better to manually extract the repeats and to align them separately.

Individual repeats can be recognized using Dotlet or Dotter.

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Keep a biological perspective

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- mouse AKDDRIRYDNEMKSWEEQMAE * : .* . : chite AD--K----PKR-PLYMLWLNS-ARESIKRENPDFK-VT-EVAKKGGELWRGL- wheat -DPNK----PKRAP-FFVFMGE-FREEFKQKNPKNKSVA-AVGKAAGERWKSLS trybr -K--KDSNAPKR-AMT-MFFSSDFR-S-KH-S-DLS-IV-EMSKAAGAAWKELG mouse ----K----PKR-PRYNIYVSESFQEA-K--D-D-S-AQGKL-KLVNEAWKNLS * *** .:: ::... : * . . . : * . *: * chite KSEWEAKAATAKQNY-I--RALQE-YERNG-G- wheat KAPYVAKANKLKGEY-N--KAIAA-YNK-GESA trybr RKVYEEMAEKDKERY----K--RE-M------- mouse KQAYIQLAKDDRIRYDNEMKSWEEQMAE----- : : * : .* :

DIFFERENT PARAMETERS

slide-26
SLIDE 26

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Do not overtune!!!

DO NOT PLAY WITH PARAMETERS! IF YOU KNOW THE ALIGNMENT YOU WANT: MAKE IT YOURSELF!

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- mouse AKDDRIRYDNEMKSWEEQMAE * : .* . : chite ---ADKPKRPL-SAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAP-SAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS-----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPR-SAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. * .: .. . : . . * . *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- mouse AKDDRIRYDNEMKSWEEQMAE * : .* . :

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

BaliBase classification and benchmark

Description PROBLEM Even Phylogenic Spread. One Outlayer Sequence Two Distantly related Groups Long Internal Indel Long Terminal Indel

slide-27
SLIDE 27

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Choosing the right method

Source: BaliBase Thompson et al, NAR, (1999) Do et al, Genome

  • Res. (2005)

PROBLEM Program Strategy

ClustalW, T-coffee, MUSCLE, ProbCons ProbCons, MUSCLE, MAFFT Dialign II, ProbCons, T-Coffee T-Coffee, MUSCLE, ProbCons Dialign II, ProbCons, MAFFT

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Some interesting links

slide-28
SLIDE 28

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

More links

MUSCLE

http://www.drive5.com/muscle

MAFFT

http://timpani.genome.ad.jp/~mafft/server

PROBCONS

http://probcons.stanford.edu

PSI-PRALINE

http://ibivu.cs.vu.nl/programs/pralinewww

3D-COFFEE

http://www.igs.cnrs-mrs.fr/Tcoffee/tcoffee_cgi/index.cgi

Swiss Institute of Bioinf ormatics Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Conclusion

The best alignment method:

Your brain The right data

The best evaluation method:

Your eyes Experimental information

(SwissProt)

Choosing the sequences well is

important

Beware of repeated elements What can I conclude?

Homology -> information

extrapolation

How can I go further?

Patterns Profiles HMMs …