CSCI 490 Bioinformatics Multiple Sequence Alignment Multiple - - PowerPoint PPT Presentation

csci 490
SMART_READER_LITE
LIVE PREVIEW

CSCI 490 Bioinformatics Multiple Sequence Alignment Multiple - - PowerPoint PPT Presentation

CSCI 490 Bioinformatics Multiple Sequence Alignment Multiple Sequence Alignment Motivation: A faint similarity between two sequences becomes very significant if present in many sequences Definition Given N sequences x 1 , x 2


slide-1
SLIDE 1

CSCI 490 Bioinformatics

Multiple Sequence Alignment

slide-2
SLIDE 2

Multiple Sequence Alignment

  • Motivation:

– A faint similarity between two sequences becomes very significant if present in many sequences

  • Definition

– Given N sequences x1, x2,…, xN: Insert gaps (-) in each sequence xi, such that

  • All sequences have the same length L
  • Score of the alignment is maximum
  • Two issues

– How to score an alignment? – How to find a (nearly) optimal alignment?

slide-3
SLIDE 3

Scoring Function: Sum Of Pairs

Definition: Induced pairwise alignment A pairwise alignment induced by the multiple

alignment

Example:

x: ACGCGGC y: ACGCGAG z: GCCGCGAG

Induces:

x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

slide-4
SLIDE 4

Sum Of Pairs (cont’d)

  • The sum-of-pairs (SP) score of an

alignment is the sum of the scores of all induced pairwise alignments S(m) = k<l s(mk, ml) s(mk, ml): score of induced alignment (k,l)

slide-5
SLIDE 5

Example:

x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG A C G T

  • A

1 -1 -1 -1 -1 C

  • 1

1 -1 -1 -1 G

  • 1 -1

1 -1 -1 T

  • 1 -1 -1

1 -1

  • 1 -1 -1 -1

(A,A) + (A,G) x 2 = -1 (G,G) x 3 = 3 (-,A) x 2 + (A,A) = -1 Total score = (-1) + 3 + (-2) + 3 + 3 + (-2) + 3 + (-1) + (-1) = 5

slide-6
SLIDE 6

Multiple Sequence Alignments Algorithms

  • Can also be global or local

– We only talk about global for now

  • A simple method

– Do pairwise alignment between all pairs – Combine the pairwise alignments into a single multiple alignment – Is this going to work?

slide-7
SLIDE 7

Compatible pairwise alignments

AAAATTTT TTTTGGGG AAAAGGGG AAAATTTT----

  • ---TTTTGGGG

AAAATTTT---- AAAA----GGGG

  • ---TTTTGGGG

AAAA----GGGG

AAAATTTT----

  • ---TTTTGGGG

AAAA----GGGG

slide-8
SLIDE 8

Incompatible pairwise alignments

AAAATTTT TTTTGGGG GGGGAAAA AAAATTTT----

  • ---TTTTGGGG
  • ---AAAATTTT

GGGGAAAA---- TTTTGGGG----

  • ---GGGGAAAA

?

slide-9
SLIDE 9

Multidimensional Dynamic Programming (MDP)

Generalization of Needleman-Wunsh:

  • Find the longest path in a high-dimensional cube

– As opposed to a two-dimensional grid

  • Uses a N-dimensional matrix

– As apposed to a two-dimensional array

  • Entry F(i1, …, ik) represents score of optimal

alignment for s1[1..i1], … sk[1..ik]

F(i1,i2,…,iN) = max(all neighbors of a cell) (F(nbr)+S(current))

slide-10
SLIDE 10
  • Example: in 3D (three sequences):
  • 23 – 1 = 7 neighbors/cell

F(i-1,j-1,k-1) + S(xi, yj, zk), F(i-1,j-1,k ) + S(xi, yj, -), F(i-1,j ,k-1) + S(xi, -, zk), F(i,j,k) = max F(i ,j-1,k-1) + S(-, yj, zk), F(i-1,j ,k ) + S(xi, -, -), F(i ,j-1,k ) + S(-, yj, -), F(i ,j ,k-1) + S(-, -, zk)

Multidimensional Dynamic Programming (MDP)

(i,j,k) (i,j,k-1) (i-1,j,k-1) (i-1,j-1,k-1) (i-1,j-1,k) (i,j-1,k) (i-1,j,k) (i,j-1,k-1)

slide-11
SLIDE 11

Multidimensional Dynamic Programming (MDP)

Running Time: 1. Size of matrix: LN; Where L = length of each sequence N = number of sequences 2. Neighbors/cell: 2N – 1 Therefore………………………… O(2N LN)

slide-12
SLIDE 12

Faster MDP

  • Carrillo & Lipman, 1988

– Branch and bound – Other heuristics

  • Implemented in a tool called MSA
  • Practical for about 6 sequences of length

about 200-300.

slide-13
SLIDE 13

Faster MDP

  • Basic idea: bounds of the optimal score of a

multiple alignment can be pre-computed

– Upper-bound: sum of optimal pair-wise alignment scores, i.e. S(m) = k<l s(mk, ml)  k<l s(k, l) – lower-bounded: score computed by any approximate algorithm – For any partial path, if Scurrent + Sperspective < lower- bound, can give up that path – Guarantees optimality

Score of the alignment between k and l induced by m

Optimal msa

Score of optimal alignment between k and l

slide-14
SLIDE 14

Progressive Alignment

  • Multiple Alignment is NP-hard
  • Most used heuristic: Progressive Alignment

Algorithm:

1. Align two of the sequences xi, xj 2. Fix that alignment 3. Align a third sequence xk to the alignment xi,xj 4. Repeat until all sequences are aligned

Running Time: O(NL2)

Each alignment takes O(L2) Repeat N times

slide-15
SLIDE 15

Progressive Alignment

  • When evolutionary tree is known:

– Align closest first, in the order of the tree

Example: Order of alignments:

  • 1. (x,y)
  • 2. (z,w)
  • 3. (xy, zw)

x w y z

slide-16
SLIDE 16

Progressive Alignment: CLUSTALW

CLUSTALW: most popular multiple protein alignment Algorithm:

1. Find all dij: alignment dist (xi, xj)

  • High alignment score => short distance

2. Construct a tree (similar to hierarchical clustering.) 3. Align nodes in order of decreasing similarity

+ a large number of heuristics

slide-17
SLIDE 17

CLUSTALW example

  • S1 ALSK
  • S2 TNSD
  • S3 NASK
  • S4 NTSD
slide-18
SLIDE 18

CLUSTALW example

  • S1 ALSK
  • S2 TNSD
  • S3 NASK
  • S4 NTSD

s1 s2 s3 s4 s1 9 4 7 s2 8 3 s3 7 s4 Distance matrix

slide-19
SLIDE 19

CLUSTALW example

  • S1 ALSK
  • S2 TNSD
  • S3 NASK
  • S4 NTSD

s1 s2 s3 s4 s1 9 4 7 s2 8 3 s3 7 s4 s1 s3 s2 s4

slide-20
SLIDE 20

CLUSTALW example

  • S1 ALSK
  • S2 TNSD
  • S3 NASK
  • S4 NTSD

s1 s2 s3 s4 s1 9 4 7 s2 8 3 s3 7 s4 s1 s3 s2 s4

  • ALSK

NA-SK

slide-21
SLIDE 21

CLUSTALW example

  • S1 ALSK
  • S2 TNSD
  • S3 NASK
  • S4 NTSD

s1 s2 s3 s4 s1 9 4 7 s2 8 3 s3 7 s4 s1 s3 s2 s4

  • ALSK

NA-SK

  • TNSD

NT-SD

slide-22
SLIDE 22

CLUSTALW example

  • S1 ALSK
  • S2 TNSD
  • S3 NASK
  • S4 NTSD

s1 s2 s3 s4 s1 9 4 7 s2 8 3 s3 7 s4 s1 s3 s2 s4

  • ALSK

NA-SK

  • TNSD

NT-SD

  • ALSK
  • TNSD

NA-SK NT-SD

slide-23
SLIDE 23

Problems with progressive alignment:

  • Depend on pair-wise alignments
  • If sequences are very distantly related, much higher likelihood of

errors

  • Initial alignments are “frozen” even when new evidence comes

Example: x: GAAGTT y: GAC-TT z: GAACTG w: GTACTG

Iterative Refinement

Frozen! Now clear: correct y should be GA-CTT

slide-24
SLIDE 24

Iterative Refinement

Algorithm (Barton-Stenberg):

  • 1. Align most similar xi, xj
  • 2. Align xk most similar to (xixj)
  • 3. Repeat 2 until (x1…xN) are aligned
  • 4. For j = 1 to N,

Remove xj, and realign to x1…xj-1xj+1…xN

  • 5. Repeat 4 until convergence

Progressive alignment

slide-25
SLIDE 25

Iterative Refinement (cont’d)

For each sequence y

  • 1. Remove y
  • 2. Realign y

(while rest fixed)

x y z x,z fixed projection allow y to vary

slide-26
SLIDE 26

Iterative Refinement

Example: align (x,y), (z,w), (xy, zw):

x: GAAGTTA y: GAC-TTA z: GAACTGA w: GTACTGA

After realigning y:

x: GAAGTTA y: G-ACTTA + 3 matches z: GAACTGA w: GTACTGA

slide-27
SLIDE 27

Iterative Refinement

  • Example not handled well:

x: GAAGTTA y1: GAC-TTA y2: GAC-TTA y3: GAC-TTA z: GAACTGA w: GTACTGA

Realigning any single yi changes nothing

slide-28
SLIDE 28

Other approaches

  • Statistical learning methods

– Profile Hidden Markov Models

  • Consistency-based methods

– Still rely on pairwise alignment

  • But consider a third seq when aligning two seqs
  • If block A in seq x aligns to block B in seq y, and both aligns

to block C in seq z, we have higher confidence to say that the alignment between A-B is reliable

  • Essentially: change scoring system according to consistency
  • Then apply DP as in other approaches

– Pioneered by a tool called T-Coffee

slide-29
SLIDE 29

Multiple alignment tools

  • Clustal W (Thompson, 1994)

– Most popular

  • T-Coffee (Notredame, 2000)

– Another popular tool – Consistency-based – Slower than clustalW, but generally more accurate for more distantly related sequences

  • MUSCLE (Edgar, 2004)

– Iterative refinement – More efficient than most others

  • DIALIGN (Morgenstern, 1998, 1999, 2005)

– “local”

  • Align-m (Walle, 2004)

– “local”

  • PROBCONS (Do, 2004)

– Probabilistic consistency-based – Best accuracy on benchmarks

  • ProDA (Phuong, 2006)

– Allow repeated and shuffled regions

slide-30
SLIDE 30

In summary

  • Multiple alignment scoring functions

– Sum of pairs – Other funcs exist, but less used

  • Multiple alignment algorithms:

– MDP

  • Optimal
  • too slow
  • Branch & Bound doesn’t solve the problem entirely

– Progressive alignment: clustalW – Iterative refinement – Consistency-based

Heuristic