CSCI 490 Bioinformatics Multiple Sequence Alignment Multiple - PowerPoint PPT Presentation

CSCI 490 Bioinformatics Multiple Sequence Alignment

Multiple Sequence Alignment • Motivation: – A faint similarity between two sequences becomes very significant if present in many sequences • Definition – Given N sequences x 1 , x 2 ,…, x N : Insert gaps (-) in each sequence x i , such that • All sequences have the same length L • Score of the alignment is maximum • Two issues – How to score an alignment? – How to find a (nearly) optimal alignment?

Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: ACGCGGC y: ACGCGAG z: GCCGCGAG Induces : x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Sum Of Pairs (cont’d) • The sum-of-pairs (SP) score of an alignment is the sum of the scores of all induced pairwise alignments S(m) =  k<l s(m k , m l ) s(m k , m l ): score of induced alignment (k,l)

Example: x: AC-GCGG-C - A C G T y: AC-GC-GAG z: GCCGC-GAG A 1 -1 -1 -1 -1 C -1 1 -1 -1 -1 G -1 -1 1 -1 -1 (A,A) + (G,G) x 3 (-,A) x 2 + (A,G) x 2 = 3 (A,A) = -1 T -1 -1 -1 1 -1 = -1 - -1 -1 -1 -1 0 Total score = (-1) + 3 + (-2) + 3 + 3 + (-2) + 3 + (-1) + (-1) = 5

Multiple Sequence Alignments Algorithms • Can also be global or local – We only talk about global for now • A simple method – Do pairwise alignment between all pairs – Combine the pairwise alignments into a single multiple alignment – Is this going to work?

Compatible pairwise alignments AAAATTTT AAAATTTT---- AAAATTTT---- ----TTTTGGGG AAAA----GGGG AAAATTTT---- ----TTTTGGGG AAAA----GGGG TTTTGGGG AAAAGGGG ----TTTTGGGG AAAA----GGGG

Incompatible pairwise alignments AAAATTTT AAAATTTT---- ----AAAATTTT ----TTTTGGGG GGGGAAAA---- ? TTTTGGGG GGGGAAAA TTTTGGGG---- ----GGGGAAAA

Multidimensional Dynamic Programming (MDP) Generalization of Needleman-Wunsh: • Find the longest path in a high-dimensional cube – As opposed to a two-dimensional grid • Uses a N-dimensional matrix – As apposed to a two-dimensional array • Entry F(i 1 , …, i k ) represents score of optimal alignment for s 1 [1..i 1 ], … s k [1..i k ] F(i 1 ,i 2 ,…,i N ) = max (all neighbors of a cell) (F(nbr)+S(current))

Multidimensional Dynamic Programming (MDP) • Example: in 3D (three sequences): (i-1,j-1,k-1) (i-1,j,k-1) • 2 3 – 1 = 7 neighbors/cell (i-1,j-1,k) (i-1,j,k) F(i-1,j-1,k-1) + S(x i , y j , z k ), F(i-1,j-1,k ) + S(x i , y j , -), F(i-1,j ,k-1) + S(x i , -, z k ), (i,j,k-1) (i,j-1,k-1) F(i,j,k) = max F(i ,j-1,k-1) + S(-, y j , z k ), F(i-1,j ,k ) + S(x i , -, -), F(i ,j-1,k ) + S(-, y j , -), (i,j-1,k) (i,j,k) F(i ,j ,k-1) + S(-, -, z k )

Multidimensional Dynamic Programming (MDP) Running Time: L N ; 1. Size of matrix: Where L = length of each sequence N = number of sequences Neighbors/cell: 2 N – 1 2. Therefore………………………… O(2 N L N )

Faster MDP • Carrillo & Lipman, 1988 – Branch and bound – Other heuristics • Implemented in a tool called MSA • Practical for about 6 sequences of length about 200-300.

Faster MDP • Basic idea: bounds of the optimal score of a multiple alignment can be pre-computed – Upper-bound: sum of optimal pair-wise alignment scores, i.e. S(m) =  k<l s(m k , m l )   k<l s(k, l) Optimal Score of optimal alignment Score of the alignment between k and l between k and l induced by m msa – lower-bounded: score computed by any approximate algorithm – For any partial path, if S current + S perspective < lower- bound, can give up that path – Guarantees optimality

Progressive Alignment • Multiple Alignment is NP-hard • Most used heuristic: Progressive Alignment Algorithm: 1. Align two of the sequences x i , x j 2. Fix that alignment 3. Align a third sequence x k to the alignment x i ,x j 4. Repeat until all sequences are aligned Running Time: O(NL 2 ) Each alignment takes O(L 2 ) Repeat N times

Progressive Alignment x y z w • When evolutionary tree is known: – Align closest first, in the order of the tree Example: Order of alignments: 1. (x,y) 2. (z,w) 3. (xy, zw)

Progressive Alignment: CLUSTALW CLUSTALW: most popular multiple protein alignment Algorithm: Find all d ij : alignment dist (x i , x j ) 1. • High alignment score => short distance 2. Construct a tree (similar to hierarchical clustering.) 3. Align nodes in order of decreasing similarity + a large number of heuristics

CLUSTALW example • S 1 ALSK • S 2 TNSD • S 3 NASK • S 4 NTSD

CLUSTALW example • S 1 ALSK • S 2 TNSD • S 3 NASK • S 4 NTSD s 1 s 2 s 3 s 4 s 1 0 9 4 7 s 2 0 8 3 s 3 0 7 s 4 0 Distance matrix

CLUSTALW example • S 1 ALSK • S 2 TNSD • S 3 NASK • S 4 NTSD s 1 s 1 s 2 s 3 s 4 s 3 s 1 0 9 4 7 s 2 s 2 0 8 3 s 4 s 3 0 7 s 4 0

CLUSTALW example • S 1 ALSK -ALSK • S 2 TNSD NA-SK • S 3 NASK • S 4 NTSD s 1 s 1 s 2 s 3 s 4 s 3 s 1 0 9 4 7 s 2 s 2 0 8 3 s 4 s 3 0 7 s 4 0

CLUSTALW example • S 1 ALSK -ALSK • S 2 TNSD NA-SK • S 3 NASK -TNSD NT-SD • S 4 NTSD s 1 s 1 s 2 s 3 s 4 s 3 s 1 0 9 4 7 s 2 s 2 0 8 3 s 4 s 3 0 7 s 4 0

CLUSTALW example • S 1 ALSK -ALSK -ALSK • S 2 TNSD NA-SK -TNSD • S 3 NASK NA-SK -TNSD NT-SD NT-SD • S 4 NTSD s 1 s 1 s 2 s 3 s 4 s 3 s 1 0 9 4 7 s 2 s 2 0 8 3 s 4 s 3 0 7 s 4 0

Iterative Refinement Problems with progressive alignment: • Depend on pair-wise alignments • If sequences are very distantly related, much higher likelihood of errors • Initial alignments are “frozen” even when new evidence comes Example: x: GAAGTT Frozen! y: GAC-TT z: GAACTG Now clear: correct y should be GA-CTT w: GTACTG

Iterative Refinement Algorithm (Barton-Stenberg): 1. Align most similar x i , x j 2. Align x k most similar to (x i x j ) Progressive alignment 3. Repeat 2 until (x 1 … x N ) are aligned 4. For j = 1 to N, Remove x j , and realign to x 1 …x j-1 x j+1 … x N 5. Repeat 4 until convergence

Iterative Refinement (cont’d) For each sequence y 1. Remove y z 2. Realign y x y (while rest fixed) allow y to vary x,z fixed projection

Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x: GAAGTTA y: GAC-TTA z: GAACTGA w: GTACTGA After realigning y: x: GAAGTTA y: G-ACTTA + 3 matches z: GAACTGA w: GTACTGA

Iterative Refinement • Example not handled well: x: GAAGTTA y 1 : GAC-TTA y 2 : GAC-TTA Realigning any single y i changes nothing y 3 : GAC-TTA z: GAACTGA w: GTACTGA

Other approaches • Statistical learning methods – Profile Hidden Markov Models • Consistency-based methods – Still rely on pairwise alignment • But consider a third seq when aligning two seqs • If block A in seq x aligns to block B in seq y, and both aligns to block C in seq z, we have higher confidence to say that the alignment between A-B is reliable • Essentially: change scoring system according to consistency • Then apply DP as in other approaches – Pioneered by a tool called T-Coffee

Multiple alignment tools • Clustal W (Thompson, 1994) – Most popular • T-Coffee (Notredame, 2000) – Another popular tool – Consistency-based – Slower than clustalW, but generally more accurate for more distantly related sequences • MUSCLE (Edgar, 2004) – Iterative refinement – More efficient than most others • DIALIGN (Morgenstern, 1998, 1999, 2005) – “local” • Align-m (Walle, 2004) – “local” • PROBCONS (Do, 2004) – Probabilistic consistency-based – Best accuracy on benchmarks • ProDA (Phuong, 2006) – Allow repeated and shuffled regions

In summary • Multiple alignment scoring functions – Sum of pairs – Other funcs exist, but less used • Multiple alignment algorithms: – MDP • Optimal • too slow • Branch & Bound doesn’t solve the problem entirely – Progressive alignment: clustalW – Iterative refinement Heuristic – Consistency-based

CSCI 490 Bioinformatics Multiple Sequence Alignment Multiple - PowerPoint PPT Presentation

CSCI 490 Bioinformatics Multiple Sequence Alignment Multiple Sequence Alignment Motivation: A faint similarity between two sequences becomes very significant if present in many sequences Definition Given N sequences x 1 , x 2

CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and Molecular Biology Course

CSCI 490 Bioinformatics Part II: Pair-wise Sequence Alignment Outline Whats the

CPSC 490: Problem Solving in Computer Science 1 Range-minimum query Given an array A of N 10

CPSC 490: Problem Solving in Computer Science which is greater than or equal to x ? 1 Problem

CPSC 490: Problem Solving in Computer Science Assignment 3 is due Friday, Feb 1 at 8 pm.

CPSC 490: Problem Solving in Computer Science Considering SCCs is useful when answering

CPSC 490: Problem Solving in Computer Science Select a presentation topic by Friday, March 22.

CPSC 490: Problem Solving in Computer Science 1 Convex sets A set S is convex x , y S

CPSC 490: Problem Solving in Computer Science Assignment 1 is due Jan 15 at noon.

CPSC 490: Problem Solving in Computer Science A bipartite graph is: and Y . A graph with no

CPSC 490: Problem Solving in Computer Science Assignment 2 is due Tuesday at noon.

CPSC 490: Problem Solving in Computer Science Presentations are next week (Feb 12 and 14).

CPSC 490: Problem Solving in Computer Science Assignment 4 is due tonight. Presentation 2

CPSC 490: Problem Solving in Computer Science 1 Range-minimum query Given an array A of N 10

CPSC 490: Problem Solving in Computer Science Assignment 1 is due Tuesday at noon. Start now!

CPSC 490: Problem Solving in Computer Science of money. You may buy fractional items (then you

Computational Higher-dimensional Type Carlo Angiuli Theory Evan Cavallo Favonia & RedPRL

The fundamental idea of program extraction 2 / 51 The fundamental idea of program extraction A

Generalised Type Setups for Dependently Sorted Logic TACL 2011 Peter Aczel The University of

Dyslipidaemia / Diabetes Case in relation to (recent) trial results/guidelines NVVC congres 5

Compliance Tracking T ool Activity Entry and Progress Monitoring November 2012 1 Login

Introduction to statistics: Foundations Shravan Vasishth Universit at Potsdam

Foundations of Chemical Kinetics Lecture 21: Master equations and rates of reaction Marc R.

Towards a Performance Model for Virtualised Multi-Tier Storage Systems Nicholas Dingle, Peter

CSCI 490 Bioinformatics Multiple Sequence Alignment Multiple - PowerPoint PPT Presentation

CSCI 490 Bioinformatics Multiple Sequence Alignment Multiple Sequence Alignment Motivation: A faint similarity between two sequences becomes very significant if present in many sequences Definition Given N sequences x 1 , x 2

CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and Molecular Biology Course

CSCI 490 Bioinformatics Part II: Pair-wise Sequence Alignment Outline Whats the

CPSC 490: Problem Solving in Computer Science 1 Range-minimum query Given an array A of N 10

CPSC 490: Problem Solving in Computer Science which is greater than or equal to x ? 1 Problem

CPSC 490: Problem Solving in Computer Science Assignment 3 is due Friday, Feb 1 at 8 pm.

CPSC 490: Problem Solving in Computer Science Considering SCCs is useful when answering

CPSC 490: Problem Solving in Computer Science Select a presentation topic by Friday, March 22.

CPSC 490: Problem Solving in Computer Science 1 Convex sets A set S is convex x , y S

CPSC 490: Problem Solving in Computer Science Assignment 1 is due Jan 15 at noon.

CPSC 490: Problem Solving in Computer Science A bipartite graph is: and Y . A graph with no

CPSC 490: Problem Solving in Computer Science Assignment 2 is due Tuesday at noon.

CPSC 490: Problem Solving in Computer Science Presentations are next week (Feb 12 and 14).

CPSC 490: Problem Solving in Computer Science Assignment 4 is due tonight. Presentation 2

CPSC 490: Problem Solving in Computer Science 1 Range-minimum query Given an array A of N 10

CPSC 490: Problem Solving in Computer Science Assignment 1 is due Tuesday at noon. Start now!

CPSC 490: Problem Solving in Computer Science of money. You may buy fractional items (then you

Computational Higher-dimensional Type Carlo Angiuli Theory Evan Cavallo Favonia &amp; RedPRL

The fundamental idea of program extraction 2 / 51 The fundamental idea of program extraction A

Generalised Type Setups for Dependently Sorted Logic TACL 2011 Peter Aczel The University of

Dyslipidaemia / Diabetes Case in relation to (recent) trial results/guidelines NVVC congres 5

Compliance Tracking T ool Activity Entry and Progress Monitoring November 2012 1 Login

Introduction to statistics: Foundations Shravan Vasishth Universit at Potsdam

Foundations of Chemical Kinetics Lecture 21: Master equations and rates of reaction Marc R.

Towards a Performance Model for Virtualised Multi-Tier Storage Systems Nicholas Dingle, Peter

Computational Higher-dimensional Type Carlo Angiuli Theory Evan Cavallo Favonia & RedPRL