CSCI 490 Bioinformatics Part II: Pair-wise Sequence Alignment - - PowerPoint PPT Presentation

csci 490
SMART_READER_LITE
LIVE PREVIEW

CSCI 490 Bioinformatics Part II: Pair-wise Sequence Alignment - - PowerPoint PPT Presentation

CSCI 490 Bioinformatics Part II: Pair-wise Sequence Alignment Outline Whats the biological problem Algorithm issues Intro to dynamic programming Global sequence alignment Local sequence alignment More efficient


slide-1
SLIDE 1

CSCI 490 Bioinformatics

Part II: Pair-wise Sequence Alignment

slide-2
SLIDE 2

Outline

  • What’s the biological problem
  • Algorithm issues

– Intro to dynamic programming – Global sequence alignment – Local sequence alignment – More efficient algorithms

  • Interpretation issues

– Model gaps more accurately

  • Practical tools: BLAST
slide-3
SLIDE 3

Evolution at the DNA level

…ACGGTGCAGTCACCA… …ACGTTGC-GTCCACCA… C DNA evolutionary events (sequence edits): Mutation, deletion, insertion

slide-4
SLIDE 4

Sequence conservation implies function

OK OK OK X X Still OK?

next generation

slide-5
SLIDE 5

Why sequence alignment?

  • Conserved regions are more likely to be

functional

– Can be used for finding genes, regulatory elements, etc.

  • Similar sequences often have similar origin and

function

– Can be used to predict functions for new genes / proteins

  • Sequence alignment is one of the most widely

used computational tools in biology

slide-6
SLIDE 6

Global Sequence Alignment

  • AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition An alignment of two strings S, T is a pair of strings S’, T’ (with spaces) s.t. (1) |S’| = |T’|, and (|S| = “length of S”) (2) removing all spaces in S’, T’ leaves S, T AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

S T S’ T’

slide-7
SLIDE 7

What is a good alignment?

Alignment: The “best” way to match the letters of one sequence with those of the other How do we define “best”?

slide-8
SLIDE 8
  • The score of aligning (characters or

spaces) x & y is σ (x,y).

  • Score of an alignment:
  • An optimal alignment: one with max score

S’: -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- T’: TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

slide-9
SLIDE 9

Scoring Function

  • Sequence edits:

AGGCCTC – Mutations AGGACTC – Insertions AGGGCCTC – Deletions AGG-CTC

Scoring Function: Match: +m ~~~AAC~~~ Mismatch:

  • s

~~~A-A~~~ Gap (indel):

  • d
slide-10
SLIDE 10
  • Match = 2, mismatch = -1, gap = -1
  • Score = 3 x 2 – 2 x 1 – 1 x 1 = 3
slide-11
SLIDE 11

More complex scoring function

  • Substitution matrix

– Similarity score of matching two letters a, b should reflect the probability of a, b derived from the same ancestor – It is usually defined by log likelihood ratio – Active research area. Especially for proteins. – Commonly used: PAM, BLOSUM

slide-12
SLIDE 12

More complex scoring function

  • Substitution matrix

BLOSUM (BLOcks SUbstitution Matrix)

slide-13
SLIDE 13

An example substitution matrix

A C G T A 3

  • 2
  • 1
  • 2

C 3

  • 2
  • 1

G 3

  • 2

T 3

slide-14
SLIDE 14

How to find an optimal alignment?

  • A naïve algorithm:

for all subseqs A of S, B of T s.t. |A| = |B| do align A[i] with B[i], 1 ≤i ≤|A| align all other chars to spaces compute its value retain the max end

  • utput the retained alignment
slide-15
SLIDE 15

How to find an optimal alignment?

  • A naïve algorithm:

for all subseqs A of S, B of T s.t. |A| = |B| do align A[i] with B[i], 1 ≤i ≤|A| align all other chars to spaces compute its value retain the max end

  • utput the retained alignment

S = abcd A = cd T = wxyz B = xz

slide-16
SLIDE 16

How to find an optimal alignment?

  • A naïve algorithm:

for all subseqs A of S, B of T s.t. |A| = |B| do align A[i] with B[i], 1 ≤i ≤|A| align all other chars to spaces compute its value retain the max end

  • utput the retained alignment

S = abcd A = cd T = wxyz B = xz

  • abc-d a-bc-d

w--xyz -w-xyz

slide-17
SLIDE 17

Analysis

  • Assume |S| = |T| = n
  • Cost of evaluating one alignment: ≥n
  • How many alignments are there:

– pick n chars of S,T together – say k of them are in S – match these k to the k unpicked chars of T

  • Total time:
  • E.g., for n = 20, time is > 240 >1012 operations
slide-18
SLIDE 18

Intro to Dynamic Programming

slide-19
SLIDE 19

Dynamic programming

  • What is dynamic programming?

– A method for solving problems exhibiting the properties of overlapping subproblems and optimal substructure – Key idea: tabulating sub-problem solutions rather than re-computing them repeatedly

  • Two simple examples:

– Computing Fibonacci numbers – Find the special shortest path in a grid

slide-20
SLIDE 20

Example 1: Fibonacci numbers

  • 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, …

F(0) = 1; F(1) = 1; F(n) = F(n-1) + F(n-2)

  • How to compute F(n)?
slide-21
SLIDE 21

A recursive algorithm

function fib(n) if (n == 0 or n == 1) return 1; else return fib(n-1) + fib(n-2); F(9) F(8) F(7) F(7) F(6) F(6) F(5) F(6) F(5) F(5) F(4) F(5) F(4) F(4) F(3)

slide-22
SLIDE 22
  • Time complexity:

– Between 2n/2 and 2n – O(1.62n), i.e. exponential

  • Why recursive Fib algorithm is inefficient?

– Overlapping subproblems

F(9) F(8) F(7) F(9) F(8) F(7) F(7) F(6) F(6) F(5) F(7) F(6) F(6) F(5) F(6) F(5) F(5) F(4) F(5) F(4) F(4) F(3) F(6) F(5) F(5) F(4) F(5) F(4) F(4) F(3)

n/2 n

slide-23
SLIDE 23

An iterative algorithm

function fib(n) F[0] = 1; F[1] = 1; for i = 2 to n F[i] = F[i-1] + F[i-2]; Return F[n];

55 34 21 13 8 5 3 2 1 1 55 34 21 13 8 5 3 2 1 1

Time complexity: Time: O(n), space: O(n)

slide-24
SLIDE 24

Example 2: shortest path in a grid

S G

m n

Each edge has a length (cost). We need to get to G from S. Can only move right or down. Aim: find a path with the minimum total length

slide-25
SLIDE 25

Optimal substructures

  • Naïve algorithm: enumerate all possible

paths and compare costs

– Exponential number of paths

  • Key observation:

– If a path P(S, G) is the shortest from S to G, any of its sub-path P(S,x), where x is on P(S,G), is the shortest from S to x

slide-26
SLIDE 26

Proof

  • Proof by contradiction

– If the path between P(S,x) is not the shortest, i.e., P’(S,x) < P(S,x) – Construct a new path P’(S,G) = P’(S,x) + P(x, G) – P’(S,G) < P(S,G) => P(S,G) is not the shortest – Contradiction – Therefore, P(S, x) is the shortest

S G x

slide-27
SLIDE 27

Recursive solution

  • Index each intersection by

two indices, (i, j)

  • Let F(i, j) be the total

length of the shortest path from (0, 0) to (i, j). Therefore, F(m, n) is the shortest path we wanted.

  • To compute F(m, n), we

need to compute both F(m-1, n) and F(m, n-1) m n

(0,0) (m, n)

F(m-1, n) + length((m-1, n), (m, n)) F(m, n) = min F(m, n-1) + length((m, n-1), (m, n))

slide-28
SLIDE 28

Recursive solution

  • But: if we use recursive

call, many subpaths will be recomputed for many times

  • Strategy: pre-compute F

values starting from the upper-left corner. Fill in row by row (what other

  • rder will also do?)

m n F(i-1, j) + length((i-1, j), (i, j)) F(i, j) = min F(i, j-1) + length((i, j-1), (i, j))

(0,0) (m, n) (i, j) (i-1, j) (i, j-1)

slide-29
SLIDE 29

Dynamic programming illustration

3 9 1 2 3 2 5 2 2 4 2 3 3 6 3 3 1 2 3 2 5 3 3 3 3 2 3 3 9 3 6 2 3 7 4 4 6 3 1 3 3

12 13 15

6

8 13 15

9

11 13 16

11

14 17 20

17

17 18 20 5 7 13 17

S G

F(i-1, j) + length(i-1, j, i, j) F(i, j) = min F(i, j-1) + length(i, j-1, i, j)

slide-30
SLIDE 30

Trackback

3 9 1 2 3 2 5 2 2 4 2 3 3 6 3 3 1 2 3 2 5 3 3 3 3 2 3 3 9 3 6 2 3 7 4 4 6 3 1 3 3

12 13 15

6

8 13 15

9

11 13 16

11

14 17 20

17

17 18 20 5 7 13 17

slide-31
SLIDE 31

Elements of dynamic programming

  • Optimal sub-structures

– Optimal solutions to the original problem contains optimal solutions to sub-problems

  • Overlapping sub-problems

– Some sub-problems appear in many solutions

  • Memorization and reuse

– Carefully choose the order that sub-problems are solved

slide-32
SLIDE 32

Dynamic Programming for sequence alignment

Suppose we wish to align x1……xM y1……yN Let F(i,j) = optimal score of aligning x1……xi y1……yj Scoring Function: Match: +m Mismatch:

  • s

Gap (indel):

  • d
slide-33
SLIDE 33

Elements of dynamic programming

  • Optimal sub-structures

– Optimal solutions to the original problem contains optimal solutions to sub-problems

  • Overlapping sub-problems

– Some sub-problems appear in many solutions

  • Memorization and reuse

– Carefully choose the order that sub-problems are solved

slide-34
SLIDE 34

Optimal substructure

  • If x[i] is aligned to y[j] in the optimal alignment

between x[1..M] and y[1..N], then

  • The alignment between x[1..i] and y[1..j] is also
  • ptimal
  • Easy to prove by contradiction

...

1 2 i M

...

1 2 j N

x: y:

slide-35
SLIDE 35

Recursive solution

Notice three possible cases: 1. xM aligns to yN

~~~~~~~ xM ~~~~~~~ yN

2. xM aligns to a gap

~~~~~~~ xM ~~~~~~~ ⎯

3. yN aligns to a gap

~~~~~~~ ⎯ ~~~~~~~ yN

m, if xM = yN F(M,N) = F(M-1, N-1) +

  • s, if not

F(M,N) = F(M-1, N) - d F(M,N) = F(M, N-1) - d

max

slide-36
SLIDE 36

Recursive solution

  • Generalize:

F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d (Xi,Yj) = m if Xi = Yj, and –s otherwise

  • Boundary conditions:

– F(0, 0) = 0. – F(0, j) = ? – F(i, 0) = ?

  • jd: y[1..j] aligned to gaps.
  • id: x[1..i] aligned to gaps.
slide-37
SLIDE 37

What order to fill?

F(0,0) F(M,N)

F(i, j) F(i, j-1) F(i-1, j) F(i-1, j-1) 1 1 2 3 i j

slide-38
SLIDE 38

What order to fill?

F(0,0) F(M,N)

slide-39
SLIDE 39

Example

x = AGTA m = 1 y = ATA s = 1 d = 1

A G T A A T A

F(i,j) i = 0 1 2 3 4 j = 0 1 2 3

F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d

slide-40
SLIDE 40

Example

x = AGTA m = 1 y = ATA s = 1 d = 1

A G T A

  • 1
  • 2
  • 3
  • 4

A

  • 1

T

  • 2

A

  • 3

j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d

slide-41
SLIDE 41

Example

x = AGTA m = 1 y = ATA s = 1 d = 1

A G T A

  • 1
  • 2
  • 3
  • 4

A

  • 1

1

  • 1
  • 2

T

  • 2

A

  • 3

j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d

slide-42
SLIDE 42

Example

x = AGTA m = 1 y = ATA s = 1 d = 1

A G T A

  • 1
  • 2
  • 3
  • 4

A

  • 1

1

  • 1
  • 2

T

  • 2

1 A

  • 3

j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d

slide-43
SLIDE 43

Example

x = AGTA m = 1 y = ATA s = 1 d = 1

A G T A

  • 1
  • 2
  • 3
  • 4

A

  • 1

1

  • 1
  • 2

T

  • 2

1 A

  • 3
  • 1
  • 1

2

j = 0 1 2 3 Optimal Alignment: F(4,3) = 2 F(i,j) i = 0 1 2 3 4

F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d

slide-44
SLIDE 44

Example

x = AGTA m = 1 y = ATA s = 1 d = 1

A G T A

  • 1
  • 2
  • 3
  • 4

A

  • 1

1

  • 1
  • 2

T

  • 2

1 A

  • 3
  • 1
  • 1

2

j = 0 1 2 3 Optimal Alignment: F(4,3) = 2 This only tells us the best score F(i,j) i = 0 1 2 3 4

F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d

slide-45
SLIDE 45

Trace-back

x = AGTA m = 1 y = ATA s = 1 d = 1

A G T A

  • 1
  • 2
  • 3
  • 4

A

  • 1

1

  • 1
  • 2

T

  • 2

1 A

  • 3
  • 1
  • 1

2

j = 0 1 2 3

F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d

F(i,j) i = 0 1 2 3 4

A A

slide-46
SLIDE 46

Trace-back

A G T A

  • 1
  • 2
  • 3
  • 4

A

  • 1

1

  • 1
  • 2

T

  • 2

1 A

  • 3
  • 1
  • 1

2

F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d

x = AGTA m = 1 y = ATA s = 1 d = 1

j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

T A T A

slide-47
SLIDE 47

Trace-back

x = AGTA m = 1 y = ATA s = 1 d = 1

A G T A

  • 1
  • 2
  • 3
  • 4

A

  • 1

1

  • 1
  • 2

T

  • 2

1 A

  • 3
  • 1
  • 1

2

j = 0 1 2 3

F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d

F(i,j) i = 0 1 2 3 4

G T A

  • T A
slide-48
SLIDE 48

Trace-back

x = AGTA m = 1 y = ATA s = 1 d = 1

A G T A

  • 1
  • 2
  • 3
  • 4

A

  • 1

1

  • 1
  • 2

T

  • 2

1 A

  • 3
  • 1
  • 1

2

j = 0 1 2 3

F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d

F(i,j) i = 0 1 2 3 4

A G T A A - T A

slide-49
SLIDE 49

Trace-back

x = AGTA m = 1 y = ATA s = 1 d = 1

A G T A

  • 1
  • 2
  • 3
  • 4

A

  • 1

1

  • 1
  • 2

T

  • 2

1 A

  • 3
  • 1
  • 1

2

j = 0 1 2 3 Optimal Alignment: F(4,3) = 2 AGTA A−TA

F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d

F(i,j) i = 0 1 2 3 4

slide-50
SLIDE 50

Using trace-back pointers

x = AGTA m = 1 y = ATA s = 1 d = 1

A G T A

  • 1
  • 2
  • 3
  • 4

A

  • 1

T

  • 2

A

  • 3

j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

slide-51
SLIDE 51

Using trace-back pointers

x = AGTA m = 1 y = ATA s = 1 d = 1

A G T A

  • 1
  • 2
  • 3
  • 4

A

  • 1

1

  • 1
  • 2

T

  • 2

A

  • 3

j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

slide-52
SLIDE 52

Using trace-back pointers

x = AGTA m = 1 y = ATA s = 1 d = 1

A G T A

  • 1
  • 2
  • 3
  • 4

A

  • 1

1

  • 1
  • 2

T

  • 2

1 A

  • 3

j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

slide-53
SLIDE 53

Using trace-back pointers

x = AGTA m = 1 y = ATA s = 1 d = 1

A G T A

  • 1
  • 2
  • 3
  • 4

A

  • 1

1

  • 1
  • 2

T

  • 2

1 A

  • 3
  • 1
  • 1

2

j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

slide-54
SLIDE 54

Using trace-back pointers

x = AGTA m = 1 y = ATA s = 1 d = 1

A G T A

  • 1
  • 2
  • 3
  • 4

A

  • 1

1

  • 1
  • 2

T

  • 2

1 A

  • 3
  • 1
  • 1

2

j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

slide-55
SLIDE 55

Using trace-back pointers

x = AGTA m = 1 y = ATA s = 1 d = 1

A G T A

  • 1
  • 2
  • 3
  • 4

A

  • 1

1

  • 1
  • 2

T

  • 2

1 A

  • 3
  • 1
  • 1

2

j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

slide-56
SLIDE 56

Using trace-back pointers

x = AGTA m = 1 y = ATA s = 1 d = 1

A G T A

  • 1
  • 2
  • 3
  • 4

A

  • 1

1

  • 1
  • 2

T

  • 2

1 A

  • 3
  • 1
  • 1

2

j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

slide-57
SLIDE 57

Using trace-back pointers

x = AGTA m = 1 y = ATA s = 1 d = 1

A G T A

  • 1
  • 2
  • 3
  • 4

A

  • 1

1

  • 1
  • 2

T

  • 2

1 A

  • 3
  • 1
  • 1

2

j = 0 1 2 3 Optimal Alignment: F(4,3) = 2 AGTA A−TA F(i,j) i = 0 1 2 3 4

slide-58
SLIDE 58

The Needleman-Wunsch Algorithm

1. Initialization.

a. F(0, 0) = 0 b. F(0, j) = - j  d c. F(i, 0) = - i  d

2. Main Iteration. Filling in scores

a. For each i = 1……M For each j = 1……N F(i-1,j) – d [case 1] F(i, j) = max F(i, j-1) – d [case 2] F(i-1, j-1) + σ(xi, yj) [case 3] UP, if [case 1] Ptr(i,j) = LEFT if [case 2] DIAG if [case 3]

3.

  • Termination. F(M, N) is the optimal score, and from Ptr(M, N) can

trace back optimal alignment

slide-59
SLIDE 59

Complexity

  • Time:

O(NM)

  • Space:

O(NM)

  • Linear-space algorithms do exist (with the

same time complexity)

slide-60
SLIDE 60

Equivalent graph problem

(0,0) (3,4) A G T A A A T 1 1 1 1 S1 = S2 =

  • Number of steps: length of the alignment
  • Path length: alignment score
  • Optimal alignment: find the longest path from (0, 0) to (3, 4)
  • General longest path problem cannot be found with DP. Longest path on this

graph can be found by DP since no cycle is possible.

→: a gap in the 2nd sequence : a gap in the 1st sequence : match / mismatch Value on vertical/horizontal line: -d Value on diagonal: m or -s

1

slide-61
SLIDE 61

Question

  • If we change the scoring scheme, will the
  • ptimal alignment be changed?

– Old: Match = 1, mismatch = gap = -1 – New: match = 2, mismatch = gap = 0 – New: Match = 2, mismatch = gap = -2?

slide-62
SLIDE 62

Question

  • What kind of alignment is represented by

these paths?

A B C A B C A B C A B C A B C

A- BC A--

  • BC
  • -A

BC-

  • A-

B-C

  • A

BC Alternating gaps are impossible if –s > -2d

slide-63
SLIDE 63

A variant of the basic algorithm

Scoring scheme: m = s = d: 1

Seq1: CAGCA-CTTGGATTCTCGG || |:||| Seq2: ---CAGCGTGG-------- Seq1: CAGCACTTGGATTCTCGG |||| | | || Seq2: CAGC-----G-T----GG The first alignment may be biologically more realistic in some cases (e.g. if we know s2 is a subsequence of s1) Score = -7 Score = -2

slide-64
SLIDE 64

A variant of the basic algorithm

  • Maybe it is OK to have an unlimited # of

gaps in the beginning and end:

  • ---------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC

GCGAGTTCATCTATCAC--GACCGC--GGTCG--------------

  • Then, we don’t want to penalize gaps in

the ends

slide-65
SLIDE 65

The Overlap Detection variant

Changes: 1. Initialization

For all i, j, F(i, 0) = 0 F(0, j) = 0

2. Termination

maxi F(i, N) FOPT = max maxj F(M, j) x1 ……………………………… xM yN ……………………………… y1

slide-66
SLIDE 66

Different types of overlaps

x y x y

slide-67
SLIDE 67

The local alignment problem

Given two strings X = x1……xM, Y = y1……yN Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g. X = abcxdex X’ = cxde Y = xxxcde Y’ = c-de

x y

slide-68
SLIDE 68

Why local alignment

  • Conserved regions may be a small part of the

whole

– Global alignment might miss them if flanking “junk”

  • utweighs similar regions
  • Genes are shuffled between genomes

A A B C D B C D

slide-69
SLIDE 69

Naïve algorithm

for all substrings X’ of X and Y’ of Y Align X’ & Y’ via dynamic programming Retain pair with max value end ; Output the retained pair

  • Time: O(n2) choices for A, O(m2) for B,

O(nm) for DP, so O(n3m3 ) total.

slide-70
SLIDE 70

Reminder

  • The overlap detection algorithm

– We do not give penalty to gaps at either end

Free gap Free gap

slide-71
SLIDE 71

The local alignment idea

  • Do not penalize the unaligned regions (gaps or

mismatches)

  • The alignment can start anywhere and ends anywhere
  • Strategy: whenever we get to some low similarity region

(negative score), we restart a new alignment

– By resetting alignment score to zero

slide-72
SLIDE 72

The Smith-Waterman algorithm

Initialization: F(0, j) = F(i, 0) = 0 F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + (xi, yj) Iteration: F(i, j) = max

slide-73
SLIDE 73

The Smith-Waterman algorithm

Termination:

  • 1. If we want the best local alignment…

FOPT = maxi,j F(i, j)

  • 2. If we want all local alignments scoring > t

For all i, j find F(i, j) > t, and trace back

slide-74
SLIDE 74

x x x c d e a b c x d e x

Match: 2 Mismatch: -1 Gap: -1

slide-75
SLIDE 75

x x x c d e a b c x d e x

Match: 2 Mismatch: -1 Gap: -1

slide-76
SLIDE 76

x x x c d e a b c 2 1 x d e x

Match: 2 Mismatch: -1 Gap: -1

slide-77
SLIDE 77

x x x c d e a b c 2 1 x 2 2 2 1 1 d e x

Match: 2 Mismatch: -1 Gap: -1

slide-78
SLIDE 78

x x x c d e a b c 2 1 x 2 2 2 1 1 d 1 1 1 1 3 2 e x

Match: 2 Mismatch: -1 Gap: -1

slide-79
SLIDE 79

x x x c d e a b c 2 1 x 2 2 2 1 1 d 1 1 1 1 3 2 e 2 5 x

Match: 2 Mismatch: -1 Gap: -1

slide-80
SLIDE 80

x x x c d e a b c 2 1 x 2 2 2 1 1 d 1 1 1 1 3 2 e 2 5 x 2 2 2 1 1 4

Match: 2 Mismatch: -1 Gap: -1

slide-81
SLIDE 81

Trace back

x x x c d e a b c 2 1 x 2 2 2 1 1 d 1 1 1 1 3 2 e 2 5 x 2 2 2 1 1 4

Match: 2 Mismatch: -1 Gap: -1

slide-82
SLIDE 82

Trace back

x x x c d e a b c 2 1 x 2 2 2 1 1 d 1 1 1 1 3 2 e 2 5 x 2 2 2 1 1 4

Match: 2 Mismatch: -1 Gap: -1

cxde | || c-de x-de | || xcde

slide-83
SLIDE 83
  • No negative values in local alignment DP

array

  • Optimal local alignment will never have a

gap on either end

  • Local alignment: “Smith-Waterman”
  • Global alignment: “Needleman-Wunsch”
slide-84
SLIDE 84

Analysis

  • Time:

– O(MN) for finding the best alignment – Time to report all alignments depends on the number of sub-opt alignments

  • Memory:

– O(MN) – O(M+N) possible

slide-85
SLIDE 85

More efficient alignment algorithms

slide-86
SLIDE 86
  • Given two sequences of length M, N
  • Time: O(MN)

– Ok, but still slow for long sequences

  • Space: O(MN)

– bad – 1Mb seq x 1Mb seq = 1TB memory

  • Can we do better?
slide-87
SLIDE 87

Bounded alignment

Good alignment should appear near the diagonal

slide-88
SLIDE 88

Bounded Dynamic Programming

If we know that x and y are very similar Assumption: # gaps(x, y) < k xi Then, | implies | i – j | < k yj

slide-89
SLIDE 89

Bounded Dynamic Programming

Initialization: F(i,0), F(0,j) undefined for i, j > k Iteration:

For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+ (xi, yj) F(i, j) = max F(i, j – 1) – d, if j > i – k F(i – 1, j) – d, if j < i + k Termination: same x1 ………………………… xM yN ………………………… y1

k

slide-90
SLIDE 90

Analysis

  • Time: O(kM) << O(MN)
  • Space: O(kM) with some tricks

2k M 2k

=>

M

slide-91
SLIDE 91
slide-92
SLIDE 92
  • Given two sequences of length M, N
  • Time: O(MN)

– ok

  • Space: O(MN)

– bad – 1mb seq x 1mb seq = 1TB memory

  • Can we do better?
slide-93
SLIDE 93

Linear space algorithm

  • If all we need is the alignment score but

not the alignment, easy! We only need to keep two rows

(You only need one row, with a little trick)

But how do we get the alignment?

slide-94
SLIDE 94

Linear space algorithm

  • When we finish, we know how we have

aligned the ends of the sequences

Naïve idea: Repeat on the smaller subproblem F(M-1, N-1) Time complexity: O((M+N)(MN))

XM YN

slide-95
SLIDE 95

(0, 0) (M, N) M/2

Key observation: optimal alignment (longest path) must use an intermediate point on the M/2-th row. Call it (M/2, k), where k is unknown.

slide-96
SLIDE 96
  • Longest path from (0, 0) to (6, 6) is

max_k (LP(0,0,3,k) + LP(6,6,3,k))

(0,0) (6,6) (3,2) (3,4) (3,6) (3,0)

slide-97
SLIDE 97

Hirschberg’s idea

  • Divide and conquer!

M/2

F(M/2, k) represents the best alignment between x1x2…xM/2 and y1y2…yk

Forward algorithm

Align x1x2…xM/2 with Y

X Y

slide-98
SLIDE 98

Backward Algorithm

M/2

B(M/2, k) represents the best alignment between reverse(xM/2+1…xM) and reverse(ykyk+1…yN )

Backward algorithm

Align reverse(xM/2+1…xM) with reverse(Y)

Y X

slide-99
SLIDE 99

Linear-space alignment

Using 2 (4) rows of space, we can compute for k = 1…N, F(M/2, k), B(M/2, k)

M/2

slide-100
SLIDE 100

Linear-space alignment

Now, we can find k* maximizing F(M/2, k) + B(M/2, k) Also, we can trace the path exiting column M/2 from k* Conclusion: In O(NM) time, O(N) space, we found

  • ptimal alignment path at row M/2
slide-101
SLIDE 101

Linear-space alignment

  • Iterate this procedure to the two sub-problems!

N-k* M/2 M/2 k*

slide-102
SLIDE 102

Analysis

  • Memory: O(N) for computation, O(N+M) to

store the optimal alignment

  • Time:

– MN for first iteration – k M/2 + (N-k) M/2 = MN/2 for second – …

k N-k M/2 M/2

slide-103
SLIDE 103

MN MN/2 MN/4 MN/8

MN + MN/2 + MN/4 + MN/8 + … = MN (1 + ½ + ¼ + 1/8 + 1/16 + …) = 2MN = O(MN)

slide-104
SLIDE 104

Outline

  • Part I: Algorithms

– Biological problem – Intro to dynamic programming – Global sequence alignment – Local sequence alignment – More efficient algorithms

  • Part II: Biological issues

– Model gaps more accurately

  • Part III: BLAST
slide-105
SLIDE 105

How to model gaps more accurately

slide-106
SLIDE 106

What’s a better alignment?

GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG

Score = 8 x m – 3 x d Score = 8 x m – 3 x d However, gaps usually occur in bunches.

  • During evolution, chunks of DNA may be lost entirely
  • Aligning genomic sequences vs. cDNAs (reverse

complimentary to mRNAs)

slide-107
SLIDE 107

Model gaps more accurately

  n n

slide-108
SLIDE 108

General gap dynamic programming

Initialization: same Iteration: F(i-1, j-1) + s(xi, yj) F(i, j) = max maxk=0…i-1F(k,j) – (i-k) maxk=0…j-1F(i,k) – (j-k) Termination: same Running Time: O((M+N)MN) (cubic) Space: O(NM) (linear-space algorithm not applicable)

slide-109
SLIDE 109

Compromise: affine gaps

(n) = d + (n – 1)e | | gap gap

  • pen

extension d e

(n) Match: 2 Gap open: -5 Gap extension: -1 GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG

8x2-5-2 = 9 8x2-3x5 = 1

We want to find the optimal alignment with affine gap penalty in

  • O(MN) time
  • O(MN) or better O(M+N) memory
slide-110
SLIDE 110

Allowing affine gap penalties

  • Still three cases

– xi aligned with yj – xi aligned to a gap

  • Are we continuing a gap in x? (if no, start is more expensive)

– yj aligned to a gap

  • Are we continuing a gap in y? (if no, start is more expensive)
  • We can use a finite state machine to represent the three

cases as three states

– The machine has two heads, reading the chars on the two strings separately – At every step, each head reads 0 or 1 char from each sequence – Depending on what it reads, goes to a different state, and produces different scores

slide-111
SLIDE 111

Finite State Machine

F: have just read 1 char from each seq (xi aligned to yj ) Ix: have read 0 char from x. (yj aligned to a gap) Iy: have read 0 char from y (xi aligned to a gap)

F Ix Iy

? / ? ? / ? ? / ? ? / ? ? / ? ? / ? ? / ?

Input Output State

slide-112
SLIDE 112

F Ix Iy

(xi,yj) /  (xi,yj) /  (xi,yj) /  (xi,-) / d (xi,-) / e (-, yj) / d (-, yj) / e

Input Output Start state Current state Input Output Next state F

(xi,yj) 

F F

(-,yj) d

Ix F

(xi,-) d

Iy Ix

(-,yj) e

Ix …

… …

slide-113
SLIDE 113

AAC ACT F-F-F-F AAC ||| ACT F-Iy-F-F-Ix AAC- ||

  • ACT

F-F-Iy-F-Ix AAC- | | A-CT F Ix Iy

(xi,yj) /  (xi,yj) /  (xi,yj) /  (xi,-) / d (xi,-) / e (-, yj) / d (-, yj) / e

start state

Given a pair of sequences, an alignment (not necessarily optimal) corresponds to a state path in the FSM. Optimal alignment: find a state path to read the two sequences such that the total output score is the highest

slide-114
SLIDE 114

Dynamic programming

  • We encode this information in three different

matrices

  • For each element (i,j) we use three variables

– F(i,j): best alignment (score) of x1..xi & y1..yj if xi aligns to yj – Ix(i,j): best alignment of x1..xi & y1..yj if yj aligns to gap – Iy(i,j): best alignment of x1..xi & y1..yj if xi aligns to gap

xi yj xi yj xi yj F(i, j) Ix(i, j) Iy(i, j)

slide-115
SLIDE 115

F Ix Iy

(xi,yj) / (xi,yj) / (xi,yj) / (xi,-) /d (xi,-)/e (-, yj) /d (-, yj)/e

F(i-1, j-1) + (xi, yj) F(i, j) = max Ix(i-1, j-1) + (xi, yj) Iy(i-1, j-1) + (xi, yj)

xi yj

slide-116
SLIDE 116

F Ix Iy

(xi,yj) / (xi,yj) / (xi,yj) / (xi,-) /d (xi,-)/e (-, yj) /d (-, yj)/e

F(i, j-1) + d Ix(i, j) = max Ix(i, j-1) + e

xi yj Ix(i, j)

slide-117
SLIDE 117

F Ix Iy

(xi,yj) / (xi,yj) / (xi,yj) / (xi,-) /d (xi,-)/e (-, yj) /d (-, yj)/e

F(i-1, j) + d Iy(i, j) = max Iy(i-1, j) + e

xi yj Iy(i, j)

slide-118
SLIDE 118

F(i – 1, j – 1) F(i, j) = (xi, yj) + max Ix(i – 1, j – 1) Iy(i – 1, j – 1) F(i, j – 1) + d Ix(i, j) = max Ix(i, j – 1) + e F(i – 1, j) + d Iy(i, j) = max Iy(i – 1, j) + e

Continuing alignment Closing gaps in x Closing gaps in y Opening a gap in x Gap extension in x Opening a gap in y Gap extension in y

slide-119
SLIDE 119

Data dependency

F Ix Iy

i j i-1 j-1 i-1 j-1

slide-120
SLIDE 120

Data dependency

Iy Ix F

i j i j i j

slide-121
SLIDE 121

Data dependency

  • If we stack all three matrices

– No cyclic dependency – Therefore, we can fill in all three matrices in order

slide-122
SLIDE 122

Algorithm

  • for i = 1:m

– for j = 1:n

  • Fill in F(i, j), Ix(i, j), Iy(i, j)

– end

end

  • F(M, N) = max (F(M, N), Ix(M, N), Iy(M, N))
  • Time: O(MN)
  • Space: O(MN) or O(N) when combined with the

linear-space algorithm

slide-123
SLIDE 123

Exercise

  • x = GCAC
  • y = GCC
  • m = 2
  • s = -2
  • d = -5
  • e = -1
slide-124
SLIDE 124
  • 5
  • 6
  • 7
  • 8
  • 5
  • 6
  • 7

F: aligned on both Iy: Insertion on y F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j) G C C G C A C G C A C G C A C G C C G C C x = y = x = y = x = y = Ix: Insertion on x (xi, yj) d e d e m = 2 s = -2 d = -5 e = -1

slide-125
SLIDE 125

2

  • 5
  • 6
  • 7
  • 8
  • 5
  • 6
  • 7

F Iy Ix G C C G C A C G C A C G C A C G C C G C C x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) (xi, yj) = 2 m = 2 s = -2 d = -5 e = -1

slide-126
SLIDE 126

2

  • 7
  • 5
  • 6
  • 7
  • 8
  • 5
  • 6
  • 7

F Iy Ix G C C G C A C G C A C G C A C G C C G C C x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) (xi, yj) = -2 m = 2 s = -2 d = -5 e = -1

slide-127
SLIDE 127

2

  • 7
  • 8
  • 5
  • 6
  • 7
  • 8
  • 5
  • 6
  • 7

F Iy Ix G C C G C A C G C A C G C A C G C C G C C x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) (xi, yj) = -2 m = 2 s = -2 d = -5 e = -1

slide-128
SLIDE 128

2

  • 7
  • 8
  • 5
  • 6
  • 7
  • 8
  • 5
  • 6
  • 7
  • 3

F Iy Ix G C C G C A C G C A C G C A C G C C G C C x = y = x = y = x = y = Ix(i,j) Ix(i,j-1) F(i,j-1) d = -5 e = -1 m = 2 s = -2 d = -5 e = -1

slide-129
SLIDE 129

2

  • 7
  • 8
  • 5
  • 6
  • 7
  • 8
  • 5
  • 6
  • 7
  • 3
  • 4

F Iy Ix G C C G C A C G C A C G C A C G C C G C C x = y = x = y = x = y = Ix(i,j) Ix(i,j-1) F(i,j-1) d = -5 e = -1 m = 2 s = -2 d = -5 e = -1

slide-130
SLIDE 130

2

  • 7
  • 8
  • 5
  • 6
  • 7
  • 8
  • 5
  • 6
  • 7
  • 3
  • 4

F Iy Ix G C C G C A C G C A C G C A C G C C G C C x = y = x = y = x = y = Iy(i,j) Iy(i-1,j) F(i-1,j) d=-5 e=-1 m = 2 s = -2 d = -5 e = -1

slide-131
SLIDE 131

2

  • 7
  • 8
  • 7
  • 5
  • 6
  • 7
  • 8
  • 5
  • 6
  • 7
  • 3
  • 4

F Iy Ix G C C G C A C G C A C G C A C G C C G C C x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) (xi, yj) = -2 m = 2 s = -2 d = -5 e = -1

slide-132
SLIDE 132

2

  • 7
  • 8
  • 7

4

  • 5
  • 6
  • 7
  • 8
  • 5
  • 6
  • 7
  • 3
  • 4

F Iy Ix G C C G C A C G C A C G C A C G C C G C C x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) (xi, yj) = 2 m = 2 s = -2 d = -5 e = -1

slide-133
SLIDE 133

2

  • 7
  • 8
  • 7

4

  • 1
  • 5
  • 6
  • 7
  • 8
  • 5
  • 6
  • 7
  • 3
  • 4

F Iy Ix G C C G C A C G C A C G C A C G C C G C C x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) (xi, yj) = 2 m = 2 s = -2 d = -5 e = -1

slide-134
SLIDE 134

2

  • 7
  • 8
  • 7

4

  • 1
  • 5
  • 6
  • 7
  • 8
  • 5
  • 6
  • 7
  • 3
  • 4
  • 12
  • 1

F Iy Ix G C C G C A C G C A C G C A C G C C G C C x = y = x = y = x = y = Ix(i,j) Ix(i,j-1) F(i,j-1) d = -5 e = -1 m = 2 s = -2 d = -5 e = -1

slide-135
SLIDE 135

2

  • 7
  • 8
  • 7

4

  • 1
  • 5
  • 6
  • 3
  • 7
  • 8
  • 5
  • 6
  • 7
  • 3
  • 4
  • 12
  • 1

F Iy Ix G C C G C A C G C A C G C A C G C C G C C x = y = x = y = x = y = Iy(i,j) Iy(i-1,j) F(i-1,j) d=-5 e=-1 m = 2 s = -2 d = -5 e = -1

slide-136
SLIDE 136

2

  • 7
  • 8
  • 7

4

  • 1
  • 5
  • 6
  • 3
  • 12
  • 13
  • 7
  • 8
  • 5
  • 6
  • 7
  • 3
  • 4
  • 12
  • 1

F Iy Ix G C C G C A C G C A C G C A C G C C G C C x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j) (xi, yj) d e d e m = 2 s = -2 d = -5 e = -1

slide-137
SLIDE 137

2

  • 7
  • 8
  • 7

4

  • 5
  • 8
  • 5

2

  • 5
  • 6
  • 3
  • 12
  • 13
  • 7
  • 8
  • 5
  • 6
  • 7
  • 3
  • 4
  • 12
  • 1
  • 13
  • 10

F Iy Ix G C C G C A C G C A C G C A C G C C G C C x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j) (xi, yj) d e d e m = 2 s = -2 d = -5 e = -1

slide-138
SLIDE 138

2

  • 7
  • 8
  • 7

4

  • 1
  • 8
  • 5

2

  • 5
  • 6
  • 3
  • 12
  • 13
  • 7
  • 4
  • 1
  • 8
  • 5
  • 6
  • 7
  • 3
  • 4
  • 12
  • 1
  • 13
  • 10

F Iy Ix G C C G C A C G C A C G C A C G C C G C C x = y = x = y = x = y = Iy(i,j) Iy(i-1,j) F(i-1,j) d=-5 e=-1 m = 2 s = -2 d = -5 e = -1

slide-139
SLIDE 139

2

  • 7
  • 8
  • 7

4

  • 1
  • 8
  • 5

2

  • 5
  • 6
  • 3
  • 12
  • 13
  • 7
  • 4
  • 1
  • 6
  • 8
  • 5
  • 6
  • 7
  • 3
  • 4
  • 12
  • 1
  • 13
  • 10

F Iy Ix G C C G C A C G C A C G C A C G C C G C C x = y = x = y = x = y = Iy(i,j) Iy(i-1,j) F(i-1,j) d=-5 e=-1 m = 2 s = -2 d = -5 e = -1

slide-140
SLIDE 140

2

  • 7
  • 8
  • 7

4

  • 1
  • 8
  • 5

2

  • 9
  • 2

1

  • 5
  • 6
  • 3
  • 12
  • 13
  • 7
  • 4
  • 1
  • 6
  • 8
  • 5
  • 2
  • 3
  • 5
  • 6
  • 7
  • 3
  • 4
  • 12
  • 1
  • 13
  • 10
  • 14
  • 7

F Iy Ix G C C G C A C G C A C G C A C G C C G C C x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j) (xi, yj) d e d e m = 2 s = -2 d = -5 e = -1

slide-141
SLIDE 141

2

  • 7
  • 8
  • 7

4

  • 1
  • 8
  • 5

2

  • 9
  • 2

1

  • 5
  • 6
  • 3
  • 12
  • 13
  • 7
  • 4
  • 1
  • 6
  • 8
  • 5
  • 2
  • 3
  • 5
  • 6
  • 7
  • 3
  • 4
  • 12
  • 1
  • 13
  • 10
  • 14
  • 7

F Iy Ix G C C G C A C G C A C G C A C G C C G C C x = y = x = y = x = y = F(i, j) F(i-1, j-1) Ix(i-1, j-1) Iy(i-1, j-1) Ix(i,j) Ix(i,j-1) F(i,j-1) Iy(i,j) Iy(i-1,j) F(i-1,j) (xi, yj) d e d e m = 2 s = -2 d = -5 e = -1

slide-142
SLIDE 142

2

  • 7
  • 8
  • 7

4

  • 1
  • 8
  • 5

2

  • 9
  • 2

1

  • 5
  • 6
  • 3
  • 12
  • 13
  • 7
  • 4
  • 1
  • 6
  • 8
  • 5
  • 2
  • 3
  • 5
  • 6
  • 7
  • 3
  • 4
  • 12
  • 1
  • 13
  • 10
  • 14
  • 7

F Iy Ix G C C G C A C G C A C G C A C G C C G C C

GCAC GCAC || | || | GC GC-C

x = y = x = y = x = y = x y G C A C G C C x = y = m = 2 s = -2 d = -5 e = -1

slide-143
SLIDE 143

Protein Substitution Matrices

  • Two popular sets of matrices for protein

sequences

– PAM matrices [Dayhoff et al, 1978]

  • Percent Accepted Mutation(PAM)
  • Better for aligning closely related sequences
  • Based on global alignments of closely related

proteins.

– BLOSUM matrices [Henikoff & Henikoff, 1992]

  • BLOcks SUbstitution Matrix(BLOSUM)
  • For both closely or remotely related sequences
  • Based on local alignments.
slide-144
SLIDE 144

Protein Substitution Matrices

The relationship between BLOSUM and PAM substitution matrices. BLOSUM matrices with higher numbers and PAM matrices with low numbers are both designed for comparisons of closely related sequences. BLOSUM matrices with low numbers and PAM matrices with high numbers are designed for comparisons of distantly related proteins.

slide-145
SLIDE 145

Amino Acids

Sometimes it is not possible two differentiate two closely related amino acids, therefore we have the special cases:

  • asparagine/aspartic acid - asx - B
  • glutamine/glutamic acid - glx - Z
slide-146
SLIDE 146

Protein Substitution Matrices

slide-147
SLIDE 147

Protein Substitution Matrices

slide-148
SLIDE 148

Positive for chemically similar substitution Common amino acids get low weights Rare amino acids get high weights

slide-149
SLIDE 149

BLOSUM-N matrices

  • If you want to detect homologous genes with

high identity, you may want a BLOSUM matrix with higher N. say BLOSUM75

  • On the other hand, if you want to detect remote

homology, you may want to use lower N, say BLOSUM50

  • BLOSUM-62: good for most purposes

45 62 90 Weak homology Strong homology

slide-150
SLIDE 150

Part III: Heuristic Local Sequence Alignment: BLAST

slide-151
SLIDE 151

Some useful applications of alignments

Given a newly discovered gene,

  • Does it occur in other species?

Assume we try Smith-Waterman:

The entire genomic database Our new gene 104 1010 - 1011

May take several weeks!

slide-152
SLIDE 152

Some useful applications of alignments

Given a newly sequenced organism,

  • Which subregions align with other organisms?
  • Potential genes
  • Other functional units

Assume we try Smith-Waterman:

The entire genomic database Our newly sequenced mammal 3109 1010 - 1011

> 1000 years ???

slide-153
SLIDE 153

BLAST

  • Basic Local Alignment Search Tool

– Altschul, Gish, Miller, Myers, Lipman, J Mol Biol 1990 – The most widely used bioinformatics tool

  • Which is better: long mediocre match or a few nearby,

short, strong matches with the same total score?

– Score-wise, exactly equivalent – Biologically, later may be more interesting, & is common – At least, if must miss some, rather miss the former

  • BLAST is a heuristic algorithm emphasizing the later

– speed/sensitivity tradeoff: BLAST may miss former, but gains greatly in speed

slide-154
SLIDE 154

BLAST

  • Available at NCBI (National Center for

Biotechnology Information) for download and

  • nline use. http://blast.ncbi.nlm.nih.gov/
  • Along with many sequence databases

Main idea: 1.Construct a dictionary of all the words in the query 2.Initiate a local alignment for each word match between query and DB Running Time: O(MN) However, orders of magnitude faster than Smith-Waterman

query DB

slide-155
SLIDE 155

BLAST ⎯ Original Version

Dictionary: All words of length k (~11 for DNA, 3 for proteins) Alignment initiated between words of alignment score  T (typically T = k) Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold

…… …… query DB query scan

slide-156
SLIDE 156

BLAST ⎯ Original Version

A C G A A G T A A G G T C C A G T C C C T T C C T G G A T T G C G A

Example: k = 4, T = 4 The matching word GGTC initiates an alignment Extension to the left and right with no gaps until alignment falls < 50% Output: GTAAGGTCC GTTAGGTCC

slide-157
SLIDE 157

Gapped BLAST

A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A

Added features:

  • Pairs of words can

initiate alignment

  • Extensions with

gaps in a band around anchor Output: GTAAGGTCCAGT GTTAGGTC-AGT

slide-158
SLIDE 158

Example

Query: gattacaccccgattacaccccgattaca (29 letters)

[2 mins]

Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences) 1,726,556 sequences; 8,074,398,388 total letters >gi|28570323|gb|AC108906.9| Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence, complete sequence Length = 144487 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 125138 tacacccagattacaccccga 125158 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 125104 tacacccagattacaccccga 125124 >gi|28173089|gb|AC104321.7| Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence, complete sequence Length = 139823 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 3891 tacacccagattacaccccga 3911

slide-159
SLIDE 159

Example

Query: Human atoh enhancer, 179 letters [1.5 min] Result: 57 blast hits

1.

gi|7677270|gb|AF218259.1|AF218259 Homo sapiens ATOH1 enhanc... 355 1e-95 2. gi|22779500|gb|AC091158.11| Mus musculus Strain C57BL6/J ch... 264 4e-68 3. gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhanc... 256 9e-66 4. gi|28875397|gb|AF467292.1| Gallus gallus CATH1 (CATH1) gene... 78 5e-12 5. gi|27550980|emb|AL807792.6| Zebrafish DNA sequence from clo... 54 7e-05 6. gi|22002129|gb|AC092389.4| Oryza sativa chromosome 10 BAC O... 44 0.068 7. gi|22094122|ref|NM_013676.1| Mus musculus suppressor of Ty ... 42 0.27 8. gi|13938031|gb|BC007132.1| Mus musculus, Similar to suppres... 42 0.27

gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhancer sequence Length = 1517 Score = 256 bits (129), Expect = 9e-66 Identities = 167/177 (94%), Gaps = 2/177 (1%) Strand = Plus / Plus

Query: 3 tgacaatagagggtctggcagaggctcctggccgcggtgcggagcgtctggagcggagca 62 ||||||||||||| ||||||||||||||||||| |||||||||||||||||||||||||| Sbjct: 1144 tgacaatagaggggctggcagaggctcctggccccggtgcggagcgtctggagcggagca 1203 Query: 63 cgcgctgtcagctggtgagcgcactctcctttcaggcagctccccggggagctgtgcggc 122 |||||||||||||||||||||||||| ||||||||| |||||||||||||||| ||||| Sbjct: 1204 cgcgctgtcagctggtgagcgcactc-gctttcaggccgctccccggggagctgagcggc 1262 Query: 123 cacatttaacaccatcatcacccctccccggcctcctcaacctcggcctcctcctcg 179 ||||||||||||| || ||| |||||||||||||||||||| ||||||||||||||| Sbjct: 1263 cacatttaacaccgtcgtca-ccctccccggcctcctcaacatcggcctcctcctcg 1318

slide-160
SLIDE 160

Different types of BLAST

  • blastn: search nucleic acid databases
  • blastp: search protein databases
  • blastx: you give a nucleic acid sequence,

search protein databases

  • tblastn: you give a protein sequence,

search nucleic acid databases

  • tblastx: you give a nucleic sequence,

search nucleic acid database, implicitly translate both into protein sequences

slide-161
SLIDE 161

BLAST cons and pros

  • Advantages

– Fast!!!! – A few minutes to search a database of 1011 bases

  • Disadvantages

– Sensitivity may be low – Often misses weak homologies

  • New improvement

– Make it even faster

  • Mainly for aligning very similar sequences or really long sequences

– E.g. whole genome vs whole genome

– Make it more sensitive

  • PSI-BLAST: iteratively add more homologous sequences
  • PatternHunter: discontinuous seeds
slide-162
SLIDE 162

Variants of BLAST

NCBI-BLAST: most widely used version WU-BLAST: (Washington University BLAST): another popular version

Optimized, added features

MEGABLAST:

Optimized to align very similar sequences. Linear gap penalty

BLAT: Blast-Like Alignment Tool BlastZ:

Optimized for aligning two genomes

PSI-BLAST:

BLAST produces many hits Those are aligned, and a pattern is extracted Pattern is used for next search; above steps iterated Sensitive for weak homologies Slower

slide-163
SLIDE 163

Summary

  • Part I: Algorithms

– Global sequence alignment: Needleman-Wunsch – Local sequence alignment: Smith-Waterman – Improvement on space and time

  • Part II: Biological issues

– Model gaps more accurately: affine gap penalty

  • Part III: Heuristic algorithms – BLAST family