CS CS 466 466 In Introduct ctio ion t to B Bio ioin - - PowerPoint PPT Presentation

β–Ά
cs cs 466 466 in introduct ctio ion t to b bio ioin
SMART_READER_LITE
LIVE PREVIEW

CS CS 466 466 In Introduct ctio ion t to B Bio ioin - - PowerPoint PPT Presentation

CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 2 Part 2 Mohammed El-Kebir January 28, 2020 Outline 1. Edit distance recap 2. Global alignment 3. Fitting alignment 4. Local alignment 5. Gapped


slide-1
SLIDE 1

CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics

Lecture 2 Part 2

Mohammed El-Kebir January 28, 2020

slide-2
SLIDE 2

Outline

  • 1. Edit distance recap
  • 2. Global alignment
  • 3. Fitting alignment
  • 4. Local alignment
  • 5. Gapped alignment

Reading:

  • Jones and Pevzner. Chapters 6.6, 6.8 and 6.9
  • Lecture notes

2

slide-3
SLIDE 3

Weighted Edit Distance – Practice Problem

  • Compute weighted edit distance between 𝐰 = AGT and 𝐱 = ATCT.

3

1 2 3 4 1 2 3

A T C G A G T V w

d[i, j] = min                0, if i = 0 and j = 0, d[i 1, j] + 1, if i > 0, d[i, j 1] + 1, if j > 0, d[i 1, j 1] + 2, if i > 0, j > 0 and vi 6= wj, d[i 1, j 1], if i > 0, j > 0 and vi = wj.

slide-4
SLIDE 4

Weighted Edit Distance – Practice Problem

  • Compute weighted edit distance between 𝐰 = AGT and 𝐱 = ATCT.

4

1 2 3 4 1 2 3 4 1 1 1 2 3 2 2 1 2 3 2 3 3 2 1 2 3

A T C G A G T V w

d[i, j] = min                0, if i = 0 and j = 0, d[i 1, j] + 1, if i > 0, d[i, j 1] + 1, if j > 0, d[i 1, j 1] + 2, if i > 0, j > 0 and vi 6= wj, d[i 1, j 1], if i > 0, j > 0 and vi = wj.

slide-5
SLIDE 5

Edit Distance – Additional Insights

  • An alignment corresponds to a series of elementary operations

5

Examples from http://profs.scienze.univr.it/~liptak/ACB/files/StringDistance_6up.pdf

slide-6
SLIDE 6

Edit Distance – Additional Insights

  • An alignment corresponds to a series of elementary operations
  • But not every series of elementary operations corresponds to an alignment! Why?

6

Examples from http://profs.scienze.univr.it/~liptak/ACB/files/StringDistance_6up.pdf

slide-7
SLIDE 7

Distance Function / Metric

7

A distance function (metric) on a set π‘Œ is a function 𝑒 ∢ π‘Œ Γ— π‘Œ β†’ ℝ s.t. for all 𝑦, 𝑧, 𝑨 ∈ π‘Œ: i. 𝑒 𝑦, 𝑧 β‰₯ 0 [non-negativity]

  • ii. 𝑒 𝑦, 𝑧 = 0 if and only if 𝑦 = 𝑧

[identity of indiscernibles]

  • iii. 𝑒 𝑦, 𝑧 = 𝑒(𝑧, 𝑦)

[symmetry]

  • iv. 𝑒 𝑦, 𝑧 ≀ 𝑒 𝑦, 𝑨 + 𝑒(𝑨, 𝑧)

[triangle inequality] Question: Is edit distance a distance function?

slide-8
SLIDE 8

Edit Distance is a Distance Function

8

Edit distance 𝑒(𝐰, 𝐱) is the minimum number of elementary operations to transform 𝐰 ∈ Ξ£βˆ— into 𝐱 ∈ Ξ£βˆ—. Claim: edit distance is a distance function. Proof: Let 𝐯, 𝐰, 𝐱 ∈ Ξ£βˆ—. i. 𝑒 𝐰, 𝐱 β‰₯ 0 [non-negativity] Edit distance is defined by an alignment. This in turn uniquely determines a series of elementary operations, each with cost either 0 (match) or 1 (otherwise). Thus, 𝑒 𝐰, 𝐱 β‰₯ 0.

slide-9
SLIDE 9

Edit Distance is a Distance Function

9

Edit distance 𝑒(𝐰, 𝐱) is the minimum number of elementary operations to transform 𝐰 ∈ Ξ£βˆ— into 𝐱 ∈ Ξ£βˆ—.

Proof: Let 𝐯, 𝐰, 𝐱 ∈ Ξ£βˆ—. ii. 𝑒 𝐰, 𝐱 = 0 if and only if 𝐰 = 𝐱 [identity of indiscernibles] (=>) By the premise, 𝑒 𝐰, 𝐱 = 0. By definition, the optimal alignment can only consist

  • f operations with cost 0. That is, the alignment consist of only matches. Thus, 𝐰 = 𝐱.

(<=) By the premise, 𝐰 = 𝐱. Thus, there exists an alignment where every pair of columns is a match. This means that |𝐰| = |𝐱| and each letter 𝑀A equals π‘₯A (where 𝑗 ∈ [|𝐰|]). Moreover, only the match operations has cost 0, the other operations have cost

  • 1. Hence, this is the optimal alignment with cost 𝑒 𝐰, 𝐱 = 0.

Claim: edit distance is a distance function.

slide-10
SLIDE 10

Edit Distance is a Distance Function

10

Edit distance 𝑒(𝐰, 𝐱) is the minimum number of elementary operations to transform 𝐰 ∈ Ξ£βˆ— into 𝐱 ∈ Ξ£βˆ—.

Proof: Let 𝐯, 𝐰, 𝐱 ∈ Ξ£βˆ—. iii. 𝑒 𝐰, 𝐱 = 𝑒(𝐱, 𝐰) [symmetry] Let 𝐁 = [𝑏A,H] be the optimal alignment corresponding to 𝑒 𝐰, 𝐱 , i.e. 𝐁 is an 2 Γ— 𝑙 matrix where 𝑙 ∈ {max( 𝐰 , 𝐱 ), … , 𝐰 + 𝐱 }. Define the function 𝑔 𝐁 = 𝐂 such that 𝐂 is obtained by interchanging the two rows of 𝐁. Since the cost of any insertion, deletion and mismatch is 1, we have that alignment 𝐂 has cost 𝑒 𝐰, 𝐱 . The existence

  • f an alignment from 𝐱 to 𝐰 with cost less than 𝑒 𝐰, 𝐱 , yields a contradiction as it

implies that 𝐁 is not an optimal alignment from 𝐰 to 𝐱. Hence, 𝑒 𝐱, 𝐰 = 𝑒 𝐰, 𝐱 .

Claim: edit distance is a distance function.

slide-11
SLIDE 11

Edit Distance is a Distance Function

11

Edit distance 𝑒(𝐰, 𝐱) is the minimum number of elementary operations to transform 𝐰 ∈ Ξ£βˆ— into 𝐱 ∈ Ξ£βˆ—.

Proof: Let 𝐯, 𝐰, 𝐱 ∈ Ξ£βˆ—. iv. 𝑒 𝐰, 𝐱 ≀ 𝑒 𝐰, 𝐯 + 𝑒(𝐯, 𝐱) [triangle inequality] Assume for a contradiction that 𝑒 𝐰, 𝐱 > 𝑒 𝐰, 𝐯 + 𝑒(𝐯, 𝐱). Let 𝑇 be the sequence

  • f elementary operations for transforming 𝐰 into 𝐯. Let 𝑇′ be the sequence of

elementary operations for transforming 𝐯 into 𝐱. Note that 𝑒 𝐰, 𝐯 = |𝑇| and 𝑒 𝐯, 𝐱 = |𝑇′|. Concatenate 𝑇 and 𝑇′ and remove redundant operations, yielding sequence 𝑇′′. By definition, 𝑇VV ≀ 𝑇 + 𝑇V . We can obtain an alignment of 𝐰 and 𝐱 from 𝑇′′ with cost 𝑇VV ≀ 𝑒 𝐰, 𝐯 + 𝑒(𝐯, 𝐱). This yields a contradiction with 𝑒 𝐰, 𝐱 > 𝑒 𝐰, 𝐯 + 𝑒(𝐯, 𝐱) being the cost of the optimal alignment of 𝐰 and 𝐱.

Claim: edit distance is a distance function.

slide-12
SLIDE 12

Outline

  • 1. Edit distance recap
  • 2. Global alignment
  • 3. Fitting alignment
  • 4. Local alignment
  • 5. Gapped alignment

Reading:

  • Jones and Pevzner. Chapters 6.6, 6.8 and 6.9

12

slide-13
SLIDE 13

Biological Sequence Alignment

  • Weighted edit distance: find

alignment with minimum distance

  • Shortest path in weighted

edit graph

  • Sequence alignment: find

alignment with maximum similarity

  • Longest path in weighted

edit graph

  • Score function:

πœ€ ∢ Ξ£ βˆͺ βˆ’

Z β†’ ℝ

13

1 2 3 4 O O O O O 1 O O O O O 2 O O O O O 3 O O O O O 4 O O O O O

W A T C G A T G T V

match mismatch insertion deletion

  • "

#

$%

  • $%

"

#

$% "

#

πœ€(𝑀A, βˆ’) πœ€(βˆ’, π‘₯H) πœ€(𝑀A, π‘₯H)

Question: What is an example of πœ€?

slide-14
SLIDE 14

Scoring Matrices

14

Transitions: interchanges among purines (two rings) or pyrimidines (one ring)

  • A <--> G
  • C <--> T

Transversions: interchanges between purines (two rings) and pyrimidines (one ring)

  • A <--> C, A <--> T
  • G <--> C, G <--> T

Transitions more likely than transversions!

A C G T

slide-15
SLIDE 15

Scoring Matrices

15

Transitions: interchanges among purines (two rings) or pyrimidines (one ring)

  • A <--> G
  • C <--> T

Transversions: interchanges between purines (two rings) and pyrimidines (one ring)

  • A <--> C, A <--> T
  • G <--> C, G <--> T

Transitions more likely than transversions!

πœ€ A T C G

  • A

1

  • 2
  • 2
  • 1
  • 1

T

  • 2

1

  • 1
  • 2
  • 1

C

  • 2
  • 1

1

  • 2
  • 1

G

  • 1
  • 2
  • 2

1

  • 1
  • 1
  • 1
  • 1
  • 1

βˆ’βˆž

slide-16
SLIDE 16

Global Alignment – Needleman-Wunsch Algorithm

  • An alignment is a source-to-sink path in the edit graph
  • An alignment 𝐁 = [𝑏A,H] is a 2 Γ— 𝑙 matrix s.t. (i) 𝑙 = {max 𝑛, π‘œ , … , 𝑛 + π‘œ},

(ii) 𝑏A,H ∈ Ξ£ βˆͺ βˆ’ and (iii) there is no π‘˜ ∈ [𝑙] where 𝑏_,H = 𝑏Z,H = βˆ’

16

Global Alignment problem: Given strings 𝐰 ∈ Ξ£` and 𝐱 ∈ Ξ£a and scoring function πœ€, find alignment with maximum score. s[i, j] = max ο£±    ο£²    ο£³ 0, if i = 0 and j = 0, s[i βˆ’ 1, j] + Ξ΄(vi, βˆ’), if i > 0, s[i, j βˆ’ 1] + Ξ΄(βˆ’, wj), if j > 0, s[i βˆ’ 1, j βˆ’ 1] + Ξ΄(vi, wj), if i > 0 and j > 0.

deletion insertion match/ mismatch

slide-17
SLIDE 17

Demonstration

  • http://alfehrest.org/sub/nwa/index.html
  • 𝐰 = ATGTTAT and 𝐱 = ATCGTAC.

17

πœ€ A T C G

  • A

1

  • 2
  • 2
  • 1
  • 1

T

  • 2

1

  • 1
  • 2
  • 1

C

  • 2
  • 1

1

  • 2
  • 1

G

  • 1
  • 2
  • 2

1

  • 1
  • 1
  • 1
  • 1
  • 1

βˆ’βˆž

slide-18
SLIDE 18

Outline

  • 1. Edit distance recap
  • 2. Global alignment
  • 3. Fitting alignment
  • 4. Local alignment
  • 5. Gapped alignment

Reading:

  • Jones and Pevzner. Chapters 6.6, 6.7 and 6.9
  • Lecture notes

18

slide-19
SLIDE 19

Next Generation Sequencing (NGS) Technology

19 November, 2017

Log Scale 1,000 10,000 100,000,000 10,000,000 1,000,000 100,000

NGS

slide-20
SLIDE 20

Allow for inexact matches due to:

  • Sequencing errors
  • Polymorphisms/mutations in

reference genome

20

NGS Characterized by Short Reads

Genome Millions -billions nucleotides Next-generation DNA sequencing 10-100’s million short reads Short read: 100 nucleotides

… GGTAGTTAG … … TATAATTAG … … AGCCATTAG … … CGTACCTAG … … CATTCAGTAG … … GGTAAACTAG …

slide-21
SLIDE 21

Allow for inexact matches due to:

  • Sequencing errors
  • Polymorphisms/mutations in

reference genome

21

NGS Characterized by Short Reads

Genome Millions -billions nucleotides Next-generation DNA sequencing 10-100’s million short reads Short read: 100 nucleotides

… GGTAGTTAG … … TATAATTAG … … AGCCATTAG … … CGTACCTAG … … CATTCAGTAG … … GGTAAACTAG …

Question: How to account for discrepancy between lengths of reference and short read? Human reference genome is 3,300,000,000 nucleotides, while a short read is 100 nucleotides. Global sequence alignment will not work!

slide-22
SLIDE 22

Fitting Alignment

22

For short read alignment, we want to align complete short read 𝐰 ∈ Ξ£` to substring of reference genome 𝐱 ∈ Ξ£a. Note that 𝑛 β‰ͺ π‘œ. Fitting Alignment problem: Given strings 𝐰 ∈ Ξ£` and 𝐱 ∈ Ξ£a and scoring function πœ€, find a alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score π‘‘βˆ— among all global alignments of 𝐰 and all substrings of 𝐱

𝐰 ∈ Σ` 𝐱 ∈ Σa

slide-23
SLIDE 23

Fitting Alignment – Naive Approach

  • Consider all contiguous non-empty substrings of 𝐱, defined by 1 ≀ 𝑗 ≀ π‘˜ ≀ π‘œ
  • How many?

23

Fitting Alignment problem: Given strings 𝐰 ∈ Ξ£` and 𝐱 ∈ Ξ£a and scoring function πœ€, find an alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score π‘‘βˆ— among all global alignments of 𝐰 and all substrings of 𝐱

𝐰 ∈ Σ` 𝐱 ∈ Σa

slide-24
SLIDE 24

Fitting Alignment – Naive Approach

  • Consider all contiguous non-empty substrings of 𝐱, defined by 1 ≀ 𝑗 ≀ π‘˜ ≀ π‘œ
  • How many? Answer: π‘œ +

a Z

  • What are their total lengths?
  • What is the running time?

24

Fitting Alignment problem: Given strings 𝐰 ∈ Ξ£` and 𝐱 ∈ Ξ£a and scoring function πœ€, find an alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score π‘‘βˆ— among all global alignments of 𝐰 and all substrings of 𝐱

𝐰 ∈ Σ` 𝐱 ∈ Σa

slide-25
SLIDE 25

Fitting Alignment – Dynamic Programming

25

Fitting Alignment problem: Given strings 𝐰 ∈ Ξ£` and 𝐱 ∈ Ξ£a and scoring function πœ€, find an alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score π‘‘βˆ— among all global alignments of 𝐰 and all substrings of 𝐱

s[i, j] = max ο£±    ο£²    ο£³ 0, if i = 0, s[i βˆ’ 1, j] + Ξ΄(vi, βˆ’), if i > 0, s[i, j βˆ’ 1] + Ξ΄(βˆ’, wj), if i > 0 and j > 0, s[i βˆ’ 1, j βˆ’ 1] + Ξ΄(vi, wj), if i > 0 and j > 0. sβˆ— = max{s[m, 0], . . . , s[m, n]}

A G G T A C G G C

𝐰\𝐱

slide-26
SLIDE 26

Fitting Alignment – Dynamic Programming

26

Fitting Alignment problem: Given strings 𝐰 ∈ Ξ£` and 𝐱 ∈ Ξ£a and scoring function πœ€, find an alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score π‘‘βˆ— among all global alignments of 𝐰 and all substrings of 𝐱

s[i, j] = max ο£±    ο£²    ο£³ 0, if i = 0, s[i βˆ’ 1, j] + Ξ΄(vi, βˆ’), if i > 0, s[i, j βˆ’ 1] + Ξ΄(βˆ’, wj), if i > 0 and j > 0, s[i βˆ’ 1, j βˆ’ 1] + Ξ΄(vi, wj), if i > 0 and j > 0. sβˆ— = max{s[m, 0], . . . , s[m, n]}

A G G T A C G G C

𝐰\𝐱

Start anywhere on first row End anywhere on last row

slide-27
SLIDE 27

Fitting Alignment – Dynamic Programming

27

Fitting Alignment problem: Given strings 𝐰 ∈ Ξ£` and 𝐱 ∈ Ξ£a and scoring function πœ€, find an alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score π‘‘βˆ— among all global alignments of 𝐰 and all substrings of 𝐱

s[i, j] = max ο£±    ο£²    ο£³ 0, if i = 0, s[i βˆ’ 1, j] + Ξ΄(vi, βˆ’), if i > 0, s[i, j βˆ’ 1] + Ξ΄(βˆ’, wj), if i > 0 and j > 0, s[i βˆ’ 1, j βˆ’ 1] + Ξ΄(vi, wj), if i > 0 and j > 0. sβˆ— = max{s[m, 0], . . . , s[m, n]}

Question: Let match score be 1, mismatch/indel score be -1. What is π‘‘βˆ—? Question: Same scores. What is optimal global alignment and score? A G G T A C G G C

  • A
  • G

G

  • T

A C G G C

𝐰 𝐱

𝐰\𝐱

Start anywhere on first row End anywhere on last row

slide-28
SLIDE 28

Fitting Alignment – Dynamic Programming

  • Online:

https://valiec.github.io/AlignmentVisualizer/index.html

28

Question: Let match score be 1, mismatch/indel score be -1. What is π‘‘βˆ—? Question: Same scores. What is optimal global alignment and score? A G G T A C G G C

  • A
  • G

G

  • T

A C G G C

𝐰 𝐱

𝐰\𝐱

s[i, j] = max ο£±    ο£²    ο£³ 0, if i = 0, s[i βˆ’ 1, j] + Ξ΄(vi, βˆ’), if i > 0, s[i, j βˆ’ 1] + Ξ΄(βˆ’, wj), if i > 0 and j > 0, s[i βˆ’ 1, j βˆ’ 1] + Ξ΄(vi, wj), if i > 0 and j > 0. sβˆ— = max{s[m, 0], . . . , s[m, n]}

slide-29
SLIDE 29

Outline

  • 1. Edit distance
  • 2. Global alignment
  • 3. Fitting alignment
  • 4. Local alignment
  • 5. Gapped alignment

Reading:

  • Jones and Pevzner. Chapters 6.6, 6.8 and 6.9
  • Lecture notes

29

slide-30
SLIDE 30

Local Alignment – Biological Motivation

30

ABL1 SHKA

From Pfam database (http://pfam.sanger.ac.uk/)

Proteins are composed of functional units called domains. Such domains may occur in different proteins even across species.

Local Alignment problem: Given strings 𝐰 ∈ Ξ£` and 𝐱 ∈ Ξ£a and scoring function πœ€, find a substring of 𝐰 and a substring of 𝐱 whose alignment has maximum global alignment score π‘‘βˆ— among all global alignments of all substrings of 𝐰 and 𝐱

slide-31
SLIDE 31

Global, Fitting and Local Alignment

31

Local Alignment problem: Given strings 𝐰 ∈ Ξ£` and 𝐱 ∈ Ξ£a and scoring function πœ€, find a substring of 𝐰 and a substring of 𝐱 whose alignment has maximum global alignment score π‘‘βˆ— among all global alignments of all substrings of 𝐰 and 𝐱 Fitting Alignment problem: Given strings 𝐰 ∈ Ξ£` and 𝐱 ∈ Ξ£a and scoring function πœ€, find an alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score π‘‘βˆ— among all global alignments of 𝐰 and all substrings of 𝐱 Global Alignment problem: Given strings 𝐰 ∈ Ξ£` and 𝐱 ∈ Ξ£a and scoring function πœ€, find alignment of 𝐰 and 𝐱 with maximum score.

slide-32
SLIDE 32

Local Alignment – Naive Algorithm

Brute force:

  • 1. Generate all pairs (𝐰V, 𝐱V) of substrings of 𝐰 and 𝐱
  • 2. For each pair (𝐰V, 𝐱V), solve global alignment problem.

32

Question: There are `

Z a Z pairs of substrings.

But they have different lengths. What is the running time?

Local Alignment problem: Given strings 𝐰 ∈ Ξ£` and 𝐱 ∈ Ξ£a and scoring function πœ€, find a substring of 𝐰 and a substring of 𝐱 whose alignment has maximum global alignment score π‘‘βˆ— among all global alignments of all substrings of 𝐰 and 𝐱

slide-33
SLIDE 33

Key Idea

33

Local alignment:

  • Start and end anywhere

Global alignment:

  • Start at (0,0) and end at (𝑛, π‘œ)
slide-34
SLIDE 34

Local Alignment Recurrence

34

s[i, j] = max ο£±    ο£²    ο£³ 0, if i = 0 and j = 0, s[i βˆ’ 1, j] + Ξ΄(vi, βˆ’), if i > 0, s[i, j βˆ’ 1] + Ξ΄(βˆ’, wj), if j > 0, s[i βˆ’ 1, j βˆ’ 1] + Ξ΄(vi, wj), if i > 0 and j > 0. sβˆ— = max

i,j s[i, j]

Local Alignment problem: Given strings 𝐰 ∈ Ξ£` and 𝐱 ∈ Ξ£a and scoring function πœ€, find a substring

  • f 𝐰 and a substring of 𝐱 whose alignment has

maximum global alignment score π‘‘βˆ— among all global alignments of all substrings of 𝐰 and 𝐱

slide-35
SLIDE 35

Local Alignment Recurrence

35

s[i, j] = max ο£±    ο£²    ο£³ 0, if i = 0 and j = 0, s[i βˆ’ 1, j] + Ξ΄(vi, βˆ’), if i > 0, s[i, j βˆ’ 1] + Ξ΄(βˆ’, wj), if j > 0, s[i βˆ’ 1, j βˆ’ 1] + Ξ΄(vi, wj), if i > 0 and j > 0. sβˆ— = max

i,j s[i, j]

Local Alignment problem: Given strings 𝐰 ∈ Ξ£` and 𝐱 ∈ Ξ£a and scoring function πœ€, find a substring

  • f 𝐰 and a substring of 𝐱 whose alignment has

maximum global alignment score π‘‘βˆ— among all global alignments of all substrings of 𝐰 and 𝐱

Start anywhere End anywhere

Running time: 𝑃(π‘›π‘œ)

slide-36
SLIDE 36

Local Alignment – Dynamic Programming

  • Online:

https://valiec.github.io/AlignmentVisualizer/index.html

36

Question: Let match score be 2, mismatch score be -2 and indel be -4. What is π‘‘βˆ—? A G G T A C G G C G G G G

𝐰 𝐱

𝐰\𝐱

s[i, j] = max ο£±    ο£²    ο£³ 0, if i = 0 and j = 0, s[i βˆ’ 1, j] + Ξ΄(vi, βˆ’), if i > 0, s[i, j βˆ’ 1] + Ξ΄(βˆ’, wj), if j > 0, s[i βˆ’ 1, j βˆ’ 1] + Ξ΄(vi, wj), if i > 0 and j > 0. sβˆ— = max

i,j s[i, j]

slide-37
SLIDE 37

Global, Fitting and Local Alignment

37

Local Alignment problem: Given strings 𝐰 ∈ Ξ£` and 𝐱 ∈ Ξ£a and scoring function πœ€, find a substring of 𝐰 and a substring of 𝐱 whose alignment has maximum global alignment score π‘‘βˆ— among all global alignments of all substrings of 𝐰 and 𝐱 Fitting Alignment problem: Given strings 𝐰 ∈ Ξ£` and 𝐱 ∈ Ξ£a and scoring function πœ€, find an alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score π‘‘βˆ— among all global alignments of 𝐰 and all substrings of 𝐱 Global Alignment problem: Given strings 𝐰 ∈ Ξ£` and 𝐱 ∈ Ξ£a and scoring function πœ€, find alignment of 𝐰 and 𝐱 with maximum score.

slide-38
SLIDE 38

Outline

  • 1. Edit distance
  • 2. Global alignment
  • 3. Fitting alignment
  • 4. Local alignment
  • 5. Gapped alignment

Reading:

  • Jones and Pevzner. Chapters 6.6, 6.8 and 6.9
  • Lecture notes

38

slide-39
SLIDE 39

Scoring Gaps

39

Let 𝐰 = AAC and 𝐱 = ACAGGC Match πœ€ 𝑑, 𝑑 = 1; Mismatch πœ€ 𝑑, 𝑒 = βˆ’1 (where 𝑑 β‰  𝑒); Indel πœ€ 𝑑, βˆ’ = πœ€ βˆ’, 𝑑 = βˆ’2 Both alignments have 3 matches and 2 indels. Score: 3 βˆ— 1 + 2 βˆ— βˆ’2 = βˆ’1

A

  • A

C A C A A C

𝐰 𝐱

A

  • A
  • C

A C A A C

𝐰 𝐱

slide-40
SLIDE 40

Scoring Gaps

40

Let 𝐰 = AAC and 𝐱 = ACAGGC Match πœ€ 𝑑, 𝑑 = 1; Mismatch πœ€ 𝑑, 𝑒 = βˆ’1 (where 𝑑 β‰  𝑒); Indel πœ€ 𝑑, βˆ’ = πœ€ βˆ’, 𝑑 = βˆ’2 Question: Which alignment is better?

A

  • A

C A C A A C

𝐰 𝐱

A

  • A
  • C

A C A A C

𝐰 𝐱 Both alignments have 3 matches and 2 indels. Score: 3 βˆ— 1 + 2 βˆ— βˆ’2 = βˆ’1

slide-41
SLIDE 41

Scoring Gaps – Affine Gap Penalties

41

Desired: Lower penalty for consecutive gaps than interspersed gaps. Why: Consecutive gaps are more likely due to slippage errors in DNA replication (2-3 nucleotides), codons for protein sequences, etc.

A

  • A

C A C A A C

𝐰 𝐱

A

  • A
  • C

A C A A C

𝐰 𝐱

slide-42
SLIDE 42

Scoring Gaps – Affine Gap Penalties

42

Desired: Lower penalty for consecutive gaps than interspersed gaps. Why: Consecutive gaps are more likely due to slippage errors in DNA replication (2-3 nucleotides), codons for protein sequences, etc. Affine gap penalty: Two penalties: (i) gap open penalty 𝜍 β‰₯ 0 and (ii) gap extension penalty 𝜏 β‰₯ 0. Stretch of 𝑙 consecutive gaps has score βˆ’(𝜍 + πœπ‘™).

A

  • A

C A C A A C

𝐰 𝐱

A

  • A
  • C

A C A A C

𝐰 𝐱

slide-43
SLIDE 43

Scoring Gaps – Affine Gap Penalties

43

Desired: Lower penalty for consecutive gaps than interspersed gaps. Why: Consecutive gaps are more likely due to slippage errors in DNA replication (2-3 nucleotides), codons for protein sequences, etc. Affine gap penalty: Two penalties: (i) gap open penalty 𝜍 β‰₯ 0 and (ii) gap extension penalty 𝜏 β‰₯ 0. Stretch of 𝑙 consecutive gaps has score βˆ’(𝜍 + πœπ‘™). Let 𝜍 = 10 and 𝜏 = 1. Left: 3 βˆ— 1 βˆ’ 10 + 1 βˆ— 2 = βˆ’9. Right: 3 βˆ— 1 βˆ’ (10 + 1 βˆ— 1) βˆ’ 10 + 1 βˆ— 1 = βˆ’19.

A

  • A

C A C A A C

𝐰 𝐱

A

  • A
  • C

A C A A C

𝐰 𝐱

slide-44
SLIDE 44

Affine Gap Penalty Alignment – Naive Approach

44

Idea: Insert horizontal (deletion) and vertical (insertion) edges spanning 𝑙 > 1 gaps with score βˆ’ (𝜍 + πœπ‘™).

new edges

  • ld edges

Affine gap penalty: Two penalties: (i) gap open penalty 𝜍 β‰₯ 0 and (ii) gap extension penalty 𝜏 β‰₯ 0. Stretch of 𝑙 consecutive gaps has score βˆ’(𝜍 + πœπ‘™).

... ... ... ... ... ... ... ... ...

slide-45
SLIDE 45

Affine Gap Penalty Alignment – Naive Approach

45

Idea: Insert horizontal (deletion) and vertical (insertion) edges spanning 𝑙 > 1 gaps with score βˆ’ (𝜍 + πœπ‘™).

new edges

  • ld edges

Question: What’s the running time? Question: What’s the recurrence? Affine gap penalty: Two penalties: (i) gap open penalty 𝜍 β‰₯ 0 and (ii) gap extension penalty 𝜏 β‰₯ 0. Stretch of 𝑙 consecutive gaps has score βˆ’(𝜍 + πœπ‘™).

... ... ... ... ... ... ... ... ...

slide-46
SLIDE 46

46

Affine Gap Penalty Alignment

Idea: Three separate recurrences: (i) Gap in first sequence 𝑑→ 𝑗, π‘˜ (ii) Match/mismatch π‘‘β†˜[𝑗, π‘˜] (iii) Gap in second sequence 𝑑↓[𝑗, π‘˜]

slide-47
SLIDE 47

47

Affine Gap Penalty Alignment

Idea: Three separate recurrences: (i) Gap in first sequence 𝑑→ 𝑗, π‘˜ (ii) Match/mismatch π‘‘β†˜[𝑗, π‘˜] (iii) Gap in second sequence 𝑑↓[𝑗, π‘˜]

s![i, j] = max ( s![i, j βˆ’ 1] βˆ’ Οƒ, if j > 1, s&[i, j βˆ’ 1] βˆ’ (Οƒ + ρ), if j > 0, s&[i, j] = max 8 > > > < > > > : 0, if i = 0 and j = 0, s![i, j], if j > 0, s#[i, j], if i > 0, s&[i βˆ’ 1, j βˆ’ 1] + Ξ΄(vi, wj), if i > 0 and j > 0, s#[i, j] = max ( s#[i βˆ’ 1, j] βˆ’ Οƒ, if i > 1, s&[i βˆ’ 1, j] βˆ’ (Οƒ + ρ), if i > 0.

slide-48
SLIDE 48

48

Affine Gap Penalty Alignment

Idea: Three separate recurrences: (i) Gap in first sequence 𝑑→ 𝑗, π‘˜ (ii) Match/mismatch π‘‘β†˜[𝑗, π‘˜] (iii) Gap in second sequence 𝑑↓[𝑗, π‘˜] Running time: 𝑃(π‘›π‘œ)

s![i, j] = max ( s![i, j βˆ’ 1] βˆ’ Οƒ, if j > 1, s&[i, j βˆ’ 1] βˆ’ (Οƒ + ρ), if j > 0, s&[i, j] = max 8 > > > < > > > : 0, if i = 0 and j = 0, s![i, j], if j > 0, s#[i, j], if i > 0, s&[i βˆ’ 1, j βˆ’ 1] + Ξ΄(vi, wj), if i > 0 and j > 0, s#[i, j] = max ( s#[i βˆ’ 1, j] βˆ’ Οƒ, if i > 1, s&[i βˆ’ 1, j] βˆ’ (Οƒ + ρ), if i > 0.

slide-49
SLIDE 49

Affine Gap Penalty Alignment – Example

49

𝐰 = AAC 𝐱 = ACAAC

Let 𝜍 = 10 and 𝜏 = 1. Match = 1. Mismatch = -1

s![i, j] = max ( s![i, j βˆ’ 1] βˆ’ Οƒ, if j > 1, s&[i, j βˆ’ 1] βˆ’ (Οƒ + ρ), if j > 0, s&[i, j] = max 8 > > > < > > > : 0, if i = 0 and j = 0, s![i, j], if j > 0, s#[i, j], if i > 0, s&[i βˆ’ 1, j βˆ’ 1] + Ξ΄(vi, wj), if i > 0 and j > 0, s#[i, j] = max ( s#[i βˆ’ 1, j] βˆ’ Οƒ, if i > 1, s&[i βˆ’ 1, j] βˆ’ (Οƒ + ρ), if i > 0.

slide-50
SLIDE 50

Gapped Alignment – Additional Insights

  • Naive approach supports arbitrary gap penalties given two

sequences 𝐰 ∈ Ξ£` and 𝐱 ∈ Ξ£a. This results in an 𝑃(π‘›π‘œ 𝑛 + π‘œ ) algorithm.

  • Alignment with convex gap

penalties given two sequences 𝐰 ∈ Ξ£` and 𝐱 ∈ Ξ£a can be computed in 𝑃(π‘›π‘œ log 𝑛) time.

See: Dan Gusfield. 1997. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York, NY, USA.

50

slide-51
SLIDE 51

Take Home Messages

  • 1. Edit distance
  • 2. Global alignment
  • 3. Fitting alignment
  • 4. Local alignment
  • 5. Gapped alignment

Reading:

  • Jones and Pevzner. Chapters 6.6, 6.8 and 6.9
  • Lecture notes

51

Global alignment is longest path in DAG Small tweaks enable different extensions Edit distance is shortest path in DAG