CS CS 466 466 In Introduct ctio ion t to B Bio ioin - - PowerPoint PPT Presentation

cs cs 466 466 in introduct ctio ion t to b bio ioin
SMART_READER_LITE
LIVE PREVIEW

CS CS 466 466 In Introduct ctio ion t to B Bio ioin - - PowerPoint PPT Presentation

CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 5 Mohammed El-Kebir February 4, 2020 Outline 1. Fitting alignment 2. Local alignment 3. Gapped alignment 4. BLOSUM scoring matrix Reading: Jones and


slide-1
SLIDE 1

CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics

Lecture 5

Mohammed El-Kebir February 4, 2020

slide-2
SLIDE 2

Outline

  • 1. Fitting alignment
  • 2. Local alignment
  • 3. Gapped alignment
  • 4. BLOSUM scoring matrix

Reading:

  • Jones and Pevzner. Chapters 6.6-6.9
  • Lecture notes

2

slide-3
SLIDE 3

Allow for inexact matches due to:

  • Sequencing errors
  • Polymorphisms/mutations in

reference genome

3

NGS Characterized by Short Reads

Genome Millions -billions nucleotides Next-generation DNA sequencing 10-100’s million short reads Short read: 100 nucleotides

… GGTAGTTAG … … TATAATTAG … … AGCCATTAG … … CGTACCTAG … … CATTCAGTAG … … GGTAAACTAG …

Question: How to account for discrepancy between lengths of reference and short read? Human reference genome is 3,300,000,000 nucleotides, while a short read is 100 nucleotides. Global sequence alignment will not work!

slide-4
SLIDE 4

Fitting Alignment

4

For short read alignment, we want to align complete short read 𝐰 ∈ Σ$ to substring of reference genome 𝐱 ∈ Σ&. Note that 𝑛 ≪ 𝑜. Fitting Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find a alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score 𝑡∗ among all global alignments of 𝐰 and all substrings of 𝐱

𝐰 ∈ Σ$ 𝐱 ∈ Σ&

slide-5
SLIDE 5

Fitting Alignment – Naive Approach

  • Consider all contiguous non-empty substrings of 𝐱, defined by 1 ≤ 𝑗 ≤ 𝑘 ≤ 𝑜
  • How many?

5

Fitting Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find an alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score 𝑡∗ among all global alignments of 𝐰 and all substrings of 𝐱

𝐰 ∈ Σ$ 𝐱 ∈ Σ&

slide-6
SLIDE 6

Fitting Alignment – Naive Approach

  • Consider all contiguous non-empty substrings of 𝐱, defined by 1 ≤ 𝑗 ≤ 𝑘 ≤ 𝑜
  • How many? Answer: 𝑜 +

& 2

  • What are their total lengths?
  • What is the running time?

6

Fitting Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find an alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score 𝑡∗ among all global alignments of 𝐰 and all substrings of 𝐱

𝐰 ∈ Σ$ 𝐱 ∈ Σ&

slide-7
SLIDE 7

Fitting Alignment – Dynamic Programming

7

Fitting Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find an alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score 𝑡∗ among all global alignments of 𝐰 and all substrings of 𝐱

s[i, j] = max          0, if i = 0, s[i − 1, j] + δ(vi, −), if i > 0, s[i, j − 1] + δ(−, wj), if i > 0 and j > 0, s[i − 1, j − 1] + δ(vi, wj), if i > 0 and j > 0. s∗ = max{s[m, 0], . . . , s[m, n]}

A G G T A C G G C

𝐰\𝐱

slide-8
SLIDE 8

Fitting Alignment – Dynamic Programming

8

Fitting Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find an alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score 𝑡∗ among all global alignments of 𝐰 and all substrings of 𝐱

s[i, j] = max          0, if i = 0, s[i − 1, j] + δ(vi, −), if i > 0, s[i, j − 1] + δ(−, wj), if i > 0 and j > 0, s[i − 1, j − 1] + δ(vi, wj), if i > 0 and j > 0. s∗ = max{s[m, 0], . . . , s[m, n]}

A G G T A C G G C

𝐰\𝐱

Start anywhere on first row End anywhere on last row

slide-9
SLIDE 9

Fitting Alignment – Dynamic Programming

9

Fitting Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find an alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score 𝑡∗ among all global alignments of 𝐰 and all substrings of 𝐱

s[i, j] = max          0, if i = 0, s[i − 1, j] + δ(vi, −), if i > 0, s[i, j − 1] + δ(−, wj), if i > 0 and j > 0, s[i − 1, j − 1] + δ(vi, wj), if i > 0 and j > 0. s∗ = max{s[m, 0], . . . , s[m, n]}

Question: Let match score be 1, mismatch/indel score be -1. What is 𝑡∗? Question: Same scores. What is optimal global alignment and score? A G G T A C G G C

  • A
  • G

G

  • T

A C G G C

𝐰 𝐱

𝐰\𝐱

Start anywhere on first row End anywhere on last row

slide-10
SLIDE 10

Fitting Alignment – Dynamic Programming

  • Online:

https://valiec.github.io/AlignmentVisualizer/index.html

10

Question: Let match score be 1, mismatch/indel score be -1. What is 𝑡∗? Question: Same scores. What is optimal global alignment and score? A G G T A C G G C

  • A
  • G

G

  • T

A C G G C

𝐰 𝐱

𝐰\𝐱

s[i, j] = max          0, if i = 0, s[i − 1, j] + δ(vi, −), if i > 0, s[i, j − 1] + δ(−, wj), if i > 0 and j > 0, s[i − 1, j − 1] + δ(vi, wj), if i > 0 and j > 0. s∗ = max{s[m, 0], . . . , s[m, n]}

slide-11
SLIDE 11

Outline

  • 1. Fitting alignment
  • 2. Local alignment
  • 3. Gapped alignment
  • 4. BLOSUM scoring matrix

Reading:

  • Jones and Pevzner. Chapters 6.6-6.9
  • Lecture notes

11

slide-12
SLIDE 12

Local Alignment – Biological Motivation

12

ABL1 SHKA

From Pfam database (http://pfam.sanger.ac.uk/)

Proteins are composed of functional units called domains. Such domains may occur in different proteins even across species.

Local Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find a substring of 𝐰 and a substring of 𝐱 whose alignment has maximum global alignment score 𝑡∗ among all global alignments of all substrings of 𝐰 and 𝐱

slide-13
SLIDE 13

Global, Fitting and Local Alignment

13

Local Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find a substring of 𝐰 and a substring of 𝐱 whose alignment has maximum global alignment score 𝑡∗ among all global alignments of all substrings of 𝐰 and 𝐱 Fitting Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find an alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score 𝑡∗ among all global alignments of 𝐰 and all substrings of 𝐱 Global Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find alignment of 𝐰 and 𝐱 with maximum score.

slide-14
SLIDE 14

Local Alignment – Naive Algorithm

Brute force:

  • 1. Generate all pairs (𝐰4, 𝐱4) of substrings of 𝐰 and 𝐱
  • 2. For each pair (𝐰4, 𝐱4), solve global alignment problem.

14

Question: There are $

2 & 2 pairs of substrings.

But they have different lengths. What is the running time?

Local Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find a substring of 𝐰 and a substring of 𝐱 whose alignment has maximum global alignment score 𝑡∗ among all global alignments of all substrings of 𝐰 and 𝐱

slide-15
SLIDE 15

Key Idea

15

Local alignment:

  • Start and end anywhere

Global alignment:

  • Start at (0,0) and end at (𝑛, 𝑜)
slide-16
SLIDE 16

Local Alignment Recurrence

16

s[i, j] = max          0, if i = 0 and j = 0, s[i − 1, j] + δ(vi, −), if i > 0, s[i, j − 1] + δ(−, wj), if j > 0, s[i − 1, j − 1] + δ(vi, wj), if i > 0 and j > 0. s∗ = max

i,j s[i, j]

Local Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find a substring

  • f 𝐰 and a substring of 𝐱 whose alignment has

maximum global alignment score 𝑡∗ among all global alignments of all substrings of 𝐰 and 𝐱

slide-17
SLIDE 17

Local Alignment Recurrence

17

s[i, j] = max          0, if i = 0 and j = 0, s[i − 1, j] + δ(vi, −), if i > 0, s[i, j − 1] + δ(−, wj), if j > 0, s[i − 1, j − 1] + δ(vi, wj), if i > 0 and j > 0. s∗ = max

i,j s[i, j]

Local Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find a substring

  • f 𝐰 and a substring of 𝐱 whose alignment has

maximum global alignment score 𝑡∗ among all global alignments of all substrings of 𝐰 and 𝐱

Start anywhere End anywhere

Running time: 𝑃(𝑛𝑜)

slide-18
SLIDE 18

Local Alignment – Dynamic Programming

  • Online:

https://valiec.github.io/AlignmentVisualizer/index.html

18

Question: Let match score be 2, mismatch score be -2 and indel be -4. What is 𝑡∗? A G G T A C G G C G G G G

𝐰 𝐱

𝐰\𝐱

s[i, j] = max          0, if i = 0 and j = 0, s[i − 1, j] + δ(vi, −), if i > 0, s[i, j − 1] + δ(−, wj), if j > 0, s[i − 1, j − 1] + δ(vi, wj), if i > 0 and j > 0. s∗ = max

i,j s[i, j]

slide-19
SLIDE 19

Global, Fitting and Local Alignment

19

Local Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find a substring of 𝐰 and a substring of 𝐱 whose alignment has maximum global alignment score 𝑡∗ among all global alignments of all substrings of 𝐰 and 𝐱 Fitting Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find an alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score 𝑡∗ among all global alignments of 𝐰 and all substrings of 𝐱 Global Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find alignment of 𝐰 and 𝐱 with maximum score.

slide-20
SLIDE 20

Outline

  • 1. Fitting alignment
  • 2. Local alignment
  • 3. Gapped alignment
  • 4. BLOSUM scoring matrix

Reading:

  • Jones and Pevzner. Chapters 6.6-6.9
  • Lecture notes

20

slide-21
SLIDE 21

Scoring Gaps

21

Let 𝐰 = AAC and 𝐱 = ACAGGC Match 𝜀 𝑑, 𝑑 = 1; Mismatch 𝜀 𝑑, 𝑒 = −1 (where 𝑑 ≠ 𝑒); Indel 𝜀 𝑑, − = 𝜀 −, 𝑑 = −2 Both alignments have 3 matches and 2 indels. Score: 3 ∗ 1 + 2 ∗ −2 = −1

A

  • A

C A C A A C

𝐰 𝐱

A

  • A
  • C

A C A A C

𝐰 𝐱

slide-22
SLIDE 22

Scoring Gaps

22

Let 𝐰 = AAC and 𝐱 = ACAGGC Match 𝜀 𝑑, 𝑑 = 1; Mismatch 𝜀 𝑑, 𝑒 = −1 (where 𝑑 ≠ 𝑒); Indel 𝜀 𝑑, − = 𝜀 −, 𝑑 = −2 Question: Which alignment is better?

A

  • A

C A C A A C

𝐰 𝐱

A

  • A
  • C

A C A A C

𝐰 𝐱 Both alignments have 3 matches and 2 indels. Score: 3 ∗ 1 + 2 ∗ −2 = −1

slide-23
SLIDE 23

Scoring Gaps – Affine Gap Penalties

23

Desired: Lower penalty for consecutive gaps than interspersed gaps. Why: Consecutive gaps are more likely due to slippage errors in DNA replication (2-3 nucleotides), codons for protein sequences, etc.

A

  • A

C A C A A C

𝐰 𝐱

A

  • A
  • C

A C A A C

𝐰 𝐱

slide-24
SLIDE 24

Scoring Gaps – Affine Gap Penalties

24

Desired: Lower penalty for consecutive gaps than interspersed gaps. Why: Consecutive gaps are more likely due to slippage errors in DNA replication (2-3 nucleotides), codons for protein sequences, etc. Affine gap penalty: Two penalties: (i) gap open penalty 𝜍 ≥ 0 and (ii) gap extension penalty 𝜏 ≥ 0. Stretch of 𝑙 consecutive gaps has score −(𝜍 + 𝜏𝑙).

A

  • A

C A C A A C

𝐰 𝐱

A

  • A
  • C

A C A A C

𝐰 𝐱

slide-25
SLIDE 25

Scoring Gaps – Affine Gap Penalties

25

Desired: Lower penalty for consecutive gaps than interspersed gaps. Why: Consecutive gaps are more likely due to slippage errors in DNA replication (2-3 nucleotides), codons for protein sequences, etc. Affine gap penalty: Two penalties: (i) gap open penalty 𝜍 ≥ 0 and (ii) gap extension penalty 𝜏 ≥ 0. Stretch of 𝑙 consecutive gaps has score −(𝜍 + 𝜏𝑙). Let 𝜍 = 10 and 𝜏 = 1. Left: 3 ∗ 1 − 10 + 1 ∗ 2 = −9. Right: 3 ∗ 1 − (10 + 1 ∗ 1) − 10 + 1 ∗ 1 = −19.

A

  • A

C A C A A C

𝐰 𝐱

A

  • A
  • C

A C A A C

𝐰 𝐱

slide-26
SLIDE 26

Affine Gap Penalty Alignment – Naive Approach

26

Idea: Insert horizontal (deletion) and vertical (insertion) edges spanning 𝑙 > 1 gaps with score − (𝜍 + 𝜏𝑙).

new edges

  • ld edges

Affine gap penalty: Two penalties: (i) gap open penalty 𝜍 ≥ 0 and (ii) gap extension penalty 𝜏 ≥ 0. Stretch of 𝑙 consecutive gaps has score −(𝜍 + 𝜏𝑙).

... ... ... ... ... ... ... ... ...

slide-27
SLIDE 27

Affine Gap Penalty Alignment – Naive Approach

27

Idea: Insert horizontal (deletion) and vertical (insertion) edges spanning 𝑙 > 1 gaps with score − (𝜍 + 𝜏𝑙).

new edges

  • ld edges

Question: What’s the running time? Question: What’s the recurrence? Affine gap penalty: Two penalties: (i) gap open penalty 𝜍 ≥ 0 and (ii) gap extension penalty 𝜏 ≥ 0. Stretch of 𝑙 consecutive gaps has score −(𝜍 + 𝜏𝑙).

... ... ... ... ... ... ... ... ...

slide-28
SLIDE 28

28

Affine Gap Penalty Alignment

Idea: Three separate recurrences: (i) Gap in first sequence 𝑡→ 𝑗, 𝑘 (ii) Match/mismatch 𝑡↘[𝑗, 𝑘] (iii) Gap in second sequence 𝑡↓[𝑗, 𝑘]

slide-29
SLIDE 29

29

Affine Gap Penalty Alignment

Idea: Three separate recurrences: (i) Gap in first sequence 𝑡→ 𝑗, 𝑘 (ii) Match/mismatch 𝑡↘[𝑗, 𝑘] (iii) Gap in second sequence 𝑡↓[𝑗, 𝑘]

s![i, j] = max ( s![i, j − 1] − σ, if j > 1, s&[i, j − 1] − (σ + ρ), if j > 0, s&[i, j] = max 8 > > > < > > > : 0, if i = 0 and j = 0, s![i, j], if j > 0, s#[i, j], if i > 0, s&[i − 1, j − 1] + δ(vi, wj), if i > 0 and j > 0, s#[i, j] = max ( s#[i − 1, j] − σ, if i > 1, s&[i − 1, j] − (σ + ρ), if i > 0.

slide-30
SLIDE 30

30

Affine Gap Penalty Alignment

Idea: Three separate recurrences: (i) Gap in first sequence 𝑡→ 𝑗, 𝑘 (ii) Match/mismatch 𝑡↘[𝑗, 𝑘] (iii) Gap in second sequence 𝑡↓[𝑗, 𝑘] Running time: 𝑃(𝑛𝑜)

s![i, j] = max ( s![i, j − 1] − σ, if j > 1, s&[i, j − 1] − (σ + ρ), if j > 0, s&[i, j] = max 8 > > > < > > > : 0, if i = 0 and j = 0, s![i, j], if j > 0, s#[i, j], if i > 0, s&[i − 1, j − 1] + δ(vi, wj), if i > 0 and j > 0, s#[i, j] = max ( s#[i − 1, j] − σ, if i > 1, s&[i − 1, j] − (σ + ρ), if i > 0.

slide-31
SLIDE 31

Affine Gap Penalty Alignment – Example

31

𝐰 = AAC 𝐱 = ACAAC

Let 𝜍 = 10 and 𝜏 = 1. Match = 1. Mismatch = -1

s![i, j] = max ( s![i, j − 1] − σ, if j > 1, s&[i, j − 1] − (σ + ρ), if j > 0, s&[i, j] = max 8 > > > < > > > : 0, if i = 0 and j = 0, s![i, j], if j > 0, s#[i, j], if i > 0, s&[i − 1, j − 1] + δ(vi, wj), if i > 0 and j > 0, s#[i, j] = max ( s#[i − 1, j] − σ, if i > 1, s&[i − 1, j] − (σ + ρ), if i > 0.

slide-32
SLIDE 32

Gapped Alignment – Additional Insights

  • Naive approach supports arbitrary gap penalties given two

sequences 𝐰 ∈ Σ$ and 𝐱 ∈ Σ&. This results in an 𝑃(𝑛𝑜 𝑛 + 𝑜 ) algorithm.

  • Alignment with convex gap

penalties given two sequences 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& can be computed in 𝑃(𝑛𝑜 log 𝑛) time.

See: Dan Gusfield. 1997. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York, NY, USA.

32

slide-33
SLIDE 33

Outline

  • 1. Fitting alignment
  • 2. Local alignment
  • 3. Gapped alignment
  • 4. BLOSUM scoring matrix

Reading:

  • Jones and Pevzner. Chapters 6.6-6.9
  • Lecture notes

33

slide-34
SLIDE 34

Substitution Matrices

  • Given a pair (𝐰, 𝐱) of aligned sequences, we want to assign a score that

measure the relative likelihood that the sequences are related as opposed to being unrelated

  • We need two models:
  • Random model 𝑆: each letter 𝑏 ∈ Σ occurs independently with probability 𝑟U
  • Match model 𝑁: aligned pair 𝑏, 𝑐 ∈ Σ × Σ occur with joint probability 𝑞U,Z

34

Pr 𝐰, 𝐱 𝑆 = ]

^

𝑟_` ⋅ ]

^

𝑟b` Pr 𝐰, 𝐱 𝑁 = ]

^

𝑞_`,b`

log

cd 𝐰, 𝐱 𝑁 cd 𝐰, 𝐱 𝑆

= ∑^ 𝑡 𝑤^, 𝑥^ where 𝑡 𝑏, 𝑐 = log

hi,j kikj

slide-35
SLIDE 35

BLOSUM (Blocks Substitution Matrices)

  • Henikoff and Henikoff, 1992
  • Computed using ungapped alignments of protein

segments (blocks) from BLOCKS database

  • Thousands of such blocks go into computing a

single BLOSUM matrix

  • Example of a one such block (right):
  • 31 positions (columns)
  • 61 sequences (rows)
  • Given threshold 𝑀, block is pruned down to largest

set 𝐷 of sequences that have at least 𝑀% sequence identity to another sequence in 𝐷

  • How to compute 𝐷?

35

slide-36
SLIDE 36

BLOSUM (Blocks Substitution Matrices)

36

  • Null model frequencies 𝑟U𝑟Z of letters 𝑏 and 𝑐:
  • Count the number of occurrences of 𝑏 (𝑐) in all

blocks

  • Divide by sum of lengths of each block

(sequences * positions)

  • Match model frequency 𝑞U,Z:
  • Count the number of pairs (𝑏, 𝑐) in all columns
  • f all blocks
  • Divide by the total number of pairs of columns:
  • ∑n 𝑜(𝐷) $(n)

2

  • 𝑛(𝐷) is the number of sequences in block 𝐷
  • 𝑜(𝐷) is the number of positions in block 𝐷

log

cd 𝐰, 𝐱 𝑁 cd 𝐰, 𝐱 𝑆

= ∑^ 𝑡 𝑤^, 𝑥^ where 𝑡 𝑏, 𝑐 =

  • p log

hi,j kikj

slide-37
SLIDE 37

BLOSUM (Blocks Substitution Matrices)

37

log

cd 𝐰, 𝐱 𝑁 cd 𝐰, 𝐱 𝑆

= ∑^ 𝑡 𝑤^, 𝑥^ where 𝑡 𝑏, 𝑐 =

  • p log

hi,j kikj

  • Null model frequencies 𝑟U𝑟Z of letters 𝑏 and 𝑐:
  • Count the number of occurrences of 𝑏 (𝑐) in all

blocks

  • Divide by sum of lengths of each block

(sequences * positions)

  • Match model frequency 𝑞U,Z:
  • Count the number of pairs (𝑏, 𝑐) in all columns
  • f all blocks
  • Divide by the total number of pairs of columns:
  • ∑n 𝑜(𝐷) $(n)

2

  • 𝑛(𝐷) is the number of sequences in block 𝐷
  • 𝑜(𝐷) is the number of positions in block 𝐷

Example: (𝜇 = 0.5) A A T S A L T A L T A V A A L

𝑟s = 7 15 𝑟u = 3 15 𝑞s,u = 4 30 𝑡 𝐵, 𝑈 = 2 ⋅ log 4 30 7 15 ⋅ 3 15 ≈ 0.3

slide-38
SLIDE 38

BLOSUM62

38

log

cd 𝐰, 𝐱 𝑁 cd 𝐰, 𝐱 𝑆

= ∑^ 𝑡 𝑤^, 𝑥^ where 𝑡 𝑏, 𝑐 =

  • p log

hi,j kikj

https://doi.org/10.1038/nbt0804-1035

slide-39
SLIDE 39

Take Home Messages

  • 1. Edit distance
  • 2. Global alignment
  • 3. Fitting alignment
  • 4. Local alignment
  • 5. Gapped alignment
  • 6. BLOSUM substitution matrix

Reading:

  • Jones and Pevzner. Chapters 6.6-6.9
  • Lecture notes

39

Global alignment is longest path in DAG Small tweaks enable different extensions Edit distance is shortest path in DAG