CS CS 466 466 In Introduct ctio ion t to B Bio ioin - - PowerPoint PPT Presentation
CS CS 466 466 In Introduct ctio ion t to B Bio ioin - - PowerPoint PPT Presentation
CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 5 Mohammed El-Kebir February 4, 2020 Outline 1. Fitting alignment 2. Local alignment 3. Gapped alignment 4. BLOSUM scoring matrix Reading: Jones and
Outline
- 1. Fitting alignment
- 2. Local alignment
- 3. Gapped alignment
- 4. BLOSUM scoring matrix
Reading:
- Jones and Pevzner. Chapters 6.6-6.9
- Lecture notes
2
Allow for inexact matches due to:
- Sequencing errors
- Polymorphisms/mutations in
reference genome
3
NGS Characterized by Short Reads
Genome Millions -billions nucleotides Next-generation DNA sequencing 10-100’s million short reads Short read: 100 nucleotides
… GGTAGTTAG … … TATAATTAG … … AGCCATTAG … … CGTACCTAG … … CATTCAGTAG … … GGTAAACTAG …
Question: How to account for discrepancy between lengths of reference and short read? Human reference genome is 3,300,000,000 nucleotides, while a short read is 100 nucleotides. Global sequence alignment will not work!
Fitting Alignment
4
For short read alignment, we want to align complete short read 𝐰 ∈ Σ$ to substring of reference genome 𝐱 ∈ Σ&. Note that 𝑛 ≪ 𝑜. Fitting Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find a alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score 𝑡∗ among all global alignments of 𝐰 and all substrings of 𝐱
𝐰 ∈ Σ$ 𝐱 ∈ Σ&
Fitting Alignment – Naive Approach
- Consider all contiguous non-empty substrings of 𝐱, defined by 1 ≤ 𝑗 ≤ 𝑘 ≤ 𝑜
- How many?
5
Fitting Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find an alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score 𝑡∗ among all global alignments of 𝐰 and all substrings of 𝐱
𝐰 ∈ Σ$ 𝐱 ∈ Σ&
Fitting Alignment – Naive Approach
- Consider all contiguous non-empty substrings of 𝐱, defined by 1 ≤ 𝑗 ≤ 𝑘 ≤ 𝑜
- How many? Answer: 𝑜 +
& 2
- What are their total lengths?
- What is the running time?
6
Fitting Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find an alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score 𝑡∗ among all global alignments of 𝐰 and all substrings of 𝐱
𝐰 ∈ Σ$ 𝐱 ∈ Σ&
Fitting Alignment – Dynamic Programming
7
Fitting Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find an alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score 𝑡∗ among all global alignments of 𝐰 and all substrings of 𝐱
s[i, j] = max 0, if i = 0, s[i − 1, j] + δ(vi, −), if i > 0, s[i, j − 1] + δ(−, wj), if i > 0 and j > 0, s[i − 1, j − 1] + δ(vi, wj), if i > 0 and j > 0. s∗ = max{s[m, 0], . . . , s[m, n]}
A G G T A C G G C
𝐰\𝐱
Fitting Alignment – Dynamic Programming
8
Fitting Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find an alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score 𝑡∗ among all global alignments of 𝐰 and all substrings of 𝐱
s[i, j] = max 0, if i = 0, s[i − 1, j] + δ(vi, −), if i > 0, s[i, j − 1] + δ(−, wj), if i > 0 and j > 0, s[i − 1, j − 1] + δ(vi, wj), if i > 0 and j > 0. s∗ = max{s[m, 0], . . . , s[m, n]}
A G G T A C G G C
𝐰\𝐱
Start anywhere on first row End anywhere on last row
Fitting Alignment – Dynamic Programming
9
Fitting Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find an alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score 𝑡∗ among all global alignments of 𝐰 and all substrings of 𝐱
s[i, j] = max 0, if i = 0, s[i − 1, j] + δ(vi, −), if i > 0, s[i, j − 1] + δ(−, wj), if i > 0 and j > 0, s[i − 1, j − 1] + δ(vi, wj), if i > 0 and j > 0. s∗ = max{s[m, 0], . . . , s[m, n]}
Question: Let match score be 1, mismatch/indel score be -1. What is 𝑡∗? Question: Same scores. What is optimal global alignment and score? A G G T A C G G C
- A
- G
G
- T
A C G G C
𝐰 𝐱
𝐰\𝐱
Start anywhere on first row End anywhere on last row
Fitting Alignment – Dynamic Programming
- Online:
https://valiec.github.io/AlignmentVisualizer/index.html
10
Question: Let match score be 1, mismatch/indel score be -1. What is 𝑡∗? Question: Same scores. What is optimal global alignment and score? A G G T A C G G C
- A
- G
G
- T
A C G G C
𝐰 𝐱
𝐰\𝐱
s[i, j] = max 0, if i = 0, s[i − 1, j] + δ(vi, −), if i > 0, s[i, j − 1] + δ(−, wj), if i > 0 and j > 0, s[i − 1, j − 1] + δ(vi, wj), if i > 0 and j > 0. s∗ = max{s[m, 0], . . . , s[m, n]}
Outline
- 1. Fitting alignment
- 2. Local alignment
- 3. Gapped alignment
- 4. BLOSUM scoring matrix
Reading:
- Jones and Pevzner. Chapters 6.6-6.9
- Lecture notes
11
Local Alignment – Biological Motivation
12
ABL1 SHKA
From Pfam database (http://pfam.sanger.ac.uk/)
Proteins are composed of functional units called domains. Such domains may occur in different proteins even across species.
Local Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find a substring of 𝐰 and a substring of 𝐱 whose alignment has maximum global alignment score 𝑡∗ among all global alignments of all substrings of 𝐰 and 𝐱
Global, Fitting and Local Alignment
13
Local Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find a substring of 𝐰 and a substring of 𝐱 whose alignment has maximum global alignment score 𝑡∗ among all global alignments of all substrings of 𝐰 and 𝐱 Fitting Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find an alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score 𝑡∗ among all global alignments of 𝐰 and all substrings of 𝐱 Global Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find alignment of 𝐰 and 𝐱 with maximum score.
Local Alignment – Naive Algorithm
Brute force:
- 1. Generate all pairs (𝐰4, 𝐱4) of substrings of 𝐰 and 𝐱
- 2. For each pair (𝐰4, 𝐱4), solve global alignment problem.
14
Question: There are $
2 & 2 pairs of substrings.
But they have different lengths. What is the running time?
Local Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find a substring of 𝐰 and a substring of 𝐱 whose alignment has maximum global alignment score 𝑡∗ among all global alignments of all substrings of 𝐰 and 𝐱
Key Idea
15
Local alignment:
- Start and end anywhere
Global alignment:
- Start at (0,0) and end at (𝑛, 𝑜)
Local Alignment Recurrence
16
s[i, j] = max 0, if i = 0 and j = 0, s[i − 1, j] + δ(vi, −), if i > 0, s[i, j − 1] + δ(−, wj), if j > 0, s[i − 1, j − 1] + δ(vi, wj), if i > 0 and j > 0. s∗ = max
i,j s[i, j]
Local Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find a substring
- f 𝐰 and a substring of 𝐱 whose alignment has
maximum global alignment score 𝑡∗ among all global alignments of all substrings of 𝐰 and 𝐱
Local Alignment Recurrence
17
s[i, j] = max 0, if i = 0 and j = 0, s[i − 1, j] + δ(vi, −), if i > 0, s[i, j − 1] + δ(−, wj), if j > 0, s[i − 1, j − 1] + δ(vi, wj), if i > 0 and j > 0. s∗ = max
i,j s[i, j]
Local Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find a substring
- f 𝐰 and a substring of 𝐱 whose alignment has
maximum global alignment score 𝑡∗ among all global alignments of all substrings of 𝐰 and 𝐱
Start anywhere End anywhere
Running time: 𝑃(𝑛𝑜)
Local Alignment – Dynamic Programming
- Online:
https://valiec.github.io/AlignmentVisualizer/index.html
18
Question: Let match score be 2, mismatch score be -2 and indel be -4. What is 𝑡∗? A G G T A C G G C G G G G
𝐰 𝐱
𝐰\𝐱
s[i, j] = max 0, if i = 0 and j = 0, s[i − 1, j] + δ(vi, −), if i > 0, s[i, j − 1] + δ(−, wj), if j > 0, s[i − 1, j − 1] + δ(vi, wj), if i > 0 and j > 0. s∗ = max
i,j s[i, j]
Global, Fitting and Local Alignment
19
Local Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find a substring of 𝐰 and a substring of 𝐱 whose alignment has maximum global alignment score 𝑡∗ among all global alignments of all substrings of 𝐰 and 𝐱 Fitting Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find an alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score 𝑡∗ among all global alignments of 𝐰 and all substrings of 𝐱 Global Alignment problem: Given strings 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& and scoring function 𝜀, find alignment of 𝐰 and 𝐱 with maximum score.
Outline
- 1. Fitting alignment
- 2. Local alignment
- 3. Gapped alignment
- 4. BLOSUM scoring matrix
Reading:
- Jones and Pevzner. Chapters 6.6-6.9
- Lecture notes
20
Scoring Gaps
21
Let 𝐰 = AAC and 𝐱 = ACAGGC Match 𝜀 𝑑, 𝑑 = 1; Mismatch 𝜀 𝑑, 𝑒 = −1 (where 𝑑 ≠ 𝑒); Indel 𝜀 𝑑, − = 𝜀 −, 𝑑 = −2 Both alignments have 3 matches and 2 indels. Score: 3 ∗ 1 + 2 ∗ −2 = −1
A
- A
C A C A A C
𝐰 𝐱
A
- A
- C
A C A A C
𝐰 𝐱
Scoring Gaps
22
Let 𝐰 = AAC and 𝐱 = ACAGGC Match 𝜀 𝑑, 𝑑 = 1; Mismatch 𝜀 𝑑, 𝑒 = −1 (where 𝑑 ≠ 𝑒); Indel 𝜀 𝑑, − = 𝜀 −, 𝑑 = −2 Question: Which alignment is better?
A
- A
C A C A A C
𝐰 𝐱
A
- A
- C
A C A A C
𝐰 𝐱 Both alignments have 3 matches and 2 indels. Score: 3 ∗ 1 + 2 ∗ −2 = −1
Scoring Gaps – Affine Gap Penalties
23
Desired: Lower penalty for consecutive gaps than interspersed gaps. Why: Consecutive gaps are more likely due to slippage errors in DNA replication (2-3 nucleotides), codons for protein sequences, etc.
A
- A
C A C A A C
𝐰 𝐱
A
- A
- C
A C A A C
𝐰 𝐱
Scoring Gaps – Affine Gap Penalties
24
Desired: Lower penalty for consecutive gaps than interspersed gaps. Why: Consecutive gaps are more likely due to slippage errors in DNA replication (2-3 nucleotides), codons for protein sequences, etc. Affine gap penalty: Two penalties: (i) gap open penalty 𝜍 ≥ 0 and (ii) gap extension penalty 𝜏 ≥ 0. Stretch of 𝑙 consecutive gaps has score −(𝜍 + 𝜏𝑙).
A
- A
C A C A A C
𝐰 𝐱
A
- A
- C
A C A A C
𝐰 𝐱
Scoring Gaps – Affine Gap Penalties
25
Desired: Lower penalty for consecutive gaps than interspersed gaps. Why: Consecutive gaps are more likely due to slippage errors in DNA replication (2-3 nucleotides), codons for protein sequences, etc. Affine gap penalty: Two penalties: (i) gap open penalty 𝜍 ≥ 0 and (ii) gap extension penalty 𝜏 ≥ 0. Stretch of 𝑙 consecutive gaps has score −(𝜍 + 𝜏𝑙). Let 𝜍 = 10 and 𝜏 = 1. Left: 3 ∗ 1 − 10 + 1 ∗ 2 = −9. Right: 3 ∗ 1 − (10 + 1 ∗ 1) − 10 + 1 ∗ 1 = −19.
A
- A
C A C A A C
𝐰 𝐱
A
- A
- C
A C A A C
𝐰 𝐱
Affine Gap Penalty Alignment – Naive Approach
26
Idea: Insert horizontal (deletion) and vertical (insertion) edges spanning 𝑙 > 1 gaps with score − (𝜍 + 𝜏𝑙).
new edges
- ld edges
Affine gap penalty: Two penalties: (i) gap open penalty 𝜍 ≥ 0 and (ii) gap extension penalty 𝜏 ≥ 0. Stretch of 𝑙 consecutive gaps has score −(𝜍 + 𝜏𝑙).
... ... ... ... ... ... ... ... ...
Affine Gap Penalty Alignment – Naive Approach
27
Idea: Insert horizontal (deletion) and vertical (insertion) edges spanning 𝑙 > 1 gaps with score − (𝜍 + 𝜏𝑙).
new edges
- ld edges
Question: What’s the running time? Question: What’s the recurrence? Affine gap penalty: Two penalties: (i) gap open penalty 𝜍 ≥ 0 and (ii) gap extension penalty 𝜏 ≥ 0. Stretch of 𝑙 consecutive gaps has score −(𝜍 + 𝜏𝑙).
... ... ... ... ... ... ... ... ...
28
Affine Gap Penalty Alignment
Idea: Three separate recurrences: (i) Gap in first sequence 𝑡→ 𝑗, 𝑘 (ii) Match/mismatch 𝑡↘[𝑗, 𝑘] (iii) Gap in second sequence 𝑡↓[𝑗, 𝑘]
29
Affine Gap Penalty Alignment
Idea: Three separate recurrences: (i) Gap in first sequence 𝑡→ 𝑗, 𝑘 (ii) Match/mismatch 𝑡↘[𝑗, 𝑘] (iii) Gap in second sequence 𝑡↓[𝑗, 𝑘]
s![i, j] = max ( s![i, j − 1] − σ, if j > 1, s&[i, j − 1] − (σ + ρ), if j > 0, s&[i, j] = max 8 > > > < > > > : 0, if i = 0 and j = 0, s![i, j], if j > 0, s#[i, j], if i > 0, s&[i − 1, j − 1] + δ(vi, wj), if i > 0 and j > 0, s#[i, j] = max ( s#[i − 1, j] − σ, if i > 1, s&[i − 1, j] − (σ + ρ), if i > 0.
30
Affine Gap Penalty Alignment
Idea: Three separate recurrences: (i) Gap in first sequence 𝑡→ 𝑗, 𝑘 (ii) Match/mismatch 𝑡↘[𝑗, 𝑘] (iii) Gap in second sequence 𝑡↓[𝑗, 𝑘] Running time: 𝑃(𝑛𝑜)
s![i, j] = max ( s![i, j − 1] − σ, if j > 1, s&[i, j − 1] − (σ + ρ), if j > 0, s&[i, j] = max 8 > > > < > > > : 0, if i = 0 and j = 0, s![i, j], if j > 0, s#[i, j], if i > 0, s&[i − 1, j − 1] + δ(vi, wj), if i > 0 and j > 0, s#[i, j] = max ( s#[i − 1, j] − σ, if i > 1, s&[i − 1, j] − (σ + ρ), if i > 0.
Affine Gap Penalty Alignment – Example
31
𝐰 = AAC 𝐱 = ACAAC
Let 𝜍 = 10 and 𝜏 = 1. Match = 1. Mismatch = -1
s![i, j] = max ( s![i, j − 1] − σ, if j > 1, s&[i, j − 1] − (σ + ρ), if j > 0, s&[i, j] = max 8 > > > < > > > : 0, if i = 0 and j = 0, s![i, j], if j > 0, s#[i, j], if i > 0, s&[i − 1, j − 1] + δ(vi, wj), if i > 0 and j > 0, s#[i, j] = max ( s#[i − 1, j] − σ, if i > 1, s&[i − 1, j] − (σ + ρ), if i > 0.
Gapped Alignment – Additional Insights
- Naive approach supports arbitrary gap penalties given two
sequences 𝐰 ∈ Σ$ and 𝐱 ∈ Σ&. This results in an 𝑃(𝑛𝑜 𝑛 + 𝑜 ) algorithm.
- Alignment with convex gap
penalties given two sequences 𝐰 ∈ Σ$ and 𝐱 ∈ Σ& can be computed in 𝑃(𝑛𝑜 log 𝑛) time.
See: Dan Gusfield. 1997. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York, NY, USA.
32
Outline
- 1. Fitting alignment
- 2. Local alignment
- 3. Gapped alignment
- 4. BLOSUM scoring matrix
Reading:
- Jones and Pevzner. Chapters 6.6-6.9
- Lecture notes
33
Substitution Matrices
- Given a pair (𝐰, 𝐱) of aligned sequences, we want to assign a score that
measure the relative likelihood that the sequences are related as opposed to being unrelated
- We need two models:
- Random model 𝑆: each letter 𝑏 ∈ Σ occurs independently with probability 𝑟U
- Match model 𝑁: aligned pair 𝑏, 𝑐 ∈ Σ × Σ occur with joint probability 𝑞U,Z
34
Pr 𝐰, 𝐱 𝑆 = ]
^
𝑟_` ⋅ ]
^
𝑟b` Pr 𝐰, 𝐱 𝑁 = ]
^
𝑞_`,b`
log
cd 𝐰, 𝐱 𝑁 cd 𝐰, 𝐱 𝑆
= ∑^ 𝑡 𝑤^, 𝑥^ where 𝑡 𝑏, 𝑐 = log
hi,j kikj
BLOSUM (Blocks Substitution Matrices)
- Henikoff and Henikoff, 1992
- Computed using ungapped alignments of protein
segments (blocks) from BLOCKS database
- Thousands of such blocks go into computing a
single BLOSUM matrix
- Example of a one such block (right):
- 31 positions (columns)
- 61 sequences (rows)
- Given threshold 𝑀, block is pruned down to largest
set 𝐷 of sequences that have at least 𝑀% sequence identity to another sequence in 𝐷
- How to compute 𝐷?
35
BLOSUM (Blocks Substitution Matrices)
36
- Null model frequencies 𝑟U𝑟Z of letters 𝑏 and 𝑐:
- Count the number of occurrences of 𝑏 (𝑐) in all
blocks
- Divide by sum of lengths of each block
(sequences * positions)
- Match model frequency 𝑞U,Z:
- Count the number of pairs (𝑏, 𝑐) in all columns
- f all blocks
- Divide by the total number of pairs of columns:
- ∑n 𝑜(𝐷) $(n)
2
- 𝑛(𝐷) is the number of sequences in block 𝐷
- 𝑜(𝐷) is the number of positions in block 𝐷
log
cd 𝐰, 𝐱 𝑁 cd 𝐰, 𝐱 𝑆
= ∑^ 𝑡 𝑤^, 𝑥^ where 𝑡 𝑏, 𝑐 =
- p log
hi,j kikj
BLOSUM (Blocks Substitution Matrices)
37
log
cd 𝐰, 𝐱 𝑁 cd 𝐰, 𝐱 𝑆
= ∑^ 𝑡 𝑤^, 𝑥^ where 𝑡 𝑏, 𝑐 =
- p log
hi,j kikj
- Null model frequencies 𝑟U𝑟Z of letters 𝑏 and 𝑐:
- Count the number of occurrences of 𝑏 (𝑐) in all
blocks
- Divide by sum of lengths of each block
(sequences * positions)
- Match model frequency 𝑞U,Z:
- Count the number of pairs (𝑏, 𝑐) in all columns
- f all blocks
- Divide by the total number of pairs of columns:
- ∑n 𝑜(𝐷) $(n)
2
- 𝑛(𝐷) is the number of sequences in block 𝐷
- 𝑜(𝐷) is the number of positions in block 𝐷
Example: (𝜇 = 0.5) A A T S A L T A L T A V A A L
𝑟s = 7 15 𝑟u = 3 15 𝑞s,u = 4 30 𝑡 𝐵, 𝑈 = 2 ⋅ log 4 30 7 15 ⋅ 3 15 ≈ 0.3
BLOSUM62
38
log
cd 𝐰, 𝐱 𝑁 cd 𝐰, 𝐱 𝑆
= ∑^ 𝑡 𝑤^, 𝑥^ where 𝑡 𝑏, 𝑐 =
- p log
hi,j kikj
https://doi.org/10.1038/nbt0804-1035
Take Home Messages
- 1. Edit distance
- 2. Global alignment
- 3. Fitting alignment
- 4. Local alignment
- 5. Gapped alignment
- 6. BLOSUM substitution matrix
Reading:
- Jones and Pevzner. Chapters 6.6-6.9
- Lecture notes
39