CS CS 466 466 In Introduct ctio ion t to B Bio ioin - - PowerPoint PPT Presentation
CS CS 466 466 In Introduct ctio ion t to B Bio ioin - - PowerPoint PPT Presentation
CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 6 Mohammed El-Kebir February 6, 2020 Course Announcements Instructor: Mohammed El-Kebir (melkebir) Office hours: Wednesdays, 3:15-4:15pm TA:
Course Announcements
Instructor:
- Mohammed El-Kebir (melkebir)
- Office hours: Wednesdays, 3:15-4:15pm
TA:
- Aswhin Ramesh (aramesh7)
- Office hours: Fridays, 11:00-11:59am in SC 3405
2
Homework 1: Due on Sept. 18 (11:59pm) Midterm: 10/4, 11-1pm @Transportation Building 103 (conflict: 10/7, 7-9pm @Siebel 1302 -- to sign up email me)
Global, Fitting and Local Alignment
3
Local Alignment problem: Given strings π° β Ξ£$ and π± β Ξ£& and scoring function π, find a substring
- f π° and a substring of π± whose alignment has
maximum global alignment score π‘β among all global alignments of all substrings of π° and π± [Smith-Waterman algorithm] Fitting Alignment problem: Given strings π° β Ξ£$ and π± β Ξ£& and scoring function π, find an alignment of π° and a substring of π± with maximum global alignment score π‘β among all global alignments of π° and all substrings of π± Global Alignment problem: Given strings π° β Ξ£$ and π± β Ξ£& and scoring function π, find alignment
- f π° and π± with maximum score.
[Needleman-Wunsch algorithm]
A G G T A C G G C
π°\π±
Question: How to assess resulting algorithms?
Time Complexity
4
1 2 3 n = 4 O O O O O 1 O O O O O 2 O O O O O 3 O O O O O m = 4 O O O O O
W A T C G A T G T V Alignment is a path from source (0, 0) to target (π, π) in edit graph Edit graph is a weighed, directed grid graph π» = (π, πΉ) with source vertex (0, 0) and target vertex (π, π). Each edge ((π, π), (π, π)) has weight depending on direction. Running time is π(ππ) [quadratic time]
Time Complexity
5
1 2 3 n = 4 O O O O O 1 O O O O O 2 O O O O O 3 O O O O O m = 4 O O O O O
W A T C G A T G T V Alignment is a path from source (0, 0) to target (π, π) in edit graph Edit graph is a weighed, directed grid graph π» = (π, πΉ) with source vertex (0, 0) and target vertex (π, π). Each edge ((π, π), (π, π)) has weight depending on direction. Running time is π(ππ) [quadratic time] Question: Compute alignment faster than π(ππ) time? [subquadratic time]
Space Complexity
6
1 2 3 n = 4 1 2 3 m = 4
W A T C G A T G T V Thus, space complexity is π(ππ) [quadratic space] Size of DP table is π + 1 Γ (π + 1) Example: To align a short read (π = 100) to human genome (π = 3 = 10>), we need 300 GB memory.
Space Complexity
7
1 2 3 n = 4 1 2 3 m = 4
W A T C G A T G T V Thus, space complexity is π(ππ) [quadratic space] Size of DP table is π + 1 Γ (π + 1) Example: To align a short read (π = 100) to human genome (π = 3 = 10>), we need 300 GB memory. Question: How long is an alignment?
Space Complexity
8
1 2 3 n = 4 1 2 3 m = 4
W A T C G A T G T V Thus, space complexity is π(ππ) [quadratic space] Size of DP table is π + 1 Γ (π + 1) Example: To align a short read (π = 100) to human genome (π = 3 = 10>), we need 300 GB memory. Question: How long is an alignment? Question: Compute alignment in π(π) space? [linear space]
Outline
- 1. Recap of global, fitting, local and gapped alignment
- 2. Space-efficient alignment
- 3. Subquadratic time alignment
Reading:
- Jones and Pevzner. Chapters 7.1-7.4
- Lecture notes
9
π π
Space Efficient Alignment
10
Computing π‘[π, π] requires access to: π‘ π β 1, π , π‘[π, π β 1] and π‘[π β 1, π β 1] π π
s[i, j] = max ο£±    ο£²    ο£³ 0, if i = 0 and j = 0, s[i β 1, j] + Ξ΄(vi, β), if i > 0, s[i, j β 1] + Ξ΄(β, wj), if j > 0, s[i β 1, j β 1] + Ξ΄(vi, wj), if i > 0 and j > 0.
π π
Space Efficient Alignment
11
Computing π‘[π, π] requires access to: π‘ π β 1, π , π‘[π, π β 1] and π‘[π β 1, π β 1] π π Thus it suffices to store only two columns to compute optimal alignment score π‘ π, π , i.e., 2 π + 1 = π(π) space.
s[i, j] = max ο£±    ο£²    ο£³ 0, if i = 0 and j = 0, s[i β 1, j] + Ξ΄(vi, β), if i > 0, s[i, j β 1] + Ξ΄(β, wj), if j > 0, s[i β 1, j β 1] + Ξ΄(vi, wj), if i > 0 and j > 0.
π π
Space Efficient Alignment
12
Question: What if we want alignment itself? Computing π‘[π, π] requires access to: π‘ π β 1, π , π‘[π, π β 1] and π‘[π β 1, π β 1] π π Thus it suffices to store only two columns to compute optimal alignment score π‘ π, π , i.e., 2 π + 1 = π(π) space.
s[i, j] = max ο£±    ο£²    ο£³ 0, if i = 0 and j = 0, s[i β 1, j] + Ξ΄(vi, β), if i > 0, s[i, j β 1] + Ξ΄(β, wj), if j > 0, s[i β 1, j β 1] + Ξ΄(vi, wj), if i > 0 and j > 0.
Space Efficient Alignment β First Attempt
- What if also want optimal alignment?
- Easy: keep best pointers as fill in table.
- No! Do not know which path to keep until computing
recurrence at each step.
w v w v
Space Efficient Alignment β First Attempt
- What if also want optimal alignment?
- Easy: keep best pointers as fill in table.
- No! Do not know which path to keep until computing
recurrence at each step.
w v w v
Space Efficient Alignment β First Attempt
- What if also want optimal alignment?
- Easy: keep best pointers as fill in table.
- No! Do not know which path to keep until computing
recurrence at each step.
w v w v
Best score for column might not be part of best alignment!
Space Efficient Alignment β Second Attempt
16
π/2 π πβ π
Maximum weight path from (0,0) to (π, π) passes through (πβ, π/2) Question: What is πβ? Alignment is a path from source (0, 0) to target (π, π) in edit graph
17
Hirschberg(π, π, πE, πβ²) 1. if πE β π > 1 2. πβ Γ arg max
MNMOONMO wt(πβ²β²)
3. Report (πβ, π + ROSR
T )
4. Hirschberg(π, π, πβ, π + ROSR
T )
5. Hirschberg(πβ, π + ROSR
T , πE, πβ²)
18
Time: area + area/2 + area/4 + β¦ = area (1 + Β½ + ΒΌ + β + β¦) β€ 2 Γ area = O(mn) Space: O(m) Hirschberg(π, π, πE, πβ²) 1. if πE β π > 1 2. πβ Γ arg max
MNMOONMO wt(πβ²β²)
3. Report (πβ, π + ROSR
T )
4. Hirschberg(π, π, πβ, π + ROSR
T )
5. Hirschberg(πβ, π + ROSR
T , πE, πβ²)
19
Time: area + area/2 + area/4 + β¦ = area (1 + Β½ + ΒΌ + β + β¦) β€ 2 Γ area = O(mn) Space: O(m) Question: How to reconstruct alignment from reported vertices? Hirschberg(π, π, πE, πβ²) 1. if πE β π > 1 2. πβ Γ arg max
MNMOONMO wt(πβ²β²)
3. Report (πβ, π + ROSR
T )
4. Hirschberg(π, π, πβ, π + ROSR
T )
5. Hirschberg(πβ, π + ROSR
T , πE, πβ²)
Hirschberg Algorithm: Reversing Edges Necessary?
20
πβ = arg max{
VNMN$
preYix π + sufYix π } Max weight path from (0,0) to (π, π) through (πβ, π/2) Compute preYix π 0 β€ π β€ π} in O(ππ) time and O(π) space, by starting from (0,0) to π, π keeping only two columns in memory. [single-source multiple destinations]
π π πβ π
Hirschberg Algorithm: Reversing Edges Necessary?
21
πβ = arg max{
VNMN$
preYix π + sufYix π } Max weight path from (0,0) to (π, π) through (πβ, π/2) Want: Compute sufYix π 0 β€ π β€ π} in O(ππ) time and O(π) space Doing a longest path from each π, π to π, π (for all 0 β€ π β€ π) will not achieve desired running time! Reversing edges enables single-source multiple destination computation in desired time and space bound!
π π πβ π
Compute preYix π 0 β€ π β€ π} in O(ππ) time and O(π) space, by starting from (0,0) to π, π keeping only two columns in memory. [single-source multiple destinations]
Hirschberg Algorithm: Reconstructing Alignment
22
Problem: Given reported vertices and scores { πV, 0, π‘V , β¦ , π&, π, π‘& }, find intermediary vertices. Hirschberg(π, π, πE, πβ²) 1. if πE β π > 1 2. πβ Γ arg max
VNMN$
wt(π) 3. Report (πβ, π + ROSR
T )
4. Hirschberg(π, π, πβ, π + ROSR
T )
5. Hirschberg(πβ, π + ROSR
T , πE, πβ²)
1 2 3 4 5
- 1
- 2
- 3
- 4
- 5
1
- 1
1
- 1
- 2
- 3
2
- 2
2 1
- 1
3
- 3
- 1
1 1 2 1 4
- 4
- 2
1 1 5
- 5
- 3
- 1
1 2
W A T C G A T G T V A T
- G
T C A T C G
- C
C C
Transposing matrix does not help, because gaps could occur in both input sequences
Linear Space Alignment β The Hirschberg Algorithm
23
Outline
- 1. Recap of global, fitting, local and gapped alignment
- 2. Space-efficient alignment
- 3. Subquadratic time alignment
Reading:
- Jones and Pevzner. Chapters 7.1-7.4
- Lecture notes
24
Banded Alignment
25
x1 x2 x3 . . . . . . xM y1 y2 y3 ... ... ... ... yN
Constrain traceback to band of DP matrix (penalize big gaps)
Figure source: http://jinome.stanford.edu/stat366/pdfs/stat366_win0607_lecture04.pdf
Constraint path to band of width π around diagonal Running time: O(ππ) Gives a good approximation of highly identical sequences
Banded Alignment
26
x1 x2 x3 . . . . . . xM y1 y2 y3 ... ... ... ... yN
Constrain traceback to band of DP matrix (penalize big gaps)
Figure source: http://jinome.stanford.edu/stat366/pdfs/stat366_win0607_lecture04.pdf
Constraint path to band of width π around diagonal Running time: O(ππ) Gives a good approximation of highly identical sequences Question: How to change recurrence to accomplish this?
Block Alignment
27
Divide input sequences into blocks of length π’
v1, β¦, vt vt+1, β¦, v2t β¦ vm-t+1, β¦, vm w1, β¦, wt wt+1, β¦, w2t β¦ wn-t+1, β¦, wn
Block Alignment
28
Divide input sequences into blocks of length π’
v1, β¦, vt vt+1, β¦, v2t β¦ vm-t+1, β¦, vm w1, β¦, wt wt+1, β¦, w2t β¦ wn-t+1, β¦, wn
Require that paths in edit graph pass through corners of blocks
Block Alignment
29
Divide input sequences into blocks of length π’
v1, β¦, vt vt+1, β¦, v2t β¦ vm-t+1, β¦, vm w1, β¦, wt wt+1, β¦, w2t β¦ wn-t+1, β¦, wn
Require that paths in edit graph pass through corners of blocks
s[i, j] = max ο£±    ο£²    ο£³ 0, if i = 0 and j = 0, s[i β 1, j] β Ο, if i > 0, s[i, j β 1] β Ο, if j > 0, s[i β 1, j β 1] + Ξ²(i, j), if i > 0 and j > 0.
0 β€ π, π β€ π’ and πΎ(π, π) is max score alignment between block π of π° and block π of π±
Block Alignment β First Attempt: Pre-compute πΎ π, π
30
s[i, j] = max ο£±    ο£²    ο£³ 0, if i = 0 and j = 0, s[i β 1, j] β Ο, if i > 0, s[i, j β 1] β Ο, if j > 0, s[i β 1, j β 1] + Ξ²(i, j), if i > 0 and j > 0.
0 β€ π, π β€ π/π’ and πΎ(π, π) is max score alignment between block π of π° and block π of π±
t 2t nt t 2t nt
Block Alignment β First Attempt: Pre-compute πΎ π, π
31
s[i, j] = max ο£±    ο£²    ο£³ 0, if i = 0 and j = 0, s[i β 1, j] β Ο, if i > 0, s[i, j β 1] β Ο, if j > 0, s[i β 1, j β 1] + Ξ²(i, j), if i > 0 and j > 0.
0 β€ π, π β€ π/π’ and πΎ(π, π) is max score alignment between block π of π° and block π of π± Question: How much time to compute all πΎ(π, π)?
t 2t nt t 2t nt
Block Alignment β First Attempt: Pre-compute πΎ π, π
32
s[i, j] = max ο£±    ο£²    ο£³ 0, if i = 0 and j = 0, s[i β 1, j] β Ο, if i > 0, s[i, j β 1] β Ο, if j > 0, s[i β 1, j β 1] + Ξ²(i, j), if i > 0 and j > 0.
0 β€ π, π β€ π/π’ and πΎ(π, π) is max score alignment between block π of π° and block π of π± Question: How much time to compute all πΎ(π, π)?
t 2t nt t 2t nt
Computing πΎ π, π takes π π’T time There are π/π’ Γ π/π’ values πΎ π, π Total: π
& d Γ & d Γ π’T = π πT time
Block Alignment β Four Russians Technique
33
t 2t nt t 2t nt
Pre-compute and store all Ξ²ij Pre-compute and store all max weight alignments π[π°β², π±β²] of all pairs (π°E, π±E) of length t strings
Algorithm:
- 1. Precompute π[π°β², π±β²] where
π°E, π±E β Ξ£d
- 2. Compute block alignment
between π° and π± using π
s[i, j] = max ο£±    ο£²    ο£³ 0, if i = 0 and j = 0, s[i β 1, j] β Ο, if i > 0, s[i, j β 1] β Ο, if j > 0, s[i β 1, j β 1] + S[v(i), w(j)], if i > 0 and j > 0.
Block Alignment β Four Russians Technique
34
t 2t nt t 2t nt
Pre-compute and store all Ξ²ij Pre-compute and store all max weight alignments π[π°β², π±β²] of all pairs (π°E, π±E) of length t strings
Algorithm:
- 1. Precompute π[π°β², π±β²] where
π°E, π±E β Ξ£d
- 2. Compute block alignment
between π° and π± using π
s[i, j] = max ο£±    ο£²    ο£³ 0, if i = 0 and j = 0, s[i β 1, j] β Ο, if i > 0, s[i, j β 1] β Ο, if j > 0, s[i β 1, j β 1] + S[v(i), w(j)], if i > 0 and j > 0.
Question: How to choose π’ for DNA?
Fastest Subquadratic Alignment* Algorithm
35
*for edit distance
Edit distance in O(πT/ log π) time Barely subquadratic! Want: O(πTSh) time where π > 0
Fastest Subquadratic Alignment* Algorithm
36
*for edit distance
Edit distance in O(πT/ log π) time Barely subquadratic! Want: O(πTSh) time where π > 0 Question: Is πTSh in O(πT/ log π) for any π > 0?
Hardness Result for Edit Distance [STOC 2015]
37
38
Take Home Messages
- 1. Global alignment in O(ππ) time and O(π) space
- Hirschberg algorithm
- 2. Block alignment can be done in subquadratic time
- Four Russians Technique: O(πT/ log π) time
- 3. Global alignment cannot be done in O(πTSh) time under SETH
Reading:
- Jones and Pevzner. Chapters 7.1-7.4
- Lecture notes
39