CS CS 466 466 In Introduct ctio ion t to B Bio ioin - - PowerPoint PPT Presentation

β–Ά
cs cs 466 466 in introduct ctio ion t to b bio ioin
SMART_READER_LITE
LIVE PREVIEW

CS CS 466 466 In Introduct ctio ion t to B Bio ioin - - PowerPoint PPT Presentation

CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 6 Mohammed El-Kebir February 6, 2020 Course Announcements Instructor: Mohammed El-Kebir (melkebir) Office hours: Wednesdays, 3:15-4:15pm TA:


slide-1
SLIDE 1

CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics

Lecture 6

Mohammed El-Kebir February 6, 2020

slide-2
SLIDE 2

Course Announcements

Instructor:

  • Mohammed El-Kebir (melkebir)
  • Office hours: Wednesdays, 3:15-4:15pm

TA:

  • Aswhin Ramesh (aramesh7)
  • Office hours: Fridays, 11:00-11:59am in SC 3405

2

Homework 1: Due on Sept. 18 (11:59pm) Midterm: 10/4, 11-1pm @Transportation Building 103 (conflict: 10/7, 7-9pm @Siebel 1302 -- to sign up email me)

slide-3
SLIDE 3

Global, Fitting and Local Alignment

3

Local Alignment problem: Given strings 𝐰 ∈ Ξ£$ and 𝐱 ∈ Ξ£& and scoring function πœ€, find a substring

  • f 𝐰 and a substring of 𝐱 whose alignment has

maximum global alignment score π‘‘βˆ— among all global alignments of all substrings of 𝐰 and 𝐱 [Smith-Waterman algorithm] Fitting Alignment problem: Given strings 𝐰 ∈ Ξ£$ and 𝐱 ∈ Ξ£& and scoring function πœ€, find an alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score π‘‘βˆ— among all global alignments of 𝐰 and all substrings of 𝐱 Global Alignment problem: Given strings 𝐰 ∈ Ξ£$ and 𝐱 ∈ Ξ£& and scoring function πœ€, find alignment

  • f 𝐰 and 𝐱 with maximum score.

[Needleman-Wunsch algorithm]

A G G T A C G G C

𝐰\𝐱

Question: How to assess resulting algorithms?

slide-4
SLIDE 4

Time Complexity

4

1 2 3 n = 4 O O O O O 1 O O O O O 2 O O O O O 3 O O O O O m = 4 O O O O O

W A T C G A T G T V Alignment is a path from source (0, 0) to target (𝑛, π‘œ) in edit graph Edit graph is a weighed, directed grid graph 𝐻 = (π‘Š, 𝐹) with source vertex (0, 0) and target vertex (𝑛, π‘œ). Each edge ((𝑗, π‘˜), (𝑙, π‘š)) has weight depending on direction. Running time is 𝑃(π‘›π‘œ) [quadratic time]

slide-5
SLIDE 5

Time Complexity

5

1 2 3 n = 4 O O O O O 1 O O O O O 2 O O O O O 3 O O O O O m = 4 O O O O O

W A T C G A T G T V Alignment is a path from source (0, 0) to target (𝑛, π‘œ) in edit graph Edit graph is a weighed, directed grid graph 𝐻 = (π‘Š, 𝐹) with source vertex (0, 0) and target vertex (𝑛, π‘œ). Each edge ((𝑗, π‘˜), (𝑙, π‘š)) has weight depending on direction. Running time is 𝑃(π‘›π‘œ) [quadratic time] Question: Compute alignment faster than 𝑃(π‘›π‘œ) time? [subquadratic time]

slide-6
SLIDE 6

Space Complexity

6

1 2 3 n = 4 1 2 3 m = 4

W A T C G A T G T V Thus, space complexity is 𝑃(π‘›π‘œ) [quadratic space] Size of DP table is 𝑛 + 1 Γ— (π‘œ + 1) Example: To align a short read (𝑛 = 100) to human genome (π‘œ = 3 = 10>), we need 300 GB memory.

slide-7
SLIDE 7

Space Complexity

7

1 2 3 n = 4 1 2 3 m = 4

W A T C G A T G T V Thus, space complexity is 𝑃(π‘›π‘œ) [quadratic space] Size of DP table is 𝑛 + 1 Γ— (π‘œ + 1) Example: To align a short read (𝑛 = 100) to human genome (π‘œ = 3 = 10>), we need 300 GB memory. Question: How long is an alignment?

slide-8
SLIDE 8

Space Complexity

8

1 2 3 n = 4 1 2 3 m = 4

W A T C G A T G T V Thus, space complexity is 𝑃(π‘›π‘œ) [quadratic space] Size of DP table is 𝑛 + 1 Γ— (π‘œ + 1) Example: To align a short read (𝑛 = 100) to human genome (π‘œ = 3 = 10>), we need 300 GB memory. Question: How long is an alignment? Question: Compute alignment in 𝑃(𝑛) space? [linear space]

slide-9
SLIDE 9

Outline

  • 1. Recap of global, fitting, local and gapped alignment
  • 2. Space-efficient alignment
  • 3. Subquadratic time alignment

Reading:

  • Jones and Pevzner. Chapters 7.1-7.4
  • Lecture notes

9

slide-10
SLIDE 10

𝑛 π‘œ

Space Efficient Alignment

10

Computing 𝑑[𝑗, π‘˜] requires access to: 𝑑 𝑗 βˆ’ 1, π‘˜ , 𝑑[𝑗, π‘˜ βˆ’ 1] and 𝑑[𝑗 βˆ’ 1, π‘˜ βˆ’ 1] 𝑗 π‘˜

s[i, j] = max ο£±    ο£²    ο£³ 0, if i = 0 and j = 0, s[i βˆ’ 1, j] + Ξ΄(vi, βˆ’), if i > 0, s[i, j βˆ’ 1] + Ξ΄(βˆ’, wj), if j > 0, s[i βˆ’ 1, j βˆ’ 1] + Ξ΄(vi, wj), if i > 0 and j > 0.

slide-11
SLIDE 11

𝑛 π‘œ

Space Efficient Alignment

11

Computing 𝑑[𝑗, π‘˜] requires access to: 𝑑 𝑗 βˆ’ 1, π‘˜ , 𝑑[𝑗, π‘˜ βˆ’ 1] and 𝑑[𝑗 βˆ’ 1, π‘˜ βˆ’ 1] 𝑗 π‘˜ Thus it suffices to store only two columns to compute optimal alignment score 𝑑 𝑛, π‘œ , i.e., 2 𝑛 + 1 = 𝑃(𝑛) space.

s[i, j] = max ο£±    ο£²    ο£³ 0, if i = 0 and j = 0, s[i βˆ’ 1, j] + Ξ΄(vi, βˆ’), if i > 0, s[i, j βˆ’ 1] + Ξ΄(βˆ’, wj), if j > 0, s[i βˆ’ 1, j βˆ’ 1] + Ξ΄(vi, wj), if i > 0 and j > 0.

slide-12
SLIDE 12

𝑛 π‘œ

Space Efficient Alignment

12

Question: What if we want alignment itself? Computing 𝑑[𝑗, π‘˜] requires access to: 𝑑 𝑗 βˆ’ 1, π‘˜ , 𝑑[𝑗, π‘˜ βˆ’ 1] and 𝑑[𝑗 βˆ’ 1, π‘˜ βˆ’ 1] 𝑗 π‘˜ Thus it suffices to store only two columns to compute optimal alignment score 𝑑 𝑛, π‘œ , i.e., 2 𝑛 + 1 = 𝑃(𝑛) space.

s[i, j] = max ο£±    ο£²    ο£³ 0, if i = 0 and j = 0, s[i βˆ’ 1, j] + Ξ΄(vi, βˆ’), if i > 0, s[i, j βˆ’ 1] + Ξ΄(βˆ’, wj), if j > 0, s[i βˆ’ 1, j βˆ’ 1] + Ξ΄(vi, wj), if i > 0 and j > 0.

slide-13
SLIDE 13

Space Efficient Alignment – First Attempt

  • What if also want optimal alignment?
  • Easy: keep best pointers as fill in table.
  • No! Do not know which path to keep until computing

recurrence at each step.

w v w v

slide-14
SLIDE 14

Space Efficient Alignment – First Attempt

  • What if also want optimal alignment?
  • Easy: keep best pointers as fill in table.
  • No! Do not know which path to keep until computing

recurrence at each step.

w v w v

slide-15
SLIDE 15

Space Efficient Alignment – First Attempt

  • What if also want optimal alignment?
  • Easy: keep best pointers as fill in table.
  • No! Do not know which path to keep until computing

recurrence at each step.

w v w v

Best score for column might not be part of best alignment!

slide-16
SLIDE 16

Space Efficient Alignment – Second Attempt

16

π‘œ/2 𝑛 π‘—βˆ— 𝑗

Maximum weight path from (0,0) to (𝑛, π‘œ) passes through (π‘—βˆ—, π‘œ/2) Question: What is π‘—βˆ—? Alignment is a path from source (0, 0) to target (𝑛, π‘œ) in edit graph

slide-17
SLIDE 17

17

Hirschberg(𝑗, π‘˜, 𝑗E, π‘˜β€²) 1. if π‘˜E βˆ’ π‘˜ > 1 2. π‘—βˆ— ß arg max

MNMOONMO wt(𝑗′′)

3. Report (π‘—βˆ—, π‘˜ + ROSR

T )

4. Hirschberg(𝑗, π‘˜, π‘—βˆ—, π‘˜ + ROSR

T )

5. Hirschberg(π‘—βˆ—, π‘˜ + ROSR

T , 𝑗E, π‘˜β€²)

slide-18
SLIDE 18

18

Time: area + area/2 + area/4 + … = area (1 + Β½ + ΒΌ + β…› + …) ≀ 2 Γ— area = O(mn) Space: O(m) Hirschberg(𝑗, π‘˜, 𝑗E, π‘˜β€²) 1. if π‘˜E βˆ’ π‘˜ > 1 2. π‘—βˆ— ß arg max

MNMOONMO wt(𝑗′′)

3. Report (π‘—βˆ—, π‘˜ + ROSR

T )

4. Hirschberg(𝑗, π‘˜, π‘—βˆ—, π‘˜ + ROSR

T )

5. Hirschberg(π‘—βˆ—, π‘˜ + ROSR

T , 𝑗E, π‘˜β€²)

slide-19
SLIDE 19

19

Time: area + area/2 + area/4 + … = area (1 + Β½ + ΒΌ + β…› + …) ≀ 2 Γ— area = O(mn) Space: O(m) Question: How to reconstruct alignment from reported vertices? Hirschberg(𝑗, π‘˜, 𝑗E, π‘˜β€²) 1. if π‘˜E βˆ’ π‘˜ > 1 2. π‘—βˆ— ß arg max

MNMOONMO wt(𝑗′′)

3. Report (π‘—βˆ—, π‘˜ + ROSR

T )

4. Hirschberg(𝑗, π‘˜, π‘—βˆ—, π‘˜ + ROSR

T )

5. Hirschberg(π‘—βˆ—, π‘˜ + ROSR

T , 𝑗E, π‘˜β€²)

slide-20
SLIDE 20

Hirschberg Algorithm: Reversing Edges Necessary?

20

π‘—βˆ— = arg max{

VNMN$

preYix 𝑗 + sufYix 𝑗 } Max weight path from (0,0) to (𝑛, π‘œ) through (π‘—βˆ—, π‘œ/2) Compute preYix 𝑗 0 ≀ 𝑗 ≀ 𝑛} in O(π‘›π‘˜) time and O(𝑛) space, by starting from (0,0) to 𝑛, π‘˜ keeping only two columns in memory. [single-source multiple destinations]

π‘˜ 𝑛 π‘—βˆ— 𝑗

slide-21
SLIDE 21

Hirschberg Algorithm: Reversing Edges Necessary?

21

π‘—βˆ— = arg max{

VNMN$

preYix 𝑗 + sufYix 𝑗 } Max weight path from (0,0) to (𝑛, π‘œ) through (π‘—βˆ—, π‘œ/2) Want: Compute sufYix 𝑗 0 ≀ 𝑗 ≀ 𝑛} in O(π‘›π‘˜) time and O(𝑛) space Doing a longest path from each 𝑗, π‘˜ to 𝑛, π‘œ (for all 0 ≀ 𝑗 ≀ 𝑛) will not achieve desired running time! Reversing edges enables single-source multiple destination computation in desired time and space bound!

π‘˜ 𝑛 π‘—βˆ— 𝑗

Compute preYix 𝑗 0 ≀ 𝑗 ≀ 𝑛} in O(π‘›π‘˜) time and O(𝑛) space, by starting from (0,0) to 𝑛, π‘˜ keeping only two columns in memory. [single-source multiple destinations]

slide-22
SLIDE 22

Hirschberg Algorithm: Reconstructing Alignment

22

Problem: Given reported vertices and scores { 𝑗V, 0, 𝑑V , … , 𝑗&, π‘œ, 𝑑& }, find intermediary vertices. Hirschberg(𝑗, π‘˜, 𝑗E, π‘˜β€²) 1. if π‘˜E βˆ’ π‘˜ > 1 2. π‘—βˆ— ß arg max

VNMN$

wt(𝑗) 3. Report (π‘—βˆ—, π‘˜ + ROSR

T )

4. Hirschberg(𝑗, π‘˜, π‘—βˆ—, π‘˜ + ROSR

T )

5. Hirschberg(π‘—βˆ—, π‘˜ + ROSR

T , 𝑗E, π‘˜β€²)

1 2 3 4 5

  • 1
  • 2
  • 3
  • 4
  • 5

1

  • 1

1

  • 1
  • 2
  • 3

2

  • 2

2 1

  • 1

3

  • 3
  • 1

1 1 2 1 4

  • 4
  • 2

1 1 5

  • 5
  • 3
  • 1

1 2

W A T C G A T G T V A T

  • G

T C A T C G

  • C

C C

Transposing matrix does not help, because gaps could occur in both input sequences

slide-23
SLIDE 23

Linear Space Alignment – The Hirschberg Algorithm

23

slide-24
SLIDE 24

Outline

  • 1. Recap of global, fitting, local and gapped alignment
  • 2. Space-efficient alignment
  • 3. Subquadratic time alignment

Reading:

  • Jones and Pevzner. Chapters 7.1-7.4
  • Lecture notes

24

slide-25
SLIDE 25

Banded Alignment

25

x1 x2 x3 . . . . . . xM y1 y2 y3 ... ... ... ... yN

Constrain traceback to band of DP matrix (penalize big gaps)

Figure source: http://jinome.stanford.edu/stat366/pdfs/stat366_win0607_lecture04.pdf

Constraint path to band of width 𝑙 around diagonal Running time: O(π‘œπ‘™) Gives a good approximation of highly identical sequences

slide-26
SLIDE 26

Banded Alignment

26

x1 x2 x3 . . . . . . xM y1 y2 y3 ... ... ... ... yN

Constrain traceback to band of DP matrix (penalize big gaps)

Figure source: http://jinome.stanford.edu/stat366/pdfs/stat366_win0607_lecture04.pdf

Constraint path to band of width 𝑙 around diagonal Running time: O(π‘œπ‘™) Gives a good approximation of highly identical sequences Question: How to change recurrence to accomplish this?

slide-27
SLIDE 27

Block Alignment

27

Divide input sequences into blocks of length 𝑒

v1, …, vt vt+1, …, v2t … vm-t+1, …, vm w1, …, wt wt+1, …, w2t … wn-t+1, …, wn

slide-28
SLIDE 28

Block Alignment

28

Divide input sequences into blocks of length 𝑒

v1, …, vt vt+1, …, v2t … vm-t+1, …, vm w1, …, wt wt+1, …, w2t … wn-t+1, …, wn

Require that paths in edit graph pass through corners of blocks

slide-29
SLIDE 29

Block Alignment

29

Divide input sequences into blocks of length 𝑒

v1, …, vt vt+1, …, v2t … vm-t+1, …, vm w1, …, wt wt+1, …, w2t … wn-t+1, …, wn

Require that paths in edit graph pass through corners of blocks

s[i, j] = max ο£±    ο£²    ο£³ 0, if i = 0 and j = 0, s[i βˆ’ 1, j] βˆ’ Οƒ, if i > 0, s[i, j βˆ’ 1] βˆ’ Οƒ, if j > 0, s[i βˆ’ 1, j βˆ’ 1] + Ξ²(i, j), if i > 0 and j > 0.

0 ≀ 𝑗, π‘˜ ≀ 𝑒 and 𝛾(𝑗, π‘˜) is max score alignment between block 𝑗 of 𝐰 and block π‘˜ of 𝐱

slide-30
SLIDE 30

Block Alignment – First Attempt: Pre-compute 𝛾 𝑗, π‘˜

30

s[i, j] = max ο£±    ο£²    ο£³ 0, if i = 0 and j = 0, s[i βˆ’ 1, j] βˆ’ Οƒ, if i > 0, s[i, j βˆ’ 1] βˆ’ Οƒ, if j > 0, s[i βˆ’ 1, j βˆ’ 1] + Ξ²(i, j), if i > 0 and j > 0.

0 ≀ 𝑗, π‘˜ ≀ π‘œ/𝑒 and 𝛾(𝑗, π‘˜) is max score alignment between block 𝑗 of 𝐰 and block π‘˜ of 𝐱

t 2t nt t 2t nt

slide-31
SLIDE 31

Block Alignment – First Attempt: Pre-compute 𝛾 𝑗, π‘˜

31

s[i, j] = max ο£±    ο£²    ο£³ 0, if i = 0 and j = 0, s[i βˆ’ 1, j] βˆ’ Οƒ, if i > 0, s[i, j βˆ’ 1] βˆ’ Οƒ, if j > 0, s[i βˆ’ 1, j βˆ’ 1] + Ξ²(i, j), if i > 0 and j > 0.

0 ≀ 𝑗, π‘˜ ≀ π‘œ/𝑒 and 𝛾(𝑗, π‘˜) is max score alignment between block 𝑗 of 𝐰 and block π‘˜ of 𝐱 Question: How much time to compute all 𝛾(𝑗, π‘˜)?

t 2t nt t 2t nt

slide-32
SLIDE 32

Block Alignment – First Attempt: Pre-compute 𝛾 𝑗, π‘˜

32

s[i, j] = max ο£±    ο£²    ο£³ 0, if i = 0 and j = 0, s[i βˆ’ 1, j] βˆ’ Οƒ, if i > 0, s[i, j βˆ’ 1] βˆ’ Οƒ, if j > 0, s[i βˆ’ 1, j βˆ’ 1] + Ξ²(i, j), if i > 0 and j > 0.

0 ≀ 𝑗, π‘˜ ≀ π‘œ/𝑒 and 𝛾(𝑗, π‘˜) is max score alignment between block 𝑗 of 𝐰 and block π‘˜ of 𝐱 Question: How much time to compute all 𝛾(𝑗, π‘˜)?

t 2t nt t 2t nt

Computing 𝛾 𝑗, π‘˜ takes 𝑃 𝑒T time There are π‘œ/𝑒 Γ— π‘œ/𝑒 values 𝛾 𝑗, π‘˜ Total: 𝑃

& d Γ— & d Γ— 𝑒T = 𝑃 π‘œT time

slide-33
SLIDE 33

Block Alignment – Four Russians Technique

33

t 2t nt t 2t nt

Pre-compute and store all Ξ²ij Pre-compute and store all max weight alignments 𝑇[𝐰′, 𝐱′] of all pairs (𝐰E, 𝐱E) of length t strings

Algorithm:

  • 1. Precompute 𝑇[𝐰′, 𝐱′] where

𝐰E, 𝐱E ∈ Σd

  • 2. Compute block alignment

between 𝐰 and 𝐱 using 𝑇

s[i, j] = max ο£±    ο£²    ο£³ 0, if i = 0 and j = 0, s[i βˆ’ 1, j] βˆ’ Οƒ, if i > 0, s[i, j βˆ’ 1] βˆ’ Οƒ, if j > 0, s[i βˆ’ 1, j βˆ’ 1] + S[v(i), w(j)], if i > 0 and j > 0.

slide-34
SLIDE 34

Block Alignment – Four Russians Technique

34

t 2t nt t 2t nt

Pre-compute and store all Ξ²ij Pre-compute and store all max weight alignments 𝑇[𝐰′, 𝐱′] of all pairs (𝐰E, 𝐱E) of length t strings

Algorithm:

  • 1. Precompute 𝑇[𝐰′, 𝐱′] where

𝐰E, 𝐱E ∈ Σd

  • 2. Compute block alignment

between 𝐰 and 𝐱 using 𝑇

s[i, j] = max ο£±    ο£²    ο£³ 0, if i = 0 and j = 0, s[i βˆ’ 1, j] βˆ’ Οƒ, if i > 0, s[i, j βˆ’ 1] βˆ’ Οƒ, if j > 0, s[i βˆ’ 1, j βˆ’ 1] + S[v(i), w(j)], if i > 0 and j > 0.

Question: How to choose 𝑒 for DNA?

slide-35
SLIDE 35

Fastest Subquadratic Alignment* Algorithm

35

*for edit distance

Edit distance in O(π‘œT/ log π‘œ) time Barely subquadratic! Want: O(π‘œTSh) time where 𝜁 > 0

slide-36
SLIDE 36

Fastest Subquadratic Alignment* Algorithm

36

*for edit distance

Edit distance in O(π‘œT/ log π‘œ) time Barely subquadratic! Want: O(π‘œTSh) time where 𝜁 > 0 Question: Is π‘œTSh in O(π‘œT/ log π‘œ) for any 𝜁 > 0?

slide-37
SLIDE 37

Hardness Result for Edit Distance [STOC 2015]

37

slide-38
SLIDE 38

38

slide-39
SLIDE 39

Take Home Messages

  • 1. Global alignment in O(π‘›π‘œ) time and O(𝑛) space
  • Hirschberg algorithm
  • 2. Block alignment can be done in subquadratic time
  • Four Russians Technique: O(π‘œT/ log π‘œ) time
  • 3. Global alignment cannot be done in O(π‘œTSh) time under SETH

Reading:

  • Jones and Pevzner. Chapters 7.1-7.4
  • Lecture notes

39