cs cs 466 466 in introduct ctio ion t to b bio ioin
play

CS CS 466 466 In Introduct ctio ion t to B Bio ioin - PowerPoint PPT Presentation

CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 6 Mohammed El-Kebir February 6, 2020 Course Announcements Instructor: Mohammed El-Kebir (melkebir) Office hours: Wednesdays, 3:15-4:15pm TA:


  1. CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 6 Mohammed El-Kebir February 6, 2020

  2. Course Announcements Instructor: • Mohammed El-Kebir (melkebir) • Office hours: Wednesdays, 3:15-4:15pm TA: • Aswhin Ramesh (aramesh7) • Office hours: Fridays, 11:00-11:59am in SC 3405 Homework 1 : Due on Sept. 18 (11:59pm) Midterm : 10/4, 11-1pm @Transportation Building 103 (conflict: 10/7, 7-9pm @Siebel 1302 -- to sign up email me) 2

  3. Global, Fitting and Local Alignment Global Alignment problem: Given strings 𝐰 ∈ Σ $ and 𝐱 ∈ Σ & and scoring function 𝜀 , find alignment 0 T A C G G C 𝐰 \ 𝐱 of 𝐰 and 𝐱 with maximum score. 0 [Needleman-Wunsch algorithm] A Fitting Alignment problem: Given strings 𝐰 ∈ Σ $ and 𝐱 ∈ Σ & and scoring function 𝜀 , find an G alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score 𝑡 ∗ among all global G alignments of 𝐰 and all substrings of 𝐱 Local Alignment problem: Given strings 𝐰 ∈ Σ $ and 𝐱 ∈ Σ & and scoring function 𝜀 , find a substring Question : How to assess of 𝐰 and a substring of 𝐱 whose alignment has maximum global alignment score 𝑡 ∗ among all resulting algorithms? global alignments of all substrings of 𝐰 and 𝐱 [Smith-Waterman algorithm] 3

  4. Time Complexity Edit graph is a weighed, directed grid graph 𝐻 = (𝑊, 𝐹) with source vertex A T C G (0, 0) and target vertex (𝑛, 𝑜) . Each W edge ((𝑗, 𝑘), (𝑙, 𝑚)) has weight 0 1 2 3 n = 4 V depending on direction. 0 O O O O O A Alignment is a path from source (0, 0) 1 O O O O O to target (𝑛, 𝑜) in edit graph T 2 O O O O O G 3 O O O O O Running time is 𝑃(𝑛𝑜) [ quadratic time ] T m = 4 O O O O O 4

  5. Time Complexity Edit graph is a weighed, directed grid graph 𝐻 = (𝑊, 𝐹) with source vertex A T C G (0, 0) and target vertex (𝑛, 𝑜) . Each W edge ((𝑗, 𝑘), (𝑙, 𝑚)) has weight 0 1 2 3 n = 4 V depending on direction. 0 O O O O O A Alignment is a path from source (0, 0) 1 O O O O O to target (𝑛, 𝑜) in edit graph T 2 O O O O O G 3 O O O O O Running time is 𝑃(𝑛𝑜) [ quadratic time ] T m = 4 O O O O O Question : Compute alignment faster than 𝑃(𝑛𝑜 ) time? [ subquadratic time ] 5

  6. Space Complexity Size of DP table is 𝑛 + 1 × (𝑜 + 1) A T C G Thus, space complexity is 𝑃(𝑛𝑜) W 0 1 2 3 n = 4 [ quadratic space ] V 0 Example : A 1 To align a short read ( 𝑛 = 100 ) to human genome ( 𝑜 = 3 = 10 > ), we need T 2 300 GB memory. G 3 T m = 4 6

  7. Space Complexity Size of DP table is 𝑛 + 1 × (𝑜 + 1) A T C G Thus, space complexity is 𝑃(𝑛𝑜) W 0 1 2 3 n = 4 [ quadratic space ] V 0 Example : A 1 To align a short read ( 𝑛 = 100 ) to human genome ( 𝑜 = 3 = 10 > ), we need T 2 300 GB memory. G 3 Question : How long is an alignment? T m = 4 7

  8. Space Complexity Size of DP table is 𝑛 + 1 × (𝑜 + 1) A T C G Thus, space complexity is 𝑃(𝑛𝑜) W 0 1 2 3 n = 4 [ quadratic space ] V 0 Example : A 1 To align a short read ( 𝑛 = 100 ) to human genome ( 𝑜 = 3 = 10 > ), we need T 2 300 GB memory. G 3 Question : How long is an alignment? T m = 4 Question : Compute alignment in 𝑃(𝑛 ) space? [ linear space ] 8

  9. Outline 1. Recap of global, fitting, local and gapped alignment 2. Space-efficient alignment 3. Subquadratic time alignment Reading: • Jones and Pevzner. Chapters 7.1-7.4 • Lecture notes 9

  10. Space Efficient Alignment 𝑘 0 𝑜 0 Computing 𝑡[𝑗, 𝑘] requires access to: 𝑡 𝑗 − 1, 𝑘 , 𝑡[𝑗, 𝑘 − 1] and 𝑡[𝑗 − 1, 𝑘 − 1]  0 , if i = 0 and j = 0,    s [ i − 1 , j ] + δ ( v i , − ) , if i > 0,  s [ i, j ] = max s [ i, j − 1] + δ ( − , w j ) , if j > 0,    s [ i − 1 , j − 1] + δ ( v i , w j ) , if i > 0 and j > 0.  𝑗 𝑛 10

  11. Space Efficient Alignment 𝑘 0 𝑜 0 Computing 𝑡[𝑗, 𝑘] requires access to: 𝑡 𝑗 − 1, 𝑘 , 𝑡[𝑗, 𝑘 − 1] and 𝑡[𝑗 − 1, 𝑘 − 1]  0 , if i = 0 and j = 0,    s [ i − 1 , j ] + δ ( v i , − ) , if i > 0,  s [ i, j ] = max s [ i, j − 1] + δ ( − , w j ) , if j > 0,    s [ i − 1 , j − 1] + δ ( v i , w j ) , if i > 0 and j > 0.  𝑗 Thus it suffices to store only two columns to compute optimal alignment score 𝑡 𝑛, 𝑜 , i.e., 2 𝑛 + 1 = 𝑃(𝑛) space. 𝑛 11

  12. Space Efficient Alignment 𝑘 0 𝑜 0 Computing 𝑡[𝑗, 𝑘] requires access to: 𝑡 𝑗 − 1, 𝑘 , 𝑡[𝑗, 𝑘 − 1] and 𝑡[𝑗 − 1, 𝑘 − 1]  0 , if i = 0 and j = 0,    s [ i − 1 , j ] + δ ( v i , − ) , if i > 0,  s [ i, j ] = max s [ i, j − 1] + δ ( − , w j ) , if j > 0,    s [ i − 1 , j − 1] + δ ( v i , w j ) , if i > 0 and j > 0.  𝑗 Thus it suffices to store only two columns to compute optimal alignment score 𝑡 𝑛, 𝑜 , i.e., 2 𝑛 + 1 = 𝑃(𝑛) space. Question : What if we want alignment itself? 𝑛 12

  13. Space Efficient Alignment – First Attempt • What if also want optimal alignment? • Easy : keep best pointers as fill in table. • No! Do not know which path to keep until computing recurrence at each step. w w v v

  14. Space Efficient Alignment – First Attempt • What if also want optimal alignment? • Easy : keep best pointers as fill in table. • No! Do not know which path to keep until computing recurrence at each step. w w v v

  15. Space Efficient Alignment – First Attempt • What if also want optimal alignment? • Easy : keep best pointers as fill in table. • No! Do not know which path to keep until computing recurrence at each step. w w v v Best score for column might not be part of best alignment!

  16. Space Efficient Alignment – Second Attempt 𝑜/2 Alignment is a path from source (0, 0) to target (𝑛, 𝑜) in edit graph 𝑗 ∗ Maximum weight path from (0,0) to (𝑛, 𝑜) passes 𝑗 through (𝑗 ∗ , 𝑜/2) Question : What is 𝑗 ∗ ? 𝑛 16

  17. Hirschberg( 𝑗, 𝑘, 𝑗 E , 𝑘′ ) if 𝑘 E − 𝑘 > 1 1. 𝑗 ∗ ß arg max 2. MNM OO NM O wt(𝑗′′) Report (𝑗 ∗ , 𝑘 + R O SR 3. T ) Hirschberg( 𝑗, 𝑘, 𝑗 ∗ , 𝑘 + R O SR 4. T ) Hirschberg( 𝑗 ∗ , 𝑘 + R O SR T , 𝑗 E , 𝑘′ ) 5. 17

  18. Hirschberg( 𝑗, 𝑘, 𝑗 E , 𝑘′ ) if 𝑘 E − 𝑘 > 1 1. 𝑗 ∗ ß arg max 2. MNM OO NM O wt(𝑗′′) Report (𝑗 ∗ , 𝑘 + R O SR 3. T ) Hirschberg( 𝑗, 𝑘, 𝑗 ∗ , 𝑘 + R O SR 4. T ) Hirschberg( 𝑗 ∗ , 𝑘 + R O SR T , 𝑗 E , 𝑘′ ) 5. Time: area + area/2 + area/4 + … = area (1 + ½ + ¼ + ⅛ + …) ≤ 2 × area = O(mn) Space: O(m) 18

  19. Hirschberg( 𝑗, 𝑘, 𝑗 E , 𝑘′ ) if 𝑘 E − 𝑘 > 1 1. 𝑗 ∗ ß arg max 2. MNM OO NM O wt(𝑗′′) Report (𝑗 ∗ , 𝑘 + R O SR 3. T ) Hirschberg( 𝑗, 𝑘, 𝑗 ∗ , 𝑘 + R O SR 4. T ) Hirschberg( 𝑗 ∗ , 𝑘 + R O SR T , 𝑗 E , 𝑘′ ) 5. Time: area + area/2 + area/4 + … = area (1 + ½ + ¼ + ⅛ + …) ≤ 2 × area = O(mn) Space: O(m) Question : How to reconstruct alignment from reported vertices? 19

  20. Hirschberg Algorithm: Reversing Edges Necessary? 𝑘 Max weight path from (0,0) to (𝑛, 𝑜) through (𝑗 ∗ , 𝑜/2) 𝑗 ∗ = arg max{ preYix 𝑗 + sufYix 𝑗 } VNMN$ 𝑗 ∗ Compute preYix 𝑗 0 ≤ 𝑗 ≤ 𝑛} in O(𝑛𝑘) time and O(𝑛) space, by starting from (0,0) to 𝑛, 𝑘 keeping only two columns in memory. [ single-source multiple destinations ] 𝑗 𝑛 20

  21. Hirschberg Algorithm: Reversing Edges Necessary? 𝑘 Max weight path from (0,0) to (𝑛, 𝑜) through (𝑗 ∗ , 𝑜/2) 𝑗 ∗ = arg max{ preYix 𝑗 + sufYix 𝑗 } VNMN$ 𝑗 ∗ Compute preYix 𝑗 0 ≤ 𝑗 ≤ 𝑛} in O(𝑛𝑘) time and O(𝑛) space, by starting from (0,0) to 𝑛, 𝑘 keeping only two columns in memory. [ single-source multiple destinations ] 𝑗 Want : Compute sufYix 𝑗 0 ≤ 𝑗 ≤ 𝑛} in O(𝑛𝑘) time 𝑛 and O(𝑛) space Doing a longest path from each 𝑗, 𝑘 to 𝑛, 𝑜 (for all 0 ≤ 𝑗 ≤ 𝑛 ) will not achieve desired running time! Reversing edges enables single-source multiple destination computation in desired time and space bound! 21

  22. Hirschberg Algorithm: Reconstructing Alignment A T C G C Hirschberg( 𝑗, 𝑘, 𝑗 E , 𝑘′ ) W if 𝑘 E − 𝑘 > 1 1. 0 1 2 3 4 5 V 𝑗 ∗ ß arg max 2. wt(𝑗) 0 0 -1 -2 -3 -4 -5 VNMN$ Report (𝑗 ∗ , 𝑘 + R O SR 3. T ) A 1 -1 1 0 -1 -2 -3 Hirschberg( 𝑗, 𝑘, 𝑗 ∗ , 𝑘 + R O SR 4. T ) T 2 -2 0 2 1 0 -1 Hirschberg( 𝑗 ∗ , 𝑘 + R O SR T , 𝑗 E , 𝑘′ ) 5. G 3 -3 -1 1 1 2 1 Problem: Given reported vertices and T 4 -4 -2 0 0 1 1 scores { 𝑗 V , 0, 𝑡 V , … , 𝑗 & , 𝑜, 𝑡 & } , find intermediary vertices. C 5 -5 -3 -1 1 0 2 Transposing matrix does not help, A T - G T C because gaps could occur in both input sequences A T C G - C 22

  23. Linear Space Alignment – The Hirschberg Algorithm 23

  24. Outline 1. Recap of global, fitting, local and gapped alignment 2. Space-efficient alignment 3. Subquadratic time alignment Reading: • Jones and Pevzner. Chapters 7.1-7.4 • Lecture notes 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend