CS CS 466 466 In Introduct ctio ion t to B Bio ioin - PowerPoint PPT Presentation

CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 6 Mohammed El-Kebir February 6, 2020

Course Announcements Instructor: • Mohammed El-Kebir (melkebir) • Office hours: Wednesdays, 3:15-4:15pm TA: • Aswhin Ramesh (aramesh7) • Office hours: Fridays, 11:00-11:59am in SC 3405 Homework 1 : Due on Sept. 18 (11:59pm) Midterm : 10/4, 11-1pm @Transportation Building 103 (conflict: 10/7, 7-9pm @Siebel 1302 -- to sign up email me) 2

Global, Fitting and Local Alignment Global Alignment problem: Given strings 𝐰 ∈ Σ $ and 𝐱 ∈ Σ & and scoring function 𝜀 , find alignment 0 T A C G G C 𝐰 \ 𝐱 of 𝐰 and 𝐱 with maximum score. 0 [Needleman-Wunsch algorithm] A Fitting Alignment problem: Given strings 𝐰 ∈ Σ $ and 𝐱 ∈ Σ & and scoring function 𝜀 , find an G alignment of 𝐰 and a substring of 𝐱 with maximum global alignment score 𝑡 ∗ among all global G alignments of 𝐰 and all substrings of 𝐱 Local Alignment problem: Given strings 𝐰 ∈ Σ $ and 𝐱 ∈ Σ & and scoring function 𝜀 , find a substring Question : How to assess of 𝐰 and a substring of 𝐱 whose alignment has maximum global alignment score 𝑡 ∗ among all resulting algorithms? global alignments of all substrings of 𝐰 and 𝐱 [Smith-Waterman algorithm] 3

Time Complexity Edit graph is a weighed, directed grid graph 𝐻 = (𝑊, 𝐹) with source vertex A T C G (0, 0) and target vertex (𝑛, 𝑜) . Each W edge ((𝑗, 𝑘), (𝑙, 𝑚)) has weight 0 1 2 3 n = 4 V depending on direction. 0 O O O O O A Alignment is a path from source (0, 0) 1 O O O O O to target (𝑛, 𝑜) in edit graph T 2 O O O O O G 3 O O O O O Running time is 𝑃(𝑛𝑜) [ quadratic time ] T m = 4 O O O O O 4

Time Complexity Edit graph is a weighed, directed grid graph 𝐻 = (𝑊, 𝐹) with source vertex A T C G (0, 0) and target vertex (𝑛, 𝑜) . Each W edge ((𝑗, 𝑘), (𝑙, 𝑚)) has weight 0 1 2 3 n = 4 V depending on direction. 0 O O O O O A Alignment is a path from source (0, 0) 1 O O O O O to target (𝑛, 𝑜) in edit graph T 2 O O O O O G 3 O O O O O Running time is 𝑃(𝑛𝑜) [ quadratic time ] T m = 4 O O O O O Question : Compute alignment faster than 𝑃(𝑛𝑜 ) time? [ subquadratic time ] 5

Space Complexity Size of DP table is 𝑛 + 1 × (𝑜 + 1) A T C G Thus, space complexity is 𝑃(𝑛𝑜) W 0 1 2 3 n = 4 [ quadratic space ] V 0 Example : A 1 To align a short read ( 𝑛 = 100 ) to human genome ( 𝑜 = 3 = 10 > ), we need T 2 300 GB memory. G 3 T m = 4 6

Space Complexity Size of DP table is 𝑛 + 1 × (𝑜 + 1) A T C G Thus, space complexity is 𝑃(𝑛𝑜) W 0 1 2 3 n = 4 [ quadratic space ] V 0 Example : A 1 To align a short read ( 𝑛 = 100 ) to human genome ( 𝑜 = 3 = 10 > ), we need T 2 300 GB memory. G 3 Question : How long is an alignment? T m = 4 7

Space Complexity Size of DP table is 𝑛 + 1 × (𝑜 + 1) A T C G Thus, space complexity is 𝑃(𝑛𝑜) W 0 1 2 3 n = 4 [ quadratic space ] V 0 Example : A 1 To align a short read ( 𝑛 = 100 ) to human genome ( 𝑜 = 3 = 10 > ), we need T 2 300 GB memory. G 3 Question : How long is an alignment? T m = 4 Question : Compute alignment in 𝑃(𝑛 ) space? [ linear space ] 8

Outline 1. Recap of global, fitting, local and gapped alignment 2. Space-efficient alignment 3. Subquadratic time alignment Reading: • Jones and Pevzner. Chapters 7.1-7.4 • Lecture notes 9

Space Efficient Alignment 𝑘 0 𝑜 0 Computing 𝑡[𝑗, 𝑘] requires access to: 𝑡 𝑗 − 1, 𝑘 , 𝑡[𝑗, 𝑘 − 1] and 𝑡[𝑗 − 1, 𝑘 − 1]  0 , if i = 0 and j = 0,    s [ i − 1 , j ] + δ ( v i , − ) , if i > 0,  s [ i, j ] = max s [ i, j − 1] + δ ( − , w j ) , if j > 0,    s [ i − 1 , j − 1] + δ ( v i , w j ) , if i > 0 and j > 0.  𝑗 𝑛 10

Space Efficient Alignment 𝑘 0 𝑜 0 Computing 𝑡[𝑗, 𝑘] requires access to: 𝑡 𝑗 − 1, 𝑘 , 𝑡[𝑗, 𝑘 − 1] and 𝑡[𝑗 − 1, 𝑘 − 1]  0 , if i = 0 and j = 0,    s [ i − 1 , j ] + δ ( v i , − ) , if i > 0,  s [ i, j ] = max s [ i, j − 1] + δ ( − , w j ) , if j > 0,    s [ i − 1 , j − 1] + δ ( v i , w j ) , if i > 0 and j > 0.  𝑗 Thus it suffices to store only two columns to compute optimal alignment score 𝑡 𝑛, 𝑜 , i.e., 2 𝑛 + 1 = 𝑃(𝑛) space. 𝑛 11

Space Efficient Alignment 𝑘 0 𝑜 0 Computing 𝑡[𝑗, 𝑘] requires access to: 𝑡 𝑗 − 1, 𝑘 , 𝑡[𝑗, 𝑘 − 1] and 𝑡[𝑗 − 1, 𝑘 − 1]  0 , if i = 0 and j = 0,    s [ i − 1 , j ] + δ ( v i , − ) , if i > 0,  s [ i, j ] = max s [ i, j − 1] + δ ( − , w j ) , if j > 0,    s [ i − 1 , j − 1] + δ ( v i , w j ) , if i > 0 and j > 0.  𝑗 Thus it suffices to store only two columns to compute optimal alignment score 𝑡 𝑛, 𝑜 , i.e., 2 𝑛 + 1 = 𝑃(𝑛) space. Question : What if we want alignment itself? 𝑛 12

Space Efficient Alignment – First Attempt • What if also want optimal alignment? • Easy : keep best pointers as fill in table. • No! Do not know which path to keep until computing recurrence at each step. w w v v

Space Efficient Alignment – First Attempt • What if also want optimal alignment? • Easy : keep best pointers as fill in table. • No! Do not know which path to keep until computing recurrence at each step. w w v v Best score for column might not be part of best alignment!

Space Efficient Alignment – Second Attempt 𝑜/2 Alignment is a path from source (0, 0) to target (𝑛, 𝑜) in edit graph 𝑗 ∗ Maximum weight path from (0,0) to (𝑛, 𝑜) passes 𝑗 through (𝑗 ∗ , 𝑜/2) Question : What is 𝑗 ∗ ? 𝑛 16

Hirschberg( 𝑗, 𝑘, 𝑗 E , 𝑘′ ) if 𝑘 E − 𝑘 > 1 1. 𝑗 ∗ ß arg max 2. MNM OO NM O wt(𝑗′′) Report (𝑗 ∗ , 𝑘 + R O SR 3. T ) Hirschberg( 𝑗, 𝑘, 𝑗 ∗ , 𝑘 + R O SR 4. T ) Hirschberg( 𝑗 ∗ , 𝑘 + R O SR T , 𝑗 E , 𝑘′ ) 5. 17

Hirschberg( 𝑗, 𝑘, 𝑗 E , 𝑘′ ) if 𝑘 E − 𝑘 > 1 1. 𝑗 ∗ ß arg max 2. MNM OO NM O wt(𝑗′′) Report (𝑗 ∗ , 𝑘 + R O SR 3. T ) Hirschberg( 𝑗, 𝑘, 𝑗 ∗ , 𝑘 + R O SR 4. T ) Hirschberg( 𝑗 ∗ , 𝑘 + R O SR T , 𝑗 E , 𝑘′ ) 5. Time: area + area/2 + area/4 + … = area (1 + ½ + ¼ + ⅛ + …) ≤ 2 × area = O(mn) Space: O(m) 18

Hirschberg( 𝑗, 𝑘, 𝑗 E , 𝑘′ ) if 𝑘 E − 𝑘 > 1 1. 𝑗 ∗ ß arg max 2. MNM OO NM O wt(𝑗′′) Report (𝑗 ∗ , 𝑘 + R O SR 3. T ) Hirschberg( 𝑗, 𝑘, 𝑗 ∗ , 𝑘 + R O SR 4. T ) Hirschberg( 𝑗 ∗ , 𝑘 + R O SR T , 𝑗 E , 𝑘′ ) 5. Time: area + area/2 + area/4 + … = area (1 + ½ + ¼ + ⅛ + …) ≤ 2 × area = O(mn) Space: O(m) Question : How to reconstruct alignment from reported vertices? 19

Hirschberg Algorithm: Reversing Edges Necessary? 𝑘 Max weight path from (0,0) to (𝑛, 𝑜) through (𝑗 ∗ , 𝑜/2) 𝑗 ∗ = arg max{ preYix 𝑗 + sufYix 𝑗 } VNMN$ 𝑗 ∗ Compute preYix 𝑗 0 ≤ 𝑗 ≤ 𝑛} in O(𝑛𝑘) time and O(𝑛) space, by starting from (0,0) to 𝑛, 𝑘 keeping only two columns in memory. [ single-source multiple destinations ] 𝑗 𝑛 20

Hirschberg Algorithm: Reversing Edges Necessary? 𝑘 Max weight path from (0,0) to (𝑛, 𝑜) through (𝑗 ∗ , 𝑜/2) 𝑗 ∗ = arg max{ preYix 𝑗 + sufYix 𝑗 } VNMN$ 𝑗 ∗ Compute preYix 𝑗 0 ≤ 𝑗 ≤ 𝑛} in O(𝑛𝑘) time and O(𝑛) space, by starting from (0,0) to 𝑛, 𝑘 keeping only two columns in memory. [ single-source multiple destinations ] 𝑗 Want : Compute sufYix 𝑗 0 ≤ 𝑗 ≤ 𝑛} in O(𝑛𝑘) time 𝑛 and O(𝑛) space Doing a longest path from each 𝑗, 𝑘 to 𝑛, 𝑜 (for all 0 ≤ 𝑗 ≤ 𝑛 ) will not achieve desired running time! Reversing edges enables single-source multiple destination computation in desired time and space bound! 21

Hirschberg Algorithm: Reconstructing Alignment A T C G C Hirschberg( 𝑗, 𝑘, 𝑗 E , 𝑘′ ) W if 𝑘 E − 𝑘 > 1 1. 0 1 2 3 4 5 V 𝑗 ∗ ß arg max 2. wt(𝑗) 0 0 -1 -2 -3 -4 -5 VNMN$ Report (𝑗 ∗ , 𝑘 + R O SR 3. T ) A 1 -1 1 0 -1 -2 -3 Hirschberg( 𝑗, 𝑘, 𝑗 ∗ , 𝑘 + R O SR 4. T ) T 2 -2 0 2 1 0 -1 Hirschberg( 𝑗 ∗ , 𝑘 + R O SR T , 𝑗 E , 𝑘′ ) 5. G 3 -3 -1 1 1 2 1 Problem: Given reported vertices and T 4 -4 -2 0 0 1 1 scores { 𝑗 V , 0, 𝑡 V , … , 𝑗 & , 𝑜, 𝑡 & } , find intermediary vertices. C 5 -5 -3 -1 1 0 2 Transposing matrix does not help, A T - G T C because gaps could occur in both input sequences A T C G - C 22

Linear Space Alignment – The Hirschberg Algorithm 23

Outline 1. Recap of global, fitting, local and gapped alignment 2. Space-efficient alignment 3. Subquadratic time alignment Reading: • Jones and Pevzner. Chapters 7.1-7.4 • Lecture notes 24

CS CS 466 466 In Introduct ctio ion t to B Bio ioin - PowerPoint PPT Presentation

CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 6 Mohammed El-Kebir February 6, 2020 Course Announcements Instructor: Mohammed El-Kebir (melkebir) Office hours: Wednesdays, 3:15-4:15pm TA:

CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 2 Part 1

CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 2 Part 2

CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 5 Mohammed

In Introduct ctio ion t to Im Improve K e KSU SU Assessment Team v Anissa Vega, , Interim

In Introduct ctio ion t to Im Improve K e KSU SU KSUs Approach to Continuous

Fanny ny B Bay fire p protect ctio ion s n servic ice # #225 225 Fanny B Fann y Bay f

RECONCILIATION ACTION PLANS AS DRIVERS OF SOCIAL CHANGE ACKNOWLEDGMENT OF COUNTRY INTRODUCTION

Bio-PEPAd: Integrating exponential and deterministic delays Jane Hillston. LFCS and CSBE,

Bio-PEPAd: Integrating exponential and deterministic delays Jane Hillston. LFCS and SynthSys,

Ethernet and WiFi h-p://xkcd.com/466/ CSCI 466: Networks

ATERWAYS : : TECTIN ING OU S TORM RMWATER P OL OLLUTION R EDUCT ION E FFOR CTIO ORTS TS

Fleet Forum May 18, 2017 Assistant Director Stewart Cowley Intr troduct ctio ion Financial

Department of Local Government Finance Int Introdu oduct ctio ion to t o the New F For orm

Se Sele lect ctio ion n pro roces ess What courses are being Boarded in February 2018?

Te Tempe mpe Cl Clim imat ate Act ctio ion n Pl Plan an: : Adap apta tatio tion n

One Resilience Noel L.J. Miranda, Bio-security/Bio-threats Preparedness Consultant ARF

Organization to Teach Gathering and Implementation of Requirements Gregor Gabrysiak, Regina

Splash User-friendly Programming Interface for Parallelizing Stochastic Algorithms Yuchen Zhang

Jaume Abella, Francisco J. Cazorla July 4 th Euromicro Conference on Real-Time Systems Barcelona,

Categorification of perfect matchings Alastair King, Bath work in progress with I. Canakci &

Motivation Portfolio approaches Javier Estrada Standard/Traditional IESE Business

Acceleration through Optimistic No-Regret Dynamics Jun-Kun Wang and Jacob Abernethy Georgia Tech

Big Data with R and Hadoop Jamie F Olson June 11, 2015 ; R and Hadoop Review various tools

Sub-sampled Newton Methods with Non-uniform Sampling Jiyan Yang ICME, Stanford University

CS CS 466 466 In Introduct ctio ion t to B Bio ioin - PowerPoint PPT Presentation

CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 6 Mohammed El-Kebir February 6, 2020 Course Announcements Instructor: Mohammed El-Kebir (melkebir) Office hours: Wednesdays, 3:15-4:15pm TA:

CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 2 Part 1

CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 2 Part 2

CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 5 Mohammed

In Introduct ctio ion t to Im Improve K e KSU SU Assessment Team v Anissa Vega, , Interim

In Introduct ctio ion t to Im Improve K e KSU SU KSUs Approach to Continuous

Fanny ny B Bay fire p protect ctio ion s n servic ice # #225 225 Fanny B Fann y Bay f

RECONCILIATION ACTION PLANS AS DRIVERS OF SOCIAL CHANGE ACKNOWLEDGMENT OF COUNTRY INTRODUCTION

Bio-PEPAd: Integrating exponential and deterministic delays Jane Hillston. LFCS and CSBE,

Bio-PEPAd: Integrating exponential and deterministic delays Jane Hillston. LFCS and SynthSys,

Ethernet and WiFi h-p://xkcd.com/466/ CSCI 466: Networks

ATERWAYS : : TECTIN ING OU S TORM RMWATER P OL OLLUTION R EDUCT ION E FFOR CTIO ORTS TS

Fleet Forum May 18, 2017 Assistant Director Stewart Cowley Intr troduct ctio ion Financial

Department of Local Government Finance Int Introdu oduct ctio ion to t o the New F For orm

Se Sele lect ctio ion n pro roces ess What courses are being Boarded in February 2018?

Te Tempe mpe Cl Clim imat ate Act ctio ion n Pl Plan an: : Adap apta tatio tion n

One Resilience Noel L.J. Miranda, Bio-security/Bio-threats Preparedness Consultant ARF

Organization to Teach Gathering and Implementation of Requirements Gregor Gabrysiak, Regina

Splash User-friendly Programming Interface for Parallelizing Stochastic Algorithms Yuchen Zhang

Jaume Abella, Francisco J. Cazorla July 4 th Euromicro Conference on Real-Time Systems Barcelona,

Categorification of perfect matchings Alastair King, Bath work in progress with I. Canakci &amp;

Motivation Portfolio approaches Javier Estrada Standard/Traditional IESE Business

Acceleration through Optimistic No-Regret Dynamics Jun-Kun Wang and Jacob Abernethy Georgia Tech

Big Data with R and Hadoop Jamie F Olson June 11, 2015 ; R and Hadoop Review various tools

Sub-sampled Newton Methods with Non-uniform Sampling Jiyan Yang ICME, Stanford University

Categorification of perfect matchings Alastair King, Bath work in progress with I. Canakci &