cs cs 466 466 in introduct ctio ion t to b bio ioin
play

CS CS 466 466 In Introduct ctio ion t to B Bio ioin - PowerPoint PPT Presentation

CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 2 Part 2 Mohammed El-Kebir January 28, 2020 Outline 1. Edit distance recap 2. Global alignment 3. Fitting alignment 4. Local alignment 5. Gapped


  1. CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 2 Part 2 Mohammed El-Kebir January 28, 2020

  2. Outline 1. Edit distance recap 2. Global alignment 3. Fitting alignment 4. Local alignment 5. Gapped alignment Reading: • Jones and Pevzner. Chapters 6.6, 6.8 and 6.9 • Lecture notes 2

  3. Weighted Edit Distance – Practice Problem • Compute weighted edit distance between 𝐰 = AGT and 𝐱 = ATCT . A T C G V w 0 1 2 3 4  0 , if i = 0 and j = 0,   0  d [ i � 1 , j ] + 1 , if i > 0,     d [ i, j ] = min d [ i, j � 1] + 1 , if j > 0, A 1  d [ i � 1 , j � 1] + 2 , if i > 0, j > 0 and v i 6 = w j ,      d [ i � 1 , j � 1] , if i > 0, j > 0 and v i = w j . G  2 T 3 3

  4. Weighted Edit Distance – Practice Problem • Compute weighted edit distance between 𝐰 = AGT and 𝐱 = ATCT . A T C G V w 0 1 2 3 4  0 , if i = 0 and j = 0,   0 0 1 2 3 4  d [ i � 1 , j ] + 1 , if i > 0,     d [ i, j ] = min d [ i, j � 1] + 1 , if j > 0, A 1 1 0 1 2 3  d [ i � 1 , j � 1] + 2 , if i > 0, j > 0 and v i 6 = w j ,      d [ i � 1 , j � 1] , if i > 0, j > 0 and v i = w j . G  2 2 1 2 3 2 T 3 3 2 1 2 3 4

  5. Edit Distance – Additional Insights • An alignment corresponds to a series of elementary operations Examples from http://profs.scienze.univr.it/~liptak/ACB/files/StringDistance_6up.pdf 5

  6. Edit Distance – Additional Insights • An alignment corresponds to a series of elementary operations • But not every series of elementary operations corresponds to an alignment! Why? Examples from http://profs.scienze.univr.it/~liptak/ACB/files/StringDistance_6up.pdf 6

  7. Distance Function / Metric A distance function (metric) on a set 𝑌 is a function 𝑒 ∶ 𝑌 × 𝑌 → ℝ s.t. for all 𝑦, 𝑧, 𝑨 ∈ 𝑌 : i. 𝑒 𝑦, 𝑧 ≥ 0 [non-negativity] ii. 𝑒 𝑦, 𝑧 = 0 if and only if 𝑦 = 𝑧 [identity of indiscernibles] iii. 𝑒 𝑦, 𝑧 = 𝑒(𝑧, 𝑦) [symmetry] iv. 𝑒 𝑦, 𝑧 ≤ 𝑒 𝑦, 𝑨 + 𝑒(𝑨, 𝑧) [triangle inequality] Question : Is edit distance a distance function? 7

  8. Edit Distance is a Distance Function Edit distance 𝑒(𝐰, 𝐱) is the minimum number of elementary operations to transform 𝐰 ∈ Σ ∗ into 𝐱 ∈ Σ ∗ . Claim : edit distance is a distance function. Proof : Let 𝐯, 𝐰, 𝐱 ∈ Σ ∗ . i. 𝑒 𝐰, 𝐱 ≥ 0 [non-negativity] Edit distance is defined by an alignment. This in turn uniquely determines a series of elementary operations, each with cost either 0 (match) or 1 (otherwise). Thus, 𝑒 𝐰, 𝐱 ≥ 0 . 8

  9. Edit Distance is a Distance Function Edit distance 𝑒(𝐰, 𝐱) is the minimum number of elementary operations to transform 𝐰 ∈ Σ ∗ into 𝐱 ∈ Σ ∗ . Claim : edit distance is a distance function. Proof : Let 𝐯, 𝐰, 𝐱 ∈ Σ ∗ . ii. 𝑒 𝐰, 𝐱 = 0 if and only if 𝐰 = 𝐱 [identity of indiscernibles] (=>) By the premise, 𝑒 𝐰, 𝐱 = 0 . By definition, the optimal alignment can only consist of operations with cost 0. That is, the alignment consist of only matches. Thus, 𝐰 = 𝐱 . (<=) By the premise, 𝐰 = 𝐱 . Thus, there exists an alignment where every pair of columns is a match. This means that |𝐰| = |𝐱| and each letter 𝑤 A equals 𝑥 A (where 𝑗 ∈ [|𝐰|] ). Moreover, only the match operations has cost 0, the other operations have cost 1. Hence, this is the optimal alignment with cost 𝑒 𝐰, 𝐱 = 0 . 9

  10. Edit Distance is a Distance Function Edit distance 𝑒(𝐰, 𝐱) is the minimum number of elementary operations to transform 𝐰 ∈ Σ ∗ into 𝐱 ∈ Σ ∗ . Claim : edit distance is a distance function. Proof : Let 𝐯, 𝐰, 𝐱 ∈ Σ ∗ . iii. 𝑒 𝐰, 𝐱 = 𝑒(𝐱, 𝐰) [symmetry] Let 𝐁 = [𝑏 A,H ] be the optimal alignment corresponding to 𝑒 𝐰, 𝐱 , i.e. 𝐁 is an 2 × 𝑙 matrix where 𝑙 ∈ {max( 𝐰 , 𝐱 ), … , 𝐰 + 𝐱 } . Define the function 𝑔 𝐁 = 𝐂 such that 𝐂 is obtained by interchanging the two rows of 𝐁 . Since the cost of any insertion, deletion and mismatch is 1, we have that alignment 𝐂 has cost 𝑒 𝐰, 𝐱 . The existence of an alignment from 𝐱 to 𝐰 with cost less than 𝑒 𝐰, 𝐱 , yields a contradiction as it implies that 𝐁 is not an optimal alignment from 𝐰 to 𝐱 . Hence, 𝑒 𝐱, 𝐰 = 𝑒 𝐰, 𝐱 . 10

  11. Edit Distance is a Distance Function Edit distance 𝑒(𝐰, 𝐱) is the minimum number of elementary operations to transform 𝐰 ∈ Σ ∗ into 𝐱 ∈ Σ ∗ . Claim : edit distance is a distance function. Proof : Let 𝐯, 𝐰, 𝐱 ∈ Σ ∗ . iv. 𝑒 𝐰, 𝐱 ≤ 𝑒 𝐰, 𝐯 + 𝑒(𝐯, 𝐱) [triangle inequality] Assume for a contradiction that 𝑒 𝐰, 𝐱 > 𝑒 𝐰, 𝐯 + 𝑒(𝐯, 𝐱) . Let 𝑇 be the sequence of elementary operations for transforming 𝐰 into 𝐯 . Let 𝑇′ be the sequence of elementary operations for transforming 𝐯 into 𝐱 . Note that 𝑒 𝐰, 𝐯 = |𝑇| and 𝑒 𝐯, 𝐱 = |𝑇′| . Concatenate 𝑇 and 𝑇′ and remove redundant operations, yielding sequence 𝑇′′ . By definition, 𝑇 VV ≤ 𝑇 + 𝑇 V . We can obtain an alignment of 𝐰 and 𝐱 from 𝑇′′ with cost 𝑇 VV ≤ 𝑒 𝐰, 𝐯 + 𝑒(𝐯, 𝐱) . This yields a contradiction with 𝑒 𝐰, 𝐱 > 𝑒 𝐰, 𝐯 + 𝑒(𝐯, 𝐱) being the cost of the optimal alignment of 𝐰 and 𝐱 . 11

  12. Outline 1. Edit distance recap 2. Global alignment 3. Fitting alignment 4. Local alignment 5. Gapped alignment Reading: • Jones and Pevzner. Chapters 6.6, 6.8 and 6.9 12

  13. Biological Sequence Alignment W • Weighted edit distance: find A T C G alignment with minimum V 0 1 2 3 4 distance • Shortest path in weighted 0 O O O O O edit graph A 1 O O O O O • Sequence alignment: find alignment with maximum T 2 O O O O O similarity G 3 O O O O O • Longest path in weighted T edit graph 4 O O O O O • Score function: Z → ℝ 𝜀 ∶ Σ ∪ − deletion insertion mismatch match $ % $ % $ % - " " " - # # # 𝜀(𝑤 A , −) 𝜀(−, 𝑥 H ) 𝜀(𝑤 A , 𝑥 H ) Question : What is an example of 𝜀 ? 13

  14. Scoring Matrices A C Transitions: interchanges among purines (two rings) or pyrimidines (one ring) • A <--> G • C <--> T Transversions: interchanges between purines (two rings) and pyrimidines (one ring) • A <--> C, A <--> T • G <--> C, G <--> T Transitions more likely than transversions! G T 14

  15. Scoring Matrices Transitions: interchanges among purines (two rings) or pyrimidines (one ring) 𝜀 A T C G - • A <--> G A 1 -2 -2 -1 -1 • C <--> T T -2 1 -1 -2 -1 Transversions: interchanges between purines C -2 -1 1 -2 -1 (two rings) and pyrimidines (one ring) • A <--> C, A <--> T G -1 -2 -2 1 -1 • G <--> C, G <--> T - -1 -1 -1 -1 −∞ Transitions more likely than transversions! 15

  16. Global Alignment – Needleman-Wunsch Algorithm Global Alignment problem: Given strings 𝐰 ∈ Σ ` and 𝐱 ∈ Σ a and scoring function 𝜀 , find alignment with maximum score. • An alignment is a source-to-sink path in the edit graph • An alignment 𝐁 = [𝑏 A,H ] is a 2 × 𝑙 matrix s.t. (i) 𝑙 = {max 𝑛, 𝑜 , … , 𝑛 + 𝑜} , (ii) 𝑏 A,H ∈ Σ ∪ − and (iii) there is no 𝑘 ∈ [𝑙] where 𝑏 _,H = 𝑏 Z,H = −  0 , if i = 0 and j = 0,   deletion  s [ i − 1 , j ] + δ ( v i , − ) , if i > 0,  s [ i, j ] = max insertion s [ i, j − 1] + δ ( − , w j ) , if j > 0,    match/ s [ i − 1 , j − 1] + δ ( v i , w j ) , if i > 0 and j > 0.  mismatch 16

  17. Demonstration • http://alfehrest.org/sub/nwa/index.html • 𝐰 = ATGTTAT and 𝐱 = ATCGTAC . 𝜀 A T C G - A 1 -2 -2 -1 -1 T -2 1 -1 -2 -1 C -2 -1 1 -2 -1 G -1 -2 -2 1 -1 - -1 -1 -1 -1 −∞ 17

  18. Outline 1. Edit distance recap 2. Global alignment 3. Fitting alignment 4. Local alignment 5. Gapped alignment Reading: • Jones and Pevzner. Chapters 6.6, 6.7 and 6.9 • Lecture notes 18

  19. Next Generation Sequencing (NGS) Technology 100,000,000 NGS 10,000,000 1,000,000 Log Scale 100,000 10,000 1,000 November, 2017 19

  20. NGS Characterized by Short Reads … CATTCAGTAG … … AGCCATTAG … … GGTAGTTAG … … GGTAAACTAG … … TATAATTAG … … CGTACCTAG … Genome 10-100’s million short reads Next-generation Millions -billions Short read : 100 nucleotides DNA sequencing nucleotides Allow for inexact matches due to: • Sequencing errors • Polymorphisms/mutations in reference genome 20

  21. NGS Characterized by Short Reads … CATTCAGTAG … … AGCCATTAG … … GGTAGTTAG … … GGTAAACTAG … … TATAATTAG … … CGTACCTAG … Genome 10-100’s million short reads Next-generation Millions -billions Short read : 100 nucleotides DNA sequencing nucleotides Allow for inexact matches due to: Human reference genome is 3,300,000,000 nucleotides, while a • Sequencing errors short read is 100 nucleotides. • Polymorphisms/mutations in Global sequence alignment will not reference genome work! Question : How to account for discrepancy between lengths of reference and short read? 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend