introduction
play

Introduction Outline 1. Strings and Graphs 2. Our String Problem: - PowerPoint PPT Presentation

Maximal Common Subsequence Enumeration 1 How Graph Structure Helped Solve a String Problem Giulia Punzi PhD Student in Computer Science Department of Computer Science Mauriana Pesaresi PhD Seminars April 20th 2020 1 A. Conte, R. Grossi, G.


  1. Maximal Common Subsequence Enumeration 1 How Graph Structure Helped Solve a String Problem Giulia Punzi PhD Student in Computer Science Department of Computer Science Mauriana Pesaresi PhD Seminars – April 20th 2020 1 A. Conte, R. Grossi, G. Punzi, T. Uno; “Maximal Common Subsequence Enumeration”, SPIRE 2019. Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 1 / 24

  2. Introduction Outline 1. Strings and Graphs 2. Our String Problem: Enumerating Maximal Common Subsequences 3. Why is it hard? 4. A Change of Perspective: Graphs 5. Conclusions and Future Work Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 2 / 24

  3. Introduction Strings and Graphs a b a c b Strings and Graphs are both ubiquitous in Computer Science. Strings: most information is textual. Graphs: essential to represent relationships and network structure. Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 3 / 24

  4. Introduction Combining Strings and Graphs Oftentimes, the two structures are combined: ◮ Bioinformatics: DNA sequences are represented with deBruijn graphs; ◮ Search Engines: textual information naturally linked with a graph structure; ◮ DFAs: graphs which correspond to regular languages. Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 4 / 24

  5. Introduction Combining Strings and Graphs Oftentimes, the two structures are combined: ◮ Bioinformatics: DNA sequences are represented with deBruijn graphs; ◮ Search Engines: textual information naturally linked with a graph structure; ◮ DFAs: graphs which correspond to regular languages. ↓ We will study one instance where a difficult string problem was solved using the underlying graph structure: Maximal Common Subsequence Enumeration Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 4 / 24

  6. Introduction Maximal Common Subsequences Given an alphabet Σ , a string is a concatenation of any number of its characters. A subsequence of a string X , denoted S ⊂ X , is a string obtained from X by removing any number of not necessarily contiguous characters. Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 5 / 24

  7. Introduction Maximal Common Subsequences Given an alphabet Σ , a string is a concatenation of any number of its characters. A subsequence of a string X , denoted S ⊂ X , is a string obtained from X by removing any number of not necessarily contiguous characters. Definition Given X, Y over Σ , a Longest Common Subsequence (LCS) between them is a common subsequence of maximum length. Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 5 / 24

  8. Introduction Maximal Common Subsequences Given an alphabet Σ , a string is a concatenation of any number of its characters. A subsequence of a string X , denoted S ⊂ X , is a string obtained from X by removing any number of not necessarily contiguous characters. Definition Given X, Y over Σ , a Longest Common Subsequence (LCS) between them is a common subsequence of maximum length. Definition (Sakai 2018) Given X, Y over Σ , a string S is a Maximal Common Subsequence of X and Y , denoted S ∈ MCS ( X, Y ) , if 1. S ⊂ X and S ⊂ Y ; 2. S ⊂ W with W ⊂ X , W ⊂ Y ⇒ S = W . Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 5 / 24

  9. Introduction Maximal Common Subsequences Example Let Σ = { A , C , G , T } and consider X = A T C AGG T Y = G AC TA T then: 1. S = ACT is a common subsequence of X and Y . Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 6 / 24

  10. Introduction Maximal Common Subsequences Example Let Σ = { A , C , G , T } and consider X = ATCAGGT Y = GACTAT then: 1. S = ACT is a common subsequence of X and Y ; 2. MCS ( X, Y ) = { ACAT , ATAT , GT } . Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 6 / 24

  11. Introduction MCS vs LCS LCS : one of the main string comparison tools Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 7 / 24

  12. Introduction MCS vs LCS LCS : one of the main string comparison tools ↓ Limitation : LCS has a quadratic conditional lower bound (Abboud et al, 2015) Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 7 / 24

  13. Introduction MCS vs LCS LCS : one of the main string comparison tools ↓ Limitation : LCS has a quadratic conditional lower bound (Abboud et al, 2015) MCS are a natural generalization of LCS. ◮ One MCS can be found in O ( n log log( n )) time (Sakai 2018) ◮ Might reveal alternative smaller alignments Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 7 / 24

  14. Our Aim: Efficient MCS Enumeration Enumeration algorithm : it lists every element of a given set exactly once. Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 8 / 24

  15. Our Aim: Efficient MCS Enumeration Enumeration algorithm : it lists every element of a given set exactly once. Polynomial-delay : delay between output of consecutive solutions is polynomial. Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 8 / 24

  16. Our Aim: Efficient MCS Enumeration Enumeration algorithm : it lists every element of a given set exactly once. Polynomial-delay : delay between output of consecutive solutions is polynomial. Problem (MCS Enumeration) List all distinct maximal common subsequences S ∈ MCS ( X, Y ) , for X, Y of length O ( n ) over Σ of size σ , with polynomial delay. Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 8 / 24

  17. Our Aim: Efficient MCS Enumeration Enumeration algorithm : it lists every element of a given set exactly once. Polynomial-delay : delay between output of consecutive solutions is polynomial. Problem (MCS Enumeration) List all distinct maximal common subsequences S ∈ MCS ( X, Y ) , for X, Y of length O ( n ) over Σ of size σ , with polynomial delay. Note that by distinct we mean as elements of the set MCS ( X, Y ) : strings with multiple occurrences need to be output once . Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 8 / 24

  18. Our Aim: MCS Enumeration Example (Enumeration) X = TAAGCC Y = TAGACT Output: Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

  19. Our Aim: MCS Enumeration Example (Enumeration) X = TA A GC C Y = TAG A C T Output: Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

  20. Our Aim: MCS Enumeration Example (Enumeration) X = T A AGC C Y = TAG A C T Output: Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

  21. Our Aim: MCS Enumeration Example (Enumeration) X = T A AG C C Y = TAG A C T Output: Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

  22. Our Aim: MCS Enumeration Example (Enumeration) X = TA A G C C Y = TAG A C T Output: Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

  23. Our Aim: MCS Enumeration Example (Enumeration) X = TAAGCC Y = TAGACT Output: ◮ TAGC Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

  24. Our Aim: MCS Enumeration Example (Enumeration) X = TAA G C C Y = TA G AC T Output: ◮ TAGC Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

  25. Our Aim: MCS Enumeration Example (Enumeration) X = TAA GC C Y = TA G AC T Output: ◮ TAGC Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

  26. Our Aim: MCS Enumeration Example (Enumeration) X = TAAGCC Y = TAGACT Output: ◮ TAGC ◮ TAAC Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

  27. Pitfalls of MCS Enumeration Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 10 / 24

  28. Pitfalls of MCS Enumeration 1. Using a divide and conquer approach Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 10 / 24

  29. Pitfalls of MCS Enumeration 1. Using a divide and conquer approach MCS do not naturally combine. Example X = AGA | TGA Y = TAG | GAT MCS ( X, Y ) = { AGGA , AGAT , TGA } : the combination AGT of the two blue submaximals is not maximal. Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 10 / 24

  30. Pitfalls of MCS Enumeration 1. Using a divide and conquer approach MCS do not naturally combine. 2. Thinking that MCS are a small number Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 10 / 24

  31. Pitfalls of MCS Enumeration 1. Using a divide and conquer approach MCS do not naturally combine. 2. Thinking that MCS are a small number MCS can be exponential even for | Σ | = 2 . Example The two strings Y = A ◦ ( CA ) ⌊ 3 n 2 ⌋ . X = A ◦ ( CCA ) n ; have an exponential number of MCS. Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 10 / 24

  32. Pitfalls of MCS Enumeration 1. Using a divide and conquer approach MCS do not naturally combine. 2. Thinking that MCS are a small number MCS can be exponential even for | Σ | = 2 . 3. Using an incremental approach? Let X and Y be any two strings; is it true that MCS ( X, Y ) ◦ c ↔ MCS ( X, Y ◦ c )? Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 10 / 24

  33. Pitfalls of MCS Enumeration Incremental Approach is Inefficient Some incremental properties can be derived, but they are intrinsically inefficient. Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 11 / 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend