Introduction Outline 1. Strings and Graphs 2. Our String Problem: - - PowerPoint PPT Presentation

introduction
SMART_READER_LITE
LIVE PREVIEW

Introduction Outline 1. Strings and Graphs 2. Our String Problem: - - PowerPoint PPT Presentation

Maximal Common Subsequence Enumeration 1 How Graph Structure Helped Solve a String Problem Giulia Punzi PhD Student in Computer Science Department of Computer Science Mauriana Pesaresi PhD Seminars April 20th 2020 1 A. Conte, R. Grossi, G.


slide-1
SLIDE 1

Maximal Common Subsequence Enumeration1

How Graph Structure Helped Solve a String Problem Giulia Punzi PhD Student in Computer Science

Department of Computer Science

Mauriana Pesaresi PhD Seminars – April 20th 2020

  • 1A. Conte, R. Grossi, G. Punzi, T. Uno; “Maximal Common Subsequence Enumeration”,

SPIRE 2019.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 1 / 24

slide-2
SLIDE 2

Introduction

Outline

  • 1. Strings and Graphs
  • 2. Our String Problem: Enumerating Maximal Common Subsequences
  • 3. Why is it hard?
  • 4. A Change of Perspective: Graphs
  • 5. Conclusions and Future Work

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 2 / 24

slide-3
SLIDE 3

Introduction

Strings and Graphs

a b a c b Strings and Graphs are both ubiquitous in Computer Science. Strings: most information is textual. Graphs: essential to represent relationships and network structure.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 3 / 24

slide-4
SLIDE 4

Introduction

Combining Strings and Graphs

Oftentimes, the two structures are combined: ◮ Bioinformatics: DNA sequences are represented with deBruijn graphs; ◮ Search Engines: textual information naturally linked with a graph structure; ◮ DFAs: graphs which correspond to regular languages.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 4 / 24

slide-5
SLIDE 5

Introduction

Combining Strings and Graphs

Oftentimes, the two structures are combined: ◮ Bioinformatics: DNA sequences are represented with deBruijn graphs; ◮ Search Engines: textual information naturally linked with a graph structure; ◮ DFAs: graphs which correspond to regular languages. ↓ We will study one instance where a difficult string problem was solved using the underlying graph structure: Maximal Common Subsequence Enumeration

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 4 / 24

slide-6
SLIDE 6

Introduction

Maximal Common Subsequences

Given an alphabet Σ, a string is a concatenation of any number of its characters. A subsequence of a string X, denoted S ⊂ X, is a string obtained from X by removing any number of not necessarily contiguous characters.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 5 / 24

slide-7
SLIDE 7

Introduction

Maximal Common Subsequences

Given an alphabet Σ, a string is a concatenation of any number of its characters. A subsequence of a string X, denoted S ⊂ X, is a string obtained from X by removing any number of not necessarily contiguous characters.

Definition

Given X, Y over Σ, a Longest Common Subsequence (LCS) between them is a common subsequence of maximum length.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 5 / 24

slide-8
SLIDE 8

Introduction

Maximal Common Subsequences

Given an alphabet Σ, a string is a concatenation of any number of its characters. A subsequence of a string X, denoted S ⊂ X, is a string obtained from X by removing any number of not necessarily contiguous characters.

Definition

Given X, Y over Σ, a Longest Common Subsequence (LCS) between them is a common subsequence of maximum length.

Definition (Sakai 2018)

Given X, Y over Σ, a string S is a Maximal Common Subsequence of X and Y , denoted S ∈ MCS(X, Y ), if

  • 1. S ⊂ X and S ⊂ Y ;
  • 2. S ⊂ W with W ⊂ X, W ⊂ Y ⇒ S = W.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 5 / 24

slide-9
SLIDE 9

Introduction

Maximal Common Subsequences

Example

Let Σ = {A, C, G, T} and consider X = A T C AGG T Y = G AC TA T then:

  • 1. S = ACT is a common subsequence of X and Y .

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 6 / 24

slide-10
SLIDE 10

Introduction

Maximal Common Subsequences

Example

Let Σ = {A, C, G, T} and consider X = ATCAGGT Y = GACTAT then:

  • 1. S = ACT is a common subsequence of X and Y ;
  • 2. MCS(X, Y ) = {ACAT, ATAT, GT}.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 6 / 24

slide-11
SLIDE 11

Introduction

MCS vs LCS

LCS: one of the main string comparison tools

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 7 / 24

slide-12
SLIDE 12

Introduction

MCS vs LCS

LCS: one of the main string comparison tools ↓ Limitation: LCS has a quadratic conditional lower bound (Abboud et al, 2015)

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 7 / 24

slide-13
SLIDE 13

Introduction

MCS vs LCS

LCS: one of the main string comparison tools ↓ Limitation: LCS has a quadratic conditional lower bound (Abboud et al, 2015) MCS are a natural generalization of LCS. ◮ One MCS can be found in O(n log log(n)) time (Sakai 2018) ◮ Might reveal alternative smaller alignments

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 7 / 24

slide-14
SLIDE 14

Our Aim: Efficient MCS Enumeration

Enumeration algorithm: it lists every element of a given set exactly once.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 8 / 24

slide-15
SLIDE 15

Our Aim: Efficient MCS Enumeration

Enumeration algorithm: it lists every element of a given set exactly once. Polynomial-delay: delay between output of consecutive solutions is polynomial.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 8 / 24

slide-16
SLIDE 16

Our Aim: Efficient MCS Enumeration

Enumeration algorithm: it lists every element of a given set exactly once. Polynomial-delay: delay between output of consecutive solutions is polynomial.

Problem (MCS Enumeration)

List all distinct maximal common subsequences S ∈ MCS(X, Y ), for X, Y of length O(n) over Σ of size σ, with polynomial delay.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 8 / 24

slide-17
SLIDE 17

Our Aim: Efficient MCS Enumeration

Enumeration algorithm: it lists every element of a given set exactly once. Polynomial-delay: delay between output of consecutive solutions is polynomial.

Problem (MCS Enumeration)

List all distinct maximal common subsequences S ∈ MCS(X, Y ), for X, Y of length O(n) over Σ of size σ, with polynomial delay. Note that by distinct we mean as elements of the set MCS(X, Y ): strings with multiple occurrences need to be output once.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 8 / 24

slide-18
SLIDE 18

Our Aim: MCS Enumeration Example (Enumeration)

X = TAAGCC Y = TAGACT Output:

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

slide-19
SLIDE 19

Our Aim: MCS Enumeration Example (Enumeration)

X = TA A GC C Y = TAG A C T Output:

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

slide-20
SLIDE 20

Our Aim: MCS Enumeration Example (Enumeration)

X = T A AGC C Y = TAG A C T Output:

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

slide-21
SLIDE 21

Our Aim: MCS Enumeration Example (Enumeration)

X = T A AG C C Y = TAG A C T Output:

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

slide-22
SLIDE 22

Our Aim: MCS Enumeration Example (Enumeration)

X = TA A G C C Y = TAG A C T Output:

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

slide-23
SLIDE 23

Our Aim: MCS Enumeration Example (Enumeration)

X = TAAGCC Y = TAGACT Output: ◮ TAGC

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

slide-24
SLIDE 24

Our Aim: MCS Enumeration Example (Enumeration)

X = TAA G C C Y = TA G AC T Output: ◮ TAGC

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

slide-25
SLIDE 25

Our Aim: MCS Enumeration Example (Enumeration)

X = TAA GC C Y = TA G AC T Output: ◮ TAGC

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

slide-26
SLIDE 26

Our Aim: MCS Enumeration Example (Enumeration)

X = TAAGCC Y = TAGACT Output: ◮ TAGC ◮ TAAC

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 9 / 24

slide-27
SLIDE 27

Pitfalls of MCS Enumeration

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 10 / 24

slide-28
SLIDE 28

Pitfalls of MCS Enumeration

  • 1. Using a divide and conquer approach

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 10 / 24

slide-29
SLIDE 29

Pitfalls of MCS Enumeration

  • 1. Using a divide and conquer approach

MCS do not naturally combine.

Example

X = AGA|TGA Y = TAG|GAT MCS(X, Y ) = {AGGA, AGAT, TGA}: the combination AGT of the two blue submaximals is not maximal.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 10 / 24

slide-30
SLIDE 30

Pitfalls of MCS Enumeration

  • 1. Using a divide and conquer approach

MCS do not naturally combine.

  • 2. Thinking that MCS are a small number

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 10 / 24

slide-31
SLIDE 31

Pitfalls of MCS Enumeration

  • 1. Using a divide and conquer approach

MCS do not naturally combine.

  • 2. Thinking that MCS are a small number

MCS can be exponential even for |Σ| = 2.

Example

The two strings X = A ◦ (CCA)n; Y = A ◦ (CA)⌊ 3n

2 ⌋.

have an exponential number of MCS.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 10 / 24

slide-32
SLIDE 32

Pitfalls of MCS Enumeration

  • 1. Using a divide and conquer approach

MCS do not naturally combine.

  • 2. Thinking that MCS are a small number

MCS can be exponential even for |Σ| = 2.

  • 3. Using an incremental approach?

Let X and Y be any two strings; is it true that MCS(X, Y ) ◦ c ↔ MCS(X, Y ◦ c)?

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 10 / 24

slide-33
SLIDE 33

Pitfalls of MCS Enumeration

Incremental Approach is Inefficient

Some incremental properties can be derived, but they are intrinsically inefficient.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 11 / 24

slide-34
SLIDE 34

Pitfalls of MCS Enumeration

Incremental Approach is Inefficient

Some incremental properties can be derived, but they are intrinsically inefficient.

Example

X =ACCACCACCA Z =ACACACACA Consider X and Y = Z ◦ Z, and we proceed incrementally over Y . Since X ⊂ Y , MCS(X, Y ) = {X} but when we are at half length, |MCS(X, Z)| is exponential.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 11 / 24

slide-35
SLIDE 35

Pitfalls of MCS Enumeration

Incremental Approach is Inefficient

Some incremental properties can be derived, but they are intrinsically inefficient.

Example

X =ACCACCACCA Z =ACACACACA Consider X and Y = Z ◦ Z, and we proceed incrementally over Y . Since X ⊂ Y , MCS(X, Y ) = {X} but when we are at half length, |MCS(X, Z)| is exponential. − → it leads to an exponential delay algorithm!

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 11 / 24

slide-36
SLIDE 36

Challenge: Polynomial Delay?

Goal: Design of a polynomial delay enumeration algorithm for MCS.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 12 / 24

slide-37
SLIDE 37

Challenge: Polynomial Delay?

Goal: Design of a polynomial delay enumeration algorithm for MCS. Idea: Instead of finding maximals of the prefixes, we find prefixes of maximals.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 12 / 24

slide-38
SLIDE 38

Challenge: Polynomial Delay?

Goal: Design of a polynomial delay enumeration algorithm for MCS. Idea: Instead of finding maximals of the prefixes, we find prefixes of maximals.

Definition

P ⊂ X, Y is called a valid prefix if ∃W such that P ◦ W ∈ MCS(X, Y ).

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 12 / 24

slide-39
SLIDE 39

Challenge: Polynomial Delay?

Goal: Design of a polynomial delay enumeration algorithm for MCS. Idea: Instead of finding maximals of the prefixes, we find prefixes of maximals.

Definition

P ⊂ X, Y is called a valid prefix if ∃W such that P ◦ W ∈ MCS(X, Y ). If we have a characterization for valid prefixes, we can build increasingly long prefixes of maximals by appending valid characters, until we generate all MCS.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 12 / 24

slide-40
SLIDE 40

Unshiftable Edges

Bipartite String Graph

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 13 / 24

slide-41
SLIDE 41

Unshiftable Edges

Bipartite String Graph

Definition (String Graphs and Mappings)

Given two strings X, Y , the corresponding Bipartite String Graph (BSG) is the bipartite graph G(X, Y ) that has one vertex for each position of X and of Y , and edge set E = {(i, j) | X[i] = Y [j]}. A mapping of a string graph is a subset of the edges P ⊆ E such that ∀(i, j), (h, k) ∈ P we have i ≤ h ⇐ ⇒ j ≤ k.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 13 / 24

slide-42
SLIDE 42

Unshiftable Edges

Bipartite String Graph

Definition (String Graphs and Mappings)

Given two strings X, Y , the corresponding Bipartite String Graph (BSG) is the bipartite graph G(X, Y ) that has one vertex for each position of X and of Y , and edge set E = {(i, j) | X[i] = Y [j]}. A mapping of a string graph is a subset of the edges P ⊆ E such that ∀(i, j), (h, k) ∈ P we have i ≤ h ⇐ ⇒ j ≤ k.

Example

The BSG for the two strings X = CGATA and Y = GCTGA is given by C G A T A G C T G A

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 13 / 24

slide-43
SLIDE 43

Unshiftable Edges

Bipartite String Graph

Definition (String Graphs and Mappings)

Given two strings X, Y , the corresponding Bipartite String Graph (BSG) is the bipartite graph G(X, Y ) that has one vertex for each position of X and of Y , and edge set E = {(i, j) | X[i] = Y [j]}. A mapping of a string graph is a subset of the edges P ⊆ E such that ∀(i, j), (h, k) ∈ P we have i ≤ h ⇐ ⇒ j ≤ k.

Example

The BSG for the two strings X = CGATA and Y = GCTGA is given by C G A T A G C T G A A mapping of the graph is shown in blue.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 13 / 24

slide-44
SLIDE 44

Unshiftable Edges

Maximal Mappings and MCS

Definition

A mapping P of a BSG is said to be maximal if adding any edge (i, j) to P no longer yields a mapping.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 14 / 24

slide-45
SLIDE 45

Unshiftable Edges

Maximal Mappings and MCS

Definition

A mapping P of a BSG is said to be maximal if adding any edge (i, j) to P no longer yields a mapping.

Example (MCS = maximal mappings)

Every MCS corresponds to a maximal mapping, but the opposite does not hold. Consider X = AGG and Y = AGAG: A G G A G A G The blue mapping is maximal, but it does not correspond to any MCS.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 14 / 24

slide-46
SLIDE 46

Unshiftable Edges

Definition

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 15 / 24

slide-47
SLIDE 47

Unshiftable Edges

Definition

Definition

Let IX(i) be the substring X[i + 1, ..., nextX(i)] (analogously for Y ).

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 15 / 24

slide-48
SLIDE 48

Unshiftable Edges

Definition

Definition

Let IX(i) be the substring X[i + 1, ..., nextX(i)] (analogously for Y ). An edge (i, j) is unshiftable ((i, j) ∈ U) if and only if either ◮ (Base case) It corresponds to the last pairwise occurrence in the strings of character X[i] = Y [j]. ◮ (Otherwise) There is at least one unshiftable edge in G(IX(i), IY (j)).

c

i

c c

j

c IX(i) IY (j)

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 15 / 24

slide-49
SLIDE 49

Unshiftable Edges

Example

Intuition: every unshiftable belongs to a maximal mapping where it cannot be “pushed further right” while spelling the same word.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 16 / 24

slide-50
SLIDE 50

Unshiftable Edges

Example

Intuition: every unshiftable belongs to a maximal mapping where it cannot be “pushed further right” while spelling the same word.

Example

G A T A C A A G G A T C A T

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 16 / 24

slide-51
SLIDE 51

Unshiftable Edges

Example

Intuition: every unshiftable belongs to a maximal mapping where it cannot be “pushed further right” while spelling the same word.

Example

G A T A C A A G G A T C A T

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 16 / 24

slide-52
SLIDE 52

Unshiftable Edges

Still not enough

Example (MCS = maximal unshiftable mappings)

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 17 / 24

slide-53
SLIDE 53

Unshiftable Edges

Still not enough

Example (MCS = maximal unshiftable mappings)

Consider X = AAGAAG, Y = AAGA. In the corresponding graph, we have a maximal rightmost unshiftable mapping for the string AAG: A A G A A G A A G A even though this word is not maximal: the only MCS is the whole AAGA.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 17 / 24

slide-54
SLIDE 54

Extending the Prefix

Candidate Extensions

P valid prefix → formally define ExtP set of candidate extensions: being a candidate is necessary for having a valid extension. Intuition: “first unshiftable edges after P”.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 18 / 24

slide-55
SLIDE 55

Extending the Prefix

Candidate Extensions

P valid prefix → formally define ExtP set of candidate extensions: being a candidate is necessary for having a valid extension. Intuition: “first unshiftable edges after P”.

Example (The condition is not sufficient)

Consider X = AGAGC, Y = AAGCAG. We have MCS(X, Y ) = {AGAG, AAGC}. Clearly, P = AG is a valid prefix. A G A G C A A G C A G The edge for C is in ExtP , but AGC is not a valid prefix.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 18 / 24

slide-56
SLIDE 56

Extending the Prefix

Candidate Extensions

P valid prefix → formally define ExtP set of candidate extensions: being a candidate is necessary for having a valid extension. Intuition: “first unshiftable edges after P”.

Example (The condition is not sufficient)

Consider X = AGAGC, Y = AAGCAG. We have MCS(X, Y ) = {AGAG, AAGC}. Clearly, P = AG is a valid prefix. A G A G C A A G C A G The edge for C is in ExtP , but AGC is not a valid prefix.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 18 / 24

slide-57
SLIDE 57

Correctness of Extension Theorem (Correctness)

Let P be a valid prefix of some M ∈ MCS(X, Y ).Then P ◦ c is still a valid prefix if and only if the following two conditions hold:

  • 1. ∃(i, j) ∈ ExtP corresponding to character c;
  • 2. P ∈ MCS(X<i, Y<j).

P Y X

i j c

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 19 / 24

slide-58
SLIDE 58

The Algorithm

Binary Partition Paradigm

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 20 / 24

slide-59
SLIDE 59

The Algorithm

Binary Partition Paradigm

Binary partition: enumerative scheme based on iterative partitions of solutions. Partition solutions into smaller sets characterized by disjoint properties, until we get to singletons → obtain tree with every and only feasible solution as leaves

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 20 / 24

slide-60
SLIDE 60

The Algorithm

Binary Partition Paradigm

Binary partition: enumerative scheme based on iterative partitions of solutions. Partition solutions into smaller sets characterized by disjoint properties, until we get to singletons → obtain tree with every and only feasible solution as leaves P ≡ having string P as a prefix. ↓ branching into possibly |Σ| partitions

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 20 / 24

slide-61
SLIDE 61

The Algorithm

Binary Partition Paradigm

Binary partition: enumerative scheme based on iterative partitions of solutions. Partition solutions into smaller sets characterized by disjoint properties, until we get to singletons → obtain tree with every and only feasible solution as leaves P ≡ having string P as a prefix. ↓ branching into possibly |Σ| partitions Complexity: If the partition oracle takes polynomial time and the height of the tree is polynomial, then the algorithm is polynomial delay.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 20 / 24

slide-62
SLIDE 62

The EnumerateMCS Algorithm

1: procedure EnumerateMCS(X, Y , Σ) 2:

U = FindUnshiftables((|X|, |Y |))

3:

BinaryPartition(#, {(−1, −1)})

4: end procedure 5: procedure BinaryPartition(P, LP ) 6:

compute the set of extensions ExtP using U

7:

if ExtP = ∅ then Output P

8:

else

9:

for (i, j) ∈ ExtP corresponding to some c ∈ Σ do

10:

if P ∈ MCS(X<i, Y<j) then

11:

let (l, m) be the last edge of LP

12:

find leftmost mapping edge (lc, mc) for c in G(X>l, Y>m)

13:

BinaryPartition(P c, LP ∪ (lc, mc))

14:

end if

15:

end for

16:

end if

17: end procedure

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 21 / 24

slide-63
SLIDE 63

The EnumerateMCS Algorithm

1: procedure EnumerateMCS(X, Y , Σ) 2:

U = FindUnshiftables((|X|, |Y |))

3:

BinaryPartition(#, {(−1, −1)})

4: end procedure 5: procedure BinaryPartition(P, LP ) 6:

compute the set of extensions ExtP using U

7:

if ExtP = ∅ then Output P

8:

else

9:

for (i, j) ∈ ExtP corresponding to some c ∈ Σ do

10:

if P ∈ MCS(X<i, Y<j) then

11:

let (l, m) be the last edge of LP

12:

find leftmost mapping edge (lc, mc) for c in G(X>l, Y>m)

13:

BinaryPartition(P c, LP ∪ (lc, mc))

14:

end if

15:

end for

16:

end if

17: end procedure For each e ∈ U find previ-

  • us pairwise occurrences of

every c ∈ Σ, then add edge to U if not already present: O(σn2 log(n)) time

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 21 / 24

slide-64
SLIDE 64

The EnumerateMCS Algorithm

1: procedure EnumerateMCS(X, Y , Σ) 2:

U = FindUnshiftables((|X|, |Y |))

3:

BinaryPartition(#, {(−1, −1)})

4: end procedure 5: procedure BinaryPartition(P, LP ) 6:

compute the set of extensions ExtP using U

7:

if ExtP = ∅ then Output P

8:

else

9:

for (i, j) ∈ ExtP corresponding to some c ∈ Σ do

10:

if P ∈ MCS(X<i, Y<j) then

11:

let (l, m) be the last edge of LP

12:

find leftmost mapping edge (lc, mc) for c in G(X>l, Y>m)

13:

BinaryPartition(P c, LP ∪ (lc, mc))

14:

end if

15:

end for

16:

end if

17: end procedure Parse unshiftable edges: O(n2) time

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 21 / 24

slide-65
SLIDE 65

The EnumerateMCS Algorithm

1: procedure EnumerateMCS(X, Y , Σ) 2:

U = FindUnshiftables((|X|, |Y |))

3:

BinaryPartition(#, {(−1, −1)})

4: end procedure 5: procedure BinaryPartition(P, LP ) 6:

compute the set of extensions ExtP using U

7:

if ExtP = ∅ then Output P

8:

else

9:

for (i, j) ∈ ExtP corresponding to some c ∈ Σ do

10:

if P ∈ MCS(X<i, Y<j) then

11:

let (l, m) be the last edge of LP

12:

find leftmost mapping edge (lc, mc) for c in G(X>l, Y>m)

13:

BinaryPartition(P c, LP ∪ (lc, mc))

14:

end if

15:

end for

16:

end if

17: end procedure This can be done in O(|P|) = O(n) time (Sakai 2018)

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 21 / 24

slide-66
SLIDE 66

The EnumerateMCS Algorithm

1: procedure EnumerateMCS(X, Y , Σ) 2:

U = FindUnshiftables((|X|, |Y |))

3:

BinaryPartition(#, {(−1, −1)})

4: end procedure 5: procedure BinaryPartition(P, LP ) 6:

compute the set of extensions ExtP using U

7:

if ExtP = ∅ then Output P

8:

else

9:

for (i, j) ∈ ExtP corresponding to some c ∈ Σ do

10:

if P ∈ MCS(X<i, Y<j) then

11:

let (l, m) be the last edge of LP

12:

find leftmost mapping edge (lc, mc) for c in G(X>l, Y>m)

13:

BinaryPartition(P c, LP ∪ (lc, mc))

14:

end if

15:

end for

16:

end if

17: end procedure Logarithmic in length

  • f

strings: O(log(n)) time

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 21 / 24

slide-67
SLIDE 67

The EnumerateMCS Algorithm

1: procedure EnumerateMCS(X, Y , Σ) 2:

U = FindUnshiftables((|X|, |Y |))

3:

BinaryPartition(#, {(−1, −1)})

4: end procedure 5: procedure BinaryPartition(P, LP ) 6:

compute the set of extensions ExtP using U

7:

if ExtP = ∅ then Output P

8:

else

9:

for (i, j) ∈ ExtP corresponding to some c ∈ Σ do

10:

if P ∈ MCS(X<i, Y<j) then

11:

let (l, m) be the last edge of LP

12:

find leftmost mapping edge (lc, mc) for c in G(X>l, Y>m)

13:

BinaryPartition(P c, LP ∪ (lc, mc))

14:

end if

15:

end for

16:

end if

17: end procedure Partition oracle: O(n2 + |ExtP |(n + log(n))) = O(n2) time

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 21 / 24

slide-68
SLIDE 68

The EnumerateMCS Algorithm

Final Complexity

Height of the partition tree: O(n) (length of longest MCS) ⇒ O(n3) delay O(σn2 log(n)) preprocessing time O(n2) space.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 22 / 24

slide-69
SLIDE 69

The EnumerateMCS Algorithm

Final Complexity

Height of the partition tree: O(n) (length of longest MCS) ⇒ O(n3) delay O(σn2 log(n)) preprocessing time O(n2) space.

Theorem

There is a O(nσ(σ + log n)) polynomial-delay enumeration algorithm for MCS enumeration, with O(n2(σ + log n)) preprocessing time and O(n2) space.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 22 / 24

slide-70
SLIDE 70

Conclusions and Future Work

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 23 / 24

slide-71
SLIDE 71

Conclusions and Future Work

◮ We investigated the string problem of enumerating MCS for the first time; it turned out to be hard to approach with standard techniques. ◮ Changing our perspective by looking at the strings as a graph was crucial to derive fundamental properties, and eventually solve the problem. ◮ MCS are just one of many string problems with interesting applications: similar shift in perspective might help solve other difficult problems.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 23 / 24

slide-72
SLIDE 72

Conclusions and Future Work

◮ We investigated the string problem of enumerating MCS for the first time; it turned out to be hard to approach with standard techniques. ◮ Changing our perspective by looking at the strings as a graph was crucial to derive fundamental properties, and eventually solve the problem. ◮ MCS are just one of many string problems with interesting applications: similar shift in perspective might help solve other difficult problems. Future Work: ◮ Explore further connections between LCS and MCS.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 23 / 24

slide-73
SLIDE 73

Conclusions and Future Work

◮ We investigated the string problem of enumerating MCS for the first time; it turned out to be hard to approach with standard techniques. ◮ Changing our perspective by looking at the strings as a graph was crucial to derive fundamental properties, and eventually solve the problem. ◮ MCS are just one of many string problems with interesting applications: similar shift in perspective might help solve other difficult problems. Future Work: ◮ Explore further connections between LCS and MCS. ◮ Find other applications of graph-theoretic tools to string problems.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 23 / 24

slide-74
SLIDE 74

Thank you for your attention!

Any Questions?

Feel free to email me at giulia.punzi@phd.unipi.it

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 24 / 24

slide-75
SLIDE 75

References

  • Y. Sakai, “Maximal Common Subsequence Algorithms”; in 29th Annual

Symposium on Combinatorial Pattern Matching, 1-10, 2018.

  • A. Conte, R. Grossi, G. Punzi, T. Uno, (2019) “Polynomial-Delay

Enumeration of Maximal Common Subsequences”; in: Brisaboa N., Puglisi S. (eds) String Processing and Information Retrieval (SPIRE 2019), Lecture Notes in Computer Science, vol 11811, 2019.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 1 / 7

slide-76
SLIDE 76

Pitfalls of MCS

Incremental Approach

Let X and Y ′ be any two strings. Consider Y = Y ′ ◦ c; MCS(X, Y ′) ◦ c ↔ MCS(X, Y )?

Example

◮ (Some MCS are not found) Let

X =AGCG Y = ACG

  • Y ′

|C

MCS(X, Y ′) = {ACG}: AGC ∈ MCS(X, Y ) was not found. ◮ (Some strings found are not MCS) Instead in

X =AAGACT Y = AGCAG

Y ′

|C

we have AGC ∈ MCS(X, Y ′), but AGC ∈ MCS(X, Y ).

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 2 / 7

slide-77
SLIDE 77

Extending the Prefix

The Cross

Let P be a prefix of some W ∈ MCS(X, Y ). Given a character c ∈ Σ, we would like to find a necessary and sufficient condition for P ◦ c to still be a valid prefix. l e1 f1 m f2 e2

Definition (Cross)

Given an edge (l, m), its following cross χ(l,m) = {e, f} is given by (at most) two edges such that: ◮ e = (e1, e2) ∈ U is such that e1 = min{h1 > l | ∃h2 > m : (h1, h2) ∈ U}. ◮ f = (f1, f2) ∈ U is such that f2 = min{h2 > m | ∃h1 > l : (h1, h2) ∈ U}.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 3 / 7

slide-78
SLIDE 78

Extending the Prefix

Candidate Extensions

Definition

Let P be a prefix of some MCS, with its leftmost mapping LP ending at edge l = (l, m), and let χ(l,m) = (e, f) be its cross. We define the set of the “Mikado” edges after P as MkP = {(i, j) ∈ U | e1 ≤ i ≤ f1 and f2 ≤ j ≤ e2}. From these we extract the candidate extensions for P as follows ExtP = {(i, j) ∈ MkP | ∃(h, k) ∈ MkP \ (i, j) such that h ≤ i and k ≤ j}. e1 f1 f2 e2 LP − → e1 f1 f2 e2 LP

Figure: MkP set transformed into ExtP

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 4 / 7

slide-79
SLIDE 79

Correctness

findR procedure

Given (i, j) ∈ U, findR(i, j) returns a maximal mapping in G(X>i, Y>j).

Lemma 1

Let P be a valid prefix with leftmost mapping ending with edge (l, m), and let (i, j) ∈ ExtP . Then, findR(i, j) returns a mapping whose corresponding subsequence is M ∈ MCS(X>l, Y>m). LP Y X findR(i, j) · · ·

l m i j

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 5 / 7

slide-80
SLIDE 80

Correctness

MCS Combination

Theorem (MCS Combination)

Let P and C be common subsequences of X, Y . Let (l, m) be the last edge of the leftmost mapping of P, and (i, j) be the first edge of the rightmost mapping of

  • C. Then:

P ◦ C ∈ MCS(X, Y ) ⇐ ⇒ P ∈ MCS(X<i, Y<j) and C ∈ MCS(X>l, Y>m). LP RC Y X

l m i j

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 6 / 7

slide-81
SLIDE 81

Correctness

Correctness Theorem

Theorem (Correctness)

Let P be a valid prefix of some M ∈ MCS(X, Y ), with leftmost mapping LP ending with edge (l, m). Then P ◦ c is still a valid prefix if and only if the following two conditions hold:

  • 1. ∃(i, j) ∈ ExtP corresponding to character c;
  • 2. P ∈ MCS(X<i, Y<j).

Proof.

We have said that the conditions are necessary in the previous sections. We know that findR(i, j) = C ∈ MCS(X>l, Y>m). By hypothesis P ∈ MCS(X<i, Y<j), therefore by the MCS combination theorem we have P ◦ C ∈ MCS(X, Y ). This latter string starts with P ◦ c, which is therefore a good prefix.

Giulia Punzi MCS Enumeration Pesaresi Seminar – April 20th 2020 7 / 7