Superstring Graph in compact space Bastien Cazaux and Eric Rivals - - PowerPoint PPT Presentation

superstring graph in compact space
SMART_READER_LITE
LIVE PREVIEW

Superstring Graph in compact space Bastien Cazaux and Eric Rivals - - PowerPoint PPT Presentation

Superstring Graph in compact space Bastien Cazaux and Eric Rivals LIRMM & IBC, Montpellier February 5, 2020 (DSB 2020) Bastien Cazaux and Eric Rivals 1 / 1 Superstring Problems Bastien Cazaux and Eric Rivals 2 / 1 Linear and


slide-1
SLIDE 1

Superstring Graph in compact space

Bastien Cazaux and Eric Rivals∗

∗ LIRMM & IBC, Montpellier

February 5, 2020 (DSB 2020)

Bastien Cazaux and Eric Rivals∗ 1 / 1

slide-2
SLIDE 2

Superstring Problems

Bastien Cazaux and Eric Rivals∗ 2 / 1

slide-3
SLIDE 3

Linear and Cyclic words

a b b a b b b a b b a b

Bastien Cazaux and Eric Rivals∗ 3 / 1

slide-4
SLIDE 4

Notation

Definition [Gusfield 1997] Let w a string.

◮ a substring of w is a string included in w, ◮ a prefix of w is a substring which begins w and ◮ a suffix is a substring which ends w. ◮ an overlap from w over v is a suffix of w that is also a prefix of v.

w a b a b b a b a a a

Bastien Cazaux and Eric Rivals∗ 4 / 1

slide-5
SLIDE 5

Notation

Definition [Gusfield 1997] Let w a string.

◮ a substring of w is a string included in w, ◮ a prefix of w is a substring which begins w and ◮ a suffix is a substring which ends w. ◮ an overlap from w over v is a suffix of w that is also a prefix of v.

w a b a b b a b a a a

Bastien Cazaux and Eric Rivals∗ 4 / 1

slide-6
SLIDE 6

Notation

Definition [Gusfield 1997] Let w a string.

◮ a substring of w is a string included in w, ◮ a prefix of w is a substring which begins w and ◮ a suffix is a substring which ends w. ◮ an overlap from w over v is a suffix of w that is also a prefix of v.

w a b a b b a b a a a

Bastien Cazaux and Eric Rivals∗ 4 / 1

slide-7
SLIDE 7

Notation

Definition [Gusfield 1997] Let w a string.

◮ a substring of w is a string included in w, ◮ a prefix of w is a substring which begins w and ◮ a suffix is a substring which ends w. ◮ an overlap from w over v is a suffix of w that is also a prefix of v.

w a b a b b a b a a a

Bastien Cazaux and Eric Rivals∗ 4 / 1

slide-8
SLIDE 8

Notation

Definition [Gusfield 1997] Let w a string.

◮ a substring of w is a string included in w, ◮ a prefix of w is a substring which begins w and ◮ a suffix is a substring which ends w. ◮ an overlap from w over v is a suffix of w that is also a prefix of v.

w a b a b b a b a a a v a b a a a b b b b

Bastien Cazaux and Eric Rivals∗ 4 / 1

slide-9
SLIDE 9

Notation

Definition [Gusfield 1997] Let w a string.

◮ a substring of w is a string included in w, ◮ a prefix of w is a substring which begins w and ◮ a suffix is a substring which ends w. ◮ an overlap from w over v is a suffix of w that is also a prefix of v.

w a b a b b a b a a a v a b a a a b b b b

Bastien Cazaux and Eric Rivals∗ 4 / 1

slide-10
SLIDE 10

Notation

Definition [Gusfield 1997] Let w a string.

◮ a substring of w is a string included in w, ◮ a prefix of w is a substring which begins w and ◮ a suffix is a substring which ends w. ◮ an overlap from w over v is a suffix of w that is also a prefix of v.

w a b a b b a b a a a v a b a a a b b b b u a b a a a

Bastien Cazaux and Eric Rivals∗ 4 / 1

slide-11
SLIDE 11

Notation

Definition [Gusfield 1997] Let w a string.

◮ a substring of w is a string included in w, ◮ a prefix of w is a substring which begins w and ◮ a suffix is a substring which ends w. ◮ an overlap from w over v is a suffix of w that is also a prefix of v.

w a b a b b a b a a a v a b a a a b b b b u a b a a a a b a b b b b b b

  • v(w,v)

pr(w,v) su(w,v)

M

w v

Bastien Cazaux and Eric Rivals∗ 4 / 1

slide-12
SLIDE 12

Superstring

Definition Let P = {s1,s2,...,sp} be a set of strings. A superstring of P is a string w such that any si is a substring of w. w : s1 : s2 : s3 : a b a a b a 1 2 3 4 5 6 a a b a b a b a

Bastien Cazaux and Eric Rivals∗ 5 / 1

slide-13
SLIDE 13

Shortest Superstrings problems

Input

Bastien Cazaux and Eric Rivals∗ 6 / 1

slide-14
SLIDE 14

Shortest Superstrings problems

Input Shortest Linear Superstring

Bastien Cazaux and Eric Rivals∗ 6 / 1

slide-15
SLIDE 15

Shortest Superstrings problems

Input Shortest Linear Superstring Shortest Cyclic Superstring

Bastien Cazaux and Eric Rivals∗ 6 / 1

slide-16
SLIDE 16

Shortest Superstrings problems

Input Shortest Linear Superstring Shortest Cyclic Superstring Shortest Cyclic Cover of Strings

Bastien Cazaux and Eric Rivals∗ 6 / 1

slide-17
SLIDE 17

Shortest Superstrings problems

Input Shortest Cyclic Cover of Strings Shortest Linear Superstring

◮ NP-hard [Gallant 1980]

◮ APX-hard [Blum 1991] ◮ Approximation 2+ 11

30

[Paluch 2015] ◮ NP-hard [Cazaux PhD 2016]

Shortest Cyclic Superstring

Bastien Cazaux and Eric Rivals∗ 6 / 1

slide-18
SLIDE 18

Shortest Superstrings problems

Input Shortest Linear Superstring

◮ NP-hard [Gallant 1980]

◮ APX-hard [Blum 1991] ◮ Approximation 2+ 11

30

[Paluch 2015] ◮ NP-hard [Cazaux PhD 2016]

Shortest Cyclic Superstring

◮ In O(|P|3 +||P||) time [Papadimitriou 1982]

◮ Linear time in ||P|| [Cazaux, Rivals JDA 2016]

Shortest Cyclic Cover of Strings

Bastien Cazaux and Eric Rivals∗ 6 / 1

slide-19
SLIDE 19

Superstring Graph

One graph to rule them all

Bastien Cazaux and Eric Rivals∗ 7 / 1

slide-20
SLIDE 20

Greedy Algorithm for SCCS

a a b a b b a a b a a a b a b b

Bastien Cazaux and Eric Rivals∗ 8 / 1

slide-21
SLIDE 21

Greedy Algorithm for SCCS

a a b a b b a a b a a a b a b b

|ov(ababb,abba)| = |abb| = 3

Bastien Cazaux and Eric Rivals∗ 8 / 1

slide-22
SLIDE 22

Greedy Algorithm for SCCS

a a b a b b a a b a a a b a b b

|ov(ababb,abba)| = |abb| = 3

a b a b b a a b b a a b a b b

Bastien Cazaux and Eric Rivals∗ 8 / 1

slide-23
SLIDE 23

Greedy Algorithm for SCCS

a a b a b a a

|ov(ababb,abba)| = |abb| = 3

a b a b b a a b a b b a

Bastien Cazaux and Eric Rivals∗ 8 / 1

slide-24
SLIDE 24

Greedy Algorithm for SCCS

a a b a b a a a b a b b a

|ov(abaa,aab)| = |aa| = 2

a b a a a a b a b a a b

Bastien Cazaux and Eric Rivals∗ 8 / 1

slide-25
SLIDE 25

Greedy Algorithm for SCCS

a b a b b a

|ov(abaa,aab)| = |aa| = 2

a b a a b a b a a b

Bastien Cazaux and Eric Rivals∗ 8 / 1

slide-26
SLIDE 26

Greedy Algorithm for SCCS

a b a b b a a b a a b

|ov(abaab,abaab)| = |ab| = 2

a b a a b a b a a b a b a

Bastien Cazaux and Eric Rivals∗ 8 / 1

slide-27
SLIDE 27

Greedy Algorithm for SCCS

a b a b b a

|ov(abaab,abaab)| = |ab| = 2

a b a a b a

Bastien Cazaux and Eric Rivals∗ 8 / 1

slide-28
SLIDE 28

Greedy Algorithm for SCCS

a b a a b a b b An other solution a b a a b a b b Theorem [Cazaux et al. 2014] The greedy algorithm solves exactly the Shortest Cyclic Cover

  • f Strings problem.

Bastien Cazaux and Eric Rivals∗ 8 / 1

slide-29
SLIDE 29

Extended Hierarchical Overlap Graph (EHOG)

ababb aab abba abaa

Bastien Cazaux and Eric Rivals∗ 9 / 1

slide-30
SLIDE 30

Extended Hierarchical Overlap Graph (EHOG)

ababb aab abba abaa abb aa ab a

ε

Bastien Cazaux and Eric Rivals∗ 9 / 1

slide-31
SLIDE 31

Extended Hierarchical Overlap Graph (EHOG)

ababb aab abba abaa abb aa ab a

ε

Bastien Cazaux and Eric Rivals∗ 9 / 1

slide-32
SLIDE 32

Extended Hierarchical Overlap Graph (EHOG)

ababb aab abba abaa abb aa ab a

ε

Bastien Cazaux and Eric Rivals∗ 9 / 1

slide-33
SLIDE 33

Extended Hierarchical Overlap Graph (EHOG)

ababb aab abba abaa abb aa ab a

ε

Bastien Cazaux and Eric Rivals∗ 9 / 1

slide-34
SLIDE 34

Permutation of words on the EHOG

ababb abba aab abaa ababb aab abba abaa abb aa ab a

ε

Bastien Cazaux and Eric Rivals∗ 10 / 1

slide-35
SLIDE 35

Results on the Superstring Graph

Definition All the solutions of the greedy algorithm for SCCS give the same graph

  • n the EHOG, and it is called Superstring Graph.

Bastien Cazaux and Eric Rivals∗ 11 / 1

slide-36
SLIDE 36

Results on the Superstring Graph

Definition All the solutions of the greedy algorithm for SCCS give the same graph

  • n the EHOG, and it is called Superstring Graph.

Propositions [Cazaux et al. 2015]

◮ The size of the Superstring Graph is linear in the size of the input. ◮ We can build the Superstring Graph in liner time in the size of the

input.

◮ A labeled eulerian cycle of the Superstring Graph is a solution of

the greedy algorithm for the Shortest Cyclic Cover of Strings problem.

Bastien Cazaux and Eric Rivals∗ 11 / 1

slide-37
SLIDE 37

Superstring Graph on the EHOG

ababb aab abba abaa abb aa ab a

ε

b a abb aa b

Bastien Cazaux and Eric Rivals∗ 12 / 1

slide-38
SLIDE 38

Why do we want to compute the Superstring Graph?

Bastien Cazaux and Eric Rivals∗ 13 / 1

slide-39
SLIDE 39

Application 1: Compute bounds for optimal solution of SLS

Bastien Cazaux and Eric Rivals∗ 14 / 1

slide-40
SLIDE 40

Application 1: Compute bounds for optimal solution of SLS

Bastien Cazaux and Eric Rivals∗ 14 / 1

slide-41
SLIDE 41

Application 1: Compute bounds for optimal solution of SLS

A B C

Bastien Cazaux and Eric Rivals∗ 14 / 1

slide-42
SLIDE 42

Application 1: Compute bounds for optimal solution of SLS

A B C a b c

Bastien Cazaux and Eric Rivals∗ 14 / 1

slide-43
SLIDE 43

Application 1: Compute bounds for optimal solution of SLS

A B C a b c With a ≥ b and a ≥ c, A + a + B + C ≤ | Optimal solution of SLS | ≤ A + a + B + b + C + c

Bastien Cazaux and Eric Rivals∗ 14 / 1

slide-44
SLIDE 44

Application 1: Compute bounds for optimal solution of SLS

A B C a b c With a ≥ b and a ≥ c, A + a + B + C ≤ | Optimal solution of SLS | ≤ A + a + B + b + C + c Results on real data [Cazaux et al. 2018] Input: E. coli genome of 50x: 4 503 422 reads (454 845 622 symbols) Result: length of optimal solutions between 187 250 434 and 187 250 672 A difference of 710 symbols (0,00038%)

Bastien Cazaux and Eric Rivals∗ 14 / 1

slide-45
SLIDE 45

Application 2: Use SG as a genome assembly graph

aa da ae ac cb bb bd eb bf fa a b c d e f ε

Results [Cazaux et al. 2016] We can build a mixed cover that includes unitigs of the dBG (or Variable

  • rder dBG) in time in O(||P||), and in linear space in the size of the de

Bruijn Graph.

Bastien Cazaux and Eric Rivals∗ 15 / 1

slide-46
SLIDE 46

How can we store the Superstring Graph in compact space?

Bastien Cazaux and Eric Rivals∗ 16 / 1

slide-47
SLIDE 47

Difficulty

The Superstring Graph is a graph with integer value (each value in

log||P|| for a set of strings P)

Bastien Cazaux and Eric Rivals∗ 17 / 1

slide-48
SLIDE 48

Relation between ST and EHOG

4 4 3 1 b a 2 1 a 1 bb a 3 1 a b b a 5 3 3 2 a 2 bb a 4 2 a b b Propositions Each node of EHOG(P) is an explicit node of the Suffix Tree of P.

Bastien Cazaux and Eric Rivals∗ 18 / 1

slide-49
SLIDE 49

Use the BWT-AC decomposition

S = {aaa$,aac$,baa$,bbc$,bbb$} a a a c b a a b c b

$ $ $ $ $

Bastien Cazaux and Eric Rivals∗ 19 / 1

slide-50
SLIDE 50

Use the BWT-AC decomposition

S = {aaa$,aac$,baa$,bbc$,bbb$} 1 6 2 3 9 5 7 4 10 8 a a a c b a a b c b

$ $ $ $ $

Bastien Cazaux and Eric Rivals∗ 19 / 1

slide-51
SLIDE 51

Use the BWT-AC decomposition

S = {aaa$,aac$,baa$,bbc$,bbb$} 1 6 2 3 9 5 7 4 10 8 a a a c b a a b c b

$ $ $ $ $

b a b a b a a c a

$ $

a b a b c b

$ $ $

BWT(mP)

mP = aaa$caa$aab$bbb$cbb$ 1 2 3 4 5 6 7 8 9 10

Bastien Cazaux and Eric Rivals∗ 19 / 1

slide-52
SLIDE 52

Use the BWT-AC decomposition

S = {aaa$,aac$,baa$,bbc$,bbb$} 1 6 2 3 9 5 7 4 10 8 a a a c b a a b c b

$ $ $ $ $

b a b a b a a c a

$ $

a b a b c b

$ $ $

BWT(mP)

mP = aaa$caa$aab$bbb$cbb$ 1 2 3 4 5 6 7 8 9 10 1 1 1 1 1 1 1 1 1 1 1

Bastien Cazaux and Eric Rivals∗ 19 / 1

slide-53
SLIDE 53

Use the BWT-AC decomposition

S = {aaa$,aac$,baa$,bbc$,bbb$} 1 6 2 3 9 5 7 4 10 8 a a a c b a a b c b

$ $ $ $ $

b a b a b a a c a

$ $

a b a b c b

$ $ $

BWT(mP)

mP = aaa$caa$aab$bbb$cbb$ 1 2 3 4 5 6 7 8 9 10 1 1 1 1 1 1 1 1 1 1 1

1 1 1 01 01 01 01 1 01 01 1 1 1 1 01 01 1 1 1 01 01 01 Blue(mP) Red(mP)

Bastien Cazaux and Eric Rivals∗ 19 / 1

slide-54
SLIDE 54

Use the BWT-AC decomposition

Result For a set of strings P,the size of the tables Blue et Red are bounded by

||P|| (= ∑s∈P |s|). → By using BWT and BP

, we can store the Superstring Graph in O(nlogσ) bits.

Bastien Cazaux and Eric Rivals∗ 20 / 1

slide-55
SLIDE 55

How can we build the Superstring Graph in compact space?

Bastien Cazaux and Eric Rivals∗ 21 / 1

slide-56
SLIDE 56

Difficulty

If we want just to compute the tables Blue and Red.

Bastien Cazaux and Eric Rivals∗ 22 / 1

slide-57
SLIDE 57

Difficulty

If we want just to compute the tables Blue and Red. t w v1 v2 u1 u2

n(v) n(u1) n(u2) d( v) d(v1) d(v2)

v For now, algorihtm in O(||P||) linear time and in O(||P||log||P||) bits space.

Bastien Cazaux and Eric Rivals∗ 22 / 1

slide-58
SLIDE 58

Difficulty

If we want just to compute the tables Blue and Red. t w v1 v2 u1 u2

n(v) n(u1) n(u2) d( v) d(v1) d(v2)

v For now, algorihtm in O(||P||) linear time and in O(||P||log||P||) bits space. Can we find an algorithm in O(||P||) linear time and in O(||P||) bits space?

Bastien Cazaux and Eric Rivals∗ 22 / 1

slide-59
SLIDE 59

Thank you for your attention

Bastien Cazaux and Eric Rivals∗ 23 / 1