Practical fast on-line exact pattern matching algorithms for highly - - PowerPoint PPT Presentation

practical fast on line exact pattern matching algorithms
SMART_READER_LITE
LIVE PREVIEW

Practical fast on-line exact pattern matching algorithms for highly - - PowerPoint PPT Presentation

Practical fast on-line exact pattern matching algorithms for highly similar sequences Nadia Ben Nsira Thierry Lecroq Elise Prieur-Gaston LITIS EA 4108, Normastic FR3638, IRIB, Universit e de Rouen Normandie, Normandie Universit e,


slide-1
SLIDE 1

Practical fast on-line exact pattern matching algorithms for highly similar sequences

Nadia Ben Nsira Thierry Lecroq ´ Elise Prieur-Gaston

LITIS EA 4108, Normastic FR3638, IRIB, Universit´ e de Rouen Normandie, Normandie Universit´ e, France

Workshop SeqBio 2018, November 19th, 2018

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 1 / 26

slide-2
SLIDE 2

Table of contents

1

Introduction and notations

2

Search in highly similar sequences

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 2 / 26

slide-3
SLIDE 3

Table of contents

1

Introduction and notations

2

Search in highly similar sequences

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 3 / 26

slide-4
SLIDE 4

Big data

NGS technologies output numerous individual genomes of the same species More than 99% similar

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 4 / 26

slide-5
SLIDE 5

Highly similar sequences

Differ from the reference by: SNVs (SNPs), indels, CNVs, translocations, ... Common and non-common parts

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 5 / 26

slide-6
SLIDE 6

Efficient solutions

Strong need for efficient indexing and pattern matching

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 6 / 26

slide-7
SLIDE 7

Pattern matching

Find one(all the) position(s) of a pattern of length m in a sequence of length n: with index → O(m) without index → O(n)

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 7 / 26

slide-8
SLIDE 8

Notations

finite alphabet Σ string x[0 . . m − 1] on Σ∗ length |x| = m ˜ x is the reverse of x (x[m − 1]x[m − 2] · · · x[1]x[0]) x[i . . j] is a factor (substring) of x from position i to position j (both inclusive) x[0 . . i] is a prefix x[i . . m − 1] is a suffix u is a border of x if u is both a prefix and a suffix of x Border(x) is the longest border of x

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 8 / 26

slide-9
SLIDE 9

Sliding window

x y x y x y n m

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 9 / 26

slide-10
SLIDE 10

Knuth-Morris-Pratt algorithm (1977)

z c u a u b y x = = j comparisons k = min{ℓ | x[|Borderℓ(u)|] = a} and z = Borderk(u)

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 10 / 26

slide-11
SLIDE 11

Boyer-Moore algorithm (1977)

a v b v y x comparisons c v x .

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 11 / 26

slide-12
SLIDE 12

Table of contents

1

Introduction and notations

2

Search in highly similar sequences

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 12 / 26

slide-13
SLIDE 13

Off-line sith an index

Huang et al. 2010: O(n + N log N) bits where n is the total length

  • f common parts in one string and N is the total length of

non-common parts in all sequences Kuruppu et al. 2010: Relative Lempel-Ziv index Na et al. 2018: FM-index of an alignment BWBBLE, Huang et al. 2013: practical solution

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 13 / 26

slide-14
SLIDE 14

Highly similar sequences

r sequences

y0 y1 y2 y3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 A T G C T A G C A A G A T A C A G A T G C T A G C A A C A T A C A G A T G C G A G C A A G A T A C A G A T G C T A G C A A C A T A C A T

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 14 / 26

slide-15
SLIDE 15

Highly similar sequences

r sequences

y0 y1 y2 y3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 A T G C T A G C A A G A T A C A G A T G C T A G C A A C A T A C A G A T G C G A G C A A G A T A C A G A T G C T A G C A A C A T A C A T A T G C A G C A A A T A C A y {G, T} {C, G} {G, T}

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 14 / 26

slide-16
SLIDE 16

Highly similar sequences

r sequences

y0 y1 y2 y3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 A T G C T A G C A A G A T A C A G A T G C T A G C A A C A T A C A G A T G C G A G C A A G A T A C A G A T G C T A G C A A C A T A C A T A T G C A G C A A A T A C A y {G, T} {C, G} {G, T} G A G C A A C

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 14 / 26

slide-17
SLIDE 17

Highly similar sequences

r sequences

y0 y1 y2 y3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 A T G C T A G C A A G A T A C A G A T G C T A G C A A C A T A C A G A T G C G A G C A A G A T A C A G A T G C T A G C A A C A T A C A T A T G C A G C A A A T A C A y {G, T} {C, G} {G, T} G A G C A A C

  • R. Grossi, C. S. Iliopoulos, C. Liu, N. Pisanti, S. P. Pissis, A. Retha, G. Rosone, F.

Vayani, L. Versari On-Line Pattern Matching on Similar Texts 28th Combinatorial Pattern Matching (CPM), Warsaw, Poland (2017) 9:1–9:14

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 14 / 26

slide-18
SLIDE 18

Highly similar sequences

r sequences

y0 y1 y2 y3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 A T G C T A G C A A G A T A C A G A T G C T A G C A A C A T A C A G A T G C G A G C A A G A T A C A G A T G C T A G C A A C A T A C A T y0 et Z = (({2}, 4, G), ({1, 3}, 10, C), ({3}), 16, T)

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 14 / 26

slide-19
SLIDE 19

For highly similar sequences

Hamming distance

For u, v ∈ A∗ such that |u| = |v|: Ham(u, v) = ♯{i | u[i] = v[i]}

Longest Common Extension

For x ∈ A∗ and 0 ≤ i ≤ j ≤ |x| − 1: LCE k

x(i, j) = max{ℓ | Ham(x[i . . i + ℓ − 1], x[j . . j + ℓ − 1]) ≤ k}

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 15 / 26

slide-20
SLIDE 20

Kangaroo jumps

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 16 / 26

slide-21
SLIDE 21

Kangaroo jumps

i j

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 16 / 26

slide-22
SLIDE 22

Kangaroo jumps

i j

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 16 / 26

slide-23
SLIDE 23

Kangaroo jumps

i j 1

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 16 / 26

slide-24
SLIDE 24

Kangaroo jumps

i j 1 2

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 16 / 26

slide-25
SLIDE 25

Kangaroo jumps

i j 1 2 3

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 16 / 26

slide-26
SLIDE 26

Kangaroo jumps

i j 1 2 3 LCE k

x(i, j) can be computed in O(k) time after O(n) preprocessing time

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 16 / 26

slide-27
SLIDE 27

References

Restriction: 1 variation on a window of size m

Adaptations of KMP and BM without LCE by adapting the shift functions

  • N. Ben Nsira, T. Lecroq and M. Elloumi

A fast Boyer-Moore type pattern matching algorithm for highly similar sequences International Journal of Data Mining and Bioinformatics 13(3) (2015) 266-288

  • N. Ben Nsira, T. Lecroq and M. Elloumi

On-line String Matching in Highly Similar DNA Sequences Mathematics in Computer Science 11(2) (2017) 113–126

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 17 / 26

slide-28
SLIDE 28

2 variants

relaxing the restriction from 1 to k variations in a window of size m searching for a finite set of patterns (still with 1 variation in a window

  • f size m

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 18 / 26

slide-29
SLIDE 29

Single pattern with at most k variations

Applying the Landau-Vishkin algorithm as a filter

Searching with k mismatches in O(kn) When Ham(x, y0[j . . j + ℓ − 1]) = ℓ ≤ k ℓ = 0: an exact occurrence of the pattern has been found in y0 and all the other sequence that do not have a variation comparing to y0 between position j and position j + m − 1 both included. ℓ > 0: let W = {i0, . . . , iℓ−1} be the set of the ℓ positions such that y0[j + ip] = x[ip] with 0 ≤ p < ℓ. Then x occurs exactly in yh if:

◮ (G, j + ip, x[ip]) ∈ Z with g ∈ G for all 0 ≤ p < ℓ; ◮ ∃ (G, h, c) ∈ Z such that h ∈ W. Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 19 / 26

slide-30
SLIDE 30

Single pattern with at most k variations

r = 2 and k = 2

1 2 3 4 5 6 7 8 9 10 y0 A C C T A C G A C T A x C T A C T T

j = 2 and W = (4, 5)

x C T A C T T

j = 5 and W = (1, 5)

y1 A C C T A C T A C T T

Z = (({1}, 6, T), ({1}, 10, T))

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 20 / 26

slide-31
SLIDE 31

Single pattern with at most k variations

r = 2 and k = 2

1 2 3 4 5 6 7 8 9 10 y0 A C C T A C G A C T A x C T A C T T

j = 2 and W = (4, 5)

x C T A C T T

j = 5 and W = (1, 5)

y1 A C C T A C T A C T T

Z = (({1}, 6, T), ({1}, 10, T))

Our solution runs in time O(knr)

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 20 / 26

slide-32
SLIDE 32

Multiple patterns with at most 1 variation

Build a classical trie of the patterns Scan the highly similar sequences with at most 2 active states

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 21 / 26

slide-33
SLIDE 33

Multiple patterns with at most 1 variation

X = {ACGA, ACTA, CTA} and r = 2 sequences

1 A 2 C 3 G 4 A 5 T 6 A 9 8 A 7 T C

{ACGA} {ACTA, CTA} {CTA}

Σ \ {A, C}

0 1 2 3 4 5 6 7 8 910 11 y0 A C C T A C G A C T A y1 T T active states 0 1 2 7 8 9 2 3 4 2 5 6 7 1 1 5 6 2 1

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 22 / 26

slide-34
SLIDE 34

Multiple patterns with at most 1 variation

Our solution runs in time O(n) for the searching phase and in time O(s) for the preprocessing phase where s = |x| for all x ∈ X

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 23 / 26

slide-35
SLIDE 35

Experiments

Similar sequences of different lengths with patterns of length 16

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 200000 400000 600000 800000 1x106 1.2x106 1.4x106 1.6x106 Time(s) length EDSM LVsim ACsim

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 24 / 26

slide-36
SLIDE 36

Perspectives

Do more experiments Adapt other pattern matching techniques Relax the restrictions Adaptive analysis

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 25 / 26

slide-37
SLIDE 37

Thank you for your attention!

Ben Nsira, Lecroq, Prieur (LITIS) Similar Sequences SeqBio 2018 26 / 26