PSC LSD & LAW 2019 February 7, 2019 Outline 1. Motivation - - PowerPoint PPT Presentation

psc
SMART_READER_LITE
LIVE PREVIEW

PSC LSD & LAW 2019 February 7, 2019 Outline 1. Motivation - - PowerPoint PPT Presentation

On-line Searching in IUPAC Nucleotide Sequences Jan Holub (joint work with Petr Prochzka) The Prague Stringology Club Faculty of Information Technology Czech Technical University in Prague PSC LSD & LAW 2019 February 7, 2019 Outline


slide-1
SLIDE 1

On-line Searching in IUPAC Nucleotide Sequences

Jan Holub

(joint work with Petr Procházka) The Prague Stringology Club Faculty of Information Technology Czech Technical University in Prague

PSC

LSD & LAW 2019 February 7, 2019

slide-2
SLIDE 2

Outline

LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 2 / 21

1. Motivation 2. Basic Concepts 3. BADPM data structures 4. BADPM pattern preprocessing 5. BADPM searching 6. BADPM complexities 7. Experiments

slide-3
SLIDE 3

Motivation

LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 3 / 21

DNA sequencing the population of many individuals.

1000 Genomes Projects, UK10K project.

Pan-genomics: a consensus sequences is a way of representing the sequenced population.

Consensus sequence can be expressed as so-called degenerate string.

Need for fast on-line algorithms searching for different patterns in the consensus sequence.

slide-4
SLIDE 4

Basic Concepts: IUPAC alphabet

LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 4 / 21

IUPAC symbol Subset Bit coding

A {A} 0001 C {C} 0010 G {G} 0100 T {T} 1000 R {A, G} 0101 Y {C, T} 1010 S {C, G} 0110 W {A, T} 1001 K {G, T} 1100 M {A, C} 0011 B {C, G, T} 1110 D {A, G, T} 1101 H {A, C, T} 1011 V {A, C, G} 0111 N {A, C, G, T} 1111

slide-5
SLIDE 5

Basic Concepts: DNA Consensus Sequence

LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 5 / 21

T C C A G C G C T T A C T C T A T A C C T A A T C C A G C A C T T A C T C T G T G C C C G C T C C A G C A C T T A C T C T G T G C C C A C T C C A G C A C T T A C T C T G T G C C C A C T C C A G C A C T T A C T C T G T G C C C G C T C T A G C A C T T A C T C T A T G C C T G C T C T A G C A C T T A C T C T A T G C C T G C

homo sapiens: pan paniscus: chlorocebus sabaeus: macaca fascicularis: macaca mulatta: papio anubis: callithrix jacchus: CONSENSUS:

T C Y A G C R C T T A C T C T R T R C C Y R M

Figure 1: Consensus sequence over IUPAC alphabet for different species (chro- mosome 7: 55 187 593 – 55 187 615).

slide-6
SLIDE 6

Basic Concepts: Degenerate Pattern Matching

LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 6 / 21

Problem Given a degenerate text T and a degenerate pattern P. The problem is to find all the occurrences of P in T , i.e., to find all i such that for all j in [1, m], Ti+j−1 ∩ Pj = ∅.

slide-7
SLIDE 7

BADPM: Basic Properties

LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 7 / 21

Byte-Aligned Degenerate Pattern Matching (BADPM).

Sublinear average time complexity in searching over consensus DNA sequences.

Extremely fast for long patterns because of long shifts.

Simple pattern preprocessing: tabulating all pattern factors.

Processing at the byte level (omitting most of the bitwise operations).

Easy cooperating with n-gram inverted index.

slide-8
SLIDE 8

BADPM: Data Structures

LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 8 / 21

00 01 00 11 00 00 11 11 00 00 11 01 A C V T A A T ...

4 879 5 903 6 927

dictionary

A → 00 C → 01 G → 10 T → 11 Source sequence T A R T ... ... ...

Bi Bi+1 Bi+2

Encoded sequence B 00 01 01 11 00 01 10 11 ... ... 00 00 11 10 00 00 11 11 ... ... 00 10 11 01 00 10 11 10 00 10 11 11

variants variants

i i + 2 3 6

variantPos variantNum

... ... ... ...

j j + 1

baseSeq

Preprocessed pattern

slide-9
SLIDE 9

BADPM: Data structures (2)

LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 9 / 21

Consensus sequence divided into:

Base sequence. Consisting of only solid symbols.

  • Variants. Encoded variants (given by the degenerate symbols) in

terms of a whole byte.

Base sequence and variants encoded using bytes substituting 4-grams of symbols/bases.

Auxiliary array variantPos storing positions of “degenerate bytes” in base sequence.

Auxiliary array variantNum storing number of “byte variants” for a given byte.

slide-10
SLIDE 10

BADPM: Data structures (3)

LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 10 / 21

Dictionary of all possible two-byte values (2562 = 65 536 values).

Dictionary entries point to lists of occurrences (of a two-byte values) in the encoded pattern PC.

List elements:

Byte offset in terms of the encoded pattern PC.

Alignment to the encoded pattern PC.

slide-11
SLIDE 11

BADPM: Pattern Preprocessing

LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 11 / 21

... C 00 01 10 11 00 00 11 11 00 00 11... ... 01 11 11 11 00 11 11 11 00 00 01 A C G T A A T ... T T A A C T

alignment = 0 alignment = 1 alignment = 2 alignment = 3

6 927 27 708 32 575 50 115 53 185 62 448 64 764 45 296

dictionary

  • ffset

alignment

nB − 2 1

A → 00 C → 01 G → 10 T → 11

1 2 3 nB − 1 0 nB − 2 3 nB − 2 2

... ... ... ... ... ... ... ... Preprocessing process Preprocessed pattern T A A T T T A T A C G T A A T ... T A A T A C G T A A T ... T A A T A C G T A A T ... T A A T 00 01 10 11 00 00 11 11 00 00 11... 00 01 10 11 00 00 11 11 00 00 11... 00 01 10 11 00 00 11 11 00 00 11... ... C ... 01 11 11 11 00 11 11 11 00 00 01 T T A A C T T T A T ... C ... 01 11 11 11 00 11 11 11 00 00 01 T T A A C T T T A T ... C ... 01 11 11 11 00 11 11 11 00 00 01 T T A A C T T T A T

slide-12
SLIDE 12

BADPM: Pattern Preprocessing (2)

LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 12 / 21

For different alignments a ∈ {0, 1, 2, 3}: 1. Scan all relevant double-byte values. 2. Store byte offset (in terms of the encoded pattern PE) and alignment a to the corresponding list (a dictionary entry corresponding to the double-byte value).

slide-13
SLIDE 13

BADPM: Pattern Preprocessing Space

LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 13 / 21

i 65 535

dictionary

  • ffset

alignment

  • li

ali

  • 1

a1

... Preprocessed pattern

li

O(mα2 log m)

slide-14
SLIDE 14

BADPM: Pattern Preprocessing Space

LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 13 / 21

i 65 535

dictionary

  • ffset

alignment

  • li

ali

  • 1

a1

... Preprocessed pattern

li

O(mα2 log m)

O(α2)

slide-15
SLIDE 15

BADPM: Pattern Preprocessing Space

LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 13 / 21

i 65 535

dictionary

  • ffset

alignment

  • li

ali

  • 1

a1

... Preprocessed pattern

li

O(mα2 log m)

O(α2) O(m)

slide-16
SLIDE 16

BADPM: Pattern Preprocessing Space

LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 13 / 21

i 65 535

dictionary

  • ffset

alignment

  • li

ali

  • 1

a1

... Preprocessed pattern

li

O(mα2 log m)

O(α2) O(m) O(log m)

slide-17
SLIDE 17

BADPM: Pattern Preprocessing Time

LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 14 / 21

O(mα2)

Scan O(m) bytes of the encoded pattern PE.

Check O(α2) double-byte values at each position (pathological patterns . . . NNNNNNNN . . .).

Store offset and alignment for each double-byte value to the corresponding list (O(1) time).

slide-18
SLIDE 18

BADPM Searching

LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 15 / 21 ... ... baseSeq dictionary

  • ffset, alignment
  • 1. Read short value and check the dictionary.
  • 2. Byte-level check according to the offset.
  • 3. Prefix and suffix check according to the alignment.

...

Figure 2: BADPM: Conceptual schema of searching.

slide-19
SLIDE 19

BADPM Searching: Example

LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 16 / 21 baseSeq dictionary i

A C A A G T T A

4 284

A C G A

variants ... ... ... ... ... ... variantPos variantNum

i 1

pattern

A C A A G T T A G G C T A T A T T A G G C T A T A T A A A C T

slide-20
SLIDE 20

BADPM Searching: Example

LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 16 / 21 baseSeq dictionary i

A C A A G T T A T A T A

4 284

A C G A

variants ... ... ... ... ... ... variantPos variantNum

i 1

pattern

A C A A G T T A T A T A G G C T T A G G C T A A A C T

1

slide-21
SLIDE 21

BADPM Searching: Example

LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 16 / 21 baseSeq dictionary i

A C A A G T T A T A T A

4 284

A C G A

variants ... ... ... ... ... ... variantPos variantNum

i 1

1

G G C T

pattern

A C A A G T T A T A T A G G C T T A A A A C T

slide-22
SLIDE 22

BADPM Searching: Example

LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 16 / 21 baseSeq dictionary i

A C A A

6 332

A C G A

variants ... ... ... ... ... ... variantPos variantNum

i 1

pattern

A C A A G T T A G G C T A T A T T A G G C T A T A T A A A C T G T T A

slide-23
SLIDE 23

BADPM Searching: Time Complexity

LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 17 / 21

O(nm2α4)

Scan O(n) bytes of the base sequence.

Check O(α2) double-byte values at each position (pathological sequences ...NNNNNNNN...).

Check up to O(m) offsets for each double-byte value.

Sequential byte-by-byte comparison with the encoded pattern PE (O(m) bytes).

Considering O(α) variants for each byte of the sequence and O(α) variants for each byte of the encoded pattern PE (pathological sequences and patterns ...NNNNNNNN...).

slide-24
SLIDE 24

Experiments: Locate time

LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 18 / 21 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 200 400 600 800 1000 Locate time [sec] Pattern length BADPM PNS BMH BNDM

Figure 3: Human chromosome 7: Locate time depending on the length of the searched pattern m.

slide-25
SLIDE 25

Experiments: Locate time for chromosomes

LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 19 / 21

0.1 0.2 0.3 0.4 0.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y 50 100 150 200 250 Locate time [sec] File size [MiB] Chromosome BADPM PNS BMH BNDM File size

Figure 4: Locate time for different human chromosomes for m = 16. The second vertical axis represents the chromosome file size.

slide-26
SLIDE 26

Experiments: Inverted index

LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 20 / 21 0.0001 0.001 0.01 0.1 200 400 600 800 1000 Locate time [sec] Pattern length BADPM PNS BMH BNDM

Figure 5: Human chromosome 7: Locate time using inverted index, block size = 102 400 bases.

slide-27
SLIDE 27

Thank you!

LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 21 / 21

Any questions?

Prague Stringology Conference 2019 (August 26–28, 2019)

postdoc position on succinct data structures in Prague (2019–2022)