On-line Searching in IUPAC Nucleotide Sequences
Jan Holub
(joint work with Petr Procházka) The Prague Stringology Club Faculty of Information Technology Czech Technical University in Prague
PSC
LSD & LAW 2019 February 7, 2019
PSC LSD & LAW 2019 February 7, 2019 Outline 1. Motivation - - PowerPoint PPT Presentation
On-line Searching in IUPAC Nucleotide Sequences Jan Holub (joint work with Petr Prochzka) The Prague Stringology Club Faculty of Information Technology Czech Technical University in Prague PSC LSD & LAW 2019 February 7, 2019 Outline
Jan Holub
(joint work with Petr Procházka) The Prague Stringology Club Faculty of Information Technology Czech Technical University in Prague
LSD & LAW 2019 February 7, 2019
LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 2 / 21
LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 3 / 21
■
DNA sequencing the population of many individuals.
■
1000 Genomes Projects, UK10K project.
■
Pan-genomics: a consensus sequences is a way of representing the sequenced population.
■
Consensus sequence can be expressed as so-called degenerate string.
■
Need for fast on-line algorithms searching for different patterns in the consensus sequence.
LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 4 / 21
IUPAC symbol Subset Bit coding
A {A} 0001 C {C} 0010 G {G} 0100 T {T} 1000 R {A, G} 0101 Y {C, T} 1010 S {C, G} 0110 W {A, T} 1001 K {G, T} 1100 M {A, C} 0011 B {C, G, T} 1110 D {A, G, T} 1101 H {A, C, T} 1011 V {A, C, G} 0111 N {A, C, G, T} 1111
LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 5 / 21
T C C A G C G C T T A C T C T A T A C C T A A T C C A G C A C T T A C T C T G T G C C C G C T C C A G C A C T T A C T C T G T G C C C A C T C C A G C A C T T A C T C T G T G C C C A C T C C A G C A C T T A C T C T G T G C C C G C T C T A G C A C T T A C T C T A T G C C T G C T C T A G C A C T T A C T C T A T G C C T G C
homo sapiens: pan paniscus: chlorocebus sabaeus: macaca fascicularis: macaca mulatta: papio anubis: callithrix jacchus: CONSENSUS:
T C Y A G C R C T T A C T C T R T R C C Y R M
Figure 1: Consensus sequence over IUPAC alphabet for different species (chro- mosome 7: 55 187 593 – 55 187 615).
LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 6 / 21
Problem Given a degenerate text T and a degenerate pattern P. The problem is to find all the occurrences of P in T , i.e., to find all i such that for all j in [1, m], Ti+j−1 ∩ Pj = ∅.
LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 7 / 21
■
Byte-Aligned Degenerate Pattern Matching (BADPM).
■
Sublinear average time complexity in searching over consensus DNA sequences.
■
Extremely fast for long patterns because of long shifts.
■
Simple pattern preprocessing: tabulating all pattern factors.
■
Processing at the byte level (omitting most of the bitwise operations).
■
Easy cooperating with n-gram inverted index.
LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 8 / 21
00 01 00 11 00 00 11 11 00 00 11 01 A C V T A A T ...
4 879 5 903 6 927
dictionary
A → 00 C → 01 G → 10 T → 11 Source sequence T A R T ... ... ...
Bi Bi+1 Bi+2
Encoded sequence B 00 01 01 11 00 01 10 11 ... ... 00 00 11 10 00 00 11 11 ... ... 00 10 11 01 00 10 11 10 00 10 11 11
variants variants
i i + 2 3 6
variantPos variantNum
... ... ... ...
j j + 1
baseSeq
Preprocessed pattern
LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 9 / 21
■
Consensus sequence divided into:
◆
Base sequence. Consisting of only solid symbols.
◆
terms of a whole byte.
■
Base sequence and variants encoded using bytes substituting 4-grams of symbols/bases.
■
Auxiliary array variantPos storing positions of “degenerate bytes” in base sequence.
■
Auxiliary array variantNum storing number of “byte variants” for a given byte.
LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 10 / 21
■
Dictionary of all possible two-byte values (2562 = 65 536 values).
■
Dictionary entries point to lists of occurrences (of a two-byte values) in the encoded pattern PC.
■
List elements:
◆
Byte offset in terms of the encoded pattern PC.
◆
Alignment to the encoded pattern PC.
LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 11 / 21
... C 00 01 10 11 00 00 11 11 00 00 11... ... 01 11 11 11 00 11 11 11 00 00 01 A C G T A A T ... T T A A C T
alignment = 0 alignment = 1 alignment = 2 alignment = 3
6 927 27 708 32 575 50 115 53 185 62 448 64 764 45 296
dictionary
alignment
nB − 2 1
A → 00 C → 01 G → 10 T → 11
1 2 3 nB − 1 0 nB − 2 3 nB − 2 2
... ... ... ... ... ... ... ... Preprocessing process Preprocessed pattern T A A T T T A T A C G T A A T ... T A A T A C G T A A T ... T A A T A C G T A A T ... T A A T 00 01 10 11 00 00 11 11 00 00 11... 00 01 10 11 00 00 11 11 00 00 11... 00 01 10 11 00 00 11 11 00 00 11... ... C ... 01 11 11 11 00 11 11 11 00 00 01 T T A A C T T T A T ... C ... 01 11 11 11 00 11 11 11 00 00 01 T T A A C T T T A T ... C ... 01 11 11 11 00 11 11 11 00 00 01 T T A A C T T T A T
LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 12 / 21
For different alignments a ∈ {0, 1, 2, 3}: 1. Scan all relevant double-byte values. 2. Store byte offset (in terms of the encoded pattern PE) and alignment a to the corresponding list (a dictionary entry corresponding to the double-byte value).
LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 13 / 21
i 65 535
dictionary
alignment
ali
a1
... Preprocessed pattern
li
LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 13 / 21
i 65 535
dictionary
alignment
ali
a1
... Preprocessed pattern
li
LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 13 / 21
i 65 535
dictionary
alignment
ali
a1
... Preprocessed pattern
li
LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 13 / 21
i 65 535
dictionary
alignment
ali
a1
... Preprocessed pattern
li
LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 14 / 21
■
Scan O(m) bytes of the encoded pattern PE.
■
Check O(α2) double-byte values at each position (pathological patterns . . . NNNNNNNN . . .).
■
Store offset and alignment for each double-byte value to the corresponding list (O(1) time).
LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 15 / 21 ... ... baseSeq dictionary
...
Figure 2: BADPM: Conceptual schema of searching.
LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 16 / 21 baseSeq dictionary i
4 284
variants ... ... ... ... ... ... variantPos variantNum
pattern
LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 16 / 21 baseSeq dictionary i
4 284
variants ... ... ... ... ... ... variantPos variantNum
pattern
1
LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 16 / 21 baseSeq dictionary i
4 284
variants ... ... ... ... ... ... variantPos variantNum
1
pattern
LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 16 / 21 baseSeq dictionary i
6 332
variants ... ... ... ... ... ... variantPos variantNum
pattern
LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 17 / 21
■
Scan O(n) bytes of the base sequence.
■
Check O(α2) double-byte values at each position (pathological sequences ...NNNNNNNN...).
■
Check up to O(m) offsets for each double-byte value.
■
Sequential byte-by-byte comparison with the encoded pattern PE (O(m) bytes).
■
Considering O(α) variants for each byte of the sequence and O(α) variants for each byte of the encoded pattern PE (pathological sequences and patterns ...NNNNNNNN...).
LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 18 / 21 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 200 400 600 800 1000 Locate time [sec] Pattern length BADPM PNS BMH BNDM
Figure 3: Human chromosome 7: Locate time depending on the length of the searched pattern m.
LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 19 / 21
0.1 0.2 0.3 0.4 0.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y 50 100 150 200 250 Locate time [sec] File size [MiB] Chromosome BADPM PNS BMH BNDM File size
Figure 4: Locate time for different human chromosomes for m = 16. The second vertical axis represents the chromosome file size.
LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 20 / 21 0.0001 0.001 0.01 0.1 200 400 600 800 1000 Locate time [sec] Pattern length BADPM PNS BMH BNDM
Figure 5: Human chromosome 7: Locate time using inverted index, block size = 102 400 bases.
LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 21 / 21
■
Any questions?
■
Prague Stringology Conference 2019 (August 26–28, 2019)
■
postdoc position on succinct data structures in Prague (2019–2022)