psc
play

PSC LSD & LAW 2019 February 7, 2019 Outline 1. Motivation - PowerPoint PPT Presentation

On-line Searching in IUPAC Nucleotide Sequences Jan Holub (joint work with Petr Prochzka) The Prague Stringology Club Faculty of Information Technology Czech Technical University in Prague PSC LSD & LAW 2019 February 7, 2019 Outline


  1. On-line Searching in IUPAC Nucleotide Sequences Jan Holub (joint work with Petr Procházka) The Prague Stringology Club Faculty of Information Technology Czech Technical University in Prague PSC LSD & LAW 2019 February 7, 2019

  2. Outline 1. Motivation 2. Basic Concepts 3. BADPM data structures 4. BADPM pattern preprocessing 5. BADPM searching 6. BADPM complexities 7. Experiments LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 2 / 21

  3. Motivation DNA sequencing the population of many individuals. ■ 1000 Genomes Projects, UK10K project. ■ Pan-genomics: a consensus sequences is a way of representing the ■ sequenced population. Consensus sequence can be expressed as so-called degenerate string. ■ Need for fast on-line algorithms searching for different patterns in the ■ consensus sequence. LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 3 / 21

  4. Basic Concepts: IUPAC alphabet IUPAC symbol Subset Bit coding { A } � 0001 � A { C } � 0010 � C { G } � 0100 � G { T } � 1000 � T { A, G } � 0101 � R { C, T } � 1010 � Y { C, G } � 0110 � S { A, T } � 1001 � W { G, T } � 1100 � K { A, C } � 0011 � M { C, G, T } � 1110 � B { A, G, T } � 1101 � D { A, C, T } � 1011 � H { A, C, G } � 0111 � V { A, C, G, T } � 1111 � N LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 4 / 21

  5. Basic Concepts: DNA Consensus Sequence homo sapiens: T C T A G C A C T T A C T C T A T G C C T G C T C T A G C A C T T A C T C T A T G C C T G C pan paniscus: T C C A G C A C T T A C T C T G T G C C C G C chlorocebus sabaeus: macaca fascicularis: T C C A G C A C T T A C T C T G T G C C C A C macaca mulatta: T C C A G C A C T T A C T C T G T G C C C A C papio anubis: T C C A G C A C T T A C T C T G T G C C C G C callithrix jacchus: T C C A G C G C T T A C T C T A T A C C T A A T C Y A G C R C T T A C T C T R T R C C Y R M CONSENSUS: Figure 1: Consensus sequence over IUPAC alphabet for different species (chro- mosome 7: 55 187 593 – 55 187 615 ). LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 5 / 21

  6. Basic Concepts: Degenerate Pattern Matching Problem Given a degenerate text T and a degenerate pattern P . The problem is to find all the occurrences of P in T , i.e., to find all i such that for all j in [1 , m ] , T i + j − 1 ∩ P j � = ∅ . LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 6 / 21

  7. BADPM : Basic Properties Byte-Aligned Degenerate Pattern Matching ( BADPM ). ■ Sublinear average time complexity in searching over consensus DNA ■ sequences. Extremely fast for long patterns because of long shifts. ■ Simple pattern preprocessing: tabulating all pattern factors. ■ Processing at the byte level (omitting most of the bitwise operations). ■ Easy cooperating with n -gram inverted index. ■ LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 7 / 21

  8. BADPM : Data Structures Preprocessed pattern Source sequence Encoded sequence ... A C V T A A T ... T A R T B dictionary 0 4 879 Bi Bi +1 Bi +2 5 903 ... ... 00 01 00 11 00 00 11 11 00 00 11 01 baseSeq 6 927 j j + 1 ... ... A → 00 ... ... i i + 2 variantPos 00 01 01 11 00 00 11 10 C → 01 ... ... 3 6 G → 10 00 01 10 11 00 00 11 11 variantNum T → 11 ... 00 10 11 01 00 10 11 10 variants 00 10 11 11 ... variants LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 8 / 21

  9. BADPM : Data structures (2) Consensus sequence divided into: ■ Base sequence. Consisting of only solid symbols. ◆ Variants. Encoded variants (given by the degenerate symbols) in ◆ terms of a whole byte. Base sequence and variants encoded using bytes substituting 4-grams of ■ symbols/bases. Auxiliary array variantPos storing positions of “degenerate bytes” in base ■ sequence. Auxiliary array variantNum storing number of “byte variants” for a given ■ byte. LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 9 / 21

  10. BADPM : Data structures (3) Dictionary of all possible two-byte values ( 256 2 = 65 536 values). ■ Dictionary entries point to lists of occurrences (of a two-byte values) in ■ the encoded pattern P C . List elements: ■ Byte offset in terms of the encoded pattern P C . ◆ Alignment to the encoded pattern P C . ◆ LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 10 / 21

  11. BADPM : Pattern Preprocessing alignment = 0 dictionary 0 A C G T A A T T A A T ... T T A T T T A A C ... C T alignment offset 6 927 ... 0 0 00 01 10 11 00 00 11 11 00 00 11... ... 01 11 11 11 00 11 11 11 00 00 01 alignment = 1 A C G T A A T ... C T T A T T T A A C T A A T ... T 0 1 ... 27 708 00 01 10 11 00 00 11 11 00 00 11... ... 01 11 11 11 00 11 11 11 00 00 01 32 575 n B − 2 1 ... alignment = 2 A C G T A A T T A A T T T A T T T A A C ... ... C T 0 2 ... 45 296 00 01 10 11 00 00 11 11 00 00 11... ... 01 11 11 11 00 11 11 11 00 00 01 0 3 50 115 ... 53 185 alignment = 3 n B − 1 0 ... A C G T A A T T A A T ... ... C T T A T T T A A C T n B − 2 3 62 448 ... 00 01 10 11 00 00 11 11 00 00 11... ... 01 11 11 11 00 11 11 11 00 00 01 64 764 n B − 2 2 ... A → 00 C → 01 Preprocessing process Preprocessed pattern G → 10 T → 11 LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 11 / 21

  12. BADPM : Pattern Preprocessing (2) For different alignments a ∈ { 0 , 1 , 2 , 3 } : 1. Scan all relevant double-byte values. 2. Store byte offset (in terms of the encoded pattern P E ) and alignment a to the corresponding list (a dictionary entry corresponding to the double-byte value). LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 12 / 21

  13. BADPM : Pattern Preprocessing Space Preprocessed pattern O ( mα 2 log m ) dictionary 0 alignment offset o l i i a l i l i o 1 a 1 ... 65 535 LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 13 / 21

  14. BADPM : Pattern Preprocessing Space Preprocessed pattern O ( mα 2 log m ) dictionary 0 alignment offset o l i i a l i l i o 1 a 1 ... O ( α 2 ) 65 535 LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 13 / 21

  15. BADPM : Pattern Preprocessing Space Preprocessed pattern O ( mα 2 log m ) dictionary 0 O ( m ) alignment offset o l i i a l i l i o 1 a 1 ... O ( α 2 ) 65 535 LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 13 / 21

  16. BADPM : Pattern Preprocessing Space Preprocessed pattern O ( mα 2 log m ) dictionary 0 O ( m ) alignment offset o l i i a l i l i o 1 a 1 ... O ( α 2 ) O (log m ) 65 535 LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 13 / 21

  17. BADPM : Pattern Preprocessing Time O ( mα 2 ) Scan O ( m ) bytes of the encoded pattern P E . ■ Check O ( α 2 ) double-byte values at each position (pathological patterns ■ . . . NNNNNNNN . . . ). Store offset and alignment for each double-byte value to the ■ corresponding list ( O (1) time). LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 14 / 21

  18. BADPM Searching baseSeq dictionary ... 1. Read short value and check the dictionary. ... 2. Byte-level check according to the offset. offset, alignment ... 3. Prefix and suffix check according to the alignment. Figure 2: BADPM : Conceptual schema of searching. LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 15 / 21

  19. BADPM Searching: Example A C A A G T T A T A T A T G G C pattern i A C A A G T T A T A T A T A A A C T T A G G C baseSeq ... variants dictionary A C G A ... 4 284 variantPos ... i ... variantNum 1 ... ... LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 16 / 21

  20. BADPM Searching: Example A C A A G T T A T A T A T G G C pattern i T A A A C T T A G G C A C A A G T T A T A T A baseSeq ... variants dictionary A C G A ... 4 284 1 0 variantPos ... i ... variantNum 1 ... ... LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 16 / 21

  21. BADPM Searching: Example A C A A G T T A T A T A T G G C pattern i T A A C A A G T T A T A T A T A A A C T G G C baseSeq ... variants dictionary A C G A ... 4 284 1 0 variantPos ... i ... variantNum 1 ... ... LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 16 / 21

  22. BADPM Searching: Example A C A A G T T A T A T A T G G C pattern i A C A A T A T A T A A A C T T A G G C G T T A baseSeq ... variants dictionary A C G A ... variantPos ... i ... 6 332 0 variantNum 1 ... ... LSD & LAW 2019: J. Holub: On-line Searching in IUPAC Nucleotide Sequences – 16 / 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend