using k mers with errors for nanopore read analysis
play

Using k-mers with errors for Nanopore read analysis - Quentin - PowerPoint PPT Presentation

Centre de Recherche en Informatique, Signal et Automatique de Lille Using k-mers with errors for Nanopore read analysis - Quentin Bonenfant Laurent No Hlne T ouzet {quentin.bonenfant , laurent.noe , helene.touzet} @univ-lille.fr


  1. Centre de Recherche en Informatique, Signal et Automatique de Lille Using k-mers with errors for Nanopore read analysis - Quentin Bonenfant Laurent Noé Hélène T ouzet {quentin.bonenfant , laurent.noe , helene.touzet} @univ-lille.fr CRIStAL – UMR CNRS 9189 – BONSAI team

  2. Overview 1) K-mers 2) Long read sequencing 3) K-mers with errors 4) Use case: Nanopore adapters 5) Results 6) Conclusion 2 Quentin Bonenfant - SeqBio 2018

  3. K-mers ● Substring of size k k =8 ATCAGTCAGCGGGTATCTACTGCACCTATCGAGCTTTTTT ● Used for: – Assembly (SPAdes → De Bruijn Graph) – Mapping (Bowtie2 → Burrows-Wheeler T ransform) – Overlapping (Minimap2 → Minimizers) – … 3 Quentin Bonenfant - SeqBio 2018

  4. Long read sequencing 10-15% of errors k-mer Insertion ATCAGTCAGCGGGGTATCTACTC---CACCTATCGAGCTTTTTTATCT ||||||||||| |||| |||| ||||||||||||||| ||||| ATCAGTCAGCG --- TATC G ACTC TAG CACCTATCGAGCTTT -- TATCT Deletion Substitution 4 Quentin Bonenfant - SeqBio 2018

  5. Long read sequencing How to account for sequencing errors? → k -mers with errors → d : max number of errors 5 Quentin Bonenfant - SeqBio 2018

  6. K-mers with errors ? ACTTCCGG AATTCCGG AATTC-GG d =1 AATTTCCGG k =8 ... 6 Quentin Bonenfant - SeqBio 2018

  7. How ? ● Using dynamic programming → Large computational cost ● Indexing all neighbours → Memory expensive / long to compute ● Research with errors in an index → 01*0 seeds 7 Quentin Bonenfant - SeqBio 2018

  8. 01*0 seeds ● Approximate seeds ● Lossless ● Principle: – Choose a value for d – Split k-mer in d +2 blocks – Search blocks in the index 8 Quentin Bonenfant - SeqBio 2018

  9. 01*0 seeds Pigeonhole principle 4 pigeons ( d ) 6 holes ( d +2) → At least 2 holes are empty 9 Quentin Bonenfant - SeqBio 2018

  10. 01*0 seeds Example Finding “ AUCAGUGCAAAUGCUCAAGA ” d =3 k = 20 → Split in 5 blocs of size 4 10 Quentin Bonenfant - SeqBio 2018

  11. 01*0 seeds AUCA GUGC AAAU GCUC AAGA 1) |||| ||| | || || | |||| AUCA AUGC A-AU GCGC AAGA 0 1 1 1 0 AUCA GUGC AAAU GCUC -AAGA 2) ||| |||| | || |||| |||| AUC- GUGC AUAU GCUC AAAGA 0 1 0 AUCA GUGC AAAU GCUC AAGA 3) |||| | | |||| || | |||| AUCA GAGA AAAU GC-C AAGA 0 1 0 11 Quentin Bonenfant - SeqBio 2018

  12. 01*0 seeds ● First implementation – BWOLO (2014) – BWT Vroland C, Salson M, Bini S, Touzet H. Approximate search of short patterns with high error rates using the 01 ⁎ 0 lossless seeds. Journal of Discrete Algorithms 37, 2016 ● SeqAn implementation – Optimum Search Scheme (2018) – Bidirectional BWT Kiavash K, Pockrandt C, Torkamandi B, Luo H, and Reinert K. FAMOUS: Fast Approximate String Matching Using OptimUm Search Schemes. Recomb-Seq 2018 12 Quentin Bonenfant - SeqBio 2018

  13. Use case: Motif inference for Nanopore adapters 13 Quentin Bonenfant - SeqBio 2018

  14. Nanopore adapters sequence 14 Quentin Bonenfant - SeqBio 2018

  15. Nanopore adapters sequence ● Sequencing adapters sequence ● Porechop – https://github.com/rrwick/Porechop/ – Adapter trimming – Known adapters database ● Can we guess the adapter sequence from the reads? 15 Quentin Bonenfant - SeqBio 2018

  16. Our method ● Identify k-mers composing the adapter – Higher frequency at the start / end of reads ● Reconstruct adapter from k-mers 16 Quentin Bonenfant - SeqBio 2018

  17. Frequency of k-mers k =16 17 Quentin Bonenfant - SeqBio 2018

  18. Counting k-mers with errors ● Select the 500 most frequent 16-mers ● Count all occurences with d =2 errors 18 Quentin Bonenfant - SeqBio 2018

  19. Counting result example K-mers with errors Exact k-mers k= 16 d= 2 k= 16 d= 0 K-mer count 0 err 1err 2err K-mer TTCAGTTACGTATTGC 2761 2761 4403 5844 TTCAGTTACGTATTGC TCAGTTACGTATTGCT 2716 2626 4324 6002 CTTCGTTCAGTTACGT CTATCTTCGGCGTCTG 2628 2612 4420 5905 CGTTCAGTTACGTATT TCTATCTTCGGCGTCT 2628 2716 4361 5813 TCAGTTACGTATTGCT 2567 4423 5837 GTTCAGTTACGTATTG CTTCGTTCAGTTACGT 2626 2359 4276 5895 TCGTTCAGTTACGTAT CGTTCAGTTACGTATT 2612 2447 4048 5591 TTCGTTCAGTTACGTA GTTCAGTTACGTATTG 2567 2628 3999 4775 CTATCTTCGGCGTCTG CTCTATCTTCGGCGTC 2563 2628 3934 4748 TCTATCTTCGGCGTCT GCTCTATCTTCGGCGT 2509 2563 3900 4649 CTCTATCTTCGGCGTC CTGTCGCTCTATCTTC 2491 19 Quentin Bonenfant - SeqBio 2018

  20. k-mer ranks plot 20 Quentin Bonenfant - SeqBio 2018

  21. How adapter sequence is built Rank K-mers k=16 1 TTCAGTTACGTATTGC C T T C A G T T A C G T A T T G T T C A G T T A C G T A T T G C 2 CTTCAGTTACGTATTG T C A G T T A C G T A T T G C T 3 CGTTCAGTTACGTATT C A G T T A C G T A T T G C T G A G T T A C G T A T T G C T G T 4 TCAGTTACGTATTGCT G T T A C G T A T T G C T G T T 5 GTTCAGTTACGTATTG T T A C G T A T T G C T G T T C T A C G T A T T G C T G T T C T 6 GTTACGTATTGCTGTT C T T C A G T T A C G T A T T G C T G T T C T TTACGTATTGCTGTTC 7 CAGTTACGTATTGCTG 8 TACGTATTGCTGTTCT 9 CTCTATCTTCGGCGTC 10 AGTTACGTATTGCTGT 11 21 Quentin Bonenfant - SeqBio 2018

  22. Dataset 1 ● Consortium ANR ASTER – Algorithms and software for third generation sequencing ● Prep and sequencing: Genoscope – Specie: Mouse ( Mus musculus ) – Tissue: brain – Sample T ype: 1D cDNA – Flowcell: R9.4 – ENA/SRA : PRJEB25574 22 Quentin Bonenfant - SeqBio 2018

  23. Experiment ● Sample size – 10,000 reads, 100 fjrst bases – k = 16, d = 2 ● Run the workfmow on 100 samples ● Compare results for both counting methods (k-mers with and without errors) 23 Quentin Bonenfant - SeqBio 2018

  24. Results with exact k-mers 24 Quentin Bonenfant - SeqBio 2018

  25. Results with approximate k-mers 25 Quentin Bonenfant - SeqBio 2018

  26. The MEME approach ● MEME Multiple EM for Motif Elicitation Bailey and Elkan, Fitting a mixture model by expectation maximization to discover motifs in biopolymers, ISMB 1994. ● Experiment – 1000 random reads, fjrst 100 bases – Repeated on 5 samples 26 Quentin Bonenfant - SeqBio 2018

  27. MEME results 27 Quentin Bonenfant - SeqBio 2018

  28. Results Exact k-mers Approximate k-mers MEME 28 Quentin Bonenfant - SeqBio 2018

  29. Dataset 2 ● Nanopore wgs consortium – Oxford Nanopore human reference datasets https://github.com/nanopore-wgs-consortium/NA12878 ● Data from RNA project – https://github.com/nanopore-wgs-consortium/NA12878/blob/mast er/nanopore-human-transcriptome/fastq_fast5_bulk.md – Sample T ype: 1D cDNA – Cell line: GM12878 human cell line (Ceph/Utah pedigree) – Kit : SQK-PCS108 – Flowcell: R9.4 – File: Bham_Run1_20171115_1D.pass.depup.fastq ● Same experiment 29 Quentin Bonenfant - SeqBio 2018

  30. Exact k-mers results 67 different adapters found 30 Quentin Bonenfant - SeqBio 2018

  31. Approximate k-mers results (WGS) PCR adapters 3 (start) ← T-ACTTGCCTGTCGCTCTATCTTC 31 Quentin Bonenfant - SeqBio 2018

  32. Approximate k-mers results (WGS) Approximate k-mers Exact k-mers 32 Quentin Bonenfant - SeqBio 2018

  33. Implementation ● C++ with SeqAn library (optimal search schemes) ● Computation time For 10k reads Exact k-mers : <1 second K-mers with errors: 10-20 seconds MEME: >180h 33 Quentin Bonenfant - SeqBio 2018

  34. Porechop wrapper ● Integration in Porechop (Python) Custom wrapper* allow integration in Porechop workfmow easily by adding inferred adapters to adapter database ● T wo case studies : → discovering the adapter sequence if unknown → checking (if known) adapter is present and correctly sequenced (quality check) ● C++ and Python code available on demand 34 Quentin Bonenfant - SeqBio 2018

  35. Conclusion ● Our goal was to test the effjciency of 01*0 seeds using k-mers approach on noisy reads ● Our experiment showed – More consistent results with approximate k-mers – Practical running time for real data → k-mers with errors can improve results at low cost ● Could be used as an alternative to minimizers ? 35 Quentin Bonenfant - SeqBio 2018

  36. End Thank you for your attention 36 Quentin Bonenfant - SeqBio 2018

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend