Using k-mers with errors for Nanopore read analysis - Quentin - PowerPoint PPT Presentation

Centre de Recherche en Informatique, Signal et Automatique de Lille Using k-mers with errors for Nanopore read analysis - Quentin Bonenfant Laurent Noé Hélène T ouzet {quentin.bonenfant , laurent.noe , helene.touzet} @univ-lille.fr CRIStAL – UMR CNRS 9189 – BONSAI team

Overview 1) K-mers 2) Long read sequencing 3) K-mers with errors 4) Use case: Nanopore adapters 5) Results 6) Conclusion 2 Quentin Bonenfant - SeqBio 2018

K-mers ● Substring of size k k =8 ATCAGTCAGCGGGTATCTACTGCACCTATCGAGCTTTTTT ● Used for: – Assembly (SPAdes → De Bruijn Graph) – Mapping (Bowtie2 → Burrows-Wheeler T ransform) – Overlapping (Minimap2 → Minimizers) – … 3 Quentin Bonenfant - SeqBio 2018

Long read sequencing 10-15% of errors k-mer Insertion ATCAGTCAGCGGGGTATCTACTC---CACCTATCGAGCTTTTTTATCT ||||||||||| |||| |||| ||||||||||||||| ||||| ATCAGTCAGCG --- TATC G ACTC TAG CACCTATCGAGCTTT -- TATCT Deletion Substitution 4 Quentin Bonenfant - SeqBio 2018

Long read sequencing How to account for sequencing errors? → k -mers with errors → d : max number of errors 5 Quentin Bonenfant - SeqBio 2018

K-mers with errors ? ACTTCCGG AATTCCGG AATTC-GG d =1 AATTTCCGG k =8 ... 6 Quentin Bonenfant - SeqBio 2018

How ? ● Using dynamic programming → Large computational cost ● Indexing all neighbours → Memory expensive / long to compute ● Research with errors in an index → 01*0 seeds 7 Quentin Bonenfant - SeqBio 2018

01*0 seeds ● Approximate seeds ● Lossless ● Principle: – Choose a value for d – Split k-mer in d +2 blocks – Search blocks in the index 8 Quentin Bonenfant - SeqBio 2018

01*0 seeds Pigeonhole principle 4 pigeons ( d ) 6 holes ( d +2) → At least 2 holes are empty 9 Quentin Bonenfant - SeqBio 2018

01*0 seeds Example Finding “ AUCAGUGCAAAUGCUCAAGA ” d =3 k = 20 → Split in 5 blocs of size 4 10 Quentin Bonenfant - SeqBio 2018

01*0 seeds AUCA GUGC AAAU GCUC AAGA 1) |||| ||| | || || | |||| AUCA AUGC A-AU GCGC AAGA 0 1 1 1 0 AUCA GUGC AAAU GCUC -AAGA 2) ||| |||| | || |||| |||| AUC- GUGC AUAU GCUC AAAGA 0 1 0 AUCA GUGC AAAU GCUC AAGA 3) |||| | | |||| || | |||| AUCA GAGA AAAU GC-C AAGA 0 1 0 11 Quentin Bonenfant - SeqBio 2018

01*0 seeds ● First implementation – BWOLO (2014) – BWT Vroland C, Salson M, Bini S, Touzet H. Approximate search of short patterns with high error rates using the 01 ⁎ 0 lossless seeds. Journal of Discrete Algorithms 37, 2016 ● SeqAn implementation – Optimum Search Scheme (2018) – Bidirectional BWT Kiavash K, Pockrandt C, Torkamandi B, Luo H, and Reinert K. FAMOUS: Fast Approximate String Matching Using OptimUm Search Schemes. Recomb-Seq 2018 12 Quentin Bonenfant - SeqBio 2018

Use case: Motif inference for Nanopore adapters 13 Quentin Bonenfant - SeqBio 2018

Nanopore adapters sequence 14 Quentin Bonenfant - SeqBio 2018

Nanopore adapters sequence ● Sequencing adapters sequence ● Porechop – https://github.com/rrwick/Porechop/ – Adapter trimming – Known adapters database ● Can we guess the adapter sequence from the reads? 15 Quentin Bonenfant - SeqBio 2018

Our method ● Identify k-mers composing the adapter – Higher frequency at the start / end of reads ● Reconstruct adapter from k-mers 16 Quentin Bonenfant - SeqBio 2018

Frequency of k-mers k =16 17 Quentin Bonenfant - SeqBio 2018

Counting k-mers with errors ● Select the 500 most frequent 16-mers ● Count all occurences with d =2 errors 18 Quentin Bonenfant - SeqBio 2018

Counting result example K-mers with errors Exact k-mers k= 16 d= 2 k= 16 d= 0 K-mer count 0 err 1err 2err K-mer TTCAGTTACGTATTGC 2761 2761 4403 5844 TTCAGTTACGTATTGC TCAGTTACGTATTGCT 2716 2626 4324 6002 CTTCGTTCAGTTACGT CTATCTTCGGCGTCTG 2628 2612 4420 5905 CGTTCAGTTACGTATT TCTATCTTCGGCGTCT 2628 2716 4361 5813 TCAGTTACGTATTGCT 2567 4423 5837 GTTCAGTTACGTATTG CTTCGTTCAGTTACGT 2626 2359 4276 5895 TCGTTCAGTTACGTAT CGTTCAGTTACGTATT 2612 2447 4048 5591 TTCGTTCAGTTACGTA GTTCAGTTACGTATTG 2567 2628 3999 4775 CTATCTTCGGCGTCTG CTCTATCTTCGGCGTC 2563 2628 3934 4748 TCTATCTTCGGCGTCT GCTCTATCTTCGGCGT 2509 2563 3900 4649 CTCTATCTTCGGCGTC CTGTCGCTCTATCTTC 2491 19 Quentin Bonenfant - SeqBio 2018

k-mer ranks plot 20 Quentin Bonenfant - SeqBio 2018

How adapter sequence is built Rank K-mers k=16 1 TTCAGTTACGTATTGC C T T C A G T T A C G T A T T G T T C A G T T A C G T A T T G C 2 CTTCAGTTACGTATTG T C A G T T A C G T A T T G C T 3 CGTTCAGTTACGTATT C A G T T A C G T A T T G C T G A G T T A C G T A T T G C T G T 4 TCAGTTACGTATTGCT G T T A C G T A T T G C T G T T 5 GTTCAGTTACGTATTG T T A C G T A T T G C T G T T C T A C G T A T T G C T G T T C T 6 GTTACGTATTGCTGTT C T T C A G T T A C G T A T T G C T G T T C T TTACGTATTGCTGTTC 7 CAGTTACGTATTGCTG 8 TACGTATTGCTGTTCT 9 CTCTATCTTCGGCGTC 10 AGTTACGTATTGCTGT 11 21 Quentin Bonenfant - SeqBio 2018

Dataset 1 ● Consortium ANR ASTER – Algorithms and software for third generation sequencing ● Prep and sequencing: Genoscope – Specie: Mouse ( Mus musculus ) – Tissue: brain – Sample T ype: 1D cDNA – Flowcell: R9.4 – ENA/SRA : PRJEB25574 22 Quentin Bonenfant - SeqBio 2018

Experiment ● Sample size – 10,000 reads, 100 fjrst bases – k = 16, d = 2 ● Run the workfmow on 100 samples ● Compare results for both counting methods (k-mers with and without errors) 23 Quentin Bonenfant - SeqBio 2018

Results with exact k-mers 24 Quentin Bonenfant - SeqBio 2018

Results with approximate k-mers 25 Quentin Bonenfant - SeqBio 2018

The MEME approach ● MEME Multiple EM for Motif Elicitation Bailey and Elkan, Fitting a mixture model by expectation maximization to discover motifs in biopolymers, ISMB 1994. ● Experiment – 1000 random reads, fjrst 100 bases – Repeated on 5 samples 26 Quentin Bonenfant - SeqBio 2018

MEME results 27 Quentin Bonenfant - SeqBio 2018

Results Exact k-mers Approximate k-mers MEME 28 Quentin Bonenfant - SeqBio 2018

Dataset 2 ● Nanopore wgs consortium – Oxford Nanopore human reference datasets https://github.com/nanopore-wgs-consortium/NA12878 ● Data from RNA project – https://github.com/nanopore-wgs-consortium/NA12878/blob/mast er/nanopore-human-transcriptome/fastq_fast5_bulk.md – Sample T ype: 1D cDNA – Cell line: GM12878 human cell line (Ceph/Utah pedigree) – Kit : SQK-PCS108 – Flowcell: R9.4 – File: Bham_Run1_20171115_1D.pass.depup.fastq ● Same experiment 29 Quentin Bonenfant - SeqBio 2018

Exact k-mers results 67 different adapters found 30 Quentin Bonenfant - SeqBio 2018

Approximate k-mers results (WGS) PCR adapters 3 (start) ← T-ACTTGCCTGTCGCTCTATCTTC 31 Quentin Bonenfant - SeqBio 2018

Approximate k-mers results (WGS) Approximate k-mers Exact k-mers 32 Quentin Bonenfant - SeqBio 2018

Implementation ● C++ with SeqAn library (optimal search schemes) ● Computation time For 10k reads Exact k-mers : <1 second K-mers with errors: 10-20 seconds MEME: >180h 33 Quentin Bonenfant - SeqBio 2018

Porechop wrapper ● Integration in Porechop (Python) Custom wrapper* allow integration in Porechop workfmow easily by adding inferred adapters to adapter database ● T wo case studies : → discovering the adapter sequence if unknown → checking (if known) adapter is present and correctly sequenced (quality check) ● C++ and Python code available on demand 34 Quentin Bonenfant - SeqBio 2018

Conclusion ● Our goal was to test the effjciency of 01*0 seeds using k-mers approach on noisy reads ● Our experiment showed – More consistent results with approximate k-mers – Practical running time for real data → k-mers with errors can improve results at low cost ● Could be used as an alternative to minimizers ? 35 Quentin Bonenfant - SeqBio 2018

End Thank you for your attention 36 Quentin Bonenfant - SeqBio 2018

Using k-mers with errors for Nanopore read analysis - Quentin - PowerPoint PPT Presentation

Centre de Recherche en Informatique, Signal et Automatique de Lille Using k-mers with errors for Nanopore read analysis - Quentin Bonenfant Laurent No Hlne T ouzet {quentin.bonenfant , laurent.noe , helene.touzet} @univ-lille.fr

Basic Errors Compiling in Unix Syntax errors Common Errors, and Debugging Run-Time errors

NANOPORE SENSING OF AN ANTHRAX PROTIEN Nanopore Sensing Wilner & Katz eds.

10 Technology To Watch - 2012 - Thaweesak Koanantakool Sep. 20, 2012 1 Nanopore Sequencing

MERS and Securitization in MERS and Securitization in Contested Foreclosure Litigation Overcoming

RNA-seq nanopore read correction R. Chikhi, L. Lima, C. Marchet, ASTER Consortium December 2017

Electronic Detection of DNA-nicks Using 2D Solid-state Nanopore Transistor I use Blue Waters to

Pre-Conference Workshop: Hands-On Approach to the MERS Employer Portal Presenters: Cara

Nanopore sequencing High molecular weight DNA isolations Hi-C Ruta Sahasrabudhe Assistant

Unified error reporting -- A worthy goal? Andi Kleen, Intel Corporation Sep 2009

Introduction Detecting Errors in Effects of Annotation Errors Detecting Errors in Corpus

Exceptions Introduction to Computing Using Python Types of errors We saw different types of

From RNA-Seq data to bioinformatics analysis using Nanopore sequencers ASTE TER - Al Algorithm

Unfunded Accrued Liability About MERS We are a nonprofit organization, independent from the

ELO TRANSLATION PROJECT SARAH **** SOME VOCAB Errors Logic Errors Runtime Errors

Treasurers Institute Sun, Nov. 17, 2019 Property Tax Errors Property Tax Errors Property Tax

NMVTIS INFORMATION FOR TACA MARCH 2019 NMVTIS ERRORS Odometer Reading Discrepancies

FUNCTIONAL PEPTIDOMICS OF AMPHIBIAN VENOMS The dermal granular (venom) gland The dermal granular

Distance Metrics Mark Voorhies 5/14/2015 Mark Voorhies Distance Metrics New verbs f u n c t i

On some distributional properties of Gibbs-type priors Igor Pr unster University of Torino

An in-house expression database : CleanEx CleanEx : CONCEPT AND ORGANIZATION CleanEx_exp

A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic

The study of microbial communities: Bioinformatics applications within the UL HPC environment UL

CHARTER SCHOOLS 2 _ _

S AFEGUARDS , U NPRECEDENTED T IMES , AND A DVOCACY P ART 2 Leslie Lipson, J.D. Katie Chandler,

Using k-mers with errors for Nanopore read analysis - Quentin - PowerPoint PPT Presentation

Centre de Recherche en Informatique, Signal et Automatique de Lille Using k-mers with errors for Nanopore read analysis - Quentin Bonenfant Laurent No Hlne T ouzet {quentin.bonenfant , laurent.noe , helene.touzet} @univ-lille.fr

Basic Errors Compiling in Unix Syntax errors Common Errors, and Debugging Run-Time errors

NANOPORE SENSING OF AN ANTHRAX PROTIEN Nanopore Sensing Wilner &amp; Katz eds.

10 Technology To Watch - 2012 - Thaweesak Koanantakool Sep. 20, 2012 1 Nanopore Sequencing

MERS and Securitization in MERS and Securitization in Contested Foreclosure Litigation Overcoming

RNA-seq nanopore read correction R. Chikhi, L. Lima, C. Marchet, ASTER Consortium December 2017

Electronic Detection of DNA-nicks Using 2D Solid-state Nanopore Transistor I use Blue Waters to

Pre-Conference Workshop: Hands-On Approach to the MERS Employer Portal Presenters: Cara

Nanopore sequencing High molecular weight DNA isolations Hi-C Ruta Sahasrabudhe Assistant

Unified error reporting -- A worthy goal? Andi Kleen, Intel Corporation Sep 2009

Introduction Detecting Errors in Effects of Annotation Errors Detecting Errors in Corpus

Exceptions Introduction to Computing Using Python Types of errors We saw different types of

From RNA-Seq data to bioinformatics analysis using Nanopore sequencers ASTE TER - Al Algorithm

Unfunded Accrued Liability About MERS We are a nonprofit organization, independent from the

ELO TRANSLATION PROJECT SARAH **** SOME VOCAB Errors Logic Errors Runtime Errors

Treasurers Institute Sun, Nov. 17, 2019 Property Tax Errors Property Tax Errors Property Tax

NMVTIS INFORMATION FOR TACA MARCH 2019 NMVTIS ERRORS Odometer Reading Discrepancies

FUNCTIONAL PEPTIDOMICS OF AMPHIBIAN VENOMS The dermal granular (venom) gland The dermal granular

Distance Metrics Mark Voorhies 5/14/2015 Mark Voorhies Distance Metrics New verbs f u n c t i

On some distributional properties of Gibbs-type priors Igor Pr unster University of Torino

An in-house expression database : CleanEx CleanEx : CONCEPT AND ORGANIZATION CleanEx_exp

A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic

The study of microbial communities: Bioinformatics applications within the UL HPC environment UL

CHARTER SCHOOLS 2 ___________________________________ ___________________________________

S AFEGUARDS , U NPRECEDENTED T IMES , AND A DVOCACY P ART 2 Leslie Lipson, J.D. Katie Chandler,

NANOPORE SENSING OF AN ANTHRAX PROTIEN Nanopore Sensing Wilner & Katz eds.

CHARTER SCHOOLS 2 _ _