Using k-mers with errors for Nanopore read analysis - Quentin - - PowerPoint PPT Presentation

using k mers with errors for nanopore read analysis
SMART_READER_LITE
LIVE PREVIEW

Using k-mers with errors for Nanopore read analysis - Quentin - - PowerPoint PPT Presentation

Centre de Recherche en Informatique, Signal et Automatique de Lille Using k-mers with errors for Nanopore read analysis - Quentin Bonenfant Laurent No Hlne T ouzet {quentin.bonenfant , laurent.noe , helene.touzet} @univ-lille.fr


slide-1
SLIDE 1

Using k-mers with errors for Nanopore read analysis

  • Quentin Bonenfant

Laurent Noé Hélène T

  • uzet

{quentin.bonenfant , laurent.noe , helene.touzet} @univ-lille.fr

CRIStAL – UMR CNRS 9189 – BONSAI team

Centre de Recherche en Informatique, Signal et Automatique de Lille

slide-2
SLIDE 2

Quentin Bonenfant - SeqBio 2018 2

Overview

1) K-mers 2) Long read sequencing 3) K-mers with errors 4) Use case: Nanopore adapters 5) Results 6) Conclusion

slide-3
SLIDE 3

Quentin Bonenfant - SeqBio 2018 3

K-mers

ATCAGTCAGCGGGTATCTACTGCACCTATCGAGCTTTTTT

  • Substring of size k
  • Used for:

– Assembly (SPAdes → De Bruijn Graph) – Mapping (Bowtie2 → Burrows-Wheeler T

ransform)

– Overlapping (Minimap2 → Minimizers) – … k=8

slide-4
SLIDE 4

Quentin Bonenfant - SeqBio 2018 4

Long read sequencing

10-15% of errors

ATCAGTCAGCGGGGTATCTACTC---CACCTATCGAGCTTTTTTATCT ||||||||||| |||| |||| ||||||||||||||| ||||| ATCAGTCAGCG---TATCGACTCTAGCACCTATCGAGCTTT--TATCT Insertion Deletion Substitution

k-mer

slide-5
SLIDE 5

Quentin Bonenfant - SeqBio 2018 5

Long read sequencing

How to account for sequencing errors? → k-mers with errors → d: max number of errors

slide-6
SLIDE 6

Quentin Bonenfant - SeqBio 2018 6

K-mers with errors ?

AATTCCGG

d=1 k=8

ACTTCCGG AATTC-GG AATTTCCGG ...

slide-7
SLIDE 7

Quentin Bonenfant - SeqBio 2018 7

How ?

  • Using dynamic programming

→ Large computational cost

  • Indexing all neighbours

→ Memory expensive / long to compute

  • Research with errors in an index

→ 01*0 seeds

slide-8
SLIDE 8

Quentin Bonenfant - SeqBio 2018 8

01*0 seeds

  • Approximate seeds
  • Lossless
  • Principle:

– Choose a value for d – Split k-mer in d+2 blocks – Search blocks in the index

slide-9
SLIDE 9

Quentin Bonenfant - SeqBio 2018 9

01*0 seeds

Pigeonhole principle 4 pigeons (d) 6 holes (d+2) → At least 2 holes are empty

slide-10
SLIDE 10

Quentin Bonenfant - SeqBio 2018 10

01*0 seeds

Example Finding “AUCAGUGCAAAUGCUCAAGA” d=3 k= 20 → Split in 5 blocs of size 4

slide-11
SLIDE 11

Quentin Bonenfant - SeqBio 2018 11

AUCA GUGC AAAU GCUC AAGA |||| ||| | || || | |||| AUCA AUGC A-AU GCGC AAGA AUCA GUGC AAAU GCUC -AAGA ||| |||| | || |||| |||| AUC- GUGC AUAU GCUC AAAGA AUCA GUGC AAAU GCUC AAGA |||| | | |||| || | |||| AUCA GAGA AAAU GC-C AAGA

01*0 seeds

1 1 1 1 0 1 0

1) 2) 3)

slide-12
SLIDE 12

Quentin Bonenfant - SeqBio 2018 12

01*0 seeds

  • First implementation

– BWOLO (2014) – BWT

Vroland C, Salson M, Bini S, Touzet H. Approximate search of short patterns with high error rates using the 01 ⁎ 0 lossless seeds. Journal of Discrete Algorithms 37, 2016

  • SeqAn implementation

– Optimum Search Scheme (2018) – Bidirectional BWT

Kiavash K, Pockrandt C, Torkamandi B, Luo H, and Reinert K. FAMOUS: Fast Approximate String Matching Using OptimUm Search Schemes. Recomb-Seq 2018

slide-13
SLIDE 13

Quentin Bonenfant - SeqBio 2018 13

Use case: Motif inference for Nanopore adapters

slide-14
SLIDE 14

Quentin Bonenfant - SeqBio 2018 14

Nanopore adapters sequence

slide-15
SLIDE 15

Quentin Bonenfant - SeqBio 2018 15

Nanopore adapters sequence

  • Sequencing adapters sequence
  • Porechop

– https://github.com/rrwick/Porechop/ – Adapter trimming – Known adapters database

  • Can we guess the adapter sequence from the

reads?

slide-16
SLIDE 16

Quentin Bonenfant - SeqBio 2018 16

Our method

  • Identify k-mers composing the adapter

– Higher frequency at the start / end of reads

  • Reconstruct adapter from k-mers
slide-17
SLIDE 17

Quentin Bonenfant - SeqBio 2018 17

Frequency of k-mers

k=16

slide-18
SLIDE 18

Quentin Bonenfant - SeqBio 2018 18

Counting k-mers with errors

  • Select the 500 most frequent 16-mers
  • Count all occurences with d=2 errors
slide-19
SLIDE 19

Quentin Bonenfant - SeqBio 2018 19

Counting result example

K-mer count TTCAGTTACGTATTGC 2761 TCAGTTACGTATTGCT 2716 CTATCTTCGGCGTCTG 2628 TCTATCTTCGGCGTCT 2628 CTTCGTTCAGTTACGT 2626 CGTTCAGTTACGTATT 2612 GTTCAGTTACGTATTG 2567 CTCTATCTTCGGCGTC 2563 GCTCTATCTTCGGCGT 2509 CTGTCGCTCTATCTTC 2491

Exact k-mers

0 err 1err 2err K-mer 2761 4403 5844 TTCAGTTACGTATTGC 2626 4324 6002 CTTCGTTCAGTTACGT 2612 4420 5905 CGTTCAGTTACGTATT 2716 4361 5813 TCAGTTACGTATTGCT 2567 4423 5837 GTTCAGTTACGTATTG 2359 4276 5895 TCGTTCAGTTACGTAT 2447 4048 5591 TTCGTTCAGTTACGTA 2628 3999 4775 CTATCTTCGGCGTCTG 2628 3934 4748 TCTATCTTCGGCGTCT 2563 3900 4649 CTCTATCTTCGGCGTC

K-mers with errors

k=16 d=0 k=16 d=2

slide-20
SLIDE 20

Quentin Bonenfant - SeqBio 2018 20

k-mer ranks plot

slide-21
SLIDE 21

Quentin Bonenfant - SeqBio 2018 21

How adapter sequence is built

K-mers k=16 TTCAGTTACGTATTGC CTTCAGTTACGTATTG CGTTCAGTTACGTATT TCAGTTACGTATTGCT GTTCAGTTACGTATTG GTTACGTATTGCTGTT TTACGTATTGCTGTTC CAGTTACGTATTGCTG TACGTATTGCTGTTCT CTCTATCTTCGGCGTC AGTTACGTATTGCTGT

T T C A G T T A C G T A T T G C T C A G T T A C G T A T T G C T C T T C A G T T A C G T A T T G C A G T T A C G T A T T G C T G C T T C A G T T A C G T A T T G C T G T T C T A G T T A C G T A T T G C T G T G T T A C G T A T T G C T G T T T T A C G T A T T G C T G T T C T A C G T A T T G C T G T T C T

1 2 3 4 5 6 7 8 9 10 11

Rank

slide-22
SLIDE 22

Quentin Bonenfant - SeqBio 2018 22

Dataset 1

  • Consortium ANR ASTER

– Algorithms and software for third generation

sequencing

  • Prep and sequencing: Genoscope

– Specie: Mouse (Mus musculus) – Tissue: brain – Sample T

ype: 1D cDNA

– Flowcell: R9.4 – ENA/SRA : PRJEB25574

slide-23
SLIDE 23

Quentin Bonenfant - SeqBio 2018 23

Experiment

  • Sample size

– 10,000 reads, 100 fjrst bases – k = 16, d = 2

  • Run the workfmow on 100 samples
  • Compare results for both counting methods

(k-mers with and without errors)

slide-24
SLIDE 24

Quentin Bonenfant - SeqBio 2018 24

Results with exact k-mers

slide-25
SLIDE 25

Quentin Bonenfant - SeqBio 2018 25

Results with approximate k-mers

slide-26
SLIDE 26

Quentin Bonenfant - SeqBio 2018 26

The MEME approach

  • MEME

Multiple EM for Motif Elicitation

Bailey and Elkan, Fitting a mixture model by expectation maximization to discover motifs in biopolymers, ISMB 1994.

  • Experiment

– 1000 random reads, fjrst 100 bases – Repeated on 5 samples

slide-27
SLIDE 27

Quentin Bonenfant - SeqBio 2018 27

MEME results

slide-28
SLIDE 28

Quentin Bonenfant - SeqBio 2018 28

Results

Exact k-mers Approximate k-mers MEME

slide-29
SLIDE 29

Quentin Bonenfant - SeqBio 2018 29

Dataset 2

  • Nanopore wgs consortium

– Oxford Nanopore human reference datasets

https://github.com/nanopore-wgs-consortium/NA12878

  • Data from RNA project

– https://github.com/nanopore-wgs-consortium/NA12878/blob/mast

er/nanopore-human-transcriptome/fastq_fast5_bulk.md

– Sample T

ype: 1D cDNA

– Cell line: GM12878 human cell line (Ceph/Utah pedigree) – Kit : SQK-PCS108 – Flowcell: R9.4 – File: Bham_Run1_20171115_1D.pass.depup.fastq

  • Same experiment
slide-30
SLIDE 30

Quentin Bonenfant - SeqBio 2018 30

Exact k-mers results

67 different adapters found

slide-31
SLIDE 31

Quentin Bonenfant - SeqBio 2018 31

Approximate k-mers results (WGS)

T-ACTTGCCTGTCGCTCTATCTTC

PCR adapters 3 (start) ←

slide-32
SLIDE 32

Quentin Bonenfant - SeqBio 2018 32

Approximate k-mers results (WGS)

Exact k-mers Approximate k-mers

slide-33
SLIDE 33

Quentin Bonenfant - SeqBio 2018 33

Implementation

  • C++ with SeqAn library

(optimal search schemes)

  • Computation time

For 10k reads Exact k-mers : <1 second K-mers with errors: 10-20 seconds MEME: >180h

slide-34
SLIDE 34

Quentin Bonenfant - SeqBio 2018 34

Porechop wrapper

  • Integration in Porechop (Python)

Custom wrapper* allow integration in Porechop workfmow easily by adding inferred adapters to adapter database

  • T

wo case studies: → discovering the adapter sequence if unknown → checking (if known) adapter is present and correctly sequenced (quality check)

  • C++ and Python code available on demand
slide-35
SLIDE 35

Quentin Bonenfant - SeqBio 2018 35

Conclusion

  • Our goal was to test the effjciency of 01*0 seeds using

k-mers approach on noisy reads

  • Our experiment showed

– More consistent results with approximate k-mers – Practical running time for real data

→ k-mers with errors can improve results at low cost

  • Could be used as an alternative to minimizers ?
slide-36
SLIDE 36

Quentin Bonenfant - SeqBio 2018 36

End

Thank you for your attention

slide-37
SLIDE 37

Quentin Bonenfant - SeqBio 2018 37

Potential adapters

Exact k-mers

TCAGTTACGTATTGCTTTTCTGTTGGTGCTGATATTGCGGCGTCTGGCAGGTGTTTAACCTTTTTG 39 TCAGTTACGTATTGCTTTTCTGTTGGTGCTGATATTGCGGCGTCTGGCGGGTGTTTAACCTG 16 TCAGTTACGTATTGCTTTTCTGTTGGTGCTGATATTGCGGCGTCTGGCGGGTGTTTAACCTC 14 TCAGTTACGTATTGCTTTTCTGTTGGTGCTGATATTGCGGCGTCTGGCAGGTGTTTAACCTG 13 TCAGTTACGTATTGCTCTTGCCTGTCGCTCTATCTTCGGCGTCTGCTTGGGTGTTTAACCTTTTTG 6 TCAGTTACGTATTGCTTTTCTGTTGGTGCTGATATTGCGGCGTCTGGCAGGTGTTTAACCTC 4 TTTGTACTTCGTTCAGTTACGTATTGCTTTTCTGTTGGTGCTGATATTGCGGCGTCTGGCAGGTGTTTAACCTG 2 CTTGTACTTCGTTCAGTTACGTATTG 2 CTTGTACTTCGTTCAGTTACGTATTGCTTTTCTGTTGGTGCTGATATTGCGGCGTCTGGCAGGTGTTTAACCTG 1 CTTGTACTTCGTTCAGTTACGTATTGCTTTTCTGTTGGTGCTGATATTGCGGCGTCTGGCGGGTGTTTAACCTC 1 CTTGTACTTCGTTCAGTTACGTATTGCTTTTCTGTTGGTGCTGATATTGCGGCGTCTGGCAGGTGTTTAACCTTTTTG 1 TTTGTACTTCGTTCAGTTACGTATTG 1

K-mers with errors

ATGTACTTCGTTCAGTTACGTATTGCTTTTCTGTTGGTGCTGATATTGCGGCGTCTGCTTGGGTGTTTAACCTTTTTG 83 ATGTACTTCGTTCAGTTACGTATTGCTTTTCTGTTGGTGCTGATATTGCGGCGTCTGCTTGGGTGTTTAACCTCT 9 ATGTACTTCGTTCAGTTACGTATTGCTTTTCTGTTGGTGCTGATATTGCGGCGTCTGCTTGGGTGTTTAACCTTTTA 4 ATGTACTTCGTTCAGTTACGTATTGCTTTTCTGTTGGTGCTGATATTGCGGCGTCTGCTTGGGTGTTTAACCTTTTTA 1 ATGTACTTCGTTCAGTTACGTATTGCTTTTCTGTTGGTGCTGATATTGCGGCGTCTGCTTGGGTGTTTAACCTCTT 1 ATGTACTTCGTTCAGTTACGTATTGCTTTTCTGTTGGTGCTGATATTGCGGCGTCTGCTTGGGTGTTTAACCTGTTT 1 ATGTACTTCGTTCAGTTACGTATTGCTTTTCTGTTGGTGCTGATATTGCGGCGTCTGCTTGGGTGTTTAACCTTTTG 1

slide-38
SLIDE 38

Quentin Bonenfant - SeqBio 2018 38

Workfmow

slide-39
SLIDE 39

Quentin Bonenfant - SeqBio 2018 39

Porechop analysis

Set best read: start end K-mer_Error 98.7 84.6 SQK-NSK007 100.0 81.8 SQK-MAP006 96.6 100.0 SQK-MAP006 Short 100.0 100.0 PCR adapters 1 100.0 100.0 PCR tail 1 96.4 92.9 PCR tail 2 93.3 93.1 1D^2 part 2 100.0 100.0

slide-40
SLIDE 40

Quentin Bonenfant - SeqBio 2018 40

Aligning adapters

Kmer_error_start / SQK-NSK007_Y_Top

  • ATGTACTTCGTTCAGTTACGTATTGCTTTTCTGTTGGTGCTG

AATGTACTTCGTTCAGTTACGTATTGCT--------------- ATATTGCGGCGTCTGCTTGGGTGTTTAACCTTTTTG

slide-41
SLIDE 41

Quentin Bonenfant - SeqBio 2018 41

Aligning adapters

Kmer_error_start/SQK-MAP006_Short_Y_Top_LI32

  • ATGTACTTCGTTCAGTTACGTATTGCTTTTCTGTTGGTGCTG
  • ATATTGCGGCGTCTGCTTGGGTGTTTAACCTTTTTG
  • -----CGGCGTCTGCTTGGGTGTTTAACCT-----
slide-42
SLIDE 42

Quentin Bonenfant - SeqBio 2018 42

Aligning adapters

Kmer_error_start/PCR_tail_1_start

  • ATGTACTTCGTTCAGTTACGTATT-GCTTTTCTGTTGGTG
  • ----------------------TTAACCTTTCTGTTGGTG

CTGATATTGCGGCGTCTGCTTGGGTGTTTAACCTTTTTG CTGATATTGC-----------------------------