HG-CoLoR: enHanced de bruijn Graph for the error COrrection of LOng - - PowerPoint PPT Presentation

hg color enhanced de bruijn graph for the error
SMART_READER_LITE
LIVE PREVIEW

HG-CoLoR: enHanced de bruijn Graph for the error COrrection of LOng - - PowerPoint PPT Presentation

HG-CoLoR: enHanced de bruijn Graph for the error COrrection of LOng Reads Pierre Morisse , Thierry Lecroq and Arnaud Lefebvre pierre.morisse2@univ-rouen.fr Laboratoire dInformatique, de Traitement de lInformation et des Syst` emes


slide-1
SLIDE 1

HG-CoLoR: enHanced de bruijn Graph for the error COrrection of LOng Reads

Pierre Morisse, Thierry Lecroq and Arnaud Lefebvre

pierre.morisse2@univ-rouen.fr

Laboratoire d’Informatique, de Traitement de l’Information et des Syst` emes November 7, 2017

slide-2
SLIDE 2

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Plan

1

Introduction

2

Main idea

3

Enhanced de Bruijn graph

4

Workflow

5

Experimental results

6

Conclusion

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 2/33

slide-3
SLIDE 3

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

1

Introduction

2

Main idea

3

Enhanced de Bruijn graph

4

Workflow

5

Experimental results

6

Conclusion

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 3/33

slide-4
SLIDE 4

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Third Generation Sequencing

Recently, Third Generation Sequencing technologies started to develop Two main technologies: Pacific Biosciences and Oxford Nanopore Allow the sequencing of longer reads (several thousand of bases) Very useful to resolve assembly problems for large and complex genomes Much higher error rate, around 15% for Pacific Biosciences and up to 30% for Oxford Nanopore

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 4/33

slide-5
SLIDE 5

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Third Generation Sequencing

Recently, Third Generation Sequencing technologies started to develop Two main technologies: Pacific Biosciences and Oxford Nanopore Allow the sequencing of longer reads (several thousand of bases) Very useful to resolve assembly problems for large and complex genomes Much higher error rate, around 15% for Pacific Biosciences and up to 30% for Oxford Nanopore

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 4/33

slide-6
SLIDE 6

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Third Generation Sequencing

Recently, Third Generation Sequencing technologies started to develop Two main technologies: Pacific Biosciences and Oxford Nanopore Allow the sequencing of longer reads (several thousand of bases) Very useful to resolve assembly problems for large and complex genomes Much higher error rate, around 15% for Pacific Biosciences and up to 30% for Oxford Nanopore

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 4/33

slide-7
SLIDE 7

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Third Generation Sequencing

Recently, Third Generation Sequencing technologies started to develop Two main technologies: Pacific Biosciences and Oxford Nanopore Allow the sequencing of longer reads (several thousand of bases) Very useful to resolve assembly problems for large and complex genomes Much higher error rate, around 15% for Pacific Biosciences and up to 30% for Oxford Nanopore

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 4/33

slide-8
SLIDE 8

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Third Generation Sequencing

Recently, Third Generation Sequencing technologies started to develop Two main technologies: Pacific Biosciences and Oxford Nanopore Allow the sequencing of longer reads (several thousand of bases) Very useful to resolve assembly problems for large and complex genomes Much higher error rate, around 15% for Pacific Biosciences and up to 30% for Oxford Nanopore

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 4/33

slide-9
SLIDE 9

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Problem

Due to their high error rate, error correction of long reads is mandatory Various methods already exist for the correction of short reads, but are not applicable to long reads Forces the development of new error correction methods Two main categories: self-correction and hybrid correction

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 5/33

slide-10
SLIDE 10

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Problem

Due to their high error rate, error correction of long reads is mandatory Various methods already exist for the correction of short reads, but are not applicable to long reads Forces the development of new error correction methods Two main categories: self-correction and hybrid correction

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 5/33

slide-11
SLIDE 11

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Problem

Due to their high error rate, error correction of long reads is mandatory Various methods already exist for the correction of short reads, but are not applicable to long reads Forces the development of new error correction methods Two main categories: self-correction and hybrid correction

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 5/33

slide-12
SLIDE 12

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Problem

Due to their high error rate, error correction of long reads is mandatory Various methods already exist for the correction of short reads, but are not applicable to long reads Forces the development of new error correction methods Two main categories: self-correction and hybrid correction

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 5/33

slide-13
SLIDE 13

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

1

Introduction

2

Main idea

3

Enhanced de Bruijn graph

4

Workflow

5

Experimental results

6

Conclusion

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 6/33

slide-14
SLIDE 14

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Motivations

Most hybrid methods focus on reducing the error rate... ...But yield bad assembly results

⇒ Focus more on assembly results

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 7/33

slide-15
SLIDE 15

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Motivations

Most hybrid methods focus on reducing the error rate... ...But yield bad assembly results

⇒ Focus more on assembly results

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 7/33

slide-16
SLIDE 16

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Motivations

Most hybrid methods focus on reducing the error rate... ...But yield bad assembly results

⇒ Focus more on assembly results

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 7/33

slide-17
SLIDE 17

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Inspiration

NaS [Madoui et al., 2015] Yields highly contiguous assembly results Does not locally correct erroneous regions Uses long reads as templates to generate corrected long reads from assemblies of short reads Requires the mapping of the short reads both on the long reads and against each other

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 8/33

slide-18
SLIDE 18

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Inspiration

NaS [Madoui et al., 2015] Yields highly contiguous assembly results Does not locally correct erroneous regions Uses long reads as templates to generate corrected long reads from assemblies of short reads Requires the mapping of the short reads both on the long reads and against each other

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 8/33

slide-19
SLIDE 19

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Inspiration

NaS [Madoui et al., 2015] Yields highly contiguous assembly results Does not locally correct erroneous regions Uses long reads as templates to generate corrected long reads from assemblies of short reads Requires the mapping of the short reads both on the long reads and against each other

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 8/33

slide-20
SLIDE 20

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Inspiration

NaS [Madoui et al., 2015] Yields highly contiguous assembly results Does not locally correct erroneous regions Uses long reads as templates to generate corrected long reads from assemblies of short reads Requires the mapping of the short reads both on the long reads and against each other

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 8/33

slide-21
SLIDE 21

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Inspiration

NaS [Madoui et al., 2015] Yields highly contiguous assembly results Does not locally correct erroneous regions Uses long reads as templates to generate corrected long reads from assemblies of short reads Requires the mapping of the short reads both on the long reads and against each other

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 8/33

slide-22
SLIDE 22

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

NaS overview

NaS corrects a long read as follows:

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 9/33

slide-23
SLIDE 23

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

NaS overview

NaS corrects a long read as follows: First step Align the short reads to the long reads

long read seeds

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 9/33

slide-24
SLIDE 24

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

NaS overview

NaS corrects a long read as follows: Second step For each long read, recruit short reads that are similar to the seeds

seeds similar short reads

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 9/33

slide-25
SLIDE 25

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

NaS overview

NaS corrects a long read as follows: Third step Assemble the obtained subset of short reads

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 9/33

slide-26
SLIDE 26

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

NaS overview

NaS corrects a long read as follows: Fourth step Use the obtain contig as the correction of the initial long read

contig

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 9/33

slide-27
SLIDE 27

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Main idea

Use long reads as templates Get rid of the time consuming step of aligning the short reads against each other Focus on a seed and extend approach Rely on an enhanced de Bruijn graph, built from the short reads

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 10/33

slide-28
SLIDE 28

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Main idea

Use long reads as templates Get rid of the time consuming step of aligning the short reads against each other Focus on a seed and extend approach Rely on an enhanced de Bruijn graph, built from the short reads

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 10/33

slide-29
SLIDE 29

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Main idea

Use long reads as templates Get rid of the time consuming step of aligning the short reads against each other Focus on a seed and extend approach Rely on an enhanced de Bruijn graph, built from the short reads

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 10/33

slide-30
SLIDE 30

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Main idea

Use long reads as templates Get rid of the time consuming step of aligning the short reads against each other Focus on a seed and extend approach Rely on an enhanced de Bruijn graph, built from the short reads

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 10/33

slide-31
SLIDE 31

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

1

Introduction

2

Main idea

3

Enhanced de Bruijn graph

4

Workflow

5

Experimental results

6

Conclusion

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 11/33

slide-32
SLIDE 32

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Enhanced de Bruijn graph

Problem de Bruijn graphs are widely used for correction and assembly... ...But face difficulties with locally insufficient coverage Usual solutions Usually, multiple de Bruijn graphs of different orders are built Requires a different graph for each order Consumes large amounts of time and memory

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 12/33

slide-33
SLIDE 33

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Enhanced de Bruijn graph

Problem de Bruijn graphs are widely used for correction and assembly... ...But face difficulties with locally insufficient coverage Usual solutions Usually, multiple de Bruijn graphs of different orders are built Requires a different graph for each order Consumes large amounts of time and memory

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 12/33

slide-34
SLIDE 34

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Enhanced de Bruijn graph

Idea Enhance the de Bruijn graph with the capability of computing overlaps

  • f variable lengths between the k-mers, in an overlap graph fashion, in
  • rder to avoid building multiple de Bruijn graphs of different orders.
  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 13/33

slide-35
SLIDE 35

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Enhanced de Bruijn graph

Example With the set of reads S = {AGCTTACA, CTTACGTA, GTATACTG}, we

  • btain the following enhanced de Bruijn graph of order 6:

GCTTAC AGCTTA CTTACA CTTACG TTACGT TACGTA GTATAC TATACT ATACTG 5 5 5 5 5 5 5 4 4 4 4 4 3 3 3 3

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 14/33

slide-36
SLIDE 36

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Enhanced de Bruijn graph

Example With the set of reads S = {AGCTTACA, CTTACGTA, GTATACTG}, we

  • btain the following enhanced de Bruijn graph of order 6:

GCTTAC AGCTTA CTTACA CTTACG TTACGT TACGTA GTATAC TATACT ATACTG 5 5 5 5 5 5 5 4 4 4 4 4 3 3 3 3

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 14/33

slide-37
SLIDE 37

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Enhanced de Bruijn graph

Example With the set of reads S = {AGCTTACA, CTTACGTA, GTATACTG}, we

  • btain the following enhanced de Bruijn graph of order 6:

GCTTAC AGCTTA CTTACA CTTACG TTACGT TACGTA GTATAC TATACT ATACTG 5 5 5 5 5 5 5 4 4 4 4 4 3 3 3 3

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 14/33

slide-38
SLIDE 38

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Traversal

The enhanced de Bruijn graph does not need to be explicitly built It can be traversed with the help of PgSA [Kowalski et al., 2015]:

The k-mers from the reads are stored in the index The index is queried in order to retrieve the edges

Makes backwards traversal easy

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 15/33

slide-39
SLIDE 39

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Traversal

The enhanced de Bruijn graph does not need to be explicitly built It can be traversed with the help of PgSA [Kowalski et al., 2015]:

The k-mers from the reads are stored in the index The index is queried in order to retrieve the edges

Makes backwards traversal easy

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 15/33

slide-40
SLIDE 40

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Traversal

The enhanced de Bruijn graph does not need to be explicitly built It can be traversed with the help of PgSA [Kowalski et al., 2015]:

The k-mers from the reads are stored in the index The index is queried in order to retrieve the edges

Makes backwards traversal easy

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 15/33

slide-41
SLIDE 41

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Traversal

The enhanced de Bruijn graph does not need to be explicitly built It can be traversed with the help of PgSA [Kowalski et al., 2015]:

The k-mers from the reads are stored in the index The index is queried in order to retrieve the edges

Makes backwards traversal easy

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 15/33

slide-42
SLIDE 42

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Traversal

The enhanced de Bruijn graph does not need to be explicitly built It can be traversed with the help of PgSA [Kowalski et al., 2015]:

The k-mers from the reads are stored in the index The index is queried in order to retrieve the edges

Makes backwards traversal easy

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 15/33

slide-43
SLIDE 43

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Traversal

Example Traversing the previous enhanced de Bruijn graph:

GCTTAC AGCTTA CTTACA CTTACG TTACGT TACGTA GTATAC TATACT ATACTG 5 5 4 3 5 5 5 5 4 4 3 3 5 4 4 3

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 16/33

slide-44
SLIDE 44

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Traversal

Example k-mers set 1: AGCTTA 2: ATACTG 3: CTTACA 4: CTTACG 5: GCTTAC 6: GTATAC 7: TACGTA 8: TATACT 9: TTACGT PgSA Index

AGCTTA GCTTAC CTTACA CTTACG TTACGT

5 4 4 3

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 17/33

slide-45
SLIDE 45

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Traversal

Example k-mers set 1: AGCTTA 2: ATACTG 3: CTTACA 4: CTTACG 5: GCTTAC 6: GTATAC 7: TACGTA 8: TATACT 9: TTACGT PgSA Index

AGCTTA GCTTAC CTTACA CTTACG TTACGT

5 4 4 3

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 17/33

slide-46
SLIDE 46

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Traversal

Example k-mers set 1: AGCTTA 2: ATACTG 3: CTTACA 4: CTTACG 5: GCTTAC 6: GTATAC 7: TACGTA 8: TATACT 9: TTACGT PgSA Index Occurrences positions?

AGCTTA GCTTAC CTTACA CTTACG TTACGT

5 4 4 3

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 17/33

slide-47
SLIDE 47

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Traversal

Example k-mers set 1: AGCTTA 2: ATACTG 3: CTTACA 4: CTTACG 5: GCTTAC 6: GTATAC 7: TACGTA 8: TATACT 9: TTACGT PgSA Index Occurrences positions?

{(1,1) (5,0)}

AGCTTA GCTTAC CTTACA CTTACG TTACGT

5 4 4 3

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 17/33

slide-48
SLIDE 48

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Traversal

Example k-mers set 1: AGCTTA 2: ATACTG 3: CTTACA 4: CTTACG 5: GCTTAC 6: GTATAC 7: TACGTA 8: TATACT 9: TTACGT PgSA Index

{(1,1) (5,0)}

AGCTTA GCTTAC CTTACA CTTACG TTACGT

5 4 4 3

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 17/33

slide-49
SLIDE 49

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Traversal

Example k-mers set 1: AGCTTA 2: ATACTG 3: CTTACA 4: CTTACG 5: GCTTAC 6: GTATAC 7: TACGTA 8: TATACT 9: TTACGT PgSA Index

{(1,1) (5,0)}

AGCTTA GCTTAC CTTACA CTTACG TTACGT

4 4 3 5

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 17/33

slide-50
SLIDE 50

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Traversal

Example k-mers set 1: AGCTTA 2: ATACTG 3: CTTACA 4: CTTACG 5: GCTTAC 6: GTATAC 7: TACGTA 8: TATACT 9: TTACGT PgSA Index Occurrences positions?

AGCTTA GCTTAC CTTACA CTTACG TTACGT

4 4 3 5

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 17/33

slide-51
SLIDE 51

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Traversal

Example k-mers set 1: AGCTTA 2: ATACTG 3: CTTACA 4: CTTACG 5: GCTTAC 6: GTATAC 7: TACGTA 8: TATACT 9: TTACGT PgSA Index Occurrences positions?

{(1,2) ; (3,0) ; (4,0) ; (5,1) }

AGCTTA GCTTAC CTTACA CTTACG TTACGT

4 4 3 5

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 17/33

slide-52
SLIDE 52

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Traversal

Example k-mers set 1: AGCTTA 2: ATACTG 3: CTTACA 4: CTTACG 5: GCTTAC 6: GTATAC 7: TACGTA 8: TATACT 9: TTACGT PgSA Index

{(1,2) ; (3,0) ; (4,0) ; (5,1) }

AGCTTA GCTTAC CTTACA CTTACG TTACGT

4 4 3 5

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 17/33

slide-53
SLIDE 53

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Traversal

Example k-mers set 1: AGCTTA 2: ATACTG 3: CTTACA 4: CTTACG 5: GCTTAC 6: GTATAC 7: TACGTA 8: TATACT 9: TTACGT PgSA Index

{(1,2) ; (3,0) ; (4,0) ; (5,1) }

AGCTTA GCTTAC CTTACA CTTACG TTACGT

4 3 5 4

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 17/33

slide-54
SLIDE 54

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Traversal

Example k-mers set 1: AGCTTA 2: ATACTG 3: CTTACA 4: CTTACG 5: GCTTAC 6: GTATAC 7: TACGTA 8: TATACT 9: TTACGT PgSA Index

{(1,2) ; (3,0) ; (4,0) ; (5,1) }

AGCTTA GCTTAC CTTACA CTTACG TTACGT

3 5 4 4

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 17/33

slide-55
SLIDE 55

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Traversal

Example k-mers set 1: AGCTTA 2: ATACTG 3: CTTACA 4: CTTACG 5: GCTTAC 6: GTATAC 7: TACGTA 8: TATACT 9: TTACGT PgSA Index

{(1,2) ; (3,0) ; (4,0) ; (5,1) }

AGCTTA GCTTAC CTTACA CTTACG TTACGT

3 5 4 4

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 17/33

slide-56
SLIDE 56

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Traversal

Example k-mers set 1: AGCTTA 2: ATACTG 3: CTTACA 4: CTTACG 5: GCTTAC 6: GTATAC 7: TACGTA 8: TATACT 9: TTACGT PgSA Index Occurrences positions?

AGCTTA GCTTAC CTTACA CTTACG TTACGT

3 5 4 4

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 17/33

slide-57
SLIDE 57

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Traversal

Example k-mers set 1: AGCTTA 2: ATACTG 3: CTTACA 4: CTTACG 5: GCTTAC 6: GTATAC 7: TACGTA 8: TATACT 9: TTACGT PgSA Index Occurrences positions?

{(1,3) ; (3,1) ; (4,1) ;

(5,2) ; (9,0)}

AGCTTA GCTTAC CTTACA CTTACG TTACGT

3 5 4 4

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 17/33

slide-58
SLIDE 58

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Traversal

Example k-mers set 1: AGCTTA 2: ATACTG 3: CTTACA 4: CTTACG 5: GCTTAC 6: GTATAC 7: TACGTA 8: TATACT 9: TTACGT PgSA Index

{(1,3) ; (3,1) ; (4,1) ;

(5,2) ; (9,0)}

AGCTTA GCTTAC CTTACA CTTACG TTACGT

3 5 4 4

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 17/33

slide-59
SLIDE 59

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Traversal

Example k-mers set 1: AGCTTA 2: ATACTG 3: CTTACA 4: CTTACG 5: GCTTAC 6: GTATAC 7: TACGTA 8: TATACT 9: TTACGT PgSA Index

{(1,3) ; (3,1) ; (4,1) ;

(5,2) ; (9,0)}

AGCTTA GCTTAC CTTACA CTTACG TTACGT

3 5 4 4

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 17/33

slide-60
SLIDE 60

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Traversal

Example k-mers set 1: AGCTTA 2: ATACTG 3: CTTACA 4: CTTACG 5: GCTTAC 6: GTATAC 7: TACGTA 8: TATACT 9: TTACGT PgSA Index

{(1,3) ; (3,1) ; (4,1) ;

(5,2) ; (9,0)}

AGCTTA GCTTAC CTTACA CTTACG TTACGT

3 5 4 4

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 17/33

slide-61
SLIDE 61

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Traversal

Example k-mers set 1: AGCTTA 2: ATACTG 3: CTTACA 4: CTTACG 5: GCTTAC 6: GTATAC 7: TACGTA 8: TATACT 9: TTACGT PgSA Index

{(1,3) ; (3,1) ; (4,1) ;

(5,2) ; (9,0)}

AGCTTA GCTTAC CTTACA CTTACG TTACGT

3 5 4 4

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 17/33

slide-62
SLIDE 62

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Traversal

Example k-mers set 1: AGCTTA 2: ATACTG 3: CTTACA 4: CTTACG 5: GCTTAC 6: GTATAC 7: TACGTA 8: TATACT 9: TTACGT PgSA Index

{(1,3) ; (3,1) ; (4,1) ;

(5,2) ; (9,0)}

AGCTTA GCTTAC CTTACA CTTACG TTACGT

5 4 4 3

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 17/33

slide-63
SLIDE 63

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

1

Introduction

2

Main idea

3

Enhanced de Bruijn graph

4

Workflow

5

Experimental results

6

Conclusion

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 18/33

slide-64
SLIDE 64

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Workflow

5 steps:

1

Correct the short reads (with QuorUM [Marc ¸ais et al., 2015])

2

Filter out corrected short reads containing weak k-mers, and index solid k-mers with PgSA

3

Align the remaining short reads to the long reads, to find seeds (with BLASR [Chaisson and Tesler, 2012])

4

Merge the overlapping seeds, and link them together, by traversing the graph

5

Extend the obtained corrected long read, on the left (resp. right)

  • f the leftmost (resp. rightmost) seed
  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 19/33

slide-65
SLIDE 65

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Workflow

5 steps:

1

Correct the short reads (with QuorUM [Marc ¸ais et al., 2015])

2

Filter out corrected short reads containing weak k-mers, and index solid k-mers with PgSA

3

Align the remaining short reads to the long reads, to find seeds (with BLASR [Chaisson and Tesler, 2012])

4

Merge the overlapping seeds, and link them together, by traversing the graph

5

Extend the obtained corrected long read, on the left (resp. right)

  • f the leftmost (resp. rightmost) seed
  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 19/33

slide-66
SLIDE 66

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Workflow

5 steps:

1

Correct the short reads (with QuorUM [Marc ¸ais et al., 2015])

2

Filter out corrected short reads containing weak k-mers, and index solid k-mers with PgSA

3

Align the remaining short reads to the long reads, to find seeds (with BLASR [Chaisson and Tesler, 2012])

4

Merge the overlapping seeds, and link them together, by traversing the graph

5

Extend the obtained corrected long read, on the left (resp. right)

  • f the leftmost (resp. rightmost) seed
  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 19/33

slide-67
SLIDE 67

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Workflow

5 steps:

1

Correct the short reads (with QuorUM [Marc ¸ais et al., 2015])

2

Filter out corrected short reads containing weak k-mers, and index solid k-mers with PgSA

3

Align the remaining short reads to the long reads, to find seeds (with BLASR [Chaisson and Tesler, 2012])

4

Merge the overlapping seeds, and link them together, by traversing the graph

5

Extend the obtained corrected long read, on the left (resp. right)

  • f the leftmost (resp. rightmost) seed
  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 19/33

slide-68
SLIDE 68

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Workflow

5 steps:

1

Correct the short reads (with QuorUM [Marc ¸ais et al., 2015])

2

Filter out corrected short reads containing weak k-mers, and index solid k-mers with PgSA

3

Align the remaining short reads to the long reads, to find seeds (with BLASR [Chaisson and Tesler, 2012])

4

Merge the overlapping seeds, and link them together, by traversing the graph

5

Extend the obtained corrected long read, on the left (resp. right)

  • f the leftmost (resp. rightmost) seed
  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 19/33

slide-69
SLIDE 69

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds merging and linking

Seeds with overlapping mapping positions are merged

Perfect overlap: merge Otherwise: keep the best seed

Seeds are used as anchor points on the graph The graph is traversed to link the seeds together and assemble the k-mers

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 20/33

slide-70
SLIDE 70

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds merging and linking

Seeds with overlapping mapping positions are merged

Perfect overlap: merge Otherwise: keep the best seed

Seeds are used as anchor points on the graph The graph is traversed to link the seeds together and assemble the k-mers

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 20/33

slide-71
SLIDE 71

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds merging and linking

Seeds with overlapping mapping positions are merged

Perfect overlap: merge Otherwise: keep the best seed

Seeds are used as anchor points on the graph The graph is traversed to link the seeds together and assemble the k-mers

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 20/33

slide-72
SLIDE 72

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds merging and linking

Seeds with overlapping mapping positions are merged

Perfect overlap: merge Otherwise: keep the best seed

Seeds are used as anchor points on the graph The graph is traversed to link the seeds together and assemble the k-mers

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 20/33

slide-73
SLIDE 73

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds merging and linking

Seeds with overlapping mapping positions are merged

Perfect overlap: merge Otherwise: keep the best seed

Seeds are used as anchor points on the graph The graph is traversed to link the seeds together and assemble the k-mers

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 20/33

slide-74
SLIDE 74

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds linking

long read

seed1 seed2 seed3

. . .

src

. . .

dst

. . . . . . . . .

k − 3 k − 2 k − 3 k − 1 k − 1 k − 1 k − 2 k − 1 k − 1

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 21/33

slide-75
SLIDE 75

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds linking

long read

src dst seed3

. . .

src

. . .

dst

. . . . . . . . .

k − 3 k − 2 k − 3 k − 1 k − 1 k − 1 k − 2 k − 1 k − 1

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 21/33

slide-76
SLIDE 76

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds linking

long read

src dst seed3

. . .

src

. . .

dst

. . . . . . . . .

k − 3 k − 2 k − 3 k − 1 k − 1 k − 1 k − 2 k − 1 k − 1

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 21/33

slide-77
SLIDE 77

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds linking

long read

src dst seed3

. . .

src src

. . .

dst

. . . . . . . . .

k − 3 k − 2 k − 3 k − 1 k − 1 k − 1 k − 2 k − 1 k − 1

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 21/33

slide-78
SLIDE 78

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds linking

long read

src dst seed3

. . .

src src

. . .

dst

. . . . . . . . .

k − 3 k − 2 k − 3 k − 1 k − 1 k − 1 k − 2 k − 1 k − 1

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 21/33

slide-79
SLIDE 79

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds linking

long read

src dst seed3

. . .

src

. . .

dst

. . . . . . . . .

k − 3 k − 2 k − 3 k − 1 k − 1 k − 1 k − 2 k − 1 k − 1

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 21/33

slide-80
SLIDE 80

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds linking

long read

src dst seed3

. . .

src

. . .

dst

. . . . . . . . .

k − 3 k − 2 k − 3 k − 1 k − 1 k − 1 k − 2 k − 1 k − 1

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 21/33

slide-81
SLIDE 81

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds linking

long read

src dst seed3

. . .

src

. . .

dst

. . . . . . . . .

k − 3 k − 2 k − 3 k − 1 k − 1 k − 1 k − 2 k − 1 k − 1

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 21/33

slide-82
SLIDE 82

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds linking

long read

src dst seed3

. . .

src

. . .

dst

. . . . . . . . .

k − 3 k − 2 k − 3 k − 1 k − 1 k − 1 k − 2 k − 1 k − 1

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 21/33

slide-83
SLIDE 83

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds linking

long read

src dst seed3

. . .

src

. . .

dst

. . . . . . . . .

k − 3 k − 2 k − 3 k − 1 k − 1 k − 1 k − 2 k − 1 k − 1

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 21/33

slide-84
SLIDE 84

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds linking

long read

src dst seed3

. . .

src

. . .

dst

. . . . . . . . .

k − 3 k − 2 k − 3 k − 1 k − 1 k − 1 k − 2 k − 1 k − 1

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 21/33

slide-85
SLIDE 85

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds linking

long read

src dst seed3

. . .

src src

. . .

dst

. . . . . . . . .

k − 3 k − 2 k − 3 k − 1 k − 1 k − 1 k − 2 k − 1 k − 1

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 21/33

slide-86
SLIDE 86

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds linking

long read

src dst seed3

. . .

src src

. . .

dst

. . . . . . . . .

k − 3 k − 2 k − 3 k − 1 k − 1 k − 1 k − 2 k − 1 k − 1

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 21/33

slide-87
SLIDE 87

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds linking

long read

src dst seed3

. . .

src

. . .

dst

. . . . . . . . .

k − 3 k − 2 k − 3 k − 1 k − 1 k − 1 k − 2 k − 1 k − 1

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 21/33

slide-88
SLIDE 88

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds linking

long read

src dst seed3

. . .

src

. . .

dst

. . . . . . . . .

k − 3 k − 2 k − 3 k − 1 k − 1 k − 1 k − 2 k − 1 k − 1

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 21/33

slide-89
SLIDE 89

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds linking

long read

src dst seed3

. . .

src

. . . . . .

dst

. . . . . . . . .

k − 3 k − 2 k − 3 k − 1 k − 1 k − 1 k − 2 k − 1 k − 1

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 21/33

slide-90
SLIDE 90

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds linking

long read

src dst seed3

. . .

src

. . . . . .

dst

. . . . . . . . .

k − 3 k − 2 k − 3 k − 1 k − 1 k − 1 k − 2 k − 1 k − 1

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 21/33

slide-91
SLIDE 91

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds linking

long read

src dst seed3

. . .

src

. . .

dst dst

. . . . . . . . .

k − 3 k − 2 k − 3 k − 1 k − 1 k − 1 k − 2 k − 1 k − 1

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 21/33

slide-92
SLIDE 92

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds linking

long read

src dst seed3

. . .

src src

. . . . . .

dst dst

. . . . . . . . .

k − 3 k − 2 k − 3 k − 1 k − 1 k − 1 k − 2 k − 1 k − 1

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 21/33

slide-93
SLIDE 93

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds linking

long read

linked seeds seed3

. . .

src

. . .

dst

. . . . . . . . .

k − 3 k − 2 k − 3 k − 1 k − 1 k − 1

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 21/33

slide-94
SLIDE 94

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds linking

long read

src dst

. . .

src

. . .

dst

. . . . . . . . .

k − 3 k − 2 k − 3 k − 1 k − 1 k − 1

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 21/33

slide-95
SLIDE 95

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 4: Seeds linking

long read

corrected long read

. . .

src

. . .

dst

. . . . . . . . .

k − 3 k − 2 k − 3 k − 1 k − 1 k − 1

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 21/33

slide-96
SLIDE 96

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 5: Tips extension

Seeds don’t always map right at the beginning or until the end of the long read Once all the seeds have been linked, HG-CoLoR keeps on traversing the graph The traversal stops when the borders of the long read or a branching path are reached

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 22/33

slide-97
SLIDE 97

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 5: Tips extension

Seeds don’t always map right at the beginning or until the end of the long read Once all the seeds have been linked, HG-CoLoR keeps on traversing the graph The traversal stops when the borders of the long read or a branching path are reached

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 22/33

slide-98
SLIDE 98

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Step 5: Tips extension

Seeds don’t always map right at the beginning or until the end of the long read Once all the seeds have been linked, HG-CoLoR keeps on traversing the graph The traversal stops when the borders of the long read or a branching path are reached

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 22/33

slide-99
SLIDE 99

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Remark

Some seeds might be impossible to link together

⇒ Production of a corrected long read fragmented in multiple

parts

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 23/33

slide-100
SLIDE 100

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Remark

Some seeds might be impossible to link together

⇒ Production of a corrected long read fragmented in multiple

parts

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 23/33

slide-101
SLIDE 101

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

1

Introduction

2

Main idea

3

Enhanced de Bruijn graph

4

Workflow

5

Experimental results

6

Conclusion

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 24/33

slide-102
SLIDE 102

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Datasets

HG-CoLoR was compared to NaS, and two other state-of-the-art long read hybrid correction methods: CoLoRMap [Haghshenas et al., 2016] and Jabba [Miclotte et al., 2016] The different tools were compared on the following datasets:

Dataset Reference genome Oxford Nanopore data Illumina data Strain Reference sequence Genome size # Reads Average length Coverage # Reads Read length Coverage

  • A. baylyi

ADP1 CR543861 3.6 Mbp 89,011 4,284 106x 900,000 250 50x

  • E. coli

K-12 substr. MG1655 NC 000913 4.6 Mbp 22,270 5,999 29x 775,500 300 50x

  • S. cerevisae

S288C NC 001133-001148 12.2 Mbp 205,923 5,698 96x 2,500,000 250 50x

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 25/33

slide-103
SLIDE 103

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Alignment-based comparison

Dataset Method # Reads Average length Average identity Genome coverage Runtime

  • A. baylyi

Original 89,011 4,284 70.09% 100% N/A CoLoRMap 89,011 4,355 67.93% 100% 14h33min Jabba 17,476 10,260 99.40% 99.80% 12min30 NaS 28,492 9,530 99.83% 100% 128h55min HG-CoLoR 25,436 11,619 99.70% 100% 1h59min

  • E. coli

Original 22,270 5,999 79.46% 100% N/A CoLoRMap 22,270 6,219 89.02% 100% 8h26min Jabba 22,065 5,794 99.81% 99.41% 12min56 NaS 22,144 8,307 99.86% 100% 81h30min HG-CoLoR 21,969 6,125 99.80% 100% 1h17min

  • S. cerevisae

Original 205,923 5,698 55.49% 99.90% N/A CoLoRMap 205,923 5,737 39.93% 99.40% 37h36min Jabba 36,958 6,613 99.55% 93.21% 44min05 NaS 85,432 6,770 99.16% 99.37%

> 16 days

HG-CoLoR 75,036 6,991 98.81% 99.47% 11h45min

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 26/33

slide-104
SLIDE 104

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Alignment-based comparison

Dataset Method # Reads Average length Average identity Genome coverage Runtime

  • A. baylyi

Original 89,011 4,284 70.09% 100% N/A CoLoRMap 89,011 4,355 67.93% 100% 14h33min Jabba 17,476 10,260 99.40% 99.80% 12min30 NaS 28,492 9,530 99.83% 100% 128h55min HG-CoLoR 25,436 11,619 99.70% 100% 1h59min

  • E. coli

Original 22,270 5,999 79.46% 100% N/A CoLoRMap 22,270 6,219 89.02% 100% 8h26min Jabba 22,065 5,794 99.81% 99.41% 12min56 NaS 22,144 8,307 99.86% 100% 81h30min HG-CoLoR 21,969 6,125 99.80% 100% 1h17min

  • S. cerevisae

Original 205,923 5,698 55.49% 99.90% N/A CoLoRMap 205,923 5,737 39.93% 99.40% 37h36min Jabba 36,958 6,613 99.55% 93.21% 44min05 NaS 85,432 6,770 99.16% 99.37%

> 16 days

HG-CoLoR 75,036 6,991 98.81% 99.47% 11h45min

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 26/33

slide-105
SLIDE 105

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Alignment-based comparison

Dataset Method # Reads Average length Average identity Genome coverage Runtime

  • A. baylyi

Original 89,011 4,284 70.09% 100% N/A CoLoRMap 89,011 4,355 67.93% 100% 14h33min Jabba 17,476 10,260 99.40% 99.80% 12min30 NaS 28,492 9,530 99.83% 100% 128h55min HG-CoLoR 25,436 11,619 99.70% 100% 1h59min

  • E. coli

Original 22,270 5,999 79.46% 100% N/A CoLoRMap 22,270 6,219 89.02% 100% 8h26min Jabba 22,065 5,794 99.81% 99.41% 12min56 NaS 22,144 8,307 99.86% 100% 81h30min HG-CoLoR 21,969 6,125 99.80% 100% 1h17min

  • S. cerevisae

Original 205,923 5,698 55.49% 99.90% N/A CoLoRMap 205,923 5,737 39.93% 99.40% 37h36min Jabba 36,958 6,613 99.55% 93.21% 44min05 NaS 85,432 6,770 99.16% 99.37%

> 16 days

HG-CoLoR 75,036 6,991 98.81% 99.47% 11h45min

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 26/33

slide-106
SLIDE 106

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Alignment-based comparison

Dataset Method # Reads Average length Average identity Genome coverage Runtime

  • A. baylyi

Original 89,011 4,284 70.09% 100% N/A CoLoRMap 89,011 4,355 67.93% 100% 14h33min Jabba 17,476 10,260 99.40% 99.80% 12min30 NaS 28,492 9,530 99.83% 100% 128h55min HG-CoLoR 25,436 11,619 99.70% 100% 1h59min

  • E. coli

Original 22,270 5,999 79.46% 100% N/A CoLoRMap 22,270 6,219 89.02% 100% 8h26min Jabba 22,065 5,794 99.81% 99.41% 12min56 NaS 22,144 8,307 99.86% 100% 81h30min HG-CoLoR 21,969 6,125 99.80% 100% 1h17min

  • S. cerevisae

Original 205,923 5,698 55.49% 99.90% N/A CoLoRMap 205,923 5,737 39.93% 99.40% 37h36min Jabba 36,958 6,613 99.55% 93.21% 44min05 NaS 85,432 6,770 99.16% 99.37%

> 16 days

HG-CoLoR 75,036 6,991 98.81% 99.47% 11h45min

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 26/33

slide-107
SLIDE 107

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Alignment-based comparison

Dataset Method # Reads Average length Average identity Genome coverage Runtime

  • A. baylyi

Original 89,011 4,284 70.09% 100% N/A CoLoRMap 89,011 4,355 67.93% 100% 14h33min Jabba 17,476 10,260 99.40% 99.80% 12min30 NaS 28,492 9,530 99.83% 100% 128h55min HG-CoLoR 25,436 11,619 99.70% 100% 1h59min

  • E. coli

Original 22,270 5,999 79.46% 100% N/A CoLoRMap 22,270 6,219 89.02% 100% 8h26min Jabba 22,065 5,794 99.81% 99.41% 12min56 NaS 22,144 8,307 99.86% 100% 81h30min HG-CoLoR 21,969 6,125 99.80% 100% 1h17min

  • S. cerevisae

Original 205,923 5,698 55.49% 99.90% N/A CoLoRMap 205,923 5,737 39.93% 99.40% 37h36min Jabba 36,958 6,613 99.55% 93.21% 44min05 NaS 85,432 6,770 99.16% 99.37%

> 16 days

HG-CoLoR 75,036 6,991 98.81% 99.47% 11h45min

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 26/33

slide-108
SLIDE 108

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Alignment-based comparison

Dataset Method # Reads Average length Average identity Genome coverage Runtime

  • A. baylyi

Original 89,011 4,284 70.09% 100% N/A CoLoRMap 89,011 4,355 67.93% 100% 14h33min Jabba 17,476 10,260 99.40% 99.80% 12min30 NaS 28,492 9,530 99.83% 100% 128h55min HG-CoLoR 25,436 11,619 99.70% 100% 1h59min

  • E. coli

Original 22,270 5,999 79.46% 100% N/A CoLoRMap 22,270 6,219 89.02% 100% 8h26min Jabba 22,065 5,794 99.81% 99.41% 12min56 NaS 22,144 8,307 99.86% 100% 81h30min HG-CoLoR 21,969 6,125 99.80% 100% 1h17min

  • S. cerevisae

Original 205,923 5,698 55.49% 99.90% N/A CoLoRMap 205,923 5,737 39.93% 99.40% 37h36min Jabba 36,958 6,613 99.55% 93.21% 44min05 NaS 85,432 6,770 99.16% 99.37%

> 16 days

HG-CoLoR 75,036 6,991 98.81% 99.47% 11h45min

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 26/33

slide-109
SLIDE 109

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Assembly-based comparison

Dataset Method # Reads Coverage # Expected contigs # Obtained contigs Genome coverage

  • A. baylyi

CoLoRMap 89,011 108x 1 Jabba 17,476 50x 1 13 89.43% NaS 28,492 75x 1 1 100% HG-CoLoR 25,436 82x 1 1 99.99%

  • E. coli

CoLoRMap 22,270 30x 1 29 97,74% Jabba 22,065 28x 1 41 95.76% NaS 22,144 40x 1 1 100% HG-CoLoR 21,969 29x 1 1 100%

  • S. cerevisae

CoLoRMap 205,923 98x 16 Jabba 36,958 20x 16 134 70.52% NaS 85,432 48x 16 122 96.72% HG-CoLoR 75,036 43x 16 81 96.11%

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 27/33

slide-110
SLIDE 110

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Assembly-based comparison

Dataset Method # Reads Coverage # Expected contigs # Obtained contigs Genome coverage

  • A. baylyi

CoLoRMap 89,011 108x 1 Jabba 17,476 50x 1 13 89.43% NaS 28,492 75x 1 1 100% HG-CoLoR 25,436 82x 1 1 99.99%

  • E. coli

CoLoRMap 22,270 30x 1 29 97,74% Jabba 22,065 28x 1 41 95.76% NaS 22,144 40x 1 1 100% HG-CoLoR 21,969 29x 1 1 100%

  • S. cerevisae

CoLoRMap 205,923 98x 16 Jabba 36,958 20x 16 134 70.52% NaS 85,432 48x 16 122 96.72% HG-CoLoR 75,036 43x 16 81 96.11%

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 27/33

slide-111
SLIDE 111

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Assembly-based comparison

Dataset Method # Reads Coverage # Expected contigs # Obtained contigs Genome coverage

  • A. baylyi

CoLoRMap 89,011 108x 1 Jabba 17,476 50x 1 13 89.43% NaS 28,492 75x 1 1 100% HG-CoLoR 25,436 82x 1 1 99.99%

  • E. coli

CoLoRMap 22,270 30x 1 29 97,74% Jabba 22,065 28x 1 41 95.76% NaS 22,144 40x 1 1 100% HG-CoLoR 21,969 29x 1 1 100%

  • S. cerevisae

CoLoRMap 205,923 98x 16 Jabba 36,958 20x 16 134 70.52% NaS 85,432 48x 16 122 96.72% HG-CoLoR 75,036 43x 16 81 96.11%

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 27/33

slide-112
SLIDE 112

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Assembly-based comparison

Dataset Method # Reads Coverage # Expected contigs # Obtained contigs Genome coverage

  • A. baylyi

CoLoRMap 89,011 108x 1 Jabba 17,476 50x 1 13 89.43% NaS 28,492 75x 1 1 100% HG-CoLoR 25,436 82x 1 1 99.99%

  • E. coli

CoLoRMap 22,270 30x 1 29 97,74% Jabba 22,065 28x 1 41 95.76% NaS 22,144 40x 1 1 100% HG-CoLoR 21,969 29x 1 1 100%

  • S. cerevisae

CoLoRMap 205,923 98x 16 Jabba 36,958 20x 16 134 70.52% NaS 85,432 48x 16 122 96.72% HG-CoLoR 75,036 43x 16 81 96.11%

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 27/33

slide-113
SLIDE 113

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Assembly-based comparison

Dataset Method # Reads Coverage # Expected contigs # Obtained contigs Genome coverage

  • A. baylyi

CoLoRMap 89,011 108x 1 Jabba 17,476 50x 1 13 89.43% NaS 28,492 75x 1 1 100% HG-CoLoR 25,436 82x 1 1 99.99%

  • E. coli

CoLoRMap 22,270 30x 1 29 97,74% Jabba 22,065 28x 1 41 95.76% NaS 22,144 40x 1 1 100% HG-CoLoR 21,969 29x 1 1 100%

  • S. cerevisae

CoLoRMap 205,923 98x 16 Jabba 36,958 20x 16 134 70.52% NaS 85,432 48x 16 122 96.72% HG-CoLoR 75,036 43x 16 81 96.11%

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 27/33

slide-114
SLIDE 114

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

1

Introduction

2

Main idea

3

Enhanced de Bruijn graph

4

Workflow

5

Experimental results

6

Conclusion

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 28/33

slide-115
SLIDE 115

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Conclusion

Uses long reads as templates instead of locally correcting them Exploits the advantages of the enhanced de Bruijn Graph Oriented towards assembly Several orders of magnitude faster than NaS, while achieving comparable resutls Provides the best trade off between runtime and quality, when compared to state-of-the-art methods HG-CoLoR is available from:

https://github.com/pierre-morisse/HG-CoLoR

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 29/33

slide-116
SLIDE 116

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Conclusion

Uses long reads as templates instead of locally correcting them Exploits the advantages of the enhanced de Bruijn Graph Oriented towards assembly Several orders of magnitude faster than NaS, while achieving comparable resutls Provides the best trade off between runtime and quality, when compared to state-of-the-art methods HG-CoLoR is available from:

https://github.com/pierre-morisse/HG-CoLoR

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 29/33

slide-117
SLIDE 117

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Conclusion

Uses long reads as templates instead of locally correcting them Exploits the advantages of the enhanced de Bruijn Graph Oriented towards assembly Several orders of magnitude faster than NaS, while achieving comparable resutls Provides the best trade off between runtime and quality, when compared to state-of-the-art methods HG-CoLoR is available from:

https://github.com/pierre-morisse/HG-CoLoR

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 29/33

slide-118
SLIDE 118

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Conclusion

Uses long reads as templates instead of locally correcting them Exploits the advantages of the enhanced de Bruijn Graph Oriented towards assembly Several orders of magnitude faster than NaS, while achieving comparable resutls Provides the best trade off between runtime and quality, when compared to state-of-the-art methods HG-CoLoR is available from:

https://github.com/pierre-morisse/HG-CoLoR

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 29/33

slide-119
SLIDE 119

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Conclusion

Uses long reads as templates instead of locally correcting them Exploits the advantages of the enhanced de Bruijn Graph Oriented towards assembly Several orders of magnitude faster than NaS, while achieving comparable resutls Provides the best trade off between runtime and quality, when compared to state-of-the-art methods HG-CoLoR is available from:

https://github.com/pierre-morisse/HG-CoLoR

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 29/33

slide-120
SLIDE 120

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Conclusion

Uses long reads as templates instead of locally correcting them Exploits the advantages of the enhanced de Bruijn Graph Oriented towards assembly Several orders of magnitude faster than NaS, while achieving comparable resutls Provides the best trade off between runtime and quality, when compared to state-of-the-art methods HG-CoLoR is available from:

https://github.com/pierre-morisse/HG-CoLoR

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 29/33

slide-121
SLIDE 121

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Future work

Run HG-CoLoR on larger genomes Build a proper assembly tool from the enhanced de Bruijn graph Adapt HG-CoLoR to self-correction

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 30/33

slide-122
SLIDE 122

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Future work

Run HG-CoLoR on larger genomes Build a proper assembly tool from the enhanced de Bruijn graph Adapt HG-CoLoR to self-correction

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 30/33

slide-123
SLIDE 123

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Future work

Run HG-CoLoR on larger genomes Build a proper assembly tool from the enhanced de Bruijn graph Adapt HG-CoLoR to self-correction

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 30/33

slide-124
SLIDE 124

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

References I

Chaisson, M. J. and Tesler, G. (2012). Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC bioinformatics, 13(1):238. Haghshenas, E., Hach, F., Sahinalp, S. C., and Chauve, C. (2016). CoLoRMap: Correcting Long Reads by Mapping short reads. Bioinformatics, 32(17):i545–i551. Kowalski, T., Grabowski, S., and Deorowicz, S. (2015). Indexing arbitrary-length k-mers in sequencing reads. PLoS ONE, 10(7):1–14.

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 31/33

slide-125
SLIDE 125

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

References II

Madoui, M.-A., Engelen, S., Cruaud, C., Belser, C., Bertrand, L., Alberti, A., Lemainque, A., Wincker, P ., and Aury, J.-M. (2015). Genome assembly using Nanopore-guided long and error-free DNA reads. BMC Genomics, 16:327. Marc ¸ais, G., Yorke, J. A., and Zimin, A. (2015). QuorUM: An Error Corrector for Illumina Reads. PLOS ONE, 10(6):1–13. Miclotte, G., Heydari, M., Demeester, P ., Rombauts, S., Van de Peer, Y., Audenaert, P ., and Fostier, J. (2016). Jabba: hybrid error correction for long sequencing reads. Algorithms Mol Biol, 11:10.

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 32/33

slide-126
SLIDE 126

Introduction Main idea Enhanced de Bruijn graph Workflow Experimental results Conclusion

Thanks for your attention.

  • P. Morisse, T. Lecroq, A. Lefebvre

HG-CoLoR 33/33