consent scalable self correction of long reads with
play

CONSENT: Scalable self-correction of long reads with multiple - PowerPoint PPT Presentation

CONSENT: Scalable self-correction of long reads with multiple sequence alignment Pierre Morisse 1 , Camille Marchet 2 , Antoine Limasset 2 , Arnaud Lefebvre 1 , Thierry Lecroq 1 1 Normandie Univ, UNIROUEN, LITIS, Rouen 76000, France. 2 Lille Univ,


  1. CONSENT: Scalable self-correction of long reads with multiple sequence alignment Pierre Morisse 1 , Camille Marchet 2 , Antoine Limasset 2 , Arnaud Lefebvre 1 , Thierry Lecroq 1 1 Normandie Univ, UNIROUEN, LITIS, Rouen 76000, France. 2 Lille Univ, CNRS, CRIStAL, Lille 59000, France. RECOMB-SEQ 03 May 2019 Washington D.C.

  2. Introduction Workflow Experiments Conclusion Introduction Context 2011: Inception of third generation sequencing technologies Two main actors: Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) Sequencing of much longer reads, tens of kbps on average, up to 1 Mbp (ONT ultra-long reads) Expected to solve various problem in the genome assembly field Morisse et al. CONSENT 2/33

  3. Introduction Workflow Experiments Conclusion Introduction Context Long reads (LR) are very noisy (10-30% error rate) Display complex error profiles (errors are mostly indels) Efficiently handling these error rates is mandatory Can be done via correction: hybrid or self Morisse et al. CONSENT 3/33

  4. Introduction Workflow Experiments Conclusion Introduction Hybrid correction First efficient approach for LR error correction Makes use of complementary short reads (SR) data Different approaches: Alignment of SRs to the LRs, use of a De Bruijn graph (DBG), ... Particularly useful on old sequencing experiments (very high error rates) Morisse et al. CONSENT 4/33

  5. Introduction Workflow Experiments Conclusion Introduction Self-correction Corrects the LRs solely based on the information they contain Third generation sequencing technologies evolve fast Error rates of the LRs now reach 10-12% on average Error correction is still the first step of many analysis projects Self-correction is now a viable alternative with such error rates Morisse et al. CONSENT 5/33

  6. Introduction Workflow Experiments Conclusion Introduction Self-correction State-of-the-art: Compute overlaps between the LRs 1 Compute consensus from the overlaps 2 Morisse et al. CONSENT 6/33

  7. Introduction Workflow Experiments Conclusion Introduction Pseudo Multiple Sequence De Bruijn graph Alignment (MSA) Divide the alignments into Build a directed acyclic graph small windows (DAG) to represent the pseudo MSA and compute Correct the windows consensus independently with DBGs AC C A A GGT R 1 AC A A G GGT R 2 .GATCGGG..TAT.TGCCCGTGTTTATGCGTGTG R 1 ACCAA GG T R 1 TGTTCAGGCAAATATG...GAAACAAGGCCTG.. R 2 ACCAA .. T R 3 C A GAT..CGGGTATTGCCCGTGTTTATGCGTG..TG R 1 R 3 TATTTCTG..AT.GCGC.TGACTTTTCTTGGCAG A C A G G T A G Morisse et al. CONSENT 7/33

  8. Introduction Workflow Experiments Conclusion Introduction Pseudo Multiple Sequence De Bruijn graph Alignment (MSA) Divide the alignments into Build a directed acyclic graph small windows (DAG) to represent the pseudo MSA and compute Correct the windows consensus independently with DBGs AC C A A GGT R 1 AC A A G GGT R 2 .GATCGGG..TAT.TGCCCGTGTTTATGCGTGTG R 1 ACCAA GG T R 1 TGTTCAGGCAAATATG...GAAACAAGGCCTG.. R 2 ACCAA .. T R 3 C A GAT..CGGGTATTGCCCGTGTTTATGCGTG..TG R 1 R 3 TATTTCTG..AT.GCGC.TGACTTTTCTTGGCAG A C A G G T A G Morisse et al. CONSENT 7/33

  9. Introduction Workflow Experiments Conclusion Introduction Pseudo Multiple Sequence De Bruijn graph Alignment (MSA) Divide the alignments into Build a directed acyclic graph small windows (DAG) to represent the pseudo MSA and compute Correct the windows consensus independently with DBGs AC C A A GGT R 1 AC A A G GGT R 2 .GATCGGG..TAT.TGCCCGTGTTTATGCGTGTG R 1 ACCAA GG T R 1 TGTTCAGGCAAATATG...GAAACAAGGCCTG.. R 2 ACCAA .. T R 3 C A GAT..CGGGTATTGCCCGTGTTTATGCGTG..TG R 1 R 3 TATTTCTG..AT.GCGC.TGACTTTTCTTGGCAG A C A G G T A G Morisse et al. CONSENT 7/33

  10. Introduction Workflow Experiments Conclusion Introduction Contribution We introduce CONSENT, a new self-correction method that: Combines the two previous approaches (MSA + DBG) Computes actual MSA Compares well to the state-of-the-art, and scales better Is also able to polish contigs Morisse et al. CONSENT 8/33

  11. Introduction Workflow Experiments Conclusion Pre-treatment Overlap the long reads Currently with Minimap2 [Li, 2018] But not dependent on the aligner Morisse et al. CONSENT 9/33

  12. Introduction Workflow Experiments Conclusion First step: Retrieve alignment piles Select a long read to correct A Morisse et al. CONSENT 10/33

  13. Introduction Workflow Experiments Conclusion First step: Retrieve alignment piles Retrieve overlapping long reads A Morisse et al. CONSENT 11/33

  14. Introduction Workflow Experiments Conclusion First step: Retrieve alignment piles Get the alignment pile A R 1 R 2 R 3 R 4 R 5 R 6 Morisse et al. CONSENT 12/33

  15. Introduction Workflow Experiments Conclusion First step: Retrieve alignment piles Trim the alignment pile A R 1 R 2 R 3 R 4 R 5 R 6 Morisse et al. CONSENT 13/33

  16. Introduction Workflow Experiments Conclusion First step: Retrieve alignment piles Trim the alignment pile A R 1 R 2 R 3 R 4 R 5 R 6 Morisse et al. CONSENT 14/33

  17. Introduction Workflow Experiments Conclusion Second step: Divide piles into windows Definition A window w = ( beg , end ) is a ”factor” of an alignment pile Example A beg end R 1 R 2 R 3 R 4 R 5 R 6 Morisse et al. CONSENT 15/33

  18. Introduction Workflow Experiments Conclusion Second step: Divide piles into windows Definition A window w = ( beg , end ) is a ”factor” of an alignment pile Example A beg end R 1 R 2 R 3 R 4 R 5 R 6 Morisse et al. CONSENT 15/33

  19. Introduction Workflow Experiments Conclusion Second step: Divide piles into windows For correction, we will only consider windows w = ( beg , end ) such as: end − beg + 1 = l ∀ i , beg ≤ i ≤ end , i is covered by at least c reads Example On the previous example, with c = 4: A R 1 R 2 R 3 R 4 R 5 R 6 Morisse et al. CONSENT 16/33

  20. Introduction Workflow Experiments Conclusion Second step: Divide piles into windows For correction, we will only consider windows w = ( beg , end ) such as: end − beg + 1 = l ∀ i , beg ≤ i ≤ end , i is covered by at least c reads Example On the previous example, with c = 4: A R 1 R 2 R 3 R 4 R 5 R 6 Morisse et al. CONSENT 16/33

  21. Introduction Workflow Experiments Conclusion Third step: Compute consensus of a window 2. Compute consensus Compute MSA of these sequences Compute consensus from the MSA Unlike other methods, actual MSA is computed ⇒ POA [Lee et al., 2002] Morisse et al. CONSENT 17/33

  22. Introduction Workflow Experiments Conclusion Third step: Compute consensus of a window POA (Partial Order Alignment) Multiple sequence alignment strategy based on partial order graphs Two interests: Computes actual multiple sequence alignment 1 Directly builds the DAG representing the multiple sequence 2 alignment Morisse et al. CONSENT 18/33

  23. Introduction Workflow Experiments Conclusion Third step: Compute consensus of a window POA (Partial Order Alignment) Multiple sequence alignment strategy based on partial order graphs Two interests: Computes actual multiple sequence alignment 1 Directly builds the DAG representing the multiple sequence 2 alignment Morisse et al. CONSENT 18/33

  24. Introduction Workflow Experiments Conclusion Third step: Compute consensus of a window POA (Partial Order Alignment) Multiple sequence alignment strategy based on partial order graphs Two interests: Computes actual multiple sequence alignment 1 Directly builds the DAG representing the multiple sequence 2 alignment Morisse et al. CONSENT 18/33

  25. Introduction Workflow Experiments Conclusion Third step: Compute consensus of a window Segmentation strategy In practice, we use windows of a few hundred bases POA is time consuming, even on such windows We developed a segmentation strategy Compute MSA and consensus for smaller sequences ⇒ faster Morisse et al. CONSENT 19/33

  26. Introduction Workflow Experiments Conclusion Third step: Compute consensus of a window Segmentation strategy 1. Compute shared anchors between the window’s sequences Morisse et al. CONSENT 20/33

  27. Introduction Workflow Experiments Conclusion Third step: Compute consensus of a window Segmentation strategy 1. Compute shared anchors between the window’s sequences Morisse et al. CONSENT 20/33

  28. Introduction Workflow Experiments Conclusion Third step: Compute consensus of a window Segmentation strategy 2. Search for the longest anchors chain such as ∀ A i , A i + 1 : A i is followed by A i + 1 in at least N sequences 1 A i + 1 is never followed by A i 2 Morisse et al. CONSENT 21/33

  29. Introduction Workflow Experiments Conclusion Third step: Compute consensus of a window Segmentation strategy 2. Search for the longest anchors chain such as ∀ A i , A i + 1 : A i is followed by A i + 1 in at least N sequences 1 A i + 1 is never followed by A i 2 Morisse et al. CONSENT 21/33

  30. Introduction Workflow Experiments Conclusion Third step: Compute consensus of a window Segmentation strategy 2. Search for the longest anchors chain such as ∀ A i , A i + 1 : A i is followed by A i + 1 in at least N sequences 1 A i + 1 is never followed by A i 2 Morisse et al. CONSENT 21/33

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend