CONSENT: Scalable self-correction of long reads with multiple - PowerPoint PPT Presentation

CONSENT: Scalable self-correction of long reads with multiple sequence alignment Pierre Morisse 1 , Camille Marchet 2 , Antoine Limasset 2 , Arnaud Lefebvre 1 , Thierry Lecroq 1 1 Normandie Univ, UNIROUEN, LITIS, Rouen 76000, France. 2 Lille Univ, CNRS, CRIStAL, Lille 59000, France. RECOMB-SEQ 03 May 2019 Washington D.C.

Introduction Workflow Experiments Conclusion Introduction Context 2011: Inception of third generation sequencing technologies Two main actors: Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) Sequencing of much longer reads, tens of kbps on average, up to 1 Mbp (ONT ultra-long reads) Expected to solve various problem in the genome assembly field Morisse et al. CONSENT 2/33

Introduction Workflow Experiments Conclusion Introduction Context Long reads (LR) are very noisy (10-30% error rate) Display complex error profiles (errors are mostly indels) Efficiently handling these error rates is mandatory Can be done via correction: hybrid or self Morisse et al. CONSENT 3/33

Introduction Workflow Experiments Conclusion Introduction Hybrid correction First efficient approach for LR error correction Makes use of complementary short reads (SR) data Different approaches: Alignment of SRs to the LRs, use of a De Bruijn graph (DBG), ... Particularly useful on old sequencing experiments (very high error rates) Morisse et al. CONSENT 4/33

Introduction Workflow Experiments Conclusion Introduction Self-correction Corrects the LRs solely based on the information they contain Third generation sequencing technologies evolve fast Error rates of the LRs now reach 10-12% on average Error correction is still the first step of many analysis projects Self-correction is now a viable alternative with such error rates Morisse et al. CONSENT 5/33

Introduction Workflow Experiments Conclusion Introduction Self-correction State-of-the-art: Compute overlaps between the LRs 1 Compute consensus from the overlaps 2 Morisse et al. CONSENT 6/33

Introduction Workflow Experiments Conclusion Introduction Pseudo Multiple Sequence De Bruijn graph Alignment (MSA) Divide the alignments into Build a directed acyclic graph small windows (DAG) to represent the pseudo MSA and compute Correct the windows consensus independently with DBGs AC C A A GGT R 1 AC A A G GGT R 2 .GATCGGG..TAT.TGCCCGTGTTTATGCGTGTG R 1 ACCAA GG T R 1 TGTTCAGGCAAATATG...GAAACAAGGCCTG.. R 2 ACCAA .. T R 3 C A GAT..CGGGTATTGCCCGTGTTTATGCGTG..TG R 1 R 3 TATTTCTG..AT.GCGC.TGACTTTTCTTGGCAG A C A G G T A G Morisse et al. CONSENT 7/33

Introduction Workflow Experiments Conclusion Introduction Contribution We introduce CONSENT, a new self-correction method that: Combines the two previous approaches (MSA + DBG) Computes actual MSA Compares well to the state-of-the-art, and scales better Is also able to polish contigs Morisse et al. CONSENT 8/33

Introduction Workflow Experiments Conclusion Pre-treatment Overlap the long reads Currently with Minimap2 [Li, 2018] But not dependent on the aligner Morisse et al. CONSENT 9/33

Introduction Workflow Experiments Conclusion First step: Retrieve alignment piles Select a long read to correct A Morisse et al. CONSENT 10/33

Introduction Workflow Experiments Conclusion First step: Retrieve alignment piles Retrieve overlapping long reads A Morisse et al. CONSENT 11/33

Introduction Workflow Experiments Conclusion First step: Retrieve alignment piles Get the alignment pile A R 1 R 2 R 3 R 4 R 5 R 6 Morisse et al. CONSENT 12/33

Introduction Workflow Experiments Conclusion First step: Retrieve alignment piles Trim the alignment pile A R 1 R 2 R 3 R 4 R 5 R 6 Morisse et al. CONSENT 13/33

Introduction Workflow Experiments Conclusion First step: Retrieve alignment piles Trim the alignment pile A R 1 R 2 R 3 R 4 R 5 R 6 Morisse et al. CONSENT 14/33

Introduction Workflow Experiments Conclusion Second step: Divide piles into windows Definition A window w = ( beg , end ) is a ”factor” of an alignment pile Example A beg end R 1 R 2 R 3 R 4 R 5 R 6 Morisse et al. CONSENT 15/33

Introduction Workflow Experiments Conclusion Second step: Divide piles into windows For correction, we will only consider windows w = ( beg , end ) such as: end − beg + 1 = l ∀ i , beg ≤ i ≤ end , i is covered by at least c reads Example On the previous example, with c = 4: A R 1 R 2 R 3 R 4 R 5 R 6 Morisse et al. CONSENT 16/33

Introduction Workflow Experiments Conclusion Third step: Compute consensus of a window 2. Compute consensus Compute MSA of these sequences Compute consensus from the MSA Unlike other methods, actual MSA is computed ⇒ POA [Lee et al., 2002] Morisse et al. CONSENT 17/33

Introduction Workflow Experiments Conclusion Third step: Compute consensus of a window POA (Partial Order Alignment) Multiple sequence alignment strategy based on partial order graphs Two interests: Computes actual multiple sequence alignment 1 Directly builds the DAG representing the multiple sequence 2 alignment Morisse et al. CONSENT 18/33

Introduction Workflow Experiments Conclusion Third step: Compute consensus of a window Segmentation strategy In practice, we use windows of a few hundred bases POA is time consuming, even on such windows We developed a segmentation strategy Compute MSA and consensus for smaller sequences ⇒ faster Morisse et al. CONSENT 19/33

Introduction Workflow Experiments Conclusion Third step: Compute consensus of a window Segmentation strategy 1. Compute shared anchors between the window’s sequences Morisse et al. CONSENT 20/33

Introduction Workflow Experiments Conclusion Third step: Compute consensus of a window Segmentation strategy 2. Search for the longest anchors chain such as ∀ A i , A i + 1 : A i is followed by A i + 1 in at least N sequences 1 A i + 1 is never followed by A i 2 Morisse et al. CONSENT 21/33

CONSENT: Scalable self-correction of long reads with multiple - PowerPoint PPT Presentation

CONSENT: Scalable self-correction of long reads with multiple sequence alignment Pierre Morisse 1 , Camille Marchet 2 , Antoine Limasset 2 , Arnaud Lefebvre 1 , Thierry Lecroq 1 1 Normandie Univ, UNIROUEN, LITIS, Rouen 76000, France. 2 Lille Univ,

CONSENT: Scalable self-correction of long reads with multiple sequence alignment Pierre Morisse 1

GPU accelerated partial order multiple sequence alignment for long reads self-correction

Informed Consent R Jane McKay Informed Consent Consent why and when do we need it ?

GDPR Consent Data Protection Practitioners #DPPC2018 Conference 2018 Whats new? When is

Informed Consent in Research consent has been firmly established in clinical practice and

Facts and Fiction Thomas Srensen, Wiebke Langreder IWTMA April 2017 LT Long-term Correction

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Quantum Information Processing and Quantum Error Correction and Quantum Error Correction with

Eight Truths about Correction from the Book of Proverbs 3 1. The right attitude to correction

HG-CoLoR: Hybrid Graph for the error Correction of Long Reads Pierre Morisse , Thierry Lecroq and

HG-CoLoR: enHanced de bruijn Graph for the error COrrection of LOng Reads Pierre Morisse , Thierry

ELECTOR: Evaluator for long reads correction methods Camille Marchet 1 , , Pierre Morisse 2 , ,

CONSENT PROCESS SPECIAL POPULATIONS CONSENT Com m on Rule ANPRM Current provisions of the

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Alaska Reads Big Anna Bjartmarsdottir, UAA/APU Books of the Year Rayette Sterling, Anchorage

WDA waveform feeders ew2wda reads from EW waveform ring cs2wda reads from Comserv

38: Introduction to Graphs Chris Wyatt Electrical and Computer Engineering Virginia Tech Graphs

Graph Representation Learning William L. Hamilton COMP 551 Special Topic Lecture Will

How to teleport your cat? Mris Ozols University of Cambridge What is quantum computing?

Matematyczne modelowanie mzgu (czyli o termodynamice) Jan Karbowski University of Warsaw

E mbryogenesis in the sea urchin occurs The genes identified are not limited a priori by After

de de no novo genom nome a e assembl bly from l long- an and s shor ort-rea ead d d data

JOBIM 3 July 2012 Chondrichthyans Teleostomi Scyliorhinus canicula (dog fish) Genome sequencing

Reverse engineering minimal wiring diagrams Elena Dimitrova School of Mathematical and