Computational Methods for de novo Assembly of Next-Generation Genome - PowerPoint PPT Presentation

Computational Methods for de novo Assembly of Next-Generation Genome Sequencing Data Rayan Chikhi ENS Cachan Brittany / IRISA (Genscale team) Advisor : Dominique Lavenier 1/39

genome (unknown) reads : overlapping sub-sequences, covering the genome redundantly I NTRODUCTION , Y EAR 2000 : H UMAN G ENOME P ROJECT ”It’s a giant resource that will change mankind, like the printing press.” Dr James Watson, co-discoverer of DNA structure 2/39

I NTRODUCTION , Y EAR 2000 : H UMAN G ENOME P ROJECT ”It’s a giant resource that will change mankind, like the printing press.” Dr James Watson, co-discoverer of DNA structure First achievement : human sequencing ◮ the only way to read DNA is through small fragments (called reads ) Sequencing process : 1) Obtain many copies of the genome 2) Cut them into millions of short fragments 3) Output the sequences of these fragments genome (unknown) reads : overlapping sub-sequences, covering the genome redundantly 2/39

I NTRODUCTION , Y EAR 2000 : H UMAN G ENOME P ROJECT Second achievement : 3/39

I NTRODUCTION , Y EAR 2000 : H UMAN G ENOME P ROJECT Second achievement : human de novo assembly (thesis topic) ◮ from millions of small fragments of DNA to a single sequence ◮ purely computational process ◮ required a supercomputer with 64 GB memory ◮ result was actually not perfect : assembly was fragmented 3/39

C ONTEXT , Y EAR 2012 : STILL DIFFICULT TO SEQUENCE TODAY ? 4/39

N EXT -G ENERATION S EQUENCING T ECHNOLOGIES ◮ NGS = massively parallel sequencing 3 main NGS technologies z }| { HGP technology z }| { Sanger SOLiD 454 Illumina Proton, PacBio, Oxford 5/39

N EXT -G ENERATION S EQUENCING T ECHNOLOGIES ◮ What everyone uses today : 3 main NGS technologies z }| { HGP technology z }| { Illumina 90 percent of the world’s sequencing output is produced on Illumina instruments. GenomeWeb, February 14, 2012 ; verified with http://omicsmaps.com/stats ≈ 100 nt, i.e. 0.000003% of the human genome read length throughput equivalent to 1 human genome per day 5/39

H OW COMPUTATIONALLY HARD IS assembly TODAY ? Tentative comparison of some software methods : ≈ 20 de novo assemblers omitted. Datasets : whole human genome, Illumina reads (except for Celera : Sanger reads) ◮ We focus on computational difficulty ◮ Quality of results : newer assemblies ( ≥ 2009) are much more fragmented, because of shorter reads 6/39

O UTLINE Definition of the assembly problem Contributions Contribution 1 : localized assembly Index Traversal Contribution 2 : incorporation of pairing information Monument assembler Results Contribution 3 : ultra-low memory assembly Minia Results Perspectives 7/39

G ENOME ASSEMBLY Informal problem Given a set of sequenced reads, retrieve the genome. In computational terms Find an algorithm such that : Input : a set of reads that are sub-strings of the genome Output : the genome Toy example Input : { GAT , ATT , TTA , TAC , ACA , CAT , CAA } Output : GATTACATCAA 7/39

G ENOME ASSEMBLY Informal problem Given a set of sequenced reads, retrieve the genome. In computational terms Find an algorithm such that : Input : a set of reads that are sub-strings of the genome Output : the genome Toy example Input : { GAT , ATT , TTA , TAC , ACA , CAT , CAA } Output : GATTACATCAA Immediate questions Q : Is there a single possible output ? A : no, s = GATTACATTACAA is another possible output Q : Then, how to choose ? A : need to formulate an optimization problem a a optimization problem : problem of finding the best solution from all feasible solutions 7/39

S HORTEST COMMON SUPER - STRING PROBLEM Shortest common super-string ( SCS ) problem Given a set S of strings, construct a string of minimal length which contains all strings of S as sub-strings . (there can be many solutions) Toy example S = { GAT , ATT , TTA } Trivial super-string : { GATATTTTA } Super-strings of length 3 : 8/39

S HORTEST COMMON SUPER - STRING PROBLEM Shortest common super-string ( SCS ) problem Given a set S of strings, construct a string of minimal length which contains all strings of S as sub-strings . (there can be many solutions) Toy example S = { GAT , ATT , TTA } Trivial super-string : { GATATTTTA } Super-strings of length 3 : none Super-strings of length 4 : 8/39

S HORTEST COMMON SUPER - STRING PROBLEM Shortest common super-string ( SCS ) problem Given a set S of strings, construct a string of minimal length which contains all strings of S as sub-strings . (there can be many solutions) Toy example S = { GAT , ATT , TTA } Trivial super-string : { GATATTTTA } Super-strings of length 3 : none Super-strings of length 4 : none Super-strings of length 5 : 8/39

S HORTEST COMMON SUPER - STRING PROBLEM Shortest common super-string ( SCS ) problem Given a set S of strings, construct a string of minimal length which contains all strings of S as sub-strings . (there can be many solutions) Toy example S = { GAT , ATT , TTA } Trivial super-string : { GATATTTTA } Super-strings of length 3 : none Super-strings of length 4 : none Super-strings of length 5 : { GATTA } ← solution Problem with SCS-based assembly The genome is not a SCS. Genomes contain long repetitions, e.g. GATTACATTACAA (length = 13 ). { GAT , ATT , TTA , TAC , ACA , CAT , CAA } Sequencing yields reads : A shortest common super-string is : GATTACATCAA (length = 11 ). 8/39

A BETTER PROBLEM FORMULATION Overlap graph (simplified definition) [Myers 95] Directed graph, ◮ vertices = reads ◮ edge r 1 → r 2 if r 1 and r 2 exactly overlap over ≥ k characters. String graph Remove transitively inferable overlaps from the overlap graph. Toy string graph S = { GAT , ATT , TTA , TAC , ACA , CAT , CAA } k = 2 CAT GAT ATT TTA TAC ACA CAA GAT ATT 9/39

A SSEMBLY USING AN STRING GRAPH Assembly in theory [Nagarajan 09] Return a path of minimal length that traverses each node at least once . Illustration For the previous example, CAT GAT ATT TTA TAC ACA CAA The only solution is GATTACATTACAA . (Recall that SCS was GATTACATCAA ) → Graphs provide a good framework for assembly. 10/39

A SSEMBLY USING AN STRING GRAPH Example of ambiguities GAG AGT GTG ACT CTG TGA GAC ACC GAA AAT ATG Assembly in practice Return a set of paths covering the graph, such that all possible assemblies contain these paths. Solution of the example above The assembly is the following set of paths : { ACTGA , TGACC , TGAGTGA , TGAATGA } 11/39

ALMOST EVERY ASSEMBLY ALGORITHM [Zerbino, Birney 08 ; Li et al. 09 ; Simpson et al. 12 ; ..] Assembly graph with variants & errors 1) The graph is completely constructed. 2) Likely sequencing errors are removed. 3) Known biological events are removed. 4) Finally, simple paths are returned. 2 2 2 2 1 1 1 1 1 3 3 3 3 12/39

Definition of the assembly problem Contributions Contribution 1 : localized assembly Index Traversal Contribution 2 : incorporation of pairing information Monument assembler Results Contribution 3 : ultra-low memory assembly Minia Results Perspectives 13/39

W HOLE - GENOME GRAPHS ARE UNNECESSARY Practically Genome graphs are a better framework than SCS, but they ◮ are monolithic, hard to parallelize , and [Simpson et al. 09] ◮ require a lot of memory (human : 150 + GB). [Li et al. 09] Contribution 1 : localized assembly Proposed approach : ◮ Store reads in a redundancy-filtered index ◮ Locally construct portions of the graph at a time 14/39

C ONTRIBUTION 1.1 : R EDUNDANCY - FILTERED READ INDEX ◮ Store reads in a redundancy-filtered index [GC, RC, DL 11] ◮ Locally construct portions of the graph 15/39

R EDUNDANCY - FILTERED READ INDEX : BENCHMARK ◮ Store reads in a redundancy-filtered index ◮ Locally construct portions of the graph Memory usage (GB) of indexes Construction time SOAP : 41 mins us : 64 mins 16/39

C ONTRIBUTION 1.2 : LOCALIZED TRAVERSAL ◮ Store reads in a redundancy-filtered index ◮ Locally construct portions of the graph, according to these rules : Will traverse : variant sub-graphs Won’t traverse : long branches s t s (min depth d ) (max breadth b , max depth d ) Example : Whole graph 17/39

C ONTRIBUTION 1.2 : LOCALIZED TRAVERSAL ◮ Store reads in a redundancy-filtered index ◮ Locally construct portions of the graph, according to these rules : Will traverse : variant sub-graphs Won’t traverse : long branches s t s (min depth d ) (max breadth b , max depth d ) Example : Start with an empty graph 17/39

C ONTRIBUTION 1.2 : LOCALIZED TRAVERSAL ◮ Store reads in a redundancy-filtered index ◮ Locally construct portions of the graph, according to these rules : Will traverse : variant sub-graphs Won’t traverse : long branches s t s (min depth d ) (max breadth b , max depth d ) Example : Construct the first portion 17/39

C ONTRIBUTION 1.2 : LOCALIZED TRAVERSAL ◮ Store reads in a redundancy-filtered index ◮ Locally construct portions of the graph, according to these rules : Will traverse : variant sub-graphs Won’t traverse : long branches s t s (min depth d ) (max breadth b , max depth d ) Example : Construct the second portion 17/39

Computational Methods for de novo Assembly of Next-Generation Genome - PowerPoint PPT Presentation

Computational Methods for de novo Assembly of Next-Generation Genome Sequencing Data Rayan Chikhi ENS Cachan Brittany / IRISA (Genscale team) Advisor : Dominique Lavenier 1/39 genome (unknown) reads : overlapping sub-sequences, covering

De Novo Genome Analysis . . . . . Ketil Malde Analysis Annotation evaluation Assembly

SciLifeLab Drug Discovery Workshop Uppsala 1 June 2015 Nanna Lneborg Novo Seeds Novo Seeds

Bioinformatics Seminars Series: Assembly Validation Francesco Vezzi KTH: Royal Institute of

KEEPING WIC CONNECTED Novo Dia Group NOVO DIA GROUP, INC (NDG) OVERVIEW Core Competencies

T T r r ial De Novo: ial De Novo: T T he Justic e Cour he Justic e Cour t Appe al

De novo assembly of complex genomes using single molecule sequencing Michael Schatz Jan 14, 2014

#join Y assembly to Box JellyBox Build: 15_Y-Assembly Join (link directly to the y assembly part

Disclosures Speakers Bureau none Research Funding Novo Nordisk, Merck, Pfizer, Mylan, Gan &

MS S Mega gades des Nov ovo MS Megades Novo A powerful wide spectrum disinfectant! MS

United States Court of Appeals for the Federal Circuit __________________________ NOVO NORDISK

Connecting NC to the World 3 Independent Projects Careers and Jobs in NC - Novo Nordisk Energy

De novo genome assembly versus mapping to a reference genome Beat Wolf PhD. Student in Computer

De novo genome assembly Dr Torsten Seemann IMB Winter School - Brisbane Mon 1 July 2013

Relaxations of the Seriation Problem and Applications to de novo Genome Assembly Soutenance de

Seriation and de novo genome assembly Antoine Recanati , CNRS & ENS with Alexandre

Seriation, Spectral Clustering and de novo genome assembly Antoine Recanati , CNRS & ENS with

Towards More Effective Formulations of the Genome Assembly Problem Alexandru Tomescu Department

Whol e Gen ome Sh ot gun S equencing Whol e Gen ome Sh ot gun S equencing Shotgun DNA

Lectures 18, 19: Sequence Assembly Fall 2019 Nov 19, 21, 2019 1 Outline Introduction

The Genome Assembly Workshop Lutz Froenicke DNA Technologies & Expression Analysis Cores UC

CS681: Advanced Topics in Computational Biology Week 7 Lecture 1 Can Alkan EA224

Novel components at the periphery of long read genome assembly tools A bioinformatics thesis

DARPA BIOCOMP Contract: DARPA BIOCOMP Contract: & Erik Winfree & Erik Winfree, Caltech

Computing and Biomolecules Alvin R. Lebeck Duke University Partial Goals of Talk Introduce

Computational Methods for de novo Assembly of Next-Generation Genome - PowerPoint PPT Presentation

Computational Methods for de novo Assembly of Next-Generation Genome Sequencing Data Rayan Chikhi ENS Cachan Brittany / IRISA (Genscale team) Advisor : Dominique Lavenier 1/39 genome (unknown) reads : overlapping sub-sequences, covering

De Novo Genome Analysis . . . . . Ketil Malde Analysis Annotation evaluation Assembly

SciLifeLab Drug Discovery Workshop Uppsala 1 June 2015 Nanna Lneborg Novo Seeds Novo Seeds

Bioinformatics Seminars Series: Assembly Validation Francesco Vezzi KTH: Royal Institute of

KEEPING WIC CONNECTED Novo Dia Group NOVO DIA GROUP, INC (NDG) OVERVIEW Core Competencies

T T r r ial De Novo: ial De Novo: T T he Justic e Cour he Justic e Cour t Appe al

De novo assembly of complex genomes using single molecule sequencing Michael Schatz Jan 14, 2014

#join Y assembly to Box JellyBox Build: 15_Y-Assembly Join (link directly to the y assembly part

Disclosures Speakers Bureau none Research Funding Novo Nordisk, Merck, Pfizer, Mylan, Gan &amp;

MS S Mega gades des Nov ovo MS Megades Novo A powerful wide spectrum disinfectant! MS

United States Court of Appeals for the Federal Circuit __________________________ NOVO NORDISK

Connecting NC to the World 3 Independent Projects Careers and Jobs in NC - Novo Nordisk Energy

De novo genome assembly versus mapping to a reference genome Beat Wolf PhD. Student in Computer

De novo genome assembly Dr Torsten Seemann IMB Winter School - Brisbane Mon 1 July 2013

Relaxations of the Seriation Problem and Applications to de novo Genome Assembly Soutenance de

Seriation and de novo genome assembly Antoine Recanati , CNRS &amp; ENS with Alexandre

Seriation, Spectral Clustering and de novo genome assembly Antoine Recanati , CNRS &amp; ENS with

Towards More Effective Formulations of the Genome Assembly Problem Alexandru Tomescu Department

Whol e Gen ome Sh ot gun S equencing Whol e Gen ome Sh ot gun S equencing Shotgun DNA

Lectures 18, 19: Sequence Assembly Fall 2019 Nov 19, 21, 2019 1 Outline Introduction

The Genome Assembly Workshop Lutz Froenicke DNA Technologies &amp; Expression Analysis Cores UC

CS681: Advanced Topics in Computational Biology Week 7 Lecture 1 Can Alkan EA224

Novel components at the periphery of long read genome assembly tools A bioinformatics thesis

DARPA BIOCOMP Contract: DARPA BIOCOMP Contract: &amp; Erik Winfree &amp; Erik Winfree, Caltech

Computing and Biomolecules Alvin R. Lebeck Duke University Partial Goals of Talk Introduce

Disclosures Speakers Bureau none Research Funding Novo Nordisk, Merck, Pfizer, Mylan, Gan &

Seriation and de novo genome assembly Antoine Recanati , CNRS & ENS with Alexandre

Seriation, Spectral Clustering and de novo genome assembly Antoine Recanati , CNRS & ENS with

The Genome Assembly Workshop Lutz Froenicke DNA Technologies & Expression Analysis Cores UC

DARPA BIOCOMP Contract: DARPA BIOCOMP Contract: & Erik Winfree & Erik Winfree, Caltech