Towards More Effective Formulations of the Genome Assembly Problem - - PowerPoint PPT Presentation

towards more effective formulations of the genome
SMART_READER_LITE
LIVE PREVIEW

Towards More Effective Formulations of the Genome Assembly Problem - - PowerPoint PPT Presentation

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End Towards More Effective Formulations of the Genome Assembly Problem Alexandru Tomescu Department of Computer Science University of Helsinki, Finland DACS


slide-1
SLIDE 1

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

Towards More Effective Formulations of the Genome Assembly Problem

Alexandru Tomescu

Department of Computer Science University of Helsinki, Finland

DACS June 26, 2015

1 / 25

slide-2
SLIDE 2

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End 2 / 25

slide-3
SLIDE 3

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

CENTRAL DOGMA OF BIOLOGY

DNA transcription binding promoter enhancer silencer protein-protein interaction gene 1 pre-mRNA mature mRNA transcripts intron exon alternative splicing translation proteins TFBS gene 2

...

Image taken from Genome-Scale Algorithm Design, Cambridge University Press, 2015

3 / 25

slide-4
SLIDE 4

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

SEQUENCING ATLAS

DNA gene mature mRNA transcripts proteins RNA sequencing ChIP sequencing methylation Bisulfite sequencing binding primer primer Targeted resequencing De novo sequencing / Whole genome resequencing TFBS Image taken from Genome-Scale Algorithm Design, Cambridge University Press, 2015

4 / 25

slide-5
SLIDE 5

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

HIGH-THROUGHPUT SEQUENCING

amplification

break apart

size selection

sequencing length ~450 100 100 ~250 paired-end reads

DNA

5 / 25

slide-6
SLIDE 6

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

HIGH-THROUGHPUT SEQUENCING

amplification

break apart

size selection

sequencing length ~450 100 100 ~250 paired-end reads

DNA

5 / 25

slide-7
SLIDE 7

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

HIGH-THROUGHPUT SEQUENCING

amplification

break apart

size selection

sequencing length ~450 100 100 ~250 paired-end reads

DNA

We assume here that DNA is a single stranded, single chromosome

5 / 25

slide-8
SLIDE 8

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

ILLUMINA HISEQX

6 / 25

slide-9
SLIDE 9

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

GENOME ASSEMBLY PROBLEM

E.coli 4.6 · 106 Human 3.2 · 109 Spurce 25 · 109

INPUT: A collection of paired-end reads OUTPUT: The genome

7 / 25

slide-10
SLIDE 10

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

GENOME ASSEMBLY PROBLEM

E.coli 4.6 · 106 Human 3.2 · 109 Spurce 25 · 109

INPUT: A collection of paired-end reads OUTPUT: The genome Initial formulations:

◮ Shortest superstring problem (NP-hard) ◮ Build a graph with reads as nodes, and significant overlaps between

reads as directed edges:

◮ Find a walk that passes through every node exactly once (NP-complete) ◮ Find a walk that passes through every node at least once 7 / 25

slide-11
SLIDE 11

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

GENOME ASSEMBLY PROBLEM

E.coli 4.6 · 106 Human 3.2 · 109 Spurce 25 · 109

INPUT: A collection of paired-end reads OUTPUT: The genome Initial formulations:

◮ Shortest superstring problem (NP-hard) ◮ Build a graph with reads as nodes, and significant overlaps between

reads as directed edges:

◮ Find a walk that passes through every node exactly once (NP-complete) ◮ Find a walk that passes through every node at least once

Unrealistic:

◮ Longer repeated regions are collapsed ◮ Genome coverage is not uniform ◮ We cannot choose between multiple solutions

7 / 25

slide-12
SLIDE 12

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

PRACTICAL FORMULATIONS / PIPELINE

  • 1. Contig assembly: assemble the reads into strings (contigs) that are

guaranteed to occur in the genome

GTACGATA ACGTACG GATATCTA CTAGTACCC CTAATTCGA ACGTACGATATCTA contig:

8 / 25

slide-13
SLIDE 13

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

PRACTICAL FORMULATIONS / PIPELINE

  • 1. Contig assembly: assemble the reads into strings (contigs) that are

guaranteed to occur in the genome

GTACGATA ACGTACG GATATCTA CTAGTACCC CTAATTCGA ACGTACGATATCTA contig:

  • 2. Scaffolding: using paired-end reads, chain the contigs into scaffolds that

are guaranteed to occur in the genome

8 / 25

slide-14
SLIDE 14

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

PRACTICAL FORMULATIONS / PIPELINE (2)

  • 3. Gap filling: fill the gaps in the scaffolds

Tens of genome assembly programs available: ABySS, Velvet, Allpaths-LG, Bambus2, MSR-CA, SGA, Cortex, SOAPdenovo, Opera-LG, SPADES, ...

9 / 25

slide-15
SLIDE 15

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

DE BRUIJN GRAPHS

DEFINITION

Given a set R of strings, the de Bruijn graph of order k of R is the directed graph DBk(R) with

◮ node set: the set of k-mers of R ◮ edge set: the set of k + 1-mers of the strings of R

Also edges occur in the strings of R!

10 / 25

slide-16
SLIDE 16

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

DE BRUIJN GRAPHS

DEFINITION

Given a set R of strings, the de Bruijn graph of order k of R is the directed graph DBk(R) with

◮ node set: the set of k-mers of R ◮ edge set: the set of k + 1-mers of the strings of R

Also edges occur in the strings of R!

ATGCGTGGCA ATGCG TGGCA CGTG AT TG GC CG GT GG CA

k = 2

10 / 25

slide-17
SLIDE 17

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

CONTIG ASSEMBLY

joint work with Paul Medvedev

11 / 25

slide-18
SLIDE 18

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

CONTIG ASSEMBLY

◮ No previous formal definition of contig ◮ Usually, contigs are maximal, unary paths (i.e., whose internal nodes

have in-degree and out-degree 1, aka unitigs)

v0 v1 v2 vk vk+1 vt vt+1

12 / 25

slide-19
SLIDE 19

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

CONTIG ASSEMBLY

◮ No previous formal definition of contig ◮ Usually, contigs are maximal, unary paths (i.e., whose internal nodes

have in-degree and out-degree 1, aka unitigs)

v0 v1 v2 vk vk+1 vt vt+1

Given a dBG G:

◮ a genomic walk of G is a circular edge-covering walk of G ◮ a walk is safe if it is a sub-walk of all genomic walks of G

12 / 25

slide-20
SLIDE 20

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

CONTIG ASSEMBLY

◮ No previous formal definition of contig ◮ Usually, contigs are maximal, unary paths (i.e., whose internal nodes

have in-degree and out-degree 1, aka unitigs)

v0 v1 v2 vk vk+1 vt vt+1

Given a dBG G:

◮ a genomic walk of G is a circular edge-covering walk of G ◮ a walk is safe if it is a sub-walk of all genomic walks of G

We now assume that the dBG admits a genomic walk (i.e., is strongly connected) and is not a single cycle.

12 / 25

slide-21
SLIDE 21

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

CONTIG ASSEMBLY (2)

We say that a contig assembly algorithm is

◮ sound: if every output walk is safe ◮ complete: if every safe walk is in the output

13 / 25

slide-22
SLIDE 22

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

CONTIG ASSEMBLY (2)

We say that a contig assembly algorithm is

◮ sound: if every output walk is safe ◮ complete: if every safe walk is in the output

The unitig algorithm is:

◮ outputting all maximal unitigs ◮ sound ◮ not complete

13 / 25

slide-23
SLIDE 23

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

CONTIG ASSEMBLY (2)

We say that a contig assembly algorithm is

◮ sound: if every output walk is safe ◮ complete: if every safe walk is in the output

The unitig algorithm is:

◮ outputting all maximal unitigs ◮ sound ◮ not complete

Is there a sound and complete contig assembly algorithm?

13 / 25

slide-24
SLIDE 24

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

NON-SWITCHING CONTIGS

v0 v1 v2 vk vk+1 vt vt+1 ◮ A path with all out-branching nodes before all in-branching nodes

14 / 25

slide-25
SLIDE 25

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

NON-SWITCHING CONTIGS

v0 v1 v2 vk vk+1 vt vt+1 ◮ A path with all out-branching nodes before all in-branching nodes ◮ Related to transformation-based algorithms of

◮ Kingsford, Schatz, Pop 2010 ◮ Jackson 2009 ◮ Medvedev, Georgiou, Myers, Brudno 2007 14 / 25

slide-26
SLIDE 26

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

NON-SWITCHING CONTIGS

v0 v1 v2 vk vk+1 vt vt+1 ◮ A path with all out-branching nodes before all in-branching nodes ◮ Related to transformation-based algorithms of

◮ Kingsford, Schatz, Pop 2010 ◮ Jackson 2009 ◮ Medvedev, Georgiou, Myers, Brudno 2007

THEOREM

There is an O(|G| + |output|)-time algorithm to output all maximal non-switching contigs of G.

THEOREM

The non-switching contig assembly algorithm is sound but not complete.

14 / 25

slide-27
SLIDE 27

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

OmniTIGS

v0 vi vj vt+1

e0 ei−1 ei ej−1 ej et

We say that a walk w = (v0, e0, v1, e1, . . . , vt, et, vt+1) is an omnitig if for all 1 ≤ i ≤ j ≤ t, there is no proper vj-vi path with first edge different from ej, and last edge different from ei−1.

15 / 25

slide-28
SLIDE 28

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

OmniTIGS

v0 vi vj vt+1

e0 ei−1 ei ej−1 ej et

We say that a walk w = (v0, e0, v1, e1, . . . , vt, et, vt+1) is an omnitig if for all 1 ≤ i ≤ j ≤ t, there is no proper vj-vi path with first edge different from ej, and last edge different from ei−1.

THEOREM

A walk w is safe ⇔ w is an omnitig.

THEOREM

There is a polynomial time algorithm for outputting all maximal omnitigs.

COROLLARY

The omnitig algorithm is sound and complete.

15 / 25

slide-29
SLIDE 29

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

EXPERIMENTAL RESULTS

Algorithm #strings E-size AVG length unitig 41,524 7,806 893 non-switching contigs 32,589 7,822 1,136

  • mnitigs

24,949 7,850 1,479

◮ genome: circularized human chr21 (length 48 · 106) ◮ graph: dBGk(chr21) for k = 55 ◮ e-size: given a set of substrings of genome, their e-size is the average,

  • ver all genomic positions i, of the mean length of the strings spanning

position i

16 / 25

slide-30
SLIDE 30

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

GAP FILLING

joint work with Leena Salmela, Kristoffer Sahlin and Veli M¨ akinen

17 / 25

slide-31
SLIDE 31

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

PROBLEM FORMULATION

Previous formulations (GapCloser 2012, GapFiller 2012):

s t

18 / 25

slide-32
SLIDE 32

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

PROBLEM FORMULATION

Previous formulations (GapCloser 2012, GapFiller 2012):

s t

We formulate it as Exact Path Length problem. Given:

◮ G = dBGk(R), for some k ◮ s, t ∈ V(G), the two k-mers flanking the gap ◮ [d′..d] an estimate on the gap length

For every x ∈ [d′..d] find an s-t path spelling a string of length x (i.e. a path of length x − k).

t s

18 / 25

slide-33
SLIDE 33

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

DYNAMIC PROGRAMMING (DP)

Usually, d < |V(G)|. Can be solved by DP in time O(d · |E(G)|):

◮ for every node v store:

a(v, i) :=

  • 1

if there exists an s-v path of length i,

  • therwise.

◮ initialize a(s, 0) = 1, and compute

a(v, i) :=

  • u∈N−(v)

a(u, i − 1).

v u a(v,i) a(u,i-1)

◮ back-tracking in the DP matrix from a(t, x − k) gives a path (if exists)

19 / 25

slide-34
SLIDE 34

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

ENGINEERED IMPLEMENTATION

◮ k-mers flanking the gaps can have errors

◮ allow paths to start/end at up to t flanking k-mers

◮ we should not explore the entire graph!

◮ meet in the middle optimization

◮ DP matrix rows are sparse

◮ store only the non-zero entries in each row

◮ parallelization on scaffold level ◮ limit the memory usage of the DP matrix

◮ abandon the search on a gap if limit exceeded 20 / 25

slide-35
SLIDE 35

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

EXPERIMENTAL RESULTS (Gap2Seq)

Remaining Gap Length Erroneous Length 1 · 105 2 · 105 3 · 105 4 · 105 Original GapCloser GapFiller-bowtie GapFiller-bwa Gap2Seq

◮ genome: Staphylococcus aureus (length 2.8 · 106) ◮ graph: dBG with k = 31, for R = a collection of real reads ◮ results are totals over all scaffolds from a benchmark dataset (ABySS,

Allpaths-LG, Bambus2, CABOG, MSR-CA, SGA, SOAPdenovo, Velvet)

21 / 25

slide-36
SLIDE 36

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

EXPERIMENTAL RESULTS (Gap2Seq)

Runtime (min) 20 40 60 80 100 120 Peak Memory (GB) 0.2 0.4 0.6 0.8 GapCloser GapFiller-bowtie GapFiller-bwa Gap2Seq

22 / 25

slide-37
SLIDE 37

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

CONCLUSIONS

Potential uses for omnitigs:

◮ longer contigs = better starting point for scaffolding and gap filling ◮ more flanking information around loci of interest

Directions for omnitigs:

◮ robustness to errors, coverage gaps, reverse complements? ◮ faster omnitig algorithm ◮ a sound and complete algorithm when the genomic walk is linear?

———————— Gap Filling:

◮ performance on human data is ‘so and so’ (but all tools have problems) ◮ we need to improve the runtime and memory usage ◮ we need a way to choose between multiple solutions (contig assembly?)

23 / 25

slide-38
SLIDE 38

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End

MULT ¸UMESC

24 / 25