Informed and automated k -mer size selection for genome assembly - - PowerPoint PPT Presentation

informed and automated k mer size selection for genome
SMART_READER_LITE
LIVE PREVIEW

Informed and automated k -mer size selection for genome assembly - - PowerPoint PPT Presentation

Informed and automated k -mer size selection for genome assembly Rayan Chikhi, Paul Medvedev Pennsylvania State University HiTSeq - July 2013 1/15 G ENOME ASSEMBLY Genome assembly is the technique used to reconstruct genome sequences from DNA


slide-1
SLIDE 1

Informed and automated k-mer size selection for genome assembly

Rayan Chikhi, Paul Medvedev Pennsylvania State University HiTSeq - July 2013

1/15

slide-2
SLIDE 2

GENOME ASSEMBLY

Genome assembly is the technique used to reconstruct genome sequences from DNA sequencing.

sequenced reads:

  • verlapping

sub-sequences, covering the genome redundantly

genome (unknown)

assembly

hypothesis of the genome

read contig

high-quality vs low-quality assemblies

2/15

slide-3
SLIDE 3

MOTIVATION

Bioinformaticians routinely run assemblers (Allpaths-LG, Soapdenovo2, Velvet, . . . ) to study novel organisms. Most assemblers cut reads into k-mers (de Bruijn graph method).

ACTGATGAC ACT CTG TGA GAT ATG TGA GAC read k-mers (k=3)

Practical issue: assemblers rely on the user to set the parameter k. → What could go wrong if k is incorrectly set?

3/15

slide-4
SLIDE 4

MOTIVATION: OPTIMAL k NEEDED

Total length and contiguity (NG50) of chr. 14 (88 Mbp) assemblies

NG50: maximum ℓ such that (

|contigi |≥ℓ |contigi |) larger than |genome|/2

Illumina 100bp paired-end 70x coverage, assembled by Velvet with several values of k k NG50 Assembly size

2000 4000 40 60 80 8.0e+07 8.5e+07 40 60 80

Fact: Genome assembly is not robust with respect to k. Our motivation: help bioinformaticians obtain the best possible assembly by finding optimal k automatically

4/15

slide-5
SLIDE 5

EXISTING METHODS TO ESTIMATE BEST k

Velvetk: without looking at the data: koptim = argmink(|Nk G − C|)

where: Nk (total number of k-mers in the reads), G (estimated genome size) and C (desired target coverage). Does not know about genome complexity and error rate. VelvetOptimizer: for a specific assembler (Velvet). Brute-forces all values of k and examines N50. koptim = argmaxk(N50k) Takes in the order of CPU-years for mammalian genomes.

5/15

slide-6
SLIDE 6

EXISTING METHODS TO ESTIMATE BEST k

Velvetk: without looking at the data: koptim = argmink(|Nk G − C|)

where: Nk (total number of k-mers in the reads), G (estimated genome size) and C (desired target coverage). Does not know about genome complexity and error rate. VelvetOptimizer: for a specific assembler (Velvet). Brute-forces all values of k and examines N50. koptim = argmaxk(N50k) Takes in the order of CPU-years for mammalian genomes.

Actually, most of the time:

  • Bioinformaticians run [assembler] many times with k = 21, . . . , 91, or
  • “Our colleagues had good results with k = 51 on [some other bacterial

dataset]”.

5/15

slide-7
SLIDE 7

HYPOTHESIS FOR THE OPTIMAL k

In DNA/RNA/metaDNA/metaRNA assembly:

6/15

sequenced k-mers

genome (unknown)

ideal world: single contig

slide-8
SLIDE 8

HYPOTHESIS FOR THE OPTIMAL k

In DNA/RNA/metaDNA/metaRNA assembly:

  • small k: less chance of missing k-mers

6/15

sequenced k-mers

genome (unknown)

ideal world: single contig missing k-mers break contigs

slide-9
SLIDE 9

HYPOTHESIS FOR THE OPTIMAL k

In DNA/RNA/metaDNA/metaRNA assembly:

  • small k: less chance of missing k-mers
  • large k: less repetitions shorter than k

6/15

sequenced k-mers

genome (unknown)

ideal world: single contig missing k-mers break contigs repeat repeat

repetitions also break contigs and reduce total assembly size

slide-10
SLIDE 10

HYPOTHESIS FOR THE OPTIMAL k

In DNA/RNA/metaDNA/metaRNA assembly:

  • small k: less chance of missing k-mers
  • large k: less repetitions shorter than k
  • Also, larger k-mers: more likely to contain errors (unusable k-mers)

Our hypothesis: use the largest k-mer size possible (to avoid repetitions), such that the genome is sufficiently covered by k-mers. → So, when are sufficiently many (non-erroneous) k-mers seen?

6/15

sequenced k-mers

genome (unknown)

ideal world: single contig missing k-mers break contigs repeat repeat

repetitions also break contigs and reduce total assembly size

slide-11
SLIDE 11

k-MER HISTOGRAMS

Common practice: compute the k-mer abundance histogram.

  • x axis: abundance
  • y axis: number of k-mers having abundance x (seen x times)

Example reads dataset: ACTCA GTCA 3-mers: ACT CTC TCA GTC TCA Abundance of each distinct 3-mer: ACT: 1 CTC: 1 TCA: 2 GTC: 1 3-mer abundance: x y 1 3 2 1 3 4

For a dataset and a value of k, methods that build histograms already exist (k-mer counting, e.g. Jellyfish, DSK, . . .).

7/15

slide-12
SLIDE 12

DISSECTION OF A k-MER HISTOGRAM

Chr 14 (≈ 88 Mbp) GAGE dataset; histogram k = 21

20 40 60 80 120 1e+05 1e+07 1e+09 k = 21

Erroneous k-mers Genomic non-repeated k-mers Genomic repeated k-mers, sequencing artifacts, ..

Genomic area ≈ number of distinct k-mers covering the genome ≈ size of the assembly → How to determine exactly this area?

8/15

slide-13
SLIDE 13

HISTOGRAM MODEL

We use Quake’s model: [DR Kelley 2010] Erroneous k-mers Pareto distribution with shape α, pdf = α xα+1 Genomic k-mers Mixture of n Gaussians, weighted by a Zeta distribution

  • f shape s:

w1X1 + . . . + wnXn Xi ∼ N(iµ1, (iσ1)2) P(wi = k) = k −s/ζ(s) Full model Mixture weighted by (pe, 1 − pe).

20 40 60 80 120 1e+05 1e+07 1e+09 k = 21

Numerical optimization (R) is used to fit the model to actual histograms.

9/15

slide-14
SLIDE 14

SEEN SO FAR

  • Genome is sufficiently covered by k-mers =

⇒ good k value

  • Requires to know the number of genomic k-mers
  • Can be estimated with a k-mer histogram and the Quake model

To find the optimal k, one can compare histograms for different values of k.

Abundance Number of kmers 20 40 60 80 120 1e+05 1e+07 1e+09 k = 21 Abundance Number of kmers 20 40 60 80 120 1e+04 1e+06 1e+08 k = 41 Abundance Number of kmers 20 40 60 80 120 5e+04 5e+06 5e+08 k = 81

Chr 14 (≈ 88 Mbp) GAGE dataset; histograms for three values of k

→ Issue: computing a single histogram (using k-mer counting) is time and memory expensive

10/15

slide-15
SLIDE 15

SAMPLING HISTOGRAMS

Computing exact k-mer histograms is expensive (= k-mer counting). Organism CPU time per k value DSK

  • S. aureus

2min chr14 48min

  • B. impatiens

7.5hour

11/15

slide-16
SLIDE 16

SAMPLING HISTOGRAMS

Computing exact k-mer histograms is expensive (= k-mer counting). Organism CPU time per k value Memory usage of DSK Sampling method Sampling method (GB)

  • S. aureus

2min 11sec 0.1 chr14 48min 7min 0.1

  • B. impatiens

7.5hour 1.2hour 0.4 We developed a fast and memory-efficient histogram sampling technique. Sample 1 k-mer out of r, in k-mer space (the same k-mer seen in two different reads will be either consistently sampled, either consistently ignored)

  • 20

40 60 80 100 1e+04 1e+06 1e+08

Abundance Number of k−mers

  • Chr 14 (≈ 88 Mbp) k = 41
  • continuous line = exact histogram
  • dots = sampled histogram
  • sampling errors are visible for low

number of k-mers (log scale)

11/15

slide-17
SLIDE 17

TOOLS, DATASETS

Software: KmerGenie (http://kmergenie.bx.psu.edu) Evaluation on actual datasets from GAGE (assembly benchmark): [Salzberg 2011] Dataset

  • S. aureus

human chr 14

  • B. impatiens

Genome size 2.9 Mbp 88 Mbp 250 Mbp Coverage 167x 70x 247x Avg read length 101 bp 101 bp 124 bp Selected a typical assembler for each dataset, executed ∀k: Velvet and SOAPdenovo2 [Zerbino 2008, Luo 2013]

12/15

slide-18
SLIDE 18

KMERGENIE RESULTS: ACCURACY

Predicted best k and predicted assembly size vs actual assembly size and NG50 for 3 organisms (GAGE benchmark).

  • S. aureus

k NG50 Predicted size

10000 20000 20 40 60 2500000 3000000 3500000 20 40 60

Predicted Velvet

  • Chr. 14

k NG50 Predicted size

2000 4000 20 40 60 80 8.0e+07 8.5e+07 9.0e+07 20 40 60 80

Predicted Velvet

  • B. impatiens

k NG50 Predicted size

5000 10000 20 40 60 80 2.5e+08 3.0e+08 20 40 60 80

Predicted SOAPdenovo2

vertical lines corresponds to predicted best k

13/15

slide-19
SLIDE 19

CONCLUSION / PERSPECTIVES

  • KmerGenie helps choose the k-mer size for de novo assembly
  • Experiments: choices are close to the best possible
  • Methods:

◮ Best k maximizes the number of genomic k-mers ◮ Quake’s statistical model ◮ Efficient k-mer histogram sampling

Perspectives:

  • Increase robustness (high-coverage, longer reads)
  • Improve statistical model
  • Estimation of Velvet’s cov_cutoff =

⇒ zero-parameter assembler

  • Extract information from histograms for transcriptome and

meta-genomes

14/15

slide-20
SLIDE 20

USING KMERGENIE

curl http://kmergenie.bx.psu.edu/kmergenie-1.5397.tar.gz | tar xz cd kmergenie-1.5397 make Usage for a single file: ./kmergenie reads.fastq Usage for a list of files: ls -1 *.fastq > list_reads ./kmergenie list_reads It returns: best k: 47 As well as a set of kmer histograms to visualize. Thank you for your attention!

15/15