informed and automated k mer size selection for genome
play

Informed and automated k -mer size selection for genome assembly - PowerPoint PPT Presentation

Informed and automated k -mer size selection for genome assembly Rayan Chikhi, Paul Medvedev Pennsylvania State University HiTSeq - July 2013 1/15 G ENOME ASSEMBLY Genome assembly is the technique used to reconstruct genome sequences from DNA


  1. Informed and automated k -mer size selection for genome assembly Rayan Chikhi, Paul Medvedev Pennsylvania State University HiTSeq - July 2013 1/15

  2. G ENOME ASSEMBLY Genome assembly is the technique used to reconstruct genome sequences from DNA sequencing. genome (unknown) sequenced reads : overlapping sub-sequences, covering the genome redundantly read contig assembly hypothesis of the genome high-quality vs low-quality assemblies 2/15

  3. M OTIVATION Bioinformaticians routinely run assemblers (Allpaths-LG, Soapdenovo2, Velvet, . . . ) to study novel organisms. Most assemblers cut reads into k -mers (de Bruijn graph method). read ACTGATGAC ACT CTG TGA k-mers GAT (k=3) ATG TGA GAC Practical issue: assemblers rely on the user to set the parameter k . → What could go wrong if k is incorrectly set? 3/15

  4. M OTIVATION : OPTIMAL k NEEDED Total length and contiguity (NG50) of chr. 14 (88 Mbp) assemblies NG50: maximum ℓ such that ( � | contig i |≥ ℓ | contig i | ) larger than | genome | / 2 Illumina 100bp paired-end 70x coverage, assembled by Velvet with several values of k 8.5e+07 NG50 Assembly size 8.0e+07 40 60 80 4000 2000 0 40 60 80 k Fact: Genome assembly is not robust with respect to k . Our motivation: help bioinformaticians obtain the best possible assembly by finding optimal k automatically 4/15

  5. E XISTING METHODS TO ESTIMATE BEST k Velvetk: without looking at the data: k optim = argmin k ( | N k G − C | ) where: N k (total number of k -mers in the reads), G (estimated genome size) and C (desired target coverage). Does not know about genome complexity and error rate. VelvetOptimizer: for a specific assembler (Velvet). Brute-forces all values of k and examines N50. k optim = argmax k ( N 50 k ) Takes in the order of CPU-years for mammalian genomes. 5/15

  6. E XISTING METHODS TO ESTIMATE BEST k Velvetk: without looking at the data: k optim = argmin k ( | N k G − C | ) where: N k (total number of k -mers in the reads), G (estimated genome size) and C (desired target coverage). Does not know about genome complexity and error rate. VelvetOptimizer: for a specific assembler (Velvet). Brute-forces all values of k and examines N50. k optim = argmax k ( N 50 k ) Takes in the order of CPU-years for mammalian genomes. Actually, most of the time: - Bioinformaticians run [assembler] many times with k = 21 , . . . , 91, or - “Our colleagues had good results with k = 51 on [some other bacterial dataset]” . 5/15

  7. H YPOTHESIS FOR THE OPTIMAL k genome (unknown) sequenced k-mers ideal world: single contig In DNA/RNA/metaDNA/metaRNA assembly: 6/15

  8. H YPOTHESIS FOR THE OPTIMAL k genome (unknown) sequenced k-mers ideal world: single contig missing k-mers break contigs In DNA/RNA/metaDNA/metaRNA assembly: - small k : less chance of missing k -mers 6/15

  9. H YPOTHESIS FOR THE OPTIMAL k genome repeat repeat (unknown) sequenced k-mers ideal world: single contig missing k-mers break contigs repetitions also break contigs and reduce total assembly size In DNA/RNA/metaDNA/metaRNA assembly: - small k : less chance of missing k -mers - large k : less repetitions shorter than k 6/15

  10. H YPOTHESIS FOR THE OPTIMAL k genome repeat repeat (unknown) sequenced k-mers ideal world: single contig missing k-mers break contigs repetitions also break contigs and reduce total assembly size In DNA/RNA/metaDNA/metaRNA assembly: - small k : less chance of missing k -mers - large k : less repetitions shorter than k - Also, larger k -mers: more likely to contain errors (unusable k -mers) Our hypothesis: use the largest k -mer size possible (to avoid repetitions), such that the genome is sufficiently covered by k -mers. → So, when are sufficiently many (non-erroneous) k -mers seen? 6/15

  11. k - MER HISTOGRAMS Common practice: compute the k -mer abundance histogram. - x axis: abundance - y axis: number of k -mers having abundance x (seen x times) Abundance of each distinct 3-mer: Example reads dataset: ACT: 1 ACTCA CTC: 1 GTCA TCA: 2 GTC: 1 3-mers: 3-mer abundance: ACT CTC x y TCA 1 3 GTC 2 1 TCA 3 0 4 0 For a dataset and a value of k , methods that build histograms already exist ( k -mer counting, e.g. Jellyfish, DSK, . . . ). 7/15

  12. D ISSECTION OF A k - MER HISTOGRAM Chr 14 ( ≈ 88 Mbp) GAGE dataset; histogram k = 21 k = 21 1e+09 Erroneous k-mers 1e+07 Genomic non-repeated k-mers Genomic repeated k-mers, sequencing artifacts, .. 1e+05 0 20 40 60 80 120 Genomic area ≈ number of distinct k -mers covering the genome ≈ size of the assembly → How to determine exactly this area? 8/15

  13. H ISTOGRAM MODEL We use Quake’s model: [DR Kelley 2010] Erroneous k -mers Pareto distribution with shape α , α k = 21 1e+09 pdf = x α + 1 Genomic k -mers Mixture of n Gaussians, 1e+07 weighted by a Zeta distribution of shape s : 1e+05 w 1 X 1 + . . . + w n X n X i ∼ N ( i µ 1 , ( i σ 1 ) 2 ) 0 20 40 60 80 120 P ( w i = k ) = k − s /ζ ( s ) Full model Mixture weighted by ( p e , 1 − p e ) . Numerical optimization (R) is used to fit the model to actual histograms. 9/15

  14. S EEN SO FAR ⇒ good k value - Genome is sufficiently covered by k -mers = - Requires to know the number of genomic k -mers - Can be estimated with a k -mer histogram and the Quake model To find the optimal k , one can compare histograms for different values of k . k = 21 k = 41 k = 81 1e+09 5e+08 1e+08 Number of kmers Number of kmers Number of kmers 1e+07 5e+06 1e+06 1e+05 5e+04 1e+04 0 20 40 60 80 120 0 20 40 60 80 120 0 20 40 60 80 120 Abundance Abundance Abundance Chr 14 ( ≈ 88 Mbp) GAGE dataset; histograms for three values of k → Issue: computing a single histogram (using k -mer counting) is time and memory expensive 10/15

  15. S AMPLING HISTOGRAMS Computing exact k -mer histograms is expensive (= k -mer counting). Organism CPU time per k value DSK S. aureus 2min chr14 48min B. impatiens 7.5hour 11/15

  16. S AMPLING HISTOGRAMS Computing exact k -mer histograms is expensive (= k -mer counting). Organism CPU time per k value Memory usage of DSK Sampling method Sampling method (GB) S. aureus 2min 11sec 0.1 chr14 48min 7min 0.1 B. impatiens 7.5hour 1.2hour 0.4 We developed a fast and memory-efficient histogram sampling technique. Sample 1 k -mer out of r , in k -mer space (the same k -mer seen in two different reads will be either consistently sampled, either consistently ignored) ● 1e+08 - Chr 14 ( ≈ 88 Mbp) k = 41 Number of k−mers ● - continuous line = exact histogram ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1e+06 ● - dots = sampled histogram ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● - sampling errors are visible for low ● ● ● ● ● ● ● ● ● ● number of k -mers (log scale) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1e+04 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● 0 20 40 60 80 100 Abundance 11/15

  17. T OOLS , D ATASETS Software: KmerGenie ( http://kmergenie.bx.psu.edu ) Evaluation on actual datasets from GAGE (assembly benchmark): [Salzberg 2011] Dataset S. aureus human chr 14 B. impatiens Genome size 2.9 Mbp 88 Mbp 250 Mbp Coverage 167x 70x 247x Avg read length 101 bp 101 bp 124 bp Selected a typical assembler for each dataset, executed ∀ k : Velvet and SOAPdenovo2 [Zerbino 2008, Luo 2013] 12/15

  18. K MER G ENIE R ESULTS : ACCURACY Predicted best k and predicted assembly size vs actual assembly size and NG50 for 3 organisms (GAGE benchmark). S. aureus Chr. 14 B. impatiens Predicted Predicted Predicted Velvet Velvet SOAPdenovo2 NG50 Predicted size NG50 Predicted size NG50 Predicted size 9.0e+07 3500000 3.0e+08 3000000 8.5e+07 2.5e+08 2500000 8.0e+07 20 40 60 20 40 60 80 20 40 60 80 20000 10000 4000 10000 5000 2000 0 0 0 20 40 60 20 40 60 80 20 40 60 80 k k k vertical lines corresponds to predicted best k 13/15

  19. C ONCLUSION / PERSPECTIVES - KmerGenie helps choose the k-mer size for de novo assembly - Experiments: choices are close to the best possible - Methods: ◮ Best k maximizes the number of genomic k -mers ◮ Quake’s statistical model ◮ Efficient k -mer histogram sampling Perspectives: - Increase robustness (high-coverage, longer reads) - Improve statistical model - Estimation of Velvet’s cov_cutoff = ⇒ zero-parameter assembler - Extract information from histograms for transcriptome and meta-genomes 14/15

  20. U SING K MER G ENIE curl http://kmergenie.bx.psu.edu/kmergenie-1.5397.tar.gz | tar xz cd kmergenie-1.5397 make Usage for a single file: ./kmergenie reads.fastq Usage for a list of files: ls -1 *.fastq > list_reads ./kmergenie list_reads It returns: best k: 47 As well as a set of kmer histograms to visualize. Thank you for your attention! 15/15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend