Informed and automated k -mer size selection for genome assembly - PowerPoint PPT Presentation

Informed and automated k -mer size selection for genome assembly Rayan Chikhi, Paul Medvedev Pennsylvania State University HiTSeq - July 2013 1/15

G ENOME ASSEMBLY Genome assembly is the technique used to reconstruct genome sequences from DNA sequencing. genome (unknown) sequenced reads : overlapping sub-sequences, covering the genome redundantly read contig assembly hypothesis of the genome high-quality vs low-quality assemblies 2/15

M OTIVATION Bioinformaticians routinely run assemblers (Allpaths-LG, Soapdenovo2, Velvet, . . . ) to study novel organisms. Most assemblers cut reads into k -mers (de Bruijn graph method). read ACTGATGAC ACT CTG TGA k-mers GAT (k=3) ATG TGA GAC Practical issue: assemblers rely on the user to set the parameter k . → What could go wrong if k is incorrectly set? 3/15

M OTIVATION : OPTIMAL k NEEDED Total length and contiguity (NG50) of chr. 14 (88 Mbp) assemblies NG50: maximum ℓ such that ( � | contig i |≥ ℓ | contig i | ) larger than | genome | / 2 Illumina 100bp paired-end 70x coverage, assembled by Velvet with several values of k 8.5e+07 NG50 Assembly size 8.0e+07 40 60 80 4000 2000 0 40 60 80 k Fact: Genome assembly is not robust with respect to k . Our motivation: help bioinformaticians obtain the best possible assembly by finding optimal k automatically 4/15

E XISTING METHODS TO ESTIMATE BEST k Velvetk: without looking at the data: k optim = argmin k ( | N k G − C | ) where: N k (total number of k -mers in the reads), G (estimated genome size) and C (desired target coverage). Does not know about genome complexity and error rate. VelvetOptimizer: for a specific assembler (Velvet). Brute-forces all values of k and examines N50. k optim = argmax k ( N 50 k ) Takes in the order of CPU-years for mammalian genomes. 5/15

E XISTING METHODS TO ESTIMATE BEST k Velvetk: without looking at the data: k optim = argmin k ( | N k G − C | ) where: N k (total number of k -mers in the reads), G (estimated genome size) and C (desired target coverage). Does not know about genome complexity and error rate. VelvetOptimizer: for a specific assembler (Velvet). Brute-forces all values of k and examines N50. k optim = argmax k ( N 50 k ) Takes in the order of CPU-years for mammalian genomes. Actually, most of the time: - Bioinformaticians run [assembler] many times with k = 21 , . . . , 91, or - “Our colleagues had good results with k = 51 on [some other bacterial dataset]” . 5/15

H YPOTHESIS FOR THE OPTIMAL k genome (unknown) sequenced k-mers ideal world: single contig In DNA/RNA/metaDNA/metaRNA assembly: 6/15

H YPOTHESIS FOR THE OPTIMAL k genome (unknown) sequenced k-mers ideal world: single contig missing k-mers break contigs In DNA/RNA/metaDNA/metaRNA assembly: - small k : less chance of missing k -mers 6/15

H YPOTHESIS FOR THE OPTIMAL k genome repeat repeat (unknown) sequenced k-mers ideal world: single contig missing k-mers break contigs repetitions also break contigs and reduce total assembly size In DNA/RNA/metaDNA/metaRNA assembly: - small k : less chance of missing k -mers - large k : less repetitions shorter than k 6/15

H YPOTHESIS FOR THE OPTIMAL k genome repeat repeat (unknown) sequenced k-mers ideal world: single contig missing k-mers break contigs repetitions also break contigs and reduce total assembly size In DNA/RNA/metaDNA/metaRNA assembly: - small k : less chance of missing k -mers - large k : less repetitions shorter than k - Also, larger k -mers: more likely to contain errors (unusable k -mers) Our hypothesis: use the largest k -mer size possible (to avoid repetitions), such that the genome is sufficiently covered by k -mers. → So, when are sufficiently many (non-erroneous) k -mers seen? 6/15

k - MER HISTOGRAMS Common practice: compute the k -mer abundance histogram. - x axis: abundance - y axis: number of k -mers having abundance x (seen x times) Abundance of each distinct 3-mer: Example reads dataset: ACT: 1 ACTCA CTC: 1 GTCA TCA: 2 GTC: 1 3-mers: 3-mer abundance: ACT CTC x y TCA 1 3 GTC 2 1 TCA 3 0 4 0 For a dataset and a value of k , methods that build histograms already exist ( k -mer counting, e.g. Jellyfish, DSK, . . . ). 7/15

D ISSECTION OF A k - MER HISTOGRAM Chr 14 ( ≈ 88 Mbp) GAGE dataset; histogram k = 21 k = 21 1e+09 Erroneous k-mers 1e+07 Genomic non-repeated k-mers Genomic repeated k-mers, sequencing artifacts, .. 1e+05 0 20 40 60 80 120 Genomic area ≈ number of distinct k -mers covering the genome ≈ size of the assembly → How to determine exactly this area? 8/15

H ISTOGRAM MODEL We use Quake’s model: [DR Kelley 2010] Erroneous k -mers Pareto distribution with shape α , α k = 21 1e+09 pdf = x α + 1 Genomic k -mers Mixture of n Gaussians, 1e+07 weighted by a Zeta distribution of shape s : 1e+05 w 1 X 1 + . . . + w n X n X i ∼ N ( i µ 1 , ( i σ 1 ) 2 ) 0 20 40 60 80 120 P ( w i = k ) = k − s /ζ ( s ) Full model Mixture weighted by ( p e , 1 − p e ) . Numerical optimization (R) is used to fit the model to actual histograms. 9/15

S EEN SO FAR ⇒ good k value - Genome is sufficiently covered by k -mers = - Requires to know the number of genomic k -mers - Can be estimated with a k -mer histogram and the Quake model To find the optimal k , one can compare histograms for different values of k . k = 21 k = 41 k = 81 1e+09 5e+08 1e+08 Number of kmers Number of kmers Number of kmers 1e+07 5e+06 1e+06 1e+05 5e+04 1e+04 0 20 40 60 80 120 0 20 40 60 80 120 0 20 40 60 80 120 Abundance Abundance Abundance Chr 14 ( ≈ 88 Mbp) GAGE dataset; histograms for three values of k → Issue: computing a single histogram (using k -mer counting) is time and memory expensive 10/15

S AMPLING HISTOGRAMS Computing exact k -mer histograms is expensive (= k -mer counting). Organism CPU time per k value DSK S. aureus 2min chr14 48min B. impatiens 7.5hour 11/15

S AMPLING HISTOGRAMS Computing exact k -mer histograms is expensive (= k -mer counting). Organism CPU time per k value Memory usage of DSK Sampling method Sampling method (GB) S. aureus 2min 11sec 0.1 chr14 48min 7min 0.1 B. impatiens 7.5hour 1.2hour 0.4 We developed a fast and memory-efficient histogram sampling technique. Sample 1 k -mer out of r , in k -mer space (the same k -mer seen in two different reads will be either consistently sampled, either consistently ignored) ● 1e+08 - Chr 14 ( ≈ 88 Mbp) k = 41 Number of k−mers ● - continuous line = exact histogram ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1e+06 ● - dots = sampled histogram ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● - sampling errors are visible for low ● ● ● ● ● ● ● ● ● ● number of k -mers (log scale) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1e+04 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● 0 20 40 60 80 100 Abundance 11/15

T OOLS , D ATASETS Software: KmerGenie ( http://kmergenie.bx.psu.edu ) Evaluation on actual datasets from GAGE (assembly benchmark): [Salzberg 2011] Dataset S. aureus human chr 14 B. impatiens Genome size 2.9 Mbp 88 Mbp 250 Mbp Coverage 167x 70x 247x Avg read length 101 bp 101 bp 124 bp Selected a typical assembler for each dataset, executed ∀ k : Velvet and SOAPdenovo2 [Zerbino 2008, Luo 2013] 12/15

K MER G ENIE R ESULTS : ACCURACY Predicted best k and predicted assembly size vs actual assembly size and NG50 for 3 organisms (GAGE benchmark). S. aureus Chr. 14 B. impatiens Predicted Predicted Predicted Velvet Velvet SOAPdenovo2 NG50 Predicted size NG50 Predicted size NG50 Predicted size 9.0e+07 3500000 3.0e+08 3000000 8.5e+07 2.5e+08 2500000 8.0e+07 20 40 60 20 40 60 80 20 40 60 80 20000 10000 4000 10000 5000 2000 0 0 0 20 40 60 20 40 60 80 20 40 60 80 k k k vertical lines corresponds to predicted best k 13/15

C ONCLUSION / PERSPECTIVES - KmerGenie helps choose the k-mer size for de novo assembly - Experiments: choices are close to the best possible - Methods: ◮ Best k maximizes the number of genomic k -mers ◮ Quake’s statistical model ◮ Efficient k -mer histogram sampling Perspectives: - Increase robustness (high-coverage, longer reads) - Improve statistical model - Estimation of Velvet’s cov_cutoff = ⇒ zero-parameter assembler - Extract information from histograms for transcriptome and meta-genomes 14/15

U SING K MER G ENIE curl http://kmergenie.bx.psu.edu/kmergenie-1.5397.tar.gz | tar xz cd kmergenie-1.5397 make Usage for a single file: ./kmergenie reads.fastq Usage for a list of files: ls -1 *.fastq > list_reads ./kmergenie list_reads It returns: best k: 47 As well as a set of kmer histograms to visualize. Thank you for your attention! 15/15

Informed and automated k -mer size selection for genome assembly - PowerPoint PPT Presentation

Informed and automated k -mer size selection for genome assembly Rayan Chikhi, Paul Medvedev Pennsylvania State University HiTSeq - July 2013 1/15 G ENOME ASSEMBLY Genome assembly is the technique used to reconstruct genome sequences from DNA

Euclid OU-MER Herv Dole et al. MER tasks 1. MER photometry strategies 2. Pipeline

Mer andia Caroline Izzi, Meghan Loveless, and Gage Markley Creative Process Merlandia Logo Mer

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

Self Study: Yeast Genome Comparison SESSION 4 MARTIN KRZYWINSKI Genome Sciences Centre BC

Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF

Visualizing ENCODE Data in the UCSC Genome Browser Pauline Fujita, Ph.D. UCSC Genome Bioinformatics

The Mouse Genome The Mouse Genome Database (MGD) Database (MGD) Eppig J.T., et al. (2005). The

Genome 562 February 2015 Week 6 Genome 562 p.1/13 Julian Huxley (1887-1975) Oxford

Genome 562 January 2015 Week 1 Genome 562 p.1/6 Early workers in theoretical population

Genome assembly Mark Stenglein, Todos Santos 2018 Genome assembly is the process of attempting to

Automated Design of Digital Automated Design of Digital Automated Design of Digital Automated

Informed Search strategies AIMA sections 3.5, 3.6 Summary Informed Search strategies

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Coun try Report I di M India Meteorolog l gical Department i l D by y M K G upta Scient

ISO 20022 messaging in the context of the T2-T2S Consolidation project 1 st meeting of the

eCoMove Project Overview Coen Bresser Technolution B V Technolution B.V. www.ecomove-project.eu

New Intake Presentation Monday 13 th July 2020 Thank you for your participation and comments

Step By Step Instructions On How To Make A Powerpoint Presentation Click Here:

th Global Distributors Meeting 10 th Global Distributors Meeting 10 Acson I nternational Brand

Unlocking the language of the HSC Please note that this document is a work in progress, the

Jimmy: Searching for Free-Will (A competition) Simon Egerton, Marc Davies, Victor Callaghan,

Sambuz

Useful Links

Newsletter

Mail Us

Informed and automated k -mer size selection for genome assembly - PowerPoint PPT Presentation

Informed and automated k -mer size selection for genome assembly Rayan Chikhi, Paul Medvedev Pennsylvania State University HiTSeq - July 2013 1/15 G ENOME ASSEMBLY Genome assembly is the technique used to reconstruct genome sequences from DNA

Euclid OU-MER Herv Dole et al. MER tasks 1. MER photometry strategies 2. Pipeline

Mer andia Caroline Izzi, Meghan Loveless, and Gage Markley Creative Process Merlandia Logo Mer

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics &amp; Computational

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Genome Sequencing &amp; Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

Self Study: Yeast Genome Comparison SESSION 4 MARTIN KRZYWINSKI Genome Sciences Centre BC

Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF

Visualizing ENCODE Data in the UCSC Genome Browser Pauline Fujita, Ph.D. UCSC Genome Bioinformatics

The Mouse Genome The Mouse Genome Database (MGD) Database (MGD) Eppig J.T., et al. (2005). The

Genome 562 February 2015 Week 6 Genome 562 p.1/13 Julian Huxley (1887-1975) Oxford

Genome 562 January 2015 Week 1 Genome 562 p.1/6 Early workers in theoretical population

Genome assembly Mark Stenglein, Todos Santos 2018 Genome assembly is the process of attempting to

Automated Design of Digital Automated Design of Digital Automated Design of Digital Automated

Informed Search strategies AIMA sections 3.5, 3.6 Summary Informed Search strategies

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Coun try Report I di M India Meteorolog l gical Department i l D by y M K G upta Scient

ISO 20022 messaging in the context of the T2-T2S Consolidation project 1 st meeting of the

eCoMove Project Overview Coen Bresser Technolution B V Technolution B.V. www.ecomove-project.eu

New Intake Presentation Monday 13 th July 2020 Thank you for your participation and comments

Step By Step Instructions On How To Make A Powerpoint Presentation Click Here:

th Global Distributors Meeting 10 th Global Distributors Meeting 10 Acson I nternational Brand

Unlocking the language of the HSC Please note that this document is a work in progress, the

Jimmy: Searching for Free-Will (A competition) Simon Egerton, Marc Davies, Victor Callaghan,

Sambuz

Useful Links

Newsletter

Mail Us

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference