Informed and automated k-mer size selection for genome assembly
Rayan Chikhi, Paul Medvedev Pennsylvania State University HiTSeq - July 2013
1/15
Informed and automated k -mer size selection for genome assembly - - PowerPoint PPT Presentation
Informed and automated k -mer size selection for genome assembly Rayan Chikhi, Paul Medvedev Pennsylvania State University HiTSeq - July 2013 1/15 G ENOME ASSEMBLY Genome assembly is the technique used to reconstruct genome sequences from DNA
1/15
sub-sequences, covering the genome redundantly
genome (unknown)
hypothesis of the genome
high-quality vs low-quality assemblies
2/15
3/15
NG50: maximum ℓ such that (
|contigi |≥ℓ |contigi |) larger than |genome|/2
Illumina 100bp paired-end 70x coverage, assembled by Velvet with several values of k k NG50 Assembly size
2000 4000 40 60 80 8.0e+07 8.5e+07 40 60 80
4/15
5/15
5/15
6/15
sequenced k-mers
genome (unknown)
ideal world: single contig
6/15
sequenced k-mers
genome (unknown)
ideal world: single contig missing k-mers break contigs
6/15
sequenced k-mers
genome (unknown)
ideal world: single contig missing k-mers break contigs repeat repeat
repetitions also break contigs and reduce total assembly size
6/15
sequenced k-mers
genome (unknown)
ideal world: single contig missing k-mers break contigs repeat repeat
repetitions also break contigs and reduce total assembly size
7/15
Chr 14 (≈ 88 Mbp) GAGE dataset; histogram k = 21
Erroneous k-mers Genomic non-repeated k-mers Genomic repeated k-mers, sequencing artifacts, ..
8/15
9/15
Abundance Number of kmers 20 40 60 80 120 1e+05 1e+07 1e+09 k = 21 Abundance Number of kmers 20 40 60 80 120 1e+04 1e+06 1e+08 k = 41 Abundance Number of kmers 20 40 60 80 120 5e+04 5e+06 5e+08 k = 81
Chr 14 (≈ 88 Mbp) GAGE dataset; histograms for three values of k
10/15
11/15
40 60 80 100 1e+04 1e+06 1e+08
Abundance Number of k−mers
11/15
12/15
k NG50 Predicted size
10000 20000 20 40 60 2500000 3000000 3500000 20 40 60
Predicted Velvet
k NG50 Predicted size
2000 4000 20 40 60 80 8.0e+07 8.5e+07 9.0e+07 20 40 60 80
Predicted Velvet
k NG50 Predicted size
5000 10000 20 40 60 80 2.5e+08 3.0e+08 20 40 60 80
Predicted SOAPdenovo2
13/15
◮ Best k maximizes the number of genomic k-mers ◮ Quake’s statistical model ◮ Efficient k-mer histogram sampling
14/15
15/15