SLIDE 1
Generalization of the minimizers schemes Guillaume Marc ais, Dan - - PowerPoint PPT Presentation
Generalization of the minimizers schemes Guillaume Marc ais, Dan - - PowerPoint PPT Presentation
Generalization of the minimizers schemes Guillaume Marc ais, Dan DeBlasio, Carl Kingsford Carnegie Mellon University Computing read overlaps Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison. 1
SLIDE 2
SLIDE 3
Computing read overlaps
Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison.
1
SLIDE 4
Computing read overlaps
Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison. Overlaps Cluster by similarity
1
SLIDE 5
Computing minimizers
2
SLIDE 6
Computing minimizers
2
SLIDE 7
Computing minimizers
2
SLIDE 8
Computing minimizers
2
SLIDE 9
Computing minimizers
2
SLIDE 10
Computing minimizers
2
SLIDE 11
Computing minimizers
2
SLIDE 12
Computing minimizers
2
SLIDE 13
Computing minimizers
2
SLIDE 14
Minimizers definition and properties
Minimizers (k, w, o) In each window of w consecutive k-mers, select the smallest k-mer according to order o.
- 1. Uniform: distance between selected k-mers is ≤ w
- 2. Deterministic: two strings matching on w consecutive
k-mers select the same minimizer
3
SLIDE 15
Computing read overlaps
- 1. Uniform: no
sequence ignored 2. Deterministic: reads with
- verlap in
same bin Overlaps Cluster by minimizer
4
SLIDE 16
Many applications of minimizers
- UMDOverlapper (Roberts, 2004): bin sequencing
reads by shared minimizers to compute overlaps
- MSPKmerCounter (Li, 2015), KMC2 (Deorowicz, 2015),
Gerbil (Erber, 2017): bin input sequences based on minimizer to count k-mers in parallel
- SparseAssembler (Ye, 2012), MSP (Li, 2013), DBGFM
(Chikhi, 2014): reduce memory footprint of de Bruijn assembly graph with minimizers
- SamSAMi (Grabowski, 2015): sparse suffix array with
minimizers
- MiniMap (Li, 2016), MashMap (Jain, 2017): sparse
data structure for sequence alignment
- Kraken (Wood, 2014): taxonomic sequence classifier
- Schleimer et al. (2003): winnowing
5
SLIDE 17
Improving minimizers by lowering density
Density Density of a scheme is the expected proportion of selected k-mer in a random sequence: d = # of selected k-mers length of sequence
6
SLIDE 18
Improving minimizers by lowering density
Density Density of a scheme is the expected proportion of selected k-mer in a random sequence: d = # of selected k-mers length of sequence
Cluster by minimizer
Lower density = ⇒ smaller bins = ⇒ less computation
6
SLIDE 19
Minimizers density minimizing problem
For fixed k and w:
- Properties “Uniform” & “Deterministic” unaffected by
- rder
- Density changes with ordering o
- Lower density =
⇒ sparser data structures and/or less computation
- Benefit existing and new applications
Density minimization problem For fixed w, k, find k-mer order o giving the lowest expected density
7
SLIDE 20
Minimizers density minimizing problem
For fixed k and w:
- Properties “Uniform” & “Deterministic” unaffected by
- rder
- Density changes with ordering o
- Lower density =
⇒ sparser data structures and/or less computation
- Benefit existing and new applications
Density minimization problem For fixed w, k, find k-mer order o giving the lowest expected density
7
SLIDE 21
Density and density factor trivial bounds
Density 1 w
- Pick every other w k-mer
≤ d ≤
Pick every k-mer
- 1
d = # of minimizers per base
8
SLIDE 22
Density and density factor trivial bounds
Density 1 w
- Pick every other w k-mer
≤ d ≤
Pick every k-mer
- 1
d = # of minimizers per base Density factor 1+ 1 w ≤ df = (w+1)·d ≤ w+1 df ≈ # of minimizers per window
8
SLIDE 23
Expected and bound on density
For an idealized random
- rder o:
d = 2 w + 1 df = 2 Expect ≈ 2 minimizers per window For any order o: d ≥ 1.5 +
1 2w
w + 1 df ≥ 1.5+ 1 2w Requires ≥ 1.5 minimizers per window
Schleimer 2003, Roberts 2004
9
SLIDE 24
Expected and bound on density
For an idealized random
- rder o:
d = 2 w + 1 df = 2 Expect ≈ 2 minimizers per window Not valid for w ≫ k For any order o: d ≥ 1.5 +
1 2w
w + 1 df ≥ 1.5+ 1 2w Requires ≥ 1.5 minimizers per window Valid only for w ≫ k
Schleimer 2003, Roberts 2004
9
SLIDE 25
Asymptotic behavior in k and w
What is the best ordering possible when:
- w is fixed and k → ∞
- k is fixed and w → ∞
10
SLIDE 26
Asymptotic behavior in w
0.5 1 1.5 2 2.5 3 3.5 4 5 10 15 20 25 30 k=3
Trivial bound 1 + 1/w Lower bound (w + 1)/σk Optimal solutions
density factor (w + 1)d window length w
d ≥ 1 σk , df ≥ w + 1 σk Density factor is Ω(w), not constant
11
SLIDE 27
Asymptotic behavior in w
0.5 1 1.5 2 2.5 3 3.5 4 5 10 15 20 25 30 k=3
Trivial bound 1 + 1/w Lower bound (w + 1)/σk Optimal solutions
density factor (w + 1)d window length w
d ≥ 1 σk , df ≥ w + 1 σk Density factor is Ω(w), not constant
11
SLIDE 28
Asymptotic behavior in k
Asymptotically optimal minimizers schemes There exists a sequence of orders (ok)k∈N which are asymptotically optimal: dok − − − →
k→∞
1 w df ok − − − →
k→∞ 1 + 1
w
12
SLIDE 29
Depathing the de Bruijn graph
Optimal vertex cover of the de Bruijn graph (Lichiardopol 2006) There exists a sequence of vertex cover Vk of DBk which is asymptotically optimal in size: |Vk| − − − →
k→∞
σk 2 Optimal depathing of the de Bruijn graph For a fixed w, there exists a sequence (Uk)k∈N of sets of k-mers that covers every path of length w in DBk such that |Uk| − − − →
k→∞
σk w
13
SLIDE 30
Bound on density
For all k, w and order o: d ≥ 1.5 +
1 2w + max
- 0, ⌊ k−w
w ⌋
- w + k
14
SLIDE 31
Bound on density
For all k, w and order o: d ≥ 1.5 +
1 2w + max
- 0, ⌊ k−w
w ⌋
- w + k
df ≥ 1 + 1 w for large k df ≥ 1.5 + 1 2w for large w
14
SLIDE 32
Density factor of minimizers
Asymptotic behavior of minimizers is fully characterized:
- Minimizers scheme is optimal for large k: df −
− − →
k→∞ 1 + 1 w
- Minimizers scheme is not optimal for large w: df = Ω(w)
- Better lower bound on d
15
SLIDE 33
Density factor of minimizers
Asymptotic behavior of minimizers is fully characterized:
- Minimizers scheme is optimal for large k: df −
− − →
k→∞ 1 + 1 w
- Minimizers scheme is not optimal for large w: df = Ω(w)
- Better lower bound on d
Good:
- First example of
- ptimal minimizers
scheme
- Constructive proof
Not good:
- Large k less interesting
in practice
- Minimizers don’t have
constant density factor
15
SLIDE 34
Generalizing minimizers: local and forward schemes
Local scheme Given f : Σw+k−1 → [0, w − 1], for each window ω, select k-mer at position f(ω).
16
SLIDE 35
Generalizing minimizers: local and forward schemes
Local scheme Given f : Σw+k−1 → [0, w − 1], for each window ω, select k-mer at position f(ω). Minimizers scheme with order o is a local scheme where f = arg mini∈[0,w−1] o(ω[i : k])
16
SLIDE 36
Generalizing minimizers: local and forward schemes
Local scheme Given f : Σw+k−1 → [0, w − 1], for each window ω, select k-mer at position f(ω). Minimizers scheme with order o is a local scheme where f = arg mini∈[0,w−1] o(ω[i : k]) Forward scheme Local scheme such that f(ω′) ≥ f(ω) − 1 if suffix of ω′ equals prefix of ω
16
SLIDE 37
Local & forward as better minimizers schemes
Minimizers Forward Local
- Properties “Uniform” & “Deterministic” also satisfied
- Drop-in replacement for minimizers
- Potential for lower density
17
SLIDE 38
Density factor overview
Density factor df k → ∞ w → ∞ Scheme Best Bound Minimizers Forward Local
18
SLIDE 39
Density factor overview
Density factor df k → ∞ w → ∞ Scheme Best Bound Minimizers 1 + 1
w
O(w) Ω(w) Forward Local
18
SLIDE 40
Density factor overview
Density factor df k → ∞ w → ∞ Scheme Best Bound Minimizers 1 + 1
w
O(w) Ω(w) Forward 1 + 1
w
O(√w) ∼ 1.5 +
1 2w
Local
18
SLIDE 41
Density factor overview
Density factor df k → ∞ w → ∞ Scheme Best Bound Minimizers 1 + 1
w
O(w) Ω(w) Forward 1 + 1
w
O(√w) ∼ 1.5 +
1 2w
Local 1 + 1
w
O(√w) 1 + 1
w 18
SLIDE 42
Conclusion: the quest for constant density factor
- Minimizers schemes can’t achieve constant density
factor
- Local and forward schemes may achieve constant
density factor
- Design of optimal orders or functions f still open
19
SLIDE 43