SLIDE 1
Asymptotically optimal minimizers schemes Guillaume Marc ais, Dan - - PowerPoint PPT Presentation
Asymptotically optimal minimizers schemes Guillaume Marc ais, Dan - - PowerPoint PPT Presentation
Asymptotically optimal minimizers schemes Guillaume Marc ais, Dan DeBlasio, Carl Kingsford Carnegie Mellon University Computing read overlaps Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison. 1
SLIDE 2
SLIDE 3
Computing read overlaps
Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison.
1
SLIDE 4
Computing read overlaps
Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison. Overlaps Cluster by similarity
1
SLIDE 5
Computing minimizers
2
SLIDE 6
Computing minimizers
2
SLIDE 7
Computing minimizers
2
SLIDE 8
Computing minimizers
2
SLIDE 9
Computing minimizers
2
SLIDE 10
Computing minimizers
2
SLIDE 11
Computing minimizers
2
SLIDE 12
Computing minimizers
2
SLIDE 13
Computing minimizers
2
SLIDE 14
Minimizers definition and properties
Minimizers (k, w, o) In each window of w consecutive k-mers, select the smallest k-mer according to order o.
- 1. No large gap: distance between selected k-mers is ≤ w
- 2. Deterministic: two strings matching on w consecutive k-mers select the
same minimizer
3
SLIDE 15
Computing read overlaps
- 1. No large gap: no
sequence ignored
- 2. Deterministic:
reads with overlap in same bin Overlaps Cluster by minimizer
4
SLIDE 16
Many applications of minimizers
- UMDOverlapper (Roberts, 2004): bin sequencing reads by shared
minimizers to compute overlaps
- MSPKmerCounter (Li, 2015), KMC2 (Deorowicz, 2015), Gerbil (Erber,
2017): bin input sequences based on minimizer to count k-mers in parallel
- SparseAssembler (Ye, 2012), MSP (Li, 2013), DBGFM (Chikhi, 2014):
reduce memory footprint of de Bruijn assembly graph with minimizers
- SamSAMi (Grabowski, 2015): sparse suffix array with minimizers
- MiniMap (Li, 2016), MashMap (Jain, 2017): sparse data structure for
sequence alignment
- Kraken (Wood, 2014): taxonomic sequence classifier
- Schleimer et al. (2003): winnowing
5
SLIDE 17
Improving minimizers by lowering density
Density Density of a scheme is the expected proportion of selected k-mer in a random sequence: d = # of selected k-mers length of sequence
6
SLIDE 18
Improving minimizers by lowering density
Density Density of a scheme is the expected proportion of selected k-mer in a random sequence: d = # of selected k-mers length of sequence
Cluster by minimizer
Lower density = ⇒ smaller bins = ⇒ less computation
6
SLIDE 19
Minimizers density minimizing problem
For fixed k and w:
- Properties “No large gap” & “Deterministic” unaffected by order
- Density changes with ordering o
- Lower density =
⇒ sparser data structures and/or less computation
- Benefit existing and new applications
Density minimization problem For fixed w, k, find k-mer order o giving the lowest expected density
7
SLIDE 20
Minimizers density minimizing problem
For fixed k and w:
- Properties “No large gap” & “Deterministic” unaffected by order
- Density changes with ordering o
- Lower density =
⇒ sparser data structures and/or less computation
- Benefit existing and new applications
Density minimization problem For fixed w, k, find k-mer order o giving the lowest expected density
7
SLIDE 21
Density and density factor trivial bounds
1 w
- Pick every other w k-mer
≤ d ≤
Pick every k-mer
- 1
Random order usual expected density d =
2 w+1
1 + 1 w ≤ df = (w + 1) · d ≤ w + 1 Random order usual expected density factor df = 2
8
SLIDE 22
Density and density factor trivial bounds
1 w
- Pick every other w k-mer
≤ d ≤
Pick every k-mer
- 1
Random order usual expected density d =
2 w+1
1 + 1 w ≤ df = (w + 1) · d ≤ w + 1 Random order usual expected density factor df = 2
8
SLIDE 23
Schleimer’s bound does not apply in general
d ≥ 1.5 +
1 2w
w + 1 (Schleimer et al.)
9
SLIDE 24
Schleimer’s bound does not apply in general
d ≥ 1.5 +
1 2w
w + 1 (Schleimer et al.) Applies only if w ≫ k, or for random orders
9
SLIDE 25
Schleimer’s bound does not apply in general
d ≥ 1.5 +
1 2w
w + 1 (Schleimer et al.) Applies only if w ≫ k, or for random orders d ≥ 1.5 +
1 2w + max
- 0, ⌊ k−w
w ⌋
- w + k
Valid for any k, w and any order
9
SLIDE 26
Schleimer’s bound does not apply in general
d ≥ 1.5 +
1 2w
w + 1 (Schleimer et al.) Applies only if w ≫ k, or for random orders d ≥ 1.5 +
1 2w + max
- 0, ⌊ k−w
w ⌋
- w + k
- −
− − →
k→∞
1 w
- Valid for any k, w and any order
9
SLIDE 27
Asymptotic behavior in k and w
What is the best ordering possible when:
- w is fixed and k → ∞
- k is fixed and w → ∞
10
SLIDE 28
A universal set defines an ordering
Universal set A set M of k-mers that intersects every path of w nodes in the de Bruijn graph
- f order k.
- w = 2 =
⇒ M is a vertex cover
- From M, get order with density d ≤ |M|
σk 000 010 101 111 100 110 011 001 11
SLIDE 29
A universal set defines an ordering
Universal set A set M of k-mers that intersects every path of w nodes in the de Bruijn graph
- f order k.
- w = 2 =
⇒ M is a vertex cover
- From M, get order with density d ≤ |M|
σk
Universal set of size σk
w
⇓ Order with density 1
w 11
SLIDE 30
Creating a universal set, k = 3, w = 3, algorithm overview
Start with a de Bruijn graph
000 010 101 111 100 110 011 001 12
SLIDE 31
Creating a universal set, k = 3, w = 3, algorithm overview
Embed into a w dimensional space using ψ
000 010 101 111 100 110 011 001 001 000 111 100 110 011 101 010 12
SLIDE 32
Creating a universal set, k = 3, w = 3, algorithm overview
An edge correspond (almost) to a rotation by 2π/w
000 010 101 111 100 110 011 001 000 010 101 111 100 110 011 001 001 000 111 100 110 011 101 010 12
SLIDE 33
Creating a universal set, k = 3, w = 3, algorithm overview
After w edges return to same sub-volume
000 010 101 111 100 110 011 001 001 000 111 100 110 011 101 010 12
SLIDE 34
Creating a universal set, k = 3, w = 3, algorithm overview
Pick k-mers in the highlighted “wedge”
000 010 101 111 100 110 011 001 001 000 111 100 110 011 101 010 12
SLIDE 35
Asymptotic behavior in w
0.5 1 1.5 2 2.5 3 3.5 4 5 10 15 20 25 30 k=3
Trivial bound 1 + 1/w Lower bound (w + 1)/σk Optimal solutions
density factor (w + 1)d window length w
d ≥ 1 σk , df ≥ w + 1 σk Density factor is θ(w), not constant
13
SLIDE 36
Asymptotic behavior in w
0.5 1 1.5 2 2.5 3 3.5 4 5 10 15 20 25 30 k=3
Trivial bound 1 + 1/w Lower bound (w + 1)/σk Optimal solutions
density factor (w + 1)d window length w
d ≥ 1 σk , df ≥ w + 1 σk Density factor is θ(w), not constant
13
SLIDE 37
Summary
Asymptotic behavior of minimizers is fully characterized:
- Minimizers scheme is optimal for large k: d −
− − →
k→∞ 1 w
- Minimizers scheme is not optimal for large w: df = θ(w)
- Tighter lower bound
d ≥ 1.5 +
1 2w + max
- 0, ⌊ k−w
w ⌋
- w + k
- Comparison between k-mers take O(k)
14
SLIDE 38
Future work
- Local scheme: f : Σw+k−1 → [1, w]
- Local schemes might be optimal for large w
15
SLIDE 39