Generalization of the minimizers schemes Guillaume Marc ais, Dan - - PowerPoint PPT Presentation

generalization of the minimizers schemes
SMART_READER_LITE
LIVE PREVIEW

Generalization of the minimizers schemes Guillaume Marc ais, Dan - - PowerPoint PPT Presentation

Generalization of the minimizers schemes Guillaume Marc ais, Dan DeBlasio, Carl Kingsford Carnegie Mellon University Computing read overlaps Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison. 1


slide-1
SLIDE 1

Generalization of the minimizers schemes

Guillaume Marc ¸ais, Dan DeBlasio, Carl Kingsford

Carnegie Mellon University

slide-2
SLIDE 2

Computing read overlaps

Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison.

1

slide-3
SLIDE 3

Computing read overlaps

Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison.

1

slide-4
SLIDE 4

Computing read overlaps

Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison. Overlaps Cluster by similarity

1

slide-5
SLIDE 5

Computing minimizers

2

slide-6
SLIDE 6

Computing minimizers

2

slide-7
SLIDE 7

Computing minimizers

2

slide-8
SLIDE 8

Computing minimizers

2

slide-9
SLIDE 9

Computing minimizers

2

slide-10
SLIDE 10

Computing minimizers

2

slide-11
SLIDE 11

Computing minimizers

2

slide-12
SLIDE 12

Computing minimizers

2

slide-13
SLIDE 13

Computing minimizers

2

slide-14
SLIDE 14

Minimizers definition and properties

Minimizers (k, w, o) In each window of w consecutive k-mers, select the smallest k-mer according to order o.

  • 1. Uniform: distance between selected k-mers is ≤ w
  • 2. Deterministic: two strings matching on w consecutive

k-mers select the same minimizer

3

slide-15
SLIDE 15

Computing read overlaps

  • 1. Uniform: no

sequence ignored 2. Deterministic: reads with

  • verlap in

same bin Overlaps Cluster by minimizer

4

slide-16
SLIDE 16

Many applications of minimizers

  • UMDOverlapper (Roberts, 2004): bin sequencing

reads by shared minimizers to compute overlaps

  • MSPKmerCounter (Li, 2015), KMC2 (Deorowicz, 2015),

Gerbil (Erber, 2017): bin input sequences based on minimizer to count k-mers in parallel

  • SparseAssembler (Ye, 2012), MSP (Li, 2013), DBGFM

(Chikhi, 2014): reduce memory footprint of de Bruijn assembly graph with minimizers

  • SamSAMi (Grabowski, 2015): sparse suffix array with

minimizers

  • MiniMap (Li, 2016), MashMap (Jain, 2017): sparse

data structure for sequence alignment

  • Kraken (Wood, 2014): taxonomic sequence classifier
  • Schleimer et al. (2003): winnowing

5

slide-17
SLIDE 17

Improving minimizers by lowering density

Density Density of a scheme is the expected proportion of selected k-mer in a random sequence: d = # of selected k-mers length of sequence

6

slide-18
SLIDE 18

Improving minimizers by lowering density

Density Density of a scheme is the expected proportion of selected k-mer in a random sequence: d = # of selected k-mers length of sequence

Cluster by minimizer

Lower density = ⇒ smaller bins = ⇒ less computation

6

slide-19
SLIDE 19

Minimizers density minimizing problem

For fixed k and w:

  • Properties “Uniform” & “Deterministic” unaffected by
  • rder
  • Density changes with ordering o
  • Lower density =

⇒ sparser data structures and/or less computation

  • Benefit existing and new applications

Density minimization problem For fixed w, k, find k-mer order o giving the lowest expected density

7

slide-20
SLIDE 20

Minimizers density minimizing problem

For fixed k and w:

  • Properties “Uniform” & “Deterministic” unaffected by
  • rder
  • Density changes with ordering o
  • Lower density =

⇒ sparser data structures and/or less computation

  • Benefit existing and new applications

Density minimization problem For fixed w, k, find k-mer order o giving the lowest expected density

7

slide-21
SLIDE 21

Density and density factor trivial bounds

Density 1 w

  • Pick every other w k-mer

≤ d ≤

Pick every k-mer

  • 1

d = # of minimizers per base

8

slide-22
SLIDE 22

Density and density factor trivial bounds

Density 1 w

  • Pick every other w k-mer

≤ d ≤

Pick every k-mer

  • 1

d = # of minimizers per base Density factor 1+ 1 w ≤ df = (w+1)·d ≤ w+1 df ≈ # of minimizers per window

8

slide-23
SLIDE 23

Expected and bound on density

For an idealized random

  • rder o:

d = 2 w + 1 df = 2 Expect ≈ 2 minimizers per window For any order o: d ≥ 1.5 +

1 2w

w + 1 df ≥ 1.5+ 1 2w Requires ≥ 1.5 minimizers per window

Schleimer 2003, Roberts 2004

9

slide-24
SLIDE 24

Expected and bound on density

For an idealized random

  • rder o:

d = 2 w + 1 df = 2 Expect ≈ 2 minimizers per window Not valid for w ≫ k For any order o: d ≥ 1.5 +

1 2w

w + 1 df ≥ 1.5+ 1 2w Requires ≥ 1.5 minimizers per window Valid only for w ≫ k

Schleimer 2003, Roberts 2004

9

slide-25
SLIDE 25

Asymptotic behavior in k and w

What is the best ordering possible when:

  • w is fixed and k → ∞
  • k is fixed and w → ∞

10

slide-26
SLIDE 26

Asymptotic behavior in w

0.5 1 1.5 2 2.5 3 3.5 4 5 10 15 20 25 30 k=3

Trivial bound 1 + 1/w Lower bound (w + 1)/σk Optimal solutions

density factor (w + 1)d window length w

d ≥ 1 σk , df ≥ w + 1 σk Density factor is Ω(w), not constant

11

slide-27
SLIDE 27

Asymptotic behavior in w

0.5 1 1.5 2 2.5 3 3.5 4 5 10 15 20 25 30 k=3

Trivial bound 1 + 1/w Lower bound (w + 1)/σk Optimal solutions

density factor (w + 1)d window length w

d ≥ 1 σk , df ≥ w + 1 σk Density factor is Ω(w), not constant

11

slide-28
SLIDE 28

Asymptotic behavior in k

Asymptotically optimal minimizers schemes There exists a sequence of orders (ok)k∈N which are asymptotically optimal: dok − − − →

k→∞

1 w df ok − − − →

k→∞ 1 + 1

w

12

slide-29
SLIDE 29

Depathing the de Bruijn graph

Optimal vertex cover of the de Bruijn graph (Lichiardopol 2006) There exists a sequence of vertex cover Vk of DBk which is asymptotically optimal in size: |Vk| − − − →

k→∞

σk 2 Optimal depathing of the de Bruijn graph For a fixed w, there exists a sequence (Uk)k∈N of sets of k-mers that covers every path of length w in DBk such that |Uk| − − − →

k→∞

σk w

13

slide-30
SLIDE 30

Bound on density

For all k, w and order o: d ≥ 1.5 +

1 2w + max

  • 0, ⌊ k−w

w ⌋

  • w + k

14

slide-31
SLIDE 31

Bound on density

For all k, w and order o: d ≥ 1.5 +

1 2w + max

  • 0, ⌊ k−w

w ⌋

  • w + k

df ≥ 1 + 1 w for large k df ≥ 1.5 + 1 2w for large w

14

slide-32
SLIDE 32

Density factor of minimizers

Asymptotic behavior of minimizers is fully characterized:

  • Minimizers scheme is optimal for large k: df −

− − →

k→∞ 1 + 1 w

  • Minimizers scheme is not optimal for large w: df = Ω(w)
  • Better lower bound on d

15

slide-33
SLIDE 33

Density factor of minimizers

Asymptotic behavior of minimizers is fully characterized:

  • Minimizers scheme is optimal for large k: df −

− − →

k→∞ 1 + 1 w

  • Minimizers scheme is not optimal for large w: df = Ω(w)
  • Better lower bound on d

Good:

  • First example of
  • ptimal minimizers

scheme

  • Constructive proof

Not good:

  • Large k less interesting

in practice

  • Minimizers don’t have

constant density factor

15

slide-34
SLIDE 34

Generalizing minimizers: local and forward schemes

Local scheme Given f : Σw+k−1 → [0, w − 1], for each window ω, select k-mer at position f(ω).

16

slide-35
SLIDE 35

Generalizing minimizers: local and forward schemes

Local scheme Given f : Σw+k−1 → [0, w − 1], for each window ω, select k-mer at position f(ω). Minimizers scheme with order o is a local scheme where f = arg mini∈[0,w−1] o(ω[i : k])

16

slide-36
SLIDE 36

Generalizing minimizers: local and forward schemes

Local scheme Given f : Σw+k−1 → [0, w − 1], for each window ω, select k-mer at position f(ω). Minimizers scheme with order o is a local scheme where f = arg mini∈[0,w−1] o(ω[i : k]) Forward scheme Local scheme such that f(ω′) ≥ f(ω) − 1 if suffix of ω′ equals prefix of ω

16

slide-37
SLIDE 37

Local & forward as better minimizers schemes

Minimizers Forward Local

  • Properties “Uniform” & “Deterministic” also satisfied
  • Drop-in replacement for minimizers
  • Potential for lower density

17

slide-38
SLIDE 38

Density factor overview

Density factor df k → ∞ w → ∞ Scheme Best Bound Minimizers Forward Local

18

slide-39
SLIDE 39

Density factor overview

Density factor df k → ∞ w → ∞ Scheme Best Bound Minimizers 1 + 1

w

O(w) Ω(w) Forward Local

18

slide-40
SLIDE 40

Density factor overview

Density factor df k → ∞ w → ∞ Scheme Best Bound Minimizers 1 + 1

w

O(w) Ω(w) Forward 1 + 1

w

O(√w) ∼ 1.5 +

1 2w

Local

18

slide-41
SLIDE 41

Density factor overview

Density factor df k → ∞ w → ∞ Scheme Best Bound Minimizers 1 + 1

w

O(w) Ω(w) Forward 1 + 1

w

O(√w) ∼ 1.5 +

1 2w

Local 1 + 1

w

O(√w) 1 + 1

w 18

slide-42
SLIDE 42

Conclusion: the quest for constant density factor

  • Minimizers schemes can’t achieve constant density

factor

  • Local and forward schemes may achieve constant

density factor

  • Design of optimal orders or functions f still open

19

slide-43
SLIDE 43

GBMF4554

CCF-1256087 CCF-1319998 R01HG007104 R01GM122935 Carl Kingsford group: Dan DeBlasio Heewook Lee Natalie Sauerwald Cong Ma Hongyu Zheng Laura T ung Postdoc position open