Asymptotically optimal minimizers schemes Guillaume Marc ais, Dan - - PowerPoint PPT Presentation

asymptotically optimal minimizers schemes
SMART_READER_LITE
LIVE PREVIEW

Asymptotically optimal minimizers schemes Guillaume Marc ais, Dan - - PowerPoint PPT Presentation

Asymptotically optimal minimizers schemes Guillaume Marc ais, Dan DeBlasio, Carl Kingsford Carnegie Mellon University Computing read overlaps Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison. 1


slide-1
SLIDE 1

Asymptotically optimal minimizers schemes

Guillaume Marc ¸ais, Dan DeBlasio, Carl Kingsford

Carnegie Mellon University

slide-2
SLIDE 2

Computing read overlaps

Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison.

1

slide-3
SLIDE 3

Computing read overlaps

Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison.

1

slide-4
SLIDE 4

Computing read overlaps

Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison. Overlaps Cluster by similarity

1

slide-5
SLIDE 5

Computing minimizers

2

slide-6
SLIDE 6

Computing minimizers

2

slide-7
SLIDE 7

Computing minimizers

2

slide-8
SLIDE 8

Computing minimizers

2

slide-9
SLIDE 9

Computing minimizers

2

slide-10
SLIDE 10

Computing minimizers

2

slide-11
SLIDE 11

Computing minimizers

2

slide-12
SLIDE 12

Computing minimizers

2

slide-13
SLIDE 13

Computing minimizers

2

slide-14
SLIDE 14

Minimizers definition and properties

Minimizers (k, w, o) In each window of w consecutive k-mers, select the smallest k-mer according to order o.

  • 1. No large gap: distance between selected k-mers is ≤ w
  • 2. Deterministic: two strings matching on w consecutive k-mers select the

same minimizer

3

slide-15
SLIDE 15

Computing read overlaps

  • 1. No large gap: no

sequence ignored

  • 2. Deterministic:

reads with overlap in same bin Overlaps Cluster by minimizer

4

slide-16
SLIDE 16

Many applications of minimizers

  • UMDOverlapper (Roberts, 2004): bin sequencing reads by shared

minimizers to compute overlaps

  • MSPKmerCounter (Li, 2015), KMC2 (Deorowicz, 2015), Gerbil (Erber,

2017): bin input sequences based on minimizer to count k-mers in parallel

  • SparseAssembler (Ye, 2012), MSP (Li, 2013), DBGFM (Chikhi, 2014):

reduce memory footprint of de Bruijn assembly graph with minimizers

  • SamSAMi (Grabowski, 2015): sparse suffix array with minimizers
  • MiniMap (Li, 2016), MashMap (Jain, 2017): sparse data structure for

sequence alignment

  • Kraken (Wood, 2014): taxonomic sequence classifier
  • Schleimer et al. (2003): winnowing

5

slide-17
SLIDE 17

Improving minimizers by lowering density

Density Density of a scheme is the expected proportion of selected k-mer in a random sequence: d = # of selected k-mers length of sequence

6

slide-18
SLIDE 18

Improving minimizers by lowering density

Density Density of a scheme is the expected proportion of selected k-mer in a random sequence: d = # of selected k-mers length of sequence

Cluster by minimizer

Lower density = ⇒ smaller bins = ⇒ less computation

6

slide-19
SLIDE 19

Minimizers density minimizing problem

For fixed k and w:

  • Properties “No large gap” & “Deterministic” unaffected by order
  • Density changes with ordering o
  • Lower density =

⇒ sparser data structures and/or less computation

  • Benefit existing and new applications

Density minimization problem For fixed w, k, find k-mer order o giving the lowest expected density

7

slide-20
SLIDE 20

Minimizers density minimizing problem

For fixed k and w:

  • Properties “No large gap” & “Deterministic” unaffected by order
  • Density changes with ordering o
  • Lower density =

⇒ sparser data structures and/or less computation

  • Benefit existing and new applications

Density minimization problem For fixed w, k, find k-mer order o giving the lowest expected density

7

slide-21
SLIDE 21

Density and density factor trivial bounds

1 w

  • Pick every other w k-mer

≤ d ≤

Pick every k-mer

  • 1

Random order usual expected density d =

2 w+1

1 + 1 w ≤ df = (w + 1) · d ≤ w + 1 Random order usual expected density factor df = 2

8

slide-22
SLIDE 22

Density and density factor trivial bounds

1 w

  • Pick every other w k-mer

≤ d ≤

Pick every k-mer

  • 1

Random order usual expected density d =

2 w+1

1 + 1 w ≤ df = (w + 1) · d ≤ w + 1 Random order usual expected density factor df = 2

8

slide-23
SLIDE 23

Schleimer’s bound does not apply in general

d ≥ 1.5 +

1 2w

w + 1 (Schleimer et al.)

9

slide-24
SLIDE 24

Schleimer’s bound does not apply in general

d ≥ 1.5 +

1 2w

w + 1 (Schleimer et al.) Applies only if w ≫ k, or for random orders

9

slide-25
SLIDE 25

Schleimer’s bound does not apply in general

d ≥ 1.5 +

1 2w

w + 1 (Schleimer et al.) Applies only if w ≫ k, or for random orders d ≥ 1.5 +

1 2w + max

  • 0, ⌊ k−w

w ⌋

  • w + k

Valid for any k, w and any order

9

slide-26
SLIDE 26

Schleimer’s bound does not apply in general

d ≥ 1.5 +

1 2w

w + 1 (Schleimer et al.) Applies only if w ≫ k, or for random orders d ≥ 1.5 +

1 2w + max

  • 0, ⌊ k−w

w ⌋

  • w + k

− − →

k→∞

1 w

  • Valid for any k, w and any order

9

slide-27
SLIDE 27

Asymptotic behavior in k and w

What is the best ordering possible when:

  • w is fixed and k → ∞
  • k is fixed and w → ∞

10

slide-28
SLIDE 28

A universal set defines an ordering

Universal set A set M of k-mers that intersects every path of w nodes in the de Bruijn graph

  • f order k.
  • w = 2 =

⇒ M is a vertex cover

  • From M, get order with density d ≤ |M|

σk 000 010 101 111 100 110 011 001 11

slide-29
SLIDE 29

A universal set defines an ordering

Universal set A set M of k-mers that intersects every path of w nodes in the de Bruijn graph

  • f order k.
  • w = 2 =

⇒ M is a vertex cover

  • From M, get order with density d ≤ |M|

σk

Universal set of size σk

w

⇓ Order with density 1

w 11

slide-30
SLIDE 30

Creating a universal set, k = 3, w = 3, algorithm overview

Start with a de Bruijn graph

000 010 101 111 100 110 011 001 12

slide-31
SLIDE 31

Creating a universal set, k = 3, w = 3, algorithm overview

Embed into a w dimensional space using ψ

000 010 101 111 100 110 011 001 001 000 111 100 110 011 101 010 12

slide-32
SLIDE 32

Creating a universal set, k = 3, w = 3, algorithm overview

An edge correspond (almost) to a rotation by 2π/w

000 010 101 111 100 110 011 001 000 010 101 111 100 110 011 001 001 000 111 100 110 011 101 010 12

slide-33
SLIDE 33

Creating a universal set, k = 3, w = 3, algorithm overview

After w edges return to same sub-volume

000 010 101 111 100 110 011 001 001 000 111 100 110 011 101 010 12

slide-34
SLIDE 34

Creating a universal set, k = 3, w = 3, algorithm overview

Pick k-mers in the highlighted “wedge”

000 010 101 111 100 110 011 001 001 000 111 100 110 011 101 010 12

slide-35
SLIDE 35

Asymptotic behavior in w

0.5 1 1.5 2 2.5 3 3.5 4 5 10 15 20 25 30 k=3

Trivial bound 1 + 1/w Lower bound (w + 1)/σk Optimal solutions

density factor (w + 1)d window length w

d ≥ 1 σk , df ≥ w + 1 σk Density factor is θ(w), not constant

13

slide-36
SLIDE 36

Asymptotic behavior in w

0.5 1 1.5 2 2.5 3 3.5 4 5 10 15 20 25 30 k=3

Trivial bound 1 + 1/w Lower bound (w + 1)/σk Optimal solutions

density factor (w + 1)d window length w

d ≥ 1 σk , df ≥ w + 1 σk Density factor is θ(w), not constant

13

slide-37
SLIDE 37

Summary

Asymptotic behavior of minimizers is fully characterized:

  • Minimizers scheme is optimal for large k: d −

− − →

k→∞ 1 w

  • Minimizers scheme is not optimal for large w: df = θ(w)
  • Tighter lower bound

d ≥ 1.5 +

1 2w + max

  • 0, ⌊ k−w

w ⌋

  • w + k
  • Comparison between k-mers take O(k)

14

slide-38
SLIDE 38

Future work

  • Local scheme: f : Σw+k−1 → [1, w]
  • Local schemes might be optimal for large w

15

slide-39
SLIDE 39

GBMF4554

CCF-1256087 CCF-1319998 R01HG007104 R01GM122935 Carl Kingsford group: Dan DeBlasio Heewook Lee Mingfu Shao Brad Solomon Natalie Sauerwald Cong Ma Hongyu Zheng Laura T ung Postdoc position open