Generalization of the minimizers schemes Guillaume Marc ais, Dan - PowerPoint PPT Presentation

Generalization of the minimizers schemes Guillaume Marc ¸ais, Dan DeBlasio, Carl Kingsford Carnegie Mellon University

Computing read overlaps Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison. 1

Computing read overlaps Roberts, et al. Cluster by similarity (2004). Reducing storage requirements for biological sequence comparison. Overlaps 1

Computing minimizers 2

Minimizers definition and properties Minimizers ( k , w , o ) In each window of w consecutive k -mers, select the smallest k -mer according to order o . 1. Uniform : distance between selected k -mers is ≤ w 2. Deterministic : two strings matching on w consecutive k -mers select the same minimizer 3

Computing read overlaps 1. Uniform : no Cluster by sequence minimizer ignored 2. Deterministic : reads with overlap in same bin Overlaps 4

Many applications of minimizers • UMDOverlapper (Roberts, 2004) : bin sequencing reads by shared minimizers to compute overlaps • MSPKmerCounter (Li, 2015), KMC2 (Deorowicz, 2015), Gerbil (Erber, 2017) : bin input sequences based on minimizer to count k -mers in parallel • SparseAssembler (Ye, 2012), MSP (Li, 2013), DBGFM (Chikhi, 2014) : reduce memory footprint of de Bruijn assembly graph with minimizers • SamSAMi (Grabowski, 2015) : sparse su ffi x array with minimizers • MiniMap (Li, 2016), MashMap (Jain, 2017) : sparse data structure for sequence alignment • Kraken (Wood, 2014) : taxonomic sequence classi fi er • Schleimer et al. (2003) : winnowing 5

Improving minimizers by lowering density Density Density of a scheme is the expected proportion of selected k -mer in a random sequence: d = # of selected k -mers length of sequence 6

Improving minimizers by lowering density Density Density of a scheme is the expected proportion of selected k -mer in a random sequence: d = # of selected k -mers length of sequence Lower density Cluster by = ⇒ smaller bins minimizer = ⇒ less computation 6

Minimizers density minimizing problem For fi xed k and w : • Properties “ Uniform ” & “ Deterministic ” una ff ected by order • Density changes with ordering o • Lower density = ⇒ sparser data structures and/or less computation • Bene fi t existing and new applications Density minimization problem For fi xed w , k , fi nd k -mer order o giving the lowest expected density 7

Density and density factor trivial bounds Density Pick every k -mer �� 1 ≤ d ≤ 1 w �� Pick every other w k -mer d = # of minimizers per base 8

Density and density factor trivial bounds Density Density factor Pick every k -mer �� 1 ≤ d ≤ 1 + 1 1 w ≤ df = ( w + 1 ) · d ≤ w + 1 w �� Pick every other w k -mer df ≈ # of minimizers per window d = # of minimizers per base 8

Expected and bound on density For an idealized random For any order o : order o : d ≥ 1 . 5 + 1 df ≥ 1 . 5 + 1 2 w 2 w + 1 d = df = 2 2 w w + 1 Requires ≥ 1 . 5 minimizers Expect ≈ 2 minimizers per per window window Schleimer 2003, Roberts 2004 9

Expected and bound on density For an idealized random For any order o : order o : d ≥ 1 . 5 + 1 df ≥ 1 . 5 + 1 2 w 2 w + 1 d = df = 2 2 w w + 1 Requires ≥ 1 . 5 minimizers Expect ≈ 2 minimizers per per window window Valid only for w ≫ k Not valid for w ≫ k Schleimer 2003, Roberts 2004 9

Asymptotic behavior in k and w What is the best ordering possible when: • w is fi xed and k → ∞ • k is fi xed and w → ∞ 10

Asymptotic behavior in w 4 density factor ( w + 1) d 3 . 5 k=3 3 Lower bound ( w + 1) /σ k 2 . 5 Optimal solutions 2 1 . 5 1 Trivial bound 1 + 1 /w 0 . 5 0 0 5 10 15 20 25 30 window length w df ≥ w + 1 d ≥ 1 σ k , σ k Density factor is Ω( w ) , not constant 11

Asymptotic behavior in k Asymptotically optimal minimizers schemes There exists a sequence of orders ( o k ) k ∈ N which are asymptotically optimal: 1 k →∞ 1 + 1 d o k − − − → df o k − − − → k →∞ w w 12

Depathing the de Bruijn graph Optimal vertex cover of the de Bruijn graph (Lichiardopol 2006) There exists a sequence of vertex cover V k of DB k which is asymptotically optimal in size: σ k | V k | − − − → k →∞ 2 Optimal depathing of the de Bruijn graph For a fi xed w , there exists a sequence ( U k ) k ∈ N of sets of k -mers that covers every path of length w in DB k such that σ k | U k | − − − → k →∞ w 13

Bound on density For all k , w and order o : � � 1 . 5 + 2 w + max 0 , ⌊ k − w w ⌋ 1 d ≥ w + k 14

Bound on density For all k , w and order o : � � 1 . 5 + 2 w + max 0 , ⌊ k − w w ⌋ 1 d ≥ w + k df ≥ 1 + 1 for large k w df ≥ 1 . 5 + 1 for large w 2 w 14

Density factor of minimizers Asymptotic behavior of minimizers is fully characterized: • Minimizers scheme is optimal for large k : df − k →∞ 1 + 1 − − → w • Minimizers scheme is not optimal for large w : df = Ω( w ) • Better lower bound on d 15

Density factor of minimizers Asymptotic behavior of minimizers is fully characterized: • Minimizers scheme is optimal for large k : df − k →∞ 1 + 1 − − → w • Minimizers scheme is not optimal for large w : df = Ω( w ) • Better lower bound on d Good : Not good : • First example of • Large k less interesting optimal minimizers in practice • Minimizers don’t have scheme • Constructive proof constant density factor 15

Generalizing minimizers: local and forward schemes Local scheme Given f : Σ w + k − 1 → [ 0 , w − 1 ] , for each window ω , select k -mer at position f ( ω ) . 16

Generalizing minimizers: local and forward schemes Local scheme Given f : Σ w + k − 1 → [ 0 , w − 1 ] , for each window ω , select k -mer at position f ( ω ) . Minimizers scheme with order o is a local scheme where f = arg min i ∈ [ 0 , w − 1 ] o ( ω [ i : k ]) 16

Generalizing minimizers: local and forward schemes Local scheme Given f : Σ w + k − 1 → [ 0 , w − 1 ] , for each window ω , select k -mer at position f ( ω ) . Minimizers scheme with order o is a local scheme where f = arg min i ∈ [ 0 , w − 1 ] o ( ω [ i : k ]) Forward scheme Local scheme such that f ( ω ′ ) ≥ f ( ω ) − 1 if su ffi x of ω ′ equals pre fi x of ω 16

Local & forward as better minimizers schemes Minimizers � Forward � Local • Properties “ Uniform ” & “ Deterministic ” also satis fi ed • Drop-in replacement for minimizers • Potential for lower density 17

Density factor overview Density factor df k → ∞ w → ∞ Scheme Best Bound Minimizers Forward Local 18

Density factor overview Density factor df k → ∞ w → ∞ Scheme Best Bound 1 + 1 O ( w ) Ω( w ) Minimizers w Forward Local 18

Density factor overview Density factor df k → ∞ w → ∞ Scheme Best Bound 1 + 1 O ( w ) Ω( w ) Minimizers O ( √ w ) w 1 + 1 ∼ 1 . 5 + 1 Forward w 2 w Local 18

Density factor overview Density factor df k → ∞ w → ∞ Scheme Best Bound 1 + 1 O ( w ) Ω( w ) Minimizers O ( √ w ) w 1 + 1 ∼ 1 . 5 + 1 Forward O ( √ w ) w 2 w 1 + 1 1 + 1 Local w w 18

Conclusion: the quest for constant density factor • Minimizers schemes can’t achieve constant density factor • Local and forward schemes may achieve constant density factor • Design of optimal orders or functions f still open 19

Carl Kingsford group: Dan DeBlasio Heewook Lee Natalie Sauerwald Cong Ma Hongyu Zheng Laura T ung Postdoc position open GBMF4554 R01HG007104 CCF-1256087 R01GM122935 CCF-1319998

Generalization of the minimizers schemes Guillaume Marc ais, Dan - PowerPoint PPT Presentation

Generalization of the minimizers schemes Guillaume Marc ais, Dan DeBlasio, Carl Kingsford Carnegie Mellon University Computing read overlaps Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison. 1

Section 1 Commitment Schemes Commitment Schemes Commitment Schemes Digital analogue of a safe.

Minimizers of non local energies, and ellipses Joan Verdera Universitat Aut` onoma de Barcelona

Asymptotically optimal minimizers schemes Guillaume Marc ais, Dan DeBlasio, Carl Kingsford

Local Substitutability for Sequence Generalization Fran cois Coste , Ga elle Garet , Jacques

Data Anonymization - Generalization Algorithms Li Xiong, Slawek Goryczka CS573 Data Privacy and

Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity

CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 /

VC GENERALIZATION BOUND VC GENERALIZATION BOUND Matthieu Bloch March 12, 2020 1 LOGISTICS (AND

Deep learning: Challenges in learning and generalization Tomas Mikolov, Facebook AI What is

Generalization of Cycle-Covering Heuristics Clemens B uchner Department of Mathematics and

Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 6 February, 23

Regularity for almost minimizers with free boundary Tatiana Toro University of Washington

Subriemannian minimizers H. J. Sussmann Department of Mathematics Rutgers University

On existence and behavior of radial minimizers for the Schrdinger-Poisson-Slater problem.

Minimizers of the Landau-de Gennes energy around a spherical colloid particle Lia Bronsard

Composability of Regret Minimizers Gabriele Farina 1 Christian Kroer 2 Tuomas Sandholm 1,3,4,5 1

Certified Application Counselor Designated Organization Application July 2019 The information

Balls-into-Bins Model and Chernoff Bounds Advanced Algorithms Nanjing University, Fall 2018

Randomness in Computing L ECTURE 14 Last time Poisson distribution Poisson approximation

Systema(cs Varia(on in Energy Bins Use new GLoBES to define normaliza(on uncertain(es in

\w 1d;Gl) v "b.,t )""1/." ^A h;,^ P"^"1"1. , 71 6r* fd,- 30

Balanced Allocation with Random Walk Based Sampling Dengwang Tang Electrical and Computer

CT 320: Network and System 2 Booting Administration 1. Bootstrap loaders 2. Run levels 3.

Timing Sign-off for Selective Voltage Binning Vladimir Zolotov*, Eric Foreman, Jeffrey Hemmett,

Generalization of the minimizers schemes Guillaume Marc ais, Dan - PowerPoint PPT Presentation

Generalization of the minimizers schemes Guillaume Marc ais, Dan DeBlasio, Carl Kingsford Carnegie Mellon University Computing read overlaps Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison. 1

Section 1 Commitment Schemes Commitment Schemes Commitment Schemes Digital analogue of a safe.

Minimizers of non local energies, and ellipses Joan Verdera Universitat Aut` onoma de Barcelona

Asymptotically optimal minimizers schemes Guillaume Marc ais, Dan DeBlasio, Carl Kingsford

Local Substitutability for Sequence Generalization Fran cois Coste , Ga elle Garet , Jacques

Data Anonymization - Generalization Algorithms Li Xiong, Slawek Goryczka CS573 Data Privacy and

Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity

CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 /

VC GENERALIZATION BOUND VC GENERALIZATION BOUND Matthieu Bloch March 12, 2020 1 LOGISTICS (AND

Deep learning: Challenges in learning and generalization Tomas Mikolov, Facebook AI What is

Generalization of Cycle-Covering Heuristics Clemens B uchner Department of Mathematics and

Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 6 February, 23

Regularity for almost minimizers with free boundary Tatiana Toro University of Washington

Subriemannian minimizers H. J. Sussmann Department of Mathematics Rutgers University

On existence and behavior of radial minimizers for the Schrdinger-Poisson-Slater problem.

Minimizers of the Landau-de Gennes energy around a spherical colloid particle Lia Bronsard

Composability of Regret Minimizers Gabriele Farina 1 Christian Kroer 2 Tuomas Sandholm 1,3,4,5 1

Certified Application Counselor Designated Organization Application July 2019 The information

Balls-into-Bins Model and Chernoff Bounds Advanced Algorithms Nanjing University, Fall 2018

Randomness in Computing L ECTURE 14 Last time Poisson distribution Poisson approximation

Systema(cs Varia(on in Energy Bins Use new GLoBES to define normaliza(on uncertain(es in

*\w 1d;Gl) v &quot;b.,t )&quot;&quot;1/.&quot; ^*A h;,^ P&quot;^&quot;1&quot;1. , 71 6r* fd,- 30

Balanced Allocation with Random Walk Based Sampling Dengwang Tang Electrical and Computer

CT 320: Network and System 2 Booting Administration 1. Bootstrap loaders 2. Run levels 3.

Timing Sign-off for Selective Voltage Binning Vladimir Zolotov*, Eric Foreman, Jeffrey Hemmett,

\w 1d;Gl) v "b.,t )""1/." ^A h;,^ P"^"1"1. , 71 6r* fd,- 30