Bcash: Best compressible hash Antoine Limasset , Yoshihiro Shibuya , - PowerPoint PPT Presentation

Bcash: Best compressible hash Antoine Limasset , Yoshihiro Shibuya , Rayan Chikhi and Gregory Kucherov January 30, 2020

Set similarity 1 34

Jaccard coefficient 2 34

Minhash 3 34

Three flavor of minhash H hash functions Compute H hash functions on each key O ( nH ) H minimum hashes One hash function, keep the H smallest hashes O ( n log H ) H partitions One hash function, split the hashes into H partition according to their first bit. Each partition keep its smallest hash O ( n ) 4 34

Application to bioinformatics: MASH 5 34

Error bounds according to similarity 6 34

Hash size b bits hashes Pros 1 Probability of collision: 2 b Cons Sketch size: H ∗ b bits b = O (log n ) 7 34

Minimizer evolution First hash 8 34

Minimizer evolution Found an inferior hash, we have a new minimizer 9 34

Minimizer evolution And so on, 10 34

Minimizer evolution 11 34

Hyperminhash observation 14 34

HyperMinHash: MinHash in LogLog space A hyperminhash fingerprint is a hyperloglog fingerprint (6 bits) and a constant size finger print (10bit) Lossy compression from O (log n ) to O (log log n ) Allow cardinality estimation and unions estimation using the hyperloglog fingerprint 15 34

Main Ideas 1. We can compress minimizers using the fact that they are selected among a large number of hashes 2. Minhash work with any order relation (minimum,maximum,...) 3. We could select minimizers to be compressible 16 34

Example: optimize run length We select the hash with the minimal amount of bit flip 17 34

Terminal time! 18 34

Example: optimize amount of 0 We select the hash with the minimal amount of 1 19 34

Experiment H-partition sketching of 1 billion 32bits hashes Gzip compression of the sketches using different strategies 1. minimizing value (vanilla minhash) 2. minimizing amount of 1 3. minimizing amount of flips 20 34

10,000,000 minimizers A minimizer is chosen among 100 hashes (on average) Strategy Size Compression ratio IDENTITY 37,460,166 1.068 NUMBER 1 35,242,423 1.135 NUMBER FLIP 35,557,655 1.125 The uncompressed sketch file is 40,000,000 bytes 21 34

1,000,000 minimizers A minimizer is chosen among 1000 hashes (on average) Strategy Size Compression ratio IDENTITY 3,422,813 1.169 NUMBER 1 3,198,364 1.251 NUMBER FLIP 3,242,664 1.234 The uncompressed sketch file is 4,000,000 bytes 22 34

100,000 minimizers A minimizer is chosen among 10,000 hashes (on average) Strategy Size Compression ratio IDENTITY 319,662 1.251 NUMBER 1 295,284 1.355 NUMBER FLIP 299,484 1.336 The uncompressed sketch file is 400,000 bytes 23 34

10,000 minimizers A minimizer is chosen among 100,000 hashes (on average) Strategy Size Compression ratio IDENTITY 28,762 1.391 NUMBER 1 26,796 1.493 NUMBER FLIP 27,206 1.47026 The uncompressed sketch file is 40,000 bytes 24 34

1,000 minimizers A minimizer is chosen among 1,000,000 hashes (on average) Strategy Size Compression ratio IDENTITY 2,557 1.564 NUMBER 1 2,393 1.672 NUMBER FLIP 2,447 1.635 The uncompressed sketch file is 4,000 bytes 25 34

100 minimizers A minimizer is chosen among 10,000,000 hashes (on average) Strategy Size Compression ratio IDENTITY 280 1.429 NUMBER 1 245 1.633 NUMBER FLIP 256 1.563 The uncompressed sketch file is 400 bytes 26 34

Remark 1. Naive compression (gzip) 2. Naive selection 3. L ossless compression 27 34

1000 minimizers A minimizer is chosen among 1,000,000 hashes (on average) Strategy Size Compression ratio IDENTITY 2,557 1.564 NUMBER 1 2,393 1.672 NUMBER FLIP 2,447 1.635 NUMBER FLIP+BITWISE RLE 2,184 1.832 28 34

100 minimizers A minimizer is chosen among 10,000,000 hashes (on average) Strategy Size Compression ratio IDENTITY 280 1.429 NUMBER 1 245 1.633 NUMBER FLIP 256 1.563 NUMBER FLIP+BITWISE RLE 207 1.932 29 34

Lossy compression examples 1. Encode the n first flip lengths 2. Encode the n first 1 positions 3. Encode the n longest flip lengths 4. Encode amount of 0 How to take into account collisions? 30 34

Lossy compression pros and cons 1. Skip hard parts 2. Better control on the fields to compress 3. Harder analysis 31 34

Main questions Hard question Good measure to estimate compressibility of 4 bytes integer Easy question How to compress the previous such integer sketch 32 34

Main questions Hard question How to compress a sketch Easy question How to optimize its compressibility by selecting minimizer 33 34

Ideas/collaborations are welcome ! Very easy to test and benchmark Benchmark available at github.com/Malfoy/Bcash Write a score function and see how the sketch can be compressed! 34 / 34

Bcash: Best compressible hash Antoine Limasset , Yoshihiro Shibuya , - PowerPoint PPT Presentation

Bcash: Best compressible hash Antoine Limasset , Yoshihiro Shibuya , Rayan Chikhi and Gregory Kucherov January 30, 2020 Set similarity 1 34 Jaccard coefficient 2 34 Minhash 3 34 Three flavor of minhash H hash functions Compute H hash

Hash Functions in Action Hash Functions in Action Lecture 12 Hash Functions Hash Functions

Hash Functions in Action Hash Functions in Action Lecture 11 Hash Functions Hash Functions

Hash Functions Hash Functions 1 Cryptographic Hash Function Crypto hash function h(x) must

Hash Functions and Hash Tables (2.5.2) A hash function h maps keys of a given type to

Generics Asumu Takikawa RacketCon 2012 1 What are generics? 2 What are generics? hash-ref

Hash Pile Ups: Using Collisions to Identify Unknown Hash Functions R. Joshua Tobin and David

Disclosure REBOA for Non-Compressible Torso REBOA for Non-Compressible Torso Hemorrhage:

Security Proofs for the MD6 Hash Algorithm Ahmed Ezzat Outline Introduction to hash

LUX Hash Function Ivica Nikoli c, Alex Biryukov, Dmitry Khovratovich University of Luxembourg

HASH FUNCTIONS Mihir Bellare UCSD 1 Mihir Bellare UCSD 2 Hash functions Hash functions

Topic 22 Hash Tables " hash collision n. [from the techspeak] (var. `hash clash') When used

Hash Tables 1 Hash Table in Primary Storage Main parameter B = number of buckets Hash

HASH FUNCTIONS 1 / 62 What is a hash function? By a hash function we usually mean a map h : D

Hash Functions Hash Functions Lecture 10 Hash Functions Lecture 10 Before we talk about

Hash Functions and MACs Properties of Cryptographic Hash Functions Introduction to Message

CS 758/858: Algorithms http://www.cs.unh.edu/~ruml/cs758 Searching Hash Tables Hash Functions

2018 Lecture 12 Electricity III Split-ring commutator allows brushed DC motor Magnetic field

Compressed air optimization in glass factories Menno Verbeek VPInstruments 1 Example project:

Compressed Air Regenerative Braking (CARB) Purple Team B Catherine Koveal Ian Collier Jason

Using Olive Stone Powder for Biodegradation of Bio-based Polyamide 5.6 ebnem Glel 1, * and

Bose-Einstein condensation BEC equation of state

Incompressible limit of the linearized NavierStokes equations. N.A. Gusev 1 1 Moscow Institute

Advanced Thermodynamics: Lecture 11 Shivasubramanian Gopalakrishnan sgopalak@iitb.ac.in

Random sequences Randomness of individual sequences Consider an infinite binary sequence A (0) A

Bcash: Best compressible hash Antoine Limasset , Yoshihiro Shibuya , - PowerPoint PPT Presentation

Bcash: Best compressible hash Antoine Limasset , Yoshihiro Shibuya , Rayan Chikhi and Gregory Kucherov January 30, 2020 Set similarity 1 34 Jaccard coefficient 2 34 Minhash 3 34 Three flavor of minhash H hash functions Compute H hash

Hash Functions in Action Hash Functions in Action Lecture 12 Hash Functions Hash Functions

Hash Functions in Action Hash Functions in Action Lecture 11 Hash Functions Hash Functions

Hash Functions Hash Functions 1 Cryptographic Hash Function Crypto hash function h(x) must

Hash Functions and Hash Tables (2.5.2) A hash function h maps keys of a given type to

Generics Asumu Takikawa RacketCon 2012 1 What are generics? 2 What are generics? hash-ref

Hash Pile Ups: Using Collisions to Identify Unknown Hash Functions R. Joshua Tobin and David

Disclosure REBOA for Non-Compressible Torso REBOA for Non-Compressible Torso Hemorrhage:

Security Proofs for the MD6 Hash Algorithm Ahmed Ezzat Outline Introduction to hash

LUX Hash Function Ivica Nikoli c, Alex Biryukov, Dmitry Khovratovich University of Luxembourg

HASH FUNCTIONS Mihir Bellare UCSD 1 Mihir Bellare UCSD 2 Hash functions Hash functions

Topic 22 Hash Tables &quot; hash collision n. [from the techspeak] (var. `hash clash') When used

Hash Tables 1 Hash Table in Primary Storage Main parameter B = number of buckets Hash

HASH FUNCTIONS 1 / 62 What is a hash function? By a hash function we usually mean a map h : D

Hash Functions Hash Functions Lecture 10 Hash Functions Lecture 10 Before we talk about

Hash Functions and MACs Properties of Cryptographic Hash Functions Introduction to Message

CS 758/858: Algorithms http://www.cs.unh.edu/~ruml/cs758 Searching Hash Tables Hash Functions

2018 Lecture 12 Electricity III Split-ring commutator allows brushed DC motor Magnetic field

Compressed air optimization in glass factories Menno Verbeek VPInstruments 1 Example project:

Compressed Air Regenerative Braking (CARB) Purple Team B Catherine Koveal Ian Collier Jason

Using Olive Stone Powder for Biodegradation of Bio-based Polyamide 5.6 ebnem Glel 1, * and

Bose-Einstein condensation BEC equation of state

Incompressible limit of the linearized NavierStokes equations. N.A. Gusev 1 1 Moscow Institute

Advanced Thermodynamics: Lecture 11 Shivasubramanian Gopalakrishnan sgopalak@iitb.ac.in

Random sequences Randomness of individual sequences Consider an infinite binary sequence A (0) A

Topic 22 Hash Tables " hash collision n. [from the techspeak] (var. `hash clash') When used