fast dictionary based compression for inverted indexes
play

Fast Dictionary-based Compression for Inverted Indexes Giulio - PowerPoint PPT Presentation

Fast Dictionary-based Compression for Inverted Indexes Giulio Ermanno Pibiri Matthias Petri Alistair Mo ff at The University of Pisa The University of Melbourne The University of Melbourne and ISTI-CNR Melbourne, Australia Melbourne,


  1. Fast Dictionary-based Compression for Inverted Indexes Giulio Ermanno Pibiri Matthias Petri Alistair Mo ff at The University of Pisa The University of Melbourne The University of Melbourne and ISTI-CNR 
 Melbourne, Australia Melbourne, Australia Pisa, Italy 12/02/2019

  2. Context — Inverted Indexes We focus on compression effectiveness and decoding speed for inverted indexes . The inverted index is the de-facto data structure at the basis of every large-scale retrieval system.

  3. Context — Inverted Indexes We focus on compression effectiveness and decoding speed for inverted indexes . The inverted index is the de-facto data structure at the basis of every large-scale retrieval system. 1 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red 2 V = {always, boy, good, house, hungry, is, red, the} is the always house good L t 1 =[1, 3] is L t 2 =[4, 5] red the L t 3 =[1] boy 3 boy L t 4 =[2, 3] is is the L t 5 =[3, 5] hungry red house L t 6 =[1, 2, 3, 4, 5] is always L t 7 =[1, 2, 4] 5 hungry L t 8 =[2, 3, 5] 4

  4. Many solutions Huge research corpora describing different space/time trade-offs. Elias gamma/delta Optimized PForDelta • • Variable-Byte family Elias-Fano • • Binary Interpolative Coding Partitioned Elias-Fano • • Simple family Clustered Elias-Fano • • PForDelta Asymmetric Numeral Systems • •

  5. Many solutions Huge research corpora describing different space/time trade-offs. Elias gamma/delta Optimized PForDelta • • Variable-Byte family Elias-Fano • • Binary Interpolative Coding Partitioned Elias-Fano • • Simple family Clustered Elias-Fano • • PForDelta Asymmetric Numeral Systems • • Space Time Variable-Byte 
 Spectrum Interpolative + SIMD ~ 3X smaller ~ 4.5X faster

  6. Many solutions Huge research corpora describing different space/time trade-offs. Elias gamma/delta Optimized PForDelta • • Variable-Byte family Elias-Fano • • Binary Interpolative Coding Partitioned Elias-Fano • • Simple family Clustered Elias-Fano • • PForDelta Asymmetric Numeral Systems • • Space Time Variable-Byte 
 Spectrum Interpolative + SIMD ~ 3X smaller ~ 4.5X faster RQ Can we inherit both advantages?

  7. A crucial fact Patterns of d -gaps are repetitive .

  8. A crucial fact Patterns of d -gaps are repetitive . Gov2

  9. DINT — D ictionary of INT egers l + 1 • Encode a whole pattern with a single dictionary 
 fixed-to-fixed reference of b bits arrangement • Decode a whole pattern with a single dictionary access input stream 2 b … c 1 c 2 c 3 c 4 c 5 c 6 c 6 c 7 e c 8 c 9 c 10 c 11 … b b b b

  10. DINT — D ictionary of INT egers l + 1 • Encode a whole pattern with a single dictionary 
 fixed-to-fixed reference of b bits arrangement • Decode a whole pattern with a single dictionary access input stream 2 b … c 1 c 2 c 3 c 4 c 5 c 6 c 6 c 7 e c 8 c 9 c 10 c 11 … b b b b

  11. DINT — D ictionary of INT egers l + 1 • Encode a whole pattern with a single dictionary 
 fixed-to-fixed reference of b bits arrangement • Decode a whole pattern with a single dictionary access input stream 2 b … c 1 c 2 c 3 c 4 c 5 c 6 c 6 c 7 e c 8 c 9 c 10 c 11 … b b b b

  12. DINT — D ictionary of INT egers l + 1 • Encode a whole pattern with a single dictionary 
 fixed-to-fixed reference of b bits arrangement • Decode a whole pattern with a single dictionary access input stream 2 b … c 1 c 2 c 3 c 4 c 5 c 6 c 6 c 7 e c 8 c 9 c 10 c 11 … b b b b 1/3 of the time is saved

  13. Refinements 1 Packed dictionary structure Exploiting string overlap 2 Optimal block parsing 3 Multiple dictionaries

  14. Experimental results: setting Datasets Machine Intel Xeon 6144 processor, 512 GiB RAM, Linux 4.13.0 Compiler gcc 7.2.0 (with all optimizations) C++ code available at https://github.com/jermp/dint

  15. Experimental results: compression effectiveness

  16. Experimental results: compression effectiveness l = 16 b = 16

  17. Experimental results: effectiveness/efficiency plot

  18. Experimental results: effectiveness/efficiency plot

  19. Experimental results: effectiveness/efficiency plot

  20. Further readings Chapter 6 and 7 of my Ph.D. thesis. (more datasets, comparisons, query timings) http://pages.di.unipi.it/pibiri/papers/phd_thesis.pdf

  21. Thanks for your attention, time, patience! Any questions?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend