on optimally partitioning variable byte index data
play

On Optimally Partitioning Variable-Byte Index Data Giulio Ermanno - PowerPoint PPT Presentation

On Optimally Partitioning Variable-Byte Index Data Giulio Ermanno Pibiri Rossano Venturini University of Pisa and ISTI-CNR University of Pisa and ISTI-CNR Pisa, Italy Pisa, Italy giulio.pibiri@di.unipi.it rossano.venturini@unipi.it


  1. On Optimally Partitioning Variable-Byte Index Data Giulio Ermanno Pibiri Rossano Venturini University of Pisa and ISTI-CNR University of Pisa and ISTI-CNR Pisa, Italy Pisa, Italy giulio.pibiri@di.unipi.it rossano.venturini@unipi.it Melbourne, 17/05/2018 1

  2. Context - Inverted Indexes We focus on compression effectiveness and retrieval speed in inverted indexes . The inverted index is the de-facto data structure at the basis of every large-scale retrieval system. 2

  3. Context - Inverted Indexes We focus on compression effectiveness and retrieval speed in inverted indexes . The inverted index is the de-facto data structure at the basis of every large-scale retrieval system. red is the always house good is red the boy boy is is the hungry red house is always hungry 2

  4. Context - Inverted Indexes We focus on compression effectiveness and retrieval speed in inverted indexes . The inverted index is the de-facto data structure at the basis of every large-scale retrieval system. t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red T = {always, boy, good, house, hungry, is, red, the} is the always house good is red the boy boy is is the hungry red house is always hungry 2

  5. Context - Inverted Indexes We focus on compression effectiveness and retrieval speed in inverted indexes . The inverted index is the de-facto data structure at the basis of every large-scale retrieval system. 1 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red 2 T = {always, boy, good, house, hungry, is, red, the} is the always house good is red the boy 3 boy is is the hungry red house is always 5 hungry 4 2

  6. Context - Inverted Indexes We focus on compression effectiveness and retrieval speed in inverted indexes . The inverted index is the de-facto data structure at the basis of every large-scale retrieval system. 1 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red 2 T = {always, boy, good, house, hungry, is, red, the} is the always house good L t 1 =[1, 3] is L t 2 =[4, 5] red the L t 3 =[1] boy 3 boy L t 4 =[2, 3] is is the L t 5 =[3, 5] hungry red house L t 6 =[1, 2, 3, 4, 5] is always L t 7 =[1, 2, 4] 5 hungry L t 8 =[2, 3, 5] 4 2

  7. Context - Inverted Indexes We focus on compression effectiveness and retrieval speed in inverted indexes . The inverted index is the de-facto data structure at the basis of every large-scale retrieval system. 1 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red 2 T = {always, boy, good, house, hungry, is, red, the} is the always house good L t 1 =[1, 3] is L t 2 =[4, 5] red the L t 3 =[1] boy 3 boy L t 4 =[2, 3] is is the L t 5 =[3, 5] hungry red house L t 6 =[1, 2, 3, 4, 5] is always L t 7 =[1, 2, 4] 5 hungry L t 8 =[2, 3, 5] 4 2

  8. Context - Inverted Indexes Inverted Indexes owe their popularity to the efficient resolution of queries , such as: “return all documents in which terms {t 1 ,…,t k } occur”. 3

  9. Context - Inverted Indexes Inverted Indexes owe their popularity to the efficient resolution of queries , such as: “return all documents in which terms {t 1 ,…,t k } occur”. 1 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red 2 T = {always, boy, good, house, hungry, is, red, the} is the always house good L t 1 =[1, 3] is L t 2 =[4, 5] red the L t 3 =[1] boy 3 boy L t 4 =[2, 3] is is the hungry L t 5 =[3, 5] red house L t 6 =[1, 2, 3, 4, 5] is always L t 7 =[1, 2, 4] 5 hungry L t 8 =[2, 3, 5] 4 3

  10. Context - Inverted Indexes Inverted Indexes owe their popularity to the efficient resolution of queries , such as: “return all documents in which terms {t 1 ,…,t k } occur”. 1 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red 2 T = {always, boy, good, house, hungry, is, red, the} is the always house good L t 1 =[1, 3] is L t 2 =[4, 5] red q = {boy, is, the} the L t 3 =[1] boy 3 boy L t 4 =[2, 3] is is the hungry L t 5 =[3, 5] red house L t 6 =[1, 2, 3, 4, 5] is always L t 7 =[1, 2, 4] 5 hungry L t 8 =[2, 3, 5] 4 3

  11. Context - Inverted Indexes Inverted Indexes owe their popularity to the efficient resolution of queries , such as: “return all documents in which terms {t 1 ,…,t k } occur”. 1 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red 2 T = {always, boy, good, house, hungry, is, red, the} is the always house good L t 1 =[1, 3] is L t 2 =[4, 5] red q = {boy, is, the} the L t 3 =[1] boy 3 boy L t 4 =[2, 3] is is the hungry L t 5 =[3, 5] red house L t 6 =[1, 2, 3, 4, 5] is always L t 7 =[1, 2, 4] 5 hungry L t 8 =[2, 3, 5] 4 3

  12. Context - Inverted Indexes Inverted Indexes owe their popularity to the efficient resolution of queries , such as: “return all documents in which terms {t 1 ,…,t k } occur”. 1 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red 2 T = {always, boy, good, house, hungry, is, red, the} is the always house good L t 1 =[1, 3] is L t 2 =[4, 5] red q = {boy, is, the} the L t 3 =[1] boy 3 boy L t 4 =[2, 3] is is the hungry L t 5 =[3, 5] red house L t 6 =[1, 2, 3, 4, 5] is always L t 7 =[1, 2, 4] 5 hungry L t 8 =[2, 3, 5] 4 3

  13. Context - Inverted Indexes Inverted Indexes owe their popularity to the efficient resolution of queries , such as: “return all documents in which terms {t 1 ,…,t k } occur”. 1 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red 2 T = {always, boy, good, house, hungry, is, red, the} is the always house good L t 1 =[1, 3] is L t 2 =[4, 5] red q = {boy, is, the} the L t 3 =[1] boy 3 q = {good, hungry} boy L t 4 =[2, 3] is is the hungry L t 5 =[3, 5] red house L t 6 =[1, 2, 3, 4, 5] is always L t 7 =[1, 2, 4] 5 hungry L t 8 =[2, 3, 5] 4 3

  14. Context - Inverted Indexes Inverted Indexes owe their popularity to the efficient resolution of queries , such as: “return all documents in which terms {t 1 ,…,t k } occur”. 1 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red 2 T = {always, boy, good, house, hungry, is, red, the} is the always house good L t 1 =[1, 3] is L t 2 =[4, 5] red q = {boy, is, the} the L t 3 =[1] boy 3 q = {good, hungry} boy L t 4 =[2, 3] is is the hungry L t 5 =[3, 5] red house L t 6 =[1, 2, 3, 4, 5] is always L t 7 =[1, 2, 4] 5 hungry L t 8 =[2, 3, 5] 4 3

  15. Context - Inverted Indexes Inverted Indexes owe their popularity to the efficient resolution of queries , such as: “return all documents in which terms {t 1 ,…,t k } occur”. 1 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red 2 T = {always, boy, good, house, hungry, is, red, the} is the always house good L t 1 =[1, 3] is L t 2 =[4, 5] red q = {boy, is, the} the L t 3 =[1] boy 3 q = {good, hungry} boy L t 4 =[2, 3] is is the hungry L t 5 =[3, 5] red house L t 6 =[1, 2, 3, 4, 5] is always L t 7 =[1, 2, 4] 5 hungry L t 8 =[2, 3, 5] 4 3

  16. Many solutions Huge research corpora describing different space/time trade-offs. Elias gamma/delta Optimized PForDelta • • Variable-Byte Elias-Fano • • Binary Interpolative Coding Partitioned Elias-Fano • • Simple-9/16 Clustered Elias-Fano • • PForDelta Asymmetric Numeral Systems • • 4

  17. Many solutions Huge research corpora describing different space/time trade-offs. Elias gamma/delta Optimized PForDelta • • Variable-Byte Elias-Fano • • Binary Interpolative Coding Partitioned Elias-Fano • • Simple-9/16 Clustered Elias-Fano • • PForDelta Asymmetric Numeral Systems • • Space Time Binary Variable-Byte 
 Spectrum Interpolative (VByte) Coding Family ~ 3X smaller ~ 4.5X faster 4

  18. Our research question Can we improve the space of a VByte-encoded sequence and preserve its query processing speed? 5

  19. Variable-Byte Encoding Simple idea: encode each number using as few bytes as possible. 6 1 0000110 127 1 1111111 128 1 0000001 0 0000000 65790 1 0000100 1 0000001 0 1111110 6

  20. Variable-Byte Encoding Simple idea: encode each number using as few bytes as possible. 6 1 0000110 127 1 1111111 128 1 0000001 0 0000000 65790 1 0000100 1 0000001 0 1111110 Decoding is fast : keep reading bytes until you hit a value smaller than 128. SIMD (Single Instruction Multiple Data) 6

  21. So…what’s “wrong” with VByte? The majority of values are small ( very small indeed). VByte needs at least 8 bits per integer (bpi). 7

  22. So…what’s “wrong” with VByte? The majority of values are small ( very small indeed). VByte needs at least 8 bits per integer (bpi). Sensibly far away from bit-level effectiveness. 
 BIC: 3.8 bpi on Gov2 
 PEF: 4.1 bpi on Gov2 7

  23. High-level idea 1. Partition each inverted list into variable-length partitions. 2. Encode dense partitions with their characteristic bitvector . 3. Encode sparse partitions with VByte . 8

  24. High-level idea 1. Partition each inverted list into variable-length partitions. 2. Encode dense partitions with their characteristic bitvector . 3. Encode sparse partitions with VByte . [13, 15, 16, 17, 20, 21, 23, 24] 1 0 1 1 1 0 0 1 1 0 1 1 13 14 15 16 17 18 19 20 21 22 23 24 24 - 13 - 1 = 12 bits VS 64 bits ( 5.33X ) 8

  25. Computing an optimal partition 1st level 2nd level n 9

  26. Computing an optimal partition 1st level Stores a fixed amount to bits, say F , for each partition. 2nd level n 9

  27. Computing an optimal partition 1st level Stores a fixed amount to bits, say F , for each partition. 2nd level n guarantee Optimal Dynamic Programming (DP) 9

Recommend


More recommend