An Experimental Study of Index Compression and DAAT Query - - PowerPoint PPT Presentation

an experimental study of index compression and daat query
SMART_READER_LITE
LIVE PREVIEW

An Experimental Study of Index Compression and DAAT Query - - PowerPoint PPT Presentation

An Experimental Study of Index Compression and DAAT Query Processing Methods Antonio Mallia Micha Siedlaczek Torsten Suel Department of Computer Science and Engineering Tandon School of Engineering New York University April 16th, 2019


slide-1
SLIDE 1

An Experimental Study of Index Compression and DAAT Query Processing Methods

Antonio Mallia Michał Siedlaczek Torsten Suel Department of Computer Science and Engineering Tandon School of Engineering New York University April 16th, 2019

slide-2
SLIDE 2

Motivations

slide-3
SLIDE 3

Interacting Components

slide-4
SLIDE 4

Interacting Components

slide-5
SLIDE 5

Interacting Components

slide-6
SLIDE 6

Interacting Components

slide-7
SLIDE 7

Main Contributions

◮ Confirmed some established results ◮ New important insights ◮ Modern and generic code base

Source Code https://github.com/pisa-engine/pisa

slide-8
SLIDE 8

Main Contributions

◮ Confirmed some established results ◮ New important insights ◮ Modern and generic code base

Source Code https://github.com/pisa-engine/pisa

slide-9
SLIDE 9

Main Contributions

◮ Confirmed some established results ◮ New important insights ◮ Modern and generic code base

Source Code https://github.com/pisa-engine/pisa

slide-10
SLIDE 10

Index Compression

◮ Variable Byte Methods:

◮ VarintGB [Dean 2009] ◮ Varint-G8IU [Stepanov et al. 2011] ◮ StreamVByte [Lemire et al. 2018]

◮ Word-Aligned Methods:

◮ Simple16 [Zhang et al. 2008] ◮ Simple8b [Anh and Moffat 2010] ◮ SIMD-BP128 [Lemire and Boytsov 2015] ◮ QMX [Trotman and Lin 2016]

◮ OptPForDelta [Yan et al. 2009] ◮ Partitioned Elias-Fano [Ottaviano and Venturini 2014] ◮ Binary Interpolative [Moffat and Stuiver 2000] ◮ Asymetric Numeral Systems [Moffat and Petri 2018]

slide-11
SLIDE 11

Query Processing Algorithms

Top-k disjunctive Document-at-a-Time query processing algorithms with safe early-termination.

◮ MaxScore [Turtle and Flood 1995] ◮ WAND [Broder et al. 2003] ◮ Block-Max MaxScore [Chakrabarti et al. 2011] ◮ Block-Max WAND [Ding and Suel 2011] ◮ Variable Block-Max WAND [Mallia et al. 2017]

slide-12
SLIDE 12

Document Ordering

The impact that document ID assignment has on index compression and query efficiency.

◮ Random – baseline ◮ URL [Silvestri 2007] ◮ Recursive Graph Bisection (BP) [Dhulipala et al. 2016]

slide-13
SLIDE 13

Result Set Size

◮ Typically small k in past top-k search studies ◮ Can be significantly larger for candidate retrieval for

cascade ranking

◮ Recently shown that large k slow down retrieval [Crane et

  • al. 2017]

◮ Thus, we experiment with values of k between 10 and

10,000

slide-14
SLIDE 14

Experimental Setup

slide-15
SLIDE 15

Implementation

Source Code

◮ https://github.com/pisa-engine/pisa ◮ Fork of ds2i: https://github.com/ot/ds2i

Third Party Libraries

◮ https://github.com/lemire/FastPFor ◮ https://github.com/andrewtrotman/JASSv2 ◮ https://github.com/mpetri/partitioned_ef_ans

slide-16
SLIDE 16

Testing Environment

◮ Implemented in C++17 and compiled with GCC 7.3 on

highest optimization level

◮ Intel Core i7-4770 quad-core 3.40GHz CPU ◮ Haswell micro architecture supporting AVX2 instruction

set

◮ CPUs L1, L2, and L3 cache sizes are 32KB, 256KB, and

8MB, respectively

◮ 32GiB RAM

slide-17
SLIDE 17

Data Sets

Documents Terms Postings GOV2 24,622,347 35,636,425 5,742,630,292 Clueweb09B 50,131,015 92,094,694 15,857,983,641

◮ HTML content parsed with Apache Tika ◮ Words stemmed with Porter2 ◮ Stopwords kept

slide-18
SLIDE 18

Queries

◮ TREC 2005 TREC 2006 from Terabyte Track Efficiency Task ◮ Queries with non-existent terms removed ◮ Initially sampled 1,000 queries for each query set and

collection

5 10 100 200 300 Gov2 TREC05 5 10 15 20 Gov2 TREC06 5 10 Clueweb09 TREC05 5 10 15 20 Clueweb09 TREC06 ◮ Further sampled 1,000 queries for each query length from

2 to 6+

slide-19
SLIDE 19

Results and Discussion

slide-20
SLIDE 20

Compression

Clueweb09-B

I n t e r p

  • l

a t i v e P a c k e d + A N S 2 P E F O p t P F D S i m p l e 1 6 S i m p l e 8 b Q M X S I M D

  • B

P 1 2 8 V a r i n t

  • G

8 I U V a r i n t G B S t r e a m V B y t e 20 40 Index size [GiB] Random URL BP

slide-21
SLIDE 21

Compression

Clueweb09-B

I n t e r p

  • l

a t i v e P a c k e d + A N S 2 P E F O p t P F D S i m p l e 1 6 S i m p l e 8 b Q M X S I M D

  • B

P 1 2 8 V a r i n t

  • G

8 I U V a r i n t G B S t r e a m V B y t e 20 40 Index size [GiB] Random URL BP

slide-22
SLIDE 22

Compression

Clueweb09-B

I n t e r p

  • l

a t i v e P a c k e d + A N S 2 P E F O p t P F D S i m p l e 1 6 S i m p l e 8 b Q M X S I M D

  • B

P 1 2 8 V a r i n t

  • G

8 I U V a r i n t G B S t r e a m V B y t e 20 40 Index size [GiB] Random URL BP

slide-23
SLIDE 23

Compression

Clueweb09-B

I n t e r p

  • l

a t i v e P a c k e d + A N S 2 P E F O p t P F D S i m p l e 1 6 S i m p l e 8 b Q M X S I M D

  • B

P 1 2 8 V a r i n t

  • G

8 I U V a r i n t G B S t r e a m V B y t e 20 40 Index size [GiB] Random URL BP

slide-24
SLIDE 24

Compression

Clueweb09-B

I n t e r p

  • l

a t i v e P a c k e d + A N S 2 P E F O p t P F D S i m p l e 1 6 S i m p l e 8 b Q M X S I M D

  • B

P 1 2 8 V a r i n t

  • G

8 I U V a r i n t G B S t r e a m V B y t e 20 40 Index size [GiB] Random URL BP

slide-25
SLIDE 25

Query Speed

Clueweb09-B (URL ordering, k = 10)

I n t e r p

  • l

a t i v e P a c k e d + A N S 2 P E F O p t P F D S i m p l e 1 6 S i m p l e 8 b Q M X S I M D

  • B

P 1 2 8 V a r i n t

  • G

8 I U V a r i n t G B S t r e a m V B y t e 100 Average query time [ms] MaxScore BMM WAND BMW VBMW

slide-26
SLIDE 26

Query Speed

Clueweb09-B (URL ordering, k = 10)

P E F O p t P F D S i m p l e 1 6 S i m p l e 8 b Q M X S I M D

  • B

P 1 2 8 V a r i n t

  • G

8 I U V a r i n t G B S t r e a m V B y t e 20 40 Average query time [ms] MaxScore BMM WAND BMW VBMW

slide-27
SLIDE 27

Query Speed

Clueweb09-B (URL ordering, k = 10)

P E F O p t P F D S i m p l e 1 6 S i m p l e 8 b Q M X S I M D

  • B

P 1 2 8 V a r i n t

  • G

8 I U V a r i n t G B S t r e a m V B y t e 20 40 Average query time [ms] MaxScore BMM WAND BMW VBMW

slide-28
SLIDE 28

Query Speed

Clueweb09-B (URL ordering, k = 10)

P E F O p t P F D S i m p l e 1 6 S i m p l e 8 b Q M X S I M D

  • B

P 1 2 8 V a r i n t

  • G

8 I U V a r i n t G B S t r e a m V B y t e 20 40 Average query time [ms] MaxScore BMM WAND BMW VBMW

slide-29
SLIDE 29

Query Speed

Clueweb09-B (URL ordering, k = 10)

P E F O p t P F D S i m p l e 1 6 S i m p l e 8 b Q M X S I M D

  • B

P 1 2 8 V a r i n t

  • G

8 I U V a r i n t G B S t r e a m V B y t e 20 40 Average query time [ms] MaxScore BMM WAND BMW VBMW

slide-30
SLIDE 30

Query Speed v. Index Size

Clueweb09-B (URL ordering, k = 10)

15 20 25 30 35 40 Index Size [GiB] 18 20 22 Average query time [ms] PEF OptPFD Simple16 Simple8b QMX SIMD-BP128 Varint-G8IU VarintGB StreamVByte

VBMW

15 20 25 30 35 40 Index Size [GiB] 20 25 30 35 40 Average query time [ms] PEF OptPFD Simple16 Simple8b QMX SIMD-BP128 Varint-G8IU VarintGB StreamVByte

MaxScore

slide-31
SLIDE 31

Query Length

2 3 4 5 6+ Number of query terms 5 10 15 20 Qery time [ms] Gov2 2 3 4 5 6+ Number of query terms 15 30 45 60 Clueweb09 VBMW MaxScore OptPFD SIMD-BP128 VarintG8IU PEF

slide-32
SLIDE 32

Result Set Size

101 102 103 104 Number of retrieved documents 3 7 15 30 60 Qery time [ms] Gov2 101 102 103 104 Number of retrieved documents 10 25 50 100 Clueweb09 VBMW MaxScore OptPFD SIMD-BP128 VarintG8IU PEF

slide-33
SLIDE 33

Conclusions

◮ Clear trade-off between speed and size ◮ Interesting compression insights

◮ SIMD-BP128 matches speed of Varint methods while

improving compression ratio

◮ PEF’s speed competitive when using VBMW

◮ Significant slowdown for large k ◮ MaxScore competitive with VBMW under certain

circumstances

◮ Recursive Graph Bisection improves both compression and

speed over URL ordering

slide-34
SLIDE 34

Thank you for your time. Any questions?

slide-35
SLIDE 35

References I

Anh, V. N. and Moffat, A. (2010). Index compression using 64-bit words. Software: Practice and Experience, 40(2):131–147. Broder, A. Z., Carmel, D., Herscovici, M., Soffer, A., and Zien, J. (2003). Efficient query evaluation using a two-level retrieval process. In Proc. of the 12th Intl. Conf. on Information and Knowledge Management, pages 426–434. Chakrabarti, K., Chaudhuri, S., and Ganti, V. (2011). Interval-based pruning for top-k processing over compressed lists. In Proc. of the 2011 IEEE 27th Intl. Conf. on Data Engineering, pages 709–720. Crane, M., Culpepper, J. S., Lin, J., Mackenzie, J., and Trotman, A. (2017). A comparison of document-at-a-time and score-at-a-time query evaluation. In Proc. of the 10th ACM Intl. Conf. on Web Search and Data Mining, pages 201–210. Dean, J. (2009). Challenges in building large-scale information retrieval systems: invited talk. In Proc. of the 2nd ACM Intl. Conf. on Web Search and Data Mining, pages 1–1. Dhulipala, L., Kabiljo, I., Karrer, B., Ottaviano, G., Pupyrev, S., and Shalita, A. (2016). Compressing graphs and indexes with recursive graph bisection. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1535–1544.

slide-36
SLIDE 36

References II

Ding, S. and Suel, T. (2011). Faster top-k document retrieval using block-max indexes. In Proc. of the 34th Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 993–1002. Lemire, D. and Boytsov, L. (2015). Decoding billions of integers per second through vectorization.

  • Softw. Pract. Exper., 45(1):1–29.

Lemire, D., Kurz, N., and Rupp, C. (2018). Stream vbyte: Faster byte-oriented integer compression. Information Processing Letters, 130:1–6. Mallia, A., Ottaviano, G., Porciani, E., Tonellotto, N., and Venturini, R. (2017). Faster blockmax WAND with variable-sized blocks. In Proc. of the 40th Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 625–634. Moffat, A. and Petri, M. (2018). Index compression using byte-aligned ANS coding and two-dimensional contexts. In Proc. of the 11th ACM Intl. Conf. on Web Search and Data Mining, pages 405–413. Moffat, A. and Stuiver, L. (2000). Binary interpolative coding for effective index compression.

  • Inf. Retr., 3(1):25–47.
slide-37
SLIDE 37

References III

Ottaviano, G. and Venturini, R. (2014). Partitioned elias-fano indexes. In Proc. of the 37th Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 273–282. Silvestri, F. (2007). Sorting out the document identifier assignment problem. In Proc. of the 29th European Conf. on IR Research, pages 101–112. Stepanov, A. A., Gangolli, A. R., Rose, D. E., Ernst, R. J., and Oberoi, P. S. (2011). SIMD-based decoding of posting lists. In Proc. of the 20th Intl. Conf. on Information and Knowledge Management, pages 317–326. Trotman, A. and Lin, J. (2016). In vacuo and in situ evaluation of SIMD codecs. In Proc. of the 21st Australasian Document Computing Symposium, pages 1–8. Turtle, H. and Flood, J. (1995). Query evaluation: Strategies and optimizations. Information Processing & Management, 31(6):831–850. Yan, H., Ding, S., and Suel, T. (2009). Inverted index compression and query processing with optimized document ordering. In Proc. of the 18th Intl. Conf. on World Wide Web, pages 401–410.

slide-38
SLIDE 38

References IV

Zhang, J., Long, X., and Suel, T. (2008). Performance of compressed inverted list caching in search engines. In Proc. of the 17th Intl. Conf. on World Wide Web, pages 387–396.