Compressing Inverted Indexes with Recursive Graph Bisection: A - PowerPoint PPT Presentation

Compressing Inverted Indexes with Recursive Graph Bisection: A Reproducibility Study Joel Mackenzie 1 Antonio Mallia 2 Mathias Petri 3 J. Shane Culpepper 1 Torsten Suel 2 1 RMIT University, Melbourne, Australia 2 New York University, New York, USA 3 The University of Melbourne, Melbourne, Australia April, 2019 ECIR 2019 Reproducibility: Compressing Indexes April, 2019 1 / 37

Overview ECIR 2019 Reproducibility: Compressing Indexes April, 2019 2 / 37

Overview: Text Indexing ◮ Documents can be efficiently represented in an inverted index as a list of postings .  e red dog was red dog found red → 1 1 3 2 found underneath under shady tree dog → 1 2 2 1 3 2 a shady tree.  e dog sleep found → 1 1 dog was sleeping. … Shady trees are a shady tree great hunt → 3 1 great place for dogs place dog sleep to sleep. Red dogs like red dog like sleep sleeping. Red dogs red dog like hunt also like hunting. ECIR 2019 Reproducibility: Compressing Indexes April, 2019 3 / 37

Overview: Postings Lists ◮ A postings list L t for a term t contains a monotonically increasing list of document identifiers, represented as delta gaps, with a corresponding list of term frequencies (stored seperately). 1 3 11 14 17 24 29 docIDs 1 2 8 3 3 7 5 d -gaps ECIR 2019 Reproducibility: Compressing Indexes April, 2019 4 / 37

Motivation ECIR 2019 Reproducibility: Compressing Indexes April, 2019 5 / 37

Motivation ◮ The space consumption of a postings list can be reduced if the size of the deltas ( d -gaps) can be reduced. ◮ Compressors are more effective at compressing smaller integers. ◮ Reducing these d -gaps can be achieved by reordering the space of document identifiers. ◮ Given a collection of documents D with n = | D | , an arrangement of document identifiers can be defined as a bijection: π : D → { 1 , 2 , . . . , n } , where document d i is mapped to identifier π ( d i ) . ECIR 2019 Reproducibility: Compressing Indexes April, 2019 6 / 37

A Basic Example 2 3 11 14 17 24 29 3 5 8 10 12 16 19 t 1 t 1 3 9 13 14 27 5 6 9 10 11 t 2 t 2 docIDs 4 8 21 22 28 29 1 2 11 14 18 19 t 3 t 3 Initial arrangment ⟶ New arrangement d- gaps 2 1 8 3 3 7 5 3 2 3 2 2 4 3 t 1 t 1 3 6 4 1 13 5 1 3 1 1 t 2 t 2 4 4 13 1 6 1 1 1 9 3 4 1 t 3 t 3 ECIR 2019 Reproducibility: Compressing Indexes April, 2019 7 / 37

A Basic Example 2 3 11 14 17 24 29 3 5 8 10 12 16 19 t 1 t 1 3 9 13 14 27 5 6 9 10 11 t 2 t 2 docIDs 4 8 21 22 28 29 1 2 11 14 18 19 t 3 t 3 Initial arrangment ⟶ New arrangement d- gaps 2 1 8 3 3 7 5 3 2 3 2 2 4 3 t 1 t 1 3 6 4 1 13 5 1 3 1 1 t 2 t 2 4 4 13 1 6 1 1 1 9 3 4 1 t 3 t 3 ⟶ Smaller Gaps, Be  er Compression Larger Gaps, Less Compressible ECIR 2019 Reproducibility: Compressing Indexes April, 2019 7 / 37

Agenda: Reproducibility ◮ The current state-of-the-art in graph/index reordering is proposed in a KDD paper from 2016. 1 ◮ Given that most authors are from Facebook, the primary focus of this work was compressing graphs. ◮ No implementation was made available. Can we reproduce, from scratch, the results found in their original work? 1 L. Dhulipala et al. Compressing Graphs and Indexes with Recursive Graph Bisection. In KDD, 2016. ECIR 2019 Reproducibility: Compressing Indexes April, 2019 8 / 37

Baselines ECIR 2019 Reproducibility: Compressing Indexes April, 2019 9 / 37

Random Ordering ◮ Randomly assign a unique identifier in { 1 , 2 , . . . , n } to each document. ◮ Arrangements are poor due to lack of clustering - larger d -gaps. ◮ Used as a yardstick for comparison, not used in practice. ECIR 2019 Reproducibility: Compressing Indexes April, 2019 10 / 37

Natural Orderings ◮ Assign identifiers in the order that is natural to the collection. ◮ Crawl ordering is generally the default ordering of a text collection, as the crawler will assign identifiers as new documents are indexed. ◮ Crawl order effectiveness can depend on the method of crawling. ◮ URL ordering is usually very effective for document collections. ◮ Implicit localized clustering of similar documents. ECIR 2019 Reproducibility: Compressing Indexes April, 2019 11 / 37

π URL Ordering docID URL docID URL 1 abc.com/a 1 abc.com/a 2 xyz.com/ 2 abc.com/b_and_c xyz.com/index hello.edu/ 3 3 hello.edu/programs/cs_101 4 zzz.com/wake_up 4 5 hello.edu/ 5 xyz.com/ 6 abc.com/b_and_c 6 xyz.com/index 7 xyz.com/products 7 xyz.com/products hello.edu/programs/cs_101 8 8 zzz.com/wake_up ECIR 2019 Reproducibility: Compressing Indexes April, 2019 12 / 37

Minhash Ordering ◮ Minhash is an algorithm that approximates the Jaccard similarity of documents. ◮ This means similar documents are clustered together, resulting in smaller d -gaps and improved compression. ◮ This works under the same assumption as URL ordering. ◮ Minhash requires k different hash functions, h 1 ( x ) , h 2 ( x ) , . . . , h k ( x ) . ECIR 2019 Reproducibility: Compressing Indexes April, 2019 13 / 37

Preliminaries ECIR 2019 Reproducibility: Compressing Indexes April, 2019 14 / 37

Preliminaries ◮ Previous approaches look at implicitly clustering similar documents together through some heuristic. ◮ Use the URL of a document as a proxy for its content. ◮ Approximate Jaccard distances of document content. ◮ Instead, why not directly optimize this goal? ECIR 2019 Reproducibility: Compressing Indexes April, 2019 15 / 37

Preliminaries: Graph theory framework ◮ Consider our document index as a graph G = ( V , E ) with m = | E | . ◮ V is a disjoint set of terms , T , and documents , D . ◮ Each edge e ∈ E corresponds to an arc ( t , d ) - term t is contained in document d . ◮ Therefore, m is the number of postings in the collection. Terms T Documents D ECIR 2019 Reproducibility: Compressing Indexes April, 2019 16 / 37

Preliminaries: BiMLogA ◮ Bipartite Minimum Logarithmic Arrangement ( BiMLogA ) 1 ◮ NP-Hard. 2 ◮ Requires a bipartite graph, but can capture non-bipartite graphs via transformation. Find an arrangement π : D → { 1 , 2 , . . . , n } according to: d t � � argmin log 2 ( π ( u i + 1 ) − π ( u i )) , π t ∈ T i = 0 where d t is the degree of vertex t , t has neighbours { u 1 , u 2 , . . . , u d q } with π ( u 1 ) < π ( u 2 ) < · · · < π ( u d q ) , and u 0 = 0. 1 F. Chiericheti et al. On compressing social networks. In KDD, 2009. 2 L. Dhulipala et al. Compressing Graphs and Indexes with Recursive Graph Bisection. In KDD, 2016. ECIR 2019 Reproducibility: Compressing Indexes April, 2019 17 / 37

BiMLogA visualized 3 5 8 10 12 16 19 24 34 67 90 t 1 5 6 9 10 11 19 33 35 77 81 t 2 . . . 8 9 41 50 62 70 t z ECIR 2019 Reproducibility: Compressing Indexes April, 2019 18 / 37

BiMLogA visualized cost = log (5 - 3) 2 3 5 8 10 12 16 19 24 34 67 90 t 1 5 6 9 10 11 19 33 35 77 81 t 2 . . . 8 9 41 50 62 70 t z ECIR 2019 Reproducibility: Compressing Indexes April, 2019 18 / 37

BiMLogA visualized cost = log (5 - 3) + log (8 - 5) 2 2 3 5 8 10 12 16 19 24 34 67 90 t 1 5 6 9 10 11 19 33 35 77 81 t 2 . . . 8 9 41 50 62 70 t z ECIR 2019 Reproducibility: Compressing Indexes April, 2019 18 / 37

BiMLogA visualized cost = log (5 - 3) + log (8 - 5) + … + log (70 - 62) 2 2 2 3 5 8 10 12 16 19 24 34 67 90 t 1 5 6 9 10 11 19 33 35 77 81 t 2 . . . 8 9 41 50 62 70 t z ECIR 2019 Reproducibility: Compressing Indexes April, 2019 18 / 37

Solutions to BiMLogA ◮ BiMLogA is directly optimizing the space required to store d -gaps. ◮ We call the cost of a solution to BiMLogA the LogGap cost. ◮ NP-Hard, so we must approximate: how to do so practically? ECIR 2019 Reproducibility: Compressing Indexes April, 2019 19 / 37

Re cursive Graph Bisection ECIR 2019 Reproducibility: Compressing Indexes April, 2019 20 / 37

Re cursive Graph Bisection ( BP ) ◮ We split our input graph into two subgraphs, D 1 and D 2 . ◮ For each document d ∈ D , we compute the change in our LogGap cost if we moved d from D 1 to D 2 (or vice versa). ◮ We sort these gains from high to low, and then while we continue to yield positive gains, we swap pairs of documents. ◮ This process happens a constant number of times, or can be terminated early if no swaps occur. ◮ Until we reach our maximum depth, we recursively run the same procedure on D 1 and D 2 . ECIR 2019 Reproducibility: Compressing Indexes April, 2019 21 / 37

Re cursive Graph Bisection: Local Optimization ECIR 2019 Reproducibility: Compressing Indexes April, 2019 22 / 37

Re cursive Graph Bisection: Local Optimization 3 ECIR 2019 Reproducibility: Compressing Indexes April, 2019 22 / 37

Re cursive Graph Bisection: Local Optimization 3 -2 ECIR 2019 Reproducibility: Compressing Indexes April, 2019 22 / 37

Re cursive Graph Bisection: Local Optimization 3 -2 0 ECIR 2019 Reproducibility: Compressing Indexes April, 2019 22 / 37

Re cursive Graph Bisection: Local Optimization 3 -2 0 -4 0 -2 0 2 ECIR 2019 Reproducibility: Compressing Indexes April, 2019 22 / 37

Re cursive Graph Bisection: Local Optimization 0 3 -2 0 -4 -2 0 2 ECIR 2019 Reproducibility: Compressing Indexes April, 2019 22 / 37

Compressing Inverted Indexes with Recursive Graph Bisection: A - PowerPoint PPT Presentation

Compressing Inverted Indexes with Recursive Graph Bisection: A Reproducibility Study Joel Mackenzie 1 Antonio Mallia 2 Mathias Petri 3 J. Shane Culpepper 1 Torsten Suel 2 1 RMIT University, Melbourne, Australia 2 New York University, New York, USA

Inverted Indexes the IR Way CS330 Fall 2005 1 Term Doc # How Inverted Files now 1 is 1

Compressing Indexes Indexing, session 4 CS6200: Information Retrieval Slides by: Jesse Anderton

61A Lecture 6 Announcements Recursive Functions Recursive Functions 4 Recursive Functions

Recursive Methods Noter ch.2 Recursive Methods Recursive problem solution Problems

Recursion Announcements Recursive Functions Recursive Functions 4 Recursive Functions

Module 7: Creating and Maintaining Indexes Overview Creating Indexes Creating Index

Modern OLTP Indexes (Part 2) 1 / 43 Modern OLTP Indexes (Part 2) Recap Recap 2 / 43 Modern OLTP

Indices Tomasz Bartoszewski Inverted Index Search Construction Compression Inverted

Lempel- -Ziv Ziv- -Welch (LZW) Welch (LZW) Lempel Data Compressing Model Data Compressing

Lesson 9 Recursive Types 2/19, 21 Chapters 20, 21 Recursive type Recursive type terms are

Recursive Methods Recursive problem solution Problems that are naturally solved by

An Example of Index An Example of Index pattern of structure in indicators pattern of structure

Module 6: Planning Indexes Overview Introduction to Indexes Index Architecture How

Crawling HTML create an user user inverted index query Search show results inverted

Inverted Index Lecture 12 Inverted Index 1 December 2014 1 Wentworth Institute of Technology

Assessing the Stability of Forecasting Models: Recursive Parameter Estimation and Recursive

Lecture 3 Lossless Source Coding I-Hsiang Wang Department of Electrical Engineering National

Storage Tradeoffs in a Collaborative Backup Service for Mobile Devices Ludovic Courts,

Edgebreaker Connectivity Compression for Triangle Meshes Jarek Rossignac, TVCG 1999 Contribution

Application- -specific specific Application Compression for Remote Compression for Remote

Synchronization on Manycore Machines John Owens Associate Professor, Electrical and Computer

Primal-Dual Characterizations of Jointly Optimal Transmission Rate and Scheme for Distributed

Computer Graphics Si Lu Fall 2017 http://www.cs.pdx.edu/~lusi/CS447/CS447_547_Comput

Authors : Dalila Goudia (LIRMM-SIMPA) Marc Chaumont (LIRMM, France) William Puech (LIRMM, France)