compressing inverted indexes with recursive graph
play

Compressing Inverted Indexes with Recursive Graph Bisection: A - PowerPoint PPT Presentation

Compressing Inverted Indexes with Recursive Graph Bisection: A Reproducibility Study Joel Mackenzie 1 Antonio Mallia 2 Mathias Petri 3 J. Shane Culpepper 1 Torsten Suel 2 1 RMIT University, Melbourne, Australia 2 New York University, New York, USA


  1. Compressing Inverted Indexes with Recursive Graph Bisection: A Reproducibility Study Joel Mackenzie 1 Antonio Mallia 2 Mathias Petri 3 J. Shane Culpepper 1 Torsten Suel 2 1 RMIT University, Melbourne, Australia 2 New York University, New York, USA 3 The University of Melbourne, Melbourne, Australia April, 2019 ECIR 2019 Reproducibility: Compressing Indexes April, 2019 1 / 37

  2. Overview ECIR 2019 Reproducibility: Compressing Indexes April, 2019 2 / 37

  3. Overview: Text Indexing ◮ Documents can be efficiently represented in an inverted index as a list of postings .  e red dog was red dog found red → 1 1 3 2 found underneath under shady tree dog → 1 2 2 1 3 2 a shady tree.  e dog sleep found → 1 1 dog was sleeping. … Shady trees are a shady tree great hunt → 3 1 great place for dogs place dog sleep to sleep. Red dogs like red dog like sleep sleeping. Red dogs red dog like hunt also like hunting. ECIR 2019 Reproducibility: Compressing Indexes April, 2019 3 / 37

  4. Overview: Postings Lists ◮ A postings list L t for a term t contains a monotonically increasing list of document identifiers, represented as delta gaps, with a corresponding list of term frequencies (stored seperately). 1 3 11 14 17 24 29 docIDs 1 2 8 3 3 7 5 d -gaps ECIR 2019 Reproducibility: Compressing Indexes April, 2019 4 / 37

  5. Motivation ECIR 2019 Reproducibility: Compressing Indexes April, 2019 5 / 37

  6. Motivation ◮ The space consumption of a postings list can be reduced if the size of the deltas ( d -gaps) can be reduced. ◮ Compressors are more effective at compressing smaller integers. ◮ Reducing these d -gaps can be achieved by reordering the space of document identifiers. ◮ Given a collection of documents D with n = | D | , an arrangement of document identifiers can be defined as a bijection: π : D → { 1 , 2 , . . . , n } , where document d i is mapped to identifier π ( d i ) . ECIR 2019 Reproducibility: Compressing Indexes April, 2019 6 / 37

  7. A Basic Example 2 3 11 14 17 24 29 3 5 8 10 12 16 19 t 1 t 1 3 9 13 14 27 5 6 9 10 11 t 2 t 2 docIDs 4 8 21 22 28 29 1 2 11 14 18 19 t 3 t 3 Initial arrangment ⟶ New arrangement d- gaps 2 1 8 3 3 7 5 3 2 3 2 2 4 3 t 1 t 1 3 6 4 1 13 5 1 3 1 1 t 2 t 2 4 4 13 1 6 1 1 1 9 3 4 1 t 3 t 3 ECIR 2019 Reproducibility: Compressing Indexes April, 2019 7 / 37

  8. A Basic Example 2 3 11 14 17 24 29 3 5 8 10 12 16 19 t 1 t 1 3 9 13 14 27 5 6 9 10 11 t 2 t 2 docIDs 4 8 21 22 28 29 1 2 11 14 18 19 t 3 t 3 Initial arrangment ⟶ New arrangement d- gaps 2 1 8 3 3 7 5 3 2 3 2 2 4 3 t 1 t 1 3 6 4 1 13 5 1 3 1 1 t 2 t 2 4 4 13 1 6 1 1 1 9 3 4 1 t 3 t 3 ⟶ Smaller Gaps, Be  er Compression Larger Gaps, Less Compressible ECIR 2019 Reproducibility: Compressing Indexes April, 2019 7 / 37

  9. Agenda: Reproducibility ◮ The current state-of-the-art in graph/index reordering is proposed in a KDD paper from 2016. 1 ◮ Given that most authors are from Facebook, the primary focus of this work was compressing graphs. ◮ No implementation was made available. Can we reproduce, from scratch, the results found in their original work? 1 L. Dhulipala et al. Compressing Graphs and Indexes with Recursive Graph Bisection. In KDD, 2016. ECIR 2019 Reproducibility: Compressing Indexes April, 2019 8 / 37

  10. Baselines ECIR 2019 Reproducibility: Compressing Indexes April, 2019 9 / 37

  11. Random Ordering ◮ Randomly assign a unique identifier in { 1 , 2 , . . . , n } to each document. ◮ Arrangements are poor due to lack of clustering - larger d -gaps. ◮ Used as a yardstick for comparison, not used in practice. ECIR 2019 Reproducibility: Compressing Indexes April, 2019 10 / 37

  12. Natural Orderings ◮ Assign identifiers in the order that is natural to the collection. ◮ Crawl ordering is generally the default ordering of a text collection, as the crawler will assign identifiers as new documents are indexed. ◮ Crawl order effectiveness can depend on the method of crawling. ◮ URL ordering is usually very effective for document collections. ◮ Implicit localized clustering of similar documents. ECIR 2019 Reproducibility: Compressing Indexes April, 2019 11 / 37

  13. π URL Ordering docID URL docID URL 1 abc.com/a 1 abc.com/a 2 xyz.com/ 2 abc.com/b_and_c xyz.com/index hello.edu/ 3 3 hello.edu/programs/cs_101 4 zzz.com/wake_up 4 5 hello.edu/ 5 xyz.com/ 6 abc.com/b_and_c 6 xyz.com/index 7 xyz.com/products 7 xyz.com/products hello.edu/programs/cs_101 8 8 zzz.com/wake_up ECIR 2019 Reproducibility: Compressing Indexes April, 2019 12 / 37

  14. Minhash Ordering ◮ Minhash is an algorithm that approximates the Jaccard similarity of documents. ◮ This means similar documents are clustered together, resulting in smaller d -gaps and improved compression. ◮ This works under the same assumption as URL ordering. ◮ Minhash requires k different hash functions, h 1 ( x ) , h 2 ( x ) , . . . , h k ( x ) . ECIR 2019 Reproducibility: Compressing Indexes April, 2019 13 / 37

  15. Preliminaries ECIR 2019 Reproducibility: Compressing Indexes April, 2019 14 / 37

  16. Preliminaries ◮ Previous approaches look at implicitly clustering similar documents together through some heuristic. ◮ Use the URL of a document as a proxy for its content. ◮ Approximate Jaccard distances of document content. ◮ Instead, why not directly optimize this goal? ECIR 2019 Reproducibility: Compressing Indexes April, 2019 15 / 37

  17. Preliminaries: Graph theory framework ◮ Consider our document index as a graph G = ( V , E ) with m = | E | . ◮ V is a disjoint set of terms , T , and documents , D . ◮ Each edge e ∈ E corresponds to an arc ( t , d ) - term t is contained in document d . ◮ Therefore, m is the number of postings in the collection. Terms T Documents D ECIR 2019 Reproducibility: Compressing Indexes April, 2019 16 / 37

  18. Preliminaries: BiMLogA ◮ Bipartite Minimum Logarithmic Arrangement ( BiMLogA ) 1 ◮ NP-Hard. 2 ◮ Requires a bipartite graph, but can capture non-bipartite graphs via transformation. Find an arrangement π : D → { 1 , 2 , . . . , n } according to: d t � � argmin log 2 ( π ( u i + 1 ) − π ( u i )) , π t ∈ T i = 0 where d t is the degree of vertex t , t has neighbours { u 1 , u 2 , . . . , u d q } with π ( u 1 ) < π ( u 2 ) < · · · < π ( u d q ) , and u 0 = 0. 1 F. Chiericheti et al. On compressing social networks. In KDD, 2009. 2 L. Dhulipala et al. Compressing Graphs and Indexes with Recursive Graph Bisection. In KDD, 2016. ECIR 2019 Reproducibility: Compressing Indexes April, 2019 17 / 37

  19. BiMLogA visualized 3 5 8 10 12 16 19 24 34 67 90 t 1 5 6 9 10 11 19 33 35 77 81 t 2 . . . 8 9 41 50 62 70 t z ECIR 2019 Reproducibility: Compressing Indexes April, 2019 18 / 37

  20. BiMLogA visualized cost = log (5 - 3) 2 3 5 8 10 12 16 19 24 34 67 90 t 1 5 6 9 10 11 19 33 35 77 81 t 2 . . . 8 9 41 50 62 70 t z ECIR 2019 Reproducibility: Compressing Indexes April, 2019 18 / 37

  21. BiMLogA visualized cost = log (5 - 3) + log (8 - 5) 2 2 3 5 8 10 12 16 19 24 34 67 90 t 1 5 6 9 10 11 19 33 35 77 81 t 2 . . . 8 9 41 50 62 70 t z ECIR 2019 Reproducibility: Compressing Indexes April, 2019 18 / 37

  22. BiMLogA visualized cost = log (5 - 3) + log (8 - 5) + … + log (70 - 62) 2 2 2 3 5 8 10 12 16 19 24 34 67 90 t 1 5 6 9 10 11 19 33 35 77 81 t 2 . . . 8 9 41 50 62 70 t z ECIR 2019 Reproducibility: Compressing Indexes April, 2019 18 / 37

  23. Solutions to BiMLogA ◮ BiMLogA is directly optimizing the space required to store d -gaps. ◮ We call the cost of a solution to BiMLogA the LogGap cost. ◮ NP-Hard, so we must approximate: how to do so practically? ECIR 2019 Reproducibility: Compressing Indexes April, 2019 19 / 37

  24. Re cursive Graph Bisection ECIR 2019 Reproducibility: Compressing Indexes April, 2019 20 / 37

  25. Re cursive Graph Bisection ( BP ) ◮ We split our input graph into two subgraphs, D 1 and D 2 . ◮ For each document d ∈ D , we compute the change in our LogGap cost if we moved d from D 1 to D 2 (or vice versa). ◮ We sort these gains from high to low, and then while we continue to yield positive gains, we swap pairs of documents. ◮ This process happens a constant number of times, or can be terminated early if no swaps occur. ◮ Until we reach our maximum depth, we recursively run the same procedure on D 1 and D 2 . ECIR 2019 Reproducibility: Compressing Indexes April, 2019 21 / 37

  26. Re cursive Graph Bisection: Local Optimization ECIR 2019 Reproducibility: Compressing Indexes April, 2019 22 / 37

  27. Re cursive Graph Bisection: Local Optimization ECIR 2019 Reproducibility: Compressing Indexes April, 2019 22 / 37

  28. Re cursive Graph Bisection: Local Optimization 3 ECIR 2019 Reproducibility: Compressing Indexes April, 2019 22 / 37

  29. Re cursive Graph Bisection: Local Optimization 3 -2 ECIR 2019 Reproducibility: Compressing Indexes April, 2019 22 / 37

  30. Re cursive Graph Bisection: Local Optimization 3 -2 0 ECIR 2019 Reproducibility: Compressing Indexes April, 2019 22 / 37

  31. Re cursive Graph Bisection: Local Optimization 3 -2 0 -4 0 -2 0 2 ECIR 2019 Reproducibility: Compressing Indexes April, 2019 22 / 37

  32. Re cursive Graph Bisection: Local Optimization 0 3 -2 0 -4 -2 0 2 ECIR 2019 Reproducibility: Compressing Indexes April, 2019 22 / 37

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend