LSH-Based Probabilistic Pruning of Inverted Indices for Sets and - PowerPoint PPT Presentation

LSH-Based Probabilistic Pruning of Inverted Indices for Sets and Ranked Lists Koninika Pal and Sebastian Michel pal@cs.uni-kl.de smichel@cs.uni-kl.de TU Kaiserslautern, Germany 1 K. Pal - WebDB 2017

Introduction • Top-k Rankings, Preference lists K. Pal - WebDB 2017 2

• Top-k Rankings, Preference lists • Some applications: – Finding similar queries by results, – mining relations between entities, – recommender system, e.g. business promotion, etc. • Similarity search over ranked lists or sets of preferences K. Pal - WebDB 2017 3

Inverted Index • Inverted index handles set similarity efficiently. 1 ! h τ 1 i , h τ 2 i τ 1 = [2 , 5 , 4 , 3 , 1] 2 ! h τ 1 i , h τ 2 i τ 2 = [1 , 4 , 7 , 5 , 2] 3 ! h τ 1 i τ 3 = [0 , 8 , 7 , 5 , 6] . . . . . . • Filter: look up inverted index for each elements from query and collect candidates. • Validate: calculate distance between the candidates and the query 4 K. Pal - WebDB 2017

Motivation Higher similarity more overlapping elements Using multiple elements as key more precision Simple index Pairwise index (1 , 2) ! h τ 1 i , h τ 2 i 1 ! h τ 1 i , h τ 2 i (1 , 3) ! h τ 1 i 2 ! h τ 1 i , h τ 2 i 3 ! h τ 1 i (2 , 3) ! h τ 1 i . . . . . . . . . . . . Number of Key Increases exponentially ≫ Increase size of index structure ≫ Increase look up at query time query size 10 à 10 keys from query : 45 access keys from query 5 K. Pal - WebDB 2017

More similarity more overlapping elements How do we prune the index? Using multiple elements as key more precision How do we measure the effect of pruning in 7 ! h τ 2 i , h τ 3 i (5 , 6) ! h τ 1 i similarity search? 5 ! h τ 1 i , h τ 2 i , h τ 3 i (7 , 5) ! h τ 2 i , h τ 3 i 6 ! h τ 3 i . . . . . . . . . . . . Increase size of index structure Increase look up at query time 6 K. Pal - WebDB 2017

More similarity more overlapping elements How do we prune the index? Using multiple elements as key more precision How do we measure the effect of pruning in 7 ! h τ 2 i , h τ 3 i (5 , 6) ! h τ 1 i similarity search? 5 ! h τ 1 i , h τ 2 i , h τ 3 i (7 , 5) ! h τ 2 i , h τ 3 i 6 ! h τ 3 i . . . . . . . . . . . . Increase size of index structure Key idea: Connecting index structure with locality Increase look up at query time sensitive hashing (LSH) 7 K. Pal - WebDB 2017

Problem Description • Collection of sets of size k T τ i τ i = [2 , 5 , 4 , 3] • Input at query time: – A query of size k q – A distance threshold θ • Set similarity: Compute R = { τ i | τ i ∈ T and d ( τ i , q ) ≤ θ } = dissimilarity measure between d ( τ i , q ) τ i , q = Result set while using complete index structure R 8 K. Pal - WebDB 2017

• Index pruning factor φ • Query on pruned index return result set R p – R p ⊆ R • Additional Input to similarity search – Recall threshold % maximize � Objective: R p / R ≥ % subject to 9 K. Pal - WebDB 2017

Content • Motivation & Problem • Pruning of Inverted Index • Query Processing on Pruned index • Experimental Results • Conclusions 10 K. Pal - WebDB 2017

Pruning of Index Structure Randomly select Randomly delete Randomly delete factor of factor of keys factor of φ φ φ and delete the elements from elements in each each sets and complete entry. posting lists. then build the index. 11 K. Pal - WebDB 2017

Pruning of Index Structure • Similarity with document search: – Horizontal pruning: stop-word removal. – Vertical pruning: term-based index pruning (scoring model: tf-idf, KL-divergence etc.). – Diagonal pruning: document-centric pruning. • Contrast with common document retrieval: – Same size of query and documents. – Does not use score-based document search method. 12 K. Pal - WebDB 2017

Pruning of Index Structure • Similarity with document search: – Horizontal pruning: stop-word removal. – Vertical pruning: term-based index pruning (scoring model: tf-idf, KL-divergence etc.). – Diagonal pruning: document-centric pruning. • Contrast with common document retrieval: – Same size of query and documents. – Does not use score-based document search method. 13 K. Pal - WebDB 2017

Connecting Index with LSH Family Hash tables (LSH index) Inverted index Hash_key1 à Objects map to key1 Key 1 à posting lists Hash_key2 à Objects map to key2 Key2 à posting lists …… …... One to one mapping Example: projects sets on single elements h x ( τ i ) = x if x ∈ τ i τ 2 = [1 , 4 , 7 , 5 , 2] h x : h 7 ( τ 2 ) = 7 τ 3 = [0 , 8 , 7 , 5 , 6] h 7 : 7 ! h τ 2 i , h τ 3 i 7 ! h τ 2 i , h τ 3 i – Multiple hash functions are possible to use conjunctively in LSH h 7 , h 5 : (7 , 5) ! h τ 2 i , h τ 3 i (7 , 5) ! h τ 2 i , h τ 3 i 15 K. Pal - WebDB 2017

Properties of LSH • Why LSH? – Similar objects have higher probability to collide into same bucket – Tuning of number of index entries( ) are needed to l look up to reach recall % % = 1 − (1 − P m 1 ) l • What we need? – Collision probability of hash function: P 1 – Number of hash functions are used at query time. 16 K. Pal - WebDB 2017

Query Processing on Pruned Index • Pruning of index -> dropping objects from LSH index • Missing collision at query processing: – Objects and query are not similar – Objects are dropped due to pruning • Access more entries ( ) than the LSH method l required. • How many extra index look up are required? 17 K. Pal - WebDB 2017

Ad-hoc Query Processing • Continue index look up until successful l accesses. % = 1 − (1 − P m 1 ) l • Max. lookup à look up all keys from query • Expected look ups: E [ l ] = (1 /f ) · l • Modifying factor in collision probability: f 18 K. Pal - WebDB 2017

Probabilistic Query Processing • Find modified collision probability • Find required modified number of accesses l Y % = 1 − (1 − f Y · P m 1 ) l Y • Modifying factor = function ( φ ) f Y – Modifying factor of horizontal pruning f h faction of index pruning • φ • à removing faction of keys φ = 1 – • φ f h 19 K. Pal - WebDB 2017

Optimizing the Pruning Factor ✓ k ◆ • Max. lookup à look up all keys from query t ✓ k ◆ • Look up is bound by l Y t • Number of access ( ): % = 1 − (1 − f Y · P m 1 ) l Y l Y • Modifying factor = function ( φ ) f Y n� k o φ ∗ = argmax φ � − l Y = 0 t 20 K. Pal - WebDB 2017

Case Studies • Case 1: Jaccard Distance over sets – Use pairwise index – Relate LSH 1 index to pairwise index % = 1 − (1 − P m 1 ) l 2 θ 1 + θ , m = 2 P 1 = • Case 2: Kendall’s Tau Distance over rankings [1] Koninika Pal and Sebastian Michel. Efficient Similarity Search across Top-k Lists under the Kendall’s Tau Distance. In SSDBM 2016. 21 K. Pal - WebDB 2017

Experimental Setup • Datasets: – LifeJ : 100,000 profiles from Live Journal; truncated to set size = 20. – Yago : 25,000 top-20 rankings; Wikipedia based. • 5 consecutive experimental runs over 1000 queries. • Recall threshold = 99% % • Baseline approach: The plain LSH methods on the non- pruned index structures. • Full scan and prefix filtering 2 method in simple index retrieve candidates > 5 times than the baseline approach. [2] jiannan Wang, Guoliang Li, and Jianhua Feng. 2012. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD. 23 K. Pal - WebDB 2017

Theoretically Established Parameters for LiveJ not Horizontal Vertical Diagonal pruned pruning pruning pruning φ ∗ φ ∗ l v l h φ ∗ l d θ l E [ l v ] E [ l d ] E [ l h ] 0.1 2 0.8 125 10 0.8 125 10 0.5 95 8.6 0.3 4 0.8 167 20 0.8 167 20 0.5 126 17.4 0.5 8 0.7 112 26.6 0.7 112 26.6 0.4 87 23.5 : Optimal pruning factor. : { h / v / d } φ ∗ Y : Number of scan for probabilistic query processing. l Y : Expected number of scan for successful scan. l E [ l Y ] 24 K. Pal - WebDB 2017

Experimental Results for Probabilistic Query Processing on LifeJ Pruning Time #successful Baseline θ #candidates recall θ l Y method (ms) scan candidates 0.1 11.17 10031.3 100 24.6 125 5105.3 0.3 Horizontal 11.54 13257.0 100 33.9 167 7360.4 0.5 13.39 14452.2 100 33.6 112 9059.5 0.1 14.0 11252.9 100 125 125 5105.3 0.3 Vertical 9.8 12208.7 100 167 167 7360.4 0.5 11.0 14001.9 100 112 112 9059.5 0.1 10.38 10378.3 99.5 79.69 95 5105.3 0.3 Diagonal 11.06 11512.7 100 104.58 126 7360.4 0.5 11.32 13003.1 99.7 76.84 87 9059.5 25 K. Pal - WebDB 2017

LSH-Based Probabilistic Pruning of Inverted Indices for Sets and - PowerPoint PPT Presentation

LSH-Based Probabilistic Pruning of Inverted Indices for Sets and Ranked Lists Koninika Pal and Sebastian Michel pal@cs.uni-kl.de smichel@cs.uni-kl.de TU Kaiserslautern, Germany 1 K. Pal - WebDB 2017 Introduction Top-k Rankings,

Natural Target Pruning Making Proper Pruning Cuts Natural Target Pruning In this lesson we

Indices Tomasz Bartoszewski Inverted Index Search Construction Compression Inverted

BASICS Natural Target Pruning Terminology and Tools Reasons for Pruning Fruit Trees

Pruning for Cropload Management and Productivity 2013 Winter Pruning Workshop Dr. Mercy

Multi-Probe LSH: Efficient Indexing for Efficient Indexing for Multi-Probe LSH:

LSH: A Survey of Hashing for Similarity Search CS 584: Big Data Analytics LSH Problem Definition

JUST THE MATHS SLIDES NUMBER 1.3 ALGEBRA 3 (Indices and radicals (or surds)) by

EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis Chaoqi Wang , Roger Grosse,

BENCHMARK AND PROPRIETARY INDICES February 2019 WHATS AN

LSH for 2 distances Lecture 15 October 15, 2020 Chandra (UIUC) CS498ABD 1 Fall 2020 1 /

LSH for 2 distances Lecture 15 March 12, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 /

Inverted Indexes the IR Way CS330 Fall 2005 1 Term Doc # How Inverted Files now 1 is 1

Crawling HTML create an user user inverted index query Search show results inverted

Inverted Index Lecture 12 Inverted Index 1 December 2014 1 Wentworth Institute of Technology

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Berries, Grapes and Kiwi Pruning Blueberries Prune to an open vase shape, leaving 4 to 6

Database System Architecture Index Structures Hector Garcia-Molina Stijn Vansummeren Index

An Example of Index An Example of Index pattern of structure in indicators pattern of structure

Turbocharge your MySQL analytics with ElasticSearch Guillaume Lefranc Data & Infrastructure

Architecture Case Study: 1 Key Word in Context (KWIC) Aims: To demonstrate key features

Dropping on the Edge: Flexibility and Dropping on the Edge: Flexibility and Trac Conrmation

Outline Restless Bandits 1 Overview Problem Description Decomposition Applications 2

AICPA Business and Industry Economic Outlook Survey Detailed Survey Results: 2Q 2020 Management

MECT Microeconometrics Blundell Lecture 1 Overview and Binary Response Models Richard Blundell

Sambuz

Useful Links

Newsletter

Mail Us

LSH-Based Probabilistic Pruning of Inverted Indices for Sets and - PowerPoint PPT Presentation

LSH-Based Probabilistic Pruning of Inverted Indices for Sets and Ranked Lists Koninika Pal and Sebastian Michel pal@cs.uni-kl.de smichel@cs.uni-kl.de TU Kaiserslautern, Germany 1 K. Pal - WebDB 2017 Introduction Top-k Rankings,

Natural Target Pruning Making Proper Pruning Cuts Natural Target Pruning In this lesson we

Indices Tomasz Bartoszewski Inverted Index Search Construction Compression Inverted

BASICS Natural Target Pruning Terminology and Tools Reasons for Pruning Fruit Trees

Pruning for Cropload Management and Productivity 2013 Winter Pruning Workshop Dr. Mercy

Multi-Probe LSH: Efficient Indexing for Efficient Indexing for Multi-Probe LSH:

LSH: A Survey of Hashing for Similarity Search CS 584: Big Data Analytics LSH Problem Definition

JUST THE MATHS SLIDES NUMBER 1.3 ALGEBRA 3 (Indices and radicals (or surds)) by

EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis Chaoqi Wang , Roger Grosse,

BENCHMARK AND PROPRIETARY INDICES February 2019 WHATS AN

LSH for 2 distances Lecture 15 October 15, 2020 Chandra (UIUC) CS498ABD 1 Fall 2020 1 /

LSH for 2 distances Lecture 15 March 12, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 /

Inverted Indexes the IR Way CS330 Fall 2005 1 Term Doc # How Inverted Files now 1 is 1

Crawling HTML create an user user inverted index query Search show results inverted

Inverted Index Lecture 12 Inverted Index 1 December 2014 1 Wentworth Institute of Technology

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Berries, Grapes and Kiwi Pruning Blueberries Prune to an open vase shape, leaving 4 to 6

Database System Architecture Index Structures Hector Garcia-Molina Stijn Vansummeren Index

An Example of Index An Example of Index pattern of structure in indicators pattern of structure

Turbocharge your MySQL analytics with ElasticSearch Guillaume Lefranc Data &amp; Infrastructure

Architecture Case Study: 1 Key Word in Context (KWIC) Aims: To demonstrate key features

Dropping on the Edge: Flexibility and Dropping on the Edge: Flexibility and Trac Conrmation

Outline Restless Bandits 1 Overview Problem Description Decomposition Applications 2

AICPA Business and Industry Economic Outlook Survey Detailed Survey Results: 2Q 2020 Management

MECT Microeconometrics Blundell Lecture 1 Overview and Binary Response Models Richard Blundell

Sambuz

Useful Links

Newsletter

Mail Us

Turbocharge your MySQL analytics with ElasticSearch Guillaume Lefranc Data & Infrastructure