Top-k Aggregation Using Intersections
Ravi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research Yahoo! Research Yahoo! Research / Brooklyn Poly Yahoo! Research
Top-k Aggregation Using Intersections Yahoo! Research Ravi Kumar - - PowerPoint PPT Presentation
Top-k Aggregation Using Intersections Yahoo! Research Ravi Kumar Yahoo! Research Kunal Punera Yahoo! Research / Brooklyn Poly Torsten Suel Yahoo! Research Sergei Vassilvitskii Top-k retrieval Given a set of documents: Doc 1 Doc 6 Doc 2
Ravi Kumar Kunal Punera Torsten Suel Sergei Vassilvitskii Yahoo! Research Yahoo! Research Yahoo! Research / Brooklyn Poly Yahoo! Research
Given a set of documents: And a query: “New York City” Find the k documents best matching the query. 2
Doc 1 Doc 2 Doc 4 Doc 3 Doc 5 Doc 6
Given a set of documents: And a query: “New York City” Find the k documents best matching the query. Assume: decomposable scoring function: Score(“New York City”) = Score(“New”) + Score(“York”)+Score(“City”). 3
Doc 1 Doc 2 Doc 4 Doc 3 Doc 5 Doc 6
Data Structures behind top-k retrieval. Create posting lists: 4
Doc ID Score
Data Structures behind top-k retrieval. Create posting lists: Query: New York City New... York... City... 5
9 5.2 10 4.1 10 2.0 5 4.0 9 3.1 3 1.5 7 3.3 7 1.0 7 1.0 3 1.0 5 0.5 9 0.2 10 0.0 1 0.2 5 0.1 Doc ID Score
(Offline) Sort each list by decreasing score. Query: New York City New... York... City... Retrieval: Start with document with highest score in any list. Look up its score in other lists. Top: 6
9 5.2 10 4.1 10 2.0 5 4.0 9 3.1 3 1.5 7 3.3 7 1.0 7 1.0 3 1.0 5 0.5 9 0.2 10 0.0 1 0.2 5 0.1 9 5.2+3.1+0.2=8.5
Data Structures behind top-k retrieval: Arrange each list by decreasing score. Query: New York City New... York... City... Continue with next highest score. Top: Candidate: 7
9 5.2 10 4.1 10 2.0 5 4.0 9 3.1 3 1.5 7 3.3 7 1.0 7 1.0 3 1.0 5 0.5 9 0.2 10 0.0 1 0.2 5 0.1 9 8.5 10 4.1+2.0+0.0 = 6.1
Data Structures behind top-k retrieval: Arrange each list by decreasing score. Query: New York City New... York... City... Continue with next highest score. Top: Candidate: 7
9 5.2 10 4.1 10 2.0 5 4.0 9 3.1 3 1.5 7 3.3 7 1.0 7 1.0 3 1.0 5 0.5 9 0.2 10 0.0 1 0.2 5 0.1 9 8.5 10 4.1+2.0+0.0 = 6.1
Data Structures behind top-k retrieval: Arrange each list by decreasing score. Query: New York City New... York... City... Continue with next highest score. Top: Candidate: 8
9 5.2 10 4.1 10 2.0 5 4.0 9 3.1 3 1.5 7 3.3 7 1.0 7 1.0 3 1.0 5 0.5 9 0.2 10 0.0 1 0.2 5 0.1 9 8.5 5 4.0+0.5+0.1=4.6
Data Structures behind top-k retrieval: Arrange each list by decreasing score. Query: New York City New... York... City... Continue with next highest score. Top: Candidate: 8
9 5.2 10 4.1 10 2.0 5 4.0 9 3.1 3 1.5 7 3.3 7 1.0 7 1.0 3 1.0 5 0.5 9 0.2 10 0.0 1 0.2 5 0.1 9 8.5 5 4.0+0.5+0.1=4.6
Data Structures behind top-k retrieval: Arrange each list by decreasing score. Query: New York City New... York... City... When can we stop? Top: Best Possible Remaining: 9
9 5.2 10 4.1 10 2.0 5 4.0 9 3.1 3 1.5 7 3.3 7 1.0 7 1.0 3 1.0 5 0.5 9 0.2 10 0.0 1 0.2 5 0.1 9 8.5 * 3.3+1.5+1.0=5.8
Data Structures behind top-k retrieval: Arrange each list by decreasing score. Query: New York City New... York... City... When can we stop? Top: Best Possible Remaining: 9
9 5.2 10 4.1 10 2.0 5 4.0 9 3.1 3 1.5 7 3.3 7 1.0 7 1.0 3 1.0 5 0.5 9 0.2 10 0.0 1 0.2 5 0.1 9 8.5 * 3.3+1.5+1.0=5.8
10 Threshold Algorithm (TA) – Instance optimal (in # of accesses) [Fagin et al] – Performs random accesses No-Random-Access Algorithm (NRA) – Similar to TA – Keep a list of all seen results – Also instance optimal
11
Certain words often occur as phrases. Word association: 11
Certain words often occur as phrases. Word association: – Sagrada ... 11
Certain words often occur as phrases. Word association: – Sagrada ... – Barack ... 11
Certain words often occur as phrases. Word association: – Sagrada ... – Barack ... – Latent Semantic... 11
Certain words often occur as phrases. Word association: – Sagrada ... – Barack ... – Latent Semantic... Pre-compute posting lists for intersections – Note, this is not query-result caching Tradeoffs: – Space: extra space to store the intersection (though it’s smaller) – Time: Less time upon retrieval 12
Query: New York City All aggregations -- 6 lists. [New] [York] [City] [New York] [New City] [York City] 13
Query: New York City All aggregations -- 6 lists. [New] [York] [City] [New York] [New City] [York City] New York City NY NC YC 14
9 5.2 10 4.1 10 2.0 5 4.0 9 3.1 3 1.5 7 3.3 7 1.0 7 1.0 3 1.0 5 0.5 9 0.2 10 0.0 1 0.2 5 0.1 9 8.3 9 5.4 10 6.1 5 4.5 7 4.3 9 3.3 7 4.3 5 4.1 7 2.0 10 4.1 3 2.5 3 1.5 3 1.0 10 2.0 5 0.6
Query: New York City All aggregations -- 6 lists. [New] [York] [City] [New York] [New City] [York City] New York City NY NC YC Top: 15
9 5.2 10 4.1 10 2.0 5 4.0 9 3.1 3 1.5 7 3.3 7 1.0 7 1.0 3 1.0 5 0.5 9 0.2 10 0.0 1 0.2 5 0.1 9 8.3 9 5.4 10 6.1 5 4.5 7 4.3 9 3.3 7 4.3 5 4.1 7 2.0 10 4.1 3 2.5 3 1.5 3 1.0 10 2.0 5 0.6 9 8.5
Query: New York City All aggregations -- 6 lists. [New] [York] [City] [New York] [New City] [York City] New York City NY NC YC Top: 16
9 5.2 10 4.1 10 2.0 5 4.0 9 3.1 3 1.5 7 3.3 7 1.0 7 1.0 3 1.0 5 0.5 9 0.2 10 0.0 1 0.2 5 0.1 9 8.3 9 5.4 10 6.1 5 4.5 7 4.3 9 3.3 7 4.3 5 4.1 7 2.0 10 4.1 3 2.5 3 1.5 3 1.0 10 2.0 5 0.6 9 8.5
Can we stop now?
New York City NY NC YC Top: Bounds on any unseen element: N + Y + C = 10.1 17
9 5.2 10 4.1 10 2.0 5 4.0 9 3.1 3 1.5 7 3.3 7 1.0 7 1.0 3 1.0 5 0.5 9 0.2 10 0.0 1 0.2 5 0.1 9 8.3 9 5.4 10 6.1 5 4.5 7 4.3 9 3.3 7 4.3 5 4.1 7 2.0 10 4.1 3 2.5 3 1.5 3 1.0 10 2.0 5 0.6 9 8.5
New York City NY NC YC Top: Bounds on any unseen element: N + Y + C = 10.1 NY + C = 6.5 18
9 5.2 10 4.1 10 2.0 5 4.0 9 3.1 3 1.5 7 3.3 7 1.0 7 1.0 3 1.0 5 0.5 9 0.2 10 0.0 1 0.2 5 0.1 9 8.3 9 5.4 10 6.1 5 4.5 7 4.3 9 3.3 7 4.3 5 4.1 7 2.0 10 4.1 3 2.5 3 1.5 3 1.0 10 2.0 5 0.6 9 8.5
New York City NY NC YC Top: Bounds on any unseen element: N + Y + C = 10.1 NY + C = 6.5 NC + Y = 8.4 YC + N = 10.1 19
9 5.2 10 4.1 10 2.0 5 4.0 9 3.1 3 1.5 7 3.3 7 1.0 7 1.0 3 1.0 5 0.5 9 0.2 10 0.0 1 0.2 5 0.1 9 8.3 9 5.4 10 6.1 5 4.5 7 4.3 9 3.3 7 4.3 5 4.1 7 2.0 10 4.1 3 2.5 3 1.5 3 1.0 10 2.0 5 0.6 9 8.5
New York City NY NC YC Top: Bounds on any unseen element: N + Y + C = 10.1 NY + C = 6.5 NC + Y = 8.4 YC + N = 10.1 1/2 (NY + YC + NC) = 7.45 20
9 5.2 10 4.1 10 2.0 5 4.0 9 3.1 3 1.5 7 3.3 7 1.0 7 1.0 3 1.0 5 0.5 9 0.2 10 0.0 1 0.2 5 0.1 9 8.3 9 5.4 10 6.1 5 4.5 7 4.3 9 3.3 7 4.3 5 4.1 7 2.0 10 4.1 3 2.5 3 1.5 3 1.0 10 2.0 5 0.6 9 8.5
21
9 5.2 10 4.1 10 2.0 5 4.0 9 3.1 3 1.5 7 3.3 7 1.0 7 1.0 3 1.0 5 0.5 9 0.2 10 0.0 1 0.2 5 0.1 9 8.3 9 5.4 10 6.1 5 4.5 7 4.3 9 3.3 7 4.3 5 4.1 7 2.0 10 4.1 3 1.5 3 1.0 5 0.6
New York City NY NC YC Top: Bounds on any unseen element: N + Y + C = 10.1 NY + C = 6.5 NC + Y = 8.4 YC + N = 10.1 1/2 (NY + YC + NC) = 7.45 Thus best element has score < 6.5. So we are done!
9 8.5 3 2.5 10 2.0
Can we write the bounds on the next element? : score of document x in list i. : bound on the score in list i (score of next unseen document) Combinations: bound on Simple LP for bound on unseen elements: In theory: Easy! Just solve an LP every time. In reality: You’re kidding, right? 22 xi bi bij xi + xj max
xi xi ≤ bi xi + xj ≤ bij
Need to solve the LP: Same as solving the dual 23 max
xi xi ≤ bi xi + xj ≤ bij min
yi +
yij ≥ 1 yi, yij ≥ 0
24 min
yi +
yij ≥ 1 yi, yij ≥ 0 Add one node for each with weight yi Add one edge for each with weight yij 5.2 5.1 3.3 1.2 3.7 6.1 bij bi 1.2 5.4 3.3 4.2
24 min
yi +
yij ≥ 1 yi, yij ≥ 0 Add one node for each with weight yi Add one edge for each with weight yij 5.2 5.1 3.3 1.2 3.7 6.1 bij bi 1.2 5.4 3.3 4.2 Single Lists
24 min
yi +
yij ≥ 1 yi, yij ≥ 0 Add one node for each with weight yi Add one edge for each with weight yij 5.2 5.1 3.3 1.2 3.7 6.1 bij bi 1.2 5.4 3.3 4.2 Paired Lists
24 min
yi +
yij ≥ 1 yi, yij ≥ 0 Add one node for each with weight yi Add one edge for each with weight yij Goal: select a (fractional) subset of edges and vertices, so that each vertex has (in total) a weight of 1 selected. 5.2 5.1 3.3 1.2 3.7 6.1 bij bi 1.2 5.4 3.3 4.2
25 min
yi +
yij ≥ 1 yi, yij ≥ 0 Add one node for each with weight yi Add one edge for each with weight yij Goal: select a (fractional) subset of edges and vertices, so that each vertex has (in total) a weight of 1 selected. 5.2 5.1 3.3 1.2 3.7 6.1 bij bi 1.2 5.4 3.3 4.2
Goal: select a subset of edges and vertices, so that each vertex has a weight of 1 selected. This looks like the classical edge cover problem only with vertices. 26
5.2 5.1 3.3 1.2 3.7 6.1 1.2 5.4 3.3 4.2
Goal: select a subset of edges and vertices, so that each vertex has a weight of 1 selected. This looks like the classical edge cover problem only with vertices. We show how to solve this problem by computing min cost matching. Running time: O(nm) Checking all combinations: O(n!) 27
5.2 5.1 3.3 1.2 3.7 6.1 1.2 5.4 3.3 4.2
Introduction to TA Solving the ‘upper bound’ problem Empirical Results Conclusion 28
Datasets: – Trec (25M pages), 100k queries – Yahoo! (16M pages), 10k queries (random subset in each)
Metrics: – Number of Random and Sequential Accesses – Index size Which bigrams to select? – Query oblivious manner – Greedily based on size of intersection versus size of original lists 29
Baseline: traverse full list INT: Use intersection lists, but still no Early Termination ET: Use early termination, but without intersection lists ET + INT: Use both early termination & intersection lists Total index growth: 25%
30 15,000 30,000 45,000 60,000 Number of sequential accesses vs. Algorithm
Accesses
Immediate benefit, but diminishing returns as extra intersections added.
31 4,500 9,000 13,500 18,000 Number of sequential accesses vs. Index size
Index size increase Accesses
We prove that in worst case we must examine all of the lists to find the
But is this just a theoretical result? What if you use a simpler heuristics that focus only on intersection lists? – For 89% of the queries:
– For the 11% of the remaining queries
So the worst case does occur in practice. 32
Give a formal analysis of how to use pre-aggregated posting lists – Solving an LP is unreasonable Show empirically that a simple selection rule for intersections gives performance improvements. Many questions remain: – Extending results to tri-grams (Solving hyperedge cover) – Better ways of selecting intersections – ... 33