Social top-k @ Joint RuSSIR/EDBT Summer School 2011
Top- k Processing for Search and Information Discovery in Social - - PowerPoint PPT Presentation
Top- k Processing for Search and Information Discovery in Social - - PowerPoint PPT Presentation
Top- k Processing for Search and Information Discovery in Social Applications Sihem Amer-Yahia Julia Stoyanovich Social top- k @ Joint RuSSIR/EDBT Summer School 2011 Why top- k ? Practical many applications, with new ones emerging every
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 2
Why top-k?
- Practical
– many applications, with new ones emerging every day
- Beautiful
– basic algorithms (TA, NRA) are simple and intuitive – nice theoretical properties – instance optimality
- Insightful
– optimizes for I/O operations, a good run-time performance abstraction – trades run-time performance and space overhead – representative of many data management problems
- Influential
– Optimal Aggregation Algorithms for Middleware, PODS 2001 by Ronald Fagin, Amnon Lotem and Moni Naor – 733 citations (as of July 11, 2011), PODS 2011 test-of-time award
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 3
Why Social?
- Social science is an old discipline
- Social Web: a new opportunity where actions of millions of users online can
be gathered and processed (content sites like Delicious, streams like Twitter, and media like ALJ online, in the form of factual and opinion data)
- Computing: large-scale data processing
- Social Computing: The science of gathering, storing and processing social
breadcrumbs left my millions of users, in order to enhance their online experience => scaling-up social science
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 4
About us
- Sihem
– PhD INRIA, France (1999) – AT&T Research (1999-2006), Yahoo! Research (2006-2011) – Now: Principal Research Scientist at Qatar Foundation – Research interests: social computing large-scale data processing
- Julia
– Data management at several New York start-ups (1998-2003) – PhD Columbia University, USA (2009) – Now: Postdoctoral Researcher at the University of Pennsylvania – Research interests: ranking and data exploration biological and social applications
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 5
Course outline
- Top-k and its applications
– Fundamental top-k algorithms – Personalized search – Group recommendation
- Social: user studies
– Group recommendation (MovieLens) – Travel itinerary extraction (Flickr)
- Open problems in top-k and social
Feel free to interrupt and ask questions!
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 6
A plug: Web Data Management
A new book by Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart. Will be published in print in December 2011. Available for free download from http://webdam.inria.fr/Jorge/files/wdmd.pdf
Social top-k @ Joint RuSSIR/EDBT Summer School 2011
Top-k Processing for Search and Information Discovery in Social Applications
Lecture 1: Fundamental Top-k Algorithms
Sihem Amer-Yahia Julia Stoyanovich
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 8
Quote of the day
Life is the sum of all your choices. ~Albert Camus
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 9
What is top-k processing?
- Find k items that best answer a user’s query
– As a set, as a sorted list, or as a sorted list with scores – Usually from among N items, where N >> k
- Application domains
– Search over structured datasets with user-defined preferences
- Find large apartments in a good school district in Brooklyn
- Find cheap hotels that are near a beach
– Web search & other document retrieval / ranking tasks
- Find documents about “Winter Olympics Sochi”
– Search over multimedia repositories (with probability scores)
- Find images that show a palm tree next to a house
– Many others, including in the social domain!
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 10
SQL example
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 11
Top-k vs. SQL querying
- Compared to standard SQL querying
– Relevance to a degree, not Boolean – Return only the best items, not all items – Quality of an item is expressed by a score – A score is assigned to items by a ranking (scoring) function
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 12
Outline
Intro
- Semantics
– Ranking functions – Dominance and skylines
- Fundamental algorithms
– Fagin algorithm (FA) – Threshold algorithm (TA) – No random access algorithm (NRA)
- Extensions and alternatives
– top-k with expensive predicates (MPro) – TAAT vs. DAAT vs. ….
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 13
Ranking functions
- Ranking (scoring) functions are used to compute the score of an item
- Item r(x1, …, xm), where xi are the ranking attributes,
e.g., size of a house in m2, number of times “Sochi” occurs in the text score(r) = g (f1 (x1), … , fm (xm))
– where fi are monotone functions, e.g. f(x) = 2 × x – g is a monotone aggregation function, e.g. sum, average, max, min – e.g., score (r) = 2 × size + 3 * quality of school district
Definitions
A function f is monotone if f(x) ≤ f(y) whenever x ≤ y An aggregation function g is monotone if, for all i, g(r) ≤ g(r’) whenever r.xi ≤ r’.xi
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 14
Ranking functions and dominance
- Monotone aggregation ranking functions express user preferences
– r (price, distToBeach); score(r) = w1 * price+ w2 * distToBeach – score(r1) < score(r2) whenever r1.price < r2.price AND r1.distToBeach < r2.distToBeach – This holds for any positive weights w1 and w2! (What if one of the weights was negative? What if both were negative?) – In this example, we are interested in minimizing the score
- Mapping a multi-dimensional (e.g., 2D) space into R
– In top-k, we look for k tuples with the best score, which is weight- specific, but …
- users may differ on the weights in their preferences
- users may want to understand the structure of the 2D space
– Skylines to the rescue!
- make the structure of the space explicit
- Are independent of the weights
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 15
Skylines represent dominance
- Item r1 is better than r2, according to its score if
– score(r1) < score(r2) , i.e., – r1.price < r2.price AND r1.distToBeach < r2.distToBeach
- This is equivalent to a notion of dominance (Pareto-optimality)
– Definition: r1 dominates r2 iff r1 is as good or better than r2 in all dimensions, and strictly better in at least one dimension. – Some items may be incomparable
- r1: $100, 1.0 km
- r2: $ 90, 1.2 km
- r3: $110, 1.1 km
- r4: $120, 1.1 km
– Observe: dominance is transitive
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 16
Skyline examples
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 17
Skyline properties
- For any monotone ranking function f
– If point r is best according to f, then r is on the skyline
- A top-1 hotel will always be on the skyline, irrespective of the weights!
– If point r is on the skyline, then there exists an f for which r is best
- Every hotel on the skyline is someone’s favorite
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 18
Outline
Intro Semantics
Ranking functions Dominance and skylines
- Fundamental algorithms
– Fagin algorithm (FA) – Threshold algorithm (TA) – No random access algorithm (NRA)
- Extensions and alternatives
– top-k with expensive predicates (MPro) – TAAT vs. DAAT vs. ….
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 19
Quantifying top-k algorithm performance
- Execution time
– Sequential access (SA)
- accessing items in order, e.g., by reading from a cursor
- similar concept to a sequential disk read,
where seek time is amortized over multiple accesses
– Random access (RA)
- accessing items out of order, e.g., a primary key lookup
- similar to a random disk read
- typically more expensive than an SA (even orders of
magnitude), sometimes impossible
– Why not use wall clock time?
- Buffer size
– How much state do we have to keep during computation – Is the size bounded by some constant (e.g. k), or is it linear in the size of the dataset (N)? (recall that k << N)
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 20
Example 1
id income (K$) netWorth (K$) score = income + net worth r1 150 350 500 r2 150 425 575 r3 125 450 575 r4 100 450 550 r5 100 200 300 r6 80 100 180 r7 75 500 575 r8 75 50 125 r9 50 300 350 r10 50 50 100
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 21
Naïve computation of the top-k answers
- Algorithm
– Compute the score of each item – Sort items in decreasing order of score – Return k items with the highest score
- Properties of naïve solution
– Advantages? – Disadvantages?
- Idea : throw space at the problem
– pre-compute inverted lists for components of the score – aggregate partial scores at run-time
- Main intuition:
– top-k items will appear near the top of sufficiently many lists
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 22
The basic indexing structure: inverted list
id income
r1 150 r2 150 r3 125 r4 100 r5 100 r6 80 r7 75 r8 75 r9 50 r10 50 L1 id netWorth r7 500 r3 450 r4 450 r2 425 r1 350 r9 300 r5 200 r6 100 r8 50 r10 50 L2
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 23
Fagin Algorithm (FA)
- Algorithm
– Access all lists sequentially (SA), in parallel – STOP once k items have been seen sequentially in all lists – Compute scores of incomplete items by performing a random access (RA) – Sort on score, return the best k items
- Is this algorithm correct?
- Performance of FA for Example 1 with k = 3
– Number of SA – Number of RA – Max buffer size
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 24
Threshold Algorithm (TA)
- Algorithm
– Access all lists sequentially (SA), in parallel – After each cursor move
- Compute the score of the item r under the cursor with random accesses (RA)
- Record r in the buffer if buffer size < k and
r’s score > kth score, remove kth item from buffer
- Update the threshold θ = Σ current list scores
- STOP when kth score > θ
– Return the k items currently in the buffer
- Is this algorithm correct?
- Performance of FA for Example 1 with k = 3
– Number of SA – Number of RA – Max buffer size
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 25
Comparison between FA and TA
- Example 2 with k = 3
- Theorem 1
– # SA in TA ≤ # SA in FA
- Theorem 2
– TA requires bounded buffers, while buffers in FA are not bounded
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 26
Instance optimality of TA
- Theorem 3
– TA is instance-optimal over the class of algorithms A that
1. correctly find the top-k answers and 2. do not make any random guesses
- Does Theorem 3 hold over all algorithms that
correctly identify the top-k answers?
- Example 3
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 27
What if random accesses were impossible?
- When is this the case?
- Common approach: change semantics of the result
– Sometimes it suffices to output the top-k as a set – Sometimes we can get away with outputting top-k in sorted order, but with no scores
- Example 4
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 28
No Random Access algorithm (NRA)
- Algorithm
– Access all lists sequentially (SA), in parallel – After each cursor move compute
- Worst-case score W(r), best-case score B(r) for each seen r
- Sort all seen items on W(r), breaking ties by B(r)
- θ = Σ current list scores (this is the best-case score of any unseen object)
- STOP when W(r) of kth object > θ
– If RA is possible, compute complete scores of the top-K items – Return the top-k items
- NRA with Example 1 for k = 3
- Performance
– # SA, #RA, buffer size – Optimal performance if no RAs are allowed! – In reality, computation may be slow, because of re-sorting large buffers at each step
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 29
Outline
Intro Semantics
Ranking functions Dominance and skylines
Fundamental algorithms
Fagin algorithm (FA) Threshold algorithm (TA) No random access algorithm (NRA)
- Extensions and alternatives
– top-k with expensive predicates (MPro) – TAAT vs. DAAT
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 30
BOTH H Sor Sorte ted Acce ccess Onl Only y Random
- m
Acce ccess Onl Only y FA, T TA, Q Quic ick- k-comb combin ine, M Multi- ti-Ste Step NRA RA, Str Stream- m-comb combin ine MPro, U Upper, P Pic ick k
A classification of top-k algorithms
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 31
Minimal Probing (MPro)
- Recall TA
– sequentially accesses one list – probes all other lists for each encountered object
- What if random accesses (probes) were very expensive?
– User-defined functions – Calls to external services, e.g., web-based
- Idea: execute only the necessary probes
- Assumptions
– One component of the score accessed sequentially (SA), other components are probed (RA) – Probing schedule is given, global (same for all items) or per-item
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 32
The MPro algorithm
- For an item r
– Worst (r) – worst-case score seen so far – Best (r) - best-case score, unseen components have max score (1.0)
- What is a necessary probe?
– The probe on item r is necessary if there do not exist k items r1, …, rk such that Best (r) < Best (ri) for each ri
- The MPro Algorithm
– Access items sequentially – Maintain a best-case queue (a.k.a. the ceiling queue)
- STOP when k items with complete scores are at the head of the queue
- Probe the first incomplete item, re-sort the queue
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 33
An MPro example
SELECT id FROM House WHERE new (age) s,
- - sequential access
cheap(price, size) pC -- random access, a.k.a. “probe” large(size) pL
- - random access
ORDER BY MIN(s, rC, rL) STOP AFTER 2
id s
new (age)
pC
cheap (price, size)
pL
large (size)
score MIN(s, pC, pl)
a 0.90 0.85 0.75 0.75 b 0.80 0.78 0.90 0.78 c 0.70 0.75 0.20 0.20 d 0.60 0.90 0.90 0.60 e 0.50 0.70 0.80 0.50
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 34
The big picture: top-k is term-at-a-time
Here items are called “documents” (IR terminology)
- Term-at-a-time (TAAT)
– Partial scores for items kept in accumulators – May (it better!) terminate before all items are considered, but each item may be considered more than once – Typically outperform DAAT when contributions of terms to the score are independent and dataset is reasonably small
- Document-at-a-time (DAAT)
– Complete scores for items computed (i.e., typically all items are considered) – Smaller memory footprint than in TAAT – Easier to execute in parallel than TAAT
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 35
Summary and outlook
- Semantics of top-k queries
– Items have score that are made up of components – Components are aggregated using monotone aggregation
- Fundamental algorithms
– Use the inverted list indexing structure – Have an access strategy and a stopping condition – TA – instance-optimal over the class of reasonable algorithms – NRA – useful when random access is expensive or impossible
- Generalizations and extensions
- Next lecture
– using top-k for personalized search in social tagging sites
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 36
References and further reading
- 1. Optimal aggregation algorithms for middleware.
Ronald Fagin, Amnon Lotem and Moni Naor, PODS 2001.
- 2. Minimal probing: supporting expensive predicates for top-K queries.
Kevin Chen-Chuan Chang and Seung-won Hwang, SIGMOD 2002.
- 3. Evaluating Top-k Queries over Web-Accessible Databases.
Nicolas Bruno, Luis Gravano and Amélie Marian, ICDE 2002.
- 4. The Skyline operator.
Stephan Börzsönyi, Donald Kossmann, Konrad Stocker, ICDE 2001.
- 5. Query evaluation: strategies and optimizations.
Howard Turtle and James Flood, Info. Processing and Management 1995.
Social top-k @ Joint RuSSIR/EDBT Summer School 2011 37