Top- k Processing for Search and Information Discovery in Social - PowerPoint PPT Presentation

Top- k Processing for Search and Information Discovery in Social Applications Sihem Amer-Yahia Julia Stoyanovich Social top- k @ Joint RuSSIR/EDBT Summer School 2011

Why top- k ? • Practical – many applications, with new ones emerging every day • Beautiful – basic algorithms (TA, NRA) are simple and intuitive – nice theoretical properties – instance optimality • Insightful – optimizes for I/O operations, a good run-time performance abstraction – trades run-time performance and space overhead – representative of many data management problems • Influential – Optimal Aggregation Algorithms for Middleware, PODS 2001 by Ronald Fagin, Amnon Lotem and Moni Naor – 733 citations (as of July 11, 2011), PODS 2011 test-of-time award Social top- k @ Joint RuSSIR/EDBT Summer School 2011 2

Why Social? • Social science is an old discipline • Social Web: a new opportunity where actions of millions of users online can be gathered and processed (content sites like Delicious, streams like Twitter, and media like ALJ online, in the form of factual and opinion data ) • Computing : large-scale data processing • Social Computing: The science of gathering, storing and processing social breadcrumbs left my millions of users, in order to enhance their online experience => scaling-up social science Social top- k @ Joint RuSSIR/EDBT Summer School 2011 3

About us • Sihem – PhD INRIA, France (1999) – AT&T Research (1999-2006), Yahoo! Research (2006-2011) – Now: Principal Research Scientist at Qatar Foundation – Research interests: social computing large-scale data processing • Julia – Data management at several New York start-ups (1998-2003) – PhD Columbia University, USA (2009) – Now: Postdoctoral Researcher at the University of Pennsylvania – Research interests: ranking and data exploration biological and social applications Social top- k @ Joint RuSSIR/EDBT Summer School 2011 4

Course outline • Top- k and its applications – Fundamental top- k algorithms – Personalized search – Group recommendation • Social: user studies – Group recommendation (MovieLens) – Travel itinerary extraction (Flickr) • Open problems in top- k and social Feel free to interrupt and ask questions! Social top- k @ Joint RuSSIR/EDBT Summer School 2011 5

A plug: Web Data Management A new book by Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart. Will be published in print in December 2011. Available for free download from http://webdam.inria.fr/Jorge/files/wdmd.pdf Social top- k @ Joint RuSSIR/EDBT Summer School 2011 6

Top- k Processing for Search and Information Discovery in Social Applications Lecture 1: Fundamental Top-k Algorithms Sihem Amer-Yahia Julia Stoyanovich Social top- k @ Joint RuSSIR/EDBT Summer School 2011

Quote of the day Life is the sum of all your choices. ~Albert Camus Social top- k @ Joint RuSSIR/EDBT Summer School 2011 8

What is top -k processing? • Find k items that best answer a user’s query – As a set, as a sorted list, or as a sorted list with scores – Usually from among N items, where N >> k • Application domains – Search over structured datasets with user-defined preferences • Find large apartments in a good school district in Brooklyn • Find cheap hotels that are near a beach – Web search & other document retrieval / ranking tasks • Find documents about “Winter Olympics Sochi” – Search over multimedia repositories (with probability scores) • Find images that show a palm tree next to a house – Many others, including in the social domain! Social top- k @ Joint RuSSIR/EDBT Summer School 2011 9

SQL example Social top- k @ Joint RuSSIR/EDBT Summer School 2011 10

Top- k vs. SQL querying • Compared to standard SQL querying – Relevance to a degree , not Boolean – Return only the best items, not all items – Quality of an item is expressed by a score – A score is assigned to items by a ranking (scoring) function Social top- k @ Joint RuSSIR/EDBT Summer School 2011 11

Outline  Intro • Semantics – Ranking functions – Dominance and skylines • Fundamental algorithms – Fagin algorithm (FA) – Threshold algorithm (TA) – No random access algorithm (NRA) • Extensions and alternatives – top-k with expensive predicates (MPro) – TAAT vs. DAAT vs. …. Social top- k @ Joint RuSSIR/EDBT Summer School 2011 12

Ranking functions • Ranking (scoring) functions are used to compute the score of an item • Item r(x 1 , …, x m ) , where x i are the ranking attributes , e.g., size of a house in m 2 , number of times “Sochi” occurs in the text score(r) = g (f 1 (x 1 ), … , f m (x m )) – where f i are monotone functions , e.g. f(x) = 2 × x – g is a monotone aggregation function, e.g. sum, average, max, min – e.g., score (r) = 2 × size + 3 * quality of school district Definitions A function f is monotone if f(x) ≤ f(y) whenever x ≤ y An aggregation function g is monotone if, for all i, g(r) ≤ g(r’) whenever r.x i ≤ r’.x i Social top- k @ Joint RuSSIR/EDBT Summer School 2011 13

Ranking functions and dominance • Monotone aggregation ranking functions express user preferences – r (price, distToBeach); score(r) = w 1 * price+ w 2 * distToBeach – score(r 1 ) < score(r 2 ) whenever r 1 .price < r 2 .price AND r 1 .distToBeach < r 2 .distToBeach – This holds for any positive weights w 1 and w 2 ! (What if one of the weights was negative? What if both were negative?) – In this example, we are interested in minimizing the score • Mapping a multi-dimensional (e.g., 2D) space into R – In top- k , we look for k tuples with the best score, which is weight- specific, but … • users may differ on the weights in their preferences • users may want to understand the structure of the 2D space – Skylines to the rescue! • make the structure of the space explicit • Are independent of the weights Social top- k @ Joint RuSSIR/EDBT Summer School 2011 14

Skylines represent dominance • Item r 1 is better than r 2 , according to its score if – score(r 1 ) < score(r 2 ) , i.e., – r 1 .price < r 2 .price AND r 1 .distToBeach < r 2 .distToBeach • This is equivalent to a notion of dominance (Pareto-optimality) – Definition : r 1 dominates r 2 iff r 1 is as good or better than r 2 in all dimensions, and strictly better in at least one dimension. – Some items may be incomparable • r 1 : $100, 1.0 km • r 2 : $ 90, 1.2 km • r 3 : $110, 1.1 km • r 4 : $120, 1.1 km – Observe: dominance is transitive Social top- k @ Joint RuSSIR/EDBT Summer School 2011 15

Skyline examples Social top- k @ Joint RuSSIR/EDBT Summer School 2011 16

Skyline properties • For any monotone ranking function f – If point r is best according to f , then r is on the skyline • A top-1 hotel will always be on the skyline, irrespective of the weights! – If point r is on the skyline, then there exists an f for which r is best • Every hotel on the skyline is someone’s favorite Social top- k @ Joint RuSSIR/EDBT Summer School 2011 17

Outline  Intro  Semantics  Ranking functions  Dominance and skylines • Fundamental algorithms – Fagin algorithm (FA) – Threshold algorithm (TA) – No random access algorithm (NRA) • Extensions and alternatives – top-k with expensive predicates (MPro) – TAAT vs. DAAT vs. …. Social top- k @ Joint RuSSIR/EDBT Summer School 2011 18

Quantifying top- k algorithm performance • Execution time – Sequential access (SA) • accessing items in order, e.g., by reading from a cursor • similar concept to a sequential disk read, where seek time is amortized over multiple accesses – Random access (RA) • accessing items out of order, e.g., a primary key lookup • similar to a random disk read • typically more expensive than an SA (even orders of magnitude), sometimes impossible – Why not use wall clock time? • Buffer size – How much state do we have to keep during computation – Is the size bounded by some constant (e.g. k ), or is it linear in the size of the dataset ( N )? (recall that k << N ) Social top- k @ Joint RuSSIR/EDBT Summer School 2011 19

Example 1 score id income (K$) netWorth (K$) = income + net worth r 1 150 350 500 r 2 150 425 575 r 3 125 450 575 r 4 100 450 550 r 5 100 200 300 r 6 80 100 180 r 7 75 500 575 r 8 75 50 125 r 9 50 300 350 r 10 50 50 100 Social top- k @ Joint RuSSIR/EDBT Summer School 2011 20

Naïve computation of the top -k answers • Algorithm – Compute the score of each item – Sort items in decreasing order of score – Return k items with the highest score • Properties of naïve solution – Advantages? – Disadvantages? • Idea : throw space at the problem – pre-compute inverted lists for components of the score – aggregate partial scores at run-time • Main intuition: – top-k items will appear near the top of sufficiently many lists Social top- k @ Joint RuSSIR/EDBT Summer School 2011 21

The basic indexing structure: inverted list id income id netWorth L 1 L 2 r 1 150 r 7 500 r 2 150 r 3 450 r 3 125 r 4 450 r 4 100 r 2 425 r 5 100 r 1 350 r 6 80 r 9 300 r 7 75 r 5 200 r 8 75 r 6 100 r 9 50 r 8 50 r 10 50 r 10 50 Social top- k @ Joint RuSSIR/EDBT Summer School 2011 22

Top- k Processing for Search and Information Discovery in Social - PowerPoint PPT Presentation

Top- k Processing for Search and Information Discovery in Social Applications Sihem Amer-Yahia Julia Stoyanovich Social top- k @ Joint RuSSIR/EDBT Summer School 2011 Why top- k ? Practical many applications, with new ones emerging every

From Search to Discovery in our Future Library From Search to Discovery W e see a spectrum of

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

Chapter 3: Top-k Query Processing and Indexing 3.1 Top-k Algorithms 3.2 Approximate Top-k Query

Top- k Processing for Search and Information Discovery in Social Applications Lecture 2:

To TOP or NOT to TOP www.SAS.com To TOP or NOT to TOP Using the TOP command in Linux By Len van

RNA Search and Whirlwind tour of ncRNA search & discovery Motif Discovery RNA motif

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Boosted Top Tagging Seung J. Lee Outline Introduction: top jets @ LHC Modern boosted top

Foundations of Artificial Intelligence 9. State-Space Search: Tree Search and Graph Search Malte

Tabu Search Search Tabu Page 1 Part I Part I Tabu Search Principles Search Principles Tabu

Uninformed Search 2 Informed Search Rest of blind search An informed search strategyone

Informed search algorithms Outline Best-first search Greedy best-first search A *

Watson Discovery Spring 2020 Discovery pipeline Using NLU, document conversion, and UI tools

Class 42: Free symmetric top Class 42: Free symmetric top Free symmetric top in body frame Assume

Puncher/Squeezer Riveting Tools BEST PRACTICES 2018 Tool Uses Top Rail Punch Top Rail

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

Logica (I&E) najaar 2018 http://liacs.leidenuniv.nl/~vlietrvan1/logica/ Rudy van Vliet

Logica (I&E) najaar 2017 http://liacs.leidenuniv.nl/~vlietrvan1/logica/ Rudy van Vliet

Compiler and Me Stephan Bergmann Red Hat, Inc. 1 LibreOffice Bern 2014 Conference Presentation

Channels and cash March 1, 2018 Principles of Journalism Quiz Turning Point Cloud

UNIT TESTING IN DRUPAL HOWDY! I am Mateu I am here because I love quality code. You can find

10/12/18 Outline 1. Epidemiology 2. Evaluation 3. Treatment Sports Concussion 2018:

Representa)on Learning for Reading Comprehension Russ Salakhutdinov Machine Learning Department

Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University

Top- k Processing for Search and Information Discovery in Social - PowerPoint PPT Presentation

Top- k Processing for Search and Information Discovery in Social Applications Sihem Amer-Yahia Julia Stoyanovich Social top- k @ Joint RuSSIR/EDBT Summer School 2011 Why top- k ? Practical many applications, with new ones emerging every

From Search to Discovery in our Future Library From Search to Discovery W e see a spectrum of

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

Chapter 3: Top-k Query Processing and Indexing 3.1 Top-k Algorithms 3.2 Approximate Top-k Query

Top- k Processing for Search and Information Discovery in Social Applications Lecture 2:

To TOP or NOT to TOP www.SAS.com To TOP or NOT to TOP Using the TOP command in Linux By Len van

RNA Search and Whirlwind tour of ncRNA search &amp; discovery Motif Discovery RNA motif

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Boosted Top Tagging Seung J. Lee Outline Introduction: top jets @ LHC Modern boosted top

Foundations of Artificial Intelligence 9. State-Space Search: Tree Search and Graph Search Malte

Tabu Search Search Tabu Page 1 Part I Part I Tabu Search Principles Search Principles Tabu

Uninformed Search 2 Informed Search Rest of blind search An informed search strategyone

Informed search algorithms Outline Best-first search Greedy best-first search A *

Watson Discovery Spring 2020 Discovery pipeline Using NLU, document conversion, and UI tools

Class 42: Free symmetric top Class 42: Free symmetric top Free symmetric top in body frame Assume

Puncher/Squeezer Riveting Tools BEST PRACTICES 2018 Tool Uses Top Rail Punch Top Rail

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

Logica (I&amp;E) najaar 2018 http://liacs.leidenuniv.nl/~vlietrvan1/logica/ Rudy van Vliet

Logica (I&amp;E) najaar 2017 http://liacs.leidenuniv.nl/~vlietrvan1/logica/ Rudy van Vliet

Compiler and Me Stephan Bergmann Red Hat, Inc. 1 LibreOffice Bern 2014 Conference Presentation

Channels and cash March 1, 2018 Principles of Journalism Quiz Turning Point Cloud

UNIT TESTING IN DRUPAL HOWDY! I am Mateu I am here because I love quality code. You can find

10/12/18 Outline 1. Epidemiology 2. Evaluation 3. Treatment Sports Concussion 2018:

Representa)on Learning for Reading Comprehension Russ Salakhutdinov Machine Learning Department

Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University

RNA Search and Whirlwind tour of ncRNA search & discovery Motif Discovery RNA motif

Logica (I&E) najaar 2018 http://liacs.leidenuniv.nl/~vlietrvan1/logica/ Rudy van Vliet

Logica (I&E) najaar 2017 http://liacs.leidenuniv.nl/~vlietrvan1/logica/ Rudy van Vliet