Top- k Processing for Search and Information Discovery in Social - PowerPoint PPT Presentation

Top- k Processing for Search and Information Discovery in Social Applications Lecture 2: Network-Aware Search in Social Tagging Sites Sihem Amer-Yahia Julia Stoyanovich Social top- k @ Joint RuSSIR/EDBT Summer School 2011

Summary of last lecture • Semantics of top- k queries – Items have score that are made up of components – Components are aggregated using monotone aggregation • Fundamental algorithms – Use the inverted list indexing structure – Have an access strategy and a stopping condition – TA – instance-optimal over the class of reasonable algorithms – NRA – useful when random access is expensive or impossible • Generalizations and extensions Social top- k @ Joint RuSSIR/EDBT Summer School 2011 39

Quote of the day A city is oneness of the unlike. ~Aristotle Social top- k @ Joint RuSSIR/EDBT Summer School 2011 40

Collaborative tagging sites • Social content sites are cyber-cities! • Collaborative tagging sites are a kind of social content sites – Flickr , YouTube , Delicious , photo tagging in Facebook • Users – contribute content • annotate items (photos, videos, URLs, …) with tags – form social networks • friends/family, interest-based – consume content • browse own and other users’ items • need help discovering relevant content • Goal – Personalize search and information discovery Social top- k @ Joint RuSSIR/EDBT Summer School 2011 41

Outline  Intro • Semantics • Personalized ranking functions • Model and problem statement • Fundamental indexing structures and algorithms – EXACT – Global Upper-Bound (GUB) – gNRA and gTA • Performance optimizations – Cluster-Seekers – Cluster-Taggers Social top- k @ Joint RuSSIR/EDBT Summer School 2011 42

Why network-aware search? shopping shopping Даша Аня Result relevance depends on who is asking the query! Social top- k @ Joint RuSSIR/EDBT Summer School 2011 43

Data model Tagged(user u,item i,tag t) Link (user u, user v) Roger, i1, music Roger, i3, music Roger, i5, sports Network (u) = … { v | Link (u,v) } Hugo, i1, music Hugo, i22, music … Minnie, i2, sports … Linda, i2, football Linda, i28, news seeker tagger … Seekers = Π user Link Taggers = Π user Tagged Items (u) = Π item ( σ user=u Tagged ) Social top- k @ Joint RuSSIR/EDBT Summer School 2011 44

Semantics of relevance • The system may derive any number of networks – Are they useful? – Which of them are more useful than others? • Goal: capture user interests based on social behavior – Tagging: an implicit social tie – Friendship: an explicit social tie • Validation: modeling tagging patterns in Delicious [AAAI-SIP 2008] – Is there over-all consensus on the tagging? – Is my tagging similar to my that of my friends? – Is my tagging similar to that of people who use the same tags as I do? – Is my tagging similar to that of people who tag the same items as I do? Social top- k @ Joint RuSSIR/EDBT Summer School 2011 45

Quantifying agreement between users • Let’s forget about top- k for a second – Consider items(u) and items(v) as sets • Directed agr ( u , v ) = items ( u ) ∩ items ( v ) agr ( u , v ) ≠ agr ( v , u ) items ( u ) • Undirected (Jaccard similarity) agr ( u , v ) = items ( u ) ∩ items ( v ) agr ( u , v ) = agr ( v , u ) items ( u ) ∪ items ( v ) • Many other options, we will focus on these two for simplicity Social top- k @ Joint RuSSIR/EDBT Summer School 2011 46

Take 1: no need for personalization Global top-10 Items( Даша ) Rank URL Votes URL Tag 1 google.com 980 2 facebook.com 820 jars.com java 3 iTunes.com 729 java.sun.com java 4 twitter.com 720 techcrunch.com news 5 jonasbrothers.com 680 devshed.com tutorial 6 cnn.com 678 7 amazon.com 620 8 yahoo.com 525 9 youtube.com 524 10 techcrunch.com 492 Items( Маша ) URL Tag Quality: coverage ( Global top-10 ) = 3% bbc.co.uk news pbs.org news tomwaits.com music Applicability: scope ( Global top-10 ) = 100% nick-cave.com music music loureed.com Social top- k @ Joint RuSSIR/EDBT Summer School 2011 47

Take 2: account for tags only Intuition: if a user tags with “music” she is interested in music Top-10 for “music” Top-10 for “news” Items( Маша ) Rank URL Votes Rank URL Votes URL Tag 1 iTunes.com 542 1 cnn.com 610 2 eMusic.com 420 2 bbc.co.uk 503 bbc.co.uk news 3 pandora.com 350 3 npr.org 427 pbs.org news 4 thebeatles.com 330 4 nytimes.com 414 tomwaits.com music 5 jonasbrothers.com 215 5 slashdot.org 392 nick-cave.com music 6 madonna.com 175 6 reuters.com 330 music loureed.com 7 rhapsody.com 148 7 news.cnet.com 290 8 tomwaits.com 133 8 msnbc.msn.com 250 9 lastfm.com 120 9 news.yahoo.com 180 10 beyonce.com 107 10 digg.com 149 1 tag coverage = 10% scope = 32% scope = 14% 2 tags coverage = 14% 3 tags coverage = 18% scope = 6% Social top- k @ Joint RuSSIR/EDBT Summer School 2011 48

Take 2: what’s the problem? Social top- k @ Joint RuSSIR/EDBT Summer School 2011 49

Take 3: account for items only Intuition: interests of users who tag similar items are similar Items( Маша ) Items( Аня ) agr ( u , v ) = items ( u ) ∩ items ( v ) URL Tag items ( u ) URL Tag bbc.co.uk news Link (u, v, agr) news bbc.co.uk pbs.org news news pbs.org tomwaits.com music news nytimes.com nick-cave.com music music music nirvana.com nirvana.com Аня , Маша , 3/8 metallica.com music Маша , Аня , 3/5 Items( Ваня ) music acdc.com Ваня , Даша , 1/2 work jars.com Даша , Ваня , 1 work techcrunch.com URL Tag Аня , Ваня , 1/4 Ваня , Аня , 1/4 work jars.com Аня , Даша , 1/4 Items( Даша ) work java.sun.com Даша , Аня , 1/2 work techcrunch.com URL Tag work devshed.com work web2expo.com jars.com java work technorati.com coverage up to 85% java.sun.com java work javablogs.com techcrunch.com news but scope very low, about 1% play trenitalia.it devshed.com tutorial Social top- k @ Joint RuSSIR/EDBT Summer School 2011 50

Other options? • Take 4:account for tags and items – Intuition: multiple interests per user, overlap in items per tag coverage up to 82% scope up to 7% • Take 5: account for friendship – Intuition: interests of users a similar to those of their friends coverage = 43% scope = 31% Social behavior (friendship and tagging) is reflective of a user’s interests. That is, network-aware search makes sense. Social top- k @ Joint RuSSIR/EDBT Summer School 2011 51

Recall the data model Tagged(user u,item i,tag t) Link (user u, user v) Roger, i1, music Roger, i3, music Roger, i5, sports Network (u) = … { v | Link (u,v) } Hugo, i1, music Hugo, i22, music … Minnie, i2, sports … Linda, i2, football Linda, i28, news seeker tagger … Seekers = Π u Link Taggers = Π u Tagged Social top- k @ Joint RuSSIR/EDBT Summer School 2011 52

Problem statement • A query is a set of tags [VLDB 2008] Q = {t 1 , t 2 , …, t n } • For a seeker u, a tag t , and a item i score(i, u, t) = | Network(u) ∩ { v :Tagged(v, i, t)} | score(i, u, Q) = score(i, u, t 1 ) + score(i, u, t 2 ) + .. + score(i, u, t n ) Given a query Q issued by a seeker u, we wish to efficiently determine the top k items, i.e., the k items with highest over-all score. Social top- k @ Joint RuSSIR/EDBT Summer School 2011 53

Outline  Intro  Semantics  Personalized ranking functions  Model and problem statement • Fundamental indexing structures and algorithms – EXACT – Global Upper-Bound (GUB) – gNRA and gTA • Performance optimizations – Cluster-Seekers – Cluster-Taggers Social top- k @ Joint RuSSIR/EDBT Summer School 2011 54

Recall standard top- k algorithms Q = {t 1 , t 2 , …, t n } ; score(i, Q) = score(i, t 1 ) + score(i, t 2 ) + … + score(i, t n ) Indexing: per-tag inverted lists, each , sorted on score [PODS 2001] The NRA algorithm (no random access) – access all lists sequentially, in parallel – maintain a heap sorted on partial scores – stop when score of k th item > sum of current list scores item score item score item score K = 1 i5 128 99 i5 i1 30 i2 80 80 i2 i5 29 78 i1 30 i1 i4 27 75 i7 25 i2 72 i8 top-K heap i3 23 i6 63 i6 20 60 i4 i7 15 50 i3 i9 13 Stopping condition: 128 > 29 + 80 tag = shopping tag = shoes Social top- k @ Joint RuSSIR/EDBT Summer School 2011 55

Naïve solution: EXACT tag = shoes • Maintain single inverted list per (seeker, tag), items item score item score ordered by score 99 i5 i1 30 80 + can use standard top- k algorithms i2 i8 29 78 i8 27 i4 -- high space overhead 75 i7 i2 25 72 i1 i3 23 i6 63 i6 20 60 i4 i7 15 50 i3 Conservative example: i9 13 – 100K users, 1M items, 1K tags seeker Даша seeker Аня – 20 tags/item from 5% of the taggers tag = shopping – 10 bytes per inverted list entry item score item score – 1 Terabyte of storage! 53 i5 i1 73 36 i9 i2 65 30 i2 i3 62 15 i6 40 i4 14 i1 i5 39 10 i8 i6 18 10 i7 i7 16 Don’t try this at home! 5 i3 i8 16 seeker Даша seeker Аня Social top- k @ Joint RuSSIR/EDBT Summer School 2011 56

Top- k Processing for Search and Information Discovery in Social - PowerPoint PPT Presentation

Top- k Processing for Search and Information Discovery in Social Applications Lecture 2: Network-Aware Search in Social Tagging Sites Sihem Amer-Yahia Julia Stoyanovich Social top- k @ Joint RuSSIR/EDBT Summer School 2011 Summary of last

From Search to Discovery in our Future Library From Search to Discovery W e see a spectrum of

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

Top- k Processing for Search and Information Discovery in Social Applications Sihem Amer-Yahia

Chapter 3: Top-k Query Processing and Indexing 3.1 Top-k Algorithms 3.2 Approximate Top-k Query

To TOP or NOT to TOP www.SAS.com To TOP or NOT to TOP Using the TOP command in Linux By Len van

RNA Search and Whirlwind tour of ncRNA search & discovery Motif Discovery RNA motif

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Boosted Top Tagging Seung J. Lee Outline Introduction: top jets @ LHC Modern boosted top

Foundations of Artificial Intelligence 9. State-Space Search: Tree Search and Graph Search Malte

Tabu Search Search Tabu Page 1 Part I Part I Tabu Search Principles Search Principles Tabu

Uninformed Search 2 Informed Search Rest of blind search An informed search strategyone

Informed search algorithms Outline Best-first search Greedy best-first search A *

Watson Discovery Spring 2020 Discovery pipeline Using NLU, document conversion, and UI tools

Class 42: Free symmetric top Class 42: Free symmetric top Free symmetric top in body frame Assume

Puncher/Squeezer Riveting Tools BEST PRACTICES 2018 Tool Uses Top Rail Punch Top Rail

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

Virtualized HPC infrastructure of the Novosibirsk Scientific Center for HEP data analysis D.

Huaike Guo University of Oklahoma August 8, 2019 Searching for new physics - Leaving no stone

Health Search From Consumers to Clinicians Slides available at

M ADHYAM : A L OW - COST AND S CALABLE M ODEL FOR E DUCATIONAL C ONTENT D ISTRIBUTION IN I

Legal (and non-legal) approaches: the regulation of Web media Daith Mac Sthigh PhD

The weakest link is the human factor GRC can help.. New cyber law will not steal

The international dimension of of Backlash Against Democracy (B (BAD) Lise Rakner, University

Technology Area Technology Area 31 st APAN Meeting in Hong Kong Sureswaran Ramadass 24 February

Top- k Processing for Search and Information Discovery in Social - PowerPoint PPT Presentation

Top- k Processing for Search and Information Discovery in Social Applications Lecture 2: Network-Aware Search in Social Tagging Sites Sihem Amer-Yahia Julia Stoyanovich Social top- k @ Joint RuSSIR/EDBT Summer School 2011 Summary of last

From Search to Discovery in our Future Library From Search to Discovery W e see a spectrum of

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

Top- k Processing for Search and Information Discovery in Social Applications Sihem Amer-Yahia

Chapter 3: Top-k Query Processing and Indexing 3.1 Top-k Algorithms 3.2 Approximate Top-k Query

To TOP or NOT to TOP www.SAS.com To TOP or NOT to TOP Using the TOP command in Linux By Len van

RNA Search and Whirlwind tour of ncRNA search &amp; discovery Motif Discovery RNA motif

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Boosted Top Tagging Seung J. Lee Outline Introduction: top jets @ LHC Modern boosted top

Foundations of Artificial Intelligence 9. State-Space Search: Tree Search and Graph Search Malte

Tabu Search Search Tabu Page 1 Part I Part I Tabu Search Principles Search Principles Tabu

Uninformed Search 2 Informed Search Rest of blind search An informed search strategyone

Informed search algorithms Outline Best-first search Greedy best-first search A *

Watson Discovery Spring 2020 Discovery pipeline Using NLU, document conversion, and UI tools

Class 42: Free symmetric top Class 42: Free symmetric top Free symmetric top in body frame Assume

Puncher/Squeezer Riveting Tools BEST PRACTICES 2018 Tool Uses Top Rail Punch Top Rail

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

Virtualized HPC infrastructure of the Novosibirsk Scientific Center for HEP data analysis D.

Huaike Guo University of Oklahoma August 8, 2019 Searching for new physics - Leaving no stone

Health Search From Consumers to Clinicians Slides available at

M ADHYAM : A L OW - COST AND S CALABLE M ODEL FOR E DUCATIONAL C ONTENT D ISTRIBUTION IN I

Legal (and non-legal) approaches: the regulation of Web media Daith Mac Sthigh PhD

The weakest link is the human factor GRC can help.. New cyber law will not steal

The international dimension of of Backlash Against Democracy (B (BAD) Lise Rakner, University

Technology Area Technology Area 31 st APAN Meeting in Hong Kong Sureswaran Ramadass 24 February

RNA Search and Whirlwind tour of ncRNA search & discovery Motif Discovery RNA motif