- 9. Social Media
1
Advanced Topics in Information Retrieval 9. Social Media Jannik - - PowerPoint PPT Presentation
Advanced Topics in Information Retrieval 9. Social Media Jannik Strtgen Vinay Setty (jtroetge@mpi-inf.mpg.de) (vsetty@mpi-inf.mpg.de) 1 Outline 9.1. What is Social Media? 9.2. Tracking Memes 9.3. Opinion Retrieval 9.4. Feed
1
2
(no need to know HTML, CSS, JavaScript)
publishers) or collaboratively-edited (as
3
= ?!? =
4
5
http://mybiasedcoin.blogspot.de
6
7
between users, user profiles, comments, external urls)
happen
(e.g., how to maintain indexes and collection statistics)
8
from a blog search engine (blogdigger.com) and found that
person, location, organization) mentioned, for instance, to find out
and filtering queries
9
compared query logs from web search and Twitter, finding that queries on Twitter
words), less diverse, and more often relate to social gossip and recent events
17% of tweets in the analyzed data correspond to questions
10
11
time
need (e.g., how is life in Iceland?)
(e.g., a company or celebrity) which express an opinion about it
can subscribe to their posts (e.g., who tweets about C++?)
most important news stories (e.g., to display on front page)
12
13
visualize their volume in traditional news and blogs
14
mentions of the same meme need to be identified
that are reasonably long and occur often enough
(i.e., u can be transformed into v by adding at most ε tokens)
and how often v occurs in the document collection
15
16
a force for good in the world palling around with terrorists who would target their own country that he s palling around with terrorists who would target their own country pal around with terrorists who targeted their own country palling around with terrorists who target their own country we see america as a force of good in this world we see an america of exceptionalism someone who sees america as imperfe around with te sees america as imperfect enough to pal around with terrorists who targeted their own country terrorists who would target their own country
construction
having minimum total weight, so that each resulting component is single-rooted
hence addressed by greedy heuristic algorithm
17
1 2 3 4 5 6 7 8 9 10 11 13 15 14 12
media
18
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
3 6 9 12 Proportion of total volume Time relative to peak [hours], t Mainstream media Blogs
Figure 8: Time lag for blogs and news media. Thread volume in
19
(e.g., a company or celebrity) which express an opinion about it
but how to determine whether a post expresses an opinion?
20 Title: whole foods Description: Find opinions on the quality, expense, and value
Narrative: All opinions on the quality, expense and value of Whole Foods purchases are relevant. Comments on business and labor practices or Whole Foods as a stock investment are not relevant. Statements of produce and other merchandise carried by Whole Foods without comment are not relevant.
21
(e.g., like, good, bad, awesome, terrible, disappointing)
22
nor too rare (e.g., aardvark) in the post collection D
Drelopt ⊂ Drel be the subset of relevant opinionated posts
with
23
P[ v | Drelopt ] P[ v | Drel ]
1 + λ λ + log 2(1 + λ) λ = tf (v, Drel) |Drel|
dictionary
according to a (linear) combination of retrieval scores with 0 ≤ α ≤ 1 as a tunable mixing parameter
24
score(d) = α · score(d, Q) + (1 − α) · score(d, Qopt)
and query-dependent (QD) opinion words; posts are then ranked according to with 0 ≤ α, β ≤ 1 as a tunable mixing parameters and retrieval scores based on language model divergences
25
score(d) = α · score(d, Q) + β · score(d, QI) + (1 − α − β) · score(d, QD)
26
frequently co-occur with query terms in pseudo-relevant documents (following the approach by Lavrenko and Croft [6])
documents, and top-n words having highest probability with parameter set as k = 5 and n = 20 in practice
27
P[ w | R ] ∝ X
d ∈ R
P[ w | d ] Y
v ∈ q
P[ v | d, w ] P[ v | d, w ] = (
tf (v,d) P
u tf (u,d)
: w ∈ d :
28
29
that are relevant to a specific (typically rather broad) topic
the given topic? How to bridge vocabulary gap to posts?
30 Title: baseball Description: Blogs with recurring interests in Major League Baseball, or lesser leagues, for example, giving news or analysis of games or player moves. Narrative: Relevant blogs will have news or analysis from the major league baseball and other leagues. Blogs listing only product reviews, or with other nonsensical information are not relevant.
31
with blog-specific smoothing parameter thus smoothing blogs with shorter posts more aggressively
32
P[ q | θb ] = Y
v∈q
P[ v | θb ] tf (v,q) P[ v | θb ] = (1 − λb) · P[ v | b ] + λb · P[ v | B ] λb = β (1/|b| · P
p ∈ b
P
v tf (v, p)) + β
assuming conditional independence of terms given blog
33
P[ v | b ] = X
p ∈ b
P[ v | p, b ] P[ p | b ] P[ v | b ] = X
p ∈ b
P[ v | p ] P[ p | b ]
from blog
from post
P[ p | b ] = 1/|b| P[ v | p ] = tf (v, p) P
w tf (w, p)
34
P[ v | θp ] = (1 − λp) · P[ v | p ] + λp · P[ v | B ] P[ v | p ] = tf (v, p) P
w tf (w, p)
λp = β (P
w tf (w, p)) + β
35
P[ q | θp ] = Y
v ∈ q
P[ v | θp ] tf (v,q) P[ q | b ] = X
p ∈ b
P[ q | θp ] P[ p | b ] P[ p | b ] = 1/|b|
from blog
from post
Model (~BM) and Small Document Model (~PM) approaches
descriptions (e.g., garden) and posts (e.g., seed, flower, crop)
documents retrieved from different corpora
36
NO IMPROVEMENT!
and pointing to a document d from the top-k result as favoring frequent anchor phrases pointing to highly ranked articles
37
score(a) = X
(a,d)
(k − rank(d))
anchor phrase a from top-n article pointing to top-k article d
http://en.wikipedia.org/wiki/United_States
united states united states of america america land of the free the states
IMPROVEMENT!
38
39
Thousands of news articles generated each day!
40
41
Portal:Current events
42
track) aims to identify the most important news stories for a specific day d based on their coverage in the blogosphere
43
using language models estimated from news and blogs
(Note: This is a simplified version of the approach described in [7])
are considered as candidates and ranked by the approach
44
Importance(n,d) / KL(θn k θBd)
LM representing news article n
{ {
LM representing posts published at day d
45
46
P[ v | θBd ] = tf (v, Bd) + µ ·
tf (v,B) P
w tf (w,B)
(P
w tf (w, Bd)) + µ
using Dirichlet smoothing with the entire news collection N
Bn retrieved using headline as query and published within
smoothing with the collection of all posts B
article content and top-k pseudo-relevant blog posts
47
P[ v | θn ] = tf (v, n) + µ ·
tf (v,N) P
w tf (w,N)
(P
w tf (w, n)) + µ
VOCABULARY GAP?!?
grouping variants of memes to track them over time
finds posts expressing an opinion about a specific named entity
identifies feeds worth following for a given high-level topic
spots most important news articles based on coverage in blogs
are a common obstacle in IR but can often be bridged
are versatile and can be used to address many (if not most) tasks
48
[1]
. Kolari, J. Bai, F. Diaz, Y. Chang, Z. Zheng: Time is of the Essence: Improving Recency Ranking Using Twitter Data, WWW 2010 [2]
Information Search and Retrieval in Microblogs, JASIST, 62(6):996–1008, 2011 [3]
Retrieval and Feedback Models for Blog Feed Search, SIGIR 2008 [4]
An Effective Statistical Approach for Blog Post Opinion Retrieval, CIKM 2008 [5]
A Unified Relevance Model for Opinion Retrieval, CIKM 2009 [6]
Relevance-Based Language Models, SIGIR 2001 49
[7]
Identifying top news stories based on their popularity in the blogosphere, Information Retrieval 17:326–350, 2014 [8]
A Study of Blog Search, ECIR 2006 [9]
Information Retrieval on the Blogosphere, FTIR 6(1):1–125, 2012 [10]
#TwitterSearch: A Comparison of Microblog Search and Web Search, WSDM 2011 [11]
Blog feed search with a post index, Information Retrieval 14:515–545, 2011 50