IR in Social Media
Andreas Hotho
Data Mining and Information Retrieval Group
University of Würzburg
IR in Social Media Andreas Hotho Data Mining and Information - - PowerPoint PPT Presentation
IR in Social Media Andreas Hotho Data Mining and Information Retrieval Group University of Wrzburg Data Mining and Information Retrieval Group @ Wrzburg Founded in 2009 as a academic offspring of Kassel www.wordle.net 2 Agenda
IR in Social Media
Andreas Hotho
Data Mining and Information Retrieval Group
University of Würzburg
Data Mining and Information Retrieval Group @ Würzburg
2
www.wordle.net
3
Agenda
Introduction Understanding the Data
Folksonomies and Search
earch IR related approaches
pam in Folksonomies
emantic in Folksonomy Summary and Outlook
4
Definition: Web 2.0
“ The t erm Web 2.0 is commonly associat ed wit h web
applicat ions t hat facilit at e int eract ive informat ion sharing, int eroperabilit y, user-cent ered design, and collaborat ion on t he World Wide Web. Alt hough t he t erm suggest s a new version of t he World Wide Web, it does not refer t o an updat e t o any t echnical specificat ions, but rat her t o cumulat ive changes in t he ways soft ware developers and end-users use t he Web.“
Wikipedia http:/ / en.wikipedia.org/ wiki/ Web_2.0
„ Fragment ed Fut ure“ .
ep. 2005) and t he Web 2.0 conference in 2004.
6
Definition: Web 2.0
The version number in “ Web 2.0” , known from soft ware updat es, point s t o an updat ed web t hat includes services like Folksonomies
(collaborative tagging, social classification, social indexing, and social tagging),
Weblogs (blogs), Wikis, S
S
and t echnologies like RS
S feeds,
Podcasts, Mashups (merging content from different sources, client - and server-side), Application Programming Interfaces (API),
APIs with RES T and/ or XML- and/ or JS ON-based APIs,
AJAX-based web applications.
7
A Map of the Web 2.0
artwork by R. Munroe http://xkcd.com/
pace
30.08.2011 Andreas Hotho 8
9
Web 2.0 – Collaborative Tagging
In t his lect ure we will focus on collaborat ive t agging, in part icular on social bookmarking: everybody knows (web) bookmarks has them in his/ her own browser uses them on a daily basis bookmark repositories emerge totally independent
Interesting source of data which has a structure which can be used to do good ranking, is similar to click log data and shows properties of emergence semantics
10
Web Bookmarks
11
Web Bookmarks
Tags User Resource
12
Audio Streams
13
Audio Streams
Tags Users Resource
14
Photos
15
Photos
Tags User Resource
16
Videos
17
Videos
Tags User Resource
18
Folksonomies allow users to assign tags to resources.
A folksonomy is a tuple F := (U, T, R, Y, Á) where
U, T, and R are finite sets, whose elements are called users, t ags and resources, Y µ U £ T £ R, called set of t ag assignment s, Á µ U £ T £ T is a user-specific sub-tag/ super-tag relation.
The personomy Pu of user u is the restriction of F to u.
Folksonomies
19
Our system: BibSonomy
BibSonomy
for sharing bookmarks, for managing publication lists for researchers, for research groups, for proj ects, ... http://www.bibsonomy.org/
20
Agenda
Introduction Understanding the Data
Folksonomies and Search
earch IR related approaches
pam in Folksonomies
emantic in Folksonomy Summary and Outlook
21
Dataset
Dat a from t he Delicious folksonomy sit e Obtained in July 2005 (monthly dumps (14) June 2004 – July 2005) Consists of
Dat a from BibS
Latest obtained in July 2006 (20 monthly dumps) Consists of
22
Power Law Distribution in Delicious
t ag “ unlabeled” occurs 415,950 t imes t ag “ web” occurs 238,891 t imes
23
Small World
Milgram int roduced t he not ion of a „ small world“ :
(S tanley Milgram. The small world problem. Psychology Today, 67(1):61– 67, 1967.)
Practical experiment in the US Any two person in the US
are connected by a very short chain: six degrees of separation
Formal definit ion of t he small world propert y for graphs: (Erdös) random graph Large clustering degree Folksonomies exhibit small world propert ies: S
mall characteristic path lengths
Large clustering degree (connectedness and cliquishness)
24
Networks of Tag Co-Occurrence
trength of a node t : total weight of its edges
P>(s) := probability of node strength exceeding s
between posts
25
Cumulative Strength Distribution
Fat-t ailed dist ribut ion Irregularit ies due t o spamming act ivit y, e.g. Large number of tags per post Regular number of tags (10, 50) per post S ame dist ribut ion for shuffled t ags Behaviour determined solely by tag frequencies
Delicious BibS
s P>(s) P>(s) s
26
Nearest-Neighbor Strength
nn posit ively correlat ed t o s
nn negat ively correlat ed t o s
27
Nearest-Neighbor Strength: Delicious
S
nn
s
28
Related Work – Analysis of the Folksonomy Graph
Net work propert ies of Web 2.0 applicat ions
hen, L. Wu. Folksonomy as a Complex Network, 2005.
. Perelgut and J. Hawkins. On the S tructure, Properties and Utility of Internal Corporate Blogs. Proceedings of the International Conference on Weblogs and S
M 2007), 2007.
ervedio, F. Colaiori, L. S . Buriol, D. Donato, S . Leonardi, and G.
encyclopedia wikipedia. Phys. Rev. E, 74:036116, 2006.
Int roduct ion int o t agging syst ems
S . Golder and B. A. Huberman. The S tructure of Collaborative Tagging S ystems cs/ 0508082 (2005)
hared Metadata, December 2004. http:/ / www.adammathes.com/ academic/ computermediated- communication/ folksonomies.html.
Analysis of t agging behaviour
ervedio, and V. Loreto. Vocabulary growth in collaborative tagging systems, 2007.
emiotic Dynamics, PNAS , 2007.
antos-Neto, M. Ripeanu, A. Iamnitchi. Tracking User Attention in Collaborative Tagging Communities, 2007.
More under: http://www.bibsonomy.org/tag/network+folksonomy
29
Agenda
Introduction Understanding the Data
Folksonomies and Search
earch IR related approaches
pam in Folksonomies
emantic in Folksonomy Summary and Outlook
30.08.2011 Andreas Hotho 30
30.08.2011 Andreas Hotho 31
Logsonomies allow users to assign resources to query terms.
30.08.2011 Andreas Hotho 32
Social Bookmarking Systems
Query Logs from Search Engines
N
Folksonomies & Logsonomies
Knowledge and Data Engineering Knowledge and Data Engineering Group at the University of Kassel to knowledge management data engineering by jaeschke and 1 other personAllow to query and click the users terms results
Logsonomies
Allow t o assign t o users tags resourcesFolksonomies
30.08.2011 Andreas Hotho 33
Datasets
N
30.08.2011 Andreas Hotho 34
Logsonomy Dataset
Most frequent Tags
30.08.2011 Andreas Hotho 35
Degree distribution of tags/query words/queries.
30.08.2011 Andreas Hotho 36
Small World Properties
User 2 User 3 Res 1 Tag 2 Tag 3 Res 2 Res 3 User 1
Logsonomies show similar properties.
30.08.2011 Andreas Hotho 37
Results: Average Nearest Neighbor Strength
Del.icio.us AOL
split queries complete queries complete URLs host only URLs
38
Agenda
Introduction Understanding the Data
Folksonomies and Search
earch IR related approaches
pam in Folksonomies
emantic in Folksonomy Summary and Outlook
39
Types of Tags [Golder & Huberman, 2006]
Golder & Hubermann ident ified seven t ypes of t ags:
(also called sent iment t ags)
elf reference, e.g., myown
purpose t ags) Addit ionally, we can find
yst em t ags, e.g., for:andrea
40
Types of Users [Marlow et al., 2006]
A: consistently new tags as new photos are uploaded B: few tags, sudden growth later
41
Types of Users [Strohmaier et al., 2010]
Evidence of different ways HOW users t ag (Tagging Pragmatics) Broad dist inct ion by t agging motivation [S t rohmaier 2009]:
donuts
duff marge
beer bart barty Duff-beer bev alc nalc beer wine
„ Categorizers“ …
tags, for later browsing
„ Describers“ …
(synomyms, spelling variants, … )
Types of Resources
Basically, t here are syst ems t o t ag anyt hing … videos goals in life bookmarks phot os news publicat ion references cont act s … t o name j ust a few.
43
Types of Tags - Related Work
. Golder and B. Huberman, Journal of Informat ion S cience 32, 2006.
t rohmaier, Purpose Tagging - Capt uring User Int ent t o Assist Goal-Orient ed S
S earch, S S M'08 Workshop on S earch in S
Valley, US A, 2008.
t rohmaier, C. Körner, and R. Kern, Why do Users Tag? Det ect ing Users' Mot ivat ion for Tagging in S
yst ems, 4t h Int ernat ional AAAI Conference on Weblogs and S
M2010), Washingt on, DC, US A, May 23-26, 2010.
. & Tanaka, K. (2007), Can social bookmarking enhance search in t he web? , in 'JCDL '07: Proceedings of t he 7t h ACM/ IEEE-CS Joint Conference on Digit al Libraries' , ACM, New York, NY, US A , pp. 107--116 .
place semant ics from flickr t ags, in 'S IGIR '07: Proceedings of t he 30t h Annual Int ernat ional ACM S IGIR Conference on Research and Development in Informat ion Ret rieval' , ACM Press, New York, NY, US A , pp. 103--110 .
44
Types of Users – Related Work
paper, taxonomy, Flickr, academic article, to read, in 'HYPERTEXT '06: Proceedings of the seventeenth conference on Hypertext and hypermedia' , ACM, New York, NY, US A , pp. 31--40 .
trohmaier: Of categorizers and describers: an evaluation of quantitative measures for tagging motivation, HT '10: Proceedings of the 21st ACM Conference on Hypertext and Hypermedia, New York, NY, US A, ACM, 2010.
trohmaier, M.; Körner, C. & Kern, R. (2010), Why do users tag? Detecting users' motivation for tagging in social tagging systems, in 'International AAAI Conference on Weblogs and S
(ICWS M2010)' .
vation_behind_tagging/ index.html
45
Agenda
Introduction Understanding the Data
Folksonomies and Search
earch IR related approaches
pam in Folksonomies
emantic in Folksonomy Summary and Outlook
46
Search in Folksonomies
S earch engines need 1. to compute the hits for a query 2. and rank them. PageRank algorithm is very successful in the web (see Google):
each row of A is normalized t o 1
Authority values are propagated along the hyperlink according to
x à d Ax + (1-d) p
where A is the row-stochastic adj acency matrix of the web graph,
x
is the rank vector,
p
is the random surfer component (may be used as preference vector),
d 2 [0,1] is a weighting factor.
If |A|1 := |p|1 := 1 and there are no rank sinks, then the computation of a fixed point equals the computation
47
Search in Folksonomies
Folksonomies have a different structure as the web graph:
Web graph Folksonomies
How can a ranking algorithm for this structure look like?
User 3 User 4 User 2 User 3 User 4 User 2 User 3 User 4 User 1 User 2 User 3 User 4 User 3 User 4 User 2 User 3 User 4 User 2 User 3 User 4 Tag 1 Tag 2 Tag 3 Res 1 Res 2 Res 3
48
First Aproach: Adapted PageRank
plit each hyperedge into six directed edges.
x à d Ax + (1-d) p .
User 1 Tag 1 Res 1 User 1 Tag 1 Res 1
49
Converting a Folksonomy into an Undirected Graph
et V of nodes consist s of t he disj oint union of t he set s of t ags, users and resources: V = U [ T [ R
and resources become edges bet ween t he respect ive nodes:
{{t ,r} | 9 u 2 U : (u,t ,r) 2 Y} [ {{u,r} | 9 t 2 T : (u,t ,r) 2 Y}
50
Ranking in Folksonomies: FolkRank
Problems of folksonomy-adapt ed PageRank
dominated by graph structure undirected: weight flows back (PageRank ¼ edge degree)
Different ial approach
compute rank with and without preferences FolkRank = difference between those rankings normalized to [0,1] Let R
AP be the fixed point with p = 1
Let R
pref be the fixed point with p representing the high
weights for the preferred items
R := R
pref – R AP is the final weight vector
51
Results for: “Semantic Web”
PageRank without preference PageRank with preference FolkRank with preference
52
Rankings for „semanticweb“ for discovering semantic relationships, user comunities, and web pages
53
Trends with respect to tag “politics”
US elections in Nov. 2004
54
Related Work
Ranking in Folksonomies
Michail, A. CollaborativeRank: Motivating People to Give Helpful and Timely Ranking S uggestions, S chool of Computer S cience and Engineering, 2005.
S zekely, B. & Torres, E. Ranking Bookmarks and Bistros: Intelligent Community and Folksonomy Development, 2005.
Bao, S .; Xue, G.; Wu, X.; Yu, Y.; Fei, B. & S u, Z. Optimizing web search using social annotations, ACM Press, 2007, 501-510.
Ranking in Web 2.0
Mohammad Nauman and S hahbaz Khan. Using Personalized Web S earch for Enhancing Common S ense and Folksonomy Based Intelligent S earch S
Computer S
A,2007.
Usefulness of Tag Clouds
inclair and M. Cardew-Hall. The folksonomy tag cloud: When is it useful? Journal of Information S cience, 016555150607808,CILIP,2007.
55
Agenda
Introduction Understanding the Data
Folksonomies and Search
earch IR related approaches
pam in Folksonomies
emantic in Folksonomy Summary and Outlook
Tagging and Search 56
“Traditional” Search…
Tagging and Search 57
“Social” Search
Tagging and Search 58
Is the content in both systems the same?
Tagging and Search 59
Data Collection
earch engine data
essionId
N 1000) Google AOL MS N Rankings Queries (clickdata) Datasets
Tagging and Search 60
N and AOL very similar
N contains many URL parts Tagging and Searching: Basic statistics
Tagging and Search 61
Tagging and Searching: Basic statistics
181,137 166,110 118,628 118,002 107,316 yahoo google free count y myspace 119,580 102,728 100,873 97,495 92,078 design blog soft ware web reference Frequency Top Terms Frequency Top Terms Frequency Top t ags free google ht t p count y pict ures 145,585 116,537 84,376 77,798 75,977
Del.icio.us MSN AOL
N and AOL very similar
to power law distribution
frequent (>10) terms
word lexemes
N contains many URL parts
Tagging and Search 62
Tagging and Searching: Usage over Time
N words and del.icio.us tags
(5 % level)
Number of times the split word was submitted to a search engine on one day Number of times the tag was added to a resource
vista
Tagging and Search 63
Summary
motivations
64
Agenda
Introduction Understanding the Data
Folksonomies and Search
earch IR related approaches
pam in Folksonomies
emantic in Folksonomy Summary and Outlook
65
Tag Recommender
66
Tag Recommender
67
Tag Recommender
68
Clicked Recommended Tags
69
Recommender - Related Work
Recommender
ystems: A S urvey of the S tate-of-the-Art and Possible Extensions. Knowledge and Data Engineering, IEEE Transactions on, (17)6:734--749, 2005.
Tag Recommender
cotland, May, 2006.
Yanfei Xu and Liang Zhang and Wei Liu. Cubic Analysis of S
Personalized Recommendation. Frontiers of WWW Research and Development - APWeb 2006, 733--738, 2006.
Jäschke, R.; Eisterlehner, F.; Hotho, A. & S tumme, G. (2009), Testing and Evaluating Tag Recommenders in a Live S ystem, in Dominik Benz & Frederik Janssen, ed., 'Workshop on Knowledge Discovery, Data Mining, and Machine Learning' , pp. 44--51 .
Jäschke, R.; Marinho, L.; Hotho, A.; S chmidt -Thieme, L. & S tumme, G. (2008), 'Tag Recommendations in S
ystems', AI Communications 21 (4) , 231-247 .
More under: http://www.bibsonomy.org/tag/recommender+folksonomy
70
Agenda
Introduction Understanding the Data
Folksonomies and Search
earch IR related approaches
pam in Folksonomies
emantic in Folksonomy Summary and Outlook
71
BibSonomy after lunch ...
72
Spam?
73
Spam – User Level vs. Post Level
74
Sources of Spam?
75
Spam Classification
User Data Collection Registration Information Posts Logging Information S
Information Training Classification Feature Engineering Personal Data!!
76
Dataset: Creation
BibS
Decision is based on Links (websites) of posts Added tags Also influenced by
personal information:
30.08.2011 Andreas Hotho 77
Dataset creation process
30.08.2011 Andreas Hotho 78
Dataset - Figures
Users S pammer Tags Resources TAS All 1,411 18,681 306,993 920,176 8,709,417 Training 1,306 15,891 282,473 774,678 7,904,735 Test 100 2,790 49,644 153,512 804,682
79
Features
Profile Information
Location Information Activity Information S emantic Information
email, realname
post
334 for users
domain
top level domain
this IP
graph, e.g. spammer shares resources with other spammers
80
Personal Data in BibSonomy ???
Identifiable: person who can be identified, directly or indirectly
Beispiel 1 Beispiel 2
80
81
Data Privacy
is collect ed and st ored – in digit al form or ot herwise
Right s (ECHR) Open Issues
conform t o exist ing law?
t agging syst ems?
research applications?
Data Privacy Categories for Data Sources
Rank Data Categories Examples 1 anonymised data All user data that the operator cannot associate with a single user after having removed all features which allow an identification 2 publicly available data Posts, friend and follower links, registration information if published 3 registration information Registration data not published 4 logging information Logging data collected by operator 5 explicitly not published data Posts, contact and profile information marked as private by the user
82
83
Data Privacy Categories for Data Sources
Registration Information Posts Logging Information S
Information Data S
3 2 4 2,3,4
Rank Data Categories 1 anonymised data 2 publicly available data 3 registration information 4 logging information 5 explicitly not published data
84
Experimental Design
pam challenge dataset 2008
reached an AUC value of: 0.98
85
Performance + Privacy Evaluation
86
Spam Detection - Related Work
Paul Heymann and Georgia Koutrika and Hector Garcia-Molina. Fighting S pam on S
Web S ites: A S urvey of Approaches and Future Challenges. IEEE Internet Computing, (11)6:36-45, 2007. Georgia Koutrika and Frans Adj ie Effendi and Zoltan Gyöngyi and Paul Heymann and Hector Garcia-Molina. Combating spam in tagging systems. AIRWeb '07: Proceedings
57--64, ACM Press, New York, NY, US A, 2007. Benj amin Markines and Ciro Cattuto and Filippo Menczer. S
Dennis Fetterly and Zoltán Gyöngyi, editor(s), AIRWeb, 41-48, 2009. Zoltán Gyöngyi and Hector Garcia-Molina and Jan Pedersen. Combating Web S pam with TrustRank.. VLDB, 576-587, 2004.
87
Agenda
Introduction Understanding the Data
Folksonomies and Search
earch IR related approaches
pam in Folksonomies
emantic in Folksonomy Summary and Outlook
88
What kind of “related” tags ?
89
Example for cosine measure
90
Example: Most related tags for „web2.0“ and „howto“
S
1 2 3 4 5
Coocc aj ax web tools blog webdesign FolkRank web aj ax tools design blog TagCont ext web2 web-2.0 webapp „ web web_2.0 ResourceCont . web2 web20 2.0 web_2.0 web-2.0 UserCont ext aj ax aggregator rss google collaborate Coocc tutorial reference tips linux programming FolkRank reference linux tutorial programming software TagCont ext how-to guide tutorials help how_to ResourceCont . how-to tutorial tutorials tips diy UserCont ext reference tutorial tips hacks tools
HOWTO WEB2.0
91
Semantic Grounding in WordNet
WordNet is a large lexical database for English. Words with same meaning are grouped in synset s, which are ordered
by an is-a hierarchy.
Introduction of single artificial root node enables application of
graph-based similarity metrics between pairs of nouns / pairs of verbs.
Inclusion of top n Delicious tags in WordNet:
92
Original t ag: „ java“ Most similar t ag: Freq, folkrank:
„ programming“
Cosine:
„ python“ Example of Semantic Grounding
comput ers programming languages design_patt erns java python
Wordnet Synset Hierarchy:
map Grounded similarity
93
siblings
length of shortest path to most related tag
random
shortest paths in WordNet
94
Related work
Ontology Learning
Dominik Benz and Andreas Hotho. Position Paper: Ontology Learning from
Folksonomies.. In Alexander Hinneburg, editor(s), LWA 2007: Lernen - Wissen - Adaption, Halle, S eptember 2007, Workshop Proceedings (LWA), 109-112, Martin- Luther-University Halle-Wittenberg,2007.
Francis Heylighen. Bootstrapping knowledge representations: from entailment
meshes via semantic nets to learning webs. Kybernetes, (30)5/ 6:691--722, 2001.
Paul Heymann and Hector Garcia-Molina. Collaborative Creation of Communal
Hierarchical Taxonomies in S
P. Mika, Ontologies Are Us: A Unified Model of S
emantics, S pringer, 2005, 522-536.
chmitz, Inducing Ontology from Flickr Tags. 2006.
Analysis of tagging behaviour
C. Cattuto, S
emiotic dynamics in online social communities. The European Physical Journal C - Particles and Fields, 2006, 46, 33-37
S
hilad S en and S hyong K. Lam and Al Mamunur Rashid and Dan Cosley and Dan Frankowski and Jeremy Osterhouse and F. Maxwell Harper and John Riedl. tagging, communities, vocabulary, evolution. CS CW '06: Proceedings of the 2006 20th anniversary conference on Computer supported cooperative work, 181-- 190,ACM,New York, NY, US A,2006.
More under: http://www.bibsonomy.org/tag/ontology+folksonomy
95
Agenda
Introduction Understanding the Data
Folksonomies and Search
earch IR related approaches
pam in Folksonomies
emantic in Folksonomy Summary and Outlook
96
Lessons Learned
Network measures provide interesting insights into the user
behavior of folksonomies
All types of nodes provide valuable information Click behaviour has a similar structure as tagged data Ranking based on graph structure similar to Pagerank is possible Correlation between search interests and tagged items Recommender can influnce the user behaviour S
pam is a critical issue
Relatedness measures on tags in folksonomies are a good basis to
extract semantic relations
97
Future Work
wit hin t agging syst ems
98
Agenda
References: http://www.bibsonomy.org/group/kde/ol_tut2010
Introduction Understanding the Data
Folksonomies and Search
earch IR related approaches
pam in Folksonomies
emantic in Folksonomy Summary and Outlook
99
30.08.2011 Andreas Hotho 100
Agenda
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 2 4 6 8 10 12 14 rank month "blog" "css" "design" "linux" "music" "news" "programming" "software" "web"
BibS
publication sharing system
Network Properties of Folksonomies
Network Properties of LogS
Tag Relatedness
Query Relatedness
S pam Detection
S ummary and Outlook
30.08.2011 Andreas Hotho 101
Folksonomy vs. Logsonomy Dataset
30.08.2011 Andreas Hotho 102
cosine
news blogs people weblog culture future news news.com newspaper weather obituaries newspapers video entertainment awesome fun cool random video videos downloading url downloads download tutorial tutorials tips coding code examples tutorial tutorials software trial download templates news blog technology politics media daily news channel daily fox paper newport video music funny tv software media video music free codes clips sex game myspace tutorial howto programming reference design css tutorial free tutorials psp electronics microsoft
freq
Most related tags by cooccurrence / cosine simlarity
LogS
Folksonomy LogS
Folksonomy
30.08.2011 Andreas Hotho 103
Qualitative insights: Overlap of 10 most related tags coocc FolkRank Tag Cont ext Resource Cont ext User Cont ext 2.28 2.16 0.71 1.11 Resource Cont ext 1.93 2.25 1.5 Tag Cont ext 0.88 1.1 FolkRank 5.91
30.08.2011 Andreas Hotho 104
Qualitative insights 2: Average rank of related tags
30.08.2011 Andreas Hotho 105
Shortest path between original tag and most closely related one LogS
Folksonomy
30.08.2011 Andreas Hotho 106
length of shortest path to most related tag
shortest paths in WordNet for the Logsonomy LogS
Folksonomy