 
              Small World Properties  Test of small world propert ies  Clustering coefficient (CC)  Characteristic path length (CPL)  ... compared t o random graph (Erdös) User 2 Tag 2 User 1 Tag 3 User 3 Res 3 Res 1 Res 2 Logsonomies show similar properties. Andreas Hotho 30.08.2011 36
Results: Average Nearest Neighbor Strength Del.icio.us complete URLs host only URLs AOL split queries complete queries Andreas Hotho 30.08.2011 37
Agenda Introduction Understanding the Data • Network Properties of Folksonomies • Network Properties of LogS onomies • Types of Tags, Users, Resources Folksonomies and Search • Ranking in Folksonomies (Folkrank) • Comparision of Traditional and S ocial S earch IR related approaches • Recommender • S pam in Folksonomies • S emantic in Folksonomy Summary and Outlook 38
Types of Tags [Golder & Huberman, 2006] Golder & Hubermann ident ified seven t ypes of t ags: • Ident ifying what (or who) it is about , e.g., ont ology , learning • Ident ifying what it is, e.g., art icle , blog • Ident ifiying who owns it , e.g., apple , google • Refining cat egories, e.g., 2010 • Ident ifying qualit ies or charact erist ics, e.g., int erest ing , cool (also called sent iment t ags) • S elf reference, e.g., myown • Task organizat ion, e.g., t oread , t obuy (also called int ent or purpose t ags) Addit ionally, we can find • Cat egory of a resource • S yst em t ags, e.g., for:andrea 39
Types of Users [Marlow et al., 2006] A: consistently new tags as new photos are uploaded B: few tags, sudden growth later 40
Types of Users [Strohmaier et al., 2010] Evidence of different ways HOW users t ag (Tagging Pragmatics ) Broad dist inct ion by t agging motivation [S t rohmaier 2009]: „ Describers “ … bev - tag „ verbously“ with freely chosen words alc nalc - vocabulary not necessarily consistent wine beer (synomyms, spelling variants, … ) - goal: describe content, ease retrieval donuts „ Categorizers “ … duff - use a small controlled tag vocabulary bart Duff-beer - goal: „ ontology-like“ categorization by tags, for later browsing marge - tags as replacement for folders beer barty 41
Types of Resources Basically, t here are syst ems t o t ag anyt hing … phot os videos cont act s publicat ion references news goals in life bookmarks t o name j ust a few . …
Types of Tags - Related Work • Usage pat t erns of collaborat ive t agging syst ems. S . Golder and B. Huberman, Journal of Informat ion S cience 32, 2006. • M. S t rohmaier, Purpose Tagging - Capt uring User Int ent t o Assist Goal-Orient ed S ocial S earch, S S M'08 Workshop on S earch in S ocial Media, in conj unct ion wit h CIKM'08, Napa Valley, US A, 2008. • M. S t rohmaier, C. Körner, and R. Kern, Why do Users Tag? Det ect ing Users' Mot ivat ion for Tagging in S ocial Tagging S yst ems, 4t h Int ernat ional AAAI Conference on Weblogs and S ocial Media (ICWS M2010), Washingt on, DC, US A, May 23-26, 2010. • Yanbe, Y.; Jat owt , A.; Nakamura, S . & Tanaka, K. (2007), Can social bookmarking enhance search in t he web? , in 'JCDL '07: Proceedings of t he 7t h ACM/ IEEE-CS Joint Conference on Digit al Libraries' , ACM, New York, NY, US A , pp. 107--116 . • Rat t enbury, T.; Good, N. & Naaman, M. (2007), Towards aut omat ic ext ract ion of event and place semant ics from flickr t ags, in 'S IGIR '07: Proceedings of t he 30t h Annual Int ernat ional ACM S IGIR Conference on Research and Development in Informat ion Ret rieval' , ACM Press, New York, NY, US A , pp. 103--110 . 43
Types of Users – Related Work • Marlow, C.; Naaman, M.; Boyd, D. & Davis, M. (2006), HT06, tagging paper, taxonomy, Flickr, academic article, to read, in 'HYPERTEXT '06: Proceedings of the seventeenth conference on Hypertext and hypermedia' , ACM, New York, NY, US A , pp. 31--40 . • C. Körner, R. Kern, H.-P. Grahsl, and M. S trohmaier: Of categorizers and describers: an evaluation of quantitative measures for tagging motivation, HT '10: Proceedings of the 21st ACM Conference on Hypertext and Hypermedia, New York, NY, US A, ACM, 2010. • S trohmaier, M.; Körner, C. & Kern, R. (2010), Why do users tag? Detecting users' motivation for tagging in social tagging systems, in 'International AAAI Conference on Weblogs and S ocial Media (ICWS M2010)' . • http:/ / src.acm.org/ 2010/ ChristianKoerner/ understanding_the_moti vation_behind_tagging/ index.html 44
Agenda Introduction Understanding the Data • Network Properties of Folksonomies • Network Properties of LogS onomies • Types of Tags, Users, Resources Folksonomies and Search • Ranking in Folksonomies (Folkrank) • Comparision of Traditional and S ocial S earch IR related approaches • Recommender • S pam in Folksonomies • S emantic in Folksonomy Summary and Outlook 45
Search in Folksonomies S earch engines need 1. to compute the hits for a query 2. and rank them. PageRank algorithm is very successful in the web (see Google):  Authority values are propagated along the hyperlink according to x à d A x + (1- d ) p where A is the row-stochastic adj acency matrix of the web graph, is the rank vector, x each row of A is is the random surfer component normalized t o 1 p (may be used as preference vector), 2 [0,1] is a weighting factor. d  If | A | 1 := | p | 1 := 1 and there are no rank sinks, then the computation of a fixed point equals the computation of the first eigenvector of the matrix dA + (1- d ) p 1 T . 46
Search in Folksonomies  Folksonomies have a different structure as the web graph: Tag 1 User 2 User 2 User 3 Tag 2 User 3 User 3 User 4 Tag 3 User 4 User 4 User 1 User 2 User 2 User 3 User 2 User 3 User 3 User 4 User 4 User 4 User 3 Res 1 User 4 Res 2 Res 3 Web graph Folksonomies  How can a ranking algorithm for this structure look like? 47
First Aproach: Adapted PageRank 1. S plit each hyperedge into six directed edges. Tag 1 Tag 1 User 1 User 1 Res 1 Res 1 1. Iterative weight propagation according to PageRank: x à d A x + (1- d ) p . 48
Converting a Folksonomy into an Undirected Graph • S et V of nodes consist s of t he disj oint union of t he set s of t ags, users and resources: V = U [ T [ R • All co-occurrences of users and t ags, t ags and resources, users and resources become edges bet ween t he respect ive nodes:  E = {{ u,t } | 9 r 2 R : ( u,t ,r ) 2 Y } [ {{ t ,r } | 9 u 2 U : ( u,t ,r ) 2 Y } [ {{ u,r } | 9 t 2 T : ( u,t ,r ) 2 Y } 49
Ranking in Folksonomies: FolkRank Problems of folksonomy-adapt ed PageRank  dominated by graph structure  undirected: weight flows back (PageRank ¼ edge degree) Different ial approach  compute rank with and without preferences  FolkRank = difference between those rankings normalized to [0,1]  Let R AP be the fixed point with p = 1  Let R pref be the fixed point with p representing the high weights for the preferred items  R := R pref – R AP is the final weight vector 50
Results for: “Semantic Web” PageRank without preference PageRank with preference FolkRank with preference 51
Rankings for „semanticweb“ for discovering semantic relationships, user comunities, and web pages 52
Trends with respect to tag “politics” US elections in Nov. 2004 53
Related Work Ranking in Folksonomies  Michail, A. CollaborativeRank: Motivating People to Give Helpful and Timely Ranking S uggestions, S chool of Computer S cience and Engineering, 2005.  S zekely, B. & Torres, E. Ranking Bookmarks and Bistros: Intelligent Community and Folksonomy Development, 2005.  Bao, S .; Xue, G.; Wu, X.; Yu, Y.; Fei, B. & S u, Z. Optimizing web search using social annotations, ACM Press, 2007, 501-510. Ranking in Web 2.0  Mohammad Nauman and S hahbaz Khan. Using Personalized Web S earch for Enhancing Common S ense and Folksonomy Based Intelligent S earch S ystems. wi, (0):423-426,IEEE Computer S ociety,Los Alamitos, CA, US A,2007. Usefulness of Tag Clouds  J. S inclair and M. Cardew-Hall. The folksonomy tag cloud: When is it useful? Journal of Information S cience, 016555150607808,CILIP,2007. 54
Agenda Introduction Understanding the Data • Network Properties of Folksonomies • Network Properties of LogS onomies • Types of Tags, Users, Resources Folksonomies and Search • Ranking in Folksonomies (Folkrank) • Comparision of Traditional and S ocial S earch IR related approaches • Recommender • S pam in Folksonomies • S emantic in Folksonomy Summary and Outlook 55
“Traditional” Search… Tagging and Search 56
“Social” Search Tagging and Search 57
Is the content in both systems the same? Tagging and Search 58
Data Collection • S ocial bookmarking data (del.icio.us) • S earch engine data • Queries, Timestamps, S essionId • Rankings (Google 100, MS N 1000) Queries (clickdata) Datasets Rankings MS N AOL Google Tagging and Search 59
Tagging and Searching: Basic statistics • MS N and AOL very similar • words and tags different: • low overall overlap due to power law distribution • relatively high overlap on frequent (>10) terms • del.icio.us many multi-word lexemes • MS N contains many URL parts Tagging and Search 60
Tagging and Searching: Basic statistics • MS N and AOL very similar • words and tags different: • low overall overlap due to power law distribution • relatively high overlap on frequent (>10) terms • del.icio.us many multi- word lexemes • MS N contains many URL parts Del.icio.us MSN AOL Top t ags Frequency Top Terms Frequency Top Terms Frequency 145,585 design 119,580 yahoo 181,137 free 116,537 blog 102,728 google 166,110 google soft ware 100,873 free 118,628 ht t p 84,376 77,798 web 97,495 count y 118,002 count y 75,977 reference 92,078 myspace 107,316 pict ures Tagging and Search 61
Tagging and Searching: Usage over Time • Comparison of MS N words and del.icio.us tags • Normalization, pearson correlation coefficient, t-test • 1003 terms: 307 out of 1003 terms significant correlation level) (5 % vista Number of times the split Number of times the tag word was submitted to a was added to a resource search engine on one day on one day • People search and tag content at the same time • Tagging and searching are triggered by similar motivations Tagging and Search 62
Summary • Low overlap of query terms / tags • URLs in query log / multi-word lexemes • Power law distribution • Frequent terms overlap • Tagging and searching are triggered by similar motivations • Del.icio.us covers top search engine URLs • Ranking overlap is low • Correlation higher for IT specific topics Tagging and Search 63
Agenda Introduction Understanding the Data • Network Properties of Folksonomies • Network Properties of LogS onomies • Types of Tags, Users, Resources Folksonomies and Search • Ranking in Folksonomies (Folkrank) • Comparision of Traditional and S ocial S earch IR related approaches • Recommender • S pam in Folksonomies • S emantic in Folksonomy Summary and Outlook 64
Tag Recommender 65
Tag Recommender 66
Tag Recommender 67
Clicked Recommended Tags 68
Recommender - Related Work Recommender  G. Adomavicius and A. Tuzhilin. Toward the Next Generation of Recommender S ystems: A S urvey of the S tate-of-the-Art and Possible Extensions. Knowledge and Data Engineering, IEEE Transactions on, (17)6:734--749, 2005. Tag Recommender  Z. Xu and Y. Fu and J. Mao and D. S u. Towards the semantic web: Collaborative tag suggestions. Collaborative Web Tagging Workshop at WWW2006, Edinburgh, S cotland, May, 2006.  Yanfei Xu and Liang Zhang and Wei Liu. Cubic Analysis of S ocial Bookmarking for Personalized Recommendation. Frontiers of WWW Research and Development - APWeb 2006, 733--738, 2006.  Jäschke, R.; Eisterlehner, F.; Hotho, A. & S tumme, G. (2009), Testing and Evaluating Tag Recommenders in a Live S ystem, in Dominik Benz & Frederik Janssen, ed., 'Workshop on Knowledge Discovery, Data Mining, and Machine Learning' , pp. 44--51 .  Jäschke, R.; Marinho, L.; Hotho, A.; S chmidt -Thieme, L. & S tumme, G. (2008), 'Tag Recommendations in S ocial Bookmarking S ystems', AI Communications 21 (4) , 231-247 . More under: http://www.bibsonomy.org/tag/recommender+folksonomy 69
Agenda Introduction Understanding the Data • Network Properties of Folksonomies • Network Properties of LogS onomies • Types of Tags, Users, Resources Folksonomies and Search • Ranking in Folksonomies (Folkrank) • Comparision of Traditional and S ocial S earch IR related approaches • Recommender • S pam in Folksonomies • S emantic in Folksonomy Summary and Outlook 70
BibSonomy after lunch ... 71
Spam? 72
Spam – User Level vs. Post Level 73
Sources of Spam? 74
Spam Classification Training Feature Engineering Classification Personal Data!! User Data Collection Registration Logging S ocial Network Posts Information Information Information 75
Dataset: Creation BibS onomy admins and developers flag users as spammers Decision is based on  Links (websites) of posts  Added tags  Also influenced by personal information:  E-mail  Choice of name  Registration IP  … 76
Dataset creation process Andreas Hotho 30.08.2011 77
Dataset - Figures Users S pammer Tags Resources TAS All 1,411 18,681 306,993 920,176 8,709,417 Training 1,306 15,891 282,473 774,678 7,904,735 Test 100 2,790 49,644 153,512 804,682 • Time frame: until end of 2007 • Only users with at least one post • No consideration of private posts • Tags not normalized Andreas Hotho 30.08.2011 78
Features • 25 feat ures • 4 different cat egories • Normalizat ion of each user‘ s feat ure vect or Profile Information Activity Information • time between registration and first • Realname with 2 or 3 words post • lenght of the user name, • number of tags per post email, realname • average number of TAS • digits in user name • 470 for spammers, 334 for users Location Information S emantic Information • blacklist of tags • number of users in the same domain • Co-Occurrence information of the graph, e.g. spammer shares • number of users in the same resources with other spammers top level domain • number of spam users with this IP 79
Personal Data in BibSonomy ??? Identifiable: person who can be identified, directly or indirectly Beispiel 1 Beispiel 2 80 80
Data Privacy • concerns exist wherever personally identifiable information is collect ed and st ored – in digit al form or ot herwise • It is a right t o cont rol t he st orage of personal dat a! • Europe: Art icle 8 of t he European Convent ion on Human Right s (ECHR) Open Issues • Which laws ? • Is t he collect ion and processing of dat a in BibS onomy conform t o exist ing law? • Best pract ices for ot her social bookmarking syst ems / social t agging syst ems? • Do we need to process personal data in our data mining / research applications? 81
Data Privacy Categories for Data Sources Rank Data Categories Examples 1 anonymised data All user data that the operator cannot associate with a single user after having removed all features which allow an identification 2 publicly available data Posts, friend and follower links, registration information if published 3 registration information Registration data not published 4 logging information Logging data collected by operator 5 explicitly not published Posts, contact and profile information data marked as private by the user 82
Data Privacy Categories for Data Sources Rank Data Categories 1 anonymised data 2 publicly available data 3 registration information 4 logging information 5 explicitly not published data 3 2 4 2,3,4 Registration Logging S ocial Network Posts Information Information Information Data S ources for Feature Engineering 83
Experimental Design • S pam challenge dataset 2008 • Features of different data categories • Classificat ion methods of Weka • AUC value • Winner of the challenge reached an AUC value of: 0.98 84
Performance + Privacy Evaluation 85
Spam Detection - Related Work Paul Heymann and Georgia Koutrika and Hector Garcia-Molina. Fighting S pam on S ocial Web S ites: A S urvey of Approaches and Future Challenges. IEEE Internet Computing, (11)6:36-45, 2007. Georgia Koutrika and Frans Adj ie Effendi and Zoltan Gyöngyi and Paul Heymann and Hector Garcia-Molina. Combating spam in tagging systems. AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, 57--64, ACM Press, New York, NY, US A, 2007. Benj amin Markines and Ciro Cattuto and Filippo Menczer. S ocial spam detection.. In Dennis Fetterly and Zoltán Gyöngyi, editor(s), AIRWeb, 41-48, 2009. Zoltán Gyöngyi and Hector Garcia-Molina and Jan Pedersen. Combating Web S pam with TrustRank.. VLDB, 576-587, 2004. 86
Agenda Introduction Understanding the Data • Network Properties of Folksonomies • Network Properties of LogS onomies • Types of Tags, Users, Resources Folksonomies and Search • Ranking in Folksonomies (Folkrank) • Comparision of Traditional and S ocial S earch IR related approaches • Recommender • S pam in Folksonomies • S emantic in Folksonomy Summary and Outlook 87
What kind of “related” tags ? 88
Example for cosine measure 89
Example: Most related tags for „web2.0“ and „howto“ S im. Measure 1 2 3 4 5 Coocc aj ax web tools blog webdesign HOWTO WEB2.0 FolkRank web aj ax tools design blog TagCont ext web2 web-2.0 webapp „ web web_2.0 ResourceCont . web2 web20 2.0 web_2.0 web-2.0 UserCont ext aj ax aggregator rss google collaborate Coocc tutorial reference tips linux programming FolkRank reference linux tutorial programming software TagCont ext how-to guide tutorials help how_to ResourceCont . how-to tutorial tutorials tips diy UserCont ext reference tutorial tips hacks tools 90
Semantic Grounding in WordNet  WordNet is a large lexical database for English.  Words with same meaning are grouped in synset s , which are ordered by an is-a hierarchy.  Introduction of single artificial root node enables application of graph-based similarity metrics between pairs of nouns / pairs of verbs.  Inclusion of top n Delicious tags in WordNet:  100: 82%  1,000: 79%  5,000: 69%  10,000: 61% 91
Example of Semantic Grounding Wordnet Synset Hierarchy: Original t ag:  „ java “ comput ers Most similar t ag: programming map  Freq, folkrank: design_patt erns languages „ programming “  Cosine: „ python “ java python Grounded similarity 92
shortest paths in WordNet random siblings length of shortest path to most related tag 93
Related work Ontology Learning  Dominik Benz and Andreas Hotho. Position Paper: Ontology Learning from Folksonomies.. In Alexander Hinneburg, editor(s), LWA 2007: Lernen - Wissen - Adaption, Halle, S eptember 2007, Workshop Proceedings (LWA), 109-112, Martin- Luther-University Halle-Wittenberg,2007.  Francis Heylighen. Bootstrapping knowledge representations: from entailment meshes via semantic nets to learning webs. Kybernetes, (30)5/ 6:691--722, 2001.  Paul Heymann and Hector Garcia-Molina. Collaborative Creation of Communal Hierarchical Taxonomies in S ocial Tagging S ystems. 2006-10 2006.  P. Mika, Ontologies Are Us: A Unified Model of S ocial Networks and S emantics, S pringer, 2005, 522-536.  P. S chmitz, Inducing Ontology from Flickr Tags. 2006. Analysis of tagging behaviour  C. Cattuto, S emiotic dynamics in online social communities. The European Physical Journal C - Particles and Fields, 2006, 46, 33-37  S hilad S en and S hyong K. Lam and Al Mamunur Rashid and Dan Cosley and Dan Frankowski and Jeremy Osterhouse and F. Maxwell Harper and John Riedl. tagging, communities, vocabulary, evolution. CS CW '06: Proceedings of the 2006 20th anniversary conference on Computer supported cooperative work, 181-- 190,ACM,New York, NY, US A,2006. More under: http://www.bibsonomy.org/tag/ontology+folksonomy 94
Agenda Introduction Understanding the Data • Network Properties of Folksonomies • Network Properties of LogS onomies • Types of Tags, Users, Resources Folksonomies and Search • Ranking in Folksonomies (Folkrank) • Comparision of Traditional and S ocial S earch IR related approaches • Recommender • S pam in Folksonomies • S emantic in Folksonomy Summary and Outlook 95
Lessons Learned  Network measures provide interesting insights into the user behavior of folksonomies  All types of nodes provide valuable information  Click behaviour has a similar structure as tagged data  Ranking based on graph structure similar to Pagerank is possible  Correlation between search interests and tagged items  Recommender can influnce the user behaviour  S pam is a critical issue  Relatedness measures on tags in folksonomies are a good basis to extract semantic relations 96
Future Work • Combining t he informat ion for personalized search • Using t ag dat a t o improve int ranet search • Ut ilizing t he informat ion of t he annot at ed resource • Trying t o improve search and ranking by allowing semant ics wit hin t agging syst ems • Learning t he Ranking • … 97
Agenda Introduction Understanding the Data • Network Properties of Folksonomies • Network Properties of LogS onomies • Types of Tags, Users, Resources Folksonomies and Search • Ranking in Folksonomies (Folkrank) • Comparision of Traditional and S ocial S earch IR related approaches • Recommender • S pam in Folksonomies • S emantic in Folksonomy Summary and Outlook References: http://www.bibsonomy.org/group/kde/ol_tut2010 98
Backup 99
Agenda  BibS onomy – a social bookmark and publication sharing system  Network Properties of Folksonomies  0.4 Network Properties of "blog" "css" "design" "linux" 0.35 "music" "news" LogS onomies "programming" "software" 0.3 "web" 0.25 rank  Tag Relatedness 0.2 0.15  Query Relatedness 0.1 0.05 0 2 4 6 8 10 12 14 month  S pam Detection  S ummary and Outlook Andreas Hotho 30.08.2011 100
Folksonomy vs. Logsonomy Dataset  Del.icio.us Excerpt: 10,000 most popular tags  | U | = 476,378 | T | = 10,000 | R | = 12,660,470  | Y | = 101,491,722  AOL split queries: 10,000 most popular words  | U | = 463,380 | T | = 10,000 | R | = 1,284,724  | Y | = 26,227,550 Andreas Hotho 30.08.2011 101
Recommend
More recommend