ir in social media
play

IR in Social Media Andreas Hotho Data Mining and Information - PowerPoint PPT Presentation

IR in Social Media Andreas Hotho Data Mining and Information Retrieval Group University of Wrzburg Data Mining and Information Retrieval Group @ Wrzburg Founded in 2009 as a academic offspring of Kassel www.wordle.net 2 Agenda


  1. Small World Properties  Test of small world propert ies  Clustering coefficient (CC)  Characteristic path length (CPL)  ... compared t o random graph (Erdös) User 2 Tag 2 User 1 Tag 3 User 3 Res 3 Res 1 Res 2 Logsonomies show similar properties. Andreas Hotho 30.08.2011 36

  2. Results: Average Nearest Neighbor Strength Del.icio.us complete URLs host only URLs AOL split queries complete queries Andreas Hotho 30.08.2011 37

  3. Agenda Introduction Understanding the Data • Network Properties of Folksonomies • Network Properties of LogS onomies • Types of Tags, Users, Resources Folksonomies and Search • Ranking in Folksonomies (Folkrank) • Comparision of Traditional and S ocial S earch IR related approaches • Recommender • S pam in Folksonomies • S emantic in Folksonomy Summary and Outlook 38

  4. Types of Tags [Golder & Huberman, 2006] Golder & Hubermann ident ified seven t ypes of t ags: • Ident ifying what (or who) it is about , e.g., ont ology , learning • Ident ifying what it is, e.g., art icle , blog • Ident ifiying who owns it , e.g., apple , google • Refining cat egories, e.g., 2010 • Ident ifying qualit ies or charact erist ics, e.g., int erest ing , cool (also called sent iment t ags) • S elf reference, e.g., myown • Task organizat ion, e.g., t oread , t obuy (also called int ent or purpose t ags) Addit ionally, we can find • Cat egory of a resource • S yst em t ags, e.g., for:andrea 39

  5. Types of Users [Marlow et al., 2006] A: consistently new tags as new photos are uploaded B: few tags, sudden growth later 40

  6. Types of Users [Strohmaier et al., 2010] Evidence of different ways HOW users t ag (Tagging Pragmatics ) Broad dist inct ion by t agging motivation [S t rohmaier 2009]: „ Describers “ … bev - tag „ verbously“ with freely chosen words alc nalc - vocabulary not necessarily consistent wine beer (synomyms, spelling variants, … ) - goal: describe content, ease retrieval donuts „ Categorizers “ … duff - use a small controlled tag vocabulary bart Duff-beer - goal: „ ontology-like“ categorization by tags, for later browsing marge - tags as replacement for folders beer barty 41

  7. Types of Resources Basically, t here are syst ems t o t ag anyt hing … phot os videos cont act s publicat ion references news goals in life bookmarks t o name j ust a few . …

  8. Types of Tags - Related Work • Usage pat t erns of collaborat ive t agging syst ems. S . Golder and B. Huberman, Journal of Informat ion S cience 32, 2006. • M. S t rohmaier, Purpose Tagging - Capt uring User Int ent t o Assist Goal-Orient ed S ocial S earch, S S M'08 Workshop on S earch in S ocial Media, in conj unct ion wit h CIKM'08, Napa Valley, US A, 2008. • M. S t rohmaier, C. Körner, and R. Kern, Why do Users Tag? Det ect ing Users' Mot ivat ion for Tagging in S ocial Tagging S yst ems, 4t h Int ernat ional AAAI Conference on Weblogs and S ocial Media (ICWS M2010), Washingt on, DC, US A, May 23-26, 2010. • Yanbe, Y.; Jat owt , A.; Nakamura, S . & Tanaka, K. (2007), Can social bookmarking enhance search in t he web? , in 'JCDL '07: Proceedings of t he 7t h ACM/ IEEE-CS Joint Conference on Digit al Libraries' , ACM, New York, NY, US A , pp. 107--116 . • Rat t enbury, T.; Good, N. & Naaman, M. (2007), Towards aut omat ic ext ract ion of event and place semant ics from flickr t ags, in 'S IGIR '07: Proceedings of t he 30t h Annual Int ernat ional ACM S IGIR Conference on Research and Development in Informat ion Ret rieval' , ACM Press, New York, NY, US A , pp. 103--110 . 43

  9. Types of Users – Related Work • Marlow, C.; Naaman, M.; Boyd, D. & Davis, M. (2006), HT06, tagging paper, taxonomy, Flickr, academic article, to read, in 'HYPERTEXT '06: Proceedings of the seventeenth conference on Hypertext and hypermedia' , ACM, New York, NY, US A , pp. 31--40 . • C. Körner, R. Kern, H.-P. Grahsl, and M. S trohmaier: Of categorizers and describers: an evaluation of quantitative measures for tagging motivation, HT '10: Proceedings of the 21st ACM Conference on Hypertext and Hypermedia, New York, NY, US A, ACM, 2010. • S trohmaier, M.; Körner, C. & Kern, R. (2010), Why do users tag? Detecting users' motivation for tagging in social tagging systems, in 'International AAAI Conference on Weblogs and S ocial Media (ICWS M2010)' . • http:/ / src.acm.org/ 2010/ ChristianKoerner/ understanding_the_moti vation_behind_tagging/ index.html 44

  10. Agenda Introduction Understanding the Data • Network Properties of Folksonomies • Network Properties of LogS onomies • Types of Tags, Users, Resources Folksonomies and Search • Ranking in Folksonomies (Folkrank) • Comparision of Traditional and S ocial S earch IR related approaches • Recommender • S pam in Folksonomies • S emantic in Folksonomy Summary and Outlook 45

  11. Search in Folksonomies S earch engines need 1. to compute the hits for a query 2. and rank them. PageRank algorithm is very successful in the web (see Google):  Authority values are propagated along the hyperlink according to x à d A x + (1- d ) p where A is the row-stochastic adj acency matrix of the web graph, is the rank vector, x each row of A is is the random surfer component normalized t o 1 p (may be used as preference vector), 2 [0,1] is a weighting factor. d  If | A | 1 := | p | 1 := 1 and there are no rank sinks, then the computation of a fixed point equals the computation of the first eigenvector of the matrix dA + (1- d ) p 1 T . 46

  12. Search in Folksonomies  Folksonomies have a different structure as the web graph: Tag 1 User 2 User 2 User 3 Tag 2 User 3 User 3 User 4 Tag 3 User 4 User 4 User 1 User 2 User 2 User 3 User 2 User 3 User 3 User 4 User 4 User 4 User 3 Res 1 User 4 Res 2 Res 3 Web graph Folksonomies  How can a ranking algorithm for this structure look like? 47

  13. First Aproach: Adapted PageRank 1. S plit each hyperedge into six directed edges. Tag 1 Tag 1 User 1 User 1 Res 1 Res 1 1. Iterative weight propagation according to PageRank: x à d A x + (1- d ) p . 48

  14. Converting a Folksonomy into an Undirected Graph • S et V of nodes consist s of t he disj oint union of t he set s of t ags, users and resources: V = U [ T [ R • All co-occurrences of users and t ags, t ags and resources, users and resources become edges bet ween t he respect ive nodes:  E = {{ u,t } | 9 r 2 R : ( u,t ,r ) 2 Y } [ {{ t ,r } | 9 u 2 U : ( u,t ,r ) 2 Y } [ {{ u,r } | 9 t 2 T : ( u,t ,r ) 2 Y } 49

  15. Ranking in Folksonomies: FolkRank Problems of folksonomy-adapt ed PageRank  dominated by graph structure  undirected: weight flows back (PageRank ¼ edge degree) Different ial approach  compute rank with and without preferences  FolkRank = difference between those rankings normalized to [0,1]  Let R AP be the fixed point with p = 1  Let R pref be the fixed point with p representing the high weights for the preferred items  R := R pref – R AP is the final weight vector 50

  16. Results for: “Semantic Web” PageRank without preference PageRank with preference FolkRank with preference 51

  17. Rankings for „semanticweb“ for discovering semantic relationships, user comunities, and web pages 52

  18. Trends with respect to tag “politics” US elections in Nov. 2004 53

  19. Related Work Ranking in Folksonomies  Michail, A. CollaborativeRank: Motivating People to Give Helpful and Timely Ranking S uggestions, S chool of Computer S cience and Engineering, 2005.  S zekely, B. & Torres, E. Ranking Bookmarks and Bistros: Intelligent Community and Folksonomy Development, 2005.  Bao, S .; Xue, G.; Wu, X.; Yu, Y.; Fei, B. & S u, Z. Optimizing web search using social annotations, ACM Press, 2007, 501-510. Ranking in Web 2.0  Mohammad Nauman and S hahbaz Khan. Using Personalized Web S earch for Enhancing Common S ense and Folksonomy Based Intelligent S earch S ystems. wi, (0):423-426,IEEE Computer S ociety,Los Alamitos, CA, US A,2007. Usefulness of Tag Clouds  J. S inclair and M. Cardew-Hall. The folksonomy tag cloud: When is it useful? Journal of Information S cience, 016555150607808,CILIP,2007. 54

  20. Agenda Introduction Understanding the Data • Network Properties of Folksonomies • Network Properties of LogS onomies • Types of Tags, Users, Resources Folksonomies and Search • Ranking in Folksonomies (Folkrank) • Comparision of Traditional and S ocial S earch IR related approaches • Recommender • S pam in Folksonomies • S emantic in Folksonomy Summary and Outlook 55

  21. “Traditional” Search… Tagging and Search 56

  22. “Social” Search Tagging and Search 57

  23. Is the content in both systems the same? Tagging and Search 58

  24. Data Collection • S ocial bookmarking data (del.icio.us) • S earch engine data • Queries, Timestamps, S essionId • Rankings (Google 100, MS N 1000) Queries (clickdata) Datasets Rankings MS N AOL Google Tagging and Search 59

  25. Tagging and Searching: Basic statistics • MS N and AOL very similar • words and tags different: • low overall overlap due to power law distribution • relatively high overlap on frequent (>10) terms • del.icio.us many multi-word lexemes • MS N contains many URL parts Tagging and Search 60

  26. Tagging and Searching: Basic statistics • MS N and AOL very similar • words and tags different: • low overall overlap due to power law distribution • relatively high overlap on frequent (>10) terms • del.icio.us many multi- word lexemes • MS N contains many URL parts Del.icio.us MSN AOL Top t ags Frequency Top Terms Frequency Top Terms Frequency 145,585 design 119,580 yahoo 181,137 free 116,537 blog 102,728 google 166,110 google soft ware 100,873 free 118,628 ht t p 84,376 77,798 web 97,495 count y 118,002 count y 75,977 reference 92,078 myspace 107,316 pict ures Tagging and Search 61

  27. Tagging and Searching: Usage over Time • Comparison of MS N words and del.icio.us tags • Normalization, pearson correlation coefficient, t-test • 1003 terms: 307 out of 1003 terms significant correlation level) (5 % vista Number of times the split Number of times the tag word was submitted to a was added to a resource search engine on one day on one day • People search and tag content at the same time • Tagging and searching are triggered by similar motivations Tagging and Search 62

  28. Summary • Low overlap of query terms / tags • URLs in query log / multi-word lexemes • Power law distribution • Frequent terms overlap • Tagging and searching are triggered by similar motivations • Del.icio.us covers top search engine URLs • Ranking overlap is low • Correlation higher for IT specific topics Tagging and Search 63

  29. Agenda Introduction Understanding the Data • Network Properties of Folksonomies • Network Properties of LogS onomies • Types of Tags, Users, Resources Folksonomies and Search • Ranking in Folksonomies (Folkrank) • Comparision of Traditional and S ocial S earch IR related approaches • Recommender • S pam in Folksonomies • S emantic in Folksonomy Summary and Outlook 64

  30. Tag Recommender 65

  31. Tag Recommender 66

  32. Tag Recommender 67

  33. Clicked Recommended Tags 68

  34. Recommender - Related Work Recommender  G. Adomavicius and A. Tuzhilin. Toward the Next Generation of Recommender S ystems: A S urvey of the S tate-of-the-Art and Possible Extensions. Knowledge and Data Engineering, IEEE Transactions on, (17)6:734--749, 2005. Tag Recommender  Z. Xu and Y. Fu and J. Mao and D. S u. Towards the semantic web: Collaborative tag suggestions. Collaborative Web Tagging Workshop at WWW2006, Edinburgh, S cotland, May, 2006.  Yanfei Xu and Liang Zhang and Wei Liu. Cubic Analysis of S ocial Bookmarking for Personalized Recommendation. Frontiers of WWW Research and Development - APWeb 2006, 733--738, 2006.  Jäschke, R.; Eisterlehner, F.; Hotho, A. & S tumme, G. (2009), Testing and Evaluating Tag Recommenders in a Live S ystem, in Dominik Benz & Frederik Janssen, ed., 'Workshop on Knowledge Discovery, Data Mining, and Machine Learning' , pp. 44--51 .  Jäschke, R.; Marinho, L.; Hotho, A.; S chmidt -Thieme, L. & S tumme, G. (2008), 'Tag Recommendations in S ocial Bookmarking S ystems', AI Communications 21 (4) , 231-247 . More under: http://www.bibsonomy.org/tag/recommender+folksonomy 69

  35. Agenda Introduction Understanding the Data • Network Properties of Folksonomies • Network Properties of LogS onomies • Types of Tags, Users, Resources Folksonomies and Search • Ranking in Folksonomies (Folkrank) • Comparision of Traditional and S ocial S earch IR related approaches • Recommender • S pam in Folksonomies • S emantic in Folksonomy Summary and Outlook 70

  36. BibSonomy after lunch ... 71

  37. Spam? 72

  38. Spam – User Level vs. Post Level 73

  39. Sources of Spam? 74

  40. Spam Classification Training Feature Engineering Classification Personal Data!! User Data Collection Registration Logging S ocial Network Posts Information Information Information 75

  41. Dataset: Creation BibS onomy admins and developers flag users as spammers Decision is based on  Links (websites) of posts  Added tags  Also influenced by personal information:  E-mail  Choice of name  Registration IP  … 76

  42. Dataset creation process Andreas Hotho 30.08.2011 77

  43. Dataset - Figures Users S pammer Tags Resources TAS All 1,411 18,681 306,993 920,176 8,709,417 Training 1,306 15,891 282,473 774,678 7,904,735 Test 100 2,790 49,644 153,512 804,682 • Time frame: until end of 2007 • Only users with at least one post • No consideration of private posts • Tags not normalized Andreas Hotho 30.08.2011 78

  44. Features • 25 feat ures • 4 different cat egories • Normalizat ion of each user‘ s feat ure vect or Profile Information Activity Information • time between registration and first • Realname with 2 or 3 words post • lenght of the user name, • number of tags per post email, realname • average number of TAS • digits in user name • 470 for spammers, 334 for users Location Information S emantic Information • blacklist of tags • number of users in the same domain • Co-Occurrence information of the graph, e.g. spammer shares • number of users in the same resources with other spammers top level domain • number of spam users with this IP 79

  45. Personal Data in BibSonomy ??? Identifiable: person who can be identified, directly or indirectly Beispiel 1 Beispiel 2 80 80

  46. Data Privacy • concerns exist wherever personally identifiable information is collect ed and st ored – in digit al form or ot herwise • It is a right t o cont rol t he st orage of personal dat a! • Europe: Art icle 8 of t he European Convent ion on Human Right s (ECHR) Open Issues • Which laws ? • Is t he collect ion and processing of dat a in BibS onomy conform t o exist ing law? • Best pract ices for ot her social bookmarking syst ems / social t agging syst ems? • Do we need to process personal data in our data mining / research applications? 81

  47. Data Privacy Categories for Data Sources Rank Data Categories Examples 1 anonymised data All user data that the operator cannot associate with a single user after having removed all features which allow an identification 2 publicly available data Posts, friend and follower links, registration information if published 3 registration information Registration data not published 4 logging information Logging data collected by operator 5 explicitly not published Posts, contact and profile information data marked as private by the user 82

  48. Data Privacy Categories for Data Sources Rank Data Categories 1 anonymised data 2 publicly available data 3 registration information 4 logging information 5 explicitly not published data 3 2 4 2,3,4 Registration Logging S ocial Network Posts Information Information Information Data S ources for Feature Engineering 83

  49. Experimental Design • S pam challenge dataset 2008 • Features of different data categories • Classificat ion methods of Weka • AUC value • Winner of the challenge reached an AUC value of: 0.98 84

  50. Performance + Privacy Evaluation 85

  51. Spam Detection - Related Work Paul Heymann and Georgia Koutrika and Hector Garcia-Molina. Fighting S pam on S ocial Web S ites: A S urvey of Approaches and Future Challenges. IEEE Internet Computing, (11)6:36-45, 2007. Georgia Koutrika and Frans Adj ie Effendi and Zoltan Gyöngyi and Paul Heymann and Hector Garcia-Molina. Combating spam in tagging systems. AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, 57--64, ACM Press, New York, NY, US A, 2007. Benj amin Markines and Ciro Cattuto and Filippo Menczer. S ocial spam detection.. In Dennis Fetterly and Zoltán Gyöngyi, editor(s), AIRWeb, 41-48, 2009. Zoltán Gyöngyi and Hector Garcia-Molina and Jan Pedersen. Combating Web S pam with TrustRank.. VLDB, 576-587, 2004. 86

  52. Agenda Introduction Understanding the Data • Network Properties of Folksonomies • Network Properties of LogS onomies • Types of Tags, Users, Resources Folksonomies and Search • Ranking in Folksonomies (Folkrank) • Comparision of Traditional and S ocial S earch IR related approaches • Recommender • S pam in Folksonomies • S emantic in Folksonomy Summary and Outlook 87

  53. What kind of “related” tags ? 88

  54. Example for cosine measure 89

  55. Example: Most related tags for „web2.0“ and „howto“ S im. Measure 1 2 3 4 5 Coocc aj ax web tools blog webdesign HOWTO WEB2.0 FolkRank web aj ax tools design blog TagCont ext web2 web-2.0 webapp „ web web_2.0 ResourceCont . web2 web20 2.0 web_2.0 web-2.0 UserCont ext aj ax aggregator rss google collaborate Coocc tutorial reference tips linux programming FolkRank reference linux tutorial programming software TagCont ext how-to guide tutorials help how_to ResourceCont . how-to tutorial tutorials tips diy UserCont ext reference tutorial tips hacks tools 90

  56. Semantic Grounding in WordNet  WordNet is a large lexical database for English.  Words with same meaning are grouped in synset s , which are ordered by an is-a hierarchy.  Introduction of single artificial root node enables application of graph-based similarity metrics between pairs of nouns / pairs of verbs.  Inclusion of top n Delicious tags in WordNet:  100: 82%  1,000: 79%  5,000: 69%  10,000: 61% 91

  57. Example of Semantic Grounding Wordnet Synset Hierarchy: Original t ag:  „ java “ comput ers Most similar t ag: programming map  Freq, folkrank: design_patt erns languages „ programming “  Cosine: „ python “ java python Grounded similarity 92

  58. shortest paths in WordNet random siblings length of shortest path to most related tag 93

  59. Related work Ontology Learning  Dominik Benz and Andreas Hotho. Position Paper: Ontology Learning from Folksonomies.. In Alexander Hinneburg, editor(s), LWA 2007: Lernen - Wissen - Adaption, Halle, S eptember 2007, Workshop Proceedings (LWA), 109-112, Martin- Luther-University Halle-Wittenberg,2007.  Francis Heylighen. Bootstrapping knowledge representations: from entailment meshes via semantic nets to learning webs. Kybernetes, (30)5/ 6:691--722, 2001.  Paul Heymann and Hector Garcia-Molina. Collaborative Creation of Communal Hierarchical Taxonomies in S ocial Tagging S ystems. 2006-10 2006.  P. Mika, Ontologies Are Us: A Unified Model of S ocial Networks and S emantics, S pringer, 2005, 522-536.  P. S chmitz, Inducing Ontology from Flickr Tags. 2006. Analysis of tagging behaviour  C. Cattuto, S emiotic dynamics in online social communities. The European Physical Journal C - Particles and Fields, 2006, 46, 33-37  S hilad S en and S hyong K. Lam and Al Mamunur Rashid and Dan Cosley and Dan Frankowski and Jeremy Osterhouse and F. Maxwell Harper and John Riedl. tagging, communities, vocabulary, evolution. CS CW '06: Proceedings of the 2006 20th anniversary conference on Computer supported cooperative work, 181-- 190,ACM,New York, NY, US A,2006. More under: http://www.bibsonomy.org/tag/ontology+folksonomy 94

  60. Agenda Introduction Understanding the Data • Network Properties of Folksonomies • Network Properties of LogS onomies • Types of Tags, Users, Resources Folksonomies and Search • Ranking in Folksonomies (Folkrank) • Comparision of Traditional and S ocial S earch IR related approaches • Recommender • S pam in Folksonomies • S emantic in Folksonomy Summary and Outlook 95

  61. Lessons Learned  Network measures provide interesting insights into the user behavior of folksonomies  All types of nodes provide valuable information  Click behaviour has a similar structure as tagged data  Ranking based on graph structure similar to Pagerank is possible  Correlation between search interests and tagged items  Recommender can influnce the user behaviour  S pam is a critical issue  Relatedness measures on tags in folksonomies are a good basis to extract semantic relations 96

  62. Future Work • Combining t he informat ion for personalized search • Using t ag dat a t o improve int ranet search • Ut ilizing t he informat ion of t he annot at ed resource • Trying t o improve search and ranking by allowing semant ics wit hin t agging syst ems • Learning t he Ranking • … 97

  63. Agenda Introduction Understanding the Data • Network Properties of Folksonomies • Network Properties of LogS onomies • Types of Tags, Users, Resources Folksonomies and Search • Ranking in Folksonomies (Folkrank) • Comparision of Traditional and S ocial S earch IR related approaches • Recommender • S pam in Folksonomies • S emantic in Folksonomy Summary and Outlook References: http://www.bibsonomy.org/group/kde/ol_tut2010 98

  64. Backup 99

  65. Agenda  BibS onomy – a social bookmark and publication sharing system  Network Properties of Folksonomies  0.4 Network Properties of "blog" "css" "design" "linux" 0.35 "music" "news" LogS onomies "programming" "software" 0.3 "web" 0.25 rank  Tag Relatedness 0.2 0.15  Query Relatedness 0.1 0.05 0 2 4 6 8 10 12 14 month  S pam Detection  S ummary and Outlook Andreas Hotho 30.08.2011 100

  66. Folksonomy vs. Logsonomy Dataset  Del.icio.us Excerpt: 10,000 most popular tags  | U | = 476,378 | T | = 10,000 | R | = 12,660,470  | Y | = 101,491,722  AOL split queries: 10,000 most popular words  | U | = 463,380 | T | = 10,000 | R | = 1,284,724  | Y | = 26,227,550 Andreas Hotho 30.08.2011 101

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend