IR in Social Media Andreas Hotho Data Mining and Information - - PowerPoint PPT Presentation

ir in social media
SMART_READER_LITE
LIVE PREVIEW

IR in Social Media Andreas Hotho Data Mining and Information - - PowerPoint PPT Presentation

IR in Social Media Andreas Hotho Data Mining and Information Retrieval Group University of Wrzburg Data Mining and Information Retrieval Group @ Wrzburg Founded in 2009 as a academic offspring of Kassel www.wordle.net 2 Agenda


slide-1
SLIDE 1

IR in Social Media

Andreas Hotho

Data Mining and Information Retrieval Group

University of Würzburg

slide-2
SLIDE 2

Data Mining and Information Retrieval Group @ Würzburg

2

  • Founded in 2009 as a academic offspring of Kassel

www.wordle.net

slide-3
SLIDE 3

3

Agenda

Introduction Understanding the Data

  • Network Properties of Folksonomies
  • Network Properties of LogS
  • nomies
  • Types of Tags, Users, Resources

Folksonomies and Search

  • Ranking in Folksonomies (Folkrank)
  • Comparision of Traditional and S
  • cial S

earch IR related approaches

  • Recommender
  • S

pam in Folksonomies

  • S

emantic in Folksonomy Summary and Outlook

slide-4
SLIDE 4

4

Definition: Web 2.0

“ The t erm Web 2.0 is commonly associat ed wit h web

applicat ions t hat facilit at e int eract ive informat ion sharing, int eroperabilit y, user-cent ered design, and collaborat ion on t he World Wide Web. Alt hough t he t erm suggest s a new version of t he World Wide Web, it does not refer t o an updat e t o any t echnical specificat ions, but rat her t o cumulat ive changes in t he ways soft ware developers and end-users use t he Web.“

Wikipedia http:/ / en.wikipedia.org/ wiki/ Web_2.0

  • The t erm was coined in 1999 by Darcy DiNucci in her art icle

„ Fragment ed Fut ure“ .

  • Tim O'Reilly shaped it by his work „ What is Web 2.0“ (S

ep. 2005) and t he Web 2.0 conference in 2004.

slide-5
SLIDE 5

6

Definition: Web 2.0

The version number in “ Web 2.0” , known from soft ware updat es, point s t o an updat ed web t hat includes services like  Folksonomies

(collaborative tagging, social classification, social indexing, and social tagging),

 Weblogs (blogs),  Wikis,  S

  • cial networks,

 S

  • cial software,

and t echnologies like  RS

S feeds,

 Podcasts,  Mashups (merging content from different sources, client - and server-side),  Application Programming Interfaces (API),

APIs with RES T and/ or XML- and/ or JS ON-based APIs,

 AJAX-based web applications.

slide-6
SLIDE 6

7

A Map of the Web 2.0

artwork by R. Munroe http://xkcd.com/

  • Blogs
  • Wikis
  • Bookmarking
  • Y
  • ut ube
  • Flickr
  • 43Things
  • MyS

pace

  • Facebook
  • ...
slide-7
SLIDE 7

30.08.2011 Andreas Hotho 8

slide-8
SLIDE 8

9

Web 2.0 – Collaborative Tagging

In t his lect ure we will focus on collaborat ive t agging, in part icular on social bookmarking:  everybody knows (web) bookmarks  has them in his/ her own browser  uses them on a daily basis  bookmark repositories emerge totally independent

Interesting source of data which has a structure which can be used to do good ranking, is similar to click log data and shows properties of emergence semantics

slide-9
SLIDE 9

10

Web Bookmarks

slide-10
SLIDE 10

11

Web Bookmarks

Tags User Resource

slide-11
SLIDE 11

12

Audio Streams

slide-12
SLIDE 12

13

Audio Streams

Tags Users Resource

slide-13
SLIDE 13

14

Photos

slide-14
SLIDE 14

15

Photos

Tags User Resource

slide-15
SLIDE 15

16

Videos

slide-16
SLIDE 16

17

Videos

Tags User Resource

slide-17
SLIDE 17

18

Folksonomies allow users to assign tags to resources.

A folksonomy is a tuple F := (U, T, R, Y, Á) where

 U, T, and R are finite sets, whose elements are called users, t ags and resources,  Y µ U £ T £ R, called set of t ag assignment s,  Á µ U £ T £ T is a user-specific sub-tag/ super-tag relation.

The personomy Pu of user u is the restriction of F to u.

Folksonomies

slide-18
SLIDE 18

19

Our system: BibSonomy

BibSonomy

 for sharing bookmarks,  for managing publication lists  for researchers,  for research groups,  for proj ects, ...  http://www.bibsonomy.org/

slide-19
SLIDE 19

20

Agenda

Introduction Understanding the Data

  • Network Properties of Folksonomies
  • Network Properties of LogS
  • nomies
  • Types of Tags, Users, Resources

Folksonomies and Search

  • Ranking in Folksonomies (Folkrank)
  • Comparision of Traditional and S
  • cial S

earch IR related approaches

  • Recommender
  • S

pam in Folksonomies

  • S

emantic in Folksonomy Summary and Outlook

slide-20
SLIDE 20

21

Dataset

Dat a from t he Delicious folksonomy sit e  Obtained in July 2005 (monthly dumps (14) June 2004 – July 2005)  Consists of

  • | U| = 75,242 users
  • | T| = 533,191 tags
  • | R| = 3,158,297 resources
  • | Y| = 17,362,212 triples

Dat a from BibS

  • nomy

 Latest obtained in July 2006 (20 monthly dumps)  Consists of

  • | U| = 428 users
  • | T| = 13,108 tags
  • | R| = 47,538 resources
  • | Y| = 161,438 triples
slide-21
SLIDE 21

22

Power Law Distribution in Delicious

t ag “ unlabeled” occurs 415,950 t imes t ag “ web” occurs 238,891 t imes

  • approx. 40%
  • f t he t ags occur only once
slide-22
SLIDE 22

23

Small World

Milgram int roduced t he not ion of a „ small world“ :

(S tanley Milgram. The small world problem. Psychology Today, 67(1):61– 67, 1967.)

 Practical experiment in the US  Any two person in the US

are connected by a very short chain: six degrees of separation

Formal definit ion of t he small world propert y for graphs:  (Erdös) random graph  Large clustering degree Folksonomies exhibit small world propert ies:  S

mall characteristic path lengths

 Large clustering degree (connectedness and cliquishness)

slide-23
SLIDE 23

24

Networks of Tag Co-Occurrence

  • Consider tag-tag co-occurrences
  • Link weight = number of common posts:
  • S

trength of a node t : total weight of its edges

  • Examine cumulative strength distribution [Vazquez 2005]

P>(s) := probability of node strength exceeding s

  • Compare with shuffled graph: tags exchanged randomly

between posts

slide-24
SLIDE 24

25

Cumulative Strength Distribution

Fat-t ailed dist ribut ion Irregularit ies due t o spamming act ivit y, e.g.  Large number of tags per post  Regular number of tags (10, 50) per post S ame dist ribut ion for shuffled t ags  Behaviour determined solely by tag frequencies

Delicious BibS

  • nomy

s P>(s) P>(s) s

slide-25
SLIDE 25

26

Nearest-Neighbor Strength

  • Examine st rengt h correlat ions bet ween neighbors
  • Average nearest -neighbor st rengt h for node i:
  • Assort at ive mixing: S

nn posit ively correlat ed t o s

  • E.g. social networks
  • Disassort at ive mixing: S

nn negat ively correlat ed t o s

  • E.g. man-made, hierarchical networks
slide-26
SLIDE 26

27

Nearest-Neighbor Strength: Delicious

S

nn

s

slide-27
SLIDE 27

28

Related Work – Analysis of the Folksonomy Graph

Net work propert ies of Web 2.0 applicat ions

  • K. S

hen, L. Wu. Folksonomy as a Complex Network, 2005.

  • R. Lambiotte and M. Ausloos. Collaborative tagging as a tripartite network. 2005.

  • P. Kolari, T. Finin, Y. Yesha, Y. Yesha, K. Lyons, S

. Perelgut and J. Hawkins. On the S tructure, Properties and Utility of Internal Corporate Blogs. Proceedings of the International Conference on Weblogs and S

  • cial Media (ICWS

M 2007), 2007.

  • A. Capocci, V. D. P. S

ervedio, F. Colaiori, L. S . Buriol, D. Donato, S . Leonardi, and G.

  • Caldarelli. Preferential attachment in the growth of social networks: The internet

encyclopedia wikipedia. Phys. Rev. E, 74:036116, 2006.

Int roduct ion int o t agging syst ems

S . Golder and B. A. Huberman. The S tructure of Collaborative Tagging S ystems cs/ 0508082 (2005)

  • A. Mathes. Folksonomies – Cooperative Classification and Communication Through S

hared Metadata, December 2004. http:/ / www.adammathes.com/ academic/ computermediated- communication/ folksonomies.html.

Analysis of t agging behaviour

  • C. Cattuto, A. Baldassarri,V. S

ervedio, and V. Loreto. Vocabulary growth in collaborative tagging systems, 2007.

  • C. Cattuto, V. Loreto and L. Pietronero. Collaborative Tagging and S

emiotic Dynamics, PNAS , 2007.

  • E. S

antos-Neto, M. Ripeanu, A. Iamnitchi. Tracking User Attention in Collaborative Tagging Communities, 2007.

More under: http://www.bibsonomy.org/tag/network+folksonomy

slide-28
SLIDE 28

29

Agenda

Introduction Understanding the Data

  • Network Properties of Folksonomies
  • Network Properties of LogS
  • nomies
  • Types of Tags, Users, Resources

Folksonomies and Search

  • Ranking in Folksonomies (Folkrank)
  • Comparision of Traditional and S
  • cial S

earch IR related approaches

  • Recommender
  • S

pam in Folksonomies

  • S

emantic in Folksonomy Summary and Outlook

slide-29
SLIDE 29

30.08.2011 Andreas Hotho 30

slide-30
SLIDE 30

30.08.2011 Andreas Hotho 31

Logsonomies allow users to assign resources to query terms.

slide-31
SLIDE 31

30.08.2011 Andreas Hotho 32

Social Bookmarking Systems

  • del.icio.us
  • simpy
  • BibS
  • nomy

Query Logs from Search Engines

  • Google
  • MS

N

  • AOL

Folksonomies & Logsonomies

Knowledge and Data Engineering Knowledge and Data Engineering Group at the University of Kassel to knowledge management data engineering by jaeschke and 1 other person
  • n 2006-01-27 10:39:07
edit delete |

Allow to query and click the users terms results

Logsonomies

Allow t o assign t o users tags resources

Folksonomies

slide-32
SLIDE 32

30.08.2011 Andreas Hotho 33

Datasets

  • del.icio.us
  • 81,992 users
  • July 2005
  • AOL
  • 10 million queries
  • 657,426 users
  • March-May 2006
  • MS

N

  • 15 million queries
  • 7.4 million different sessions
  • May 2006
slide-33
SLIDE 33

30.08.2011 Andreas Hotho 34

Logsonomy Dataset

Most frequent Tags

slide-34
SLIDE 34

30.08.2011 Andreas Hotho 35

Degree distribution of tags/query words/queries.

slide-35
SLIDE 35

30.08.2011 Andreas Hotho 36

Small World Properties

  • Test of small world propert ies
  • Clustering coefficient (CC)
  • Characteristic path length (CPL)
  • ... compared t o random graph (Erdös)

User 2 User 3 Res 1 Tag 2 Tag 3 Res 2 Res 3 User 1

Logsonomies show similar properties.

slide-36
SLIDE 36

30.08.2011 Andreas Hotho 37

Results: Average Nearest Neighbor Strength

Del.icio.us AOL

split queries complete queries complete URLs host only URLs

slide-37
SLIDE 37

38

Agenda

Introduction Understanding the Data

  • Network Properties of Folksonomies
  • Network Properties of LogS
  • nomies
  • Types of Tags, Users, Resources

Folksonomies and Search

  • Ranking in Folksonomies (Folkrank)
  • Comparision of Traditional and S
  • cial S

earch IR related approaches

  • Recommender
  • S

pam in Folksonomies

  • S

emantic in Folksonomy Summary and Outlook

slide-38
SLIDE 38

39

Types of Tags [Golder & Huberman, 2006]

Golder & Hubermann ident ified seven t ypes of t ags:

  • Ident ifying what (or who) it is about , e.g., ont ology, learning
  • Ident ifying what it is, e.g., art icle, blog
  • Ident ifiying who owns it , e.g., apple, google
  • Refining cat egories, e.g., 2010
  • Ident ifying qualit ies or charact erist ics, e.g., int erest ing, cool

(also called sent iment t ags)

  • S

elf reference, e.g., myown

  • Task organizat ion, e.g., t oread, t obuy (also called int ent or

purpose t ags) Addit ionally, we can find

  • Cat egory of a resource
  • S

yst em t ags, e.g., for:andrea

slide-39
SLIDE 39

40

Types of Users [Marlow et al., 2006]

A: consistently new tags as new photos are uploaded B: few tags, sudden growth later

slide-40
SLIDE 40

41

Types of Users [Strohmaier et al., 2010]

Evidence of different ways HOW users t ag (Tagging Pragmatics) Broad dist inct ion by t agging motivation [S t rohmaier 2009]:

donuts

duff marge

beer bart barty Duff-beer bev alc nalc beer wine

„ Categorizers“ …

  • use a small controlled tag vocabulary
  • goal: „ ontology-like“ categorization by

tags, for later browsing

  • tags as replacement for folders

„ Describers“ …

  • tag „ verbously“ with freely chosen words
  • vocabulary not necessarily consistent

(synomyms, spelling variants, … )

  • goal: describe content, ease retrieval
slide-41
SLIDE 41

Types of Resources

Basically, t here are syst ems t o t ag anyt hing … videos goals in life bookmarks phot os news publicat ion references cont act s … t o name j ust a few.

slide-42
SLIDE 42

43

Types of Tags - Related Work

  • Usage pat t erns of collaborat ive t agging syst ems. S

. Golder and B. Huberman, Journal of Informat ion S cience 32, 2006.

  • M. S

t rohmaier, Purpose Tagging - Capt uring User Int ent t o Assist Goal-Orient ed S

  • cial

S earch, S S M'08 Workshop on S earch in S

  • cial Media, in conj unct ion wit h CIKM'08, Napa

Valley, US A, 2008.

  • M. S

t rohmaier, C. Körner, and R. Kern, Why do Users Tag? Det ect ing Users' Mot ivat ion for Tagging in S

  • cial Tagging S

yst ems, 4t h Int ernat ional AAAI Conference on Weblogs and S

  • cial Media (ICWS

M2010), Washingt on, DC, US A, May 23-26, 2010.

  • Yanbe, Y.; Jat owt , A.; Nakamura, S

. & Tanaka, K. (2007), Can social bookmarking enhance search in t he web? , in 'JCDL '07: Proceedings of t he 7t h ACM/ IEEE-CS Joint Conference on Digit al Libraries' , ACM, New York, NY, US A , pp. 107--116 .

  • Rat t enbury, T.; Good, N. & Naaman, M. (2007), Towards aut omat ic ext ract ion of event and

place semant ics from flickr t ags, in 'S IGIR '07: Proceedings of t he 30t h Annual Int ernat ional ACM S IGIR Conference on Research and Development in Informat ion Ret rieval' , ACM Press, New York, NY, US A , pp. 103--110 .

slide-43
SLIDE 43

44

Types of Users – Related Work

  • Marlow, C.; Naaman, M.; Boyd, D. & Davis, M. (2006), HT06, tagging

paper, taxonomy, Flickr, academic article, to read, in 'HYPERTEXT '06: Proceedings of the seventeenth conference on Hypertext and hypermedia' , ACM, New York, NY, US A , pp. 31--40 .

  • C. Körner, R. Kern, H.-P. Grahsl, and M. S

trohmaier: Of categorizers and describers: an evaluation of quantitative measures for tagging motivation, HT '10: Proceedings of the 21st ACM Conference on Hypertext and Hypermedia, New York, NY, US A, ACM, 2010.

  • S

trohmaier, M.; Körner, C. & Kern, R. (2010), Why do users tag? Detecting users' motivation for tagging in social tagging systems, in 'International AAAI Conference on Weblogs and S

  • cial Media

(ICWS M2010)' .

  • http:/ / src.acm.org/ 2010/ ChristianKoerner/ understanding_the_moti

vation_behind_tagging/ index.html

slide-44
SLIDE 44

45

Agenda

Introduction Understanding the Data

  • Network Properties of Folksonomies
  • Network Properties of LogS
  • nomies
  • Types of Tags, Users, Resources

Folksonomies and Search

  • Ranking in Folksonomies (Folkrank)
  • Comparision of Traditional and S
  • cial S

earch IR related approaches

  • Recommender
  • S

pam in Folksonomies

  • S

emantic in Folksonomy Summary and Outlook

slide-45
SLIDE 45

46

Search in Folksonomies

S earch engines need 1. to compute the hits for a query 2. and rank them. PageRank algorithm is very successful in the web (see Google):

each row of A is normalized t o 1

Authority values are propagated along the hyperlink according to

x à d Ax + (1-d) p

where A is the row-stochastic adj acency matrix of the web graph,

x

is the rank vector,

p

is the random surfer component (may be used as preference vector),

d 2 [0,1] is a weighting factor. 

If |A|1 := |p|1 := 1 and there are no rank sinks, then the computation of a fixed point equals the computation

  • f the first eigenvector of the matrix dA + (1-d) p1T .
slide-46
SLIDE 46

47

Search in Folksonomies

 Folksonomies have a different structure as the web graph:

Web graph Folksonomies

 How can a ranking algorithm for this structure look like?

User 3 User 4 User 2 User 3 User 4 User 2 User 3 User 4 User 1 User 2 User 3 User 4 User 3 User 4 User 2 User 3 User 4 User 2 User 3 User 4 Tag 1 Tag 2 Tag 3 Res 1 Res 2 Res 3

slide-47
SLIDE 47

48

First Aproach: Adapted PageRank

  • 1. S

plit each hyperedge into six directed edges.

  • 1. Iterative weight propagation according to PageRank:

x à d Ax + (1-d) p .

User 1 Tag 1 Res 1 User 1 Tag 1 Res 1

slide-48
SLIDE 48

49

Converting a Folksonomy into an Undirected Graph

  • S

et V of nodes consist s of t he disj oint union of t he set s of t ags, users and resources: V = U [ T [ R

  • All co-occurrences of users and t ags, t ags and resources, users

and resources become edges bet ween t he respect ive nodes:

  • E = {{u,t } | 9 r 2 R : (u,t ,r) 2 Y} [

{{t ,r} | 9 u 2 U : (u,t ,r) 2 Y} [ {{u,r} | 9 t 2 T : (u,t ,r) 2 Y}

slide-49
SLIDE 49

50

Ranking in Folksonomies: FolkRank

Problems of folksonomy-adapt ed PageRank

 dominated by graph structure  undirected: weight flows back (PageRank ¼ edge degree)

Different ial approach

 compute rank with and without preferences  FolkRank = difference between those rankings normalized to [0,1]  Let R

AP be the fixed point with p = 1

 Let R

pref be the fixed point with p representing the high

weights for the preferred items

 R := R

pref – R AP is the final weight vector

slide-50
SLIDE 50

51

Results for: “Semantic Web”

PageRank without preference PageRank with preference FolkRank with preference

slide-51
SLIDE 51

52

Rankings for „semanticweb“ for discovering semantic relationships, user comunities, and web pages

slide-52
SLIDE 52

53

Trends with respect to tag “politics”

US elections in Nov. 2004

slide-53
SLIDE 53

54

Related Work

Ranking in Folksonomies

Michail, A. CollaborativeRank: Motivating People to Give Helpful and Timely Ranking S uggestions, S chool of Computer S cience and Engineering, 2005.

S zekely, B. & Torres, E. Ranking Bookmarks and Bistros: Intelligent Community and Folksonomy Development, 2005.

Bao, S .; Xue, G.; Wu, X.; Yu, Y.; Fei, B. & S u, Z. Optimizing web search using social annotations, ACM Press, 2007, 501-510.

Ranking in Web 2.0

Mohammad Nauman and S hahbaz Khan. Using Personalized Web S earch for Enhancing Common S ense and Folksonomy Based Intelligent S earch S

  • ystems. wi, (0):423-426,IEEE

Computer S

  • ciety,Los Alamitos, CA, US

A,2007.

Usefulness of Tag Clouds

  • J. S

inclair and M. Cardew-Hall. The folksonomy tag cloud: When is it useful? Journal of Information S cience, 016555150607808,CILIP,2007.

slide-54
SLIDE 54

55

Agenda

Introduction Understanding the Data

  • Network Properties of Folksonomies
  • Network Properties of LogS
  • nomies
  • Types of Tags, Users, Resources

Folksonomies and Search

  • Ranking in Folksonomies (Folkrank)
  • Comparision of Traditional and S
  • cial S

earch IR related approaches

  • Recommender
  • S

pam in Folksonomies

  • S

emantic in Folksonomy Summary and Outlook

slide-55
SLIDE 55

Tagging and Search 56

“Traditional” Search…

slide-56
SLIDE 56

Tagging and Search 57

“Social” Search

slide-57
SLIDE 57

Tagging and Search 58

Is the content in both systems the same?

slide-58
SLIDE 58

Tagging and Search 59

Data Collection

  • S
  • cial bookmarking data (del.icio.us)
  • S

earch engine data

  • Queries, Timestamps, S

essionId

  • Rankings (Google 100, MS

N 1000) Google AOL MS N Rankings Queries (clickdata) Datasets

slide-59
SLIDE 59

Tagging and Search 60

  • MS

N and AOL very similar

  • words and tags different:
  • low overall overlap due to power law distribution
  • relatively high overlap on frequent (>10) terms
  • del.icio.us many multi-word lexemes
  • MS

N contains many URL parts Tagging and Searching: Basic statistics

slide-60
SLIDE 60

Tagging and Search 61

Tagging and Searching: Basic statistics

181,137 166,110 118,628 118,002 107,316 yahoo google free count y myspace 119,580 102,728 100,873 97,495 92,078 design blog soft ware web reference Frequency Top Terms Frequency Top Terms Frequency Top t ags free google ht t p count y pict ures 145,585 116,537 84,376 77,798 75,977

Del.icio.us MSN AOL

  • MS

N and AOL very similar

  • words and tags different:
  • low overall overlap due

to power law distribution

  • relatively high overlap on

frequent (>10) terms

  • del.icio.us many multi-

word lexemes

  • MS

N contains many URL parts

slide-61
SLIDE 61

Tagging and Search 62

Tagging and Searching: Usage over Time

  • Comparison of MS

N words and del.icio.us tags

  • Normalization, pearson correlation coefficient, t-test
  • 1003 terms: 307 out of 1003 terms significant correlation

(5 % level)

Number of times the split word was submitted to a search engine on one day Number of times the tag was added to a resource

  • n one day
  • People search and tag content at the same time
  • Tagging and searching are triggered by similar motivations

vista

slide-62
SLIDE 62

Tagging and Search 63

Summary

  • Low overlap of query terms / tags
  • URLs in query log / multi-word lexemes
  • Power law distribution
  • Frequent terms overlap
  • Tagging and searching are triggered by similar

motivations

  • Del.icio.us covers top search engine URLs
  • Ranking overlap is low
  • Correlation higher for IT specific topics
slide-63
SLIDE 63

64

Agenda

Introduction Understanding the Data

  • Network Properties of Folksonomies
  • Network Properties of LogS
  • nomies
  • Types of Tags, Users, Resources

Folksonomies and Search

  • Ranking in Folksonomies (Folkrank)
  • Comparision of Traditional and S
  • cial S

earch IR related approaches

  • Recommender
  • S

pam in Folksonomies

  • S

emantic in Folksonomy Summary and Outlook

slide-64
SLIDE 64

65

Tag Recommender

slide-65
SLIDE 65

66

Tag Recommender

slide-66
SLIDE 66

67

Tag Recommender

slide-67
SLIDE 67

68

Clicked Recommended Tags

slide-68
SLIDE 68

69

Recommender - Related Work

Recommender

  • G. Adomavicius and A. Tuzhilin. Toward the Next Generation of Recommender S

ystems: A S urvey of the S tate-of-the-Art and Possible Extensions. Knowledge and Data Engineering, IEEE Transactions on, (17)6:734--749, 2005.

Tag Recommender

  • Z. Xu and Y. Fu and J. Mao and D. S
  • u. Towards the semantic web: Collaborative tag
  • suggestions. Collaborative Web Tagging Workshop at WWW2006, Edinburgh, S

cotland, May, 2006.

Yanfei Xu and Liang Zhang and Wei Liu. Cubic Analysis of S

  • cial Bookmarking for

Personalized Recommendation. Frontiers of WWW Research and Development - APWeb 2006, 733--738, 2006.

Jäschke, R.; Eisterlehner, F.; Hotho, A. & S tumme, G. (2009), Testing and Evaluating Tag Recommenders in a Live S ystem, in Dominik Benz & Frederik Janssen, ed., 'Workshop on Knowledge Discovery, Data Mining, and Machine Learning' , pp. 44--51 .

Jäschke, R.; Marinho, L.; Hotho, A.; S chmidt -Thieme, L. & S tumme, G. (2008), 'Tag Recommendations in S

  • cial Bookmarking S

ystems', AI Communications 21 (4) , 231-247 .

More under: http://www.bibsonomy.org/tag/recommender+folksonomy

slide-69
SLIDE 69

70

Agenda

Introduction Understanding the Data

  • Network Properties of Folksonomies
  • Network Properties of LogS
  • nomies
  • Types of Tags, Users, Resources

Folksonomies and Search

  • Ranking in Folksonomies (Folkrank)
  • Comparision of Traditional and S
  • cial S

earch IR related approaches

  • Recommender
  • S

pam in Folksonomies

  • S

emantic in Folksonomy Summary and Outlook

slide-70
SLIDE 70

71

BibSonomy after lunch ...

slide-71
SLIDE 71

72

Spam?

slide-72
SLIDE 72

73

Spam – User Level vs. Post Level

slide-73
SLIDE 73

74

Sources of Spam?

slide-74
SLIDE 74

75

Spam Classification

User Data Collection Registration Information Posts Logging Information S

  • cial Network

Information Training Classification Feature Engineering Personal Data!!

slide-75
SLIDE 75

76

Dataset: Creation

BibS

  • nomy admins and developers flag users as spammers

Decision is based on  Links (websites) of posts  Added tags  Also influenced by

personal information:

  • E-mail
  • Choice of name
  • Registration IP
slide-76
SLIDE 76

30.08.2011 Andreas Hotho 77

Dataset creation process

slide-77
SLIDE 77

30.08.2011 Andreas Hotho 78

Dataset - Figures

  • Time frame: until end of 2007
  • Only users with at least one post
  • No consideration of private posts
  • Tags not normalized

Users S pammer Tags Resources TAS All 1,411 18,681 306,993 920,176 8,709,417 Training 1,306 15,891 282,473 774,678 7,904,735 Test 100 2,790 49,644 153,512 804,682

slide-78
SLIDE 78

79

Features

Profile Information

  • 25 feat ures
  • 4 different cat egories
  • Normalizat ion of each user‘ s feat ure vect or

Location Information Activity Information S emantic Information

  • Realname with 2 or 3 words
  • lenght of the user name,

email, realname

  • digits in user name
  • time between registration and first

post

  • number of tags per post
  • average number of TAS
  • 470 for spammers,

334 for users

  • number of users in the same

domain

  • number of users in the same

top level domain

  • number of spam users with

this IP

  • blacklist of tags
  • Co-Occurrence information of the

graph, e.g. spammer shares resources with other spammers

slide-79
SLIDE 79

80

Personal Data in BibSonomy ???

Identifiable: person who can be identified, directly or indirectly

Beispiel 1 Beispiel 2

80

slide-80
SLIDE 80

81

Data Privacy

  • concerns exist wherever personally identifiable information

is collect ed and st ored – in digit al form or ot herwise

  • It is a right t o cont rol t he st orage of personal dat a!
  • Europe: Art icle 8 of t he European Convent ion on Human

Right s (ECHR) Open Issues

  • Which laws ?
  • Is t he collect ion and processing of dat a in BibS
  • nomy

conform t o exist ing law?

  • Best pract ices for ot her social bookmarking syst ems / social

t agging syst ems?

  • Do we need to process personal data in our data mining /

research applications?

slide-81
SLIDE 81

Data Privacy Categories for Data Sources

Rank Data Categories Examples 1 anonymised data All user data that the operator cannot associate with a single user after having removed all features which allow an identification 2 publicly available data Posts, friend and follower links, registration information if published 3 registration information Registration data not published 4 logging information Logging data collected by operator 5 explicitly not published data Posts, contact and profile information marked as private by the user

82

slide-82
SLIDE 82

83

Data Privacy Categories for Data Sources

Registration Information Posts Logging Information S

  • cial Network

Information Data S

  • urces for Feature Engineering

3 2 4 2,3,4

Rank Data Categories 1 anonymised data 2 publicly available data 3 registration information 4 logging information 5 explicitly not published data

slide-83
SLIDE 83

84

Experimental Design

  • S

pam challenge dataset 2008

  • Features of different data categories
  • Classificat ion methods of Weka
  • AUC value
  • Winner of the challenge

reached an AUC value of: 0.98

slide-84
SLIDE 84

85

Performance + Privacy Evaluation

slide-85
SLIDE 85

86

Spam Detection - Related Work

Paul Heymann and Georgia Koutrika and Hector Garcia-Molina. Fighting S pam on S

  • cial

Web S ites: A S urvey of Approaches and Future Challenges. IEEE Internet Computing, (11)6:36-45, 2007. Georgia Koutrika and Frans Adj ie Effendi and Zoltan Gyöngyi and Paul Heymann and Hector Garcia-Molina. Combating spam in tagging systems. AIRWeb '07: Proceedings

  • f the 3rd international workshop on Adversarial information retrieval on the web,

57--64, ACM Press, New York, NY, US A, 2007. Benj amin Markines and Ciro Cattuto and Filippo Menczer. S

  • cial spam detection.. In

Dennis Fetterly and Zoltán Gyöngyi, editor(s), AIRWeb, 41-48, 2009. Zoltán Gyöngyi and Hector Garcia-Molina and Jan Pedersen. Combating Web S pam with TrustRank.. VLDB, 576-587, 2004.

slide-86
SLIDE 86

87

Agenda

Introduction Understanding the Data

  • Network Properties of Folksonomies
  • Network Properties of LogS
  • nomies
  • Types of Tags, Users, Resources

Folksonomies and Search

  • Ranking in Folksonomies (Folkrank)
  • Comparision of Traditional and S
  • cial S

earch IR related approaches

  • Recommender
  • S

pam in Folksonomies

  • S

emantic in Folksonomy Summary and Outlook

slide-87
SLIDE 87

88

What kind of “related” tags ?

slide-88
SLIDE 88

89

Example for cosine measure

slide-89
SLIDE 89

90

Example: Most related tags for „web2.0“ and „howto“

S

  • im. Measure

1 2 3 4 5

Coocc aj ax web tools blog webdesign FolkRank web aj ax tools design blog TagCont ext web2 web-2.0 webapp „ web web_2.0 ResourceCont . web2 web20 2.0 web_2.0 web-2.0 UserCont ext aj ax aggregator rss google collaborate Coocc tutorial reference tips linux programming FolkRank reference linux tutorial programming software TagCont ext how-to guide tutorials help how_to ResourceCont . how-to tutorial tutorials tips diy UserCont ext reference tutorial tips hacks tools

HOWTO WEB2.0

slide-90
SLIDE 90

91

Semantic Grounding in WordNet

 WordNet is a large lexical database for English.  Words with same meaning are grouped in synset s, which are ordered

by an is-a hierarchy.

 Introduction of single artificial root node enables application of

graph-based similarity metrics between pairs of nouns / pairs of verbs.

 Inclusion of top n Delicious tags in WordNet:

  • 100: 82%
  • 1,000: 79%
  • 5,000: 69%
  • 10,000: 61%
slide-91
SLIDE 91

92

Original t ag:  „ java“ Most similar t ag:  Freq, folkrank:

„ programming“

 Cosine:

„ python“ Example of Semantic Grounding

comput ers programming languages design_patt erns java python

Wordnet Synset Hierarchy:

map Grounded similarity

slide-92
SLIDE 92

93

siblings

length of shortest path to most related tag

random

shortest paths in WordNet

slide-93
SLIDE 93

94

Related work

Ontology Learning

 Dominik Benz and Andreas Hotho. Position Paper: Ontology Learning from

Folksonomies.. In Alexander Hinneburg, editor(s), LWA 2007: Lernen - Wissen - Adaption, Halle, S eptember 2007, Workshop Proceedings (LWA), 109-112, Martin- Luther-University Halle-Wittenberg,2007.

 Francis Heylighen. Bootstrapping knowledge representations: from entailment

meshes via semantic nets to learning webs. Kybernetes, (30)5/ 6:691--722, 2001.

 Paul Heymann and Hector Garcia-Molina. Collaborative Creation of Communal

Hierarchical Taxonomies in S

  • cial Tagging S
  • ystems. 2006-10 2006.

 P. Mika, Ontologies Are Us: A Unified Model of S

  • cial Networks and S

emantics, S pringer, 2005, 522-536. 

  • P. S

chmitz, Inducing Ontology from Flickr Tags. 2006.

Analysis of tagging behaviour

 C. Cattuto, S

emiotic dynamics in online social communities. The European Physical Journal C - Particles and Fields, 2006, 46, 33-37

 S

hilad S en and S hyong K. Lam and Al Mamunur Rashid and Dan Cosley and Dan Frankowski and Jeremy Osterhouse and F. Maxwell Harper and John Riedl. tagging, communities, vocabulary, evolution. CS CW '06: Proceedings of the 2006 20th anniversary conference on Computer supported cooperative work, 181-- 190,ACM,New York, NY, US A,2006.

More under: http://www.bibsonomy.org/tag/ontology+folksonomy

slide-94
SLIDE 94

95

Agenda

Introduction Understanding the Data

  • Network Properties of Folksonomies
  • Network Properties of LogS
  • nomies
  • Types of Tags, Users, Resources

Folksonomies and Search

  • Ranking in Folksonomies (Folkrank)
  • Comparision of Traditional and S
  • cial S

earch IR related approaches

  • Recommender
  • S

pam in Folksonomies

  • S

emantic in Folksonomy Summary and Outlook

slide-95
SLIDE 95

96

Lessons Learned

 Network measures provide interesting insights into the user

behavior of folksonomies

 All types of nodes provide valuable information  Click behaviour has a similar structure as tagged data  Ranking based on graph structure similar to Pagerank is possible  Correlation between search interests and tagged items  Recommender can influnce the user behaviour  S

pam is a critical issue

 Relatedness measures on tags in folksonomies are a good basis to

extract semantic relations

slide-96
SLIDE 96

97

Future Work

  • Combining t he informat ion for personalized search
  • Using t ag dat a t o improve int ranet search
  • Ut ilizing t he informat ion of t he annot at ed resource
  • Trying t o improve search and ranking by allowing semant ics

wit hin t agging syst ems

  • Learning t he Ranking
slide-97
SLIDE 97

98

Agenda

References: http://www.bibsonomy.org/group/kde/ol_tut2010

Introduction Understanding the Data

  • Network Properties of Folksonomies
  • Network Properties of LogS
  • nomies
  • Types of Tags, Users, Resources

Folksonomies and Search

  • Ranking in Folksonomies (Folkrank)
  • Comparision of Traditional and S
  • cial S

earch IR related approaches

  • Recommender
  • S

pam in Folksonomies

  • S

emantic in Folksonomy Summary and Outlook

slide-98
SLIDE 98

Backup

99

slide-99
SLIDE 99

30.08.2011 Andreas Hotho 100

Agenda

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 2 4 6 8 10 12 14 rank month "blog" "css" "design" "linux" "music" "news" "programming" "software" "web"

BibS

  • nomy – a social bookmark and

publication sharing system

Network Properties of Folksonomies

Network Properties of LogS

  • nomies

Tag Relatedness

Query Relatedness

S pam Detection

S ummary and Outlook

slide-100
SLIDE 100

30.08.2011 Andreas Hotho 101

Folksonomy vs. Logsonomy Dataset

  • Del.icio.us Excerpt: 10,000 most popular tags
  • |U| = 476,378 |T| = 10,000 |R| = 12,660,470
  • |Y| = 101,491,722
  • AOL split queries: 10,000 most popular words
  • |U| = 463,380 |T| = 10,000 |R| = 1,284,724
  • |Y| = 26,227,550
slide-101
SLIDE 101

30.08.2011 Andreas Hotho 102

cosine

news blogs people weblog culture future news news.com newspaper weather obituaries newspapers video entertainment awesome fun cool random video videos downloading url downloads download tutorial tutorials tips coding code examples tutorial tutorials software trial download templates news blog technology politics media daily news channel daily fox paper newport video music funny tv software media video music free codes clips sex game myspace tutorial howto programming reference design css tutorial free tutorials psp electronics microsoft

freq

Most related tags by cooccurrence / cosine simlarity

LogS

  • nomy

Folksonomy LogS

  • nomy

Folksonomy

slide-102
SLIDE 102

30.08.2011 Andreas Hotho 103

Qualitative insights: Overlap of 10 most related tags coocc FolkRank Tag Cont ext Resource Cont ext User Cont ext 2.28 2.16 0.71 1.11 Resource Cont ext 1.93 2.25 1.5 Tag Cont ext 0.88 1.1 FolkRank 5.91

slide-103
SLIDE 103

30.08.2011 Andreas Hotho 104

Qualitative insights 2: Average rank of related tags

slide-104
SLIDE 104

30.08.2011 Andreas Hotho 105

Shortest path between original tag and most closely related one LogS

  • nomy

Folksonomy

slide-105
SLIDE 105

30.08.2011 Andreas Hotho 106

length of shortest path to most related tag

shortest paths in WordNet for the Logsonomy LogS

  • nomy

Folksonomy