[PPT] - Harnessing Folksonomies for Resource Classification PhD Thesis PowerPoint Presentation

SLIDE 1

Harnessing Folksonomies for Resource Classification

PhD Thesis Arkaitz Zubiaga

UNED

July 12th, 2011 Advisors: Raquel Mart´ ınez Unanue V´ ıctor Fresno Fern´ andez

SLIDE 2

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications

Index

1

Motivation

2

Selection of a Classifier

3

STS & Datasets

4

Representing the Aggregation of Tags

5

Tag Distributions on STS

6

User Behavior on STS

7

Conclusions & Outlook

8

Publications

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 3 / 98

SLIDE 4

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Motivation

Resource Classification

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 4 / 98

SLIDE 5

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Motivation

Resource Classification

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 5 / 98

SLIDE 6

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Motivation

Resource Classification

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 6 / 98

SLIDE 7

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Motivation

Resource Classification

Classifying resources is a common task.

Web pages, books, movies, files,...

Large collections of resources → expensive & effortful to classify manually.

LoC reported an average cost of $94.58 for cataloging each book in 2002.

Enormous costs and efforts → automatic classification.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 7 / 98

SLIDE 8

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Motivation

Resource Classification

Representation of resources → self-content. Use of self-content of resources presents some issues:

Not always representative enough. Not always accessible (e.g., books).

Social tags provided by users → alternative to solve the problem.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 8 / 98

SLIDE 9

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Motivation

Tagging

T1, T2, T3 = sets of tags.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 9 / 98

SLIDE 10

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Motivation

Social Tagging

Aggregation of user annotations → folksonomy. Folksonomy: Folk (People) + Taxis (Classification) + Nomos (Management).

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 10 / 98

SLIDE 11

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Motivation

Organization of Resources

User annotations → own organization of resources. A user’s tags Tag # Resources research 82 twitter 28 web2.0 35 language 42 english 64 ... ...

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 11 / 98

SLIDE 12

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Motivation

Example of Bookmarks

User Resource Tags 1 user1 flickr.com photo, web2.0, social 2 user2 flickr.com photography, images 3 user1 google.com searchengine 4 user3 twitter.com microblogging, twitter Bookmark: (1) user ui ∈ U who annotates (2) resource rj ∈ R being annotated (3) tags Tij = {t1, ..., tn} ∈ T utilized.

bij : ui × rj × Tij

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 12 / 98

SLIDE 13

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Motivation

Sum of Annotations

Top tags (79,681 users) Tag Rank Tag User Count 1 photos 22,712 2 flickr 19,046 3 photography 15,968 4 photo 15,225 5 sharing 10,648 6 images 9,637 7 web2.0 9,528 8 community 4,571 9 social 3,798 10 pictures 3,115

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 13 / 98

SLIDE 14

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Motivation

Tag-based Resource Classification

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 14 / 98

SLIDE 15

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Motivation

Problem Statement

How can the annotations provided by users on social tagging systems be exploited to improve the accuracy of a resource classification task?

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 15 / 98

SLIDE 16

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Motivation

Related Work

Social tags for information management: Search: Bao et al. (2007) & Heymann et al. (2008). Recommender Systems: Shepitsen et al. (2008) & Li et al. (2008). Enhanced Browsing: Smith (2008).

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 16 / 98

SLIDE 17

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Motivation

Related Work

Classification: Noll and Meinel (2008) → statistical analysis of matches between tags & taxonomies.

Tags are useful for broad categorization. Not for narrower categorization.

Lack of further research with:

Actual classification experiments. Other types of resources. Different representations of social tags.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 17 / 98

SLIDE 18

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Selection of a Classifier

Index

1

Motivation

2

Selection of a Classifier

3

STS & Datasets

4

Representing the Aggregation of Tags

5

Tag Distributions on STS

6

User Behavior on STS

7

Conclusions & Outlook

8

Publications

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 18 / 98

SLIDE 19

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Selection of a Classifier

Characteristics of the task

We have:

Large set of resources: some labeled + many unlabeled. Multiclass taxonomy.

Automated classifiers learn a model from labeled resources.

This model is used to classify unlabeled resources afterward.

2 learning settings:

Supervised: only labeled resources considered for learning. Semi-supervised: unlabeled resources are also taken into account.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 19 / 98

SLIDE 20

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Selection of a Classifier

Support Vector Machines (SVM)

Hyperplane that separates with largest margin. Use of kernels → redimensions the space. Resource/Hyperplane margin → Classifier’s reliability.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 20 / 98

SLIDE 21

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Selection of a Classifier

Selection of a Classifier

SVMs solve binary problems by default. To solve multiclass tasks:

Native multiclass classifier (mSVM). Combining binary classifiers:

ne-against-all (oaaSVM).
ne-against-one (oaoSVM).

Both supervised (s) and semi-supervised (ss).

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 21 / 98

SLIDE 22

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Selection of a Classifier

Experiment Settings

3 benchmark datasets to analyze suitability of classifiers: Dataset # web pages # trainset # categories BankSearch 10,000 3,000 10 WebKB 4,518 1,000 6 Y! Science 788 100 6 We present accuracy to show performance. We perform 6 runs, and show the average accuracy.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 22 / 98

SLIDE 23

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Selection of a Classifier

Results

BankSearch WebKB Y! Science mSVM (s) .925 .810 .825 mSVM (ss) .923 .778 .836

aaSVM (s)

.843 .776 .536

aaSVM (ss)

.842 .773 .565

aoSVM (s)

.826 .775 .483

aoSVM (ss)

.811 .754 .514 Native multiclass classifier performs best, while supervised ≃ semi-supervised. We used the supervised approach, as it is computationally less expensive.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 23 / 98

SLIDE 24

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications STS & Datasets

Index

1

Motivation

2

Selection of a Classifier

3

STS & Datasets

4

Representing the Aggregation of Tags

5

Tag Distributions on STS

6

User Behavior on STS

7

Conclusions & Outlook

8

Publications

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 24 / 98

SLIDE 25

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications STS & Datasets

Requirements

Selected STS should have:

Large communities involved. Public access to data. Consolidated taxonomies as a ground truth.

We chose Delicious, LibraryThing & GoodReads.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 25 / 98

SLIDE 26

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications STS & Datasets

Characteristics of STS

Delicious LibraryThing GoodReads

Resources web documents books books Tag suggestions based on earlier bookmarks on the resource no based on earlier tags utilized by the user Tag insertion space-separated comma-separated

ne by one text-

box Saving a resource prompts user to add tags prompts user to add tags at sec-

nd step

user needs to click again to add tags

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 26 / 98

SLIDE 27

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications STS & Datasets

Characteristics of STS

Delicious LibraryThing GoodReads

Resources web documents books books Tag suggestions based on earlier bookmarks on the resource no based on earlier tags utilized by the user Tag insertion space-separated comma-separated

ne by one text-

box Saving a resource prompts user to add tags prompts user to add tags at sec-

nd step

user needs to click again to add tags

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 27 / 98

SLIDE 28

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications STS & Datasets

Characteristics of STS

Delicious LibraryThing GoodReads

Resources web documents books books Tag suggestions based on earlier bookmarks on the resource no based on earlier tags utilized by the user Tag insertion space-separated comma-separated

ne by one text-

box Saving a resource prompts user to add tags prompts user to add tags at sec-

nd step

user needs to click again to add tags

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 28 / 98

SLIDE 29

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications STS & Datasets

Characteristics of STS

Delicious LibraryThing GoodReads

Resources web documents books books Tag suggestions based on earlier bookmarks on the resource no based on earlier tags utilized by the user Tag insertion space-separated comma-separated

ne by one text-

box Saving a resource prompts user to add tags prompts user to add tags at sec-

nd step

user needs to click again to add tags

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 29 / 98

SLIDE 30

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications STS & Datasets

Characteristics of STS

Delicious LibraryThing GoodReads

Resources web documents books books Tag suggestions based on earlier bookmarks on the resource no based on earlier tags utilized by the user Tag insertion space-separated comma-separated

ne by one text-

box Saving a resource prompts user to add tags prompts user to add tags at sec-

nd step

user needs to click again to add tags

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 30 / 98

SLIDE 31

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications STS & Datasets

Retrieval of Categorized Resources

Retrieval of popular annotated resources, which were also categorized by experts. Top level (L1) Second level (L2) Resources Classes Resources Classes Web ODP 12,616 17 12,286 243 Books DDC 27,299 10 27,040 99 LCC 24,861 20 23,565 204

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 31 / 98

SLIDE 32

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications STS & Datasets

Retrieval of Additional User Annotations

Delicious: 300,571,231 bookmarks → 273,478,137 annotated (91.00%) LibraryThing: 44,612,784 bookmarks → 22,343,427 annotated (50.08%) GoodReads: 47,302,861 bookmarks → 9,323,539 annotated (19.71%) Importance of system’s encouragement to tagging resources.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 32 / 98

SLIDE 33

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications STS & Datasets

Tag Popularity on Resources

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Delicious LibraryThing GoodReads

Tag rank on resources Average usage

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 33 / 98

SLIDE 34

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications STS & Datasets

Tag Novelty in Bookmarks by Rank

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Delicious LibraryThing

Bookmark rank % of novelty

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 34 / 98

SLIDE 35

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications STS & Datasets

Retrieval of Additional Data

URLs:

Self-content, by crawling URLs. User reviews (Delicious & StumbleUpon).

Books:

Self-content (unavailable):

Synopses (Barnes&Noble). Editorial reviews (Amazon).

User reviews (LibraryThing, GoodReads & Amazon).

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 35 / 98

SLIDE 36

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications STS & Datasets

Summary of the Analysis of Datasets

Few users annotate resources when the system does not encourage to do it. Resource-based tag suggestions → Repeated use of popular tags.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 36 / 98

SLIDE 37

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Representing the Aggregation of Tags

Index

1

Motivation

2

Selection of a Classifier

3

STS & Datasets

4

Representing the Aggregation of Tags

5

Tag Distributions on STS

6

User Behavior on STS

7

Conclusions & Outlook

8

Publications

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 37 / 98

SLIDE 38

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Representing the Aggregation of Tags

Representing Resources Using Tags

Different ways to aggregate user annotations on a vectorial representation. 2 major factors to consider:

What tags to use? How to weigh those tags?

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 38 / 98

SLIDE 39

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Representing the Aggregation of Tags

Representing Resources Using Tags

Use of all tags (FTA), or just top 10 tags for each resource. 4 different weightings. Example of a resource (100 users): t1 (50), t2 (30), t3 (20), ..., t9 (1), t10 (1), ..., tn (1) FTA Top 10 t1 t2 t3 ... t9 t10 ... tn Ranks 1 0.9 0.8 ... 0.2 0.1 ... Fractions 0.5 0.3 0.2 ... 0.02 0.01 ... 0.01 Binary 1 1 1 ... 1 1 ... 1 TF 50 30 20 ... 2 1 ... 1

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 39 / 98

SLIDE 40

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Representing the Aggregation of Tags

Representing Resources Using Other Data Sources

To represent resources using content and reviews:

1

Removal of HTML tags.

2

Removal of stopwords.

3

Stem of remaining words.

4

TF-IDF weighting of words.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 40 / 98

SLIDE 41

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Representing the Aggregation of Tags

Experiment Setup

Multiclass SVMs. Show the average accuracy of 6 runs. For clarity of presentation, we limit results to:

LCC taxonomy for books. Training sets of 6,000 URLs (6,616 (L1)/6,286 (L2) for test). Training sets of 18,000 books (8,861 (L1)/5,565 (L2) for test).

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 41 / 98

SLIDE 42

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Representing the Aggregation of Tags

Experiment Setup

Compared Representations Self-content (baseline). Reviews. Tags:

Ranks (Top 10). Fractions (Top 10 & FTA). Binary (Top 10 & FTA). TF (Top 10 & FTA).

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 42 / 98

SLIDE 43

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Representing the Aggregation of Tags

Results of Tags vs Other Data Sources

Delicious LThing GReads L1 L2 L1 L2 L1 L2

(17) (243) (20) (204) (20) (204)

Content .610 .470 .807 .673 .807 .673 Reviews .646 .524 .828 .705 .828 .705 Tags Ranks .484 .360 .795 .511 .630 .405 Fractions (10) .464 .349 .738 .411 .663 .427 Fractions (FTA) .461 .336 .712 .409 .654 .432 Binary (10) .531 .361 .770 .550 .623 .422 Binary (FTA) .572 .529 .655 .606 .639 .481 TF (10) .654 .545 .855 .722 .713 .491 TF (FTA) .680 .568 .857 .736 .731 .517 Usually, FTA > 10.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 43 / 98

SLIDE 44

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Representing the Aggregation of Tags

Results of Tags vs Other Data Sources

Delicious LThing GReads L1 L2 L1 L2 L1 L2

(17) (243) (20) (204) (20) (204)

Content .610 .470 .807 .673 .807 .673 Reviews .646 .524 .828 .705 .828 .705 Tags Ranks .484 .360 .795 .511 .630 .405 Fractions (10) .464 .349 .738 .411 .663 .427 Fractions (FTA) .461 .336 .712 .409 .654 .432 Binary (10) .531 .361 .770 .550 .623 .422 Binary (FTA) .572 .529 .655 .606 .639 .481 TF (10) .654 .545 .855 .722 .713 .491 TF (FTA) .680 .568 .857 .736 .731 .517 TF (FTA) is the best approach for tags.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 44 / 98

SLIDE 45

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Representing the Aggregation of Tags

Results of Tags vs Other Data Sources

Delicious LThing GReads L1 L2 L1 L2 L1 L2

(17) (243) (20) (204) (20) (204)

Content .610 .470 .807 .673 .807 .673 Reviews .646 .524 .828 .705 .828 .705 Tags .680 .568 .857 .736 .731 .517 Tags clearly outperform content and reviews on Delicious and LibraryThing.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 45 / 98

SLIDE 46

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Representing the Aggregation of Tags

Results of Tags vs Other Data Sources

Delicious LThing GReads L1 L2 L1 L2 L1 L2

(17) (243) (20) (204) (20) (204)

Content .610 .470 .807 .673 .807 .673 Reviews .646 .524 .828 .705 .828 .705 Tags .680 .568 .857 .736 .731 .517 GoodReads’ disencouragement to tagging makes it insufficient to outperform content and reviews.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 46 / 98

SLIDE 47

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Representing the Aggregation of Tags

Results of Tags vs Other Data Sources

Delicious LThing GReads L1 L2 L1 L2 L1 L2

(17) (243) (20) (204) (20) (204)

Content .610 .470 .807 .673 .807 .673 Reviews .646 .524 .828 .705 .828 .705 Tags .680 .568 .857 .736 .731 .517 Tags are also useful for deeper categorization (L2).

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 47 / 98

SLIDE 48

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Representing the Aggregation of Tags

Classifier Committees

Despite the superiority of social tags, all data sources perform well. Their outputs can be combined by using classifier committees. Classifier committees add up margins (i.e., reliability values) outputted by several classifiers, and provide a single combined prediction.

Cat. #1
Cat. #2
Cat. #3
Classif. A

1.2 1.1 0.6

Classif. B

0.5 1.0 1.2

Classif. committees

1.7 2.1 1.8

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 48 / 98

SLIDE 49

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Representing the Aggregation of Tags

Results of Classifier Committees

Delicious LThing GReads L1 L2 L1 L2 L1 L2

(17) (243) (20) (204) (20) (204)

Content (C) .610 .470 .807 .673 .807 .673 Reviews (R) .646 .524 .828 .705 .828 .705 Tags (T) .680 .568 .857 .736 .731 .517 Commit. C + R .670 .547 .817 .704 .817 .704 C + T .696 .587 .821 .720 .832 .696 R + T .694 .584 .859 .755 .857 .730 C + R + T .699 .588 .827 .732 .843 .727 Classifier committees successfully improve performance.

Even on GoodReads, where tags were not good enough

n their own.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 49 / 98

SLIDE 50

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Representing the Aggregation of Tags

Results of Classifier Committees

Delicious LThing GReads L1 L2 L1 L2 L1 L2

(17) (243) (20) (204) (20) (204)

Content (C) .610 .470 .807 .673 .807 .673 Reviews (R) .646 .524 .828 .705 .828 .705 Tags (T) .680 .568 .857 .736 .731 .517 Commit. C + R .670 .547 .817 .704 .817 .704 C + T .696 .587 .821 .720 .832 .696 R + T .694 .584 .859 .755 .857 .730 C + R + T .699 .588 .827 .732 .843 .727 Data sources must be chosen with care:

All 3 are helpful on Delicious. Content is harmful for books. Inappropriate considering synopses and ed. reviews as a summary of content?

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 50 / 98

SLIDE 51

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Representing the Aggregation of Tags

Summary of Results

Better represent using all tags with TF weighting. Tags perform accurately even for deeper levels.

The system must encourage the user to tag to make it useful enough.

Tags can be combined with other data to improve performance.

Combined data sources must be chosen with care.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 51 / 98

SLIDE 52

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Tag Distributions on STS

Index

1

Motivation

2

Selection of a Classifier

3

STS & Datasets

4

Representing the Aggregation of Tags

5

Tag Distributions on STS

6

User Behavior on STS

7

Conclusions & Outlook

8

Publications

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 52 / 98

SLIDE 53

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Tag Distributions on STS

Tag Distributions

So far, we have considered that tags annotated by the same number of users are equally representative to the resource. Distributions of tags in a collection could help determine representativity of tags.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 53 / 98

SLIDE 54

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Tag Distributions on STS

TF-IDF

TF-IDF is an inverse weighting function (IWF) that computes: the term frequency (TF). the inverse document frequency (IDF). tf -idfij = tfij × log |D| |{d : ti ∈ d}| High IDF value for terms appearing in few documents. Low IDF value for terms appearing in many documents.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 54 / 98

SLIDE 55

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Tag Distributions on STS

Tag Weighting Functions

Analogous to TF-IDF on folksonomies:

TF-IRF → distributions across resources. TF-IUF → distributions across users. TF-IBF → distributions across bookmarks.

TF-IRF and TF-IUF had been barely used, and their suitability was yet unexplored. TF-IBF had not been used.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 55 / 98

SLIDE 56

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Tag Distributions on STS

Results Using IWFs

Delicious LThing GReads L1 L2 L1 L2 L1 L2 TF .680 .568 .857 .736 .731 .517 IWFs TF-IRF .639 .529 .894 .809 .799 .622 TF-IBF .641 .532 .895 .811 .800 .628 TF-IUF .661 .555 .892 .803 .794 .623 All 3 IWFs clearly outperform TF for LibraryThing and GoodReads.

Similar performance of IWFs.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 56 / 98

SLIDE 57

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Tag Distributions on STS

Results Using IWFs

Delicious LThing GReads L1 L2 L1 L2 L1 L2 TF .680 .568 .857 .736 .731 .517 IWFs TF-IRF .639 .529 .894 .809 .799 .622 TF-IBF .641 .532 .895 .811 .800 .628 TF-IUF .661 .555 .892 .803 .794 .623 IWFs underperform on Delicious, due to tag suggestions that make top tags utmost popular.

IUF superior to IBF and IRF. Users who make their

wn choices make the difference.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 57 / 98

SLIDE 58

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Tag Distributions on STS

Using Classifier Committees with IWFs

How about using tags represented with IWFs on classifier committees?

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 58 / 98

SLIDE 59

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Tag Distributions on STS

Results Using IWF with Committees

Delicious LThing GReads L1 L2 L1 L2 L1 L2 TF .699 .588 .859 .755 .857 .730 IWFs TF-IRF .697 .592 .885 .793 .864 .748 TF-IBF .698 .592 .887 .797 .866 .751 TF-IUF .700 .595 .885 .792 .864 .749 IWF-based committes are even better than TF-based

nes.

Even on Delicious, where IWFs were not appropriate, committees perform slightly better.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 59 / 98

SLIDE 60

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Tag Distributions on STS

Results Using IWF with Committees

Delicious LThing GReads L1 L2 L1 L2 L1 L2 TF .699 .588 .859 .755 .857 .730 IWFs TF-IRF .697 .592 .885 .793 .864 .748 TF-IBF .698 .592 .887 .797 .866 .751 TF-IUF .700 .595 .885 .792 .864 .749 Despite this outperformance of IWFs using committees, IWFs on their own perform better on LibraryThing (.895 & .811).

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 60 / 98

SLIDE 61

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Tag Distributions on STS

Summary of Results Using IWFs

IWFs are an appropriate way to weight tags when used

n classifier committees.

The exception is LibraryThing, where tags on their own perform better.

Combined data sources must be appropriately chosen (e.g., synopses & ed. reviews are harmful with books).

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 61 / 98

SLIDE 62

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications User Behavior on STS

Index

1

Motivation

2

Selection of a Classifier

3

STS & Datasets

4

Representing the Aggregation of Tags

5

Tag Distributions on STS

6

User Behavior on STS

7

Conclusions & Outlook

8

Publications

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 62 / 98

SLIDE 63

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications User Behavior on STS

User Behavior: Categorizers and Describers

K¨

rner1 suggested 2 kinds of user behavior:

Categorizer Describer Goal of Tagging later browsing later retrieval Change of Tag Vocabulary costly cheap Size of Tag Vocabulary limited

pen

Tags subjective

bjective

They found that Describers help infer semantic relations among tags. Do these tagging behaviors affect the usefulness of tags for resource classification?

1C. K¨
rner. Understanding the Motivation behind Tagging. Hypertext 2009.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 63 / 98

SLIDE 64

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications User Behavior on STS

Categorizers and Describers

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 64 / 98

SLIDE 65

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications User Behavior on STS

Categorizers and Describers

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 65 / 98

SLIDE 66

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications User Behavior on STS

Weighting Measures

We use 3 measures to weight users, based on Koerner et al. (2010). 2 factors are considered: verbosity & diversity.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 66 / 98

SLIDE 67

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications User Behavior on STS

Weighting Measures: TPP

Tags per Post (TPP) – Verbosity TPP(u) =

r

|Tur|

|Ru|

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 67 / 98

SLIDE 68

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications User Behavior on STS

Weighting Measures: ORPHAN

Orphan Ratio (ORPHAN) – Diversity n =

|R(tmax)|

100

ORPHAN(u) = |T o

u |

|Tu| , T o

u = {t||R(t)| ≤ n}

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 68 / 98

SLIDE 69

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications User Behavior on STS

Weighting Measures: TRR

Tag Resource Ratio (TRR) – Verbosity + Diversity TRR(u) = |Tu| |Ru|

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 69 / 98

SLIDE 70

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications User Behavior on STS

Use of Weighting Measures

These 3 measures provide:

A weight for each user. Ranking of users according to each measure.

From rankings → subsets of users as extreme Categorizers (highest-ranked) and extreme Describers (lowest-ranked). Subsets range from 10% to 100% (step size = 10%).

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 70 / 98

SLIDE 71

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications User Behavior on STS

Experiment Setup

We select subsets of users according to number of tag assignments. Selecting by percents of users would be unfair → different amounts of data.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 71 / 98

SLIDE 72

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications User Behavior on STS

Experiments

Classification We use a multiclass SVM, with TF weighting of tags. Descriptivity Vectorial representations of resources:

Tr → tag frequencies. Rr → term frequencies on descriptive data (self-content).

Cosine similarity between Tr and Rr: cos(θr) =

n

i=1

Tri × Rri

n

i=1 (Tri)2 ×

n

i=1 (Rri)2

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 72 / 98

SLIDE 73

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications User Behavior on STS

Descriptivity Results

TPP (Verb.) ORPHAN (Div.) TRR (V. + D.) Delicious LibraryThing GoodReads

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 73 / 98

SLIDE 74

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications User Behavior on STS

Classification Results

TPP (Verb.) ORPHAN (Div.) TRR (V. + D.) Delicious LibraryThing GoodReads

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 74 / 98

SLIDE 75

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications User Behavior on STS

Classification Results

TPP (Verb.) ORPHAN (Div.) TRR (V. + D.) Delicious LibraryThing GoodReads

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 75 / 98

SLIDE 76

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications User Behavior on STS

Overall Categorizers/Describers Results

Discriminating by verbosity (TPP) does best for finding extreme Categorizers. The use of non-descriptive tags provide more accurate classification.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 76 / 98

SLIDE 77

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Conclusions & Outlook

Index

1

Motivation

2

Selection of a Classifier

3

STS & Datasets

4

Representing the Aggregation of Tags

5

Tag Distributions on STS

6

User Behavior on STS

7

Conclusions & Outlook

8

Publications

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 77 / 98

SLIDE 78

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Conclusions & Outlook

Contributions

Generation & analysis of 3 large-scale social tagging datasets. Release of some tagging datasets, used by Godoy and Amandi (2010), Strohmaier et al. (2010), Li et al. (2011), and Ares et al. (2011).

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 78 / 98

SLIDE 79

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Conclusions & Outlook

Contributions

First research work performing actual classification experiments using social tags.

Analysis of different representations of social tags. Analysis of effect of tag distributions. Study of user behavior.

It paves the way to future researchers interested in the task & in the exploration of STS.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 79 / 98

SLIDE 80

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Conclusions & Outlook

Research Questions

Apart from the Problem Statement:

How can the annotations provided by users on social tagging systems be exploited to improve the accuracy of a resource classification task?

We set forth 10 research questions.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 80 / 98

SLIDE 81

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Conclusions & Outlook

Research Questions (1)

RQ 1 What is a suitable SVM classifier for the task? Native multiclass SVM >> Combinations of binary SVMs.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 81 / 98

SLIDE 82

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Conclusions & Outlook

Research Questions (2)

RQ 2 What is a suitable learning method for the task? Supervised ≃ Semi-supervised. Unlike for binary tasks, where Semi-supervised >> Supervised (Joachims, 1999).

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 82 / 98

SLIDE 83

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Conclusions & Outlook

Research Questions (3)

RQ 3 How do the settings of STS affect folksonomies? Great impact of tag suggestions. Importance of encouraging users to annotate.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 83 / 98

SLIDE 84

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Conclusions & Outlook

Research Questions (4)

RQ 4 How to amalgamate annotations to get a representation of a resource? Considering all the tags rather than only those in the top. Weighting tags according to number of users annotating them.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 84 / 98

SLIDE 85

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Conclusions & Outlook

Research Questions (5)

RQ 5 Is it worthwhile combining tags with other data sources? Combining different data sources helps improve performance. Data sources must be appropriately chosen.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 85 / 98

SLIDE 86

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Conclusions & Outlook

Research Questions (6)

RQ 6 Are social tags specific enough to classify into narrower categories? Tags are as useful as for top level. Noll and Meinel (2008) → tags were probably not useful for deeper levels.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 86 / 98

SLIDE 87

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Conclusions & Outlook

Research Questions (7)

RQ 7 Can we consider tag distributions to get the representativity of each tag? LibraryThing & GoodReads: really useful. Delicious: not useful, because of tag suggestions → need of committees to make them useful.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 87 / 98

SLIDE 88

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Conclusions & Outlook

Research Questions (8)

RQ 8 What approach to use to weigh the representativity of tags? LibraryThing & GoodReads: IBF, IRF & IUF are very similar. Delicious: IUF clearly superior, because of users that get rid of suggestions.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 88 / 98

SLIDE 89

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Conclusions & Outlook

Research Questions (9)

RQ 9 Can we discriminate users who further resemble an expert classification? Categorizers > Describers for classification. Need of appropriate measure for discriminating.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 89 / 98

SLIDE 90

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Conclusions & Outlook

Research Questions (10)

RQ 10 What features identify a Categorizer? Categorizers can be found when discriminating by verbosity. Non-descriptive tags produce more accurate classification.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 90 / 98

SLIDE 91

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Conclusions & Outlook

Future Directions

Increase of interest in the field, still much work to do. We have considered each tag as a diferent token. → Considering semantic meanings of social tags could help. Tag suggestions leverage several issues in folksonomies. → Looking for a weighting function that fits the characteristics of systems with tag suggestions, e.g., Delicious.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 91 / 98

SLIDE 92

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Publications

Index

1

Motivation

2

Selection of a Classifier

3

STS & Datasets

4

Representing the Aggregation of Tags

5

Tag Distributions on STS

6

User Behavior on STS

7

Conclusions & Outlook

8

Publications

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 92 / 98

SLIDE 93

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Publications

Publications

Peer-Reviewed Conferences (I) Arkaitz Zubiaga, Christian K¨

rner, Markus Strohmaier.
2011. Tags vs Shelves: From Social Tagging to Social
Classification. In Proceedings of Hypertext 2011, the

22nd ACM Conference on Hypertext and Hypermedia, Eindhoven, Netherlands. (acceptance rate: 35/104, 34%) Arkaitz Zubiaga, Raquel Mart´ ınez, V´ ıctor Fresno. 2009. Getting the Most Out of Social Annotations for Web Page

Classification. In Proceedings of DocEng 2009, the 9th

ACM Symposium on Document Engineering, pp. 74-83, Munich, Germany. (acceptance rate: 16/54, 29.6%) [15 citations]

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 93 / 98

SLIDE 94

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Publications

Publications

Peer-Reviewed Conferences (II) Arkaitz Zubiaga. 2009. Enhancing Navigation on Wikipedia with Social Tags. Wikimania 2009, Buenos Aires, Argentina. [6 citations] Arkaitz Zubiaga, Alberto P. Garc´ ıa-Plaza, V´ ıctor Fresno, Raquel Mart´ ınez. 2009. Content-based Clustering for Tag Cloud Visualization. In Proceedings of ASONAM 2009, International Conference on Advances in Social Networks Analysis and Mining, pp. 316-319, Athens, Greece. [3 citations]

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 94 / 98

SLIDE 95

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Publications

Publications

Journals (I) Arkaitz Zubiaga, Raquel Mart´ ınez, V´ ıctor Fresno. 2011. Augmenting Web Page Classifiers with Social Annotations. Procesamiento del Lenguaje Natural. (acceptance rate: 33/60, 55%) Arkaitz Zubiaga, Raquel Mart´ ınez, V´ ıctor Fresno. 2009. Clasificaci´

n de P´

aginas Web con Anotaciones Sociales. Procesamiento del Lenguaje Natural, vol. 43, pp. 225-233. (acceptance rate: 36/72, 50%) Arkaitz Zubiaga, V´ ıctor Fresno, Raquel Mart´ ınez. 2009. Comparativa de Aproximaciones a SVM Semisupervisado Multiclase para Clasificaci´

n de P´

aginas Web. Procesamiento del Lenguaje Natural, vol. 42, pp. 63-70.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 95 / 98

SLIDE 96

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Publications

Publications

Journals (II) Arkaitz Zubiaga, V´ ıctor Fresno, Raquel Mart´ ınez. Harnessing Folksonomies to Produce a Social Classification of Resources. IEEE Transactions on Knowledge and Data Engineering. (pending notification)

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 96 / 98

SLIDE 97

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Publications

Publications

Book Chapters Arkaitz Zubiaga, V´ ıctor Fresno, Raquel Mart´ ınez. 2011. Exploiting Social Annotations for Resource Classification. Social Network Mining, Analysis and Research Trends: Techniques and Applications. IGI Global. Workshops Arkaitz Zubiaga, V´ ıctor Fresno, Raquel Mart´ ınez. 2009. Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?. In Proceedings of the NAACL-HLT 2009 Workshop on Semi-supervised Learning for Natural Language Processing, pp. 28-36, Boulder, CO, United States.

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 97 / 98

SLIDE 98

PhD Thesis Arkaitz Zubiaga Motivation Selection of a Classifier STS & Datasets Representing the Aggregation of Tags Tag Distributions

n STS

User Behavior

n STS

Conclusions & Outlook Publications Publications

Thank You

AchiuArigato Danke Dhannvaad Dua Netjer en ek

Efcharisto Gracias Gr`

acies Gratia Grazie

Guishepeli Hvala KiitosK¨

sz¨
n¨
m

Merc´ e MerciMila esker Obrigado Shukran ShukriyaTack Tak Takk T¨

anan Tapadh leat

Tesekk¨ ur ederim Thank you Toda

http://thesis.zubiaga.org/

Arkaitz Zubiaga (UNED) PhD Thesis July 12th, 2011 98 / 98

Harnessing Folksonomies for Resource Classification

Table of Contents

Index

Resource Classification

Resource Classification

Resource Classification

Resource Classification

Resource Classification

Tagging

Social Tagging

Organization of Resources

Example of Bookmarks

bij : ui × rj × Tij

Sum of Annotations

Tag-based Resource Classification

Problem Statement

Related Work

Related Work

Index

Characteristics of the task

Support Vector Machines (SVM)

Selection of a Classifier

Experiment Settings

Results

Index

Requirements

Characteristics of STS

Characteristics of STS

Characteristics of STS

Characteristics of STS

Characteristics of STS

Retrieval of Categorized Resources

Retrieval of Additional User Annotations

Tag Popularity on Resources

Tag Novelty in Bookmarks by Rank

Retrieval of Additional Data

Summary of the Analysis of Datasets

Index

Representing Resources Using Tags

Representing Resources Using Tags

Representing Resources Using Other Data Sources

Experiment Setup

Experiment Setup

Results of Tags vs Other Data Sources

Results of Tags vs Other Data Sources

Results of Tags vs Other Data Sources

Results of Tags vs Other Data Sources

Results of Tags vs Other Data Sources

Classifier Committees

Results of Classifier Committees

Results of Classifier Committees

Summary of Results

Index

Tag Distributions

TF-IDF

Tag Weighting Functions

Results Using IWFs

Results Using IWFs

Using Classifier Committees with IWFs

Results Using IWF with Committees

Results Using IWF with Committees

Summary of Results Using IWFs

Index

User Behavior: Categorizers and Describers

Categorizers and Describers

Categorizers and Describers

Weighting Measures

Weighting Measures: TPP

Weighting Measures: ORPHAN

Weighting Measures: TRR

Use of Weighting Measures

Experiment Setup

Experiments

Descriptivity Results

Classification Results

Classification Results

Overall Categorizers/Describers Results

Index

Contributions

Contributions