[PPT] - Improved Annotation of the Blogosphere via Autotagging and PowerPoint Presentation

SLIDE 1

15th International World Wide Web Conference

Improved Annotation of the Blogosphere via Autotagging and Hierarchical Clustering

Chris Brooks and Nancy Montanez Department of Computer Science University of San Francisco

Department of Computer Science — University of San Francisco – p. 1/??

SLIDE 2

Shared Tags and Folksonomies

Tags have (at least) three clear uses:
Individual organization
Shared annotation of articles into categories
Shared annotation as an aid to searching
We are more interested in tags as a mechanism for sharing

information.

Folksonomy: the meaning associated with a tag will evolve and

coalesce through community usage.

Department of Computer Science — University of San Francisco – p. 3/??

SLIDE 4

Popular tags

About Me, Acne News, Actualite, Actualites, Actualites et politique, Advertising, Allmant, All Posts, amazon, Amigos, amor, Amusement, Anime, An- nouncements, Articles/News, Asides, Asterisk, audio, Babes, Babes On Flickr, Baby, Baseball, Blogging, Blogs, book, books, Business, Car, Car Insurance, Cars, category, Cell Phones, China, Cinema, Cine cinema, Comics, Computadores e a Internet, Computer, Computers, Computers and Internet, Computers en internet, Computing, CSS, Curiosidades, Current events, Data Recovery, days, Development, diario, Directory, Divertissement, Dogs, dreams, Entertainment, Entretenimento, Entretenimiento, Environment, etc, Europe, Event, EveryDay, Everything, F1, fAcTs, Family, fashion, Feeling, Feelings, FF11, FFXI, Film, Firefox, Flash, Flickr, Flutes, Food and Drink, Football, foreign-exchange, Foreign Exchange, Fotos, Friends, Fun, Funny, general, Game, Games, Gaming, Generale, General news, General Posting, General webmaster threads, Geral, Golf, Google, gossip, Hardware, Health and wellness, Health Insurance, History, hobbies, Hobby, Home, Humor, Hurricane Katrina, Info, Informatica e Internet, Interna- tional, Internet, In The News, Intrattenimento, Java, jeux, Jewelry, jogos,Journal, Journalism, Juegos, kat-tun, Katrina, Knitting, Law, Legislation, libros, Life, Links, Live, Livres, Livros, London, Love, Love Poems, Lyrics, Musica, Macintosh, Marketing, MassCops Recent Topics, Me, Media, meme, memes, memo, metblogs, metroblogging, Military, Misc, Misc., miscellaneous, MobLog, Mood, Movie, Movies, murmur, Music, Musica, Musik, Musings, Musique, Muziek, My blog, Nature, News and politics, Noticias e politica, Opinion, Ordinateurs et Internet, Organizacoes, Organizaciones, Organiza- tions, others, Pasatiempos, Passatempos, PC, Pensamentos, Pensamientos, People, Personal, Philosophy, photo, Pictures, Podcast, Poem, poemas, Poesia, Poker, police headlines, Politik, Projects, Quotes, Radio, Ramblings, random, Randomness, Random thoughts, Rant, Real Estate, Recipes, reflexiones, reizen, Relationships, Research, Resources, Review, RO, RSS, Saude e bem-estar, Salud y bienestar, Sante et bien-etre, School, Science, Search, Sex, sexy, Shopping, Site news, Society, software, Spam, Stories, stuff, Tech News, technology, Television, Terrorism, test, Tips, Tools, Travel, Updates, USA, Viagens, Viajes, Video, Videos, VoIP , Votes, Voyages, War, Weather, Weblog, Website, weight loss, Whatever, Windows, Wireless, wordpress, words, Work, World news, Writing

The 250 most popular tags on Technorati, as of October 6, 2005

Things to notice:
Tags tend to be general terms
Synonyms and related concepts are repeated
Misspellings, and different cases
Jargon, slang, spam, and Non-English words
Non-useful tags (everything, etc, random, test)

Department of Computer Science — University of San Francisco – p. 4/??

SLIDE 5

Representational Power

A tag is a label that is applied to a set of blog entries.
There is no way to specify relationships between tags
Opposite, more general/specific, synonym, etc
In logical terms, tags are a propositional mechanism.
This should set off some alarms amongst the AI people in

the audience!

We see users trying to use tags more expressively
e.g. “San Francisco, California”
This can’t be decomposed, or related to the tag “San

Francisco” or the tag “California”

Maybe tags are not quite so easy to use ...

Department of Computer Science — University of San Francisco – p. 5/??

SLIDE 6

Tags as an Information Retrieval Mechanism

In this paper, we tried to determine whether tags were useful as

an information retrieval mechanism.

Specifically, can tags help with a search task?
How similar are articles that are assigned the same tags?
Hypothesis: Rarer tags are better at describing articles than

more specific tags.

Department of Computer Science — University of San Francisco – p. 6/??

SLIDE 7

Tags as an Information Retrieval Mechanism

Retrieved the top 350 tags from Technorati, and then the 250

most recent articles for each tag.

Articles are converted into weighted vectors, using TFIDF to

assign weights to each word.

All articles that share a tag are assigned to a tag cluster
The size of a tag cluster is measured using the average

pairwise cosine similarity.

Note: the actual content of the documents is what is

evaluated.

Department of Computer Science — University of San Francisco – p. 7/??

SLIDE 8

How Similar are Tag Clusters?

50 100 150 200 250 300 350 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Tag Rank (or Popularity) Cosine Similarity

Similarity Amoung Blogs using Popular Tags 250 Blogs per Tag

Articles with the same tag

are somewhat similar.

Small spike amongst highly

popular tags. (game, games, vote)

Contrary to expectations,

articles with rare tags are not more similar than articles with common tags.

But how similar are these

clusters?

Department of Computer Science — University of San Francisco – p. 8/??

SLIDE 9

Baselines

5 10 15 20 25 30 35 40 45 50 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Index of Clusters (50 Blogs per Cluster) Cosine Similarity

Similarity Amoung Random Clusters of Blogs 2,500 Blogs Total

1 2 3 4 5 6 7 8 9 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Index of News Topic Clusters (Roughly 30 Articles per Topic) Cosine Similarity

Similarity Amoung Documents from Google News 10 Different News Topics, 30 Articles per Topic

Tagging clusters articles

better than random selection, but worse than Google News.

Tagging seems most

effective at grouping articles into broad topical bins.

Not very effective as a

mechanism for locating particular articles.

Department of Computer Science — University of San Francisco – p. 9/??

SLIDE 10

Autotagging

Perhaps users are not very good at choosing tags for search -

can automated methods do better?

Autotagging is the process of automatically assigning tags

based on the content of an article.

Hypothesis: To determine what an article is about, look at the

article itself!

Assign TFIDF scores to all words and extract the

highest-scoring words.

Department of Computer Science — University of San Francisco – p. 10/??

SLIDE 11

Autotagging

20 40 60 80 100 120 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Index of TFIDF Keyword Clusters Cosine Similarity

Similarity Amoung Blogs with the Same Top TFIDF Keywords 500 Blogs Total

Pairwise similarity of clusters of articles shar- ing a highly-scored word.

We also extracted the top

three highest-scoring words from each article and assigned them as tags.

Clusters formed using these

words were smaller and much more similar than clusters using user-chosen keywords.

Tags extracted from user

text are more helpful in creating specific categories than user-selected tags are.

Department of Computer Science — University of San Francisco – p. 11/??

SLIDE 12

Generating Hierarchies of Tags

Tags are unable to express related concepts.
Do related articles have tags judged as similar by a human?
To address this question, we use agglomerative clustering to

construct a tag hierarchy.

Goal: identify and group tags that are similar or related.

Department of Computer Science — University of San Francisco – p. 12/??

SLIDE 13

Agglomerative Clustering

The agglomerative clustering algorithm is very straightforward:
Find the two closest tag clusters and merge them into a single

abstract cluster. Repeat until one cluster containing all tags remains.

This yields a dendrogram showing tag similarities.
Tags = {t1, t2, ..., tn}
while |Tags| > 1 :
find ti, tj s.t. sim(ti, tj) >= sim(ti, tk)∀k = i, k = j
tnew = ti ∪ tj
tags = tags − {ti, tj}
tags = tags ∪ tnew

Department of Computer Science — University of San Francisco – p. 13/??

SLIDE 14

Positive results: locating related tags

i T u n e s C a r I n s u r a n c e A u t

I

n s u r a n c e W i n e a n d C h e e s e g a r d e n G a r d e n a n d P

n

d s F

r

e i g n P

l

i c y R e f e r e n c e J u n k L e g i s l a t i

n

T i p s B

d

y ✣ f

r

✣ L I F E N e w s P r e s c r i p t i

n

w e i g h t l

s

s D a i l y n e w s B

r

e d

m

e s s a y A b

u

t m e P

e

t r y A n y t h i n g M y S t

r

y P

e

m F e e l i n g F e e l i n g s M y d a y M

d

H

m

e a n d G a r d e n P

l

i t i c s I n s u r a n c e H e a l t h P e r s

n

a l J

k

e s r

t

c

n

c e p t . . . l i t e r a r y s e l f ✣ e x p r e s s i

n

e m

t

i

n

s g a r d e n i n g

Clustering is able to

construct groups of tags that might be characterized as “related” by a human.

Department of Computer Science — University of San Francisco – p. 14/??

SLIDE 15

Negative results: shared vocabulary

Using vectors of single words to represent documents can

produce anomalies

Both politics and games talk about scores, opponents, and

winning.

Department of Computer Science — University of San Francisco – p. 15/??

SLIDE 16

Negative results: syntactic problems

“diary” and “dairy” are seen as closely related.
Misspelling in the tag.
Illustrates a problem with the current representational power
f tags.
Your tags are only as good as your users!
(aside: many community blogs have frequent discussions

about “appropriate” tagging vocabularies)

Department of Computer Science — University of San Francisco – p. 16/??

SLIDE 17

Conclusions

Tags are very attractive due to their simplicity and ease of use.
Limited representational power makes them most useful for

grouping into large categories.

By themselves, tags do not seem very effective as a search

mechanism.

Tags can be grouped using clustering techniques, which

indicates that relationships can be induced automatically.

Needed: tools for increasing expressivity without sacrificing

ease of use.

Expressing relationships, suggesting appropriate tags,

catching misspellings, automatically grouping tags.

Department of Computer Science — University of San Francisco – p. 17/??

SLIDE 18

Future Work

Current experiments only provide an approximate picture of

cluster similarity.

Phrase extraction would produce more precise results.
Other metrics should be evaluated.
Currently developing tools that suggest tags based on article

similarity and hierarchy.

Question: do authors and readers use the same tag

vocabulary?

Thanks to Technorati for the use of their data.

Department of Computer Science — University of San Francisco – p. 18/??

15th International World Wide Web Conference

Improved Annotation of the Blogosphere via Autotagging and Hierarchical Clustering

Chris Brooks and Nancy Montanez Department of Computer Science University of San Francisco

Tags

become a popular method for annotating and

keywords to blog entries, and share these annotations with others.

useful for?

help as an information retrieval mechanism?

Shared Tags and Folksonomies

information.

coalesce through community usage.

Popular tags

The 250 most popular tags on Technorati, as of October 6, 2005

Representational Power

the audience!

Francisco” or the tag “California”

Tags as an Information Retrieval Mechanism

an information retrieval mechanism.

more specific tags.

Tags as an Information Retrieval Mechanism

most recent articles for each tag.

assign weights to each word.

pairwise cosine similarity.

evaluated.

How Similar are Tag Clusters?

are somewhat similar.

popular tags. (game, games, vote)

articles with rare tags are not more similar than articles with common tags.

clusters?

Baselines

better than random selection, but worse than Google News.

effective at grouping articles into broad topical bins.

mechanism for locating particular articles.

Autotagging

can automated methods do better?

based on the content of an article.

article itself!

highest-scoring words.

Autotagging

Pairwise similarity of clusters of articles shar- ing a highly-scored word.

three highest-scoring words from each article and assigned them as tags.

words were smaller and much more similar than clusters using user-chosen keywords.

text are more helpful in creating specific categories than user-selected tags are.

Generating Hierarchies of Tags

construct a tag hierarchy.

Agglomerative Clustering

abstract cluster. Repeat until one cluster containing all tags remains.

Positive results: locating related tags

construct groups of tags that might be characterized as “related” by a human.

Negative results: shared vocabulary

produce anomalies

winning.

Negative results: syntactic problems

about “appropriate” tagging vocabularies)

Conclusions

grouping into large categories.

mechanism.

indicates that relationships can be induced automatically.

ease of use.

catching misspellings, automatically grouping tags.

Future Work

cluster similarity.

similarity and hierarchy.

vocabulary?