Improved Annotation of the Blogosphere via Autotagging and - - PowerPoint PPT Presentation

improved annotation of the blogosphere via autotagging
SMART_READER_LITE
LIVE PREVIEW

Improved Annotation of the Blogosphere via Autotagging and - - PowerPoint PPT Presentation

15th International World Wide Web Conference Improved Annotation of the Blogosphere via Autotagging and Hierarchical Clustering Chris Brooks and Nancy Montanez Department of Computer Science University of San Francisco Department of Computer


slide-1
SLIDE 1

15th International World Wide Web Conference

Improved Annotation of the Blogosphere via Autotagging and Hierarchical Clustering

Chris Brooks and Nancy Montanez Department of Computer Science University of San Francisco

Department of Computer Science — University of San Francisco – p. 1/??

slide-2
SLIDE 2

Tags

  • Taggging has recently

become a popular method for annotating and

  • rganizing blog entries.
  • Allows users to attach

keywords to blog entries, and share these annotations with others.

  • Easy to use and intuitive.
  • But what tasks are tags

useful for?

  • More specifically, do tags

help as an information retrieval mechanism?

Department of Computer Science — University of San Francisco – p. 2/??

slide-3
SLIDE 3

Shared Tags and Folksonomies

  • Tags have (at least) three clear uses:
  • Individual organization
  • Shared annotation of articles into categories
  • Shared annotation as an aid to searching
  • We are more interested in tags as a mechanism for sharing

information.

  • Folksonomy: the meaning associated with a tag will evolve and

coalesce through community usage.

Department of Computer Science — University of San Francisco – p. 3/??

slide-4
SLIDE 4

Popular tags

About Me, Acne News, Actualite, Actualites, Actualites et politique, Advertising, Allmant, All Posts, amazon, Amigos, amor, Amusement, Anime, An- nouncements, Articles/News, Asides, Asterisk, audio, Babes, Babes On Flickr, Baby, Baseball, Blogging, Blogs, book, books, Business, Car, Car Insurance, Cars, category, Cell Phones, China, Cinema, Cine cinema, Comics, Computadores e a Internet, Computer, Computers, Computers and Internet, Computers en internet, Computing, CSS, Curiosidades, Current events, Data Recovery, days, Development, diario, Directory, Divertissement, Dogs, dreams, Entertainment, Entretenimento, Entretenimiento, Environment, etc, Europe, Event, EveryDay, Everything, F1, fAcTs, Family, fashion, Feeling, Feelings, FF11, FFXI, Film, Firefox, Flash, Flickr, Flutes, Food and Drink, Football, foreign-exchange, Foreign Exchange, Fotos, Friends, Fun, Funny, general, Game, Games, Gaming, Generale, General news, General Posting, General webmaster threads, Geral, Golf, Google, gossip, Hardware, Health and wellness, Health Insurance, History, hobbies, Hobby, Home, Humor, Hurricane Katrina, Info, Informatica e Internet, Interna- tional, Internet, In The News, Intrattenimento, Java, jeux, Jewelry, jogos,Journal, Journalism, Juegos, kat-tun, Katrina, Knitting, Law, Legislation, libros, Life, Links, Live, Livres, Livros, London, Love, Love Poems, Lyrics, Musica, Macintosh, Marketing, MassCops Recent Topics, Me, Media, meme, memes, memo, metblogs, metroblogging, Military, Misc, Misc., miscellaneous, MobLog, Mood, Movie, Movies, murmur, Music, Musica, Musik, Musings, Musique, Muziek, My blog, Nature, News and politics, Noticias e politica, Opinion, Ordinateurs et Internet, Organizacoes, Organizaciones, Organiza- tions, others, Pasatiempos, Passatempos, PC, Pensamentos, Pensamientos, People, Personal, Philosophy, photo, Pictures, Podcast, Poem, poemas, Poesia, Poker, police headlines, Politik, Projects, Quotes, Radio, Ramblings, random, Randomness, Random thoughts, Rant, Real Estate, Recipes, reflexiones, reizen, Relationships, Research, Resources, Review, RO, RSS, Saude e bem-estar, Salud y bienestar, Sante et bien-etre, School, Science, Search, Sex, sexy, Shopping, Site news, Society, software, Spam, Stories, stuff, Tech News, technology, Television, Terrorism, test, Tips, Tools, Travel, Updates, USA, Viagens, Viajes, Video, Videos, VoIP , Votes, Voyages, War, Weather, Weblog, Website, weight loss, Whatever, Windows, Wireless, wordpress, words, Work, World news, Writing

The 250 most popular tags on Technorati, as of October 6, 2005

  • Things to notice:
  • Tags tend to be general terms
  • Synonyms and related concepts are repeated
  • Misspellings, and different cases
  • Jargon, slang, spam, and Non-English words
  • Non-useful tags (everything, etc, random, test)

Department of Computer Science — University of San Francisco – p. 4/??

slide-5
SLIDE 5

Representational Power

  • A tag is a label that is applied to a set of blog entries.
  • There is no way to specify relationships between tags
  • Opposite, more general/specific, synonym, etc
  • In logical terms, tags are a propositional mechanism.
  • This should set off some alarms amongst the AI people in

the audience!

  • We see users trying to use tags more expressively
  • e.g. “San Francisco, California”
  • This can’t be decomposed, or related to the tag “San

Francisco” or the tag “California”

  • Maybe tags are not quite so easy to use ...

Department of Computer Science — University of San Francisco – p. 5/??

slide-6
SLIDE 6

Tags as an Information Retrieval Mechanism

  • In this paper, we tried to determine whether tags were useful as

an information retrieval mechanism.

  • Specifically, can tags help with a search task?
  • How similar are articles that are assigned the same tags?
  • Hypothesis: Rarer tags are better at describing articles than

more specific tags.

Department of Computer Science — University of San Francisco – p. 6/??

slide-7
SLIDE 7

Tags as an Information Retrieval Mechanism

  • Retrieved the top 350 tags from Technorati, and then the 250

most recent articles for each tag.

  • Articles are converted into weighted vectors, using TFIDF to

assign weights to each word.

  • All articles that share a tag are assigned to a tag cluster
  • The size of a tag cluster is measured using the average

pairwise cosine similarity.

  • Note: the actual content of the documents is what is

evaluated.

Department of Computer Science — University of San Francisco – p. 7/??

slide-8
SLIDE 8

How Similar are Tag Clusters?

50 100 150 200 250 300 350 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Tag Rank (or Popularity) Cosine Similarity

Similarity Amoung Blogs using Popular Tags 250 Blogs per Tag

  • Articles with the same tag

are somewhat similar.

  • Small spike amongst highly

popular tags. (game, games, vote)

  • Contrary to expectations,

articles with rare tags are not more similar than articles with common tags.

  • But how similar are these

clusters?

Department of Computer Science — University of San Francisco – p. 8/??

slide-9
SLIDE 9

Baselines

5 10 15 20 25 30 35 40 45 50 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Index of Clusters (50 Blogs per Cluster) Cosine Similarity

Similarity Amoung Random Clusters of Blogs 2,500 Blogs Total

1 2 3 4 5 6 7 8 9 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Index of News Topic Clusters (Roughly 30 Articles per Topic) Cosine Similarity

Similarity Amoung Documents from Google News 10 Different News Topics, 30 Articles per Topic

  • Tagging clusters articles

better than random selection, but worse than Google News.

  • Tagging seems most

effective at grouping articles into broad topical bins.

  • Not very effective as a

mechanism for locating particular articles.

Department of Computer Science — University of San Francisco – p. 9/??

slide-10
SLIDE 10

Autotagging

  • Perhaps users are not very good at choosing tags for search -

can automated methods do better?

  • Autotagging is the process of automatically assigning tags

based on the content of an article.

  • Hypothesis: To determine what an article is about, look at the

article itself!

  • Assign TFIDF scores to all words and extract the

highest-scoring words.

Department of Computer Science — University of San Francisco – p. 10/??

slide-11
SLIDE 11

Autotagging

20 40 60 80 100 120 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Index of TFIDF Keyword Clusters Cosine Similarity

Similarity Amoung Blogs with the Same Top TFIDF Keywords 500 Blogs Total

Pairwise similarity of clusters of articles shar- ing a highly-scored word.

  • We also extracted the top

three highest-scoring words from each article and assigned them as tags.

  • Clusters formed using these

words were smaller and much more similar than clusters using user-chosen keywords.

  • Tags extracted from user

text are more helpful in creating specific categories than user-selected tags are.

Department of Computer Science — University of San Francisco – p. 11/??

slide-12
SLIDE 12

Generating Hierarchies of Tags

  • Tags are unable to express related concepts.
  • Do related articles have tags judged as similar by a human?
  • To address this question, we use agglomerative clustering to

construct a tag hierarchy.

  • Goal: identify and group tags that are similar or related.

Department of Computer Science — University of San Francisco – p. 12/??

slide-13
SLIDE 13

Agglomerative Clustering

  • The agglomerative clustering algorithm is very straightforward:
  • Find the two closest tag clusters and merge them into a single

abstract cluster. Repeat until one cluster containing all tags remains.

  • This yields a dendrogram showing tag similarities.
  • Tags = {t1, t2, ..., tn}
  • while |Tags| > 1 :
  • find ti, tj s.t. sim(ti, tj) >= sim(ti, tk)∀k = i, k = j
  • tnew = ti ∪ tj
  • tags = tags − {ti, tj}
  • tags = tags ∪ tnew

Department of Computer Science — University of San Francisco – p. 13/??

slide-14
SLIDE 14

Positive results: locating related tags

i T u n e s C a r I n s u r a n c e A u t
  • I
n s u r a n c e W i n e a n d C h e e s e g a r d e n G a r d e n a n d P
  • n
d s F
  • r
e i g n P
  • l
i c y R e f e r e n c e J u n k L e g i s l a t i
  • n
T i p s B
  • d
y ✣ f
  • r
✣ L I F E N e w s P r e s c r i p t i
  • n
w e i g h t l
  • s
s D a i l y n e w s B
  • r
e d
  • m
e s s a y A b
  • u
t m e P
  • e
t r y A n y t h i n g M y S t
  • r
y P
  • e
m F e e l i n g F e e l i n g s M y d a y M
  • d
H
  • m
e a n d G a r d e n P
  • l
i t i c s I n s u r a n c e H e a l t h P e r s
  • n
a l J
  • k
e s r
  • t
c
  • n
c e p t . . . l i t e r a r y s e l f ✣ e x p r e s s i
  • n
e m
  • t
i
  • n
s g a r d e n i n g
  • Clustering is able to

construct groups of tags that might be characterized as “related” by a human.

Department of Computer Science — University of San Francisco – p. 14/??

slide-15
SLIDE 15

Negative results: shared vocabulary

  • Using vectors of single words to represent documents can

produce anomalies

  • Both politics and games talk about scores, opponents, and

winning.

Department of Computer Science — University of San Francisco – p. 15/??

slide-16
SLIDE 16

Negative results: syntactic problems

  • “diary” and “dairy” are seen as closely related.
  • Misspelling in the tag.
  • Illustrates a problem with the current representational power
  • f tags.
  • Your tags are only as good as your users!
  • (aside: many community blogs have frequent discussions

about “appropriate” tagging vocabularies)

Department of Computer Science — University of San Francisco – p. 16/??

slide-17
SLIDE 17

Conclusions

  • Tags are very attractive due to their simplicity and ease of use.
  • Limited representational power makes them most useful for

grouping into large categories.

  • By themselves, tags do not seem very effective as a search

mechanism.

  • Tags can be grouped using clustering techniques, which

indicates that relationships can be induced automatically.

  • Needed: tools for increasing expressivity without sacrificing

ease of use.

  • Expressing relationships, suggesting appropriate tags,

catching misspellings, automatically grouping tags.

Department of Computer Science — University of San Francisco – p. 17/??

slide-18
SLIDE 18

Future Work

  • Current experiments only provide an approximate picture of

cluster similarity.

  • Phrase extraction would produce more precise results.
  • Other metrics should be evaluated.
  • Currently developing tools that suggest tags based on article

similarity and hierarchy.

  • Question: do authors and readers use the same tag

vocabulary?

  • Thanks to Technorati for the use of their data.

Department of Computer Science — University of San Francisco – p. 18/??