comparing recommenda on algorithms for social bookmarking

ComparingRecommenda/on AlgorithmsforSocialBookmarking ToineBogers - PowerPoint PPT Presentation

ComparingRecommenda/on AlgorithmsforSocialBookmarking ToineBogers RoyalSchoolofLibraryandInforma/onScience Copenhagen,Denmark Aboutme Ph.D.fromTilburgUniversity


  1. Comparing
Recommenda/on
 Algorithms
for
Social
Bookmarking
 Toine
Bogers
 Royal
School
of
Library
and
Informa/on
Science
 Copenhagen,
Denmark


  2. About
me
 • Ph.D.
from
Tilburg
University
  “ Recommender
Systems
for
Social
Bookmarking ”
  Promotor:
Prof.
dr.
Antal
van
den
Bosch
 • Currently
@
RSLIS
(Copenhagen,
DK)
  Research
assistant
on
retrieval
fusion
project
 • Research
interests
  Recommender
systems
  Social
bookmarking
  Expert
search
  Informa/on
retrieval


  3. Outline
 1. Introduc/on
 2. Collabora/ve
filtering
 3. Content‐based
filtering
 4. Recommender
systems
fusion
 5. Conclusions


  4. Social
bookmarking
 • Way
of
storing,
organizing,
and
managing
bookmarks
of
 Web
pages,
scien/fic
ar/cles,
books,
etc.

  All
done
online
  Can
be
made
public
or
kept
private
  Allow
users
to
 tag
 (=
label)
their
items
  Many
different
websites
available:


  5. Social
bookmarking
 • Different
domains
  Web
pages
  Scien/fic
ar/cles
  Books
 • Strong
growth
in
popularity
  Millions
of
users,
items,
and
tags
  For
example:
Delicious
 - 140,000+
posts/day
on
average
in
2008
(Keller,
2009)
 - 7,000,000+
posts/month
in
2008
(Wetzker
et
al.,
2009)


  6. Content
overload
 • Problems
with
this
growth
  Content
overload
  Increasing
ambiguity
 • How
can
we
deal
with
this?
  Browsing
 Can
become
less
effec/ve

 as
content
increases!
  Search
 • A
possible
solu/on
  Take
a
more
ac/ve
role:
 recommenda,on


  7. Recommenda/on
tasks
 !"#$%&$''' ?@AB *9A7 9CD *+")& !"#$%"&%'("& !"#$%"& ,"-#))"./ ?@AB )" $,#3%'.4 012#. *+")& 7#,"& 914& *9A7 ()*&"$+$$''' "5$",+6 %'("&+8'6& 6:44"62#. ;"$+8& ;#)1'. !",6#.1%'<"0& 9CD "5$",+6 6"1,-8 =,#>6'.4

  8. Item
recommenda/on
 • Our
focus:
 item
recommenda,on 

  Iden/fy
sets
of
items
that
are
likely
to
be
of
interest
to
a
 certain
user

 - Return
a
ranked
list
of
items
 - ‘Find
Good
Items’
task
(Herlocker
et
al.,
2004)
  Based
on
different
informa/on
sources
 - Transac/on
pajerns
( usage
data ,
purchase
informa/on)
 – Explicit
ra/ngs
 – Implicit
feedback
 - Metadata
 - Tags


  9. Related
work
 • Work
on
social
bookmarking
mostly
focused
on
  Improving
browsing
experience
 - clustering,
dealing
with
ambiguity
  Incorpora/ng
tags
in
search
algorithms
  Tag
recommenda/on

 • Problems
with
work
on
item
recommenda/on
  Different
data
sets
  Different
evalua/on
metrics
  No
comparison
of
algorithms
under
controlled
condi/ons
  Hardly
ever
publicly
available
data
sets
  No
user‐based
evalua/on


  10. Collec/ng
data
 • Four
data
sets
from
two
different
domains
  Web
bookmarks
 - Delicious
 - BibSonomy
 ~78%
of
users
posted
only
type
of
content

  Scien/fic
ar/cles
 

(bookmarks
or
scien/fic
ar/cles)
 - CiteULike
 - BibSonomy


  11. What
did
we
collect?
 • Usage
data
  User‐item‐tag
triples
with
/mestamps
 • Metadata
  Varies
with
the
domain
 Web
bookmarks 
 Scien,fic
ar,cles
 TITLE ,
 DESCRIPTION ,
 TAGS ,
 Item‐intrinsic
   URL 
 - TITLE ,
 DESCRIPTION ,
 JOURNAL ,
 AUTHOR ,
 TAGS ,
 URL ,
etc.
  Item‐extrinsic
 - CHAPTER ,
 DAY ,
 EDITION ,
 YEAR ,
 INSTITUTION ,
etc.



  12. Filtering
 • Why?
  To
reduce
noise
in
our
data
sets
  Common
procedure
in
recommender
systems
research
 • How?
  ≥
20
items
per
user
  ≥
2
users
per
item
(no
 hapax
legomena 
items)
  No
untagged
posts
 • Compared
to
related
work
  Stricter
filtering
  More
realis/c


  13. Data
sets
 Bookmarks
 Scien,fic
ar,cles
 Delicious
 BibSonomy 
 CiteULike
 BibSonomy
 #
users
 1,243
 192
 1,322
 167
 #
items
 152,698
 11,165
 38,419
 12,982
 #
tags
 42,820
 13,233
 28,312
 5,165
 #
posts
 238,070
 29,096
 84,637
 29,720


  14. Experimental
setup
 • Backtes/ng
  Withhold
randomly
selected
items
from
test
users
  Use
remaining
material
for
training
recommender
system
  Success
is
predicted
the
user’s
interest
in
his/her
withheld
 items

 • Details
  Overall
90%‐10%
split
on
users
  Withhold
10
randomly
selected
items
of
each
test
user
  Parameter
op/miza/on
 - Used
10‐fold
cross‐valida/on
 - 90‐10
splits
 - 10
withheld
items
  Macro‐averaging
of
evalua/on
scores


  15. Evalua/on
 • ‘Find
Good
Items’
task
returns
a
ranked
list
  Need
metric
that
take
into
ranking
of
items
 • Precision‐oriented
metric
  Mean
Average
Precision
(MAP)
 - Average
Precision
(AP)
is
average
of
precision
values
at
each
relevant,
 retrieved
item
 - MAP
is
AP
averaged
over
all
users
 - “single
figure
measure
of
quality
across
recall
levels”
(Manning,
2009)
 • Tested
different
metrics
  All
precision‐oriented
metrics
showed
the
same
picture


  16. Collabora/ve
filtering
 • Ques/on
  How
can
we
use
the
informa/on
in
the
 folksonomy
 to
 generate
bejer
recommenda/ons? 

 - Users
 - Items
 usage
pajerns
 - Tags
 • Collabora/ve
filtering
(CF)
  Ajempts
to
automate
“word‐of‐mouth”
recommenda/ons
  Recommend
items
based
on
how
 like‐minded 
users
rated
 those
items
  Similarity
based
on
 - Usage
data
 - Tagging
data


  17. Collabora/ve
filtering
 • Model‐based
CF
  ‘Eager’
recommenda/on
algorithms
  Train
a
predic/ve
model
of
the
recommenda/on
task
  Quick
to
apply
to
generate
recommenda/ons
 • Memory‐based
CF
  ‘Lazy’
recommenda/on
algorithms
  Simply
store
all
pajerns
in
memory
  Defer
predic/on
effort
to
when
user
requests
 recommenda/ons


  18. Related
work
 • Model‐based
  Hybrid
PLSA‐based
approach

(Wetzker
et
al.,
2009)
  Tensor
decomposi/on
(Symeonidis
et
al.,
2008)
 • Memory‐based
  Tag‐aware
fusion
(Tso‐Sujer
et
al.,
2008)
 • Graph‐based
  FolkRank
(Hotho
et
al.,
2006)
  Random
walk
(Clements
et
al.,
2008)


  19. Algorithms
 • User‐based
 k ‐NN
algorithm
  Calculate
similarity
between
the
ac/ve
user
and
all
other
users
  Determine
the
top
 k 
nearest
neighbors
 - I.e.,
the
most
similar
users
  Unseen
items
from
nearest
neighbors
are
scored
by
the
 similarity
between
the
neighbor
and
the
ac/ve
user
 • Item‐based
 k ‐NN
algorithm
  Calculate
similarity
between
the
ac/ve
user’s
items
and
all
 other
items
  Determine
the
top
 k 
nearest
neighbors
 - I.e.,
the
most
similar
items
for
each
of
the
ac/ve
user’s
items
  Unseen
neighboring
items
are
scored
by
the
similarity
 between
the
neighbor
and
the
ac/ve
user’s
item


  20. Usage
data
 • Baseline:
CF
using
usage
data
 items
 • Profile
vectors
  User
profiles
 users
 UI
  Item
profiles
 • No
explicit
ra/ngs
available
  Only
binary
informa/on
(1
or
0)
  Or
rather:
 unary !
 • Similarity
metric
  Cosine
similarity
 • 10‐fold
cross‐valua/on
to
op/mize
 k


  21. Results
(usage
data)
 Bookmarks
 Scien,fic
ar,cles
 BibSonomy
 Delicious
 BibSonomy
 CiteULike
 UBCF
+
usage
data
 0.0277
 0.0046
 0.0865
 0.0746
 IBCF
+
usage
data
 0.0244
 0.0027
 0.0737
 0.0887


  22. Tagging
data
 • Tags
are
short
topical
descrip/ons
of
an
item
(or
user)
 tags
 tags
 • Profile
vectors
 users
  User
tag
profiles
 UT
 IT
 items
  Item
tag
profiles
 • Similarity
metrics
  Cosine
similarity
  Jaccard
overlap
  Dice’s
coefficient


  23. Results
(tagging
data)
 Bookmarks
 Scien,fic
ar,cles
 BibSonomy
 Delicious
 BibSonomy
 CiteULike
 UBCF
+
usage
data
 0.0277
 0.0046
 0.0865
 0.0746
 IBCF
+
usage
data
 0.0244
 0.0027
 0.0737
 0.0887
 UBCF
+
tagging
data
 0.0102
 0.0017
 0.0459
 0.0449
 IBCF
+
tagging
data
 0.0370
 0.0101
 0.1100
 0.0814


Recommend


More recommend