Tag Spam Creates Large Non-Giant Connected Components Non-Giant - - PowerPoint PPT Presentation

tag spam creates large non giant connected components non
SMART_READER_LITE
LIVE PREVIEW

Tag Spam Creates Large Non-Giant Connected Components Non-Giant - - PowerPoint PPT Presentation

Tag Spam Creates Large Non-Giant Connected Components Non-Giant Connected Components Nicolas Neubauer (1), Robert Wetzker (2) & Klaus Obermayer (1) Neural Information Processing Group (1), DAI Lab (2) Technische Universitt Berlin


slide-1
SLIDE 1

Tag Spam Creates Large Non-Giant Connected Components Non-Giant Connected Components

Nicolas Neubauer (1), Robert Wetzker (2) & Klaus Obermayer (1) Neural Information Processing Group (1), DAI Lab (2) Technische Universität Berlin AIRWeb@WWW’09, 21.4.2009

slide-2
SLIDE 2

Overview

1. Spam in Social Bookmarking Systems 2. Hyperincident Connected Components 3. Document/User and Tag/User Graphs 4. Conclusions

slide-3
SLIDE 3

Social Tagging

slide-4
SLIDE 4

Edges: Top 2000 similarities between top 800 documents (no spam) - Bibsonomy

slide-5
SLIDE 5

Edges: Top 2000 similarities between top 800 documents (spam) - Bibsonomy

  • Some tag spam targets search engines
  • Top entry for a given tag might indicate relevance
  • Other tag spam targets users
  • Sites with certain tags might lure users into visiting them
  • Spammers behave so radically different it shows in the resulting

network structures

slide-6
SLIDE 6

Overview

1. Spam in Social Bookmarking Systems 2. Hyperincident Connected Components 3. Document/User and Tag/User Graphs 4. Conclusions

slide-7
SLIDE 7

Hyperincident Connectivity

  • Tagging data can be interpreted as a hypergraph, defined by hyperedges

(d,u,t) for a document d being tagged with tag t by a user u

  • Two edges are incident if they share a node (i.e., d, u, or t)

– In all examined datasets, everything was basically connected to everything

  • Definition: Two edges are 2-hyperincident if they share at least two nodes

2-hyperincident connected components:

  • 2-hyperincident connected components:

Components of edges between paths of 2-hyperincident edges exist

Blue, dotted lines indicate 2-hyperincident edges Blue, dotted lines indicate incident edges

slide-8
SLIDE 8

Expanding 2-hyperincident edges around a user

  • Starting from a legitimate user, we had to stop at a limit of discovered

nodes (here: 2000)

  • Starting from spam users, we often found closed sets of connected nodes
  • We did not find such components for legitimate users
slide-9
SLIDE 9

Distribution of Component Sizes

x=number of components of size y (log/log) Neubauer&Obermayer: Hyperincident Connected Components of Tagging Networks, HyperText 2009, in press

slide-10
SLIDE 10

Distribution of Large Components‘ Sizes

x = rank of component, y = number of edges in component

slide-11
SLIDE 11

Spam Detection

  • Users in nlc/gcc are likely to be

spammers/non-spammers

  • Are spammers/non-spammers also likely

to live in nlc/gcc?

  • Yes

– although many users from both classes do neither. do neither.

  • Simple classification heuristic:

– If user is only in nlc-> spam = 1 – If user is only in gcc -> spam = 0 –

  • therwise-> spam = 0.5

– Note that users can be in more than one component

  • Area under ROC curve (AUC - balanced

accuracy) of .73 Distribution of users over components ROC curve of simple classifier

slide-12
SLIDE 12

Largest and Next-largest 2-HCC for one Month of Delicious Tags

slide-13
SLIDE 13

Overview

1. Spam in Social Bookmarking Systems 2. Hyperincident Connected Components 3. Document/User and Tag/User Graphs 4. Conclusions

slide-14
SLIDE 14

Doubting Hyper-Incident Connectivity

  • “Nice result, but probably mostly based on documents”
  • Short story: Right.

– Long story: Tags do have a bit of influence here.

Question: What happens if we examine connectivity on the

  • Question: What happens if we examine connectivity on the

document/user graph, ie edges=(d,u) for (d,u,t) in hypergraphs?

– And what happens if we do the same for the tag/user graph?

slide-15
SLIDE 15

Connectivity Structure (Bibsonomy)

  • We see a the distribution of component sizes in the user/document graph

closely resembles the one found in the entire hypergraph

  • The tag/document graph is mostly connected
slide-16
SLIDE 16

User Distribution

  • Accordingly, membership information on the user/document graph is

comparably informative, while the tag/document graph is useless

slide-17
SLIDE 17

Spam Detection

New spam detection experiments:

  • applied above heuristic on

document/user graph (red)

  • compared to original approach (black)
  • new heuristic (blue):

new maximum spam score for users being in nlc in both graphs

  • also examined effect of #documents/user

Results:

  • Hypergraph and document/user graph

connectivity provide similar, but sometimes complementary information

  • Entire approach works better when users

have more documents

ROC curves for all three heuristics AUC values

slide-18
SLIDE 18

Overview

1. Spam in Social Bookmarking Systems 2. Hyperincident Connected Components 3. Document/User and Tag/User Graphs 4. Conclusions

slide-19
SLIDE 19

Final Results & Discussion

Requirements

Feature extraction on Previous Labels resources or references Content analysis X X Reference analysis X X User Similarity X Structural Analysis

  • Accuracy decreases, but so do domain dependence and

requirements on available information

  • Addition to other, more specialized approaches
  • Stand-alone baseline when more specialized approaches

are not available

  • Although a large part of connectivity is related to

documents, tags do play a subtle role.

  • Next : Exploring temporal evolution & even stricter

notions of connectivity