Benjamin Markines Ciro Cattuto Filippo Menczer ISI Foundation - - PowerPoint PPT Presentation

benjamin markines ciro cattuto filippo menczer
SMART_READER_LITE
LIVE PREVIEW

Benjamin Markines Ciro Cattuto Filippo Menczer ISI Foundation - - PowerPoint PPT Presentation

Benjamin Markines Ciro Cattuto Filippo Menczer ISI Foundation Beneficiaries Spammer Intermediary Non-beneficiaries Systems: search engines, tagging Information consumers Original authors ? Surfer


slide-1
SLIDE 1

Benjamin Markines Ciro Cattuto Filippo Menczer

ISI Foundation

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7

 Beneficiaries

 Spammer  Intermediary

 Non-beneficiaries

 Systems: search engines, tagging  Information consumers  Original authors

 ?

 Surfer  Advertisers

slide-8
SLIDE 8

 Features

 Folksonomy description  Feature descriptions

 post level  resource level  user level

 Feature analysis

 Dataset description

 Social spam detection

slide-9
SLIDE 9

bob alice wired.com cnn.com www2009.org web tech news

F = (U, T, R, Y ), Y ⊆ U × T × R (the triples)

slide-10
SLIDE 10

 TagSpam

fT agSpam(u, r) = 1 |T(u, r)|

  • t∈T (u,r)

Pr(t)

resource post tags user spam probability

slide-11
SLIDE 11

 TagBlur

fTagBlur(u, r) = 1 Z

  • t1=t2∈T (u,r)

1 σ(t1, t2) + ǫ − 1 1 + ǫ

|tag pairs| tag similarity

slide-12
SLIDE 12

random τ = 10-4

WWW 2009 HT 2009

slide-13
SLIDE 13

 DomFp

fDomF p(r) =

  • k∈K σ(k(r), k) · Pr(k)
  • k∈K σ(k(r), k)

shingle similarity spam probability DOM fingerprint

slide-14
SLIDE 14

 NumAds

fNumAds(r) = g(r)/gmax

number of ads

slide-15
SLIDE 15

 Plagiarism

fP lagiarism(r) = y(r)/ymax

number of more authoritative sources matching exact phrase

slide-16
SLIDE 16

 ValidLinks

fV alidLinks(u) = |Vu| |Ru|

number valid links total number

  • f links
slide-17
SLIDE 17

 BibSonomy.org  Spam is labeled at user level

 Aggregate for user level

 Sampled 1000 users

 500 spammers

 500 users in training set/test set

 250 spammers

f(u) = 1 |P(u)|

  • (u,r)∈P (u)

f(u, r)

posts

slide-18
SLIDE 18

!" #!" $!!" $#!" %!!" %#!" &!!" &#!" '!!" '#!" !" !($" !(%" !(&" !('" !(#" !()" !(*" !(+" !(," $"

  • ./01/2134"

564783" 9/.18:1;<3" =>4?@" A/0B.62" A/0C@/4" !"#$%&'()*+, ,-*()%./,!.))*0(1./, @D/23>;" EF1G3H6/2D8"

slide-19
SLIDE 19

0.2 0.4 0.6 0.8 1 false positive rate (fp) 0.2 0.4 0.6 0.8 1 true positive rate (tp) TagSpam TagBlur DomFp ValidLinks NumAds Plagiarism

slide-20
SLIDE 20

!"##$ !"%&$ !"&'$ !"'($ !"%$ !"')$ !$ !"*$ !"+$ !",$ !"($ !")$ !"'$ !"%$ !"&$ !"#$ *$

  • ./01.2$
  • ./3456$

78291$ :.4;<=;>?@$ A52B<@$ C4./;.6;@2$ !"#$%&#$

slide-21
SLIDE 21
slide-22
SLIDE 22

linear SVM AdaBoost Features

Accuracy

FP F1

Accuracy

FP F1 TagSpam 95.82% .061 .957 94.66% .048 .943 + TagBlur 96.75% .048 .966 96.06% .044 .958 + DomFp 96.75% .048 .966 96.06% .044 .958 + ValidLinks 96.52% .048 .964 96.75% .026 .965 + NumAds 96.52% .048 .964 97.22% .026 .970 + Plagiarism 96.75% .048 .966 98.38% .022 .983

slide-23
SLIDE 23

1 2 3 4 5 6 number of features 94 95 96 97 98 99 percent correctly classified linear SVM AdaBoost

slide-24
SLIDE 24

 Web/Email Spam

 Attenberg and Suel 2008, Gyöngyi et al. 2004

 Social Spam

 Heymann et al. 2007

 Spam Detection

 Krause et al. 2008, Caverlee et al. 2008,

Benevenuto et al. 2008, Koutrika et al. 2007

 ECML PKDD Discovery Challenge 2008

 Held by BibSonomy team  Gkanogiannis and Kalamboukis 2008, Chevalier

and Gramme 2008, Kim and Hwang 2008

slide-25
SLIDE 25

 Identified/analyzed 6 features for

spam detection

 TagSpam alone achieves 0.99 ROC AUC

  • utperforming ECML PKDD Discovery

Challenge 2008

 Accuracy over 98% with AdaBoost

 False-positive rate: 0.022  These results set the state of the art

 could improve by combining with other

features, e.g. Krause et al. 2008

 Limitations

 Efficiency issues  Bootstrap issues

slide-26
SLIDE 26

TagSpam

 Depends on a set of labeled tags

TagBlur

 Depends on a notion of similarity/distances  Assumes spam does not dominate the folksonomy,

affecting distances

DomFp

 Depends on a set of labeled fingerprints  Requires page download

NumAds

 Requires page download

Plagiarism

 Requires page download  Search engine cooperation

ValidLinks

 HEAD request per resource

slide-27
SLIDE 27

Benjamin Markines Ciro Cattuto Filippo Menczer

ISI Foundation

Features

 Folksonomy

description

 Feature descriptions

 post level  resource level  user level

Feature analysis

Social spam detection

BibSonomy Team http://www.bibsonomy.org