benjamin markines ciro cattuto filippo menczer
play

Benjamin Markines Ciro Cattuto Filippo Menczer ISI Foundation - PowerPoint PPT Presentation

Benjamin Markines Ciro Cattuto Filippo Menczer ISI Foundation Beneficiaries Spammer Intermediary Non-beneficiaries Systems: search engines, tagging Information consumers Original authors ? Surfer


  1. Benjamin Markines Ciro Cattuto Filippo Menczer ISI Foundation

  2.  Beneficiaries  Spammer  Intermediary  Non-beneficiaries  Systems: search engines, tagging  Information consumers  Original authors  ?  Surfer  Advertisers

  3.  Features  Folksonomy description  Feature descriptions  post level  resource level  user level  Feature analysis  Dataset description  Social spam detection

  4. web tech news alice bob www2009.org wired.com cnn.com F = ( U, T, R, Y ) , Y ⊆ U × T × R (the triples)

  5.  TagSpam spam probability user 1 � f T agSpam ( u, r ) = Pr( t ) | T ( u, r ) | t ∈ T ( u,r ) resource post tags

  6.  TagBlur f TagBlur ( u, r ) = 1 1 1 � σ ( t 1 , t 2 ) + ǫ − 1 + ǫ Z t 1 � = t 2 ∈ T ( u,r ) |tag pairs| tag similarity

  7. random τ = 10 -4 WWW 2009 HT 2009

  8.  DomFp DOM shingle fingerprint spam similarity probability � k ∈ K σ ( k ( r ) , k ) · Pr( k ) f DomF p ( r ) = � k ∈ K σ ( k ( r ) , k )

  9.  NumAds number of ads f NumAds ( r ) = g ( r ) /g max

  10.  Plagiarism number of more authoritative sources matching exact phrase f P lagiarism ( r ) = y ( r ) /y max

  11.  ValidLinks number valid links total number f V alidLinks ( u ) = | V u | of links | R u |

  12.  BibSonomy.org  Spam is labeled at user level  Aggregate for user level 1 � f ( u ) = f ( u, r ) | P ( u ) | ( u,r ) ∈ P ( u ) posts  Sampled 1000 users  500 spammers  500 users in training set/test set  250 spammers

  13. !"#$%&'()*+, !" #!" $!!" $#!" %!!" %#!" &!!" &#!" '!!" '#!" A/0C@/4" A/0B.62" =>4?@" 9/.18:1;<3" 564783" @D/23>;" EF1G3H6/2D8" -./01/2134" !" !($" !(%" !(&" !('" !(#" !()" !(*" !(+" !(," $" ,-*()%./,!.))*0(1./,

  14. 1 0.8 true positive rate (tp) 0.6 0.4 TagSpam TagBlur DomFp 0.2 ValidLinks NumAds Plagiarism 0 0 0.2 0.4 0.6 0.8 1 false positive rate (fp)

  15. !"##$ *$ !"&'$ !"#$ !"%&$ !"&$ !"%$ !"%$ !"')$ !"'($ !"'$ !"#$%&#$ !")$ !"($ !",$ !"+$ !"*$ !$ -./01.2$ -./3456$ 78291$ :.4;<=;>?@$ A52B<@$ C4./;.6;@2$

  16. linear SVM AdaBoost Features FP FP F 1 F 1 Accuracy Accuracy TagSpam 95.82% .061 .957 94.66% .048 .943 + TagBlur 96.75% .048 .966 96.06% .044 .958 + DomFp 96.75% .048 .966 96.06% .044 .958 + ValidLinks 96.52% .048 .964 96.75% .026 .965 + NumAds 96.52% .048 .964 97.22% .026 .970 + Plagiarism 96.75% .048 .966 98.38% .022 .983

  17. 99 percent correctly classified 98 97 96 95 linear SVM AdaBoost 94 1 2 3 4 5 6 number of features

  18.  Web/Email Spam  Attenberg and Suel 2008, Gyöngyi et al. 2004  Social Spam  Heymann et al. 2007  Spam Detection  Krause et al. 2008, Caverlee et al. 2008, Benevenuto et al. 2008, Koutrika et al. 2007  ECML PKDD Discovery Challenge 2008  Held by BibSonomy team  Gkanogiannis and Kalamboukis 2008, Chevalier and Gramme 2008, Kim and Hwang 2008

  19.  Identified/analyzed 6 features for spam detection  TagSpam alone achieves 0.99 ROC AUC outperforming ECML PKDD Discovery Challenge 2008  Accuracy over 98% with AdaBoost  False-positive rate: 0.022  These results set the state of the art  could improve by combining with other features, e.g. Krause et al. 2008  Limitations  Efficiency issues  Bootstrap issues

  20. TagSpam   Depends on a set of labeled tags TagBlur   Depends on a notion of similarity/distances  Assumes spam does not dominate the folksonomy, affecting distances DomFp   Depends on a set of labeled fingerprints  Requires page download NumAds   Requires page download Plagiarism   Requires page download  Search engine cooperation ValidLinks   HEAD request per resource

  21. Features Benjamin Markines   Folksonomy description Ciro Cattuto  Feature descriptions  post level Filippo Menczer  resource level  user level BibSonomy Team Feature analysis  Social spam detection http://www.bibsonomy.org  ISI Foundation

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend