Tag Spam Creates Large Non-Giant Connected Components Non-Giant - PowerPoint PPT Presentation

Tag Spam Creates Large Non-Giant Connected Components Non-Giant Connected Components Nicolas Neubauer (1), Robert Wetzker (2) & Klaus Obermayer (1) Neural Information Processing Group (1), DAI Lab (2) Technische Universität Berlin AIRWeb@WWW’09, 21.4.2009

Overview 1. Spam in Social Bookmarking Systems 2. Hyperincident Connected Components 3. Document/User and Tag/User Graphs 4. Conclusions

Social Tagging

Edges: Top 2000 similarities between top 800 documents (no spam) - Bibsonomy

Some tag spam targets search engines • Top entry for a given tag might indicate relevance • Other tag spam targets users • Sites with certain tags might lure users into visiting them • Spammers behave so radically different it shows in the resulting • network structures Edges: Top 2000 similarities between top 800 documents (spam) - Bibsonomy

Hyperincident Connectivity Tagging data can be interpreted as a hypergraph, defined by hyperedges • (d,u,t) for a document d being tagged with tag t by a user u Two edges are incident if they share a node (i.e., d, u, or t) • – In all examined datasets, everything was basically connected to everything Definition: Two edges are 2-hyperincident if they share at least two nodes • 2-hyperincident connected components: 2-hyperincident connected components: • Components of edges between paths of 2-hyperincident edges exist Blue, dotted lines indicate incident edges Blue, dotted lines indicate 2-hyperincident edges

Expanding 2-hyperincident edges around a user Starting from a legitimate user, we had to stop at a limit of discovered • nodes (here: 2000) Starting from spam users, we often found closed sets of connected nodes • We did not find such components for legitimate users •

Distribution of Component Sizes x=number of components of size y (log/log) Neubauer&Obermayer: Hyperincident Connected Components of Tagging Networks, HyperText 2009, in press

Distribution of Large Components‘ Sizes x = rank of component, y = number of edges in component

Spam Detection Users in nlc/gcc are likely to be • spammers/non-spammers Are spammers/non-spammers also likely • to live in nlc/gcc? Yes • although many users from both classes – do neither. do neither. Distribution of users over components Simple classification heuristic: • If user is only in nlc-> spam = 1 – If user is only in gcc -> spam = 0 – otherwise-> spam = 0.5 – Note that users can be in more than one – component Area under ROC curve (AUC - balanced • accuracy) of .73 ROC curve of simple classifier

Largest and Next-largest 2-HCC for one Month of Delicious Tags

Doubting Hyper-Incident Connectivity “Nice result, but probably mostly based on documents” • Short story: Right. • – Long story: Tags do have a bit of influence here. Question: What happens if we examine connectivity on the Question: What happens if we examine connectivity on the • document/user graph, ie edges=(d,u) for (d,u,t) in hypergraphs? – And what happens if we do the same for the tag/user graph?

Connectivity Structure (Bibsonomy) We see a the distribution of component sizes in the user/document graph • closely resembles the one found in the entire hypergraph The tag/document graph is mostly connected •

User Distribution Accordingly, membership information on the user/document graph is • comparably informative, while the tag/document graph is useless

Spam Detection New spam detection experiments: applied above heuristic on • document/user graph (red) compared to original approach (black) • new heuristic (blue): • new maximum spam score for users being in nlc in both graphs also examined effect of #documents/user • ROC curves for all three heuristics Results: Hypergraph and document/user graph • connectivity provide similar, but sometimes complementary information Entire approach works better when users • have more documents AUC values

Final Results & Discussion Requirements Feature extraction on Previous Labels resources or references Content analysis X X Reference analysis X X User Similarity X Structural Analysis Accuracy decreases, but so do domain dependence and • requirements on available information Addition to other, more specialized approaches • Stand-alone baseline when more specialized approaches • are not available Although a large part of connectivity is related to • documents, tags do play a subtle role. Next : Exploring temporal evolution & even stricter • notions of connectivity

Tag Spam Creates Large Non-Giant Connected Components Non-Giant - PowerPoint PPT Presentation

Tag Spam Creates Large Non-Giant Connected Components Non-Giant Connected Components Nicolas Neubauer (1), Robert Wetzker (2) & Klaus Obermayer (1) Neural Information Processing Group (1), DAI Lab (2) Technische Universitt Berlin

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

Campus with Tag Manager Marcel Ayers, Director of Implementation OmniUpdate Agenda What is Tag

Massive Data Algorithmics Lecture 10: Connected Components and MST Massive Data Algorithmics

TAG Update Brooke V ilante TAG TOSA October 13, 2015 TAG Reinvestment Board allocated $200,000

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, June 24, 2010 Databases and

Spam Is Bad John R. Levine Chair, IRTF ASRG Chair@asrg.sp.am http://asrg.sp.am Why is spam

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information

Connectivity and Biconnectivity 462 cec CS 16: Connectivity Connected Components Connected

Strongly Connected Components Detection Strongly Connected Components A directed graph is

Strongly connected components Finding strongly-connected components A strongly connected component

(TAG) at River Trail MS Spring 20 2020 20 TAG students are placed in Advanced or TAG classes

Company presentation August 2020 / Q2 2020 Content 2 2 I. TAG overview and strategy 3 II.

Computer Science Research CS 197 | Stanford University | Michael Bernstein cs197.stanford.edu

Software Maintenance Overview Legacy code Reverse-engineering Re-engineering

Office of eLearning May 19, 2020 OPEN - GB - INFO 1 1 eLearning Projections Net new student

Op Open Of n Offic ice H Hour urs questions typed in the chat box, and offline afterwards,

CIS 330: Applied Database Systems Lecture 7: Technologies at the Three Tiers Alan Demers

..3 -,/ $

Interpreting Social Media Elijah Mayfield School of Computer Science Carnegie Mellon University

Online Load Balancing with Learned Weights Benjamin Moseley Tepper School of Business, Carnegie

Sambuz

Useful Links

Newsletter

Mail Us

Tag Spam Creates Large Non-Giant Connected Components Non-Giant - PowerPoint PPT Presentation

Tag Spam Creates Large Non-Giant Connected Components Non-Giant Connected Components Nicolas Neubauer (1), Robert Wetzker (2) & Klaus Obermayer (1) Neural Information Processing Group (1), DAI Lab (2) Technische Universitt Berlin

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Opinion Spam and Analysis NITIN JINDAL &amp; BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

Campus with Tag Manager Marcel Ayers, Director of Implementation OmniUpdate Agenda What is Tag

Massive Data Algorithmics Lecture 10: Connected Components and MST Massive Data Algorithmics

TAG Update Brooke V ilante TAG TOSA October 13, 2015 TAG Reinvestment Board allocated $200,000

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, June 24, 2010 Databases and

Spam Is Bad John R. Levine Chair, IRTF ASRG Chair@asrg.sp.am http://asrg.sp.am Why is spam

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information

Connectivity and Biconnectivity 462 cec CS 16: Connectivity Connected Components Connected

Strongly Connected Components Detection Strongly Connected Components A directed graph is

Strongly connected components Finding strongly-connected components A strongly connected component

(TAG) at River Trail MS Spring 20 2020 20 TAG students are placed in Advanced or TAG classes

Company presentation August 2020 / Q2 2020 Content 2 2 I. TAG overview and strategy 3 II.

Computer Science Research CS 197 | Stanford University | Michael Bernstein cs197.stanford.edu

Software Maintenance Overview Legacy code Reverse-engineering Re-engineering

Office of eLearning May 19, 2020 OPEN - GB - INFO 1 1 eLearning Projections Net new student

Op Open Of n Offic ice H Hour urs questions typed in the chat box, and offline afterwards,

CIS 330: Applied Database Systems Lecture 7: Technologies at the Three Tiers Alan Demers

..3 -,/ $

Interpreting Social Media Elijah Mayfield School of Computer Science Carnegie Mellon University

Online Load Balancing with Learned Weights Benjamin Moseley Tepper School of Business, Carnegie

Sambuz

Useful Links

Newsletter

Mail Us

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All