Exploring Linguistic Features for Web Spam Detection A Preliminary - PowerPoint PPT Presentation

Exploring Linguistic Features for Web Spam Detection A Preliminary Study Jakub Piskorski 1 Marcin Sydow 2 Dawid Weiss 3 1 Joint Research Centre of the European Commission, Ispra, Italy 2 Web Mining Lab, Polish-Japanese Institute of Information Technology, Warsaw, Poland 3 Institute of Computing Science, Poznan University of Technology, Poland

1 Introduction Computation 2 3 Preprocessing Attribute pre-Selection 4 5 Conclusions

Introduction Computation Preprocessing Attribute pre-Selection Conclusions Background There is a recent interest in machine-learning approach to Web spam detection. The main motivations are: • complexity: too many factors to consider • scale: too much data to analyse by humans • need for adaptivity: a dynamic problem (arms race)

Introduction Computation Preprocessing Attribute pre-Selection Conclusions Previous work on content analysis, etc. Various content-based factors have been already studied: • statistic-based approach (Fetterly et al. ’04) • checksums, term weighting (Drost et al. ’05, Ntoulas et al. ’06) • blog spam detection by language model disagreement (Mishne et al. ’05) • auto-generated content (Fetterly et al. ’05) • HTML structure (Urvoy et al. ’06) • commercial attractiveness of keywords (Benczur et al. ’07)

Introduction Computation Preprocessing Attribute pre-Selection Conclusions Previous work on content analysis, etc. Various content-based factors have been already studied: • statistic-based approach (Fetterly et al. ’04) • checksums, term weighting (Drost et al. ’05, Ntoulas et al. ’06) • blog spam detection by language model disagreement (Mishne et al. ’05) • auto-generated content (Fetterly et al. ’05) • HTML structure (Urvoy et al. ’06) • commercial attractiveness of keywords (Benczur et al. ’07) Also other dimensions of data were explored: link-based, query-log based, combined, etc.

Introduction Computation Preprocessing Attribute pre-Selection Conclusions Previous work on content analysis, etc. Various content-based factors have been already studied: • statistic-based approach (Fetterly et al. ’04) • checksums, term weighting (Drost et al. ’05, Ntoulas et al. ’06) • blog spam detection by language model disagreement (Mishne et al. ’05) • auto-generated content (Fetterly et al. ’05) • HTML structure (Urvoy et al. ’06) • commercial attractiveness of keywords (Benczur et al. ’07) Also other dimensions of data were explored: link-based, query-log based, combined, etc. What about linguistic analysis of Web documents?

Introduction Computation Preprocessing Attribute pre-Selection Conclusions Motivation Linguistic analysis: • have not been used before in the Web spam detection problem (except some corpus-based statistics) • proved successful in deception detection in textual human-to-human communication (Zhou et al. “Automating Linguistics-based Cues for detecting deception of text-based Asynchronous Computer-Mediated Communication”)

Introduction Computation Preprocessing Attribute pre-Selection Conclusions Linguistic Analysis We applied light-weight linguistic analysis to compute new attributes for Web spam detection problem. Two different NLP software tools were used: • Corleone (developed at JRC, Ispra) • General Inquirer ( www.wjh.harvard.edu/~inquirer ) Why only a light-weight analysis? • computationally cheap • more immune in the context of the open-domain nature of the Web documents General linguistic, document-level analysis without any prior knowledge about the corpus.

Introduction Computation Preprocessing Attribute pre-Selection Conclusions Contributions 1 the two Yahoo! Web Spam Corpora of human-labelled hosts were taken 2 the two different NLP software tools were applied to them 3 over 200 linguistic-based attributes were computed and made publicly available for further research. Info: http://www.pjwstk.edu.pl/~msyd/linguisticSpamFeatures.html 4 over 1200 histograms were generated and analysed (also available) 5 the most promising attributes were preliminarily selected with the use of 2 different distribution-distance metrics

Corleone-based attributes, examples • Type: # of valid word forms Lexical validity = # of all tokens # of potential word forms Text-like fraction = # of all tokens • Diversity: # of different tokens Lexical diversity = # of all tokens # of different nouns & verbs Content diversity = # of all nouns & verbs # of different POS n-grams Syntactical diversity = # of all POS n-grams X Syntactical entropy p g · log p g − = g ∈ G

General Inquirer attribute groups • adjective types • ‘Osgood’ semantic dimensions • references to • skill categories • pleasure, pain, virtue and vice locations • motivation • overstatement/understatement • references to • adjective types • language of a particular ‘institution’ objects • cognitive • power • roles, collectivities, rituals, and orientation • rectitude interpersonal relations • pronoun types • references to people/animals • affection • negation and • processes of communicating • wealth interjections • valuing of status, honour, recognition • well-being • verb types and prestige • enlightenment

Computation, input data sets Map-reduce jobs (Hadoop) for processing (40 CPU cluster). 2006 2007 pages 3 396 900 12 533 652 pages without content 65 948 1 616 853 pages with HTTP/404 281 875 230 120 TXT SQF (compressed file, GB) 2.87 8.24

Reducing noise • Removed binary content-type pages. • Different “modes” of page filtering: (0) < 50k tokens , (1) 150–20k tokens, (2) 400–5k tokens. Lexical Validity.dat 5 NON-SPAM SPAM UNDECIDED 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Lexical validity for unfiltered input , Corleone , WebSpam-Uk2007 .

Reducing noise • Removed binary content-type pages. • Different “modes” of page filtering: (0) < 50k tokens, (1) 150–20k tokens , (2) 400–5k tokens. Lexical Validity.dat 6 NON-SPAM SPAM UNDECIDED 5 4 3 2 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Lexical validity for mode-1 filtered input , Corleone , WebSpam-Uk2007 .

Discriminancy Measures � | s h i − n h absDist ( h ) = i | / 200 (1) i ∈ I � ( s h i / max h − n h i / max h ) 2 / | I | sqDist ( h ) = (2) i ∈ I

The Most Promising Features (Corleone) The most discriminating Corleone attributes wrt absDist and sqDist metric. Corleone (absDist) 2007 2006 Corleone (sqDist) 2007 2006 Passive Voice 0.263 0.273 Syn. Diversity (4g) 0.053 0.054 Syn. Diversity (4g) 0.255 0.245 Syn. Diversity (3g) 0.050 0.067 Content Diversity 0.234 0.331 Syn. Diversity (2g) 0.037 0.036 Syn. Diversity (3g) 0.230 0.253 Content Diversity 0.032 0.065 Pronoun Fraction 0.224 0.261 Syn. Entropy (2g) 0.029 0.026 Syn. Diversity (2g) 0.221 0.232 Lexical Diversity 0.026 0.043 Lexical Diversity 0.213 0.262 Lexical Validity 0.024 0.033 Syn. Entropy (2g) 0.208 0.179 Pronoun Fraction 0.024 0.031 Text-Like Fraction 0.188 0.184 Text-Like Fraction 0.023 0.017

SyntacticalDiversity 3 Grams.dat SyntacticalDiversity 2 Grams.dat 3.5 6 NON-SPAM NON-SPAM SPAM SPAM UNDECIDED UNDECIDED 3 5 2.5 4 2 3 1.5 2 1 1 0.5 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 SyntacticalDiversity 4 Grams.dat 3.5 NON-SPAM Corleone, Syntactical diversity SPAM UNDECIDED mode-1 filtered, 2006 data set 3 • 2, 3 and 4-grams 2.5 • different Y scale to illustrate shape • increasing skewness of NON-SPAM 2 1.5 1 0.5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Corleone, Syntactical diversity mode-1 filtered, 2006 and 2007 data set • 4-grams • different Y scale to illustrate shape • 2006 (left), 2007 (right) • results very similar SyntacticalDiversity 4 Grams.dat SyntacticalDiversity 4 Grams.dat 3.5 4 NON-SPAM NON-SPAM SPAM SPAM UNDECIDED UNDECIDED 3.5 3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

The Most Promising Features (GI) The most discriminating General Inquirer attributes according to absDist and sqDist metric. GI (absDst) 2007 2006 GI (sqDist) 2007 2006 WltTot 0.287 0.346 leftovers 0.0150 0.0128 WltOth 0.285 0.341 EnlOth 0.0085 0.0072 Academ 0.270 0.263 EnlTot 0.0082 0.0118 Object 0.255 0.282 Object 0.0073 0.0086 EnlTot 0.249 0.247 text-length 0.0056 0.0048 Econ@ 0.228 0.356 ECON 0.0038 0.0034 SV 0.206 0.260 Econ@ 0.0038 0.0031 WltTot 0.0038 0.0027 WltOth 0.0037 0.0024

Leftovers attribute, General Inquirer , mode-1 filtered, 2006 data set: leftovers.dat 1.6 NON-SPAM SPAM UNDECIDED 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 10 20 30 40 50 60 70 80 90 100

Conclusions and Further Work Positive outcomes: • Features showing different characteristic between normal and spam classes: content diversity, lexical diversity, syntactical diversity, . . . Limitations and problems: • Spam pages generated from legitimate content. • Graphical spam (images overlaid over legitimate text). • Multi-lingual pages. Further steps: • new attributes should be tested directly in the Web classification task

Exploring Linguistic Features for Web Spam Detection A Preliminary - PowerPoint PPT Presentation

Exploring Linguistic Features for Web Spam Detection A Preliminary Study Jakub Piskorski 1 Marcin Sydow 2 Dawid Weiss 3 1 Joint Research Centre of the European Commission, Ispra, Italy 2 Web Mining Lab, Polish-Japanese Institute of Information

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, June 24, 2010 Databases and

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information

Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring

Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Spam Is Bad John R. Levine Chair, IRTF ASRG Chair@asrg.sp.am http://asrg.sp.am Why is spam

Identifying Web Spam Identifying Web Spam With User Behavior Analysis With User Behavior

Detecting Product Review Spammers using Rating Behaviors Itay Dressler What is Spam? Why

The CAN-SPAM Act of 2003 D E C E M B E R 2 0 0 3 THE CAN-SPAM ACT OF 2003 Status of Legislation

Spam-URL Detection via Redirects Heeyoung Kwon Mirza Basim Baig Leman Akoglu Era of Spam Era

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Spamming Botnets: Signatures and Characteristics

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

+ Collective Spammer Detection in Evolving Multi-Relational Social Networks Shobeir Fakhraei

1 Model-Based Classification Model-Based Classification Model-based approach Build a

FINGERPRINTING CLICK-SPAM IN AD NETWORKS Vacha Dave , Saikat Guha and Yin Zhang * The

Bias, Fairness, Accountability, and Transparency in Machine Learning CS 115 Computing for the

Email Spam and the Ethics of An3spam measures Behrooz

Exploring Python Bytecode @AnjanaVakil EuroPython 2016 Hi! Im Anjana, and Im a Pythoholic

Exploring Linguistic Features for Web Spam Detection A Preliminary - PowerPoint PPT Presentation

Exploring Linguistic Features for Web Spam Detection A Preliminary Study Jakub Piskorski 1 Marcin Sydow 2 Dawid Weiss 3 1 Joint Research Centre of the European Commission, Ispra, Italy 2 Web Mining Lab, Polish-Japanese Institute of Information

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Opinion Spam and Analysis NITIN JINDAL &amp; BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, June 24, 2010 Databases and

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information

Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring

Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Spam Is Bad John R. Levine Chair, IRTF ASRG Chair@asrg.sp.am http://asrg.sp.am Why is spam

Identifying Web Spam Identifying Web Spam With User Behavior Analysis With User Behavior

Detecting Product Review Spammers using Rating Behaviors Itay Dressler What is Spam? Why

The CAN-SPAM Act of 2003 D E C E M B E R 2 0 0 3 THE CAN-SPAM ACT OF 2003 Status of Legislation

Spam-URL Detection via Redirects Heeyoung Kwon Mirza Basim Baig Leman Akoglu Era of Spam Era

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Spamming Botnets: Signatures and Characteristics

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

+ Collective Spammer Detection in Evolving Multi-Relational Social Networks Shobeir Fakhraei

1 Model-Based Classification Model-Based Classification Model-based approach Build a

FINGERPRINTING CLICK-SPAM IN AD NETWORKS Vacha Dave *, Saikat Guha and Yin Zhang * * The

Bias, Fairness, Accountability, and Transparency in Machine Learning CS 115 Computing for the

Email Spam and the Ethics of An3spam measures Behrooz

Exploring Python Bytecode @AnjanaVakil EuroPython 2016 Hi! Im Anjana, and Im a Pythoholic

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All

FINGERPRINTING CLICK-SPAM IN AD NETWORKS Vacha Dave , Saikat Guha and Yin Zhang * The