Opinion Spam Analysis and Detection Leaked Confidential Information - PowerPoint PPT Presentation

Opinion Spam Analysis and Detection Leaked Confidential Information as Ground Truth Yu-Ren Chen, Hsin Hsi Chen National Taiwan University

What Is Opinion Spam? ● spreading commercially advantageous opinions as regular users on the Internet positive opinions about own products/services ○ negative opinions about competitors ○ ● also known as ‘ review spam ’ or ‘shilling’ ● undoubtedly unethical ● in most cases illegal ● believed to be widely used in practice potentially lucrative profits → strong incentives ○

‘Spam’ in General ● definition of ‘spam’ ○ ( Wikipedia ) the use of electronic messaging systems to send unsolicited bulk messages (spam), especially advertising , indiscriminately ○ ( Oxford ) irrelevant or unsolicited messages sent over the Internet, typically to large numbers of users, for the purposes of advertising , phishing, spreading malware, etc.

Various Kinds of Spam ● email spam the most well-known kind of spam ○ spam is defined as ‘email spam’ in Merriam-Webster ○ ● search engine spam manipulation of search engine indices and rankings ○ content spam: keyword stuffing, meta-tag stuffing, scraper ○ sites, article spinning, machine translation link spam: link farms, hidden links, expired domains, ○ comment spam, referrer log spam ● social network spam bulk message, malicious links, fake friends, etc. ○ ● many more...

Opinion Spam v.s. Other Spam ● very carefully written to avoid getting caught ○ deemed as fraud → unacceptable ○ backfiring may cause serious damage to the reputation of a brand (or a store, restaurant, etc.) ● initiated by a big brand in our case study ( Samsung ) ○ high stakes ○ opinion spammers have to be really careful

Difficulty in Obtaining Dataset ● manual annotation is pretty much useless low inter-annotator agreement score (Ott et al. 2011) ○ approximated ground truth ○ duplicate / near-duplicate reviews ■ ● (Jindal et al. 2008) crowdsourced fake reviews ■ ● (Ott et al. 2011) ● utilizing confidential internal records in our study ○ real record of opinion spam in real world ○ ‘true’ ground truth

Our Case Study - 三星寫手門事件 ● conducted by ‘ 鵬泰顧問有限公司 ’, a subsidiary company of Samsung ● hired writers and designated employees were instructed to make disingenuous reviews on web forums ● revealed by a hacker known as ‘0xb’ ○ making confidential documents in 鵬泰 publicly available on TaiwanSamsungLeaks.org ● Samsung was fined $10 million TWD by the Fair Trade Commission in Taiwan

About ‘Mobile01’ ● the main battlefield of this marketing campaign ● where the leaked documents was first made public ● one of the most popular websites in Taiwan #10 in Alexa traffic rank in Taiwan ○ ● mainly featuring discussion about consumer electronics mobile phones, tablets, etc. ○ ● primarily a web forum site rather than product review site such as Amazon ○ ● written in Traditional Chinese for the most part rather than in English ○

Structure of Mobile01 Forum

More about Web Forums ● thread ( topic, 討論串 ) ○ collection of posts (from oldest to latest) ○ started by specifying the title and the first post ● post ( 文章 ) ○ first post (original post, thread starter, 一樓 ) ○ reply ( 回覆 ) ■ all posts except the first posts ■ can be used for bumping ( 手動置頂，頂 ) ● hierarchical structure ○ forum ⇢ board ⇢ thread ⇢ post

Dataset Collection ● leaked spreadsheets HHP-2011.xlsx and HHP-2012.xlsx ○ containing URLs to the spam posts ○ source of the ground truth in our study ○ ● Mobile01 contents of the posts ○ including spam and non-spam ■ other various kinds of information on Mobile01 ○ all posts from 2011 to 2012 on SAMSUNG (Android) board ○ where the ‘ spam density ’ is the highest ■ all profiles of the posters of such posts ○ three SQLite tables: POSTS , PROFILES, THREADS , ○

a snippet of the leaked spreadsheet ‘HHP-2012.xlsx’

an example post on Mobile01

a snippet of the table ‘POSTS’ scraped from Mobile01

an example profile on Mobile01

a snippet of the table ‘PROFILES’ scraped from Mobile01

a snippet of the table ‘THREADS’ scraped from Mobile01

amount of data collected

Looking into the Dataset ● main observations ○ subtlety in spam posts ○ low spam post ratio of some spammers ○ types of spammer accounts reputable accounts ■ throwaway accounts ■ ○ first posts v.s. replies in threads ○ pattern in submission time of spam posts ○ activeness of threads ○ collusive activities of spammers

examples of subtly written spam posts

Low Spam Post Ratio of Some Spammers ● our definition of spammer poster who had submitted any spam post ○ ● only 33% of the posts from spammers are spam in this dataset

Different Types of Spammer Accounts ● reputable accounts hired reputable writers ○ low spam ratio ○ ● throwaway accounts registered in batch ○ high spam ratio ○ low number of threads ○ initiated ● others hired non-reputable posters? ○ borrowed accounts? ○

pattern in registration time

First Posts v.s. Replies ● first post initiating the thread ○ richer in content ○ higher spam ratio ○ ● reply (2nd, 3rd posts in thread) usually quite concise ○

Submission Time of Spam Posts ● hypothesis : spam posts are more often made during work time compared to normal posts because spamming is a job rather than a leisure activity ○

Activeness of Threads ● threads started by spam first posts are expected to be more active written to draw attention and exposure ○ ● measuring of ‘activeness’ of a thread number of posts in the thread ○ number of clicks on the thread ○

Collusion between Spammers ● different spam accounts submit spam posts to the same thread ○ fabricating the majority opinion ○ bumping the same spam thread ● 67% of the spam posts are in the threads containing multiple spam posts ● could be different actual human posters ● or just one human login with different accounts ○ still can be seen as collusion between the accounts

Detection ● evaluation metric ● data splitting ● machine learning ● spam detection for first posts ● spam detection for replies ● spammer detection

Evaluation Metric ● spam / spammer is in minority ( <5% ) ○ accuracy ✘ ● high precision / recall on the spam / spammer class is preferable ○ F-measure ✔

Data Splitting ● posts (spam detection) ○ made in 2011 → training set ○ made between Jan 2012 and May 2012 → test set ● user accounts (spammer detection) ○ who had submitted a post in 2011 → training set ○ who had submitted a post between Jan 2012 and May 2012 → test set ○ who had submitted a post in both → training set ■ probabilistic prediction on posts will be used in spammer detection

Test Set* for Posts ● concern: capturing the writing habit of spammers? ○ favorite words , preferred writing style and etc. ■ might be purely personal preference ■ not essential to opinion spamming ● solution: removing posts by ‘cross-posters’ from test set for posts ○ cross-posters: users who have made a post in both of training set and test set ○ result set: test set*

number of instances in each split

Machine Learning ● Scikit-Learn (Pedregosa et al. 2011) machine learning in Python ○ ● SVM with RBF kernel outperforms SVM with linear kernel, logistic regression, ○ AdaBoost, random forests , etc Python wrapper for LibSVM ○ scaling features to zero mean and unit variance ○ (Hsu et al. 2003) ■ two primary hyperparameter (c, γ) to tune ○ 5-fold cross-validation on the training set ■ grid search on (c, γ) with F-measure as the metric to ■ optimize

Spam Detection for First Posts ● specifically for first posts in threads ● expected to be have nicer results than for replies ○ higher spam ratio ○ richer content ● only show the performance on test set* in the following slides for conciseness

Random Baseline ● predict whether a first post is spam according to the result of a fair coin flip ● precision ≈ ratio of spam ● recall ≈ 50% features\metrics precision recall F-measure random 2.52% 55.71% 4.82%

(Dimension-reduced) Bag-of-Words ● Chinese word segmentation with Jieba ● words with < 5 occurrences are removed ● words appeared in over 30% of the posts are removed stop words like ○ ● Randomized PCA (Halko et al. 2011) to reduce the dimension of bag-of-words efficient on large matrices ○ mitigating overfitting ○ speed up the training process ○

Number of Dimension to Reduce to ● determined by the result of the 5-fold cross validation ● F-measure is the highest when bag-of-words is reduced to 150 dimensions

Bag-of-Words Performance ● tremendous performance boost ○ F-measure improved by 46% ● how? features\metrics precision recall F-measure random 2.52% 55.71% 4.82% bag-of-words 50.00% 51.43% 50.70%

Opinion Spam Analysis and Detection Leaked Confidential Information - PowerPoint PPT Presentation

Opinion Spam Analysis and Detection Leaked Confidential Information as Ground Truth Yu-Ren Chen, Hsin Hsi Chen National Taiwan University What Is Opinion Spam? spreading commercially advantageous opinions as regular users on the Internet

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, June 24, 2010 Databases and

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Spam Is Bad John R. Levine Chair, IRTF ASRG Chair@asrg.sp.am http://asrg.sp.am Why is spam

Opinion Mining Opinion Mining Feiyu Xu DFKI, LT-Lab Xu, LT1, 2013 Outline Outline

Detecting Product Review Spammers using Rating Behaviors Itay Dressler What is Spam? Why

The CAN-SPAM Act of 2003 D E C E M B E R 2 0 0 3 THE CAN-SPAM ACT OF 2003 Status of Legislation

Spam-URL Detection via Redirects Heeyoung Kwon Mirza Basim Baig Leman Akoglu Era of Spam Era

Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood

Opinion Extraction Task Opinion Mining Reviews A popular topic in opinion analysis is

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Hudson Valley DSRIP Project Advisory Committee Webinar November 19, 2014 10:00 10:45 am

Two-phase free boundary problems for harmonic measure with H older data (and blowups in

Finding vulnerabiliFes CS642: Computer Security Professor

Combining AIRS and TES CO Measurements Juying Warner, Z. Sun, C. Barnet, A. Tangborn, M. Luo, G.

Objective Metatheory of (Cubical) Type Theory Jonathan Sterling August 31, 2020 The

The Fate of Metals Synthesized in Popula6on III Supernovae

Performance Management 2017 / 18: Quarter 1 Communities and Housing Committee Director of

CMSC 110 Instructor: Grading Jia Tao, Ph.D.

Opinion Spam Analysis and Detection Leaked Confidential Information - PowerPoint PPT Presentation

Opinion Spam Analysis and Detection Leaked Confidential Information as Ground Truth Yu-Ren Chen, Hsin Hsi Chen National Taiwan University What Is Opinion Spam? spreading commercially advantageous opinions as regular users on the Internet

Opinion Spam and Analysis NITIN JINDAL &amp; BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, June 24, 2010 Databases and

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Spam Is Bad John R. Levine Chair, IRTF ASRG Chair@asrg.sp.am http://asrg.sp.am Why is spam

Opinion Mining Opinion Mining Feiyu Xu DFKI, LT-Lab Xu, LT1, 2013 Outline Outline

Detecting Product Review Spammers using Rating Behaviors Itay Dressler What is Spam? Why

The CAN-SPAM Act of 2003 D E C E M B E R 2 0 0 3 THE CAN-SPAM ACT OF 2003 Status of Legislation

Spam-URL Detection via Redirects Heeyoung Kwon Mirza Basim Baig Leman Akoglu Era of Spam Era

Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood

Opinion Extraction Task Opinion Mining Reviews A popular topic in opinion analysis is

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Hudson Valley DSRIP Project Advisory Committee Webinar November 19, 2014 10:00 10:45 am

Two-phase free boundary problems for harmonic measure with H older data (and blowups in

Finding vulnerabiliFes CS642: Computer Security Professor

Combining AIRS and TES CO Measurements Juying Warner, Z. Sun, C. Barnet, A. Tangborn, M. Luo, G.

Objective Metatheory of (Cubical) Type Theory Jonathan Sterling August 31, 2020 The

The Fate of Metals Synthesized in Popula6on III Supernovae

Performance Management 2017 / 18: Quarter 1 Communities and Housing Committee Director of

CMSC 110 Instructor: Grading Jia Tao, Ph.D.

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All