Detecting Singleton Review Spammers Using Semantic Similarity Vlad - PowerPoint PPT Presentation

Detecting Singleton Review Spammers Using Semantic Similarity Vlad Sandulescu, joint work with Martin Ester 2015.05.19

Online reviews • 31% of consumers read online reviews before actually making a purchase (rising) • by the end of 2014, 15% of all social media reviews will consist of company paid fake reviews

⋆ ⋆ ⋆ ⋆ ⋆ 4/12/2011 Immediately upon entering, we became aware of the fact that this is a unique and charming hotel. The main lobby is decorated by live vines overlapping the open-feeling roof and by chandeliers, quite a contrast. The hotel sta ff were courteous, welcoming and Ken K. e ffi cient. The room was tastefully decorated with plush, comfortable bedding and the Burke, VA street noises of New York were never noticeable. The location is convenient to everything in the area of Columbus Circle and Carnegie Hall and there is a subway � 0 friends ⋆ 4 reviews nearby. Overall a lovely experience.

⋆ ⋆ ⋆ ⋆ ⋆ 4/12/2011 Immediately upon entering, we became aware of the fact that this is a unique and charming hotel. The main lobby is decorated by live vines overlapping the open-feeling roof and by chandeliers, quite a contrast. The hotel sta ff were courteous, welcoming and Ken K. e ffi cient. The room was tastefully decorated with plush, comfortable bedding and the Burke, VA street noises of New York were never noticeable. The location is convenient to everything in the area of Columbus Circle and Carnegie Hall and there is a subway � 0 friends ⋆ 4 reviews nearby. Overall a lovely experience. Behavioural features text analysis • Behavioural approach gives good results for ”elite” users • Textual analysis = mostly cosine similarity, but also linguistic cues of deceptive writing - using more verbs, adverbs and pronouns • ”husband” or ”vacation” = highly suspicious based on their incidence in fake reviews • ∼ 90% of reviewers write a single review under one user name • What about the singleton reviewers?

Hypothesis • Semantic similarity measures should outperform vectorial based models in detecting more subtle similarities between fake reviews written by the same author • A spammer’s imagination is limited, so he will partially reuse some of the aspects between reviews, through paraphrase and synonyms Goals • Detect opinion spam using semantic similarity (WordNet) and topic modeling (LDA) • Compare to vectorial similarity models (cosine)

Wordnet synsets exaltation ecstasy diffusion shipping rapture transferral transportation raptus is tape drive tape transport displace conveyance is transport transfer move transmit carry channelize delight channelise channel enchant ship enrapture send enthral ravish enthrall

Wordnet synsets exaltation ecstasy diffusion shipping rapture transferral transportation raptus is tape drive tape transport displace conveyance is transport transfer move transmit carry channelize delight channelise channel enchant ship enrapture send transport - shipping = 0.8 enthral ravish transport - move = 0.2 enthrall

Vectorial-based measures For T1 and T2, their cosine similarity can be formulated as P n T 1 T 2 i =1 T 1 i T 2 i cos( T 1 , T 2 ) = k T 1 kk T 2 k = pP n i =1 ( T 1 i ) 2 pP n i =1 ( T 2 i ) 2 Knowledge-based measures For T1 and T2, their semantic similarity (Mihalcea et al.) can be formulated as: P P ( maxSim ( w , T 2 ) ⇤ idf ( w )) ( maxSim ( w , T 1 ) ⇤ idf ( w )) sim ( T 1 , T 2 ) = 1 w ∈ { T 1 } w ∈ { T 2 } 2 ( + ) ( P P idf ( w ) idf ( w ) w ∈ { T 1 } w ∈ { T 2 } transport - ”The shop now offers night delivery”

⋆ ⋆ ⋆ ⋆ ⋆ 4/12/2011 Immediately upon entering, we became aware of the fact that this is a unique and charming hotel. The main lobby is decorated by live vines overlapping the open-feeling roof and by chandeliers, quite a contrast. The hotel sta ff were courteous, welcoming and Ken K. e ffi cient. The room was tastefully decorated with plush, comfortable bedding and the Burke, VA street noises of New York were never noticeable. The location is convenient to everything in the area of Columbus Circle and Carnegie Hall and there is a subway � 0 friends ⋆ 4 reviews nearby. Overall a lovely experience. Aspect-based opinion mining • opinion phrases : <aspect, sentiment> • opinion phrases: <hotel, unique> , <hotel, charming> , <staff, courteous> • different words = same aspect (laptop, notebook, notebook computer) • reviews = short documents = latent topics mixture = review aspects mixture • reviews similarity = topics similarity => topic modeling problem • advantage: language agnostic, not like WordNet

�� Topic Modeling for opinion spam detection β α Θ d Z d,n W d,n N D Θ d represents the topic proportions for the d th document Z d,n represents the topic assignment for the n th word in the d th document W d,n � �� represents the observed word for the n th word in the d th document � � � �� β represents a distribution over the words in the known vocabulary � � � �� ( � ) � � �� ( � ∥ � ) = � ( � ) . � ( � ) �� ( � ∥ � ) = � � �� ( � ∥ � ) + � � �� ( � ∥ � ) , �� = � � ( � + � ) �� ( � , � ) = �� − β �� ( � ∥ � )

Ott dataset 57K crawled reviews 9K labeled reviews 800 labeled reviews from 660 New York restaurants from 130 US and UK businesses from TripAdvisor and AMT Recommended reviews = truthful One submission per turker, Not recommended = fake rejected short, illegible or plagiarized reviews

Preprocessing • Stop words removal, POS tagging (extracted NN, JJ, VB) ”I am working hard on my presentation at WWW” I /PRP am /VBP working /VBG hard /RB on /IN my /PRP presentation /NN at /IN WWW /NNP • am be, working work lemma lemma • Cosine (all POS), Cosine (NN, JJ, VB), Cosine with lemmatization, Semantic Pairwise similarity • ∀ pairs (Ri, Rj) ∈ business B • if sim(Ri, Rj) > T, T ∈ {.5, 1} ⇒ Ri and Rj are fake, else truthful

Semantic similarity results Yelp/Trustpilot - classifier performance with vectorial and semantic similarity measures (a) Yelp - Precision (b) Yelp - F1 Score 1,0 0,8 0,7 0,9 0,6 CPL- ↑ P ,T>0.75 0,5 0,8 Precision F1 Score 0,4 ↑ T ⇒↑ P 0,7 0,3 P=90%, T>0.8 0,2 0,6 Semantic ↑ F1-score 0,1 0,5 0,6 0,7 0,8 0,9 0,2 0,3 0,4 0,5 0,6 0,7 Threshold Threshold (c) Trustpilot - Precision (d) Trustpilot - F1 Score 1 0,8 0,7 0,9 0,6 0,5 Precision F1 Score 0,8 0,4 P=90%, T>0.85 0,3 Trustpilot’s spammers are lazy 0,7 0,2 Yelp’s spam is higher quality 0,1 0,5 0,6 0,7 0,8 0,9 0,2 0,3 0,4 0,5 0,6 0,7 Threshold Threshold cos cpnl cpl mih

Distribution of truthful and deceptive reviews - Ott Cumulative percentage of reviews vs. similarity values (a) Cos 1 0,8 Vectorial ∼ 2% diff 0,6 • 80% reviews ↑ 0.32 0,4 • 80% reviews ↑ 0.34 0,2 0,0 0,2 0,4 0,6 0,8 (b) Mihalcea 1 Semantic ∼ 6-10% diff 0,8 • 40% reviews ↑ 0.22 0,6 • 40% reviews ↑ 0.32 • 80% reviews ↑ 0.38 0,4 • 80% reviews ↑ 0.44 0,2 0,0 0,2 0,4 0,6 0,8 truthful deceptive

Bag-of-words LDA model results Yelp/Trustpilot - classifier performance for IR similarity with bag-of-words LDA (b) Yelp - F1 Score (a) Yelp - Precision 0,7 1 0,6 0,9 0,5 0,8 F1 Score Precision 0,4 • topics ∈ {10 - 100} 0,7 0,3 • #30-P>70% 0,6 0,2 • topics ↑ ⇒ P ↓ 0,5 0,1 • topics ↑ ⇒ F1 ↓ 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 0,5 0,6 0,7 0,8 0,9 Threshold Threshold (c) Trustpilot - Precision (d) Trustpilot - F1 Score • Trustpilot reviews are 1 0,7 much shorter 0,6 0,9 • Everybody kind of 0,5 0,8 Threshold Precision talks about the same 0,4 0,7 aspects 0,3 0,6 0,2 0,5 0,1 0,5 0,6 0,7 0,8 0,9 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 Threshold Threshold IR10 IR30 IR50 IR70 IR100

Bag-of-opinion-phrases LDA model results Yelp - classifier performance for IR similarity with bag-of-opinion-phrases LDA (a) Precision (b) F1 Score 0,7 0,7 0,6 0,5 Precision F1 Score 0,4 0,6 0,3 0,2 0,1 0,5 0,6 0,7 0,8 0,9 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 Threshold Threshold IR10 IR30 IR50 IR70 IR100 • Yelp - smoother precision increase as both #topics and threshold ↑ • Trustpilot - poor results due to reviews length and topic sparseness and smaller dataset • (aspect,sentiment) predict same author better

Detecting Singleton Review Spammers Using Semantic Similarity Vlad - PowerPoint PPT Presentation

Detecting Singleton Review Spammers Using Semantic Similarity Vlad Sandulescu, joint work with Martin Ester 2015.05.19 Online reviews 31% of consumers read online reviews before actually making a purchase (rising) by the end of 2014, 15%

Detecting Spammers and Content Detecting Spammers and Content Detecting Spammers and Content

The Singleton Pattern Design Patterns In Java Bob Tarr The Singleton Pattern The Singleton

Detecting Product Review Spammers using Rating Behaviors Itay Dressler What is Spam? Why

12/6/2013 Detecting Fakes Image Forensics: Detecting Forged Photos 1.Detecting photorealistic

Detecting Errors in Semantic Annotation Argument identification variation Heuristics for

Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine Shuang

Legal Aspects of Toxic Mould and Indoor Air Quality Problems John R. Singleton, Q.C. Singleton

NetFlow Analysis: Detecting covert channels on the network Detecting malicious traffic by using

Detecting Topics and their Transitions Victor Mireles , Artem Revenko Hybrid Statistical Semantic

E-mail trends in 2010: How do spammers get your address? Using distributed poisoned addresses to

Treating metadata in agriculture Treating metadata in agriculture using Semantic MediaWiki using

Creating Semantic Mashups: Bridging Web 2.0 and the Semantic Web Jamie Taylor, Colin Evans, Toby

: on the Semantic Web : on the Semantic Web Building a Semantic Prototype for Danish Building a

Semantic Processing Augmenting CFGs Currying Quantifier scope Semantic Grammars L445 / L545

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Preliminary Results on Generalized Fixed Order Interpolation Constantino Lagoa 1 1 Electrical

R2D2P Workshop Workshop on on Transition Transition Phase Phase R2D2P Sydney,Australia 12-16

National Laboratory Presentation to ETEBA LeAnne Stribley Director, Acquisition Management

Using measurement uncertainties in the MQO 1 Using measurement uncertainties | 24-25 june 2015

A Machine Learning Method in Computational Materials Science Computer Network Information

The United States and Article VI: A Record of Accomplishment by Thomas DAgostino

Civil/Military Cooperation Prashant Sanglikar Asst Director Safety & Flight OPS, IATA

Ethiopias PSNP and its role in Disaster Risk Management What is the PSNP ? An instrument that

Detecting Singleton Review Spammers Using Semantic Similarity Vlad - PowerPoint PPT Presentation

Detecting Singleton Review Spammers Using Semantic Similarity Vlad Sandulescu, joint work with Martin Ester 2015.05.19 Online reviews 31% of consumers read online reviews before actually making a purchase (rising) by the end of 2014, 15%

Detecting Spammers and Content Detecting Spammers and Content Detecting Spammers and Content

The Singleton Pattern Design Patterns In Java Bob Tarr The Singleton Pattern The Singleton

Detecting Product Review Spammers using Rating Behaviors Itay Dressler What is Spam? Why

12/6/2013 Detecting Fakes Image Forensics: Detecting Forged Photos 1.Detecting photorealistic

Detecting Errors in Semantic Annotation Argument identification variation Heuristics for

Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine Shuang

Legal Aspects of Toxic Mould and Indoor Air Quality Problems John R. Singleton, Q.C. Singleton

NetFlow Analysis: Detecting covert channels on the network Detecting malicious traffic by using

Detecting Topics and their Transitions Victor Mireles , Artem Revenko Hybrid Statistical Semantic

E-mail trends in 2010: How do spammers get your address? Using distributed poisoned addresses to

Treating metadata in agriculture Treating metadata in agriculture using Semantic MediaWiki using

Creating Semantic Mashups: Bridging Web 2.0 and the Semantic Web Jamie Taylor, Colin Evans, Toby

: on the Semantic Web : on the Semantic Web Building a Semantic Prototype for Danish Building a

Semantic Processing Augmenting CFGs Currying Quantifier scope Semantic Grammars L445 / L545

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Preliminary Results on Generalized Fixed Order Interpolation Constantino Lagoa 1 1 Electrical

R2D2P Workshop Workshop on on Transition Transition Phase Phase R2D2P Sydney,Australia 12-16

National Laboratory Presentation to ETEBA LeAnne Stribley Director, Acquisition Management

Using measurement uncertainties in the MQO 1 Using measurement uncertainties | 24-25 june 2015

A Machine Learning Method in Computational Materials Science Computer Network Information

The United States and Article VI: A Record of Accomplishment by Thomas DAgostino

Civil/Military Cooperation Prashant Sanglikar Asst Director Safety &amp; Flight OPS, IATA

Ethiopias PSNP and its role in Disaster Risk Management What is the PSNP ? An instrument that

Civil/Military Cooperation Prashant Sanglikar Asst Director Safety & Flight OPS, IATA