Sentiment Analysis for Twitter using Hyrid Naive Bayes Harsh Thakkar - - PowerPoint PPT Presentation

sentiment analysis for twitter using hyrid naive bayes
SMART_READER_LITE
LIVE PREVIEW

Sentiment Analysis for Twitter using Hyrid Naive Bayes Harsh Thakkar - - PowerPoint PPT Presentation

Introduction Background Proposed approach Experimental setup Results Conclusion Sentiment Analysis for Twitter using Hyrid Naive Bayes Harsh Thakkar 1 Dr. Dhiren Patel 2 1 M.Tech. II Student 2 Professor & Guide Computer Engineering


slide-1
SLIDE 1

Introduction Background Proposed approach Experimental setup Results Conclusion

Sentiment Analysis for Twitter using Hyrid Naive Bayes

Harsh Thakkar 1

  • Dr. Dhiren Patel 2

1M.Tech. II Student 2Professor & Guide

Computer Engineering Department SVNIT, Surat

June 19, 2013

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 1/41

slide-2
SLIDE 2

Introduction Background Proposed approach Experimental setup Results Conclusion

Road Map

1 Introduction 2 Background & Related work 3 Proposed approach 4 Experimental setup 5 Results & Analysis 6 Conclusion

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 2/41

slide-3
SLIDE 3

Introduction Background Proposed approach Experimental setup Results Conclusion

Road Map

1 Introduction 2 Background & Related work 3 Proposed approach 4 Experimental setup 5 Results & Analysis 6 Conclusion

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 3/41

slide-4
SLIDE 4

Introduction Background Proposed approach Experimental setup Results Conclusion

Sentiment Analysis Sentiment Analysis: “It is the phenomenon of ex- tracting sentiments or opinions from reviews expressed by users over a particular subject, area or product on- line”

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 4/41

slide-5
SLIDE 5

Introduction Background Proposed approach Experimental setup Results Conclusion

Natural Language Processing Natural Language Processing (NLP): “It is the technology dealing with our most ubiquitous product: human language, as it appears in emails, web pages, tweets, product descriptions, newspaper stories, social media, and scientific articles, in thousands of languages and varieties”

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 5/41

slide-6
SLIDE 6

Introduction Background Proposed approach Experimental setup Results Conclusion

Motivation Why S.A. ? Increased use of microbloging as a platform to express

  • pinions.

Everyday enormous amount of data is created from so- cial networks like twitter. Data ⇒ Valuable information for everybody’s needs. Why Twitter ? Twitter is an Open access social network It is an Ocean of sentiments (140 characters High sen- timent density) Twitter provides developer friendly API mining senti- ments is easier

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 6/41

slide-7
SLIDE 7

Introduction Background Proposed approach Experimental setup Results Conclusion

Road Map

1 Introduction 2 Background & Related work 3 Proposed approach 4 Experimental setup 5 Results & Analysis 6 Conclusion

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 7/41

slide-8
SLIDE 8

Introduction Background Proposed approach Experimental setup Results Conclusion

Background & Related work Sentiment analysis is formulated as a text-classification problem Depending on the task at hand and perspective of the person doing the sentiment analysis, the approach can be..

General approaches Twitter specific approaches

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 8/41

slide-9
SLIDE 9

Introduction Background Proposed approach Experimental setup Results Conclusion

General Approaches General approaches are as follows:

Knowledge-based approach: is a F(x) of keywords Relationship-based approach: component relationship

  • riented [customer, brand]

Language models: is based on frequency of n-grams Semantics & Discourse structures: Overall semantic structure of a text is taken into consideration. Every word has its subjective meaning

Applications:

Movie reviews [4] Product reviews [5] News and Blogs ([3],[6])

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 9/41

slide-10
SLIDE 10

Introduction Background Proposed approach Experimental setup Results Conclusion

Twitter specific Approaches Twitter specific approaches are: Lexical approach Machine learning approach Hybrid approach

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 10/41

slide-11
SLIDE 11

Introduction Background Proposed approach Experimental setup Results Conclusion

Lexical approach

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 11/41

slide-12
SLIDE 12

Introduction Background Proposed approach Experimental setup Results Conclusion

Machine learning approach Main tasks:

The classifier (algorithm/method) Selection of features (emoticons, n-grams, etc) The training Data!

A series of feature vectors are chosen and a collection of tagged corpora are provided for training a classifier. Selection of features is crucial to the success rate of the classification. Two classification methods are dominant

S.V.M ([14],[15]) Naive Bayes [16]

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 12/41

slide-13
SLIDE 13

Introduction Background Proposed approach Experimental setup Results Conclusion

Performance comparison of Lexical ML approaches

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 13/41

slide-14
SLIDE 14

Introduction Background Proposed approach Experimental setup Results Conclusion

Performance comparison of Hybrid approaches

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 14/41

slide-15
SLIDE 15

Introduction Background Proposed approach Experimental setup Results Conclusion

Inference Its is clear from the results ML approaches are superior to lexical approaches. In machine learning approaches, Naive Bayes yield higher

  • accuracy. (IMDB, spam filters)

Lexical vs Machine Learning ⇒ Time vs Performance

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 15/41

slide-16
SLIDE 16

Introduction Background Proposed approach Experimental setup Results Conclusion

Road Map

1 Introduction 2 Background & Related work 3 Proposed approach 4 Experimental setup 5 Results & Analysis 6 Conclusion

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 16/41

slide-17
SLIDE 17

Introduction Background Proposed approach Experimental setup Results Conclusion

Problem Statement Problem Statement “To propose a hybrid approach yearning competitive results by hybridizing machine learning and lexical approaches that captures and analyses sentiments of users in an open social network like twitter for exploring public opinion.”

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 17/41

slide-18
SLIDE 18

Introduction Background Proposed approach Experimental setup Results Conclusion

Proposed approach We propose to hybridize the following two, lexical and machine learning approaches: Lexical ⇒ SentiWordNet Lexicon dictionary, with; Machine learning ⇒ Naive Bayes algorithm

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 18/41

slide-19
SLIDE 19

Introduction Background Proposed approach Experimental setup Results Conclusion

Proposed system architecture

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 19/41

slide-20
SLIDE 20

Introduction Background Proposed approach Experimental setup Results Conclusion

Proposed process flow model

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 20/41

slide-21
SLIDE 21

Introduction Background Proposed approach Experimental setup Results Conclusion

Corpus & Preprocessing Corpus:

We crawled labelled datasets using (, ) emoticons. It contains various datasets of 1k, 10k, 50k, 100k and 1M tweets, total approx. 4 Million. Data is crawled by archiving realtime tweets via Tweet- Stream API.

Preprocessing:

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 21/41

slide-22
SLIDE 22

Introduction Background Proposed approach Experimental setup Results Conclusion

Phase I Phase I

Naive Bayes Based on the Bayesian conditional probability model P(H|E) = P(H)P(E|H) P(E) (1)

where, P(H|E)- posterior probability of the hypothesis. P(H)- prior probability of hypothesis. P(E)- prior probability of evidence. P(E|H)- conditional probability of evidence of given hy- pothesis. Or in a simpler form: Posterior = (Prior) × (Likelihood) Evidence (2)

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 22/41

slide-23
SLIDE 23

Introduction Background Proposed approach Experimental setup Results Conclusion

Phase II Phase II

Integrating SentiWordNet 3.0: Derived from WordNet (hierarchical organized lexical database) Groups English words into sets of synonyms called “synsets” Records semantic relations between these synonym sets. Each term in SentiWordNet database is assigned a score

  • f [−1, 1] in SentiWordNet which indicates its polarity.

[courtesy:sentiwordnet.isti.cnr.it] Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 23/41

slide-24
SLIDE 24

Introduction Background Proposed approach Experimental setup Results Conclusion

Road Map

1 Introduction 2 Background & Related work 3 Proposed approach 4 Experimental setup 5 Results & Analysis 6 Conclusion

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 24/41

slide-25
SLIDE 25

Introduction Background Proposed approach Experimental setup Results Conclusion

General system requirements for Hybrid Naive Bayes

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 25/41

slide-26
SLIDE 26

Introduction Background Proposed approach Experimental setup Results Conclusion

Tools & Technology We use the following tools and technologies: Python R 2.7 //Over all scripting & backend SentiWordNet 3.0 //Linguistic resource LMF R 2.3.5 //Persistent data storage NLTK R 2.0 //Language processing and validation

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 26/41

slide-27
SLIDE 27

Introduction Background Proposed approach Experimental setup Results Conclusion

Road Map

1 Introduction 2 Background & Related work 3 Proposed approach 4 Experimental setup 5 Results & Analysis 6 Conclusion

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 27/41

slide-28
SLIDE 28

Introduction Background Proposed approach Experimental setup Results Conclusion

Result format We present the results in the form of classifier accuracy, in two phases: Phase I: Base naive bayes performance

Tests were carried out using multiple twitter datasets consisting of a mixture of new and old keywords e.g [“ironman3”, “amitabhbachhan”, “google”, “twitter”, “robertdowneyjr”, etc] Results are validated using NLTK 2.0

Phase II: Hybrid naive bayes performance

We carried out the same procedure after Integrating SentiWordNet dictionary with the classifier

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 28/41

slide-29
SLIDE 29

Introduction Background Proposed approach Experimental setup Results Conclusion

Results

Phase I: Base naive bayes in action on a windows platform Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 29/41

slide-30
SLIDE 30

Introduction Background Proposed approach Experimental setup Results Conclusion

Results

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 30/41

slide-31
SLIDE 31

Introduction Background Proposed approach Experimental setup Results Conclusion

Results

Phase II: Hybrid naive bayes in action on a windows platform Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 31/41

slide-32
SLIDE 32

Introduction Background Proposed approach Experimental setup Results Conclusion

Results

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 32/41

slide-33
SLIDE 33

Introduction Background Proposed approach Experimental setup Results Conclusion

Results

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 33/41

slide-34
SLIDE 34

Introduction Background Proposed approach Experimental setup Results Conclusion

Results

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 34/41

slide-35
SLIDE 35

Introduction Background Proposed approach Experimental setup Results Conclusion

Results

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 35/41

slide-36
SLIDE 36

Introduction Background Proposed approach Experimental setup Results Conclusion

Results

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 36/41

slide-37
SLIDE 37

Introduction Background Proposed approach Experimental setup Results Conclusion

Road Map

1 Introduction 2 Background & Related work 3 Proposed approach 4 Experimental setup 5 Results & Analysis 6 Conclusion

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 37/41

slide-38
SLIDE 38

Introduction Background Proposed approach Experimental setup Results Conclusion

Conclusion We successfully hybridized existing lexical and machine learning approaches and out-performed base nave bayes consistent average accuracy ≥ 90%, and 98.6% in the best case. Our approach also out performs other approaches of ([17], [26], [27]). It clear from the results that hybrid nave bayes can posi- tively applied over other sentiment analysis applications like

Financial sentiment analysis (stocks opinion mining) Customer feedback services etc

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 38/41

slide-39
SLIDE 39

Introduction Background Proposed approach Experimental setup Results Conclusion

Future work Interpreting sarcasm

“Discourse-driven sentiment analysis” Deep dive into linguistics (Dr. Pushpak & team, IITB)

Multi-lingual support

Language specific lexicon dictionary

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 39/41

slide-40
SLIDE 40

Introduction Background Proposed approach Experimental setup Results Conclusion

Our publication Harsh Thakkar and Dhiren Patel., “Approaches for sen- timent analysis on Twitter: A state-of-art study” ac- cepted at International Network for Social Network Anal- ysis conference (INSNA), Xi‘an, China, July 2013.

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 40/41

slide-41
SLIDE 41

Introduction Background Proposed approach Experimental setup Results Conclusion

Queries?

Harsh Thakkar Roll: P11CO010 M.Tech. Dissertation 2013 41/41