TERM PROJECT Classifying Tweets Using Nave Bayes Classifier CSC 177 - - PowerPoint PPT Presentation

term project
SMART_READER_LITE
LIVE PREVIEW

TERM PROJECT Classifying Tweets Using Nave Bayes Classifier CSC 177 - - PowerPoint PPT Presentation

TERM PROJECT Classifying Tweets Using Nave Bayes Classifier CSC 177 Spring 2020 Andrew Flores, Hera Flores Agenda Demo Motivation Background Knowledge Lessons Learned Scope of the Project Future work Approach


slide-1
SLIDE 1

Classifying Tweets Using Naïve Bayes Classifier CSC 177 – Spring 2020 Andrew Flores, Hera Flores

TERM PROJECT

slide-2
SLIDE 2

Agenda

 Motivation  Background Knowledge  Scope of the Project  Approach  Implementation Details  Demo  Lessons Learned  Future work  References  Acknowledgment

slide-3
SLIDE 3

Extensive amounts of data

 IBM has estimated that 80% of the worlds data is

  • unstructured. [8]

 Everyday roughly 2.5 billion GB of new data are created.  In particular, Twitter creates roughly 12 TB of data every

  • day. [9]
slide-4
SLIDE 4

Knowledge is power (1)

 All of this

unstructured data is a goldmine of knowledge.

slide-5
SLIDE 5

Knowledge is power (2)

 The data is out

  • there. We just have

to mine it.

slide-6
SLIDE 6

Sentiment analysis

Background Knowledge

[6]

slide-7
SLIDE 7

Why sentiment analysis matters

 Many companies apply sentiment analysis techniques to

social media in order to gain an understanding of

 Service performance  Investor sentiment  Public opinion

 The following are examples from SentDex, an online

algorithm that indexes financial, political, and geographical sentiment. [10]

slide-8
SLIDE 8

EBAY

[10]

slide-9
SLIDE 9

Charles Schwab Corp

[10]

slide-10
SLIDE 10

Public Enterprise Group Inc

[10]

slide-11
SLIDE 11

Scope of the Project

 We downloaded a dataset from Kaggle that consisted of

roughly 14000 tweets.

 It was our goal to analyze, transform, and classify this

data.

 We wanted to run multiple trials adjusting parameters to

achieve optimal performance.

slide-12
SLIDE 12

Naïve Bayes Classifier

 Naïve Bayes Classifier is a

supervised machine learning algorithm based on statistical methods created from the mathematician Thomas Bayes

slide-13
SLIDE 13

Naïve Bayes properties

 Principle of Naive Bayes Classifier

 Probabilistic machine learning model  Based on the Bayes theorem [4]

slide-14
SLIDE 14

The problem

 It’s difficult to analyze sentiment via

traditional surveys that are inefficient in terms of time and effort. These traditional methods can also be erroneous at times.

 Airline companies aren’t able to sift through

the thousands of social media posts in any given time that might store valuable information regarding their service performance or critical concerns.

slide-15
SLIDE 15

The solution

 We’ll build a sentiment text

classifier that puts airline related tweet texts into one of two categories - negative or positive sentiment.

[1]

slide-16
SLIDE 16

Methods of Implementation (1)

 Data from Kaggle  Anaconda to implement

Python code

 Jupyter Notebook for EDA  Naïve Bayes classifier in

Jupyter NB and Spyder

slide-17
SLIDE 17

Methods of Implementation (2)

slide-18
SLIDE 18

Let’s visualize some data!

slide-19
SLIDE 19

Tweet Wordcloud

 Consists of all tweets from

the our tweet.csv file

slide-20
SLIDE 20

Positive Tweet Wordcloud

 Consists of the most

positively associated words from the entire data set.

slide-21
SLIDE 21

Negative Tweet Wordcloud

 Consists of the most

negatively associated words from the entire data set.

slide-22
SLIDE 22

Overall Sentiment Distribution

2363 9178 3099

Sentiment

Positive Negative Neutral

slide-23
SLIDE 23

Distribution of Airlines

504 3822 2420 2222 2913 2759

Airlines

Virgin America United Southwest Delta US Airways American

slide-24
SLIDE 24

Individual Airline Sentiment Distribution

slide-25
SLIDE 25

Classification Report (1)

size accuracy precision recall f1-score support Time(sec) Input 1 avg 2000 0.88 0.88 0.88 0.87 400 12.76 Input 2 avg 3000 0.91 0.91 0.91 0.91 600 15.40 Input 3 avg 3600 0.92 0.92 0.92 0.92 720 24.55 Input 4 avg 4726 0.92 0.93 0.92 0.92 946 30.53

slide-26
SLIDE 26

Classification Report (2)

0.88 0.91 0.92 0.92 0.88 0.91 0.92 0.93 0.87 0.91 0.92 0.92 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 Input1 1 Input 2 Input 3 Input 4

Classification chart

accuracy/ recall precision f1-score

slide-27
SLIDE 27

Classification Report (3)

Size Time (sec) 1 2000 12.76 2 3000 15.40 3 3600 24.55 4 4726 30.53

slide-28
SLIDE 28

Confusion Matrices

slide-29
SLIDE 29

DEMO TIME!!!!!

slide-30
SLIDE 30

Lessons Learned

 How to implement text mining techniques using Python

and its associated libraries.

 Performing multiple trials with different sizes yield unique

results.

 Speed decreased with increased sizes.  As size increased, accuracy increased.

slide-31
SLIDE 31

Future Work

 In the future we plan to:

 Incorporate usage of neutral data.  Compare its efficiency against other supervised ML algorithms.  Implement web scraping to find new twitter data set

slide-32
SLIDE 32

References

[1] V. Valkov, "Movie review sentiment." Curiousily, www.curiousily.com/posts/movie-review-sentiment-analysis-with-naive-bayes/. Accessed 18 Apr. 2020.

[2] K. DeGrave. " A Naive Bayes Tweet Classifier." kaggle, www.kaggle.com/degravek/a-naive-bayes-tweet-classifier.Accessed 19 Apr. 2020.

[3] N.K. Sharma, S. Rahamatkar, S. Sharma "Classification of Airline Tweet using Naïve-Bayes classifier for Sentiment Analysis." 9 IEEXplore. Accessed 5

  • Apr. 2020.

[4] R. Gandhi. "Naive Bayes Classifier." TowardsDataScience, towardsdatascience.com/naive-bayes-classifier-81d512f50a7c. Accessed 27 Apr. 2020.

[5] C. Masolo. "Sentiment analysis on US Twitter Airline dataset." TowardsDataScience, towardsdatascience.com/sentiment-analysis-on-us-twitter-airline- dataset-1-of-2-2417f204b971. Accessed 5 Apr. 2020.

[6] MonkeyLearn "Sentiment Analysis.", monkeylearn.com/sentiment-analysis/. Accessed 5 Apr. 2020.

[7] C. Schneider. "The biggest data challenges that you might not even know you have." ibm, www.ibm.com/blogs/watson/2016/05/biggest-data- challenges-might-not-even-know/. Accessed 22 Apr. 2020.

[8] P. Upadhyay. "Removing stop words with NLTK in Python." GeeksforGeeks, www.geeksforgeeks.org/removing-stop-words-nltk-python/. Accessed 15

  • Apr. 2020.

[9] D. Gura. "All Those 140-Character Twitter Messages Amount To Petabytes Of Data Every Year." npr, www.npr.org/sections/thetwo- way/2010/09/28/130199229/all-those-140-character-twitter-messages-yield-four-petrabytes-of-data-annually. Accessed 15 Apr. 2020.

[10] H. Kinsley. "Sentiment analysis accuracy." Sentdex, sentdex.com/how-accurate-is-sentiment-analysis-for-stocks/. Accessed 15 Apr. 2020.

[11] kaggel. "Twitter US Airline Sentiment." , www.kaggle.com/crowdflower/twitter-airline-sentiment. Accessed 15 Mar. 2020.

[12] Tsuruoka, Yoshimasa, et al. "Highly Scalable Text Mining – Parallel Tagging Application ." IEEE Xplore, Oxford Journals, 2004.

slide-33
SLIDE 33

Acknowledgments (1)

Thank you to Thomas Bayes for laying down the theoretical groundwork for future engineers, scientists, and mathematicians.

slide-34
SLIDE 34

Acknowledgments (2)

Thank you to Dr. Lu for providing us with a great learning environment and teaching us plenty about Data Warehousing/ Data Mining.