TERM PROJECT Classifying Tweets Using Nave Bayes Classifier CSC 177 - - PowerPoint PPT Presentation
TERM PROJECT Classifying Tweets Using Nave Bayes Classifier CSC 177 - - PowerPoint PPT Presentation
TERM PROJECT Classifying Tweets Using Nave Bayes Classifier CSC 177 Spring 2020 Andrew Flores, Hera Flores Agenda Demo Motivation Background Knowledge Lessons Learned Scope of the Project Future work Approach
Agenda
Motivation Background Knowledge Scope of the Project Approach Implementation Details Demo Lessons Learned Future work References Acknowledgment
Extensive amounts of data
IBM has estimated that 80% of the worlds data is
- unstructured. [8]
Everyday roughly 2.5 billion GB of new data are created. In particular, Twitter creates roughly 12 TB of data every
- day. [9]
Knowledge is power (1)
All of this
unstructured data is a goldmine of knowledge.
Knowledge is power (2)
The data is out
- there. We just have
to mine it.
Sentiment analysis
Background Knowledge
[6]
Why sentiment analysis matters
Many companies apply sentiment analysis techniques to
social media in order to gain an understanding of
Service performance Investor sentiment Public opinion
The following are examples from SentDex, an online
algorithm that indexes financial, political, and geographical sentiment. [10]
EBAY
[10]
Charles Schwab Corp
[10]
Public Enterprise Group Inc
[10]
Scope of the Project
We downloaded a dataset from Kaggle that consisted of
roughly 14000 tweets.
It was our goal to analyze, transform, and classify this
data.
We wanted to run multiple trials adjusting parameters to
achieve optimal performance.
Naïve Bayes Classifier
Naïve Bayes Classifier is a
supervised machine learning algorithm based on statistical methods created from the mathematician Thomas Bayes
Naïve Bayes properties
Principle of Naive Bayes Classifier
Probabilistic machine learning model Based on the Bayes theorem [4]
The problem
It’s difficult to analyze sentiment via
traditional surveys that are inefficient in terms of time and effort. These traditional methods can also be erroneous at times.
Airline companies aren’t able to sift through
the thousands of social media posts in any given time that might store valuable information regarding their service performance or critical concerns.
The solution
We’ll build a sentiment text
classifier that puts airline related tweet texts into one of two categories - negative or positive sentiment.
[1]
Methods of Implementation (1)
Data from Kaggle Anaconda to implement
Python code
Jupyter Notebook for EDA Naïve Bayes classifier in
Jupyter NB and Spyder
Methods of Implementation (2)
Let’s visualize some data!
Tweet Wordcloud
Consists of all tweets from
the our tweet.csv file
Positive Tweet Wordcloud
Consists of the most
positively associated words from the entire data set.
Negative Tweet Wordcloud
Consists of the most
negatively associated words from the entire data set.
Overall Sentiment Distribution
2363 9178 3099
Sentiment
Positive Negative Neutral
Distribution of Airlines
504 3822 2420 2222 2913 2759
Airlines
Virgin America United Southwest Delta US Airways American
Individual Airline Sentiment Distribution
Classification Report (1)
size accuracy precision recall f1-score support Time(sec) Input 1 avg 2000 0.88 0.88 0.88 0.87 400 12.76 Input 2 avg 3000 0.91 0.91 0.91 0.91 600 15.40 Input 3 avg 3600 0.92 0.92 0.92 0.92 720 24.55 Input 4 avg 4726 0.92 0.93 0.92 0.92 946 30.53
Classification Report (2)
0.88 0.91 0.92 0.92 0.88 0.91 0.92 0.93 0.87 0.91 0.92 0.92 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 Input1 1 Input 2 Input 3 Input 4
Classification chart
accuracy/ recall precision f1-score
Classification Report (3)
Size Time (sec) 1 2000 12.76 2 3000 15.40 3 3600 24.55 4 4726 30.53
Confusion Matrices
DEMO TIME!!!!!
Lessons Learned
How to implement text mining techniques using Python
and its associated libraries.
Performing multiple trials with different sizes yield unique
results.
Speed decreased with increased sizes. As size increased, accuracy increased.
Future Work
In the future we plan to:
Incorporate usage of neutral data. Compare its efficiency against other supervised ML algorithms. Implement web scraping to find new twitter data set
References
[1] V. Valkov, "Movie review sentiment." Curiousily, www.curiousily.com/posts/movie-review-sentiment-analysis-with-naive-bayes/. Accessed 18 Apr. 2020.
[2] K. DeGrave. " A Naive Bayes Tweet Classifier." kaggle, www.kaggle.com/degravek/a-naive-bayes-tweet-classifier.Accessed 19 Apr. 2020.
[3] N.K. Sharma, S. Rahamatkar, S. Sharma "Classification of Airline Tweet using Naïve-Bayes classifier for Sentiment Analysis." 9 IEEXplore. Accessed 5
- Apr. 2020.
[4] R. Gandhi. "Naive Bayes Classifier." TowardsDataScience, towardsdatascience.com/naive-bayes-classifier-81d512f50a7c. Accessed 27 Apr. 2020.
[5] C. Masolo. "Sentiment analysis on US Twitter Airline dataset." TowardsDataScience, towardsdatascience.com/sentiment-analysis-on-us-twitter-airline- dataset-1-of-2-2417f204b971. Accessed 5 Apr. 2020.
[6] MonkeyLearn "Sentiment Analysis.", monkeylearn.com/sentiment-analysis/. Accessed 5 Apr. 2020.
[7] C. Schneider. "The biggest data challenges that you might not even know you have." ibm, www.ibm.com/blogs/watson/2016/05/biggest-data- challenges-might-not-even-know/. Accessed 22 Apr. 2020.
[8] P. Upadhyay. "Removing stop words with NLTK in Python." GeeksforGeeks, www.geeksforgeeks.org/removing-stop-words-nltk-python/. Accessed 15
- Apr. 2020.
[9] D. Gura. "All Those 140-Character Twitter Messages Amount To Petabytes Of Data Every Year." npr, www.npr.org/sections/thetwo- way/2010/09/28/130199229/all-those-140-character-twitter-messages-yield-four-petrabytes-of-data-annually. Accessed 15 Apr. 2020.
[10] H. Kinsley. "Sentiment analysis accuracy." Sentdex, sentdex.com/how-accurate-is-sentiment-analysis-for-stocks/. Accessed 15 Apr. 2020.
[11] kaggel. "Twitter US Airline Sentiment." , www.kaggle.com/crowdflower/twitter-airline-sentiment. Accessed 15 Mar. 2020.
[12] Tsuruoka, Yoshimasa, et al. "Highly Scalable Text Mining – Parallel Tagging Application ." IEEE Xplore, Oxford Journals, 2004.