TERM PROJECT Classifying Tweets Using Nave Bayes Classifier CSC 177 - PowerPoint PPT Presentation

TERM PROJECT Classifying Tweets Using Naïve Bayes Classifier CSC 177 – Spring 2020 Andrew Flores, Hera Flores

Agenda  Demo  Motivation  Background Knowledge  Lessons Learned  Scope of the Project  Future work  Approach  References  Acknowledgment  Implementation Details

Extensive amounts of data  IBM has estimated that 80% of the worlds data is unstructured. [8]  Everyday roughly 2.5 billion GB of new data are created.  In particular, Twitter creates roughly 12 TB of data every day. [9]

Knowledge is power (1)  All of this unstructured data is a goldmine of knowledge.

Knowledge is power (2)  The data is out there. We just have to mine it.

Background Knowledge Sentiment analysis [6]

Why sentiment analysis matters  Many companies apply sentiment analysis techniques to social media in order to gain an understanding of  Service performance  Investor sentiment  Public opinion  The following are examples from SentDex, an online algorithm that indexes financial, political, and geographical sentiment. [10]

EBAY [10]

Charles Schwab Corp [10]

Public Enterprise Group Inc [10]

Scope of the Project  We downloaded a dataset from Kaggle that consisted of roughly 14000 tweets.  It was our goal to analyze, transform, and classify this data.  We wanted to run multiple trials adjusting parameters to achieve optimal performance.

Naïve Bayes Classifier  Naïve Bayes Classifier is a supervised machine learning algorithm based on statistical methods created from the mathematician Thomas Bayes

Naïve Bayes properties  Principle of Naive Bayes Classifier  Probabilistic machine learning model  Based on the Bayes theorem [4]

The problem  It’s difficult to analyze sentiment via traditional surveys that are inefficient in terms of time and effort. These traditional methods can also be erroneous at times.  Airline companies aren’t able to sift through the thousands of social media posts in any given time that might store valuable information regarding their service performance or critical concerns.

The solution  We’ll build a sentiment text classifier that puts airline related tweet texts into one of two categories - negative or positive sentiment. [1]

Methods of Implementation (1)  Data from Kaggle  Anaconda to implement Python code  Jupyter Notebook for EDA  Naïve Bayes classifier in Jupyter NB and Spyder

Methods of Implementation (2)

Let’s visualize some data!

Tweet Wordcloud  Consists of all tweets from the our tweet.csv file

Positive Tweet Wordcloud  Consists of the most positively associated words from the entire data set.

Negative Tweet Wordcloud  Consists of the most negatively associated words from the entire data set.

Overall Sentiment Distribution Sentiment 2363 3099 9178 Positive Negative Neutral

Distribution of Airlines Airlines 504 2759 3822 2913 2420 2222 Virgin America United Southwest Delta US Airways American

Individual Airline Sentiment Distribution

Classification Report (1) size accuracy precision recall f1-score support Time(sec) Input 1 avg 2000 0.88 0.88 0.88 0.87 400 12.76 Input 2 avg 3000 0.91 0.91 0.91 0.91 600 15.40 Input 3 avg 3600 0.92 0.92 0.92 0.92 720 24.55 Input 4 avg 4726 0.92 0.93 0.92 0.92 946 30.53

Classification Report (2) Classification chart 0.94 0.93 0.93 0.92 0.92 0.92 0.92 0.92 0.92 0.91 0.91 0.91 0.91 0.9 0.89 0.88 0.88 0.88 0.87 0.87 0.86 0.85 0.84 Input1 1 Input 2 Input 3 Input 4 accuracy/ recall precision f1-score

Classification Report (3) Size Time (sec) 1 2000 12.76 2 3000 15.40 3 3600 24.55 4 4726 30.53

Confusion Matrices

DEMO TIME!!!!!

Lessons Learned  How to implement text mining techniques using Python and its associated libraries.  Performing multiple trials with different sizes yield unique results.  Speed decreased with increased sizes.  As size increased, accuracy increased.

Future Work  In the future we plan to:  Incorporate usage of neutral data.  Compare its efficiency against other supervised ML algorithms.  Implement web scraping to find new twitter data set

References [1] V. Valkov, "Movie review sentiment." Curiousily, www.curiousily.com/posts/movie-review-sentiment-analysis-with-naive-bayes/. Accessed 18 Apr.  2020. [2] K. DeGrave. " A Naive Bayes Tweet Classifier." kaggle, www.kaggle.com/degravek/a-naive-bayes-tweet-classifier.Accessed 19 Apr. 2020.  [3] N.K. Sharma, S. Rahamatkar, S. Sharma "Classification of Airline Tweet using Naïve-Bayes classifier for Sentiment Analysis." 9 IEEXplore. Accessed 5  Apr. 2020. [4] R. Gandhi. "Naive Bayes Classifier." TowardsDataScience, towardsdatascience.com/naive-bayes-classifier-81d512f50a7c. Accessed 27 Apr. 2020.  [5] C. Masolo. "Sentiment analysis on US Twitter Airline dataset." TowardsDataScience, towardsdatascience.com/sentiment-analysis-on-us-twitter-airline-  dataset-1-of-2-2417f204b971. Accessed 5 Apr. 2020. [6] MonkeyLearn "Sentiment Analysis.", monkeylearn.com/sentiment-analysis/. Accessed 5 Apr. 2020.  [7] C. Schneider. "The biggest data challenges that you might not even know you have." ibm, www.ibm.com/blogs/watson/2016/05/biggest-data-  challenges-might-not-even-know/. Accessed 22 Apr. 2020. [8] P. Upadhyay. "Removing stop words with NLTK in Python." GeeksforGeeks, www.geeksforgeeks.org/removing-stop-words-nltk-python/. Accessed 15  Apr. 2020. [9] D. Gura. "All Those 140-Character Twitter Messages Amount To Petabytes Of Data Every Year." npr, www.npr.org/sections/thetwo-  way/2010/09/28/130199229/all-those-140-character-twitter-messages-yield-four-petrabytes-of-data-annually. Accessed 15 Apr. 2020. [10] H. Kinsley. "Sentiment analysis accuracy." Sentdex, sentdex.com/how-accurate-is-sentiment-analysis-for-stocks/. Accessed 15 Apr. 2020.  [11] kaggel. "Twitter US Airline Sentiment." , www.kaggle.com/crowdflower/twitter-airline-sentiment. Accessed 15 Mar. 2020.  [12] Tsuruoka, Yoshimasa, et al. "Highly Scalable Text Mining – Parallel Tagging Application ." IEEE Xplore, Oxford Journals, 2004. 

Acknowledgments (1) Thank you to Thomas Bayes for laying down the theoretical groundwork for future engineers, scientists, and mathematicians.

Acknowledgments (2) Thank you to Dr. Lu for providing us with a great learning environment and teaching us plenty about Data Warehousing/ Data Mining.

TERM PROJECT Classifying Tweets Using Nave Bayes Classifier CSC 177 - PowerPoint PPT Presentation

TERM PROJECT Classifying Tweets Using Nave Bayes Classifier CSC 177 Spring 2020 Andrew Flores, Hera Flores Agenda Demo Motivation Background Knowledge Lessons Learned Scope of the Project Future work Approach

The short- -term and long term and long- -term term The short stratospheric and tropospheric

InfoPorte by the Numbers (Slide 2) 1. Term Code From : Filled in with the current term 2. Term Code

Presentation Outline 1. Medium Term Fiscal projections 1. The 2011/12 and Medium Term Budget

LONG TERM PROJECT: PRESENTATION PLAN Your Long Term Project is due no later than Friday,

REZCO CASH: SHORT TERM GAIN = LONG TERM PAIN CASH VS EQUITY 2 CASH VS EQUITY CASH VS EQUITY

South Burlington School District Proposed Long-Term Bond Why issue a long term bond? Entities

Codsall Middle School Year 5 Autumn Term Spring Term Summer Term Story Openers Persuasive

SHORT-TERM RENTALS IN AUSTIN, TX Smart City Policy Summit September 17, 2019 Todd LaRue,

SHS MJ-TERM 2018 Survey MJ-TERM May-June Term: May 21 st June 15 th . (18.5 Days)

Attribute Grammars intermediate syntax semantics representation Language Implementation 2

TERM FACULTY TASK FORCE COMMUNITY FORUM Term Faculty Task Force Update Fall 2017 OUR CHARGE The

Towards Greater International Transparency of Clinical Trials Short Term Efforts for Long Term

University of Applied Sciences Upper Austria 2 3 4 y x G(Expr): Expr Term | Term + Expr

The DSM data matrix DSM data are given as a term-term or term-context matrix: get see use hear

8.6.20 1 English Term 6 Week 2.notebook June 06, 2020 8.6.20 2 English Term 6 Week 2.notebook

Return To Office Strategy Short-Term Strategy Mid-Term Strategy - Remote Work Long-Term

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee

Introduction to Adaptive Designs FUNDAMENTAL DESIGN PRINCIPLES, CASE STUDIES, AND HANDS-ON

Reducing Uncertainty and Increasing Confidence in Reservoir Seismic Characterisation Erick

Planning and Delivering your Presentation This document should be read together with the

Variational inference, spin glasses, and TAP free energy Song Mei Stanford University September

Sentiment Analysis and Movie Reviews By: Donovan Ambler Overview Problem Description

High-Dimensional Classification Methods for Sparse Signals and Their Applications in Gene

Language Ted Dunning Kristinn Reykjavk University Languages