(Simple) Text Analysis using Twitter Data IR Lab., CS - UI Bayu - PowerPoint PPT Presentation

10/16/2015 (Simple) Text Analysis using Twitter Data IR Lab., CS - UI Bayu Distiawan Trisedya Slide Acknowledgement: Alfan Farizki W Information Retrieval Lab. Faculty of Computer Science University of Indonesia 1 Big Data Workshop – Bank Indonesia, Surabaya, Indonesia

We will do fun programming task  10/16/2015 IR Lab., CS - UI If you’re a not a programmer, don’t worry ! 2

10/16/2015 Crawling Tweets & Simple IR Lab., CS - UI Processing 3

1. Getting Twitter API • Create a twitter account if you do not already have one. • Go to https://apps.twitter.com/ and log in with your twitter 10/16/2015 credentials. • Click "Create New App“ IR Lab., CS - UI • Fill out the form, agree to the terms, and click "Create your Twitter application“ • In the next page, click on "API keys" tab, and copy your "API key" and "API secret". • Scroll down and click "Create my access token", and copy your "Access token" and "Access token secret". 4

2. Install Python (and libraries) • Install Anaconda • Install Tweepy 10/16/2015 OR • Install Python IR Lab., CS - UI • Lib: Tweepy, pandas, matplotlib, numpy • Set environment variables 5

3. Creating crawler • Open tweepy installation folder, find streaming example • \installation_dir\tweepy\examples\streaming.py 10/16/2015 IR Lab., CS - UI 6

3. Creating crawler (2) • Copy your key & token from Twitter API to the code 10/16/2015 # Go to http://apps.twitter.com and create an app. # The consumer key and secret will be generated for you after consumer_key =“” IR Lab., CS - UI consumer_secret =“” # After the step above, you will be redirected to your app's page. # Create an access token under the the "Your access token" section access_token =“” access_token_secret =“” 7

3. Creating crawler (3) class StdOutListener(StreamListener): 10/16/2015 def on_data(self, data): print(data) return True IR Lab., CS - UI def on_error(self, status): print(status) • This part is a listener to print received tweet to standard output for example command line 8

3. Creating crawler (4) stream.filter(track=['basketball']) 10/16/2015 • This part is where we put the keyword of tweets that suits our IR Lab., CS - UI interest. From the example, we will received tweets that contains “basketball” • You can try to change with multiple keywords like • stream.filter (track=[‘ jokowi ‘, ‘ prabowo ’]) • We don’t cover the technique to enhance the keywords to make the search result better. 9

3. Creating crawler (5) • Run the crawler • open your command line or terminal 10/16/2015 • change the active directory to the place of the crawler file • command: IR Lab., CS - UI • python crawler.py • see what happended • change the command to: • python crawler.py > output.json 10

4. Preparing corpus (1) • Filter the tweet stream, pick the attribute we want to analyze. • In this example we only want to analyze the text from the 10/16/2015 tweets IR Lab., CS - UI 11

4. Preparing corpus (2) import json fo = open (‘ file_path\output.json ’, 'r') 10/16/2015 fw = open(‘ file_path\corpus.txt', 'a') IR Lab., CS - UI • Create new python file (transform2.py) • Impor json, we want to read json format • fo -> read the file where the crawler produced • fw -> create new file 12

4. Preparing corpus (3) for line in fo: try: tweet = json.loads(line) 10/16/2015 fw.write(tweet['text']+"\n") except: continue IR Lab., CS - UI • read all line in fo • write the tweet text to fw 13

5. Simple Analysis (1) import json import pandas as pd 10/16/2015 import matplotlib.pyplot as plt import re IR Lab., CS - UI • Count how many tweets contain “ jokowi ” and “ prabowo ” • Create new python file (simple_analysis.py) • Import the library needed 14

5. Simple Analysis (2) def word_in_text(word, text): word = word.lower() text = text.lower() 10/16/2015 match = re.search(word, text) if match: return True IR Lab., CS - UI return False • Create a function to check wheter the word contained on a text or not 15

5. Simple Analysis (3) corpus_path = 'corpus.txt' tweets = [] 10/16/2015 corpus_file = open(corpus_path, "r") for line in corpus_file: tweets.append(line) IR Lab., CS - UI print "Tweets count: " + str(len(tweets)) • Read the corpus file line by line and store each line into an array • Count how many tweets on it 16

5. Simple Analysis (4) tweets_frame = pd.DataFrame() tweets_frame['jokowi'] = map(lambda tweet: word_in_text('jokowi', tweet), tweets) 10/16/2015 tweets_frame['prabowo'] = map(lambda tweet: word_in_text('prabowo', tweet), tweets) print tweets_frame['jokowi'].value_counts()[True] print tweets_frame['prabowo'].value_counts()[True] IR Lab., CS - UI • Calculate how many tweets contains “ jokowi ” and “ prabowo ” • Print the result 17

5. Simple Analysis (5) candidates = ['jokowi', 'prabowo'] tweets_candidates = [tweets_frame['jokowi'].value_counts()[True], tweets_frame['prabowo'].value_counts()[True]] 10/16/2015 x_pos = list(range(len(candidates))) width = 0.8 fig, ax = plt.subplots() plt.bar(x_pos, tweets_candidates, width, alpha=1, color='g') IR Lab., CS - UI # Setting axis labels and ticks ax.set_ylabel('Number of tweets', fontsize=15) ax.set_title('Jokowi vs. Prabowo', fontsize=10, fontweight='bold') ax.set_xticks([p + 0.4 * width for p in x_pos]) ax.set_xticklabels(candidates) plt.grid() plt.show() • Show the result on bar chart 18

10/16/2015 Simple Political Sentiment IR Lab., CS - UI Analysis on Tweets 19

The Data • We collected tweets on June 9 th 2014, when the first 10/16/2015 presidential election debate was held. IR Lab., CS - UI • The data consists of around 2,456,465 tweets, crawled for around 24 hours. • You can find this data on “debatcapres_ 2014_sesi1. txt” . Please do not open directly using your text editor ! 20

The Data If you want to peek your data, you can use the following Python code: 10/16/2015 dataFile = open ("debatcapres_2014_sesi1.txt", "r") IR Lab., CS - UI lines = 10 for i in range (lines): print (dataFile. readline ()) dataFile. close () 21

Our Task Sentiment Analysis Positive tweets 10/16/2015 Negative tweets “ Jokowi ” IR Lab., CS - UI Positive tweets Preprocessed Raw Tweets Tweets Negative tweets “ Prabowo ” 22

Steps • Preprocessing our corpus • Splitting our corpus 10/16/2015 • “ jokowi ” • “ prabowo ” IR Lab., CS - UI • Simple Sentiment Analysis • N-gram Frequency Analysis • Top 100 unigrams • Top 100 bigrams 23

Preprocessing Before we analyze the text, we need to clean and normalize our raw data. 10/16/2015 We usually do the following steps (but not limited to): IR Lab., CS - UI Normalization Stop Word Removal Stemming 24

Preprocessing Normalization We transform our text data into standard form. 10/16/2015 yawdh gw yg akan pergi ksn IR Lab., CS - UI Ya sudah, saya yang akan pergi ke sana 25

Preprocessing aje saja ajh saja Normalization ak aku We leverage a special dictionary. 10/16/2015 alesan alasan ancur hancur ane saya IR Lab., CS - UI Resource: anget hangat singkatan.dic ank anak apah apa Code: apo apa normalizer.py aq aku asek asik ati2 hati-hati 26 atit sakit

Preprocessing Normalization 10/16/2015 Let’s try from command prompt ! IR Lab., CS - UI > python >>> from normalizer import Normalizer >>> norm = Normalizer() >>> norm.normalize (“ yawdh gw yg akan pergi ksn ”) ‘ ya sudah saya yang akan pergi ke sana ’ 27

Preprocessing Stop Word Removal Stop Word: most common words; they usually have little value. 10/16/2015 IR Lab., CS - UI 28 http://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html

Preprocessing Resource: Stop Word Removal twitter_stp.dic 10/16/2015 Code: stpremoval.py IR Lab., CS - UI > python >>> from stpremoval import StpRemoval >>> st = StpRemoval() >>> norm.removeStp (“ budi dan rani pergi ke bandung ”) ‘ budi rani pergi bandung ’ 29

Preprocessing Let’s combine Normalization and Stop Word Removal to our whole corpus. 10/16/2015 IR Lab., CS - UI Preprocessed Raw Tweets Tweets You just need to run preprocesscorp.py ! 30

Splitting Corpus Simple Approach: Suppose we want to find tweet that 10/16/2015 mentions “ prabowo ” . “ jokowi ” IR Lab., CS - UI Idea: for each tweet in the corpus: if tweet contains “ prabowo ” then print tweet Preprocessed Tweets Use select.py for “ prabowo ” and “ jokowi ” ! “ prabowo ” 31

Sentiment Analysis One of the tasks: Polarity Classification 10/16/2015 Positive Tweets @IMKristenBell I absolutely love the samsung commercials IR Lab., CS - UI with you and Dax xD so cute and funny. ♡♡ Hope you have nice week :) I love my iPhone 6! :D The case on it makes it look so nice ! Negative Tweets # samsung chargers are just as bad as #apple ones. holy hell. > :( snapchat looks so bad on the new iphone :-( 32

(Simple) Text Analysis using Twitter Data IR Lab., CS - UI Bayu - PowerPoint PPT Presentation

10/16/2015 (Simple) Text Analysis using Twitter Data IR Lab., CS - UI Bayu Distiawan Trisedya Slide Acknowledgement: Alfan Farizki W Information Retrieval Lab. Faculty of Computer Science University of Indonesia 1 Big Data Workshop Bank

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Using Twitter for your CPD Janet Thomas November 2019 #PHYSIO19 Why twitter for CPD?

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

ML at Twitter: A Deep Dive into Twitters Timeline Cibele Montez Halasz, Twitter Cortex

//Dashboard //Twitter Panel //Twitter Panel Context and Actions Act based on the document

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Business Proposal Infographic Style Your Text Here Your Text Here Your Text Here Your Text

Littlewood-Paley decompositions on manifolds with ends Jean-Marc Bouclet Universit e de

Public policy and Mori economic development 20-Aug-20 1 Aotearoa New Zealand, its history,

Krein spaces applied to Friedrichs systems Kre simir Burazin Department of Mathematics,

Loewners theorem in several variables Mikl os P alfia Sungkyunkwan University &

Price Perceptions and Electricity Demand with Nonlinear Tariffs Shaun McRae and Robyn Meeks

Outline Topic 6 Topic 6 Explanation of inheritance. Using inheritance to create a

Results and Status from Results and Status from HARP and MIPP HARP and MIPP M. Sorel (IFIC,

House Prices and Rents in Tokyo - A Comparison of Repeat-sales and Hedonic measures- Chihiro

(Simple) Text Analysis using Twitter Data IR Lab., CS - UI Bayu - PowerPoint PPT Presentation

10/16/2015 (Simple) Text Analysis using Twitter Data IR Lab., CS - UI Bayu Distiawan Trisedya Slide Acknowledgement: Alfan Farizki W Information Retrieval Lab. Faculty of Computer Science University of Indonesia 1 Big Data Workshop Bank

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Using Twitter for your CPD Janet Thomas November 2019 #PHYSIO19 Why twitter for CPD?

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

ML at Twitter: A Deep Dive into Twitters Timeline Cibele Montez Halasz, Twitter Cortex

//Dashboard //Twitter Panel //Twitter Panel Context and Actions Act based on the document

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Business Proposal Infographic Style Your Text Here Your Text Here Your Text Here Your Text

Littlewood-Paley decompositions on manifolds with ends Jean-Marc Bouclet Universit e de

Public policy and Mori economic development 20-Aug-20 1 Aotearoa New Zealand, its history,

Krein spaces applied to Friedrichs systems Kre simir Burazin Department of Mathematics,

Loewners theorem in several variables Mikl os P alfia Sungkyunkwan University &amp;

Price Perceptions and Electricity Demand with Nonlinear Tariffs Shaun McRae and Robyn Meeks

Outline Topic 6 Topic 6 Explanation of inheritance. Using inheritance to create a

Results and Status from Results and Status from HARP and MIPP HARP and MIPP M. Sorel (IFIC,

House Prices and Rents in Tokyo - A Comparison of Repeat-sales and Hedonic measures- Chihiro

Loewners theorem in several variables Mikl os P alfia Sungkyunkwan University &