simple text analysis using twitter data
play

(Simple) Text Analysis using Twitter Data IR Lab., CS - UI Bayu - PowerPoint PPT Presentation

10/16/2015 (Simple) Text Analysis using Twitter Data IR Lab., CS - UI Bayu Distiawan Trisedya Slide Acknowledgement: Alfan Farizki W Information Retrieval Lab. Faculty of Computer Science University of Indonesia 1 Big Data Workshop Bank


  1. 10/16/2015 (Simple) Text Analysis using Twitter Data IR Lab., CS - UI Bayu Distiawan Trisedya Slide Acknowledgement: Alfan Farizki W Information Retrieval Lab. Faculty of Computer Science University of Indonesia 1 Big Data Workshop – Bank Indonesia, Surabaya, Indonesia

  2. We will do fun programming task  10/16/2015 IR Lab., CS - UI If you’re a not a programmer, don’t worry ! 2

  3. 10/16/2015 Crawling Tweets & Simple IR Lab., CS - UI Processing 3

  4. 1. Getting Twitter API • Create a twitter account if you do not already have one. • Go to https://apps.twitter.com/ and log in with your twitter 10/16/2015 credentials. • Click "Create New App“ IR Lab., CS - UI • Fill out the form, agree to the terms, and click "Create your Twitter application“ • In the next page, click on "API keys" tab, and copy your "API key" and "API secret". • Scroll down and click "Create my access token", and copy your "Access token" and "Access token secret". 4

  5. 2. Install Python (and libraries) • Install Anaconda • Install Tweepy 10/16/2015 OR • Install Python IR Lab., CS - UI • Lib: Tweepy, pandas, matplotlib, numpy • Set environment variables 5

  6. 3. Creating crawler • Open tweepy installation folder, find streaming example • \installation_dir\tweepy\examples\streaming.py 10/16/2015 IR Lab., CS - UI 6

  7. 3. Creating crawler (2) • Copy your key & token from Twitter API to the code 10/16/2015 # Go to http://apps.twitter.com and create an app. # The consumer key and secret will be generated for you after consumer_key =“” IR Lab., CS - UI consumer_secret =“” # After the step above, you will be redirected to your app's page. # Create an access token under the the "Your access token" section access_token =“” access_token_secret =“” 7

  8. 3. Creating crawler (3) class StdOutListener(StreamListener): 10/16/2015 def on_data(self, data): print(data) return True IR Lab., CS - UI def on_error(self, status): print(status) • This part is a listener to print received tweet to standard output for example command line 8

  9. 3. Creating crawler (4) stream.filter(track=['basketball']) 10/16/2015 • This part is where we put the keyword of tweets that suits our IR Lab., CS - UI interest. From the example, we will received tweets that contains “basketball” • You can try to change with multiple keywords like • stream.filter (track=[‘ jokowi ‘, ‘ prabowo ’]) • We don’t cover the technique to enhance the keywords to make the search result better. 9

  10. 3. Creating crawler (5) • Run the crawler • open your command line or terminal 10/16/2015 • change the active directory to the place of the crawler file • command: IR Lab., CS - UI • python crawler.py • see what happended • change the command to: • python crawler.py > output.json 10

  11. 4. Preparing corpus (1) • Filter the tweet stream, pick the attribute we want to analyze. • In this example we only want to analyze the text from the 10/16/2015 tweets IR Lab., CS - UI 11

  12. 4. Preparing corpus (2) import json fo = open (‘ file_path\output.json ’, 'r') 10/16/2015 fw = open(‘ file_path\corpus.txt', 'a') IR Lab., CS - UI • Create new python file (transform2.py) • Impor json, we want to read json format • fo -> read the file where the crawler produced • fw -> create new file 12

  13. 4. Preparing corpus (3) for line in fo: try: tweet = json.loads(line) 10/16/2015 fw.write(tweet['text']+"\n") except: continue IR Lab., CS - UI • read all line in fo • write the tweet text to fw 13

  14. 5. Simple Analysis (1) import json import pandas as pd 10/16/2015 import matplotlib.pyplot as plt import re IR Lab., CS - UI • Count how many tweets contain “ jokowi ” and “ prabowo ” • Create new python file (simple_analysis.py) • Import the library needed 14

  15. 5. Simple Analysis (2) def word_in_text(word, text): word = word.lower() text = text.lower() 10/16/2015 match = re.search(word, text) if match: return True IR Lab., CS - UI return False • Create a function to check wheter the word contained on a text or not 15

  16. 5. Simple Analysis (3) corpus_path = 'corpus.txt' tweets = [] 10/16/2015 corpus_file = open(corpus_path, "r") for line in corpus_file: tweets.append(line) IR Lab., CS - UI print "Tweets count: " + str(len(tweets)) • Read the corpus file line by line and store each line into an array • Count how many tweets on it 16

  17. 5. Simple Analysis (4) tweets_frame = pd.DataFrame() tweets_frame['jokowi'] = map(lambda tweet: word_in_text('jokowi', tweet), tweets) 10/16/2015 tweets_frame['prabowo'] = map(lambda tweet: word_in_text('prabowo', tweet), tweets) print tweets_frame['jokowi'].value_counts()[True] print tweets_frame['prabowo'].value_counts()[True] IR Lab., CS - UI • Calculate how many tweets contains “ jokowi ” and “ prabowo ” • Print the result 17

  18. 5. Simple Analysis (5) candidates = ['jokowi', 'prabowo'] tweets_candidates = [tweets_frame['jokowi'].value_counts()[True], tweets_frame['prabowo'].value_counts()[True]] 10/16/2015 x_pos = list(range(len(candidates))) width = 0.8 fig, ax = plt.subplots() plt.bar(x_pos, tweets_candidates, width, alpha=1, color='g') IR Lab., CS - UI # Setting axis labels and ticks ax.set_ylabel('Number of tweets', fontsize=15) ax.set_title('Jokowi vs. Prabowo', fontsize=10, fontweight='bold') ax.set_xticks([p + 0.4 * width for p in x_pos]) ax.set_xticklabels(candidates) plt.grid() plt.show() • Show the result on bar chart 18

  19. 10/16/2015 Simple Political Sentiment IR Lab., CS - UI Analysis on Tweets 19

  20. The Data • We collected tweets on June 9 th 2014, when the first 10/16/2015 presidential election debate was held. IR Lab., CS - UI • The data consists of around 2,456,465 tweets, crawled for around 24 hours. • You can find this data on “debatcapres_ 2014_sesi1. txt” . Please do not open directly using your text editor ! 20

  21. The Data If you want to peek your data, you can use the following Python code: 10/16/2015 dataFile = open ("debatcapres_2014_sesi1.txt", "r") IR Lab., CS - UI lines = 10 for i in range (lines): print (dataFile. readline ()) dataFile. close () 21

  22. Our Task Sentiment Analysis Positive tweets 10/16/2015 Negative tweets “ Jokowi ” IR Lab., CS - UI Positive tweets Preprocessed Raw Tweets Tweets Negative tweets “ Prabowo ” 22

  23. Steps • Preprocessing our corpus • Splitting our corpus 10/16/2015 • “ jokowi ” • “ prabowo ” IR Lab., CS - UI • Simple Sentiment Analysis • N-gram Frequency Analysis • Top 100 unigrams • Top 100 bigrams 23

  24. Preprocessing Before we analyze the text, we need to clean and normalize our raw data. 10/16/2015 We usually do the following steps (but not limited to): IR Lab., CS - UI Normalization Stop Word Removal Stemming 24

  25. Preprocessing Normalization We transform our text data into standard form. 10/16/2015 yawdh gw yg akan pergi ksn IR Lab., CS - UI Ya sudah, saya yang akan pergi ke sana 25

  26. Preprocessing aje saja ajh saja Normalization ak aku We leverage a special dictionary. 10/16/2015 alesan alasan ancur hancur ane saya IR Lab., CS - UI Resource: anget hangat singkatan.dic ank anak apah apa Code: apo apa normalizer.py aq aku asek asik ati2 hati-hati 26 atit sakit

  27. Preprocessing Normalization 10/16/2015 Let’s try from command prompt ! IR Lab., CS - UI > python >>> from normalizer import Normalizer >>> norm = Normalizer() >>> norm.normalize (“ yawdh gw yg akan pergi ksn ”) ‘ ya sudah saya yang akan pergi ke sana ’ 27

  28. Preprocessing Stop Word Removal Stop Word: most common words; they usually have little value. 10/16/2015 IR Lab., CS - UI 28 http://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html

  29. Preprocessing Resource: Stop Word Removal twitter_stp.dic 10/16/2015 Code: stpremoval.py IR Lab., CS - UI > python >>> from stpremoval import StpRemoval >>> st = StpRemoval() >>> norm.removeStp (“ budi dan rani pergi ke bandung ”) ‘ budi rani pergi bandung ’ 29

  30. Preprocessing Let’s combine Normalization and Stop Word Removal to our whole corpus. 10/16/2015 IR Lab., CS - UI Preprocessed Raw Tweets Tweets You just need to run preprocesscorp.py ! 30

  31. Splitting Corpus Simple Approach: Suppose we want to find tweet that 10/16/2015 mentions “ prabowo ” . “ jokowi ” IR Lab., CS - UI Idea: for each tweet in the corpus: if tweet contains “ prabowo ” then print tweet Preprocessed Tweets Use select.py for “ prabowo ” and “ jokowi ” ! “ prabowo ” 31

  32. Sentiment Analysis One of the tasks: Polarity Classification 10/16/2015 Positive Tweets @IMKristenBell I absolutely love the samsung commercials IR Lab., CS - UI with you and Dax xD so cute and funny. ♡♡ Hope you have nice week :) I love my iPhone 6! :D The case on it makes it look so nice ! Negative Tweets # samsung chargers are just as bad as #apple ones. holy hell. > :( snapchat looks so bad on the new iphone :-( 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend