(Simple) Text Analysis using Twitter Data IR Lab., CS - UI Bayu - - PowerPoint PPT Presentation

simple text analysis using twitter data
SMART_READER_LITE
LIVE PREVIEW

(Simple) Text Analysis using Twitter Data IR Lab., CS - UI Bayu - - PowerPoint PPT Presentation

10/16/2015 (Simple) Text Analysis using Twitter Data IR Lab., CS - UI Bayu Distiawan Trisedya Slide Acknowledgement: Alfan Farizki W Information Retrieval Lab. Faculty of Computer Science University of Indonesia 1 Big Data Workshop Bank


slide-1
SLIDE 1

(Simple) Text Analysis using Twitter Data

Information Retrieval Lab. Faculty of Computer Science University of Indonesia Big Data Workshop – Bank Indonesia, Surabaya, Indonesia

10/16/2015 IR Lab., CS - UI

1 Bayu Distiawan Trisedya Slide Acknowledgement: Alfan Farizki W

slide-2
SLIDE 2

We will do fun programming task  If you’re a not a programmer, don’t worry !

10/16/2015 IR Lab., CS - UI

2

slide-3
SLIDE 3

Crawling Tweets & Simple Processing

10/16/2015 IR Lab., CS - UI

3

slide-4
SLIDE 4
  • 1. Getting Twitter API
  • Create a twitter account if you do not already have one.
  • Go to https://apps.twitter.com/ and log in with your twitter

credentials.

  • Click "Create New App“
  • Fill out the form, agree to the terms, and click "Create your

Twitter application“

  • In the next page, click on "API keys" tab, and copy your "API

key" and "API secret".

  • Scroll down and click "Create my access token", and copy your

"Access token" and "Access token secret".

10/16/2015 IR Lab., CS - UI

4

slide-5
SLIDE 5
  • 2. Install Python (and libraries)
  • Install Anaconda
  • Install Tweepy

OR

  • Install Python
  • Lib: Tweepy, pandas, matplotlib, numpy
  • Set environment variables

10/16/2015 IR Lab., CS - UI

5

slide-6
SLIDE 6
  • 3. Creating crawler
  • Open tweepy installation folder, find streaming example
  • \installation_dir\tweepy\examples\streaming.py

10/16/2015 IR Lab., CS - UI

6

slide-7
SLIDE 7
  • 3. Creating crawler (2)
  • Copy your key & token from Twitter API to the code

# Go to http://apps.twitter.com and create an app. # The consumer key and secret will be generated for you after consumer_key=“” consumer_secret=“” # After the step above, you will be redirected to your app's page. # Create an access token under the the "Your access token" section access_token=“” access_token_secret=“”

10/16/2015 IR Lab., CS - UI

7

slide-8
SLIDE 8
  • 3. Creating crawler (3)
  • This part is a listener to print received tweet to standard
  • utput for example command line

class StdOutListener(StreamListener): def on_data(self, data): print(data) return True def on_error(self, status): print(status)

10/16/2015 IR Lab., CS - UI

8

slide-9
SLIDE 9
  • 3. Creating crawler (4)
  • This part is where we put the keyword of tweets that suits our
  • interest. From the example, we will received tweets that

contains “basketball”

  • You can try to change with multiple keywords like
  • stream.filter(track=[‘jokowi‘, ‘prabowo’])
  • We don’t cover the technique to enhance the keywords to

make the search result better.

stream.filter(track=['basketball'])

10/16/2015 IR Lab., CS - UI

9

slide-10
SLIDE 10
  • 3. Creating crawler (5)
  • Run the crawler
  • open your command line or terminal
  • change the active directory to the place of the crawler file
  • command:
  • python crawler.py
  • see what happended
  • change the command to:
  • python crawler.py > output.json

10/16/2015 IR Lab., CS - UI

10

slide-11
SLIDE 11
  • 4. Preparing corpus (1)
  • Filter the tweet stream, pick the attribute we want to analyze.
  • In this example we only want to analyze the text from the

tweets

10/16/2015 IR Lab., CS - UI

11

slide-12
SLIDE 12
  • 4. Preparing corpus (2)
  • Create new python file (transform2.py)
  • Impor json, we want to read json format
  • fo -> read the file where the crawler produced
  • fw -> create new file

import json fo = open(‘file_path\output.json’, 'r') fw = open(‘file_path\corpus.txt', 'a')

10/16/2015 IR Lab., CS - UI

12

slide-13
SLIDE 13
  • 4. Preparing corpus (3)
  • read all line in fo
  • write the tweet text to fw

for line in fo: try: tweet = json.loads(line) fw.write(tweet['text']+"\n") except: continue

10/16/2015 IR Lab., CS - UI

13

slide-14
SLIDE 14
  • 5. Simple Analysis (1)
  • Count how many tweets contain “jokowi” and “prabowo”
  • Create new python file (simple_analysis.py)
  • Import the library needed

import json import pandas as pd import matplotlib.pyplot as plt import re

10/16/2015 IR Lab., CS - UI

14

slide-15
SLIDE 15
  • 5. Simple Analysis (2)
  • Create a function to check wheter the word contained on a text or

not

def word_in_text(word, text): word = word.lower() text = text.lower() match = re.search(word, text) if match: return True return False

10/16/2015 IR Lab., CS - UI

15

slide-16
SLIDE 16
  • 5. Simple Analysis (3)
  • Read the corpus file line by line and store each line into an array
  • Count how many tweets on it

corpus_path = 'corpus.txt' tweets = [] corpus_file = open(corpus_path, "r") for line in corpus_file: tweets.append(line) print "Tweets count: " + str(len(tweets))

10/16/2015 IR Lab., CS - UI

16

slide-17
SLIDE 17
  • 5. Simple Analysis (4)
  • Calculate how many tweets contains “jokowi” and “prabowo”
  • Print the result

tweets_frame = pd.DataFrame() tweets_frame['jokowi'] = map(lambda tweet: word_in_text('jokowi', tweet), tweets) tweets_frame['prabowo'] = map(lambda tweet: word_in_text('prabowo', tweet), tweets) print tweets_frame['jokowi'].value_counts()[True] print tweets_frame['prabowo'].value_counts()[True]

10/16/2015 IR Lab., CS - UI

17

slide-18
SLIDE 18
  • 5. Simple Analysis (5)
  • Show the result on bar chart

candidates = ['jokowi', 'prabowo'] tweets_candidates = [tweets_frame['jokowi'].value_counts()[True], tweets_frame['prabowo'].value_counts()[True]] x_pos = list(range(len(candidates))) width = 0.8 fig, ax = plt.subplots() plt.bar(x_pos, tweets_candidates, width, alpha=1, color='g') # Setting axis labels and ticks ax.set_ylabel('Number of tweets', fontsize=15) ax.set_title('Jokowi vs. Prabowo', fontsize=10, fontweight='bold') ax.set_xticks([p + 0.4 * width for p in x_pos]) ax.set_xticklabels(candidates) plt.grid() plt.show()

10/16/2015 IR Lab., CS - UI

18

slide-19
SLIDE 19

Simple Political Sentiment Analysis on Tweets

10/16/2015 IR Lab., CS - UI

19

slide-20
SLIDE 20

The Data

  • We collected tweets on June 9th 2014, when the first

presidential election debate was held.

  • The data consists of around 2,456,465 tweets, crawled for

around 24 hours.

  • You can find this data on “debatcapres_2014_sesi1.txt”. Please

do not open directly using your text editor !

10/16/2015 IR Lab., CS - UI

20

slide-21
SLIDE 21

The Data

If you want to peek your data, you can use the following Python code:

10/16/2015 IR Lab., CS - UI

21 dataFile = open("debatcapres_2014_sesi1.txt", "r") lines = 10 for i in range(lines): print(dataFile.readline()) dataFile.close()

slide-22
SLIDE 22

Our Task

10/16/2015 IR Lab., CS - UI

22 Raw Tweets Preprocessed Tweets “Jokowi” “Prabowo” Positive tweets Negative tweets Positive tweets Negative tweets Sentiment Analysis

slide-23
SLIDE 23

Steps

  • Preprocessing our corpus
  • Splitting our corpus
  • “jokowi”
  • “prabowo”
  • Simple Sentiment Analysis
  • N-gram Frequency Analysis
  • Top 100 unigrams
  • Top 100 bigrams

10/16/2015 IR Lab., CS - UI

23

slide-24
SLIDE 24

Preprocessing

Before we analyze the text, we need to clean and normalize our raw data. We usually do the following steps (but not limited to):

10/16/2015 IR Lab., CS - UI

24

Normalization Stop Word Removal Stemming

slide-25
SLIDE 25

Preprocessing

Normalization We transform our text data into standard form.

10/16/2015 IR Lab., CS - UI

25

yawdh gw yg akan pergi ksn Ya sudah, saya yang akan pergi ke sana

slide-26
SLIDE 26

Preprocessing

Normalization We leverage a special dictionary.

10/16/2015 IR Lab., CS - UI

26

aje saja ajh saja ak aku alesan alasan ancur hancur ane saya anget hangat ank anak apah apa apo apa aq aku asek asik ati2 hati-hati atit sakit Resource: singkatan.dic Code: normalizer.py

slide-27
SLIDE 27

Preprocessing

Normalization Let’s try from command prompt !

10/16/2015 IR Lab., CS - UI

27

> python >>> from normalizer import Normalizer >>> norm = Normalizer() >>> norm.normalize(“yawdh gw yg akan pergi ksn”) ‘ya sudah saya yang akan pergi ke sana’

slide-28
SLIDE 28

Preprocessing

Stop Word Removal Stop Word: most common words; they usually have little value.

10/16/2015 IR Lab., CS - UI

28

http://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html

slide-29
SLIDE 29

Preprocessing

Stop Word Removal

10/16/2015 IR Lab., CS - UI

29

Resource: twitter_stp.dic Code: stpremoval.py

> python >>> from stpremoval import StpRemoval >>> st = StpRemoval() >>> norm.removeStp(“budi dan rani pergi ke bandung”) ‘budi rani pergi bandung’

slide-30
SLIDE 30

Preprocessing

10/16/2015 IR Lab., CS - UI

30 Let’s combine Normalization and Stop Word Removal to our whole corpus. Raw Tweets Preprocessed Tweets

You just need to run preprocesscorp.py !

slide-31
SLIDE 31

Splitting Corpus

10/16/2015 IR Lab., CS - UI

31 Preprocessed Tweets “jokowi” “prabowo” Simple Approach: Suppose we want to find tweet that mentions “prabowo”. Idea:

for each tweet in the corpus: if tweet contains “prabowo” then print tweet

Use select.py for “prabowo” and “jokowi” !

slide-32
SLIDE 32

Sentiment Analysis

One of the tasks: Polarity Classification

10/16/2015 IR Lab., CS - UI

32 Positive Tweets @IMKristenBell I absolutely love the samsung commercials with you and Dax xD so cute and funny. ♡♡ Hope you have nice week :) I love my iPhone 6! :D The case on it makes it look so nice! Negative Tweets #samsung chargers are just as bad as #apple ones. holy hell. >:( snapchat looks so bad on the new iphone :-(

slide-33
SLIDE 33

Sentiment Analysis

We need Sentiment Lexicon or a collection of opinionated words. Example:

10/16/2015 IR Lab., CS - UI

33 Positive Words: good happy excellent beautiful useful ... Negative Words: bad sad worst ugly useless ...

slide-34
SLIDE 34

Sentiment Analysis

Positive words : +1 Negative words : -1

10/16/2015 IR Lab., CS - UI

34 This laptop looks nice and has cool screen, but the performance is bad 𝑞𝑝𝑚𝑏𝑠𝑗𝑢𝑧 = #𝑞𝑝𝑡 + #𝑜𝑓𝑕 #𝑝𝑞𝑗𝑜𝑗𝑝𝑜 𝑥𝑝𝑠𝑒𝑡 #𝑝𝑞𝑗𝑜𝑗𝑝𝑜 𝑥𝑝𝑠𝑒𝑡 > 0 𝑝𝑢ℎ𝑓𝑠 +1 +1

  • 1

Polarity score = (1 + 1 – 1) / 3 = 0.33 The sentiment is positive enough !

slide-35
SLIDE 35

Sentiment Analysis

10/16/2015 IR Lab., CS - UI

35

Resource: positif.txt negatif.txt Code: sentianal.py

> python >>> from sentianal import Sentianal >>> s = Sentianal() >>> s.compute (“buku ini bagus dan menarik tapi mahal”) 0.33333

slide-36
SLIDE 36

Sentiment Analysis

10/16/2015 IR Lab., CS - UI

36 “Jokowi” Positive tweets Negative tweets

Use selectposneg.py for both jokowi and prabowo cases !

slide-37
SLIDE 37

Kekurangan

Prabowo emosional jokowi galau OMG prabowo tersudutkan!!! Ham n hukum! Hebat- JK!

10/16/2015 IR Lab., CS - UI

37

slide-38
SLIDE 38

N-Gram Frequency Analysis

Next Question: We want to know what topics they are talking about ?

10/16/2015 IR Lab., CS - UI

38 “Prabowo” Positive tweets Negative tweets

slide-39
SLIDE 39

N-Gram Frequency Analysis

Simplest way to find topics ! 1. Computing common words (common unigrams) 2. Computing common bigrams

10/16/2015 IR Lab., CS - UI

39 Positive tweets Negative tweets How to extract topics from this collection of tweets ?

slide-40
SLIDE 40

N-Gram Frequency Analysis

Saya pergi ke depok untuk mengikuti konferensi

Unigrams: saya, pergi, ke, depok, untuk, mengikuti, konferensi Bigrams: saya pergi, pergi ke, ke depok, depok untuk, untuk mengikuti, mengikuti konferensi

10/16/2015 IR Lab., CS - UI

40

Find common Unigrams and Bigrams !

slide-41
SLIDE 41

N-Gram Frequency Analysis

  • Use unigramcount.py to find top-N unigrams in the corpus
  • Use bigramcount.py to find top-N bigrams in the corpus

10/16/2015 IR Lab., CS - UI

41

A A B A A C B A B C A A B C

Unigrams: A : 7 times B : 4 times C : 3 times Bigrams: A A : 3 times A B : 3 times B A : 2 times A C : once C B : once B C : once C A : once

Top-2 unigram/bigram

slide-42
SLIDE 42

Political Sentiment Analysis for Predicting Presidential Election Results in A Twitter Nation Approach:

  • Collecting Twitter data (from May 1st 2014 to July 6th 2014)
  • Automatic Buzzer Detection
  • Removing noise in our data
  • Removing fake users, paid users, etc
  • Applying our sentiment analysis method
  • Using the number of positive tweets to predict the election
  • utcomes

10/16/2015 IR Lab., CS - UI

42