Feature extraction for sentiment analysis on twitter data with - - PowerPoint PPT Presentation

feature extraction for sentiment analysis on twitter data
SMART_READER_LITE
LIVE PREVIEW

Feature extraction for sentiment analysis on twitter data with - - PowerPoint PPT Presentation

Feature extraction for sentiment analysis on twitter data with spanish language Victor Mu niz Research Center in Mathematics. Monterrey, Mexico. Victor Mu niz (CIMAT Mty) Sentiment Analysis Junio 2015 1 / 33 Introduction Sentiment


slide-1
SLIDE 1

Feature extraction for sentiment analysis on twitter data with spanish language

Victor Mu˜ niz

Research Center in Mathematics. Monterrey, Mexico.

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 1 / 33

slide-2
SLIDE 2

Introduction

Sentiment Analysis focuses on automatically identifying whether a text expresses a positive, negative or neutral opinion about some topic.

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 2 / 33

slide-3
SLIDE 3

Introduction

Among all virtual opinion plataforms, Twitter has become the most popular for sentiment analysis due to several reasons: Availability of information Large amount of data Constant update Worldwide available

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 3 / 33

slide-4
SLIDE 4

Introduction

Among all virtual opinion plataforms, Twitter has become the most popular for sentiment analysis due to several reasons: Lot of applications

Opinion based marketing Online ranking Government and politics Official statistics Among many others...

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 4 / 33

slide-5
SLIDE 5

Introduction

One of the most popular techniques for text classification is the Bag of Words (Joachims, 1998), which constructs a Term Document Matrix based on term frequencies.

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 5 / 33

slide-6
SLIDE 6

Introduction

However, on twitter data, the application of this (or any) technique is not straightforward:

Andas bien loco @Telcel con la zona horaria d tu RED, a cada rato m mueves la Hr.?? #chidotucotorreo @ServicioTelcel http://t.co/QoOX3OCYxt @Profeco @Tiendas_OXXO no cumple con algunos requerimientos como tipos de bebida falsos asi como la falta del precio :( No nos deja pasar el cadenero del oxxo gooey! k pedo 100pre me pasaaaa!!! #Queoso

Short text Misspellings Abbreviations and non-standard contractions Emoticons, hashtags Unbalanced classes

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 6 / 33

slide-7
SLIDE 7

Introduction

Standard preprocessing techniques on twitter data are not enough, because generally we have variations of words with the same meaning: pseudo-estudiantes = pseudoestudiantes = seudoestudiantes = seudestudiantes separados = separa2 siempre = sienpre = 100pre This problem causes sparse Term Document Matrix Bag of words it’s not enough. We need to incorporate contextual (apriori) information The challenge is to extract the main features of the tweet, which give us insights of the sentiment (polarity) of the text

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 7 / 33

slide-8
SLIDE 8

Introduction

There is a lot of work on both feature extraction and classification for tweets, however, the vast majority are focused on english text Some previous work on lexical normalization of spanish text has been done (Mosqueda & Moreda, 2012), however, there are important differences between countries and regions, even in the same language. This must be taken into account The objective of our work, is to implement a normalization method for spanish text by using kernel-based methods, in

  • rder to obtain important features which can be used as input

for a classification method

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 8 / 33

slide-9
SLIDE 9

Preprocessing and normalization

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 9 / 33

slide-10
SLIDE 10

Normalization

Data: We obtained and manually classify tweets from the API (https://dev.twitter.com/) according to some specific topics (i.e, convenience stores, cellphone services, etc). Standard text preprocessing: Convert to lowercase Remove stopwords in spanish according to the list given by Martin Porter’s Snowball stemming project http://snowball.tartarus.org/. We add some words relative to the topic. Remove special characters: URL’s, @, RT, , -, :, among others

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 10 / 33

slide-11
SLIDE 11

Normalization

Remove repeated characters and excess of white spaces Emoticon sustitution according to the list: en.wikipedia.org/wiki/List_of_emoticons. For instance: :-) emoticon-positivo >:[ emoticon-negativo :) emoticon-positivo =( emoticon-negativo :o) emoticon-positivo :-[ emoticon-negativo :c) emoticon-positivo :-|| emoticon-muy-negativo :-D emoticon-muy-positivo >:( emoticon-muy-negativo X-D emoticon-muy-positivo :| emoticon-neutral

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 11 / 33

slide-12
SLIDE 12

Normalization

The normalisation process consists on

1 Detection of non-conventional words 2 Substitution with similar words, (hopefully the correct ones in terms

  • f the linguistic meaning)

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 12 / 33

slide-13
SLIDE 13

Normalization

Detection of non-conventinal words We used Aspell (http://aspell.net/) with a spanish dictionary, and we added extra terms, such as cities and localities from Mexico and other ones relative to the topic. For each word in the preprocessed tweet, we did a search with the Aspell API, and if it does not appear, we consider the options given by Aspell. Very often, the top ranked suggestion by Aspell is not the best choice

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 13 / 33

slide-14
SLIDE 14

Normalization

Detection of non-conventinal words Consider pseudoestudiantes

[1] "pseudo" "estudiantes" "pseudo-estudiantes" [4] "predestinares" "predestines" "predestinases" [7] "predestinareis" "predestinase" "predestinar" [10] "predestinas" "predestinasteis" "predestinaste" [13] "predestinis" "sudestada" "predestinaras" [16] "predestinars" "predestinaseis" "sudestadas" [19] "predestinis" "predestinadas" "predestinados" [22] "predestinabas" "predestinamos"

We need to choose the appropriate word from the suggestions

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 14 / 33

slide-15
SLIDE 15

Normalization

Kernel methods and “string kernels”. Let x, z ∈ X (input space). Consider the kernel function: k(x, z) = φ(x), φ(z) where φ is a map: φ : x ∈ X → φ(x) ∈ H (feature space) Kernel trick (Scholkopf and Smola, 2002)

Datos Kernel Matriz de Gram Algoritmo Funcion decision

f(x) = αik(xi, x) X k(x, x′) A K

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 15 / 33

slide-16
SLIDE 16

Normalization

String Kernels (Lodhi 2002, Shawe-Taylor y Cristianini 2004, Watkins 2000, Herbrich 2002) provides a similarity measure between two documents x y y. Let s to be a substring. The mapping to the feature space is given by φs(x) =

  • s∈x

λL(sx), where λ ∈ (0, 1) es a weight and L(sx) is the length of the substring s into the document x. Example: Consider s = car: if x =“cara”, then L(sx) = 3 (cara). φs(x) = λ3, if x =“cuarto”, then L(sx) = 4 (cuarto) φs(x) = λ4.

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 16 / 33

slide-17
SLIDE 17

Normalization

String Kernels (Lodhi 2002, Shawe-Taylor y Cristianini 2004, Watkins 2000, Herbrich 2002) provides a similarity measure between two documents x y y. Let s to be a substring. The mapping to the feature space is given by φs(x) =

  • s∈x

λL(sx), where λ ∈ (0, 1) es a weight and L(sx) is the length of the substring s into the document x. Example: Consider s = car: if x =“cara”, then L(sx) = 3 (cara). φs(x) = λ3, if x =“cuarto”, then L(sx) = 4 (cuarto) φs(x) = λ4.

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 16 / 33

slide-18
SLIDE 18

Normalization

String Kernels (Lodhi 2002, Shawe-Taylor y Cristianini 2004, Watkins 2000, Herbrich 2002) provides a similarity measure between two documents x y y. Let s to be a substring. The mapping to the feature space is given by φs(x) =

  • s∈x

λL(sx), where λ ∈ (0, 1) es a weight and L(sx) is the length of the substring s into the document x. Example: Consider s = car: if x =“cara”, then L(sx) = 3 (cara). φs(x) = λ3, if x =“cuarto”, then L(sx) = 4 (cuarto) φs(x) = λ4.

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 16 / 33

slide-19
SLIDE 19

Normalization

The kernel (dot product) between documents x and y is given by kn(x, y) =

  • s∈Σn
  • s⊂x
  • s⊂y

λL(sx)+L(sy), where Σn is the set of all substrings of size n from a finite alphabet Σ. Example: Consider the words cat, car, bat and bar con |s| = 2: c-a c-t a-t b-a b-t c-r a-r b-r φ(cat) λ2 λ3 λ2 φ(car) λ2 λ3 λ2 φ(bat) λ2 λ2 λ3 φ(bar) λ2 λ2 λ3 k(car, cat) = λ4, k(car, car) = k(cat, cat) = 2λ4 + λ6.

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 17 / 33

slide-20
SLIDE 20

Normalization

The kernel (dot product) between documents x and y is given by kn(x, y) =

  • s∈Σn
  • s⊂x
  • s⊂y

λL(sx)+L(sy), where Σn is the set of all substrings of size n from a finite alphabet Σ. Example: Consider the words cat, car, bat and bar con |s| = 2: c-a c-t a-t b-a b-t c-r a-r b-r φ(cat) λ2 λ3 λ2 φ(car) λ2 λ3 λ2 φ(bat) λ2 λ2 λ3 φ(bar) λ2 λ2 λ3 k(car, cat) = λ4, k(car, car) = k(cat, cat) = 2λ4 + λ6.

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 17 / 33

slide-21
SLIDE 21

Normalization

The kernel (dot product) between documents x and y is given by kn(x, y) =

  • s∈Σn
  • s⊂x
  • s⊂y

λL(sx)+L(sy), where Σn is the set of all substrings of size n from a finite alphabet Σ. Example: Consider the words cat, car, bat and bar con |s| = 2: c-a c-t a-t b-a b-t c-r a-r b-r φ(cat) λ2 λ3 λ2 φ(car) λ2 λ3 λ2 φ(bat) λ2 λ2 λ3 φ(bar) λ2 λ2 λ3 k(car, cat) = λ4, k(car, car) = k(cat, cat) = 2λ4 + λ6.

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 17 / 33

slide-22
SLIDE 22

Normalization

There are different types of string kernels (spectrum, constant, sequence, exponential, boundrage), depending on the weight λ and the substring size s. If λ(s) = 0 for substrings starting and ending with white space, we

  • btain the “bag of words” kernel.

It can be computationally expensive for large documents

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 18 / 33

slide-23
SLIDE 23

Normalization

Substitution of non-standard words Kernel PCA based on Aspell suggestions for pseudoestudiantes. We used a sequence string kernel with s = 3 and λ = 0.5.

  • −4

−3 −2 −1 1 −2 −1 1 2 3 4 1st Principal Component 2nd Principal Component pseudo estudiantes pseudo−estudiantes predestinares predestines predestinases predestinareis predestinase predestinar predestinas predestinasteis predestinaste predestinéis sudestada predestinaras predestinarás predestinaseis sudestadas predestináis predestinadas predestinados predestinabas predestinamos

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 19 / 33

slide-24
SLIDE 24

Normalization

Substitution of non-standard words The projection of pseudoestudiantes in the first and second Principal Components is

  • −4

−3 −2 −1 1 −2 −1 1 2 3 4 1st Principal Component 2nd Principal Component pseudo estudiantes pseudo−estudiantes predestinares predestines predestinases predestinareis predestinase predestinar predestinas predestinasteis predestinaste predestinéis sudestada predestinaras predestinarás predestinaseis sudestadas predestináis predestinadas predestinados predestinabas predestinamos pseudoestudiantes

By using the minimum distance criterion with 3 PC, the most similar word is pseudo-estudiantes, which is correct.

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 20 / 33

slide-25
SLIDE 25

Normalization

Simulation test: randomly change 1 letter in a sample of 200 words from spanish dictionary

len2 len3 len4 0.0 0.2 0.4 0.6 lam 0.5 lam 1.1 lam 1.5 lam 2 lam 0.5 lam 1.1 lam 1.5 lam 2 lam 0.5 lam 1.1 lam 1.5 lam 2

val.lam error

factor(metodos) aspell bound constant exp sequence spectrum

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 21 / 33

slide-26
SLIDE 26

Normalization

Simulation test: randomly change 2 letters in a sample of 200 words from spanish dictionary

len2 len3 len4 0.00 0.25 0.50 0.75 lam 0.5 lam 1.1 lam 1.5 lam 2 lam 0.5 lam 1.1 lam 1.5 lam 2 lam 0.5 lam 1.1 lam 1.5 lam 2

val.lam error

factor(metodos) aspell bound constant exp sequence spectrum

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 22 / 33

slide-27
SLIDE 27

Normalization

Some results: Original

Ese cafe del oxxo si que levanta!!! Puuues a chambearts!!! No se cual cafe sea mas malo si el del @7ElevenMexico o el de @Tiendas_OXXO pero ambos son malisiiiiimoooooosss!!! Pesima mezcla @paolastonexxx eso es lo bonito... Puedo pagar en OXXO. :)

Normalized

ese cafe si levanta pues chamba no se cual cafe sea mas malo si pero ambos son musimos pesima mezcla eso es bonito puedo pagar emoticon-positivo

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 23 / 33

slide-28
SLIDE 28

Classification

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 24 / 33

slide-29
SLIDE 29

Classification

We implement a classification algorithm similar to that used by Melville et

  • al. (2013).

We used bag of words in the preprocessed and normalized tweets (positive, negative and neutral), to obtain relevant words for each category by using a mutual information measure (Yang and Pedersen, 1997). We add a topic-related list of words for each category as an apriori information.

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 25 / 33

slide-30
SLIDE 30

Classification

We implement a multinomial naive Bayesian classifier, where the class c of a tweet is given by c∗ = argmaxcp(c|d), where p(c|d) = α1p1(c|d) + α2p2(c|d), p1(c|d), α1: class probabilities and weight using bag of words p2(c|d), α2: class probabilities and weight using the topic-related

  • words. And

p(c|d) = p(c) p(w|c)ni(d) p(d)

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 26 / 33

slide-31
SLIDE 31

Classification

We use 800 tweets previously classified. We use 80% for training and 20% for testing by using a Cross Validation criteria The Mean Average Error (MAE) for the training data was 0.192 ± .015. The MAE for test data was 0.23 ± .021 But...

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 27 / 33

slide-32
SLIDE 32

Classification

We use 800 tweets previously classified. We use 80% for training and 20% for testing by using a Cross Validation criteria The Mean Average Error (MAE) for the training data was 0.192 ± .015. The MAE for test data was 0.23 ± .021 But...

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 27 / 33

slide-33
SLIDE 33

Classification

We use 800 tweets previously classified. We use 80% for training and 20% for testing by using a Cross Validation criteria The Mean Average Error (MAE) for the training data was 0.192 ± .015. The MAE for test data was 0.23 ± .021 But...

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 27 / 33

slide-34
SLIDE 34

Classification

We use 800 tweets previously classified. We use 80% for training and 20% for testing by using a Cross Validation criteria The Mean Average Error (MAE) for the training data was 0.192 ± .015. The MAE for test data was 0.23 ± .021 But...

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 27 / 33

slide-35
SLIDE 35

Classification

  • Error: 0.195

clase estimada clase real

15 5 9 479 63 14 28 6

N O P P O N

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 28 / 33

slide-36
SLIDE 36

Classification

P N O

Distribucion de categorias

0.0 0.2 0.4 0.6 0.8

It is necessary to use classification algorithms sensitive to unbalanced categories

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 29 / 33

slide-37
SLIDE 37

Classification

P N O

Distribucion de categorias

0.0 0.2 0.4 0.6 0.8

It is necessary to use classification algorithms sensitive to unbalanced categories

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 29 / 33

slide-38
SLIDE 38

Work in progress

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 30 / 33

slide-39
SLIDE 39

Work in progress

We are improving tweets normalization:

Implementing the methaphone algorithm Testing different types of string kernels Improving the apriori information (word list for normalization and classification)

Hashtags information Classification methods (SVM, Boosting) Cost sensitive classification Spatial and temporal analysis of tweets

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 31 / 33

slide-40
SLIDE 40

Thank you for your attention!

Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 32 / 33