Feature extraction for sentiment analysis on twitter data with - PowerPoint PPT Presentation

Feature extraction for sentiment analysis on twitter data with spanish language Victor Mu˜ niz Research Center in Mathematics. Monterrey, Mexico. Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 1 / 33

Introduction Sentiment Analysis focuses on automatically identifying whether a text expresses a positive, negative or neutral opinion about some topic. Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 2 / 33

Introduction Among all virtual opinion plataforms, Twitter has become the most popular for sentiment analysis due to several reasons: Availability of information Large amount of data Constant update Worldwide available Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 3 / 33

Introduction Among all virtual opinion plataforms, Twitter has become the most popular for sentiment analysis due to several reasons: Lot of applications Opinion based marketing Online ranking Government and politics Official statistics Among many others... Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 4 / 33

Introduction One of the most popular techniques for text classification is the Bag of Words (Joachims, 1998), which constructs a Term Document Matrix based on term frequencies. Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 5 / 33

Introduction However, on twitter data, the application of this (or any) technique is not straightforward: Andas bien loco @Telcel con la zona horaria d tu RED, a cada rato m mueves la Hr.?? #chidotucotorreo @ServicioTelcel Short text http://t.co/QoOX3OCYxt Misspellings Abbreviations and @Profeco @Tiendas_OXXO no cumple con algunos requerimientos como tipos de non-standard bebida falsos asi como la falta del contractions precio :( Emoticons, hashtags Unbalanced classes No nos deja pasar el cadenero del oxxo gooey! k pedo 100pre me pasaaaa!!! #Queoso Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 6 / 33

Introduction Standard preprocessing techniques on twitter data are not enough, because generally we have variations of words with the same meaning: pseudo-estudiantes = pseudoestudiantes = seudoestudiantes = seudestudiantes separados = separa2 siempre = sienpre = 100pre This problem causes sparse Term Document Matrix Bag of words it’s not enough. We need to incorporate contextual (apriori) information The challenge is to extract the main features of the tweet, which give us insights of the sentiment (polarity) of the text Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 7 / 33

Introduction There is a lot of work on both feature extraction and classification for tweets, however, the vast majority are focused on english text Some previous work on lexical normalization of spanish text has been done (Mosqueda & Moreda, 2012), however, there are important differences between countries and regions, even in the same language. This must be taken into account The objective of our work, is to implement a normalization method for spanish text by using kernel-based methods, in order to obtain important features which can be used as input for a classification method Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 8 / 33

Preprocessing and normalization Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 9 / 33

Normalization Data : We obtained and manually classify tweets from the API ( https://dev.twitter.com/ ) according to some specific topics (i.e, convenience stores, cellphone services, etc). Standard text preprocessing: Convert to lowercase Remove stopwords in spanish according to the list given by Martin Porter’s Snowball stemming project http://snowball.tartarus.org/ . We add some words relative to the topic. Remove special characters: URL’s, @, RT, , -, :, among others Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 10 / 33

Normalization Remove repeated characters and excess of white spaces Emoticon sustitution according to the list: en.wikipedia.org/wiki/List_of_emoticons . For instance: :-) emoticon-positivo > :[ emoticon-negativo :) emoticon-positivo =( emoticon-negativo :o) emoticon-positivo :-[ emoticon-negativo :c) emoticon-positivo :- || emoticon-muy-negativo :-D emoticon-muy-positivo > :( emoticon-muy-negativo X-D emoticon-muy-positivo : | emoticon-neutral Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 11 / 33

Normalization The normalisation process consists on 1 Detection of non-conventional words 2 Substitution with similar words, (hopefully the correct ones in terms of the linguistic meaning) Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 12 / 33

Normalization Detection of non-conventinal words We used Aspell ( http://aspell.net/ ) with a spanish dictionary, and we added extra terms, such as cities and localities from Mexico and other ones relative to the topic. For each word in the preprocessed tweet, we did a search with the Aspell API, and if it does not appear, we consider the options given by Aspell. Very often, the top ranked suggestion by Aspell is not the best choice Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 13 / 33

Normalization Detection of non-conventinal words Consider pseudoestudiantes [1] "pseudo" "estudiantes" "pseudo-estudiantes" [4] "predestinares" "predestines" "predestinases" [7] "predestinareis" "predestinase" "predestinar" [10] "predestinas" "predestinasteis" "predestinaste" [13] "predestinis" "sudestada" "predestinaras" [16] "predestinars" "predestinaseis" "sudestadas" [19] "predestinis" "predestinadas" "predestinados" [22] "predestinabas" "predestinamos" We need to choose the appropriate word from the suggestions Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 14 / 33

Normalization Kernel methods and “string kernels”. Let x , z ∈ X (input space). Consider the kernel function: k ( x , z ) = � φ ( x ) , φ ( z ) � where φ is a map: φ : x ∈ X �→ φ ( x ) ∈ H (feature space) Kernel trick (Scholkopf and Smola, 2002) f ( x ) = � α i k ( x i , x ) k ( x, x ′ ) X K A Datos Kernel Matriz de Gram Algoritmo Funcion decision Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 15 / 33

Normalization String Kernels (Lodhi 2002, Shawe-Taylor y Cristianini 2004, Watkins 2000, Herbrich 2002) provides a similarity measure between two documents x y y . Let s to be a substring. The mapping to the feature space is given by � λ L ( s x ) , φ s ( x ) = s ∈ x where λ ∈ (0 , 1) es a weight and L ( s x ) is the length of the substring s into the document x . Example: Consider s = car : if x =“cara”, then L ( s x ) = 3 ( car a). φ s ( x ) = λ 3 , if x =“cuarto”, then L ( s x ) = 4 ( c u ar to) φ s ( x ) = λ 4 . Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 16 / 33

Normalization The kernel (dot product) between documents x and y is given by � � � λ L ( s x )+ L ( s y ) , k n ( x , y ) = s ∈ Σ n s ⊂ x s ⊂ y where Σ n is the set of all substrings of size n from a finite alphabet Σ. Example: Consider the words cat, car, bat and bar con | s | = 2: c-a c-t a-t b-a b-t c-r a-r b-r λ 2 λ 3 λ 2 φ (cat) 0 0 0 0 0 λ 2 λ 3 λ 2 φ (car) 0 0 0 0 0 λ 2 λ 2 λ 3 φ (bat) 0 0 0 0 0 λ 2 λ 2 λ 3 φ (bar) 0 0 0 0 0 k ( car , cat ) = λ 4 , k ( car , car ) = k ( cat , cat ) = 2 λ 4 + λ 6 . Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 17 / 33

Feature extraction for sentiment analysis on twitter data with - PowerPoint PPT Presentation

Feature extraction for sentiment analysis on twitter data with spanish language Victor Mu niz Research Center in Mathematics. Monterrey, Mexico. Victor Mu niz (CIMAT Mty) Sentiment Analysis Junio 2015 1 / 33 Introduction Sentiment

Twitter Sentiment Analysis Twitter Sentiment Analysis Presented by: Loitongbam Gyanendro Singh

Sentiment Analysis in Twitter Rohit Kumar Jha, Sakaar Khurana Sentiment Analysis in Twitter

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Feature Extraction 7-1 Ronald Peikert SciVis 2007 - Feature Extraction What are features?

Feature Extraction 7-1 Ronald Peikert SciVis 2008 - Feature Extraction What are features?

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

3. Feature Extraction 3.1 Feature Extraction from Speech or other types of audio like music

Automated Feature Extraction Automated Feature Extraction for Object Recognition for Object

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Sentiment analysis Christopher Potts CS 244U: Natural language understanding May 19 1 / 83

Pl u tchik ' s w heel of emotion , polarit y v s . sentiment SE N TIME N T AN ALYSIS IN R Ted K

Feature Extraction Combining Feature Extraction Combining Spectral Noise Reduction and Spectral

AB Feature Extraction Experiments Discussion Noise Robust LVCSR Feature Extraction Based on

Object based feature extraction of Google based feature extraction of Google Object Earth

Welcome! Julia Silge Data Scientist at Stack Overflow DataCamp Sentiment Analysis in R: The

Terry Lam (with M. Mitzenmacher and G. Varghese) Denial of Service Worm outbreak Millions

Manila Third Sewerage Project Henry Manguerra GEF-MTSP Consultant August 3-4, 2011

On the 3 -wave Equations with Constant Boundary Conditions Georgi Grahovski Institute for

Elaborao de Planos de Capacitao Slides Diretoria de Desenvolvimento Gerencial Programa

strt Prr

The Local Area Multicomputer (LAM) Implementation of MPI Jeffrey M. Squyres, Andrew Lumsdaine

Setting u p a CFA FAC TOR AN ALYSIS IN R Jennifer Br u sso w Ps y chometrician Wh y a

Strong Normalization by HOAS Andrei Popescu Joint work with Elsa Gunter Simply-typed