in information retr trieval for or se senti timent an
play

In Information Retr trieval for or Se Senti timent An Anal - PowerPoint PPT Presentation

In Information Retr trieval for or Se Senti timent An Anal alysis Weighting Schemes for enhancing classification accuracy Krithika Verma (kverma2@umbc.edu) CMSC-676 Information Retrieval In Introduction Sentiment analysis (SA)


  1. In Information Retr trieval for or Se Senti timent An Anal alysis Weighting Schemes for enhancing classification accuracy Krithika Verma (kverma2@umbc.edu) CMSC-676 Information Retrieval

  2. In Introduction • Sentiment analysis (SA) → mechanism to process vast amount of info and give insightful content. • Info available is form of → blogs, tweets, social n/w, reviews, etc. • Ex: Government sector, BI for politics, market research • What is classification in SA? ➢ Identify if text is subjective or objective ➢ polarity of a given text is determined, i.e. positive, negative or neutral • Document representation → important part in SA

  3. De Delta TFIDF IDF for or SA • Words weighed using difference between tf.idf scores in positive and negative class documents. • SVM is used with delta tf.idf to show the improved accuracy • Why SVM? • Method ➢ Assigns feature values for a document based on difference of word’s tf.idf scores in positive and negative corpora. • Features in ➢ negative training set > positive training set positive score ➢ Positive training set > negative training set negative score

  4. Delta tf.idf features are more sentimental Created subjectivity detector using Pang Lee’s subjectivity data -set transformation yields a clear improvement with a 99.9% confidence interval over the baseline bag of words

  5. Related Wor ork – Sim imple tf tf.idf and del elta tf tf.idf • tf in classification results in decreased accuracy • Idf has no additional class preference. • To solve this issue, delta tf.idf was introduced by Martineau and Finin • Delta tf.idf better than simple tf , binary weighting scheme • Fails to take into consideration • non-linearity of tf to document relevancy • No smoothing for the dfi,j factor

  6. SMART NO NOTATION AND ND BM BM25 • As per smart notation, term weight is a function of three triples ➢ term frequency factor (local factor) ➢ inverse document frequency function (global factor) ➢ Normalization • Ex: SMART.bnn → binary document representation nnn → Raw term frequency based representation

  7. EXT EXTEN END SM SMART T NOTATI TION AND D BM BM25 25 WITH TH DEL DELTA TF TF.IDF • smoothing factor is added to the product of dfi with Ni rather than to the dfi alone • best accuracy of 96.90% is attained using BM25 tf weights with the BM25 delta idf variant

  8. Related Superv rvised Term weighting Methods • Text classification → supervised learning task • Ex: inverse category frequency • fewer are the categories in which a term occurs, the greater is the discriminating power of the term • Fundamental elements of supervised term weighting used to compute global importance of a term ti for a category ck • Relevance frequency → considers the terms distribution in the positive and negative examples stating that, in multi-label text categorization, if more terms in positive examples than negative, greater the contribution to categorization

  9. Superv rvised Variant of of tf tf.i .idf • Idea → avoids decreasing weight of terms contained in docs belonging to same category unlike what happens in idf. • Idfec (Inverse document frequency excluding category) • D T \ C k → training documents not labeled with ck • How it helps? ➢ Improves classification effectiveness ➢ term not belonging to category C k is penalized as in tf.idf ➢ Term appearing in category C k retains a high global weight • Similar to tf.rf , as both penalize weights of a term ti according to the number of negative examples where the ti appears

  10. Example • A corpus of 100 training documents , containing two terms t 1 and t 2 with category C k • For term t 1 ➢ idf(t 1 ) = log(100/(27 + 5)) = log(3.125) → 0.49 ➢ rf(t 1 ,C k ) = log(2 + 27/5) = log(7.4) → 0.86 ➢ idfec (t 1 ,C k ) = log((65 + 5)/5) = log(14) → 1.14 • For term t 2 ➢ idf(t 2 ) = log(2.857) → 0.46 ➢ rf(t 2 , C k ) = log(2.4) → 0.38 ➢ idfec(t 1 , C k ) = log(2.8) → 0.44 • Conclusion: t 1 is more relevant to classify a category for a particular document

  11. Related Wor ork • Delta TFIDF outperforms the baseline with a statistical significance of 95% on a two tailed t-test. ➢ Pang and Lee’s approach requires an additional trained SVM subjectivity classifier which requires even more labeled data ➢ Created subjectivity detector using Pang Lee’s subjectivity data -set ➢ transformation yields a clear improvement with a 99.9% confidence interval over the baseline bag of words • reviewed existing methods for both unsupervised and supervised and proposed a novel solution as a modification of the classic tf.idf scheme.

  12. Futu ture Wor ork • plan to test the proposed weighting functions in other domains such as topic classification and additionally extend the approach to accommodate multi- class classification. • Improve the technique using redundancy , as redundancy is more effective than idf weights • plan to test further variants of supervised scheme and perform tests on larger datasets

  13. References • Georgios Paltoglou and Mike Thelwall. A study of Information Retrieval weighting schemes for sentiment analysis. https://www.aclweb.org/anthology/P10-1141/ • Justin Martineau and Tim Finin. Delta TFIDF: An Improved Feature Space for Sentiment Analysis. https://www.researchgate.net/publication/221298092_Delta_TFIDF_An_Improved_Feature_Spac e_for_Sentiment_Analysis • A Comparison of Term Weighting Schemes for Text Classification and Sentiment Analysis with a Supervised Variant of tf.idf. Conference paper. https://www.researchgate.net/publication/299278964_A_Comparison_of_Term_Weighting_Sche mes_for_Text_Classification_and_Sentiment_Analysis_with_a_Supervised_Variant_of_tfidf • Supervised Term Weighting Metrics for Sentiment Analysis in Short Text. https://arxiv.org/abs/1610.03106

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend