In Information Retr trieval for or Se Senti timent An Anal - - PowerPoint PPT Presentation

in information retr trieval for or se senti timent an
SMART_READER_LITE
LIVE PREVIEW

In Information Retr trieval for or Se Senti timent An Anal - - PowerPoint PPT Presentation

In Information Retr trieval for or Se Senti timent An Anal alysis Weighting Schemes for enhancing classification accuracy Krithika Verma (kverma2@umbc.edu) CMSC-676 Information Retrieval In Introduction Sentiment analysis (SA)


slide-1
SLIDE 1

In Information Retr trieval for

  • r Se

Senti timent An Anal alysis

Weighting Schemes for enhancing classification accuracy

Krithika Verma (kverma2@umbc.edu) CMSC-676 Information Retrieval

slide-2
SLIDE 2

In Introduction

  • Sentiment analysis (SA) → mechanism to process vast amount of info

and give insightful content.

  • Info available is form of → blogs, tweets, social n/w, reviews, etc.
  • Ex: Government sector, BI for politics, market research
  • What is classification in SA?

➢Identify if text is subjective or objective ➢polarity of a given text is determined, i.e. positive, negative or neutral

  • Document representation → important part in SA
slide-3
SLIDE 3

De Delta TFIDF IDF for

  • r SA
  • Words weighed using difference between tf.idf

scores in positive and negative class documents.

  • SVM is used with delta tf.idf to show the

improved accuracy

  • Why SVM?
  • Method

➢Assigns feature values for a document based

  • n difference of word’s tf.idf scores in positive

and negative corpora.

  • Features in

➢negative training set > positive training set positive score ➢Positive training set > negative training set negative score

slide-4
SLIDE 4

Delta tf.idf features are more sentimental Created subjectivity detector using Pang Lee’s subjectivity data-set transformation yields a clear improvement with a 99.9% confidence interval over the baseline bag of words

slide-5
SLIDE 5

Related Wor

  • rk – Sim

imple tf tf.idf and del elta tf tf.idf

  • tf in classification results in decreased accuracy
  • Idf has no additional class preference.
  • To solve this issue, delta tf.idf was introduced by Martineau and Finin
  • Delta tf.idf better than simple tf , binary weighting scheme
  • Fails to take into consideration
  • non-linearity of tf to document relevancy
  • No smoothing for the dfi,j factor
slide-6
SLIDE 6

SMART NO NOTATION AND ND BM BM25

  • As per smart notation, term weight is a function of

three triples ➢term frequency factor (local factor) ➢inverse document frequency function (global factor) ➢Normalization

  • Ex: SMART.bnn → binary document representation

nnn → Raw term frequency based representation

slide-7
SLIDE 7

EXT EXTEN END SM SMART T NOTATI TION AND D BM BM25 25 WITH TH DEL DELTA TF TF.IDF

  • smoothing factor is added to the product of dfi with Ni

rather than to the dfi alone

  • best accuracy of 96.90% is attained using BM25 tf

weights with the BM25 delta idf variant

slide-8
SLIDE 8

Related Superv rvised Term weighting Methods

  • Text classification → supervised learning task
  • Ex: inverse category frequency
  • fewer are the categories in which a term occurs, the greater is the

discriminating power of the term

  • Fundamental elements of supervised term weighting used to compute

global importance of a term ti for a category ck

  • Relevance frequency →considers the terms distribution in the positive and

negative examples stating that, in multi-label text categorization, if more terms in positive examples than negative, greater the contribution to categorization

slide-9
SLIDE 9

Superv rvised Variant of

  • f tf

tf.i .idf

  • Idea→ avoids decreasing weight of terms contained in docs belonging to same

category unlike what happens in idf.

  • Idfec (Inverse document frequency excluding category)
  • DT\ Ck → training documents not labeled with ck
  • How it helps?

➢Improves classification effectiveness ➢term not belonging to category Ck is penalized as in tf.idf ➢Term appearing in category Ck retains a high global weight

  • Similar to tf.rf , as both penalize weights of a term ti according to the number of

negative examples where the ti appears

slide-10
SLIDE 10

Example

  • A corpus of 100 training documents , containing two terms

t1 and t2 with category Ck

  • For term t1

➢ idf(t1) = log(100/(27 + 5)) = log(3.125) → 0.49 ➢ rf(t1,Ck) = log(2 + 27/5) = log(7.4) → 0.86 ➢ idfec (t1,Ck) = log((65 + 5)/5) = log(14) → 1.14

  • For term t2

➢ idf(t2) = log(2.857) → 0.46 ➢ rf(t2, Ck) = log(2.4) → 0.38 ➢ idfec(t1, Ck) = log(2.8) → 0.44

  • Conclusion: t1 is more relevant to classify a category for a

particular document

slide-11
SLIDE 11

Related Wor

  • rk
  • Delta TFIDF outperforms the baseline with a statistical significance of

95% on a two tailed t-test.

➢Pang and Lee’s approach requires an additional trained SVM subjectivity classifier which requires even more labeled data ➢Created subjectivity detector using Pang Lee’s subjectivity data-set ➢transformation yields a clear improvement with a 99.9% confidence interval

  • ver the baseline bag of words
  • reviewed existing methods for both unsupervised and supervised and

proposed a novel solution as a modification of the classic tf.idf scheme.

slide-12
SLIDE 12

Futu ture Wor

  • rk
  • plan to test the proposed weighting functions in other domains such

as topic classification and additionally extend the approach to accommodate multi-class classification.

  • Improve the technique using redundancy , as redundancy is more

effective than idf weights

  • plan to test further variants of supervised scheme and perform tests
  • n larger datasets
slide-13
SLIDE 13

References

  • Georgios Paltoglou and Mike Thelwall. A study of Information Retrieval weighting schemes for

sentiment analysis. https://www.aclweb.org/anthology/P10-1141/

  • Justin Martineau and Tim Finin. Delta TFIDF: An Improved Feature Space for Sentiment Analysis.

https://www.researchgate.net/publication/221298092_Delta_TFIDF_An_Improved_Feature_Spac e_for_Sentiment_Analysis

  • A Comparison of Term Weighting Schemes for Text Classification and Sentiment Analysis with a

Supervised Variant of tf.idf. Conference paper. https://www.researchgate.net/publication/299278964_A_Comparison_of_Term_Weighting_Sche mes_for_Text_Classification_and_Sentiment_Analysis_with_a_Supervised_Variant_of_tfidf

  • Supervised Term Weighting Metrics for Sentiment Analysis in Short Text.

https://arxiv.org/abs/1610.03106