Mining Sentiment Mining Sentiment Classification from - - PDF document

mining sentiment mining sentiment classification from
SMART_READER_LITE
LIVE PREVIEW

Mining Sentiment Mining Sentiment Classification from - - PDF document

Mining Sentiment Mining Sentiment Classification from Classification from Political Web Logs Political Web Logs Kathleen Durant Kathleen Durant WebKDD 06 06 WebKDD August 20, 2006 August 20, 2006 WebKDD 2006 Workshop on WebKDD


slide-1
SLIDE 1

WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,

  • Aug. 20, 2006, at KDD 2006,
  • Aug. 20, 2006, at KDD 2006,

Philadelphia, PA, USA Philadelphia, PA, USA

Mining Sentiment Mining Sentiment Classification from Classification from Political Web Logs Political Web Logs

Kathleen Durant Kathleen Durant WebKDD WebKDD ‘ ‘06 06 August 20, 2006 August 20, 2006

slide-2
SLIDE 2

WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,

  • Aug. 20, 2006, at KDD 2006,
  • Aug. 20, 2006, at KDD 2006,

Philadelphia, PA, USA Philadelphia, PA, USA

Explosion of News and Opinions on Explosion of News and Opinions on the Web the Web

  • Substantial growth of people

Substantial growth of people accessing the Internet for news accessing the Internet for news

– – 3% in 1995, 20% in 2004 3% in 1995, 20% in 2004

  • Growth of web logs on the Web

Growth of web logs on the Web

– – 100,000 in 2002 to 4.8 million in 2004 100,000 in 2002 to 4.8 million in 2004

  • Growth in people reading Web logs

Growth in people reading Web logs

– – 2004 saw a 58% increase in readers of 2004 saw a 58% increase in readers of web logs web logs

slide-3
SLIDE 3

WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,

  • Aug. 20, 2006, at KDD 2006,
  • Aug. 20, 2006, at KDD 2006,

Philadelphia, PA, USA Philadelphia, PA, USA

Sentiment Topic View of the Sentiment Topic View of the Blog Space Blog Space

  • Web logs provide readily available

Web logs provide readily available

  • pinions on a myriad of topics
  • pinions on a myriad of topics
  • Sentiment classification separates

Sentiment classification separates

  • pinions into two opposing camps
  • pinions into two opposing camps
  • Take advantage of opinions and tools

Take advantage of opinions and tools to build a custom view of blog space to build a custom view of blog space by topic and opinion by topic and opinion

slide-4
SLIDE 4

WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,

  • Aug. 20, 2006, at KDD 2006,
  • Aug. 20, 2006, at KDD 2006,

Philadelphia, PA, USA Philadelphia, PA, USA

Questions Investigated Questions Investigated

  • Can existing Machine learning

Can existing Machine learning techniques be successfully applied? techniques be successfully applied?

  • Which techniques work well?

Which techniques work well?

– – Na Naï ïve Bayes, Support Vector Machines ve Bayes, Support Vector Machines

  • What

What ’ ’s the effect of unbalanced class s the effect of unbalanced class compositions on results? compositions on results?

– – Different camps write at different rates Different camps write at different rates

  • n particular topics
  • n particular topics
slide-5
SLIDE 5

WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,

  • Aug. 20, 2006, at KDD 2006,
  • Aug. 20, 2006, at KDD 2006,

Philadelphia, PA, USA Philadelphia, PA, USA

Research Statement Research Statement

  • Apply sentiment classification to political

Apply sentiment classification to political web log posts web log posts

– – Topic specific corpus Topic specific corpus

  • George W. Bush and the Iraq War

George W. Bush and the Iraq War

– – Domain Specific Domain Specific

  • Political Web log Posts

Political Web log Posts

  • Judge

Judge – – Joe Gandelman Joe Gandelman

– – classified over 250 web logs classified over 250 web logs

  • Classify Web log posts according to our

Classify Web log posts according to our judge judge’ ’s sentiment class s sentiment class

– – Right Right -

  • voice

voice – – Left Left -

  • voice

voice

slide-6
SLIDE 6

WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,

  • Aug. 20, 2006, at KDD 2006,
  • Aug. 20, 2006, at KDD 2006,

Philadelphia, PA, USA Philadelphia, PA, USA

Segmentation of Data Segmentation of Data

  • Data segmented

Data segmented by the Month by the Month

– – Leading to 25 Leading to 25 different models different models – – Small enough to Small enough to limit the events limit the events discussed discussed – – Large enough to Large enough to generate enough generate enough posts on topic posts on topic

Mar 03 Apr 03 May 03 Jun 03 Mar 03 Apr 03 May 03 Jun 03 Nov 04 Nov 04 Mar 03 Apr 03 May 03 Jun 03 Jul 03 Aug 03 Sep 03 Oct 03 Mar 03 Apr 03 May 03 Jun 03 Nov 03 Dec 03 Jan 04 Feb 04 Mar 03 Apr 03 May 03 Jun 03 Mar 04 Apr 04 May 04 Jun 04 Mar 03 Apr 03 May 03 Jun 03 Dec 04 Jan 05 Feb 05 Mar 05 Mar 03 Apr 03 May 03 Jun 03 Jul 04 Aug 04 Sep 04 Oct 04

slide-7
SLIDE 7

WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,

  • Aug. 20, 2006, at KDD 2006,
  • Aug. 20, 2006, at KDD 2006,

Philadelphia, PA, USA Philadelphia, PA, USA

Dataset Representation via the Dataset Representation via the Vector Space Model Vector Space Model

  • Feature set

Feature set – – terms occurring at least 5 terms occurring at least 5 times within the Month times within the Month’ ’s corpus s corpus – – Unigrams with polarity of environment Unigrams with polarity of environment

  • Differentiate between

Differentiate between “ “ not support not support ” ” , , “ “ support support ” ”

– – Bag Bag-

  • of
  • f-
  • words framework

words framework

  • Order not important,

Order not important, “ “ Bush is Bush is” ” = = “ “ Is Bush Is Bush” ”

– – Presence Vectors Presence Vectors

  • Given n features the post is represented as a n

Given n features the post is represented as a n-

  • dimensional vector

dimensional vector

– – 0 feature not present in post 0 feature not present in post – – 1 feature is present 1 feature is present – – Example: { 0,1,1,1,0} 5 features feature 1 and feature Example: { 0,1,1,1,0} 5 features feature 1 and feature 5 are not present, features 2,3,4 are. 5 are not present, features 2,3,4 are.

slide-8
SLIDE 8

WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,

  • Aug. 20, 2006, at KDD 2006,
  • Aug. 20, 2006, at KDD 2006,

Philadelphia, PA, USA Philadelphia, PA, USA

Na Naï ïve Bayes Classification ve Bayes Classification

Prior for the red class Prior for the blue class Likelihood term i appears = total number of occurrences of term in class/ in Class total number of words in red category

Posterior Probability = Prior * Likelihood

Choose the category with the Maximum Posterior Probability

Calculate the product of the probabilities for each term in a post

slide-9
SLIDE 9

WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,

  • Aug. 20, 2006, at KDD 2006,
  • Aug. 20, 2006, at KDD 2006,

Philadelphia, PA, USA Philadelphia, PA, USA

Support Vector Machines Support Vector Machines

Wx+ b = -1 Wx + b < 0 Wx + b > 0 W x+ b = 0 Wx+ b = 1 Margin = 2/ | W|

slide-10
SLIDE 10

WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,

  • Aug. 20, 2006, at KDD 2006,
  • Aug. 20, 2006, at KDD 2006,

Philadelphia, PA, USA Philadelphia, PA, USA

Web logs to Classifiers Web logs to Classifiers

Unbalanced Naïve Bayes Balanced Balanced Inflated SVM Unbalanced Small SVM NB On Topic Off Topic

slide-11
SLIDE 11

WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,

  • Aug. 20, 2006, at KDD 2006,
  • Aug. 20, 2006, at KDD 2006,

Philadelphia, PA, USA Philadelphia, PA, USA

Comparing Machine Learning Comparing Machine Learning Techniques Techniques

  • Off

Off-

  • the

the-

  • shelf

shelf Machine Learning Machine Learning Techniques perform Techniques perform well well

  • Na

Naï ïve Bayes ve Bayes significantly significantly

  • utperforms
  • utperforms

Support Vector Support Vector Machines Machines

– – 99.9% confidence 99.9% confidence level, CI level, CI [ 1.425,3.489] [ 1.425,3.489]

30 40 50 60 70 80 90 100 2 3

  • 3

2 3

  • 7

2 3

  • 1

1 2 4

  • 3

2 4

  • 7

2 4

  • 1

1 2 5

  • 3

Predictability

slide-12
SLIDE 12

WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,

  • Aug. 20, 2006, at KDD 2006,
  • Aug. 20, 2006, at KDD 2006,

Philadelphia, PA, USA Philadelphia, PA, USA

Class Composition found on the Class Composition found on the Web Web

  • Imbalance in

Imbalance in the class ratio the class ratio

– – 14% of right 14% of right -

  • voice posts on

voice posts on topic topic – – 24% of left 24% of left -

  • voice posts on

voice posts on topic topic

10 20 30 40 50 60 70 80 90 100 2 3

  • 3

2 3

  • 5

2 3

  • 7

2 3

  • 9

2 3

  • 1

1 2 4

  • 1

2 4

  • 3

2 4

  • 5

2 4

  • 7

2 4

  • 9

2 4

  • 1

1 2 5

  • 1

2 5

  • 3

Percentage Right voice Left Voice

slide-13
SLIDE 13

WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,

  • Aug. 20, 2006, at KDD 2006,
  • Aug. 20, 2006, at KDD 2006,

Philadelphia, PA, USA Philadelphia, PA, USA

Unbalanced Large and Small Unbalanced Large and Small Results by Category Results by Category

20 30 40 50 60 70 80 90 100 2003-03 2003-07 2003-11 2004-03 2004-07 2004-11 2005-03 Predictability Right Voices Left Voices

Unbalanced Large Unbalanced Sm all 20 30 40 50 60 70 80 90 100 2003-03 2003-07 2003-11 2004-03 2004-07 2004-11 2005-03 Predictability Right Voices Left Voices

slide-14
SLIDE 14

WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,

  • Aug. 20, 2006, at KDD 2006,
  • Aug. 20, 2006, at KDD 2006,

Philadelphia, PA, USA Philadelphia, PA, USA

Unbalanced and Balanced Results Unbalanced and Balanced Results by Category by Category

Unbalanced Balanced 20 30 40 50 60 70 80 90 100 2003-03 2003-07 2003-11 2004-03 2004-07 2004-11 2005-03 Predictability Right Voices Left Voices

20 30 40 50 60 70 80 90 100 2 3

  • 3

2 3

  • 7

2 3

  • 1

1 2 4

  • 3

2 4

  • 7

2 4

  • 1

1 2 5

  • 3

Predictability Right Voices Left Voices

slide-15
SLIDE 15

WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,

  • Aug. 20, 2006, at KDD 2006,
  • Aug. 20, 2006, at KDD 2006,

Philadelphia, PA, USA Philadelphia, PA, USA

Conclusions Conclusions

  • Off

Off-

  • the

the-

  • Shelf Machine Learning

Shelf Machine Learning Techniques work pretty well Techniques work pretty well

  • Balanced Na

Balanced Naï ïve Bayes significantly ve Bayes significantly

  • utperforms Support Vector Machines
  • utperforms Support Vector Machines

– – SVM 75.47% , NB 78.06% [ 1.425,3.488] SVM 75.47% , NB 78.06% [ 1.425,3.488]

  • Balancing the classes helps keep the

Balancing the classes helps keep the number of misclassified per category more number of misclassified per category more balanced balanced

– – Unbalanced classifiers: more Right Unbalanced classifiers: more Right -

  • voices were

voices were consistently misclassified consistently misclassified – – Balanced classifiers: more Left Balanced classifiers: more Left -

  • voices were

voices were misclassified 56% to 44% over time continuum misclassified 56% to 44% over time continuum