WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,
- Aug. 20, 2006, at KDD 2006,
- Aug. 20, 2006, at KDD 2006,
Philadelphia, PA, USA Philadelphia, PA, USA
Mining Sentiment Mining Sentiment Classification from - - PDF document
Mining Sentiment Mining Sentiment Classification from Classification from Political Web Logs Political Web Logs Kathleen Durant Kathleen Durant WebKDD 06 06 WebKDD August 20, 2006 August 20, 2006 WebKDD 2006 Workshop on WebKDD
WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,
Philadelphia, PA, USA Philadelphia, PA, USA
WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,
Philadelphia, PA, USA Philadelphia, PA, USA
WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,
Philadelphia, PA, USA Philadelphia, PA, USA
WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,
Philadelphia, PA, USA Philadelphia, PA, USA
WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,
Philadelphia, PA, USA Philadelphia, PA, USA
– – Topic specific corpus Topic specific corpus
George W. Bush and the Iraq War
– – Domain Specific Domain Specific
Political Web log Posts
– – classified over 250 web logs classified over 250 web logs
– – Right Right -
voice – – Left Left -
voice
WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,
Philadelphia, PA, USA Philadelphia, PA, USA
– – Leading to 25 Leading to 25 different models different models – – Small enough to Small enough to limit the events limit the events discussed discussed – – Large enough to Large enough to generate enough generate enough posts on topic posts on topic
Mar 03 Apr 03 May 03 Jun 03 Mar 03 Apr 03 May 03 Jun 03 Nov 04 Nov 04 Mar 03 Apr 03 May 03 Jun 03 Jul 03 Aug 03 Sep 03 Oct 03 Mar 03 Apr 03 May 03 Jun 03 Nov 03 Dec 03 Jan 04 Feb 04 Mar 03 Apr 03 May 03 Jun 03 Mar 04 Apr 04 May 04 Jun 04 Mar 03 Apr 03 May 03 Jun 03 Dec 04 Jan 05 Feb 05 Mar 05 Mar 03 Apr 03 May 03 Jun 03 Jul 04 Aug 04 Sep 04 Oct 04
WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,
Philadelphia, PA, USA Philadelphia, PA, USA
Differentiate between “ “ not support not support ” ” , , “ “ support support ” ”
Order not important, “ “ Bush is Bush is” ” = = “ “ Is Bush Is Bush” ”
Given n features the post is represented as a n-
dimensional vector
– – 0 feature not present in post 0 feature not present in post – – 1 feature is present 1 feature is present – – Example: { 0,1,1,1,0} 5 features feature 1 and feature Example: { 0,1,1,1,0} 5 features feature 1 and feature 5 are not present, features 2,3,4 are. 5 are not present, features 2,3,4 are.
WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,
Philadelphia, PA, USA Philadelphia, PA, USA
Prior for the red class Prior for the blue class Likelihood term i appears = total number of occurrences of term in class/ in Class total number of words in red category
Posterior Probability = Prior * Likelihood
Choose the category with the Maximum Posterior Probability
Calculate the product of the probabilities for each term in a post
WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,
Philadelphia, PA, USA Philadelphia, PA, USA
Wx+ b = -1 Wx + b < 0 Wx + b > 0 W x+ b = 0 Wx+ b = 1 Margin = 2/ | W|
WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,
Philadelphia, PA, USA Philadelphia, PA, USA
Unbalanced Naïve Bayes Balanced Balanced Inflated SVM Unbalanced Small SVM NB On Topic Off Topic
WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,
Philadelphia, PA, USA Philadelphia, PA, USA
Off-
the-
shelf Machine Learning Machine Learning Techniques perform Techniques perform well well
Naï ïve Bayes ve Bayes significantly significantly
Support Vector Support Vector Machines Machines
– – 99.9% confidence 99.9% confidence level, CI level, CI [ 1.425,3.489] [ 1.425,3.489]
Predictability
WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,
Philadelphia, PA, USA Philadelphia, PA, USA
10 20 30 40 50 60 70 80 90 100 2 3
2 3
2 3
2 3
2 3
1 2 4
2 4
2 4
2 4
2 4
2 4
1 2 5
2 5
Percentage Right voice Left Voice
WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,
Philadelphia, PA, USA Philadelphia, PA, USA
20 30 40 50 60 70 80 90 100 2003-03 2003-07 2003-11 2004-03 2004-07 2004-11 2005-03 Predictability Right Voices Left Voices
Unbalanced Large Unbalanced Sm all 20 30 40 50 60 70 80 90 100 2003-03 2003-07 2003-11 2004-03 2004-07 2004-11 2005-03 Predictability Right Voices Left Voices
WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,
Philadelphia, PA, USA Philadelphia, PA, USA
Unbalanced Balanced 20 30 40 50 60 70 80 90 100 2003-03 2003-07 2003-11 2004-03 2004-07 2004-11 2005-03 Predictability Right Voices Left Voices
20 30 40 50 60 70 80 90 100 2 3
2 3
2 3
1 2 4
2 4
2 4
1 2 5
Predictability Right Voices Left Voices
WebKDD 2006 Workshop on WebKDD 2006 Workshop on Knowledge Discovery on the Web, Knowledge Discovery on the Web,
Philadelphia, PA, USA Philadelphia, PA, USA
– – SVM 75.47% , NB 78.06% [ 1.425,3.488] SVM 75.47% , NB 78.06% [ 1.425,3.488]
– – Unbalanced classifiers: more Right Unbalanced classifiers: more Right -
voices were consistently misclassified consistently misclassified – – Balanced classifiers: more Left Balanced classifiers: more Left -
voices were misclassified 56% to 44% over time continuum misclassified 56% to 44% over time continuum