Mining the Peanut Gallery
Opinion Extraction and Semantic Classification of Product Reviews
Presented by Ledao Chen and David Zhao A paper by Kushal Dave, Steve Lawrence, David M. Pennock
1
Mining the Peanut Gallery Opinion Extraction and Semantic - - PowerPoint PPT Presentation
Mining the Peanut Gallery Opinion Extraction and Semantic Classification of Product Reviews A paper by Kushal Dave, Steve Lawrence, David M. Pennock Presented by Ledao Chen and David Zhao 1 Problem Product reviews are everywhere! How
Opinion Extraction and Semantic Classification of Product Reviews
Presented by Ledao Chen and David Zhao A paper by Kushal Dave, Steve Lawrence, David M. Pennock
1
2
3
4
Category 1 Category 2 Category 3
Category 5 Category 6 Category 7 Positive Negative
Train fold Test fold
5
Category 1 Category 2 Category 3
Category 5 Category 6 Category 7 Positive Negative
10x sets
6
[ [“Peace” “cannot” “be” “kept” “by” “force” “;” “it” “can” “only” ...], [“Darkness” “cannot” “drive” “out” “darkness” “;” “only” “light”...], [“Hate” “cannot” “drive” “out” “hate” “;” “only” “love” “can” “do”...] ]
7
8
○ “not good or useful”→“not NOTgood NOTor NOTuseful”
9
[Peace cannot be kept by force it can only be achieved by understanding] Combined into “achieved-understanding” feature
10
11
Substrings become longer generally more discriminatory
12
their frequency decrease less evidence for considering them relevant
13
14
The normalized term frequency, by taking the
number of times a feature fi occurs in C and dividing it by the total number of tokens in C. A term’s score is thus a measure of bias ranging from –1 to 1.
15
16
Performs on par with SVM Sensitive to different class sizes, thus performs poorly on Test 1
17
Performs poorly on both tests
18
document frequency, dampened by logarithm, provided better result on Test 1
scheme on TF provided better result
19
Basic idea: Sum the scores of the words in an unknown document and use the sign of the total to determine
20
Basic idea: crawl search engine results for a given product’s name and attempt to identify and analyze product reviews within this set.
Model by data from Discard some certain pages, paragraphs, sentences (such as pages without “review” in their title, paragraphs not containing the name of the product, and excessively long or short sentences)
21
Randomly selected 600 sentences (200 for each of 3 products) from search engine as parsed and thresholded by the mining tool. Manually tagged as positive (P) or negative (N) or ambiguous (I) Ambiguous means they were ambiguous when taken out of context, did not express an opinion at all, or were not describing the product.
P: 173 N:71 I: 356
22
23
Worse than tossing a coin
through the choice of appropriate features and metrics
24
25
26
27