Chair of Network Architectures and Services Department of Informatics Technical University of Munich
Text Mining on Mailing Lists: Sentiment Analysis Gordon Heiczman, B. - - PowerPoint PPT Presentation
Text Mining on Mailing Lists: Sentiment Analysis Gordon Heiczman, B. - - PowerPoint PPT Presentation
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Text Mining on Mailing Lists: Sentiment Analysis Gordon Heiczman, B. Sc. October 13, 2017 Chair of Network Architectures and Services
Introduction
What is Sentiment Analysis?
- G. Heiczman — Sentiment Analysis
2
Introduction
What is Sentiment Analysis?
- G. Heiczman — Sentiment Analysis
3
Introduction
What is Sentiment Analysis?
- G. Heiczman — Sentiment Analysis
4
Introduction
Problems of today:
- Too much information
- Too little time
- G. Heiczman — Sentiment Analysis
5
Introduction
Agenda
- Text Mining summary
- Example of practical application
- Presentation of results
- Conclusion and Lessons Learned
- G. Heiczman — Sentiment Analysis
6
Text Mining
Feature Selection Main purpose: Extract valuable information, get rid of redundant features ’Bag of Words’ approach Most common selection steps:
- Removal of stop words (the, is, at ...)
- Removal of plurals (dogs -> dog)
- Word / n-gram frequency
- Part of Speech (POS) tagging (adjectives)
- Opinion words (like, hate, love ...)
- Detection of negation (not good -> bad)
- G. Heiczman — Sentiment Analysis
7
Text Mining
Sentiment Classification Three main categories:
- Machine Learning
- Lexicon-based
- Hybrid
- G. Heiczman — Sentiment Analysis
8
Text Mining
Pitfalls
- Named Entity Recognition i.e. "What is the topic"
- Anaphora Resolution - Reference word resolution. "What is ’it’ refering to?"
- Sarcasm
- Abbreviations, poor grammar / punctuation / spelling
- G. Heiczman — Sentiment Analysis
9
Practical Application
- Dataset
- Language
- Email retrieval
- Content retrieval
- Sentiment value retrieval
- G. Heiczman — Sentiment Analysis
10
Practical Application
Dataset Collection of emails from the IETF. Task of IETF is to set standards.
- G. Heiczman — Sentiment Analysis
11
Practical Application
Language C# or Python? Not enough comprehensive, completely free tools Notable C# tools:
- VaderSharp (free but primitive)
- Aylien (paid)
- Watson D.C. (paid)
- Vivekn (free but no documentation)
Python tool: TextBlob
- G. Heiczman — Sentiment Analysis
12
Practical Application
Multiple values obtained through SA:
- Polarity ( -1.0 <-> 1.0)
- Subjectivity (0.0 <-> 1.0)
- Most used word
- Sentence Count
- G. Heiczman — Sentiment Analysis
13
Practical Application
Textblob example blob = TextBlob("I think this presentation is really, really good!") print(blob.sentiment) # Gives both polarity and subjectivity around 1.0 print(blob.words.count(’really’)) # Gives 2 print(blob.noun_phrases) # Gives nouns, in this case presentation
- G. Heiczman — Sentiment Analysis
14
Practical Application
Figure 1: Example of email with polarity 1.0
- Filename: /home/.../geopriv/2007-12.mail
- Key: 251
- G. Heiczman — Sentiment Analysis
15
Practical Application
Programflow
- G. Heiczman — Sentiment Analysis
16
Practical Application
Programflow
- G. Heiczman — Sentiment Analysis
17
Statistics
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 i p
- i
b 8 4 a l l i b n e m
- i
d e a s d a t a t r a c k e r
- r
q m t s i m a p e x t d n s
- r
r t y p e
- a
p p l i c a t i
- n
s i e t f m i b s I 3 v p n h
- k
e y
Figure 2: Top 10 groups who use the most sentences
Even distribution Indication of in-depth discussion or off-topic rambling?
- G. Heiczman — Sentiment Analysis
18
Statistics
0.5 0.375854 0.3078360.303693 0.29163 0.276323 0.2532740.251799 0.25 0.25 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 i a
- c
- s
c r i b e s i s s l l 8 4 a l l d n s
- v
e r h t t p i
- l
a
- c
- n
v e r s i
- n
- t
- l
a d d r
- s
e l e c t
- d
t 6 7 a t t e n d e e s 9 6
- m
e n t
- r
s m a l l
- c
b g m p
Figure 3: Top 10 most positive groups
Logarithmic distribution Notable group: "iaoc-scribes"
- G. Heiczman — Sentiment Analysis
19
Statistics
- 0.50
- 0.45
- 0.40
- 0.35
- 0.30
- 0.25
- 0.20
- 0.15
- 0.10
- 0.05
0.00 i e t f
- s
a i l
- r
s i r t f
- m
- b
i l i t y
- c
h a r t e r 7 a t t e n d e e s s c h e m a w r e c i p s r a s a s l i u c g w e b p r i n t m i b
Figure 4: Top 10 most negative groups
Stronger logarithmic distribution Notable group: "ietf-sailors"
- G. Heiczman — Sentiment Analysis
20
Statistics
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 i a
- c
- s
c r i b e s c
- s
m
- g
- l
7 3 a t t e n d e e s i e t f
- u
t c
- m
e s 7 a t t e n d e e s i r t f
- m
- b
i l i t y
- c
h a r t e r i p s r a i m a p k m a r t d c l c
Figure 5: Top 10 most subjective groups
Surprising top scores Discussion groups
- G. Heiczman — Sentiment Analysis
21
Statistics
From the 7 most negative (-1.0) polarity entries 6 belong to the group ’eos’ All of them are in Spanish (?)
- G. Heiczman — Sentiment Analysis
22
Conclusion
Useful but not universally Lessons learned:
- Filter the data-set intelligently
- Don’t try to solve everything with one library
- G. Heiczman — Sentiment Analysis
23