Text Mining on Mailing Lists: Sentiment Analysis Gordon Heiczman, B. - - PowerPoint PPT Presentation

text mining on mailing lists sentiment analysis
SMART_READER_LITE
LIVE PREVIEW

Text Mining on Mailing Lists: Sentiment Analysis Gordon Heiczman, B. - - PowerPoint PPT Presentation

Chair of Network Architectures and Services Department of Informatics Technical University of Munich Text Mining on Mailing Lists: Sentiment Analysis Gordon Heiczman, B. Sc. October 13, 2017 Chair of Network Architectures and Services


slide-1
SLIDE 1

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Text Mining on Mailing Lists: Sentiment Analysis

Gordon Heiczman, B. Sc.

October 13, 2017 Chair of Network Architectures and Services Department of Informatics Technical University of Munich

slide-2
SLIDE 2

Introduction

What is Sentiment Analysis?

  • G. Heiczman — Sentiment Analysis

2

slide-3
SLIDE 3

Introduction

What is Sentiment Analysis?

  • G. Heiczman — Sentiment Analysis

3

slide-4
SLIDE 4

Introduction

What is Sentiment Analysis?

  • G. Heiczman — Sentiment Analysis

4

slide-5
SLIDE 5

Introduction

Problems of today:

  • Too much information
  • Too little time
  • G. Heiczman — Sentiment Analysis

5

slide-6
SLIDE 6

Introduction

Agenda

  • Text Mining summary
  • Example of practical application
  • Presentation of results
  • Conclusion and Lessons Learned
  • G. Heiczman — Sentiment Analysis

6

slide-7
SLIDE 7

Text Mining

Feature Selection Main purpose: Extract valuable information, get rid of redundant features ’Bag of Words’ approach Most common selection steps:

  • Removal of stop words (the, is, at ...)
  • Removal of plurals (dogs -> dog)
  • Word / n-gram frequency
  • Part of Speech (POS) tagging (adjectives)
  • Opinion words (like, hate, love ...)
  • Detection of negation (not good -> bad)
  • G. Heiczman — Sentiment Analysis

7

slide-8
SLIDE 8

Text Mining

Sentiment Classification Three main categories:

  • Machine Learning
  • Lexicon-based
  • Hybrid
  • G. Heiczman — Sentiment Analysis

8

slide-9
SLIDE 9

Text Mining

Pitfalls

  • Named Entity Recognition i.e. "What is the topic"
  • Anaphora Resolution - Reference word resolution. "What is ’it’ refering to?"
  • Sarcasm
  • Abbreviations, poor grammar / punctuation / spelling
  • G. Heiczman — Sentiment Analysis

9

slide-10
SLIDE 10

Practical Application

  • Dataset
  • Language
  • Email retrieval
  • Content retrieval
  • Sentiment value retrieval
  • G. Heiczman — Sentiment Analysis

10

slide-11
SLIDE 11

Practical Application

Dataset Collection of emails from the IETF. Task of IETF is to set standards.

  • G. Heiczman — Sentiment Analysis

11

slide-12
SLIDE 12

Practical Application

Language C# or Python? Not enough comprehensive, completely free tools Notable C# tools:

  • VaderSharp (free but primitive)
  • Aylien (paid)
  • Watson D.C. (paid)
  • Vivekn (free but no documentation)

Python tool: TextBlob

  • G. Heiczman — Sentiment Analysis

12

slide-13
SLIDE 13

Practical Application

Multiple values obtained through SA:

  • Polarity ( -1.0 <-> 1.0)
  • Subjectivity (0.0 <-> 1.0)
  • Most used word
  • Sentence Count
  • G. Heiczman — Sentiment Analysis

13

slide-14
SLIDE 14

Practical Application

Textblob example blob = TextBlob("I think this presentation is really, really good!") print(blob.sentiment) # Gives both polarity and subjectivity around 1.0 print(blob.words.count(’really’)) # Gives 2 print(blob.noun_phrases) # Gives nouns, in this case presentation

  • G. Heiczman — Sentiment Analysis

14

slide-15
SLIDE 15

Practical Application

Figure 1: Example of email with polarity 1.0

  • Filename: /home/.../geopriv/2007-12.mail
  • Key: 251
  • G. Heiczman — Sentiment Analysis

15

slide-16
SLIDE 16

Practical Application

Programflow

  • G. Heiczman — Sentiment Analysis

16

slide-17
SLIDE 17

Practical Application

Programflow

  • G. Heiczman — Sentiment Analysis

17

slide-18
SLIDE 18

Statistics

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 i p

  • i

b 8 4 a l l i b n e m

  • i

d e a s d a t a t r a c k e r

  • r

q m t s i m a p e x t d n s

  • r

r t y p e

  • a

p p l i c a t i

  • n

s i e t f m i b s I 3 v p n h

  • k

e y

Figure 2: Top 10 groups who use the most sentences

Even distribution Indication of in-depth discussion or off-topic rambling?

  • G. Heiczman — Sentiment Analysis

18

slide-19
SLIDE 19

Statistics

0.5 0.375854 0.3078360.303693 0.29163 0.276323 0.2532740.251799 0.25 0.25 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 i a

  • c
  • s

c r i b e s i s s l l 8 4 a l l d n s

  • v

e r h t t p i

  • l

a

  • c
  • n

v e r s i

  • n
  • t
  • l

a d d r

  • s

e l e c t

  • d

t 6 7 a t t e n d e e s 9 6

  • m

e n t

  • r

s m a l l

  • c

b g m p

Figure 3: Top 10 most positive groups

Logarithmic distribution Notable group: "iaoc-scribes"

  • G. Heiczman — Sentiment Analysis

19

slide-20
SLIDE 20

Statistics

  • 0.50
  • 0.45
  • 0.40
  • 0.35
  • 0.30
  • 0.25
  • 0.20
  • 0.15
  • 0.10
  • 0.05

0.00 i e t f

  • s

a i l

  • r

s i r t f

  • m
  • b

i l i t y

  • c

h a r t e r 7 a t t e n d e e s s c h e m a w r e c i p s r a s a s l i u c g w e b p r i n t m i b

Figure 4: Top 10 most negative groups

Stronger logarithmic distribution Notable group: "ietf-sailors"

  • G. Heiczman — Sentiment Analysis

20

slide-21
SLIDE 21

Statistics

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 i a

  • c
  • s

c r i b e s c

  • s

m

  • g
  • l

7 3 a t t e n d e e s i e t f

  • u

t c

  • m

e s 7 a t t e n d e e s i r t f

  • m
  • b

i l i t y

  • c

h a r t e r i p s r a i m a p k m a r t d c l c

Figure 5: Top 10 most subjective groups

Surprising top scores Discussion groups

  • G. Heiczman — Sentiment Analysis

21

slide-22
SLIDE 22

Statistics

From the 7 most negative (-1.0) polarity entries 6 belong to the group ’eos’ All of them are in Spanish (?)

  • G. Heiczman — Sentiment Analysis

22

slide-23
SLIDE 23

Conclusion

Useful but not universally Lessons learned:

  • Filter the data-set intelligently
  • Don’t try to solve everything with one library
  • G. Heiczman — Sentiment Analysis

23