General Online Research Conference GOR 18 28 February to 2 March - - PowerPoint PPT Presentation

general online research conference gor 18 28 february to
SMART_READER_LITE
LIVE PREVIEW

General Online Research Conference GOR 18 28 February to 2 March - - PowerPoint PPT Presentation

General Online Research Conference GOR 18 28 February to 2 March 2018, TH Kln University of Applied Sciences, Cologne, Germany Christopher Harms, SKOPOS GmbH & Co. KG Sebastian Schmidt, SKOPOS GmbH & Co. KG Learning From All


slide-1
SLIDE 1

Christopher Harms, SKOPOS GmbH & Co. KG Sebastian Schmidt, SKOPOS GmbH & Co. KG Learning From All Answers: Embedding-based Topic Modelling for Open-Ended Questions Contact: christopher.harms@skopos.de

General Online Research Conference GOR 18 28 February to 2 March 2018, TH Köln – University of Applied Sciences, Cologne, Germany

This work is licensed under a Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/) Suggested citation: Harms, Christopher, & Schmidt, Sebastian. 2018. “Learning From All Answers: Embedding-based Topic Modelling for Open-Ended Questions..” General Online Research (GOR) Conference, Cologne.

slide-2
SLIDE 2

GOR 2018 | Cologne | March 1, 2018

Learning From All Answers: Embedding-based Topic Modelling for Open-Ended Questions

Christopher Harms, Consultant Research & Development Sebastian Schmidt, Director Research & Development

# LearningFromAllAnswers

slide-3
SLIDE 3

3 GOR 2018 | Learning From All Answers | March 1, 2018

Initial Scenario

What can we do to improve our service for you?

slide-4
SLIDE 4

4 GOR 2018 | Learning From All Answers | March 1, 2018

Initial Scenario

Word Cloud Qualitative summary Code plan

§ Manual coding § Automatic coding through supervised learning

Can we improve this through unsupervised Machine Learning? How to extract information from open-ended questions?

slide-5
SLIDE 5

5 GOR 2018 | Learning From All Answers | March 1, 2018

Methods Overview

Naïve Keyword Extraction Latent Dirichlet Allocation (LDA; Blei et al., 2003) Embedding-based Topic-Modelling (ETM, Qiang et al., 2016)

slide-6
SLIDE 6

6 GOR 2018 | Learning From All Answers | March 1, 2018

Methods Overview

  • Nouns indicate topics
  • Extraction through a pre-trained POS

tagger (e.g. spaCy)

  • Catch different forms of same word:

Lemmatization or Stemming

  • Word Cloud of resulting terms,

highlighting relative frequency

Working from home for me means freedom and independence. I can just go for a walk when there is sunny weather and I need a break.

v Home v Freedom v Independence v Walk v Weather v Break

Naïve Keyword Extraction

slide-7
SLIDE 7

8 GOR 2018 | Learning From All Answers | March 1, 2018

Methods Overview

  • Bayesian generative probabilistic

model

  • Each topic is a probability distribution
  • ver words
  • Inference: Find the relationship

between words and topics for a given corpus

α !" #$ #% #& #' #( )$ )% )& )' )( *+ β L K

Latent Dirichlet Allocation

Topic assignments Words Distribution of topics

  • ver words
slide-8
SLIDE 8

9 GOR 2018 | Learning From All Answers | March 1, 2018

Methods Overview

  • Co-occurring words are grouped into

a topic

  • Readily available programming

packages (e.g. gensim)

  • Number of topics has to be chosen a

priori

  • Large corpus needed for reasonable

results

  • No knowledge about relationship

between different words (e.g. “buffet” and “restaurant”)

Latent Dirichlet Allocation Benefits Disadvantages

slide-9
SLIDE 9

10 GOR 2018 | Learning From All Answers | March 1, 2018

Methods Overview

Word Embeddings king - man + woman = queen breakfast + lunch = brunch

  • Embeddings contain information about word relationships
  • Trained on a very large corpus of texts
  • Each word becomes a multidimensional vector
slide-10
SLIDE 10

11 GOR 2018 | Learning From All Answers | March 1, 2018

  • Extension of the LDA model

1. Aggregate short texts into pseudo- documents 2. Assign similar words more likely to the same topic

  • Word embeddings are used for

similarity of documents and words

Methods Overview

Embedding-based Topic Modelling

slide-11
SLIDE 11

12 GOR 2018 | Learning From All Answers | March 1, 2018

  • Undirected edge between topics for

similar words (binary potential): Similar words should be more likely belong to the same topic

  • Graphical model is a Markov

Random Field (MRF-LDA, Xie et al., 2015)

  • Weight for binary potential, if 0 model

reduces to LDA

Methods Overview

Embedding-based Topic Modelling

α !" #$ #% #& #' #( )$ )% )& )' )( *+ β L K

Topic assignments Words Distribution of topics

  • ver words
slide-12
SLIDE 12

13 GOR 2018 | Learning From All Answers | March 1, 2018

Methods Overview

  • Knowledge of word relationships is

incorporated (pre-trained embeddings)

  • k-Means improves Topic Modelling of

short texts

  • Number of pseudo-texts and topics has

to be chosen a priori

  • Computationally expensive
  • Requires a large corpus for reasonable

results

  • No prepared software packages

available

Embedding-based Topic Modelling Benefits Disadvantages

slide-13
SLIDE 13

14 GOR 2018 | Learning From All Answers | March 1, 2018

Proof-of-Concept

  • 10.000 tweets in English language
  • Purely observational
  • 10.000 survey responses in German

language

  • Responses to three different

questions concerning travel

Datasets Twitter (Sentiment140) Survey Responses

slide-14
SLIDE 14

16 GOR 2018 | Learning From All Answers | March 1, 2018

Proof-of-Concept

Topic #1 Topic #2 Topic #3 hope twitter morning better phone good sick use cold feeling site snow feel tweets car Topic #1 Topic #2 Topic #3 new sad sleep cold house time better watching night damn night hours need thank bed Topic #1 Topic #2 Topic #3 gut super immer geklappt einfach zufrieden

  • rganisiert

nein buchen gefallen unkompliziert gerne reise schnell reisen Topic #1 Topic #2 Topic #3 super geklappt service einfach reibungslos

  • rganisation

stimmt vielen hotel tolle dank hotels funktioniert perfekt information

LDA ETM

Results: Resulting Topics with Top5 Words (excerpt)

slide-15
SLIDE 15

17 GOR 2018 | Learning From All Answers | March 1, 2018

Expert Review

Classical Machine Learning metrics not informative for real research projects Question of interest for us: Can our (human) colleagues work with the results provided by the algorithms? Are resulting topics coherent? That is, can words associated with a topic indeed be grouped into a sensible topic?

slide-16
SLIDE 16

18 GOR 2018 | Learning From All Answers | March 1, 2018

Expert Review

3.54 (1.04) 3.23 (1.10) 2.70 (1.15) 2.25 (1.19)

LDA ETM

Results: Expert Review (English Dataset)

slide-17
SLIDE 17

19 GOR 2018 | Learning From All Answers | March 1, 2018

Expert Review

4.06 (0.76) 4.09 (0.90) 4.09 (0.85) 3.72 (0.98)

LDA ETM

Results: Expert Review (German Dataset)

slide-18
SLIDE 18

20 GOR 2018 | Learning From All Answers | March 1, 2018

Expert Review

Summary English: LDA results more coherent than ETM results German: ETM and LDA rated equally coherent But: Highly dependent on topic selection

slide-19
SLIDE 19

21 GOR 2018 | Learning From All Answers | March 1, 2018

Summary

Our Learnings Proof of Concept – needs further development Fine-tuning of hyper-parameters and techniques required Pre-trained word vectors provide valuable information Lots of data required for best results (> 1,000 responses) Metric for usefulness in real-world environment?

slide-20
SLIDE 20

22 GOR 2018 | Learning From All Answers | March 1, 2018

Further questions? Let‘s talk!

Thank You For Your Attention!

Christopher Harms Consultant Research & Development christopher.harms@skopos.de @chrisharms Sebastian Schmidt Director Research & Development sebastian.schmidt@skopos.de

slide-21
SLIDE 21

23 GOR 2018 | Learning From All Answers | March 1, 2018

References

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of

Machine Learning Research, 3, 993–1022.

  • Qiang, J., Chen, P., Wang, T., & Wu, X. (2016). Topic Modeling over Short Texts by

Incorporating Word Embeddings. CEUR Workshop Proceedings, 1828, 53–59. Retrieved from http://arxiv.org/abs/1609.08496

  • Xie, P., Yang, D., & Xing, E. P. (2015). Incorporating Word Correlation Knowledge into

Topic Modeling. In Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) (pp. 725–734). Retrieved from http://www.cs.cmu.edu/~pengtaox/papers/naacl15_mrflda.pdf