Topic Modelling (and Natural Language Processing) workshop PyCon - - PowerPoint PPT Presentation

topic modelling and natural language processing workshop
SMART_READER_LITE
LIVE PREVIEW

Topic Modelling (and Natural Language Processing) workshop PyCon - - PowerPoint PPT Presentation

Topic Modelling (and Natural Language Processing) workshop PyCon UK 2019 @MarcoBonzanini github.com/bonzanini/topic-modelling Nice to meet you Data Science consultant: NLP, Machine Learning, Data Engineering Corporate


slide-1
SLIDE 1

Topic Modelling


(and Natural Language Processing)


workshop

@MarcoBonzanini

PyCon UK 2019

github.com/bonzanini/topic-modelling

slide-2
SLIDE 2

Nice to meet you

  • Data Science consultant:


NLP, Machine Learning,
 Data Engineering

  • Corporate training:


Python + Data Science

  • PyData London chairperson

github.com/bonzanini/topic-modelling

slide-3
SLIDE 3

This tutorial

  • Introduction to Topic Modelling
  • Depending on time/interest:


Happy to discuss broader applications of NLP

  • The audience (tell me about you):

  • new-ish to NLP?

  • new-ish to Python tools for NLP?

github.com/bonzanini/topic-modelling

slide-4
SLIDE 4

Motivation

Suppose you:

  • have a huge number of (text) documents
  • want to know what they’re talking about
  • can’t read them all

github.com/bonzanini/topic-modelling

slide-5
SLIDE 5

Topic Modelling

  • Bird’s-eye view on the whole corpus (dataset of docs)
  • Unsupervised learning


pros: no need for labelled data
 cons: how to evaluate the model? github.com/bonzanini/topic-modelling

slide-6
SLIDE 6

Topic Modelling

Input:


  • a collection of documents
  • a number of topics K

github.com/bonzanini/topic-modelling

slide-7
SLIDE 7

Topic Modelling

Output:


  • K topics
  • their word distributions

movie, actor,
 soundtrack,
 director, … goal, match,
 referee,
 champions, … price, invest, market, stock, …

github.com/bonzanini/topic-modelling

slide-8
SLIDE 8

Distributional Hypothesis

  • “You shall know a word by the company it keeps”


— J. R. Firth, 1957

  • “Words that occur in similar context, tend to have

similar meaning”
 — Z. Harris, 1954

  • Context approximates Meaning

github.com/bonzanini/topic-modelling

slide-9
SLIDE 9

Term-document matrix

Word 1 Word 2 Word N Doc 1 1 7 2 Doc 2 3 5 Doc N 4 2

github.com/bonzanini/topic-modelling

slide-10
SLIDE 10

Latent Dirichlet Allocation

  • Commonly used topic modelling approach
  • Key idea:


each document is a distribution of topics
 each topic is a distribution of words github.com/bonzanini/topic-modelling

slide-11
SLIDE 11

Latent Dirichlet Allocation

  • “Latent” as in hidden:

  • nly words are visible, other variables are hidden
  • “Dirichlet Allocation”:


topics are assumed to be distributed with a specific probability (Dirichlet prior) github.com/bonzanini/topic-modelling

slide-12
SLIDE 12

Topic Model Evaluation

  • How good is my topic model?


“Unsupervised learning”… is there a correct answer?

  • Extrinsic metrics: what’s the task?
  • Intrinsic metrics: e.g. topic coherence
  • More interesting:

  • how useful is my topic model?

  • data visualisation can help to get some insights

github.com/bonzanini/topic-modelling

slide-13
SLIDE 13

Topic Coherence

  • It gives a score of the topic quality
  • Relationship with Information Theory


(Pointwise Mutual Information)

  • Used to find the best number of topics for a corpus

github.com/bonzanini/topic-modelling

slide-14
SLIDE 14

Demo