topic modelling and natural language processing workshop
play

Topic Modelling (and Natural Language Processing) workshop PyCon - PowerPoint PPT Presentation

Topic Modelling (and Natural Language Processing) workshop PyCon UK 2019 @MarcoBonzanini github.com/bonzanini/topic-modelling Nice to meet you Data Science consultant: NLP, Machine Learning, Data Engineering Corporate


  1. Topic Modelling 
 (and Natural Language Processing) 
 workshop PyCon UK 2019 @MarcoBonzanini github.com/bonzanini/topic-modelling

  2. Nice to meet you • Data Science consultant: 
 NLP, Machine Learning, 
 Data Engineering • Corporate training: 
 Python + Data Science • PyData London chairperson github.com/bonzanini/topic-modelling

  3. This tutorial • Introduction to Topic Modelling • Depending on time/interest: 
 Happy to discuss broader applications of NLP • The audience (tell me about you): 
 - new-ish to NLP? 
 - new-ish to Python tools for NLP? github.com/bonzanini/topic-modelling

  4. Motivation Suppose you: • have a huge number of (text) documents • want to know what they’re talking about • can’t read them all github.com/bonzanini/topic-modelling

  5. Topic Modelling • Bird’s-eye view on the whole corpus (dataset of docs) • Unsupervised learning 
 pros: no need for labelled data 
 cons: how to evaluate the model? github.com/bonzanini/topic-modelling

  6. Topic Modelling Input: 
 - a collection of documents - a number of topics K github.com/bonzanini/topic-modelling

  7. Topic Modelling movie, actor, 
 soundtrack, 
 director, … Output: 
 goal, match, 
 - K topics referee, 
 - their word distributions champions, … price, invest, market, stock, … github.com/bonzanini/topic-modelling

  8. Distributional Hypothesis • “You shall know a word by the company it keeps” 
 — J. R. Firth, 1957 • “Words that occur in similar context, tend to have similar meaning” 
 — Z. Harris, 1954 • Context approximates Meaning github.com/bonzanini/topic-modelling

  9. Term-document matrix Word 1 Word 2 Word N Doc 1 1 7 2 Doc 2 3 0 5 Doc N 0 4 2 github.com/bonzanini/topic-modelling

  10. Latent Dirichlet Allocation • Commonly used topic modelling approach • Key idea: 
 each document is a distribution of topics 
 each topic is a distribution of words github.com/bonzanini/topic-modelling

  11. Latent Dirichlet Allocation • “Latent” as in hidden: 
 only words are visible, other variables are hidden • “Dirichlet Allocation”: 
 topics are assumed to be distributed with a specific probability (Dirichlet prior) github.com/bonzanini/topic-modelling

  12. Topic Model Evaluation • How good is my topic model? 
 “Unsupervised learning”… is there a correct answer? • Extrinsic metrics: what’s the task? • Intrinsic metrics: e.g. topic coherence • More interesting: 
 - how useful is my topic model? 
 - data visualisation can help to get some insights github.com/bonzanini/topic-modelling

  13. Topic Coherence • It gives a score of the topic quality • Relationship with Information Theory 
 (Pointwise Mutual Information) • Used to find the best number of topics for a corpus github.com/bonzanini/topic-modelling

  14. Demo

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend