Topic Models and Applications to Short Documents Dieu-Thu Le - PowerPoint PPT Presentation

Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text Enrichment with Topic Models Topic Models and Applications to Short Documents Dieu-Thu Le Email: dieuthu.le@unitn.it Trento University April 6, 2011 1 / 43

Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text Enrichment with Topic Models Outline Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text Enrichment with Topic Models Author Name Disambiguation Online Contextual Advertising Query Classification 2 / 43

Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text Enrichment with Topic Models Problems with data collections ◮ With the availability of large document collections online, it becomes more difficult to represent and extract knowledge from them ◮ We need new tools to organize and understand these vast collections 3 / 43

Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text Enrichment with Topic Models Topic Models Topic Models provide methods for statistical analysis of document collections & other discrete data ◮ Uncover the hidden topical patterns in the collection ◮ Discover patterns of word-use and connect documents that exhibit similar patterns 4 / 43

Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text Enrichment with Topic Models Discover Topics from a Document Collection 5 / 43

Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text Enrichment with Topic Models Image Annotation with Topic Models 1 1 Source: Y.Shao et al. Semi-supervised topic modeling for image annotation, 2009 6 / 43

Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text Enrichment with Topic Models Intuition behind LDA (Latent Dirichlet Allocation) 2 Simple intuition: Documents exhibit multiple topics 2 Source: http://www.cs.princeton.edu/ blei/modeling-science.pdf 7 / 43

Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text Enrichment with Topic Models Generative Process Cast this intuition into a probabilistic procedure by which documents can be generated: ◮ Choose a distribution over topics for a document ◮ For each word, choose a topic according to the distribution 8 / 43

Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text Enrichment with Topic Models Generative Process (2) 9 / 43

Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text Enrichment with Topic Models Statistical Inference: a Reverse Process In reality, what we observe are only documents. Given these documents, our goal is to know what topic model is most likely to have generated the data: ◮ What are the words for each topic? ◮ What are the topics for each document? 10 / 43

Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text Enrichment with Topic Models Graphical Models Notation ◮ Nodes are random variables ◮ Edges denote possible dependence ◮ Observed variables are shaded ◮ Plates denote repetitions E.g, this graph is: p ( y , x 1 , ..., x N ) = p ( y ) � N n =1 p ( x n | y ) 11 / 43

Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text Enrichment with Topic Models Notations ◮ Word: 1 ... V ◮ Document: w = ( w 1 , w 2 , ..., w Nd ) sequence of N words ◮ Corpus: D = ( w 1 , w 2 , ..., w M ) collection of M documents 12 / 43

Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text Enrichment with Topic Models LDA: Graphical Model ◮ α , β : Dirichlet prior ◮ M : number of doc ◮ N d : number of words in d ◮ z : latent topic ◮ w : observed word ◮ θ : distribution of topic in doc ◮ φ : distribution of words generated from topic z Using plate notation: ◮ Sampling of distribution over topics for each document d ◮ Sampling of word distributions for each topic z until T topics have been generated 13 / 43

Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text Enrichment with Topic Models LDA: Graphical Model Key Problem Compute posterior distribution of the hidden variables given a document 14 / 43

Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text Enrichment with Topic Models Algorithm for Extracting Topics ◮ How to estimate posterior distribution of hidden variables given a collection of documents? ◮ Direct: e.g., via expectation-maximization (EM) [Hofmann, 1999] ◮ Indirect: estimate the posterior distribution over z. E.g., Gibbs Sampling [Griffiths & Steyvers, 2004] 15 / 43

Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text Enrichment with Topic Models Gibbs Sampling for LDA ◮ Random start ◮ Iterative ◮ For each word, we compute: ◮ How dominate is a topic z in doc d ? How often was topic z already used in doc d ? ◮ How likely is a word for a topic z ? How often was the word w already assigned to topic z ? 16 / 43

Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text Enrichment with Topic Models Gibbs Sampling for LDA C WT C DT w i j + β d i j + α P ( z i = j | z i , w i , d i , · ) ∝ � W � T w =1 C WT t =1 C DT + W β d i t + T α wj ◮ Topic of each word will be sampled from this distribution ◮ #times word w i ⇒ topic j (except the current) ◮ total words ⇒ topic k ◮ #words in doc d ⇒ topic j (except the current) ◮ #words in doc m 17 / 43

Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text Enrichment with Topic Models Gibbs Sampling Convergence ◮ Random Start ◮ N iterations ◮ Each iteration updates count-matrices Convergence: ◮ count-matrices stop changing 18 / 43

Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text Enrichment with Topic Models Estimating θ and φ C WT + β φ ′ ( j ) ij = i � W k =1 C WT + W β kj C DT + α θ ′ ( d ) dj = j � T k =1 C DT dk + T α 19 / 43

Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text Enrichment with Topic Models Short & Sparse Text Segments ◮ The explosion of ◮ e-commerce ◮ online communication, and ◮ online publishing ◮ Typical examples ◮ Web search snippets ◮ Forum & chat messages ◮ Blog and news feeds/summaries ◮ Book & movie summaries ◮ Product descriptions ◮ Customer reviews ◮ Short descriptions of entities, such as people, company, hotel, etc. 20 / 43

Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text Enrichment with Topic Models Challenges ◮ Very short ◮ From a dozen of words to several sentences ◮ Noisier ◮ Less topic-focused ◮ Sparse ◮ Not enough common words or shared context among them ◮ Consequences ◮ Difficult in similarity measure ◮ Hard to classify and clustering correctly 21 / 43

Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text Enrichment with Topic Models Synonym & Polysemy with Topics 22 / 43

Topic Models and Applications to Short Documents Dieu-Thu Le - PowerPoint PPT Presentation

Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text Enrichment with Topic Models Topic Models and Applications to Short Documents Dieu-Thu Le Email: dieuthu.le@unitn.it Trento University April 6, 2011 1 / 43 Introduction

Virtual Student Orientation Information for Families SLIDESMANIA.COM TOPIC TOPIC TOPIC TOPIC

ConnectHome ConnectHome Topic 2 Topic 2 Nation Webinar Nation Webinar Topic 3 Topic 3 Topic

Clothing shorts Clothing shorts Charles M. Jones Columbia Business School September 18, 2008 He who

COMP31212: Concurrency Topic 5.3: Liveness and Topic 5.4 Fairness Topic 5.3: Liveness Properties

UNIT TOPICS TOPIC 1: MINERALS TOPIC 2: IGNEOUS ROCKS TOPIC 3: SEDIMENTARY ROCKS

TOPIC #X: TOPIC NAME DATE, 2020 PRESENTATION OUTLINE Main topic #1 Main topic #2 Main

Shorting for Fun and Profit Nah, Just Profit Michael Shulman Editor, ChangeWave Shorts and

2 nd Tuesday Technical Shorts Episode 1: Federal Small Hydropower Regulations Milton Geiger

A winter story Chris Shorts Director S&T, Marketing & Utilization Market Winter

Cybersecurity Shorts: Short Cyber Training Videos for Todays Workforce Dr. Kelly S. Wright,

Full Uniform Bundle O W E H L K L C H I I R G INITIALS C H EST 1904 INITIALS

cinematics main promo vfx W I T H T H E V I S I O N 4 th bangalore ANIMAYA shorts film 2015

CG & VFX 4 th bangalore ANIMAYA shorts film 2015 festival AWARDS back to main VES

Dawn Storage Growing to Meet Demand Platts Storage Outlook Conference Chris Shorts , January 08,

A practical approach to detect turn to turn shorts during superconductive magnet fabrication

9/15/17 Outline Topic 1.Introduc8on Topic 2. RCS for six key fuels Topic 3.

Q3 2017 Revenues: A robust European dynamic 26 October 2017 CONTENTS > RENTAL ACTIVITY:

1 Contents Page 3 1. Highlights Page 10 2. Portfolio Page 26 3. Finance 2

Workshop on Implementing PPP projects in Sierra Leone Timothy J. Murphy David Livingston

Recap Dr Dr. Weerakor akorn n Ongsakul kul GMSARN SARN Secreta etary ry Genera ral 1

12th to 18th June 2010 What cars are eligible? cost. In some cases we will have no option but to

Marie-Franoise-Thrse Martin was born on January 2, 1873 to Louis and Zelie Martin in

WELCOME TO SOFITEL LEGEND METROPOLE HANOI SUMMARY 1 . L E G E N D A R Y P L A C E S 0 5 2 . L E

HOTEL ROYAL HOI AN MGALLERY BY SOFITEL H O T E L R O Y A L H O I A N 17/10/2017 HOI AN Hoi

Topic Models and Applications to Short Documents Dieu-Thu Le - PowerPoint PPT Presentation

Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text Enrichment with Topic Models Topic Models and Applications to Short Documents Dieu-Thu Le Email: dieuthu.le@unitn.it Trento University April 6, 2011 1 / 43 Introduction

Virtual Student Orientation Information for Families SLIDESMANIA.COM TOPIC TOPIC TOPIC TOPIC

ConnectHome ConnectHome Topic 2 Topic 2 Nation Webinar Nation Webinar Topic 3 Topic 3 Topic

Clothing shorts Clothing shorts Charles M. Jones Columbia Business School September 18, 2008 He who

COMP31212: Concurrency Topic 5.3: Liveness and Topic 5.4 Fairness Topic 5.3: Liveness Properties

UNIT TOPICS TOPIC 1: MINERALS TOPIC 2: IGNEOUS ROCKS TOPIC 3: SEDIMENTARY ROCKS

TOPIC #X: TOPIC NAME DATE, 2020 PRESENTATION OUTLINE Main topic #1 Main topic #2 Main

Shorting for Fun and Profit Nah, Just Profit Michael Shulman Editor, ChangeWave Shorts and

2 nd Tuesday Technical Shorts Episode 1: Federal Small Hydropower Regulations Milton Geiger

A winter story Chris Shorts Director S&amp;T, Marketing &amp; Utilization Market Winter

Cybersecurity Shorts: Short Cyber Training Videos for Todays Workforce Dr. Kelly S. Wright,

Full Uniform Bundle O W E H L K L C H I I R G INITIALS C H EST 1904 INITIALS

cinematics main promo vfx W I T H T H E V I S I O N 4 th bangalore ANIMAYA shorts film 2015

CG &amp; VFX 4 th bangalore ANIMAYA shorts film 2015 festival AWARDS back to main VES

Dawn Storage Growing to Meet Demand Platts Storage Outlook Conference Chris Shorts , January 08,

A practical approach to detect turn to turn shorts during superconductive magnet fabrication

9/15/17 Outline Topic 1.Introduc8on Topic 2. RCS for six key fuels Topic 3.

Q3 2017 Revenues: A robust European dynamic 26 October 2017 CONTENTS &gt; RENTAL ACTIVITY:

1 Contents Page 3 1. Highlights Page 10 2. Portfolio Page 26 3. Finance 2

Workshop on Implementing PPP projects in Sierra Leone Timothy J. Murphy David Livingston

Recap Dr Dr. Weerakor akorn n Ongsakul kul GMSARN SARN Secreta etary ry Genera ral 1

12th to 18th June 2010 What cars are eligible? cost. In some cases we will have no option but to

Marie-Franoise-Thrse Martin was born on January 2, 1873 to Louis and Zelie Martin in

WELCOME TO SOFITEL LEGEND METROPOLE HANOI SUMMARY 1 . L E G E N D A R Y P L A C E S 0 5 2 . L E

HOTEL ROYAL HOI AN MGALLERY BY SOFITEL H O T E L R O Y A L H O I A N 17/10/2017 HOI AN Hoi

A winter story Chris Shorts Director S&T, Marketing & Utilization Market Winter

CG & VFX 4 th bangalore ANIMAYA shorts film 2015 festival AWARDS back to main VES

Q3 2017 Revenues: A robust European dynamic 26 October 2017 CONTENTS > RENTAL ACTIVITY: