Applications of Topic Models Document Understanding, session 7 - PowerPoint PPT Presentation

Applications of Topic Models Document Understanding, session 7 CS6200: Information Retrieval

Extending Topic Models PLSA is the most basic probabilistic topic model, and the idea has been usefully extended in many ways. • Its probability estimates have been regularized to improve output quality, most notably by Latent Dirichlet Allocation (LDA). M – number of documents • The document collection has been grouped in various ways (e.g. by language or N – document length publication date) to give topics more d – document, selected with P( d ) flexibility. z – topic, selected with P( z | d ) • Additional data can be included, such as w – word, selected with P( w | z ) sentiment labels, to condition the vocabulary distribution on new factors.

Latent Dirichlet Allocation Latent Dirichlet Allocation regularizes PLSA by using Dirichlet priors for its Multinomial topic distributions. Most topic models extend LDA, not PLSA. The distributions ⍺ and β are Bayesian posteriors, whose priors work like M – number of documents smoothing parameters to limit how N – document length extreme the document and vocabulary ⍺ – Multinomial dist. over documents distributions can become. β – Multinomial dists. over words The data likelihood is given by: θ – document, selected with P( d | ⍺ ) � N d z – topic, selected with P( z | θ ) M � � � � � P ( D| α , β ) = p ( θ d | α ) p ( z | θ d ) p ( w n | z , β ) dθ w – word, selected with P( w | z, β ) z d = 1 n = 1 David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation.

Dynamic Topic Models Language usage changes over time, due to vocabulary drift and communities’ changing interests. Dynamic Topic Models capture that change by learning how topics drift as time goes on. Documents are grouped into time steps, according to their publication dates. The distributions over vocabulary and documents, ⍺ and β , are constrained to Three time steps of the model. ⍺ and β drift slightly in each time step. drift only gradually from the distributions in the preceding time step. David M. Blei and John D. Lafferty. 2006. Dynamic topic models.

Topics over Time The resulting topics show how language usage changes within each topic. David M. Blei and John D. Lafferty. 2006. Dynamic topic models.

Polylingual Topic Models Can we learn how topics are expressed by speakers of different languages? Polylingual Topic Models accomplish this by training on a collection of document tuples: each tuple has a representative document from each language. θ is a tuple of related documents, one in each language. φ is a language-specific vocabulary distribution. Tuples may be translations, or just Wikipedia pages in each language – even though they don’t cover the same subtopics. David Mimno, Hanna M. Wallach, Jason Naradowsky, David A. Smith, and Andrew McCallum. 2009. Polylingual topic models.

Polylingual Topic Models Two topics from EU Parliament Proceedings Two topics from Wikipedia (direct translations) (related pages)

Wrapping Up There are many ways to group documents or include additional data to extend topic modeling. The resulting topics are useful for data exploration and categorization. Topic models are not sufficient alone to yield good IR ranking performance, but they are a useful set of supplementary features for document understanding. Next, we’ll look at how to cluster documents together using any set of features.

Applications of Topic Models Document Understanding, session 7 - PowerPoint PPT Presentation

Applications of Topic Models Document Understanding, session 7 CS6200: Information Retrieval Extending Topic Models PLSA is the most basic probabilistic topic model, and the idea has been usefully extended in many ways. Its probability

Virtual Student Orientation Information for Families SLIDESMANIA.COM TOPIC TOPIC TOPIC TOPIC

ConnectHome ConnectHome Topic 2 Topic 2 Nation Webinar Nation Webinar Topic 3 Topic 3 Topic

UNIT TOPICS TOPIC 1: MINERALS TOPIC 2: IGNEOUS ROCKS TOPIC 3: SEDIMENTARY ROCKS

TOPIC #X: TOPIC NAME DATE, 2020 PRESENTATION OUTLINE Main topic #1 Main topic #2 Main

COMP31212: Concurrency Topic 5.3: Liveness and Topic 5.4 Fairness Topic 5.3: Liveness Properties

Second Year Student Meeting PhD Candidacy Exam On-topic or Off-topic Candidacy Exam? On-Topic:

The Dynamic Earth Unit Topics Topic 1: Earths Interior Topic 2: Continental Drift

Strategic Considerations for Managing a Nanotechnology Patent Portfolio Sarah Korman, Ph.D., J.D.

9/15/17 Outline Topic 1.Introduc8on Topic 2. RCS for six key fuels Topic 3.

Researching Researching Your Paper Topic Your Paper Topic A HOW TO GUIDE A HOW TO GUIDE

Using topic models as classifiers Pavel Oleinikov Associate Director Quantitative Analysis

Topic #3 CS162 Topic #3 1 Topic #3 Abstract Data Types Introduction to...Object Models

9/15/17 Course Outline Topic 1.Introduc8on Topic 2. Renewable Hydrogen Genera8on Topic

ConnectHome Nation Webinar Resident Engagement 1 Topic 1 Resident Agenda Engagement Topic 2

Perfect competition with real firms 1 Topic 3 Topic 4 Topic 5 Isolate entry/exit Isolate

Why learn topic modeling Pavel Oleinikov Associate Director Quantitative Analysis Center

The Market Entry Strategy: The Spotlight Campaign 1 Executive Summary The Market Opportunity

Workshop Referencing, NOT Plagiarism 1 Dr Ali Rashidi Referencing is a way of

AFPUB-2012-V4-002-DRAFT-01 Frank Habicht 1 st draft submitted September 19 th , 2012

Security in Plain TXT Observing the Use of DNS TXT Records in the Wild Adam Portier, Villanova

Web Mining and Recommender Systems T emporal data mining: Regression for Sequence Data Learning

Boosting capabilities of liquid argon TPC detectors for low energy rare event physics Dr. Angela

Safety Culture April 15, 2019 | 9:00 a.m. 4:15 p.m. Safety Announcement Safety is our number

Instance recognition Thurs April 6 Kristen Grauman UT Austin Instance recognition Indexing