Machine Learning 2 DS 4420 - Spring 2020 Topic Modeling 1 Byron C. - PowerPoint PPT Presentation

Machine Learning 2 DS 4420 - Spring 2020 Topic Modeling 1 Byron C. Wallace

Last time: Clustering —> Mixture Models —> Expectation Maximization (EM)

Today: Topic models

�� Mixture models Assume we are given data, , consisting of fully unsupervised ex- amples in dimensions: Data: D = { � ( i ) } N i =1 where � ( i ) ∈ R M z ∼ Multinomial ( φ ) Generative Story: � ∼ p θ ( ·| z ) Model: p θ , φ ( � , z ) = p θ ( � | z ) p φ ( z ) Joint: K � p θ , φ ( � ) = p θ ( � | z ) p φ ( z ) Marginal: z =1 (Marginal) Log-likelihood: N � p θ , φ ( � ( i ) ) � ( θ ) = �� i =1 N K � � p θ ( � ( i ) | z ) p φ ( z ) = z =1 i =1 Slide credit: Matt Gormley and Eric Xing (CMU)

(Soft) EM Initialize parameters randomly while not converged 1. E-Step: Create one training example for each possible value of the latent variables Weight each example according to model’s confidence Treat parameters as observed 2. M-Step: Set the parameters to the values that maximizes likelihood Treat pseudo-counts from above as observed Slide credit: Matt Gormley and Eric Xing (CMU)

And for NB of For EM times expected soft t occurs C in PC Zi L Count Xi tin Pct e Ix il F Plz c a in Xi Total Token Count

TOPIC MODELS Some content borrowed from:   David Blei   (Columbia)

Topic Models: Motivation Suppose we have a giant dataset (“corpus”) of text, e.g., all of the • NYTimes or all emails from a company ❖ Cannot read all documents ❖ But want to get a sense of what they contain

Topic Models: Motivation Topic models are a way of uncovering, well, • “topics” (themes) in a set of documents Topic models are unsupervised • Can be viewed as a type of clustering, so follows • naturally from prior lectures; will come back to this.

Topic Models: Motivation Topic models are a way of uncovering, well, • “topics” (themes) in a set of documents Topic models are unsupervised • Can be viewed as a sort of soft clustering of • documents into topics.

Topic 1 Topic 2 Topic 3 Topic 4 the i that easter “number” is proteins ishtar in the a satan to the of the which to have espn and i with hockey a of if but this “number” metaphorical english as you and evil there is run fact Example from Wallach, 2006

Key outputs • Topics Distributions over words; we hope these are somehow thematically coherent • Document-topics Probabilistic assignments of topics to documents

Example: Enron emails https://en.wikipedia.org/wiki/Enron_scandal https://www.cs.cmu.edu/~enron/

Example: Enron emails Topic Terms 3 trading financial trade product price 6 gas capacity deal pipeline contract 9 state california davis power utilities 14 ferc issue order party case 22 group meeting team process plan Example from Boyd-Graber, Hu and Mimno, 2017 https://en.wikipedia.org/wiki/Enron_scandal

Document-topic probabilities Yesterday, SDG&E filed a motion for adoption of an electric procurement cost recovery mechanism and for an order short- ening time for parties to file comments on the mechanism. The attached email from SDG&E contains the motion, an executive summary, and a detailed summary of their proposals and rec- ommendations governing procurement of the net short energy requirements for SDG&E’s customers. The utility requests a 15-day comment period, which means comments would have to be filed by September 10 (September 8 is a Saturday). Reply comments would be filed 10 days later. Topic Probability 9 0.42 11 0.05 8 0.05 Example from Boyd-Graber, Hu and Mimno, 2017

Topics as Matrix Factorization • One can view topics as a kind of matrix factorization M × K × K × V ≈ M × V Topics Topic Assignment Dataset Figure from Boyd-Graber, Hu and Mimno, 2017

Topics as Matrix Factorization • One can view topics as a kind of matrix factorization M × K × K × V ≈ M × V Topics Topic Assignment Dataset • We will try and take a more probabilistic view, but useful to keep this in mind Figure from Boyd-Graber, Hu and Mimno, 2017

Probabilistic Word Mixtures Idea: Model text as a mixture over words (ignore order) gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01 organism 0.01 .,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,, Words: Topics:

Topic Modeling Topics Words in Document Topic Proportions (shared) (mixture over topics) (document-specific) assignments gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01 organism 0.01 .,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,, Idea: Model corpus of documents with shared topics

Topic Modeling Topics Words in Document Topic Proportions (shared) (mixture over topics) (document-specific) assignments gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01 organism 0.01 .,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,, • Each topic is a distribution over words • Each document is a mixture over topics • Each word is drawn from one topic distribution

Machine Learning 2 DS 4420 - Spring 2020 Topic Modeling 1 Byron C. - PowerPoint PPT Presentation

Machine Learning 2 DS 4420 - Spring 2020 Topic Modeling 1 Byron C. Wallace Last time: Clustering > Mixture Models > Expectation Maximization (EM) Today: Topic models Mixture models Assume we are given data, ,

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Portfolio Project 2: Developing Your Personal Brand Canvas By Paul Curcio Arena The arena

MIT Communications Futures Program A cross-cutting examination of the telecommunications

Establishing an OD Practice at May 18, 2013 By Tonya Cornileus, Ph.D. ESPN Facts ESPN,

Advancing Access & Engaging Equity in Community Colleges Towards Racially Just Student

A Case for a Coordinated Video Control Plane Xi Liu, Florin Dobrian, Henry Milner, Junchen

Data Collection Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics

Fantasy Sports and Where to Play Them? - Vaastav What are Fantasy Sports? Participants

Informal Meeting of the Social Protection Committee Bratislava, 20 September 2016 Thematic Review

Machine Learning 2 DS 4420 - Spring 2020 Topic Modeling 1 Byron C. - PowerPoint PPT Presentation

Machine Learning 2 DS 4420 - Spring 2020 Topic Modeling 1 Byron C. Wallace Last time: Clustering > Mixture Models > Expectation Maximization (EM) Today: Topic models Mixture models Assume we are given data, ,

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Portfolio Project 2: Developing Your Personal Brand Canvas By Paul Curcio Arena The arena

MIT Communications Futures Program A cross-cutting examination of the telecommunications

Establishing an OD Practice at May 18, 2013 By Tonya Cornileus, Ph.D. ESPN Facts ESPN,

Advancing Access &amp; Engaging Equity in Community Colleges Towards Racially Just Student

A Case for a Coordinated Video Control Plane Xi Liu, Florin Dobrian, Henry Milner, Junchen

Data Collection Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics

Fantasy Sports and Where to Play Them? - Vaastav What are Fantasy Sports? Participants

Informal Meeting of the Social Protection Committee Bratislava, 20 September 2016 Thematic Review

Advancing Access & Engaging Equity in Community Colleges Towards Racially Just Student