COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

T OPIC MODELING

M ODELS FOR TEXT DATA Given text data we want to: ◮ Organize ◮ Visualize ◮ Summarize ◮ Search ◮ Predict ◮ Understand Topic models allow us to 1. Discover themes in text 2. Annotate documents 3. Organize, summarize, etc.

T OPIC MODELING

T OPIC MODELING A probabilistic topic model ◮ Learns distributions on words called “topics” shared by documents ◮ Learns a distribution on topics for each document ◮ Assigns every word in a document to a topic

T OPIC MODELING However, none of these things are known in advance and must be learned ◮ Each document is treated as a “bag of words” ◮ Need to define (1) a model, and (2) an algorithm to learn it ◮ We will review the standard topic model, but won’t cover inference

L ATENT D IRICHLET ALLOCATION There are two essential ingredients to latent Dirichlet allocation (LDA). 1. A collection of distributions on words (topics). 2. A distribution on topics for each document. β 1 β 2 β 3 youth vote politics rate ball reason interest power sense proof boy score brain order set season senate tax

L ATENT D IRICHLET ALLOCATION There are two essential ingredients to latent Dirichlet allocation (LDA). 1. A collection of distributions on words (topics). 2. A distribution on topics for each document. θ 1

L ATENT D IRICHLET ALLOCATION There are two essential ingredients to latent Dirichlet allocation (LDA). 1. A collection of distributions on words (topics). 2. A distribution on topics for each document. The generative process for LDA is: 1. Generate each topic, which is a distribution on words β k ∼ Dirichlet ( γ ) , k = 1 , . . . , K 2. For each document, generate a distribution on topics θ d ∼ Dirichlet ( α ) , d = 1 , . . . , D 3. For the n th word in the d th document, a) Allocate the word to a topic, c dn ∼ Discrete ( θ d ) b) Generate the word from the selected topic, x dn ∼ Discrete ( β c dn )

D IRICHLET DISTRIBUTION A continuous distribution on discrete probability vectors. Let β k be a probability vector and γ a positive parameter vector, Γ( � V v γ v ) � β γ v − 1 p ( β k | γ ) = � V k , v v = 1 Γ( γ v ) v = 1 This defines the Dirichlet distribution. Some examples of β k generated from this distribution for a constant value of γ and V = 10 are given below. γ = 1

D IRICHLET DISTRIBUTION A continuous distribution on discrete probability vectors. Let β k be a probability vector and γ a positive parameter vector, Γ( � V v γ v ) � β γ v − 1 p ( β k | γ ) = � V k , v v = 1 Γ( γ v ) v = 1 This defines the Dirichlet distribution. Some examples of β k generated from this distribution for a constant value of γ and V = 10 are given below. γ = 0 . 1

D IRICHLET DISTRIBUTION A continuous distribution on discrete probability vectors. Let β k be a probability vector and γ a positive parameter vector, Γ( � V v γ v ) � β γ v − 1 p ( β k | γ ) = � V k , v v = 1 Γ( γ v ) v = 1 This defines the Dirichlet distribution. Some examples of β k generated from this distribution for a constant value of γ and V = 10 are given below. γ = 0 . 01

LDA OUTPUT LDA outputs two main things: 1. A set of distributions on words (topics). Shown above are ten topics from NYT data. We list the ten words with the highest probability. 2. A distribution on topics for each document (not shown). This indicates its thematic breakdown and provides a compact representation.

LDA AND M ATRIX F ACTORIZATION Q : For a particular document, what is P ( x dn = i | β β β, θ d ) ? A : Find this by integrating out the cluster assignment, K � P ( x dn = i | β β β, θ ) = P ( x dn = i , c dn = k | β β β, θ d ) k = 1 K � = P ( x dn = i , | β β β, c dn = k ) P ( c dn = k | θ d ) � �� k = 1 = β ki = θ dk Let B = [ β 1 , . . . , β K ] and Θ = [ θ 1 , . . . , θ D ] , then P ( x dn = i | β, θ ) = ( B Θ) id In other words, we can read the probabilities from a matrix formed by taking the product of two matrices that have nonnegative entries.

N ONNEGATIVE MATRIX FACTORIZATION

N ONNEGATIVE MATRIX FACTORIZATION LDA can be thought of as an instance of nonnegative matrix factorization. ◮ It is a probabilistic model. ◮ Inference involves techniques not taught in this course. We will discuss two other related models and their algorithms. These two models are called nonnegative matrix factorization (NMF) ◮ They can be used for the same tasks as LDA ◮ Though “nonnegative matrix factorization” is a general technique, “NMF” usually just refers to the following two methods.

N ONNEGATIVE MATRIX FACTORIZATION N 2 "objects" rank = k { { { H kj > _ 0 N 1 dimensions ~ ~ (i,j)-th entry, X ij > _ 0 W ik > _ 0 We use notation and think about the problem slightly differently from PMF ◮ Data X has nonnegative entries. None missing, but likely many zeros. ◮ The learned factorization W and H also have nonnegative entries. ◮ The value X ij ≈ � k W ik H kj , but we won’t write this with vector notation ◮ Later we interpret the output in terms of columns of W and H .

N ONNEGATIVE MATRIX FACTORIZATION What are some data modeling problems that can constitute X ? ◮ Text data: ◮ Word term frequencies ◮ X ij contains the number of times word i appears in document j . ◮ Image data: ◮ Face identification data sets ◮ Put each vectorized N × M image of a face on a column of X . ◮ Other discrete grouped data: ◮ Quantize continuous sets of features using K-means ◮ X ij counts how many times group j uses cluster i . ◮ For example: group = song, features = d × n spectral information matrix

T WO OBJECTIVE FUNCTIONS NMF minimizes one of the following two objective functions over W and H . Choice 1: Squared error objective � � � X − WH � 2 = ( X ij − ( WH ) ij ) 2 i j Choice 2: Divergence objective � � D ( X � WH ) = − [ X ij ln ( WH ) ij − ( WH ) ij ] i j ◮ Both have the constraint that W and H contain nonnegative values. ◮ NMF uses a fast, simple algorithm for optimizing these two objectives.

M INIMIZATION AND MULTIPLICATIVE ALGORITHMS 1 h F ( h ) ”: Recall what we should look for in minimizing an objective “min 1. A way to generate a sequence of values h 1 , h 2 , . . . , such that F ( h 1 ) ≥ F ( h 2 ) ≥ F ( h 3 ) ≥ · · · 2. Convergence of the sequence to a local minimum of F The following algorithms fulfill these requirements. In this case: ◮ Minimization is done via an “auxiliary function.” ◮ Leads to a “multiplicative algorithm” for W and H . ◮ We’ll skip details (see reference). 1 For details, see D.D. Lee and H.S. Seung (2001). “Algorithms for non-negative matrix factorization.” Advances in Neural Information Processing Systems.

M ULTIPLICATIVE UPDATE FOR � X − WH � 2 Problem min � ij ( X ij − ( WH ) ij ) 2 subject to W ik ≥ 0, H kj ≥ 0. Algorithm ◮ Randomly initialize H and W with nonnegative values. ◮ Iterate the following, first for all values in H , then all in W : ( W T X ) kj ← , H kj H kj ( W T WH ) kj ( XH T ) ik ← , W ik W ik ( WHH T ) ik until the change in � X − WH � 2 is “small.”

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University T OPIC MODELING M ODELS FOR TEXT DATA Given text data we want to:

Introduction to machine learning COMS 4721 Learning from data Machine learning : the study

COMS 4721: Machine Learning for Data Science Lecture 14, 3/21/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 3, 1/24/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 15, 3/23/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 4, 1/26/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 5, 1/31/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 12, 2/28/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 11, 2/23/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 23, 4/20/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 8, 2/14/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 17, 3/30/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 21, 4/13/2017 Prof. John Paisley Department

Numerical Differentiation & Integration Numerical Differentiation I Numerical Analysis (9th

Truncation errors: using Taylor series to approximation functions Approximating functions using

Commutative Queries Richard Beigel Richard Chang Yale University & University of Maryland

MATH 12002 - CALCULUS I 2.7: Related Rates Part 3: More Examples Professor Donald L. White

JUST THE MATHS SLIDES NUMBER 14.9 PARTIAL DIFFERENTIATION 9 (Taylors series) for

Correlation, Convolution, Filtering COMPSCI 527 Computer Vision COMPSCI 527 Computer

Sequentially split -homomorphisms (Part II) Workshop on Structure and Classification of C

Determining Potential Good Reduction in Arithmetic Dynamics Robert L. Benedetto Amherst College

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University T OPIC MODELING M ODELS FOR TEXT DATA Given text data we want to:

Introduction to machine learning COMS 4721 Learning from data Machine learning : the study

COMS 4721: Machine Learning for Data Science Lecture 14, 3/21/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 3, 1/24/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 15, 3/23/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 4, 1/26/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 5, 1/31/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 12, 2/28/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 11, 2/23/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 23, 4/20/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 8, 2/14/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 17, 3/30/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 21, 4/13/2017 Prof. John Paisley Department

Numerical Differentiation &amp; Integration Numerical Differentiation I Numerical Analysis (9th

Truncation errors: using Taylor series to approximation functions Approximating functions using

Commutative Queries Richard Beigel Richard Chang Yale University &amp; University of Maryland

MATH 12002 - CALCULUS I 2.7: Related Rates Part 3: More Examples Professor Donald L. White

JUST THE MATHS SLIDES NUMBER 14.9 PARTIAL DIFFERENTIATION 9 (Taylors series) for

Correlation, Convolution, Filtering COMPSCI 527 Computer Vision COMPSCI 527 Computer

Sequentially split -homomorphisms (Part II) Workshop on Structure and Classification of C

Determining Potential Good Reduction in Arithmetic Dynamics Robert L. Benedetto Amherst College

Numerical Differentiation & Integration Numerical Differentiation I Numerical Analysis (9th

Commutative Queries Richard Beigel Richard Chang Yale University & University of Maryland