Data Mining Techniques
CS 6220 - Section 2 - Spring 2017
Lecture 7
Jan-Willem van de Meent (credit: David Blei)
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 7 - - PowerPoint PPT Presentation
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 7 Jan-Willem van de Meent ( credit: David Blei) Review: K-means Clustering Objective: Sum of Squares 1 2 One-hot assignment Center for cluster k k Alternate between
CS 6220 - Section 2 - Spring 2017
Jan-Willem van de Meent (credit: David Blei)
Objective: Sum of Squares μ1 μ2 μ3 Alternate between two steps
µk
One-hot assignment Center for cluster k
xn | zn = k ∼ Norm(µk,Σk)
Generative Model
zn ∼ Discrete(π)
Questions
1. What is log p(X, z | μ, Σ, π) ? 2. For what choice of π and Σ do we recover K-means?
Same as K-means when:
πk = 1/K Σk = σ2I
Parameter Updates
µk =
1 Nk
PN
n=1 znk xn
PN π = (N1/N,..., NK/N)
1 P
P Σk =
1 Nk
PN
n=1 znk (xn µk)(xn µk)>
Assignment Update
Idea: Replace hard assignments with soft assignments
Nk := PN
n=1 znk
Nk := PN
n=1 γnk
P µk =
1 Nk
PN
n=1 γnk xn
P P Σk =
1 Nk
PN
n=1 γnk (xn µk)(xn µk)>
Parameter Updates
π = (N1/N,..., NK/N)
1 P
Soft Assignment Update
Idea: Replace hard assignments with soft assignments
(multiplication by 1)
(multiplication by 1) (multiplication by 1)
(multiplication by 1) (multiplication by 1) (Bayes rule)
(multiplication by 1) (multiplication by 1) (Bayes rule)
Generative Model
Expectation Maximization Initialize θ Repeat until convergence
xn | zn = k ∼ Norm(µk,Σk) zn ∼ Discrete(π)
Borrowing from: David Blei (Columbia)
Topics: Words: Idea: Model text as a mixture over words (ignore order)
gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01
.,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,,
Expectation Maximization Initialize θ Repeat until convergence
Generative Model
E-step: Update assignments Generative Model M-step: Update parameters
gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01
.,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,,
Topics Documents Topic proportions and assignments
Topics: Words:
gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01
.,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,,
Topics Documents Topic proportions and assignments
E-step: Update assignments Generative Model M-step: Update parameters
*(Probabilistic Latent Semantic Indexing, a.k.a. Probabilistic Latent Semantic Analysis)
Generative Model (with priors) Maximum a Posteriori E-step: Update assignments M-step: Update parameters
θd Zd,n Wd,n N D K
βk
α
η
Proportions parameter Per-document topic proportions Per-word topic assignment Observed word Topics Topic parameter
(a.k.a. PLSI/PLSA with priors)
Likelihood (discrete) Prior (Dirichlet) Question: What distribution is the posterior? More examples: https://en.wikipedia.org/wiki/Conjugate_prior
Generative Model (with priors) Maximum a Posteriori M-step: Update parameters E-step: Update assignments
Idea: Maximize Evidence Lower Bound (ELBO)
Maximizing the ELBO is equivalent to minimizing the KL divergence
Variational E-step: Maximize w.r.t. φ Use Factorized Approximation for q(z,β,θ) Variational M-step: Maximize w.r.t. λ and γ
Discrete Dirichlet Dirichlet (expectations closed form for Dirichlet distributions) (analogous to MAP estimation)
Variational E-step: Maximize w.r.t. φ Use Factorized Approximation for q(z,β,θ) Variational M-step: Maximize w.r.t. λ and γ
Discrete Dirichlet Dirichlet (expectations closed form for Dirichlet distributions) (analogous to MAP estimation)
1 8 16 26 36 46 56 66 76 86 96 Topics Probability 0.0 0.1 0.2 0.3 0.4
human evolution disease computer genome evolutionary host models dna species bacteria information genetic
diseases data genes life resistance computers sequence
bacterial system gene biology new network molecular groups strains systems sequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics group parasite networks mapping new parasites software project two united new sequences common tuberculosis simulations
problem model selection species problems rate male forest mathematical constant males ecology number distribution females fish new time sex ecological mathematics number species conservation university size female diversity two values evolution population first value populations natural numbers average population ecosystems work rates sexual populations time data behavior endangered mathematicians density evolutionary tropical chaos measured genetic forests chaotic models reproductive ecosystem
10 20 30 40 50 60 70 80 90 100 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400
Number of Topics Perplexity
Smoothed Unigram Smoothed Mixt. Unigrams LDA Fold in pLSI 7000
Nematode abstracts Associated Press
20 40 60 80 100 120 140 160 180 200 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000
Number of Topics Perplexity
Smoothed Unigram Smoothed Mixt. Unigrams LDA Fold in pLSI
perplexity = exp ⇢− P
d log p(wd)
P
d Nd
to Variational inference or MAP inference (LDA)
can be embedded in more complicated models
θd Zd,n Wd,n N D K
βk
α Yd η, σ2
1 Draw topic proportions θ | α ∼ Dir(α). 2 For each word
3 Draw response variable y | z1:N, η, σ2 ∼ N
z, σ2 , where ¯ z = (1/N) PN
n=1 zn.
both motion simple perfect fascinating power complex however cinematography screenplay performances pictures effective picture his their character many while performance between
−30 −20 −10 10 20
has than films director will characters
from there which who much what awful featuring routine dry
charlie paris not about movie all would they its have like you was just some
bad guys watchable its not
movie least problem unfortunately supposed worse flat dull
Estimate a covariance matrix Σ that parameterizes correlations between topics in a document
Zd,n Wd,n N D K
Σ µ ηd
βk
Noconjugate prior
Track changes in word distributions associated with a topic over time.
AMONG the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order...
1789
My fellow citizens: I stand here today humbled by the task before us, grateful for the trust you have bestowed, mindful
2009 Inaugural addresses
D θd Zd,n Wd,n N K α D θd Zd,n Wd,n N α D θd Zd,n Wd,n N α
βk,1 βk,2 βk,T
1880 electric machine power engine steam two machines iron battery wire 1890 electric power company steam electrical machine two system motor engine 1900 apparatus steam power engine engineering water construction engineer room feet 1910 air water engineering apparatus room laboratory engineer made gas tube 1920 apparatus tube air pressure water glass gas made laboratory mercury 1930 tube apparatus glass air mercury laboratory pressure made gas small 1940 air tube apparatus glass laboratory rubber pressure small mercury gas 1950 tube apparatus glass air chamber instrument small laboratory pressure rubber 1960 tube system temperature air heat chamber power high instrument control 1970 air heat power system temperature chamber high flow tube design 1980 high power design heat system systems devices instruments control large 1990 materials high power current applications technology devices design device heat 2000 devices device materials current gate high light silicon material technology
1880 1900 1920 1940 1960 1980 2000
1880 1900 1920 1940 1960 1980 2000
RELATIVITY LASER FORCE NERVE OXYGEN NEURON "Theoretical Physics" "Neuroscience"
θd N D K
α
Wdn Zdn η Xu
Legislator ideal points Observed votes
U
Vud Ad, Bd σ2
u
σ2
d
Bill content (topic model) Bill sentiment variables
dod,defense,defense and appropriation,military,subtitle veteran,veterans,bills,care,injury people,woman,american,nation,school producer,eligible,crop,farm,subparagraph coin,inspector,designee,automobile,lebanon bills,iran,official,company,sudan human,vietnam,united nations,call,people drug,pediatric,product,device,medical child,fire,attorney,internet,bills surveillance,director,court,electronic,flood energy,bills,price,commodity,market land,site,bills,interior,river child,center,poison,victim,abuse coast guard,vessel,space,administrator,requires science,director,technology,mathematics,bills computer,alien,bills,user,collection head,start,child,technology,award loss,crop,producer,agriculture,trade bills,tax,subparagraph,loss,taxable cover,bills,bridge,transaction,following transportation,rail,railroad,passenger,homeland security business,administrator,bills,business concern,loan defense,iraq,transfer,expense,chapter medicare,medicaid,child,chip,coverage student,loan,institution,lender,school energy,fuel,standard,administrator,lamp housing,mortgage,loan,family,recipient bank,transfer,requires,holding company,industrial county,eligible,ballot,election,jurisdiction tax credit,budget authority,energy,outlays,tax