Data Mining Techniques
CS 6220 - Section 3 - Fall 2016
Lecture 11
Jan-Willem van de Meent (credit: Yijun Zhao, Dave Blei)
Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 11 - - PowerPoint PPT Presentation
Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 11 Jan-Willem van de Meent (credit: Yijun Zhao, Dave Blei) PROJECT GUIDELINES ( updated ) Project Goals Select a dataset / prediction problem Perform exploratory analysis
CS 6220 - Section 3 - Fall 2016
Jan-Willem van de Meent (credit: Yijun Zhao, Dave Blei)
and preprocesssing
(can figures and tables be generated by running code?)
relate to choice of algorithm
and methodology
which do not, and why?
from UCI repository
(if any) is needed
algorithm in scikit learn
(i.e. not classification or clustering)
(talk to me for suggestions)
(after midterm and proposals)
time to work on projects
(expect final to emphasize later topics)
http://www.ccs.neu.edu/course/cs6220f16/sec3/midterm-topics.html
Suppose we have two coins A and B (weighted). We want to estimate the bias of the two coins. i.e. pA(head) = µA pB(head) = µB Pick a coin at random (simplified version, a equal mixture) Flip 10 times and record ’H’ and ’T’ repeat the process until we have a good size of training data
Generative Model
Expectation Maximization Initialize θ Repeat until convergence
Generative Model
Expectation Maximization Initialize θ Repeat until convergence
Generative Model
Expectation Maximization Initialize θ Repeat until convergence
Borrowing from: David Blei (Columbia)
Labels: Spam or not Spam
n
x = 1 . . . 1 . . . a aardvark aardwolf . . . buy . . . zygmurgy
Features: Words in E-mail Generative Model Maximum Likelihood
Labels: Spam or not Spam
n
x = 1 . . . 1 . . . a aardvark aardwolf . . . buy . . . zygmurgy
Features: Words in E-mail Generative Model (with prior) Maximum Likelihood Posterior Mean for Parameters
n
x = 1 . . . 1 . . . a aardvark aardwolf . . . buy . . . zygmurgy
Observations: Bag of Words 24 1 4 Clusters: Types of Documents
n
x = 1 . . . 1 . . . a aardvark aardwolf . . . buy . . . zygmurgy
Observations: Bag of Words Generative Model (with prior) Maximum Likelihood 24 1 4
How should we modify the generative model?
Clusters: Types of Documents
Clusters: Types of Documents
n
x = 1 . . . 1 . . . a aardvark aardwolf . . . buy . . . zygmurgy
Observations: Bag of Words Generative Model (with prior) 24 1 4
gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01
.,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,,
Topics Documents Topic proportions and assignments
θd Zd,n Wd,n N D K
βk
α
η
Proportions parameter Per-document topic proportions Per-word topic assignment Observed word Topics Topic parameter
Generative Model (no priors) Expectation Step Maximization Step
D N K "d !d %d,n !d,n #k $k
LDA:
θd Zd,n Wd,n N D K
βk
α
η
Variational Approximation Generative Model
D N K "d !d %d,n !d,n #k $k
LDA:
θd Zd,n Wd,n N D K
βk
α
η
Variational Approximation Generative Model
One iteration of mean field variational inference for LDA (1) For each topic k and term v: (8) λ(t+1)
k,v
= η +
D
⇥
d=1 N
⇥
n=1
1(wd,n = v)φ(t)
n,k.
(2) For each document d: (a) Update γd: (9) γ (t+1)
d,k
= αk + N
n=1 φ(t) d,n,k.
(b) For each word n, update ⌅ φd,n: (10) φ(t+1)
d,n,k ⇥ exp
⌅ Ψ (γ (t+1)
d,k
) + Ψ (λ(t+1)
k,wn ) Ψ (V v=1 λ(t+1) k,v
) ⇧ , where Ψ is the digamma function, the first derivative of the log Γ function.
1 8 16 26 36 46 56 66 76 86 96 Topics Probability 0.0 0.1 0.2 0.3 0.4
human evolution disease computer genome evolutionary host models dna species bacteria information genetic
diseases data genes life resistance computers sequence
bacterial system gene biology new network molecular groups strains systems sequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics group parasite networks mapping new parasites software project two united new sequences common tuberculosis simulations
problem model selection species problems rate male forest mathematical constant males ecology number distribution females fish new time sex ecological mathematics number species conservation university size female diversity two values evolution population first value populations natural numbers average population ecosystems work rates sexual populations time data behavior endangered mathematicians density evolutionary tropical chaos measured genetic forests chaotic models reproductive ecosystem
10 20 30 40 50 60 70 80 90 100 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400
Number of Topics Perplexity
Smoothed Unigram Smoothed Mixt. Unigrams LDA Fold in pLSI 7000
Nematode abstracts Associated Press
20 40 60 80 100 120 140 160 180 200 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000
Number of Topics Perplexity
Smoothed Unigram Smoothed Mixt. Unigrams LDA Fold in pLSI
perplexity = exp ⇢− P
d log p(wd)
P
d Nd
results to Variational inference (LDA) on most data
can be embedded in more complicated models
Estimate a covariance matrix Σ that parameterizes correlations between topics in a document
Zd,n Wd,n N D K
Σ µ ηd
βk
Noconjugate prior
Track changes in word distributions associated with a topic over time.
AMONG the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order...
1789
My fellow citizens: I stand here today humbled by the task before us, grateful for the trust you have bestowed, mindful
2009 Inaugural addresses
D θd Zd,n Wd,n N K α D θd Zd,n Wd,n N α D θd Zd,n Wd,n N α
βk,1 βk,2 βk,T
D θd Zd,n Wd,n N K α D θd Zd,n Wd,n N α D θd Zd,n Wd,n N α
βk,1 βk,2 βk,T
1880 electric machine power engine steam two machines iron battery wire 1890 electric power company steam electrical machine two system motor engine 1900 apparatus steam power engine engineering water construction engineer room feet 1910 air water engineering apparatus room laboratory engineer made gas tube 1920 apparatus tube air pressure water glass gas made laboratory mercury 1930 tube apparatus glass air mercury laboratory pressure made gas small 1940 air tube apparatus glass laboratory rubber pressure small mercury gas 1950 tube apparatus glass air chamber instrument small laboratory pressure rubber 1960 tube system temperature air heat chamber power high instrument control 1970 air heat power system temperature chamber high flow tube design 1980 high power design heat system systems devices instruments control large 1990 materials high power current applications technology devices design device heat 2000 devices device materials current gate high light silicon material technology
1880 1900 1920 1940 1960 1980 2000
1880 1900 1920 1940 1960 1980 2000
RELATIVITY LASER FORCE NERVE OXYGEN NEURON "Theoretical Physics" "Neuroscience"
θd Zd,n Wd,n N D K
βk
α Yd η, σ2
1 Draw topic proportions θ | α ∼ Dir(α). 2 For each word
3 Draw response variable y | z1:N, η, σ2 ∼ N
z, σ2 , where ¯ z = (1/N) PN
n=1 zn.
both motion simple perfect fascinating power complex however cinematography screenplay performances pictures effective picture his their character many while performance between
−30 −20 −10 10 20
has than films director will characters
from there which who much what awful featuring routine dry
charlie paris not about movie all would they its have like you was just some
bad guys watchable its not
movie least problem unfortunately supposed worse flat dull
θd N D K
α
Wdn Zdn η Xu
Legislator ideal points Observed votes
U
Vud Ad, Bd σ2
u
σ2
d
Bill content (topic model) Bill sentiment variables
dod,defense,defense and appropriation,military,subtitle veteran,veterans,bills,care,injury people,woman,american,nation,school producer,eligible,crop,farm,subparagraph coin,inspector,designee,automobile,lebanon bills,iran,official,company,sudan human,vietnam,united nations,call,people drug,pediatric,product,device,medical child,fire,attorney,internet,bills surveillance,director,court,electronic,flood energy,bills,price,commodity,market land,site,bills,interior,river child,center,poison,victim,abuse coast guard,vessel,space,administrator,requires science,director,technology,mathematics,bills computer,alien,bills,user,collection head,start,child,technology,award loss,crop,producer,agriculture,trade bills,tax,subparagraph,loss,taxable cover,bills,bridge,transaction,following transportation,rail,railroad,passenger,homeland security business,administrator,bills,business concern,loan defense,iraq,transfer,expense,chapter medicare,medicaid,child,chip,coverage student,loan,institution,lender,school energy,fuel,standard,administrator,lamp housing,mortgage,loan,family,recipient bank,transfer,requires,holding company,industrial county,eligible,ballot,election,jurisdiction tax credit,budget authority,energy,outlays,tax