Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 7 - - PowerPoint PPT Presentation

data mining techniques
SMART_READER_LITE
LIVE PREVIEW

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 7 - - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 7 Jan-Willem van de Meent ( credit: David Blei) Review: K-means Clustering Objective: Sum of Squares 1 2 One-hot assignment Center for cluster k k Alternate between


slide-1
SLIDE 1

Data Mining Techniques

CS 6220 - Section 2 - Spring 2017

Lecture 7

Jan-Willem van de Meent (credit: David Blei)

slide-2
SLIDE 2

Review: K-means Clustering

Objective: Sum of Squares μ1 μ2 μ3 Alternate between two steps
 


  • 1. Minimize SSE w.r.t. zn
  • 2. Minimize SSE w.r.t. μk

µk

One-hot assignment Center for cluster k

slide-3
SLIDE 3

Review: Probabilistic K-means

xn | zn = k ∼ Norm(µk,Σk)

Generative Model

zn ∼ Discrete(π)

Questions

1. What is log p(X, z | μ, Σ, π) ? 2. For what choice of π and Σ
 do we recover K-means?

Same as K-means when:

πk = 1/K Σk = σ2I

slide-4
SLIDE 4

Parameter Updates

µk =

1 Nk

PN

n=1 znk xn

PN π = (N1/N,..., NK/N)

1 P

P Σk =

1 Nk

PN

n=1 znk (xn µk)(xn µk)>

Assignment Update

Idea: Replace hard 
 assignments with soft assignments

Nk := PN

n=1 znk

Review: Probabilistic K-means

slide-5
SLIDE 5

Nk := PN

n=1 γnk

P µk =

1 Nk

PN

n=1 γnk xn

P P Σk =

1 Nk

PN

n=1 γnk (xn µk)(xn µk)>

Parameter Updates

Review: Soft K-means

π = (N1/N,..., NK/N)

1 P

Soft Assignment Update

Idea: Replace hard 
 assignments with soft assignments

slide-6
SLIDE 6

Review: Lower Bound on Log Likelihood

(multiplication by 1)

slide-7
SLIDE 7

(multiplication by 1) (multiplication by 1)

Review: Lower Bound on Log Likelihood

slide-8
SLIDE 8

(multiplication by 1) (multiplication by 1) (Bayes rule)

Review: Lower Bound on Log Likelihood

slide-9
SLIDE 9

(multiplication by 1) (multiplication by 1) (Bayes rule)

Review: Lower Bound on Log Likelihood

slide-10
SLIDE 10

Review: Lower Bound on Log Likelihood

slide-11
SLIDE 11

Review: Lower Bound on Log Likelihood

slide-12
SLIDE 12

Review: EM for Gaussian Mixtures

Generative Model

Expectation Maximization Initialize θ Repeat until convergence

  • 1. Expectation Step
  • 2. Maximization Step

xn | zn = k ∼ Norm(µk,Σk) zn ∼ Discrete(π)

slide-13
SLIDE 13

TOPIC MODELS

Borrowing from:
 David Blei
 (Columbia)

slide-14
SLIDE 14

Word Mixtures

Topics: Words: Idea: Model text as a mixture over words (ignore order)

gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01

  • rganism 0.01

.,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,,

slide-15
SLIDE 15

EM for Word Mixtures

Expectation Maximization Initialize θ Repeat until convergence

  • 1. Expectation Step

  • 2. Maximization Step

Generative Model

slide-16
SLIDE 16

EM for Word Mixtures

E-step: Update assignments Generative Model M-step: Update parameters

slide-17
SLIDE 17

Topic Modeling

gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01

  • rganism 0.01

.,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,,

Topics Documents Topic proportions and assignments

  • Each topic is a distribution over words
  • Each document is a mixture over topics
  • Each word is drawn from one topic distribution
slide-18
SLIDE 18

Topic Modeling

Topics: Words:

gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01

  • rganism 0.01

.,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,,

Topics Documents Topic proportions and assignments

slide-19
SLIDE 19

EM for Topic Models (PLSI/PLSA*)

E-step: Update assignments Generative Model M-step: Update parameters

*(Probabilistic Latent Semantic Indexing, a.k.a. Probabilistic Latent Semantic Analysis)

slide-20
SLIDE 20

Topic Models with Priors

Generative Model (with priors) Maximum a Posteriori E-step: Update assignments M-step: Update parameters

slide-21
SLIDE 21

Latent Dirichlet Allocation

θd Zd,n Wd,n N D K

βk

α

η

Proportions parameter Per-document topic proportions Per-word topic assignment Observed word Topics Topic parameter

(a.k.a. PLSI/PLSA with priors)

slide-22
SLIDE 22

Intermezzo: Dirichlet Distribution

slide-23
SLIDE 23

Intermezzo: Dirichlet Distribution

slide-24
SLIDE 24

Intermezzo: Conjugacy

Likelihood (discrete) Prior (Dirichlet) Question: What distribution is the posterior? More examples: https://en.wikipedia.org/wiki/Conjugate_prior

slide-25
SLIDE 25

MAP estimation for LDA

Generative Model (with priors) Maximum a Posteriori M-step: Update parameters E-step: Update assignments

slide-26
SLIDE 26

Variational Inference

Idea: Maximize Evidence Lower Bound (ELBO)

Maximizing the ELBO is equivalent to minimizing the KL divergence

slide-27
SLIDE 27

Variational EM

Variational E-step: Maximize w.r.t. φ Use Factorized Approximation for q(z,β,θ) Variational M-step: Maximize w.r.t. λ and γ

Discrete Dirichlet Dirichlet (expectations closed form for Dirichlet distributions) (analogous to MAP estimation)

slide-28
SLIDE 28

Variational EM

Variational E-step: Maximize w.r.t. φ Use Factorized Approximation for q(z,β,θ) Variational M-step: Maximize w.r.t. λ and γ

Discrete Dirichlet Dirichlet (expectations closed form for Dirichlet distributions) (analogous to MAP estimation)

slide-29
SLIDE 29

Example Inference

1 8 16 26 36 46 56 66 76 86 96 Topics Probability 0.0 0.1 0.2 0.3 0.4

slide-30
SLIDE 30

Example Inference

human evolution disease computer genome evolutionary host models dna species bacteria information genetic

  • rganisms

diseases data genes life resistance computers sequence

  • rigin

bacterial system gene biology new network molecular groups strains systems sequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics group parasite networks mapping new parasites software project two united new sequences common tuberculosis simulations

slide-31
SLIDE 31

Example Inference

slide-32
SLIDE 32

Example Inference

problem model selection species problems rate male forest mathematical constant males ecology number distribution females fish new time sex ecological mathematics number species conservation university size female diversity two values evolution population first value populations natural numbers average population ecosystems work rates sexual populations time data behavior endangered mathematicians density evolutionary tropical chaos measured genetic forests chaotic models reproductive ecosystem

slide-33
SLIDE 33

Performance Metric: Perplexity

10 20 30 40 50 60 70 80 90 100 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400

Number of Topics Perplexity

Smoothed Unigram Smoothed Mixt. Unigrams LDA Fold in pLSI 7000

Nematode abstracts Associated Press

20 40 60 80 100 120 140 160 180 200 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000

Number of Topics Perplexity

Smoothed Unigram Smoothed Mixt. Unigrams LDA Fold in pLSI

perplexity = exp ⇢− P

d log p(wd)

P

d Nd

  • Marginal likelihood (evidence) of held out documents
slide-34
SLIDE 34

Extensions of LDA

  • EM inference (PLSA/PLSI) yields similar results 


to Variational inference or MAP inference (LDA) 


  • n most data
  • Reason for popularity of LDA: 


can be embedded in more complicated models

slide-35
SLIDE 35

Extensions: Supervised LDA

θd Zd,n Wd,n N D K

βk

α Yd η, σ2

1 Draw topic proportions θ | α ∼ Dir(α). 2 For each word

  • Draw topic assignment zn | θ ∼ Mult(θ).
  • Draw word wn | zn, β1:K ∼ Mult(βzn).

3 Draw response variable y | z1:N, η, σ2 ∼ N

  • η>¯

z, σ2 , where ¯ z = (1/N) PN

n=1 zn.

slide-36
SLIDE 36

Extensions: Supervised LDA

both motion simple perfect fascinating power complex however cinematography screenplay performances pictures effective picture his their character many while performance between

−30 −20 −10 10 20

  • more

has than films director will characters

  • ne

from there which who much what awful featuring routine dry

  • ffered

charlie paris not about movie all would they its have like you was just some

  • ut

bad guys watchable its not

  • ne

movie least problem unfortunately supposed worse flat dull

slide-37
SLIDE 37

Extensions: Correlated Topic Model

Estimate a covariance matrix Σ that parameterizes correlations between topics in a document

Zd,n Wd,n N D K

Σ µ ηd

βk

Noconjugate prior

  • n topic proportions
slide-38
SLIDE 38

Extensions: Dynamic Topic Models

Track changes in word distributions 
 associated with a topic over time.

AMONG the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order...

1789

My fellow citizens: I stand here today humbled by the task before us, grateful for the trust you have bestowed, mindful

  • f the sacrifices borne by our ancestors...

2009 Inaugural addresses

slide-39
SLIDE 39

Extensions: Dynamic Topic Models

D θd Zd,n Wd,n N K α D θd Zd,n Wd,n N α D θd Zd,n Wd,n N α

βk,1 βk,2 βk,T

. . .

slide-40
SLIDE 40

Extensions: Dynamic Topic Models

1880 electric machine power engine steam two machines iron battery wire 1890 electric power company steam electrical machine two system motor engine 1900 apparatus steam power engine engineering water construction engineer room feet 1910 air water engineering apparatus room laboratory engineer made gas tube 1920 apparatus tube air pressure water glass gas made laboratory mercury 1930 tube apparatus glass air mercury laboratory pressure made gas small 1940 air tube apparatus glass laboratory rubber pressure small mercury gas 1950 tube apparatus glass air chamber instrument small laboratory pressure rubber 1960 tube system temperature air heat chamber power high instrument control 1970 air heat power system temperature chamber high flow tube design 1980 high power design heat system systems devices instruments control large 1990 materials high power current applications technology devices design device heat 2000 devices device materials current gate high light silicon material technology

slide-41
SLIDE 41

Extensions: Dynamic Topic Models

1880 1900 1920 1940 1960 1980 2000

  • o o o o o o o
  • o o o o o o o o o
  • o o o o o o o o o o o o o o
  • o
  • o o o o
  • o o
  • o o
  • o o

1880 1900 1920 1940 1960 1980 2000

  • o o
  • o
  • o
  • o o o
  • o o o o o o o o o o
  • o o o
  • o
  • o
  • o o o o
  • o o
  • o o o o o o o o o o o o o
  • o o o o o

RELATIVITY LASER FORCE NERVE OXYGEN NEURON "Theoretical Physics" "Neuroscience"

slide-42
SLIDE 42

Extensions: Ideal Point Topic Models

θd N D K

βk

α

Wdn Zdn η Xu

Legislator ideal points Observed votes

U

Vud Ad, Bd σ2

u

σ2

d

Bill content (topic model) Bill sentiment variables

slide-43
SLIDE 43

Extensions: Ideal Point Topic Models

dod,defense,defense and appropriation,military,subtitle veteran,veterans,bills,care,injury people,woman,american,nation,school producer,eligible,crop,farm,subparagraph coin,inspector,designee,automobile,lebanon bills,iran,official,company,sudan human,vietnam,united nations,call,people drug,pediatric,product,device,medical child,fire,attorney,internet,bills surveillance,director,court,electronic,flood energy,bills,price,commodity,market land,site,bills,interior,river child,center,poison,victim,abuse coast guard,vessel,space,administrator,requires science,director,technology,mathematics,bills computer,alien,bills,user,collection head,start,child,technology,award loss,crop,producer,agriculture,trade bills,tax,subparagraph,loss,taxable cover,bills,bridge,transaction,following transportation,rail,railroad,passenger,homeland security business,administrator,bills,business concern,loan defense,iraq,transfer,expense,chapter medicare,medicaid,child,chip,coverage student,loan,institution,lender,school energy,fuel,standard,administrator,lamp housing,mortgage,loan,family,recipient bank,transfer,requires,holding company,industrial county,eligible,ballot,election,jurisdiction tax credit,budget authority,energy,outlays,tax