data mining techniques
play

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 7 - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 7 Jan-Willem van de Meent ( credit: David Blei) Review: K-means Clustering Objective: Sum of Squares 1 2 One-hot assignment Center for cluster k k Alternate between


  1. Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 7 Jan-Willem van de Meent ( credit: David Blei)

  2. 
 Review: K-means Clustering Objective: Sum of Squares μ 1 μ 2 One-hot assignment Center for cluster k µ k Alternate between two steps 
 μ 3 1. Minimize SSE w.r.t. z n 2. Minimize SSE w.r.t. μ k

  3. Review: Probabilistic K-means Generative Model z n ∼ Discrete ( π ) x n | z n = k ∼ Norm ( µ k , Σ k ) Questions 1. What is log p ( X , z | μ , Σ , π ) ? 2. For what choice of π and Σ 
 do we recover K -means? Σ k = σ 2 I π k = 1 / K Same as K-means when:

  4. Review: Probabilistic K-means Assignment Update Parameter Updates P N N k : = n = 1 z nk π = ( N 1 / N ,..., N K / N ) 1 P P N 1 Idea: Replace hard 
 µ k = n = 1 z nk x n P N k assignments with P N P N 1 n = 1 z nk ( x n � µ k )( x n � µ k ) > Σ k = soft assignments N k

  5. Review: Soft K-means Soft Assignment Update Parameter Updates P N N k : = n = 1 γ nk P π = ( N 1 / N ,..., N K / N ) 1 P P N 1 Idea: Replace hard 
 µ k = n = 1 γ nk x n P N k assignments with P P N 1 n = 1 γ nk ( x n � µ k )( x n � µ k ) > Σ k = soft assignments N k

  6. Review: Lower Bound on Log Likelihood (multiplication by 1)

  7. Review: Lower Bound on Log Likelihood (multiplication by 1) (multiplication by 1)

  8. Review: Lower Bound on Log Likelihood (multiplication by 1) (multiplication by 1) (Bayes rule)

  9. Review: Lower Bound on Log Likelihood (multiplication by 1) (multiplication by 1) (Bayes rule)

  10. Review: Lower Bound on Log Likelihood

  11. Review: Lower Bound on Log Likelihood

  12. Review: EM for Gaussian Mixtures Generative Model Expectation Maximization z n ∼ Discrete ( π ) Initialize θ x n | z n = k ∼ Norm ( µ k , Σ k ) Repeat until convergence 1. Expectation Step 2. Maximization Step

  13. TOPIC MODELS Borrowing from : 
 David Blei 
 (Columbia)

  14. Word Mixtures Idea: Model text as a mixture over words (ignore order) gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01 organism 0.01 .,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,, Words: Topics:

  15. EM for Word Mixtures Generative Model Expectation Maximization Initialize θ Repeat until convergence 1. Expectation Step 
 2. Maximization Step

  16. EM for Word Mixtures Generative Model E-step: Update assignments M-step: Update parameters

  17. Topic Modeling Topic proportions and Topics Documents assignments gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01 organism 0.01 .,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,, • Each topic is a distribution over words • Each document is a mixture over topics • Each word is drawn from one topic distribution

  18. Topic Modeling Topic proportions and Topics Documents assignments gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01 organism 0.01 .,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,, Words: Topics:

  19. EM for Topic Models (PLSI/PLSA*) Generative Model E-step: Update assignments M-step: Update parameters *(Probabilistic Latent Semantic Indexing, a.k.a. Probabilistic Latent Semantic Analysis)

  20. Topic Models with Priors Generative Model (with priors) Maximum a Posteriori E-step: Update assignments M-step: Update parameters

  21. Latent Dirichlet Allocation (a.k.a. PLSI/PLSA with priors) Per-word Proportions topic assignment parameter Per-document Topic Observed Topics topic proportions word parameter α θ d Z d,n W d,n β k η N D K

  22. Intermezzo: Dirichlet Distribution

  23. Intermezzo: Dirichlet Distribution

  24. Intermezzo: Conjugacy Likelihood (discrete) Prior (Dirichlet) Question: What distribution is the posterior? More examples: https://en.wikipedia.org/wiki/Conjugate_prior

  25. MAP estimation for LDA Generative Model (with priors) Maximum a Posteriori E-step: Update assignments M-step: Update parameters

  26. Variational Inference Idea: Maximize Evidence Lower Bound (ELBO) Maximizing the ELBO is equivalent to minimizing the KL divergence

  27. Variational EM Use Factorized Approximation for q ( z , β , θ ) Discrete Dirichlet Dirichlet Variational E-step: Maximize w.r.t. φ (expectations closed form for Dirichlet distributions) Variational M-step: Maximize w.r.t. λ and γ (analogous to MAP estimation)

  28. Variational EM Use Factorized Approximation for q ( z , β , θ ) Discrete Dirichlet Dirichlet Variational E-step: Maximize w.r.t. φ (expectations closed form for Dirichlet distributions) Variational M-step: Maximize w.r.t. λ and γ (analogous to MAP estimation)

  29. Example Inference 0.4 0.3 Probability 0.2 0.1 0.0 1 8 16 26 36 46 56 66 76 86 96 Topics

  30. Example Inference human evolution disease computer genome evolutionary host models dna species bacteria information genetic organisms diseases data genes life resistance computers sequence origin bacterial system gene biology new network molecular groups strains systems sequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics group parasite networks mapping new parasites software project two united new sequences common tuberculosis simulations

  31. Example Inference

  32. Example Inference problem model selection species problems rate male forest mathematical constant males ecology number distribution females fish new time sex ecological mathematics number species conservation university size female diversity two values evolution population first value populations natural numbers average population ecosystems work rates sexual populations time data behavior endangered mathematicians density evolutionary tropical chaos measured genetic forests chaotic models reproductive ecosystem

  33. Performance Metric: Perplexity Nematode abstracts Associated Press 3400 7000 Smoothed Unigram Smoothed Unigram Smoothed Mixt. Unigrams Smoothed Mixt. Unigrams 3200 6500 LDA LDA Fold in pLSI Fold in pLSI 3000 6000 2800 5500 Perplexity 2600 Perplexity 5000 2400 4500 2200 4000 2000 3500 1800 1600 3000 1400 2500 0 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100 120 140 160 180 200 Number of Topics Number of Topics 7000 ⇢ − P d log p ( w d ) � perplexity = exp P d N d Marginal likelihood (evidence) of held out documents

  34. Extensions of LDA • EM inference (PLSA/PLSI) yields similar results 
 to Variational inference or MAP inference (LDA) 
 on most data • Reason for popularity of LDA: 
 can be embedded in more complicated models

  35. Extensions: Supervised LDA θ d α Z d,n W d,n β k N K η , σ 2 Y d D 1 Draw topic proportions θ | α ∼ Dir ( α ) . 2 For each word • Draw topic assignment z n | θ ∼ Mult ( θ ) . • Draw word w n | z n , β 1 : K ∼ Mult ( β z n ) . 3 Draw response variable y | z 1 : N , η , σ 2 ∼ N z , σ 2 � � η > ¯ , where z = ( 1 / N ) P N ¯ n = 1 z n .

  36. Extensions: Supervised LDA least bad more awful his both problem guys has featuring their motion unfortunately watchable than routine character simple supposed its films dry many perfect worse not director offered while fascinating flat one will charlie performance power dull movie characters paris between complex ● ● ● ● ● ● ● ● ● ● − 30 − 20 − 10 have not 0 one however 10 20 like about from cinematography you movie there screenplay was all which performances just would who pictures some they much effective out its what picture

  37. Extensions: Correlated Topic Model β k Σ η d Z d,n W d,n N D K µ Noconjugate prior on topic proportions Estimate a covariance matrix Σ that parameterizes correlations between topics in a document

  38. Extensions: Dynamic Topic Models 1789 2009 Inaugural addresses My fellow citizens: I stand here today humbled by the task AMONG the vicissitudes incident to life no event could before us, grateful for the trust you have bestowed, mindful have filled me with greater anxieties than that of which of the sacrifices borne by our ancestors... the notification was transmitted by your order... Track changes in word distributions 
 associated with a topic over time.

  39. Extensions: Dynamic Topic Models α α α θ d θ d θ d Z d,n Z d,n Z d,n W d,n W d,n W d,n N N N D D D . . . β k, 2 β k,T β k, 1 K

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend