cs6220 data mining techniques
play

CS6220: DATA MINING TECHNIQUES Text Data: Topic Models Instructor: - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Text Data: Topic Models Instructor: Yizhou Sun yzsun@ccs.neu.edu February 17, 2016 Methods to Learn Matrix Data Text Set Data Sequence Time Series Graph & Images Data Data Network Decision Tree;


  1. CS6220: DATA MINING TECHNIQUES Text Data: Topic Models Instructor: Yizhou Sun yzsun@ccs.neu.edu February 17, 2016

  2. Methods to Learn Matrix Data Text Set Data Sequence Time Series Graph & Images Data Data Network Decision Tree; Naïve HMM Label Neural Classification Bayes; Logistic Propagation Network Regression SVM; kNN K-means; hierarchical PLSA SCAN; Spectral Clustering clustering; DBSCAN; Clustering Mixture Models; kernel k-means* Apriori; FP- GSP; Frequent growth PrefixSpan Pattern Mining Prediction Linear Regression Autoregression Collaborative Filtering DTW P-PageRank Similarity Search PageRank Ranking 2

  3. Text Data: Topic Models • Text Data and Topic Models • Probabilistic Latent Semantic Analysis • Summary 3

  4. Text Data • Word/term • Document • A bag of words • Corpus • A collection of documents 4

  5. Represent a Document • Most common way: Bag-of-Words • Ignore the order of words • keep the count c1 c2 c3 c4 c5 m1 m2 m3 m4 5

  6. More Details • Represent the doc as a vector where each entry corresponds to a different word and the number at that entry corresponds to how many times that word was present in the document (or some function of it) • Number of words is huge • Select and use a smaller set of words that are of interest • E.g. uninteresting words: ‘and’, ‘the’ ‘at’, ‘is’, etc. These are called stop-words • Stemming: remove endings. E.g. ‘learn’, ‘learning’, ‘learnable’, ‘learned’ could be substituted by the single stem ‘learn’ • Other simplifications can also be invented and used • The set of different remaining words is called dictionary or vocabulary. Fix an ordering of the terms in the dictionary so that you can operate them by their index. • Can be extended to bi-gram, tri-gram, or so 6

  7. Topics • Topic • A topic is represented by a word distribution • Relate to an issue 7

  8. Topic Models • Topic modeling • Get topics automatically from a corpus • Assign documents to topics automatically • Most frequently used topic models • pLSA • LDA 8

  9. Text Data: Topic Models • Text Data and Topic Models • Probabilistic Latent Semantic Analysis • Summary 9

  10. Notations • Word, document, topic • 𝑥, 𝑒, 𝑨 • Word count in document • 𝑑(𝑥, 𝑒) • Word distribution for each topic ( 𝛾 𝑨 ) • 𝛾 𝑨𝑥 : 𝑞(𝑥|𝑨) • Topic distribution for each document ( 𝜄 𝑒 ) • 𝜄 𝑒𝑨 : 𝑞(𝑨|𝑒) (Yes, fuzzy clustering) 10

  11. Review of Multinomial Distribution • Select n data points from K categories, each with probability 𝑞 𝑙 • n trials of independent categorical distribution • E.g., get 1-6 from a dice with 1/6 • When K=2, binomial distribution • n trials of independent Bernoulli distribution • E.g., flip a coin to get heads or tails 11

  12. Generative Model for pLSA • Describe how a document is generated probabilistically • For each position in d, 𝑜 = 1, … , 𝑂 𝑒 • Generate the topic for the position as 𝑨 𝑜 ~𝑛𝑣𝑚𝑢 ⋅ 𝜄 𝑒 , 𝑗. 𝑓. , 𝑞 𝑨 𝑜 = 𝑙 = 𝜄 𝑒𝑙 (Note, 1 trial multinomial, i.e., categorical distribution) • Generate the word for the position as 𝑥 𝑜 ~𝑛𝑣𝑚𝑢 ⋅ 𝛾 𝑨 𝑜 , 𝑗. 𝑓. , 𝑞 𝑥 𝑜 = 𝑥 = 𝛾 𝑨 𝑜 𝑥 12

  13. The Likelihood Function for a Corpus • Probability of a word 𝑞 𝑥|𝑒 = 𝑞(𝑥, 𝑨 = 𝑙|𝑒) = 𝑞 𝑥 𝑨 = 𝑙 𝑞 𝑨 = 𝑙|𝑒 = 𝛾 𝑙𝑥 𝜄 𝑒𝑙 𝑙 𝑙 𝑙 • Likelihood of a corpus 𝜌 𝑒 𝑗𝑡 𝑣𝑡𝑣𝑏𝑚𝑚𝑧 𝑑𝑝𝑜𝑡𝑗𝑒𝑓𝑠𝑓𝑒 𝑏𝑡 𝑣𝑜𝑗𝑔𝑝𝑠𝑛, 𝑥ℎ𝑗𝑑ℎ 𝑑𝑏𝑜 𝑐𝑓 𝑒𝑠𝑝𝑞𝑞𝑓𝑒 13

  14. Re-arrange the Likelihood Function • Group the same word from different positions together max 𝑚𝑝𝑕𝑀 = 𝑑 𝑥, 𝑒 𝑚𝑝𝑕 𝜄 𝑒𝑨 𝛾 𝑨𝑥 𝑒𝑥 𝑨 𝑡. 𝑢. 𝜄 𝑒𝑨 = 1 𝑏𝑜𝑒 𝛾 𝑨𝑥 = 1 𝑨 𝑥 14

  15. Optimization: EM Algorithm • Repeat until converge • E-step: for each word in each document, calculate is conditional probability belonging to each topic 𝛾 𝑨𝑥 𝜄 𝑒𝑨 𝑞 𝑨 𝑥, 𝑒 ∝ 𝑞 𝑥 𝑨, 𝑒 𝑞 𝑨 𝑒 = 𝛾 𝑨𝑥 𝜄 𝑒𝑨 (𝑗. 𝑓. , 𝑞 𝑨 𝑥, 𝑒 = ) 𝑨′ 𝛾 𝑨′𝑥 𝜄 𝑒𝑨′ • M-step: given the conditional distribution, find the parameters that can maximize the expected likelihood 𝑒 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥,𝑒 𝛾 𝑨𝑥 ∝ 𝑒 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 (𝑗. 𝑓. , 𝛾 𝑨𝑥 = 𝑥′,𝑒 𝑞 𝑨 𝑥 ′ , 𝑒 𝑑 𝑥 ′ ,𝑒 ) (𝑗. 𝑓. , 𝜄 𝑒𝑨 = 𝑥 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 𝜄 𝑒𝑨 ∝ 𝑞 𝑨 𝑥, 𝑒 𝑑 𝑥, 𝑒 ) 𝑂 𝑒 𝑥 15

  16. Text Data: Topic Models • Text Data and Topic Models • Probabilistic Latent Semantic Analysis • Summary 16

  17. Summary • Basic Concepts • Word/term, document, corpus, topic • How to represent a document • pLSA • Generative model • Likelihood function • EM algorithm 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend