MINING Text Data: Nave Bayes Instructor: Yizhou Sun - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING Text Data: Naïve Bayes Instructor: Yizhou Sun yzsun@cs.ucla.edu December 7, 2017

Methods to be Learnt Vector Data Set Data Sequence Data Text Data Logistic Regression; Naïve Bayes for Text Classification Decision Tree ; KNN; SVM ; NN Clustering K-means; hierarchical PLSA clustering; DBSCAN; Mixture Models Linear Regression Prediction GLM* Apriori; FP growth GSP; PrefixSpan Frequent Pattern Mining Similarity Search DTW 2

Naïve Bayes for Text • Text Data • Revisit of Multinomial Distribution • Multinomial Naïve Bayes • Summary 3

Text Data • Word/term • Document • A sequence of words • Corpus • A collection of documents 4

Text Classification Applications • Spam detection From: airak@medicana.com.tr Subject: Loan Offer Do you need a personal or business loan urgent that can be process within 2 to 3 working days? Have you been frustrated so many times by your banks and other loan firm and you don't know what to do? Here comes the Good news Deutsche Bank Financial Business and Home Loan is here to offer you any kind of loan you need at an affordable interest rate of 3% If you are interested let us know. • Sentiment analysis 5

Represent a Document • Most common way: Bag-of-Words • Ignore the order of words • keep the count c1 c2 c3 c4 c5 m1 m2 m3 m4 For document 𝑒, 𝒚 𝑒 = (𝑦 𝑒1 , 𝑦 𝑒2 , … , 𝑦 𝑒𝑂 ) , where 𝑦 𝑒𝑜 is the number of words for nth word in the vocabulary Vector space model 6

More Details • Represent the doc as a vector where each entry corresponds to a different word and the number at that entry corresponds to how many times that word was present in the document (or some function of it) • Number of words is huge • Select and use a smaller set of words that are of interest • E.g. uninteresting words: ‘and’, ‘the’ ‘at’, ‘is’, etc. These are called stop- words • Stemming: remove endings. E.g. ‘learn’, ‘learning’, ‘learnable’, ‘learned’ could be substituted by the single stem ‘learn’ • Other simplifications can also be invented and used • The set of different remaining words is called dictionary or vocabulary. Fix an ordering of the terms in the dictionary so that you can operate them by their index. • Can be extended to bi-gram, tri-gram, or so 7

Limitations of Vector Space Model • Dimensionality • High dimensionality • Sparseness • Most of the entries are zero • Shallow representation • The vector representation does not capture semantic relations between words 8

Bernoulli and Categorical Distribution • Bernoulli distribution • Discrete distribution that takes two values {0,1} • 𝑄 𝑌 = 1 = 𝑞 and 𝑄 𝑌 = 0 = 1 − 𝑞 • E.g., toss a coin with head and tail • Categorical distribution • Discrete distribution that takes more than two values, i.e., 𝑦 ∈ 1, … , 𝐿 • Also called generalized Bernoulli distribution, multinoulli distribution • 𝑄 𝑌 = 𝑙 = 𝑞 𝑙 𝑏𝑜𝑒 σ 𝑙 𝑞 𝑙 = 1 • E.g., get 1-6 from a dice with 1/6 10

Binomial and Multinomial Distribution • Binomial distribution • Number of successes (i.e., total number of 1’s) by repeating n trials of independent Bernoulli distribution with probability 𝑞 • 𝑦: 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑡𝑣𝑑𝑑𝑓𝑡𝑡𝑓𝑡 • 𝑄 𝑌 = 𝑦 = 𝑜 𝑦 𝑞 𝑦 1 − 𝑞 𝑜−𝑦 • Multinomial distribution (multivariate random variable) • Repeat n trials of independent categorical distribution • Let 𝑦 𝑙 be the number of times value 𝑙 has been observed, note σ 𝑙 𝑦 𝑙 = 𝑜 𝑜! 𝑦 𝑙 • 𝑄 𝑌 1 = 𝑦 1 , 𝑌 2 = 𝑦 2 , … , 𝑌 𝐿 = 𝑦 𝐿 = 𝑦 1 !𝑦 2 !…𝑦 𝐿 ! ς 𝑙 𝑞 𝑙 11

Bayes’ Theorem: Basics ( | ) ( ) P X h P h • Bayes’ Theorem:  ( | ) P h X ( ) P X • Let X be a data sample (“ evidence ”) • Let h be a hypothesis that X belongs to class C • P(h) ( prior probability ): the probability of hypothesis h • E.g., the probability of “spam” class • P(X|h) ( likelihood ): the probability of observing the sample X, given that the hypothesis holds • E.g., the probability of an email given it’s a spam • P(X): marginal probability that sample data is observed • 𝑄 𝑌 = σ ℎ 𝑄 𝑌 ℎ 𝑄(ℎ) • P(h|X), (i.e., posterior probability): the probability that the hypothesis holds given the observed data sample X 13

Classification: Choosing Hypotheses • Maximum Likelihood (maximize the likelihood):  arg max ( | ) X h P D h ML  h H • Maximum a posteriori (maximize the posterior): • Useful observation: it does not depend on the denominator P(X)   arg max ( | ) arg max ( | ) ( ) X X h P h D P D h P h MAP   h H h H 14

Classification by Maximum A Posteriori • Let D be a training set of tuples and their associated class labels, and each tuple is represented by an p-D attribute vector x = (x 1 , x 2 , …, x p ) • Suppose there are m classes y ∈ {1, 2, …, m } • Classification is to derive the maximum posteriori, i.e., the maximal P(y=j| x ) • This can be derived from Bayes’ theorem 𝑞 𝑧 = 𝑘 𝒚 = 𝑞 𝒚 𝑧 = 𝑘 𝑞(𝑧 = 𝑘) 𝑞(𝒚) • Since p( x ) is constant for all classes, only 𝑞 𝒚 𝑧 𝑞(𝑧 ) needs to be maximized 15

Now Come to Text Setting • A document is represented as a bag of words • 𝒚 𝑒 = (𝑦 𝑒1 , 𝑦 𝑒2 , … , 𝑦 𝑒𝑂 ) , where 𝑦 𝑒𝑜 is the number of words for nth word in the vocabulary • Model 𝑞 𝒚 𝑒 𝑧 for class 𝑧 • Follow multinomial distribution with parameter vector 𝜸 𝑧 = (𝛾 𝑧1 , 𝛾 𝑧2 , … , 𝛾 𝑧𝑂 ) , i.e., (σ 𝑜 𝑦 𝑒𝑜 )! 𝑦 𝑒𝑜 • 𝑞 𝒚 𝑒 𝑧 = 𝑦 𝑒1 !𝑦 𝑒2 !…𝑦 𝑒𝑂 ! ς 𝑜 𝛾 𝑧𝑜 • Model 𝑞 𝑧 = 𝑘 • Follow categorical distribution with parameter vector 𝝆 = (𝜌 1 , 𝜌 2 , … , 𝜌 𝑛 ) , i.e., • 𝑞 𝑧 = 𝑘 = 𝜌 𝑘 16

Classification Process Assuming Parameters are Given • Find 𝑧 that maximizes 𝑞 𝑧 𝒚 𝑒 , which is equivalently to maximize 𝑧 ∗ = 𝑏𝑠𝑕max 𝑞 𝒚 𝑒 , 𝑧 𝑧 = 𝑏𝑠𝑕𝑛𝑏𝑦 𝑧 𝑞 𝒚 𝑒 𝑧 𝑞 𝑧 (σ 𝑜 𝑦 𝑒𝑜 ) ! 𝑦 𝑒𝑜 × 𝜌 𝑧 = 𝑏𝑠𝑕𝑛𝑏𝑦 𝑧 𝑦 𝑒1 ! 𝑦 𝑒2 ! … 𝑦 𝑒𝑂 ! ෑ 𝛾 𝑧𝑜 𝑜 Constant for every class, 𝑦 𝑒𝑜 × 𝜌 𝑧 denoted as 𝒅 𝒆 = 𝑏𝑠𝑕𝑛𝑏𝑦 𝑧 ෑ 𝛾 𝑧𝑜 𝑜 = 𝑏𝑠𝑕𝑛𝑏𝑦 𝑧 ෍ 𝑦 𝑒𝑜 𝑚𝑝𝑕𝛾 𝑧𝑜 + 𝑚𝑝𝑕𝜌 𝑧 𝑜 17

Parameter Estimation via MLE • Given a corpus and labels for each document • 𝐸 = {(𝒚 𝑒 , 𝑧 𝑒 )} • Find the MLE estimators for Θ = (𝜸 1 , 𝜸 2 , … , 𝜸 𝑛 , 𝝆) • The log likelihood function for the training dataset 𝑚𝑝𝑕𝑀 = 𝑚𝑝𝑕 ෑ 𝑞(𝒚 𝑒 , 𝑧 𝑒 |Θ) = ෍ 𝑚𝑝𝑕 𝑞 𝒚 𝑒 , 𝑧 𝑒 Θ 𝑒 𝑒 = ෍ 𝑚𝑝𝑕 𝑞 𝒚 𝑒 𝑧 𝑒 𝑞 𝑧 𝑒 = ෍ 𝑦 𝑒𝑜 𝑚𝑝𝑕𝛾 𝑧𝑜 + 𝑚𝑝𝑕𝜌 𝑧 𝑒 + 𝑚𝑝𝑕𝑑 𝑒 𝑒 𝑒 • The optimization problem Does not involve max log 𝑀 parameters, can be Θ dropped for optimization 𝑡. 𝑢. purpose 𝜌 𝑘 ≥ 0 𝑏𝑜𝑒 ෍ 𝜌 𝑘 = 1 𝑘 𝛾 𝑘𝑜 ≥ 0 𝑏𝑜𝑒 ෍ 𝛾 𝑘𝑜 = 1 𝑔𝑝𝑠 𝑏𝑚𝑚 𝑘 18 𝑜

Solve the Optimization Problem • Use the Lagrange multiplier method • Solution σ 𝑒:𝑧𝑒=𝑘 𝑦 𝑒𝑜 • መ 𝛾 𝑘𝑜 = σ 𝑒:𝑧𝑒=𝑘 σ 𝑜′ 𝑦 𝑒𝑜′ • σ 𝑒:𝑧 𝑒 =𝑘 𝑦 𝑒𝑜 : total count of word n in class j • σ 𝑒:𝑧 𝑒 =𝑘 σ 𝑜 ′ 𝑦 𝑒𝑜 ′ : total count of words in class j σ 𝑒 1(𝑧 𝑒 =𝑘) • ො 𝜌 𝑘 = |𝐸| • 1(𝑧 𝑒 = 𝑘) is the indicator function, which equals to 1 if 𝑧 𝑒 = 𝑘 holds • |D|: total number of documents 19

Smoothing • What if some word n does not appear in some class j in training dataset? σ 𝑒:𝑧𝑒=𝑘 𝑦 𝑒𝑜 • መ 𝛾 𝑘𝑜 = σ 𝑒:𝑧𝑒=𝑘 σ 𝑜′ 𝑦 𝑒𝑜′ = 0 𝑦 𝑒𝑜 = 0 • ⇒ 𝑞 𝒚 𝑒 𝑧 = 𝑘 ∝ ς 𝑜 𝛾 𝑧𝑜 • But other words may have a strong indication the document belongs to class j • Solution: add-1 smoothing or Laplacian smoothing σ 𝑒:𝑧𝑒=𝑘 𝑦 𝑒𝑜 +1 • መ 𝛾 𝑘𝑜 = σ 𝑒:𝑧𝑒=𝑘 σ 𝑜′ 𝑦 𝑒𝑜′ +𝑂 • 𝑂: 𝑢𝑝𝑢𝑏𝑚 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑥𝑝𝑠𝑒𝑡 𝑗𝑜 𝑢ℎ𝑓 𝑤𝑝𝑑𝑏𝑐𝑣𝑚𝑏𝑠𝑧 • Check: σ 𝑜 መ 𝛾 𝑘𝑜 = 1? 20

MINING Text Data: Nave Bayes Instructor: Yizhou Sun - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING Text Data: Nave Bayes Instructor: Yizhou Sun yzsun@cs.ucla.edu December 7, 2017 Methods to be Learnt Vector Data Set Data Sequence Data Text Data Logistic Regression; Nave Bayes for Text

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 1 Relationship Mining Correlation Mining Relationship Mining Discover

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Introduction What is data mining? to Data mining functionalities Data Mining Major

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

NANO MINING POOL CLOUD CONTRACTS AND MINING SERVICES OUR PRODUCTS Cloud cards are mining cards

Discrete Probability: a brief review CMPS 4750/6750: Computer Networks 1 Applications of

Bernoulli Equations Bernd Schr oder logo1 Bernd Schr oder Louisiana Tech University,

COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 21. Independence

Contents Introduction 1 Graphical Models and the PC Algorithm Conditional Independence

Expectation Maximization Henrik I. Christensen Robotics & Intelligent Machines @ GT Georgia

Bernoulli Mixture Models Victor Medina Researcher at SBIF DataCamp Mixture Models in R The

Relations for Barnes Zeta Functions Abdelmejid Bayad Universit e dEvry Val dEssonne

Generalization of Bernoulli numbers and polynomials to the multiple case Olivier Bouillot,

MINING Text Data: Nave Bayes Instructor: Yizhou Sun - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING Text Data: Nave Bayes Instructor: Yizhou Sun yzsun@cs.ucla.edu December 7, 2017 Methods to be Learnt Vector Data Set Data Sequence Data Text Data Logistic Regression; Nave Bayes for Text

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 1 Relationship Mining Correlation Mining Relationship Mining Discover

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Introduction What is data mining? to Data mining functionalities Data Mining Major

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

NANO MINING POOL CLOUD CONTRACTS AND MINING SERVICES OUR PRODUCTS Cloud cards are mining cards

Discrete Probability: a brief review CMPS 4750/6750: Computer Networks 1 Applications of

Bernoulli Equations Bernd Schr oder logo1 Bernd Schr oder Louisiana Tech University,

COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 21. Independence

Contents Introduction 1 Graphical Models and the PC Algorithm Conditional Independence

Expectation Maximization Henrik I. Christensen Robotics &amp; Intelligent Machines @ GT Georgia

Bernoulli Mixture Models Victor Medina Researcher at SBIF DataCamp Mixture Models in R The

Relations for Barnes Zeta Functions Abdelmejid Bayad Universit e dEvry Val dEssonne

Generalization of Bernoulli numbers and polynomials to the multiple case Olivier Bouillot,

Expectation Maximization Henrik I. Christensen Robotics & Intelligent Machines @ GT Georgia