MINING Text Data: Nave Bayes Instructor: Yizhou Sun - - PowerPoint PPT Presentation

mining
SMART_READER_LITE
LIVE PREVIEW

MINING Text Data: Nave Bayes Instructor: Yizhou Sun - - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING Text Data: Nave Bayes Instructor: Yizhou Sun yzsun@cs.ucla.edu December 7, 2017 Methods to be Learnt Vector Data Set Data Sequence Data Text Data Logistic Regression; Nave Bayes for Text


slide-1
SLIDE 1

CS145: INTRODUCTION TO DATA MINING

Instructor: Yizhou Sun

yzsun@cs.ucla.edu December 7, 2017

Text Data: Naïve Bayes

slide-2
SLIDE 2

Methods to be Learnt

2

Vector Data Set Data Sequence Data Text Data Classification

Logistic Regression; Decision Tree; KNN; SVM; NN Naïve Bayes for Text

Clustering

K-means; hierarchical clustering; DBSCAN; Mixture Models PLSA

Prediction

Linear Regression GLM*

Frequent Pattern Mining

Apriori; FP growth GSP; PrefixSpan

Similarity Search

DTW

slide-3
SLIDE 3

Naïve Bayes for Text

  • Text Data
  • Revisit of Multinomial Distribution
  • Multinomial Naïve Bayes
  • Summary

3

slide-4
SLIDE 4

Text Data

  • Word/term
  • Document
  • A sequence of words
  • Corpus
  • A collection of

documents

4

slide-5
SLIDE 5

Text Classification Applications

  • Spam detection
  • Sentiment analysis

5

From: airak@medicana.com.tr Subject: Loan Offer Do you need a personal or business loan urgent that can be process within 2 to 3 working days? Have you been frustrated so many times by your banks and other loan firm and you don't know what to do? Here comes the Good news Deutsche Bank Financial Business and Home Loan is here to offer you any kind of loan you need at an affordable interest rate of 3% If you are interested let us know.

slide-6
SLIDE 6

Represent a Document

  • Most common way: Bag-of-Words
  • Ignore the order of words
  • keep the count

6 c1 c2 c3 c4 c5 m1 m2 m3 m4

Vector space model For document 𝑒, 𝒚𝑒 = (𝑦𝑒1, 𝑦𝑒2, … , 𝑦𝑒𝑂), where 𝑦𝑒𝑜 is the number of words for nth word in the vocabulary

slide-7
SLIDE 7

More Details

  • Represent the doc as a vector where each entry

corresponds to a different word and the number at that entry corresponds to how many times that word was present in the document (or some function of it)

  • Number of words is huge
  • Select and use a smaller set of words that are of interest
  • E.g. uninteresting words: ‘and’, ‘the’ ‘at’, ‘is’, etc. These are called stop-

words

  • Stemming: remove endings. E.g. ‘learn’, ‘learning’, ‘learnable’, ‘learned’

could be substituted by the single stem ‘learn’

  • Other simplifications can also be invented and used
  • The set of different remaining words is called dictionary or vocabulary. Fix

an ordering of the terms in the dictionary so that you can operate them by their index.

  • Can be extended to bi-gram, tri-gram, or so

7

slide-8
SLIDE 8

Limitations of Vector Space Model

  • Dimensionality
  • High dimensionality
  • Sparseness
  • Most of the entries are zero
  • Shallow representation
  • The vector representation does not capture

semantic relations between words

8

slide-9
SLIDE 9

Naïve Bayes for Text

  • Text Data
  • Revisit of Multinomial Distribution
  • Multinomial Naïve Bayes
  • Summary

9

slide-10
SLIDE 10

Bernoulli and Categorical Distribution

  • Bernoulli distribution
  • Discrete distribution that takes two values {0,1}
  • 𝑄 𝑌 = 1 = 𝑞 and 𝑄 𝑌 = 0 = 1 − 𝑞
  • E.g., toss a coin with head and tail
  • Categorical distribution
  • Discrete distribution that takes more than two

values, i.e., 𝑦 ∈ 1, … , 𝐿

  • Also called generalized Bernoulli distribution,

multinoulli distribution

  • 𝑄 𝑌 = 𝑙 = 𝑞𝑙 𝑏𝑜𝑒 σ𝑙 𝑞𝑙 = 1
  • E.g., get 1-6 from a dice with 1/6

10

slide-11
SLIDE 11

Binomial and Multinomial Distribution

  • Binomial distribution
  • Number of successes (i.e., total number of 1’s) by

repeating n trials of independent Bernoulli distribution with probability 𝑞

  • 𝑦: 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑡𝑣𝑑𝑑𝑓𝑡𝑡𝑓𝑡
  • 𝑄 𝑌 = 𝑦 = 𝑜

𝑦 𝑞𝑦 1 − 𝑞 𝑜−𝑦

  • Multinomial distribution (multivariate random

variable)

  • Repeat n trials of independent categorical distribution
  • Let 𝑦𝑙 be the number of times value 𝑙 has been observed,

note σ𝑙 𝑦𝑙 = 𝑜

  • 𝑄 𝑌1 = 𝑦1, 𝑌2 = 𝑦2, … , 𝑌𝐿 = 𝑦𝐿 =

𝑜! 𝑦1!𝑦2!…𝑦𝐿! ς𝑙 𝑞𝑙 𝑦𝑙

11

slide-12
SLIDE 12

Naïve Bayes for Text

  • Text Data
  • Revisit of Multinomial Distribution
  • Multinomial Naïve Bayes
  • Summary

12

slide-13
SLIDE 13

Bayes’ Theorem: Basics

  • Bayes’ Theorem:
  • Let X be a data sample (“evidence”)
  • Let h be a hypothesis that X belongs to class C
  • P(h) (prior probability): the probability of hypothesis h
  • E.g., the probability of “spam” class
  • P(X|h) (likelihood): the probability of observing the

sample X, given that the hypothesis holds

  • E.g., the probability of an email given it’s a spam
  • P(X): marginal probability that sample data is observed
  • 𝑄 𝑌 = σℎ 𝑄 𝑌 ℎ 𝑄(ℎ)
  • P(h|X), (i.e., posterior probability): the probability that

the hypothesis holds given the observed data sample X

) ( ) ( ) | ( ) | ( X X X P h P h P h P 

13

slide-14
SLIDE 14

Classification: Choosing Hypotheses

  • Maximum Likelihood (maximize the likelihood):
  • Maximum a posteriori (maximize the posterior):
  • Useful observation: it does not depend on the denominator P(X)

14

) | ( max arg h D P h

H h ML 

 ) ( ) | ( max arg ) | ( max arg h P h D P D h P h

H h H h MAP  

 

X X X

slide-15
SLIDE 15

Classification by Maximum A Posteriori

  • Let D be a training set of tuples and their associated class

labels, and each tuple is represented by an p-D attribute vector x = (x1, x2, …, xp)

  • Suppose there are m classes y∈{1, 2, …, m}
  • Classification is to derive the maximum posteriori, i.e., the

maximal P(y=j|x)

  • This can be derived from Bayes’ theorem

𝑞 𝑧 = 𝑘 𝒚 = 𝑞 𝒚 𝑧 = 𝑘 𝑞(𝑧 = 𝑘) 𝑞(𝒚)

  • Since p(x) is constant for all classes, only 𝑞 𝒚 𝑧 𝑞(𝑧) needs to

be maximized

15

slide-16
SLIDE 16

Now Come to Text Setting

  • A document is represented as a bag of words
  • 𝒚𝑒 = (𝑦𝑒1, 𝑦𝑒2, … , 𝑦𝑒𝑂), where 𝑦𝑒𝑜 is the

number of words for nth word in the vocabulary

  • Model 𝑞 𝒚𝑒 𝑧 for class 𝑧
  • Follow multinomial distribution with

parameter vector 𝜸𝑧 = (𝛾𝑧1, 𝛾𝑧2, … , 𝛾𝑧𝑂), i.e.,

  • 𝑞 𝒚𝑒 𝑧 =

(σ𝑜 𝑦𝑒𝑜)! 𝑦𝑒1!𝑦𝑒2!…𝑦𝑒𝑂! ς𝑜 𝛾𝑧𝑜 𝑦𝑒𝑜

  • Model 𝑞 𝑧 = 𝑘
  • Follow categorical distribution with parameter

vector 𝝆 = (𝜌1, 𝜌2, … , 𝜌𝑛), i.e.,

  • 𝑞 𝑧 = 𝑘 = 𝜌𝑘

16

slide-17
SLIDE 17

Classification Process Assuming Parameters are Given

  • Find 𝑧 that maximizes 𝑞 𝑧 𝒚𝑒 , which is

equivalently to maximize

𝑧∗ = 𝑏𝑠𝑕max

𝑧

𝑞 𝒚𝑒, 𝑧 = 𝑏𝑠𝑕𝑛𝑏𝑦𝑧 𝑞 𝒚𝑒 𝑧 𝑞 𝑧 = 𝑏𝑠𝑕𝑛𝑏𝑦𝑧 (σ𝑜 𝑦𝑒𝑜) ! 𝑦𝑒1! 𝑦𝑒2! … 𝑦𝑒𝑂! ෑ

𝑜

𝛾𝑧𝑜

𝑦𝑒𝑜 × 𝜌𝑧

= 𝑏𝑠𝑕𝑛𝑏𝑦𝑧 ෑ

𝑜

𝛾𝑧𝑜

𝑦𝑒𝑜 × 𝜌𝑧

= 𝑏𝑠𝑕𝑛𝑏𝑦𝑧 ෍

𝑜

𝑦𝑒𝑜𝑚𝑝𝑕𝛾𝑧𝑜 + 𝑚𝑝𝑕𝜌𝑧

17

Constant for every class, denoted as 𝒅𝒆

slide-18
SLIDE 18

Parameter Estimation via MLE

  • Given a corpus and labels for each document
  • 𝐸 = {(𝒚𝑒, 𝑧𝑒)}
  • Find the MLE estimators for Θ = (𝜸1, 𝜸2, … , 𝜸𝑛, 𝝆)
  • The log likelihood function for the training dataset

𝑚𝑝𝑕𝑀 = 𝑚𝑝𝑕 ෑ

𝑒

𝑞(𝒚𝑒, 𝑧𝑒|Θ) = ෍

𝑒

𝑚𝑝𝑕 𝑞 𝒚𝑒, 𝑧𝑒 Θ = ෍

𝑒

𝑚𝑝𝑕 𝑞 𝒚𝑒 𝑧𝑒 𝑞 𝑧𝑒 = ෍

𝑒

𝑦𝑒𝑜𝑚𝑝𝑕𝛾𝑧𝑜 + 𝑚𝑝𝑕𝜌𝑧𝑒 + 𝑚𝑝𝑕𝑑𝑒

  • The optimization problem

max

Θ

log 𝑀

𝑡. 𝑢.

𝜌𝑘 ≥ 0 𝑏𝑜𝑒 ෍

𝑘

𝜌𝑘 = 1 𝛾𝑘𝑜 ≥ 0 𝑏𝑜𝑒 ෍

𝑜

𝛾𝑘𝑜 = 1 𝑔𝑝𝑠 𝑏𝑚𝑚 𝑘

18

Does not involve parameters, can be dropped for optimization purpose

slide-19
SLIDE 19

Solve the Optimization Problem

  • Use the Lagrange multiplier method
  • Solution

𝛾𝑘𝑜 =

σ𝑒:𝑧𝑒=𝑘 𝑦𝑒𝑜 σ𝑒:𝑧𝑒=𝑘 σ𝑜′ 𝑦𝑒𝑜′

  • σ𝑒:𝑧𝑒=𝑘 𝑦𝑒𝑜: total count of word n in class j
  • σ𝑒:𝑧𝑒=𝑘 σ𝑜′ 𝑦𝑒𝑜′: total count of words in class j

𝜌𝑘 =

σ𝑒 1(𝑧𝑒=𝑘) |𝐸|

  • 1(𝑧𝑒 = 𝑘) is the indicator function, which equals to 1

if 𝑧𝑒 = 𝑘 holds

  • |D|: total number of documents

19

slide-20
SLIDE 20

Smoothing

  • What if some word n does not appear in some class j

in training dataset?

𝛾𝑘𝑜 =

σ𝑒:𝑧𝑒=𝑘 𝑦𝑒𝑜 σ𝑒:𝑧𝑒=𝑘 σ𝑜′ 𝑦𝑒𝑜′ = 0

  • ⇒ 𝑞 𝒚𝑒 𝑧 = 𝑘 ∝ ς𝑜 𝛾𝑧𝑜

𝑦𝑒𝑜 = 0

  • But other words may have a strong indication the document

belongs to class j

  • Solution: add-1 smoothing or Laplacian smoothing

𝛾𝑘𝑜 =

σ𝑒:𝑧𝑒=𝑘 𝑦𝑒𝑜+1 σ𝑒:𝑧𝑒=𝑘 σ𝑜′ 𝑦𝑒𝑜′+𝑂

  • 𝑂: 𝑢𝑝𝑢𝑏𝑚 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑥𝑝𝑠𝑒𝑡 𝑗𝑜 𝑢ℎ𝑓 𝑤𝑝𝑑𝑏𝑐𝑣𝑚𝑏𝑠𝑧
  • Check: σ𝑜 መ

𝛾𝑘𝑜 = 1?

20

slide-21
SLIDE 21

Example

  • Data:
  • Vocabulary:
  • Learned parameters (with smoothing):

21

Index 1 2 3 4 5 6 Word Chinese Beijing Shanghai Macao Tokyo Japan

መ 𝛾𝑑1 = 5 + 1 8 + 6 = 3 7 መ 𝛾𝑑2 = 1 + 1 8 + 6 = 1 7 መ 𝛾𝑑3 = 1 + 1 8 + 6 = 1 7 መ 𝛾𝑑4 = 1 + 1 8 + 6 = 1 7 መ 𝛾𝑑5 = 0 + 1 8 + 6 = 1 14 መ 𝛾𝑑6 = 0 + 1 8 + 6 = 1 14 መ 𝛾𝑘1 = 1 + 1 3 + 6 = 2 9 መ 𝛾𝑘2 = 0 + 1 3 + 6 = 1 9 መ 𝛾𝑘3 = 0 + 1 3 + 6 = 1 9 መ 𝛾𝑘4 = 0 + 1 3 + 6 = 1 9 መ 𝛾𝑘5 = 1 + 1 3 + 6 = 2 9 መ 𝛾𝑘6 = 1 + 1 3 + 6 = 2 9

ො 𝜌𝑑 = 3 4 ො 𝜌𝑘 = 1 4

slide-22
SLIDE 22

Example (Continued)

  • Classification stage
  • For the test document d=5, compute
  • 𝑞 𝑧 = 𝑑 𝒚5 ∝ 𝑞 𝑧 = 𝑑 × ς𝑜 𝛾𝑑𝑜

𝑦5𝑜 = 3 4 × 3 7 3

×

1 14 × 1 14 ≈ 0.0003

  • 𝑞 𝑧 = 𝑘 𝒚5 ∝ 𝑞 𝑧 = 𝑘 × ς𝑜 𝛾𝑘𝑜

𝑦5𝑜 = 1 4 × 2 9 3

×

2 9 × 2 9 ≈ 0.0001

  • Conclusion: 𝒚5 should be classified into c class

22

slide-23
SLIDE 23

A More General Naïve Bayes Framework

  • Let D be a training set of tuples and their associated

class labels, and each tuple is represented by an p-D attribute vector x = (x1, x2, …, xp)

  • Suppose there are m classes y∈{1, 2, …, m}
  • Goal: Find y

max

𝑧

𝑄 𝑧 𝒚 = 𝑄(𝑧, 𝒚)/𝑄(𝒚) ∝ 𝑄 𝒚 𝑧 𝑄(𝑧)

  • A simplified assumption: attributes are conditionally

independent given the class (class conditional independency):

  • 𝑞 𝒚 𝑧 = ς𝑙 𝑞(𝑦𝑙|𝑧)
  • 𝑞(𝑦𝑙|𝑧) can follow any distribution, e.g., Gaussian,

Bernoulli, categorical, …

23

slide-24
SLIDE 24

Generative Model vs. Discriminative Model

  • Generative model
  • 𝑛𝑝𝑒𝑓𝑚 𝑘𝑝𝑗𝑜𝑢 𝑞𝑠𝑝𝑐𝑏𝑐𝑗𝑚𝑗𝑢𝑧 𝑞(𝒚, 𝑧)
  • E.g., naïve Bayes
  • Discriminative model
  • 𝑛𝑝𝑒𝑓𝑚 𝑑𝑝𝑜𝑒𝑗𝑢𝑗𝑝𝑜𝑏𝑚 𝑞𝑠𝑝𝑐𝑏𝑐𝑗𝑚𝑗𝑢𝑧 𝑞(𝑧|𝒚)
  • E.g., logistic regression

24

slide-25
SLIDE 25

Naïve Bayes for Text

  • Text Data
  • Revisit of Multinomial Distribution
  • Multinomial Naïve Bayes
  • Summary

25

slide-26
SLIDE 26

Summary

  • Text data
  • Bag of words representation
  • Naïve Bayes for Text
  • Multinomial naïve Bayes

26