Machine Learning Nave Bayes Model Rui Xia T ext M ining Group N - - PowerPoint PPT Presentation

machine learning
SMART_READER_LITE
LIVE PREVIEW

Machine Learning Nave Bayes Model Rui Xia T ext M ining Group N - - PowerPoint PPT Presentation

Machine Learning Nave Bayes Model Rui Xia T ext M ining Group N anjing U niversity of S cience & T echnology rxia@njust.edu.cn Nave Bayes Models A Probabilistic Model A Generative Model Known as the Nave Assumption


slide-1
SLIDE 1

Machine Learning

Naïve Bayes Model

Rui Xia Text Mining Group Nanjing University of Science & Technology rxia@njust.edu.cn

slide-2
SLIDE 2

Naïve Bayes Models

  • A Probabilistic Model
  • A Generative Model
  • Known as the “Naïve” Assumption
  • Suitable for Discrete Distributions
  • Widely used in Text Classification, Natural Language

Processing and Pattern Recognition

Machine Learning Course, NJUST 2

slide-3
SLIDE 3

Generative vs. Discriminative

  • Discriminative Model
  • Generative Model

It models the posterior probability of class label given

  • bservation p(y|x)

It models the joint probability

  • f class label and observation

p(x, y), and then use the Bayes rule (p(y|x)=p(x,y)/p(x)) for prediction.

Machine Learning Course, NJUST 3

slide-4
SLIDE 4

Naïve Bayes Assumption

  • Bag-of-words (BOW) representation
  • A Mixture Model

Class prior probability Class-conditional probability Having two event models

Machine Learning Course, NJUST 4

𝑞 𝑦, 𝑧 = 𝑑

𝑘 = 𝑞 𝑧 = 𝑑 𝑘 𝑞(𝑦|𝑑 𝑘)

𝑦 = (𝜕1, 𝜕2, … , 𝜕|𝑦|) 𝑞 𝑦|𝑑

𝑘 = 𝑞 𝜕1, 𝜕2, … , 𝜕 𝑦 𝑑 𝑘 = ෑ ℎ=1 |𝑦|

𝑞(𝜕ℎ|𝑑

𝑘)

slide-5
SLIDE 5

Multinomial Event Model

Machine Learning Course, NJUST 5

slide-6
SLIDE 6

Model Description

  • Hypothesis
  • Joint Probability

Model Parameters

Machine Learning Course, NJUST 6

𝑞 𝑧 = 𝑑

𝑘 = 𝜌𝑘

𝑞 𝑦|𝑑

𝑘 = 𝑞

𝜕1, 𝜕2, … , 𝜕 𝑦 𝑑

𝑘 = ෑ ℎ=1 |𝑦|

𝑞(𝜕ℎ|𝑑

𝑘)

= ෑ

𝑗=1 𝑊

𝑞(𝑢𝑗|𝑑𝑘)𝑂(𝑢𝑗,𝑦) = ෑ

𝑗=1 𝑊

𝜄𝑗|𝑘

𝑂(𝑢𝑗,𝑦)

𝑞 𝑦, 𝑧 = 𝑑

𝑘 = 𝑞 𝑑 𝑘 𝑞 𝑦|𝑑 𝑘 = 𝜌𝑘 ෑ 𝑗=1 𝑊

𝜄𝑗|𝑘

𝑂(𝑢𝑗,𝑦)

slide-7
SLIDE 7

Likelihood Function

  • (Joint) Likelihood

Machine Learning Course, NJUST 7

𝑀 𝜌, 𝜄 = log ෑ

𝑙=1 𝑂

𝑞(𝑦𝑙, 𝑧𝑙) = log ෑ

𝑙=1 𝑂

𝑘=1 𝐷

𝐽 𝑧𝑙 = 𝑑

𝑘 𝑞 𝑧𝑙 = 𝑑 𝑘 𝑞(𝑦𝑙|𝑧𝑙 = 𝑑𝑘)

= ෍

𝑙=1 𝑂

𝑘=1 𝐷

𝐽 𝑧𝑙 = 𝑑

𝑘 log 𝑞 𝑧𝑙 = 𝑑 𝑘 𝑞(𝑦𝑙|𝑧𝑙 = 𝑑𝑘)

= ෍

𝑙=1 𝑂

𝑘=1 𝐷

𝐽 𝑧𝑙 = 𝑑

𝑘 log 𝜌𝑘 ෑ 𝑗=1 𝑊

𝜄𝑗|𝑘

𝑂(𝑢𝑗,𝑦𝑙)

= ෍

𝑙=1 𝑂

𝑘=1 𝐷

𝐽 𝑧𝑙 = 𝑑

𝑘

log𝜌𝑘 + ෍

𝑗=1 𝑊

𝑂 𝑢𝑗, 𝑦𝑙 log𝜄𝑗|𝑘

slide-8
SLIDE 8

Maximum Likelihood Estimation

  • MLE Formulation
  • Applying Lagrange multipliers

Machine Learning Course, NJUST 8

max

𝜌,𝜄 𝑀(𝜌, 𝜄)

𝑡. 𝑢. ෍

𝑘=1 𝐷

𝜌𝑘 = 1 ෍

𝑗=1 𝑊

𝜄𝑗|𝑘 = 1, 𝑘 = 1, … , 𝐷 𝐾 = 𝑀 𝜌, 𝜄 + 𝛽(1 − ෍

𝑘=1 𝐷

𝜌𝑘) + ෍

𝑘=1 𝐷

𝛾𝑘 (1 − ෍

𝑗=1 𝑊

𝜄𝑗|𝑘) = ෍

𝑙=1 𝑂

𝑘=1 𝐷

𝐽 𝑧𝑙 = 𝑑

𝑘 [log𝜌𝑘 + ෍ 𝑗=1 𝑊

𝑂 𝑢𝑗, 𝑦𝑙 log𝜄𝑗|𝑘] + 𝛽 1 − ෍

𝑘=1 𝐷

𝜌𝑘 + ෍

𝑘=1 𝐷

𝛾𝑘 1 − ෍

𝑗=1 𝑊

𝜄𝑗|𝑘

slide-9
SLIDE 9

Close-form MLE Solution

  • Gradient
  • MLE Solution

Machine Learning Course, NJUST 9

𝜖𝐾 𝜖𝜌𝑘 = ෍

𝑙=1 𝑂

𝐽 𝑧𝑙 = 𝑑

𝑘

1 𝜌𝑘 − 𝛽 = 0 𝜖𝐾 𝜖𝜄𝑗|𝑘 = ෍

𝑙=1 𝑂

𝐽 𝑧𝑙 = 𝑑

𝑘

𝑂 𝑢𝑗, 𝑦𝑙 𝜄𝑗|𝑘 − 𝛾𝑘 = 0

𝜌𝑘 = σ𝑙=1

𝑂

𝐽 𝑧𝑙 = 𝑑

𝑘

σ𝑙=1

𝑂

σ𝑘′=1

𝐷

𝐽 𝑧𝑙 = 𝑑

𝑘

= 𝑂

𝑘

𝑂 𝜄𝑗|𝑘 = σ𝑙=1

𝑂

𝐽 𝑧𝑙 = 𝑑

𝑘 𝑂 𝑢𝑗, 𝑦𝑙

σ𝑙=1

𝑂

𝐽 𝑧𝑙 = 𝑑

𝑘 σ𝑗′=1 𝑊

𝑂 𝑢𝑗′, 𝑦𝑙

slide-10
SLIDE 10

Laplace Smoothing

  • In order to prevent from zero probability
  • Laplace Smoothing

Machine Learning Course, NJUST 10

𝑞 𝑦, 𝑧 = 𝑑

𝑘 = 𝜌𝑘 ෑ 𝑗=1 𝑊

𝜄𝑗|𝑘

𝑂(𝑢𝑗,𝑦)

𝜄𝑗|𝑘 = σ𝑙=1

𝑂

𝐽 𝑧𝑙 = 𝑑

𝑘 𝑂 𝑢𝑗, 𝑦𝑙

σ𝑗′=1

𝑊

σ𝑙=1

𝑂

𝐽 𝑧𝑙 = 𝑑

𝑘 𝑂 𝑢𝑗′, 𝑦𝑙

𝜌𝑘 = σ𝑙=1

𝑂

𝐽 𝑧𝑙 = 𝑑

𝑘

σ𝑘′=1

𝐷

σ𝑙=1

𝑂

𝐽 𝑧𝑙 = 𝑑

𝑘

𝜄𝑗|𝑘 = σ𝑙=1

𝑂

𝐽 𝑧𝑙 = 𝑑

𝑘 𝑂 𝑢𝑗, 𝑦𝑙 + 1

σ𝑗′=1

𝑊

σ𝑙=1

𝑂

𝐽 𝑧𝑙 = 𝑑

𝑘 𝑂 𝑢𝑗′, 𝑦𝑙 + 𝑊

𝜌𝑘 = σ𝑙=1

𝑂

𝐽 𝑧𝑙 = 𝑑

𝑘 + 1

σ𝑘′=1

𝐷

σ𝑙=1

𝑂

𝐽 𝑧𝑙 = 𝑑

𝑘 + 𝐷

slide-11
SLIDE 11

Multi-variate Bernoulli Event Model

Machine Learning Course, NJUST 11

slide-12
SLIDE 12

Model Description

  • Hypothesis
  • Joint Probability

Model Parameters

Machine Learning Course, NJUST 12

𝑞 𝑧 = 𝑑

𝑘 = 𝜌𝑘

𝑞 𝑦|𝑧 = 𝑑

𝑘 = 𝑞 𝑢1, 𝑢2, … , 𝑢𝑊 𝑑 𝑘

= ෑ

𝑗=1 𝑊

[𝐽 𝑢𝑗𝜗𝑦 𝑞 𝑢𝑗 𝑑

𝑘 + 𝐽(𝑢𝑗∉𝑦)(1 − 𝑞 𝑢𝑗 𝑑 𝑘 )]

= ෑ

𝑗=1 𝑊

[𝐽 𝑢𝑗𝜗𝑦 𝜈𝑗|𝑘 + 𝐽(𝑢𝑗∉𝑦)(1 − 𝜈𝑗|𝑘)] 𝑞 𝑦, 𝑑

𝑘 = 𝜌𝑘 ෑ 𝑗=1 𝑊

[𝐽 𝑢𝑗𝜗𝑦 𝜈𝑗|𝑘 + 𝐽(𝑢𝑗∉𝑦)(1 − 𝜈𝑗|𝑘)]

slide-13
SLIDE 13

Likelihood Function

  • (Joint) Likelihood

Machine Learning Course, NJUST 13

𝑀 𝜌, 𝜈 = log ෑ

𝑙=1 𝑂

𝑞(𝑦𝑙, 𝑧𝑙) = ෍

𝑙=1 𝑂

log ෍

𝑘=1 𝐷

𝐽 𝑧𝑙 = 𝑑

𝑘 𝑞 𝑦𝑙, 𝑧𝑙

= ෍

𝑙=1 𝑂

𝑘=1 𝐷

𝐽 𝑧𝑙 = 𝑑

𝑘 log𝑞(𝑑 𝑘) ෑ 𝑗=1 𝑊

𝐽 𝑢𝑗𝜗𝑦 𝑞 𝑢𝑗 𝑑

𝑘 + 𝐽(𝑢𝑗∉𝑦)(1 − 𝑞 𝑢𝑗 𝑑 𝑘 )

= ෍

𝑙=1 𝑂

𝑘=1 𝐷

𝐽 𝑧𝑙 = 𝑑

𝑘

log𝜌𝑘 + ෍

𝑗=1 𝑊

𝐽(𝑢𝑗𝜗𝑦𝑙) log𝜈𝑗|𝑘 + 𝐽 𝑢𝑗∉𝑦𝑙 log(1 − 𝜈𝑗|𝑘)

slide-14
SLIDE 14

Maximum Likelihood Estimation

  • MLE Formulation
  • Applying Lagrange multipliers

Machine Learning Course, NJUST 14

max

𝜌,𝜈 𝑀(𝜌, 𝜈)

𝑡. 𝑢. ෍

𝑘=1 𝐷

𝜌𝑘 = 1

𝐾 = 𝑀 𝜌, 𝜈 + 𝛽 1 − ෍

𝑘=1 𝐷

𝜌𝑘 = ෍

𝑙=1 𝑂

𝑘=1 𝐷

𝐽 𝑧𝑙 = 𝑑

𝑘

𝑚𝑝𝑕𝜌𝑘 + ෍

𝑗=1 𝑊

𝐽(𝑢𝑗𝜗𝑦𝑙) 𝑚𝑝𝑕𝜈𝑗|𝑘 + 𝐽 𝑢𝑗∉𝑦 log(1 − 𝜈𝑗|𝑘) + 𝛽 1 − ෍

𝑘=1 𝐷

𝜌𝑘

slide-15
SLIDE 15

Close-form MLE Solution

  • Gradient
  • MLE Solution

Machine Learning Course, NJUST 15

𝜖𝐾 𝜖𝜌𝑘 = ෍

𝑙=1 𝑂

𝐽(𝑧𝑙 = 𝑑

𝑘) 1

𝜌𝑘 − 𝛽 = 0

𝜖𝐾 𝜖𝜈𝑗|𝑘 = ෍

𝑙=1 𝑂

𝐽 𝑧𝑙 = 𝑑

𝑘

𝐽 𝑢𝑗𝜗𝑦𝑙 𝜈𝑗|𝑘 − 𝐽 𝑢𝑗∉𝑦𝑙 1 − 𝜈𝑗|𝑘 = 0, ∀𝑘 = 1, … , 𝐷.

𝜌𝑘 = σ𝑙=1

𝑂

𝐽 𝑧𝑙 = 𝑑

𝑘

σ𝑙=1

𝑂

σ𝑘′=1

𝐷

𝐽 𝑧𝑙 = 𝑑

𝑘′

= 𝑂

𝑘

𝑂 𝜈𝑗|𝑘 = σ𝑙=1

𝑂

𝐽 𝑧𝑙 = 𝑑

𝑘 𝐽 𝑢𝑗𝜗𝑦𝑙

σ𝑙=1

𝑂

𝐽 𝑧𝑙 = 𝑑

𝑘

slide-16
SLIDE 16

Laplace Smoothing

  • In order to prevent from zero probability
  • Laplace Smoothing

Machine Learning Course, NJUST 16

𝑞(𝑦, 𝑑

𝑘) = 𝜌𝑘 ෑ 𝑗=1 𝑊

[𝐽 𝑢𝑗𝜗𝑦 𝜈𝑗|𝑘 + 𝐽(𝑢𝑗∉𝑦)(1 − 𝜈𝑗|𝑘)]

𝜈𝑗|𝑘 = σ𝑙=1

𝑂

𝐽 𝑧𝑙 = 𝑑

𝑘 𝐽 𝑢𝑗𝜗𝑦𝑙

σ𝑙=1

𝑂

𝐽 𝑧𝑙 = 𝑑

𝑘

𝜌𝑘 = σ𝑙=1

𝑂

𝐽 𝑧𝑙 = 𝑑

𝑘

σ𝑘′=1

𝐷

σ𝑙=1

𝑂

𝐽 𝑧𝑙 = 𝑑

𝑘

𝜈𝑗|𝑘 = σ𝑙=1

𝑂

𝐽 𝑧𝑙 = 𝑑

𝑘 𝐽 𝑢𝑗𝜗𝑦𝑙 + 1

σ𝑙=1

𝑂

𝐽 𝑧𝑙 = 𝑑

𝑘 + 2

𝜌𝑘 = σ𝑙=1

𝑂

𝐽 𝑧𝑙 = 𝑑

𝑘 + 1

σ𝑘′=1

𝐷

σ𝑙=1

𝑂

𝐽 𝑧𝑙 = 𝑑

𝑘 + 𝐷

slide-17
SLIDE 17

Text Classification as An Example

17

Machine Learning Course, NJUST

slide-18
SLIDE 18

Data sets

  • Training data
  • Test data
  • Class labels
  • Feature vector

Machine Learning Course, NJUST 18

slide-19
SLIDE 19
  • Training
  • Prediction

Multinomial Naïve Bayes

Machine Learning Course, NJUST 19

slide-20
SLIDE 20
  • Training
  • Prediction

Multi-variate Bernoulli Naïve Bayes

Machine Learning Course, NJUST 20

slide-21
SLIDE 21

Xia-NB Software

  • Functions

– Written in C++ – Support multinomial and multi-variate Bernoulli event model – Laplace smoothing – Uniform data format like SVM-light/LibSVM – Fast running with sparse representation

  • Download

https://github.com/NUSTM/XIA-NB

Machine Learning Course, NJUST 21

slide-22
SLIDE 22

Project

  • Implement naïve Bayes algorithm with

– Multinomial event model – Multi-variate Bernoulli model

  • Running the algorithm based on the training & testing

data given in Page 18.

  • Compare the naïve Bayes algorithm with logistic

regression (by using Bag-of-words to represent the data).

Machine Learning Course, NJUST 22

slide-23
SLIDE 23

Questions?

Machine Learning Course, NJUST