Machine Learning Nave Bayes Model Rui Xia T ext M ining Group N - - PowerPoint PPT Presentation
Machine Learning Nave Bayes Model Rui Xia T ext M ining Group N - - PowerPoint PPT Presentation
Machine Learning Nave Bayes Model Rui Xia T ext M ining Group N anjing U niversity of S cience & T echnology rxia@njust.edu.cn Nave Bayes Models A Probabilistic Model A Generative Model Known as the Nave Assumption
Naïve Bayes Models
- A Probabilistic Model
- A Generative Model
- Known as the “Naïve” Assumption
- Suitable for Discrete Distributions
- Widely used in Text Classification, Natural Language
Processing and Pattern Recognition
Machine Learning Course, NJUST 2
Generative vs. Discriminative
- Discriminative Model
- Generative Model
It models the posterior probability of class label given
- bservation p(y|x)
It models the joint probability
- f class label and observation
p(x, y), and then use the Bayes rule (p(y|x)=p(x,y)/p(x)) for prediction.
Machine Learning Course, NJUST 3
Naïve Bayes Assumption
- Bag-of-words (BOW) representation
- A Mixture Model
Class prior probability Class-conditional probability Having two event models
Machine Learning Course, NJUST 4
𝑞 𝑦, 𝑧 = 𝑑
𝑘 = 𝑞 𝑧 = 𝑑 𝑘 𝑞(𝑦|𝑑 𝑘)
𝑦 = (𝜕1, 𝜕2, … , 𝜕|𝑦|) 𝑞 𝑦|𝑑
𝑘 = 𝑞 𝜕1, 𝜕2, … , 𝜕 𝑦 𝑑 𝑘 = ෑ ℎ=1 |𝑦|
𝑞(𝜕ℎ|𝑑
𝑘)
Multinomial Event Model
Machine Learning Course, NJUST 5
Model Description
- Hypothesis
- Joint Probability
Model Parameters
Machine Learning Course, NJUST 6
𝑞 𝑧 = 𝑑
𝑘 = 𝜌𝑘
𝑞 𝑦|𝑑
𝑘 = 𝑞
𝜕1, 𝜕2, … , 𝜕 𝑦 𝑑
𝑘 = ෑ ℎ=1 |𝑦|
𝑞(𝜕ℎ|𝑑
𝑘)
= ෑ
𝑗=1 𝑊
𝑞(𝑢𝑗|𝑑𝑘)𝑂(𝑢𝑗,𝑦) = ෑ
𝑗=1 𝑊
𝜄𝑗|𝑘
𝑂(𝑢𝑗,𝑦)
𝑞 𝑦, 𝑧 = 𝑑
𝑘 = 𝑞 𝑑 𝑘 𝑞 𝑦|𝑑 𝑘 = 𝜌𝑘 ෑ 𝑗=1 𝑊
𝜄𝑗|𝑘
𝑂(𝑢𝑗,𝑦)
Likelihood Function
- (Joint) Likelihood
Machine Learning Course, NJUST 7
𝑀 𝜌, 𝜄 = log ෑ
𝑙=1 𝑂
𝑞(𝑦𝑙, 𝑧𝑙) = log ෑ
𝑙=1 𝑂
𝑘=1 𝐷
𝐽 𝑧𝑙 = 𝑑
𝑘 𝑞 𝑧𝑙 = 𝑑 𝑘 𝑞(𝑦𝑙|𝑧𝑙 = 𝑑𝑘)
=
𝑙=1 𝑂
𝑘=1 𝐷
𝐽 𝑧𝑙 = 𝑑
𝑘 log 𝑞 𝑧𝑙 = 𝑑 𝑘 𝑞(𝑦𝑙|𝑧𝑙 = 𝑑𝑘)
=
𝑙=1 𝑂
𝑘=1 𝐷
𝐽 𝑧𝑙 = 𝑑
𝑘 log 𝜌𝑘 ෑ 𝑗=1 𝑊
𝜄𝑗|𝑘
𝑂(𝑢𝑗,𝑦𝑙)
=
𝑙=1 𝑂
𝑘=1 𝐷
𝐽 𝑧𝑙 = 𝑑
𝑘
log𝜌𝑘 +
𝑗=1 𝑊
𝑂 𝑢𝑗, 𝑦𝑙 log𝜄𝑗|𝑘
Maximum Likelihood Estimation
- MLE Formulation
- Applying Lagrange multipliers
Machine Learning Course, NJUST 8
max
𝜌,𝜄 𝑀(𝜌, 𝜄)
𝑡. 𝑢.
𝑘=1 𝐷
𝜌𝑘 = 1
𝑗=1 𝑊
𝜄𝑗|𝑘 = 1, 𝑘 = 1, … , 𝐷 𝐾 = 𝑀 𝜌, 𝜄 + 𝛽(1 −
𝑘=1 𝐷
𝜌𝑘) +
𝑘=1 𝐷
𝛾𝑘 (1 −
𝑗=1 𝑊
𝜄𝑗|𝑘) =
𝑙=1 𝑂
𝑘=1 𝐷
𝐽 𝑧𝑙 = 𝑑
𝑘 [log𝜌𝑘 + 𝑗=1 𝑊
𝑂 𝑢𝑗, 𝑦𝑙 log𝜄𝑗|𝑘] + 𝛽 1 −
𝑘=1 𝐷
𝜌𝑘 +
𝑘=1 𝐷
𝛾𝑘 1 −
𝑗=1 𝑊
𝜄𝑗|𝑘
Close-form MLE Solution
- Gradient
- MLE Solution
Machine Learning Course, NJUST 9
𝜖𝐾 𝜖𝜌𝑘 =
𝑙=1 𝑂
𝐽 𝑧𝑙 = 𝑑
𝑘
1 𝜌𝑘 − 𝛽 = 0 𝜖𝐾 𝜖𝜄𝑗|𝑘 =
𝑙=1 𝑂
𝐽 𝑧𝑙 = 𝑑
𝑘
𝑂 𝑢𝑗, 𝑦𝑙 𝜄𝑗|𝑘 − 𝛾𝑘 = 0
𝜌𝑘 = σ𝑙=1
𝑂
𝐽 𝑧𝑙 = 𝑑
𝑘
σ𝑙=1
𝑂
σ𝑘′=1
𝐷
𝐽 𝑧𝑙 = 𝑑
𝑘
= 𝑂
𝑘
𝑂 𝜄𝑗|𝑘 = σ𝑙=1
𝑂
𝐽 𝑧𝑙 = 𝑑
𝑘 𝑂 𝑢𝑗, 𝑦𝑙
σ𝑙=1
𝑂
𝐽 𝑧𝑙 = 𝑑
𝑘 σ𝑗′=1 𝑊
𝑂 𝑢𝑗′, 𝑦𝑙
Laplace Smoothing
- In order to prevent from zero probability
- Laplace Smoothing
Machine Learning Course, NJUST 10
𝑞 𝑦, 𝑧 = 𝑑
𝑘 = 𝜌𝑘 ෑ 𝑗=1 𝑊
𝜄𝑗|𝑘
𝑂(𝑢𝑗,𝑦)
𝜄𝑗|𝑘 = σ𝑙=1
𝑂
𝐽 𝑧𝑙 = 𝑑
𝑘 𝑂 𝑢𝑗, 𝑦𝑙
σ𝑗′=1
𝑊
σ𝑙=1
𝑂
𝐽 𝑧𝑙 = 𝑑
𝑘 𝑂 𝑢𝑗′, 𝑦𝑙
𝜌𝑘 = σ𝑙=1
𝑂
𝐽 𝑧𝑙 = 𝑑
𝑘
σ𝑘′=1
𝐷
σ𝑙=1
𝑂
𝐽 𝑧𝑙 = 𝑑
𝑘
𝜄𝑗|𝑘 = σ𝑙=1
𝑂
𝐽 𝑧𝑙 = 𝑑
𝑘 𝑂 𝑢𝑗, 𝑦𝑙 + 1
σ𝑗′=1
𝑊
σ𝑙=1
𝑂
𝐽 𝑧𝑙 = 𝑑
𝑘 𝑂 𝑢𝑗′, 𝑦𝑙 + 𝑊
𝜌𝑘 = σ𝑙=1
𝑂
𝐽 𝑧𝑙 = 𝑑
𝑘 + 1
σ𝑘′=1
𝐷
σ𝑙=1
𝑂
𝐽 𝑧𝑙 = 𝑑
𝑘 + 𝐷
Multi-variate Bernoulli Event Model
Machine Learning Course, NJUST 11
Model Description
- Hypothesis
- Joint Probability
Model Parameters
Machine Learning Course, NJUST 12
𝑞 𝑧 = 𝑑
𝑘 = 𝜌𝑘
𝑞 𝑦|𝑧 = 𝑑
𝑘 = 𝑞 𝑢1, 𝑢2, … , 𝑢𝑊 𝑑 𝑘
= ෑ
𝑗=1 𝑊
[𝐽 𝑢𝑗𝜗𝑦 𝑞 𝑢𝑗 𝑑
𝑘 + 𝐽(𝑢𝑗∉𝑦)(1 − 𝑞 𝑢𝑗 𝑑 𝑘 )]
= ෑ
𝑗=1 𝑊
[𝐽 𝑢𝑗𝜗𝑦 𝜈𝑗|𝑘 + 𝐽(𝑢𝑗∉𝑦)(1 − 𝜈𝑗|𝑘)] 𝑞 𝑦, 𝑑
𝑘 = 𝜌𝑘 ෑ 𝑗=1 𝑊
[𝐽 𝑢𝑗𝜗𝑦 𝜈𝑗|𝑘 + 𝐽(𝑢𝑗∉𝑦)(1 − 𝜈𝑗|𝑘)]
Likelihood Function
- (Joint) Likelihood
Machine Learning Course, NJUST 13
𝑀 𝜌, 𝜈 = log ෑ
𝑙=1 𝑂
𝑞(𝑦𝑙, 𝑧𝑙) =
𝑙=1 𝑂
log
𝑘=1 𝐷
𝐽 𝑧𝑙 = 𝑑
𝑘 𝑞 𝑦𝑙, 𝑧𝑙
=
𝑙=1 𝑂
𝑘=1 𝐷
𝐽 𝑧𝑙 = 𝑑
𝑘 log𝑞(𝑑 𝑘) ෑ 𝑗=1 𝑊
𝐽 𝑢𝑗𝜗𝑦 𝑞 𝑢𝑗 𝑑
𝑘 + 𝐽(𝑢𝑗∉𝑦)(1 − 𝑞 𝑢𝑗 𝑑 𝑘 )
=
𝑙=1 𝑂
𝑘=1 𝐷
𝐽 𝑧𝑙 = 𝑑
𝑘
log𝜌𝑘 +
𝑗=1 𝑊
𝐽(𝑢𝑗𝜗𝑦𝑙) log𝜈𝑗|𝑘 + 𝐽 𝑢𝑗∉𝑦𝑙 log(1 − 𝜈𝑗|𝑘)
Maximum Likelihood Estimation
- MLE Formulation
- Applying Lagrange multipliers
Machine Learning Course, NJUST 14
max
𝜌,𝜈 𝑀(𝜌, 𝜈)
𝑡. 𝑢.
𝑘=1 𝐷
𝜌𝑘 = 1
𝐾 = 𝑀 𝜌, 𝜈 + 𝛽 1 −
𝑘=1 𝐷
𝜌𝑘 =
𝑙=1 𝑂
𝑘=1 𝐷
𝐽 𝑧𝑙 = 𝑑
𝑘
𝑚𝑝𝜌𝑘 +
𝑗=1 𝑊
𝐽(𝑢𝑗𝜗𝑦𝑙) 𝑚𝑝𝜈𝑗|𝑘 + 𝐽 𝑢𝑗∉𝑦 log(1 − 𝜈𝑗|𝑘) + 𝛽 1 −
𝑘=1 𝐷
𝜌𝑘
Close-form MLE Solution
- Gradient
- MLE Solution
Machine Learning Course, NJUST 15
𝜖𝐾 𝜖𝜌𝑘 =
𝑙=1 𝑂
𝐽(𝑧𝑙 = 𝑑
𝑘) 1
𝜌𝑘 − 𝛽 = 0
𝜖𝐾 𝜖𝜈𝑗|𝑘 =
𝑙=1 𝑂
𝐽 𝑧𝑙 = 𝑑
𝑘
𝐽 𝑢𝑗𝜗𝑦𝑙 𝜈𝑗|𝑘 − 𝐽 𝑢𝑗∉𝑦𝑙 1 − 𝜈𝑗|𝑘 = 0, ∀𝑘 = 1, … , 𝐷.
𝜌𝑘 = σ𝑙=1
𝑂
𝐽 𝑧𝑙 = 𝑑
𝑘
σ𝑙=1
𝑂
σ𝑘′=1
𝐷
𝐽 𝑧𝑙 = 𝑑
𝑘′
= 𝑂
𝑘
𝑂 𝜈𝑗|𝑘 = σ𝑙=1
𝑂
𝐽 𝑧𝑙 = 𝑑
𝑘 𝐽 𝑢𝑗𝜗𝑦𝑙
σ𝑙=1
𝑂
𝐽 𝑧𝑙 = 𝑑
𝑘
Laplace Smoothing
- In order to prevent from zero probability
- Laplace Smoothing
Machine Learning Course, NJUST 16
𝑞(𝑦, 𝑑
𝑘) = 𝜌𝑘 ෑ 𝑗=1 𝑊
[𝐽 𝑢𝑗𝜗𝑦 𝜈𝑗|𝑘 + 𝐽(𝑢𝑗∉𝑦)(1 − 𝜈𝑗|𝑘)]
𝜈𝑗|𝑘 = σ𝑙=1
𝑂
𝐽 𝑧𝑙 = 𝑑
𝑘 𝐽 𝑢𝑗𝜗𝑦𝑙
σ𝑙=1
𝑂
𝐽 𝑧𝑙 = 𝑑
𝑘
𝜌𝑘 = σ𝑙=1
𝑂
𝐽 𝑧𝑙 = 𝑑
𝑘
σ𝑘′=1
𝐷
σ𝑙=1
𝑂
𝐽 𝑧𝑙 = 𝑑
𝑘
𝜈𝑗|𝑘 = σ𝑙=1
𝑂
𝐽 𝑧𝑙 = 𝑑
𝑘 𝐽 𝑢𝑗𝜗𝑦𝑙 + 1
σ𝑙=1
𝑂
𝐽 𝑧𝑙 = 𝑑
𝑘 + 2
𝜌𝑘 = σ𝑙=1
𝑂
𝐽 𝑧𝑙 = 𝑑
𝑘 + 1
σ𝑘′=1
𝐷
σ𝑙=1
𝑂
𝐽 𝑧𝑙 = 𝑑
𝑘 + 𝐷
Text Classification as An Example
17
Machine Learning Course, NJUST
Data sets
- Training data
- Test data
- Class labels
- Feature vector
Machine Learning Course, NJUST 18
- Training
- Prediction
Multinomial Naïve Bayes
Machine Learning Course, NJUST 19
- Training
- Prediction
Multi-variate Bernoulli Naïve Bayes
Machine Learning Course, NJUST 20
Xia-NB Software
- Functions
– Written in C++ – Support multinomial and multi-variate Bernoulli event model – Laplace smoothing – Uniform data format like SVM-light/LibSVM – Fast running with sparse representation
- Download
https://github.com/NUSTM/XIA-NB
Machine Learning Course, NJUST 21
Project
- Implement naïve Bayes algorithm with
– Multinomial event model – Multi-variate Bernoulli model
- Running the algorithm based on the training & testing
data given in Page 18.
- Compare the naïve Bayes algorithm with logistic
regression (by using Bag-of-words to represent the data).
Machine Learning Course, NJUST 22
Questions?
Machine Learning Course, NJUST