CS145: INTRODUCTION TO DATA MINING
Instructor: Yizhou Sun
yzsun@cs.ucla.edu December 7, 2017
MINING Text Data: Nave Bayes Instructor: Yizhou Sun - - PowerPoint PPT Presentation
CS145: INTRODUCTION TO DATA MINING Text Data: Nave Bayes Instructor: Yizhou Sun yzsun@cs.ucla.edu December 7, 2017 Methods to be Learnt Vector Data Set Data Sequence Data Text Data Logistic Regression; Nave Bayes for Text
yzsun@cs.ucla.edu December 7, 2017
2
Vector Data Set Data Sequence Data Text Data Classification
Logistic Regression; Decision Tree; KNN; SVM; NN Naïve Bayes for Text
Clustering
K-means; hierarchical clustering; DBSCAN; Mixture Models PLSA
Prediction
Linear Regression GLM*
Frequent Pattern Mining
Apriori; FP growth GSP; PrefixSpan
Similarity Search
DTW
3
4
5
From: airak@medicana.com.tr Subject: Loan Offer Do you need a personal or business loan urgent that can be process within 2 to 3 working days? Have you been frustrated so many times by your banks and other loan firm and you don't know what to do? Here comes the Good news Deutsche Bank Financial Business and Home Loan is here to offer you any kind of loan you need at an affordable interest rate of 3% If you are interested let us know.
6 c1 c2 c3 c4 c5 m1 m2 m3 m4
Vector space model For document 𝑒, 𝒚𝑒 = (𝑦𝑒1, 𝑦𝑒2, … , 𝑦𝑒𝑂), where 𝑦𝑒𝑜 is the number of words for nth word in the vocabulary
words
could be substituted by the single stem ‘learn’
an ordering of the terms in the dictionary so that you can operate them by their index.
7
8
9
10
𝑦 𝑞𝑦 1 − 𝑞 𝑜−𝑦
note σ𝑙 𝑦𝑙 = 𝑜
𝑜! 𝑦1!𝑦2!…𝑦𝐿! ς𝑙 𝑞𝑙 𝑦𝑙
11
12
) ( ) ( ) | ( ) | ( X X X P h P h P h P
13
14
H h ML
H h H h MAP
maximal P(y=j|x)
𝑞 𝑧 = 𝑘 𝒚 = 𝑞 𝒚 𝑧 = 𝑘 𝑞(𝑧 = 𝑘) 𝑞(𝒚)
be maximized
15
(σ𝑜 𝑦𝑒𝑜)! 𝑦𝑒1!𝑦𝑒2!…𝑦𝑒𝑂! ς𝑜 𝛾𝑧𝑜 𝑦𝑒𝑜
16
𝑧
𝑜
𝑦𝑒𝑜 × 𝜌𝑧
𝑜
𝑦𝑒𝑜 × 𝜌𝑧
𝑜
17
Constant for every class, denoted as 𝒅𝒆
𝑚𝑝𝑀 = 𝑚𝑝 ෑ
𝑒
𝑞(𝒚𝑒, 𝑧𝑒|Θ) =
𝑒
𝑚𝑝 𝑞 𝒚𝑒, 𝑧𝑒 Θ =
𝑒
𝑚𝑝 𝑞 𝒚𝑒 𝑧𝑒 𝑞 𝑧𝑒 =
𝑒
𝑦𝑒𝑜𝑚𝑝𝛾𝑧𝑜 + 𝑚𝑝𝜌𝑧𝑒 + 𝑚𝑝𝑑𝑒
max
Θ
log 𝑀
𝜌𝑘 ≥ 0 𝑏𝑜𝑒
𝑘
𝜌𝑘 = 1 𝛾𝑘𝑜 ≥ 0 𝑏𝑜𝑒
𝑜
𝛾𝑘𝑜 = 1 𝑔𝑝𝑠 𝑏𝑚𝑚 𝑘
18
Does not involve parameters, can be dropped for optimization purpose
σ𝑒:𝑧𝑒=𝑘 𝑦𝑒𝑜 σ𝑒:𝑧𝑒=𝑘 σ𝑜′ 𝑦𝑒𝑜′
σ𝑒 1(𝑧𝑒=𝑘) |𝐸|
19
σ𝑒:𝑧𝑒=𝑘 𝑦𝑒𝑜 σ𝑒:𝑧𝑒=𝑘 σ𝑜′ 𝑦𝑒𝑜′ = 0
𝑦𝑒𝑜 = 0
σ𝑒:𝑧𝑒=𝑘 𝑦𝑒𝑜+1 σ𝑒:𝑧𝑒=𝑘 σ𝑜′ 𝑦𝑒𝑜′+𝑂
20
21
Index 1 2 3 4 5 6 Word Chinese Beijing Shanghai Macao Tokyo Japan
መ 𝛾𝑑1 = 5 + 1 8 + 6 = 3 7 መ 𝛾𝑑2 = 1 + 1 8 + 6 = 1 7 መ 𝛾𝑑3 = 1 + 1 8 + 6 = 1 7 መ 𝛾𝑑4 = 1 + 1 8 + 6 = 1 7 መ 𝛾𝑑5 = 0 + 1 8 + 6 = 1 14 መ 𝛾𝑑6 = 0 + 1 8 + 6 = 1 14 መ 𝛾𝑘1 = 1 + 1 3 + 6 = 2 9 መ 𝛾𝑘2 = 0 + 1 3 + 6 = 1 9 መ 𝛾𝑘3 = 0 + 1 3 + 6 = 1 9 መ 𝛾𝑘4 = 0 + 1 3 + 6 = 1 9 መ 𝛾𝑘5 = 1 + 1 3 + 6 = 2 9 መ 𝛾𝑘6 = 1 + 1 3 + 6 = 2 9
ො 𝜌𝑑 = 3 4 ො 𝜌𝑘 = 1 4
𝑦5𝑜 = 3 4 × 3 7 3
1 14 × 1 14 ≈ 0.0003
𝑦5𝑜 = 1 4 × 2 9 3
2 9 × 2 9 ≈ 0.0001
22
𝑧
23
24
25
26