Laboratoire LE2I – UMR CNRS 6306 – Université de Bourgogne
An unsupervised classification process for large datasets based on web reasoning
Rafael PEIXOTO, Thomas HASSAN, Christophe CRUZ, Aurelie BERTAUX, Nuno SILVA thomas.hassan@u-bourgogne.fr
An unsupervised classification process for large datasets based on - - PowerPoint PPT Presentation
An unsupervised classification process for large datasets based on web reasoning Rafael PEIXOTO, Thomas HASSAN, Christophe CRUZ, Aurelie BERTAUX, Nuno SILVA thomas.hassan@u-bourgogne.fr Laboratoire LE2I UMR CNRS 6306 Universit de
Laboratoire LE2I – UMR CNRS 6306 – Université de Bourgogne
Rafael PEIXOTO, Thomas HASSAN, Christophe CRUZ, Aurelie BERTAUX, Nuno SILVA thomas.hassan@u-bourgogne.fr
3
4
6
7
8
10
11
large volumes of unstructured text Distributed method that exploits the coocurrence matrix
data items Classification using a Web Reasonner
12
𝜕𝛽
𝑢𝑗 = 𝑢𝑘|∀𝑢𝑘 ∈ 𝑈𝑓𝑠𝑛: 𝑄𝐷 𝑢𝑗|𝑢𝑘 > 𝛽
𝑄𝐷(𝑢𝑓𝑠𝑛𝑗|𝑢𝑓𝑠𝑛𝑘) = 𝑑𝑔𝑛 (𝑢𝑓𝑠𝑛𝑗, 𝑢𝑓𝑠𝑛𝑘) 𝑑𝑔𝑛(𝑢𝑓𝑠𝑛𝑘, 𝑢𝑓𝑠𝑛𝑘)
𝑄𝐷(i|j) term1 term2 term3 term4 term5 term6 term7 label1 5 5 25 25 label2 75 75 5 label3 75 25 label4 5 25 25 5 93 25 label5 95 60 5 label6 60 95 90 label7 5 98 5 60 25 79
𝜕𝛾
𝑢𝑗 = 𝑢𝑘|∀𝑢𝑘 ∈ 𝑈𝑓𝑠𝑛: 𝛾 ≤ 𝑄𝐷 𝑢𝑗|𝑢𝑘 ≤ 𝛽
Alpha set: Beta set: Coocurrence:
13
𝜕𝛽
𝑢𝑗 = 𝑢𝑘|∀𝑢𝑘 ∈ 𝑈𝑓𝑠𝑛: 𝑄𝐷 𝑢𝑗|𝑢𝑘 > 𝛽
𝜕𝛾
𝑢𝑗 = 𝑢𝑘|∀𝑢𝑘 ∈ 𝑈𝑓𝑠𝑛: 𝛾 ≤ 𝑄𝐷 𝑢𝑗|𝑢𝑘 ≤ 𝛽
Alpha set: Beta set:
14
% term1 term2 term3 term4 term5 term6 term7 label1 5 5 25 25 label2 75 75 5 label3 75 25 label4 5 25 25 5 93 25 label5 95 60 5 label6 60 95 90 label7 5 98 5 60 25 79
𝜕𝛽
𝑢𝑗 = 𝑢𝑘|∀𝑢𝑘 ∈ 𝑈𝑓𝑠𝑛: 𝑄𝐷 𝑢𝑗|𝑢𝑘 > 𝛽 , 𝛽 = 91
𝜕𝛾
𝑢𝑗 = 𝑢𝑘|∀𝑢𝑘 ∈ 𝑈𝑓𝑠𝑛: 𝛾 ≤ 𝑄𝐷 𝑢𝑗|𝑢𝑘 ≤ 𝛽 , 𝛾 = 70
91
15
16
DL concepts Description 𝐽𝑢𝑓𝑛 ⊑ ∃ℎ𝑏𝑡𝑈𝑓𝑠𝑛. 𝑈𝑓𝑠𝑛 Items to classify (e.g. doc) has terms 𝑈𝑓𝑠𝑛 ⊑ ⊺ Terms (e.g. word) extracted from items 𝑀𝑏𝑐𝑓𝑚 ⊑ 𝑈𝑓𝑠𝑛 Labels are terms used to classify items 𝑀𝑏𝑐𝑓𝑚 ⊑ ∀𝑐𝑠𝑝𝑏𝑒𝑓𝑠. 𝑀𝑏𝑐𝑓𝑚 Broader relation between labels 𝑀𝑏𝑐𝑓𝑚 ⊑ ∀𝑜𝑏𝑠𝑠𝑝𝑥𝑓𝑠. 𝑀𝑏𝑐𝑓𝑚 Narrower relation between labels 𝑐𝑠𝑝𝑏𝑒𝑓𝑠 ≡ 𝑜𝑏𝑠𝑠𝑝𝑥𝑓𝑠− Broader and narrower are inverse 𝐽𝑢𝑓𝑛 ⊓ 𝑈𝑓𝑠𝑛 = ∅ Items and Terms are disjoint 𝐽𝑢𝑓𝑛 ⊑ ∀𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒. 𝑀𝑏𝑐𝑓𝑚 Relation that links items with labels
18
𝐽𝑢𝑓𝑛 ? 𝑗𝑢 , 𝑈𝑓𝑠𝑛 𝑢𝑓𝑠𝑛𝑗 , 𝑀𝑏𝑐𝑓𝑚 𝑢𝑓𝑠𝑛𝑗 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢𝑓𝑠𝑛𝑘 → 𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒(? 𝑗𝑢, 𝑢𝑓𝑠𝑛𝑗)
19
Alpha rules 𝐽𝑢𝑓𝑛 ? 𝑗𝑢 , 𝑈𝑓𝑠𝑛 t1 , 𝑀𝑏𝑐𝑓𝑚 𝑢𝑓𝑠𝑛𝑗 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢1 → 𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒(? 𝑗𝑢, 𝑢𝑓𝑠𝑛𝑗) 𝐽𝑢𝑓𝑛 ? 𝑗𝑢 , 𝑈𝑓𝑠𝑛 t2 , 𝑀𝑏𝑐𝑓𝑚 𝑢𝑓𝑠𝑛𝑗 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢2 → 𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒(? 𝑗𝑢, 𝑢𝑓𝑠𝑛𝑗) Beta rules 𝐽𝑢𝑓𝑛 ? 𝑗𝑢 , 𝑈𝑓𝑠𝑛 t1 , Term 𝑢2 , 𝑀𝑏𝑐𝑓𝑚 𝑢𝑓𝑠𝑛𝑗 , ℎasTerm ? 𝑗𝑢, 𝑢1 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢2 → 𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒(? 𝑗𝑢, 𝑢𝑓𝑠𝑛𝑗) 𝐽𝑢𝑓𝑛 ? 𝑗𝑢 , 𝑈𝑓𝑠𝑛 t1 , Term 𝑢3 , 𝑀𝑏𝑐𝑓𝑚 𝑢𝑓𝑠𝑛𝑗 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢1 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢3 → 𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒 ? 𝑗𝑢, 𝑢𝑓𝑠𝑛𝑗 𝐽𝑢𝑓𝑛 ? 𝑗𝑢 , 𝑈𝑓𝑠𝑛 t2 , Term 𝑢3 , 𝑀𝑏𝑐𝑓𝑚 𝑢𝑓𝑠𝑛𝑗 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢2 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢3 → 𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒(? 𝑗𝑢, 𝑢𝑓𝑠𝑛𝑗)
20
21
22
Parameter Step Value Alpha Threshold Resolution 90 Beta Threshold 80 Term ranking (n) 5 p 0.25 Term Threshold (𝛅) Realization 2
23
24
25
𝛽 = 91 90 𝛾 = 80
27
28
Laboratoire LE2I – UMR CNRS 6306 – Université de Bourgogne
Thank you ! Rafael PEIXOTO, Thomas HASSAN, Christophe CRUZ, Aurelie BERTAUX, Nuno SILVA thomas.hassan@u-bourgogne.fr