An unsupervised classification process for large datasets based on - - PowerPoint PPT Presentation

an unsupervised classification process for large datasets
SMART_READER_LITE
LIVE PREVIEW

An unsupervised classification process for large datasets based on - - PowerPoint PPT Presentation

An unsupervised classification process for large datasets based on web reasoning Rafael PEIXOTO, Thomas HASSAN, Christophe CRUZ, Aurelie BERTAUX, Nuno SILVA thomas.hassan@u-bourgogne.fr Laboratoire LE2I UMR CNRS 6306 Universit de


slide-1
SLIDE 1

Laboratoire LE2I – UMR CNRS 6306 – Université de Bourgogne

An unsupervised classification process for large datasets based on web reasoning

Rafael PEIXOTO, Thomas HASSAN, Christophe CRUZ, Aurelie BERTAUX, Nuno SILVA thomas.hassan@u-bourgogne.fr

slide-2
SLIDE 2

Outline

Context

  • Global problem
  • The Semantic HMC

Specific Problem

  • Proposed Solution

Implementation

  • Setup
  • Results

Conclusion and future work

slide-3
SLIDE 3

Global Problem

3

Value extraction from Big Data sources

slide-4
SLIDE 4

Global Problem

4

slide-5
SLIDE 5

Proposition: « Semantic HMC »

6

slide-6
SLIDE 6

Proposition: « Semantic HMC »

7

slide-7
SLIDE 7

Proposition : « Semantic HMC »

8

Unsupervised ontology learning Rule-based Classification (Web Reasoner)

slide-8
SLIDE 8

Outline

Context

  • Global problem
  • The Semantic HMC

Specific Problem

  • Proposed Solution

Implementation

  • Setup
  • Results

Conclusion and future work

slide-9
SLIDE 9

Specific Problem

10

Rule-based reasonning to perform Classification Unsupervised ontology learning Rule-based Classification

slide-10
SLIDE 10

Specific Problem

11

  • Resolution: Learn classifications rules from

large volumes of unstructured text Distributed method that exploits the coocurrence matrix

  • Realization: classify large volumes of new

data items Classification using a Web Reasonner

slide-11
SLIDE 11

Proposed solution: rule learning (Resolution)

12

Learning Alpha and Beta sets

𝜕𝛽

𝑢𝑗 = 𝑢𝑘|∀𝑢𝑘 ∈ 𝑈𝑓𝑠𝑛: 𝑄𝐷 𝑢𝑗|𝑢𝑘 > 𝛽

𝑄𝐷(𝑢𝑓𝑠𝑛𝑗|𝑢𝑓𝑠𝑛𝑘) = 𝑑𝑔𝑛 (𝑢𝑓𝑠𝑛𝑗, 𝑢𝑓𝑠𝑛𝑘) 𝑑𝑔𝑛(𝑢𝑓𝑠𝑛𝑘, 𝑢𝑓𝑠𝑛𝑘)

𝑄𝐷(i|j) term1 term2 term3 term4 term5 term6 term7 label1 5 5 25 25 label2 75 75 5 label3 75 25 label4 5 25 25 5 93 25 label5 95 60 5 label6 60 95 90 label7 5 98 5 60 25 79

𝜕𝛾

𝑢𝑗 = 𝑢𝑘|∀𝑢𝑘 ∈ 𝑈𝑓𝑠𝑛: 𝛾 ≤ 𝑄𝐷 𝑢𝑗|𝑢𝑘 ≤ 𝛽

Alpha set: Beta set: Coocurrence:

slide-12
SLIDE 12

Proposed solution: rule learning (Resolution)

13

Learning Alpha and Beta sets

𝜕𝛽

𝑢𝑗 = 𝑢𝑘|∀𝑢𝑘 ∈ 𝑈𝑓𝑠𝑛: 𝑄𝐷 𝑢𝑗|𝑢𝑘 > 𝛽

𝜕𝛾

𝑢𝑗 = 𝑢𝑘|∀𝑢𝑘 ∈ 𝑈𝑓𝑠𝑛: 𝛾 ≤ 𝑄𝐷 𝑢𝑗|𝑢𝑘 ≤ 𝛽

Alpha set: Beta set:

slide-13
SLIDE 13

14

Example:

% term1 term2 term3 term4 term5 term6 term7 label1 5 5 25 25 label2 75 75 5 label3 75 25 label4 5 25 25 5 93 25 label5 95 60 5 label6 60 95 90 label7 5 98 5 60 25 79

𝜕𝛽

𝑢𝑗 = 𝑢𝑘|∀𝑢𝑘 ∈ 𝑈𝑓𝑠𝑛: 𝑄𝐷 𝑢𝑗|𝑢𝑘 > 𝛽 , 𝛽 = 91

𝜕𝛾

𝑢𝑗 = 𝑢𝑘|∀𝑢𝑘 ∈ 𝑈𝑓𝑠𝑛: 𝛾 ≤ 𝑄𝐷 𝑢𝑗|𝑢𝑘 ≤ 𝛽 , 𝛾 = 70

91

Proposed solution: rule learning (Resolution)

slide-14
SLIDE 14

Proposed solution: classification with web reasoner

15

Classification at query-time using backward-chaining

slide-15
SLIDE 15

Core Ontology

16

DL concepts Description 𝐽𝑢𝑓𝑛 ⊑ ∃ℎ𝑏𝑡𝑈𝑓𝑠𝑛. 𝑈𝑓𝑠𝑛 Items to classify (e.g. doc) has terms 𝑈𝑓𝑠𝑛 ⊑ ⊺ Terms (e.g. word) extracted from items 𝑀𝑏𝑐𝑓𝑚 ⊑ 𝑈𝑓𝑠𝑛 Labels are terms used to classify items 𝑀𝑏𝑐𝑓𝑚 ⊑ ∀𝑐𝑠𝑝𝑏𝑒𝑓𝑠. 𝑀𝑏𝑐𝑓𝑚 Broader relation between labels 𝑀𝑏𝑐𝑓𝑚 ⊑ ∀𝑜𝑏𝑠𝑠𝑝𝑥𝑓𝑠. 𝑀𝑏𝑐𝑓𝑚 Narrower relation between labels 𝑐𝑠𝑝𝑏𝑒𝑓𝑠 ≡ 𝑜𝑏𝑠𝑠𝑝𝑥𝑓𝑠− Broader and narrower are inverse 𝐽𝑢𝑓𝑛 ⊓ 𝑈𝑓𝑠𝑛 = ∅ Items and Terms are disjoint 𝐽𝑢𝑓𝑛 ⊑ ∀𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒. 𝑀𝑏𝑐𝑓𝑚 Relation that links items with labels

slide-16
SLIDE 16

Outline

Context

  • Global problem
  • The Semantic HMC

Specific Problem

  • Proposed Solution

Implementation

  • Setup
  • Results

Conclusion and future work

slide-17
SLIDE 17

Implementation: rule creation

18

Distributed process using mapreduce:

OWL API used to generate SWRL rules from the output

𝐽𝑢𝑓𝑛 ? 𝑗𝑢 , 𝑈𝑓𝑠𝑛 𝑢𝑓𝑠𝑛𝑗 , 𝑀𝑏𝑐𝑓𝑚 𝑢𝑓𝑠𝑛𝑗 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢𝑓𝑠𝑛𝑘 → 𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒(? 𝑗𝑢, 𝑢𝑓𝑠𝑛𝑗)

slide-18
SLIDE 18

19

Generated rules Exemple

Alpha rules 𝐽𝑢𝑓𝑛 ? 𝑗𝑢 , 𝑈𝑓𝑠𝑛 t1 , 𝑀𝑏𝑐𝑓𝑚 𝑢𝑓𝑠𝑛𝑗 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢1 → 𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒(? 𝑗𝑢, 𝑢𝑓𝑠𝑛𝑗) 𝐽𝑢𝑓𝑛 ? 𝑗𝑢 , 𝑈𝑓𝑠𝑛 t2 , 𝑀𝑏𝑐𝑓𝑚 𝑢𝑓𝑠𝑛𝑗 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢2 → 𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒(? 𝑗𝑢, 𝑢𝑓𝑠𝑛𝑗) Beta rules 𝐽𝑢𝑓𝑛 ? 𝑗𝑢 , 𝑈𝑓𝑠𝑛 t1 , Term 𝑢2 , 𝑀𝑏𝑐𝑓𝑚 𝑢𝑓𝑠𝑛𝑗 , ℎasTerm ? 𝑗𝑢, 𝑢1 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢2 → 𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒(? 𝑗𝑢, 𝑢𝑓𝑠𝑛𝑗) 𝐽𝑢𝑓𝑛 ? 𝑗𝑢 , 𝑈𝑓𝑠𝑛 t1 , Term 𝑢3 , 𝑀𝑏𝑐𝑓𝑚 𝑢𝑓𝑠𝑛𝑗 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢1 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢3 → 𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒 ? 𝑗𝑢, 𝑢𝑓𝑠𝑛𝑗 𝐽𝑢𝑓𝑛 ? 𝑗𝑢 , 𝑈𝑓𝑠𝑛 t2 , Term 𝑢3 , 𝑀𝑏𝑐𝑓𝑚 𝑢𝑓𝑠𝑛𝑗 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢2 , ℎ𝑏𝑡𝑈𝑓𝑠𝑛 ? 𝑗𝑢, 𝑢3 → 𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒(? 𝑗𝑢, 𝑢𝑓𝑠𝑛𝑗)

Implementation: rule creation

slide-19
SLIDE 19

Implementation: Classification at query-time

20

Stardog used as a scalable triple-store (compatible with backward- chaining inference as well as SWRL rules inference) Rule selection process developped in Java interacting with Stardog to

  • ptimize query performance

Resolution Realization

slide-20
SLIDE 20

Implementation: test environment

21

slide-21
SLIDE 21

Implementation: parameter setup

22

Parameter Step Value Alpha Threshold Resolution 90 Beta Threshold 80 Term ranking (n) 5 p 0.25 Term Threshold (𝛅) Realization 2

slide-22
SLIDE 22

Results

23

Number of classifications: 𝐽𝑢𝑓𝑛 ⊑ ∀𝑗𝑡𝐷𝑚𝑏𝑡𝑡𝑗𝑔𝑗𝑓𝑒. 𝑀𝑏𝑐𝑓𝑚

slide-23
SLIDE 23

24

Number of learned rules (Alpha + Beta)

Results

slide-24
SLIDE 24

25

Number of learned rules (Alpha + Beta)

Results

𝛽 = 91 90 𝛾 = 80

slide-25
SLIDE 25

Outline

Context

  • Global problem
  • The Semantic HMC

Specific Problem

  • Proposed Solution

Implementation

  • Setup
  • Results

Conclusion and future work

slide-26
SLIDE 26

Conclusion

27

  • A new unsupervised process to automatically classify items
  • A highly scalable rule learning method based on

statistical and lexical approaches

  • A novel method to classify items using a web reasoner
  • The process prototype was successfully implemented in a

scalable and distributed platform to process Big Data

  • Preliminary results show that the items are classified

automatically by the reasonner

slide-27
SLIDE 27

Ongoing and Future Work

28

  • Quality Evaluation of the process: comparison with state-of-

the art in classification

  • Predictive performance evaluation based on cross-validation

with large dataset

  • Optimization of the process by exhaustive analysis of

parameters’ impact

  • Application to classification of news articles on the web
slide-28
SLIDE 28

Laboratoire LE2I – UMR CNRS 6306 – Université de Bourgogne

An unsupervised classification process for large datasets using web reasoning

Thank you ! Rafael PEIXOTO, Thomas HASSAN, Christophe CRUZ, Aurelie BERTAUX, Nuno SILVA thomas.hassan@u-bourgogne.fr