Ontology-based approach for unsupervised and adaptive focused - - PowerPoint PPT Presentation

ontology based approach for unsupervised and adaptive
SMART_READER_LITE
LIVE PREVIEW

Ontology-based approach for unsupervised and adaptive focused - - PowerPoint PPT Presentation

Ontology-based approach for unsupervised and adaptive focused crawling Thomas HASSAN, Christophe CRUZ, Aurlie Bertaux thomas.hassan@u-bourgogne.fr Le2i FRE2005, CNRS, Arts et Mtiers, Univ. Bourgogne Franche-Comt Dijon, France Outline


slide-1
SLIDE 1

Le2i FRE2005, CNRS, Arts et Métiers, Univ. Bourgogne Franche-Comté Dijon, France

Ontology-based approach for unsupervised and adaptive focused crawling

Thomas HASSAN, Christophe CRUZ, Aurélie Bertaux

thomas.hassan@u-bourgogne.fr

slide-2
SLIDE 2

Outline

  • Context

§ Industrial context § Problem statement

  • Proposed solution

§ Background § Architecture

  • Evaluation

§ Scaling § Performance

  • Conclusion and future work
slide-3
SLIDE 3

Industrial context

3

Competitive intelligence

slide-4
SLIDE 4

Content feed tools Content analysis

Industrial context

slide-5
SLIDE 5

Bottlenecks :

  • Cross-referencing articles to assess veracity
  • Manual classification of articles
  • Discrepancy between data and knowledge base

High time cost for experts, possible loss

  • f information

Content feed tools Content analysis

Problem statement

slide-6
SLIDE 6

Problem statement

6

  • How to specialize feed tools

with domain-specific knowledge ?

  • How to optimize content gathering to find

most relevant items fast ?

  • How to expand information sources horizon ?
slide-7
SLIDE 7

Outline

  • Context

§ Industrial context § Problem statement

  • Proposed solution

§ Background § Architecture

  • Evaluation

§ Scaling § Performance

  • Conclusion and future work
slide-8
SLIDE 8

Background : focused crawler

8

Relevant Irrelevant Seed item Inlink

slide-9
SLIDE 9

Background : focused crawler + semantics

9

Web Crawler Ontology Efficient content gathering Relevant content analysis

slide-10
SLIDE 10

10

1) Dynamic data VS static ontology : Discrepancy between ontology-based classifier and actual web data 2) Crawler should improve from experience : Both content and graph mining should be useed to enhance crawling performance

Limitations

Objectives : adapt both crawling experience and content analysis over time to accelerate crawling and improve relevance

slide-11
SLIDE 11

Architecture : baseline implementation

11

  • Crawl web sources periodically
  • High throughput, fault tolerance
  • Integrate usefull modules

Based on Nutch, hadoop-based distributed crawler

Diagram from : https://nutch.wordpress.com/

slide-12
SLIDE 12

Multi-label Hierarchical Classification

HMC with DAG HMC with Tree

Item Item L L

Architecture : classification module

Classification model construction based on probability distribution of features

term1 term2 term3 term4 term5 term6 term7 label1 5 5 25 25 label2 75 75 5 label3 75 25 label4 5 25 25 5 93 25 label5 95 60 5 label6 60 95 90 label7 5 98 5 60 25 79

slide-13
SLIDE 13

Multi-label Hierarchical Classification

HMC with DAG HMC with Tree

Item Item L L

Architecture : classification module

Objective : content-based classification of items

Each document represented as a vector of terms it contains (Lucene) Outputs a vector of labels (relevant concepts of the ontology) for each item

slide-14
SLIDE 14

Architecture : priority module

14

Use the context-graph approach to estimate relevance

  • f unseen links. Computes similarity with fetched items

based on the distance to relevant items

Diligenti, et al., 2000. Focused Crawling Using Context Graphs. In VLDB (pp. 527-534).

Relevant Irrelevant Inlink Graph layers

slide-15
SLIDE 15

15

Integration with the crawler

Architecture : classification module

slide-16
SLIDE 16

16

Architecture : maintenance module

term1 term2 term3 term4 term5 term6 term7 label1 5 5 25 25 label2 75 75 5 label3 75 25 label4 5 25 25 5 93 25 label5 95 60 5 label6 60 95 90 label7 5 98 5 60 25 79

Objective : maintain a cooccurrence matrix of features

slide-17
SLIDE 17

17

Architecture : maintenance module

slide-18
SLIDE 18

Outline

  • Context

§ Industrial context § Problem statement

  • Proposed solution

§ Background § Architecture

  • Evaluation

§ Scaling § Performance

  • Conclusion and future work
slide-19
SLIDE 19

Scaling

19

Distributed architecture to deal with scaling

slide-20
SLIDE 20

Scaling

20

Distributed architecture to deal with scaling

slide-21
SLIDE 21

Quality Evaluation

21

Comparison with standard Best-N-First using only cosine similarity

slide-22
SLIDE 22

Outline

  • Context

§ Industrial context § Problem statement

  • Proposed solution

§ Background § Architecture

  • Evaluation

§ Scaling § Performance

  • Conclusion and future work
slide-23
SLIDE 23

Conclusion

23

  • An approach for unsupervised ontology-based focused

crawling § Performs cross-referencing of web items § Ontology-based classification model for accurate item classification § Adaptation and evolution of the model using web content and web graph mining

  • Future work

§ Evaluation of the architecture in industrial context § Leverage scalability issues of the maintenance process. § Active learning integration in the maintenance process (expert feedback)

slide-24
SLIDE 24

Le2i FRE2005, CNRS, Arts et Métiers, Univ. Bourgogne Franche-Comté Dijon, France

Ontology-based approach for unsupervised and adaptive focused crawling

Thomas HASSAN, Christophe CRUZ, Aurélie Bertaux

thomas.hassan@u-bourgogne.fr

Thank you !