Le2i FRE2005, CNRS, Arts et Métiers, Univ. Bourgogne Franche-Comté Dijon, France
Ontology-based approach for unsupervised and adaptive focused crawling
Thomas HASSAN, Christophe CRUZ, Aurélie Bertaux
thomas.hassan@u-bourgogne.fr
Ontology-based approach for unsupervised and adaptive focused - - PowerPoint PPT Presentation
Ontology-based approach for unsupervised and adaptive focused crawling Thomas HASSAN, Christophe CRUZ, Aurlie Bertaux thomas.hassan@u-bourgogne.fr Le2i FRE2005, CNRS, Arts et Mtiers, Univ. Bourgogne Franche-Comt Dijon, France Outline
Le2i FRE2005, CNRS, Arts et Métiers, Univ. Bourgogne Franche-Comté Dijon, France
Thomas HASSAN, Christophe CRUZ, Aurélie Bertaux
thomas.hassan@u-bourgogne.fr
3
Content feed tools Content analysis
Bottlenecks :
High time cost for experts, possible loss
Content feed tools Content analysis
6
8
Relevant Irrelevant Seed item Inlink
9
Web Crawler Ontology Efficient content gathering Relevant content analysis
10
11
Diagram from : https://nutch.wordpress.com/
Multi-label Hierarchical Classification
HMC with DAG HMC with Tree
Item Item L L
term1 term2 term3 term4 term5 term6 term7 label1 5 5 25 25 label2 75 75 5 label3 75 25 label4 5 25 25 5 93 25 label5 95 60 5 label6 60 95 90 label7 5 98 5 60 25 79
Multi-label Hierarchical Classification
HMC with DAG HMC with Tree
Item Item L L
Each document represented as a vector of terms it contains (Lucene) Outputs a vector of labels (relevant concepts of the ontology) for each item
14
Diligenti, et al., 2000. Focused Crawling Using Context Graphs. In VLDB (pp. 527-534).
Relevant Irrelevant Inlink Graph layers
15
16
term1 term2 term3 term4 term5 term6 term7 label1 5 5 25 25 label2 75 75 5 label3 75 25 label4 5 25 25 5 93 25 label5 95 60 5 label6 60 95 90 label7 5 98 5 60 25 79
17
19
20
21
23
Le2i FRE2005, CNRS, Arts et Métiers, Univ. Bourgogne Franche-Comté Dijon, France
Thomas HASSAN, Christophe CRUZ, Aurélie Bertaux
thomas.hassan@u-bourgogne.fr
Thank you !