Hierarchical Link Analysis for Ranking Web Data Renaud Delbru, - - PowerPoint PPT Presentation
Hierarchical Link Analysis for Ranking Web Data Renaud Delbru, - - PowerPoint PPT Presentation
Hierarchical Link Analysis for Ranking Web Data Renaud Delbru, Nickolai Toupikov, Michele Catasta, Giovanni Tummarello, and Stefan Decker Digital Enterprise Research Institute, Galway June 1, 2010 Introduction Web Data Model The DING Model
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Introduction
Web of Data
There is a growing increase of web data sources ...
Linked Open Data cloud; Open Graph protocol; e-commerces (good relations), e-government, ...
How to search and retrieve relevant information ?
One single query can return million of entities ... ... and users expect only the most relevant ones. Web data search engines (e.g., Sindice) need effective way to rank entities. Partial solution: Popularity-based entity ranking.
1 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Link Analysis on the Web
Link Analysis
Given a directed graph, determine the popularity of its nodes using link information A link from a node i to a node j is considered as an evidence of the importance of node j
Link Analysis for Web Documents
PageRank considers exclusively link structure Hierarchical Link Analysis consider both link structure and hierarchical structure
Link Analysis for Web Data
Current approaches consider exclusively link structure Sindice: Dataset/Entity centric view
2 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Link Analysis on the Web
Link Analysis
Given a directed graph, determine the popularity of its nodes using link information A link from a node i to a node j is considered as an evidence of the importance of node j
Link Analysis for Web Documents
PageRank considers exclusively link structure Hierarchical Link Analysis consider both link structure and hierarchical structure
Link Analysis for Web Data
Current approaches consider exclusively link structure Sindice: Dataset/Entity centric view
2 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Link Analysis on the Web
Link Analysis
Given a directed graph, determine the popularity of its nodes using link information A link from a node i to a node j is considered as an evidence of the importance of node j
Link Analysis for Web Documents
PageRank considers exclusively link structure Hierarchical Link Analysis consider both link structure and hierarchical structure
Link Analysis for Web Data
Current approaches consider exclusively link structure Sindice: Dataset/Entity centric view
2 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Link Analysis on the Web
Link Analysis
Given a directed graph, determine the popularity of its nodes using link information A link from a node i to a node j is considered as an evidence of the importance of node j
Link Analysis for Web Documents
PageRank considers exclusively link structure Hierarchical Link Analysis consider both link structure and hierarchical structure
Link Analysis for Web Data
Current approaches consider exclusively link structure Sindice: Dataset/Entity centric view
2 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Link Analysis on the Web
Link Analysis
Given a directed graph, determine the popularity of its nodes using link information A link from a node i to a node j is considered as an evidence of the importance of node j
Link Analysis for Web Documents
PageRank considers exclusively link structure Hierarchical Link Analysis consider both link structure and hierarchical structure
Link Analysis for Web Data
Current approaches consider exclusively link structure Sindice: Dataset/Entity centric view
2 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Outline: Web Data Model
Web Data Model Web Data Graph Dataset Graph Internal and External Node Intra and Inter-Dataset Edge Linkset Two-Layer Model Quantifying the Two-Layer Model
3 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Web Data Graph
Figure: Web data graph
4 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Dataset Graph
Figure: Dataset graph
5 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Internal and External Node
Figure: Internal (red) and external nodes (blue)
6 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Intra and Inter-Dataset Edge
Figure: Inter-dataset (orange) and intra-dataset (black) edges
7 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Linkset
Figure: Linkset
8 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Two-Layer Model
Figure: Two-layer model of the Web of Data
9 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Quantifying the two-layer model
Datasets
DBpedia 17.7 million of entities Citeseer (RKBExplorer) 2.48 million of entities Geonames 13.8 million of entities Sindice 60 million of entities among 50.000 datasets Dataset Intra Inter DBpedia 88M (93.2%) 6.4M (6.8%) Citeseer 12.9M (77.7%) 3.7M (22.3%) Geonames 59M (98.3%) 1M (1.7%) Sindice 287M (78.8%) 77M (21.2%)
Table: Ratio intra / inter dataset links
10 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Outline: The DING Model
The DING Model Overview Unsupervised Link Weighting Computing DatasetRank Computing Local EntityRank Combining Dataset Rank and Entity Rank
11 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
The DING Model: Overview
DING Principles
DING performs entity ranking in three steps:
1
dataset ranks are computed by performing link analysis on the top layer (i.e. the dataset graph);
2
for each dataset, entity ranks are computed by performing link analysis on the local entity collection;
3
the popularity of the dataset is propagated to its entities and combined with their local ranks to estimate a global entity rank.
12 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
The DING Model: Overview
DING Principles
DING performs entity ranking in three steps:
1
dataset ranks are computed by performing link analysis on the top layer (i.e. the dataset graph);
2
for each dataset, entity ranks are computed by performing link analysis on the local entity collection;
3
the popularity of the dataset is propagated to its entities and combined with their local ranks to estimate a global entity rank.
12 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
The DING Model: Overview
DING Principles
DING performs entity ranking in three steps:
1
dataset ranks are computed by performing link analysis on the top layer (i.e. the dataset graph);
2
for each dataset, entity ranks are computed by performing link analysis on the local entity collection;
3
the popularity of the dataset is propagated to its entities and combined with their local ranks to estimate a global entity rank.
12 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
The DING Model: Overview
DING Principles
DING performs entity ranking in three steps:
1
dataset ranks are computed by performing link analysis on the top layer (i.e. the dataset graph);
2
for each dataset, entity ranks are computed by performing link analysis on the local entity collection;
3
the popularity of the dataset is propagated to its entities and combined with their local ranks to estimate a global entity rank.
12 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Unsupervised Link Weighting
Intuition
TF-IDF applied on link labels
Link Frequency - Inverse Dataset Frequency (LF-IDF)
Link weighting factor wσ,i,j Assign low weight to very common links, such as rdfs:seeAlso wσ,i,j = LF(Lσ,i,j) × IDF(σ) = |Lσ,i,j|
- Lτ,i,k |Lτ,i,k| × log
N 1 + freq(σ)
13 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Unsupervised Link Weighting
Intuition
TF-IDF applied on link labels
Link Frequency - Inverse Dataset Frequency (LF-IDF)
Link weighting factor wσ,i,j Assign low weight to very common links, such as rdfs:seeAlso wσ,i,j = LF(Lσ,i,j) × IDF(σ) = |Lσ,i,j|
- Lτ,i,k |Lτ,i,k| × log
N 1 + freq(σ)
14 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Unsupervised Link Weighting
Intuition
TF-IDF applied on link labels
Link Frequency - Inverse Dataset Frequency (LF-IDF)
Link weighting factor wσ,i,j Assign low weight to very common links, such as rdfs:seeAlso wσ,i,j = LF(Lσ,i,j) × IDF(σ) = |Lσ,i,j|
- Lτ,i,k |Lτ,i,k| × log
N 1 + freq(σ)
15 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Computing Dataset Rank
Assumption
Dataset surfing behaviour is the same as the web page surfing behaviour in PageRank
16 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Computing Dataset Rank
Assumption
Dataset surfing behaviour is the same as the web page surfing behaviour in PageRank
DatasetRank
Weighted PageRank on the weighted dataset graph
17 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Computing Dataset Rank
Assumption
Dataset surfing behaviour is the same as the web page surfing behaviour in PageRank
DatasetRank
Weighted PageRank on the weighted dataset graph Distribution factor wσ,i,j is defined by LF-IDF rk(Dj) = α
- Lσ,i,j
rk−1(Di)wσ,i,j + (1 − α) |EDj|
- D∈G |ED|
18 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Computing Dataset Rank
Assumption
Dataset surfing behaviour is the same as the web page surfing behaviour in PageRank
DatasetRank
Weighted PageRank on the weighted dataset graph Distribution factor wσ,i,j is defined by LF-IDF Probability of random jump is proportional to the size of a dataset rk(Dj) = α
- Lσ,i,j
rk−1(Di)wσ,i,j + (1 − α) |EDj|
- D∈G |ED|
19 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Computing Local EntityRank
Generic Algorithms
Weighted EntityRank: Weighted PageRank applied on the internal entities and intra-links of a dataset Weighted LinkCount: in-degree counting links applied on the internal entities and intra-links of a dataset
20 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Combining Dataset Rank and Entity Rank
Naive approach
Purely probabilistic point of view: joint probability Assumption: independent events Global score rg(e) = P(e ∩ D) = r(e) ∗ r(D) Problem: favours smaller datasets
DING Approach
Add a local entity rank factor; Normalise local ranks to a same average based on dataset size rg(e) = r(D) ∗ r(e) ∗
|ED|
- D′∈G |E ′
D| 21 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Outline: Experimental Results
Experimental Results Overview User Study SemSearch 2010
22 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Experimental Results: Overview
Link Analysis Methods
Global EntityRank (GER); Local LinkCount (LLC) and Local EntityRank (LER); Local algorithms combined with DatasetRank (DR-LLC and DR-LER).
Experiments
1 User study to evaluate qualitatively each methods; 2 Semantic Search challenge. 23 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
User Study: Design
Exp-A
Local entity ranking (LER & LLC) on DBpedia dataset 31 participants
Exp-B
DING (DR-LER & DR-LLC) on Sindice’s page-repository 58 participants
Task
10 queries (keyword and SPARQL queries) One result list (top-10) per algorithm Rate algorithms (W, SW, S, SB, B) in relation to GER
24 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
User Study: Questionnaire
Figure: One of the questionnaire given to the participant
25 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
User Study A: Results
(a) LER
Rate Oi Ei %χ2 B 6.2 −13% SB 7 6.2 +0% S 21 6.2 +71% SW 3 6.2 −3% W 6.2 −13% Totals 31 31
(b) LLC
Rate Oi Ei %χ2 B 3 6.2 −12% SB 8 6.2 +4% S 13 6.2 +53% SW 6 6.2 −0% W 1 6.2 −31% Totals 31 31
Table: Chi-square test for Exp-A. The column %χ2 gives, for each modality, its contribution to χ2 (in relative value).
Conclusion
LER and LLC provides similar results than GER. However, there is a more significant proportion of the population that considers LER more similar to GER.
26 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
User Study B: Results
(a) DR-LER
Rate Oi Ei %χ2 B 12 11.6 +0% SB 12 11.6 +0% S 22 11.6 +57% SW 9 11.6 −4% W 3 11.6 −39% Totals 58 58
(b) DR-LLC
Rate Oi Ei %χ2 B 7 11.6 −9% SB 24 11.6 +65% S 13 11.6 +1% SW 10 11.6 −1% W 4 11.6 −24% Totals 58 58
Table: Chi-square test for Exp-B. The column %χ2 gives, for each modality, its contribution to χ2 (in relative value).
Conclusion
It appears that DR-LLC provides a better effectiveness. A large proportion of the population finds it slightly better than GER, and this is reinforced by a few number of people finding it worse.
27 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
SemSearch 2010: Entity Search Track
SemSearch 2010
First semantic search evaluation; Focus on entity search.
Experiment Design
Billion Triple Challenge 2009 dataset; 92 keyword queries; Relevance judgement on top 10 entities.
28 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
SemSearch 2010: Experiment Results
Figure: SemSearch 2010 evaluation results
29 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Scalability: Computing Dataset Rank
Graph Node Edge Web Data 60M 364M Dataset 50K 1.2M
Table: Graph Size
DatasetRank
1 iteration ≈ 200ms; Good quality rank in few seconds.
30 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Scalability: Dataset size distribution
Power-law distribution; The majority of the datasets contain less than 1000 nodes.
31 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Scalability: Computing Entity Rank
EntityRank
55 iterations of 1 minute (for DBPedia dataset).
LinkCount
requires only 1 iteration; can be computed on the fly with appropriate data index.
32 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Dataset-Dependent Local EntityRank
Dataset Specific Algorithms
No reason to have one generic algorithm for all datasets; We could choose appropriate entity ranking algorithm for each dataset. Graph Structure Dataset Algorithm Generic, Controlled DBpedia LinkCount Generic, Open Social Communities EntityRank Hierarchical Geonames, Taxonomies DHC Bipartite DBLP CiteRank
Table: List of various graph structures with appropriate algorithms
33 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Dataset-Dependent Local EntityRank
Dataset Specific Algorithms
No reason to have one generic algorithm for all datasets; We could choose appropriate entity ranking algorithm for each dataset. Graph Structure Dataset Algorithm Generic, Controlled DBpedia LinkCount Generic, Open Social Communities EntityRank Hierarchical Geonames, Taxonomies DHC Bipartite DBLP CiteRank
Table: List of various graph structures with appropriate algorithms
34 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Dataset-Dependent Local EntityRank
Dataset Specific Algorithms
No reason to have one generic algorithm for all datasets; We could choose appropriate entity ranking algorithm for each dataset. Graph Structure Dataset Algorithm Generic, Controlled DBpedia LinkCount Generic, Open Social Communities EntityRank Hierarchical Geonames, Taxonomies DHC Bipartite DBLP CiteRank
Table: List of various graph structures with appropriate algorithms
35 / 36
Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion
Conclusion
DING Method
Hierarchical Link Analysis for web data; Quality comparable or even better than standard approaches; Lower computational complexity; Dataset-dependent local entity ranking.
Future Work
Investigate how to detect appropriate local entity ranking method for a dataset; Study query-dependent ranking and how it can be combined with DING ranking.
36 / 36