Hierarchical Link Analysis for Ranking Web Data Renaud Delbru, - - PowerPoint PPT Presentation

hierarchical link analysis for ranking web data
SMART_READER_LITE
LIVE PREVIEW

Hierarchical Link Analysis for Ranking Web Data Renaud Delbru, - - PowerPoint PPT Presentation

Hierarchical Link Analysis for Ranking Web Data Renaud Delbru, Nickolai Toupikov, Michele Catasta, Giovanni Tummarello, and Stefan Decker Digital Enterprise Research Institute, Galway June 1, 2010 Introduction Web Data Model The DING Model


slide-1
SLIDE 1

Hierarchical Link Analysis for Ranking Web Data

Renaud Delbru, Nickolai Toupikov, Michele Catasta, Giovanni Tummarello, and Stefan Decker

Digital Enterprise Research Institute, Galway

June 1, 2010

slide-2
SLIDE 2

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Introduction

Web of Data

There is a growing increase of web data sources ...

Linked Open Data cloud; Open Graph protocol; e-commerces (good relations), e-government, ...

How to search and retrieve relevant information ?

One single query can return million of entities ... ... and users expect only the most relevant ones. Web data search engines (e.g., Sindice) need effective way to rank entities. Partial solution: Popularity-based entity ranking.

1 / 36

slide-3
SLIDE 3

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Link Analysis on the Web

Link Analysis

Given a directed graph, determine the popularity of its nodes using link information A link from a node i to a node j is considered as an evidence of the importance of node j

Link Analysis for Web Documents

PageRank considers exclusively link structure Hierarchical Link Analysis consider both link structure and hierarchical structure

Link Analysis for Web Data

Current approaches consider exclusively link structure Sindice: Dataset/Entity centric view

2 / 36

slide-4
SLIDE 4

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Link Analysis on the Web

Link Analysis

Given a directed graph, determine the popularity of its nodes using link information A link from a node i to a node j is considered as an evidence of the importance of node j

Link Analysis for Web Documents

PageRank considers exclusively link structure Hierarchical Link Analysis consider both link structure and hierarchical structure

Link Analysis for Web Data

Current approaches consider exclusively link structure Sindice: Dataset/Entity centric view

2 / 36

slide-5
SLIDE 5

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Link Analysis on the Web

Link Analysis

Given a directed graph, determine the popularity of its nodes using link information A link from a node i to a node j is considered as an evidence of the importance of node j

Link Analysis for Web Documents

PageRank considers exclusively link structure Hierarchical Link Analysis consider both link structure and hierarchical structure

Link Analysis for Web Data

Current approaches consider exclusively link structure Sindice: Dataset/Entity centric view

2 / 36

slide-6
SLIDE 6

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Link Analysis on the Web

Link Analysis

Given a directed graph, determine the popularity of its nodes using link information A link from a node i to a node j is considered as an evidence of the importance of node j

Link Analysis for Web Documents

PageRank considers exclusively link structure Hierarchical Link Analysis consider both link structure and hierarchical structure

Link Analysis for Web Data

Current approaches consider exclusively link structure Sindice: Dataset/Entity centric view

2 / 36

slide-7
SLIDE 7

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Link Analysis on the Web

Link Analysis

Given a directed graph, determine the popularity of its nodes using link information A link from a node i to a node j is considered as an evidence of the importance of node j

Link Analysis for Web Documents

PageRank considers exclusively link structure Hierarchical Link Analysis consider both link structure and hierarchical structure

Link Analysis for Web Data

Current approaches consider exclusively link structure Sindice: Dataset/Entity centric view

2 / 36

slide-8
SLIDE 8

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Outline: Web Data Model

Web Data Model Web Data Graph Dataset Graph Internal and External Node Intra and Inter-Dataset Edge Linkset Two-Layer Model Quantifying the Two-Layer Model

3 / 36

slide-9
SLIDE 9

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Web Data Graph

Figure: Web data graph

4 / 36

slide-10
SLIDE 10

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Dataset Graph

Figure: Dataset graph

5 / 36

slide-11
SLIDE 11

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Internal and External Node

Figure: Internal (red) and external nodes (blue)

6 / 36

slide-12
SLIDE 12

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Intra and Inter-Dataset Edge

Figure: Inter-dataset (orange) and intra-dataset (black) edges

7 / 36

slide-13
SLIDE 13

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Linkset

Figure: Linkset

8 / 36

slide-14
SLIDE 14

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Two-Layer Model

Figure: Two-layer model of the Web of Data

9 / 36

slide-15
SLIDE 15

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Quantifying the two-layer model

Datasets

DBpedia 17.7 million of entities Citeseer (RKBExplorer) 2.48 million of entities Geonames 13.8 million of entities Sindice 60 million of entities among 50.000 datasets Dataset Intra Inter DBpedia 88M (93.2%) 6.4M (6.8%) Citeseer 12.9M (77.7%) 3.7M (22.3%) Geonames 59M (98.3%) 1M (1.7%) Sindice 287M (78.8%) 77M (21.2%)

Table: Ratio intra / inter dataset links

10 / 36

slide-16
SLIDE 16

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Outline: The DING Model

The DING Model Overview Unsupervised Link Weighting Computing DatasetRank Computing Local EntityRank Combining Dataset Rank and Entity Rank

11 / 36

slide-17
SLIDE 17

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

The DING Model: Overview

DING Principles

DING performs entity ranking in three steps:

1

dataset ranks are computed by performing link analysis on the top layer (i.e. the dataset graph);

2

for each dataset, entity ranks are computed by performing link analysis on the local entity collection;

3

the popularity of the dataset is propagated to its entities and combined with their local ranks to estimate a global entity rank.

12 / 36

slide-18
SLIDE 18

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

The DING Model: Overview

DING Principles

DING performs entity ranking in three steps:

1

dataset ranks are computed by performing link analysis on the top layer (i.e. the dataset graph);

2

for each dataset, entity ranks are computed by performing link analysis on the local entity collection;

3

the popularity of the dataset is propagated to its entities and combined with their local ranks to estimate a global entity rank.

12 / 36

slide-19
SLIDE 19

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

The DING Model: Overview

DING Principles

DING performs entity ranking in three steps:

1

dataset ranks are computed by performing link analysis on the top layer (i.e. the dataset graph);

2

for each dataset, entity ranks are computed by performing link analysis on the local entity collection;

3

the popularity of the dataset is propagated to its entities and combined with their local ranks to estimate a global entity rank.

12 / 36

slide-20
SLIDE 20

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

The DING Model: Overview

DING Principles

DING performs entity ranking in three steps:

1

dataset ranks are computed by performing link analysis on the top layer (i.e. the dataset graph);

2

for each dataset, entity ranks are computed by performing link analysis on the local entity collection;

3

the popularity of the dataset is propagated to its entities and combined with their local ranks to estimate a global entity rank.

12 / 36

slide-21
SLIDE 21

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Unsupervised Link Weighting

Intuition

TF-IDF applied on link labels

Link Frequency - Inverse Dataset Frequency (LF-IDF)

Link weighting factor wσ,i,j Assign low weight to very common links, such as rdfs:seeAlso wσ,i,j = LF(Lσ,i,j) × IDF(σ) = |Lσ,i,j|

  • Lτ,i,k |Lτ,i,k| × log

N 1 + freq(σ)

13 / 36

slide-22
SLIDE 22

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Unsupervised Link Weighting

Intuition

TF-IDF applied on link labels

Link Frequency - Inverse Dataset Frequency (LF-IDF)

Link weighting factor wσ,i,j Assign low weight to very common links, such as rdfs:seeAlso wσ,i,j = LF(Lσ,i,j) × IDF(σ) = |Lσ,i,j|

  • Lτ,i,k |Lτ,i,k| × log

N 1 + freq(σ)

14 / 36

slide-23
SLIDE 23

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Unsupervised Link Weighting

Intuition

TF-IDF applied on link labels

Link Frequency - Inverse Dataset Frequency (LF-IDF)

Link weighting factor wσ,i,j Assign low weight to very common links, such as rdfs:seeAlso wσ,i,j = LF(Lσ,i,j) × IDF(σ) = |Lσ,i,j|

  • Lτ,i,k |Lτ,i,k| × log

N 1 + freq(σ)

15 / 36

slide-24
SLIDE 24

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Computing Dataset Rank

Assumption

Dataset surfing behaviour is the same as the web page surfing behaviour in PageRank

16 / 36

slide-25
SLIDE 25

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Computing Dataset Rank

Assumption

Dataset surfing behaviour is the same as the web page surfing behaviour in PageRank

DatasetRank

Weighted PageRank on the weighted dataset graph

17 / 36

slide-26
SLIDE 26

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Computing Dataset Rank

Assumption

Dataset surfing behaviour is the same as the web page surfing behaviour in PageRank

DatasetRank

Weighted PageRank on the weighted dataset graph Distribution factor wσ,i,j is defined by LF-IDF rk(Dj) = α

  • Lσ,i,j

rk−1(Di)wσ,i,j + (1 − α) |EDj|

  • D∈G |ED|

18 / 36

slide-27
SLIDE 27

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Computing Dataset Rank

Assumption

Dataset surfing behaviour is the same as the web page surfing behaviour in PageRank

DatasetRank

Weighted PageRank on the weighted dataset graph Distribution factor wσ,i,j is defined by LF-IDF Probability of random jump is proportional to the size of a dataset rk(Dj) = α

  • Lσ,i,j

rk−1(Di)wσ,i,j + (1 − α) |EDj|

  • D∈G |ED|

19 / 36

slide-28
SLIDE 28

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Computing Local EntityRank

Generic Algorithms

Weighted EntityRank: Weighted PageRank applied on the internal entities and intra-links of a dataset Weighted LinkCount: in-degree counting links applied on the internal entities and intra-links of a dataset

20 / 36

slide-29
SLIDE 29

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Combining Dataset Rank and Entity Rank

Naive approach

Purely probabilistic point of view: joint probability Assumption: independent events Global score rg(e) = P(e ∩ D) = r(e) ∗ r(D) Problem: favours smaller datasets

DING Approach

Add a local entity rank factor; Normalise local ranks to a same average based on dataset size rg(e) = r(D) ∗ r(e) ∗

|ED|

  • D′∈G |E ′

D| 21 / 36

slide-30
SLIDE 30

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Outline: Experimental Results

Experimental Results Overview User Study SemSearch 2010

22 / 36

slide-31
SLIDE 31

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Experimental Results: Overview

Link Analysis Methods

Global EntityRank (GER); Local LinkCount (LLC) and Local EntityRank (LER); Local algorithms combined with DatasetRank (DR-LLC and DR-LER).

Experiments

1 User study to evaluate qualitatively each methods; 2 Semantic Search challenge. 23 / 36

slide-32
SLIDE 32

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

User Study: Design

Exp-A

Local entity ranking (LER & LLC) on DBpedia dataset 31 participants

Exp-B

DING (DR-LER & DR-LLC) on Sindice’s page-repository 58 participants

Task

10 queries (keyword and SPARQL queries) One result list (top-10) per algorithm Rate algorithms (W, SW, S, SB, B) in relation to GER

24 / 36

slide-33
SLIDE 33

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

User Study: Questionnaire

Figure: One of the questionnaire given to the participant

25 / 36

slide-34
SLIDE 34

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

User Study A: Results

(a) LER

Rate Oi Ei %χ2 B 6.2 −13% SB 7 6.2 +0% S 21 6.2 +71% SW 3 6.2 −3% W 6.2 −13% Totals 31 31

(b) LLC

Rate Oi Ei %χ2 B 3 6.2 −12% SB 8 6.2 +4% S 13 6.2 +53% SW 6 6.2 −0% W 1 6.2 −31% Totals 31 31

Table: Chi-square test for Exp-A. The column %χ2 gives, for each modality, its contribution to χ2 (in relative value).

Conclusion

LER and LLC provides similar results than GER. However, there is a more significant proportion of the population that considers LER more similar to GER.

26 / 36

slide-35
SLIDE 35

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

User Study B: Results

(a) DR-LER

Rate Oi Ei %χ2 B 12 11.6 +0% SB 12 11.6 +0% S 22 11.6 +57% SW 9 11.6 −4% W 3 11.6 −39% Totals 58 58

(b) DR-LLC

Rate Oi Ei %χ2 B 7 11.6 −9% SB 24 11.6 +65% S 13 11.6 +1% SW 10 11.6 −1% W 4 11.6 −24% Totals 58 58

Table: Chi-square test for Exp-B. The column %χ2 gives, for each modality, its contribution to χ2 (in relative value).

Conclusion

It appears that DR-LLC provides a better effectiveness. A large proportion of the population finds it slightly better than GER, and this is reinforced by a few number of people finding it worse.

27 / 36

slide-36
SLIDE 36

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

SemSearch 2010: Entity Search Track

SemSearch 2010

First semantic search evaluation; Focus on entity search.

Experiment Design

Billion Triple Challenge 2009 dataset; 92 keyword queries; Relevance judgement on top 10 entities.

28 / 36

slide-37
SLIDE 37

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

SemSearch 2010: Experiment Results

Figure: SemSearch 2010 evaluation results

29 / 36

slide-38
SLIDE 38

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Scalability: Computing Dataset Rank

Graph Node Edge Web Data 60M 364M Dataset 50K 1.2M

Table: Graph Size

DatasetRank

1 iteration ≈ 200ms; Good quality rank in few seconds.

30 / 36

slide-39
SLIDE 39

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Scalability: Dataset size distribution

Power-law distribution; The majority of the datasets contain less than 1000 nodes.

31 / 36

slide-40
SLIDE 40

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Scalability: Computing Entity Rank

EntityRank

55 iterations of 1 minute (for DBPedia dataset).

LinkCount

requires only 1 iteration; can be computed on the fly with appropriate data index.

32 / 36

slide-41
SLIDE 41

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Dataset-Dependent Local EntityRank

Dataset Specific Algorithms

No reason to have one generic algorithm for all datasets; We could choose appropriate entity ranking algorithm for each dataset. Graph Structure Dataset Algorithm Generic, Controlled DBpedia LinkCount Generic, Open Social Communities EntityRank Hierarchical Geonames, Taxonomies DHC Bipartite DBLP CiteRank

Table: List of various graph structures with appropriate algorithms

33 / 36

slide-42
SLIDE 42

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Dataset-Dependent Local EntityRank

Dataset Specific Algorithms

No reason to have one generic algorithm for all datasets; We could choose appropriate entity ranking algorithm for each dataset. Graph Structure Dataset Algorithm Generic, Controlled DBpedia LinkCount Generic, Open Social Communities EntityRank Hierarchical Geonames, Taxonomies DHC Bipartite DBLP CiteRank

Table: List of various graph structures with appropriate algorithms

34 / 36

slide-43
SLIDE 43

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Dataset-Dependent Local EntityRank

Dataset Specific Algorithms

No reason to have one generic algorithm for all datasets; We could choose appropriate entity ranking algorithm for each dataset. Graph Structure Dataset Algorithm Generic, Controlled DBpedia LinkCount Generic, Open Social Communities EntityRank Hierarchical Geonames, Taxonomies DHC Bipartite DBLP CiteRank

Table: List of various graph structures with appropriate algorithms

35 / 36

slide-44
SLIDE 44

Introduction Web Data Model The DING Model Experimental Results Scalability Conclusion

Conclusion

DING Method

Hierarchical Link Analysis for web data; Quality comparable or even better than standard approaches; Lower computational complexity; Dataset-dependent local entity ranking.

Future Work

Investigate how to detect appropriate local entity ranking method for a dataset; Study query-dependent ranking and how it can be combined with DING ranking.

36 / 36