Information Retrieval and Filtering over Self-Organising Digital - - PowerPoint PPT Presentation

information retrieval and filtering over self organising
SMART_READER_LITE
LIVE PREVIEW

Information Retrieval and Filtering over Self-Organising Digital - - PowerPoint PPT Presentation

Information Retrieval and Filtering over Self-Organising Digital Libraries Paraskevi Raftopoulou Raftopoulou 1,2 1,2 , Euripides G.M. Petrakis 2 , Paraskevi Christos Tryfonopoulos 1 , and Gerhard Weikum 1 1 Max-Planck Institute for Informatics,


slide-1
SLIDE 1

Information Retrieval and Filtering

  • ver Self-Organising Digital Libraries

Paraskevi Paraskevi Raftopoulou Raftopoulou1,2

1,2, Euripides G.M. Petrakis2,

Christos Tryfonopoulos1, and Gerhard Weikum1

1Max-Planck Institute for Informatics, Saarbruecken, Germany

http://www.mpi-inf.mpg.de/

2 Technical University of Crete, Chania, Greece

http://www.intelligence.tuc.gr/

slide-2
SLIDE 2

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 2 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

Outline

Motivating scenario Background iClusterDL

Architecture Protocols

Experimental evaluation Related work & outlook

slide-3
SLIDE 3

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 3 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

Motivating scenario

slide-4
SLIDE 4

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 4 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

Motivating scenario

Christos needs papers on information retrieval

“I want papers on information retrieval” Answers Christos

slide-5
SLIDE 5

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 5 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

Motivating scenario

Christos needs papers on information retrieval

“I want papers on information retrieval” Answers Christos

slide-6
SLIDE 6

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 6 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

Motivating scenario

Christos needs papers on information retrieval

“I want papers on information retrieval” Answers Christos

slide-7
SLIDE 7

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 7 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

Motivating scenario

There are lots of DLs out there!

Why ask one or a few, when you could ask thousands? Goal: Distributed resource sharing

Framework to provide IR and IF functionality on

top of SONs

Integrate DLs, publishers and other networks

seamlessly and with minimum effort

Speed-up query processing

slide-8
SLIDE 8

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 8 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

Background information

slide-9
SLIDE 9

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 9 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

Background: IR vs IF

IR scenario:

A user poses an one-time query “I want papers on

information retrieval”.

The system returns a list of pointers to matching

resources (or the actual resources).

IF (or pub/sub or information dissemination) scenario:

A user posts a continuous query to receive a notification

when a paper on “information retrieval” is published.

The system notifies the subscriber with a pointer to the

matching resources (or the actual resources).

slide-10
SLIDE 10

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 10 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

Background: SONs

Virtually connected peers Routing indices with links to

  • ther peers

Peers connected to each other

are called neighbors

Provide semantic (and social)

information about peers

Self-organising overlay

networks

Support rich data models and

expressive query languages

p p1

1

p p2

2

p p3

3

p p4

4

p p5

5

p p6

6

p p7

7

p p8

8

physical net. physical net.

p p1

1

p p2

2

p p3

3

p p4

4

p p5

5

p p6

6

p p7

7

p p8

8

  • verlay net.
  • verlay net.

RI4 p1 p7 p p1

1

p p2

2

p p3

3

p p4

4

p p5

5

p p6

6

p p7

7

p p8

8

slide-11
SLIDE 11

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 11 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete inter-cluster or long-range links

Techniques for self-organising

peers:

abandon old connections and create

new ones

periodic process

Inspired by the ‘small world effect’

reach anybody in a small

number of routing hops

intra-cluster or short-range links

Background: Rewiring strategies

slide-12
SLIDE 12

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 12 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

iClusterDL architecture

slide-13
SLIDE 13

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 13 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

iClusterDL basics

(i) intelligent + (Cluster) clustering + (DL) digital libraries =

iClusterDL Contributions:

Architecture and protocols to support both IR and IF

2-level hierarchical (super-peer) P2P network seamless and easy integration of DLs, scalable

Self-organising DLs based on SONs

support rich query models benefits from loosely-connected peers

slide-14
SLIDE 14

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 14 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

  • Forms message routing layer
  • Runs a rewiring protocol
  • Serves clients and providers
  • stores cont. queries
  • stores resource publications
  • answers one-time queries
  • creates notifications
  • stores notifications

iClusterDL Architecture

SP SP SP SP SP P Integration Springer DL C P CiteSeer SP Super-peer SP ACM DL

slide-15
SLIDE 15

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 15 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

  • Implemented by information

sources

  • Used to expose source’s

contents

  • Connects to iClusterDL

network through a super-peer

Provider

iClusterDL Architecture

SP SP SP SP SP P Integration Springer DL ACM DL C P CiteSeer P SP P

slide-16
SLIDE 16

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 16 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

  • Connects to iClusterDL

network through a super-peer

  • Information consumers:
  • pose one-time queries
  • receive answers
  • subscribe to resource

publications

  • receive notifications

Client

iClusterDL Architecture

SP SP SP SP SP P Integration Springer DL ACM DL C P CiteSeer C

request resource / send resource

C SP P

slide-17
SLIDE 17

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 17 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

Super-peer join/leave Super-peer rewiring Client join (first time only) Client connect/disconnect Resource publication/indexing/removal/update One-time query processing Continuous query processing Notification delivery (client online or offline)

iClusterDL Protocols

slide-18
SLIDE 18

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 18 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

Super-peer protocols

Basic idea: Organise super-peers in SONs. Make

sure that similar super-peers are clustered together.

Two levels of clustering:

A provider peer clusters its documents and uses its

interests to join the network.

A super-peer uses the interests of its providers to

identify itself in the network and find other similar super-peers.

slide-19
SLIDE 19

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 19 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

A super-peer s

1. computes its intra-cluster similarity (average similarity with its short-range links) 2. initiates rewiring if similarity < threshold θ 3. sends a message (msg) with its interest to m neighbors

  • All super-peers receiving msg append their interest and

forward msg to m neighbors

  • The message is sent back to s when TTL = 0

Super-peer rewiring

slide-20
SLIDE 20

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 20 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

IR protocols

Basic idea: Index information in the SON. Make

sure one-time queries meet similar publications.

Two levels of indexing:

Global (among all super-peers): Use a self-organising

protocol.

Local (at each super-peer): Use a local index

appropriate for the publication language.

slide-21
SLIDE 21

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 21 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

One-time query processing

A super-peer s

  • 1. compares q against its interests &

selects the interest int most similar to q

  • 2. if similarity ≥ threshold θ

forwards a message (msg) including q to all its short-range links sends q to all similar providers stored in its provider table

  • 3. if similarity < threshold θ forwards msg to the m of its neighbors

most similar to q

All super-peers receiving msg do the same process The message is forwarded until TTL = 0

slide-22
SLIDE 22

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 22 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

Experimental evaluation

slide-23
SLIDE 23

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 23 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

Experimental Evaluation

Evaluated the protocols under different

parameters:

Data corpus Similarity threshold Query TTL

Looked into the:

Network traffic Recall

OHSUMED TREC 30,000 medical articles 10 categories TREC-6 556,000 documents 100 categories

slide-24
SLIDE 24

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 24 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

Experimental Evaluation

Evaluated the protocols under different

parameters:

Data corpus Similarity threshold Query TTL

Looked into the:

Network traffic Recall

the start of the rewiring is randomly chosen from the time interval [0, 4K] the periodicity is randomly selected from a normal distribution of 2K

slide-25
SLIDE 25

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 25 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

2 τb broadcast TTL 2 m message fanout 6 τf fixed forwarding TTL 4 τR rewiring TTL 0.9 θ similarity threshold 4 l long-range links 8 s short-range links 2,000 N super-peers Value Symbol Parameter

Experimental Evaluation

Evaluated the protocols under different

parameters:

Data corpus Similarity threshold Query TTL

Looked into the:

Network traffic Recall

slide-26
SLIDE 26

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 26 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

Experimental Evaluation

Recall for IR and IF

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 4 4.5 5 5.5 6 recall time units x 1000 IR - OHSUMED IF - OHSUMED IR - TREC-6 IF - TREC-6

slide-27
SLIDE 27

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 27 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

Experimental Evaluation

Recall for iClusterDL and Flooding using the same number of messages

slide-28
SLIDE 28

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 28 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

Related work and outlook

slide-29
SLIDE 29

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 29 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

Related Work

Semantic Overlay Networks

Initial approaches include:

[KJ04], [SMZ03], [PMW07]

Based on the idea of small-world networks:

[Smi04], [LLS04], [VSI06], DESENT

IR and IF in Digital Libraries

Content-based retrieval:

[LC03], OverCite

Support both IR & IF functionality:

P2PDIET, LibraRing, MinervaDL

slide-30
SLIDE 30

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 30 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

Contributions summarised

First architecture to

unify IR & IF functionality in SONs apply SONs to Digital Library domain to support scalability

An architecture that is

automatic: requires no intervention general: works for any type of data adaptive: adjusts to changes of DL contents efficient: offers fast query processing accurate: achieves high recall

slide-31
SLIDE 31

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 31 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

Open Problems

The effect of different system parameters on:

clustering performance retrieval performance

Dynamic peer content Peer churn

slide-32
SLIDE 32

ECDL Conference 2008 Aarhus, Denmark, 14-19 September 2008 32 of 32 Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

Acknowledgements - Funding

EU project Aeolus Heraclitus

(Greek Government PhD Fellowship Program)