1 Outline Overview of Kikori-KS Background Summary of our - - PDF document

1
SMART_READER_LITE
LIVE PREVIEW

1 Outline Overview of Kikori-KS Background Summary of our - - PDF document

Outline Background Summary of our Contribution Kikori-KS: An Effective and Kikori-KS Efficient Keyword Search System for Digital Libraries in XML User Interfaces Implementation of Keyword Search on Relational Databases


slide-1
SLIDE 1

1

Kikori-KS: An Effective and Efficient Keyword Search System for Digital Libraries in XML

Toshiyuki Shimizu (Kyoto University) Norimasa Terada (Nagoya University) Masatoshi Yoshikawa (Kyoto University)

ICADL 2006 29th November

2

Outline

Background Summary of our Contribution Kikori-KS

User Interfaces Implementation of Keyword Search on Relational

Databases

Ranking Model

Experiments Conclusions, Future Works

3

Background (1/2)

Large number of documents in digital

libraries are now structured in XML

Growing demand for XML Information

Retrieval (XML-IR) Systems

We can identify meaningful document fragments

by encoding documents in XML

ex) Sections, subsections and paragraphs

in scholarly articles

Browsing only document fragments relevant to a certain

topic

Keyword search on XML documents

Simple, intuitively understandable, yet useful form of

queries, especially for unskilled end-users

We do not need to understand XML query languages and

XML schema

4

Background (2/2)

For the keyword “database”

article title section section title p XML Index Query processing… Introduction body transaction database title p …XML database XML Labeling

: Element Node : Text Value 5

Outline

Background Summary of our Contribution Kikori-KS

User Interfaces Implementation of Keyword Search on Relational

Databases

Ranking Model

Experiments Conclusions, Future Works

6

Summary of our Contribution

We have developed Kikori-KS

A prototype system for XML-IR Under Kikori Project Accepts Keyword Set as a query

User-friendly interface

FetchHighlight interface

Storage schema on RDB

The database schema is carefully designed Acceptable search time

slide-2
SLIDE 2

2

7

Outline

Background Summary of our Contribution Kikori-KS

User Interfaces Implementation of Keyword Search on Relational

Databases

Ranking Model

Experiments Conclusions, Future Works

8

Overview of Kikori-KS

End User

XML Documents

Set of Keywords

RDB

<?xml version=“1.0”?> <document> ~ ~ </document>

Storage Module SQL Translation Module Ranked relevant elements Search Results User Interface Module

9

Outline

Background Summary of our Contribution Kikori-KS

User Interfaces Implementation of Keyword Search on Relational

Databases

Ranking Model

Experiments Conclusions, Future Works

10

User Interfaces

E1 E2 E3

: :

Focussed Thorough FetchBrowse

E1 E2 E3

: :

D1 E11 E12 E13

:

D2 E21 E22 :

(Ei does not

  • verlap with Ej)

Search results of XML-IR are document

fragments, which may be nested

INEX 2005 project* defined three strategies

for element retrieval

*http://inex.is.informatik.uni-duisburg.de/2005/

Three strategies of INEX are not necessarily

intended to be used in designing user interfaces

FetchHighlight

D1 E11 E111 E112 E12

:

D2 E21 :

11

E2-1 0.1 0.2 0.3 0.4 E2-2 E2-3 E2-4 E2-6 E2-7 E2-10 E2-5 E2-8 E2-9

D2

E1-1 0.2 0.6 0.1 0.5 0.3 0.4 E1-2 E1-3 E1-4 E1-6 E1-7 E1-10 E1-5 E1-8 E1-9

D1

Retrieval Strategy of INEX (1/3)

E1-3 (0.6) E1-8 (0.5) E1-10 (0.4) E2-10 (0.4) : : Thorough

Relevant elements are retrieved in

descending order of their scores element score

12

E2-1 0.1 0.2 0.3 0.4 E2-2 E2-3 E2-4 E2-6 E2-7 E2-10 E2-5 E2-8 E2-9

D2

E1-1 0.2 0.6 0.1 0.5 0.3 0.4 E1-2 E1-3 E1-4 E1-6 E1-7 E1-10 E1-5 E1-8 E1-9

D1

Retrieval Strategy of INEX (2/3)

Focussed

The system retrieves only focussed

elements (i.e. non-overlapping elements)

Ranked in relevance order

E1-3 (0.6) E1-8 (0.5) E2-10 (0.4) element score

slide-3
SLIDE 3

3

13

Retrieval Strategy of INEX (3/3)

D1 (0.2) E1-3 (0.6) E1-8 (0.5) E1-10 (0.4) : D2 (0.1) E2-10 (0.4) : FetchBrowse Fetching Phase

The system first identifies relevant

documents and ranks them in relevance order

Browsing Phase Within a fetched document, the

system identifies relevant elements and ranks them in relevance order E1-1 0.2 0.6 0.1 0.5 0.3 0.4 E1-2 E1-3 E1-4 E1-6 E1-7 E1-10 E1-5 E1-8 E1-9

D1

E2-1 0.1 0.2 0.3 0.4 E2-2 E2-3 E2-4 E2-6 E2-7 E2-10 E2-5 E2-8 E2-9

D2

0.2 0.1 element score document score

14

FetchHighlight

Displaying search result elements aggregated

by XML documents is effective

FetchBrowse is of that style

Displaying search result elements in their

document order is useful

FetchHighlight

D1 E11 E111 E112 E12

:

D2 E21 :

XML documents are first sorted in their

relevance order

Relevant elements within the XML

document are displayed in document

  • rder

Elements are indented in accordance

with their depth in the XML tree

15

FetchHighlight Interface

Document order Aggregated by document Elements with high score are displayed by using a larger font Outline elements are displayed

16

Browsing Document Fragment

* Selected document fragment is Highlighted * Search words are Highlighted

17

The Feature of FetchHighlight Interface

Focussed elements are easily identified

Users can also recognize the parts in the

documents with many high relevant elements clustered

Outline elements

Displayed even if the score is 0 The elements with particular structural information

ex) such as sections and subsections

Useful for browsing 18

Outline

Background Summary of our Contribution Kikori-KS

User Interfaces Implementation of Keyword Search on Relational

Databases

Ranking Model

Experiments Conclusions, Future Works

slide-4
SLIDE 4

4

19

Storing XML documents into RDB

A huge number of document fragments have

to be handled efficiently

ex) There are 16,080,830 document fragments

(elements) against 16,819 documents in the INEX 1.9 collection used in our experiments

Storage schema based on XRel

Independent of the logical structure of XML

documents.

Conceptual Database Design

Element (docID, elemID, pathID, start, end, label) Path (pathID, pathexp) Term (term, docID, elemID, tfipf)

20

Conceptual Database Design

XML Index 68 45 3 3 1 database 44 10 2 2 1 : 236 end : : : : : XML Index 1 1 1 1 label start pathID elemID docID

Element

/article 1 /article/title 3 : : /article/transaction 2 pathexp pathID

Path

0.4 3 1 XML 0.3 1 1 database 0.1 2 1 database 0.3 1 1 XML : : term : : : : : : tfipf elemID docID

Term

article title section section title p XML Index Query processing Introduction body transaction database title p We explain XML database XML Labeling

elemID

1 2 3 4 5 6 7 8 9 10 * label: short text representing the element * tfipf : term weight in the element

21

Schema Refinement (1/4)

Materialized view

Join Element table, Path table, and Term table

Partitioning the Term table with each term

Term_xyz (docID, elemID, tfipf, start, end, label, pathexp)

Selecting outline elements and constructing

an Outline table in advance

The system designer have to predefines outline

elements

Outline (docID, elemID, start, end, label, pathexp)

22

Schema Refinement (2/4)

Materialized view

Join Element table, Path table, and Term table

/article/title XML Index 68 45 0.4 3 1 XML : 236 : 44 236 end : 1 : 10 1 start : XML Index : database XML Index label : 0.3 : 0.1 0.3 tfipf /article 1 1 database /article/transaction 2 1 database /article 1 1 XML : : term : : : : : : pathexp elemID docID

Term Element Path

23

Schema Refinement (3/4)

Partitioning the table by terms

Term_xyz (docID, elemID, tfipf, start, end, label, pathexp)

Term_database

: 44 236 end : 10 1 start : database XML Index label : 0.1 0.3 tfipf /article 1 1 /article/transaction 2 1 : : : pathexp elemID docID /article/title XML Index 68 45 0.4 3 1 : 236 end : 1 start : XML Index label : 0.3 tfipf /article 1 1 : : : pathexp elemID docID

Term_XML

24

Schema Refinement (4/4)

Selecting outline elements

and constructing an Outline table in advance

The system designer

predefine outline elements

Outline

article title section section title p XML Index Query processing Introduction body transaction database title p We explain XML database XML Labeling

elemID

1 2 3 4 5 6 7 8 9 10

XML Labeling Introduction label /article/body/section 219 144 8 1 143 end /article/body/section 75 5 1 pathexp start elemID docID Outline (docID, elemID, start, end, label, pathexp)

slide-5
SLIDE 5

5

25

Query Translation

The input keyword set is automatically

translated into an SQL statement

Calculates the score of each relevant element

Support for mandatory term (using a “+”

sign) and a negation (using a “-” sign)

26

Outline

Background Summary of our Contribution Kikori-KS

User Interfaces Implementation of Keyword Search on Relational

Databases

Ranking Model

Experiments Conclusions, Future Works

27

Ranking Model (1/2)

Vector Space Model

The score is the degree of similarity between the

query vector and the element vector

=

E Q k

E k weight Q k weight E Q Sim

,

) , ( * ) , ( ) , ( Q E : Query Vector qtf Q k weight = ) , (

We can use the term frequency of t in Q as the weight of the term in query

: Element Vector

28

Ranking Model (2/2)

ipf nel ntf E k weight * ) , ( = )) ln( 1 ln( 1 tf ntf + + =

p p

ef N ipf 1 ln + =

( )

) ln( 1 * * ) 1 (

p p

avgel avgel el s s nel + ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + − =

Term weights in the element

Based on the formula in [Liu et al. 06] and [Grabs et al. 02]

Normalized term frequency (tf) Specificity of term t within elements that share path p Normalization factor that reflects the element length (el) of E

2 . = s

29

Outline

Background Summary of our Contribution Kikori-KS

User Interfaces Implementation of Keyword Search on Relational

Databases

Ranking Model

Experiments Conclusions, Future Works

30

Experiments

XML data set

INEX 1.9 (about 700 MB) The articles of the IEEE Computer Society’s publications are

marked up in XML

Query set

40 queries of INEX 2005 INEX also provides relevance assessments

Precision/Recall Graph for examining effectiveness

Precision/Recall Graph can be obtained using EvalJ*

Query processing time for examining efficiency Experimental setup

CPU : Xeon 3.80 GHz (2 CPU) RAM : 4.0 GB OS : Miracle Linux 3.0 RDBMS : Oracle10g Release1 *http://sourceforge.net/projects/evalj/

slide-6
SLIDE 6

6

31

Precision/Recall

0.1 0.2 0.3 0.4 0.5 1 Precision Recall Kikori-KS 0.1 0.2 0.3 0.4 0.5 1 Precision Recall Kikori-KS

Thorough FetchBrowse

The rank of Kikori-KS is relatively high

especially in FetchBrowse

32

Processing Time

1000 2000 3000 4000 5000 6000 Thorough FetchBrowse FetchHighlight Processing Time (ms) Kikori-KS achieved acceptable search time

33

Outline

Background Summary of our Contribution Kikori-KS

User Interfaces Implementation of Keyword Search on Relational

Databases

Ranking Model

Experiments Conclusions, Future Works

34

Conclusions and Future Works

Conclusions

Kikori-KS

A prototype system for XML-IR that accepts keyword set

as a query

User-friendly FetchHighlight Interface Storage schema on RDB Experiments using INEX test collection

Kikori-KS can handle a keyword set query in an

acceptable time and with relatively high precision Future Works

Developing storage schema and weighting

methods for phrase searches

Introducing content and structure (CAS) searches 35

Thank you