Indexing of textual databases based on lexical resources: A case - - PowerPoint PPT Presentation

indexing of textual databases based
SMART_READER_LITE
LIVE PREVIEW

Indexing of textual databases based on lexical resources: A case - - PowerPoint PPT Presentation

Indexing of textual databases based on lexical resources: A case study for Serbian Ranka Stankovi Cvetana Krstev 1st International Ivan Obradovi KEYSTONE Conference Olivera Kitanovi IKC 2015 Coimbra Portugal, 8-9 September 2015


slide-1
SLIDE 1

Indexing of textual databases based

  • n lexical resources:

A case study for Serbian

Ranka Stanković Cvetana Krstev Ivan Obradović Olivera Kitanović University of Belgrade, Serbia

1st International KEYSTONE Conference

IKC 2015

Coimbra Portugal, 8-9 September 2015

slide-2
SLIDE 2

Presentation outline

  • Motivation
  • Current solution
  • Improved solution

▫ Resources used ▫ Architecture of the new system

  • Evaluation
  • Conclusion and future work

1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

slide-3
SLIDE 3

Motivation

  • Geological Information System of Serbia was launched in 2004

▫ general geology, exploration of mineral deposits, hydrogeology, engineering geology ▫ users (professionals or ordinary citizens) ▫ geo-portal, cartographic content, multimedia, dictionaries and textual databases

  • FoDiB - geological projects documentation with structured descriptions of
  • ver 4,900 national geological projects from 1956 to the present day
  • Metadata:

▫ title, year, location, company, authors, abstract, keywords ▫ prospects, application of mineral resource and possibilities for its use ▫ field works, geomechanics, mining, geodesic works, prospective exploration

  • DB contains project summary with about 30% of text from project,

representing well the textual content of the complete report

  • Digitalization and full text archiving is foreseen, so this approach will be

expanded and implemented on future full text database

1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

slide-4
SLIDE 4

Present solution

1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

Faceted search:

  • Mineral resource
  • Location
  • Author
  • Year
  • General

(ugalj OR “ugljeni basen”) AND kostolac

slide-5
SLIDE 5

Current solution

  • Search by scanning the text

▫ appropriate fields with given keywords ▫ word boundaries not taken into consideration

  • Search results are ranked on the basis of weight factors

assigned to individual fields

  • Each search criterion fits several different attributes within

the database, where weight factors determine the attributes’ relevance for the result set

  • Example of search criteria: location

▫ Weights are: Municipality 8, County 7, Title 4, Keywords 3, Abstract 2, Appendices 1. ▫ For location criterion with keyword Bor: Bor in Municipality field better ranked than Bor in the Abstract field.

1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

slide-6
SLIDE 6

Improved Solution

  • One of the problems of full text search in Serbian is its

rich morphology

  • Keywords are in first person singular, while in the texts

they take different inflectional forms

  • Normalization of morphological forms for document

indexing and query processing

▫ stemming: several stemmers are avilable, one with open code ▫ statistical lemmatization (TreeTagger, trained on corpus of contemporary Serbian, not appropriate for technical texts) ▫ lemmatization based on morphological electronic dictionaries and finite state transducers for Serbian

1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

slide-7
SLIDE 7

Resources used in improved solution

  • NLP for Serbian based on lexical resources

▫ electronic dictionaries: 135,000 simple word lemmas + 13,000 MWUs ▫ local grammars using finite-state transducers (FSTs): 1,000 inflectional transducers ▫ 3,500,000 inflected forms generated automatically

  • NER: names of persons, locations and organizations,

time, date, money and percentage

  • The Serbian NER system is a handcrafted rule-based

system based on e-dictionaries and local grammars in the form of FSTs

  • For more information about lexical resources

and tools see: http://jerteh.rs

1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

slide-8
SLIDE 8

NE type Frequency Average per doc % of the text person 11,991 2.45 1.33 location 49,414 10.08 5.49

  • rganization

2,882 0.59 0.32 total 64,287 13.11 7.14 Table 1. Distribution of three top-level NEs: persons, locations and

  • rganizations

Resources used in improved solution

  • The whole collection consist of 4,902 documents,

2,880,229 tokens (900,403 simple word forms).

  • Almost all documents contained at least one NE
  • On the average, 4 NEs of all types were recognized per

document, with as many as 47 NEs for one of the documents

1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

slide-9
SLIDE 9

Architecture of the new system

1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

Lexical resources:

e-dictionaries local grammers

NLP tools:

Leximir Unitex Geological documents Indexed documents User query Query representation Transliteration Tokenization POS tagging & lemmatization Bag-of-words NE extraction Phrases chunking Document representation Matching: scoring and ranking Transliteration (Latin->Cyrillic) Tokenization & lemmatization Retrieved documents Feedback

slide-10
SLIDE 10

Architecture of the new system

  • BOW - representation of the document by a set
  • f ungrammatical words (nouns, adjectives,

adverbs and acronyms) followed by their frequencies

  • Text is lemmatized and lemmas (simple and

multi-word) are extracted and their frequencies are calculated

  • In this approach 12,204 simple lemmas (with

450,418 occurences) and 271 MWUs (with 6,525

  • ccurences) were extracted

1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

slide-11
SLIDE 11

One document dealing with gold

1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

slide-12
SLIDE 12

Term weights

  • First implementation: tf_idf
  • Further development included:

▫ tfc.tfc - modification of tf.idf with cosine normalization ▫ tfc.nfc - term weighting algorithm with normalized tf factor for the query term weights ▫ lnc.ltc measure where ‘l’ stands for weights with a logarithmic tf component ▫ lnu.ltu where normalization is based on the number of unique words in text ▫ measure used in Inquery system

Hiemstra, D.: Using language models for information retrieval. Taaluitgeverij Nes- lia Paniculata (2001)

1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

slide-13
SLIDE 13

Evaluation

  • First evaluation: entire collection of documents

and a set of 10 information needs

  • For query selection the log of the existing system

was used as well as suggestions of geologists on most common information needs.

  • It turned out that most frequent requests are for

▫ a mineral resource type (copper, gold, coal) ▫ location ▫ geological event (landslide, earthquake) ▫ research company

1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

slide-14
SLIDE 14

Evaluation

  • Precision P = tp/(tp + fp), recall R = tp/(tp + fn), and F- measure F =

2*P*R/(P + R)

  • Precision-recall curve for (zlato OR Au) AND (Bor OR Borski okrug)

retrieval without index with index

  • precision of the old system is significantly better among first-ranked

documents

  • recall is better with the new system: 39 among first 80 documents in the

new system were relevant, compared to 25 in the old one

1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

slide-15
SLIDE 15

Evaluation

  • Comparative graph of the relationship between

precision and recall

▫ interpolated average precision for 11 levels of recall 0.0, 0.1,0.2,..., 0.9,

1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

slide-16
SLIDE 16

Average Precision per query and Mean Average Precision (MAP) for the old and the new system

A space in a query stands for an OR operator, a semicolon for an AND operator (relevant for the old system)

1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

slide-17
SLIDE 17

Evaluation

  • The biggest problem of the new system are:

▫ specific technical terms that are not found in electronic dictionaries ▫ quite a number of typographical errors in the document collection

  • This shortcomings can be rectified by:

▫ correcting errors (based on the list of words unrecognized by the vocabulary) ▫ continuous enhancement of the vocabulary by adding new words

  • Evaluation was time consuming due to:

▫ lack of previously marked documents as relevant for queries ▫ no software support for evaluation, everything was done in excel (manually)

1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

slide-18
SLIDE 18

Conclusion and future work

  • Advantages of current solution:

▫ simple to apply ▫ performs well for certain types of queries

  • New solution based on pre-indexing outperforms the present, but it can be

further improved by:

▫ enriching morphological e-dictionaries with terms from geological domain ▫ adapting NER to the new domain and text type (technical rather than newspapers) ▫ experimenting with different term weight measures ▫ experimenting with different comparison of documents and information need representation

  • Further research will be done by:

▫ applying the new solution to other textual collections ▫ developing a geodatabase for visualization of locations of recognized named entities

  • An analysis of queries in the full sentence form is planned
  • Integration of query expansion by adding synonyms from available

resources, such as the geologic dictionary for terminological query terms and WordNet for more general terms.

1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

slide-19
SLIDE 19

1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015