Indexing of textual databases based on lexical resources: A case - PowerPoint PPT Presentation

Indexing of textual databases based on lexical resources: A case study for Serbian Ranka Stankovi ć Cvetana Krstev 1st International Ivan Obradovi ć KEYSTONE Conference Olivera Kitanovi ć IKC 2015 Coimbra Portugal, 8-9 September 2015 University of Belgrade, Serbia

Presentation outline • Motivation • Current solution • Improved solution ▫ Resources used ▫ Architecture of the new system • Evaluation • Conclusion and future work 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

Motivation • Geological Information System of Serbia was launched in 2004 ▫ general geology, exploration of mineral deposits, hydrogeology, engineering geology ▫ users (professionals or ordinary citizens) ▫ geo-portal, cartographic content, multimedia, dictionaries and textual databases • FoDiB - geological projects documentation with structured descriptions of over 4,900 national geological projects from 1956 to the present day • Metadata: ▫ title, year, location, company, authors, abstract, keywords ▫ prospects, application of mineral resource and possibilities for its use ▫ field works, geomechanics, mining, geodesic works, prospective exploration • DB contains project summary with about 30% of text from project, representing well the textual content of the complete report • Digitalization and full text archiving is foreseen, so this approach will be expanded and implemented on future full text database 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

Present solution (ugalj OR “ugljeni basen”) AND kostolac Faceted search: • Mineral resource • Location • Author • Year • General 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

Current solution • Search by scanning the text ▫ appropriate fields with given keywords ▫ word boundaries not taken into consideration • Search results are ranked on the basis of weight factors assigned to individual fields • Each search criterion fits several different attributes within the database, where weight factors determine the attributes ’ relevance for the result set • Example of search criteria: location ▫ Weights are: Municipality 8, County 7, Title 4, Keywords 3, Abstract 2, Appendices 1. ▫ For location criterion with keyword Bor: Bor in Municipality field better ranked than Bor in the Abstract field. 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

Improved Solution • One of the problems of full text search in Serbian is its rich morphology • Keywords are in first person singular, while in the texts they take different inflectional forms • Normalization of morphological forms for document indexing and query processing ▫ stemming: several stemmers are avilable, one with open code ▫ statistical lemmatization (TreeTagger, trained on corpus of contemporary Serbian, not appropriate for technical texts) ▫ lemmatization based on morphological electronic dictionaries and finite state transducers for Serbian 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

Resources used in improved solution • NLP for Serbian based on lexical resources ▫ electronic dictionaries: 135,000 simple word lemmas + 13,000 MWUs ▫ local grammars using finite-state transducers (FSTs): 1,000 inflectional transducers ▫ 3,500,000 inflected forms generated automatically • NER: names of persons, locations and organizations, time, date, money and percentage • The Serbian NER system is a handcrafted rule-based system based on e-dictionaries and local grammars in the form of FSTs • For more information about lexical resources and tools see: http://jerteh.rs 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

Resources used in improved solution • The whole collection consist of 4,902 documents, 2,880,229 tokens (900,403 simple word forms). • Almost all documents contained at least one NE • On the average, 4 NEs of all types were recognized per document, with as many as 47 NEs for one of the documents Table 1. Distribution of three top-level NEs: persons, locations and organizations NE type Frequency Average per doc % of the text person 11,991 2.45 1.33 location 49,414 10.08 5.49 organization 2,882 0.59 0.32 total 64,287 13.11 7.14 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

Architecture of the new system User query Geological documents Transliteration Lexical (Latin->Cyrillic) resources: Tokenization & lemmatization e-dictionaries Transliteration local grammers Tokenization NLP tools: POS tagging & Query lemmatization Leximir representation Unitex Bag-of-words NE extraction Feedback Phrases chunking Matching: scoring and Document ranking Indexed documents representation Retrieved documents 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

Architecture of the new system • BOW - representation of the document by a set of ungrammatical words (nouns, adjectives, adverbs and acronyms) followed by their frequencies • Text is lemmatized and lemmas (simple and multi-word) are extracted and their frequencies are calculated • In this approach 12,204 simple lemmas (with 450,418 occurences) and 271 MWUs (with 6,525 occurences) were extracted 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

One document dealing with gold 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

Term weights • First implementation: tf_idf • Further development included: ▫ tfc.tfc - modification of tf.idf with cosine normalization ▫ tfc.nfc - term weighting algorithm with normalized tf factor for the query term weights ▫ lnc.ltc measure where ‘l’ stands for weights with a logarithmic tf component ▫ lnu.ltu where normalization is based on the number of unique words in text ▫ measure used in Inquery system Hiemstra, D.: Using language models for information retrieval. Taaluitgeverij Nes- lia Paniculata (2001) 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

Evaluation • First evaluation: entire collection of documents and a set of 10 information needs • For query selection the log of the existing system was used as well as suggestions of geologists on most common information needs. • It turned out that most frequent requests are for ▫ a mineral resource type (copper, gold, coal) ▫ location ▫ geological event (landslide, earthquake) ▫ research company 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

Evaluation • Precision P = tp/(tp + fp), recall R = tp/(tp + fn), and F- measure F = 2*P*R/(P + R) • Precision-recall curve for (zlato OR Au) AND (Bor OR Borski okrug) retrieval without index with index • precision of the old system is significantly better among first-ranked documents • recall is better with the new system: 39 among first 80 documents in the new system were relevant, compared to 25 in the old one 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

Evaluation • Comparative graph of the relationship between precision and recall ▫ interpolated average precision for 11 levels of recall 0.0, 0.1,0.2,..., 0.9, 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

Average Precision per query and Mean Average Precision (MAP) for the old and the new system A space in a query stands for an OR operator, a semicolon for an AND operator (relevant for the old system) 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

Evaluation • The biggest problem of the new system are: ▫ specific technical terms that are not found in electronic dictionaries ▫ quite a number of typographical errors in the document collection • This shortcomings can be rectified by: ▫ correcting errors (based on the list of words unrecognized by the vocabulary) ▫ continuous enhancement of the vocabulary by adding new words • Evaluation was time consuming due to: ▫ lack of previously marked documents as relevant for queries ▫ no software support for evaluation, everything was done in excel (manually) 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

Conclusion and future work • Advantages of current solution: ▫ simple to apply ▫ performs well for certain types of queries • New solution based on pre-indexing outperforms the present, but it can be further improved by: ▫ enriching morphological e-dictionaries with terms from geological domain ▫ adapting NER to the new domain and text type (technical rather than newspapers) ▫ experimenting with different term weight measures ▫ experimenting with different comparison of documents and information need representation • Further research will be done by: ▫ applying the new solution to other textual collections ▫ developing a geodatabase for visualization of locations of recognized named entities • An analysis of queries in the full sentence form is planned • Integration of query expansion by adding synonyms from available resources, such as the geologic dictionary for terminological query terms and WordNet for more general terms. 1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

1st International KEYSTONE Conference IKC 2015 Coimbra Portugal, 8-9 September 2015

Indexing of textual databases based on lexical resources: A case - PowerPoint PPT Presentation

Indexing of textual databases based on lexical resources: A case study for Serbian Ranka Stankovi Cvetana Krstev 1st International Ivan Obradovi KEYSTONE Conference Olivera Kitanovi IKC 2015 Coimbra Portugal, 8-9 September 2015

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Textual Criticism Textual Criticism: Definition Textual criticism is the study of copies of

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Chapter 6 Hash-Based Indexing Efficient Support for Equality Search Hash-Based Indexing Static

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

How Fast Indexing Makes Databases Greener Martin Farach-Colton Michael A. Bender Rutgers and

Module 3: Creating and Managing Databases Overview Creating Databases Creating

Indexing Presentation - The Basics Attached is the slide deck for a short presentation on indexing

Indexing and Searching Indexing and Searching TDT4215 TDT4215 Indexing & Searching 3

Bitmap Indexing and related indexing techniques Presented by: El Ghailani Maher Outline I

Indexing December 12, 2008 Indexing Introduction New tuple is stored without any order next

Dynamic Embedding on Textual Networks via a Gaussian Process Presenter : Pengyu Cheng Joint work

Natural logic and textual inference Bill MacCartney CS224U 12 May 2014 Textual inference

Design and Realization of the EXCITEMENT Open Platform for Textual Entailment Gnter Neumann,

Towards Unification of HPC and Big Data Paradigms Jess Carretero Computer Science and

Hydrogen Chemical Electrical Where does the energy upon burning come

Measuring stellar masses We measure mass using gravity. Direct mass measurements are possible only

Jet Quenching Liliana Apolinrio 26th February 2019 COST THOR School, Lund, Sweden

Capacity Development in Land ( Academic and Professional Development ) Prof.dr. Jaap Zevenbergen

Offshore Wind Meet Oil & Gas, Defence, Space Sandpit Session Dr Nee-Joo Teh Energy

Session 3: Hydrology & Clouds 3:00- 5:30 PM Session 3: Hydrology & Clouds 3:00- 5:30 PM

2020 (Virtual) Convening Day 1: Skills Training Tuesday, March 31, 3:00 PM - 5:30 PM, PDT Day 2:

Indexing of textual databases based on lexical resources: A case - PowerPoint PPT Presentation

Indexing of textual databases based on lexical resources: A case study for Serbian Ranka Stankovi Cvetana Krstev 1st International Ivan Obradovi KEYSTONE Conference Olivera Kitanovi IKC 2015 Coimbra Portugal, 8-9 September 2015

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Textual Criticism Textual Criticism: Definition Textual criticism is the study of copies of

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Chapter 6 Hash-Based Indexing Efficient Support for Equality Search Hash-Based Indexing Static

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

How Fast Indexing Makes Databases Greener Martin Farach-Colton Michael A. Bender Rutgers and

Module 3: Creating and Managing Databases Overview Creating Databases Creating

Indexing Presentation - The Basics Attached is the slide deck for a short presentation on indexing

Indexing and Searching Indexing and Searching TDT4215 TDT4215 Indexing &amp; Searching 3

Bitmap Indexing and related indexing techniques Presented by: El Ghailani Maher Outline I

Indexing December 12, 2008 Indexing Introduction New tuple is stored without any order next

Dynamic Embedding on Textual Networks via a Gaussian Process Presenter : Pengyu Cheng Joint work

Natural logic and textual inference Bill MacCartney CS224U 12 May 2014 Textual inference

Design and Realization of the EXCITEMENT Open Platform for Textual Entailment Gnter Neumann,

Towards Unification of HPC and Big Data Paradigms Jess Carretero Computer Science and

Hydrogen Chemical Electrical Where does the energy upon burning come

Measuring stellar masses We measure mass using gravity. Direct mass measurements are possible only

Jet Quenching Liliana Apolinrio 26th February 2019 COST THOR School, Lund, Sweden

Capacity Development in Land ( Academic and Professional Development ) Prof.dr. Jaap Zevenbergen

Offshore Wind Meet Oil &amp; Gas, Defence, Space Sandpit Session Dr Nee-Joo Teh Energy

Session 3: Hydrology &amp; Clouds 3:00- 5:30 PM Session 3: Hydrology &amp; Clouds 3:00- 5:30 PM

2020 (Virtual) Convening Day 1: Skills Training Tuesday, March 31, 3:00 PM - 5:30 PM, PDT Day 2:

Indexing and Searching Indexing and Searching TDT4215 TDT4215 Indexing & Searching 3

Offshore Wind Meet Oil & Gas, Defence, Space Sandpit Session Dr Nee-Joo Teh Energy

Session 3: Hydrology & Clouds 3:00- 5:30 PM Session 3: Hydrology & Clouds 3:00- 5:30 PM