Information Retrieval Based on Extraction of Domain Specific - - PowerPoint PPT Presentation

information retrieval based on extraction of domain
SMART_READER_LITE
LIVE PREVIEW

Information Retrieval Based on Extraction of Domain Specific - - PowerPoint PPT Presentation

Information Retrieval Based on Extraction of Domain Specific Information Retrieval Based on Extraction of Domain Specific Significant Keywords and Other Relevant Phrases from a Conceptual Semantic Network Structure Mohammad Moinul Hoque,


slide-1
SLIDE 1

“Information Retrieval Based on Extraction of Domain Specific Information Retrieval Based on Extraction of Domain Specific Significant Keywords and Other Relevant Phrases from a Conceptual Semantic Network Structure ” M h d M i l H P k h P d l T G l Mohammad Moinul Hoque, Prakash Poudyal, Teresa Goncalves and Paulo Quaresma

University of Evora Team Evora Portugal Evora, Portugal

slide-2
SLIDE 2

Introduction

 This paper presents

 A functional approach towards the problem domain of Information

Retrieval System built upon a narration based search text.

 The presented system

 Retrieves documents from the background collection

 By extracting

D i ifi i ifi t k d

Domain specific significant keywords

Other relevant phrases from a given narrative search text.

 The narrative search text can be

 A description / scenario  A description / scenario

slide-3
SLIDE 3

P d A h Proposed Approach

 A domain specific Conceptual Semantic Network is built (CSN)  Significant keywords are extracted from the narrative search text

g y with the help of the CSN to form an initial search query

 Alternative sets of search queries are also formulated by

 Expanding the initial query built from the CSN.  Adding synonymous terms of the retrieved keyword/phrases using

WordNet synonym sets.

slide-4
SLIDE 4

Domain specific Conceptual Semantic

  • a

spec c Co ceptua Se a t c Network

 The corpus we are dealing with is domain specific (legal

documents) documents)

 Search space is also domain dependent.  We build a potential model of Concept based Semantic Network

t t (CSN) ll structure (CSN) manually.

 CSN contains various conceptual terms/phrases related to the

ti d i d ti th respective domains and connections among them.

 These concept terms / phrases are extracted from the Wikipedia

using a crawler application developed for the purpose using a crawler application developed for the purpose.

slide-5
SLIDE 5

Domain specific Conceptual Domain specific Conceptual Semantic Network

Maintenance Dowry Divorce Maintenance Child Custody Torture Marriage Case File Child Custody Re-Marriage Re Marriage

A ti l i f th d i ifi (Hi d M i d Di L d i ) A partial view of the domain specific (Hindu Marriage and Divorce Law domain) Conceptual Semantic Network

slide-6
SLIDE 6

Domain specific Conceptual Domain specific Conceptual Semantic Network

Warranty refusal Manufacturing defect Warranty claim Consumer Protection Return Replacement

A partial view of the domain specific (Consumer Protection Law domain) Conceptual Semantic Network Conceptual Semantic Network

slide-7
SLIDE 7

Indexing the Document Corpus Indexing the Document Corpus

 Background data files are preprocessed first by stripping off a

few data structure tags.

 Stemming is performed on data using the Porter Stemming

Algorithm for English language.

 Finally the data is indexed using an inverted index structure.

Background Data Preprocess Stemming Inverted document I d Index

slide-8
SLIDE 8

Search text analysis and processing

 Preprocessing

 English stop words are eliminated from the text since they are very less or not

significant at all

 Text is freed from the noisy symbols or characters  Text is freed from the noisy symbols or characters.  Search text is converted into a set of sentences using the heuristics method

employed by the OpenNLP’s API .

 POS Tagging: Words in the search text are tagged with POS tag

using Stanford POS tagger

 Named-Entities such as Organization and Locations are identified

from the search text. D i ifi C t l S ti N t k i lt d t

l t

 Domain specific Conceptual Semantic Network is consulted to select

the significant keywords from the search text

 Those non-stop terms/phrases are initially picked ups as a possible keyword set

Those non stop terms/phrases are initially picked ups as a possible keyword set to build up an initial search query.

slide-9
SLIDE 9

Search query expansion Search query expansion

 From the Conceptual Semantic Network

 Possible connections with other conceptual terms related to the

p initial set of keywords are analyzed and added to form an alternate set of queries

 Based on the parts of speech tags of the marked keywords

 Possible set of synonyms are extracted using WORDNET synset.  These synonyms are also added to create an alternate set of

queries

 For possible file retrieval performance enhancement.

slide-10
SLIDE 10

Search text analysis and processing Search text analysis and processing

S Domain specific Narrative Search text Domain specific Conceptual Semantic Network Preprocessing Initial Search Query Set of Search Queries Query Stemming Named Entity Recognition Query expansion POS Tagging Indexed Data gg g WordNET

slide-11
SLIDE 11

Generating Search Queries (Example)

 S1:

“I am a Hindu girl married for over 5 years and have a 4 yr child out of my wedlock. My married life had been full of problems from the first week of marriage - most of which can be summarized as dowry related harassment physical and mental torture and cruelty Now my related harassment, physical and mental torture and cruelty. Now my husband and family have filed for divorce mostly on the grounds of cruelty and infidelity using false allegations to malign my character cruelty and infidelity using false allegations to malign my character and false allegation to prove that I have been a bad daughter in law. …... I want to file a FIR and complaint in Women Cell regarding my jewellery and dowry related harassment. …. I want the child custody, monthly maintenance and share in husband's or in-law's property…”

slide-12
SLIDE 12

Generating Search Queries (Example) Generating Search Queries (Example)

 Our system analyzes the text and discovers the domain dependent terms from the

CSN

 For example: the system extracts the keyword ‘marriage’ and associates the phrase

‘full of problems’ with it ‘full of problems’ with it.

 The System continues to discover similar associations

 Consults the WORDNET synonym set to add a few more synonyms of those

keywords depending on their parts of speech tag keywords depending on their parts of speech tag

 The system creates a collection of sets containing 1…m number of sets E h f hi h i i t f b f k d

 Each of which is again a set of n number of keywords.  The cardinality of these sets appearing inside the superset will be ranging from 2….n  The content of these sets k d i th h t t k d f th t l t k ti hi h

keywords in the search text , keywords from the conceptual network connections which are associated with them.

 Phrases appearing inside quotation marks are directly included in the collection set  Phrases appearing inside quotation marks, are directly included in the collection set

slide-13
SLIDE 13

Search Query generation

 Collection Set

[ {marriage, problem}, {marriage , dowry}, {marriage, ‘physical torture’}, {Marriage, dowry, harassment }, {Separation, child } { g y } { p custody}, {divorce, maintenance}, {Divorce, ‘false allegation’},{marriage, endowment, harassment}, {Marriage, Dowry mental torture} {marriage dowry emotional abuse} Dowry, mental torture}, {marriage, dowry, emotional abuse}, {marriage, dowry, ‘verbal abuse’, ‘physical abuse’, harassment}, {marriage, dowry, ‘verbal abuse’, ‘physical b h d } { h l b abuse’, harassment, divorce}, {marriage, ‘physical abuse’ , ‘abusive marriage’, cruelty, ‘mental torture’, separation} ]

slide-14
SLIDE 14

Ranking of the retrieved documents and g selection of final set of documents

 Adopted the Lucene based searching techniques (uses a

combination of Vector Space Model and Boolean Model ) combination of Vector Space Model and Boolean Model )

 Documents are ranked and scored by VSM; only for those

retrieved document which are approved by the BM retrieved document which are approved by the BM.

 When passing the queries

E h f th t f i i id th ll ti t f i

 Each of the sets of queries inside the collection set of queries are

sent separately and the returned set of documents are stored with their corresponding scores their corresponding scores.

 VSM score of a document d for the query q is calculated using

Cosine Similarity of the weighted query vectors V(q) and V(d) Cosine Similarity of the weighted query vectors V(q) and V(d).

slide-15
SLIDE 15

Ranking of the retrieved documents and selection of final set of documents

V(q) · V(d) C i Si il i ( d ) Cosine Similarity ( q,d ) = ––––––––– |V(q)| |V(d)| V(q).V(d) is the dot product of the weighted vectors, and |V(q)| and |V(d)| are their Euclidean norms.

slide-16
SLIDE 16

Ranking Point priorities Ranking Point priorities

 Search queries having larger cardinality in terms of the

Search queries having larger cardinality in terms of the containing keywords within them are given higher priorities

 Returned documents containing more significant keywords have

the best chance of being more relevant. g

 From the results set of each queries

From the results set of each queries,

 Top 1000 highest ranked documents (based on VSM points) are

generated and expected to be relevant to the search text. generated and expected to be relevant to the search text.

slide-17
SLIDE 17

Experimental Setup Experimental Setup

 Data set and queries : ‘FIRE 2013 Ad hoc retrieval from legal  Data set and queries : ‘FIRE 2013 Ad-hoc retrieval from legal

documents’ track.

 Background data contained a set of small documents constituting  Background data contained a set of small documents constituting

  • ver 3 Gigabytes in size.

 Documents contained verdicts from the Supreme Court, various acts of

  • parliament. on two different domains namely ‘Consumer Law’ and ‘Hindu

Marriage & Divorce Law’ Marriage & Divorce Law .

 Number of documents containing such kind of verdicts were in excess of

g Hundred and sixty thousands for each of the domains.

ff f f f

 20 different search text comprising 10 from each of the domains for

the document retrieval were used for the document retreival.

slide-18
SLIDE 18

Experimental Setup Experimental Setup

 Our system  Our system

 Retrieve top 1000 documents for each of the queries from the

respective domains.

 Also retrieved top 1000 documents for each of the queries from

the combined domain index rather than the indices from the the combined domain index rather than the indices from the relevant domain.

slide-19
SLIDE 19

Conclusion

 Creating a domain specific Conceptual Semantic Network and

selection of significant keywords from the same is at the core of the g y system.

 The creation of s ch domain specific net ork str ct re is feasible  The creation of such domain specific network structure is feasible

specifically under the legal domain documents

 As the number of terms or keyword phrases participating in the

conceptual network is limited in number.

 Wordnet synonyms of the keywords also helped the system to

Wordnet synonyms of the keywords also helped the system to achieve wider retrieval coverage. B f li it d t ti th t h d ld t d i th

 Because of limited amount time that we had, we could not design the

CSN as expressively as we wanted to.

 So, the retrieval results that we have achieved have lots of scope for

, p improvement.