Intelligent Information Retrieval: Intelligent Information - - PowerPoint PPT Presentation

intelligent information retrieval intelligent information
SMART_READER_LITE
LIVE PREVIEW

Intelligent Information Retrieval: Intelligent Information - - PowerPoint PPT Presentation

Intelligent Information Retrieval: Intelligent Information Retrieval: some research trends some research trends Gabriella Pasi Gabriella Pasi Istituto per le Tecnologie della Costruzione Istituto per le Tecnologie della Costruzione Sezione


slide-1
SLIDE 1

Intelligent Information Retrieval: Intelligent Information Retrieval: some research trends some research trends

Gabriella Pasi Gabriella Pasi

Istituto per le Tecnologie della Costruzione Istituto per le Tecnologie della Costruzione Sezione Tecnologie Informatiche Multimediali Sezione Tecnologie Informatiche Multimediali Consiglio Nazionale delle Ricerche Consiglio Nazionale delle Ricerche via Ampère, 56, 20131 via Ampère, 56, 20131 -

  • Milano

Milano

e e-

  • mail: gabriella.pasi@itim.mi.cnr

mail: gabriella.pasi@itim.mi.cnr. .it it

slide-2
SLIDE 2

The problem of Information Access The problem of Information Access

Development of the WWW Development of the WWW

NEED FOR SYSTEMS WHICH SUPPORT A FAST AND NEED FOR SYSTEMS WHICH SUPPORT A FAST AND EFFECTIVE ACCESS TO INFORMATION EFFECTIVE ACCESS TO INFORMATION

Increasing amount of available information Increasing amount of available information

Distinct nature of information needs Distinct nature of information needs Distinct ways to provide an automatic Distinct ways to provide an automatic support to information access support to information access

slide-3
SLIDE 3

– – Navigating via links on Navigating via links on web web sites (point and click paradigm) sites (point and click paradigm) – – Reccomendations as a decision support aid Reccomendations as a decision support aid – – Preferences elicitation through Preferences elicitation through “ “guided” dialogues guided” dialogues – – Explicit specification of Explicit specification of users needs users needs

There are distinct ways to locate information, depending both on the way in which the information is represented, and on the users’ needs:

The identification of a meaningful starting point An explicit query formulation (Information Retrieval Systems - Search Engines) Learning from “similar” preferences (Recommender Systems) User knowledge elicitation (Decision Support Systems)

requires requires requires requires requires requires requires requires

The problem of Information Access The problem of Information Access

slide-4
SLIDE 4

The problem of Information Access The problem of Information Access

– – Notion Notion of

  • f relevance

relevance: : what the user wants is relevant

  • information. Relevance is a subjective property of

information items. The notion of preference is in this The notion of preference is in this context related to the one of relevance context related to the one of relevance – – Systems Systems which which support support information information access access: : The definition of systems which help users to access information relevant relevant to their needs is based on the solution of a decision making problem: how to select and rank information items which reflect the user’s information items which reflect the user’s preferences preferences ?

slide-5
SLIDE 5

Information Retrieval Information Retrieval

Information Retrieval Information Retrieval (IR) (IR) aims at defining systems able to find aims at defining systems able to find documents which satisfy someone’s information need documents which satisfy someone’s information need . . Information Information can be can be of any kind

  • f any kind: textual, visual,

: textual, visual, or auditory

  • r auditory,

, although although most most actual IR actual IR systems systems store store and enable the retrieval of and enable the retrieval of

  • nly
  • nly textual

textual information information organized

  • rganized in

in documents documents. . The problem of identifying the documents relevant to specific The problem of identifying the documents relevant to specific needs is needs is a a decision decision-

  • making problem

making problem, , based on the assessment of based on the assessment of the the subjective notion of relevance subjective notion of relevance. . Very complex task Very complex task, , pervaded with pervaded with imprecision and uncertainty imprecision and uncertainty

slide-6
SLIDE 6

Information Retrieval System: Information Retrieval System: a basic scheme a basic scheme

Ultimate aim of the system Ultimate aim of the system: to estimate the : to estimate the relevance relevance

  • f documents on the basis of a comparison of the
  • f documents on the basis of a comparison of the

formal representation of documents and queries formal representation of documents and queries

Usually unstructured

  • r semi-structured text

FORMAL FORMAL REPRESENTATION OF REPRESENTATION OF DOCUMENTS DOCUMENTS USER QUERY USER QUERY ITEMS ESTIMATED ITEMS ESTIMATED RELEVANT RELEVANT QUERY QUERY FORMULATION FORMULATION MATCHING MATCHING MECHANISM MECHANISM INDEXING MECHANISM INDEXING MECHANISM DOCUMENTS DOCUMENTS

slide-7
SLIDE 7

Techniques that improve the basic Techniques that improve the basic scheme of an IRS scheme of an IRS

Some techniques which allows to improve the retrieval Some techniques which allows to improve the retrieval capabilities capabilities are: are:

  • Relevance

Relevance Feedback, Feedback,

  • Text Categorization

Text Categorization, ,

  • Use

Use of Thesauri

  • f Thesauri
  • Document clustering

Document clustering

  • Cross

Cross-

  • lingual

lingual Information Retrieval Information Retrieval

slide-8
SLIDE 8
  • Text (or other media) formal representation

Text (or other media) formal representation

the text representation is usually based on keywords extraction the text representation is usually based on keywords extraction and weighting and weighting

– – how to improve document representations? how to improve document representations?

  • Queries

Queries

usually based on selection criteria specified by terms usually based on selection criteria specified by terms

– – how to define query languages that better express how to define query languages that better express user’s needs? user’s needs?

  • The matching mechanism

The matching mechanism

it compares the document and query representations it compares the document and query representations

– – what is a “good” model of retrieval? How to account what is a “good” model of retrieval? How to account for imprecision and uncertainty? for imprecision and uncertainty?

  • Produced results: ranked lists of documents

Produced results: ranked lists of documents

degrees of relevance or probability of relevance degrees of relevance or probability of relevance

Information Retrieval: main issues Information Retrieval: main issues

slide-9
SLIDE 9

How to improve the relevance estimate? How to improve the relevance estimate?

  • to simplify the user

to simplify the user-

  • system interaction

system interaction (tolerance to an approximate expression of users’ needs) (tolerance to an approximate expression of users’ needs)

Information Retrieval Systems Information Retrieval Systems

by by better interpreting better interpreting and and learning learning users’ preferences users’ preferences Application of soft computing techniques: Application of soft computing techniques:

  • to learn the user notion of relevance

to learn the user notion of relevance

The relevance estimate strongly depends on the adopted IR model The relevance estimate strongly depends on the adopted IR model

  • to improve the formal representation of the documents’ content

to improve the formal representation of the documents’ content Definition of “ Definition of “intelligent intelligent retrieval systems retrieval systems” ”

Flexible systems vs. intelligent systems Flexible systems vs. intelligent systems

  • tolerance to uncertainty and imprecision (intrinsic in subje

tolerance to uncertainty and imprecision (intrinsic in subjective ctive evaluations) evaluations)

  • learning capabilities

learning capabilities

slide-10
SLIDE 10

“ “Intelligent” IR: some research directions Intelligent” IR: some research directions

  • IR

IR models that manage uncertainty and vagueness models that manage uncertainty and vagueness

  • Relevance

Relevance Feedback Feedback

  • Automated text categorization

Automated text categorization They model the uncertainty and/or imprecision intrinsic in the They model the uncertainty and/or imprecision intrinsic in the retrieval activity retrieval activity To learn users’ preferences by refinement of queries To learn users’ preferences by refinement of queries

  • Vocabulary expansion and intelligent users’

Vocabulary expansion and intelligent users’ interfaces interfaces

  • Personalized indexing

Personalized indexing

  • Flexible

Flexible query languages query languages To improve the formal representation of documents To improve the formal representation of documents To improve the expression of users’ needs To improve the expression of users’ needs

slide-11
SLIDE 11

IR IR models that deal with uncertainty and vagueness models that deal with uncertainty and vagueness

  • Probabilistic models

Probabilistic models

estimate of the probability of relevance of documents to a user’ estimate of the probability of relevance of documents to a user’s query s query

  • Logical models

Logical models

The estimate of the relevance of a document with respect to a The estimate of the relevance of a document with respect to a query consists in determining the "logical status" of the implic query consists in determining the "logical status" of the implication. ation.

  • Fuzzy models

Fuzzy models

relevance is modeled as a gradual property of documents. They relevance is modeled as a gradual property of documents. They capture the vagueness intrinsic in the retrieval activity capture the vagueness intrinsic in the retrieval activity

  • Neural models

Neural models

to design IRSs able to adapt to the characteristics of the IR to design IRSs able to adapt to the characteristics of the IR environment environment, , and in particular to the user's interpretation of relevance. and in particular to the user's interpretation of relevance.

slide-12
SLIDE 12

Relevance Relevance Feedback Feedback

Relevance feedback exploits a Relevance feedback exploits a learning learning of the user’s notion of relevance

  • f the user’s notion of relevance

by adapting the system behavior to it by adapting the system behavior to it A A relevance feedback mechanism performs an automatic process relevance feedback mechanism performs an automatic process which generates improved queries on the basis of an initial quer which generates improved queries on the basis of an initial query y evaluation evaluation This process is directed by the user who first analyzes the pref This process is directed by the user who first analyzes the preference erence

  • rdering estimated by the system over the retrieved information
  • rdering estimated by the system over the retrieved information items

items, , and then is asked to express her and then is asked to express her/ / his his preferences preferences over the retrieved

  • ver the retrieved

items in order to explicitly indicate to the systems the items t items in order to explicitly indicate to the systems the items truly ruly evaluated relevant. evaluated relevant.

I nformation Retrieval System

QUERY FORMULATION DOCUMENTS Estimated relevant INFORMATION NEEDS USERS RELEVANCE FEED-BACK

slide-13
SLIDE 13

Automated text categorization Automated text categorization

It is aimed at the automated categorization (classification) of It is aimed at the automated categorization (classification) of texts into predefined categories, thus organizing them and texts into predefined categories, thus organizing them and making retrieval more flexible and consequently more making retrieval more flexible and consequently more effective. effective. TC is applied in several domains, such as for example TC is applied in several domains, such as for example document indexing based on a controlled vocabulary, document indexing based on a controlled vocabulary, document filtering, document sense disambiguation etc. document filtering, document sense disambiguation etc. The dominant approach to text categorization is based on The dominant approach to text categorization is based on machine learning techniques machine learning techniques: a general inductive process : a general inductive process automatically builds a classifier by learning from a set of automatically builds a classifier by learning from a set of preclassified preclassified documents the characteristics of the categories. documents the characteristics of the categories.

slide-14
SLIDE 14

Vocabulary expansion and intelligent users’ Vocabulary expansion and intelligent users’ interfaces interfaces

The query representation The query representation in in IRSs is commonly based on IRSs is commonly based on keywords keywords ( (or strings

  • r strings)

) specification specification. . The retrieval mechanism The retrieval mechanism performs performs in in this this case a case a lexical lexical match match of words

  • f words.

. One of the One of the main problems of main problems of IR IR systems is vocabulary mismatch systems is vocabulary mismatch. . Vocabulary expansion Vocabulary expansion can can result from transforming the result from transforming the document and query representations document and query representations, as , as with with Latent Semantic Latent Semantic Indexing Indexing, , or it

  • r it can be done by

can be done by using using a a thesaurus thesaurus. . The basic assumption of The basic assumption of LSI LSI is that is that in in the word usage there is the word usage there is an underlying or latent structure an underlying or latent structure: : for retrieval some statistically for retrieval some statistically derived derived conceptual conceptual indices indices are are used instead of used instead of individual individual words words Fuzzy thesauri and pseudothesauri Fuzzy thesauri and pseudothesauri are are used to expand the set used to expand the set

  • f index terms of documents with new terms
  • f index terms of documents with new terms by

by taking into taking into account their varying significance account their varying significance in in representing the topics representing the topics dealt with dealt with in in the documents the documents

slide-15
SLIDE 15

The most used automatic indexing procedures The most used automatic indexing procedures are are based on term based on term extraction and weighting extraction and weighting: a : a document is represented document is represented by by means of means of a a collection of index terms with associated weights collection of index terms with associated weights ( (the index term weights the index term weights). ). An index term weight expresses the degree of significance of the An index term weight expresses the degree of significance of the index index term term as a descriptor as a descriptor of the document information content

  • f the document information content

The The vector vector space model space model, , the probabilistic models and fuzzy models adopt the probabilistic models and fuzzy models adopt a a weighted document representation weighted document representation

Document indexing Document indexing

  • the weighted representation of documents does not take into

the weighted representation of documents does not take into account that account that a a term term can play a can play a different different role role within within a a text text, , according according to the distribution of its occurrences to the distribution of its occurrences. .

  • usual

usual indexing procedures behave indexing procedures behave as a as a black black box box producing the producing the same document representation for all users same document representation for all users

Limitations Limitations: :

Need for Need for “ “personalized“ indexing procedures personalized“ indexing procedures

slide-16
SLIDE 16

To index structured documents

The formal representation of a document is defined by exploitin The formal representation of a document is defined by exploiting its g its logical structure (e.g. XML documents). Given a term t, for each logical structure (e.g. XML documents). Given a term t, for each subpart s subpart si

i

  • f the document a distinct term weight is computed,
  • f the document a distinct term weight is computed,

expressing the importance of the term as a descriptor in that do expressing the importance of the term as a descriptor in that document cument

  • subpart. The overall index term weigh
  • subpart. The overall index term weigh is computed by aggregating, in a

is computed by aggregating, in a user user-

  • driven way the

driven way the “ “partial” weigths partial” weigths The user specifies both her The user specifies both her/ / his preference about the sections in which to his preference about the sections in which to privilege the search and the aggregation function. privilege the search and the aggregation function.

aggregation function

TITLE AUTHORS ABSTRACT INTRODUCTION ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................

F (d,t)

s1

F (d,t)

s2

F (d,t)

s3

F (d,t)

s4

A

F (d,t)

Personalized document indexing Personalized document indexing

slide-17
SLIDE 17

A) linguistic query weights as flexible constraints on A) linguistic query weights as flexible constraints on weighted document representations weighted document representations

Flexible Query Languages Flexible Query Languages

q = <network;important important> AND <PC;not not very very important important> R = {0.2/computer, 0.6/network, 0.1/chip, 0.9/PC, 0.7/DOS}

d1

Partial matching mechanism Weighted query d1 Weighted representation

  • f documents

The weights specify soft constraints on the weighted document representations The Retrieval Status Value of a document expresses the degree of constraints satisfaction Soft constraints The constraints’ The constraints’ evaluation evaluation depends on the depends on the weight semantics weight semantics

slide-18
SLIDE 18

Use of linguistic quantifiers to specify aggregation criteria (the behaviour of these

  • perators lie between AND and OR) ex: (all, most of, at least k);
  • To simplify query formulation

To simplify query formulation

  • to

to improve improve expressiveness expressiveness

Flexible Query Languages Flexible Query Languages

A) linguistic aggregation operators A) linguistic aggregation operators

all ---------------------------------> AND most of at least k with K∈ N ....................... At least 1 ----------------------------> OR

Example :

Boolean query

(imagesAND noise) OR (images AND satellite) OR (images AND meteo* ) OR (noise AND satellite) OR (noise AND meteo* ) OR (meteo* AND satellite)

Same query with linguitic quantifiers

At least 2(images, noise, satellite, meteo* )

slide-19
SLIDE 19

Second Step Second Step: : the user formulates a flexible query based on the following soft the user formulates a flexible query based on the following soft constraints: constraints: t in Q sections t in Q sections in which in which Q Q is a linguistic quantifier such as is a linguistic quantifier such as at least one, most, all at least one, most, all that that specifies the number of the documents’ sections that should be t specifies the number of the documents’ sections that should be taken into aken into account to compute the overall weight of the index term account to compute the overall weight of the index term t t in a whole in a whole document d. document d.

First step First step: : the user ranks the logical sections the user ranks the logical sections of the documents in decreasing order

  • f the documents in decreasing order
  • f their perceived importance in bearing relevant information;
  • f their perceived importance in bearing relevant information;

section section s si

i is more important than section

is more important than section s sj

j iff

iff i< j i< j (i and j being the positions of (i and j being the positions of s si

i and

and s sj

j, respectively, in the ordered list).

, respectively, in the ordered list).

Flexible Flexible querying of Structured Documents querying of Structured Documents

slide-20
SLIDE 20

Conclusions Conclusions

Some approaches to the definition of Some approaches to the definition of “ “intelligent intelligent” ” Information Information Retrieval Systems have been presented Retrieval Systems have been presented. . In particular In particular some promising research directions that could some promising research directions that could guarantee the development of guarantee the development of more more effective IRSs have been effective IRSs have been

  • utlined
  • utlined.

. Among these Among these, , the research efforts aimed at defining new the research efforts aimed at defining new indexing techniques of semi indexing techniques of semi-

  • structured documents

structured documents ( (such such as as XML XML documents documents) are ) are very important very important: : the possibility of creating the possibility of creating in a in a user user-

  • driven way the documents’ surrogates would ensure

driven way the documents’ surrogates would ensure a a modeling of the users’ interests also at the indexing level modeling of the users’ interests also at the indexing level ( (usually this is limited to the query formulation level usually this is limited to the query formulation level). ).