CS344: Introduction to Artificial Intelligence Pushpak - - PowerPoint PPT Presentation

cs344 introduction to artificial intelligence
SMART_READER_LITE
LIVE PREVIEW

CS344: Introduction to Artificial Intelligence Pushpak - - PowerPoint PPT Presentation

CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT B IIT Bombay b Lecture 32: Information Retrieval: Basic concepts and Model t d M d l The elusive user satisfaction The elusive user satisfaction Ranking


slide-1
SLIDE 1

CS344: Introduction to Artificial Intelligence

Pushpak Bhattacharyya CSE Dept., IIT B b IIT Bombay Lecture 32: Information Retrieval: Basic t d M d l concepts and Model

slide-2
SLIDE 2

The elusive user satisfaction The elusive user satisfaction

Ranking Ranking Correctness Correctness

  • f

Query Processing Coverage I d i NER

Stemming

MWE

Crawling

Indexing MWE

slide-3
SLIDE 3

Q I di T ib i L ti Query: Indian Tribes in Latin America America

slide-4
SLIDE 4

Google

  • I ndians of Latin America: an exhibition of materials in the Lilly ...
  • Lilly Library: Latin American mss. Brazil. A large map in colors, this locates the course of rivers, towns,

mountain ranges, and Indian tribes. ... www.indiana.edu/~ liblilly/etexts/ila/ - 241k - Cached - Similar pages - Note this

  • I ndigenous peoples of the Americas - Wikipedia, the free encyclopedia

American Indian creation legends tell of a variety of originations of that it had confirmed the presence

  • American Indian creation legends tell of a variety of originations of ..... that it had confirmed the presence
  • f 67 different uncontacted tribes in Brazil, ...

en.wikipedia.org/wiki/Indigenous_peoples_of_the_Americas - 178k - Cached - Similar pages - Note this

  • Cognition :: Giving Technologies New Meaning
  • The volumes that Farabee produced from his travels include Indian Tribes of Eastern Peru ... motor

vehicles that are lemons · Indian tribes of Latin America ... wikipedia cognition com/?num= 10&from val

I ndian%20tribes%20of%20Latin%20Ame

  • 54k -

wikipedia.cognition.com/?num= 10&from_val...I ndian%20tribes%20of%20Latin%20Ame... - 54k - Cached - Similar pages - Note this

  • Top 25 American I ndian Tribes for the United
  • Top 25 American Indian Tribes for the UnitedStates: 1990 and 1980--Con. ... 16028 73.0 Canadian and

Latin American... 19375 248.3 Chickasaw. ... www.census.gov/population/socdemo/race/indian/ailang1.txt - 6k - Cached - Similar pages - Note this

Ten Largest American I ndian Tribes 2000 I nfoplease com

  • Ten Largest American I ndian Tribes, 2000 — I nfoplease.com
  • Latin American Indian, 180940. Choctaw, 158774. Sioux, 153360 ... American Indian and Alaska Native

Population by Selected Tribes, Census 2000 ... www.infoplease.com/ipa/A0767349.html - 29k - Cached - Similar pages - Note this

  • The I ndian Tribes of North America by John R. Swanton at Questia ...
  • Read the complete book The Indian Tribes of North America by becoming a ..... Sao Paulo recently elected

its must cope with demands by Latin America for its...must cope with demands by Latin America for ... www.questia.com/library/book/the-indian-tribes-of-north-america-by-john-r-swanton.jsp - Similar pages

  • Note this
slide-5
SLIDE 5

Yahoo

  • different indian tribes of latin america,
  • More...
  • WEB RESULTS
  • South America Daily
  • I ndian Pepper Photos Prices Spices. The Times of India ... Archaeologists unearth ancient tribe members sacri London ... Iran and the left in

Latin America ...

  • www.wn.com/LatinAmerica - 192k
  • www.wn.com/LatinAmerica 192k
  • Native American I ndian Cultures - Mexico, South America
  • Also, many of the Yanomamo tribe are losing their members and culture by ... of Amazon I ndian tribal art in the world, with over 75 tribes
  • represented. ...
  • indian-cultures.com - Cached
  • Native American I ndian Cultures - links
  • North American Tribes. rednation.org - RedNation of the Cherokee. Meso and Latin American I ndians ... Human Rights in Latin America

... i di lt

/C lt /Li k ht l C h d

  • www.indian-cultures.com/Cultures/Links.html - Cached
  • I ndigenous peoples of the Americas - Wikipedia, the free encyclopedia
  • ... in America, particularly with regards to native I ndians. ... Uncontacted I ndian tribe found in Brazil's Amazon. The Peopling of the

American Continents ...

  • en.wikipedia.org/wiki/Indigenous_peoples_of_the_Americas - 179k - Cached
  • Native American I mages - American I ndian North America Tribe Map
  • American I ndian North America Tribe Map. Click here to view more images ... Medal | History Hotline | Iraqi War | Korean War | Latin

Americans | Medal of ... |

  • www.nativeamericans.com/NativeAmericanImages6.htm - Cached
  • Resources for
  • Numbers of Native Americans or I ndians in Latin America: 39,442,000 million ... I ndian Tribes in Latin America - Latin American

I ndian Population - Up date ...

  • www.xmission.com/~ amauta/population.htm - Cached
  • I ndian tribe found in Brazil's Amazon - Boston.com
  • Latin America/Caribbean. I ndian tribe found in Brazil's Amazon ... Uncontacted tribes are usually discovered when loggers and ranchers

encroach on encroach on ...

  • boston.com/news/world/latinamerica/articles/2007/06/01/.../ News
slide-6
SLIDE 6

AltaVista

  • Latin America

Compare airfare prices from over 120 top websites and save up to 70%. Flights.SideStep.com Regional Telecom Statistics & Forecasts Fixed, mobile, Internet, broadband telecom statistics and forecasts. www.hottelecom.com AltaVista found 4,520,000 results

  • South America Daily

I ndian Pepper Photos Prices Spices. The Times of India ... Archaeologists unearth ancient tribe members sacri London ... Iran and the left in Latin America ... www.wn.com/LatinAmerica More pages from wn.com Native American I ndian Cultures - Mexico, South America Native American I ndian Cultures Mexico, South America Also, many of the Yanomamo tribe are losing their members and culture by ... of Amazon I ndian tribal art in the world, with over 75 tribes represented. ... indian-cultures.com More pages from indian-cultures.com I ndian tribes in Suriname cross borders - Boston.com Days of rain near Suriname's southern border have deluged Amerindian farmland, ... Latin America/Caribbean. I ndian tribes in Suriname cross borders ... www.boston.com/news/world/latinamerica/articles/2006/05/12...in_suriname_cross_borders More pages from boston.com Indigenous peoples of the Americas - Wikipedia, the free encyclopedia ... in America, particularly with regards to native I ndians. ... Uncontacted I ndian tribe found in Brazil's Amazon. The Peopling of the American Continents ... en.wikipedia.org/wiki/Indigenous_peoples_of_the_Americas More pages from en.wikipedia.org Native American Images - American I ndian North America Tribe Map American I ndian North America Tribe Map. Click here to view more images ... Medal | History Hotline | Iraqi War | Korean War | Latin Americans | Medal of ... www.nativeamericans.com/NativeAmericanImages6.htm More pages from nativeamericans.com

slide-7
SLIDE 7

MSN

  • Native American I mages - American I ndian North America Tribe Map
  • Native American Images American I ndian North America Tribe Map Click here to view more images ... History Hotline | Iraqi

War | Korean War | Latin Americans ...

  • www.nativeamericans.com/NativeAmericanImages6.htm
  • · Cached page
  • Resources for

152t ) P (126t ) P (67t ) S i (10t ) d V l (331t ) (t th d)

I di T ib

i

L ti

  • ... 152t.), Panama (126t.), Paraguay (67t.), Surinam (10t.), and Venezuela (331t.) (t.= thousand). - I ndian Tribes in Latin

America

  • www.xmission.com/~ amauta/tribes.htm
  • · Cached page
  • Latin America Community Assistance Foundation - LACA
  • The Tarahumara Indians are the most primitive of all I ndian tribes in North America, and are the least touched by modern

society.

www lacafoundation org/?page id 58

  • www.lacafoundation.org/?page_id= 58
  • · Cached page
  • Latin America Tour Set for Curtis Photos of North America Tribes
  • 28 September 2005. Latin America Tour Set for Curtis Photos of North America Tribes. Famed photographer recorded

I ndian tribal life in 19th, early 20th century

  • www.america.gov/st/washfile-english/2005/September/20050928134700GLnesnoM0.2225763.html
  • Latin America / / Current

C t TV L ti

A i

t di l

L ti A i

t i d f th A j l l d fli t

  • Current TV Latin America category, discover popular Latin America stories, news and ... of the Amazon jungle, a land conflict

between rice farmers and a handful of I ndian tribes ...

  • current.com/topics/75844112_latin_america
  • · Cached page
  • Bloomberg.com: Latin America
  • May 30 (Bloomberg) -- Brazil's National I ndian Foundation has discovered an I ndian tribe in the Amazon that hasn't had

contact with civilization in a rare sighting of the few ...

bl b / / ? id 20601086& id S j5 fHW CQ& f

l ti i

  • www.bloomberg.com/apps/news?pid= 20601086&sid= aSrj5wfHW.CQ&refer= latin_america
slide-8
SLIDE 8

Personalized focused search (wikipedia.cognition)

  • I ndian Latin-America tribe: 249 files —
  • William Curtis Farabee
  • The volumes that Farabee produced from his travels include Indian Tribes of Eastern Peru based on his first trip in 1906-1908 (Obituary,

1925).

  • Direct link (no highlighting)
  • Mexican Texas
  • Settlers were empowered to create their own militias to help control hostile Indian tribes. Texas faced raids from both the Apache and
  • Settlers were empowered to create their own militias to help control hostile Indian tribes. Texas faced raids from both the Apache and

Comanche tribes, [...]

  • Direct link (no highlighting)
  • Temecula, California
  • The Luiseño and Cahuilla tribes were involved, rather bloodily, in the local battles of the Mexican-American War during the following years.
  • Direct link (no highlighting)
  • Kaweah I ndian Nation
  • Recently, scam artists have sold purported citizenships in the non-recognized tribe, particularly to Mexican nationals who have entered the US

ill ll 1 [ ] illegally.1 [...]

  • Direct link (no highlighting)
  • Flag of Puerto Rico
  • The tribal nation flag of the Jatibonicu Taino Indians of Borikén, represents the Jatibonicu Taino tribe's original pre-Columbian territories of [...]
  • Direct link (no highlighting)
  • Maina I ndians
  • The Maina Indians are a group of tribes constituting a distinct linguistic stock, the [...] along the north bank of the Marañón River in South

America America

  • Direct link (no highlighting)
  • Erie (tribe)
  • ^ Ebooks by Google: "Handbook of American Indians North of Mexico" By Frederick Webb Hodge http://books.google.com/books?
  • Direct link (no highlighting)
  • Miccosukee
  • [1] Other members went on to form the Miccosukee Tribe of Indians of Florida, which was not recognized by Fidel Castro's Cuban government

in 1959. The [...]

  • Direct link (no highlighting)
  • New Tribes Mission
  • In Paraguay in 1979 and 1986, New Tribes Mission was accused of assisting in the forcible contact of nomadic Ayoreo Indians.
  • Direct link (no highlighting)
slide-9
SLIDE 9

Example: Semantically precise search for relations/events

Query: afghans destroying opium poppies Q y fg y g p p pp

slide-10
SLIDE 10

India Wide Cross Lingual I f ti A (CLIA) Information Access (CLIA) Endeavour Endeavour

slide-11
SLIDE 11

Motivation

  • English still the most dominant

language on the web

Contributes 72% of the

content

  • Number of non-English users

g steadily rising all over the world

  • English penetration in India
  • English penetration in India

Estimated to be around 3-4% Mostly the urban educated

class class

  • Need to enable access to

above information through local languages

slide-12
SLIDE 12

Cross Language Information Retrieval (CLIR)

Crawled and Target Language Index Crawled and Indexed Web Pages Target Language Index in English

ितपित याऽा

Hindi Query

CLI R Engine

T I f i

CLI R Engine

ितपित आने क े िलए रेल

ितपित याऽा

Target Information in English Language Resources साधन

ितपित पुय नगर पहुँचने क े िलए बहुत रेल उपलध ह | अगर मुंबई से याऽा कर रहे है तो मुंबई-चेनई एसूेस गाड़ से ूवास कर सकते है

Ranked List of Results

एसूस गाड़ स ूवास कर सकत ह |

Result Snippets in Hindi

slide-13
SLIDE 13

Challenges involved in CLIA Challenges involved in CLIA

Indexing, retrieval and ranking of multilingual

documents

Web data is not clean and regular

Different font encodings – some of them proprietary

Spelling variations very common

Spelling variations very common Different document encodings

Language identification needed to invoke appropriate

l l language analyzers

Involves a number of fundamental NLP research

problems like query disambiguation, machine p q y g transliteration, named-entity recognition, multi-word recognition

slide-14
SLIDE 14

Cross Language Information Access (CLIA) Consortia Project

  • Indian Language CLIR Engine under development

g g g p

Input – Six Indian Languages (Hindi, Bengali, Telugu, Tamil,

Marathi and Punjabi)

Output – Hindi, English and Input Language of Query

p

, g p g g

Q y

Domains – Tourism (Current Release)

  • Involves 10 academic institutes all over the country: IITs,

Indian Statistical Institute CDAC Anna University Indian Statistical Institute, CDAC, Anna University, Jadavpur University

IIT Bombay – Overall co-ordinator

Responsible for Hindi Marathi language verticals

Responsible for Hindi, Marathi language verticals

  • Includes full-fledged search features

Snippet translation Summary generation Information Extraction

slide-15
SLIDE 15

Portal

Public portal released at

http://www clia iitb ac in/clia-beta-ext/ in http://www.clia.iitb.ac.in/clia-beta-ext/ in September 2009. (Outside IITB)

Public portal released at

Public portal released at

http://www.clia.iitb.ac.in:8080/clia-beta-ext/ in September 2009. (Inside IITB)

slide-16
SLIDE 16
slide-17
SLIDE 17

Recent Press Coverage Recent Press Coverage

slide-18
SLIDE 18

Hindustan Times

slide-19
SLIDE 19

IR Basics IR Basics

(mainly from R. Baeza-Yates and B. Ribeiro-Neto. Modern Information RetrievalAddison-Wesley, Wokingham, UK, 1999. and Christopher D. Manning, Prabhakar Raghavan and Hinrich p g, g Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008. )

slide-20
SLIDE 20

Definition of IR Model

An IR model is a quadrupul [D, Q, F, R(qi, dj)] [ , Q, , (qi,

j)]

Where, D: documents D: documents Q: Queries F: Framework for modeling document query F: Framework for modeling document, query and their relationships R(.,.): Ranking function returning a real no. R(.,.): Ranking function returning a real no. expressing the relevance of dj with qi

slide-21
SLIDE 21

Index Terms

Keywords representing a document Semantics of the word helps remember Semantics of the word helps remember

the main theme of the document Generally nouns

Generally nouns Assign numerical weights to index

d h terms to indicate their importance

slide-22
SLIDE 22

Introduction

Docs Index Terms Index Terms doc Information Need Ranking match Information Need query

slide-23
SLIDE 23

Classic IR Models - Basic Concepts

  • The importance of the index terms is represented by

weights associated to them

  • Let

– t be the number of index terms in the system – K= { k1, k2, k3,... kt} set of all index terms – ki be an index term d be a document – dj be a document – wij is a weight associated with (ki,dj) – wij = 0 indicates that term does not belong to doc wij 0 indicates that term does not belong to doc – vec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated with the document dj – gi(vec(dj)) = wij is a function which returns the weight associated with pair (ki,dj)

slide-24
SLIDE 24

The Boolean Model

  • Simple model based on set theory
  • Only AND, OR and NOT are used

y ,

  • Queries specified as boolean expressions

– precise semantics – neat formalism – q = ka ∧ (kb ∨ ¬kc) T ith t b t Th { 0 1}

  • Terms are either present or absent. Thus, wij ε { 0,1}
  • Consider

– q = k

∧ (k ∨

k ) – q = ka ∧ (kb ∨ ¬kc) – vec(qdnf) = (1,1,1) ∨ (1,1,0) ∨ (1,0,0) – vec(qcc) = (1,1,0) is a conjunctive component vec(qcc) (1,1,0) is a conjunctive component

slide-25
SLIDE 25

The Boolean Model

k (k k )

(1 1 0) Ka Kb

  • q = ka ∧ (kb ∨ ¬kc)

(1,1,1) (1,0,0) (1,1,0)

  • sim(q,dj) = 1 if ∃ vec(qcc) |

Kc

j

(vec(qcc) ε vec(qdnf)) ∧ (∀ki, gi(vec(dj)) = gi(vec(qcc))) 0 otherwise 0 otherwise

slide-26
SLIDE 26

Drawbacks of the Boolean Model

  • Retrieval based on binary decision criteria with no notion of

partial matching

  • No ranking of the documents is provided (absence of a

grading scale) Information need has to be translated into a Boolean

  • Information need has to be translated into a Boolean

expression which most users find awkward

  • The Boolean queries formulated by the users are most often

q y too simplistic

  • As a consequence, the Boolean model frequently returns

ith t f t d t i t either too few or too many documents in response to a user query

slide-27
SLIDE 27

The Vector Model

  • Use of binary weights is too limiting

Non binary weights provide consideration for

  • Non-binary weights provide consideration for

partial matches

  • These term weights are used to compute a

degree of similarity between a query and each document

  • Ranked set of documents provides for better

matching

slide-28
SLIDE 28

The Vector Model

  • Define:
  • Define:

– wij > 0 whenever ki ∈ dj w > = 0 associated with the pair (k q) – wiq > = 0 associated with the pair (ki,q) – vec(dj) = (w1j, w2j, ..., wtj) vec(q) = (w w w ) vec(q) = (w1q, w2q, ..., wtq)

  • In this space queries and documents are
  • In this space, queries and documents are

represented as weighted vectors

slide-29
SLIDE 29

The Vector Model

j dj i q

Θ

  • Sim(q,dj) = cos(Θ)

= [vec(dj) • vec(q)] / |dj| * |q|= [Σ wij * wiq] / |dj| * |q| Si d i ( d ) 1

i

  • Since wij > 0 and wiq > 0,

0 < = sim(q,dj) < = 1

  • A document is retrieved even if it matches the query terms only partially
slide-30
SLIDE 30

The Vector Model

  • Sim(q,dj) = [Σ wij * wiq] / |dj| * |q|
  • How to compute the weights wij and wi ?
  • How to compute the weights wij and wiq ?
  • A good weight must take into account two

effects: effects:

– quantification of intra-document contents (similarity) (similarity)

  • tf factor, the term frequency within a document

– quantification of inter-documents separation (dissi- – quantification of inter-documents separation (dissi- milarity)

  • idf factor, the inverse document frequency

d acto , t e e se docu e t eque cy

– wij = tf(i,j) * idf(i)

slide-31
SLIDE 31

The Vector Model

  • Let,

– N be the total number of docs in the collection – ni be the number of docs which contain ki – freq(i,j) raw frequency of ki within dj

  • A normalized tf factor is given by

f(i j) = freq(i j) / max (freq(l j)) – f(i,j) = freq(i,j) / maxl(freq(l,j)) – where the maximum is computed over all terms which occur within the document dj

  • The idf factor is computed as

– idf(i) = log (N/ni) – the log is used to make the values of tf and idf

  • comparable. It can also be interpreted as the amount of

information associated with the term ki information associated with the term ki.

slide-32
SLIDE 32

The Vector Model

  • The best term-weighting schemes use weights which are give

by w = f(i j) * log(N/n ) – wij = f(i,j) * log(N/ni) – the strategy is called a tf-idf weighting scheme

  • For the query term weights, a suggestion is

For the query term weights, a suggestion is – wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N/ni)

  • The vector model with tf-idf weights is a good ranking

strategy with general collections

  • The vector model is usually as good as the known ranking

alternatives It is also simple and fast to compute

  • alternatives. It is also simple and fast to compute.
slide-33
SLIDE 33

The Vector Model

  • Advantages:

– term-weighting improves quality of the answer set g g – partial matching allows retrieval of docs that approximate the query conditions – cosine ranking formula sorts documents according to degree of similarity to the query

  • Disadvantages:

– assumes independence of index terms (??); not p ( ); clear that this is bad though

slide-34
SLIDE 34

The Vector Model: Example I

d7 k1 k2 d1 d2 d3 d4 d5 d6 d7 d1 k3

k1 k2 k3 q • dj d1 1 1 2 d2 1 1 d3 1 1 2 d4 1 1 d5 1 1 1 3 d6 1 1 2 d7 1 1 q 1 1 1

slide-35
SLIDE 35

The Vector Model: Example II

d7 k1 k2 d1 d2 d3 d4 d5 d6 d7 d1 k3

k1 k2 k3 q • dj d1 1 1 4 d2 1 1 d2 1 1 d3 1 1 5 d4 1 1 d5 1 1 1 6 d5 1 1 1 6 d6 1 1 3 d7 1 2 q 1 2 3

slide-36
SLIDE 36

The Vector Model: Example III

d7 k1 k2 d1 d2 d3 d4 d5 d6 d7 d1 k3

k1 k2 k3 q • dj d1 2 1 5 d2 1 1 d2 1 1 d3 1 3 11 d4 2 2 d5 1 2 4 17 d5 1 2 4 17 d6 1 2 5 d7 5 10 q 1 2 3