CS344: Introduction to Artificial Intelligence Pushpak - - PowerPoint PPT Presentation
CS344: Introduction to Artificial Intelligence Pushpak - - PowerPoint PPT Presentation
CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT B IIT Bombay b Lecture 32: Information Retrieval: Basic concepts and Model t d M d l The elusive user satisfaction The elusive user satisfaction Ranking
The elusive user satisfaction The elusive user satisfaction
Ranking Ranking Correctness Correctness
- f
Query Processing Coverage I d i NER
Stemming
MWE
Crawling
Indexing MWE
Q I di T ib i L ti Query: Indian Tribes in Latin America America
- I ndians of Latin America: an exhibition of materials in the Lilly ...
- Lilly Library: Latin American mss. Brazil. A large map in colors, this locates the course of rivers, towns,
mountain ranges, and Indian tribes. ... www.indiana.edu/~ liblilly/etexts/ila/ - 241k - Cached - Similar pages - Note this
- I ndigenous peoples of the Americas - Wikipedia, the free encyclopedia
American Indian creation legends tell of a variety of originations of that it had confirmed the presence
- American Indian creation legends tell of a variety of originations of ..... that it had confirmed the presence
- f 67 different uncontacted tribes in Brazil, ...
en.wikipedia.org/wiki/Indigenous_peoples_of_the_Americas - 178k - Cached - Similar pages - Note this
- Cognition :: Giving Technologies New Meaning
- The volumes that Farabee produced from his travels include Indian Tribes of Eastern Peru ... motor
vehicles that are lemons · Indian tribes of Latin America ... wikipedia cognition com/?num= 10&from val
I ndian%20tribes%20of%20Latin%20Ame
- 54k -
wikipedia.cognition.com/?num= 10&from_val...I ndian%20tribes%20of%20Latin%20Ame... - 54k - Cached - Similar pages - Note this
- Top 25 American I ndian Tribes for the United
- Top 25 American Indian Tribes for the UnitedStates: 1990 and 1980--Con. ... 16028 73.0 Canadian and
Latin American... 19375 248.3 Chickasaw. ... www.census.gov/population/socdemo/race/indian/ailang1.txt - 6k - Cached - Similar pages - Note this
Ten Largest American I ndian Tribes 2000 I nfoplease com
- Ten Largest American I ndian Tribes, 2000 — I nfoplease.com
- Latin American Indian, 180940. Choctaw, 158774. Sioux, 153360 ... American Indian and Alaska Native
Population by Selected Tribes, Census 2000 ... www.infoplease.com/ipa/A0767349.html - 29k - Cached - Similar pages - Note this
- The I ndian Tribes of North America by John R. Swanton at Questia ...
- Read the complete book The Indian Tribes of North America by becoming a ..... Sao Paulo recently elected
its must cope with demands by Latin America for its...must cope with demands by Latin America for ... www.questia.com/library/book/the-indian-tribes-of-north-america-by-john-r-swanton.jsp - Similar pages
- Note this
Yahoo
- different indian tribes of latin america,
- More...
- WEB RESULTS
- South America Daily
- I ndian Pepper Photos Prices Spices. The Times of India ... Archaeologists unearth ancient tribe members sacri London ... Iran and the left in
Latin America ...
- www.wn.com/LatinAmerica - 192k
- www.wn.com/LatinAmerica 192k
- Native American I ndian Cultures - Mexico, South America
- Also, many of the Yanomamo tribe are losing their members and culture by ... of Amazon I ndian tribal art in the world, with over 75 tribes
- represented. ...
- indian-cultures.com - Cached
- Native American I ndian Cultures - links
- North American Tribes. rednation.org - RedNation of the Cherokee. Meso and Latin American I ndians ... Human Rights in Latin America
... i di lt
/C lt /Li k ht l C h d
- www.indian-cultures.com/Cultures/Links.html - Cached
- I ndigenous peoples of the Americas - Wikipedia, the free encyclopedia
- ... in America, particularly with regards to native I ndians. ... Uncontacted I ndian tribe found in Brazil's Amazon. The Peopling of the
American Continents ...
- en.wikipedia.org/wiki/Indigenous_peoples_of_the_Americas - 179k - Cached
- Native American I mages - American I ndian North America Tribe Map
- American I ndian North America Tribe Map. Click here to view more images ... Medal | History Hotline | Iraqi War | Korean War | Latin
Americans | Medal of ... |
- www.nativeamericans.com/NativeAmericanImages6.htm - Cached
- Resources for
- Numbers of Native Americans or I ndians in Latin America: 39,442,000 million ... I ndian Tribes in Latin America - Latin American
I ndian Population - Up date ...
- www.xmission.com/~ amauta/population.htm - Cached
- I ndian tribe found in Brazil's Amazon - Boston.com
- Latin America/Caribbean. I ndian tribe found in Brazil's Amazon ... Uncontacted tribes are usually discovered when loggers and ranchers
encroach on encroach on ...
- boston.com/news/world/latinamerica/articles/2007/06/01/.../ News
AltaVista
- Latin America
Compare airfare prices from over 120 top websites and save up to 70%. Flights.SideStep.com Regional Telecom Statistics & Forecasts Fixed, mobile, Internet, broadband telecom statistics and forecasts. www.hottelecom.com AltaVista found 4,520,000 results
- South America Daily
I ndian Pepper Photos Prices Spices. The Times of India ... Archaeologists unearth ancient tribe members sacri London ... Iran and the left in Latin America ... www.wn.com/LatinAmerica More pages from wn.com Native American I ndian Cultures - Mexico, South America Native American I ndian Cultures Mexico, South America Also, many of the Yanomamo tribe are losing their members and culture by ... of Amazon I ndian tribal art in the world, with over 75 tribes represented. ... indian-cultures.com More pages from indian-cultures.com I ndian tribes in Suriname cross borders - Boston.com Days of rain near Suriname's southern border have deluged Amerindian farmland, ... Latin America/Caribbean. I ndian tribes in Suriname cross borders ... www.boston.com/news/world/latinamerica/articles/2006/05/12...in_suriname_cross_borders More pages from boston.com Indigenous peoples of the Americas - Wikipedia, the free encyclopedia ... in America, particularly with regards to native I ndians. ... Uncontacted I ndian tribe found in Brazil's Amazon. The Peopling of the American Continents ... en.wikipedia.org/wiki/Indigenous_peoples_of_the_Americas More pages from en.wikipedia.org Native American Images - American I ndian North America Tribe Map American I ndian North America Tribe Map. Click here to view more images ... Medal | History Hotline | Iraqi War | Korean War | Latin Americans | Medal of ... www.nativeamericans.com/NativeAmericanImages6.htm More pages from nativeamericans.com
MSN
- Native American I mages - American I ndian North America Tribe Map
- Native American Images American I ndian North America Tribe Map Click here to view more images ... History Hotline | Iraqi
War | Korean War | Latin Americans ...
- www.nativeamericans.com/NativeAmericanImages6.htm
- · Cached page
- Resources for
152t ) P (126t ) P (67t ) S i (10t ) d V l (331t ) (t th d)
I di T ib
i
L ti
- ... 152t.), Panama (126t.), Paraguay (67t.), Surinam (10t.), and Venezuela (331t.) (t.= thousand). - I ndian Tribes in Latin
America
- www.xmission.com/~ amauta/tribes.htm
- · Cached page
- Latin America Community Assistance Foundation - LACA
- The Tarahumara Indians are the most primitive of all I ndian tribes in North America, and are the least touched by modern
society.
www lacafoundation org/?page id 58
- www.lacafoundation.org/?page_id= 58
- · Cached page
- Latin America Tour Set for Curtis Photos of North America Tribes
- 28 September 2005. Latin America Tour Set for Curtis Photos of North America Tribes. Famed photographer recorded
I ndian tribal life in 19th, early 20th century
- www.america.gov/st/washfile-english/2005/September/20050928134700GLnesnoM0.2225763.html
- Latin America / / Current
C t TV L ti
A i
t di l
L ti A i
t i d f th A j l l d fli t
- Current TV Latin America category, discover popular Latin America stories, news and ... of the Amazon jungle, a land conflict
between rice farmers and a handful of I ndian tribes ...
- current.com/topics/75844112_latin_america
- · Cached page
- Bloomberg.com: Latin America
- May 30 (Bloomberg) -- Brazil's National I ndian Foundation has discovered an I ndian tribe in the Amazon that hasn't had
contact with civilization in a rare sighting of the few ...
bl b / / ? id 20601086& id S j5 fHW CQ& f
l ti i
- www.bloomberg.com/apps/news?pid= 20601086&sid= aSrj5wfHW.CQ&refer= latin_america
Personalized focused search (wikipedia.cognition)
- I ndian Latin-America tribe: 249 files —
- William Curtis Farabee
- The volumes that Farabee produced from his travels include Indian Tribes of Eastern Peru based on his first trip in 1906-1908 (Obituary,
1925).
- Direct link (no highlighting)
- Mexican Texas
- Settlers were empowered to create their own militias to help control hostile Indian tribes. Texas faced raids from both the Apache and
- Settlers were empowered to create their own militias to help control hostile Indian tribes. Texas faced raids from both the Apache and
Comanche tribes, [...]
- Direct link (no highlighting)
- Temecula, California
- The Luiseño and Cahuilla tribes were involved, rather bloodily, in the local battles of the Mexican-American War during the following years.
- Direct link (no highlighting)
- Kaweah I ndian Nation
- Recently, scam artists have sold purported citizenships in the non-recognized tribe, particularly to Mexican nationals who have entered the US
ill ll 1 [ ] illegally.1 [...]
- Direct link (no highlighting)
- Flag of Puerto Rico
- The tribal nation flag of the Jatibonicu Taino Indians of Borikén, represents the Jatibonicu Taino tribe's original pre-Columbian territories of [...]
- Direct link (no highlighting)
- Maina I ndians
- The Maina Indians are a group of tribes constituting a distinct linguistic stock, the [...] along the north bank of the Marañón River in South
America America
- Direct link (no highlighting)
- Erie (tribe)
- ^ Ebooks by Google: "Handbook of American Indians North of Mexico" By Frederick Webb Hodge http://books.google.com/books?
- Direct link (no highlighting)
- Miccosukee
- [1] Other members went on to form the Miccosukee Tribe of Indians of Florida, which was not recognized by Fidel Castro's Cuban government
in 1959. The [...]
- Direct link (no highlighting)
- New Tribes Mission
- In Paraguay in 1979 and 1986, New Tribes Mission was accused of assisting in the forcible contact of nomadic Ayoreo Indians.
- Direct link (no highlighting)
Example: Semantically precise search for relations/events
Query: afghans destroying opium poppies Q y fg y g p p pp
India Wide Cross Lingual I f ti A (CLIA) Information Access (CLIA) Endeavour Endeavour
Motivation
- English still the most dominant
language on the web
Contributes 72% of the
content
- Number of non-English users
g steadily rising all over the world
- English penetration in India
- English penetration in India
Estimated to be around 3-4% Mostly the urban educated
class class
- Need to enable access to
above information through local languages
Cross Language Information Retrieval (CLIR)
Crawled and Target Language Index Crawled and Indexed Web Pages Target Language Index in English
ितपित याऽा
Hindi Query
CLI R Engine
T I f i
CLI R Engine
ितपित आने क े िलए रेल
ितपित याऽा
Target Information in English Language Resources साधन
ितपित पुय नगर पहुँचने क े िलए बहुत रेल उपलध ह | अगर मुंबई से याऽा कर रहे है तो मुंबई-चेनई एसूेस गाड़ से ूवास कर सकते है
Ranked List of Results
एसूस गाड़ स ूवास कर सकत ह |
Result Snippets in Hindi
Challenges involved in CLIA Challenges involved in CLIA
Indexing, retrieval and ranking of multilingual
documents
Web data is not clean and regular
Different font encodings – some of them proprietary
Spelling variations very common
Spelling variations very common Different document encodings
Language identification needed to invoke appropriate
l l language analyzers
Involves a number of fundamental NLP research
problems like query disambiguation, machine p q y g transliteration, named-entity recognition, multi-word recognition
Cross Language Information Access (CLIA) Consortia Project
- Indian Language CLIR Engine under development
g g g p
Input – Six Indian Languages (Hindi, Bengali, Telugu, Tamil,
Marathi and Punjabi)
Output – Hindi, English and Input Language of Query
p
, g p g g
Q y
Domains – Tourism (Current Release)
- Involves 10 academic institutes all over the country: IITs,
Indian Statistical Institute CDAC Anna University Indian Statistical Institute, CDAC, Anna University, Jadavpur University
IIT Bombay – Overall co-ordinator
Responsible for Hindi Marathi language verticals
Responsible for Hindi, Marathi language verticals
- Includes full-fledged search features
Snippet translation Summary generation Information Extraction
Portal
Public portal released at
http://www clia iitb ac in/clia-beta-ext/ in http://www.clia.iitb.ac.in/clia-beta-ext/ in September 2009. (Outside IITB)
Public portal released at
Public portal released at
http://www.clia.iitb.ac.in:8080/clia-beta-ext/ in September 2009. (Inside IITB)
Recent Press Coverage Recent Press Coverage
Hindustan Times
IR Basics IR Basics
(mainly from R. Baeza-Yates and B. Ribeiro-Neto. Modern Information RetrievalAddison-Wesley, Wokingham, UK, 1999. and Christopher D. Manning, Prabhakar Raghavan and Hinrich p g, g Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008. )
Definition of IR Model
An IR model is a quadrupul [D, Q, F, R(qi, dj)] [ , Q, , (qi,
j)]
Where, D: documents D: documents Q: Queries F: Framework for modeling document query F: Framework for modeling document, query and their relationships R(.,.): Ranking function returning a real no. R(.,.): Ranking function returning a real no. expressing the relevance of dj with qi
Index Terms
Keywords representing a document Semantics of the word helps remember Semantics of the word helps remember
the main theme of the document Generally nouns
Generally nouns Assign numerical weights to index
d h terms to indicate their importance
Introduction
Docs Index Terms Index Terms doc Information Need Ranking match Information Need query
Classic IR Models - Basic Concepts
- The importance of the index terms is represented by
weights associated to them
- Let
– t be the number of index terms in the system – K= { k1, k2, k3,... kt} set of all index terms – ki be an index term d be a document – dj be a document – wij is a weight associated with (ki,dj) – wij = 0 indicates that term does not belong to doc wij 0 indicates that term does not belong to doc – vec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated with the document dj – gi(vec(dj)) = wij is a function which returns the weight associated with pair (ki,dj)
The Boolean Model
- Simple model based on set theory
- Only AND, OR and NOT are used
y ,
- Queries specified as boolean expressions
– precise semantics – neat formalism – q = ka ∧ (kb ∨ ¬kc) T ith t b t Th { 0 1}
- Terms are either present or absent. Thus, wij ε { 0,1}
- Consider
– q = k
∧ (k ∨
k ) – q = ka ∧ (kb ∨ ¬kc) – vec(qdnf) = (1,1,1) ∨ (1,1,0) ∨ (1,0,0) – vec(qcc) = (1,1,0) is a conjunctive component vec(qcc) (1,1,0) is a conjunctive component
The Boolean Model
k (k k )
(1 1 0) Ka Kb
- q = ka ∧ (kb ∨ ¬kc)
(1,1,1) (1,0,0) (1,1,0)
- sim(q,dj) = 1 if ∃ vec(qcc) |
Kc
j
(vec(qcc) ε vec(qdnf)) ∧ (∀ki, gi(vec(dj)) = gi(vec(qcc))) 0 otherwise 0 otherwise
Drawbacks of the Boolean Model
- Retrieval based on binary decision criteria with no notion of
partial matching
- No ranking of the documents is provided (absence of a
grading scale) Information need has to be translated into a Boolean
- Information need has to be translated into a Boolean
expression which most users find awkward
- The Boolean queries formulated by the users are most often
q y too simplistic
- As a consequence, the Boolean model frequently returns
ith t f t d t i t either too few or too many documents in response to a user query
The Vector Model
- Use of binary weights is too limiting
Non binary weights provide consideration for
- Non-binary weights provide consideration for
partial matches
- These term weights are used to compute a
degree of similarity between a query and each document
- Ranked set of documents provides for better
matching
The Vector Model
- Define:
- Define:
– wij > 0 whenever ki ∈ dj w > = 0 associated with the pair (k q) – wiq > = 0 associated with the pair (ki,q) – vec(dj) = (w1j, w2j, ..., wtj) vec(q) = (w w w ) vec(q) = (w1q, w2q, ..., wtq)
- In this space queries and documents are
- In this space, queries and documents are
represented as weighted vectors
The Vector Model
j dj i q
Θ
- Sim(q,dj) = cos(Θ)
= [vec(dj) • vec(q)] / |dj| * |q|= [Σ wij * wiq] / |dj| * |q| Si d i ( d ) 1
i
- Since wij > 0 and wiq > 0,
0 < = sim(q,dj) < = 1
- A document is retrieved even if it matches the query terms only partially
The Vector Model
- Sim(q,dj) = [Σ wij * wiq] / |dj| * |q|
- How to compute the weights wij and wi ?
- How to compute the weights wij and wiq ?
- A good weight must take into account two
effects: effects:
– quantification of intra-document contents (similarity) (similarity)
- tf factor, the term frequency within a document
– quantification of inter-documents separation (dissi- – quantification of inter-documents separation (dissi- milarity)
- idf factor, the inverse document frequency
d acto , t e e se docu e t eque cy
– wij = tf(i,j) * idf(i)
The Vector Model
- Let,
– N be the total number of docs in the collection – ni be the number of docs which contain ki – freq(i,j) raw frequency of ki within dj
- A normalized tf factor is given by
f(i j) = freq(i j) / max (freq(l j)) – f(i,j) = freq(i,j) / maxl(freq(l,j)) – where the maximum is computed over all terms which occur within the document dj
- The idf factor is computed as
– idf(i) = log (N/ni) – the log is used to make the values of tf and idf
- comparable. It can also be interpreted as the amount of
information associated with the term ki information associated with the term ki.
The Vector Model
- The best term-weighting schemes use weights which are give
by w = f(i j) * log(N/n ) – wij = f(i,j) * log(N/ni) – the strategy is called a tf-idf weighting scheme
- For the query term weights, a suggestion is
For the query term weights, a suggestion is – wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N/ni)
- The vector model with tf-idf weights is a good ranking
strategy with general collections
- The vector model is usually as good as the known ranking
alternatives It is also simple and fast to compute
- alternatives. It is also simple and fast to compute.
The Vector Model
- Advantages:
– term-weighting improves quality of the answer set g g – partial matching allows retrieval of docs that approximate the query conditions – cosine ranking formula sorts documents according to degree of similarity to the query
- Disadvantages:
– assumes independence of index terms (??); not p ( ); clear that this is bad though
The Vector Model: Example I
d7 k1 k2 d1 d2 d3 d4 d5 d6 d7 d1 k3
k1 k2 k3 q • dj d1 1 1 2 d2 1 1 d3 1 1 2 d4 1 1 d5 1 1 1 3 d6 1 1 2 d7 1 1 q 1 1 1
The Vector Model: Example II
d7 k1 k2 d1 d2 d3 d4 d5 d6 d7 d1 k3
k1 k2 k3 q • dj d1 1 1 4 d2 1 1 d2 1 1 d3 1 1 5 d4 1 1 d5 1 1 1 6 d5 1 1 1 6 d6 1 1 3 d7 1 2 q 1 2 3
The Vector Model: Example III
d7 k1 k2 d1 d2 d3 d4 d5 d6 d7 d1 k3
k1 k2 k3 q • dj d1 2 1 5 d2 1 1 d2 1 1 d3 1 3 11 d4 2 2 d5 1 2 4 17 d5 1 2 4 17 d6 1 2 5 d7 5 10 q 1 2 3