Information Retrieval Modeling Russian Summer School in Information - - PowerPoint PPT Presentation

information retrieval modeling
SMART_READER_LITE
LIVE PREVIEW

Information Retrieval Modeling Russian Summer School in Information - - PowerPoint PPT Presentation

Information Retrieval Modeling Russian Summer School in Information Retrieval Djoerd Hiemstra http://www.cs.utwente.nl/~hiemstra 1/40 Overview 1. Smoothing methods 2. Translation models 3. Document priors 4. ... 2/40 Course Material


slide-1
SLIDE 1

1/40

Information Retrieval Modeling

Russian Summer School in Information Retrieval

Djoerd Hiemstra http://www.cs.utwente.nl/~hiemstra

slide-2
SLIDE 2

2/40

Overview

  • 1. Smoothing methods
  • 2. Translation models
  • 3. Document priors
  • 4. ...
slide-3
SLIDE 3

3/40

Course Material

  • Djoerd Hiemstra, “Language Models, Smoothing, and

N-grams'’, In M. Tamer Özsu and Ling Liu (eds.) Encyclopedia of Database Systems, Springer, 2009

slide-4
SLIDE 4

4/40

Noisy channel paradigm (Shannon 1948)

noisy channel

I (input) O (output)

 I =argmax

I

P I ∣O 

  • hypothesise all possible input texts I and

take the one with the highest probability, symbolically:

=argmax

I

P I ⋅PO∣I 

slide-5
SLIDE 5

5/40

Noisy channel paradigm (Shannon 1948)

noisy channel

D (document)

T1, T2,…(query)  D=argmax

D

P D∣T 1 ,T 2 ,⋯

  • hypothesise all possible documents D

and take the one with the highest probability, symbolically: =argmax

D

P D⋅PT 1 ,T 2 ,⋯∣D

slide-6
SLIDE 6

6/40

Noisy channel paradigm

  • Did you get the picture? Formulate the

following systems as a noisy channel:

– Automatic Speech Recognition – Optical Character Recognition – Parsing of Natural Language – Machine Translation – Part-of-speech tagging

slide-7
SLIDE 7

7/40

Statistical language models

  • Given a query T1,T2,…,Tn , rank the documents

according to the following probability measure:

PT 1 ,T 2 ,... ,T n∣D=∏

i=1 n

1−iPT iiPT i∣D λi : probability that the term on position i is important 1− λi : probability that the term is unimportant P(Ti | D) : probability of an important term P(Ti) : probability of an unimportant term

slide-8
SLIDE 8

8/40

Statistical language models

PT i=ti∣D=d = tf ti ,d 

∑t tf t ,d 

important term PT i=ti = df t i 

∑t df t 

unimportant term 

  • Definition of probability measures:

λi = 0.5

slide-9
SLIDE 9

9/40

Statistical language models

  • How to estimate value of λi ?

– For ad-hoc retrieval (i.e. no previously retrieved documents to guide the search)

λi = constant (i.e. each term equally important)

– Note that for extreme values:

λi = 0 : term does not influence ranking λi = 1 : term is mandatory in retrieved docs.

lim λi → 1 : docs containing n query terms are ranked above docs containing n − 1 terms

(Hiemstra 2002)

slide-10
SLIDE 10

10/40

Statistical language models

  • Presentation as hidden Markov model

– finite state machine: probabilities governing transitions – sequence of state transitions cannot be determined from sequence of output symbols (i.e. are hidden)

slide-11
SLIDE 11

11/40

Statistical language models

  • Implementation

PT 1 ,T 2 ,⋯,T n∣D = ∏

i=1 n

1−iPT ii PT i∣D

⋮ PT 1 ,T 2 ,⋯,T n∣D  ∝ ∑

i=1 n

log1 i PT i∣D  1−i PT i  

slide-12
SLIDE 12

12/40

Statistical language models

  • Implementation as vector product:

scoreq , d  =

k∈ matching terms

qk⋅d k q k=tf  k ,q d k=log1 tf k ,d ∑t df t  df k ∑t tf t ,d  ⋅ k 1−k 

slide-13
SLIDE 13

13/40

Smoothing

  • Sparse data problem:

– many events that are plausible in reality are not found in the data used to estimate probabilities. – i.e., documents are short, and do not contain all words that would be good index terms

slide-14
SLIDE 14

14/40

No smoothing

  • Maximum likelihood estimate

– Documents that do not contain all terms get zero probability (are not retrieved)

P T i=ti∣D=d = tf ti , d 

∑t tf t ,d 

slide-15
SLIDE 15

15/40

Laplace smoothing

  • Simply add 1 to every possible event

– over-estimates probabilities of unseen events

P T i=ti∣D=d = tf t i ,d 1

∑t tf t ,d 1

slide-16
SLIDE 16

16/40

Linear interpolation smoothing

  • Linear combination of maximum

likelihood and model that is less sparse

– also called “Jelinek-Mercer smoothing”

PT i∣D=1− PT iPT i∣D , where 0≤≤1

slide-17
SLIDE 17

17/40

Dirichlet smoothing

  • Has a relatively big effect on small documents,

but a relatively small effect on big documents.

(Zhai & Lafferty 2004)

∑t tf t ,d 

P T i=ti∣D=d = tf ti , d  PT i∣C  ¿ ¿

slide-18
SLIDE 18

18/40

Cross-language IR

cross-language information retrieval zoeken in anderstalige informatie recherche d'informations multilingues

slide-19
SLIDE 19

19/40

Language models & translation

  • Cross-language information retrieval

(CLIR):

– Enter query in one language (language of choice) and retrieve documents in one or more other languages. – The system takes care of automatic translation

slide-20
SLIDE 20

20/40

slide-21
SLIDE 21

21/40

Language models & translation

  • Noisy channel paradigm

 D=argmax

D

P D∣S 1 ,S 2 ,⋯

  • hypothesise all possible documents D and

take the one with the highest probability:

D (doc.) T1, T2,…(query)

noisy channel noisy channel

S1, S2,…(request)

=argmax

D

P D ⋅ ∑

T 1, T2 ,⋯

PT1 ,T 2 ,⋯;S 1 ,S 2 ,⋯∣D

slide-22
SLIDE 22

22/40

Language models & translation

  • Cross-language information retrieval :

– Assume that the translation of a word/term does not depend on the document in which it occurs. – if: S1, S2,…, Sn is a Dutch query of length n – and ti1, ti2,…, tim are m English translations of the Dutch query term Si

PS1 ,S2 ,... ,S n∣D =

i=1 n

j=1 mi

P Si∣T i=tij 1−PT i=tij PT i=t ij∣D 

slide-23
SLIDE 23

23/40

Language models & translation

  • Presentation as hidden Markov model
slide-24
SLIDE 24

24/40

Language models & translation

  • How does it work in practice?

– Find for each Russian query term Ni the possible translations ti1, ti2,…, tim and translation probabilities – Combine them in a structured query – Process structured query

slide-25
SLIDE 25

25/40

Language models & translation

  • Example:

– Russian query: ОСТОРОЖНО

РАДИОАКТИВНЫЕ ОТХОДЫ

– Translations of ОСТОРОЖНО : dangerous (0.8) or hazardous (1.0) – Translations of РАДИОАКТИВНЫЕ ОТХОДЫ : radioactivity (0.3) or radioactive chemicals (0.3) or radioactive wastet (0.1) – Structured query: ((0.8 dangerous ∪ 1.0 hazardous) , (0.3 fabric ∪ 0.3 chemicals ∪ 0.1 dust))

slide-26
SLIDE 26

26/40

Structured query

– Structured query:

((0.8 dangerous ∪ 1.0 hazardous) , (0.3 fabric ∪ 0.3 chemicals ∪ 0.1 dust))

slide-27
SLIDE 27

27/40

Language models & translation

  • Other applications using the translation

model

– On-line stemming – Synonym expansion – Spelling correction – ‘fuzzy’ matching – Extended (ranked) Boolean retrieval

slide-28
SLIDE 28

28/40

Language models & translation

  • Note that:

– λi = 1, for all 0 ≤ i ≤ n : Boolean retrieval

– Stemming and on-line morphological generation give exact same results:

P(funny ∪ funnies, table ∪ tables ∪ tabled) = P(funni, tabl)

slide-29
SLIDE 29

29/40

  • translation language model

– (source: parallel corpora) – average precision: 0.335 (83 % of base line)

  • no translation model, using all translations:

– average precision: 0.308 (76 % of base line)

  • manual disambiguated run (take best translation)

– average precision: 0.315 (78 % of base line) (Hiemstra and De Jong 1999)

Experimental Results

slide-30
SLIDE 30

Prior probabilities

slide-31
SLIDE 31

31/40

Prior probabilities and static ranking

  • Noisy channel paradigm (Shannon 1948)

noisy channel

D (document)

T1, T2,…(query)

 D=argmax

D

P D∣T 1 ,T 2,⋯

  • hypothesise all possible documents D and take the
  • ne with the highest probability, symbolically:

=argmax

D

P D⋅PT 1 ,T 2 ,⋯∣D

slide-32
SLIDE 32

32/40

Prior probability of relevance on informational queries

document length →

Pdoclen D=C⋅doclen D

← probability of relevance

slide-33
SLIDE 33

33/40

Priors in Entry Page Search

  • Sources of Information

– Document length – Number of links pointing to a document – The depth of the URL – Occurrence of cue words (‘welcome’,’home’) – number of links in a document – page traffic

slide-34
SLIDE 34

34/40

document length → ← probability of relevance

Prior probability of relevance on navigational queries

slide-35
SLIDE 35

35/40

Priors in Entry Page Search

  • Assumption

– Entry pages referenced more often

  • Different types of inlinks

– From other hosts (recommendation) – From same host (navigational)

  • Both types point often to entry pages
slide-36
SLIDE 36

36/40

Priors in Entry Page Search

Pinlinks D=C⋅inlinkCount  D

← probability of relevance nr of inlinks →

slide-37
SLIDE 37

37/40

Priors in Entry Page Search: URL depth

  • Top level documents are often entry pages
  • Four types of URLs

– root: www.romip.ru/ – subroot: www.romip.ru/russir2009/ – path: www.romip.ru/russir2009/en/ – file: www.romip.ru/russir2009/en/venue.html

slide-38
SLIDE 38

38/40

Priors in Entry Page Search: results

method Content Anchors P(Q|D) 0.3375 0.4188 P(Q|D)Pdoclen(D) 0.2634 0.5600 P(Q|D)Pinlink(D) 0.4974 0.5365 P(Q|D)PURL(D) 0.7705 0.6301

(Kraaij, Westerveld and Hiemstra 2002)

slide-39
SLIDE 39

39/40

Language Models conclusion

  • Smoothing: accounts for sparse documents, and

bad queries

  • Translation model: accounts for multiple query

representations (e.g. CLIR or stemming)

  • Document priors: account for "non-content"

information

slide-40
SLIDE 40

40/40

References References

  • Djoerd Hiemstra and Franciska de Jong. Disambiguation

strategies for cross-language information retrieval., Lecture Notes in Computer Science 1696: Research and Advanced Technology for Digital Libraries, Springer-Verlag, 1999

  • Wessel Kraaij, Thijs Westerveld and Djoerd Hiemstra. The

Importance of Prior Probabilities for Entry Page Search. In Proceedings of SIGIR 2002

  • Claude Shannon. A mathematical theory of communication. Bell

System Technical Journal 27, 1948

  • ChengXiang Zhai and John Lafferty. A study of smoothing

methods for language models applied to information retrieval. ACM Transactions on Information Systems 22(2), pages 179- 214, 2004.