Introduction to Information Retrieval - - PowerPoint PPT Presentation

introduction to information retrieval
SMART_READER_LITE
LIVE PREVIEW

Introduction to Information Retrieval - - PowerPoint PPT Presentation

Statistical language models Statistical language models in IR Discussion Introduction to Information Retrieval http://informationretrieval.org IIR 12: Language Models for IR Hinrich Sch utze Institute for Natural Language Processing,


slide-1
SLIDE 1

Statistical language models Statistical language models in IR Discussion

Introduction to Information Retrieval

http://informationretrieval.org IIR 12: Language Models for IR

Hinrich Sch¨ utze

Institute for Natural Language Processing, Universit¨ at Stuttgart

2011-08-29

Sch¨ utze: Language models for IR 1 / 30

slide-2
SLIDE 2

Statistical language models Statistical language models in IR Discussion

Models and Methods

1

Boolean model and its limitations (30)

2

Vector space model (30)

3

Probabilistic models (30)

4

Language model-based retrieval (30)

5

Latent semantic indexing (30)

6

Learning to rank (30)

Sch¨ utze: Language models for IR 3 / 30

slide-3
SLIDE 3

Statistical language models Statistical language models in IR Discussion

Take-away

Sch¨ utze: Language models for IR 4 / 30

slide-4
SLIDE 4

Statistical language models Statistical language models in IR Discussion

Take-away

Statistical language models: Introduction

Sch¨ utze: Language models for IR 4 / 30

slide-5
SLIDE 5

Statistical language models Statistical language models in IR Discussion

Take-away

Statistical language models: Introduction Statistical language models in IR

Sch¨ utze: Language models for IR 4 / 30

slide-6
SLIDE 6

Statistical language models Statistical language models in IR Discussion

Take-away

Statistical language models: Introduction Statistical language models in IR Discussion: Properties of different probabilistic models in use in IR

Sch¨ utze: Language models for IR 4 / 30

slide-7
SLIDE 7

Statistical language models Statistical language models in IR Discussion

Outline

1

Statistical language models

2

Statistical language models in IR

3

Discussion

Sch¨ utze: Language models for IR 5 / 30

slide-8
SLIDE 8

Statistical language models Statistical language models in IR Discussion

What is a language model?

Sch¨ utze: Language models for IR 6 / 30

slide-9
SLIDE 9

Statistical language models Statistical language models in IR Discussion

What is a language model?

We can view a finite state automaton as a deterministic language model.

Sch¨ utze: Language models for IR 6 / 30

slide-10
SLIDE 10

Statistical language models Statistical language models in IR Discussion

What is a language model?

We can view a finite state automaton as a deterministic language model. I wish

Sch¨ utze: Language models for IR 6 / 30

slide-11
SLIDE 11

Statistical language models Statistical language models in IR Discussion

What is a language model?

We can view a finite state automaton as a deterministic language model. I wish I

Sch¨ utze: Language models for IR 6 / 30

slide-12
SLIDE 12

Statistical language models Statistical language models in IR Discussion

What is a language model?

We can view a finite state automaton as a deterministic language model. I wish I wish

Sch¨ utze: Language models for IR 6 / 30

slide-13
SLIDE 13

Statistical language models Statistical language models in IR Discussion

What is a language model?

We can view a finite state automaton as a deterministic language model. I wish I wish I

Sch¨ utze: Language models for IR 6 / 30

slide-14
SLIDE 14

Statistical language models Statistical language models in IR Discussion

What is a language model?

We can view a finite state automaton as a deterministic language model. I wish I wish I wish

Sch¨ utze: Language models for IR 6 / 30

slide-15
SLIDE 15

Statistical language models Statistical language models in IR Discussion

What is a language model?

We can view a finite state automaton as a deterministic language model. I wish I wish I wish I

Sch¨ utze: Language models for IR 6 / 30

slide-16
SLIDE 16

Statistical language models Statistical language models in IR Discussion

What is a language model?

We can view a finite state automaton as a deterministic language model. I wish I wish I wish I wish

Sch¨ utze: Language models for IR 6 / 30

slide-17
SLIDE 17

Statistical language models Statistical language models in IR Discussion

What is a language model?

We can view a finite state automaton as a deterministic language model. I wish I wish I wish I wish I

Sch¨ utze: Language models for IR 6 / 30

slide-18
SLIDE 18

Statistical language models Statistical language models in IR Discussion

What is a language model?

We can view a finite state automaton as a deterministic language model. I wish I wish I wish I wish I wish

Sch¨ utze: Language models for IR 6 / 30

slide-19
SLIDE 19

Statistical language models Statistical language models in IR Discussion

What is a language model?

We can view a finite state automaton as a deterministic language model. I wish I wish I wish I wish I wish . . .

Sch¨ utze: Language models for IR 6 / 30

slide-20
SLIDE 20

Statistical language models Statistical language models in IR Discussion

What is a language model?

We can view a finite state automaton as a deterministic language model. I wish I wish I wish I wish I wish . . . Cannot generate: “wish I wish” or “I wish I”

Sch¨ utze: Language models for IR 6 / 30

slide-21
SLIDE 21

Statistical language models Statistical language models in IR Discussion

What is a language model?

We can view a finite state automaton as a deterministic language model. I wish I wish I wish I wish I wish . . . Cannot generate: “wish I wish” or “I wish I” Our basic model: each document was generated by a different automaton like this

Sch¨ utze: Language models for IR 6 / 30

slide-22
SLIDE 22

Statistical language models Statistical language models in IR Discussion

What is a language model?

We can view a finite state automaton as a deterministic language model. I wish I wish I wish I wish I wish . . . Cannot generate: “wish I wish” or “I wish I” Our basic model: each document was generated by a different automaton like this except that these automata are probabilistic.

Sch¨ utze: Language models for IR 6 / 30

slide-23
SLIDE 23

Statistical language models Statistical language models in IR Discussion

A probabilistic language model

q1 w P(w|q1) w P(w|q1) STOP 0.2 toad 0.01 the 0.2 said 0.03 a 0.1 likes 0.02 frog 0.01 that 0.04 . . . . . . This is a one-state probabilistic finite-state automaton – a unigram language model – and the state emission distribution for its one state q1.

Sch¨ utze: Language models for IR 7 / 30

slide-24
SLIDE 24

Statistical language models Statistical language models in IR Discussion

A probabilistic language model

q1 w P(w|q1) w P(w|q1) STOP 0.2 toad 0.01 the 0.2 said 0.03 a 0.1 likes 0.02 frog 0.01 that 0.04 . . . . . . This is a one-state probabilistic finite-state automaton – a unigram language model – and the state emission distribution for its one state q1. STOP is not a word, but a special symbol indicating that the automaton stops.

Sch¨ utze: Language models for IR 7 / 30

slide-25
SLIDE 25

Statistical language models Statistical language models in IR Discussion

A probabilistic language model

q1 w P(w|q1) w P(w|q1) STOP 0.2 toad 0.01 the 0.2 said 0.03 a 0.1 likes 0.02 frog 0.01 that 0.04 . . . . . . This is a one-state probabilistic finite-state automaton – a unigram language model – and the state emission distribution for its one state q1. STOP is not a word, but a special symbol indicating that the automaton stops. frog

Sch¨ utze: Language models for IR 7 / 30

slide-26
SLIDE 26

Statistical language models Statistical language models in IR Discussion

A probabilistic language model

q1 w P(w|q1) w P(w|q1) STOP 0.2 toad 0.01 the 0.2 said 0.03 a 0.1 likes 0.02 frog 0.01 that 0.04 . . . . . . This is a one-state probabilistic finite-state automaton – a unigram language model – and the state emission distribution for its one state q1. STOP is not a word, but a special symbol indicating that the automaton stops. frog P(string) = 0.01

Sch¨ utze: Language models for IR 7 / 30

slide-27
SLIDE 27

Statistical language models Statistical language models in IR Discussion

A probabilistic language model

q1 w P(w|q1) w P(w|q1) STOP 0.2 toad 0.01 the 0.2 said 0.03 a 0.1 likes 0.02 frog 0.01 that 0.04 . . . . . . This is a one-state probabilistic finite-state automaton – a unigram language model – and the state emission distribution for its one state q1. STOP is not a word, but a special symbol indicating that the automaton stops. frog said P(string) = 0.01

Sch¨ utze: Language models for IR 7 / 30

slide-28
SLIDE 28

Statistical language models Statistical language models in IR Discussion

A probabilistic language model

q1 w P(w|q1) w P(w|q1) STOP 0.2 toad 0.01 the 0.2 said 0.03 a 0.1 likes 0.02 frog 0.01 that 0.04 . . . . . . This is a one-state probabilistic finite-state automaton – a unigram language model – and the state emission distribution for its one state q1. STOP is not a word, but a special symbol indicating that the automaton stops. frog said P(string) = 0.01 ·0.03

Sch¨ utze: Language models for IR 7 / 30

slide-29
SLIDE 29

Statistical language models Statistical language models in IR Discussion

A probabilistic language model

q1 w P(w|q1) w P(w|q1) STOP 0.2 toad 0.01 the 0.2 said 0.03 a 0.1 likes 0.02 frog 0.01 that 0.04 . . . . . . This is a one-state probabilistic finite-state automaton – a unigram language model – and the state emission distribution for its one state q1. STOP is not a word, but a special symbol indicating that the automaton stops. frog said that P(string) = 0.01 ·0.03

Sch¨ utze: Language models for IR 7 / 30

slide-30
SLIDE 30

Statistical language models Statistical language models in IR Discussion

A probabilistic language model

q1 w P(w|q1) w P(w|q1) STOP 0.2 toad 0.01 the 0.2 said 0.03 a 0.1 likes 0.02 frog 0.01 that 0.04 . . . . . . This is a one-state probabilistic finite-state automaton – a unigram language model – and the state emission distribution for its one state q1. STOP is not a word, but a special symbol indicating that the automaton stops. frog said that P(string) = 0.01 ·0.03 ·0.04

Sch¨ utze: Language models for IR 7 / 30

slide-31
SLIDE 31

Statistical language models Statistical language models in IR Discussion

A probabilistic language model

q1 w P(w|q1) w P(w|q1) STOP 0.2 toad 0.01 the 0.2 said 0.03 a 0.1 likes 0.02 frog 0.01 that 0.04 . . . . . . This is a one-state probabilistic finite-state automaton – a unigram language model – and the state emission distribution for its one state q1. STOP is not a word, but a special symbol indicating that the automaton stops. frog said that toad P(string) = 0.01 ·0.03 ·0.04

Sch¨ utze: Language models for IR 7 / 30

slide-32
SLIDE 32

Statistical language models Statistical language models in IR Discussion

A probabilistic language model

q1 w P(w|q1) w P(w|q1) STOP 0.2 toad 0.01 the 0.2 said 0.03 a 0.1 likes 0.02 frog 0.01 that 0.04 . . . . . . This is a one-state probabilistic finite-state automaton – a unigram language model – and the state emission distribution for its one state q1. STOP is not a word, but a special symbol indicating that the automaton stops. frog said that toad P(string) = 0.01 ·0.03 ·0.04 ·0.01

Sch¨ utze: Language models for IR 7 / 30

slide-33
SLIDE 33

Statistical language models Statistical language models in IR Discussion

A probabilistic language model

q1 w P(w|q1) w P(w|q1) STOP 0.2 toad 0.01 the 0.2 said 0.03 a 0.1 likes 0.02 frog 0.01 that 0.04 . . . . . . This is a one-state probabilistic finite-state automaton – a unigram language model – and the state emission distribution for its one state q1. STOP is not a word, but a special symbol indicating that the automaton stops. frog said that toad likes P(string) = 0.01 ·0.03 ·0.04 ·0.01

Sch¨ utze: Language models for IR 7 / 30

slide-34
SLIDE 34

Statistical language models Statistical language models in IR Discussion

A probabilistic language model

q1 w P(w|q1) w P(w|q1) STOP 0.2 toad 0.01 the 0.2 said 0.03 a 0.1 likes 0.02 frog 0.01 that 0.04 . . . . . . This is a one-state probabilistic finite-state automaton – a unigram language model – and the state emission distribution for its one state q1. STOP is not a word, but a special symbol indicating that the automaton stops. frog said that toad likes P(string) = 0.01 ·0.03 ·0.04 ·0.01 ·0.02

Sch¨ utze: Language models for IR 7 / 30

slide-35
SLIDE 35

Statistical language models Statistical language models in IR Discussion

A probabilistic language model

q1 w P(w|q1) w P(w|q1) STOP 0.2 toad 0.01 the 0.2 said 0.03 a 0.1 likes 0.02 frog 0.01 that 0.04 . . . . . . This is a one-state probabilistic finite-state automaton – a unigram language model – and the state emission distribution for its one state q1. STOP is not a word, but a special symbol indicating that the automaton stops. frog said that toad likes frog P(string) = 0.01 ·0.03 ·0.04 ·0.01 ·0.02

Sch¨ utze: Language models for IR 7 / 30

slide-36
SLIDE 36

Statistical language models Statistical language models in IR Discussion

A probabilistic language model

q1 w P(w|q1) w P(w|q1) STOP 0.2 toad 0.01 the 0.2 said 0.03 a 0.1 likes 0.02 frog 0.01 that 0.04 . . . . . . This is a one-state probabilistic finite-state automaton – a unigram language model – and the state emission distribution for its one state q1. STOP is not a word, but a special symbol indicating that the automaton stops. frog said that toad likes frog P(string) = 0.01 ·0.03 ·0.04 ·0.01 ·0.02 ·0.01

Sch¨ utze: Language models for IR 7 / 30

slide-37
SLIDE 37

Statistical language models Statistical language models in IR Discussion

A probabilistic language model

q1 w P(w|q1) w P(w|q1) STOP 0.2 toad 0.01 the 0.2 said 0.03 a 0.1 likes 0.02 frog 0.01 that 0.04 . . . . . . This is a one-state probabilistic finite-state automaton – a unigram language model – and the state emission distribution for its one state q1. STOP is not a word, but a special symbol indicating that the automaton stops. frog said that toad likes frog STOP P(string) = 0.01 ·0.03 ·0.04 ·0.01 ·0.02 ·0.01

Sch¨ utze: Language models for IR 7 / 30

slide-38
SLIDE 38

Statistical language models Statistical language models in IR Discussion

A probabilistic language model

q1 w P(w|q1) w P(w|q1) STOP 0.2 toad 0.01 the 0.2 said 0.03 a 0.1 likes 0.02 frog 0.01 that 0.04 . . . . . . This is a one-state probabilistic finite-state automaton – a unigram language model – and the state emission distribution for its one state q1. STOP is not a word, but a special symbol indicating that the automaton stops. frog said that toad likes frog STOP P(string) = 0.01 ·0.03 ·0.04 ·0.01 ·0.02 ·0.01 ·0.2

Sch¨ utze: Language models for IR 7 / 30

slide-39
SLIDE 39

Statistical language models Statistical language models in IR Discussion

A probabilistic language model

q1 w P(w|q1) w P(w|q1) STOP 0.2 toad 0.01 the 0.2 said 0.03 a 0.1 likes 0.02 frog 0.01 that 0.04 . . . . . . This is a one-state probabilistic finite-state automaton – a unigram language model – and the state emission distribution for its one state q1. STOP is not a word, but a special symbol indicating that the automaton stops. frog said that toad likes frog STOP P(string) = 0.01 ·0.03 ·0.04 ·0.01 ·0.02 ·0.01 ·0.2 = 0.0000000000048

Sch¨ utze: Language models for IR 7 / 30

slide-40
SLIDE 40

Statistical language models Statistical language models in IR Discussion

A different language model for each document

language model of d1 language model of d2 w P(w|.) w P(w|.) STOP .2 toad .01 the .2 said .03 a .1 likes .02 frog .01 that .04 . . . . . . w P(w|.) w P(w|.) STOP .2 toad .02 the .15 said .03 a .08 likes .02 frog .01 that .05 . . . . . . query: frog said that toad likes frog STOP

Sch¨ utze: Language models for IR 8 / 30

slide-41
SLIDE 41

Statistical language models Statistical language models in IR Discussion

A different language model for each document

language model of d1 language model of d2 w P(w|.) w P(w|.) STOP .2 toad .01 the .2 said .03 a .1 likes .02 frog .01 that .04 . . . . . . w P(w|.) w P(w|.) STOP .2 toad .02 the .15 said .03 a .08 likes .02 frog .01 that .05 . . . . . . query: frog said that toad likes frog STOP P(query|Md1) = 0.01 ·0.03 ·0.04 ·0.01 ·0.02 ·0.01 ·0.2 = 0.0000000000048 = 4.8 · 10−12

Sch¨ utze: Language models for IR 8 / 30

slide-42
SLIDE 42

Statistical language models Statistical language models in IR Discussion

A different language model for each document

language model of d1 language model of d2 w P(w|.) w P(w|.) STOP .2 toad .01 the .2 said .03 a .1 likes .02 frog .01 that .04 . . . . . . w P(w|.) w P(w|.) STOP .2 toad .02 the .15 said .03 a .08 likes .02 frog .01 that .05 . . . . . . query: frog said that toad likes frog STOP P(query|Md1) = 0.01 ·0.03 ·0.04 ·0.01 ·0.02 ·0.01 ·0.2 = 0.0000000000048 = 4.8 · 10−12 P(query|Md2) = 0.01 ·0.03 ·0.05 ·0.02 ·0.02 ·0.01 ·0.2 = 0.0000000000120 = 12 · 10−12

Sch¨ utze: Language models for IR 8 / 30

slide-43
SLIDE 43

Statistical language models Statistical language models in IR Discussion

A different language model for each document

language model of d1 language model of d2 w P(w|.) w P(w|.) STOP .2 toad .01 the .2 said .03 a .1 likes .02 frog .01 that .04 . . . . . . w P(w|.) w P(w|.) STOP .2 toad .02 the .15 said .03 a .08 likes .02 frog .01 that .05 . . . . . . query: frog said that toad likes frog STOP P(query|Md1) = 0.01 ·0.03 ·0.04 ·0.01 ·0.02 ·0.01 ·0.2 = 0.0000000000048 = 4.8 · 10−12 P(query|Md2) = 0.01 ·0.03 ·0.05 ·0.02 ·0.02 ·0.01 ·0.2 = 0.0000000000120 = 12 · 10−12 P(query|Md1) < P(query|Md2) Thus, document d2 is “more relevant” to the query “frog said that toad likes frog STOP” than d1 is.

Sch¨ utze: Language models for IR 8 / 30

slide-44
SLIDE 44

Statistical language models Statistical language models in IR Discussion

Outline

1

Statistical language models

2

Statistical language models in IR

3

Discussion

Sch¨ utze: Language models for IR 9 / 30

slide-45
SLIDE 45

Statistical language models Statistical language models in IR Discussion

Using language models in IR

Each document is treated as (the basis for) a language model.

Sch¨ utze: Language models for IR 10 / 30

slide-46
SLIDE 46

Statistical language models Statistical language models in IR Discussion

Using language models in IR

Each document is treated as (the basis for) a language model. Given a query q

Sch¨ utze: Language models for IR 10 / 30

slide-47
SLIDE 47

Statistical language models Statistical language models in IR Discussion

Using language models in IR

Each document is treated as (the basis for) a language model. Given a query q Rank documents based on P(d|q)

Sch¨ utze: Language models for IR 10 / 30

slide-48
SLIDE 48

Statistical language models Statistical language models in IR Discussion

Using language models in IR

Each document is treated as (the basis for) a language model. Given a query q Rank documents based on P(d|q) P(d|q) = P(q|d)P(d) P(q)

Sch¨ utze: Language models for IR 10 / 30

slide-49
SLIDE 49

Statistical language models Statistical language models in IR Discussion

Using language models in IR

Each document is treated as (the basis for) a language model. Given a query q Rank documents based on P(d|q) P(d|q) = P(q|d)P(d) P(q) P(q) is the same for all documents, so ignore

Sch¨ utze: Language models for IR 10 / 30

slide-50
SLIDE 50

Statistical language models Statistical language models in IR Discussion

Using language models in IR

Each document is treated as (the basis for) a language model. Given a query q Rank documents based on P(d|q) P(d|q) = P(q|d)P(d) P(q) P(q) is the same for all documents, so ignore P(d) is the prior – often treated as the same for all d

Sch¨ utze: Language models for IR 10 / 30

slide-51
SLIDE 51

Statistical language models Statistical language models in IR Discussion

Using language models in IR

Each document is treated as (the basis for) a language model. Given a query q Rank documents based on P(d|q) P(d|q) = P(q|d)P(d) P(q) P(q) is the same for all documents, so ignore P(d) is the prior – often treated as the same for all d

But we can give a higher prior to “high-quality” documents, e.g., those with high PageRank.

Sch¨ utze: Language models for IR 10 / 30

slide-52
SLIDE 52

Statistical language models Statistical language models in IR Discussion

Using language models in IR

Each document is treated as (the basis for) a language model. Given a query q Rank documents based on P(d|q) P(d|q) = P(q|d)P(d) P(q) P(q) is the same for all documents, so ignore P(d) is the prior – often treated as the same for all d

But we can give a higher prior to “high-quality” documents, e.g., those with high PageRank.

P(q|d) is the probability of q given d .

Sch¨ utze: Language models for IR 10 / 30

slide-53
SLIDE 53

Statistical language models Statistical language models in IR Discussion

Using language models in IR

Each document is treated as (the basis for) a language model. Given a query q Rank documents based on P(d|q) P(d|q) = P(q|d)P(d) P(q) P(q) is the same for all documents, so ignore P(d) is the prior – often treated as the same for all d

But we can give a higher prior to “high-quality” documents, e.g., those with high PageRank.

P(q|d) is the probability of q given d . Under the assumptions we made, ranking documents according to P(q|d)P(d) and P(d|q) is equivalent.

Sch¨ utze: Language models for IR 10 / 30

slide-54
SLIDE 54

Statistical language models Statistical language models in IR Discussion

How to compute P(q|d)

Sch¨ utze: Language models for IR 11 / 30

slide-55
SLIDE 55

Statistical language models Statistical language models in IR Discussion

How to compute P(q|d)

We will make the same conditional independence assumption as in BIM.

Sch¨ utze: Language models for IR 11 / 30

slide-56
SLIDE 56

Statistical language models Statistical language models in IR Discussion

How to compute P(q|d)

We will make the same conditional independence assumption as in BIM. P(q|Md) = P(t1, . . . , t|q||Md) =

  • 1≤k≤|q|

P(tk|Md) (|q|: length of q; tk: the token occurring at position k in q)

Sch¨ utze: Language models for IR 11 / 30

slide-57
SLIDE 57

Statistical language models Statistical language models in IR Discussion

How to compute P(q|d)

We will make the same conditional independence assumption as in BIM. P(q|Md) = P(t1, . . . , t|q||Md) =

  • 1≤k≤|q|

P(tk|Md) (|q|: length of q; tk: the token occurring at position k in q) This is equivalent to: P(q|Md) =

  • distinct term t in q

P(t|Md)tft,q

Sch¨ utze: Language models for IR 11 / 30

slide-58
SLIDE 58

Statistical language models Statistical language models in IR Discussion

How to compute P(q|d)

We will make the same conditional independence assumption as in BIM. P(q|Md) = P(t1, . . . , t|q||Md) =

  • 1≤k≤|q|

P(tk|Md) (|q|: length of q; tk: the token occurring at position k in q) This is equivalent to: P(q|Md) =

  • distinct term t in q

P(t|Md)tft,q tft,q: term frequency (# occurrences) of t in q

Sch¨ utze: Language models for IR 11 / 30

slide-59
SLIDE 59

Statistical language models Statistical language models in IR Discussion

How to compute P(q|d)

We will make the same conditional independence assumption as in BIM. P(q|Md) = P(t1, . . . , t|q||Md) =

  • 1≤k≤|q|

P(tk|Md) (|q|: length of q; tk: the token occurring at position k in q) This is equivalent to: P(q|Md) =

  • distinct term t in q

P(t|Md)tft,q tft,q: term frequency (# occurrences) of t in q Multinomial model (omitting constant factor)

Sch¨ utze: Language models for IR 11 / 30

slide-60
SLIDE 60

Statistical language models Statistical language models in IR Discussion

Parameter estimation

Missing piece: Where do the parameters P(t|Md) come from?

Sch¨ utze: Language models for IR 12 / 30

slide-61
SLIDE 61

Statistical language models Statistical language models in IR Discussion

Parameter estimation

Missing piece: Where do the parameters P(t|Md) come from? Start with maximum likelihood estimates

Sch¨ utze: Language models for IR 12 / 30

slide-62
SLIDE 62

Statistical language models Statistical language models in IR Discussion

Parameter estimation

Missing piece: Where do the parameters P(t|Md) come from? Start with maximum likelihood estimates ˆ P(t|Md) = tft,d |d| (|d|: length of d; tft,d: # occurrences of t in d)

Sch¨ utze: Language models for IR 12 / 30

slide-63
SLIDE 63

Statistical language models Statistical language models in IR Discussion

Parameter estimation

Missing piece: Where do the parameters P(t|Md) come from? Start with maximum likelihood estimates ˆ P(t|Md) = tft,d |d| (|d|: length of d; tft,d: # occurrences of t in d) We have a problem with zeros.

Sch¨ utze: Language models for IR 12 / 30

slide-64
SLIDE 64

Statistical language models Statistical language models in IR Discussion

Parameter estimation

Missing piece: Where do the parameters P(t|Md) come from? Start with maximum likelihood estimates ˆ P(t|Md) = tft,d |d| (|d|: length of d; tft,d: # occurrences of t in d) We have a problem with zeros. A single t in the query with P(t|Md) = 0 will make P(q|Md) = P(t|Md) zero.

Sch¨ utze: Language models for IR 12 / 30

slide-65
SLIDE 65

Statistical language models Statistical language models in IR Discussion

Parameter estimation

Missing piece: Where do the parameters P(t|Md) come from? Start with maximum likelihood estimates ˆ P(t|Md) = tft,d |d| (|d|: length of d; tft,d: # occurrences of t in d) We have a problem with zeros. A single t in the query with P(t|Md) = 0 will make P(q|Md) = P(t|Md) zero. We would give a single term in the query “veto power”.

Sch¨ utze: Language models for IR 12 / 30

slide-66
SLIDE 66

Statistical language models Statistical language models in IR Discussion

Parameter estimation

Missing piece: Where do the parameters P(t|Md) come from? Start with maximum likelihood estimates ˆ P(t|Md) = tft,d |d| (|d|: length of d; tft,d: # occurrences of t in d) We have a problem with zeros. A single t in the query with P(t|Md) = 0 will make P(q|Md) = P(t|Md) zero. We would give a single term in the query “veto power”. For example, for query [Michael Jackson top hits] a document about “Michael Jackson top songs” (but not using the word “hits”) would have P(q|Md) = 0. – That’s bad.

Sch¨ utze: Language models for IR 12 / 30

slide-67
SLIDE 67

Statistical language models Statistical language models in IR Discussion

Parameter estimation

Missing piece: Where do the parameters P(t|Md) come from? Start with maximum likelihood estimates ˆ P(t|Md) = tft,d |d| (|d|: length of d; tft,d: # occurrences of t in d) We have a problem with zeros. A single t in the query with P(t|Md) = 0 will make P(q|Md) = P(t|Md) zero. We would give a single term in the query “veto power”. For example, for query [Michael Jackson top hits] a document about “Michael Jackson top songs” (but not using the word “hits”) would have P(q|Md) = 0. – That’s bad. We need to smooth the estimates to avoid zeros.

Sch¨ utze: Language models for IR 12 / 30

slide-68
SLIDE 68

Statistical language models Statistical language models in IR Discussion

Smoothing

Key intuition: A nonoccurring term is possible (even though it didn’t occur), . . .

Sch¨ utze: Language models for IR 13 / 30

slide-69
SLIDE 69

Statistical language models Statistical language models in IR Discussion

Smoothing

Key intuition: A nonoccurring term is possible (even though it didn’t occur), . . . . . . but no more likely than would be expected by chance in the collection.

Sch¨ utze: Language models for IR 13 / 30

slide-70
SLIDE 70

Statistical language models Statistical language models in IR Discussion

Smoothing

Key intuition: A nonoccurring term is possible (even though it didn’t occur), . . . . . . but no more likely than would be expected by chance in the collection. Notation: Mc: the collection model; cft: the number of

  • ccurrences of t in the collection; T =

t cft: the total

number of tokens in the collection.

Sch¨ utze: Language models for IR 13 / 30

slide-71
SLIDE 71

Statistical language models Statistical language models in IR Discussion

Smoothing

Key intuition: A nonoccurring term is possible (even though it didn’t occur), . . . . . . but no more likely than would be expected by chance in the collection. Notation: Mc: the collection model; cft: the number of

  • ccurrences of t in the collection; T =

t cft: the total

number of tokens in the collection. ˆ P(t|Mc) = cft T

Sch¨ utze: Language models for IR 13 / 30

slide-72
SLIDE 72

Statistical language models Statistical language models in IR Discussion

Smoothing

Key intuition: A nonoccurring term is possible (even though it didn’t occur), . . . . . . but no more likely than would be expected by chance in the collection. Notation: Mc: the collection model; cft: the number of

  • ccurrences of t in the collection; T =

t cft: the total

number of tokens in the collection. ˆ P(t|Mc) = cft T We will use ˆ P(t|Mc) to “smooth” P(t|d) away from zero.

Sch¨ utze: Language models for IR 13 / 30

slide-73
SLIDE 73

Statistical language models Statistical language models in IR Discussion

Jelinek-Mercer smoothing

P(t|d) = λP(t|Md) + (1 − λ)P(t|Mc)

Sch¨ utze: Language models for IR 14 / 30

slide-74
SLIDE 74

Statistical language models Statistical language models in IR Discussion

Jelinek-Mercer smoothing

P(t|d) = λP(t|Md) + (1 − λ)P(t|Mc) Mixes the probability from the document with the general collection frequency of the word.

Sch¨ utze: Language models for IR 14 / 30

slide-75
SLIDE 75

Statistical language models Statistical language models in IR Discussion

Jelinek-Mercer smoothing

P(t|d) = λP(t|Md) + (1 − λ)P(t|Mc) Mixes the probability from the document with the general collection frequency of the word. High value of λ: “conjunctive-like” search – tends to retrieve documents containing all query words.

Sch¨ utze: Language models for IR 14 / 30

slide-76
SLIDE 76

Statistical language models Statistical language models in IR Discussion

Jelinek-Mercer smoothing

P(t|d) = λP(t|Md) + (1 − λ)P(t|Mc) Mixes the probability from the document with the general collection frequency of the word. High value of λ: “conjunctive-like” search – tends to retrieve documents containing all query words. Low value of λ: more disjunctive, suitable for long queries

Sch¨ utze: Language models for IR 14 / 30

slide-77
SLIDE 77

Statistical language models Statistical language models in IR Discussion

Jelinek-Mercer smoothing

P(t|d) = λP(t|Md) + (1 − λ)P(t|Mc) Mixes the probability from the document with the general collection frequency of the word. High value of λ: “conjunctive-like” search – tends to retrieve documents containing all query words. Low value of λ: more disjunctive, suitable for long queries Tuning λ is important for good performance.

Sch¨ utze: Language models for IR 14 / 30

slide-78
SLIDE 78

Statistical language models Statistical language models in IR Discussion

Jelinek-Mercer smoothing: Summary

P(q|d) ∝

  • 1≤k≤|q|

(λP(tk|Md) + (1 − λ)P(tk|Mc))

Sch¨ utze: Language models for IR 15 / 30

slide-79
SLIDE 79

Statistical language models Statistical language models in IR Discussion

Jelinek-Mercer smoothing: Summary

P(q|d) ∝

  • 1≤k≤|q|

(λP(tk|Md) + (1 − λ)P(tk|Mc)) What we model: The user has a document in mind and generates the query from this document.

Sch¨ utze: Language models for IR 15 / 30

slide-80
SLIDE 80

Statistical language models Statistical language models in IR Discussion

Jelinek-Mercer smoothing: Summary

P(q|d) ∝

  • 1≤k≤|q|

(λP(tk|Md) + (1 − λ)P(tk|Mc)) What we model: The user has a document in mind and generates the query from this document. P(q|d) is the probability that the document that the user had in mind was in fact this one.

Sch¨ utze: Language models for IR 15 / 30

slide-81
SLIDE 81

Statistical language models Statistical language models in IR Discussion

Example

Collection: d1 and d2

Sch¨ utze: Language models for IR 16 / 30

slide-82
SLIDE 82

Statistical language models Statistical language models in IR Discussion

Example

Collection: d1 and d2 d1: Jackson was one of the most talented entertainers of all time

Sch¨ utze: Language models for IR 16 / 30

slide-83
SLIDE 83

Statistical language models Statistical language models in IR Discussion

Example

Collection: d1 and d2 d1: Jackson was one of the most talented entertainers of all time d2: Michael Jackson anointed himself King of Pop

Sch¨ utze: Language models for IR 16 / 30

slide-84
SLIDE 84

Statistical language models Statistical language models in IR Discussion

Example

Collection: d1 and d2 d1: Jackson was one of the most talented entertainers of all time d2: Michael Jackson anointed himself King of Pop Query q: Michael Jackson

Sch¨ utze: Language models for IR 16 / 30

slide-85
SLIDE 85

Statistical language models Statistical language models in IR Discussion

Example

Collection: d1 and d2 d1: Jackson was one of the most talented entertainers of all time d2: Michael Jackson anointed himself King of Pop Query q: Michael Jackson Use mixture model with λ = 1/2

Sch¨ utze: Language models for IR 16 / 30

slide-86
SLIDE 86

Statistical language models Statistical language models in IR Discussion

Example

Collection: d1 and d2 d1: Jackson was one of the most talented entertainers of all time d2: Michael Jackson anointed himself King of Pop Query q: Michael Jackson Use mixture model with λ = 1/2 P(q|d1) = [(0/11 + 1/18)/2] · [(1/11 + 2/18)/2] ≈ 0.003

Sch¨ utze: Language models for IR 16 / 30

slide-87
SLIDE 87

Statistical language models Statistical language models in IR Discussion

Example

Collection: d1 and d2 d1: Jackson was one of the most talented entertainers of all time d2: Michael Jackson anointed himself King of Pop Query q: Michael Jackson Use mixture model with λ = 1/2 P(q|d1) = [(0/11 + 1/18)/2] · [(1/11 + 2/18)/2] ≈ 0.003 P(q|d2) = [(1/7 + 1/18)/2] · [(1/7 + 2/18)/2] ≈ 0.013

Sch¨ utze: Language models for IR 16 / 30

slide-88
SLIDE 88

Statistical language models Statistical language models in IR Discussion

Example

Collection: d1 and d2 d1: Jackson was one of the most talented entertainers of all time d2: Michael Jackson anointed himself King of Pop Query q: Michael Jackson Use mixture model with λ = 1/2 P(q|d1) = [(0/11 + 1/18)/2] · [(1/11 + 2/18)/2] ≈ 0.003 P(q|d2) = [(1/7 + 1/18)/2] · [(1/7 + 2/18)/2] ≈ 0.013 Ranking: d2 > d1

Sch¨ utze: Language models for IR 16 / 30

slide-89
SLIDE 89

Statistical language models Statistical language models in IR Discussion

Dirichlet smoothing

Sch¨ utze: Language models for IR 17 / 30

slide-90
SLIDE 90

Statistical language models Statistical language models in IR Discussion

Dirichlet smoothing

P(t|d) = tft,d + αP(t|Mc) Ld + α

Sch¨ utze: Language models for IR 17 / 30

slide-91
SLIDE 91

Statistical language models Statistical language models in IR Discussion

Dirichlet smoothing

P(t|d) = tft,d + αP(t|Mc) Ld + α The background distribution P(t|Mc) is the prior for P(t|d).

Sch¨ utze: Language models for IR 17 / 30

slide-92
SLIDE 92

Statistical language models Statistical language models in IR Discussion

Dirichlet smoothing

P(t|d) = tft,d + αP(t|Mc) Ld + α The background distribution P(t|Mc) is the prior for P(t|d). Intuition: Before having seen any part of the document we start with the background distribution as our estimate.

Sch¨ utze: Language models for IR 17 / 30

slide-93
SLIDE 93

Statistical language models Statistical language models in IR Discussion

Dirichlet smoothing

P(t|d) = tft,d + αP(t|Mc) Ld + α The background distribution P(t|Mc) is the prior for P(t|d). Intuition: Before having seen any part of the document we start with the background distribution as our estimate. As we read the document and count terms we update the background distribution.

Sch¨ utze: Language models for IR 17 / 30

slide-94
SLIDE 94

Statistical language models Statistical language models in IR Discussion

Dirichlet smoothing

P(t|d) = tft,d + αP(t|Mc) Ld + α The background distribution P(t|Mc) is the prior for P(t|d). Intuition: Before having seen any part of the document we start with the background distribution as our estimate. As we read the document and count terms we update the background distribution. The weighting factor α determines how strong an effect the prior has.

Sch¨ utze: Language models for IR 17 / 30

slide-95
SLIDE 95

Statistical language models Statistical language models in IR Discussion

Jelinek-Mercer or Dirichlet?

Dirichlet performs better for keyword queries, Jelinek-Mercer performs better for verbose queries. Both models are sensitive to the smoothing parameters – you shouldn’t use these models without parameter tuning.

Sch¨ utze: Language models for IR 18 / 30

slide-96
SLIDE 96

Statistical language models Statistical language models in IR Discussion

Sensitivity of Dirichlet to smoothing parameter

µ is the Dirichlet smoothing parameter (called α on the previous slides)

Sch¨ utze: Language models for IR 19 / 30

slide-97
SLIDE 97

Statistical language models Statistical language models in IR Discussion

Vector space (tf-idf) vs. LM

precision significant Rec. tf-idf LM %chg 0.0 0.7439 0.7590 +2.0 0.1 0.4521 0.4910 +8.6 0.2 0.3514 0.4045 +15.1 * 0.4 0.2093 0.2572 +22.9 * 0.6 0.1024 0.1405 +37.1 * 0.8 0.0160 0.0432 +169.6 * 1.0 0.0028 0.0050 +76.9 11-point average 0.1868 0.2233 +19.6 *

Sch¨ utze: Language models for IR 20 / 30

slide-98
SLIDE 98

Statistical language models Statistical language models in IR Discussion

Vector space (tf-idf) vs. LM

precision significant Rec. tf-idf LM %chg 0.0 0.7439 0.7590 +2.0 0.1 0.4521 0.4910 +8.6 0.2 0.3514 0.4045 +15.1 * 0.4 0.2093 0.2572 +22.9 * 0.6 0.1024 0.1405 +37.1 * 0.8 0.0160 0.0432 +169.6 * 1.0 0.0028 0.0050 +76.9 11-point average 0.1868 0.2233 +19.6 * The language modeling approach always does better in these experiments . . .

Sch¨ utze: Language models for IR 20 / 30

slide-99
SLIDE 99

Statistical language models Statistical language models in IR Discussion

Vector space (tf-idf) vs. LM

precision significant Rec. tf-idf LM %chg 0.0 0.7439 0.7590 +2.0 0.1 0.4521 0.4910 +8.6 0.2 0.3514 0.4045 +15.1 * 0.4 0.2093 0.2572 +22.9 * 0.6 0.1024 0.1405 +37.1 * 0.8 0.0160 0.0432 +169.6 * 1.0 0.0028 0.0050 +76.9 11-point average 0.1868 0.2233 +19.6 * The language modeling approach always does better in these experiments . . . . . . but note that where the approach shows significant gains is at higher levels of recall.

Sch¨ utze: Language models for IR 20 / 30

slide-100
SLIDE 100

Statistical language models Statistical language models in IR Discussion

Summary: IR language models

Sch¨ utze: Language models for IR 21 / 30

slide-101
SLIDE 101

Statistical language models Statistical language models in IR Discussion

Summary: IR language models

1

View the document as a generative model that generates the query

Sch¨ utze: Language models for IR 21 / 30

slide-102
SLIDE 102

Statistical language models Statistical language models in IR Discussion

Summary: IR language models

1

View the document as a generative model that generates the query

2

Define the precise generative model we want to use

Sch¨ utze: Language models for IR 21 / 30

slide-103
SLIDE 103

Statistical language models Statistical language models in IR Discussion

Summary: IR language models

1

View the document as a generative model that generates the query

2

Define the precise generative model we want to use

3

Estimate parameters (different parameters for each document’s model)

Sch¨ utze: Language models for IR 21 / 30

slide-104
SLIDE 104

Statistical language models Statistical language models in IR Discussion

Summary: IR language models

1

View the document as a generative model that generates the query

2

Define the precise generative model we want to use

3

Estimate parameters (different parameters for each document’s model)

4

Smooth to avoid zeros

Sch¨ utze: Language models for IR 21 / 30

slide-105
SLIDE 105

Statistical language models Statistical language models in IR Discussion

Summary: IR language models

1

View the document as a generative model that generates the query

2

Define the precise generative model we want to use

3

Estimate parameters (different parameters for each document’s model)

4

Smooth to avoid zeros

5

Apply to query and find document most likely to have generated the query

Sch¨ utze: Language models for IR 21 / 30

slide-106
SLIDE 106

Statistical language models Statistical language models in IR Discussion

Summary: IR language models

1

View the document as a generative model that generates the query

2

Define the precise generative model we want to use

3

Estimate parameters (different parameters for each document’s model)

4

Smooth to avoid zeros

5

Apply to query and find document most likely to have generated the query

6

Present most likely document(s) to user

Sch¨ utze: Language models for IR 21 / 30

slide-107
SLIDE 107

Statistical language models Statistical language models in IR Discussion

Outline

1

Statistical language models

2

Statistical language models in IR

3

Discussion

Sch¨ utze: Language models for IR 22 / 30

slide-108
SLIDE 108

Statistical language models Statistical language models in IR Discussion

Naive Bayes generative model

We want to classify document d.

Sch¨ utze: Language models for IR 23 / 30

slide-109
SLIDE 109

Statistical language models Statistical language models in IR Discussion

Naive Bayes generative model

We want to classify document d.

Human-defined classes: e.g., politics, economics, sports.

Sch¨ utze: Language models for IR 23 / 30

slide-110
SLIDE 110

Statistical language models Statistical language models in IR Discussion

Naive Bayes generative model

We want to classify document d.

Human-defined classes: e.g., politics, economics, sports.

Assume that d was generated by the generative model.

Sch¨ utze: Language models for IR 23 / 30

slide-111
SLIDE 111

Statistical language models Statistical language models in IR Discussion

Naive Bayes generative model

We want to classify document d.

Human-defined classes: e.g., politics, economics, sports.

Assume that d was generated by the generative model. Key question: Which of the classes (= class models) is most likely to have generated the document?

Sch¨ utze: Language models for IR 23 / 30

slide-112
SLIDE 112

Statistical language models Statistical language models in IR Discussion

Naive Bayes generative model

We want to classify document d.

Human-defined classes: e.g., politics, economics, sports.

Assume that d was generated by the generative model. Key question: Which of the classes (= class models) is most likely to have generated the document?

Or: for which class do we have the most evidence?

Sch¨ utze: Language models for IR 23 / 30

slide-113
SLIDE 113

Statistical language models Statistical language models in IR Discussion

Naive Bayes and LM generative models

We want to classify document d.

Human-defined classes: e.g., politics, economics, sports.

Assume that d was generated by the generative model. Key question: Which of the classes (= class models) is most likely to have generated the document?

Or: for which class do we have the most evidence?

Sch¨ utze: Language models for IR 23 / 30

slide-114
SLIDE 114

Statistical language models Statistical language models in IR Discussion

Naive Bayes and LM generative models

We want to classify document d. We want to classify a query q.

Human-defined classes: e.g., politics, economics, sports.

Assume that d was generated by the generative model. Key question: Which of the classes (= class models) is most likely to have generated the document?

Or: for which class do we have the most evidence?

Sch¨ utze: Language models for IR 23 / 30

slide-115
SLIDE 115

Statistical language models Statistical language models in IR Discussion

Naive Bayes and LM generative models

We want to classify document d. We want to classify a query q.

Human-defined classes: e.g., politics, economics, sports. Each document in the collection is a different class.

Assume that d was generated by the generative model. Key question: Which of the classes (= class models) is most likely to have generated the document?

Or: for which class do we have the most evidence?

Sch¨ utze: Language models for IR 23 / 30

slide-116
SLIDE 116

Statistical language models Statistical language models in IR Discussion

Naive Bayes and LM generative models

We want to classify document d. We want to classify a query q.

Human-defined classes: e.g., politics, economics, sports. Each document in the collection is a different class.

Assume that d was generated by the generative model. Assume that q was generated by a generative model Key question: Which of the classes (= class models) is most likely to have generated the document?

Or: for which class do we have the most evidence?

Sch¨ utze: Language models for IR 23 / 30

slide-117
SLIDE 117

Statistical language models Statistical language models in IR Discussion

Naive Bayes and LM generative models

We want to classify document d. We want to classify a query q.

Human-defined classes: e.g., politics, economics, sports. Each document in the collection is a different class.

Assume that d was generated by the generative model. Assume that q was generated by a generative model Key question: Which of the classes (= class models) is most likely to have generated the document? Which document (=class) is most likely to have generated the query q?

Or: for which class do we have the most evidence?

Sch¨ utze: Language models for IR 23 / 30

slide-118
SLIDE 118

Statistical language models Statistical language models in IR Discussion

Naive Bayes and LM generative models

We want to classify document d. We want to classify a query q.

Human-defined classes: e.g., politics, economics, sports. Each document in the collection is a different class.

Assume that d was generated by the generative model. Assume that q was generated by a generative model Key question: Which of the classes (= class models) is most likely to have generated the document? Which document (=class) is most likely to have generated the query q?

Or: for which class do we have the most evidence? For which document (as the source of the query) do we have the most evidence?

Sch¨ utze: Language models for IR 23 / 30

slide-119
SLIDE 119

Statistical language models Statistical language models in IR Discussion

Naive Bayes Multinomial model / IR language models

C=China X1=Beijing X2=and X3=Taipei X4=join X5=WTO

Sch¨ utze: Language models for IR 24 / 30

slide-120
SLIDE 120

Statistical language models Statistical language models in IR Discussion

Naive Bayes Bernoulli model / Binary independence model

UAlaska=0 UBeijing=1 UIndia=0 Ujoin=1 UTaipei=1 UWTO=1 C=China

Sch¨ utze: Language models for IR 25 / 30

slide-121
SLIDE 121

Statistical language models Statistical language models in IR Discussion

Comparison of the two models

multinomial model / IR LM Bernoulli model / BIM event model generation of (multi)set of tokens generation of subset of vocabular random variable(s) X = t iff t occurs at given pos Ut = 1 iff t occurs in doc

  • doc. representation

d = t1, . . . , tk, . . . , tnd , tk ∈ V d = e1, . . . , ei, . . . , eM, ei ∈ {0, 1} parameter estimation ˆ P(X = t|c) ˆ P(Ui = e|c)

  • dec. rule: maximize

ˆ P(c) Q

1≤k≤nd ˆ

P(X = tk|c) ˆ P(c) Q

ti ∈V ˆ

P(Ui = ei|c) multiple occurrences taken into account ignored length of docs can handle longer docs works best for short docs # features can handle more works best with fewer estimate for the ˆ P(X = the|c) ≈ 0.05 ˆ P(Uthe = 1|c) ≈ 1.0

Sch¨ utze: Language models for IR 26 / 30

slide-122
SLIDE 122

Statistical language models Statistical language models in IR Discussion

Vector space vs BM25 vs LM

Sch¨ utze: Language models for IR 27 / 30

slide-123
SLIDE 123

Statistical language models Statistical language models in IR Discussion

Vector space vs BM25 vs LM

BM25/LM: based on probability theory

Sch¨ utze: Language models for IR 27 / 30

slide-124
SLIDE 124

Statistical language models Statistical language models in IR Discussion

Vector space vs BM25 vs LM

BM25/LM: based on probability theory Vector space: based on similarity, a geometric/linear algebra notion

Sch¨ utze: Language models for IR 27 / 30

slide-125
SLIDE 125

Statistical language models Statistical language models in IR Discussion

Vector space vs BM25 vs LM

BM25/LM: based on probability theory Vector space: based on similarity, a geometric/linear algebra notion Term frequency is directly used in all three models.

Sch¨ utze: Language models for IR 27 / 30

slide-126
SLIDE 126

Statistical language models Statistical language models in IR Discussion

Vector space vs BM25 vs LM

BM25/LM: based on probability theory Vector space: based on similarity, a geometric/linear algebra notion Term frequency is directly used in all three models.

LMs: raw term frequency, BM25/Vector space: more complex

Sch¨ utze: Language models for IR 27 / 30

slide-127
SLIDE 127

Statistical language models Statistical language models in IR Discussion

Vector space vs BM25 vs LM

BM25/LM: based on probability theory Vector space: based on similarity, a geometric/linear algebra notion Term frequency is directly used in all three models.

LMs: raw term frequency, BM25/Vector space: more complex

Length normalization

Sch¨ utze: Language models for IR 27 / 30

slide-128
SLIDE 128

Statistical language models Statistical language models in IR Discussion

Vector space vs BM25 vs LM

BM25/LM: based on probability theory Vector space: based on similarity, a geometric/linear algebra notion Term frequency is directly used in all three models.

LMs: raw term frequency, BM25/Vector space: more complex

Length normalization

Vector space: Cosine or pivot normalization

Sch¨ utze: Language models for IR 27 / 30

slide-129
SLIDE 129

Statistical language models Statistical language models in IR Discussion

Vector space vs BM25 vs LM

BM25/LM: based on probability theory Vector space: based on similarity, a geometric/linear algebra notion Term frequency is directly used in all three models.

LMs: raw term frequency, BM25/Vector space: more complex

Length normalization

Vector space: Cosine or pivot normalization LMs: probabilities are inherently length normalized

Sch¨ utze: Language models for IR 27 / 30

slide-130
SLIDE 130

Statistical language models Statistical language models in IR Discussion

Vector space vs BM25 vs LM

BM25/LM: based on probability theory Vector space: based on similarity, a geometric/linear algebra notion Term frequency is directly used in all three models.

LMs: raw term frequency, BM25/Vector space: more complex

Length normalization

Vector space: Cosine or pivot normalization LMs: probabilities are inherently length normalized BM25: tuning parameters for optimizing length normalization

Sch¨ utze: Language models for IR 27 / 30

slide-131
SLIDE 131

Statistical language models Statistical language models in IR Discussion

Vector space vs BM25 vs LM

BM25/LM: based on probability theory Vector space: based on similarity, a geometric/linear algebra notion Term frequency is directly used in all three models.

LMs: raw term frequency, BM25/Vector space: more complex

Length normalization

Vector space: Cosine or pivot normalization LMs: probabilities are inherently length normalized BM25: tuning parameters for optimizing length normalization

idf: BM25/vector space use it directly.

Sch¨ utze: Language models for IR 27 / 30

slide-132
SLIDE 132

Statistical language models Statistical language models in IR Discussion

Vector space vs BM25 vs LM

BM25/LM: based on probability theory Vector space: based on similarity, a geometric/linear algebra notion Term frequency is directly used in all three models.

LMs: raw term frequency, BM25/Vector space: more complex

Length normalization

Vector space: Cosine or pivot normalization LMs: probabilities are inherently length normalized BM25: tuning parameters for optimizing length normalization

idf: BM25/vector space use it directly. LMs: Mixing term and collection frequencies has an effect similar to idf.

Sch¨ utze: Language models for IR 27 / 30

slide-133
SLIDE 133

Statistical language models Statistical language models in IR Discussion

Vector space vs BM25 vs LM

BM25/LM: based on probability theory Vector space: based on similarity, a geometric/linear algebra notion Term frequency is directly used in all three models.

LMs: raw term frequency, BM25/Vector space: more complex

Length normalization

Vector space: Cosine or pivot normalization LMs: probabilities are inherently length normalized BM25: tuning parameters for optimizing length normalization

idf: BM25/vector space use it directly. LMs: Mixing term and collection frequencies has an effect similar to idf.

Terms rare in the general collection, but common in some documents will have a greater influence on the ranking.

Sch¨ utze: Language models for IR 27 / 30

slide-134
SLIDE 134

Statistical language models Statistical language models in IR Discussion

Vector space vs BM25 vs LM

BM25/LM: based on probability theory Vector space: based on similarity, a geometric/linear algebra notion Term frequency is directly used in all three models.

LMs: raw term frequency, BM25/Vector space: more complex

Length normalization

Vector space: Cosine or pivot normalization LMs: probabilities are inherently length normalized BM25: tuning parameters for optimizing length normalization

idf: BM25/vector space use it directly. LMs: Mixing term and collection frequencies has an effect similar to idf.

Terms rare in the general collection, but common in some documents will have a greater influence on the ranking.

Collection frequency (LMs) vs. document frequency (BM25, vector space)

Sch¨ utze: Language models for IR 27 / 30

slide-135
SLIDE 135

Statistical language models Statistical language models in IR Discussion

Take-away

Statistical language models: Introduction Statistical language models in IR Discussion: Properties of different probabilistic models in use in IR

Sch¨ utze: Language models for IR 28 / 30

slide-136
SLIDE 136

Statistical language models Statistical language models in IR Discussion

Resources

Chapter 12 of Introduction to Information Retrieval Resources at http://informationretrieval.org/essir2011

Ponte and Croft’s 1998 SIGIR paper (one of the first on LMs in IR) Zhai and Lafferty: A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. (2004). Lemur toolkit (good support for LMs in IR) Bernoulli vs multinomial models

Sch¨ utze: Language models for IR 29 / 30

slide-137
SLIDE 137

Statistical language models Statistical language models in IR Discussion

Exercise: Compute ranking

Collection: d1 and d2 d1: Xerox reports a profit but revenue is down d2: Lucene narrows quarter loss but revenue decreases further Query q: revenue down Use mixture model with λ = 1/2

Sch¨ utze: Language models for IR 30 / 30

slide-138
SLIDE 138

Statistical language models Statistical language models in IR Discussion

Exercise: Compute ranking

Collection: d1 and d2 d1: Xerox reports a profit but revenue is down d2: Lucene narrows quarter loss but revenue decreases further Query q: revenue down Use mixture model with λ = 1/2

Sch¨ utze: Language models for IR 30 / 30

slide-139
SLIDE 139

Statistical language models Statistical language models in IR Discussion

Exercise: Compute ranking

Collection: d1 and d2 d1: Xerox reports a profit but revenue is down d2: Lucene narrows quarter loss but revenue decreases further Query q: revenue down Use mixture model with λ = 1/2

Sch¨ utze: Language models for IR 30 / 30

slide-140
SLIDE 140

Statistical language models Statistical language models in IR Discussion

Exercise: Compute ranking

Collection: d1 and d2 d1: Xerox reports a profit but revenue is down d2: Lucene narrows quarter loss but revenue decreases further Query q: revenue down Use mixture model with λ = 1/2

Sch¨ utze: Language models for IR 30 / 30

slide-141
SLIDE 141

Statistical language models Statistical language models in IR Discussion

Exercise: Compute ranking

Collection: d1 and d2 d1: Xerox reports a profit but revenue is down d2: Lucene narrows quarter loss but revenue decreases further Query q: revenue down Use mixture model with λ = 1/2 P(q|d1) = [(1/8 + 2/16)/2] · [(1/8 + 1/16)/2] = 1/8 · 3/32 = 3/256 P(q|d2) = [(1/8 + 2/16)/2] · [(0/8 + 1/16)/2] = 1/8 · 1/32 = 1/256 Ranking: d1 > d2

Sch¨ utze: Language models for IR 30 / 30