Introduction to Information Retrieval - - PowerPoint PPT Presentation

introduction to information retrieval
SMART_READER_LITE
LIVE PREVIEW

Introduction to Information Retrieval - - PowerPoint PPT Presentation

tf-idf weighting Vector space model Pivot length normalization Introduction to Information Retrieval http://informationretrieval.org IIR 6&7: Vector Space Model Hinrich Sch utze Institute for Natural Language Processing, University of


slide-1
SLIDE 1

tf-idf weighting Vector space model Pivot length normalization

Introduction to Information Retrieval

http://informationretrieval.org IIR 6&7: Vector Space Model

Hinrich Sch¨ utze

Institute for Natural Language Processing, University of Stuttgart

2011-08-29

Sch¨ utze: Vector space model 1 / 37

slide-2
SLIDE 2

tf-idf weighting Vector space model Pivot length normalization

Models and Methods

1

Boolean model and its limitations (30)

2

Vector space model (30)

3

Probabilistic models (30)

4

Language model-based retrieval (30)

5

Latent semantic indexing (30)

6

Learning to rank (30)

Sch¨ utze: Vector space model 3 / 37

slide-3
SLIDE 3

tf-idf weighting Vector space model Pivot length normalization

Take-away

Sch¨ utze: Vector space model 4 / 37

slide-4
SLIDE 4

tf-idf weighting Vector space model Pivot length normalization

Take-away

tf-idf weighting: Quick review of tf-idf weighting

Sch¨ utze: Vector space model 4 / 37

slide-5
SLIDE 5

tf-idf weighting Vector space model Pivot length normalization

Take-away

tf-idf weighting: Quick review of tf-idf weighting Vector space model – represents queries and documents in a high-dimensional space.

Sch¨ utze: Vector space model 4 / 37

slide-6
SLIDE 6

tf-idf weighting Vector space model Pivot length normalization

Take-away

tf-idf weighting: Quick review of tf-idf weighting Vector space model – represents queries and documents in a high-dimensional space. Pivot normalization (or “pivoted document length normalization”): alternative to cosine normalization that removes a bias inherent in standard length normalization

Sch¨ utze: Vector space model 4 / 37

slide-7
SLIDE 7

tf-idf weighting Vector space model Pivot length normalization

Outline

1

tf-idf weighting

2

Vector space model

3

Pivot length normalization

Sch¨ utze: Vector space model 5 / 37

slide-8
SLIDE 8

tf-idf weighting Vector space model Pivot length normalization

Binary incidence matrix

Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 1 1 1 Brutus 1 1 1 Caesar 1 1 1 1 1 Calpurnia 1 Cleopatra 1 mercy 1 1 1 1 1 worser 1 1 1 1 . . . Each document is represented as a binary vector ∈ {0, 1}|V |.

Sch¨ utze: Vector space model 6 / 37

slide-9
SLIDE 9

tf-idf weighting Vector space model Pivot length normalization

Binary incidence matrix

Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 1 1 1 Brutus 1 1 1 Caesar 1 1 1 1 1 Calpurnia 1 Cleopatra 1 mercy 1 1 1 1 1 worser 1 1 1 1 . . . Each document is represented as a binary vector ∈ {0, 1}|V |.

Sch¨ utze: Vector space model 6 / 37

slide-10
SLIDE 10

tf-idf weighting Vector space model Pivot length normalization

Count matrix

Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 157 73 1 Brutus 4 157 2 Caesar 232 227 2 1 Calpurnia 10 Cleopatra 57 mercy 2 3 8 5 8 worser 2 1 1 1 5 . . . Each document is now represented as a count vector ∈ N|V |.

Sch¨ utze: Vector space model 7 / 37

slide-11
SLIDE 11

tf-idf weighting Vector space model Pivot length normalization

Count matrix

Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 157 73 1 Brutus 4 157 2 Caesar 232 227 2 1 Calpurnia 10 Cleopatra 57 mercy 2 3 8 5 8 worser 2 1 1 1 5 . . . Each document is now represented as a count vector ∈ N|V |.

Sch¨ utze: Vector space model 7 / 37

slide-12
SLIDE 12

tf-idf weighting Vector space model Pivot length normalization

Term frequency tf

The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d.

Sch¨ utze: Vector space model 8 / 37

slide-13
SLIDE 13

tf-idf weighting Vector space model Pivot length normalization

Term frequency tf

The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d. We want to rank documents according to query-document matching scores and use tf as a component in these matching scores.

Sch¨ utze: Vector space model 8 / 37

slide-14
SLIDE 14

tf-idf weighting Vector space model Pivot length normalization

Term frequency tf

The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d. We want to rank documents according to query-document matching scores and use tf as a component in these matching scores. But how?

Sch¨ utze: Vector space model 8 / 37

slide-15
SLIDE 15

tf-idf weighting Vector space model Pivot length normalization

Term frequency tf

The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d. We want to rank documents according to query-document matching scores and use tf as a component in these matching scores. But how? Raw term frequency is not what we want because:

Sch¨ utze: Vector space model 8 / 37

slide-16
SLIDE 16

tf-idf weighting Vector space model Pivot length normalization

Term frequency tf

The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d. We want to rank documents according to query-document matching scores and use tf as a component in these matching scores. But how? Raw term frequency is not what we want because: A document with tf = 10 occurrences of the term is more relevant than a document with tf = 1 occurrence of the term.

Sch¨ utze: Vector space model 8 / 37

slide-17
SLIDE 17

tf-idf weighting Vector space model Pivot length normalization

Term frequency tf

The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d. We want to rank documents according to query-document matching scores and use tf as a component in these matching scores. But how? Raw term frequency is not what we want because: A document with tf = 10 occurrences of the term is more relevant than a document with tf = 1 occurrence of the term. But not 10 times more relevant.

Sch¨ utze: Vector space model 8 / 37

slide-18
SLIDE 18

tf-idf weighting Vector space model Pivot length normalization

Term frequency tf

The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d. We want to rank documents according to query-document matching scores and use tf as a component in these matching scores. But how? Raw term frequency is not what we want because: A document with tf = 10 occurrences of the term is more relevant than a document with tf = 1 occurrence of the term. But not 10 times more relevant. Relevance does not increase proportionally with term frequency.

Sch¨ utze: Vector space model 8 / 37

slide-19
SLIDE 19

tf-idf weighting Vector space model Pivot length normalization

Instead of raw frequency: Log frequency weighting

Sch¨ utze: Vector space model 9 / 37

slide-20
SLIDE 20

tf-idf weighting Vector space model Pivot length normalization

Instead of raw frequency: Log frequency weighting

The log frequency weight of term t in d is defined as follows wt,d = 1 + log10 tft,d if tft,d > 0

  • therwise

Sch¨ utze: Vector space model 9 / 37

slide-21
SLIDE 21

tf-idf weighting Vector space model Pivot length normalization

Instead of raw frequency: Log frequency weighting

The log frequency weight of term t in d is defined as follows wt,d = 1 + log10 tft,d if tft,d > 0

  • therwise

tft,d → wt,d: 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.

Sch¨ utze: Vector space model 9 / 37

slide-22
SLIDE 22

tf-idf weighting Vector space model Pivot length normalization

Instead of raw frequency: Log frequency weighting

The log frequency weight of term t in d is defined as follows wt,d = 1 + log10 tft,d if tft,d > 0

  • therwise

tft,d → wt,d: 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc. Matching score for a document-query pair: sum over terms t in both q and d: tf-matching-score(q, d) =

t∈q∩d(1 + log tft,d)

Sch¨ utze: Vector space model 9 / 37

slide-23
SLIDE 23

tf-idf weighting Vector space model Pivot length normalization

Frequency in document vs. frequency in collection

In addition, to term frequency (the frequency of the term in the document) . . .

Sch¨ utze: Vector space model 10 / 37

slide-24
SLIDE 24

tf-idf weighting Vector space model Pivot length normalization

Frequency in document vs. frequency in collection

In addition, to term frequency (the frequency of the term in the document) . . . . . . we also want to use the frequency of the term in the collection for weighting and ranking.

Sch¨ utze: Vector space model 10 / 37

slide-25
SLIDE 25

tf-idf weighting Vector space model Pivot length normalization

idf weight

Sch¨ utze: Vector space model 11 / 37

slide-26
SLIDE 26

tf-idf weighting Vector space model Pivot length normalization

idf weight

dft is the document frequency, the number of documents that t occurs in.

Sch¨ utze: Vector space model 11 / 37

slide-27
SLIDE 27

tf-idf weighting Vector space model Pivot length normalization

idf weight

dft is the document frequency, the number of documents that t occurs in. dft is an inverse measure of the informativeness of term t.

Sch¨ utze: Vector space model 11 / 37

slide-28
SLIDE 28

tf-idf weighting Vector space model Pivot length normalization

idf weight

dft is the document frequency, the number of documents that t occurs in. dft is an inverse measure of the informativeness of term t. Inverse document frequency, idft, is a direct measure of the informativeness of the term.

Sch¨ utze: Vector space model 11 / 37

slide-29
SLIDE 29

tf-idf weighting Vector space model Pivot length normalization

idf weight

dft is the document frequency, the number of documents that t occurs in. dft is an inverse measure of the informativeness of term t. Inverse document frequency, idft, is a direct measure of the informativeness of the term. The idf weight of term t is defined as follows: idft = log10 N dft (N is the number of documents in the collection.)

Sch¨ utze: Vector space model 11 / 37

slide-30
SLIDE 30

tf-idf weighting Vector space model Pivot length normalization

idf weight

dft is the document frequency, the number of documents that t occurs in. dft is an inverse measure of the informativeness of term t. Inverse document frequency, idft, is a direct measure of the informativeness of the term. The idf weight of term t is defined as follows: idft = log10 N dft (N is the number of documents in the collection.) [log N/dft] instead of [N/dft] to “dampen” the effect of idf

Sch¨ utze: Vector space model 11 / 37

slide-31
SLIDE 31

tf-idf weighting Vector space model Pivot length normalization

Examples for idf

Sch¨ utze: Vector space model 12 / 37

slide-32
SLIDE 32

tf-idf weighting Vector space model Pivot length normalization

Examples for idf

idft = log10

1,000,000

dft term dft idft calpurnia 1 6 animal 100 4 sunday 1000 3 fly 10,000 2 under 100,000 1 the 1,000,000

Sch¨ utze: Vector space model 12 / 37

slide-33
SLIDE 33

tf-idf weighting Vector space model Pivot length normalization

Effect of idf on ranking

Sch¨ utze: Vector space model 13 / 37

slide-34
SLIDE 34

tf-idf weighting Vector space model Pivot length normalization

Effect of idf on ranking

idf gives high weights to rare terms like arachnocentric.

Sch¨ utze: Vector space model 13 / 37

slide-35
SLIDE 35

tf-idf weighting Vector space model Pivot length normalization

Effect of idf on ranking

idf gives high weights to rare terms like arachnocentric. idf gives low weights to frequent words like good, increase, and line.

Sch¨ utze: Vector space model 13 / 37

slide-36
SLIDE 36

tf-idf weighting Vector space model Pivot length normalization

Effect of idf on ranking

idf gives high weights to rare terms like arachnocentric. idf gives low weights to frequent words like good, increase, and line. idf affects the ranking of documents for queries with at least two terms.

Sch¨ utze: Vector space model 13 / 37

slide-37
SLIDE 37

tf-idf weighting Vector space model Pivot length normalization

Effect of idf on ranking

idf gives high weights to rare terms like arachnocentric. idf gives low weights to frequent words like good, increase, and line. idf affects the ranking of documents for queries with at least two terms. For example, in the query “arachnocentric line”, idf weighting increases the relative weight of arachnocentric and decreases the relative weight of line.

Sch¨ utze: Vector space model 13 / 37

slide-38
SLIDE 38

tf-idf weighting Vector space model Pivot length normalization

Effect of idf on ranking

idf gives high weights to rare terms like arachnocentric. idf gives low weights to frequent words like good, increase, and line. idf affects the ranking of documents for queries with at least two terms. For example, in the query “arachnocentric line”, idf weighting increases the relative weight of arachnocentric and decreases the relative weight of line. idf has little effect on ranking for one-term queries.

Sch¨ utze: Vector space model 13 / 37

slide-39
SLIDE 39

tf-idf weighting Vector space model Pivot length normalization

Summary: tf-idf weighting

Sch¨ utze: Vector space model 14 / 37

slide-40
SLIDE 40

tf-idf weighting Vector space model Pivot length normalization

Summary: tf-idf weighting

Assign a tf-idf weight for each term t in each document d: wt,d = (1 + log tft,d) · log N dft

Sch¨ utze: Vector space model 14 / 37

slide-41
SLIDE 41

tf-idf weighting Vector space model Pivot length normalization

Summary: tf-idf weighting

Assign a tf-idf weight for each term t in each document d: wt,d = (1 + log tft,d) · log N dft The tf-idf weight . . .

Sch¨ utze: Vector space model 14 / 37

slide-42
SLIDE 42

tf-idf weighting Vector space model Pivot length normalization

Summary: tf-idf weighting

Assign a tf-idf weight for each term t in each document d: wt,d = (1 + log tft,d) · log N dft The tf-idf weight . . .

. . . increases with the number of occurrences within a

  • document. (term frequency component)

Sch¨ utze: Vector space model 14 / 37

slide-43
SLIDE 43

tf-idf weighting Vector space model Pivot length normalization

Summary: tf-idf weighting

Assign a tf-idf weight for each term t in each document d: wt,d = (1 + log tft,d) · log N dft The tf-idf weight . . .

. . . increases with the number of occurrences within a

  • document. (term frequency component)

. . . increases with the rarity of the term in the collection. (inverse document frequency component)

Sch¨ utze: Vector space model 14 / 37

slide-44
SLIDE 44

tf-idf weighting Vector space model Pivot length normalization

Outline

1

tf-idf weighting

2

Vector space model

3

Pivot length normalization

Sch¨ utze: Vector space model 15 / 37

slide-45
SLIDE 45

tf-idf weighting Vector space model Pivot length normalization

Binary incidence matrix

Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 1 1 1 Brutus 1 1 1 Caesar 1 1 1 1 1 Calpurnia 1 Cleopatra 1 mercy 1 1 1 1 1 worser 1 1 1 1 . . . Each document is represented as a binary vector ∈ {0, 1}|V |.

Sch¨ utze: Vector space model 16 / 37

slide-46
SLIDE 46

tf-idf weighting Vector space model Pivot length normalization

Count matrix

Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 157 73 1 Brutus 4 157 2 Caesar 232 227 2 1 Calpurnia 10 Cleopatra 57 mercy 2 3 8 5 8 worser 2 1 1 1 5 . . . Each document is now represented as a count vector ∈ N|V |.

Sch¨ utze: Vector space model 17 / 37

slide-47
SLIDE 47

tf-idf weighting Vector space model Pivot length normalization

Binary → count → weight matrix

Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 5.25 3.18 0.0 0.0 0.0 0.35 Brutus 1.21 6.10 0.0 1.0 0.0 0.0 Caesar 8.59 2.54 0.0 1.51 0.25 0.0 Calpurnia 0.0 1.54 0.0 0.0 0.0 0.0 Cleopatra 2.85 0.0 0.0 0.0 0.0 0.0 mercy 1.51 0.0 1.90 0.12 5.25 0.88 worser 1.37 0.0 0.11 4.15 0.25 1.95 . . . Each document is now represented as a real-valued vector of tf-idf weights ∈ R|V |.

Sch¨ utze: Vector space model 18 / 37

slide-48
SLIDE 48

tf-idf weighting Vector space model Pivot length normalization

Binary → count → weight matrix

Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 5.25 3.18 0.0 0.0 0.0 0.35 Brutus 1.21 6.10 0.0 1.0 0.0 0.0 Caesar 8.59 2.54 0.0 1.51 0.25 0.0 Calpurnia 0.0 1.54 0.0 0.0 0.0 0.0 Cleopatra 2.85 0.0 0.0 0.0 0.0 0.0 mercy 1.51 0.0 1.90 0.12 5.25 0.88 worser 1.37 0.0 0.11 4.15 0.25 1.95 . . . Each document is now represented as a real-valued vector of tf-idf weights ∈ R|V |.

Sch¨ utze: Vector space model 18 / 37

slide-49
SLIDE 49

tf-idf weighting Vector space model Pivot length normalization

Documents as vectors

Each document is now represented as a real-valued vector of tf-idf weights ∈ R|V |.

Sch¨ utze: Vector space model 19 / 37

slide-50
SLIDE 50

tf-idf weighting Vector space model Pivot length normalization

Documents as vectors

Each document is now represented as a real-valued vector of tf-idf weights ∈ R|V |. So we have a |V |-dimensional real-valued vector space.

Sch¨ utze: Vector space model 19 / 37

slide-51
SLIDE 51

tf-idf weighting Vector space model Pivot length normalization

Documents as vectors

Each document is now represented as a real-valued vector of tf-idf weights ∈ R|V |. So we have a |V |-dimensional real-valued vector space. Terms are axes of the space.

Sch¨ utze: Vector space model 19 / 37

slide-52
SLIDE 52

tf-idf weighting Vector space model Pivot length normalization

Documents as vectors

Each document is now represented as a real-valued vector of tf-idf weights ∈ R|V |. So we have a |V |-dimensional real-valued vector space. Terms are axes of the space. Documents are points or vectors in this space.

Sch¨ utze: Vector space model 19 / 37

slide-53
SLIDE 53

tf-idf weighting Vector space model Pivot length normalization

Documents as vectors

Each document is now represented as a real-valued vector of tf-idf weights ∈ R|V |. So we have a |V |-dimensional real-valued vector space. Terms are axes of the space. Documents are points or vectors in this space. Very high-dimensional: tens of millions of dimensions when you apply this to web search engines

Sch¨ utze: Vector space model 19 / 37

slide-54
SLIDE 54

tf-idf weighting Vector space model Pivot length normalization

Documents as vectors

Each document is now represented as a real-valued vector of tf-idf weights ∈ R|V |. So we have a |V |-dimensional real-valued vector space. Terms are axes of the space. Documents are points or vectors in this space. Very high-dimensional: tens of millions of dimensions when you apply this to web search engines Each vector is very sparse - most entries are zero.

Sch¨ utze: Vector space model 19 / 37

slide-55
SLIDE 55

tf-idf weighting Vector space model Pivot length normalization

Queries as vectors

Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space

Sch¨ utze: Vector space model 20 / 37

slide-56
SLIDE 56

tf-idf weighting Vector space model Pivot length normalization

Queries as vectors

Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space Key idea 2: Rank documents according to their proximity to the query

Sch¨ utze: Vector space model 20 / 37

slide-57
SLIDE 57

tf-idf weighting Vector space model Pivot length normalization

Queries as vectors

Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space Key idea 2: Rank documents according to their proximity to the query proximity = similarity

Sch¨ utze: Vector space model 20 / 37

slide-58
SLIDE 58

tf-idf weighting Vector space model Pivot length normalization

Queries as vectors

Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space Key idea 2: Rank documents according to their proximity to the query proximity = similarity proximity ≈ negative distance

Sch¨ utze: Vector space model 20 / 37

slide-59
SLIDE 59

tf-idf weighting Vector space model Pivot length normalization

Queries as vectors

Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space Key idea 2: Rank documents according to their proximity to the query proximity = similarity proximity ≈ negative distance Recall: We’re doing this because we want to get away from the you’re-either-in-or-out, feast-or-famine Boolean model.

Sch¨ utze: Vector space model 20 / 37

slide-60
SLIDE 60

tf-idf weighting Vector space model Pivot length normalization

Queries as vectors

Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space Key idea 2: Rank documents according to their proximity to the query proximity = similarity proximity ≈ negative distance Recall: We’re doing this because we want to get away from the you’re-either-in-or-out, feast-or-famine Boolean model. Instead: rank relevant documents higher than nonrelevant documents

Sch¨ utze: Vector space model 20 / 37

slide-61
SLIDE 61

tf-idf weighting Vector space model Pivot length normalization

How do we formalize vector space similarity?

Sch¨ utze: Vector space model 21 / 37

slide-62
SLIDE 62

tf-idf weighting Vector space model Pivot length normalization

How do we formalize vector space similarity?

First cut: (negative) distance between two points

Sch¨ utze: Vector space model 21 / 37

slide-63
SLIDE 63

tf-idf weighting Vector space model Pivot length normalization

How do we formalize vector space similarity?

First cut: (negative) distance between two points ( = distance between the end points of the two vectors)

Sch¨ utze: Vector space model 21 / 37

slide-64
SLIDE 64

tf-idf weighting Vector space model Pivot length normalization

How do we formalize vector space similarity?

First cut: (negative) distance between two points ( = distance between the end points of the two vectors) Euclidean distance?

Sch¨ utze: Vector space model 21 / 37

slide-65
SLIDE 65

tf-idf weighting Vector space model Pivot length normalization

How do we formalize vector space similarity?

First cut: (negative) distance between two points ( = distance between the end points of the two vectors) Euclidean distance? Euclidean distance is a bad idea . . .

Sch¨ utze: Vector space model 21 / 37

slide-66
SLIDE 66

tf-idf weighting Vector space model Pivot length normalization

How do we formalize vector space similarity?

First cut: (negative) distance between two points ( = distance between the end points of the two vectors) Euclidean distance? Euclidean distance is a bad idea . . . . . . because Euclidean distance is large for vectors of different lengths.

Sch¨ utze: Vector space model 21 / 37

slide-67
SLIDE 67

tf-idf weighting Vector space model Pivot length normalization

Why distance is a bad idea

Sch¨ utze: Vector space model 22 / 37

slide-68
SLIDE 68

tf-idf weighting Vector space model Pivot length normalization

Why distance is a bad idea

1 1

rich poor q:[rich poor] d1:Ranks of starving poets swell d2:Rich poor gap grows d3:Record baseball salaries in 2010 The Euclidean distance of q and d2 is large although the distribution of terms in the query q and the distribution of terms in the document d2 are very similar.

Sch¨ utze: Vector space model 22 / 37

slide-69
SLIDE 69

tf-idf weighting Vector space model Pivot length normalization

Use angle instead of distance

Sch¨ utze: Vector space model 23 / 37

slide-70
SLIDE 70

tf-idf weighting Vector space model Pivot length normalization

Use angle instead of distance

Rank documents according to angle with query

Sch¨ utze: Vector space model 23 / 37

slide-71
SLIDE 71

tf-idf weighting Vector space model Pivot length normalization

Use angle instead of distance

Rank documents according to angle with query The following two notions are equivalent.

Sch¨ utze: Vector space model 23 / 37

slide-72
SLIDE 72

tf-idf weighting Vector space model Pivot length normalization

Use angle instead of distance

Rank documents according to angle with query The following two notions are equivalent.

Rank documents according to the angle between query and document in decreasing order

Sch¨ utze: Vector space model 23 / 37

slide-73
SLIDE 73

tf-idf weighting Vector space model Pivot length normalization

Use angle instead of distance

Rank documents according to angle with query The following two notions are equivalent.

Rank documents according to the angle between query and document in decreasing order Rank documents according to cosine(query,document) in increasing order

Sch¨ utze: Vector space model 23 / 37

slide-74
SLIDE 74

tf-idf weighting Vector space model Pivot length normalization

Use angle instead of distance

Rank documents according to angle with query The following two notions are equivalent.

Rank documents according to the angle between query and document in decreasing order Rank documents according to cosine(query,document) in increasing order

Cosine is a monotonically decreasing function of the angle for the interval [0◦, 180◦]

Sch¨ utze: Vector space model 23 / 37

slide-75
SLIDE 75

tf-idf weighting Vector space model Pivot length normalization

Use angle instead of distance

Rank documents according to angle with query The following two notions are equivalent.

Rank documents according to the angle between query and document in decreasing order Rank documents according to cosine(query,document) in increasing order

Cosine is a monotonically decreasing function of the angle for the interval [0◦, 180◦] → do ranking according to cosine

Sch¨ utze: Vector space model 23 / 37

slide-76
SLIDE 76

tf-idf weighting Vector space model Pivot length normalization

Cosine similarity between query and document

Sch¨ utze: Vector space model 24 / 37

slide-77
SLIDE 77

tf-idf weighting Vector space model Pivot length normalization

Cosine similarity between query and document

cos( q, d) = sim( q, d) = q | q| ·

  • d

| d| =

|V |

  • i=1

qi | q| · di | d|

Sch¨ utze: Vector space model 24 / 37

slide-78
SLIDE 78

tf-idf weighting Vector space model Pivot length normalization

Cosine similarity between query and document

cos( q, d) = sim( q, d) = q | q| ·

  • d

| d| =

|V |

  • i=1

qi | q| · di | d| qi is the tf-idf weight of term i in the query.

Sch¨ utze: Vector space model 24 / 37

slide-79
SLIDE 79

tf-idf weighting Vector space model Pivot length normalization

Cosine similarity between query and document

cos( q, d) = sim( q, d) = q | q| ·

  • d

| d| =

|V |

  • i=1

qi | q| · di | d| qi is the tf-idf weight of term i in the query. di is the tf-idf weight of term i in the document.

Sch¨ utze: Vector space model 24 / 37

slide-80
SLIDE 80

tf-idf weighting Vector space model Pivot length normalization

Cosine similarity between query and document

cos( q, d) = sim( q, d) = q | q| ·

  • d

| d| =

|V |

  • i=1

qi | q| · di | d| qi is the tf-idf weight of term i in the query. di is the tf-idf weight of term i in the document. | q| and | d| are the lengths of q and d.

Sch¨ utze: Vector space model 24 / 37

slide-81
SLIDE 81

tf-idf weighting Vector space model Pivot length normalization

Cosine similarity between query and document

cos( q, d) = sim( q, d) = q | q| ·

  • d

| d| =

|V |

  • i=1

qi | q| · di | d| qi is the tf-idf weight of term i in the query. di is the tf-idf weight of term i in the document. | q| and | d| are the lengths of q and d. This is the cosine similarity of q and d . . . . . . or, equivalently, the cosine of the angle between q and d.

Sch¨ utze: Vector space model 24 / 37

slide-82
SLIDE 82

tf-idf weighting Vector space model Pivot length normalization

Cosine similarity between query and document

cos( q, d) = sim( q, d) = q | q| ·

  • d

| d| =

|V |

  • i=1

qi | q| · di | d| qi is the tf-idf weight of term i in the query. di is the tf-idf weight of term i in the document. | q| and | d| are the lengths of q and d. This is the cosine similarity of q and d . . . . . . or, equivalently, the cosine of the angle between q and d. cosine similarity = dot product of length-normalized vectors

Sch¨ utze: Vector space model 24 / 37

slide-83
SLIDE 83

tf-idf weighting Vector space model Pivot length normalization

Cosine similarity illustrated

Sch¨ utze: Vector space model 25 / 37

slide-84
SLIDE 84

tf-idf weighting Vector space model Pivot length normalization

Cosine similarity illustrated

1 1

rich poor

  • v(q)
  • v(d1)
  • v(d2)
  • v(d3)

θ

Sch¨ utze: Vector space model 25 / 37

slide-85
SLIDE 85

tf-idf weighting Vector space model Pivot length normalization

Components of tf-idf weighting

Term frequency Document frequency Normalization n (natural) tft,d n (no) 1 n (none) 1 l (logarithm) 1 + log(tft,d) t (idf) log N dft c (cosine)

1

w2

1 +w2 2 +...+w2 M

a (augmented) 0.5 +

0.5×tft,d maxt(tft,d)

p (prob idf) max{0, log N−dft

dft

} u (pivoted unique) 1/u b (boolean) 1 if tft,d > 0

  • therwise

b (byte size) 1/CharLengthα, α < 1 L (log ave)

1+log(tft,d) 1+log(avet∈d(tft,d))

Sch¨ utze: Vector space model 26 / 37

slide-86
SLIDE 86

tf-idf weighting Vector space model Pivot length normalization

Components of tf-idf weighting

Term frequency Document frequency Normalization n (natural) tft,d n (no) 1 n (none) 1 l (logarithm) 1 + log(tft,d) t (idf) log N dft c (cosine)

1

w2

1 +w2 2 +...+w2 M

a (augmented) 0.5 +

0.5×tft,d maxt(tft,d)

p (prob idf) max{0, log N−dft

dft

} u (pivoted unique) 1/u b (boolean) 1 if tft,d > 0

  • therwise

b (byte size) 1/CharLengthα, α < 1 L (log ave)

1+log(tft,d) 1+log(avet∈d(tft,d))

Best known combination of weighting options

Sch¨ utze: Vector space model 26 / 37

slide-87
SLIDE 87

tf-idf weighting Vector space model Pivot length normalization

tf-idf example

Sch¨ utze: Vector space model 27 / 37

slide-88
SLIDE 88

tf-idf weighting Vector space model Pivot length normalization

tf-idf example

We often use different weightings for queries and documents.

Sch¨ utze: Vector space model 27 / 37

slide-89
SLIDE 89

tf-idf weighting Vector space model Pivot length normalization

tf-idf example

We often use different weightings for queries and documents. Notation: ddd.qqq

Sch¨ utze: Vector space model 27 / 37

slide-90
SLIDE 90

tf-idf weighting Vector space model Pivot length normalization

tf-idf example

We often use different weightings for queries and documents. Notation: ddd.qqq Example: lnc.ltn

Sch¨ utze: Vector space model 27 / 37

slide-91
SLIDE 91

tf-idf weighting Vector space model Pivot length normalization

tf-idf example

We often use different weightings for queries and documents. Notation: ddd.qqq Example: lnc.ltn document: logarithmic tf, no df weighting, cosine normalization

Sch¨ utze: Vector space model 27 / 37

slide-92
SLIDE 92

tf-idf weighting Vector space model Pivot length normalization

tf-idf example

We often use different weightings for queries and documents. Notation: ddd.qqq Example: lnc.ltn document: logarithmic tf, no df weighting, cosine normalization query: logarithmic tf, idf, no normalization

Sch¨ utze: Vector space model 27 / 37

slide-93
SLIDE 93

tf-idf weighting Vector space model Pivot length normalization

tf-idf example

We often use different weightings for queries and documents. Notation: ddd.qqq Example: lnc.ltn document: logarithmic tf, no df weighting, cosine normalization query: logarithmic tf, idf, no normalization Isn’t it bad to not idf-weight the document?

Sch¨ utze: Vector space model 27 / 37

slide-94
SLIDE 94

tf-idf weighting Vector space model Pivot length normalization

tf-idf example

We often use different weightings for queries and documents. Notation: ddd.qqq Example: lnc.ltn document: logarithmic tf, no df weighting, cosine normalization query: logarithmic tf, idf, no normalization Isn’t it bad to not idf-weight the document? Example query: “best car insurance”

Sch¨ utze: Vector space model 27 / 37

slide-95
SLIDE 95

tf-idf weighting Vector space model Pivot length normalization

tf-idf example

We often use different weightings for queries and documents. Notation: ddd.qqq Example: lnc.ltn document: logarithmic tf, no df weighting, cosine normalization query: logarithmic tf, idf, no normalization Isn’t it bad to not idf-weight the document? Example query: “best car insurance” Example document: “car insurance auto insurance”

Sch¨ utze: Vector space model 27 / 37

slide-96
SLIDE 96

tf-idf weighting Vector space model Pivot length normalization

tf-idf example: lnc.ltn

Query: “best car insurance”. Document: “car insurance auto insurance”. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n’lized: document weights after cosine normalization, product: the product of final query weight and final document weight

Sch¨ utze: Vector space model 28 / 37

slide-97
SLIDE 97

tf-idf weighting Vector space model Pivot length normalization

tf-idf example: lnc.ltn

Query: “best car insurance”. Document: “car insurance auto insurance”. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized auto best 1 car 1 insurance 1 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n’lized: document weights after cosine normalization, product: the product of final query weight and final document weight

Sch¨ utze: Vector space model 28 / 37

slide-98
SLIDE 98

tf-idf weighting Vector space model Pivot length normalization

tf-idf example: lnc.ltn

Query: “best car insurance”. Document: “car insurance auto insurance”. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized auto 1 best 1 car 1 1 insurance 1 2 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n’lized: document weights after cosine normalization, product: the product of final query weight and final document weight

Sch¨ utze: Vector space model 28 / 37

slide-99
SLIDE 99

tf-idf weighting Vector space model Pivot length normalization

tf-idf example: lnc.ltn

Query: “best car insurance”. Document: “car insurance auto insurance”. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized auto 1 best 1 1 car 1 1 1 insurance 1 1 2 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n’lized: document weights after cosine normalization, product: the product of final query weight and final document weight

Sch¨ utze: Vector space model 28 / 37

slide-100
SLIDE 100

tf-idf weighting Vector space model Pivot length normalization

tf-idf example: lnc.ltn

Query: “best car insurance”. Document: “car insurance auto insurance”. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized auto 1 1 best 1 1 car 1 1 1 1 insurance 1 1 2 1.3 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n’lized: document weights after cosine normalization, product: the product of final query weight and final document weight

Sch¨ utze: Vector space model 28 / 37

slide-101
SLIDE 101

tf-idf weighting Vector space model Pivot length normalization

tf-idf example: lnc.ltn

Query: “best car insurance”. Document: “car insurance auto insurance”. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized auto 5000 1 1 best 1 1 50000 car 1 1 10000 1 1 insurance 1 1 1000 2 1.3 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n’lized: document weights after cosine normalization, product: the product of final query weight and final document weight

Sch¨ utze: Vector space model 28 / 37

slide-102
SLIDE 102

tf-idf weighting Vector space model Pivot length normalization

tf-idf example: lnc.ltn

Query: “best car insurance”. Document: “car insurance auto insurance”. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized auto 5000 2.3 1 1 best 1 1 50000 1.3 car 1 1 10000 2.0 1 1 insurance 1 1 1000 3.0 2 1.3 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n’lized: document weights after cosine normalization, product: the product of final query weight and final document weight

Sch¨ utze: Vector space model 28 / 37

slide-103
SLIDE 103

tf-idf weighting Vector space model Pivot length normalization

tf-idf example: lnc.ltn

Query: “best car insurance”. Document: “car insurance auto insurance”. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized auto 5000 2.3 1 1 best 1 1 50000 1.3 1.3 car 1 1 10000 2.0 2.0 1 1 insurance 1 1 1000 3.0 3.0 2 1.3 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n’lized: document weights after cosine normalization, product: the product of final query weight and final document weight

Sch¨ utze: Vector space model 28 / 37

slide-104
SLIDE 104

tf-idf weighting Vector space model Pivot length normalization

tf-idf example: lnc.ltn

Query: “best car insurance”. Document: “car insurance auto insurance”. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized auto 5000 2.3 1 1 best 1 1 50000 1.3 1.3 car 1 1 10000 2.0 2.0 1 1 insurance 1 1 1000 3.0 3.0 2 1.3 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n’lized: document weights after cosine normalization, product: the product of final query weight and final document weight

Sch¨ utze: Vector space model 28 / 37

slide-105
SLIDE 105

tf-idf weighting Vector space model Pivot length normalization

tf-idf example: lnc.ltn

Query: “best car insurance”. Document: “car insurance auto insurance”. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized auto 5000 2.3 1 1 1 best 1 1 50000 1.3 1.3 car 1 1 10000 2.0 2.0 1 1 1 insurance 1 1 1000 3.0 3.0 2 1.3 1.3 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n’lized: document weights after cosine normalization, product: the product of final query weight and final document weight

Sch¨ utze: Vector space model 28 / 37

slide-106
SLIDE 106

tf-idf weighting Vector space model Pivot length normalization

tf-idf example: lnc.ltn

Query: “best car insurance”. Document: “car insurance auto insurance”. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized auto 5000 2.3 1 1 1 0.52 best 1 1 50000 1.3 1.3 car 1 1 10000 2.0 2.0 1 1 1 0.52 insurance 1 1 1000 3.0 3.0 2 1.3 1.3 0.68 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n’lized: document weights after cosine normalization, product: the product of final query weight and final document weight √ 12 + 02 + 12 + 1.32 ≈ 1.92 1/1.92 ≈ 0.52 1.3/1.92 ≈ 0.68

Sch¨ utze: Vector space model 28 / 37

slide-107
SLIDE 107

tf-idf weighting Vector space model Pivot length normalization

tf-idf example: lnc.ltn

Query: “best car insurance”. Document: “car insurance auto insurance”. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized auto 5000 2.3 1 1 1 0.52 best 1 1 50000 1.3 1.3 car 1 1 10000 2.0 2.0 1 1 1 0.52 1.04 insurance 1 1 1000 3.0 3.0 2 1.3 1.3 0.68 2.04 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n’lized: document weights after cosine normalization, product: the product of final query weight and final document weight

Sch¨ utze: Vector space model 28 / 37

slide-108
SLIDE 108

tf-idf weighting Vector space model Pivot length normalization

tf-idf example: lnc.ltn

Query: “best car insurance”. Document: “car insurance auto insurance”. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized auto 5000 2.3 1 1 1 0.52 best 1 1 50000 1.3 1.3 car 1 1 10000 2.0 2.0 1 1 1 0.52 1.04 insurance 1 1 1000 3.0 3.0 2 1.3 1.3 0.68 2.04 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n’lized: document weights after cosine normalization, product: the product of final query weight and final document weight Final similarity score between query and document:

i wqi · wdi = 0 + 0 + 1.04 + 2.04 = 3.08 Sch¨ utze: Vector space model 28 / 37

slide-109
SLIDE 109

tf-idf weighting Vector space model Pivot length normalization

Outline

1

tf-idf weighting

2

Vector space model

3

Pivot length normalization

Sch¨ utze: Vector space model 29 / 37

slide-110
SLIDE 110

tf-idf weighting Vector space model Pivot length normalization

A problem for cosine normalization

Sch¨ utze: Vector space model 30 / 37

slide-111
SLIDE 111

tf-idf weighting Vector space model Pivot length normalization

A problem for cosine normalization

Query q: “anti-doping rules Beijing 2008 olympics”

Sch¨ utze: Vector space model 30 / 37

slide-112
SLIDE 112

tf-idf weighting Vector space model Pivot length normalization

A problem for cosine normalization

Query q: “anti-doping rules Beijing 2008 olympics” Compare three documents

Sch¨ utze: Vector space model 30 / 37

slide-113
SLIDE 113

tf-idf weighting Vector space model Pivot length normalization

A problem for cosine normalization

Query q: “anti-doping rules Beijing 2008 olympics” Compare three documents

d1: a short document on anti-doping rules at 2008 Olympics

Sch¨ utze: Vector space model 30 / 37

slide-114
SLIDE 114

tf-idf weighting Vector space model Pivot length normalization

A problem for cosine normalization

Query q: “anti-doping rules Beijing 2008 olympics” Compare three documents

d1: a short document on anti-doping rules at 2008 Olympics d2: a long document that consists of a copy of d1 and 5 other news stories, all on topics different from Olympics/anti-doping

Sch¨ utze: Vector space model 30 / 37

slide-115
SLIDE 115

tf-idf weighting Vector space model Pivot length normalization

A problem for cosine normalization

Query q: “anti-doping rules Beijing 2008 olympics” Compare three documents

d1: a short document on anti-doping rules at 2008 Olympics d2: a long document that consists of a copy of d1 and 5 other news stories, all on topics different from Olympics/anti-doping d3: a short document on anti-doping rules at the 2004 Athens Olympics

Sch¨ utze: Vector space model 30 / 37

slide-116
SLIDE 116

tf-idf weighting Vector space model Pivot length normalization

A problem for cosine normalization

Query q: “anti-doping rules Beijing 2008 olympics” Compare three documents

d1: a short document on anti-doping rules at 2008 Olympics d2: a long document that consists of a copy of d1 and 5 other news stories, all on topics different from Olympics/anti-doping d3: a short document on anti-doping rules at the 2004 Athens Olympics

What ranking do we expect in the vector space model?

Sch¨ utze: Vector space model 30 / 37

slide-117
SLIDE 117

tf-idf weighting Vector space model Pivot length normalization

Pivot normalization

Cosine normalization produces weights that are too large for short documents and too small for long documents (on average).

Sch¨ utze: Vector space model 31 / 37

slide-118
SLIDE 118

tf-idf weighting Vector space model Pivot length normalization

Pivot normalization

Cosine normalization produces weights that are too large for short documents and too small for long documents (on average). Adjust cosine normalization by linear adjustment: “turning” the average normalization on the pivot

Sch¨ utze: Vector space model 31 / 37

slide-119
SLIDE 119

tf-idf weighting Vector space model Pivot length normalization

Pivot normalization

Cosine normalization produces weights that are too large for short documents and too small for long documents (on average). Adjust cosine normalization by linear adjustment: “turning” the average normalization on the pivot Effect: Similarities of short documents with query decrease; similarities of long documents with query increase.

Sch¨ utze: Vector space model 31 / 37

slide-120
SLIDE 120

tf-idf weighting Vector space model Pivot length normalization

Pivot normalization

Cosine normalization produces weights that are too large for short documents and too small for long documents (on average). Adjust cosine normalization by linear adjustment: “turning” the average normalization on the pivot Effect: Similarities of short documents with query decrease; similarities of long documents with query increase. This removes the unfair advantage that short documents have.

Sch¨ utze: Vector space model 31 / 37

slide-121
SLIDE 121

tf-idf weighting Vector space model Pivot length normalization

Pivot normalization

Cosine normalization produces weights that are too large for short documents and too small for long documents (on average). Adjust cosine normalization by linear adjustment: “turning” the average normalization on the pivot Effect: Similarities of short documents with query decrease; similarities of long documents with query increase. This removes the unfair advantage that short documents have. Singhal’s study is also interesting from the point of view of methodology.

Sch¨ utze: Vector space model 31 / 37

slide-122
SLIDE 122

tf-idf weighting Vector space model Pivot length normalization

Predicted and true probability of relevance

Sch¨ utze: Vector space model 32 / 37

slide-123
SLIDE 123

tf-idf weighting Vector space model Pivot length normalization

Predicted and true probability of relevance

source: Lillian Lee

Sch¨ utze: Vector space model 32 / 37

slide-124
SLIDE 124

tf-idf weighting Vector space model Pivot length normalization

Pivot normalization

Sch¨ utze: Vector space model 33 / 37

slide-125
SLIDE 125

tf-idf weighting Vector space model Pivot length normalization

Pivot normalization

source: Lillian Lee

Sch¨ utze: Vector space model 33 / 37

slide-126
SLIDE 126

tf-idf weighting Vector space model Pivot length normalization

Pivoted normalization: Amit Singhal’s experiments

Sch¨ utze: Vector space model 34 / 37

slide-127
SLIDE 127

tf-idf weighting Vector space model Pivot length normalization

Pivoted normalization: Amit Singhal’s experiments

(relevant documents retrieved and (change in) average precision)

Sch¨ utze: Vector space model 34 / 37

slide-128
SLIDE 128

tf-idf weighting Vector space model Pivot length normalization

Summary: Ranked retrieval in the vector space model

Sch¨ utze: Vector space model 35 / 37

slide-129
SLIDE 129

tf-idf weighting Vector space model Pivot length normalization

Summary: Ranked retrieval in the vector space model

Represent each document as a weighted tf-idf vector

Sch¨ utze: Vector space model 35 / 37

slide-130
SLIDE 130

tf-idf weighting Vector space model Pivot length normalization

Summary: Ranked retrieval in the vector space model

Represent each document as a weighted tf-idf vector Represent the query as a weighted tf-idf vector

Sch¨ utze: Vector space model 35 / 37

slide-131
SLIDE 131

tf-idf weighting Vector space model Pivot length normalization

Summary: Ranked retrieval in the vector space model

Represent each document as a weighted tf-idf vector Represent the query as a weighted tf-idf vector Compute the cosine similarity between the query vector and each document vector

Sch¨ utze: Vector space model 35 / 37

slide-132
SLIDE 132

tf-idf weighting Vector space model Pivot length normalization

Summary: Ranked retrieval in the vector space model

Represent each document as a weighted tf-idf vector Represent the query as a weighted tf-idf vector Compute the cosine similarity between the query vector and each document vector

Alternatively, use pivot normalization

Sch¨ utze: Vector space model 35 / 37

slide-133
SLIDE 133

tf-idf weighting Vector space model Pivot length normalization

Summary: Ranked retrieval in the vector space model

Represent each document as a weighted tf-idf vector Represent the query as a weighted tf-idf vector Compute the cosine similarity between the query vector and each document vector

Alternatively, use pivot normalization

Rank documents with respect to the query

Sch¨ utze: Vector space model 35 / 37

slide-134
SLIDE 134

tf-idf weighting Vector space model Pivot length normalization

Summary: Ranked retrieval in the vector space model

Represent each document as a weighted tf-idf vector Represent the query as a weighted tf-idf vector Compute the cosine similarity between the query vector and each document vector

Alternatively, use pivot normalization

Rank documents with respect to the query Return the top K (e.g., K = 10) to the user

Sch¨ utze: Vector space model 35 / 37

slide-135
SLIDE 135

tf-idf weighting Vector space model Pivot length normalization

Take-away

tf-idf weighting: Quick review of tf-idf weighting Vector space model – represents queries and documents in a high-dimensional space. Pivot normalization (or “pivoted document length normalization”): alternative to cosine normalization that removes a bias inherent in standard length normalization

Sch¨ utze: Vector space model 36 / 37

slide-136
SLIDE 136

tf-idf weighting Vector space model Pivot length normalization

Resources

Chapters 6 and 7 of Introduction to Information Retrieval Resources at http://informationretrieval.org/essir2011

Gerard Salton (main proponent of vector space model in 70s, 80s, 90s) Exploring the similarity space (Moffat and Zobel, 2005) Pivot normalization (original paper)

Sch¨ utze: Vector space model 37 / 37