Information Retrieval Tutorial 4: Vector Space Model Professor: - - PowerPoint PPT Presentation

information retrieval tutorial 4 vector space model
SMART_READER_LITE
LIVE PREVIEW

Information Retrieval Tutorial 4: Vector Space Model Professor: - - PowerPoint PPT Presentation

Review Information Retrieval Tutorial 4: Vector Space Model Professor: Michel Schellekens TA: Ang Gao University College Cork 2012-11-15 Vector Space Model 1 / 21 Review Outline Review 1 Vector Space Model 2 / 21 Review Simple


slide-1
SLIDE 1

Review

Information Retrieval Tutorial 4: Vector Space Model

Professor: Michel Schellekens TA: Ang Gao

University College Cork

2012-11-15

Vector Space Model 1 / 21

slide-2
SLIDE 2

Review

Outline

1

Review

Vector Space Model 2 / 21

slide-3
SLIDE 3

Review

Simple Boolean vs. Ranking of result set

Simple Boolean vs. Ranking of result set

Simple Boolean retrieval returns matching documents in no particular order. Google (and most well designed Boolean engines) rank the result set – they rank good hits (according to some estimator

  • f relevance) higher than bad hits.

Vector Space Model 3 / 21

slide-4
SLIDE 4

Review

Ranked retrieval

Vector Space Model 4 / 21

slide-5
SLIDE 5

Review

Ranked retrieval

Thus far, our queries have been Boolean.

Vector Space Model 4 / 21

slide-6
SLIDE 6

Review

Ranked retrieval

Thus far, our queries have been Boolean.

Documents either match or don’t.

Vector Space Model 4 / 21

slide-7
SLIDE 7

Review

Ranked retrieval

Thus far, our queries have been Boolean.

Documents either match or don’t.

Good for expert users with precise understanding of their needs and of the collection.

Vector Space Model 4 / 21

slide-8
SLIDE 8

Review

Ranked retrieval

Thus far, our queries have been Boolean.

Documents either match or don’t.

Good for expert users with precise understanding of their needs and of the collection. Also good for applications: Applications can easily consume 1000s of results.

Vector Space Model 4 / 21

slide-9
SLIDE 9

Review

Ranked retrieval

Thus far, our queries have been Boolean.

Documents either match or don’t.

Good for expert users with precise understanding of their needs and of the collection. Also good for applications: Applications can easily consume 1000s of results. Not good for the majority of users

Vector Space Model 4 / 21

slide-10
SLIDE 10

Review

Ranked retrieval

Thus far, our queries have been Boolean.

Documents either match or don’t.

Good for expert users with precise understanding of their needs and of the collection. Also good for applications: Applications can easily consume 1000s of results. Not good for the majority of users Most users are not capable of writing Boolean queries . . .

Vector Space Model 4 / 21

slide-11
SLIDE 11

Review

Ranked retrieval

Thus far, our queries have been Boolean.

Documents either match or don’t.

Good for expert users with precise understanding of their needs and of the collection. Also good for applications: Applications can easily consume 1000s of results. Not good for the majority of users Most users are not capable of writing Boolean queries . . .

. . . or they are, but they think it’s too much work.

Vector Space Model 4 / 21

slide-12
SLIDE 12

Review

Ranked retrieval

Thus far, our queries have been Boolean.

Documents either match or don’t.

Good for expert users with precise understanding of their needs and of the collection. Also good for applications: Applications can easily consume 1000s of results. Not good for the majority of users Most users are not capable of writing Boolean queries . . .

. . . or they are, but they think it’s too much work.

Most users don’t want to wade through 1000s of results.

Vector Space Model 4 / 21

slide-13
SLIDE 13

Review

Ranked retrieval

Thus far, our queries have been Boolean.

Documents either match or don’t.

Good for expert users with precise understanding of their needs and of the collection. Also good for applications: Applications can easily consume 1000s of results. Not good for the majority of users Most users are not capable of writing Boolean queries . . .

. . . or they are, but they think it’s too much work.

Most users don’t want to wade through 1000s of results. This is particularly true of web search.

Vector Space Model 4 / 21

slide-14
SLIDE 14

Review

Problem with Boolean search: Feast or famine

Boolean queries often result in either too few (=0) or too many (1000s) results.

Vector Space Model 5 / 21

slide-15
SLIDE 15

Review

Problem with Boolean search: Feast or famine

Boolean queries often result in either too few (=0) or too many (1000s) results. In Boolean retrieval, it takes a lot of skill to come up with a query that produces a manageable number of hits.

Vector Space Model 5 / 21

slide-16
SLIDE 16

Review

Problem with Boolean search: Feast or famine

Boolean queries often result in either too few (=0) or too many (1000s) results. In Boolean retrieval, it takes a lot of skill to come up with a query that produces a manageable number of hits. AND gives too few; OR gives too many

Vector Space Model 5 / 21

slide-17
SLIDE 17

Review

Scoring as the basis of ranked retrieval

Vector Space Model 6 / 21

slide-18
SLIDE 18

Review

Scoring as the basis of ranked retrieval

We wish to rank documents that are more relevant higher than documents that are less relevant.

Vector Space Model 6 / 21

slide-19
SLIDE 19

Review

Scoring as the basis of ranked retrieval

We wish to rank documents that are more relevant higher than documents that are less relevant. How can we accomplish such a ranking of the documents in the collection with respect to a query?

Vector Space Model 6 / 21

slide-20
SLIDE 20

Review

Scoring as the basis of ranked retrieval

We wish to rank documents that are more relevant higher than documents that are less relevant. How can we accomplish such a ranking of the documents in the collection with respect to a query? Assign a score to each query-document pair, say in [0, 1].

Vector Space Model 6 / 21

slide-21
SLIDE 21

Review

Scoring as the basis of ranked retrieval

We wish to rank documents that are more relevant higher than documents that are less relevant. How can we accomplish such a ranking of the documents in the collection with respect to a query? Assign a score to each query-document pair, say in [0, 1]. This score measures how well document and query “match”.

Vector Space Model 6 / 21

slide-22
SLIDE 22

Review

Take 1: Jaccard coefficient

Vector Space Model 7 / 21

slide-23
SLIDE 23

Review

Take 1: Jaccard coefficient

A commonly used measure of overlap of two sets

Vector Space Model 7 / 21

slide-24
SLIDE 24

Review

Take 1: Jaccard coefficient

A commonly used measure of overlap of two sets Let A and B be two sets

Vector Space Model 7 / 21

slide-25
SLIDE 25

Review

Take 1: Jaccard coefficient

A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient: jaccard(A, B) = |A ∩ B| |A ∪ B| (A = ∅ or B = ∅)

Vector Space Model 7 / 21

slide-26
SLIDE 26

Review

Take 1: Jaccard coefficient

A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient: jaccard(A, B) = |A ∩ B| |A ∪ B| (A = ∅ or B = ∅) jaccard(A, A) = 1

Vector Space Model 7 / 21

slide-27
SLIDE 27

Review

Take 1: Jaccard coefficient

A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient: jaccard(A, B) = |A ∩ B| |A ∪ B| (A = ∅ or B = ∅) jaccard(A, A) = 1 jaccard(A, B) = 0 if A ∩ B = 0

Vector Space Model 7 / 21

slide-28
SLIDE 28

Review

Take 1: Jaccard coefficient

A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient: jaccard(A, B) = |A ∩ B| |A ∪ B| (A = ∅ or B = ∅) jaccard(A, A) = 1 jaccard(A, B) = 0 if A ∩ B = 0 A and B don’t have to be the same size.

Vector Space Model 7 / 21

slide-29
SLIDE 29

Review

Take 1: Jaccard coefficient

A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient: jaccard(A, B) = |A ∩ B| |A ∪ B| (A = ∅ or B = ∅) jaccard(A, A) = 1 jaccard(A, B) = 0 if A ∩ B = 0 A and B don’t have to be the same size. Always assigns a number between 0 and 1.

Vector Space Model 7 / 21

slide-30
SLIDE 30

Review

Jaccard coefficient: Example

Problem1: What is the query-document match score that the Jaccard coefficient computes for:

Vector Space Model 8 / 21

slide-31
SLIDE 31

Review

Jaccard coefficient: Example

Problem1: What is the query-document match score that the Jaccard coefficient computes for:

Query: “University College Cork”

Vector Space Model 8 / 21

slide-32
SLIDE 32

Review

Jaccard coefficient: Example

Problem1: What is the query-document match score that the Jaccard coefficient computes for:

Query: “University College Cork” Document “Cork City Tourism guide”

Vector Space Model 8 / 21

slide-33
SLIDE 33

Review

Jaccard coefficient: Example

Problem1: What is the query-document match score that the Jaccard coefficient computes for:

Query: “University College Cork” Document “Cork City Tourism guide” jaccard(q, d) = 1/6

Vector Space Model 8 / 21

slide-34
SLIDE 34

Review

What’s wrong with Jaccard?

Vector Space Model 9 / 21

slide-35
SLIDE 35

Review

What’s wrong with Jaccard?

It doesn’t consider term frequency (how many occurrences a term has). (tf)

Vector Space Model 9 / 21

slide-36
SLIDE 36

Review

What’s wrong with Jaccard?

It doesn’t consider term frequency (how many occurrences a term has). (tf) Rare terms are more informative than frequent terms. Jaccard does not consider this information. (idf)

Vector Space Model 9 / 21

slide-37
SLIDE 37

Review

What’s wrong with Jaccard?

It doesn’t consider term frequency (how many occurrences a term has). (tf) Rare terms are more informative than frequent terms. Jaccard does not consider this information. (idf) We need a more sophisticated way of normalizing for the length of a document.

Vector Space Model 9 / 21

slide-38
SLIDE 38

Review

tf-idf weighting

Vector Space Model 10 / 21

slide-39
SLIDE 39

Review

tf-idf weighting

The tf-idf weight of a term is the product of its tf weight and its idf weight.

Vector Space Model 10 / 21

slide-40
SLIDE 40

Review

tf-idf weighting

The tf-idf weight of a term is the product of its tf weight and its idf weight. wt,d = (1 + log tft,d) · log N dft

Vector Space Model 10 / 21

slide-41
SLIDE 41

Review

tf-idf weighting

The tf-idf weight of a term is the product of its tf weight and its idf weight. wt,d = (1 + log tft,d) · log N dft tf-weight

Vector Space Model 10 / 21

slide-42
SLIDE 42

Review

tf-idf weighting

The tf-idf weight of a term is the product of its tf weight and its idf weight. wt,d = (1 + log tft,d) · log N dft idf-weight (N is the number of documents in the collection.)

Vector Space Model 10 / 21

slide-43
SLIDE 43

Review

tf-idf weighting

The tf-idf weight of a term is the product of its tf weight and its idf weight. wt,d = (1 + log tft,d) · log N dft Best known weighting scheme in information retrieval

Vector Space Model 10 / 21

slide-44
SLIDE 44

Review

tf-idf weighting

The tf-idf weight of a term is the product of its tf weight and its idf weight. wt,d = (1 + log tft,d) · log N dft Best known weighting scheme in information retrieval The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d.

Vector Space Model 10 / 21

slide-45
SLIDE 45

Review

tf-idf weighting

The tf-idf weight of a term is the product of its tf weight and its idf weight. wt,d = (1 + log tft,d) · log N dft Best known weighting scheme in information retrieval The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d. dft is the document frequency, the number of documents that t occurs in.

Vector Space Model 10 / 21

slide-46
SLIDE 46

Review

tf-idf weighting

The tf-idf weight of a term is the product of its tf weight and its idf weight. wt,d = (1 + log tft,d) · log N dft Best known weighting scheme in information retrieval The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d. dft is the document frequency, the number of documents that t occurs in. dft is an inverse measure of the informativeness of term t.

Vector Space Model 10 / 21

slide-47
SLIDE 47

Review

tf-idf weighting

The tf-idf weight of a term is the product of its tf weight and its idf weight. wt,d = (1 + log tft,d) · log N dft Best known weighting scheme in information retrieval The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d. dft is the document frequency, the number of documents that t occurs in. dft is an inverse measure of the informativeness of term t. idft is a measure of the informativeness of the term.

Vector Space Model 10 / 21

slide-48
SLIDE 48

Review

Computing TF-IDF: An Example

Problem2: Given a document containing terms with given frequencies:

Vector Space Model 11 / 21

slide-49
SLIDE 49

Review

Computing TF-IDF: An Example

Problem2: Given a document containing terms with given frequencies:

A(3), B(2), C(1)

Vector Space Model 11 / 21

slide-50
SLIDE 50

Review

Computing TF-IDF: An Example

Problem2: Given a document containing terms with given frequencies:

A(3), B(2), C(1) Assume collection contains 10, 000 documents and document frequencies of these terms are:

Vector Space Model 11 / 21

slide-51
SLIDE 51

Review

Computing TF-IDF: An Example

Problem2: Given a document containing terms with given frequencies:

A(3), B(2), C(1) Assume collection contains 10, 000 documents and document frequencies of these terms are: A(50), B(1300), C(250)

Vector Space Model 11 / 21

slide-52
SLIDE 52

Review

Computing TF-IDF: An Example

Problem2: Given a document containing terms with given frequencies:

A(3), B(2), C(1) Assume collection contains 10, 000 documents and document frequencies of these terms are: A(50), B(1300), C(250) Calculate tf-idf weight for A,B,C in this document.

Vector Space Model 11 / 21

slide-53
SLIDE 53

Review

Computing TF-IDF: An Example

Problem2: Given a document containing terms with given frequencies:

A(3), B(2), C(1) Assume collection contains 10, 000 documents and document frequencies of these terms are: A(50), B(1300), C(250) Calculate tf-idf weight for A,B,C in this document. A: (1 + log(3)) ∗ log( 10000

50 ) = 11.119

Vector Space Model 11 / 21

slide-54
SLIDE 54

Review

Computing TF-IDF: An Example

Problem2: Given a document containing terms with given frequencies:

A(3), B(2), C(1) Assume collection contains 10, 000 documents and document frequencies of these terms are: A(50), B(1300), C(250) Calculate tf-idf weight for A,B,C in this document. A: (1 + log(3)) ∗ log( 10000

50 ) = 11.119

B: (1 + log(2)) ∗ log( 10000

1300 ) = 3.295

Vector Space Model 11 / 21

slide-55
SLIDE 55

Review

Computing TF-IDF: An Example

Problem2: Given a document containing terms with given frequencies:

A(3), B(2), C(1) Assume collection contains 10, 000 documents and document frequencies of these terms are: A(50), B(1300), C(250) Calculate tf-idf weight for A,B,C in this document. A: (1 + log(3)) ∗ log( 10000

50 ) = 11.119

B: (1 + log(2)) ∗ log( 10000

1300 ) = 3.295

C: (1 + log(1)) ∗ log( 10000

250 ) = 3.689

Vector Space Model 11 / 21

slide-56
SLIDE 56

Review

Binary incidence matrix

Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 1 1 1 Brutus 1 1 1 Caesar 1 1 1 1 1 Calpurnia 1 Cleopatra 1 mercy 1 1 1 1 1 worser 1 1 1 1 . . . Each document is represented as a binary vector ∈ {0, 1}|V |.

Vector Space Model 12 / 21

slide-57
SLIDE 57

Review

Binary incidence matrix

Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 1 1 1 Brutus 1 1 1 Caesar 1 1 1 1 1 Calpurnia 1 Cleopatra 1 mercy 1 1 1 1 1 worser 1 1 1 1 . . . Each document is represented as a binary vector ∈ {0, 1}|V |.

Vector Space Model 12 / 21

slide-58
SLIDE 58

Review

Count matrix

Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 157 73 1 Brutus 4 157 2 Caesar 232 227 2 1 Calpurnia 10 Cleopatra 57 mercy 2 3 8 5 8 worser 2 1 1 1 5 . . . Each document is now represented as a count vector ∈ N|V |.

Vector Space Model 13 / 21

slide-59
SLIDE 59

Review

Count matrix

Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 157 73 1 Brutus 4 157 2 Caesar 232 227 2 1 Calpurnia 10 Cleopatra 57 mercy 2 3 8 5 8 worser 2 1 1 1 5 . . . Each document is now represented as a count vector ∈ N|V |.

Vector Space Model 13 / 21

slide-60
SLIDE 60

Review

Binary → count → weight matrix

Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 5.25 3.18 0.0 0.0 0.0 0.35 Brutus 1.21 6.10 0.0 1.0 0.0 0.0 Caesar 8.59 2.54 0.0 1.51 0.25 0.0 Calpurnia 0.0 1.54 0.0 0.0 0.0 0.0 Cleopatra 2.85 0.0 0.0 0.0 0.0 0.0 mercy 1.51 0.0 1.90 0.12 5.25 0.88 worser 1.37 0.0 0.11 4.15 0.25 1.95 . . . Each document is now represented as a real-valued vector of tf-idf weights ∈ R|V |.

Vector Space Model 14 / 21

slide-61
SLIDE 61

Review

Binary → count → weight matrix

Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 5.25 3.18 0.0 0.0 0.0 0.35 Brutus 1.21 6.10 0.0 1.0 0.0 0.0 Caesar 8.59 2.54 0.0 1.51 0.25 0.0 Calpurnia 0.0 1.54 0.0 0.0 0.0 0.0 Cleopatra 2.85 0.0 0.0 0.0 0.0 0.0 mercy 1.51 0.0 1.90 0.12 5.25 0.88 worser 1.37 0.0 0.11 4.15 0.25 1.95 . . . Each document is now represented as a real-valued vector of tf-idf weights ∈ R|V |.

Vector Space Model 14 / 21

slide-62
SLIDE 62

Review

Summary: Ranked retrieval in the vector space model

Vector Space Model 15 / 21

slide-63
SLIDE 63

Review

Summary: Ranked retrieval in the vector space model

Represent the query as a weighted tf-idf vector

Vector Space Model 15 / 21

slide-64
SLIDE 64

Review

Summary: Ranked retrieval in the vector space model

Represent the query as a weighted tf-idf vector Represent each document as a weighted tf-idf vector

Vector Space Model 15 / 21

slide-65
SLIDE 65

Review

Summary: Ranked retrieval in the vector space model

Represent the query as a weighted tf-idf vector Represent each document as a weighted tf-idf vector Compute the cosine similarity between the query vector and each document vector

Vector Space Model 15 / 21

slide-66
SLIDE 66

Review

Summary: Ranked retrieval in the vector space model

Represent the query as a weighted tf-idf vector Represent each document as a weighted tf-idf vector Compute the cosine similarity between the query vector and each document vector

Euclidean distance is large for vectors of different lengths, long documents and short documents (or queries) will be positioned far apart.

Vector Space Model 15 / 21

slide-67
SLIDE 67

Review

Summary: Ranked retrieval in the vector space model

Represent the query as a weighted tf-idf vector Represent each document as a weighted tf-idf vector Compute the cosine similarity between the query vector and each document vector

Euclidean distance is large for vectors of different lengths, long documents and short documents (or queries) will be positioned far apart. The angle between Semantically same documents is 0.

Vector Space Model 15 / 21

slide-68
SLIDE 68

Review

Summary: Ranked retrieval in the vector space model

Represent the query as a weighted tf-idf vector Represent each document as a weighted tf-idf vector Compute the cosine similarity between the query vector and each document vector

Euclidean distance is large for vectors of different lengths, long documents and short documents (or queries) will be positioned far apart. The angle between Semantically same documents is 0.

Rank documents with respect to the query

Vector Space Model 15 / 21

slide-69
SLIDE 69

Review

Summary: Ranked retrieval in the vector space model

Represent the query as a weighted tf-idf vector Represent each document as a weighted tf-idf vector Compute the cosine similarity between the query vector and each document vector

Euclidean distance is large for vectors of different lengths, long documents and short documents (or queries) will be positioned far apart. The angle between Semantically same documents is 0.

Rank documents with respect to the query Return the top K (e.g., K = 10) to the user

Vector Space Model 15 / 21

slide-70
SLIDE 70

Review

Cosine similarity between query and document

Vector Space Model 16 / 21

slide-71
SLIDE 71

Review

Cosine similarity between query and document

cos( q, d) = sim( q, d) = q · d | q|| d| = |V |

i=1 qidi

|V |

i=1 q2 i

|V |

i=1 d2 i

qi is the tf-idf weight of term i in the query.

Vector Space Model 16 / 21

slide-72
SLIDE 72

Review

Cosine similarity between query and document

cos( q, d) = sim( q, d) = q · d | q|| d| = |V |

i=1 qidi

|V |

i=1 q2 i

|V |

i=1 d2 i

qi is the tf-idf weight of term i in the query. di is the tf-idf weight of term i in the document.

Vector Space Model 16 / 21

slide-73
SLIDE 73

Review

Cosine similarity between query and document

cos( q, d) = sim( q, d) = q · d | q|| d| = |V |

i=1 qidi

|V |

i=1 q2 i

|V |

i=1 d2 i

qi is the tf-idf weight of term i in the query. di is the tf-idf weight of term i in the document. | q| and | d| are the lengths of q and d.

Vector Space Model 16 / 21

slide-74
SLIDE 74

Review

Cosine similarity between query and document

cos( q, d) = sim( q, d) = q · d | q|| d| = |V |

i=1 qidi

|V |

i=1 q2 i

|V |

i=1 d2 i

qi is the tf-idf weight of term i in the query. di is the tf-idf weight of term i in the document. | q| and | d| are the lengths of q and d. This is the cosine similarity of q and d . . . . . . or, equivalently, the cosine of the angle between q and d.

Vector Space Model 16 / 21

slide-75
SLIDE 75

Review

Ranked retrieval in the vector space model example

Example

Consider these documents: Doc1 Shipment of gold damaged in a fire Doc2 Delivery of silver arrived in a silver truck Doc3 Shipment of gold arrived in a truck

Vector Space Model 17 / 21

slide-76
SLIDE 76

Review

Ranked retrieval in the vector space model example

Example

Consider these documents: Doc1 Shipment of gold damaged in a fire Doc2 Delivery of silver arrived in a silver truck Doc3 Shipment of gold arrived in a truck Compute the tf-idf weights for each terms in each document

Vector Space Model 17 / 21

slide-77
SLIDE 77

Review

Ranked retrieval in the vector space model example

Example

Consider these documents: Doc1 Shipment of gold damaged in a fire Doc2 Delivery of silver arrived in a silver truck Doc3 Shipment of gold arrived in a truck Compute the tf-idf weights for each terms in each document Rank the three documents by computed score for the query ’gold silver truck’

Vector Space Model 17 / 21

slide-78
SLIDE 78

Review First for each document and query, we compute all vector lengths (zero terms ignored) Vector Space Model 18 / 21

slide-79
SLIDE 79

Review

Next, we compute all dot products (zero products ignored) Now we calculate the similarity values

Vector Space Model 19 / 21

slide-80
SLIDE 80

Review

Ranked retrieval in the vector space model example

Problem 3

Consider these documents: Doc1 a a b e c Doc2 b c a c c Doc3 e b d

Vector Space Model 20 / 21

slide-81
SLIDE 81

Review

Ranked retrieval in the vector space model example

Problem 3

Consider these documents: Doc1 a a b e c Doc2 b c a c c Doc3 e b d Compute the tf-idf weights for each terms in each document

Vector Space Model 20 / 21

slide-82
SLIDE 82

Review

Ranked retrieval in the vector space model example

Problem 3

Consider these documents: Doc1 a a b e c Doc2 b c a c c Doc3 e b d Compute the tf-idf weights for each terms in each document Rank the three documents by computed score for the query ’a c d’

Vector Space Model 20 / 21

slide-83
SLIDE 83

Review Vector Space Model 21 / 21