computing relevance similarity the vector space model
play

Computing Relevance, Similarity: The Vector Space Model Chapter 27, - PDF document

Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearsts slides at UC-Berkeley http://www.sims.berkeley.edu/courses/is202/f00/ Database Management Systems, R. Ramakrishnan 1 Document Vectors


  1. Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley http://www.sims.berkeley.edu/courses/is202/f00/ Database Management Systems, R. Ramakrishnan 1 Document Vectors � Documents are represented as “bags of words” � Represented as vectors when used computationally • A vector is like an array of floating point • Has direction and magnitude • Each vector holds a place for every term in the collection • Therefore, most vectors are sparse Database Management Systems, R. Ramakrishnan 2 Document Vectors: One location for each word. nova galaxy heat h’wood film role diet fur A 10 5 3 B 5 10 C 10 8 7 D “Nova” occurs 10 times in text A 9 10 5 “Galaxy” occurs 5 times in text A 10 10 E F “Heat” occurs 3 times in text A 9 10 G 5 7 (Blank means 0 occurrences.) 9 H 6 10 2 8 I 7 5 1 3 Database Management Systems, R. Ramakrishnan 3

  2. Document Vectors Document ids nova galaxy heat h’wood film role diet fur A 10 5 3 B 5 10 C 10 8 7 D 9 10 5 E 10 10 9 10 F G 5 7 9 H 6 10 2 8 I 7 5 1 3 Database Management Systems, R. Ramakrishnan 4 We Can Plot the Vectors Star Doc about movie stars Doc about astronomy Doc about mammal behavior Diet Assumption: Documents that are “close” in space are similar. Database Management Systems, R. Ramakrishnan 5 Vector Space Model � Documents are represented as vectors in term space • Terms are usually stems • Documents represented by binary vectors of terms � Queries represented the same as documents � A vector distance measure between the query and documents is used to rank retrieved documents • Query and Document similarity is based on length and direction of their vectors • Vector operations to capture boolean query Database Management Systems, R. Ramakrishnan 6 conditions

  3. Vector Space Documents and Queries t 1 docs t1 t2 t3 RSV=Q.Di t 3 D1 1 0 1 4 D 2 D 9 D 1 D2 1 0 0 1 D3 0 1 1 5 D 4 D 11 D4 1 0 0 1 D 5 D5 1 1 1 6 D 3 D 6 D6 1 1 0 3 D 10 D7 0 1 0 2 D8 0 1 0 2 D 8 D9 0 0 1 3 t 2 D 7 D10 0 1 1 5 D11 1 0 1 3 Q 1 2 3 q1 q2 q3 Boolean term combinations Q is a query – also represented as a vector Database Management Systems, R. Ramakrishnan 7 Assigning Weights to Terms � Binary Weights � Raw term frequency � tf x idf • Recall the Zipf distribution • Want to weight terms highly if they are •frequent in relevant documents … BUT •infrequent in the collection as a whole Database Management Systems, R. Ramakrishnan 8 Binary Weights � Only the presence (1) or absence (0) of a term is included in the vector docs t1 t2 t3 D1 1 0 1 D2 1 0 0 D3 0 1 1 D4 1 0 0 D5 1 1 1 D6 1 1 0 D7 0 1 0 D8 0 1 0 D9 0 0 1 D10 0 1 1 D11 1 0 1 Database Management Systems, R. Ramakrishnan 9

  4. Raw Term Weights � The frequency of occurrence for the term in each document is included in the vector docs t1 t2 t3 D1 2 0 3 D2 1 0 0 D3 0 4 7 D4 3 0 0 D5 1 6 3 D6 3 5 0 D7 0 8 0 D8 0 10 0 D9 0 0 1 D10 0 3 5 D11 4 0 1 Database Management Systems, R. Ramakrishnan 10 TF x IDF Weights � tf x idf measure: • Term Frequency (tf) • Inverse Document Frequency (idf) -- a way to deal with the problems of the Zipf distribution � Goal: Assign a tf * idf weight to each term in each document Database Management Systems, R. Ramakrishnan 11 TF x IDF Calculation w = tf * log( N / n ) ik ik k = T term k in document D k i = tf frequency of term T in document D ik k i = idf inverse document frequency of term T in C k k = N total number of documents in the collection C = n the number of documents in C that contain T k k   N = idf log   k  n  k Database Management Systems, R. Ramakrishnan 12

  5. Inverse Document Frequency � IDF provides high values for rare words and low values for common words  10000    = log 0  10000  For a  10000  log   = 0 . 301 collection  5000  of 10000   10000 = documents log   2 . 698  20    10000 = log   4  1  Database Management Systems, R. Ramakrishnan 13 TF x IDF Normalization � Normalize the term weights (so longer documents are not unfairly given more weight) • The longer the document, the more likely it is for a given term to appear in it, and the more often a given term is likely to appear in it. So, we want to reduce the importance attached to a term appearing in a document based on the length of the document. tf log( N / n ) = w ik k ik ∑ = t 2 2 ( tf ) [log( N / n )] ik k k 1 Database Management Systems, R. Ramakrishnan 14 Pair-wise Document Similarity nova galaxy heat h’wood film role diet fur A 1 3 1 B 5 2 2 1 5 C 4 1 D How to compute document similarity? Database Management Systems, R. Ramakrishnan 15

  6. Pair-wise Document Similarity = ∗ + ∗ = sim ( A , B ) ( 1 5 ) ( 2 3 ) 11 = D w , w ..., w 1 11 12 , 1 t sim ( A , C ) = 0 D = w , w ..., w = sim ( A , D ) 0 2 21 22 , 2 t sim ( B , C ) = 0 t ∑ = ∗ sim ( D , D ) w w = sim ( B , D ) 0 1 2 1 i 2 i = i 1 = ∗ + ∗ = sim ( C , D ) ( 2 4 ) ( 1 1 ) 9 nova galaxy heat h’wood film role diet fur A 1 3 1 B 5 2 2 1 5 C 4 1 D Database Management Systems, R. Ramakrishnan 16 Pair-wise Document Similarity (cosine normalization) = D w , w ..., w 1 11 12 , 1 t = D w , w ..., w 2 21 22 , 2 t t ∑ = ∗ sim ( D , D ) w w unnormaliz ed 1 2 1 i 2 i = i 1 t ∑ ∗ w w 1 i 2 i = sim ( D , D ) i = 1 cosine normalized 1 2 t t ∑ ∑ 2 ∗ 2 ( w ) ( w ) 1 i 2 i = = i 1 i 1 Database Management Systems, R. Ramakrishnan 17 Vector Space “Relevance” Measure = D w , w ,..., w i d d d i 1 i 2 it = = Q w , w ..., w w 0 if a term is absent q 1 q 2 , qt t ∑ = ∗ if term weights normalized : sim ( Q , D ) w w i qj d ij = j 1 otherwise normalize in the similarity comparison : t ∑ ∗ w w qj d ij = j = 1 sim ( Q , D ) i t t ∑ ∑ 2 ∗ 2 ( w ) ( w ) qj d ij j = 1 j = 1 Database Management Systems, R. Ramakrishnan 18

  7. Computing Relevance Scores = Say we have query vect or Q ( 0 . 4 , 0 . 8 ) Also, document D = ( 0 . 2 , 0 . 7 ) 2 What does their similarity comparison yield? + ( 0 . 4 * 0 . 2 ) ( 0 . 8 * 0 . 7 ) = sim ( Q , D ) 2 2 + 2 2 + 2 [( 0 . 4 ) ( 0 . 8 ) ] * [( 0 . 2 ) ( 0 . 7 ) ] 0 . 64 = = 0 . 98 0 . 42 Database Management Systems, R. Ramakrishnan 19 Vector Space with Term Weights and Cosine Matching D i =( d i1 ,w di1 ;d i2 , w di2 ;…;d it , w dit ) Term B Q =( q i1 ,w qi1 ;q i2 , w qi2 ;…;q it , w qit ) 1.0 Q = (0.4,0.8) ∑ t w w D1=(0.8,0.3) Q D 2 q d j = 1 sim ( Q , D ) = j ij D2=(0.2,0.7) 0.8 i ∑ ∑ t t 2 2 ( w ) ( w ) = q = d j 1 j j 1 ij 0.6 α ⋅ + ⋅ ( 0 . 4 0 . 2 ) ( 0 . 8 0 . 7 ) 2 = sim ( Q , D 2 ) 0.4 2 + 2 ⋅ 2 + 2 [( 0 . 4 ) ( 0 . 8 ) ] [( 0 . 2 ) ( 0 . 7 ) ] D 1 0.2 0 . 64 α = = 0 . 98 1 0 . 42 0 0.2 0.4 0.6 0.8 1.0 . 56 = = Term A sim ( Q , D ) 0 . 74 1 0 . 58 Database Management Systems, R. Ramakrishnan 20 Text Clustering � Finds overall similarities among groups of documents � Finds overall similarities among groups of tokens � Picks out some themes, ignores others Database Management Systems, R. Ramakrishnan 21

  8. Text Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Te Term rm 1 1 Te Term rm 2 Database Management Systems, R. Ramakrishnan 22 Problems with Vector Space � There is no real theoretical basis for the assumption of a term space • It is more for visualization than having any real basis • Most similarity measures work about the same � Terms are not really orthogonal dimensions • Terms are not independent of all other terms; remember our discussion of correlated terms in text Database Management Systems, R. Ramakrishnan 23 Probabilistic Models � Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query � Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) � Relies on accurate estimates of probabilities Database Management Systems, R. Ramakrishnan 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend