1
play

1 So, similarity is not a Boolean notion It is Similarity Are they - PDF document

Ranking Ordering according to the degree of some fuzzy notions: Ranking and Preference in Similarity (or dissimilarity) Relevance Database Search: Preference Q a) Similarity and Relevance Kevin Chen-Chuan Chang ranking 2


  1. Ranking– Ordering according to the degree of some fuzzy notions: Ranking and Preference in � Similarity (or dissimilarity) � Relevance Database Search: � Preference Q a) Similarity and Relevance Kevin Chen-Chuan Chang ranking 2 Similarity!-- Are they similar? Similarity!-- Are they similar? � Two images � Two images 3 4 1

  2. So, similarity is not a Boolean notion– It is Similarity– Are they similar? relatively ranking � Two strings 5 6 Ranking by similarity Similarity-based ranking –- by a “distance” function (or “dissimilarity”) Q d(Q, O i ) 7 8 2

  3. The “space” – Defined by the objects and their Vector space– What is a vector space? distances � Object representation– Vector or not? ( S , d ) is a vector space if: � Each object in S is a k-dimensional vector x = ( 1 x ,..., x ) � k y = ( y ,..., y ) � � Distance function– Metric or not? 1 k � The distance d ( x, y ) between any x and y is metric 9 10 Vector space distance functions – Vector space distance functions – The L p distance functions L 1 : The Manhattan distance � The general form: � Let p =1 in L p : 1 k k ∑ ∑ = − = − P L ( x : ( x ,..., x ), y : ( y ,..., y )) x y ( : ( ,..., ), : ( ,..., )) ( ) p L x x x y y y x y 1 1 k 1 k i i P 1 k 1 k i i = = i 1 1 i � AKA: p-norm distance, Minkowski distance Manhattan or “block” distance: � � Does this look familiar? ( y 1 , y 2 ) ( x 1 , x 2 ) 11 12 3

  4. Vector space distance functions – Vector space distance functions– L 2 : The Euclidean distance The Cosine measure � Let p =2 in L p : 1 k ∑ ∑ = − 2 ( : ( ,..., ), : ( ,..., )) ( ) 2 • × L x x x y y y x y x y x y 1 1 = θ = = P k k i i i i sim ( x , y ) cos( ) ∑ ∑ = × i 1 × x y 2 2 x y i i � The shortest distance x θ ( y 1 , y 2 ) y ( x 1 , x 2 ) 13 14 Sounds abstract? That’s actually how Web How to evaluate vector-space queries? search engines (like Google) work Consider Lp measure-- Vector space modeling � Consider L 2 as the ranking function Cosine measure Or the “TF -IDF” model � Given object Q , find O i of increasing d ( Q, O i ) Q: “apple computer” Q = (x 1 , …, x k ) Sim(Q, D) = ∑ × � How to evaluate this query? What index structure? x y D i i D = (y 1 , …, y k ) � As nearest -neighbor queries � Using multidimensional or spatial indexes. e.g., R- tree [Guttman, 1984] 15 16 4

  5. How to evaluate vector-space queries? Is vector space always possible? Consider Cosine measure-- ∑ × x y � Sim(Q, D) = � Can you always express objects as k-dimensional i i vectors, so that � distance function compares only corresponding � How to evaluate this query? What index structure? dimensions? � Simple computation: multiply and sum up � Inverted index to find document with non-zero � Counter examples? weights for query terms 17 18 How about comparing two strings? Is it Metric space– What is a metric space? natural to consider in vector space? � Two strings � Set S of objects � Global distance function d , (the “ metric” ) � For every two points x, y in S: ≥ � Positiveness: d ( x , y ) 0 = ( , ) ( , ) � Symmetry d x y d y x = ( , ) 0 � Reflexivity d x x ≤ + ( , ) ( , ) ( , ) � Triangle inequity d x y d x z d z y 19 20 5

  6. Vector space is a special case of metric space– Another example-- Edit distance E.g., consider L 2 � Let p =2 in L p : � The smallest number of edit operations (insertions, 1 k ∑ deletions, and substitutions) required to transform = − 2 ( : ( ,..., ), : ( ,..., )) ( ) 2 L x x x y y y x y 1 1 P k k i i one string into another = i 1 � Virginia � The shortest distance � Verginia � Verminia � Vermonta ( y 1 , y 2 ) � Vermonta � Vermont ( x 1 , x 2 ) � http://urchin.earth.li/~twic/edit-distance.html 21 22 Is edit distance metric? How to evaluate metric-space ranking queries? [Chávez et al., 2001] � Can you show that it is symmetric? � Can we still use R-tree? � Such that d (Virginia, Vermont) = d (Vermont, Virginia)? � Virginia � What property of metric space can we leverage to � Verginia “prune” the search space for finding near objects? � Verminia � Vermonta � Vermonta � Vermont � Check other properties 23 24 6

  7. Metric-space indexing Relevance -based ranking – for text retrieval � What is the range of u? What is being “relevant”? � How does this help in focusing our search? Many different ways modeling relevance � Similarity Q 5 � How similar is D to Q? Index 2 � Probability 3 � How likely is D relevant to Q? u 6 � Inference � How likely can D infer Q? 25 26 Similarity-based relevance-– We just talked about Probabilistic relevance this “ vector-space modeling” [Salton et al., 1975] � View: Probability of relevance Vector space modeling Cosine measure Or the “TF -IDF” model � the “probabilistic ranking principle” [Robertson, 1977] “ If a retrieval system’s response to each request is a ranking of the Q: “apple computer” Q = (x 1 , …, x k ) documents in the collections in order of decreasing probability of Sim(Q, D) = ∑ usefulness to the user who submitted the request, where the × x y probabilities are estimated as accurately as possible on the basis of D i i D = (y 1 , …, y k ) whatever data made available to the system for this purpose, then the overall effectiveness of the system to its users will be th e best that is obtainable on the basis of that data. � TF-IDF for term weights in vectors � TF: term frequency (in this document) � Initial idea proposed in [Maron and Kuhns, 1960] � the more term occurrences in this doc, the better many models followed. � IDF: inverse document frequency (in entire DB) � the fewer documents contain this term, the better 27 28 7

  8. Probabilistic models (e.g.: [Croft and Harper, This is how we derive the ranking function: 1979] ) ( | , ) ( | , ) P R Q D P R Q D � Estimate and rank by P(R | Q, D), or � To rank by log log P ( R | Q , D ) P ( R | Q , D ) − 1 i = ∏ p q p P ( t | R ) � I.e., , where i ⋅ i log ( | , ) ( , | ) ( ) ( , | ) − i P R Q D P Q D R P R P Q D R = ∝ 1 p q ∈ i i ti Q , D i = ( | , ) ( , | ) ( ) ( , | ) ( | ) P R Q D P Q D R P R P Q D R q P t R i ∏ ∏ ∏ ∏ � Assume = − = − P ( Q , D | R ) P ( t | R ) ( 1 P ( t | R ) ) p ( 1 p ) i j i j � p i the same for all query terms ∈ ∈ ∈ ∈ ti Q , D tj Q , D ti Q , D tj Q , D ∏ ∏ ∏ ∏ = − = − � q i = n i / N , where N is DB size P ( Q , D | R ) P ( t | R ) ( 1 P ( t | R ) ) q ( 1 q ) i j i j � (i.e., “all” docs are non-relevant) ∈ ∈ ∈ ∈ ti Q , D tj Q , D ti Q , D tj Q , D ∏ ∏ ∏ − − ( 1 ) ( 1 ) p p p q i j i i − − − − − P ( R | Q , D ) p 1 q 1 1 ∈ ∈ ∈ ∏ ∏ p q ∏ q ∏ N n ∑ N n = ti Q , D tj Q , D ∝ ti Q D , = i ⋅ i i ⋅ i ∝ i = i = i � log log log log ∏ ∏ ∏ − − − − 1 ( | , ) q ( 1 q ) q ( 1 p ) 1 p q p q q n n P R Q D i j i i ∈ i i ∈ i i ∈ i ∈ i ∈ i ti Q , D ti Q , D ti Q , D ti Q , D ti Q , D ∈ ∈ ∈ ti Q , D tj Q , D ti Q D , � Similar to using “IDF” � intuition: e.g., “apple computer” in a computer DB 29 30 Inference-based relevance Inference network [Turtle and Croft, 1990] � Given doc as evidence, prove that info need is satisfied � Inference based on Bayesian belief networks � Motivation d 1 d 2 doc “doc d n observed” d n � Is there any “objective” way of defining relevance? � Hint from a logic view of database querying: retrieve all objects t 1 t 2 Doc rep. t n s.t., O → Q � E.g., O = (john, cs, 3.5) � gpa>3.0 AND dept=cs r k Doc Network r 1 r 2 r 3 Doc concept � What about “Retrieve D iff we can prove D → Q”? � Challenges: Uncertainty in inference? [van Rijsbergen, 1986] c m c 1 c 2 Query concept � Representation of documents and queries � Quantify the uncertainty of inference P(D → Q) = P(Q|D) q 1 Query rep. Query Network q 2 Q Query or “infomation need” 31 32 8

  9. Using and constructing the network � Using the network: Suppose all probabilities known Ranking and Preference in � Document network can be pre-computed Database Search: � For any given query, query network can be evaluated � P(Q|D) can be computed for each document b) Preference Modeling � Documents can be ranked according to P(Q|D) � Constructing the network: Assigning probabilities Kevin Chen-Chuan Chang � Subjective probabilities � Heuristics, e.g., TF-IDF weighting � Statistical estimation � Need “training”/relevance data 33 Ranking– Ordering according to the degree of What do you prefer? For a job. some fuzzy notions: � Similarity (or dissimilarity) � Relevance � Preference Q ranking 35 36 9

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend