similarity estimation similarity estimation techniques
play

Similarity Estimation Similarity Estimation Techniques from - PowerPoint PPT Presentation

Similarity Estimation Similarity Estimation Techniques from Rounding Techniques from Rounding Algorithms Algorithms Moses Charikar Moses Charikar Princeton University Princeton University 1 1 Compact sketches for Compact sketches for


  1. Similarity Estimation Similarity Estimation Techniques from Rounding Techniques from Rounding Algorithms Algorithms Moses Charikar Moses Charikar Princeton University Princeton University 1 1

  2. Compact sketches for Compact sketches for estimating similarity estimating similarity � Collection of objects, e.g. mathematical Collection of objects, e.g. mathematical � representation of documents, images. representation of documents, images. � Implicit similarity/distance function. Implicit similarity/distance function. � � Want to estimate similarity without Want to estimate similarity without � looking at entire objects. looking at entire objects. � Compute compact sketches of objects Compute compact sketches of objects � so that similarity/distance can be so that similarity/distance can be estimated from them. estimated from them. 2 2

  3. Similarity Preserving Hashing Similarity Preserving Hashing � Similarity function Similarity function sim sim(x,y) (x,y) � � Family of hash functions Family of hash functions F F with with � probability distribution such that probability distribution such that = = Pr h F h x [ ( ) h y ( )] sim x y ( , ) ∈ 3 3

  4. Applications Applications � Compact representation scheme for Compact representation scheme for � estimating similarity estimating similarity → x ( ( ), h x h x ( ), … , h x ( )) 1 2 k → y ( ( ), h y h ( ), y … , h ( )) y 1 2 k � Approximate nearest neighbor search Approximate nearest neighbor search � [Indyk Indyk, ,Motwani Motwani] ] [ [Kushilevitz Kushilevitz, ,Ostrovsky Ostrovsky, ,Rabani Rabani] ] [ 4 4

  5. Estimating Set Similarity Estimating Set Similarity [Broder Broder, ,Manasse Manasse, ,Glassman Glassman, ,Zweig Zweig] ] [ [Broder Broder,C,Frieze, ,C,Frieze,Mitzenmacher Mitzenmacher] ] [ � Collection of subsets Collection of subsets � S S 1 2 | S ∩ S | similarity = | 1 2 S ∪ S | 1 2 5 5

  6. Minwise Independent Independent Minwise Permutations Permutations S 1 σ min( ( S )) σ 1 S S 1 2 S 2 σ σ min( ( S )) 2 ∩ | S S | σ = σ = prob(min( ( S ) min( ( S )) 1 2 1 2 ∪ | S S | 1 2 6 6

  7. Related Work Related Work � Streaming algorithms Streaming algorithms � � Compute Compute f(data) f(data) in one pass using small space. in one pass using small space. � � Implicitly construct sketch of data seen so far. Implicitly construct sketch of data seen so far. � � Synopsis data structures Synopsis data structures [Gibbons, [Gibbons,Matias Matias] ] � � Compact distance oracles, distance labels. Compact distance oracles, distance labels. � � Hash functions with similar properties: Hash functions with similar properties: � [Linial Linial,Sassoon] ,Sassoon] [ [Indyk Indyk, ,Motwani Motwani, ,Raghavan Raghavan, ,Vempala Vempala] ] [ [Feige Feige, , Krauthgamer Krauthgamer] ] [ 7 7

  8. Results Results � Necessary conditions for existence of Necessary conditions for existence of � similarity preserving hashing (SPH). similarity preserving hashing (SPH). � SPH schemes from rounding algorithms SPH schemes from rounding algorithms � � Hash function for vectors based on Hash function for vectors based on random random � hyperplane rounding rounding. . hyperplane � Hash function for estimating Hash function for estimating Earth Mover Earth Mover � Distance based on rounding schemes for based on rounding schemes for Distance classification with pairwise pairwise relationships relationships. . classification with 8 8

  9. Existence of SPH schemes Existence of SPH schemes � sim sim(x,y) (x,y) admits an SPH scheme if admits an SPH scheme if � ∃ family of hash functions family of hash functions F F such that such that ∃ = = Pr h F h x [ ( ) h y ( )] sim x y ( , ) ∈ 9 9

  10. Theorem: If : If sim sim(x,y) (x,y) admits an SPH admits an SPH Theorem scheme then 1 1- -sim sim(x,y) (x,y) satisfies satisfies scheme then triangle inequality. triangle inequality. Proof: : Proof − = ≠ 1 sim x y ( , ) Pr ( ( ) h x h y ( )) ∈ h F ∆ ≠ ( , ) : x y indicator variable for ( ) h x h y ( ) h ∆ + ∆ ≥∆ ( , ) x y ( , ) y z ( , ) x z h h h − = ∆ 1 sim x y ( , ) E [ ( , )] x y ∈ h F h 10 10

  11. Stronger Condition Stronger Condition Theorem: If : If sim sim(x,y) (x,y) admits an SPH admits an SPH Theorem scheme then (1+ (1+sim sim(x,y) )/2 (x,y) )/2 has an SPH has an SPH scheme then scheme with hash functions mapping scheme with hash functions mapping objects to {0,1} {0,1}. . objects to Theorem: If : If sim sim(x,y) (x,y) admits an SPH admits an SPH Theorem scheme then 1 1- -sim sim(x,y) (x,y) is is isometrically isometrically scheme then embeddable in the Hamming cube. embeddable in the Hamming cube. 11 11

  12. Random Hyperplane Hyperplane Rounding Rounding Random based SPH based SPH � Collection of vectors Collection of vectors � � � � � � ( , ) u v = − sim u v ( , ) 1 π � Pick random Pick random hyperplane hyperplane � � r through origin (normal ) through origin (normal ) � �  ⋅ ≥  1 if r u 0 �  =  h u ( ) � � �  r ⋅ < 0 if r u 0   � [ [Goemans Goemans,Williamson] ,Williamson] � 12 12

  13. Earth Mover Distance (EMD) Earth Mover Distance (EMD) P P Q Q EMD(P,Q) EMD(P,Q) 13 13

  14. Earth Mover Distance Earth Mover Distance � Set of points Set of points L={l L={l 1 ,l 2 ,… …l l n } 1 ,l 2 , n } � � Distance function Distance function d(i,j) d(i,j) (assume metric) (assume metric) � � Distribution Distribution P(L) P(L) : non : non- -negative weights negative weights � (p 1 ,p 2 ,… …p p n ) . . (p 1 ,p 2 , n ) � Earth Mover Distance ( Earth Mover Distance (EMD EMD): distance ): distance � between distributions P P and and Q Q . . between distributions � Proposed as metric in graphics and Proposed as metric in graphics and � vision for distance between images. vision for distance between images. [Rubner Rubner, ,Tomasi Tomasi, ,Guibas Guibas] ] [ 14 14

  15. ∑ ⋅ min f d i j ( , ) i j , i j , ∑ ∀ = i f p i j , i j ∑ ∀ = j f q i , j j i ∀ ≥ i j , f 0 i j , 15 15

  16. Relaxation of SPH Relaxation of SPH � Estimate distance measure, not Estimate distance measure, not � similarity measure in [0,1]. similarity measure in [0,1]. � Allow hash functions to map objects to Allow hash functions to map objects to � points in metric space and measure points in metric space and measure E[ d(h(P),h(Q) d(h(P),h(Q) ] ]. . E[ (SPH: d(x,y) = 1 if x d(x,y) = 1 if x ≠ y ) ) (SPH: ≠ y � Estimator will approximate EMD. Estimator will approximate EMD. � 16 16

  17. Classification with pairwise pairwise Classification with relationships [ relationships [Kleinberg Kleinberg, ,Tardos Tardos] ] separation separation Assignment cost Assignment cost cost cost w e w e 17 17

  18. Classification with pairwise pairwise Classification with relationships relationships � Collection of objects Collection of objects V V � � Labels Labels L={l L={l 1 ,l 2 ,… …l l n } 1 ,l 2 , n } � � Assignment of labels Assignment of labels h : V h : V → L → L � � Cost of assigning label to Cost of assigning label to u u : : c(u,h(u)) c(u,h(u)) � � Graph of related objects; for edge Graph of related objects; for edge � e=(u,v), cost paid: cost paid: w w e .d(h(u),h(v)) e=(u,v), e .d(h(u),h(v)) � Find assignment of labels to minimize Find assignment of labels to minimize � cost. cost. 18 18

  19. LP Relaxation and Rounding LP Relaxation and Rounding [Kleinberg Kleinberg, ,Tardos Tardos] ] [ [Chekuri Chekuri, ,Khanna Khanna, ,Naor Naor, ,Zosin Zosin] ] [ P P Q Q Separation cost measured by EMD(P,Q) EMD(P,Q) Separation cost measured by Rounding algorithm guarantees Rounding algorithm guarantees Pr[ h(P)= h(P)=l l i ] = p = p i Pr[ i ] i E[ d(h(P),h(Q) d(h(P),h(Q) ] ] ≤ ≤ O( O( log log n n log log log log n) EMD(P,Q) n) EMD(P,Q) E[ 19 19

  20. Rounding details Rounding details � Probabilistically approximate metric on Probabilistically approximate metric on � L by tree metric (HST) L by tree metric (HST) � Expected distortion Expected distortion O(log n log log n) O(log n log log n) � � EMD on tree metric has nice form: EMD on tree metric has nice form: � � T T: : subtree subtree � � P(T): P(T): sum of probabilities for leaves in T sum of probabilities for leaves in T � � l l T : length of edge leading up from T T : length of edge leading up from T � � EMD(P,Q) = EMD(P,Q) = ∑ ∑ l l T |P(T)- -Q(T)| Q(T)| T |P(T) � 20 20

  21. Theorem: The rounding scheme gives a : The rounding scheme gives a Theorem hashing scheme such that hashing scheme such that EMD(P,Q) ≤ ≤ E[d(h(P),h(Q)] E[d(h(P),h(Q)] EMD(P,Q) ≤ O(log n log log n) EMD(P,Q) O(log n log log n) EMD(P,Q) ≤ Proof: : Proof = = y : Probability that ( ) h P l h Q , ( ) l i j , i j y give feasible solution to LP for EMD i j , Cost of this solution = E[ ( ( ), d h P h Q ( )] ≤ Henc e EMD P Q ( , ) E[ ( ( d h P h Q ) , ( )] 21 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend