similarity estimation
play

Similarity Estimation Lecture 13 March 5, 2019 Chandra (UIUC) - PowerPoint PPT Presentation

CS 498ABD: Algorithms for Big Data, Spring 2019 Similarity Estimation Lecture 13 March 5, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 / 30 Similar Items Modern data: often unstructured and high-dimensional Examples: documents, web pages,


  1. CS 498ABD: Algorithms for Big Data, Spring 2019 Similarity Estimation Lecture 13 March 5, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 / 30

  2. Similar Items Modern data: often unstructured and high-dimensional Examples: documents, web pages, reviews, images, audio, video, Chandra (UIUC) CS498ABD 2 Spring 2019 2 / 30

  3. Similar Items Modern data: often unstructured and high-dimensional Examples: documents, web pages, reviews, images, audio, video, Given a collection of objects from a data collection: find all “similar” items (application: duplicate detection in documents) for an item x find all items in the collection similar to x (near-neighbor search, many applications) Chandra (UIUC) CS498ABD 2 Spring 2019 2 / 30

  4. Similar Items Modern data: often unstructured and high-dimensional Examples: documents, web pages, reviews, images, audio, video, Given a collection of objects from a data collection: find all “similar” items (application: duplicate detection in documents) for an item x find all items in the collection similar to x (near-neighbor search, many applications) Comparing two items expensive. Comparing all pairs, infeasible. Chandra (UIUC) CS498ABD 2 Spring 2019 2 / 30

  5. High-level Ideas How to measure similarity/dissimilarity? Proxy functions for estimating/capturing similarity Focus only on highly similar items rather than try to find similarity for all pairs Compression/sketching/hashing to create compact representations of objects Fast/approximate near-neighbor search via ideas such as locality-sensitive-hashing, clustering etc Chandra (UIUC) CS498ABD 3 Spring 2019 3 / 30

  6. Topics Jaccard similarity for sets and minhash Angular distance and simhash Locality-sensitive hashing Chandra (UIUC) CS498ABD 4 Spring 2019 4 / 30

  7. Part I Jaccard Similarity and Min-wise independent Hashing Chandra (UIUC) CS498ABD 5 Spring 2019 5 / 30

  8. Set Similarity Motivation: How do we detect near-duplicate text documents? Web pages, papers, homeworks, . . . ? Chandra (UIUC) CS498ABD 6 Spring 2019 6 / 30

  9. Set Similarity Motivation: How do we detect near-duplicate text documents? Web pages, papers, homeworks, . . . ? Model documents as (multi)sets of “words” or more generally “shingles” A very large set of words/singles Each document is a set of words/shingles Large number of documents and each document is sparse in space of words/shingles Chandra (UIUC) CS498ABD 6 Spring 2019 6 / 30

  10. Jaccard similarity of sets Definition: given two sets S , T the Jaccard similarity between S and T is defined as | S ∩ T | | S ∪ T | and denoted by SIM ( S , T ) . Chandra (UIUC) CS498ABD 7 Spring 2019 7 / 30

  11. Jaccard similarity of sets Definition: given two sets S , T the Jaccard similarity between S and T is defined as | S ∩ T | | S ∪ T | and denoted by SIM ( S , T ) . Assumption: S , T very similar if SIM ( S , T ) ≥ α for some fixed threshold α . Say α = 0 . 7 Chandra (UIUC) CS498ABD 7 Spring 2019 7 / 30

  12. Jaccard similarity of sets Definition: given two sets S , T the Jaccard similarity between S and T is defined as | S ∩ T | | S ∪ T | and denoted by SIM ( S , T ) . Assumption: S , T very similar if SIM ( S , T ) ≥ α for some fixed threshold α . Say α = 0 . 7 Question: Given many documents how do we find similar documents? Chandra (UIUC) CS498ABD 7 Spring 2019 7 / 30

  13. Min Hashing Let n be the size of vocabulary For a permutation σ of [ n ] and set S let σ min ( S ) = min { σ ( i ) | i ∈ S } Chandra (UIUC) CS498ABD 8 Spring 2019 8 / 30

  14. Min Hashing Let n be the size of vocabulary For a permutation σ of [ n ] and set S let σ min ( S ) = min { σ ( i ) | i ∈ S } Example: Chandra (UIUC) CS498ABD 8 Spring 2019 8 / 30

  15. Min Hashing Lemma Let S , T be two subsets of [ n ] . Suppose σ is a random permutation of [ n ] . Then Pr[ σ min ( S ) = σ min ( T )] = | S ∩ T | | S ∪ T | . Chandra (UIUC) CS498ABD 9 Spring 2019 9 / 30

  16. Min Hashing Pick ℓ random permutations σ 1 , σ 2 , . . . , σ ℓ For each set S store a ℓ -tuple ( σ 1 min ( S ) , . . . , σ ℓ min ( S )) To check similarity between S and T let s = |{ i | σ i min ( S ) = σ i min ( T ) }| . Output estimator Z = SIM ( S , T ) = s /ℓ Chandra (UIUC) CS498ABD 10 Spring 2019 10 / 30

  17. Min Hashing Pick ℓ random permutations σ 1 , σ 2 , . . . , σ ℓ For each set S store a ℓ -tuple ( σ 1 min ( S ) , . . . , σ ℓ min ( S )) To check similarity between S and T let s = |{ i | σ i min ( S ) = σ i min ( T ) }| . Output estimator Z = SIM ( S , T ) = s /ℓ Z is an exact estimator for SIM ( S , T ) . Exercise: Suppose SIM ( S , T ) ≥ α . How large should ℓ be such that Pr[ Z < (1 − ǫ ) α ] < δ ? Chandra (UIUC) CS498ABD 10 Spring 2019 10 / 30

  18. Min Hashing In practice: Pick some sufficiently large ℓ Use “shingles” instead of “words”: depends on application Store for each S the compact “sketch/signature” ( σ 1 min ( S ) , . . . , σ ℓ min ( S )) Do further optimizations for performance/space See Chapter 3 in Mining Massive Data Sets book by Leskovic, Rajaraman, Ullman. Chandra (UIUC) CS498ABD 11 Spring 2019 11 / 30

  19. Random permutation? Random permutation like a random hash function is complex Cannot store compactly Computing σ min ( S ) expensive Need pseudorandom permutations that suffice. Chandra (UIUC) CS498ABD 12 Spring 2019 12 / 30

  20. Minwise Independent Permutations [Broder-Charikar-Frieze-Mitzemacher] Given n , S n is the set of n ! permutations Want a family F ⊆ S n of permutations such that picking a random σ from F behaves like a random permutation (uniformly chosen from S n ) Chandra (UIUC) CS498ABD 13 Spring 2019 13 / 30

  21. Minwise Independent Permutations [Broder-Charikar-Frieze-Mitzemacher] Given n , S n is the set of n ! permutations Want a family F ⊆ S n of permutations such that picking a random σ from F behaves like a random permutation (uniformly chosen from S n ) Definition A family F ⊆ S n is a minwise independent family of permutations if for every X ⊆ [ n ] and a ∈ X , for a σ chosen uniformly from F , 1 Pr[ σ min ( X ) = a ] = | X | . Chandra (UIUC) CS498ABD 13 Spring 2019 13 / 30

  22. Minwise Independent Permutations Definition A family F ⊆ S n is a minwise independent family of permutations if for every X ⊆ [ n ] and a ∈ X , for a σ chosen uniformly from F , 1 Pr[ σ min ( X ) = a ] = | X | . Exercise: Minwise independent permutations suffice for Jaccard similarity estimation. Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 30

  23. Minwise Independent Permutations Definition A family F ⊆ S n is a minwise independent family of permutations if for every X ⊆ [ n ] and a ∈ X , for a σ chosen uniformly from F , 1 Pr[ σ min ( X ) = a ] = | X | . Exercise: Minwise independent permutations suffice for Jaccard similarity estimation. Question: is there a small F ? Not obvious there is a non-trivial family. Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 30

  24. Minwise Independent Permutations Definition A family F ⊆ S n is a minwise independent family of permutations if for every X ⊆ [ n ] and a ∈ X , for a σ chosen uniformly from F , 1 Pr[ σ min ( X ) = a ] = | X | . Exercise: Minwise independent permutations suffice for Jaccard similarity estimation. Question: is there a small F ? Not obvious there is a non-trivial family. There exist minwise independent families of size 4 n Any minwise independent family must have size e (1 − o (1)) n Hence we need to relax the requirement further. Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 30

  25. Minwise Independent Permutations Definition A family F ⊆ S n is a minwise independent family of permutations if for every X ⊆ [ n ] and a ∈ X , for a σ chosen uniformly from F , 1 Pr[ σ min ( X ) = a ] = | X | . Two relaxations: ǫ -approximate minwise independence. 1 − ǫ ≤ Pr[ σ min ( X ) = a ] ≤ 1 + ǫ | X | . | X | Need condition to hold only for sets X where | X | ≤ k for some k < n . Sufficient for applications where sets are much smaller than n Chandra (UIUC) CS498ABD 15 Spring 2019 15 / 30

  26. Relaxation of Minwise Independence Definition A family F ⊆ S n is ( ǫ, k ) min-wise independent family if for all X ⊂ [ n ] such that | X | ≤ k , if σ is chosen uniformly from F , 1 − ǫ ≤ Pr[ σ min ( X ) = a ] ≤ 1 + ǫ | X | . | X | Chandra (UIUC) CS498ABD 16 Spring 2019 16 / 30

  27. Minwise Independence and Hashing Question: Is there a connection between minwise independent permutations and hashing? Suppose H is a family of t -wise independent hash functions from [ n ] to [ n ] . Let h ∈ H . Why is h not a permutation? Chandra (UIUC) CS498ABD 17 Spring 2019 17 / 30

  28. Minwise Independence and Hashing Question: Is there a connection between minwise independent permutations and hashing? Suppose H is a family of t -wise independent hash functions from [ n ] to [ n ] . Let h ∈ H . Why is h not a permutation? Because of collisions Suppose h : [ n ] → [ m ] where m ≫ n then h has very low probability of collisions. Then would h behave like a minwise independent permutation? Chandra (UIUC) CS498ABD 17 Spring 2019 17 / 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend