Similarity Estimation Lecture 13 March 5, 2019 Chandra (UIUC) - - PowerPoint PPT Presentation

similarity estimation
SMART_READER_LITE
LIVE PREVIEW

Similarity Estimation Lecture 13 March 5, 2019 Chandra (UIUC) - - PowerPoint PPT Presentation

CS 498ABD: Algorithms for Big Data, Spring 2019 Similarity Estimation Lecture 13 March 5, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 / 30 Similar Items Modern data: often unstructured and high-dimensional Examples: documents, web pages,


slide-1
SLIDE 1

CS 498ABD: Algorithms for Big Data, Spring 2019

Similarity Estimation

Lecture 13

March 5, 2019

Chandra (UIUC) CS498ABD 1 Spring 2019 1 / 30

slide-2
SLIDE 2

Similar Items

Modern data: often unstructured and high-dimensional Examples: documents, web pages, reviews, images, audio, video,

Chandra (UIUC) CS498ABD 2 Spring 2019 2 / 30

slide-3
SLIDE 3

Similar Items

Modern data: often unstructured and high-dimensional Examples: documents, web pages, reviews, images, audio, video, Given a collection of objects from a data collection: find all “similar” items (application: duplicate detection in documents) for an item x find all items in the collection similar to x (near-neighbor search, many applications)

Chandra (UIUC) CS498ABD 2 Spring 2019 2 / 30

slide-4
SLIDE 4

Similar Items

Modern data: often unstructured and high-dimensional Examples: documents, web pages, reviews, images, audio, video, Given a collection of objects from a data collection: find all “similar” items (application: duplicate detection in documents) for an item x find all items in the collection similar to x (near-neighbor search, many applications) Comparing two items expensive. Comparing all pairs, infeasible.

Chandra (UIUC) CS498ABD 2 Spring 2019 2 / 30

slide-5
SLIDE 5

High-level Ideas

How to measure similarity/dissimilarity? Proxy functions for estimating/capturing similarity Focus only on highly similar items rather than try to find similarity for all pairs Compression/sketching/hashing to create compact representations of objects Fast/approximate near-neighbor search via ideas such as locality-sensitive-hashing, clustering etc

Chandra (UIUC) CS498ABD 3 Spring 2019 3 / 30

slide-6
SLIDE 6

Topics

Jaccard similarity for sets and minhash Angular distance and simhash Locality-sensitive hashing

Chandra (UIUC) CS498ABD 4 Spring 2019 4 / 30

slide-7
SLIDE 7

Part I Jaccard Similarity and Min-wise independent Hashing

Chandra (UIUC) CS498ABD 5 Spring 2019 5 / 30

slide-8
SLIDE 8

Set Similarity

Motivation: How do we detect near-duplicate text documents? Web pages, papers, homeworks, . . .?

Chandra (UIUC) CS498ABD 6 Spring 2019 6 / 30

slide-9
SLIDE 9

Set Similarity

Motivation: How do we detect near-duplicate text documents? Web pages, papers, homeworks, . . .? Model documents as (multi)sets of “words” or more generally “shingles” A very large set of words/singles Each document is a set of words/shingles Large number of documents and each document is sparse in space of words/shingles

Chandra (UIUC) CS498ABD 6 Spring 2019 6 / 30

slide-10
SLIDE 10

Jaccard similarity of sets

Definition: given two sets S, T the Jaccard similarity between S and T is defined as |S ∩ T| |S ∪ T| and denoted by SIM(S, T).

Chandra (UIUC) CS498ABD 7 Spring 2019 7 / 30

slide-11
SLIDE 11

Jaccard similarity of sets

Definition: given two sets S, T the Jaccard similarity between S and T is defined as |S ∩ T| |S ∪ T| and denoted by SIM(S, T). Assumption: S, T very similar if SIM(S, T) ≥ α for some fixed threshold α. Say α = 0.7

Chandra (UIUC) CS498ABD 7 Spring 2019 7 / 30

slide-12
SLIDE 12

Jaccard similarity of sets

Definition: given two sets S, T the Jaccard similarity between S and T is defined as |S ∩ T| |S ∪ T| and denoted by SIM(S, T). Assumption: S, T very similar if SIM(S, T) ≥ α for some fixed threshold α. Say α = 0.7 Question: Given many documents how do we find similar documents?

Chandra (UIUC) CS498ABD 7 Spring 2019 7 / 30

slide-13
SLIDE 13

Min Hashing

Let n be the size of vocabulary For a permutation σ of [n] and set S let σmin(S) = min{σ(i) | i ∈ S}

Chandra (UIUC) CS498ABD 8 Spring 2019 8 / 30

slide-14
SLIDE 14

Min Hashing

Let n be the size of vocabulary For a permutation σ of [n] and set S let σmin(S) = min{σ(i) | i ∈ S} Example:

Chandra (UIUC) CS498ABD 8 Spring 2019 8 / 30

slide-15
SLIDE 15

Min Hashing

Lemma

Let S, T be two subsets of [n]. Suppose σ is a random permutation

  • f [n]. Then

Pr[σmin(S) = σmin(T)] = |S ∩ T| |S ∪ T|.

Chandra (UIUC) CS498ABD 9 Spring 2019 9 / 30

slide-16
SLIDE 16

Min Hashing

Pick ℓ random permutations σ1, σ2, . . . , σℓ For each set S store a ℓ-tuple (σ1

min(S), . . . , σℓ min(S))

To check similarity between S and T let s = |{i | σi

min(S) = σi min(T)}|. Output estimator

Z = SIM(S, T) = s/ℓ

Chandra (UIUC) CS498ABD 10 Spring 2019 10 / 30

slide-17
SLIDE 17

Min Hashing

Pick ℓ random permutations σ1, σ2, . . . , σℓ For each set S store a ℓ-tuple (σ1

min(S), . . . , σℓ min(S))

To check similarity between S and T let s = |{i | σi

min(S) = σi min(T)}|. Output estimator

Z = SIM(S, T) = s/ℓ Z is an exact estimator for SIM(S, T). Exercise: Suppose SIM(S, T) ≥ α. How large should ℓ be such that Pr[Z < (1 − ǫ)α] < δ?

Chandra (UIUC) CS498ABD 10 Spring 2019 10 / 30

slide-18
SLIDE 18

Min Hashing

In practice: Pick some sufficiently large ℓ Use “shingles” instead of “words”: depends on application Store for each S the compact “sketch/signature” (σ1

min(S), . . . , σℓ min(S))

Do further optimizations for performance/space See Chapter 3 in Mining Massive Data Sets book by Leskovic, Rajaraman, Ullman.

Chandra (UIUC) CS498ABD 11 Spring 2019 11 / 30

slide-19
SLIDE 19

Random permutation?

Random permutation like a random hash function is complex Cannot store compactly Computing σmin(S) expensive Need pseudorandom permutations that suffice.

Chandra (UIUC) CS498ABD 12 Spring 2019 12 / 30

slide-20
SLIDE 20

Minwise Independent Permutations

[Broder-Charikar-Frieze-Mitzemacher] Given n, Sn is the set of n! permutations Want a family F ⊆ Sn of permutations such that picking a random σ from F behaves like a random permutation (uniformly chosen from Sn)

Chandra (UIUC) CS498ABD 13 Spring 2019 13 / 30

slide-21
SLIDE 21

Minwise Independent Permutations

[Broder-Charikar-Frieze-Mitzemacher] Given n, Sn is the set of n! permutations Want a family F ⊆ Sn of permutations such that picking a random σ from F behaves like a random permutation (uniformly chosen from Sn)

Definition

A family F ⊆ Sn is a minwise independent family of permutations if for every X ⊆ [n] and a ∈ X, for a σ chosen uniformly from F, Pr[σmin(X) = a] = 1 |X|.

Chandra (UIUC) CS498ABD 13 Spring 2019 13 / 30

slide-22
SLIDE 22

Minwise Independent Permutations

Definition

A family F ⊆ Sn is a minwise independent family of permutations if for every X ⊆ [n] and a ∈ X, for a σ chosen uniformly from F, Pr[σmin(X) = a] = 1 |X|. Exercise: Minwise independent permutations suffice for Jaccard similarity estimation.

Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 30

slide-23
SLIDE 23

Minwise Independent Permutations

Definition

A family F ⊆ Sn is a minwise independent family of permutations if for every X ⊆ [n] and a ∈ X, for a σ chosen uniformly from F, Pr[σmin(X) = a] = 1 |X|. Exercise: Minwise independent permutations suffice for Jaccard similarity estimation. Question: is there a small F? Not obvious there is a non-trivial family.

Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 30

slide-24
SLIDE 24

Minwise Independent Permutations

Definition

A family F ⊆ Sn is a minwise independent family of permutations if for every X ⊆ [n] and a ∈ X, for a σ chosen uniformly from F, Pr[σmin(X) = a] = 1 |X|. Exercise: Minwise independent permutations suffice for Jaccard similarity estimation. Question: is there a small F? Not obvious there is a non-trivial family. There exist minwise independent families of size 4n Any minwise independent family must have size e(1−o(1))n Hence we need to relax the requirement further.

Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 30

slide-25
SLIDE 25

Minwise Independent Permutations

Definition

A family F ⊆ Sn is a minwise independent family of permutations if for every X ⊆ [n] and a ∈ X, for a σ chosen uniformly from F, Pr[σmin(X) = a] = 1 |X|. Two relaxations: ǫ-approximate minwise independence. 1 − ǫ |X| ≤ Pr[σmin(X) = a] ≤ 1 + ǫ |X| . Need condition to hold only for sets X where |X| ≤ k for some k < n. Sufficient for applications where sets are much smaller than n

Chandra (UIUC) CS498ABD 15 Spring 2019 15 / 30

slide-26
SLIDE 26

Relaxation of Minwise Independence

Definition

A family F ⊆ Sn is (ǫ, k) min-wise independent family if for all X ⊂ [n] such that |X| ≤ k, if σ is chosen uniformly from F, 1 − ǫ |X| ≤ Pr[σmin(X) = a] ≤ 1 + ǫ |X| .

Chandra (UIUC) CS498ABD 16 Spring 2019 16 / 30

slide-27
SLIDE 27

Minwise Independence and Hashing

Question: Is there a connection between minwise independent permutations and hashing? Suppose H is a family of t-wise independent hash functions from [n] to [n]. Let h ∈ H. Why is h not a permutation?

Chandra (UIUC) CS498ABD 17 Spring 2019 17 / 30

slide-28
SLIDE 28

Minwise Independence and Hashing

Question: Is there a connection between minwise independent permutations and hashing? Suppose H is a family of t-wise independent hash functions from [n] to [n]. Let h ∈ H. Why is h not a permutation? Because of collisions Suppose h : [n] → [m] where m ≫ n then h has very low probability of collisions. Then would h behave like a minwise independent permutation?

Chandra (UIUC) CS498ABD 17 Spring 2019 17 / 30

slide-29
SLIDE 29

Minwise Independence and Hashing

Theorem (Indyk)

Let H be a t-wis independent family of hash functions from [n] to [n] where t = Ω(log 1

ǫ). Then H is a (ǫ, k) minwise-independent

family of permutations for k = Ω(ǫn). Thus hash functions from [n] to [n] effectively suffice for minwise independence and can be used in minhashing.

Chandra (UIUC) CS498ABD 18 Spring 2019 18 / 30

slide-30
SLIDE 30

Minwise independence and Distinct Elements

Do you see connection between minwise independent permutations/hashing and Distinct Element sampling? Exercise: How would you used minwise independent permutations to sample near-uniformly from the set of distinct elements in a stream?

Chandra (UIUC) CS498ABD 19 Spring 2019 19 / 30

slide-31
SLIDE 31

Part II Angular Distance and Simhash

Chandra (UIUC) CS498ABD 20 Spring 2019 20 / 30

slide-32
SLIDE 32

Angular distance

Given a collection of vectors v1, v2, . . . , vn in Rd representing some data objects. Two vectors u, v “similar” if they point roughly in the same direction Define dist(u, v) = θ(u, v)/π where θ(u, v) is angle between vectors u and v. Assuming u, v are unit vectors wlog we have u · v = cos(θ(u, v)). Similarity is 1 − dist(u, v)

Chandra (UIUC) CS498ABD 21 Spring 2019 21 / 30

slide-33
SLIDE 33

Sim Hash

[Charikar] as a special case of a connection between rounding algorithms and hashing Pick random hyperplane/unit vector r For each vi set hr(vi) = sign(r · vi)

Chandra (UIUC) CS498ABD 22 Spring 2019 22 / 30

slide-34
SLIDE 34

Sim Hash

[Charikar] as a special case of a connection between rounding algorithms and hashing Pick random hyperplane/unit vector r For each vi set hr(vi) = sign(r · vi)

Lemma

Pr[hr(vi) = hr(vj)] = θ(vi, vj)/π.

Chandra (UIUC) CS498ABD 22 Spring 2019 22 / 30

slide-35
SLIDE 35

Sim Hash

[Charikar] as a special case of a connection between rounding algorithms and hashing Pick random hyperplane/unit vector r For each vi set hr(vi) = sign(r · vi)

Lemma

Pr[hr(vi) = hr(vj)] = θ(vi, vj)/π. Using several random hyperplanes r1, r2, . . . , rℓ we create a compact hash value/sketch for angle similarity

Chandra (UIUC) CS498ABD 22 Spring 2019 22 / 30

slide-36
SLIDE 36

A general observation

For Jaccard similarity and angular similarity we had the property that there is a family of hash functions H such that for h chosen randomly from H Pr[h(A) = h(B)] = sim(A, B)

Chandra (UIUC) CS498ABD 23 Spring 2019 23 / 30

slide-37
SLIDE 37

A general observation

For Jaccard similarity and angular similarity we had the property that there is a family of hash functions H such that for h chosen randomly from H Pr[h(A) = h(B)] = sim(A, B) Question: When is the above true in general?

Chandra (UIUC) CS498ABD 23 Spring 2019 23 / 30

slide-38
SLIDE 38

A general observation

For Jaccard similarity and angular similarity we had the property that there is a family of hash functions H such that for h chosen randomly from H Pr[h(A) = h(B)] = sim(A, B) Question: When is the above true in general?

Lemma (Charikar)

If there is a hash family for a similarity measure sim(·, ·) with the preceding property then d(·, ·) = 1 − sim(·, ·) is a metric and further d is embeddable in generalized Hamming distance.

Chandra (UIUC) CS498ABD 23 Spring 2019 23 / 30

slide-39
SLIDE 39

Part III Similarity and Distance Measures

Chandra (UIUC) CS498ABD 24 Spring 2019 24 / 30

slide-40
SLIDE 40

Similarity and Distance

Different objects and applications drive similarity measures Similarity between x and y large implies Another common way is to use distances where small distances mean higher similarity

Chandra (UIUC) CS498ABD 25 Spring 2019 25 / 30

slide-41
SLIDE 41

Some common measures

Jaccard similarity measure of sets Cosine angle between vectors Distance measures: norm based measures x − yp say p = 1, 2, . . . Hamming distance between vectors Edit distance between strings Distance measures between probability distributions: earth-mover distance, KL divergence/relative entropy (not symmetric),

Chandra (UIUC) CS498ABD 26 Spring 2019 26 / 30

slide-42
SLIDE 42

Part IV Near-Neighbor Search

Chandra (UIUC) CS498ABD 27 Spring 2019 27 / 30

slide-43
SLIDE 43

Similarity estimation and search

Collection of data items/objects D We saw ways to compress objects to speed up similarity estimation between objects

Chandra (UIUC) CS498ABD 28 Spring 2019 28 / 30

slide-44
SLIDE 44

Similarity estimation and search

Collection of data items/objects D We saw ways to compress objects to speed up similarity estimation between objects Still two problems remain: find all highly similar pairs — cannot do quadratic time even with compressed hashes new point x: want to know all points “similar” to x in D. linear search is not feasible

Chandra (UIUC) CS498ABD 28 Spring 2019 28 / 30

slide-45
SLIDE 45

Near-Neighbor Search

Collection of data items/objects D Preprocess D using small space so that given query x, output all y ∈ D with high similarity to x (or small distance to x)

Chandra (UIUC) CS498ABD 29 Spring 2019 29 / 30

slide-46
SLIDE 46

Near-Neighbor Search

Collection of data items/objects D Preprocess D using small space so that given query x, output all y ∈ D with high similarity to x (or small distance to x) Fundamental data structure problem with many applications

Chandra (UIUC) CS498ABD 29 Spring 2019 29 / 30

slide-47
SLIDE 47

Near-Neighbor Search

Collection of data items/objects D Preprocess D using small space so that given query x, output all y ∈ D with high similarity to x (or small distance to x) Fundamental data structure problem with many applications Classical (exact) solution approaches from geometry: Voronoi diagrams, k-d trees, space partition/filling approaches.

Chandra (UIUC) CS498ABD 29 Spring 2019 29 / 30

slide-48
SLIDE 48

Near-Neighbor Search

Collection of data items/objects D Preprocess D using small space so that given query x, output all y ∈ D with high similarity to x (or small distance to x) Fundamental data structure problem with many applications Classical (exact) solution approaches from geometry: Voronoi diagrams, k-d trees, space partition/filling approaches. Major drawback: curse of dimensionality for exact search

Chandra (UIUC) CS498ABD 29 Spring 2019 29 / 30

slide-49
SLIDE 49

Near-Neighbor Search

Collection of data items/objects D Preprocess D using small space so that given query x, output all y ∈ D with high similarity to x (or small distance to x) Fundamental data structure problem with many applications Classical (exact) solution approaches from geometry: Voronoi diagrams, k-d trees, space partition/filling approaches. Major drawback: curse of dimensionality for exact search Modern/recent approaches: approximate NN search via locality-sensitive hashing (LSH), randomized k-d trees, etc

Chandra (UIUC) CS498ABD 29 Spring 2019 29 / 30

slide-50
SLIDE 50

LSH approach

Initially developed for NN search in high-dimensional Euclidean space and then generalized to other similarity/distance measures. High-level ideas: collection of n objects p1, p2, . . . , pn in some space some distance/similarity measure d on pairs of objects create a hash function family H with the property that each hash function h has “locality” preserving property h maps points similar to each other (or closer in distance) to the same bucket with higher probability than it would map points that are not so similar Use multiple independent hash functions to create a data structure Hashing family depends on the similarity/distance measure

Chandra (UIUC) CS498ABD 30 Spring 2019 30 / 30