Efficient Algorithms for Streaming Datasets with Near-Duplicates - - PowerPoint PPT Presentation

efficient algorithms for streaming datasets with near
SMART_READER_LITE
LIVE PREVIEW

Efficient Algorithms for Streaming Datasets with Near-Duplicates - - PowerPoint PPT Presentation

Efficient Algorithms for Streaming Datasets with Near-Duplicates Qin Zhang Indiana University Bloomington Based on work with: Djamal Belazzougui (CERIST) Di Chen (HKUST) Jiecao Chen (IUB) Haoyu Zhang (IUB) Theory and Applications of Hashing


slide-1
SLIDE 1

1-1

Efficient Algorithms for Streaming Datasets with Near-Duplicates

Theory and Applications of Hashing May 4, 2017

Qin Zhang Indiana University Bloomington Based on work with: Djamal Belazzougui (CERIST) Di Chen (HKUST) Jiecao Chen (IUB) Haoyu Zhang (IUB)

slide-2
SLIDE 2

2-1

Disclaimer

Not really a survey talk; results are all very recent, and solutions may be quite premature.

slide-3
SLIDE 3

2-2

Disclaimer

Not really a survey talk; results are all very recent, and solutions may be quite premature.

Agenda

  • 1. Background and motivation
  • 2. Distinct elements on data with near-duplicates
  • 3. Similarity join under edit distance
slide-4
SLIDE 4

3-1

The Streaming Model

– high-speed online data – want space/time efficient algorithms

1 7 9 1 7 3 2

RAM CPU

Model of computation

E.g., what is the number

  • f distinct elements?
slide-5
SLIDE 5

4-1

= M x Mx

linear mapping (sometimes embeds a hash function) sketching vector

Problem: given a data vector x ∈ Rd, compute f (x) Can do this using linear sketches

Linear sketches

g(Mx) ≈ f (x)

recover

slide-6
SLIDE 6

4-2

= M x Mx

linear mapping (sometimes embeds a hash function) sketching vector

Problem: given a data vector x ∈ Rd, compute f (x) Can do this using linear sketches

Linear sketches

g(Mx) ≈ f (x)

recover

Simple and useful: used extensively in streaming/distributed algorithms, compressive sensing, . . .

slide-7
SLIDE 7

5-1

Linear sketches in the streaming model

View each incoming element i as updating x ← x + ei Can update the sketching vector incrementally M(x + ei) = Mx + Mei = Mx + Mi space = size of sketch Mx time ≤ space (usually)

RAM

1 7 9 1 7 3 2

slide-8
SLIDE 8

6-1

Real-world data is often noisy

music, images, videos... after compressions, resize, photoshop, etc.

slide-9
SLIDE 9

6-2

Real-world data is often noisy

music, images, videos... after compressions, resize, photoshop, etc. “theory and applications of hashing” “theory application of hash” “dagstuhl hashing” “dagstuhl seminar hash” Queries of the same meaning sent to Google

slide-10
SLIDE 10

7-1

RAM CPU We have to consider near-duplicates as

  • ne element. Then how to compute f (x)?

Robust streaming algorithms

slide-11
SLIDE 11

8-1

Linear sketches do not work

Linear sketches do not work. Why? Items representing the same entity may be hashed into different coordinates of the sketching vector

slide-12
SLIDE 12

9-1

Magic hash functions?

Answer: (In general) No. Does there exist a magic hash function that can (1) map only items rep. the same element into the same bucket, and (2) can be described succinctly?

slide-13
SLIDE 13

9-2

Magic hash functions?

Answer: (In general) No. Does there exist a magic hash function that can (1) map only items rep. the same element into the same bucket, and (2) can be described succinctly? Some hashing functions may help (will discuss later)

slide-14
SLIDE 14

10-1

History and the New Question

Related to Entity Resolution: Identify and group different manifestations of the same real world object.

Key problem in data cleaning / integration. Have been studied for 40+ years in DB, also in AI, NT.

Previous solutions use at least linear space, detect items representing the same entity, output all distinct entities.

slide-15
SLIDE 15

10-2

History and the New Question

Related to Entity Resolution: Identify and group different manifestations of the same real world object.

Key problem in data cleaning / integration. Have been studied for 40+ years in DB, also in AI, NT.

Previous solutions use at least linear space, detect items representing the same entity, output all distinct entities. Question: Can we analyze data with near-duplicates in the streaming model space/time efficiently?

slide-16
SLIDE 16

11-1

  • Data: points in a metric space
  • Problem: compute # robust distinct elements (F0)

(Useful in: traffic monitoring, query optimization, . . .)

Robust F0: Given threshold α, partition the input item set S into the set of groups G = {G1, . . . , Gn} of minimum-cardinality so that ∀p, q ∈ Gi, d(p, q) ≤ α.

Distinct Elements

– Chen, Z., SIGMOD 2016 (will discuss today) – Chen, Z., ???? (extend to sliding windows and ℓ0-sampling)

slide-17
SLIDE 17

12-1

(α, β)-sparse dataset: pairs of items in the same group has distance at most α; pairs of items in different groups have distance at least β.

Well-shaped dataset

slide-18
SLIDE 18

12-2

(α, β)-sparse dataset: pairs of items in the same group has distance at most α; pairs of items in different groups have distance at least β.

Well-shaped dataset

If (separation ratio) β/α > 2, call the dataset well-shaped A natural partition exists for a well-shaped dataset

slide-19
SLIDE 19

12-3

(α, β)-sparse dataset: pairs of items in the same group has distance at most α; pairs of items in different groups have distance at least β.

Well-shaped dataset

If (separation ratio) β/α > 2, call the dataset well-shaped A natural partition exists for a well-shaped dataset Will talk about general datasets later.

slide-20
SLIDE 20

13-1

Algorithm for (α, β) (β > 2α) well-shaped datasets in 2D

G1 G2 G3 A random grid G

  • f side length α/2
slide-21
SLIDE 21

14-1

Simple sampling (needs two passes)

  • 1. Sample η ∈ ˜

O(1/ǫ2) non-empty cells C

  • 2. Use another pass to compute for each sampled cell C,

w(C) = 1/w(GC), where GC is the (only) group intersecting C, and w(GC) is #cells GC intersects

  • 3. Output z

η · C∈C w(C), where z is the #non-empty cells in G

Gives a (1 + ǫ)-approximation of robust F0 using ˜ O(1/ǫ2) bits space and 2 passes. Algorithm Simple Sampling

slide-22
SLIDE 22

15-1

Bucket sampling

  • Cannot sample cell early: most sampled cell will be

empty thus useless for the estimation.

  • Cannot sample late: cannot obtain the “neighborhood”

information to compute w(C) for a sampled C

slide-23
SLIDE 23

15-2

Bucket sampling

  • Cannot sample cell early: most sampled cell will be

empty thus useless for the estimation.

  • Cannot sample late: cannot obtain the “neighborhood”

information to compute w(C) for a sampled C What to do?

slide-24
SLIDE 24

15-3

Bucket sampling

  • Cannot sample cell early: most sampled cell will be

empty thus useless for the estimation.

  • Cannot sample late: cannot obtain the “neighborhood”

information to compute w(C) for a sampled C What to do? We sample a collection of cells implicitly, but only maintain the neighborhood info. for “non-empty” sampled cells

slide-25
SLIDE 25

15-4

Bucket sampling

  • Cannot sample cell early: most sampled cell will be

empty thus useless for the estimation.

  • Cannot sample late: cannot obtain the “neighborhood”

information to compute w(C) for a sampled C What to do? We sample a collection of cells implicitly, but only maintain the neighborhood info. for “non-empty” sampled cells Maintain the collection using a hash function h: That is, all cells C with h(C) = 1

slide-26
SLIDE 26

15-5

Bucket sampling

  • Cannot sample cell early: most sampled cell will be

empty thus useless for the estimation.

  • Cannot sample late: cannot obtain the “neighborhood”

information to compute w(C) for a sampled C What to do? We sample a collection of cells implicitly, but only maintain the neighborhood info. for “non-empty” sampled cells Maintain h s.t. |{C | h(C) = 1 ∧ ∃p ∈ S, d(p, C) ≤ α}| = O(1/ǫ2) Maintain the collection using a hash function h: That is, all cells C with h(C) = 1

slide-27
SLIDE 27

16-1

G1 G2 G3

Bucket sampling (cont.)

sampled cell

store one point of each non-empty neighboring cell; used to compute the weight of the sampled cell.

For a well-shaped dataset, can get (1 + ǫ)-approximation of robust F0 using ˜ O(1/ǫ2) bits space, ˜ O(1) time per item.

slide-28
SLIDE 28

17-1

For general datasets, we introduce F0-ambiguity: The F0-ambiguity of S is the minimum δ s.t. there exists T ⊆ S such that

  • S\T is well-shaped
  • F0(S\T) ≥ (1 − δ)F0(S)

General datasets

slide-29
SLIDE 29

17-2

For general datasets, we introduce F0-ambiguity: The F0-ambiguity of S is the minimum δ s.t. there exists T ⊆ S such that

  • S\T is well-shaped
  • F0(S\T) ≥ (1 − δ)F0(S)

General datasets

Unfortunately approximate δ is hard – we cannot differentiate whether δ = 0 or 1/2 without an Ω(m) space, by reducing it to the Diameter problem

slide-30
SLIDE 30

17-3

For general datasets, we introduce F0-ambiguity: The F0-ambiguity of S is the minimum δ s.t. there exists T ⊆ S such that

  • S\T is well-shaped
  • F0(S\T) ≥ (1 − δ)F0(S)

General datasets

However, we can still guarantee the following even without knowing the value δ Unfortunately approximate δ is hard – we cannot differentiate whether δ = 0 or 1/2 without an Ω(m) space, by reducing it to the Diameter problem

slide-31
SLIDE 31

17-4

For general datasets, we introduce F0-ambiguity: The F0-ambiguity of S is the minimum δ s.t. there exists T ⊆ S such that

  • S\T is well-shaped
  • F0(S\T) ≥ (1 − δ)F0(S)

General datasets

For a dataset with F0-ambiguity δ, can get a (1 + O(ǫ + δ)) approximation of robust F0 using ˜ O(1/ǫ2) bits However, we can still guarantee the following even without knowing the value δ Unfortunately approximate δ is hard – we cannot differentiate whether δ = 0 or 1/2 without an Ω(m) space, by reducing it to the Diameter problem

slide-32
SLIDE 32

18-1

Generalization

Say h is ρ-smart on a well-shaped dataset S and its natural group partition if it satisfies:

  • Small “image radius”. Each group is adjacent to

(at most) ρ hash buckets on average.

– we say a group G is adjacent to a hash bucket B if ∃p, q ∈ S s.t. p ∈ G, h(q) = B and d(p, q) ≤ α.

  • No false-positive. Items from different groups will

be hashed into disjoint buckets.

slide-33
SLIDE 33

18-2

Generalization

Say h is ρ-smart on a well-shaped dataset S and its natural group partition if it satisfies:

  • Small “image radius”. Each group is adjacent to

(at most) ρ hash buckets on average.

– we say a group G is adjacent to a hash bucket B if ∃p, q ∈ S s.t. p ∈ G, h(q) = B and d(p, q) ≤ α.

  • No false-positive. Items from different groups will

be hashed into disjoint buckets. This is what we really need in the analysis for Random Grid + 2D Euclidean space

slide-34
SLIDE 34

19-1

Locality sensitive hashing

(LSH) We say a hash family H is (ℓ, u, p1, p2)-sensitive if for any two items p, q,

  • 1. if d(p, q) ≤ ℓ then Prh∈rH[h(p) = h(q)] ≥ p1,
  • 2. if d(p, q) ≥ u then Prh∈r H[h(p) = h(q)] ≤ p2
slide-35
SLIDE 35

19-2

Locality sensitive hashing

(LSH) We say a hash family H is (ℓ, u, p1, p2)-sensitive if for any two items p, q,

  • 1. if d(p, q) ≤ ℓ then Prh∈rH[h(p) = h(q)] ≥ p1,
  • 2. if d(p, q) ≥ u then Prh∈r H[h(p) = h(q)] ≤ p2

A hash function h is called η-concentrated on (well-shaped) S if for any G ∈ G, |{h(x) | ∃y ∈ G s.t. d(x, y) ≤ α}| ≤ η.

slide-36
SLIDE 36

20-1

The connections

S: an (α, β)-sparse (β > 2α) dataset, |S| = m. H: a (2α, β, p1, p2)-sensitive LSH family that is η-concentrated on S. F: a k-fold hash family of H and let f ∈r F. f is 100(η(1 − p1) + p1)k-smart on S w.pr. (0.99 − m2pk

2).

slide-37
SLIDE 37

20-2

The connections

S: an (α, β)-sparse (β > 2α) dataset, |S| = m. H: a (2α, β, p1, p2)-sensitive LSH family that is η-concentrated on S. F: a k-fold hash family of H and let f ∈r F. f is 100(η(1 − p1) + p1)k-smart on S w.pr. (0.99 − m2pk

2).

Gaussian LSH for Euclidean (α, β, p(α), p(β))-sensitive and O(1)-concentrated; ⇒ O(1)-smart when β/α ≥ log m Random Projection LSH for Cosine (α, β, 1 − α

π , 1 − β π)-sensitive and O(1)-concentrated;

⇒ O(1)-smart when α ≤ 1/ log m and Ω(1) ≤ β < π.

slide-38
SLIDE 38

20-3

The connections

S: an (α, β)-sparse (β > 2α) dataset, |S| = m. H: a (2α, β, p1, p2)-sensitive LSH family that is η-concentrated on S. F: a k-fold hash family of H and let f ∈r F. f is 100(η(1 − p1) + p1)k-smart on S w.pr. (0.99 − m2pk

2).

Gaussian LSH for Euclidean (α, β, p(α), p(β))-sensitive and O(1)-concentrated; ⇒ O(1)-smart when β/α ≥ log m Random Projection LSH for Cosine (α, β, 1 − α

π , 1 − β π)-sensitive and O(1)-concentrated;

⇒ O(1)-smart when α ≤ 1/ log m and Ω(1) ≤ β < π. Not every LSH can be made ρ-smart. E.g., Min-Hash

slide-39
SLIDE 39

21-1

Experiments

Dataset: 4,000,000 images from ImageNet Experiments on a desktop PC with 8GB of RAM and a 4-core 3.40GHz Intel i7 CPU I500k100x5d means the dataset consists of – 500k images, – each has 100 near-duplicates on avarage, – mapped into points in 5-dim Euclidean space (feature space)

slide-40
SLIDE 40

22-1

Correnctness (known α)

slide-41
SLIDE 41

23-1

Baseline (greedy algo.) Θ(n) space Sketch (our algo.) ˜ O(1/ǫ2) space CellCount: (streaming

  • algo. for

comparison) ˜ O(1/ǫ2) space

Correnctness (unknown α)

Dataset: I500k100x5d

slide-42
SLIDE 42

24-1

Running time

slide-43
SLIDE 43

25-1

What if a metric X has no good LSH, e.g., for edit distance?

slide-44
SLIDE 44

25-2

What if a metric X has no good LSH, e.g., for edit distance? Idea: embed X to another metric which has good LSH

slide-45
SLIDE 45

26-1

  • Problem: Given strings s1, . . . , sn over alphabeta Σ,

a threshold K, edit similarity (self)join outputs all pairs (si, sj) s.t. ED(si, sj) ≤ K.

  • Central problem in databases; studied extensively in

literature

Edit Similarity Join

– Zhang, Z., KDD 2017

slide-46
SLIDE 46

27-1

Previous work

Most existing approaches are signature-based, use different filtering methods. They fall short on long strings and relatively large thresholds.

slide-47
SLIDE 47

27-2

Previous work

Most existing approaches are signature-based, use different filtering methods. They fall short on long strings and relatively large thresholds. In a recent string similarity search/join competition (SIGMOD Record 2014), it was reported that “an error rate (K/N) of 20% ∼ 25% pushes today’s techniques to the limit”

slide-48
SLIDE 48

27-3

Previous work

Most existing approaches are signature-based, use different filtering methods. They fall short on long strings and relatively large thresholds. In a recent string similarity search/join competition (SIGMOD Record 2014), it was reported that “an error rate (K/N) of 20% ∼ 25% pushes today’s techniques to the limit” In our experiments, the previous best algorithms cannot finish in 10 hours on a collection of 20, 000 DNA sequences each of length 20, 000 and 1% error rate.

slide-49
SLIDE 49

27-4

Previous work

Most existing approaches are signature-based, use different filtering methods. They fall short on long strings and relatively large thresholds. In a recent string similarity search/join competition (SIGMOD Record 2014), it was reported that “an error rate (K/N) of 20% ∼ 25% pushes today’s techniques to the limit” In our experiments, the previous best algorithms cannot finish in 10 hours on a collection of 20, 000 DNA sequences each of length 20, 000 and 1% error rate. However, long strings and large thresholds are critical to applications in bioinformatics; the edit distances of human DNAs are mostly in the range of 1% ∼ 10%

slide-50
SLIDE 50

28-1

Our work

We propose an algorithm called EmbedJoin. EmbedJoin scales very well up to error threshold 20% which is far beyond the reach of existing algorithms.

slide-51
SLIDE 51

29-1

Our approach

We try to first embed edit distance to some “easier” space for which LSH exists. Previously the best embedding from ED to ℓ1 incurs a distortion of 2O(√log n log log n) (Ostrovsky, Rabani. JACM’07) Recently Chakraborty, Goldenberg and Koucky (STOC’16) gives a (weak) embedding from ED to Hamming with distortion O(K). Call this CGK embedding.

slide-52
SLIDE 52

29-2

Our approach

We try to first embed edit distance to some “easier” space for which LSH exists. Previously the best embedding from ED to ℓ1 incurs a distortion of 2O(√log n log log n) (Ostrovsky, Rabani. JACM’07) Recently Chakraborty, Goldenberg and Koucky (STOC’16) gives a (weak) embedding from ED to Hamming with distortion O(K). Call this CGK embedding. We basically do: CGK + LSH (for Hamming) Can find a set of candidate pairs (i, j) w.r.t. ED(si, sj) ≤ K in the streaming fashion. Using another verification step we can remove all the false positives.

slide-53
SLIDE 53

30-1

Our main tool – CGK embedding

The CGK embedding

Parameterized with a random string r ∈ {0, 1}6n, and maps f : s ∈ {0, 1}n → s′ ∈ {0, 1}3n. Two counters i and j both initialized to 1. For j = 1, 2, . . . steps:

  • 1. s′[j] ← s[i].
  • 2. If r[(2j − 1) + s[i]] = 1, then i ← i + 1. Stop when i = n + 1.
  • 3. j ← j + 1.

1 0 1 0 1 1 0 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 1 1 i j 2j − 1 + s[i]

s s’ r

slide-54
SLIDE 54

30-2

Our main tool – CGK embedding

The CGK embedding

Parameterized with a random string r ∈ {0, 1}6n, and maps f : s ∈ {0, 1}n → s′ ∈ {0, 1}3n. Two counters i and j both initialized to 1. For j = 1, 2, . . . steps:

  • 1. s′[j] ← s[i].
  • 2. If r[(2j − 1) + s[i]] = 1, then i ← i + 1. Stop when i = n + 1.
  • 3. j ← j + 1.

1 0 1 0 1 1 0 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 1 1 i j 2j − 1 + s[i]

s s’ r

Property

If ed(s, t) = k, then k/2 ≤ HAM(f (s), f (t)) ≤ O(k2) w.pr. 0.99

slide-55
SLIDE 55

31-1

CGK as a random walk

1 0 1 1 0 1 1 1 1 0 1 1 0 1 0 1 1 1 1 p j 2j − 1 + s[p]

s s’

CGK → A random walk on two strings

1 1 1 1 1 1 1

t t’

2j − 1 + t[q] CGK CGK q

r

slide-56
SLIDE 56

31-2

CGK as a random walk

1 0 1 1 0 1 1 1 1 0 1 1 0 1 0 1 1 1 1 p j 2j − 1 + s[p]

s s’

CGK → A random walk on two strings

1 1 1 1 1 1 1

t t’

2j − 1 + t[q] CGK CGK q

The shift (p − q) is a random walk on the line.

1

r

slide-57
SLIDE 57

32-1

CGK as a random walk (cont.)

1 0 0 p

s t

q

s t

0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1

Blue area belongs to the optimal matching. The shift is p − q = 2. If the random walk w.r.t. the shift goes left by 2 steps, (p, q) will hit one of the blue (matching) edges, and the embedding will be “synchronized” afterwards (i.e., the bits we write in s′ and t′ will be the same at every step).

slide-58
SLIDE 58

33-1

If ed(s, t) = k, then k

2 ≤ HAM(f (s), f (t)) ≤ O(k2) w.pr. 0.99

Observations

The UB O(k2) comes from the fact that a simple random walk on the integer line, starting from the origin, hits position k in time O(k2) w.pr. 0.99

slide-59
SLIDE 59

33-2

If ed(s, t) = k, then k

2 ≤ HAM(f (s), f (t)) ≤ O(k2) w.pr. 0.99

Observations

The UB O(k2) comes from the fact that a simple random walk on the integer line, starting from the origin, hits position k in time O(k2) w.pr. 0.99 Observation 1: As long as the gap is preserved, slightly large distortion will not be a problem for the LSH step.

slide-60
SLIDE 60

33-3

If ed(s, t) = k, then k

2 ≤ HAM(f (s), f (t)) ≤ O(k2) w.pr. 0.99

Observations

The UB O(k2) comes from the fact that a simple random walk on the integer line, starting from the origin, hits position k in time O(k2) w.pr. 0.99 Observation 1: As long as the gap is preserved, slightly large distortion will not be a problem for the LSH step. Observation 2: run CGK multiple times and then take the min would help in practice

slide-61
SLIDE 61

34-1

Experimental results – datasets

  • UNIREF: a dataset of UniRef90 protein sequence data from

UniProt project

  • TREC: a dataset of references from Medline consisting of

titles and abstracts from 270 medical journals

  • GENXXX: datasets of human genome sequences of 50

individuals obtained from 1000 genomes project

slide-62
SLIDE 62

35-1

Experimental results – tested algorithms

We compare our algos with the previous best algos reported by a recent experimental study [Jiang et al. VLDB’14]

  • 1. EmbedJoin
  • 2. PassJoin [Li, Deng, Wang, Feng. PVLDB’11]:

partition-based, use the pigeon-hole principle.

  • 3. EDJoin [Xiao, Wang, Lin; PVLDB’08]:

signature (q-gram) based, prefix filtering

  • 4. AdaptJoin [Wang, Li, Feng. SIGMOD’12]:

improves prefix filtering by learning the tradeoff between number of signatures and the fltering power

  • 5. QChunk [Qin et al. SIGMOD’11]:

improves the prefix filtering by using q-chunk to replace q-gram

slide-63
SLIDE 63

36-1

Accuracy of EmbedJoin

slide-64
SLIDE 64

37-1

Running time comparisons

UNIREF, vary K UNIREF, vary n GEN50kS, vary K GEN50kS, vary n

slide-65
SLIDE 65

38-1

Space usage comparisons

UNIREF, vary K GEN50kS, vary K GEN50kS, vary n UNIREF, vary n

slide-66
SLIDE 66

39-1

The scalability of EmbedJoin

GEN20kS, vary K/n GEN20kL, vary K/n GEN320kS, vary K/n

slide-67
SLIDE 67

40-1

What if we want to find all pairs (x, y) with ED(x, y) ≤ K exactly? Embedding + LSH doesn’t work.

slide-68
SLIDE 68

40-2

What if we want to find all pairs (x, y) with ED(x, y) ≤ K exactly? Embedding + LSH doesn’t work. Solution: use sketches.

(In streaming we have to store a sketch for each string)

slide-69
SLIDE 69

41-1

Sketching Edit Distance

x sk(x) sketching

App: distributed similarity join

y sk(y)

RAM CPU

x y streaming (our sketches can be constructed by one scan of the input) – Belazzougui, Z., FOCS 2016

slide-70
SLIDE 70

42-1

First sketching/streaming algorithm with poly(K, log n) size/space. (more precisely O(K 8 log5 n)) The sketch can be constructed in ˜ O(n) time assuming K ≤ n0.1. Given sk(x) and sk(y), can compute the at most K edits to transfer x to y in poly(K, log n) time. Previously Ω(n) LB for linear sketches. (Andoni, Goldberger, McGregor, Porat, STOC 2013)

Previous and our results

slide-71
SLIDE 71

43-1

The idea

We can view an alignment A between s and t as a non-crossing bipartite matching

s t 1 1 0 0 1 1 0 0

Can be compressed by writing down all singletons and starting/ending edges of each cluster, denoted by sk(A)

1 0 1 0 1 0 1 1

slide-72
SLIDE 72

43-2

The idea

We can view an alignment A between s and t as a non-crossing bipartite matching

s t 1 1 0 0 1 1 0 0

Can be compressed by writing down all singletons and starting/ending edges of each cluster, denoted by sk(A)

1 0 1 0 1 0 1 1 Note: size of sk(OPT) is only O(k log n).

slide-73
SLIDE 73

43-3

The idea

We can view an alignment A between s and t as a non-crossing bipartite matching

s t 1 1 0 0 1 1 0 0

Can be compressed by writing down all singletons and starting/ending edges of each cluster, denoted by sk(A)

1 0 1 0 1 0 1 1

Given alignments A1, . . . , Aρ, letting I =

j∈[ρ] Aj

Main idea: if ∃ an optimal alignment that goes through all edges in I, then we can obtain an optimal alignment using sk(A1), . . . , sk(Aρ)

Note: size of sk(OPT) is only O(k log n).

slide-74
SLIDE 74

44-1

The idea (cont.)

CGK embedding naturally gives an alignment.

1 0 1 1 0 1 1 1 p j

s s’

1 1 1 1 1 1 1

t t’

CGK CGK q

The random walk state sequence ((p1, q1), (p2, q2), . . .) contains an alignment A. A can be constructed in a greedy way, and sk(A) has size poly(K, log n).

slide-75
SLIDE 75

44-2

The idea (cont.)

CGK embedding naturally gives an alignment.

1 0 1 1 0 1 1 1 p j

s s’

1 1 1 1 1 1 1

t t’

CGK CGK q

The random walk state sequence ((p1, q1), (p2, q2), . . .) contains an alignment A. A can be constructed in a greedy way, and sk(A) has size poly(K, log n). Key lemma: Can show if we take ρ = poly(K, log n) random walks which give alignments A1, . . . , Aρ, then there is an optimal alignment contains I =

j∈[ρ] Aj

slide-76
SLIDE 76

45-1

The idea (cont.)

sk(Ai) ⇔ differences between s′ and t′ in the Ham-space which we know how to solve. Additional structures needed for the reverse mapping (Ham-space → edit-space) to find all the edits.

slide-77
SLIDE 77

46-1

The idea (cont.)

Key lemma: Can show if we take ρ = poly(K, log n) random walks which give alignments A1, . . . , Aρ, then there is an optimal alignment contains I =

j∈[ρ] Aj

  • Anchor. Given ρ random walks generated according to the

CGK embedding, we say a pair (u, v) is an anchor if s[u] = t[v], and (u, v) ∈ I . Claim: W.pr. 1 − 1/n2, there is an optimal alignment going through all anchors. Proof idea: We focus on a “greedy” optimal matching O. Suppose on the contrary that O does not pass an anchor (u, v), then we can find a matching M in the left neighborhood of (u, v) which may “mislead” a random walk, that is, with a non-trivial probability the random walk will “follow” M and consequently miss (u, v).

slide-78
SLIDE 78

47-1

Conclusion and open problems

The motivation of this line of work: can we process noisy data (with near-duplicates) in the streaming model space/time efficiently?

slide-79
SLIDE 79

47-2

Conclusion and open problems

The motivation of this line of work: can we process noisy data (with near-duplicates) in the streaming model space/time efficiently? Linear sketches do not work; we propose a framework using LSHs with special properties

slide-80
SLIDE 80

47-3

Conclusion and open problems

The motivation of this line of work: can we process noisy data (with near-duplicates) in the streaming model space/time efficiently? Linear sketches do not work; we propose a framework using LSHs with special properties For those metrics do not have LSHs, embedding is a possible solution, but it generates a distortion

slide-81
SLIDE 81

47-4

Conclusion and open problems

The motivation of this line of work: can we process noisy data (with near-duplicates) in the streaming model space/time efficiently? Linear sketches do not work; we propose a framework using LSHs with special properties For those metrics do not have LSHs, embedding is a possible solution, but it generates a distortion For exact thresholds, we may need sketches.

slide-82
SLIDE 82

47-5

Conclusion and open problems

The motivation of this line of work: can we process noisy data (with near-duplicates) in the streaming model space/time efficiently? Linear sketches do not work; we propose a framework using LSHs with special properties For those metrics do not have LSHs, embedding is a possible solution, but it generates a distortion For exact thresholds, we may need sketches. Many open problems. General question: what can we do in the streaming model for general metric given a threshold for near-duplicates?

slide-83
SLIDE 83

47-6

Conclusion and open problems

The motivation of this line of work: can we process noisy data (with near-duplicates) in the streaming model space/time efficiently? Linear sketches do not work; we propose a framework using LSHs with special properties For those metrics do not have LSHs, embedding is a possible solution, but it generates a distortion For exact thresholds, we may need sketches. Many open problems. General question: what can we do in the streaming model for general metric given a threshold for near-duplicates? Similar question can be asked for the distributed model.

slide-84
SLIDE 84

48-1

Thank you! Questions?