1-1
Edit Distance: Sketching, Streaming and Document Exchange
FOCS 2016
- Oct. 9, 2016
Edit Distance: Sketching, Streaming and Document Exchange Djamal - - PowerPoint PPT Presentation
Edit Distance: Sketching, Streaming and Document Exchange Djamal Belazzougui Qin Zhang CERIST, Algeria IU Bloomington FOCS 2016 Oct. 9, 2016 1-1 Edit Distance Definition: Given two strings s , t n : ed ( s , t ) = minimum number of
1-1
2-1
2-2
2-3
bioinformatics (measuring similarity between DNA seq.
2-4
bioinformatics (measuring similarity between DNA seq. automatic spelling correction
3-1
3-2
App: remote file sync; file transmission through a noisy channel
3-3
App: remote file sync; file transmission through a noisy channel
App: distributed similarity join
3-4
App: remote file sync; file transmission through a noisy channel
App: distributed similarity join
RAM CPU
4-1
K: distance threshold; n: input size. For simplicity, assuming K < n0.1
√log n under
almost linear encoding/decoding time for doc-exchange.
Note: Ω(n) LB for linear sketches. (Andoni, Goldberger, McGregor, Porat. STOC’13)
4-2
K: distance threshold; n: input size. For simplicity, assuming K < n0.1
√log n under
almost linear encoding/decoding time for doc-exchange.
Note: Ω(n) LB for linear sketches. (Andoni, Goldberger, McGregor, Porat. STOC’13) IMS scheme
5-1
6-1
f : s ∈ {0, 1}n → s′ ∈ {0, 1}3n. Two counters i and j both initialized to 1. For j = 1, 2, . . . steps:
1 0 1 0 1 1 0 1 1 1 0 0 1 i j
s s’
(Chakraborty, Goldenberg, Koucky, STOC’16 Similar idea by Saha, FOCS’14 )
6-2
f : s ∈ {0, 1}n → s′ ∈ {0, 1}3n. Two counters i and j both initialized to 1. For j = 1, 2, . . . steps:
1 0 1 0 1 1 0 1 1 1 0 0 1 i j
s s’
If ed(s, t) = k, then k/2 ≤ ham(f (s), f (t)) ≤ O(k2) w.pr. 0.99
(Chakraborty, Goldenberg, Koucky, STOC’16 Similar idea by Saha, FOCS’14 )
7-1
1 0 1 1 0 1 1 1 p j
s s’
1 1 1 1 1 1 1
t t’
CGK CGK q
7-2
1 0 1 1 0 1 1 1 p j
s s’
1 1 1 1 1 1 1
t t’
CGK CGK q
8-1
App: remote file sync; file transmission through a noisy channel
9-1
9-2
(recall IMS gives O(K log n log(n/K)))
9-3
CGK (recall IMS gives O(K log n log(n/K)))
9-4
CGK (recall IMS gives O(K log n log(n/K)))
9-5
common periodic substrings.
CGK (recall IMS gives O(K log n log(n/K)))
10-1
1 0 1 1 0 1 1 1 p j
s s’
1 1 1 1 1 1 1
t t’
CGK CGK q
10-2
1 0 1 1 0 1 1 1 p j
s s’
1 1 1 1 1 1 1
t t’
CGK CGK q
Call a walk step from state (p, q) a progress step if s[p] = t[q] and one of these cases happens
10-3
1 0 1 1 0 1 1 1 p j
s s’
1 1 1 1 1 1 1
t t’
CGK CGK q
Call a walk step from state (p, q) a progress step if s[p] = t[q] and one of these cases happens Call a seq. of walks from state (p, q) where the next progress step happens, to the first state (p′, q′) where ed(s[p′...n], t[q′...n]) = ed(s[p...n], t[q...n]) − 1 a progress phase
10-4
1 0 1 1 0 1 1 1 p j
s s’
1 1 1 1 1 1 1
t t’
CGK CGK
≤ K progress phases ⇒ ≤ K pairs of mismatching blocks # random walk steps in a progress phase ⇐ ⇒ size of the mismatching block
q
Call a walk step from state (p, q) a progress step if s[p] = t[q] and one of these cases happens Call a seq. of walks from state (p, q) where the next progress step happens, to the first state (p′, q′) where ed(s[p′...n], t[q′...n]) = ed(s[p...n], t[q...n]) − 1 a progress phase a progress phase ⇔ a pair of mismatching blocks
11-1
Whp, a progress phase “consumes” ≤ K 10 progress steps. Call a seq. of walks from state (p, q) where a (the next) progress step happens, to the first state (p′, q′) where ed(s[p′...n], t[q′...n]) = ed(s[p...n], t[q...n]) − 1 a progress phase ≤ K progress phases ⇒ ≤ K pairs of mismatching blocks # random walk steps in a progress phase ⇐ ⇒ size of the mismatching block
11-2
Whp, a progress phase “consumes” ≤ K 10 progress steps. Call a seq. of walks from state (p, q) where a (the next) progress step happens, to the first state (p′, q′) where ed(s[p′...n], t[q′...n]) = ed(s[p...n], t[q...n]) − 1 a progress phase ≤ K progress phases ⇒ ≤ K pairs of mismatching blocks # random walk steps in a progress phase ⇐ ⇒ size of the mismatching block Can show that after properly removing long common periods, we get a progress step in ≤ K 50 random walk steps
11-3
Whp, a progress phase “consumes” ≤ K 10 progress steps. Call a seq. of walks from state (p, q) where a (the next) progress step happens, to the first state (p′, q′) where ed(s[p′...n], t[q′...n]) = ed(s[p...n], t[q...n]) − 1 a progress phase ≤ K progress phases ⇒ ≤ K pairs of mismatching blocks # random walk steps in a progress phase ⇐ ⇒ size of the mismatching block Can show that after properly removing long common periods, we get a progress step in ≤ K 50 random walk steps Recall our main idea: If we can find ≤ K pairs of blocks in s and t each of size K 99, such that they contain all the edits, then IMS gives O(K(log2 K)). (Other steps cost O(K log n))
12-1
App: distributed similarity join
13-1
s t 1 1 0 0 1 0 1 0 1 1 0 0 1 0 1 1
14-1
s t 1 1 0 0 1 1 0 0
1 0 1 0 1 0 1 1
14-2
s t 1 1 0 0 1 1 0 0
1 0 1 0 1 0 1 1 Note: size of sk(OPT) is only O(K log n).
14-3
s t 1 1 0 0 1 1 0 0
1 0 1 0 1 0 1 1 Note: size of sk(OPT) is only O(K log n).
j∈[ρ] Aj
15-1
1 0 1 1 0 1 1 1 p j
s s’
1 1 1 1 1 1 1
t t’
CGK CGK q
The random walk state sequence ((p1, q1), (p2, q2), . . .) contains an alignment A, which can be constructed in a greedy way, and sk(A) has size poly(K, log n).
15-2
1 0 1 1 0 1 1 1 p j
s s’
1 1 1 1 1 1 1
t t’
CGK CGK q
The random walk state sequence ((p1, q1), (p2, q2), . . .) contains an alignment A, which can be constructed in a greedy way, and sk(A) has size poly(K, log n).
j∈[ρ] Aj
15-3
1 0 1 1 0 1 1 1 p j
s s’
1 1 1 1 1 1 1
t t’
CGK CGK q
The random walk state sequence ((p1, q1), (p2, q2), . . .) contains an alignment A, which can be constructed in a greedy way, and sk(A) has size poly(K, log n).
j∈[ρ] Aj
15-4
1 0 1 1 0 1 1 1 p j
s s’
1 1 1 1 1 1 1
t t’
CGK CGK q
The random walk state sequence ((p1, q1), (p2, q2), . . .) contains an alignment A, which can be constructed in a greedy way, and sk(A) has size poly(K, log n).
j∈[ρ] Aj
16-1
almost linear encoding/decoding time for document exchange.
16-2
almost linear encoding/decoding time for document exchange.
to the information-theoretic optimal bound O(K log n) for all values K and n under (almost) linear running time?
16-3
almost linear encoding/decoding time for document exchange.
to the information-theoretic optimal bound O(K log n) for all values K and n under (almost) linear running time?
Can we prove any non-trivial lower bounds? (Now K 8 log5 n. We believe with a more careful analysis on the same algo, can reduce to K 4 log3 n or even K 3 log2 n)
16-4
almost linear encoding/decoding time for document exchange.
to the information-theoretic optimal bound O(K log n) for all values K and n under (almost) linear running time?
Can we prove any non-trivial lower bounds? (Now K 8 log5 n. We believe with a more careful analysis on the same algo, can reduce to K 4 log3 n or even K 3 log2 n)
17-1
18-1
j∈[ρ] Aj
CGK embedding, we say a pair (u, v) is an anchor if s[u] = t[v], and all the ρ random walks pass (u, v) . Claim: W.pr. 1 − 1/n2, there is an optimal alignment going through all anchors. Proof idea: We focus on a “greedy” optimal matching O. Suppose on the contrary that O does not pass an anchor (u, v), then we can find a matching M in the left neighborhood of (u, v) which may “mislead” a random walk, that is, with a non-trivial probability the random walk will “follow” M and consequently miss (u, v).