edit distance sketching streaming and document exchange
play

Edit Distance: Sketching, Streaming and Document Exchange Djamal - PowerPoint PPT Presentation

Edit Distance: Sketching, Streaming and Document Exchange Djamal Belazzougui Qin Zhang CERIST, Algeria IU Bloomington FOCS 2016 Oct. 9, 2016 1-1 Edit Distance Definition: Given two strings s , t n : ed ( s , t ) = minimum number of


  1. Edit Distance: Sketching, Streaming and Document Exchange Djamal Belazzougui Qin Zhang CERIST, Algeria IU Bloomington FOCS 2016 Oct. 9, 2016 1-1

  2. Edit Distance Definition: Given two strings s , t ∈ Σ n : ed ( s , t ) = minimum number of character operations (insertion/deletion/substitution) that transform s to t . 2-1

  3. Edit Distance Definition: Given two strings s , t ∈ Σ n : ed ( s , t ) = minimum number of character operations (insertion/deletion/substitution) that transform s to t . ed( banana , ananas ) = 2 2-2

  4. Edit Distance Definition: Given two strings s , t ∈ Σ n : ed ( s , t ) = minimum number of character operations (insertion/deletion/substitution) that transform s to t . ed( banana , ananas ) = 2 Applications: numerous. E.g., bioinformatics (measuring similarity between DNA seq. 2-3

  5. Edit Distance Definition: Given two strings s , t ∈ Σ n : ed ( s , t ) = minimum number of character operations (insertion/deletion/substitution) that transform s to t . ed( banana , ananas ) = 2 Applications: numerous. E.g., bioinformatics (measuring automatic spelling correction similarity between DNA seq. 2-4

  6. Problems The threshold version of ED: Given two strings s , t ∈ { 0 , 1 } n and a threhold K , output all the edits if ed ( s , t ) ≤ K , output “ Error ” otherwise. 3-1

  7. Problems The threshold version of ED: Given two strings s , t ∈ { 0 , 1 } n and a threhold K , output all the edits if ed ( s , t ) ≤ K , output “ Error ” otherwise. Models/Problems: sk(s) t s document exchange App: remote file sync; file transmission through a noisy channel 3-2

  8. Problems The threshold version of ED: Given two strings s , t ∈ { 0 , 1 } n and a threhold K , output all the edits if ed ( s , t ) ≤ K , output “ Error ” otherwise. Models/Problems: sk(s) sk(t) sk(s) s t t s sketching App: distributed similarity join document exchange App: remote file sync; file transmission through a noisy channel 3-3

  9. Problems The threshold version of ED: Given two strings s , t ∈ { 0 , 1 } n and a threhold K , output all the edits if ed ( s , t ) ≤ K , output “ Error ” otherwise. Models/Problems: sk(s) sk(t) sk(s) s t t s sketching App: distributed similarity join document exchange App: remote file sync; RAM file transmission through a noisy channel t s streaming CPU 3-4

  10. Previous and our results K : distance threshold; n : input size. For simplicity, assuming K < n 0 . 1 √ log n under • Information theoretic optimal communication for K ≤ 2 almost linear encoding/decoding time for doc-exchange. • First sketching/streaming algorithm with poly( K , log n ) size/space. Note: Ω( n ) LB for linear sketches. (Andoni, Goldberger, McGregor, Porat. STOC’13) 4-1

  11. Previous and our results IMS scheme K : distance threshold; n : input size. For simplicity, assuming K < n 0 . 1 √ log n under • Information theoretic optimal communication for K ≤ 2 almost linear encoding/decoding time for doc-exchange. • First sketching/streaming algorithm with poly( K , log n ) size/space. Note: Ω( n ) LB for linear sketches. (Andoni, Goldberger, McGregor, Porat. STOC’13) 4-2

  12. Main Tool: CGK Embedding 5-1

  13. Our main tool – CGK embedding (Chakraborty, Goldenberg, Koucky, STOC’16 The CGK embedding Similar idea by Saha, FOCS’14 ) f : s ∈ { 0 , 1 } n → s ′ ∈ { 0 , 1 } 3 n . Two counters i and j both initialized to 1. For j = 1 , 2 , . . . steps: 1. s ′ [ j ] ← s [ i ]. 2. Flip a coin; if head, then i ← i + 1. Stop when i = n + 1. 3. j ← j + 1. s s’ 1 0 1 0 1 1 0 1 1 1 0 0 1 j i 6-1

  14. Our main tool – CGK embedding (Chakraborty, Goldenberg, Koucky, STOC’16 The CGK embedding Similar idea by Saha, FOCS’14 ) f : s ∈ { 0 , 1 } n → s ′ ∈ { 0 , 1 } 3 n . Two counters i and j both initialized to 1. For j = 1 , 2 , . . . steps: 1. s ′ [ j ] ← s [ i ]. 2. Flip a coin; if head, then i ← i + 1. Stop when i = n + 1. 3. j ← j + 1. s s’ 1 0 1 0 1 1 0 1 1 1 0 0 1 j i Property If ed ( s , t ) = k , then k / 2 ≤ ham ( f ( s ) , f ( t )) ≤ O ( k 2 ) w.pr. 0 . 99 6-2

  15. CGK as a random walk CGK → a random walk on two strings s s’ CGK 1 0 1 1 0 1 1 1 p j t t’ CGK 1 1 1 1 1 1 1 q 7-1

  16. CGK as a random walk CGK → a random walk on two strings s s’ CGK 1 0 1 1 0 1 1 1 p j t t’ CGK 1 1 1 1 1 1 1 q The shift ( p − q ) is a random walk on the line. 7-2

  17. Document Exchange sk(s) s t App: remote file sync; file transmission through a noisy channel Warning: I will cheat in multiple places 8-1

  18. Technique overview: document exchange Main idea : If we can find ≤ K pairs of blocks in s and t each of size K 99 , such that they contain all the edits, then IMS gives O ( K (log 2 K )). (recall IMS gives O ( K log n log( n / K ))) 9-1

  19. Technique overview: document exchange Main idea : If we can find ≤ K pairs of blocks in s and t each of size K 99 , such that they contain all the edits, then IMS gives O ( K (log 2 K )). (recall IMS gives O ( K log n log( n / K ))) Question : if exist, how to identify these pairs? 9-2

  20. Technique overview: document exchange Main idea : If we can find ≤ K pairs of blocks in s and t each of size K 99 , such that they contain all the edits, then IMS gives O ( K (log 2 K )). (recall IMS gives O ( K log n log( n / K ))) Question : if exist, how to identify these pairs? CGK (edit-space → ham-space) + random partition to blocks + error-correcting code for Ham w.r.t. blocks + reverse mapping 9-3

  21. Technique overview: document exchange Main idea : If we can find ≤ K pairs of blocks in s and t each of size K 99 , such that they contain all the edits, then IMS gives O ( K (log 2 K )). (recall IMS gives O ( K log n log( n / K ))) Question : if exist, how to identify these pairs? CGK (edit-space → ham-space) + random partition to blocks + error-correcting code for Ham w.r.t. blocks + reverse mapping Challenge : the O ( K 2 ) errors after CGK embedding can possibly be distributed into O ( K 2 ) pairs of blocks. This may introduce a factor of K 2 of communication in the error-correcting which is too much. 9-4

  22. Technique overview: document exchange Main idea : If we can find ≤ K pairs of blocks in s and t each of size K 99 , such that they contain all the edits, then IMS gives O ( K (log 2 K )). (recall IMS gives O ( K log n log( n / K ))) Question : if exist, how to identify these pairs? CGK (edit-space → ham-space) + random partition to blocks + error-correcting code for Ham w.r.t. blocks + reverse mapping Challenge : the O ( K 2 ) errors after CGK embedding can possibly be distributed into O ( K 2 ) pairs of blocks. This may introduce a factor of K 2 of communication in the error-correcting which is too much. • Can reduce O ( K 2 ) pairs to O ( K ), by removing long common periodic substrings. • Not easy: everything has to be done using one-way comm.! 9-5

  23. Technique overview: document exchange (cont.) CGK s 1 0 1 1 0 1 1 1 s’ j p CGK 1 1 1 1 1 1 1 t t’ q 10-1

  24. Technique overview: document exchange (cont.) CGK s 1 0 1 1 0 1 1 1 s’ j p CGK 1 1 1 1 1 1 1 t t’ q Call a walk step from state ( p , q ) a progress step if s [ p ] � = t [ q ] and one of these cases happens 10-2

  25. Technique overview: document exchange (cont.) CGK s 1 0 1 1 0 1 1 1 s’ j p CGK 1 1 1 1 1 1 1 t t’ q Call a walk step from state ( p , q ) a progress step if s [ p ] � = t [ q ] and one of these cases happens Call a seq. of walks from state ( p , q ) where the next progress step happens, to the first state ( p ′ , q ′ ) where ed ( s [ p ′ ... n ] , t [ q ′ ... n ]) = ed ( s [ p ... n ] , t [ q ... n ]) − 1 a progress phase 10-3

  26. Technique overview: document exchange (cont.) CGK s 1 0 1 1 0 1 1 1 s’ j p CGK 1 1 1 1 1 1 1 t t’ q Call a walk step from state ( p , q ) a progress step if s [ p ] � = t [ q ] and one of these cases happens Call a seq. of walks from state ( p , q ) where the next progress step happens, to the first state ( p ′ , q ′ ) where ed ( s [ p ′ ... n ] , t [ q ′ ... n ]) = ed ( s [ p ... n ] , t [ q ... n ]) − 1 a progress phase a progress phase ⇔ a pair of mismatching blocks ≤ K progress phases ⇒ ≤ K pairs of mismatching blocks # random walk steps in a progress phase ⇐ ⇒ size of the mismatching block 10-4

  27. Technique overview: document exchange (cont.) Call a seq. of walks from state ( p , q ) where a (the next) progress step happens, to the first state ( p ′ , q ′ ) where ed ( s [ p ′ ... n ] , t [ q ′ ... n ]) = ed ( s [ p ... n ] , t [ q ... n ]) − 1 a progress phase ≤ K progress phases ⇒ ≤ K pairs of mismatching blocks # random walk steps in a progress phase ⇐ ⇒ size of the mismatching block Whp, a progress phase “consumes” ≤ K 10 progress steps. 11-1

  28. Technique overview: document exchange (cont.) Call a seq. of walks from state ( p , q ) where a (the next) progress step happens, to the first state ( p ′ , q ′ ) where ed ( s [ p ′ ... n ] , t [ q ′ ... n ]) = ed ( s [ p ... n ] , t [ q ... n ]) − 1 a progress phase ≤ K progress phases ⇒ ≤ K pairs of mismatching blocks # random walk steps in a progress phase ⇐ ⇒ size of the mismatching block Whp, a progress phase “consumes” ≤ K 10 progress steps. Can show that after properly removing long common periods, we get a progress step in ≤ K 50 random walk steps 11-2

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend