Edit Distance: Sketching, Streaming and Document Exchange Djamal - PowerPoint PPT Presentation

Edit Distance: Sketching, Streaming and Document Exchange Djamal Belazzougui Qin Zhang CERIST, Algeria IU Bloomington FOCS 2016 Oct. 9, 2016 1-1

Edit Distance Definition: Given two strings s , t ∈ Σ n : ed ( s , t ) = minimum number of character operations (insertion/deletion/substitution) that transform s to t . 2-1

Edit Distance Definition: Given two strings s , t ∈ Σ n : ed ( s , t ) = minimum number of character operations (insertion/deletion/substitution) that transform s to t . ed( banana , ananas ) = 2 2-2

Edit Distance Definition: Given two strings s , t ∈ Σ n : ed ( s , t ) = minimum number of character operations (insertion/deletion/substitution) that transform s to t . ed( banana , ananas ) = 2 Applications: numerous. E.g., bioinformatics (measuring similarity between DNA seq. 2-3

Edit Distance Definition: Given two strings s , t ∈ Σ n : ed ( s , t ) = minimum number of character operations (insertion/deletion/substitution) that transform s to t . ed( banana , ananas ) = 2 Applications: numerous. E.g., bioinformatics (measuring automatic spelling correction similarity between DNA seq. 2-4

Problems The threshold version of ED: Given two strings s , t ∈ { 0 , 1 } n and a threhold K , output all the edits if ed ( s , t ) ≤ K , output “ Error ” otherwise. 3-1

Problems The threshold version of ED: Given two strings s , t ∈ { 0 , 1 } n and a threhold K , output all the edits if ed ( s , t ) ≤ K , output “ Error ” otherwise. Models/Problems: sk(s) t s document exchange App: remote file sync; file transmission through a noisy channel 3-2

Problems The threshold version of ED: Given two strings s , t ∈ { 0 , 1 } n and a threhold K , output all the edits if ed ( s , t ) ≤ K , output “ Error ” otherwise. Models/Problems: sk(s) sk(t) sk(s) s t t s sketching App: distributed similarity join document exchange App: remote file sync; file transmission through a noisy channel 3-3

Problems The threshold version of ED: Given two strings s , t ∈ { 0 , 1 } n and a threhold K , output all the edits if ed ( s , t ) ≤ K , output “ Error ” otherwise. Models/Problems: sk(s) sk(t) sk(s) s t t s sketching App: distributed similarity join document exchange App: remote file sync; RAM file transmission through a noisy channel t s streaming CPU 3-4

Previous and our results K : distance threshold; n : input size. For simplicity, assuming K < n 0 . 1 √ log n under • Information theoretic optimal communication for K ≤ 2 almost linear encoding/decoding time for doc-exchange. • First sketching/streaming algorithm with poly( K , log n ) size/space. Note: Ω( n ) LB for linear sketches. (Andoni, Goldberger, McGregor, Porat. STOC’13) 4-1

Previous and our results IMS scheme K : distance threshold; n : input size. For simplicity, assuming K < n 0 . 1 √ log n under • Information theoretic optimal communication for K ≤ 2 almost linear encoding/decoding time for doc-exchange. • First sketching/streaming algorithm with poly( K , log n ) size/space. Note: Ω( n ) LB for linear sketches. (Andoni, Goldberger, McGregor, Porat. STOC’13) 4-2

Main Tool: CGK Embedding 5-1

Our main tool – CGK embedding (Chakraborty, Goldenberg, Koucky, STOC’16 The CGK embedding Similar idea by Saha, FOCS’14 ) f : s ∈ { 0 , 1 } n → s ′ ∈ { 0 , 1 } 3 n . Two counters i and j both initialized to 1. For j = 1 , 2 , . . . steps: 1. s ′ [ j ] ← s [ i ]. 2. Flip a coin; if head, then i ← i + 1. Stop when i = n + 1. 3. j ← j + 1. s s’ 1 0 1 0 1 1 0 1 1 1 0 0 1 j i 6-1

Our main tool – CGK embedding (Chakraborty, Goldenberg, Koucky, STOC’16 The CGK embedding Similar idea by Saha, FOCS’14 ) f : s ∈ { 0 , 1 } n → s ′ ∈ { 0 , 1 } 3 n . Two counters i and j both initialized to 1. For j = 1 , 2 , . . . steps: 1. s ′ [ j ] ← s [ i ]. 2. Flip a coin; if head, then i ← i + 1. Stop when i = n + 1. 3. j ← j + 1. s s’ 1 0 1 0 1 1 0 1 1 1 0 0 1 j i Property If ed ( s , t ) = k , then k / 2 ≤ ham ( f ( s ) , f ( t )) ≤ O ( k 2 ) w.pr. 0 . 99 6-2

CGK as a random walk CGK → a random walk on two strings s s’ CGK 1 0 1 1 0 1 1 1 p j t t’ CGK 1 1 1 1 1 1 1 q 7-1

CGK as a random walk CGK → a random walk on two strings s s’ CGK 1 0 1 1 0 1 1 1 p j t t’ CGK 1 1 1 1 1 1 1 q The shift ( p − q ) is a random walk on the line. 7-2

Document Exchange sk(s) s t App: remote file sync; file transmission through a noisy channel Warning: I will cheat in multiple places 8-1

Technique overview: document exchange Main idea : If we can find ≤ K pairs of blocks in s and t each of size K 99 , such that they contain all the edits, then IMS gives O ( K (log 2 K )). (recall IMS gives O ( K log n log( n / K ))) 9-1

Technique overview: document exchange Main idea : If we can find ≤ K pairs of blocks in s and t each of size K 99 , such that they contain all the edits, then IMS gives O ( K (log 2 K )). (recall IMS gives O ( K log n log( n / K ))) Question : if exist, how to identify these pairs? 9-2

Technique overview: document exchange Main idea : If we can find ≤ K pairs of blocks in s and t each of size K 99 , such that they contain all the edits, then IMS gives O ( K (log 2 K )). (recall IMS gives O ( K log n log( n / K ))) Question : if exist, how to identify these pairs? CGK (edit-space → ham-space) + random partition to blocks + error-correcting code for Ham w.r.t. blocks + reverse mapping 9-3

Technique overview: document exchange Main idea : If we can find ≤ K pairs of blocks in s and t each of size K 99 , such that they contain all the edits, then IMS gives O ( K (log 2 K )). (recall IMS gives O ( K log n log( n / K ))) Question : if exist, how to identify these pairs? CGK (edit-space → ham-space) + random partition to blocks + error-correcting code for Ham w.r.t. blocks + reverse mapping Challenge : the O ( K 2 ) errors after CGK embedding can possibly be distributed into O ( K 2 ) pairs of blocks. This may introduce a factor of K 2 of communication in the error-correcting which is too much. 9-4

Technique overview: document exchange Main idea : If we can find ≤ K pairs of blocks in s and t each of size K 99 , such that they contain all the edits, then IMS gives O ( K (log 2 K )). (recall IMS gives O ( K log n log( n / K ))) Question : if exist, how to identify these pairs? CGK (edit-space → ham-space) + random partition to blocks + error-correcting code for Ham w.r.t. blocks + reverse mapping Challenge : the O ( K 2 ) errors after CGK embedding can possibly be distributed into O ( K 2 ) pairs of blocks. This may introduce a factor of K 2 of communication in the error-correcting which is too much. • Can reduce O ( K 2 ) pairs to O ( K ), by removing long common periodic substrings. • Not easy: everything has to be done using one-way comm.! 9-5

Technique overview: document exchange (cont.) CGK s 1 0 1 1 0 1 1 1 s’ j p CGK 1 1 1 1 1 1 1 t t’ q 10-1

Technique overview: document exchange (cont.) CGK s 1 0 1 1 0 1 1 1 s’ j p CGK 1 1 1 1 1 1 1 t t’ q Call a walk step from state ( p , q ) a progress step if s [ p ] � = t [ q ] and one of these cases happens 10-2

Technique overview: document exchange (cont.) CGK s 1 0 1 1 0 1 1 1 s’ j p CGK 1 1 1 1 1 1 1 t t’ q Call a walk step from state ( p , q ) a progress step if s [ p ] � = t [ q ] and one of these cases happens Call a seq. of walks from state ( p , q ) where the next progress step happens, to the first state ( p ′ , q ′ ) where ed ( s [ p ′ ... n ] , t [ q ′ ... n ]) = ed ( s [ p ... n ] , t [ q ... n ]) − 1 a progress phase 10-3

Technique overview: document exchange (cont.) CGK s 1 0 1 1 0 1 1 1 s’ j p CGK 1 1 1 1 1 1 1 t t’ q Call a walk step from state ( p , q ) a progress step if s [ p ] � = t [ q ] and one of these cases happens Call a seq. of walks from state ( p , q ) where the next progress step happens, to the first state ( p ′ , q ′ ) where ed ( s [ p ′ ... n ] , t [ q ′ ... n ]) = ed ( s [ p ... n ] , t [ q ... n ]) − 1 a progress phase a progress phase ⇔ a pair of mismatching blocks ≤ K progress phases ⇒ ≤ K pairs of mismatching blocks # random walk steps in a progress phase ⇐ ⇒ size of the mismatching block 10-4

Technique overview: document exchange (cont.) Call a seq. of walks from state ( p , q ) where a (the next) progress step happens, to the first state ( p ′ , q ′ ) where ed ( s [ p ′ ... n ] , t [ q ′ ... n ]) = ed ( s [ p ... n ] , t [ q ... n ]) − 1 a progress phase ≤ K progress phases ⇒ ≤ K pairs of mismatching blocks # random walk steps in a progress phase ⇐ ⇒ size of the mismatching block Whp, a progress phase “consumes” ≤ K 10 progress steps. 11-1

Technique overview: document exchange (cont.) Call a seq. of walks from state ( p , q ) where a (the next) progress step happens, to the first state ( p ′ , q ′ ) where ed ( s [ p ′ ... n ] , t [ q ′ ... n ]) = ed ( s [ p ... n ] , t [ q ... n ]) − 1 a progress phase ≤ K progress phases ⇒ ≤ K pairs of mismatching blocks # random walk steps in a progress phase ⇐ ⇒ size of the mismatching block Whp, a progress phase “consumes” ≤ K 10 progress steps. Can show that after properly removing long common periods, we get a progress step in ≤ K 50 random walk steps 11-2

Edit Distance: Sketching, Streaming and Document Exchange Djamal - PowerPoint PPT Presentation

Edit Distance: Sketching, Streaming and Document Exchange Djamal Belazzougui Qin Zhang CERIST, Algeria IU Bloomington FOCS 2016 Oct. 9, 2016 1-1 Edit Distance Definition: Given two strings s , t n : ed ( s , t ) = minimum number of

Iterative Sketching Agile Arizona 2017 Agenda Who am I? The Power of Sketching When

Minimum Cost Edit Distance Edit a source string into a target string Each edit has a cost

Click to edit Master title style DRVR Click to edit Master title style Click to edit Master

Click to edit Master title style Click to edit Master title style Click to edit Master title

Distance Education Distance education used to be about the distance. 1700s 1800s 1900s 2000s

Click to edit Master title style Click to edit Master title style Edit Master text styles Edit

Free Form Sketching System for Free Form Sketching System for Product Design Using Virtual

Curve Sketching Michael Freeze MAT 151 UNC Wilmington Summer 2013 1 / 10 Section 5.4 :: Curve

Edit distance Dynamic Programming Edit distance and its variants Misspellings make approximate

Minimum Edit Distance Definition of Minimum Edit Distance How

Why compute minimum edit distance? Minimum edit distance: worked example Sometimes we want to

Sketching and Streaming Matrix Norms David Woodruff IBM Almaden Based on joint works with Yi Li

Mark-recapture distance sampling (MRDS) in Distance 7.1 Setting up Distance for MRDS

DATA MINING LECTURE 5 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY

The q -gram distance Bioinformatics Algorithms In many situations, edit distance is a good

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

WWW with CAPTCHA Adrian Rusu Amalia Rusu and Rebecca Docimo Department of Computer Science

Recent developments from CORC reflections, impact and the way ahead Kate Dalzell, CORC

Rebecca Riley Business Development Director @CityREDI @RileyResearch #analystFEST Welcome!

Da Data Science and It Its Applications in Smart Cities Mohadeseh Ganji, Senior Data Scientist,

Approximate Pattern Matching Using Suffix Tries Hendrik Nigul nigulh@math.ut.ee University of

02110 String indexing Computational geometry Introduction to NP-completeness Inge Li

End-2-End Search Mices 2018 Duncan Blythe About Me Duncan Blythe Research Scientist @ Zalando

Communication Complexity in the Field: New Questions from Practice Qin Zhang Indiana University

Sambuz

Useful Links

Newsletter

Mail Us