streaming and communication complexity of hamming distance
play

Streaming and communication complexity of Hamming distance Tatiana - PowerPoint PPT Presentation

Streaming and communication complexity of Hamming distance Tatiana Starikovskaya IRIF, Universit e Paris-Diderot (Joint work with Rapha el Clifford, ICALP16) Approximate pattern matching Problem Pattern P of length n , text T Find the


  1. Streaming and communication complexity of Hamming distance Tatiana Starikovskaya IRIF, Universit´ e Paris-Diderot (Joint work with Rapha¨ el Clifford, ICALP’16)

  2. Approximate pattern matching Problem Pattern P of length n , text T Find the Hamming distance between P and each n -length substring of T

  3. Approximate pattern matching Problem Pattern P of length n , text T Find the Hamming distance between P and each n -length substring of T “Big Data” Applications ▸ Computational biology ▸ Signal processing ▸ Text retrieval Standard algorithms: Ω ( n ) space

  4. Model of computation Problem Pattern P of length n , text T Find the Hamming distance between P and each n -length substring of T Model ▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T Text T c

  5. Model of computation Problem Pattern P of length n , text T Find the Hamming distance between P and each n -length substring of T Model ▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T Text T c a

  6. Model of computation Problem Pattern P of length n , text T Find the Hamming distance between P and each n -length substring of T Model ▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T Text T c a a

  7. Model of computation Problem Pattern P of length n , text T Find the Hamming distance between P and each n -length substring of T Model ▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T Text T c a a b

  8. Model of computation Problem Pattern P of length n , text T Find the Hamming distance between P and each n -length substring of T Model ▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T Text T c a a b c

  9. Model of computation Problem Pattern P of length n , text T Find the Hamming distance between P and each n -length substring of T Model ▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T Text T c c a a b c a b c a a a c Pattern P

  10. Model of computation Problem Pattern P of length n , text T Find the Hamming distance between P and each n -length substring of T Model ▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T Text T c a a b c a a b c a a a c Pattern P

  11. Model of computation Problem Pattern P of length n , text T Find the Hamming distance between P and each n -length substring of T Model ▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T Text T c a a b c a a a b c a a a c Pattern P

  12. Model of computation Problem Pattern P of length n , text T Find the Hamming distance between P and each n -length substring of T Model ▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T Text T c a a b c a a a c b c a a a c Pattern P

  13. Model of computation Problem Pattern P of length n , text T Find the Hamming distance between P and each n -length substring of T Model ▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T Text T c a a b c a a a c a b c a a a c Pattern P

  14. What is known: Hamming distance ▸ All distances ▸ Space Ω ( n ) [Folklore] ▸ Time O( log 2 n ) [Clifford et al., CPM’11]

  15. What is known: Hamming distance ▸ All distances ▸ Space Ω ( n ) [Folklore] ▸ Time O( log 2 n ) [Clifford et al., CPM’11] ▸ Only distances ≤ k [Clifford et al., SODA’16] √ ▸ Exact values: space O( k 2 polylog n ) , time O( k log k + polylog n ) ▸ ( 1 + ε ) -approx.: space O( ε − 2 k 2 polylog n ) , time O( ε − 2 polylog n )

  16. This work: (1+ε)-Approximate HDs problem Lower bounds: Upper bounds: reduction to a CC problem show a streaming algorithm

  17. This work: (1+ε)-Approximate HDs problem Lower bounds: Upper bounds: reduction to a CC problem show a streaming algorithm Let's discuss that!

  18. Lower bound for all HDs, approximate Bob Charlie b a a b a b a a a a a a Alice a a a a a a 3-parties CC problem ▸ Alice holds the pattern, Bob holds T [ 1 , n ] , Charlie holds T [ n + 1 , 2 n ] ▸ Charlie ’s output: ( 1 + ε ) -HD for each alignment of P and T Min. communication between Alice , Bob , and Charlie ?

  19. Lower bound for all HDs, approximate Bob Charlie b a a b a b a a a a a a Alice a a a a a a ▸ Streaming algorithm: T = stream, not allowed to store a copy of P or T , output = ( 1 + ε ) -HDs ▸ At time = n it stores all the information needed to compute the ( 1 + ε ) -HDs ▸ Comm. protocol: send this information from A and B to C ▸ Lower bound for the CC problem ⇒ streaming lower bound

  20. This work: (1+ε)-Approximate HDs problem Lower bounds: Upper bounds: reduction to a CC problem show a streaming algorithm

  21. This work: (1+ε)-Approximate HDs problem Lower bounds: Upper bounds: reduction to a CC problem show a streaming algorithm Simpler CC problem: 3-parties CC problem B and C know the pattern

  22. This work: (1+ε)-Approximate HDs problem Lower bounds: Upper bounds: reduction to a CC problem show a streaming algorithm Simpler CC problem: 3-parties CC problem B and C know the pattern Upper Upper bounds bounds

  23. This work: (1+ε)-Approximate HDs problem Lower bounds: Upper bounds: reduction to a CC problem show a streaming algorithm Simpler CC problem: 3-parties CC problem B and C know the pattern Upper Upper bounds bounds

  24. This work: (1+ε)-Approximate HDs problem Lower bounds: Upper bounds: reduction to a CC problem show a streaming algorithm Simpler CC problem: 3-parties CC problem B and C know the pattern Upper Upper bounds bounds

  25. Communication complexity

  26. Simpler CC problem: B and C know the pattern Lower bound: Ω ( ε − 1 log 2 ε − 1 n ) Bob Charlie b a a b a b a a a a a a ▸ Window counting: ( 1 + ε ) -approx. of #(b) in a sliding window of width n = ( 1 + ε ) -approx. of HD between P = aa ... a and T ▸ Ω ( ε − 1 log 2 ε − 1 n ) bits [Datar et al., 2013]

  27. 3-parties CC problem Lower bound: Ω ( ε − 1 log 2 ε − 1 n + ε − 2 log n ) Bob Charlie b a a b a b a a a a a a ▸ Output = ( 1 + ε ) -HD between T [ 1 , n ] and T [ n + 1 , 2 n ] = ( 1 + ε ) -approx. of HD between T = T [ 1 , n ] 00 ... 0 ( Bob and Charlie ) and P = T [ n + 1 , 2 n ] ( Alice ) ▸ Ω ( ε − 2 log n ) bits [Jayram & Woordruff, 2013]

  28. Important notion: ( 1 + ε ) -approximate sketch for HD Intuition ▸ Sketch of a string is a very short vector ▸ L 2 -distance between sketches ≈ HD between strings

  29. Important notion: ( 1 + ε ) -approximate sketch for HD Intuition ▸ Sketch of a string is a very short vector ▸ L 2 -distance between sketches ≈ HD between strings Formal definition (binary alphabets) ▸ Y = 1 / ε 2 × n matrix of IID unbiased ± 1 random variables ± 1 ± 1 S [ 1 ] ⎛ ⎞ ⎛ ⎞ ... ± 1 ⋱ S [ 2 ] sketch ( S ) = ⎜ ⎟ ⎜ ⎟ �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� ⎝ ⎠ ⎝ ⎠ ⋮ ⋮ length = 1 / ε 2 Y S

  30. Important notion: ( 1 + ε ) -approximate sketch for HD Formal definition (binary alphabets) ▸ Y = 1 / ε 2 × n matrix of IID unbiased ± 1 random variables sketch ( S ) = Y S �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� length = 1 / ε 2

  31. Important notion: ( 1 + ε ) -approximate sketch for HD Formal definition (binary alphabets) ▸ Y = 1 / ε 2 × n matrix of IID unbiased ± 1 random variables sketch ( S ) = Y S �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� length = 1 / ε 2 Lemma ( 1 − ε ) ⋅ HD ( S 1 , S 2 ) ≤ ε 2 ⋅ ∣ sketch ( S 1 ) − sketch ( S 2 )∣ 2 2 ≤ ( 1 + ε ) ⋅ HD ( S 1 , S 2 ) Proof

  32. Important notion: ( 1 + ε ) -approximate sketch for HD Formal definition (binary alphabets) ▸ Y = 1 / ε 2 × n matrix of IID unbiased ± 1 random variables sketch ( S ) = Y S �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� length = 1 / ε 2 Lemma ( 1 − ε ) ⋅ HD ( S 1 , S 2 ) ≤ ε 2 ⋅ ∣ sketch ( S 1 ) − sketch ( S 2 )∣ 2 2 ≤ ( 1 + ε ) ⋅ HD ( S 1 , S 2 ) Proof E [ ε 2 ⋅ ∣ sketch ( S 1 ) − sketch ( S 2 )∣ 2 2 ] = E [ ε 2 ⋅ ∣ Y ( S 1 − S 2 )∣ 2 2 ] = ε 2 ⋅ E [∣ Y ( S 1 − S 2 )∣ 2 2 ] =

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend