Streaming and communication complexity of Hamming distance Tatiana - PowerPoint PPT Presentation

Streaming and communication complexity of Hamming distance Tatiana Starikovskaya IRIF, Universit´ e Paris-Diderot (Joint work with Rapha¨ el Clifford, ICALP’16)

Approximate pattern matching Problem Pattern P of length n , text T Find the Hamming distance between P and each n -length substring of T

Approximate pattern matching Problem Pattern P of length n , text T Find the Hamming distance between P and each n -length substring of T “Big Data” Applications ▸ Computational biology ▸ Signal processing ▸ Text retrieval Standard algorithms: Ω ( n ) space

Model of computation Problem Pattern P of length n , text T Find the Hamming distance between P and each n -length substring of T Model ▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T Text T c

Model of computation Problem Pattern P of length n , text T Find the Hamming distance between P and each n -length substring of T Model ▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T Text T c a

Model of computation Problem Pattern P of length n , text T Find the Hamming distance between P and each n -length substring of T Model ▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T Text T c a a

Model of computation Problem Pattern P of length n , text T Find the Hamming distance between P and each n -length substring of T Model ▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T Text T c a a b

Model of computation Problem Pattern P of length n , text T Find the Hamming distance between P and each n -length substring of T Model ▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T Text T c a a b c

Model of computation Problem Pattern P of length n , text T Find the Hamming distance between P and each n -length substring of T Model ▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T Text T c c a a b c a b c a a a c Pattern P

Model of computation Problem Pattern P of length n , text T Find the Hamming distance between P and each n -length substring of T Model ▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T Text T c a a b c a a b c a a a c Pattern P

Model of computation Problem Pattern P of length n , text T Find the Hamming distance between P and each n -length substring of T Model ▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T Text T c a a b c a a a b c a a a c Pattern P

Model of computation Problem Pattern P of length n , text T Find the Hamming distance between P and each n -length substring of T Model ▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T Text T c a a b c a a a c b c a a a c Pattern P

Model of computation Problem Pattern P of length n , text T Find the Hamming distance between P and each n -length substring of T Model ▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T Text T c a a b c a a a c a b c a a a c Pattern P

What is known: Hamming distance ▸ All distances ▸ Space Ω ( n ) [Folklore] ▸ Time O( log 2 n ) [Clifford et al., CPM’11]

What is known: Hamming distance ▸ All distances ▸ Space Ω ( n ) [Folklore] ▸ Time O( log 2 n ) [Clifford et al., CPM’11] ▸ Only distances ≤ k [Clifford et al., SODA’16] √ ▸ Exact values: space O( k 2 polylog n ) , time O( k log k + polylog n ) ▸ ( 1 + ε ) -approx.: space O( ε − 2 k 2 polylog n ) , time O( ε − 2 polylog n )

This work: (1+ε)-Approximate HDs problem Lower bounds: Upper bounds: reduction to a CC problem show a streaming algorithm

This work: (1+ε)-Approximate HDs problem Lower bounds: Upper bounds: reduction to a CC problem show a streaming algorithm Let's discuss that!

Lower bound for all HDs, approximate Bob Charlie b a a b a b a a a a a a Alice a a a a a a 3-parties CC problem ▸ Alice holds the pattern, Bob holds T [ 1 , n ] , Charlie holds T [ n + 1 , 2 n ] ▸ Charlie ’s output: ( 1 + ε ) -HD for each alignment of P and T Min. communication between Alice , Bob , and Charlie ?

Lower bound for all HDs, approximate Bob Charlie b a a b a b a a a a a a Alice a a a a a a ▸ Streaming algorithm: T = stream, not allowed to store a copy of P or T , output = ( 1 + ε ) -HDs ▸ At time = n it stores all the information needed to compute the ( 1 + ε ) -HDs ▸ Comm. protocol: send this information from A and B to C ▸ Lower bound for the CC problem ⇒ streaming lower bound

This work: (1+ε)-Approximate HDs problem Lower bounds: Upper bounds: reduction to a CC problem show a streaming algorithm

This work: (1+ε)-Approximate HDs problem Lower bounds: Upper bounds: reduction to a CC problem show a streaming algorithm Simpler CC problem: 3-parties CC problem B and C know the pattern

This work: (1+ε)-Approximate HDs problem Lower bounds: Upper bounds: reduction to a CC problem show a streaming algorithm Simpler CC problem: 3-parties CC problem B and C know the pattern Upper Upper bounds bounds

Communication complexity

Simpler CC problem: B and C know the pattern Lower bound: Ω ( ε − 1 log 2 ε − 1 n ) Bob Charlie b a a b a b a a a a a a ▸ Window counting: ( 1 + ε ) -approx. of #(b) in a sliding window of width n = ( 1 + ε ) -approx. of HD between P = aa ... a and T ▸ Ω ( ε − 1 log 2 ε − 1 n ) bits [Datar et al., 2013]

3-parties CC problem Lower bound: Ω ( ε − 1 log 2 ε − 1 n + ε − 2 log n ) Bob Charlie b a a b a b a a a a a a ▸ Output = ( 1 + ε ) -HD between T [ 1 , n ] and T [ n + 1 , 2 n ] = ( 1 + ε ) -approx. of HD between T = T [ 1 , n ] 00 ... 0 ( Bob and Charlie ) and P = T [ n + 1 , 2 n ] ( Alice ) ▸ Ω ( ε − 2 log n ) bits [Jayram & Woordruff, 2013]

Important notion: ( 1 + ε ) -approximate sketch for HD Intuition ▸ Sketch of a string is a very short vector ▸ L 2 -distance between sketches ≈ HD between strings

Important notion: ( 1 + ε ) -approximate sketch for HD Intuition ▸ Sketch of a string is a very short vector ▸ L 2 -distance between sketches ≈ HD between strings Formal definition (binary alphabets) ▸ Y = 1 / ε 2 × n matrix of IID unbiased ± 1 random variables ± 1 ± 1 S [ 1 ] ⎛ ⎞ ⎛ ⎞ ... ± 1 ⋱ S [ 2 ] sketch ( S ) = ⎜ ⎟ ⎜ ⎟ �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� ⎝ ⎠ ⎝ ⎠ ⋮ ⋮ length = 1 / ε 2 Y S

Important notion: ( 1 + ε ) -approximate sketch for HD Formal definition (binary alphabets) ▸ Y = 1 / ε 2 × n matrix of IID unbiased ± 1 random variables sketch ( S ) = Y S �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� length = 1 / ε 2

Important notion: ( 1 + ε ) -approximate sketch for HD Formal definition (binary alphabets) ▸ Y = 1 / ε 2 × n matrix of IID unbiased ± 1 random variables sketch ( S ) = Y S �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� length = 1 / ε 2 Lemma ( 1 − ε ) ⋅ HD ( S 1 , S 2 ) ≤ ε 2 ⋅ ∣ sketch ( S 1 ) − sketch ( S 2 )∣ 2 2 ≤ ( 1 + ε ) ⋅ HD ( S 1 , S 2 ) Proof

Important notion: ( 1 + ε ) -approximate sketch for HD Formal definition (binary alphabets) ▸ Y = 1 / ε 2 × n matrix of IID unbiased ± 1 random variables sketch ( S ) = Y S �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� length = 1 / ε 2 Lemma ( 1 − ε ) ⋅ HD ( S 1 , S 2 ) ≤ ε 2 ⋅ ∣ sketch ( S 1 ) − sketch ( S 2 )∣ 2 2 ≤ ( 1 + ε ) ⋅ HD ( S 1 , S 2 ) Proof E [ ε 2 ⋅ ∣ sketch ( S 1 ) − sketch ( S 2 )∣ 2 2 ] = E [ ε 2 ⋅ ∣ Y ( S 1 − S 2 )∣ 2 2 ] = ε 2 ⋅ E [∣ Y ( S 1 − S 2 )∣ 2 2 ] =

Streaming and communication complexity of Hamming distance Tatiana - PowerPoint PPT Presentation

Streaming and communication complexity of Hamming distance Tatiana Starikovskaya IRIF, Universit e Paris-Diderot (Joint work with Rapha el Clifford, ICALP16) Approximate pattern matching Problem Pattern P of length n , text T Find the

Communication Complexity Lecture 23 Computing with remote inputs 1 Communication Complexity

Distance Education Distance education used to be about the distance. 1700s 1800s 1900s 2000s

Joshua Brody and Amit Chakrabarti Dartmouth College 24 th CCC, 2009, Paris Joshua Brody 1

Mark-recapture distance sampling (MRDS) in Distance 7.1 Setting up Distance for MRDS

Longest Near-Palindrome under Hamming Distance Elena Grigorescu, Purdue University Erfan Sadeqi

Data Streams & Communication Complexity Lecture 3: Communication Complexity and Lower Bounds

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Error Detection and Correction: Hamming Code; Reed-Muller Code Greg Plaxton Theory in

Cyclic Sieving of Dual Hamming Codes Alex Mason 1 Shruthi Sridhar 2 1 Washington University, St.

Distance in data space Notion of distance (metrics) in data space Who is my closest neighbor?

Permutations and codes: set of words. Polynomials, bases, and covering radius In the binary

Error Detection and Correction: Parity Check Code; Bounds Based on Hamming Distance Greg Plaxton

Prediction and Comparison of Two or More Networks: Hamming Distance, Correlation, QAP, MRQAP

Improving Efficiency in Neural Network Accelerator using Operands Hamming Distance Optimization

The Central Dogma of Genetics Or the Coding Theory Behind it Artur Schfer University of St.

Hashing Techniques (Sung-Eui Yoon) Professor KAIST http://sgvr.kaist.ac.kr Student

Section 2 Link Layer CSE 461 Autumn 2015 Panji Wisesa Byte Count Add a length to the

HOST Physical Unclonable Functions I ECE 525 Introduction We discussed the basic tenets of

Talk outline Hamming similarity search Approximate similarity search using LSH Recent

Communicating with Errors Someone sends you a message: As mmbrof teGreek commniand art of n

Structured Quasi-Gray labelling for Reed-Muller Grassmannian Constellations 2020 IEEE

outline Background JPL MER example Reliable State Machines JPL FPGA/ASIC Process

Streaming and communication complexity of Hamming distance Tatiana - PowerPoint PPT Presentation

Streaming and communication complexity of Hamming distance Tatiana Starikovskaya IRIF, Universit e Paris-Diderot (Joint work with Rapha el Clifford, ICALP16) Approximate pattern matching Problem Pattern P of length n , text T Find the

Communication Complexity Lecture 23 Computing with remote inputs 1 Communication Complexity

Distance Education Distance education used to be about the distance. 1700s 1800s 1900s 2000s

Joshua Brody and Amit Chakrabarti Dartmouth College 24 th CCC, 2009, Paris Joshua Brody 1

Mark-recapture distance sampling (MRDS) in Distance 7.1 Setting up Distance for MRDS

Longest Near-Palindrome under Hamming Distance Elena Grigorescu, Purdue University Erfan Sadeqi

Data Streams &amp; Communication Complexity Lecture 3: Communication Complexity and Lower Bounds

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Error Detection and Correction: Hamming Code; Reed-Muller Code Greg Plaxton Theory in

Cyclic Sieving of Dual Hamming Codes Alex Mason 1 Shruthi Sridhar 2 1 Washington University, St.

Distance in data space Notion of distance (metrics) in data space Who is my closest neighbor?

Permutations and codes: set of words. Polynomials, bases, and covering radius In the binary

Error Detection and Correction: Parity Check Code; Bounds Based on Hamming Distance Greg Plaxton

Prediction and Comparison of Two or More Networks: Hamming Distance, Correlation, QAP, MRQAP

Improving Efficiency in Neural Network Accelerator using Operands Hamming Distance Optimization

The Central Dogma of Genetics Or the Coding Theory Behind it Artur Schfer University of St.

Hashing Techniques (Sung-Eui Yoon) Professor KAIST http://sgvr.kaist.ac.kr Student

Section 2 Link Layer CSE 461 Autumn 2015 Panji Wisesa Byte Count Add a length to the

HOST Physical Unclonable Functions I ECE 525 Introduction We discussed the basic tenets of

Talk outline Hamming similarity search Approximate similarity search using LSH Recent

Communicating with Errors Someone sends you a message: As mmbrof teGreek commniand art of n

Structured Quasi-Gray labelling for Reed-Muller Grassmannian Constellations 2020 IEEE

outline Background JPL MER example Reliable State Machines JPL FPGA/ASIC Process

Data Streams & Communication Complexity Lecture 3: Communication Complexity and Lower Bounds