CS 473: Algorithms
Ruta Mehta
University of Illinois, Urbana-Champaign
Spring 2018
Ruta (UIUC) CS473 1 Spring 2018 1 / 29
CS 473: Algorithms Ruta Mehta University of Illinois, - - PowerPoint PPT Presentation
CS 473: Algorithms Ruta Mehta University of Illinois, Urbana-Champaign Spring 2018 Ruta (UIUC) CS473 1 Spring 2018 1 / 29 CS 473: Algorithms, Spring 2018 Fingerprinting Lecture 11 Feb 20, 2018 Most slides are courtesy Prof. Chekuri
Ruta Mehta
University of Illinois, Urbana-Champaign
Spring 2018
Ruta (UIUC) CS473 1 Spring 2018 1 / 29
Feb 20, 2018
Most slides are courtesy Prof. Chekuri
Ruta (UIUC) CS473 2 Spring 2018 2 / 29
Source: Wikipedia
Process of mapping a large data item to a much shorter bit string, called its fingerprint. Fingerprints uniquely identifies data “for all practical purposes”.
Ruta (UIUC) CS473 3 Spring 2018 3 / 29
Source: Wikipedia
Process of mapping a large data item to a much shorter bit string, called its fingerprint. Fingerprints uniquely identifies data “for all practical purposes”. Typically used to avoid comparison and transmission of bulky data. Eg: Web browser can store/fetch file fingerprints to check if it is changed.
Ruta (UIUC) CS473 3 Spring 2018 3 / 29
Source: Wikipedia
Process of mapping a large data item to a much shorter bit string, called its fingerprint. Fingerprints uniquely identifies data “for all practical purposes”. Typically used to avoid comparison and transmission of bulky data. Eg: Web browser can store/fetch file fingerprints to check if it is changed. As you may have guessed, fingerprint functions are hash functions.
Ruta (UIUC) CS473 3 Spring 2018 3 / 29
Hashing:
1
To insert x in dictionary store x in table in location h(x)
2
To lookup y in dictionary check contents of location h(y)
Ruta (UIUC) CS473 4 Spring 2018 4 / 29
Hashing:
1
To insert x in dictionary store x in table in location h(x)
2
To lookup y in dictionary check contents of location h(y) Bloom Filter: tradeoff space for false positives
1
What if elements (x) are unwieldy objects such a long strings, images, etc with non-uniform sizes.
2
To insert x in dictionary, set bit at location h(x) to 1 (initially all bits are set to 0)
3
To lookup y if bit in location h(y) is 1 say yes, else no.
Ruta (UIUC) CS473 4 Spring 2018 4 / 29
Bloom Filter: tradeoff space for false positives Reducing false positives:
1
Pick k hash functions h1, h2, . . . , hk independently
2
Insert x: for 1 ≤ i ≤ k set bit in location hi(x) in table i to 1
Ruta (UIUC) CS473 5 Spring 2018 5 / 29
Bloom Filter: tradeoff space for false positives Reducing false positives:
1
Pick k hash functions h1, h2, . . . , hk independently
2
Insert x: for 1 ≤ i ≤ k set bit in location hi(x) in table i to 1
3
Lookup y: compute hi(y) for 1 ≤ i ≤ k and say yes only if each bit in the corresponding location is 1, otherwise say no. If probability of false positive for one hash function is α < 1 then with k independent hash function it is
Ruta (UIUC) CS473 5 Spring 2018 5 / 29
Bloom Filter: tradeoff space for false positives Reducing false positives:
1
Pick k hash functions h1, h2, . . . , hk independently
2
Insert x: for 1 ≤ i ≤ k set bit in location hi(x) in table i to 1
3
Lookup y: compute hi(y) for 1 ≤ i ≤ k and say yes only if each bit in the corresponding location is 1, otherwise say no. If probability of false positive for one hash function is α < 1 then with k independent hash function it is αk.
Ruta (UIUC) CS473 5 Spring 2018 5 / 29
Use of hash functions for designing fast algorithms
Given a text T of length m and pattern P of length n, m ≫ n, find all occurrences of P in T.
Ruta (UIUC) CS473 6 Spring 2018 6 / 29
Use of hash functions for designing fast algorithms
Given a text T of length m and pattern P of length n, m ≫ n, find all occurrences of P in T.
Ruta (UIUC) CS473 6 Spring 2018 6 / 29
Use of hash functions for designing fast algorithms
Given a text T of length m and pattern P of length n, m ≫ n, find all occurrences of P in T.
It involves: Sampling a prime String equality via mod p arithmetic Rabin’s fingerprinting scheme – rolling hash Karp-Rabin pattern matching algorithm: O(m + n) time.
Ruta (UIUC) CS473 6 Spring 2018 6 / 29
Ruta (UIUC) CS473 7 Spring 2018 7 / 29
Given an integer x > 0, sample a prime uniformly at random from all the primes between 1 and x.
Ruta (UIUC) CS473 8 Spring 2018 8 / 29
Given an integer x > 0, sample a prime uniformly at random from all the primes between 1 and x.
1
Sample a number p uniformly at random from {1, . . . , x}.
2
If p is a prime, then output p. Else go to Step (1).
Ruta (UIUC) CS473 8 Spring 2018 8 / 29
Given an integer x > 0, sample a prime uniformly at random from all the primes between 1 and x.
1
Sample a number p uniformly at random from {1, . . . , x}.
2
If p is a prime, then output p. Else go to Step (1).
Agrawal-Kayal-Saxena primality test: deterministic but slow Miller-Rabin randomized primality test: fast but randomized
Ruta (UIUC) CS473 8 Spring 2018 8 / 29
Is the returned prime sampled uniformly at random?
Ruta (UIUC) CS473 9 Spring 2018 9 / 29
Is the returned prime sampled uniformly at random? π(x) : number of primes in {1, . . . , x},
For a fixed prime p∗ ≤ x, Pr[algorithm outputs p∗] = 1/π(x).
Ruta (UIUC) CS473 9 Spring 2018 9 / 29
Is the returned prime sampled uniformly at random? π(x) : number of primes in {1, . . . , x},
For a fixed prime p∗ ≤ x, Pr[algorithm outputs p∗] = 1/π(x).
Event A : a prime is picked in a round. Pr[A] =
Ruta (UIUC) CS473 9 Spring 2018 9 / 29
Is the returned prime sampled uniformly at random? π(x) : number of primes in {1, . . . , x},
For a fixed prime p∗ ≤ x, Pr[algorithm outputs p∗] = 1/π(x).
Event A : a prime is picked in a round. Pr[A] = π(x)/x.
Ruta (UIUC) CS473 9 Spring 2018 9 / 29
Is the returned prime sampled uniformly at random? π(x) : number of primes in {1, . . . , x},
For a fixed prime p∗ ≤ x, Pr[algorithm outputs p∗] = 1/π(x).
Event A : a prime is picked in a round. Pr[A] = π(x)/x. Event B : number (prime) p∗ is picked. Pr[B] =
Ruta (UIUC) CS473 9 Spring 2018 9 / 29
Is the returned prime sampled uniformly at random? π(x) : number of primes in {1, . . . , x},
For a fixed prime p∗ ≤ x, Pr[algorithm outputs p∗] = 1/π(x).
Event A : a prime is picked in a round. Pr[A] = π(x)/x. Event B : number (prime) p∗ is picked. Pr[B] = 1/x.
Ruta (UIUC) CS473 9 Spring 2018 9 / 29
Is the returned prime sampled uniformly at random? π(x) : number of primes in {1, . . . , x},
For a fixed prime p∗ ≤ x, Pr[algorithm outputs p∗] = 1/π(x).
Event A : a prime is picked in a round. Pr[A] = π(x)/x. Event B : number (prime) p∗ is picked. Pr[B] = 1/x. Pr[A ∩ B] =
Ruta (UIUC) CS473 9 Spring 2018 9 / 29
Is the returned prime sampled uniformly at random? π(x) : number of primes in {1, . . . , x},
For a fixed prime p∗ ≤ x, Pr[algorithm outputs p∗] = 1/π(x).
Event A : a prime is picked in a round. Pr[A] = π(x)/x. Event B : number (prime) p∗ is picked. Pr[B] = 1/x. Pr[A ∩ B] =Pr[B] = 1/x. Why?
Ruta (UIUC) CS473 9 Spring 2018 9 / 29
Is the returned prime sampled uniformly at random? π(x) : number of primes in {1, . . . , x},
For a fixed prime p∗ ≤ x, Pr[algorithm outputs p∗] = 1/π(x).
Event A : a prime is picked in a round. Pr[A] = π(x)/x. Event B : number (prime) p∗ is picked. Pr[B] = 1/x. Pr[A ∩ B] =Pr[B] = 1/x. Why? Because B ⊂ A.
Ruta (UIUC) CS473 9 Spring 2018 9 / 29
Is the returned prime sampled uniformly at random? π(x) : number of primes in {1, . . . , x},
For a fixed prime p∗ ≤ x, Pr[algorithm outputs p∗] = 1/π(x).
Event A : a prime is picked in a round. Pr[A] = π(x)/x. Event B : number (prime) p∗ is picked. Pr[B] = 1/x. Pr[A ∩ B] =Pr[B] = 1/x. Why? Because B ⊂ A. Pr[B|A] =
Ruta (UIUC) CS473 9 Spring 2018 9 / 29
Is the returned prime sampled uniformly at random? π(x) : number of primes in {1, . . . , x},
For a fixed prime p∗ ≤ x, Pr[algorithm outputs p∗] = 1/π(x).
Event A : a prime is picked in a round. Pr[A] = π(x)/x. Event B : number (prime) p∗ is picked. Pr[B] = 1/x. Pr[A ∩ B] =Pr[B] = 1/x. Why? Because B ⊂ A. Pr[B|A] = Pr[A ∩ B] Pr[A] = Pr[B] Pr[A] = 1/x π(x)/x = 1 π(x)
Ruta (UIUC) CS473 9 Spring 2018 9 / 29
1
Sample a number p uniformly at random from {1, . . . , x}.
2
If p is a prime, then output p. Else go to Step (1).
Q: How many samples in expectation before termination? A: x/π(x). Exercise.
Ruta (UIUC) CS473 10 Spring 2018 10 / 29
π(x) : Number of primes between 0 and x.
Prime Number Theorem: limx→∞
π(x) x/ ln x = 1
Ruta (UIUC) CS473 11 Spring 2018 11 / 29
π(x) : Number of primes between 0 and x.
Prime Number Theorem: limx→∞
π(x) x/ ln x = 1
π(x) ≥ 7 8 x ln x = (1.262..) x lg x > x lg x
Ruta (UIUC) CS473 11 Spring 2018 11 / 29
π(x) : Number of primes between 0 and x.
Prime Number Theorem: limx→∞
π(x) x/ ln x = 1
π(x) ≥ 7 8 x ln x = (1.262..) x lg x > x lg x y ∼ {1, . . . , x} u.a.r., then y is a prime w.p.
π(x) x
>
1 lg x .
Ruta (UIUC) CS473 11 Spring 2018 11 / 29
π(x) : Number of primes between 0 and x.
Prime Number Theorem: limx→∞
π(x) x/ ln x = 1
π(x) ≥ 7 8 x ln x = (1.262..) x lg x > x lg x y ∼ {1, . . . , x} u.a.r., then y is a prime w.p.
π(x) x
>
1 lg x .
If we want k ≥ 4 primes then x ≥ 2k lg k suffices. π(x) ≥ π(2k lg k) = 2k lg k lg 2 + lg k + lg lg k ≥ k(2 lg k) 2 lg k = k
Ruta (UIUC) CS473 11 Spring 2018 11 / 29
Ruta (UIUC) CS473 12 Spring 2018 12 / 29
Alice, the captain of a Mars lander, receives an N-bit string x, and Bob, back at mission control, receives a string y. They know nothing about each others strings, but want to check if x = y.
Ruta (UIUC) CS473 13 Spring 2018 13 / 29
Alice, the captain of a Mars lander, receives an N-bit string x, and Bob, back at mission control, receives a string y. They know nothing about each others strings, but want to check if x = y. Alice sends Bob x, and Bob confirms if x = y. But sending N bits is costly! Can they share less communication and check equality?
Ruta (UIUC) CS473 13 Spring 2018 13 / 29
Alice, the captain of a Mars lander, receives an N-bit string x, and Bob, back at mission control, receives a string y. They know nothing about each others strings, but want to check if x = y. Alice sends Bob x, and Bob confirms if x = y. But sending N bits is costly! Can they share less communication and check equality?
If want 100% surety then NO. If OK with 99.99% surety then O(lg N) may suffice!!!
Ruta (UIUC) CS473 13 Spring 2018 13 / 29
Alice, the captain of a Mars lander, receives an N-bit string x, and Bob, back at mission control, receives a string y. They know nothing about each others strings, but want to check if x = y. Alice sends Bob x, and Bob confirms if x = y. But sending N bits is costly! Can they share less communication and check equality?
If want 100% surety then NO. If OK with 99.99% surety then O(lg N) may suffice!!!
If x = y, then Pr[Bob says equal] = 1. If x = y, then Pr[Bob says un-equal] = 0.9999.
Ruta (UIUC) CS473 13 Spring 2018 13 / 29
Alice, the captain of a Mars lander, receives an N-bit string x, and Bob, back at mission control, receives a string y. They know nothing about each others strings, but want to check if x = y. Alice sends Bob x, and Bob confirms if x = y. But sending N bits is costly! Can they share less communication and check equality?
If want 100% surety then NO. If OK with 99.99% surety then O(lg N) may suffice!!!
If x = y, then Pr[Bob says equal] = 1. If x = y, then Pr[Bob says un-equal] = 0.9999.
HOW?
Ruta (UIUC) CS473 13 Spring 2018 13 / 29
x, y : N-bit strings.
Ruta (UIUC) CS473 14 Spring 2018 14 / 29
x, y : N-bit strings. (Recall) If M = ⌈2(5N) lg 5N⌉, then 5N primes in {1, . . . , M}.
Ruta (UIUC) CS473 14 Spring 2018 14 / 29
x, y : N-bit strings. (Recall) If M = ⌈2(5N) lg 5N⌉, then 5N primes in {1, . . . , M}.
Define hp(x) = x mod p
1
Alice picks a random prime p from {1, . . . M}.
Ruta (UIUC) CS473 14 Spring 2018 14 / 29
x, y : N-bit strings. (Recall) If M = ⌈2(5N) lg 5N⌉, then 5N primes in {1, . . . , M}.
Define hp(x) = x mod p
1
Alice picks a random prime p from {1, . . . M}.
2
She sends Bob prime p, and also hp(x) = x mod p.
3
Bob checks if hp(y) = hp(x). If so, he says equal else un-equal.
Ruta (UIUC) CS473 14 Spring 2018 14 / 29
x, y : N-bit strings. (Recall) If M = ⌈2(5N) lg 5N⌉, then 5N primes in {1, . . . , M}.
Define hp(x) = x mod p
1
Alice picks a random prime p from {1, . . . M}.
2
She sends Bob prime p, and also hp(x) = x mod p.
3
Bob checks if hp(y) = hp(x). If so, he says equal else un-equal.
If x = y then Bob always says equal.
Ruta (UIUC) CS473 14 Spring 2018 14 / 29
x, y : N-bit strings. (Recall) If M = ⌈2(5N) lg 5N⌉, then 5N primes in {1, . . . , M}.
Define hp(x) = x mod p
1
Alice picks a random prime p from {1, . . . M}.
2
She sends Bob prime p, and also hp(x) = x mod p.
3
Bob checks if hp(y) = hp(x). If so, he says equal else un-equal.
If x = y then, Pr[Bob says equal] ≤ 1/5 (error probability).
Ruta (UIUC) CS473 15 Spring 2018 15 / 29
x, y : N-bit strings. (Recall) If M = ⌈2(sN) lg sN⌉, then sN primes in {1, . . . , M}.
Define hp(x) = x mod p
1
Alice picks a random prime p from {1, . . . M}.
2
She sends Bob prime p, and also hp(x) = x mod p.
3
Bob checks if hp(y) = hp(x). If so, he says equal else un-equal.
If x = y then, Pr[Bob says equal] ≤ 1/s (error probability).
Ruta (UIUC) CS473 16 Spring 2018 16 / 29
Let x = 6 = 2 ∗ 3. If we draw a p u.a.r. from {2, 3, 5, 7}, then what is the probability that x mod p = 0? (A) 0. (B) 1. (C) 1/4. (D) 1/2. (E) none of the above.
Ruta (UIUC) CS473 17 Spring 2018 17 / 29
Let x = 6 = 2 ∗ 3. If we draw a p u.a.r. from {2, 3, 5, 7}, then what is the probability that x mod p = 0? (A) 0. (B) 1. (C) 1/4. (D) 1/2. (E) none of the above. Now, let y = 21. What is the probability that (y − x) mod p = 15 mod p = 0? (A) 0. (B) 1. (C) 1/4. (D) 1/2.
Ruta (UIUC) CS473 17 Spring 2018 17 / 29
Error probability
x, y N-bit string, M = ⌈2(sN) lg sN⌉, and hp(x) = x mod p
If x = y then, Pr[Bob says equal] = Pr[hp(x) = hp(y)] ≤ 1/s
Given x = y, hp(x) = hp(y) ⇒ x mod p = y mod p.
Ruta (UIUC) CS473 18 Spring 2018 18 / 29
Error probability
x, y N-bit string, M = ⌈2(sN) lg sN⌉, and hp(x) = x mod p
If x = y then, Pr[Bob says equal] = Pr[hp(x) = hp(y)] ≤ 1/s
Given x = y, hp(x) = hp(y) ⇒ x mod p = y mod p. D = |x − y|, then D mod p = 0, and D ≤ 2N.
Ruta (UIUC) CS473 18 Spring 2018 18 / 29
Error probability
x, y N-bit string, M = ⌈2(sN) lg sN⌉, and hp(x) = x mod p
If x = y then, Pr[Bob says equal] = Pr[hp(x) = hp(y)] ≤ 1/s
Given x = y, hp(x) = hp(y) ⇒ x mod p = y mod p. D = |x − y|, then D mod p = 0, and D ≤ 2N. D = p1 . . . pk prime factorization.
Ruta (UIUC) CS473 18 Spring 2018 18 / 29
Error probability
x, y N-bit string, M = ⌈2(sN) lg sN⌉, and hp(x) = x mod p
If x = y then, Pr[Bob says equal] = Pr[hp(x) = hp(y)] ≤ 1/s
Given x = y, hp(x) = hp(y) ⇒ x mod p = y mod p. D = |x − y|, then D mod p = 0, and D ≤ 2N. D = p1 . . . pk prime factorization. All pi ≥ 2 ⇒ D ≥ 2k.
Ruta (UIUC) CS473 18 Spring 2018 18 / 29
Error probability
x, y N-bit string, M = ⌈2(sN) lg sN⌉, and hp(x) = x mod p
If x = y then, Pr[Bob says equal] = Pr[hp(x) = hp(y)] ≤ 1/s
Given x = y, hp(x) = hp(y) ⇒ x mod p = y mod p. D = |x − y|, then D mod p = 0, and D ≤ 2N. D = p1 . . . pk prime factorization. All pi ≥ 2 ⇒ D ≥ 2k. 2k ≤ D ≤ 2N ⇒ k ≤ N. D has at most N divisors.
Ruta (UIUC) CS473 18 Spring 2018 18 / 29
Error probability
x, y N-bit string, M = ⌈2(sN) lg sN⌉, and hp(x) = x mod p
If x = y then, Pr[Bob says equal] = Pr[hp(x) = hp(y)] ≤ 1/s
Given x = y, hp(x) = hp(y) ⇒ x mod p = y mod p. D = |x − y|, then D mod p = 0, and D ≤ 2N. D = p1 . . . pk prime factorization. All pi ≥ 2 ⇒ D ≥ 2k. 2k ≤ D ≤ 2N ⇒ k ≤ N. D has at most N divisors. Probability that a random prime p from {1, . . . , M} is a divisor =
k π(M) ≤ N π(M)
Ruta (UIUC) CS473 18 Spring 2018 18 / 29
Error probability
x, y N-bit string, M = ⌈2(sN) lg sN⌉, and hp(x) = x mod p
If x = y then, Pr[Bob says equal] = Pr[hp(x) = hp(y)] ≤ 1/s
Given x = y, hp(x) = hp(y) ⇒ x mod p = y mod p. D = |x − y|, then D mod p = 0, and D ≤ 2N. D = p1 . . . pk prime factorization. All pi ≥ 2 ⇒ D ≥ 2k. 2k ≤ D ≤ 2N ⇒ k ≤ N. D has at most N divisors. Probability that a random prime p from {1, . . . , M} is a divisor =
k π(M) ≤ N π(M) ≤ N M/ lg M = N 2(sN) lg sN lg M ≤ 1 s
Ruta (UIUC) CS473 18 Spring 2018 18 / 29
1
Choose large enough s. Error prob: 1/s.
Ruta (UIUC) CS473 19 Spring 2018 19 / 29
1
Choose large enough s. Error prob: 1/s.
2
Alice repeats the process R times, and Bob says equal only if he gets equal all R times.
Ruta (UIUC) CS473 19 Spring 2018 19 / 29
1
Choose large enough s. Error prob: 1/s.
2
Alice repeats the process R times, and Bob says equal only if he gets equal all R times. Error probability:
1 sR .
Ruta (UIUC) CS473 19 Spring 2018 19 / 29
1
Choose large enough s. Error prob: 1/s.
2
Alice repeats the process R times, and Bob says equal only if he gets equal all R times. Error probability:
1 sR . For s = 5, R = 10, 1 510 ≤ 0.000001.
Ruta (UIUC) CS473 19 Spring 2018 19 / 29
1
Choose large enough s. Error prob: 1/s.
2
Alice repeats the process R times, and Bob says equal only if he gets equal all R times. Error probability:
1 sR . For s = 5, R = 10, 1 510 ≤ 0.000001.
M = ⌈2(sN) lg sN⌉
Each round sends 2 integers ≤ M. # bits: 2 lg M ≤ 4(lg s + lg N).
Ruta (UIUC) CS473 19 Spring 2018 19 / 29
1
Choose large enough s. Error prob: 1/s.
2
Alice repeats the process R times, and Bob says equal only if he gets equal all R times. Error probability:
1 sR . For s = 5, R = 10, 1 510 ≤ 0.000001.
M = ⌈2(sN) lg sN⌉
Each round sends 2 integers ≤ M. # bits: 2 lg M ≤ 4(lg s + lg N). If x and y are copies of Wikipedia, about 25 billion characters. If 8 bits per character, then N ≈ 238 bits.
Ruta (UIUC) CS473 19 Spring 2018 19 / 29
1
Choose large enough s. Error prob: 1/s.
2
Alice repeats the process R times, and Bob says equal only if he gets equal all R times. Error probability:
1 sR . For s = 5, R = 10, 1 510 ≤ 0.000001.
M = ⌈2(sN) lg sN⌉
Each round sends 2 integers ≤ M. # bits: 2 lg M ≤ 4(lg s + lg N). If x and y are copies of Wikipedia, about 25 billion characters. If 8 bits per character, then N ≈ 238 bits. Second approach will send 10(2 lg (10N lg 5N)) ≤ 1280 bits.
Ruta (UIUC) CS473 19 Spring 2018 19 / 29
Ruta (UIUC) CS473 20 Spring 2018 20 / 29
Given a string T of length m and pattern P of length n, s.t. m ≫ n, find all occurrences of P in T.
T=abracadabra, P=ab.
Ruta (UIUC) CS473 21 Spring 2018 21 / 29
Given a string T of length m and pattern P of length n, s.t. m ≫ n, find all occurrences of P in T.
T=abracadabra, P=ab. Solution S = {1, 8}.
Ruta (UIUC) CS473 21 Spring 2018 21 / 29
Given a string T of length m and pattern P of length n, s.t. m ≫ n, find all occurrences of P in T.
T=abracadabra, P=ab. Solution S = {1, 8}. For j > i, let Ti...j = T[i]T[i + 1] . . . T[j].
Ruta (UIUC) CS473 21 Spring 2018 21 / 29
Given a string T of length m and pattern P of length n, s.t. m ≫ n, find all occurrences of P in T.
T=abracadabra, P=ab. Solution S = {1, 8}. For j > i, let Ti...j = T[i]T[i + 1] . . . T[j].
S = ∅. For each i = 1 . . . m − n + 1 If Ti...i+n−1 = P then S = S ∪ {i}.
Ruta (UIUC) CS473 21 Spring 2018 21 / 29
Given a string T of length m and pattern P of length n, s.t. m ≫ n, find all occurrences of P in T.
T=abracadabra, P=ab. Solution S = {1, 8}. For j > i, let Ti...j = T[i]T[i + 1] . . . T[j].
S = ∅. For each i = 1 . . . m − n + 1 If Ti...i+n−1 = P then S = S ∪ {i}. O(mn) run-time.
Ruta (UIUC) CS473 21 Spring 2018 21 / 29
Pick a prime p u.a.r. from {1, . . . , M}. hp(x) = x mod p.
S = ∅. For each i = 1 . . . m − n + 1 If hp(Ti...i+n−1) = hp(P) then S = S ∪ {i}.
Ruta (UIUC) CS473 22 Spring 2018 22 / 29
Pick a prime p u.a.r. from {1, . . . , M}. hp(x) = x mod p.
S = ∅. For each i = 1 . . . m − n + 1 If hp(Ti...i+n−1) = hp(P) then S = S ∪ {i}. If x is of length n, then computing hp(x) takes O(n) running time. Overall O(mn) running time.
Ruta (UIUC) CS473 22 Spring 2018 22 / 29
Pick a prime p u.a.r. from {1, . . . , M}. hp(x) = x mod p.
S = ∅. For each i = 1 . . . m − n + 1 If hp(Ti...i+n−1) = hp(P) then S = S ∪ {i}. If x is of length n, then computing hp(x) takes O(n) running time. Overall O(mn) running time. Can we compute hp(Ti+1...i+n) using hp(Ti...i+n−1) fast?
Ruta (UIUC) CS473 22 Spring 2018 22 / 29
Let a and b be (non-negative) integers. (a + b) mod p = ((a mod p) + (b mod p)) mod p
Ruta (UIUC) CS473 23 Spring 2018 23 / 29
Let a and b be (non-negative) integers. (a + b) mod p = ((a mod p) + (b mod p)) mod p (a · b) mod p = ((a mod p) · (b mod p)) mod p
Ruta (UIUC) CS473 23 Spring 2018 23 / 29
x = Ti...i+n−1 and x′ = Ti+1...i+n.
x = 1011001, and x′ = 0110010 (or x′ = 0110011).
Ruta (UIUC) CS473 24 Spring 2018 24 / 29
x = Ti...i+n−1 and x′ = Ti+1...i+n.
x = 1011001, and x′ = 0110010 (or x′ = 0110011). x′ = 2(x − xhb2n−1) + x′
lb
Ruta (UIUC) CS473 24 Spring 2018 24 / 29
x = Ti...i+n−1 and x′ = Ti+1...i+n.
x = 1011001, and x′ = 0110010 (or x′ = 0110011). x′ = 2(x − xhb2n−1) + x′
lb
= 2x − xhb2n + x′
lb
Ruta (UIUC) CS473 24 Spring 2018 24 / 29
x = Ti...i+n−1 and x′ = Ti+1...i+n.
x = 1011001, and x′ = 0110010 (or x′ = 0110011). x′ = 2(x − xhb2n−1) + x′
lb
= 2x − xhb2n + x′
lb
hp(x′) = x′ mod p = (2(x mod p) − xhb(2n mod p) + x′
lb) mod p
= (2hp(x) − xhbhp(2n) + x′
lb) mod p
Ruta (UIUC) CS473 24 Spring 2018 24 / 29
p : a random prime from {1, . . . , M}.
1
Set S = ∅. Compute hp(T1...n), hp(2n), and hp(P).
2
For each i = 1, . . . , m − n + 1
1
If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.
2
Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n) by applying rolling hash.
Ruta (UIUC) CS473 25 Spring 2018 25 / 29
p : a random prime from {1, . . . , M}.
1
Set S = ∅. Compute hp(T1...n), hp(2n), and hp(P).
2
For each i = 1, . . . , m − n + 1
1
If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.
2
Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n) by applying rolling hash.
In Step 1, computing hp(x) for an n bit x is in O(n) time.
Ruta (UIUC) CS473 25 Spring 2018 25 / 29
p : a random prime from {1, . . . , M}.
1
Set S = ∅. Compute hp(T1...n), hp(2n), and hp(P).
2
For each i = 1, . . . , m − n + 1
1
If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.
2
Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n) by applying rolling hash.
In Step 1, computing hp(x) for an n bit x is in O(n) time. Assuming O(lg M) bit arithmetic can be done in O(1) time, Since hp(.) produces lg M bit numbers, both steps inside for loop can be done in O(1) time.
Ruta (UIUC) CS473 25 Spring 2018 25 / 29
p : a random prime from {1, . . . , M}.
1
Set S = ∅. Compute hp(T1...n), hp(2n), and hp(P).
2
For each i = 1, . . . , m − n + 1
1
If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.
2
Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n) by applying rolling hash.
In Step 1, computing hp(x) for an n bit x is in O(n) time. Assuming O(lg M) bit arithmetic can be done in O(1) time, Since hp(.) produces lg M bit numbers, both steps inside for loop can be done in O(1) time. Overall O(m + n) time.
Ruta (UIUC) CS473 25 Spring 2018 25 / 29
p : a random prime from {1, . . . , M}.
1
Set S = ∅. Compute hp(T1...n), hp(2n), and hp(P).
2
For each i = 1, . . . , m − n + 1
1
If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.
2
Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n) by applying rolling hash.
In Step 1, computing hp(x) for an n bit x is in O(n) time. Assuming O(lg M) bit arithmetic can be done in O(1) time, Since hp(.) produces lg M bit numbers, both steps inside for loop can be done in O(1) time. Overall O(m + n) time. Can’t do better.
Ruta (UIUC) CS473 25 Spring 2018 25 / 29
1
For each i = 1, . . . , m − n + 1
1
If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.
2
Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n).
If match at any position i then i ∈ S. In otherwords if Ti...i+n−1 = P, then i ∈ S. All matched positions are in S.
Ruta (UIUC) CS473 26 Spring 2018 26 / 29
1
For each i = 1, . . . , m − n + 1
1
If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.
2
Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n).
If match at any position i then i ∈ S. In otherwords if Ti...i+n−1 = P, then i ∈ S. All matched positions are in S. Can it contain unmatched positions?
Ruta (UIUC) CS473 26 Spring 2018 26 / 29
1
For each i = 1, . . . , m − n + 1
1
If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.
2
Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n).
If match at any position i then i ∈ S. In otherwords if Ti...i+n−1 = P, then i ∈ S. All matched positions are in S. Can it contain unmatched positions? YES!
Ruta (UIUC) CS473 26 Spring 2018 26 / 29
1
For each i = 1, . . . , m − n + 1
1
If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.
2
Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n).
If match at any position i then i ∈ S. In otherwords if Ti...i+n−1 = P, then i ∈ S. All matched positions are in S. Can it contain unmatched positions? YES! With what probability?
Ruta (UIUC) CS473 26 Spring 2018 26 / 29
Pr[S contains an index i, while there is no match at i]
1
For each i = 1, . . . , m − n + 1
1
If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.
2
Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n).
Ruta (UIUC) CS473 27 Spring 2018 27 / 29
Pr[S contains an index i, while there is no match at i]
1
For each i = 1, . . . , m − n + 1
1
If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.
2
Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n).
Set M = ⌈2(sn) lg sn⌉. Given x = y, Pr[hp(x) = hp(y)] ≤ 1/s.
Ruta (UIUC) CS473 27 Spring 2018 27 / 29
Pr[S contains an index i, while there is no match at i]
1
For each i = 1, . . . , m − n + 1
1
If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.
2
Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n).
Set M = ⌈2(sn) lg sn⌉. Given x = y, Pr[hp(x) = hp(y)] ≤ 1/s.
Ruta (UIUC) CS473 27 Spring 2018 27 / 29
Pr[S contains an index i, while there is no match at i]
1
For each i = 1, . . . , m − n + 1
1
If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.
2
Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n).
Set M = ⌈2(sn) lg sn⌉. Given x = y, Pr[hp(x) = hp(y)] ≤ 1/s.
Given Ti...i+n−1 = P, Pr[i ∈ S] ≤ 1/s.
Ruta (UIUC) CS473 27 Spring 2018 27 / 29
Pr[S contains an index i, while there is no match at i]
1
For each i = 1, . . . , m − n + 1
1
If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.
2
Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n).
Set M = ⌈2(sn) lg sn⌉. Given x = y, Pr[hp(x) = hp(y)] ≤ 1/s.
Given Ti...i+n−1 = P, Pr[i ∈ S] ≤ 1/s. Pr[Any index in S is wrong]
Ruta (UIUC) CS473 27 Spring 2018 27 / 29
Pr[S contains an index i, while there is no match at i]
1
For each i = 1, . . . , m − n + 1
1
If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.
2
Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n).
Set M = ⌈2(sn) lg sn⌉. Given x = y, Pr[hp(x) = hp(y)] ≤ 1/s.
Given Ti...i+n−1 = P, Pr[i ∈ S] ≤ 1/s. Pr[Any index in S is wrong] ≤ m/s (Union bound).
Ruta (UIUC) CS473 27 Spring 2018 27 / 29
Pr[S contains an index i, while there is no match at i]
1
For each i = 1, . . . , m − n + 1
1
If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.
2
Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n).
Set M = ⌈2(sn) lg sn⌉. Given x = y, Pr[hp(x) = hp(y)] ≤ 1/s.
Given Ti...i+n−1 = P, Pr[i ∈ S] ≤ 1/s. Pr[Any index in S is wrong] ≤ m/s (Union bound). To ensure S is correct with at least 0.99 probability, we need 1 − m s = 0.99 ⇔ m s = 1 100 ⇔ s = 100m .
Ruta (UIUC) CS473 27 Spring 2018 27 / 29
Back to running time
In Step 1, computing hp(x) for an n bit x is in O(n) time. Assuming O(lg M) bit arithmetic can be done in O(1) time, Since hp(.) produces lg M bit numbers, both steps inside for loop can be done in O(1) time. Overall O(m + n) time. Can’t do better. M = ⌈200mn lg 100mn⌉ ⇒ lg M = O(lg m)
Ruta (UIUC) CS473 28 Spring 2018 28 / 29
Back to running time
In Step 1, computing hp(x) for an n bit x is in O(n) time. Assuming O(lg M) bit arithmetic can be done in O(1) time, Since hp(.) produces lg M bit numbers, both steps inside for loop can be done in O(1) time. Overall O(m + n) time. Can’t do better. M = ⌈200mn lg 100mn⌉ ⇒ lg M = O(lg m) Even if T is entire Wikipedia, with bit length m ≈ 238,
Ruta (UIUC) CS473 28 Spring 2018 28 / 29
Back to running time
In Step 1, computing hp(x) for an n bit x is in O(n) time. Assuming O(lg M) bit arithmetic can be done in O(1) time, Since hp(.) produces lg M bit numbers, both steps inside for loop can be done in O(1) time. Overall O(m + n) time. Can’t do better. M = ⌈200mn lg 100mn⌉ ⇒ lg M = O(lg m) Even if T is entire Wikipedia, with bit length m ≈ 238, lg M ≈ 64 (assuming bit-length of n ≤ 216)
Ruta (UIUC) CS473 28 Spring 2018 28 / 29
Back to running time
In Step 1, computing hp(x) for an n bit x is in O(n) time. Assuming O(lg M) bit arithmetic can be done in O(1) time, Since hp(.) produces lg M bit numbers, both steps inside for loop can be done in O(1) time. Overall O(m + n) time. Can’t do better. M = ⌈200mn lg 100mn⌉ ⇒ lg M = O(lg m) Even if T is entire Wikipedia, with bit length m ≈ 238, lg M ≈ 64 (assuming bit-length of n ≤ 216) 64-bit arithmetic is doable on laptops!
Ruta (UIUC) CS473 28 Spring 2018 28 / 29
1
Hashing is a powerful and important technique. Many practical applications.
2
Randomization fundamental to understand hashing.
3
Good and efficient hashing possible in theory and practice with proper definitions (universal, perfect, etc).
4
Related ideas of creating a compact fingerprint/sketch for
Ruta (UIUC) CS473 29 Spring 2018 29 / 29