CS 473: Algorithms Chandra Chekuri Ruta Mehta University of - - PowerPoint PPT Presentation

cs 473 algorithms
SMART_READER_LITE
LIVE PREVIEW

CS 473: Algorithms Chandra Chekuri Ruta Mehta University of - - PowerPoint PPT Presentation

CS 473: Algorithms Chandra Chekuri Ruta Mehta University of Illinois, Urbana-Champaign Fall 2016 Chandra & Ruta (UIUC) CS473 1 Fall 2016 1 / 22 CS 473: Algorithms, Fall 2016 Fingerprinting Lecture 11 September 28, 2016 Chandra


slide-1
SLIDE 1

CS 473: Algorithms

Chandra Chekuri Ruta Mehta

University of Illinois, Urbana-Champaign

Fall 2016

Chandra & Ruta (UIUC) CS473 1 Fall 2016 1 / 22

slide-2
SLIDE 2

CS 473: Algorithms, Fall 2016

Fingerprinting

Lecture 11

September 28, 2016

Chandra & Ruta (UIUC) CS473 2 Fall 2016 2 / 22

slide-3
SLIDE 3

Fingerprinting

Source: Wikipedia

Process of mapping a large data item to a much shorter bit string, called its fingerprint. Fingerprints uniquely identifies data for all practical purposes.

Chandra & Ruta (UIUC) CS473 3 Fall 2016 3 / 22

slide-4
SLIDE 4

Fingerprinting

Source: Wikipedia

Process of mapping a large data item to a much shorter bit string, called its fingerprint. Fingerprints uniquely identifies data for all practical purposes. Typically used to avoid comparison and transmission of bulky data. Eg: Web browser can store/fetch file fingerprints to check if it is changed.

Chandra & Ruta (UIUC) CS473 3 Fall 2016 3 / 22

slide-5
SLIDE 5

Fingerprinting

Source: Wikipedia

Process of mapping a large data item to a much shorter bit string, called its fingerprint. Fingerprints uniquely identifies data for all practical purposes. Typically used to avoid comparison and transmission of bulky data. Eg: Web browser can store/fetch file fingerprints to check if it is changed. As you may have guessed, fingerprint functions are hash functions.

Chandra & Ruta (UIUC) CS473 3 Fall 2016 3 / 22

slide-6
SLIDE 6

Bloom Filters

Hashing:

1

To insert x in dictionary store x in table in location h(x)

2

To lookup y in dictionary check contents of location h(y)

Chandra & Ruta (UIUC) CS473 4 Fall 2016 4 / 22

slide-7
SLIDE 7

Bloom Filters

Hashing:

1

To insert x in dictionary store x in table in location h(x)

2

To lookup y in dictionary check contents of location h(y) Bloom Filter: tradeoff space for false positives

1

Storing items in dictionary expensive in terms of memory, especially if items are unwieldy objects such a long strings, images, etc with non-uniform sizes.

2

To insert x in dictionary set bit to 1 in location h(x) (initially all bits are set to 0)

3

To lookup y if bit in location h(y) is 1 say yes, else no.

Chandra & Ruta (UIUC) CS473 4 Fall 2016 4 / 22

slide-8
SLIDE 8

Bloom Filters

Chandra & Ruta (UIUC) CS473 5 Fall 2016 5 / 22

slide-9
SLIDE 9

Bloom Filters

Bloom Filter: tradeoff space for false positives

1

To insert x in dictionary set bit to 1 in location h(x) (initially all bits are set to 0)

2

To lookup y if bit in location h(y) is 1 say yes, else no

3

No false negatives but false positives possible due to collisions

Chandra & Ruta (UIUC) CS473 5 Fall 2016 5 / 22

slide-10
SLIDE 10

Bloom Filters

Bloom Filter: tradeoff space for false positives

1

To insert x in dictionary set bit to 1 in location h(x) (initially all bits are set to 0)

2

To lookup y if bit in location h(y) is 1 say yes, else no

3

No false negatives but false positives possible due to collisions Reducing false positives:

1

Pick k hash functions h1, h2, . . . , hk independently

2

To insert x for 1 ≤ i ≤ k set bit in location hi(x) in table i to 1

3

To lookup y compute hi(y) for 1 ≤ i ≤ k and say yes only if each bit in the corresponding location is 1, otherwise say no. If probability of false positive for one hash function is α < 1 then with k independent hash function it is

Chandra & Ruta (UIUC) CS473 5 Fall 2016 5 / 22

slide-11
SLIDE 11

Bloom Filters

Bloom Filter: tradeoff space for false positives

1

To insert x in dictionary set bit to 1 in location h(x) (initially all bits are set to 0)

2

To lookup y if bit in location h(y) is 1 say yes, else no

3

No false negatives but false positives possible due to collisions Reducing false positives:

1

Pick k hash functions h1, h2, . . . , hk independently

2

To insert x for 1 ≤ i ≤ k set bit in location hi(x) in table i to 1

3

To lookup y compute hi(y) for 1 ≤ i ≤ k and say yes only if each bit in the corresponding location is 1, otherwise say no. If probability of false positive for one hash function is α < 1 then with k independent hash function it is αk.

Chandra & Ruta (UIUC) CS473 5 Fall 2016 5 / 22

slide-12
SLIDE 12

Outline

Use of hash functions for designing fast algorithms

Problem

Given a text T of length m and pattern P of length n, m ≫ n, find all occurrences of P in T.

Chandra & Ruta (UIUC) CS473 6 Fall 2016 6 / 22

slide-13
SLIDE 13

Outline

Use of hash functions for designing fast algorithms

Problem

Given a text T of length m and pattern P of length n, m ≫ n, find all occurrences of P in T.

Karp-Rabin Randomized Algorithm

Chandra & Ruta (UIUC) CS473 6 Fall 2016 6 / 22

slide-14
SLIDE 14

Outline

Use of hash functions for designing fast algorithms

Problem

Given a text T of length m and pattern P of length n, m ≫ n, find all occurrences of P in T.

Karp-Rabin Randomized Algorithm

Sampling a prime String equality via mod p arithmetic Rabin’s fingerprinting scheme – rolling hash Karp-Rabin pattern matching algorithm: O(m + n) time.

Chandra & Ruta (UIUC) CS473 6 Fall 2016 6 / 22

slide-15
SLIDE 15

Sampling a prime

Problem

Given an integer x > 0, sample a prime uniformly at random from all the primes between 1 and x.

Chandra & Ruta (UIUC) CS473 7 Fall 2016 7 / 22

slide-16
SLIDE 16

Sampling a prime

Problem

Given an integer x > 0, sample a prime uniformly at random from all the primes between 1 and x.

Procedure

1

Sample a number p uniformly at random from {1, . . . , x}.

2

If p is a prime, then output p. Else go to Step (1).

Chandra & Ruta (UIUC) CS473 7 Fall 2016 7 / 22

slide-17
SLIDE 17

Sampling a prime

Problem

Given an integer x > 0, sample a prime uniformly at random from all the primes between 1 and x.

Procedure

1

Sample a number p uniformly at random from {1, . . . , x}.

2

If p is a prime, then output p. Else go to Step (1).

Checking if p is prime

Agrawal-Kayal-Saxena primality test: deterministic but slow Miller-Rabin randomized primality test: fast but randomized

  • utputs ‘prime’ when it is not with very low probability.

Chandra & Ruta (UIUC) CS473 7 Fall 2016 7 / 22

slide-18
SLIDE 18

Sampling a Prime: Analysis

Is the returned prime sampled uniformly at random?

Chandra & Ruta (UIUC) CS473 8 Fall 2016 8 / 22

slide-19
SLIDE 19

Sampling a Prime: Analysis

Is the returned prime sampled uniformly at random? π(x) : number of primes in {1, . . . , x},

Lemma

For a fixed prime p∗ ≤ x, Pr[algorithm outputs p∗] = 1/π(x).

Chandra & Ruta (UIUC) CS473 8 Fall 2016 8 / 22

slide-20
SLIDE 20

Sampling a Prime: Analysis

Is the returned prime sampled uniformly at random? π(x) : number of primes in {1, . . . , x},

Lemma

For a fixed prime p∗ ≤ x, Pr[algorithm outputs p∗] = 1/π(x).

Proof.

A : Event that a prime is picked in a round. Pr[A] = π(x)/x.

Chandra & Ruta (UIUC) CS473 8 Fall 2016 8 / 22

slide-21
SLIDE 21

Sampling a Prime: Analysis

Is the returned prime sampled uniformly at random? π(x) : number of primes in {1, . . . , x},

Lemma

For a fixed prime p∗ ≤ x, Pr[algorithm outputs p∗] = 1/π(x).

Proof.

A : Event that a prime is picked in a round. Pr[A] = π(x)/x. B : Number (prime) p∗ is picked. Pr[B] = 1/x. B ⊂ A.

Chandra & Ruta (UIUC) CS473 8 Fall 2016 8 / 22

slide-22
SLIDE 22

Sampling a Prime: Analysis

Is the returned prime sampled uniformly at random? π(x) : number of primes in {1, . . . , x},

Lemma

For a fixed prime p∗ ≤ x, Pr[algorithm outputs p∗] = 1/π(x).

Proof.

A : Event that a prime is picked in a round. Pr[A] = π(x)/x. B : Number (prime) p∗ is picked. Pr[B] = 1/x. B ⊂ A. Pr[B|A] = Pr

[A∩B] Pr [A]

= Pr

[B] Pr [A] = 1/x π(x)/x = 1 π(x)

Chandra & Ruta (UIUC) CS473 8 Fall 2016 8 / 22

slide-23
SLIDE 23

Sampling a Prime: Analysis

Is the returned prime sampled uniformly at random? π(x) : number of primes in {1, . . . , x},

Lemma

For a fixed prime p∗ ≤ x, Pr[algorithm outputs p∗] = 1/π(x).

Proof.

A : Event that a prime is picked in a round. Pr[A] = π(x)/x. B : Number (prime) p∗ is picked. Pr[B] = 1/x. B ⊂ A. Pr[B|A] = Pr

[A∩B] Pr [A]

= Pr

[B] Pr [A] = 1/x π(x)/x = 1 π(x)

Running time in expectation

Q: How many samples in expectation before termination? A: x/π(x). Exercise.

Chandra & Ruta (UIUC) CS473 8 Fall 2016 8 / 22

slide-24
SLIDE 24

How many primes between 0 and x

π(x) : Number of primes between 0 and x.

Prime Number Theorem

limx→∞

π(x) x/ ln x = 1

By Jacques Hadamard and Charles Jean de la Vall´ ee-Poussin in 1896

Chandra & Ruta (UIUC) CS473 9 Fall 2016 9 / 22

slide-25
SLIDE 25

How many primes between 0 and x

π(x) : Number of primes between 0 and x.

Prime Number Theorem

limx→∞

π(x) x/ ln x = 1

By Jacques Hadamard and Charles Jean de la Vall´ ee-Poussin in 1896

Chebyshev (from 1848)

π(x) ≥ 7 8 x ln x = (1.262..) x lg x > x lg x

Chandra & Ruta (UIUC) CS473 9 Fall 2016 9 / 22

slide-26
SLIDE 26

How many primes between 0 and x

π(x) : Number of primes between 0 and x.

Prime Number Theorem

limx→∞

π(x) x/ ln x = 1

By Jacques Hadamard and Charles Jean de la Vall´ ee-Poussin in 1896

Chebyshev (from 1848)

π(x) ≥ 7 8 x ln x = (1.262..) x lg x > x lg x y ∼ {1, . . . , x} u.a.r., then y is a prime w.p.

π(x) x

>

1 lg x.

Chandra & Ruta (UIUC) CS473 9 Fall 2016 9 / 22

slide-27
SLIDE 27

How many primes between 0 and x

π(x) : Number of primes between 0 and x.

Prime Number Theorem

limx→∞

π(x) x/ ln x = 1

By Jacques Hadamard and Charles Jean de la Vall´ ee-Poussin in 1896

Chebyshev (from 1848)

π(x) ≥ 7 8 x ln x = (1.262..) x lg x > x lg x y ∼ {1, . . . , x} u.a.r., then y is a prime w.p.

π(x) x

>

1 lg x.

If we want k ≥ 4 primes then x ≥ 2k lg k suffices. π(x) ≥ π(2k lg k) = k(2 lg k) lg 2 + lg k + lg lg k ≥ k

Chandra & Ruta (UIUC) CS473 9 Fall 2016 9 / 22

slide-28
SLIDE 28

String Equality

Problem

Alice, the captain of a Mars lander, receives an N-bit string x, and Bob, back at mission control, receives a string y. They know nothing about each others strings, but want to check if x = y.

Chandra & Ruta (UIUC) CS473 10 Fall 2016 10 / 22

slide-29
SLIDE 29

String Equality

Problem

Alice, the captain of a Mars lander, receives an N-bit string x, and Bob, back at mission control, receives a string y. They know nothing about each others strings, but want to check if x = y. Alice sends Bob x, and Bob confirms if x = y. But sending N bits is costly! Can they share less communication and check equality?

Chandra & Ruta (UIUC) CS473 10 Fall 2016 10 / 22

slide-30
SLIDE 30

String Equality

Problem

Alice, the captain of a Mars lander, receives an N-bit string x, and Bob, back at mission control, receives a string y. They know nothing about each others strings, but want to check if x = y. Alice sends Bob x, and Bob confirms if x = y. But sending N bits is costly! Can they share less communication and check equality?

Possibilities:

If want 100% surety then NO. If OK with 99.99% surety then O(lg N) may suffice!!!

Chandra & Ruta (UIUC) CS473 10 Fall 2016 10 / 22

slide-31
SLIDE 31

String Equality

Problem

Alice, the captain of a Mars lander, receives an N-bit string x, and Bob, back at mission control, receives a string y. They know nothing about each others strings, but want to check if x = y. Alice sends Bob x, and Bob confirms if x = y. But sending N bits is costly! Can they share less communication and check equality?

Possibilities:

If want 100% surety then NO. If OK with 99.99% surety then O(lg N) may suffice!!!

If x = y, then Pr[Bob says equal] = 1. If x = y, then Pr[Bob says un-equal] = 0.9999.

HOW?

Chandra & Ruta (UIUC) CS473 10 Fall 2016 10 / 22

slide-32
SLIDE 32

String Equality: Randomized Algorithm

(Recall) 5N primes in {1, . . . , M} if M = ⌈2(5N) lg 5N⌉.

Chandra & Ruta (UIUC) CS473 11 Fall 2016 11 / 22

slide-33
SLIDE 33

String Equality: Randomized Algorithm

(Recall) 5N primes in {1, . . . , M} if M = ⌈2(5N) lg 5N⌉. Define hp(x) = x mod p

1

Alice picks a random prime p from {1, . . . M}.

Chandra & Ruta (UIUC) CS473 11 Fall 2016 11 / 22

slide-34
SLIDE 34

String Equality: Randomized Algorithm

(Recall) 5N primes in {1, . . . , M} if M = ⌈2(5N) lg 5N⌉. Define hp(x) = x mod p

1

Alice picks a random prime p from {1, . . . M}.

2

She sends Bob prime p, and also hp(x) = x mod p.

3

Bob checks if hp(y) = hp(x). If so, he says equal else un-equal.

Chandra & Ruta (UIUC) CS473 11 Fall 2016 11 / 22

slide-35
SLIDE 35

String Equality: Randomized Algorithm

(Recall) 5N primes in {1, . . . , M} if M = ⌈2(5N) lg 5N⌉. Define hp(x) = x mod p

1

Alice picks a random prime p from {1, . . . M}.

2

She sends Bob prime p, and also hp(x) = x mod p.

3

Bob checks if hp(y) = hp(x). If so, he says equal else un-equal.

Lemma

If x = y then Bob always says equal.

Chandra & Ruta (UIUC) CS473 11 Fall 2016 11 / 22

slide-36
SLIDE 36

String Equality: Randomized Algorithm

(Recall) 5N primes in {1, . . . , M} if M = ⌈2(5N) lg 5N⌉. Define hp(x) = x mod p

1

Alice picks a random prime p from {1, . . . M}.

2

She sends Bob prime p, and also hp(x) = x mod p.

3

Bob checks if hp(y) = hp(x). If so, he says equal else un-equal.

Lemma

If x = y then Bob always says equal.

Lemma

If x = y then, Pr[Bob says equal] ≤ 1/5 (error probability).

Chandra & Ruta (UIUC) CS473 11 Fall 2016 11 / 22

slide-37
SLIDE 37

String Equality: Randomized Algorithm

Error probability

Let M = ⌈2(sN) lg sN⌉ and hp(x) = x mod p

Lemma

If x = y then, Pr[Bob says equal] = Pr[hp(x) = hp(y)] ≤ 1/s

Proof.

Given x = y, hp(x) = hp(y) ⇒ x mod p = y mod p.

Chandra & Ruta (UIUC) CS473 12 Fall 2016 12 / 22

slide-38
SLIDE 38

String Equality: Randomized Algorithm

Error probability

Let M = ⌈2(sN) lg sN⌉ and hp(x) = x mod p

Lemma

If x = y then, Pr[Bob says equal] = Pr[hp(x) = hp(y)] ≤ 1/s

Proof.

Given x = y, hp(x) = hp(y) ⇒ x mod p = y mod p. D = |x − y|, then D mod p = 0, and D ≤ 2N.

Chandra & Ruta (UIUC) CS473 12 Fall 2016 12 / 22

slide-39
SLIDE 39

String Equality: Randomized Algorithm

Error probability

Let M = ⌈2(sN) lg sN⌉ and hp(x) = x mod p

Lemma

If x = y then, Pr[Bob says equal] = Pr[hp(x) = hp(y)] ≤ 1/s

Proof.

Given x = y, hp(x) = hp(y) ⇒ x mod p = y mod p. D = |x − y|, then D mod p = 0, and D ≤ 2N. D = p1 . . . pk prime factorization. All pi ≥ 2 ⇒ D ≥ 2k.

Chandra & Ruta (UIUC) CS473 12 Fall 2016 12 / 22

slide-40
SLIDE 40

String Equality: Randomized Algorithm

Error probability

Let M = ⌈2(sN) lg sN⌉ and hp(x) = x mod p

Lemma

If x = y then, Pr[Bob says equal] = Pr[hp(x) = hp(y)] ≤ 1/s

Proof.

Given x = y, hp(x) = hp(y) ⇒ x mod p = y mod p. D = |x − y|, then D mod p = 0, and D ≤ 2N. D = p1 . . . pk prime factorization. All pi ≥ 2 ⇒ D ≥ 2k. 2k ≤ D ≤ 2N ⇒ k ≤ N. D has at most N divisors.

Chandra & Ruta (UIUC) CS473 12 Fall 2016 12 / 22

slide-41
SLIDE 41

String Equality: Randomized Algorithm

Error probability

Let M = ⌈2(sN) lg sN⌉ and hp(x) = x mod p

Lemma

If x = y then, Pr[Bob says equal] = Pr[hp(x) = hp(y)] ≤ 1/s

Proof.

Given x = y, hp(x) = hp(y) ⇒ x mod p = y mod p. D = |x − y|, then D mod p = 0, and D ≤ 2N. D = p1 . . . pk prime factorization. All pi ≥ 2 ⇒ D ≥ 2k. 2k ≤ D ≤ 2N ⇒ k ≤ N. D has at most N divisors. Probability that a random prime p from {1, . . . , M} is a divisor, ≤ N π(M)

Chandra & Ruta (UIUC) CS473 12 Fall 2016 12 / 22

slide-42
SLIDE 42

String Equality: Randomized Algorithm

Error probability

Let M = ⌈2(sN) lg sN⌉ and hp(x) = x mod p

Lemma

If x = y then, Pr[Bob says equal] = Pr[hp(x) = hp(y)] ≤ 1/s

Proof.

Given x = y, hp(x) = hp(y) ⇒ x mod p = y mod p. D = |x − y|, then D mod p = 0, and D ≤ 2N. D = p1 . . . pk prime factorization. All pi ≥ 2 ⇒ D ≥ 2k. 2k ≤ D ≤ 2N ⇒ k ≤ N. D has at most N divisors. Probability that a random prime p from {1, . . . , M} is a divisor, ≤ N π(M) ≤ N M/ lg M = N 2(sN) lg sN lg M ≤ 1 s

Chandra & Ruta (UIUC) CS473 12 Fall 2016 12 / 22

slide-43
SLIDE 43

Low Error Probability and Communication.

Low Error Probability

1

Choose large enough s. Error prob: 1/s.

2

Alice repeats the process R times, and Bob says equal only if he gets equal all R times.

Chandra & Ruta (UIUC) CS473 13 Fall 2016 13 / 22

slide-44
SLIDE 44

Low Error Probability and Communication.

Low Error Probability

1

Choose large enough s. Error prob: 1/s.

2

Alice repeats the process R times, and Bob says equal only if he gets equal all R times. Error probability:

1 sR.

Chandra & Ruta (UIUC) CS473 13 Fall 2016 13 / 22

slide-45
SLIDE 45

Low Error Probability and Communication.

Low Error Probability

1

Choose large enough s. Error prob: 1/s.

2

Alice repeats the process R times, and Bob says equal only if he gets equal all R times. Error probability:

1

  • sR. For s = 5, R = 10,

1 510 ≤ 0.000001.

Chandra & Ruta (UIUC) CS473 13 Fall 2016 13 / 22

slide-46
SLIDE 46

Low Error Probability and Communication.

Low Error Probability

1

Choose large enough s. Error prob: 1/s.

2

Alice repeats the process R times, and Bob says equal only if he gets equal all R times. Error probability:

1

  • sR. For s = 5, R = 10,

1 510 ≤ 0.000001.

M = ⌈2(sN) lg sN⌉

Amount of Communication

Each round sends 2 integers ≤ M. # bits 2 lg M ≤ 4(lg s + lg N).

Chandra & Ruta (UIUC) CS473 13 Fall 2016 13 / 22

slide-47
SLIDE 47

Low Error Probability and Communication.

Low Error Probability

1

Choose large enough s. Error prob: 1/s.

2

Alice repeats the process R times, and Bob says equal only if he gets equal all R times. Error probability:

1

  • sR. For s = 5, R = 10,

1 510 ≤ 0.000001.

M = ⌈2(sN) lg sN⌉

Amount of Communication

Each round sends 2 integers ≤ M. # bits 2 lg M ≤ 4(lg s + lg N). If x and y are copies of Wikipedia, about 25 billion characters. If 8 bits per character, then N ≈ 238 bits.

Chandra & Ruta (UIUC) CS473 13 Fall 2016 13 / 22

slide-48
SLIDE 48

Low Error Probability and Communication.

Low Error Probability

1

Choose large enough s. Error prob: 1/s.

2

Alice repeats the process R times, and Bob says equal only if he gets equal all R times. Error probability:

1

  • sR. For s = 5, R = 10,

1 510 ≤ 0.000001.

M = ⌈2(sN) lg sN⌉

Amount of Communication

Each round sends 2 integers ≤ M. # bits 2 lg M ≤ 4(lg s + lg N). If x and y are copies of Wikipedia, about 25 billion characters. If 8 bits per character, then N ≈ 238 bits. Second approach will send 10(2 lg 10N lg 5N) ≤ 1280 bits.

Chandra & Ruta (UIUC) CS473 13 Fall 2016 13 / 22

slide-49
SLIDE 49

Part I Karp-Rabin Pattern Matching Algorithm

Chandra & Ruta (UIUC) CS473 14 Fall 2016 14 / 22

slide-50
SLIDE 50

Pattern Matching

Given a string T of length m and pattern P of length n, s.t. m ≫ n, find all occurrences of P in T.

Example

T=abracadabra, P=ab.

Chandra & Ruta (UIUC) CS473 15 Fall 2016 15 / 22

slide-51
SLIDE 51

Pattern Matching

Given a string T of length m and pattern P of length n, s.t. m ≫ n, find all occurrences of P in T.

Example

T=abracadabra, P=ab. Solution S = {1, 8}.

Chandra & Ruta (UIUC) CS473 15 Fall 2016 15 / 22

slide-52
SLIDE 52

Pattern Matching

Given a string T of length m and pattern P of length n, s.t. m ≫ n, find all occurrences of P in T.

Example

T=abracadabra, P=ab. Solution S = {1, 8}. For j > i, let Ti...j = T[i]T[i + 1] . . . T[j].

Chandra & Ruta (UIUC) CS473 15 Fall 2016 15 / 22

slide-53
SLIDE 53

Pattern Matching

Given a string T of length m and pattern P of length n, s.t. m ≫ n, find all occurrences of P in T.

Example

T=abracadabra, P=ab. Solution S = {1, 8}. For j > i, let Ti...j = T[i]T[i + 1] . . . T[j].

Brute force algorithm

S = ∅. For each i = 1 . . . m − n + 1 If Ti...i+n−1 = P then S = S ∪ {i}.

Chandra & Ruta (UIUC) CS473 15 Fall 2016 15 / 22

slide-54
SLIDE 54

Pattern Matching

Given a string T of length m and pattern P of length n, s.t. m ≫ n, find all occurrences of P in T.

Example

T=abracadabra, P=ab. Solution S = {1, 8}. For j > i, let Ti...j = T[i]T[i + 1] . . . T[j].

Brute force algorithm

S = ∅. For each i = 1 . . . m − n + 1 If Ti...i+n−1 = P then S = S ∪ {i}. O(mn) run-time.

Chandra & Ruta (UIUC) CS473 15 Fall 2016 15 / 22

slide-55
SLIDE 55

Using Hash Function

Pick a prime p u.a.r. from {1, . . . , M}. hp(x) = x mod p.

Brute force algorithm using hash function

S = ∅. For each i = 1 . . . m − n + 1 If hp(Ti...i+n−1) = hp(P) then S = S ∪ {i}.

Chandra & Ruta (UIUC) CS473 16 Fall 2016 16 / 22

slide-56
SLIDE 56

Using Hash Function

Pick a prime p u.a.r. from {1, . . . , M}. hp(x) = x mod p.

Brute force algorithm using hash function

S = ∅. For each i = 1 . . . m − n + 1 If hp(Ti...i+n−1) = hp(P) then S = S ∪ {i}. If x is of length n, then computing hp(x) takes O(n) running time. Overall O(mn) running time.

Chandra & Ruta (UIUC) CS473 16 Fall 2016 16 / 22

slide-57
SLIDE 57

Using Hash Function

Pick a prime p u.a.r. from {1, . . . , M}. hp(x) = x mod p.

Brute force algorithm using hash function

S = ∅. For each i = 1 . . . m − n + 1 If hp(Ti...i+n−1) = hp(P) then S = S ∪ {i}. If x is of length n, then computing hp(x) takes O(n) running time. Overall O(mn) running time. Can we compute hp(Ti+1...i+n) using hp(Ti...i+n−1) fast?

Chandra & Ruta (UIUC) CS473 16 Fall 2016 16 / 22

slide-58
SLIDE 58

Rolling Hash

x = Ti...i+n−1 and x′ = Ti+1...i+n.

Example

x = 1011001, and x′ = 0110010 (or x′ = 0110011).

Chandra & Ruta (UIUC) CS473 17 Fall 2016 17 / 22

slide-59
SLIDE 59

Rolling Hash

x = Ti...i+n−1 and x′ = Ti+1...i+n.

Example

x = 1011001, and x′ = 0110010 (or x′ = 0110011). x′ = 2(x − xhb2n−1) + x′

lb

Chandra & Ruta (UIUC) CS473 17 Fall 2016 17 / 22

slide-60
SLIDE 60

Rolling Hash

x = Ti...i+n−1 and x′ = Ti+1...i+n.

Example

x = 1011001, and x′ = 0110010 (or x′ = 0110011). x′ = 2(x − xhb2n−1) + x′

lb

hp(x′) = x′ mod p = (2(x mod p) − xhb(2n mod p) + x′

lb) mod p

= (2hp(x) − xhbhp(2n) + x′

lb) mod p

Chandra & Ruta (UIUC) CS473 17 Fall 2016 17 / 22

slide-61
SLIDE 61

Karp-Rabin Algorithm

p : a random prime from {1, . . . , M}.

1

Set S = ∅. Compute hp(T1...n), hp(2n), and hp(P).

2

For each i = 1, . . . , m − n + 1

1

If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.

2

Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n) by applying rolling hash.

Chandra & Ruta (UIUC) CS473 18 Fall 2016 18 / 22

slide-62
SLIDE 62

Karp-Rabin Algorithm

p : a random prime from {1, . . . , M}.

1

Set S = ∅. Compute hp(T1...n), hp(2n), and hp(P).

2

For each i = 1, . . . , m − n + 1

1

If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.

2

Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n) by applying rolling hash.

Running Time

In Step 1, computing hp(x) for an n bit x is in O(n) time.

Chandra & Ruta (UIUC) CS473 18 Fall 2016 18 / 22

slide-63
SLIDE 63

Karp-Rabin Algorithm

p : a random prime from {1, . . . , M}.

1

Set S = ∅. Compute hp(T1...n), hp(2n), and hp(P).

2

For each i = 1, . . . , m − n + 1

1

If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.

2

Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n) by applying rolling hash.

Running Time

In Step 1, computing hp(x) for an n bit x is in O(n) time. Assuming O(lg M) bit arithmetic can be done in O(1) time, Since hp(.) produces lg M bit numbers, both steps inside for loop can be done in O(1) time.

Chandra & Ruta (UIUC) CS473 18 Fall 2016 18 / 22

slide-64
SLIDE 64

Karp-Rabin Algorithm

p : a random prime from {1, . . . , M}.

1

Set S = ∅. Compute hp(T1...n), hp(2n), and hp(P).

2

For each i = 1, . . . , m − n + 1

1

If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.

2

Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n) by applying rolling hash.

Running Time

In Step 1, computing hp(x) for an n bit x is in O(n) time. Assuming O(lg M) bit arithmetic can be done in O(1) time, Since hp(.) produces lg M bit numbers, both steps inside for loop can be done in O(1) time. Overall O(m + n) time.

Chandra & Ruta (UIUC) CS473 18 Fall 2016 18 / 22

slide-65
SLIDE 65

Karp-Rabin Algorithm

p : a random prime from {1, . . . , M}.

1

Set S = ∅. Compute hp(T1...n), hp(2n), and hp(P).

2

For each i = 1, . . . , m − n + 1

1

If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.

2

Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n) by applying rolling hash.

Running Time

In Step 1, computing hp(x) for an n bit x is in O(n) time. Assuming O(lg M) bit arithmetic can be done in O(1) time, Since hp(.) produces lg M bit numbers, both steps inside for loop can be done in O(1) time. Overall O(m + n) time. Can’t do better.

Chandra & Ruta (UIUC) CS473 18 Fall 2016 18 / 22

slide-66
SLIDE 66

Karp-Rabin Algorithm: Error Probability

1

Set S = ∅. Compute hp(T1...n), hp(2n), and hp(P).

2

For each i = 1, . . . , m − n + 1

1

If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.

2

Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n).

Lemma

If match at any position i then i ∈ S. In otherwords if Ti...i+n−1 = P, then i ∈ S. All matched positions are in S.

Chandra & Ruta (UIUC) CS473 19 Fall 2016 19 / 22

slide-67
SLIDE 67

Karp-Rabin Algorithm: Error Probability

1

Set S = ∅. Compute hp(T1...n), hp(2n), and hp(P).

2

For each i = 1, . . . , m − n + 1

1

If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.

2

Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n).

Lemma

If match at any position i then i ∈ S. In otherwords if Ti...i+n−1 = P, then i ∈ S. All matched positions are in S. Can it contain unmatched positions?

Chandra & Ruta (UIUC) CS473 19 Fall 2016 19 / 22

slide-68
SLIDE 68

Karp-Rabin Algorithm: Error Probability

1

Set S = ∅. Compute hp(T1...n), hp(2n), and hp(P).

2

For each i = 1, . . . , m − n + 1

1

If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.

2

Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n).

Lemma

If match at any position i then i ∈ S. In otherwords if Ti...i+n−1 = P, then i ∈ S. All matched positions are in S. Can it contain unmatched positions? YES!

Chandra & Ruta (UIUC) CS473 19 Fall 2016 19 / 22

slide-69
SLIDE 69

Karp-Rabin Algorithm: Error Probability

1

Set S = ∅. Compute hp(T1...n), hp(2n), and hp(P).

2

For each i = 1, . . . , m − n + 1

1

If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.

2

Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n).

Lemma

If match at any position i then i ∈ S. In otherwords if Ti...i+n−1 = P, then i ∈ S. All matched positions are in S. Can it contain unmatched positions? YES! With what probability?

Chandra & Ruta (UIUC) CS473 19 Fall 2016 19 / 22

slide-70
SLIDE 70

Karp-Rabin Algorithm: Error Probability

Pr[S contains an index i, while there is no match at i]

1

Set S = ∅. Compute hp(T1...n), hp(2n), and hp(P).

2

For each i = 1, . . . , m − n + 1

1

If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.

2

Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n).

Chandra & Ruta (UIUC) CS473 20 Fall 2016 20 / 22

slide-71
SLIDE 71

Karp-Rabin Algorithm: Error Probability

Pr[S contains an index i, while there is no match at i]

1

Set S = ∅. Compute hp(T1...n), hp(2n), and hp(P).

2

For each i = 1, . . . , m − n + 1

1

If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.

2

Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n).

Set M = ⌈2(sn) lg sn⌉. Given x = y, Pr[hp(x) = hp(y)] ≤ 1/s.

Chandra & Ruta (UIUC) CS473 20 Fall 2016 20 / 22

slide-72
SLIDE 72

Karp-Rabin Algorithm: Error Probability

Pr[S contains an index i, while there is no match at i]

1

Set S = ∅. Compute hp(T1...n), hp(2n), and hp(P).

2

For each i = 1, . . . , m − n + 1

1

If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.

2

Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n).

Set M = ⌈2(sn) lg sn⌉. Given x = y, Pr[hp(x) = hp(y)] ≤ 1/s.

False positive: Pr[S contains an i, while no match at i]

Chandra & Ruta (UIUC) CS473 20 Fall 2016 20 / 22

slide-73
SLIDE 73

Karp-Rabin Algorithm: Error Probability

Pr[S contains an index i, while there is no match at i]

1

Set S = ∅. Compute hp(T1...n), hp(2n), and hp(P).

2

For each i = 1, . . . , m − n + 1

1

If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.

2

Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n).

Set M = ⌈2(sn) lg sn⌉. Given x = y, Pr[hp(x) = hp(y)] ≤ 1/s.

False positive: Pr[S contains an i, while no match at i]

Given Ti...i+n−1 = P, Pr[i ∈ S] ≤ 1/s.

Chandra & Ruta (UIUC) CS473 20 Fall 2016 20 / 22

slide-74
SLIDE 74

Karp-Rabin Algorithm: Error Probability

Pr[S contains an index i, while there is no match at i]

1

Set S = ∅. Compute hp(T1...n), hp(2n), and hp(P).

2

For each i = 1, . . . , m − n + 1

1

If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.

2

Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n).

Set M = ⌈2(sn) lg sn⌉. Given x = y, Pr[hp(x) = hp(y)] ≤ 1/s.

False positive: Pr[S contains an i, while no match at i]

Given Ti...i+n−1 = P, Pr[i ∈ S] ≤ 1/s. Pr[Any index in S is wrong]

Chandra & Ruta (UIUC) CS473 20 Fall 2016 20 / 22

slide-75
SLIDE 75

Karp-Rabin Algorithm: Error Probability

Pr[S contains an index i, while there is no match at i]

1

Set S = ∅. Compute hp(T1...n), hp(2n), and hp(P).

2

For each i = 1, . . . , m − n + 1

1

If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.

2

Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n).

Set M = ⌈2(sn) lg sn⌉. Given x = y, Pr[hp(x) = hp(y)] ≤ 1/s.

False positive: Pr[S contains an i, while no match at i]

Given Ti...i+n−1 = P, Pr[i ∈ S] ≤ 1/s. Pr[Any index in S is wrong] ≤ m/s (Union bound).

Chandra & Ruta (UIUC) CS473 20 Fall 2016 20 / 22

slide-76
SLIDE 76

Karp-Rabin Algorithm: Error Probability

Pr[S contains an index i, while there is no match at i]

1

Set S = ∅. Compute hp(T1...n), hp(2n), and hp(P).

2

For each i = 1, . . . , m − n + 1

1

If hp(Ti...i+n−1) = hp(P), then S = S ∪ {i}.

2

Compute hp(Ti+1...i+n) using hp(Ti...i+n−1) and hp(2n).

Set M = ⌈2(sn) lg sn⌉. Given x = y, Pr[hp(x) = hp(y)] ≤ 1/s.

False positive: Pr[S contains an i, while no match at i]

Given Ti...i+n−1 = P, Pr[i ∈ S] ≤ 1/s. Pr[Any index in S is wrong] ≤ m/s (Union bound). To ensure S is correct with at least 0.99 probability, we need 1 − m s = 0.99 ⇔ m s = 1 100 ⇔ s = 100m .

Chandra & Ruta (UIUC) CS473 20 Fall 2016 20 / 22

slide-77
SLIDE 77

Karp-Rabin Algorithm

Back to running time

Running Time

In Step 1, computing hp(x) for an n bit x is in O(n) time. Assuming O(lg M) bit arithmetic can be done in O(1) time, Since hp(.) produces lg M bit numbers, both steps inside for loop can be done in O(1) time. Overall O(m + n) time. Can’t do better. M = ⌈200mn lg 100mn⌉ ⇒ lg M = O(lg m)

Chandra & Ruta (UIUC) CS473 21 Fall 2016 21 / 22

slide-78
SLIDE 78

Karp-Rabin Algorithm

Back to running time

Running Time

In Step 1, computing hp(x) for an n bit x is in O(n) time. Assuming O(lg M) bit arithmetic can be done in O(1) time, Since hp(.) produces lg M bit numbers, both steps inside for loop can be done in O(1) time. Overall O(m + n) time. Can’t do better. M = ⌈200mn lg 100mn⌉ ⇒ lg M = O(lg m) Even if T is entire Wikipedia, with bit length m ≈ 238,

Chandra & Ruta (UIUC) CS473 21 Fall 2016 21 / 22

slide-79
SLIDE 79

Karp-Rabin Algorithm

Back to running time

Running Time

In Step 1, computing hp(x) for an n bit x is in O(n) time. Assuming O(lg M) bit arithmetic can be done in O(1) time, Since hp(.) produces lg M bit numbers, both steps inside for loop can be done in O(1) time. Overall O(m + n) time. Can’t do better. M = ⌈200mn lg 100mn⌉ ⇒ lg M = O(lg m) Even if T is entire Wikipedia, with bit length m ≈ 238, lg M ≈ 64 (assuming bit-length of n ≤ 216)

Chandra & Ruta (UIUC) CS473 21 Fall 2016 21 / 22

slide-80
SLIDE 80

Karp-Rabin Algorithm

Back to running time

Running Time

In Step 1, computing hp(x) for an n bit x is in O(n) time. Assuming O(lg M) bit arithmetic can be done in O(1) time, Since hp(.) produces lg M bit numbers, both steps inside for loop can be done in O(1) time. Overall O(m + n) time. Can’t do better. M = ⌈200mn lg 100mn⌉ ⇒ lg M = O(lg m) Even if T is entire Wikipedia, with bit length m ≈ 238, lg M ≈ 64 (assuming bit-length of n ≤ 216) 64-bit arithmetic is doable on laptops!

Chandra & Ruta (UIUC) CS473 21 Fall 2016 21 / 22

slide-81
SLIDE 81

Take away points

1

Hashing is a powerful and important technique. Many practical applications.

2

Randomization fundamental to understanding hashing.

3

Good and efficient hashing possible in theory and practice with proper definitions (universal, perfect, etc).

4

Related ideas of creating a compact fingerprint/sketch for

  • bjects is very powerful in theory and practice.

Chandra & Ruta (UIUC) CS473 22 Fall 2016 22 / 22