Strings Part 1: Tries and KMP Lucca Siaudzionis and Jack - - PowerPoint PPT Presentation

strings
SMART_READER_LITE
LIVE PREVIEW

Strings Part 1: Tries and KMP Lucca Siaudzionis and Jack - - PowerPoint PPT Presentation

Strings Part 1: Tries and KMP Lucca Siaudzionis and Jack Spalding-Jamieson 2020/03/05 University of British Columbia Announcements Were still finalizing A4. It will be out the weekend and youll have a little over two weeks to do it.


slide-1
SLIDE 1

Strings

Part 1: Tries and KMP

Lucca Siaudzionis and Jack Spalding-Jamieson 2020/03/05

University of British Columbia

slide-2
SLIDE 2

Announcements

  • We’re still finalizing A4. It will be out the weekend and you’ll have a little over two weeks

to do it.

1

slide-3
SLIDE 3

Inspiration

Suppose we want to implement a map: string -> int

  • where we have N string keys and
  • each string has length ≤ M

2

slide-4
SLIDE 4

Inspiration

One solution: build a BST of the strings

  • This is essentially what would happen if you were to use map<string, int>

3

slide-5
SLIDE 5

Inspiration

One solution: build a BST of the strings

  • This is essentially what would happen if you were to use map<string, int>

Time complexity: O(M log N)

3

slide-6
SLIDE 6

Inspiration

One solution: build a BST of the strings

  • This is essentially what would happen if you were to use map<string, int>

Time complexity: O(M log N) Space complexity: O(M + N)

3

slide-7
SLIDE 7

Inspiration

One solution: build a BST of the strings

  • This is essentially what would happen if you were to use map<string, int>

Time complexity: O(M log N) Space complexity: O(M + N) This doesn’t allow partial prefix matches, which might be useful sometimes

  • Can we do better?

3

slide-8
SLIDE 8

Observation

There are only 26 letters in the alphabet, so there is a lot of repetitive information that would be stored in a BST.

4

slide-9
SLIDE 9

Observation

There are only 26 letters in the alphabet, so there is a lot of repetitive information that would be stored in a BST. Why don’t we just use the alphabet to form a tree?

4

slide-10
SLIDE 10

A Trie is a Tree!

If we build a tree where every node represents a prefix of a word, we have a Trie. Keys: to, tea, ted, ten, A, i, in, inn.

Source:Wikipedia 5

slide-11
SLIDE 11

Trie Structure

In the previous example, each node represents a prefix.

6

slide-12
SLIDE 12

Trie Structure

In the previous example, each node represents a prefix.

  • But we shouldn’t store that entire prefix. (Why?)

6

slide-13
SLIDE 13

Trie Structure

In the previous example, each node represents a prefix.

  • But we shouldn’t store that entire prefix. (Why?)

We expand a prefix with an edge containing a character.

6

slide-14
SLIDE 14

Trie Structure

In the previous example, each node represents a prefix.

  • But we shouldn’t store that entire prefix. (Why?)

We expand a prefix with an edge containing a character. An entire word is also a prefix of itself, so we flag some nodes indicating that they contain the end of a word.

6

slide-15
SLIDE 15

Trie Structure – Implementation

1

struct TrieNode {

2

bool isWord;

3

vector<TrieNode*> children;

4

TrieNode() {

5

isWord = false;

6

children = vector<TrieNode*>(26, nullptr); // assuming only

7

}

8

// ...

9

};

10 11

TrieNode* root = new TrieNode(); // fresh new trie

7

slide-16
SLIDE 16

Trie Lookup

To see find out if a word is present in the trie, we just walk along the path in the tree defined by the word.

  • If, at any step, an edge is missing, then the word is not present in the trie.
  • If we reach the end node, but it has isWord false, then the word is not present in the trie.

8

slide-17
SLIDE 17

Trie Lookup – Implementation

1

// implementation inside TrieNode

2

bool find(string& word) {

3

TrieNode* curNode = this;

4

for (auto c : word) {

5

if (!curNode->children[c - 'a']) return false;

6

curNode = curNode->children[c - 'a'];

7

}

8

return curNode->isWord;

9

}

9

slide-18
SLIDE 18

Trie Insertion

To insert a new word, we walk in the trie along the path defined by that word

  • Every time an edge is missing, we create a new edge and node, appending it to the current

node we are scanning.

  • When we reach the end of the word, we define it in the trie.

10

slide-19
SLIDE 19

Trie Insertion – Implementation

1

// implementation inside TrieNode struct

2

void insert(string& word) {

3

TrieNode* curNode = this;

4

for (auto c : word) {

5

// if the edge is missing

6

if (!curNode->children[c - 'a']) {

7

// we create a new node

8

curNode->children[c - 'a'] = new TrieNode();

9

}

10

curNode = curNode->children[c - 'a'];

11

}

12

curNode->isWord = true;

13

}

11

slide-20
SLIDE 20

Trie Deletion and Prefix Match

Prefix match is very similar to the Lookup procedure

  • You should adapt it according to your problem.

12

slide-21
SLIDE 21

Trie Deletion and Prefix Match

Prefix match is very similar to the Lookup procedure

  • You should adapt it according to your problem.

There are a few different ways to implement deletion

  • We’ll leave that as an exercise to you. :)

12

slide-22
SLIDE 22

Discussion Problem

Two player game where you alternate turns adding a letter to a string. At every turn, the string must be prefix of some word from a given list. The person who adds the last letter of a word loses. If you go first, can you win?

13

slide-23
SLIDE 23

Discussion Problem – Insight

Perform tree DP on the trie of all words

  • State: f(node) = can you win if you are here?
  • f(trie node that is a word end) = false
  • f(node) = true if f(child) = false for some child
  • f(node) = false if f(child) = true for all child

14

slide-24
SLIDE 24

Exact String Matching

Given a text string T and a pattern P, find all the occurrences of P in T.

  • Let N = length(T) and M = length(P).

15

slide-25
SLIDE 25

Exact String Matching – Brute Force

The brute force is intuitive:

  • For every position of T, see if there is a match of P that starts at that position.
  • Implementation is a double for-loop.

16

slide-26
SLIDE 26

Exact String Matching – Brute Force

The brute force is intuitive:

  • For every position of T, see if there is a match of P that starts at that position.
  • Implementation is a double for-loop.

Time complexity: O(NM) Can we do better?

16

slide-27
SLIDE 27

Knuth-Morris-Pratt Algorithm (KMP)

The idea of KMP is to find, for every position of T, the longest prefix of P that ends there.

17

slide-28
SLIDE 28

KMP – Insight

Say that we know for a fact that the longest prefix of P that ends at the i − 1-th character of T has length equal to k.

18

slide-29
SLIDE 29

KMP – Insight

Say that we know for a fact that the longest prefix of P that ends at the i − 1-th character of T has length equal to k.

  • How do we use this to find the longest prefix of P that ends at position i of T?

18

slide-30
SLIDE 30

KMP – Insight

Assume for now that k < M (i.e. there was no full match of P). If the longest prefix of P ending at i − 1 has length k, there are two cases to analyze:

19

slide-31
SLIDE 31

KMP – Insight

Assume for now that k < M (i.e. there was no full match of P). If the longest prefix of P ending at i − 1 has length k, there are two cases to analyze:

  • If P[k] = T[i], then the longest prefix of P ending at i has length k + 1.

19

slide-32
SLIDE 32

KMP – Insight

Assume for now that k < M (i.e. there was no full match of P). If the longest prefix of P ending at i − 1 has length k, there are two cases to analyze:

  • If P[k] = T[i], then the longest prefix of P ending at i has length k + 1.
  • What if P[k] = T[i]?

19

slide-33
SLIDE 33

KMP – Insight

So, we have P[0..k − 1] = T[i − k..i − 1], and P[k] = T[i].

20

slide-34
SLIDE 34

KMP – Insight

So, we have P[0..k − 1] = T[i − k..i − 1], and P[k] = T[i]. Suppose you knew that the longest suffix of P[0..k − 1] that is also a prefix of P has length k2

  • i.e., P[0..k2 − 1] = P[k − k2..k − 1], and k2 is the maximum such number < k.

20

slide-35
SLIDE 35

KMP – Insight

So, we have P[0..k − 1] = T[i − k..i − 1], and P[k] = T[i]. Suppose you knew that the longest suffix of P[0..k − 1] that is also a prefix of P has length k2

  • i.e., P[0..k2 − 1] = P[k − k2..k − 1], and k2 is the maximum such number < k.

Then, we have two cases:

20

slide-36
SLIDE 36

KMP – Insight

So, we have P[0..k − 1] = T[i − k..i − 1], and P[k] = T[i]. Suppose you knew that the longest suffix of P[0..k − 1] that is also a prefix of P has length k2

  • i.e., P[0..k2 − 1] = P[k − k2..k − 1], and k2 is the maximum such number < k.

Then, we have two cases:

  • If P[k2] = T[i], the longest prefix of P ending at i has length k2 + 1.

20

slide-37
SLIDE 37

KMP – Insight

So, we have P[0..k − 1] = T[i − k..i − 1], and P[k] = T[i]. Suppose you knew that the longest suffix of P[0..k − 1] that is also a prefix of P has length k2

  • i.e., P[0..k2 − 1] = P[k − k2..k − 1], and k2 is the maximum such number < k.

Then, we have two cases:

  • If P[k2] = T[i], the longest prefix of P ending at i has length k2 + 1.
  • But what if P[k2] = T[i]?

20

slide-38
SLIDE 38

KMP – Insight

So, we have P[0..k − 1] = T[i − k..i − 1], and P[k] = T[i]. Suppose you knew that the longest suffix of P[0..k − 1] that is also a prefix of P has length k2

  • i.e., P[0..k2 − 1] = P[k − k2..k − 1], and k2 is the maximum such number < k.

Then, we have two cases:

  • If P[k2] = T[i], the longest prefix of P ending at i has length k2 + 1.
  • But what if P[k2] = T[i]?
  • Then, we find the longest suffix of P[0..k2 − 1] that is also a prefix and repeat!

20

slide-39
SLIDE 39

KMP – Success and Fail Arrows

Implicitly, what we are doing is building a DFA. Each node represents a current prefix-length of P. There are two arrows leaving each node:

  • Success: That means there was a character-match, and we increase our longest prefix by 1
  • Fail: The characters we compared were different, so we move the the longest suffix of the

current prefix.

  • This is equivalent of making k ← k2 in the previous slide.

21

slide-40
SLIDE 40

KMP – Success and Fail Arrows

For a certain prefix length k, we have:

  • Success[k] = k+1
  • Fail[0] = -1
  • Fail[k] = next longest prefix that could possibly match, if the current match of length k

fails to match next char

  • This is equivalent of the longest suffix of this prefix that is also a prefix of P (this is a

complicated sentence, read it again slowly)

22

slide-41
SLIDE 41

KMP – Fail Arrows

23

slide-42
SLIDE 42

KMP – Fail Arrows

How do we draw the k-th Fail arrow?

23

slide-43
SLIDE 43

KMP – Fail Arrows

How do we draw the k-th Fail arrow? Follow the (k − 1)-th Fail arrow at least once and until next char matches or k = −1, then move right.

23

slide-44
SLIDE 44

KMP – Fail Arrows

How do we draw the k-th Fail arrow? Follow the (k − 1)-th Fail arrow at least once and until next char matches or k = −1, then move right.

23

slide-45
SLIDE 45

KMP – Fail Arrows

How do we draw the k-th Fail arrow? Follow the (k − 1)-th Fail arrow at least once and until next char matches or k = −1, then move right.

23

slide-46
SLIDE 46

KMP – Fail Arrows

How do we draw the k-th Fail arrow? Follow the (k − 1)-th Fail arrow at least once and until next char matches or k = −1, then move right.

23

slide-47
SLIDE 47

KMP – Fail Arrows

How do we draw the k-th Fail arrow? Follow the (k − 1)-th Fail arrow at least once and until next char matches or k = −1, then move right.

23

slide-48
SLIDE 48

KMP – Fail Arrows

How do we draw the k-th Fail arrow? Follow the (k − 1)-th Fail arrow at least once and until next char matches or k = −1, then move right.

23

slide-49
SLIDE 49

KMP – Fail Arrows

How do we draw the k-th Fail arrow? Follow the (k − 1)-th Fail arrow at least once and until next char matches or k = −1, then move right.

23

slide-50
SLIDE 50

KMP – Fail Arrows

How do we draw the k-th Fail arrow? Follow the (k − 1)-th Fail arrow at least once and until next char matches or k = −1, then move right.

23

slide-51
SLIDE 51

KMP – Fail Arrows

How do we draw the k-th Fail arrow? Follow the (k − 1)-th Fail arrow at least once and until next char matches or k = −1, then move right.

23

slide-52
SLIDE 52

KMP – Fail Arrows

How do we draw the k-th Fail arrow? Follow the (k − 1)-th Fail arrow at least once and until next char matches or k = −1, then move right.

23

slide-53
SLIDE 53

KMP – Fail Arrows

How do we draw the k-th Fail arrow? Follow the (k − 1)-th Fail arrow at least once and until next char matches or k = −1, then move right.

23

slide-54
SLIDE 54

KMP – Fail Arrows

How do we draw the k-th Fail arrow? Follow the (k − 1)-th Fail arrow at least once and until next char matches or k = −1, then move right.

23

slide-55
SLIDE 55

KMP – Fail Arrows

How do we draw the k-th Fail arrow? Follow the (k − 1)-th Fail arrow at least once and until next char matches or k = −1, then move right.

23

slide-56
SLIDE 56

KMP – Fail Arrows

How do we draw the k-th Fail arrow? Follow the (k − 1)-th Fail arrow at least once and until next char matches or k = −1, then move right.

23

slide-57
SLIDE 57

KMP – Fail Arrows

How do we draw the k-th Fail arrow? Follow the (k − 1)-th Fail arrow at least once and until next char matches or k = −1, then move right.

23

slide-58
SLIDE 58

KMP – Fail Arrows

How do we draw the k-th Fail arrow? Follow the (k − 1)-th Fail arrow at least once and until next char matches or k = −1, then move right.

23

slide-59
SLIDE 59

KMP – Fail Arrows

How do we draw the k-th Fail arrow? Follow the (k − 1)-th Fail arrow at least once and until next char matches or k = −1, then move right. When we see char c, what’s next state?

23

slide-60
SLIDE 60

KMP – Fail Arrows

How do we draw the k-th Fail arrow? Follow the (k − 1)-th Fail arrow at least once and until next char matches or k = −1, then move right. When we see char c, what’s next state? Follow fail arrow until next char matches or k = −1, then move right.

23

slide-61
SLIDE 61

KMP – String Matching Example

P = “ABABDAB” T = “ABABABDABC”

24

slide-62
SLIDE 62

KMP – String Matching Example

P = “ABABDAB” T = “ABABABDABC”

24

slide-63
SLIDE 63

KMP – String Matching Example

P = “ABABDAB” T = “ABABABDABC”

24

slide-64
SLIDE 64

KMP – String Matching Example

P = “ABABDAB” T = “ABABABDABC”

24

slide-65
SLIDE 65

KMP – String Matching Example

P = “ABABDAB” T = “ABABABDABC”

24

slide-66
SLIDE 66

KMP – String Matching Example

P = “ABABDAB” T = “ABABABDABC”

24

slide-67
SLIDE 67

KMP – String Matching Example

P = “ABABDAB” T = “ABABABDABC”

24

slide-68
SLIDE 68

KMP – String Matching Example

P = “ABABDAB” T = “ABABABDABC”

24

slide-69
SLIDE 69

KMP – String Matching Example

P = “ABABDAB” T = “ABABABDABC”

24

slide-70
SLIDE 70

KMP – String Matching Example

P = “ABABDAB” T = “ABABABDABC”

24

slide-71
SLIDE 71

KMP – String Matching Example

P = “ABABDAB” T = “ABABABDABC”

24

slide-72
SLIDE 72

KMP – Finding a Match

If, when running KMP, we’ve found a match, then there are two scenarios:

  • If we are only interested in finding a single match, we could just stop running KMP.
  • If we want every match, we add this match to a list and follow a single fail arrow.

25

slide-73
SLIDE 73

KMP Algorithm: Finding Fail Arrows

1

KMP_INIT(W):

2

initialize array Fail of size |W|+1

3

set Fail [0] = -1

4

for i in 1 to |W|

5

let nxt = Fail[i-1]

6

while nxt >= 0 && W[nxt] != W[i -1]:

7

nxt = Fail[nxt]

8

set Fail[i] = nxt + 1

9

return Fail

26

slide-74
SLIDE 74

KMP Algorithm: Matching

1

KMP_MATCH(Fail , W, S):

2

initialize cur = 0, matches = empty list

3

for i in 0 to |S| - 1

4

while cur >= 0 && W[cur] != S[i]:

5

cur = Fail[cur]

6

cur = cur + 1

7

if cur == |W|:

8

add i - |W| + 1 to matches

9

cur = Fail[cur]

10

return matches

27

slide-75
SLIDE 75

The KMP Algorithm: Time Complexity Analysis

Finding fail arrows:

  • To build next arrow, start at end point of previous arrow, take 0 or more back arrows, and

1 forward arrow.

  • Every backward arrow move back at least 1 state
  • No arrow moves past −1
  • ⇒ Backward arrows ≤ forward arrows = O(M)

28

slide-76
SLIDE 76

The KMP Algorithm: Time Complexity Analysis

Finding fail arrows:

  • To build next arrow, start at end point of previous arrow, take 0 or more back arrows, and

1 forward arrow.

  • Every backward arrow move back at least 1 state
  • No arrow moves past −1
  • ⇒ Backward arrows ≤ forward arrows = O(M)

Finding search string in target string of size N

  • To get next state, take 0 or more back arrows, 1 forward arrow
  • ⇒ Similar reasoning gives no more than O(N) arrows

28

slide-77
SLIDE 77

The KMP Algorithm: Time Complexity Analysis

Finding fail arrows:

  • To build next arrow, start at end point of previous arrow, take 0 or more back arrows, and

1 forward arrow.

  • Every backward arrow move back at least 1 state
  • No arrow moves past −1
  • ⇒ Backward arrows ≤ forward arrows = O(M)

Finding search string in target string of size N

  • To get next state, take 0 or more back arrows, 1 forward arrow
  • ⇒ Similar reasoning gives no more than O(N) arrows

⇒ Time complexity is O(M + N)

28

slide-78
SLIDE 78

Discussion Problem: Wildcards

Find at least one occurrence of S1 in string S2. Catch: a * in string A matches any sequence of characters. S1 S2 Match? aa*b aab Yes aacdab Yes caaccbd Yes aacdaa No acccccb No

Figure 1: Example of wildcard matching

How many * can you handle?

29

slide-79
SLIDE 79

Discussion Problem: Wildcards – Insight

  • Cut S1 by * into pieces T1, T2, . . . , Tk
  • Find first copy of T1, then the first copy of T2 after T1, and so on
  • Time complexity: still O(M + N)

30

slide-80
SLIDE 80

Jack’s Weekend Recommendation

Air Force One A little known but stellar action film with a healthy dose of way too much americana.

31

slide-81
SLIDE 81

Lucca’s Weekend Recommendation

Casino Royale This, in my opinion, is the best action film ever.

32