strings
play

Strings Part 1: Tries and KMP Lucca Siaudzionis and Jack - PowerPoint PPT Presentation

Strings Part 1: Tries and KMP Lucca Siaudzionis and Jack Spalding-Jamieson 2020/03/05 University of British Columbia Announcements Were still finalizing A4. It will be out the weekend and youll have a little over two weeks to do it.


  1. Strings Part 1: Tries and KMP Lucca Siaudzionis and Jack Spalding-Jamieson 2020/03/05 University of British Columbia

  2. Announcements • We’re still finalizing A4. It will be out the weekend and you’ll have a little over two weeks to do it. 1

  3. Inspiration Suppose we want to implement a map: string -> int • where we have N string keys and • each string has length ≤ M 2

  4. Inspiration One solution: build a BST of the strings • This is essentially what would happen if you were to use map<string, int> 3

  5. Inspiration One solution: build a BST of the strings • This is essentially what would happen if you were to use map<string, int> Time complexity: O ( M log N ) 3

  6. Inspiration One solution: build a BST of the strings • This is essentially what would happen if you were to use map<string, int> Time complexity: O ( M log N ) Space complexity: O ( M + N ) 3

  7. Inspiration One solution: build a BST of the strings • This is essentially what would happen if you were to use map<string, int> Time complexity: O ( M log N ) Space complexity: O ( M + N ) This doesn’t allow partial prefix matches, which might be useful sometimes • Can we do better? 3

  8. Observation There are only 26 letters in the alphabet, so there is a lot of repetitive information that would be stored in a BST. 4

  9. Observation There are only 26 letters in the alphabet, so there is a lot of repetitive information that would be stored in a BST. Why don’t we just use the alphabet to form a tree? 4

  10. A Trie is a Tree! If we build a tree where every node represents a prefix of a word, we have a Trie. Keys: to, tea, ted, ten, A, i, in, inn. Source: Wikipedia 5

  11. Trie Structure In the previous example, each node represents a prefix. 6

  12. Trie Structure In the previous example, each node represents a prefix. • But we shouldn’t store that entire prefix. (Why?) 6

  13. Trie Structure In the previous example, each node represents a prefix. • But we shouldn’t store that entire prefix. (Why?) We expand a prefix with an edge containing a character. 6

  14. Trie Structure In the previous example, each node represents a prefix. • But we shouldn’t store that entire prefix. (Why?) We expand a prefix with an edge containing a character. An entire word is also a prefix of itself, so we flag some nodes indicating that they contain the end of a word. 6

  15. Trie Structure – Implementation struct TrieNode { 1 bool isWord; 2 vector<TrieNode*> children; 3 TrieNode() { 4 isWord = false; 5 children = vector<TrieNode*>(26, nullptr); // assuming only 6 } 7 // ... 8 }; 9 10 TrieNode* root = new TrieNode(); // fresh new trie 11 7

  16. Trie Lookup To see find out if a word is present in the trie, we just walk along the path in the tree defined by the word. • If, at any step, an edge is missing, then the word is not present in the trie. • If we reach the end node, but it has isWord false, then the word is not present in the trie. 8

  17. Trie Lookup – Implementation // implementation inside TrieNode 1 bool find(string& word) { 2 TrieNode* curNode = this; 3 for (auto c : word) { 4 if (!curNode->children[c - 'a']) return false; 5 curNode = curNode->children[c - 'a']; 6 } 7 return curNode->isWord; 8 } 9 9

  18. Trie Insertion To insert a new word, we walk in the trie along the path defined by that word • Every time an edge is missing, we create a new edge and node, appending it to the current node we are scanning. • When we reach the end of the word, we define it in the trie. 10

  19. Trie Insertion – Implementation // implementation inside TrieNode struct 1 void insert(string& word) { 2 TrieNode* curNode = this; 3 for (auto c : word) { 4 // if the edge is missing 5 if (!curNode->children[c - 'a']) { 6 // we create a new node 7 curNode->children[c - 'a'] = new TrieNode(); 8 } 9 curNode = curNode->children[c - 'a']; 10 } 11 curNode->isWord = true; 12 } 13 11

  20. Trie Deletion and Prefix Match Prefix match is very similar to the Lookup procedure • You should adapt it according to your problem. 12

  21. Trie Deletion and Prefix Match Prefix match is very similar to the Lookup procedure • You should adapt it according to your problem. There are a few different ways to implement deletion • We’ll leave that as an exercise to you. :) 12

  22. Discussion Problem Two player game where you alternate turns adding a letter to a string. At every turn, the string must be prefix of some word from a given list. The person who adds the last letter of a word loses. If you go first, can you win? 13

  23. Discussion Problem – Insight Perform tree DP on the trie of all words • State: f(node) = can you win if you are here? • f(trie node that is a word end) = false • f(node) = true if f(child) = false for some child • f(node) = false if f(child) = true for all child 14

  24. Exact String Matching Given a text string T and a pattern P , find all the occurrences of P in T . • Let N = length ( T ) and M = length ( P ). 15

  25. Exact String Matching – Brute Force The brute force is intuitive: • For every position of T , see if there is a match of P that starts at that position. • Implementation is a double for-loop. 16

  26. Exact String Matching – Brute Force The brute force is intuitive: • For every position of T , see if there is a match of P that starts at that position. • Implementation is a double for-loop. Time complexity: O ( NM ) Can we do better? 16

  27. Knuth-Morris-Pratt Algorithm (KMP) The idea of KMP is to find, for every position of T , the longest prefix of P that ends there. 17

  28. KMP – Insight Say that we know for a fact that the longest prefix of P that ends at the i − 1-th character of T has length equal to k . 18

  29. KMP – Insight Say that we know for a fact that the longest prefix of P that ends at the i − 1-th character of T has length equal to k . • How do we use this to find the longest prefix of P that ends at position i of T ? 18

  30. KMP – Insight Assume for now that k < M (i.e. there was no full match of P ). If the longest prefix of P ending at i − 1 has length k , there are two cases to analyze: 19

  31. KMP – Insight Assume for now that k < M (i.e. there was no full match of P ). If the longest prefix of P ending at i − 1 has length k , there are two cases to analyze: • If P [ k ] = T [ i ], then the longest prefix of P ending at i has length k + 1. 19

  32. KMP – Insight Assume for now that k < M (i.e. there was no full match of P ). If the longest prefix of P ending at i − 1 has length k , there are two cases to analyze: • If P [ k ] = T [ i ], then the longest prefix of P ending at i has length k + 1. • What if P [ k ] � = T [ i ]? 19

  33. KMP – Insight So, we have P [0 .. k − 1] = T [ i − k .. i − 1], and P [ k ] � = T [ i ]. 20

  34. KMP – Insight So, we have P [0 .. k − 1] = T [ i − k .. i − 1], and P [ k ] � = T [ i ]. Suppose you knew that the longest suffix of P [0 .. k − 1] that is also a prefix of P has length k 2 • i.e., P [0 .. k 2 − 1] = P [ k − k 2 .. k − 1], and k 2 is the maximum such number < k . 20

  35. KMP – Insight So, we have P [0 .. k − 1] = T [ i − k .. i − 1], and P [ k ] � = T [ i ]. Suppose you knew that the longest suffix of P [0 .. k − 1] that is also a prefix of P has length k 2 • i.e., P [0 .. k 2 − 1] = P [ k − k 2 .. k − 1], and k 2 is the maximum such number < k . Then, we have two cases: 20

  36. KMP – Insight So, we have P [0 .. k − 1] = T [ i − k .. i − 1], and P [ k ] � = T [ i ]. Suppose you knew that the longest suffix of P [0 .. k − 1] that is also a prefix of P has length k 2 • i.e., P [0 .. k 2 − 1] = P [ k − k 2 .. k − 1], and k 2 is the maximum such number < k . Then, we have two cases: • If P [ k 2 ] = T [ i ], the longest prefix of P ending at i has length k 2 + 1. 20

  37. KMP – Insight So, we have P [0 .. k − 1] = T [ i − k .. i − 1], and P [ k ] � = T [ i ]. Suppose you knew that the longest suffix of P [0 .. k − 1] that is also a prefix of P has length k 2 • i.e., P [0 .. k 2 − 1] = P [ k − k 2 .. k − 1], and k 2 is the maximum such number < k . Then, we have two cases: • If P [ k 2 ] = T [ i ], the longest prefix of P ending at i has length k 2 + 1. • But what if P [ k 2 ] � = T [ i ]? 20

  38. KMP – Insight So, we have P [0 .. k − 1] = T [ i − k .. i − 1], and P [ k ] � = T [ i ]. Suppose you knew that the longest suffix of P [0 .. k − 1] that is also a prefix of P has length k 2 • i.e., P [0 .. k 2 − 1] = P [ k − k 2 .. k − 1], and k 2 is the maximum such number < k . Then, we have two cases: • If P [ k 2 ] = T [ i ], the longest prefix of P ending at i has length k 2 + 1. • But what if P [ k 2 ] � = T [ i ]? • Then, we find the longest suffix of P [0 .. k 2 − 1] that is also a prefix and repeat! 20

  39. KMP – Success and Fail Arrows Implicitly, what we are doing is building a DFA. Each node represents a current prefix-length of P . There are two arrows leaving each node: • Success : That means there was a character-match, and we increase our longest prefix by 1 • Fail : The characters we compared were different, so we move the the longest suffix of the current prefix. • This is equivalent of making k ← k 2 in the previous slide. 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend