Strings Part 1: Tries and KMP Lucca Siaudzionis and Jack - PowerPoint PPT Presentation

Strings Part 1: Tries and KMP Lucca Siaudzionis and Jack Spalding-Jamieson 2020/03/05 University of British Columbia

Announcements • We’re still finalizing A4. It will be out the weekend and you’ll have a little over two weeks to do it. 1

Inspiration Suppose we want to implement a map: string -> int • where we have N string keys and • each string has length ≤ M 2

Inspiration One solution: build a BST of the strings • This is essentially what would happen if you were to use map<string, int> 3

Inspiration One solution: build a BST of the strings • This is essentially what would happen if you were to use map<string, int> Time complexity: O ( M log N ) 3

Inspiration One solution: build a BST of the strings • This is essentially what would happen if you were to use map<string, int> Time complexity: O ( M log N ) Space complexity: O ( M + N ) 3

Inspiration One solution: build a BST of the strings • This is essentially what would happen if you were to use map<string, int> Time complexity: O ( M log N ) Space complexity: O ( M + N ) This doesn’t allow partial prefix matches, which might be useful sometimes • Can we do better? 3

Observation There are only 26 letters in the alphabet, so there is a lot of repetitive information that would be stored in a BST. 4

Observation There are only 26 letters in the alphabet, so there is a lot of repetitive information that would be stored in a BST. Why don’t we just use the alphabet to form a tree? 4

A Trie is a Tree! If we build a tree where every node represents a prefix of a word, we have a Trie. Keys: to, tea, ted, ten, A, i, in, inn. Source: Wikipedia 5

Trie Structure In the previous example, each node represents a prefix. 6

Trie Structure In the previous example, each node represents a prefix. • But we shouldn’t store that entire prefix. (Why?) 6

Trie Structure In the previous example, each node represents a prefix. • But we shouldn’t store that entire prefix. (Why?) We expand a prefix with an edge containing a character. 6

Trie Structure In the previous example, each node represents a prefix. • But we shouldn’t store that entire prefix. (Why?) We expand a prefix with an edge containing a character. An entire word is also a prefix of itself, so we flag some nodes indicating that they contain the end of a word. 6

Trie Structure – Implementation struct TrieNode { 1 bool isWord; 2 vector<TrieNode*> children; 3 TrieNode() { 4 isWord = false; 5 children = vector<TrieNode*>(26, nullptr); // assuming only 6 } 7 // ... 8 }; 9 10 TrieNode* root = new TrieNode(); // fresh new trie 11 7

Trie Lookup To see find out if a word is present in the trie, we just walk along the path in the tree defined by the word. • If, at any step, an edge is missing, then the word is not present in the trie. • If we reach the end node, but it has isWord false, then the word is not present in the trie. 8

Trie Lookup – Implementation // implementation inside TrieNode 1 bool find(string& word) { 2 TrieNode* curNode = this; 3 for (auto c : word) { 4 if (!curNode->children[c - 'a']) return false; 5 curNode = curNode->children[c - 'a']; 6 } 7 return curNode->isWord; 8 } 9 9

Trie Insertion To insert a new word, we walk in the trie along the path defined by that word • Every time an edge is missing, we create a new edge and node, appending it to the current node we are scanning. • When we reach the end of the word, we define it in the trie. 10

Trie Insertion – Implementation // implementation inside TrieNode struct 1 void insert(string& word) { 2 TrieNode* curNode = this; 3 for (auto c : word) { 4 // if the edge is missing 5 if (!curNode->children[c - 'a']) { 6 // we create a new node 7 curNode->children[c - 'a'] = new TrieNode(); 8 } 9 curNode = curNode->children[c - 'a']; 10 } 11 curNode->isWord = true; 12 } 13 11

Trie Deletion and Prefix Match Prefix match is very similar to the Lookup procedure • You should adapt it according to your problem. 12

Trie Deletion and Prefix Match Prefix match is very similar to the Lookup procedure • You should adapt it according to your problem. There are a few different ways to implement deletion • We’ll leave that as an exercise to you. :) 12

Discussion Problem Two player game where you alternate turns adding a letter to a string. At every turn, the string must be prefix of some word from a given list. The person who adds the last letter of a word loses. If you go first, can you win? 13

Discussion Problem – Insight Perform tree DP on the trie of all words • State: f(node) = can you win if you are here? • f(trie node that is a word end) = false • f(node) = true if f(child) = false for some child • f(node) = false if f(child) = true for all child 14

Exact String Matching Given a text string T and a pattern P , find all the occurrences of P in T . • Let N = length ( T ) and M = length ( P ). 15

Exact String Matching – Brute Force The brute force is intuitive: • For every position of T , see if there is a match of P that starts at that position. • Implementation is a double for-loop. 16

Exact String Matching – Brute Force The brute force is intuitive: • For every position of T , see if there is a match of P that starts at that position. • Implementation is a double for-loop. Time complexity: O ( NM ) Can we do better? 16

Knuth-Morris-Pratt Algorithm (KMP) The idea of KMP is to find, for every position of T , the longest prefix of P that ends there. 17

KMP – Insight Say that we know for a fact that the longest prefix of P that ends at the i − 1-th character of T has length equal to k . 18

KMP – Insight Say that we know for a fact that the longest prefix of P that ends at the i − 1-th character of T has length equal to k . • How do we use this to find the longest prefix of P that ends at position i of T ? 18

KMP – Insight Assume for now that k < M (i.e. there was no full match of P ). If the longest prefix of P ending at i − 1 has length k , there are two cases to analyze: 19

KMP – Insight Assume for now that k < M (i.e. there was no full match of P ). If the longest prefix of P ending at i − 1 has length k , there are two cases to analyze: • If P [ k ] = T [ i ], then the longest prefix of P ending at i has length k + 1. 19

KMP – Insight Assume for now that k < M (i.e. there was no full match of P ). If the longest prefix of P ending at i − 1 has length k , there are two cases to analyze: • If P [ k ] = T [ i ], then the longest prefix of P ending at i has length k + 1. • What if P [ k ] � = T [ i ]? 19

KMP – Insight So, we have P [0 .. k − 1] = T [ i − k .. i − 1], and P [ k ] � = T [ i ]. 20

KMP – Insight So, we have P [0 .. k − 1] = T [ i − k .. i − 1], and P [ k ] � = T [ i ]. Suppose you knew that the longest suffix of P [0 .. k − 1] that is also a prefix of P has length k 2 • i.e., P [0 .. k 2 − 1] = P [ k − k 2 .. k − 1], and k 2 is the maximum such number < k . 20

KMP – Insight So, we have P [0 .. k − 1] = T [ i − k .. i − 1], and P [ k ] � = T [ i ]. Suppose you knew that the longest suffix of P [0 .. k − 1] that is also a prefix of P has length k 2 • i.e., P [0 .. k 2 − 1] = P [ k − k 2 .. k − 1], and k 2 is the maximum such number < k . Then, we have two cases: 20

KMP – Insight So, we have P [0 .. k − 1] = T [ i − k .. i − 1], and P [ k ] � = T [ i ]. Suppose you knew that the longest suffix of P [0 .. k − 1] that is also a prefix of P has length k 2 • i.e., P [0 .. k 2 − 1] = P [ k − k 2 .. k − 1], and k 2 is the maximum such number < k . Then, we have two cases: • If P [ k 2 ] = T [ i ], the longest prefix of P ending at i has length k 2 + 1. 20

KMP – Insight So, we have P [0 .. k − 1] = T [ i − k .. i − 1], and P [ k ] � = T [ i ]. Suppose you knew that the longest suffix of P [0 .. k − 1] that is also a prefix of P has length k 2 • i.e., P [0 .. k 2 − 1] = P [ k − k 2 .. k − 1], and k 2 is the maximum such number < k . Then, we have two cases: • If P [ k 2 ] = T [ i ], the longest prefix of P ending at i has length k 2 + 1. • But what if P [ k 2 ] � = T [ i ]? 20

KMP – Insight So, we have P [0 .. k − 1] = T [ i − k .. i − 1], and P [ k ] � = T [ i ]. Suppose you knew that the longest suffix of P [0 .. k − 1] that is also a prefix of P has length k 2 • i.e., P [0 .. k 2 − 1] = P [ k − k 2 .. k − 1], and k 2 is the maximum such number < k . Then, we have two cases: • If P [ k 2 ] = T [ i ], the longest prefix of P ending at i has length k 2 + 1. • But what if P [ k 2 ] � = T [ i ]? • Then, we find the longest suffix of P [0 .. k 2 − 1] that is also a prefix and repeat! 20

KMP – Success and Fail Arrows Implicitly, what we are doing is building a DFA. Each node represents a current prefix-length of P . There are two arrows leaving each node: • Success : That means there was a character-match, and we increase our longest prefix by 1 • Fail : The characters we compared were different, so we move the the longest suffix of the current prefix. • This is equivalent of making k ← k 2 in the previous slide. 21

Strings Part 1: Tries and KMP Lucca Siaudzionis and Jack - PowerPoint PPT Presentation

Strings Part 1: Tries and KMP Lucca Siaudzionis and Jack Spalding-Jamieson 2020/03/05 University of British Columbia Announcements Were still finalizing A4. It will be out the weekend and youll have a little over two weeks to do it.

s[i] Introduction to Computer Programming Strings CSCI-UA 2 Strings and Characters Strings are

Listing Bit Strings List all bit strings of length 3. Listing Bit Strings List all bit strings

Chapter 9 Strings 1 C-Strings vs C++ Strings T wo string types: C-strings Array

Strings Testing for equality with strings. Lexicographic ordering of strings. Other

Languages and Regular expressions Lecture 2 1 Strings, Sets of Strings, Sets of Sets of

Strings Digital Medicine I Lists, strings, loops Repetition Hans-Joachim Bckenhauer Dennis

Chapter 9: Strings (To avoid confusion, C-style strings will be referred to as C-string,

ARM Assembler Strings Strings p. 1/16 Characters or Strings A string is a sequence of

String Amplitudes, Topological Strings and the Omega-deformation Strings @ Princeton 26 - 06 -

Strings, Languages, and Regular expressions Lecture 2 1 Strings 2 Definitions for strings

Strings in Python Computers store text as strings >>> s = "GATTACA" 0 1 2

STRINGS AND FACTORS Jeff Goldsmith, PhD Department of Biostatistics 1 Strings vs Factors

61A Extra Lecture 4 Announcements Encoding Strings Representing Strings: UTF-8 Encoding 4

HANDOUT 1 Strings STRINGS Weve already introduced the string data type a few lectures ago.

A first look at string processing Python Strings Basic data type in Python Strings are

Greenbriar East Elementary School Welcome! Strings Registration Link Band Registration Link

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in

Optimization of Pattern Matching Algorithm for Memory Based Architecture Cheng-Hung Lin, Yu-Tang

String matching Announcements Programming assignment 1 posted - need to submit a .sh file The

HTML Templates The Problem We want to serve custom HTML In HW3 you're sending di ff erent

Lecture 15: Suffix trees, suffix arrays, and their applica8ons

CSE182 lecture 4 notes &questions Vineet Bafna October 5, 2006 1 Notes Recall that we are

Strings Strings A string is a series of characters Characters can be referenced by using

The Closest Substring problem with small distances D aniel Marx dmarx@informatik.hu-berlin.de

Strings Part 1: Tries and KMP Lucca Siaudzionis and Jack - PowerPoint PPT Presentation

Strings Part 1: Tries and KMP Lucca Siaudzionis and Jack Spalding-Jamieson 2020/03/05 University of British Columbia Announcements Were still finalizing A4. It will be out the weekend and youll have a little over two weeks to do it.

s[i] Introduction to Computer Programming Strings CSCI-UA 2 Strings and Characters Strings are

Listing Bit Strings List all bit strings of length 3. Listing Bit Strings List all bit strings

Chapter 9 Strings 1 C-Strings vs C++ Strings T wo string types: C-strings Array

Strings Testing for equality with strings. Lexicographic ordering of strings. Other

Languages and Regular expressions Lecture 2 1 Strings, Sets of Strings, Sets of Sets of

Strings Digital Medicine I Lists, strings, loops Repetition Hans-Joachim Bckenhauer Dennis

Chapter 9: Strings (To avoid confusion, C-style strings will be referred to as C-string,

ARM Assembler Strings Strings p. 1/16 Characters or Strings A string is a sequence of

String Amplitudes, Topological Strings and the Omega-deformation Strings @ Princeton 26 - 06 -

Strings, Languages, and Regular expressions Lecture 2 1 Strings 2 Definitions for strings

Strings in Python Computers store text as strings &gt;&gt;&gt; s = &quot;GATTACA&quot; 0 1 2

STRINGS AND FACTORS Jeff Goldsmith, PhD Department of Biostatistics 1 Strings vs Factors

61A Extra Lecture 4 Announcements Encoding Strings Representing Strings: UTF-8 Encoding 4

HANDOUT 1 Strings STRINGS Weve already introduced the string data type a few lectures ago.

A first look at string processing Python Strings Basic data type in Python Strings are

Greenbriar East Elementary School Welcome! Strings Registration Link Band Registration Link

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in

Optimization of Pattern Matching Algorithm for Memory Based Architecture Cheng-Hung Lin, Yu-Tang

String matching Announcements Programming assignment 1 posted - need to submit a .sh file The

HTML Templates The Problem We want to serve custom HTML In HW3 you're sending di ff erent

Lecture 15: Suffix trees, suffix arrays, and their applica8ons

CSE182 lecture 4 notes &amp;questions Vineet Bafna October 5, 2006 1 Notes Recall that we are

Strings Strings A string is a series of characters Characters can be referenced by using

The Closest Substring problem with small distances D aniel Marx dmarx@informatik.hu-berlin.de

Strings in Python Computers store text as strings >>> s = "GATTACA" 0 1 2

CSE182 lecture 4 notes &questions Vineet Bafna October 5, 2006 1 Notes Recall that we are