optimal matching for sliding window compression with
play

Optimal Matching For SLiding Window Compression With Suffix Tries - PowerPoint PPT Presentation

Optimal Matching For SLiding Window Compression With Suffix Tries An advantage of sliding window compression is fast decoding. However, encoding is a different story. Simple hashing schemes such as that employed by gzip work well in


  1. Optimal Matching For SLiding Window Compression With Suffix Tries • An advantage of sliding window compression is fast decoding. • However, encoding is a different story. • Simple hashing schemes such as that employed by gzip work well in practice, but from a theoretical point of view do not give true linear time encoding. • A suffix trie is a powerful data structure for effectively representing all substrings of a given string, using only linear space: *** It can be used to give a true linear time implementation of sliding window compression encoding. - 1 -

  2. Compact Tries Idea: Compact tries are like regular tries except that edges from a vertex to its children are labeled by strings that start with distinct characters. Example: The compact trie for { aaccaa, aaccab, abc,bb, bca, bcbb, c }: a c b acca bc b c a b a bb - 2 -

  3. Reduced space when the strings are substrings of a master string: • If the strings are explicitly saved, then the space consumed is still proportional to the space used in the uncompacted trie. • However, if the trie is used to represent substrings of a "master" string s that is already stored, edge labels can be pairs of integers indicating a position and length in s , and hence the space per edge is constant. Since each non- leaf vertex has at least two children (so more than half the vertices are leaves), the space for the entire trie is proportional to the space consumed by the leaves. Note : Labels on edges may not be unique. It may be appropriate for a particular application to use some standard convention (for example, when we consider sliding suffix tries, labels will always refer to the rightmost occurrence). - 3 -

  4. Suffix Tries A suffix trie stores all suffixes of the string s$ where $ is a new symbol. Below is the suffix trie for the string s=abcbbacbbab $: a b c b b a c b b a b $ 1 2 3 4 5 6 7 8 9 10 11 12 a $ 12 cbba b c(8,5) b $ $ a 6 c (4,9) b (12,1) 11 10 c (8,5) c (4,9) c (8,5) 1 ba b (12,1) 7 5 2 3 c (8,5) 9 b (12,1) 4 8 Note : For presentation, edges to a non-leaf vertex are labeled with the full string and edges to leaves are labeled with the first character of the string followed by a (position, length) reference to a suffix of s . In practice, all edges can be labeled by a pair of integers representing a position and length in s $. - 4 -

  5. Note : For presentation, in the previous figure edges to a non-leaf vertex are labeled with the full string and edges to leaves are labeled with the first character of the string followed by a (position, length) reference to a suffix of s . In practice, all edges can be labeled by a pair of integers representing a position and length in s $. - 5 -

  6. Example Applications of Suffix Tries Find all occurrences of a string t in a string s: Move down from the root in the suffix trie for s according to t . If a leaf is reached before finishing t , t is not a substring of s . Otherwise, after reaching a vertex v , any of the leaves below v are suffixes of s that have t as a prefix (and so the suffix starting position is where t matches). Find the longest repeated string in a string s: Traverse the suffix tree (e.g., preorder traversal) to find the non-leaf vertex of lowest virtual depth. Find the longest common substring of a string s and a string t: Construct the suffix trie for s#t$ . Then traverse this suffix trie in post-order and label each vertex as to whether it has descendants in only s , in only t , or in both s and t (once you know this information for the children, it is easily computed for the parent), to find the non-leaf vertex of lowest virtual depth that has both. Sort all suffixes of a string s: Traverse the suffix trie for s in lexicographic order. - 6 -

  7. Simple Suffix Trie Construction Algorithm Idea: At stage i , add the leaf for position i by modifying the suffix trie already constructed for positions 1 through i –1 of s $. Input characters will be read more than once because input is read forward from position i to insert the new suffix, and then we back up to position i+ 1 to start the next step. Basic brute force algorithm, let n denote the length of s $: Initialize an empty suffix trie. for i := 1 to n do begin cursor := i v := SCAN( root ) Label v with position i . end - 7 -

  8. procedure SCAN( x ): Starting at vertex x reading the input character pointed to by cursor , move down the suffix trie and keep reading input characters (and advancing cursor ) to create a new leaf in one of two ways: 1. A vertex is reached that does not have a child for the current input character (and so a new leaf can be added to that vertex) — return this new leaf. 2. The string labeling an edge cannot be matched, the edge is "split", and a new non-leaf vertex added with one child being the edge labeled by the remainder of the string, and the other child the new leaf — return this new leaf. Time: In the worst case, O ( i ) time could be required to process the i th character, for a total of O( n 2 ) time. Space: O (1) space in addition to the O ( n ) space for the suffix trie. - 8 -

  9. Maximum match length: • SCAN can be modified to limit leaf depth to some maximum depth m . • If a leaf corresponds to a string that occurs more than once in s , then each time it is visited, it can be re-labeled with the rightmost position this string has occurred thus far. - 9 -

  10. McCreight's Suffix Trie Construction Idea: • At stage i , add the leaf for position i by modifying the suffix trie already constructed for positions 1 through i –1 of s $. • Input characters may be read more than once because input is read forward from position i to insert the new suffix, and then we back up to position i+ 1 to start the next step. Definition: An uncle of a non-leaf vertex v in a suffix trie is a vertex corresponding to the same string less the first character. McCreight's algorithm : • Achieves linear time by employing uncle links . • Although it is not necessarily the case that each stage of the algorithm is O(1) time, the time to construct the suffix trie for n characters is O ( n ). • The details must be carefully worked out. - 10 -

  11. McCreight's Algorithm - basic idea: • The suffix for position i is like that for position i –1 except that the leftmost character is removed. • Instead of starting at the root each time to insert the next suffix, maintain uncle links so that "shortcuts" can be taken. • If the parent v of the previous leaf added does not already have an uncle link, using v 's parent’s suffix link (which will always be present), you end up higher than you would like in the trie, and have to rescan the string t that labels the edge from the parent p of v to v . • Add the new suffix link from v to its uncle for future use, scan down to add a new leaf for position i , and then move up to its parent, which becomes the new vertex v for the next iteration of the algorithm. root u uncle p link RESCAN the substring t t w new uncle SCAN link previous v new v previous new leaf leaf - 11 -

  12. Note: • This previous figure shows the general case, where solid lines are single edges and dashed lines denote a path of one or more edges. • The two special cases are: When v is the root. Simply scan down from the root. When v is a child of the root. Rescan t less its first character, since the uncle link from the root to itself goes to a string of the same length (length 0) rather than to one that is one character shorter (as with all other uncle links). Key invariant: At each stage, every non-leaf vertex, except possibly for one created at the last step, has an uncle link. - 12 -

  13. The McCreight Algorithm Initialize the trie to a single root vertex v . cursor := 1 for i := 1 to n do begin if v is the root then w := v else if v already has an uncle link then w := UNCLE( v ) else begin p := PARENT( v ) u := UNCLE( p ) k := the length of the string t labeling the edge from p to v if p is the root then begin k := k -1; cursor := cursor +1 end w := RESCAN( u,k ) Add an uncle link from v to w . end v := SCAN( w ) Label v with position i . v := PARENT( v ) cursor := cursor –1 end - 13 -

  14. function RESCAN( x,k ) : • Starting at vertex x scanning the input symbol pointed to by the cursor, advance the cursor as we move down the trie. • Each edge is traversed in constant time because the suffix trie has already been constructed for this portion of the input, and it suffices to match the first character of edge label to the input (and check that its length does not cause the total length traversed to exceed k ). • The vertex where the rescan ends is returned (a new vertex is created if it ends in the middle of an edge). - 14 -

  15. McCreight's Algorithm Complexity (assuming the alphabet size is constant) Time is O(n): Main loop, excluding calls to SCAN and RESCAN: Each iteration is O (1) excluding the time for SCAN and RESCAN operations, for a total of O ( n ). Calls to SCAN: Each step of SCAN reads a new input symbol or creates a new trie vertex (or both), so the total time for all calls to SCAN is O ( n ). Calls to RESCAN: Traversal of each edge is O (1) since only the first character of its label has to be matched (the position in the input can be calculated using virtual depth fields). If the RESCAN visits more than one edge, except for the last one, each corresponds to positions of the input that will never be rescanned or scanned again (since each step either follows v 's uncle link or the one of parent of v , but goes no higher in the trie), and hence the time for all call to RESCAN is O ( n ). Space is also O(n). - 15 -

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend