Optimal Matching For SLiding Window Compression With Suffix Tries - - PowerPoint PPT Presentation

optimal matching for sliding window compression with
SMART_READER_LITE
LIVE PREVIEW

Optimal Matching For SLiding Window Compression With Suffix Tries - - PowerPoint PPT Presentation

Optimal Matching For SLiding Window Compression With Suffix Tries An advantage of sliding window compression is fast decoding. However, encoding is a different story. Simple hashing schemes such as that employed by gzip work well in


slide-1
SLIDE 1
  • 1 -

Optimal Matching For SLiding Window Compression With Suffix Tries

  • An advantage of sliding window compression is fast decoding.
  • However, encoding is a different story.
  • Simple hashing schemes such as that employed by gzip work well in

practice, but from a theoretical point of view do not give true linear time encoding.

  • A suffix trie is a powerful data structure for effectively representing all

substrings of a given string, using only linear space: *** It can be used to give a true linear time implementation of sliding window compression encoding.

slide-2
SLIDE 2
  • 2 -

Compact Tries

Idea: Compact tries are like regular tries except that edges from a vertex to its children are labeled by strings that start with distinct characters. Example: The compact trie for {aaccaa, aaccab, abc,bb, bca, bcbb, c}:

a b c acca a b bc b c a bb

slide-3
SLIDE 3
  • 3 -

Reduced space when the strings are substrings of a master string:

  • If the strings are explicitly saved, then the space consumed is still

proportional to the space used in the uncompacted trie.

  • However, if the trie is used to represent substrings of a "master" string s that

is already stored, edge labels can be pairs of integers indicating a position and length in s, and hence the space per edge is constant. Since each non- leaf vertex has at least two children (so more than half the vertices are leaves), the space for the entire trie is proportional to the space consumed by the leaves. Note: Labels on edges may not be unique. It may be appropriate for a particular application to use some standard convention (for example, when we consider sliding suffix tries, labels will always refer to the rightmost

  • ccurrence).
slide-4
SLIDE 4
  • 4 -

Suffix Tries A suffix trie stores all suffixes of the string s$ where $ is a new symbol. Below is the suffix trie for the string s=abcbbacbbab$: a b c b b a c b b a b $ 1 2 3 4 5 6 7 8 9 10 11 12

c(4,9) $ b a $ c(8,5) b a $ b(12,1) c(8,5) c(4,9) ba b(12,1) c(8,5) cbba b(12,1) c(8,5) 1 11 3 7 2 4 8 5 9 6 10 12

Note: For presentation, edges to a non-leaf vertex are labeled with the full string and edges to leaves are labeled with the first character of the string followed by a (position, length) reference to a suffix of s. In practice, all edges can be labeled by a pair of integers representing a position and length in s$.

slide-5
SLIDE 5
  • 5 -

Note: For presentation, in the previous figure edges to a non-leaf vertex are labeled with the full string and edges to leaves are labeled with the first character of the string followed by a (position, length) reference to a suffix of s. In practice, all edges can be labeled by a pair of integers representing a position and length in s$.

slide-6
SLIDE 6
  • 6 -

Example Applications of Suffix Tries Find all occurrences of a string t in a string s: Move down from the root in the suffix trie for s according to t. If a leaf is reached before finishing t, t is not a substring of s. Otherwise, after reaching a vertex v, any of the leaves below v are suffixes of s that have t as a prefix (and so the suffix starting position is where t matches). Find the longest repeated string in a string s: Traverse the suffix tree (e.g., preorder traversal) to find the non-leaf vertex of lowest virtual depth. Find the longest common substring of a string s and a string t: Construct the suffix trie for s#t$. Then traverse this suffix trie in post-order and label each vertex as to whether it has descendants in only s, in only t, or in both s and t (once you know this information for the children, it is easily computed for the parent), to find the non-leaf vertex of lowest virtual depth that has both. Sort all suffixes of a string s: Traverse the suffix trie for s in lexicographic

  • rder.
slide-7
SLIDE 7
  • 7 -

Simple Suffix Trie Construction Algorithm Idea: At stage i, add the leaf for position i by modifying the suffix trie already constructed for positions 1 through i–1 of s$. Input characters will be read more than once because input is read forward from position i to insert the new suffix, and then we back up to position i+1 to start the next step. Basic brute force algorithm, let n denote the length of s$: Initialize an empty suffix trie. for i := 1 to n do begin cursor := i v := SCAN(root) Label v with position i. end

slide-8
SLIDE 8
  • 8 -

procedure SCAN(x): Starting at vertex x reading the input character pointed to by cursor, move down the suffix trie and keep reading input characters (and advancing cursor) to create a new leaf in one of two ways:

  • 1. A vertex is reached that does not have a child for the current

input character (and so a new leaf can be added to that vertex) — return this new leaf.

  • 2. The string labeling an edge cannot be matched, the edge is

"split", and a new non-leaf vertex added with one child being the edge labeled by the remainder of the string, and the other child the new leaf — return this new leaf. Time: In the worst case, O(i) time could be required to process the ith character, for a total of O(n2) time. Space: O(1) space in addition to the O(n) space for the suffix trie.

slide-9
SLIDE 9
  • 9 -

Maximum match length:

  • SCAN can be modified to limit leaf depth to some maximum depth m.
  • If a leaf corresponds to a string that occurs more than once in s, then

each time it is visited, it can be re-labeled with the rightmost position this string has occurred thus far.

slide-10
SLIDE 10
  • 10 -

McCreight's Suffix Trie Construction Idea:

  • At stage i, add the leaf for position i by modifying the suffix trie already

constructed for positions 1 through i–1 of s$.

  • Input characters may be read more than once because input is read forward

from position i to insert the new suffix, and then we back up to position i+1 to start the next step. Definition: An uncle of a non-leaf vertex v in a suffix trie is a vertex corresponding to the same string less the first character. McCreight's algorithm:

  • Achieves linear time by employing uncle links.
  • Although it is not necessarily the case that each stage of the algorithm is O(1)

time, the time to construct the suffix trie for n characters is O(n).

  • The details must be carefully worked out.
slide-11
SLIDE 11
  • 11 -

McCreight's Algorithm - basic idea:

  • The suffix for position i is like that for position i–1 except that the leftmost

character is removed.

  • Instead of starting at the root each time to insert the next suffix, maintain uncle

links so that "shortcuts" can be taken.

  • If the parent v of the previous leaf added does not already have an uncle link,

using v's parent’s suffix link (which will always be present), you end up higher than you would like in the trie, and have to rescan the string t that labels the edge from the parent p of v to v.

  • Add the new suffix link from v to its uncle for future use, scan down to add a new

leaf for position i, and then move up to its parent, which becomes the new vertex v for the next iteration of the algorithm.

root p u w previous leaf new previous v new v uncle link new uncle link SCAN RESCAN the substring t t leaf

slide-12
SLIDE 12
  • 12 -

Note:

  • This previous figure shows the general case, where solid lines are

single edges and dashed lines denote a path of one or more edges.

  • The two special cases are:

When v is the root. Simply scan down from the root. When v is a child of the root. Rescan t less its first character, since the uncle link from the root to itself goes to a string of the same length (length 0) rather than to one that is one character shorter (as with all other uncle links). Key invariant: At each stage, every non-leaf vertex, except possibly for

  • ne created at the last step, has an uncle link.
slide-13
SLIDE 13
  • 13 -

The McCreight Algorithm Initialize the trie to a single root vertex v. cursor := 1 for i := 1 to n do begin if v is the root then w := v else if v already has an uncle link then w := UNCLE(v) else begin p := PARENT(v) u := UNCLE(p) k := the length of the string t labeling the edge from p to v if p is the root then begin k := k-1; cursor := cursor+1 end w:= RESCAN(u,k) Add an uncle link from v to w. end v := SCAN(w) Label v with position i. v := PARENT(v) cursor := cursor–1 end

slide-14
SLIDE 14
  • 14 -

function RESCAN(x,k):

  • Starting at vertex x scanning the input symbol pointed to by the

cursor, advance the cursor as we move down the trie.

  • Each edge is traversed in constant time because the suffix trie has

already been constructed for this portion of the input, and it suffices to match the first character of edge label to the input (and check that its length does not cause the total length traversed to exceed k).

  • The vertex where the rescan ends is returned (a new vertex is created

if it ends in the middle of an edge).

slide-15
SLIDE 15
  • 15 -

McCreight's Algorithm Complexity

(assuming the alphabet size is constant) Time is O(n): Main loop, excluding calls to SCAN and RESCAN: Each iteration is O(1) excluding the time for SCAN and RESCAN operations, for a total of O(n). Calls to SCAN: Each step of SCAN reads a new input symbol or creates a new trie vertex (or both), so the total time for all calls to SCAN is O(n). Calls to RESCAN: Traversal of each edge is O(1) since only the first character of its label has to be matched (the position in the input can be calculated using virtual depth fields). If the RESCAN visits more than

  • ne edge, except for the last one, each corresponds to positions of the

input that will never be rescanned or scanned again (since each step either follows v's uncle link or the one of parent of v, but goes no higher in the trie), and hence the time for all call to RESCAN is O(n). Space is also O(n).

slide-16
SLIDE 16
  • 16 -

Sliding Suffix Tries:

  • For sliding window compression, we not only build the suffix trie as we go,

but also trim it so that we have a suffix trie for just the window and not all the way back to the start of the input.

  • One approach is to employ recycling copies of the data structure.
  • Another approach is the Fiala and Greene Algorithm to, in amortized linear

time, remove strings as they fall off the left end of the window.

slide-17
SLIDE 17
  • 17 -

Sliding Window With Two McCreight Tries

Idea: Start a new McCreight suffix trie for each block of n characters. Only the most recent two tries need to be stored; the "real" window of length n will overlap the window

  • f 2n represented by these two tries, and both tries must be used when searching for

strings in the window.

T1 1 n 2n 3n T2 T3

Two–trie algorithm:

  • 1. Run McCreight's algorithm for n steps starting at position 1 to form T1.

Do a post-order traversal of T1 to update the labels of edges and vertices.

  • 2. Run McCreight's algorithm for n steps starting at position n+1 to form T2.

Do a post-order traversal of T2 to update the labels of edges and vertices.

  • 3. Reclaim the memory for T1 and construct T3 for positions 2n+1 to 3n.

Do a post-order traversal of T3 to update the labels of edges and vertices. Etc.

slide-18
SLIDE 18
  • 18 -

Using the two tries for sliding window compression:

Tk i j j-n

window

j+n i-n Tk-1

  • When at a position i and we wish to search for a string t in the preceding n

characters, for some k, Tk represents positions j through j+n–1 for some position j ≤ i ≤ j+n–1, and Tk-1 represents positions j–n through j–1.

  • So we can search for t by looking in Tk, and by also looking in Tk-1 by starting

at the root and going down as far as possible until we get to a leaf, or must stop at a non-leaf vertex because going down to the appropriate child goes to a vertex representing a string at a position earlier than i–n. Complexity: O(n) time and space. Note: Buffering is needed to allow it to scan and rescan ahead as necessary to insert strings into the current trie at least up to the current position i.

slide-19
SLIDE 19
  • 19 -

Fiala-Greene Sliding Suffix Trie Algorithm Idea:

  • Use McCreight’s algorithm, and delete leaves when they become obsolete.
  • We need to keep vertex and edge labels updated to refer to the rightmost
  • ccurrence seen thus far of the corresponding string.
  • However, with McCreight's algorithm, RESCAN and SCAN start at non-leaf

vertices that may be deep in the trie, and even if they update labels, the labels above them in the trie will be missed.

  • Moving up to the root to fix labels each time a RESCAN / SCAN is done

would be too time consuming.

  • Instead use a “lazy” method of “percolating” updates that has amortized

linear time. Additions to McCreight’s data structure:

  • Associate child counts with each vertex and keep the leaves in a queue.
  • Store a flag bit with each non-leaf vertex that will be used to determine when

to spend the time to go higher in the trie to update labels to refer to the rightmost position seen thus far of the corresponding string.

slide-20
SLIDE 20
  • 20 -

(Fiala-Green algorithm, continued)

Modifications to McCreight’s suffix trie construction algorithm:

  • Once the window is full, a leaf queue is maintained so that the oldest leaf is

deleted at each stage to make room for the new one that is added.

  • When a child count of a non-leaf vertex v is reduced to 1 as a result of a leaf

deletion, v is deleted and the child is attached to the parent of v.

  • After a new leaf v has been created and labeled with the current position,

“percolate” the updating of ancestor vertices by complementing flag bits from 1 to 0 until a 0 flag bit is found and changed to 1.

  • That is, if we think of a root to leaf path of flag bits as forming a binary

number that ends in a 0 bit at the leaf, we set the leaf bit to 1, and then increment this number: flag(v) := 1 while v≠root and flag(v)=1 do begin flag(v) := 0 v := PARENT(v) Update the labels of v and the edge that was traversed. end flag(v) := 1

slide-21
SLIDE 21
  • 21 -

(Fiala-Green algorithm, continued)

*** It can be shown that:

  • The Fiala-Greene algorithm updates every vertex of the trie of n leaves at

least once during every interval of n charcters, from which correctness follows.

  • The total additional time to employ this algorithm for sliding the trie is linear

in the number of characters processed.