BurrowsWheeler Transform Idea: The BurrowsWheeler Transform (BWT) - - PowerPoint PPT Presentation

burrows wheeler transform
SMART_READER_LITE
LIVE PREVIEW

BurrowsWheeler Transform Idea: The BurrowsWheeler Transform (BWT) - - PowerPoint PPT Presentation

BurrowsWheeler Transform Idea: The BurrowsWheeler Transform (BWT) produces from a string S of length n a new string T of length n , and so is not a compression method in itself. Compression algorithms can be based on the idea that it is


slide-1
SLIDE 1
  • 1 -

Burrows–Wheeler Transform

Idea: The Burrows–Wheeler Transform (BWT) produces from a string S of length n a new string T of length n, and so is not a compression method in itself. Compression algorithms can be based on the idea that it is easier to compress T than S. ** The popular bzip utilities are examples. End of string assumption: We assume that S[n] is a special character $ that occurs nowhere else in S, and is greatest in a sorted ordering of the characters of S; this assumption is not necessary but simplifies our presentation.

slide-2
SLIDE 2
  • 2 -

Matrix Definition of the BWT Let M(S) be the matrix of all cyclic rotations of S, listed in lexicographic order. We refer to M as the Burrows-Wheeler matrix for S and the Burrows-Wheeler Transform of S, denoted BWT(S), is the last column of M. Example:

1 2 3 4 5 6 7 8 9 10

S = b r a t a t b a t $

1 2 3 4 5 6 7 8 9 10

1 a t a t b a t $ b r 2 a t b a t $ b r a t 3 a t $ b r a t a t b 4 b a t $ b r a t a t 5 b r a t a t b a t $ 6 r a t a t b a t $ b 7 t a t b a t $ b r a 8 t b a t $ b r a t a 9 t $ b r a t a t b a 10 $ b r a t a t b a t

slide-3
SLIDE 3
  • 3 -

Suffix Trie Definition of the BWT

  • The Burrows–Wheeler Transform of S, BWT(S), is a new string of exactly

n characters that is a permutation of the characters of S corresponding to a pre-order traversal of the suffix trie of S (assuming children are visited in sorted order).

  • Specifically, after performing a pre-order traversal of the suffix trie to
  • btain a sequence of indices, subtract 1 from each index (with the

convention that 0 = n), and then list the characters at the corresponding positions.

slide-4
SLIDE 4
  • 4 -

Example: The string S = bratatbat$ is shown with its positions labeled from 1 through 10, and below the corresponding suffix trie.

1 2 3 4 5 6 7 8 9 10

S = b r a t a t b a t $ $ b t at $ b(8,3) a(6,5) r(3,8) r(3,8) a(9,2) b(8,3) a(6,5) $ 3 5 8 7 1 2 4 6 9 10 Traversal in pre-order gives the sequence: 3 5 8 7 1 2 4 6 9 10 Subtracting 1 from each number (where 0 maps back to 10) gives 2 4 7 6 10 1 3 5 8 9 which corresponds to the string:

r t b t $ b a a a t

slide-5
SLIDE 5
  • 5 -

*** Although the suffix trie and matrix definitions of the BWT are equivalent, when considering efficient implementation of the BWT, it will be convenient to sometimes use one and sometimes use the other when motivating specific algorithms.

slide-6
SLIDE 6
  • 6 -

Intuition:

  • By examining the two equivalent definitions of the BWT one can gain

an intuition as to why it may be more straightforward to compress BWT(S) than S.

  • We are already familiar with the idea of using a context to predict the

next character.

  • For example, after seeing elephan in English text, we know that with

high likelihood the next character is a t.

  • BWT(S) clusters symbols according to their context so that runs of

identical symbols and runs of symbols drawn from a small subset occur

  • ften within BWT(S).
  • Thus it is more straightforward to compress BWT(S) than S.
slide-7
SLIDE 7
  • 7 -

Prefix v. Suffix Contexts

  • The contexts that are effectively employed by the BWT are the strings

that follow a character, rather than the ones that precede it, as is the case, for example, with PPM methods.

  • If we want to reflect preceding contexts, we can simply do BWT(SR),

where we use SR to denote the string that is S reversed except for the final $, which we leave at the right end. Note: It is not actually necessary to reverse S, instead we could define the sorting of rows of the matrix to be done based on visiting the characters of a row from the second to last to the first.

slide-8
SLIDE 8
  • 8 -

O(n) computation of BWT(S) For a string S of length n, the straightforward computation BWT(S) based

  • n the Burrows-Wheeler matrix is O(n2) since M has n2 entries.

For an O(n) computation, we can employ the suffix trie definition:

  • Construct a suffix trie for S in such a way that children of a leaf are

accessed in lexicographic order.

  • Traverse the leaves of the suffix trie in pre-order and output S[i–1] at

the leaf for position i (except output S[n] when i=1). *** Assuming that the alphabet size is constant with respect to n, the suffix trie construction is linear time and space, and so is the pre-order traversal, for a total of O(n) time and space.

slide-9
SLIDE 9
  • 9 -

Computation of the inverse BWT

Given the index q of the row in the BWT matrix M that contains S, S can be recovered from BWT(S) in linear time. Computing q, the index of S in M, from BWT(S): If $ is the qth character in BWT(S), then S is the qth row in M. Computing the first column of M from BWT(S):

  • We already have the last column (since it is BWT(S)).
  • The first column, which we denote by F[1]...F[n], is just the characters of S

listed in sorted order (each character occurs in a block of positions in F).

  • Since the characters of BWT(S) are the same as the characters of S; we can

simply sort the characters of BWT(S) to get F.

slide-10
SLIDE 10
  • 10 -

The Inverse BWT is Well Defined

  • Let M1 = F, and let M2 be the matrix of two columns that is formed by placing

BWT(S) in the first column, F in the second column, and then rearranging the rows in sorted order.

  • Then M2 lists all pairs of characters of S in sorted order, and so it is the first 2

columns of M.

  • M3, the first 3 columns of M, is formed by prepending the column BWT(S) to

M2 and sorting the rows.

  • This process can be continued until we have Mn = M.
  • We can then read the qth row of M to recover S.

*** It is not very practical to actually construct M; we shall see how to avoid it.

slide-11
SLIDE 11
  • 11 -

Example: In our previous example of S = bratatbat$, to form M2

form (BWT(S) F) → 1 2 1 r a 2 t a 3 b a 4 t b 5 $ b 6 b r 7 a t 8 a t 9 a t

10 t $

→ Sort → 1 2 1 a t 2 a t 3 a t 4 b a 5 b r 6 r a 7 t a 8 t b 9 t $

10 $ b

and to form M3:

form (BWT(S) M2 ) → 1 2 3 1 r a t 2 t a t 3 b a t 4 t b a 5 $ b r 6 b r a 7 a t a 8 a t b 9 a t $ 10 t $ b → Sort →

1 2 3

1 a t a 2 a t b 3 a t $ 4 b a t 5 b r a 6 r a t 7 t a t 8 t b a 9 t $ b 10 $ b r

slide-12
SLIDE 12
  • 12 -

Lemma: Let S be a string ending in a special character $ that occurs nowhere else in S, and suppose a character c occurs at more than one position in S. Let c1 and c2 denote any two of these occurrences of c. Then c1 comes before c2 in F(S) if and

  • nly if c1 comes before c2 in BWT(S).

Proof:

  • Suppose that in F(S), c1 is at position i and c2 is at position j, and suppose that

in BWT(S), c1 is at position x and c2 is at position y.

  • Let I and J denote the ith and jth rows of M less their first characters.
  • Then i<j implies I < J (I≠J since S ends in $), which implies that Ic < Jc,

which implies that x<y, and symmetric reasoning applies for i>j.

  • For the reverse direction, let X and Y be the xth and yth rows of M less their last

characters.

  • Then x<y implies X<Y (X≠Y since S ends in $), which implies that cX <

cY, which implies that i<j, and symmetric reasoning applies for x>y.

slide-13
SLIDE 13
  • 13 -

Example:

Using again S = bratatbat$, we can check that the three a's occur in the same

  • rder in columns 1 and 10, the two b's occur in the same order in columns 1 and

10, and the three t's occur in the same order in columns 1 and 10.

1 2 3 4 5 6 7 8 9 10

S = b r a t a t b a t $

1 2 3 4 5 6 7 8 9 10

1 a t a t b a t $ b r 2 a t b a t $ b r a t 3 a t $ b r a t a t b 4 b a t $ b r a t a t 5 b r a t a t b a t $ 6 r a t a t b a t $ b 7 t a t b a t $ b r a 8 t b a t $ b r a t a 9 t $ b r a t a t b a 10 $ b r a t a t b a t

slide-14
SLIDE 14
  • 14 -

Idea:

  • Trace between BWT(S) and F(S) to discover the characters of S.
  • At stage 1, we know that S[1] = F[q].
  • Now suppose that F[q] is the kth copy of that character in F, then we find the kth

copy of F[q] in BWT(S), suppose it is rth character of BWT(S).

  • The we set q = r, set S[2]=F[q].
  • Now we can repeat the same process for S[3], and so on.
slide-15
SLIDE 15
  • 15 -

Example:

a t a t b a t $ b r a t b a t $ b r a t a t $ b r a t a t b b a t $ b r a t a t b r a t a t b a t $ r a t a t b a t $ b t a t b a t $ b r a t b a t $ b r a t a t $ b r a t a t b a $ b r a t a t b a t

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Stage 1: S[1] = F[q] = F[5] = b We see that this is the second b in F. The second b in BWT(S) is at the end of row 6. Set q = 6. Stage 2: S[2] = F[q] = F[6] = r We see that this is the first r in F. The first r in BWT(S) is at the end of row 1. Set q = 1. Stage 3: S[3] = F[q] = F[1] = a We see that this is the first a in F. The first a in BWT(S) is at the end of row 7. Set q = 7. Stage 4: S[4] = F[q] = F[7] = t We see that this is the first t in F. The first t in BWT(S) is at the end of row 2. Set q = 2.

slide-16
SLIDE 16
  • 16 -

(example, continued)

Stage 5: S[5] = F[q] = F[2] = a We see that this is the second a in F. The second a in BWT(S) is at the end of row 8. Set q = 8. Stage 6: S[6] = F[q] = F[8] = t We see that this is the second t in F. The second t in BWT(S) is at the end of row 4. Set q = 4. Stage 7: S[6] = F[q] = F[4] = b We see that this is the first b in F. The first b in BWT(S) is at the end of row 3. Set q = 3. Stage 8: S[6] = F[q] = F[3] = a We see that this is the third a in F. The third a in BWT(S) is at the end of row 9. Set q = 9. Stage 9: S[6] = F[q] = F[9] = t We see that this is the third t in F. The third t in BWT(S) is at the end of row 10. Set q = 10. Stage 10: S[6] = F[q] = F[10] = $ (We are done, but if we continue we return to where we started.) We see that this is the first $ in F. The first $ in BWT(S) is at the end of row 5. Set q = 5.

slide-17
SLIDE 17
  • 17 -

Basic Inverse BWT Algorithm: Compute F by sorting BWT(S). Assuming that we have the convention that S ends in $, we can compute q, the position of S in M, from BWT(S); otherwise, we expect to receive q along with BWT(S). for i:=1 to n do begin Let c be the character stored in F[q]. S[i] := c Let k be such that F[q] is the kth copy of c in F (that is, for 1 ≤ j ≤ q, there are at most k positions of F for which F[j] = c). Let r be such that the rth character of BWT(S) is the kth copy of c in BWT(S). q := r end

slide-18
SLIDE 18
  • 18 -

Efficient Computation of the Inverse BWT The key to an efficient computation is to pre-compute permutation links that point from a character of F to the corresponding character in BWT(S); that is: The kth occurrence of a character c in the first column is linked to the kth

  • ccurrence of the character c in BWT(S).

We store these links in the array P[1]...P[n]. *** By starting at position q, and successively doing q := P[q], the characters of S can be output in the correct order.

slide-19
SLIDE 19
  • 19 -

To compute P, we employ the array C:

C[a] C[b] C[r] C[t] C[$] F[1]...F[n] BWT(s)

q

bratatbat$ can be recovered by reconstructing the first column F and visiting F in the order of the permutation links, starting at row r. To compute P, the array C is first constructed to point to the start of blocks in F. The vari- able q, the index of the row of M equal to S, is the start- ing point of the computation. permutation links, P[1]...P[n]

slide-20
SLIDE 20
  • 20 -

Four Pass Inverse BWT Algorithm

  • 1. Use bucket sort to reconstruct the first column of M, F[1]...F[n].
  • 2. for i := 1 to n do

if i=1 or F[i]≠(F[i]–1) then C[F[i]] := i

  • 3. for i := 1 to n do begin

P[C[BWT[i]]] := i C[BWT[i]] := (C[BWT[i]]+1) end

  • 4. for i := 1 to n do begin
  • utput F[q] as the ith character of S

q := P[q] end

Note: It is possible to use a similar approach that constructs S from right to left and uses only two passes.

slide-21
SLIDE 21
  • 21 -

General Framework for Data Compression with the BWT

  • 1. Partition the input stream into blocks.

Note: In practice, something like 64K blocks works well, but blocks on the order of 1MB or greater may be used to get the best possible compression.

  • 2. Perform the BWT on each block.
  • 3. Perform some simple processing on each block (e.g., move to front coding).
  • 4. Employ a first order coder (e.g., Huffman or Arithmetic).
slide-22
SLIDE 22
  • 22 -

Move To Front Coding Idea:

  • Move-to-front (MTF) compression maintains an ordered list of the

characters of an input alphabet of size k, which we assume are the integers 0 through k–1 (e.g., k=256).

  • Each time a character is processed, the MTF algorithm codes the

index the character in the list. For convenience of notation, we index the list from index 1 to index k, and encode these indices with the integers 0 through k–1 (the code for an index is 1 less than its value).

  • So although the coding of a character is an integer in the range of 0 to

k–1, this code is not the binary code for the character, but rather its position in the list (coded as its index minus 1).

  • After a character is coded or decoded, it is moved to the front of the

list.

  • Coding can employ an "off-the-shelf" method such as Huffman or

Arithmetic coding, possible combined with run-length coding.

slide-23
SLIDE 23
  • 23 -

Simple Implementation of MTF Coding For an alphabet of size k, use an array list[0]...list[k–1]. In the worst case, the encoder spends O(k) time to search for the position p

  • f a character c in list, and both the encoder and decoder take O(k) time to

slide list[0]...list[p–1] back one position so that c can be placed in list[0], for a total of O(nk) time to process a sequence of n characters.

slide-24
SLIDE 24
  • 24 -

Efficient Implementation of MTF Coding Idea:

  • Each character c that is encoded or decoded is assigned a time stamp

X[c]. Initially, X[c]=–c, 0 ≤ c < k, so that all time stamps start out ≤ 0. When the ith character of S is encoded or decoded, it is given time stamp i, 1 ≤ i ≤ n.

  • Use a data structure list than can efficiently support:

INDEX(c): The index in the list of the character c. FIND(i): The character corresponding to index i. MOVE(c,t): Move character c to the front of the list, and give it a new

time stamp t.

  • We assume that we have available two functions, CODE that sends bits

to the output of the encoder, and DECODE that read bits from the input to the decoder (e.g., Huffman or Arithmetic codes).

slide-25
SLIDE 25
  • 25 -

(balanced tree implementation of MTF continued)

Encoding and decoding algorithms MTF Encoder: L := empty list for c := 0 to k–1 insert c into L with time stamp X[c] = –c for time := 1 to n do c := read the next input character i := INDEX(c) CODE(i–1) MOVE(c,time) end MTF decoder: L := empty list for c := 0 to k–1 insert c into L with time stamp X[c] = –c for time:= 1 to n do i := DECODE c := FIND(i+1) MOVE(c,time)

  • utput c

end

slide-26
SLIDE 26
  • 26 -

O(log(n)) Implementation of the INDEX and CHAR Functions Represent the list with a balanced tree data structure BT based on the time stamp, where each vertex v has three fields

TIME(v): a time stamp CHAR(v): the character with this time stamp COUNT(v): the number of vertices in the subtree rooted at v (including v) Note: For a nil-pointer, define: COUNT(nil)=0

and in addition to INSERT and DELETE, these operations are supported: INDEX(c,v): Return the index of character c in the sorted list of data items stored the subtree of BT rooted at v, where 1 is the index of the first element, and return 0 if c is not in this subtree. ** Define INDEX(c) = INDEX(c,root). FIND(i,v): Return the character corresponding to index i in the sorted list

  • f data items stored the subtree of BT rooted at v, where 1 is the

index of the first element; no check is made for i out of range. ** Define FIND(i) = FIND(i,root).

slide-27
SLIDE 27
  • 27 -

INDEX and FIND Pseudo-Code function INDEX(c,v) if v=nil then return 0 else if X[c] < TIME(v) then return INDEX(t,LCHILD(v)) else if X[c] = TIME(v) then return COUNT(LCHILD(v))+1 else begin i := INDEX(d,RCHILD(v)) if i=0 then return 0 else return COUNT(LCHILD(v))+1+i end end function FIND(i,v) j = COUNT(LCHILD(v)) if i ≤ j then return FIND(i,LCHILD(v)) else if i = j+1 then return CHAR(v) else return FIND(i–j–1, RCHILD(v)) end

Note: To support these functions, it is assume that the basic INSERT and DELETE functions for BT update the COUNT fields as vertices are traversed.

slide-28
SLIDE 28
  • 28 -

Complexity:

  • INDEX and FIND work in O(log(k)) time.
  • MOVE is also O(log(k)) time, since it can be implemented with a

DELETE followed by and INSERT on the balanced tree.

  • Hence, assuming that CODE and DECODE work in O(1) time (or even

in O(log(k)) time), both the MTF encoder and decoder work in O(nlog(k)) time.

  • The space is O(k).