Space Efficient Data Structures and FM index
Venkatesh Raman
The Institute of Mathematical Sciences, Chennai
Space Efficient Data Structures and FM index Venkatesh Raman The - - PowerPoint PPT Presentation
Space Efficient Data Structures and FM index Venkatesh Raman The Institute of Mathematical Sciences, Chennai NISER Bhubaneshwar, February 9, 2019 Introduction Data Structures Libraries Conclusions Overview Introduction Data Structures
The Institute of Mathematical Sciences, Chennai
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
m
Introduction Data Structures Libraries Conclusions
1 n+1
n
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
657 658 658 659 659 659 660 661 661 662 662 662 662 663 663 664 664 664 664 664 664 665 666 667 668 669 670 671 672 673 674 675 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Introduction Data Structures Libraries Conclusions
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668
n (log )/2
Introduction Data Structures Libraries Conclusions
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668
n (log )/2
Introduction Data Structures Libraries Conclusions
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668
n (log )/2
Introduction Data Structures Libraries Conclusions
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668
n (log )/2
Introduction Data Structures Libraries Conclusions
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668
n (log )/2
Introduction Data Structures Libraries Conclusions
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668
n (log )/2
Introduction Data Structures Libraries Conclusions
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668
n (log )/2
Introduction Data Structures Libraries Conclusions
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668
n (log )/2
Introduction Data Structures Libraries Conclusions
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668
n (log )/2
Introduction Data Structures Libraries Conclusions
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668
n (log )/2
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
log n t * log n bits loglog n bits
4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 8 5 3
0.5 * log n
Introduction Data Structures Libraries Conclusions
log n t * log n bits loglog n bits
4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 8 5 3
0.5 * log n
Introduction Data Structures Libraries Conclusions
log n t * log n bits loglog n bits
4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 8 5 3
0.5 * log n
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
4 3 5 3 2 3 2 6 3 1 1 1 1 1 4 3 5 3 2 3 2 6 3 1 1 1 2 1 1 1 1 1 1 1 3 3 2 3 2 3 1 1 1 1 1 1 1 1 1
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
m
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Introduction Data Structures Libraries Conclusions
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Introduction Data Structures Libraries Conclusions
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Introduction Data Structures Libraries Conclusions
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
aba$
$
ba aba$ $ $ a $ aba$ ba $ aba$
ba aba$ $ $ a $ aba$ ba $ aba$
(1, 2) (3, 4) (6, 1) (6, 1) (0, 1) (1, 2)
(3, 4) (3, 4) (6, 1) (6, 1)
(1, 2) (3, 4) (6, 1) (6, 1) (0, 1) (1, 2)
(3, 4) (3, 4) (6, 1) (6, 1) (1, 2) (3, 4) (6, 1) (6, 1) (0, 1) (1, 2)
(3, 4) (3, 4) (6, 1) (6, 1)
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
(SA = “Suffix Array”)
$ a $ a a b a $ a b a $ b a $ b a a b a $ a b a a b a $
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
T All rotations Sort
BWT(T) Last column Burrows-Wheeler Matrix
Burrows M, Wheeler DJ: A block sorting lossless data compression algorithm. Digital Equipment Corporation, Palo Alto, CA 1994, Technical Report 124; 1994
T All rotations Sort
BWT(T) Last column Burrows-Wheeler Matrix
How to reverse the BWT? BWM has a key property called the LF Mapping...
Give each character in T a rank, equal to # times the character occurred previously in T. Call this the T-ranking.
Now let’s re-write the BWM including ranks...
BWM with T-ranking:
Look at fjrst and last columns, called F and L
And look at just the as
cases we see: a3, a1, a2, a0
BWM with T-ranking:
Same with bs: b1, b0
BWM with T-ranking:
LF Mapping: The ith occurrence of a character c in L and the ith occurrence of c in F correspond to the same occurrence in T However we rank occurrences of c, ranks appear in the same order in F and L
Why does the LF Mapping hold?
Why are these
as in this order
relative to each other? They’re sorted by right-context
Why are these
as in this order
relative to each other? They’re sorted by right-context
Occurrences of c in F are sorted by right-context. Same for L! Whatever ranking we give to characters in T, rank orders in F and L will match
BWM with T-ranking:
We’d like a difgerent ranking so that for a given character, ranks are in ascending order as we look down the F / L columns...
BWM with B-ranking:
Ascending rank F now has very simple structure: a $, a block of as with ascending ranks, a block of bs with ascending ranks
Which BWM row begins with b1? Skip row starting with $ (1 row) Skip rows starting with a (4 rows) Skip row starting with b0 (1 row)
row 6 Answer: row 6
Say T has 300 As, 400 Cs, 250 Gs and 700 Ts and $ < A < C < G < T Skip row starting with $ (1 row) Skip rows starting with A (300 rows) Skip rows starting with C (400 rows) Skip fjrst 100 rows starting with G (100 rows) Answer: row 1 + 300 + 400 + 100 = row 801 Which BWM row (0-based) begins with G100? (Ranks are B-ranks.)
Reverse BWT(T) starting at right-hand-side of T and moving left
Start in fjrst row. F must have $. L contains character just prior to $: a0
as fjrst a in F. Jump to row beginning with a0. L contains character just prior to a0: b0. Repeat for b0, get a2 Repeat for a2, get a1 Repeat for a1, get b1 Repeat for b1, get a3 Repeat for a3, get $, done Reverse of chars we visited = a3 b1 a1 a2 b0 a0 $ = T
Another way to visualize reversing BWT(T):
Sorts characters by right-context, making a more compressible string Repeated applications of LF Mapping, recreating T from right to left
Not stored in index
Paolo Ferragina, and Giovanni Manzini. "Opportunistic data structures with applications." Foundations of Computer Science,
We have rows beginning with a, now we seek rows beginning with ba
Look at those rows in L.
Use LF Mapping. Let new range delimit those bs Now we have the rows with prefjx ba
We have rows beginning with ba, now we seek rows beginning with aba
Use LF Mapping
Now we have the rows with prefjx aba
Now we have the same range, [3, 5), we would have got from querying suffjx array
Unlike suffjx array, we don’t immediately know where the matches are in T...
Where are these?
When P does not occur in T, we will eventually fail to fjnd the next character in L:
def;reverseBwt(bw): ;;;;""";Make;T;from;BWT(T);""" ;;;;ranks,;tots;=;rankBwt(bw) ;;;;first;=;firstCol(tots) ;;;;rowi;=;0 ;;;;t;=;"$" ;;;;while;bw[rowi];!=;'$': ;;;;;;;;c;=;bw[rowi] ;;;;;;;;t;=;c;+;t ;;;;;;;;rowi;=;first[c][0];+;ranks[rowi] ;;;;return;t
m integers
O(m) scan Storing ranks takes too much space $ a b a a b a0 a0 $ a b a a b0 a1 a b a $ a b1 a2 b a $ a b a1 a3 b a a b a $ b0 a $ a b a a2 b1 a a b a $ a3 Need way to fjnd where matches
Scanning for preceding character is slow Where?
$ a b a a b a0 a0 $ a b a a b0 a1 a b a $ a b1 a2 b a $ a b a1 a3 b a a b a $ b0 a $ a b a a2 b1 a a b a $ a3 Need a way to fjnd where these
With SA sample we can do this in O(1) time per occurrence
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions
Introduction Data Structures Libraries Conclusions