Space Efficient Data Structures and FM index Venkatesh Raman The - PowerPoint PPT Presentation

Introduction Data Structures Libraries Conclusions “Space for Data” Definition (Information-theoretic Lower Bound) If an object x is chosen from a set S then in the worst case we need log 2 | S | bits to represent x . • x is a binary tree of n nodes. • S is the set of all binary trees of n nodes. � 2 n � 1 • log 2 | S | = log 2 = 2 n − O (log n ) n +1 n bits Note that the standard binary tree representation uses Θ(1) pointers per node, or Θ( n ) pointers; each pointer is an address needing log n bits, so totally Θ( n log n ) bits, log n times more than necessary.

Introduction Data Structures Libraries Conclusions “Space for Data” Definition (Information-theoretic Lower Bound) If an object x is chosen from a set S then in the worst case we need log 2 | S | bits to represent x . • x is a triangulated planar graph of n nodes. • S is the set of all triangulated planar graphs with n nodes. • log 2 | S | ∼ 3 . 24 n bits. There are also bounds for general graphs, chordal graphs, bounded treewidth graphs.

Introduction Data Structures Libraries Conclusions Overview Introduction Data Structures Goals Bit Vectors Strings from a larger alphabet Sparse Bit Vectors Trees Burrows-Wheeler Transform and Indexing Libraries Conclusions

Introduction Data Structures Libraries Conclusions Succinct Data Structures Aim is to store using space: Space usage = “space for data” + “space for index” . � �� lower-order term and perform operations directly on it. • For static DS, often get O (1) time operations. • Representation often tightly tied to set of operations. • They work in practice!

Introduction Data Structures Libraries Conclusions Bit Vectors Data: Sequence X of n bits, x 1 , . . . , x n . ITLB: n bits; total space n + o ( n ) bits.

Introduction Data Structures Libraries Conclusions Bit Vectors Data: Sequence X of n bits, x 1 , . . . , x n . ITLB: n bits; total space n + o ( n ) bits. Operations: • rank 1 ( i ): number of 1s in x 1 , . . . , x i . • select 1 ( i ): position of i th 1. Also rank 0 , select 0 . Ideally all in O (1) time. Example: X = 01101001, rank 1 (4) = 2, select 0 (4) = 7.

Introduction Data Structures Libraries Conclusions Bit Vectors Data: Sequence X of n bits, x 1 , . . . , x n . ITLB: n bits; total space n + o ( n ) bits. Operations: • rank 1 ( i ): number of 1s in x 1 , . . . , x i . • select 1 ( i ): position of i th 1. Also rank 0 , select 0 . Ideally all in O (1) time. Example: X = 01101001, rank 1 (4) = 2, select 0 (4) = 7. Operations introduced in [Elias, J. ACM ’75 ], [Tarjan and Yao, C. ACM ’78 ], [Chazelle, SIAM J. Comput ’85 ], [Jacobson, FOCS ’89 ].

Introduction Data Structures Libraries Conclusions Bit Vectors: Implementing rank 1 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 658 658 659 659 659 660 661 661 662 662 662 662 663 663 664 664 664 664 664 664 665 666 667 668 669 670 671 672 673 674 675 • Naive solution: store answer to all rank 1 queries. Space: O ( n log n ) bits.

Introduction Data Structures Libraries Conclusions Bit Vectors: Implementing rank 1 (log )/2 n 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668 • Naive solution: store answer to all rank 1 queries. Space: O ( n log n ) bits. • Sample: store answer only to every (log n ) / 2-th rank 1 queries. Space: O ( n ) bits.

Introduction Data Structures Libraries Conclusions Bit Vectors: Implementing rank 1 (log )/2 n 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668 • Naive solution: store answer to all rank 1 queries. Space: O ( n log n ) bits. • Sample: store answer only to every (log n ) / 2-th rank 1 queries. Space: O ( n ) bits. • How to support rank 1 in O (1) time?

Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 (log )/2 n 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668

Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 (log )/2 n 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668 • Scanning the (log n ) / 2 block takes O (log n ) time.

Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 (log )/2 n 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668 • Scanning the (log n ) / 2 block takes O (log n ) time. • We will use what is called the “Four Russians trick”.

Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 (log )/2 n 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668 • Scanning the (log n ) / 2 block takes O (log n ) time. • We will use what is called the “Four Russians trick”. • Let k = (log n ) / 2. Create a table A with 2 k +log 2 k = O ( √ n log n ) = o ( n ) entries.

Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 (log )/2 n 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668 • Scanning the (log n ) / 2 block takes O (log n ) time. • We will use what is called the “Four Russians trick”. • Let k = (log n ) / 2. Create a table A with 2 k +log 2 k = O ( √ n log n ) = o ( n ) entries. • A [ y 1 . . . y log 2 k x 1 . . . x k ] = number of 1s in x 1 . . . x y +1 where y = y 1 . . . y log 2 k . (The “four Russians” trick.)

Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 (log )/2 n 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668 • Scanning the (log n ) / 2 block takes O (log n ) time. • We will use what is called the “Four Russians trick”. • Let k = (log n ) / 2. Create a table A with 2 k +log 2 k = O ( √ n log n ) = o ( n ) entries. • A [ y 1 . . . y log 2 k x 1 . . . x k ] = number of 1s in x 1 . . . x y +1 where y = y 1 . . . y log 2 k . (The “four Russians” trick.) • rank 1 ( x ) = 657 + A [ 10111010011 ] . � �� 3

Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 (log )/2 n 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668 • Scanning the (log n ) / 2 block takes O (log n ) time. • We will use what is called the “Four Russians trick”. • Let k = (log n ) / 2. Create a table A with 2 k +log 2 k = O ( √ n log n ) = o ( n ) entries. • A [ y 1 . . . y log 2 k x 1 . . . x k ] = number of 1s in x 1 . . . x y +1 where y = y 1 . . . y log 2 k . (The “four Russians” trick.) • rank 1 ( x ) = 657 + A [ 10111010011 ] . � �� 3 • O ( n ) bits, O (1) time.

Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 (log )/2 n 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 657 661 664 668 • Scanning the (log n ) / 2 block takes O (log n ) time. • We will use what is called the “Four Russians trick”. • Let k = (log n ) / 2. Create a table A with 2 k +log 2 k = O ( √ n log n ) = o ( n ) entries. • A [ y 1 . . . y log 2 k x 1 . . . x k ] = number of 1s in x 1 . . . x y +1 where y = y 1 . . . y log 2 k . (The “four Russians” trick.) • rank 1 ( x ) = 657 + A [ 10111010011 ] . � �� 3 • O ( n ) bits, O (1) time. • Many theoretical SDS: decompose + sample + table lookup.

Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 Improve redundancy by two-level approach.

Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 Improve redundancy by two-level approach. • Store answer for every log 2 n positions. This takes only O ( n log n / log 2 n = n / log n ) = o ( n ) bits.

Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 Improve redundancy by two-level approach. • Store answer for every log 2 n positions. This takes only O ( n log n / log 2 n = n / log n ) = o ( n ) bits. • Then for every (log n ) / 2 positions, store answer within the block. This takes O ( n (log log n ) / log n ) = o ( n ) bits.

Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 Improve redundancy by two-level approach. • Store answer for every log 2 n positions. This takes only O ( n log n / log 2 n = n / log n ) = o ( n ) bits. • Then for every (log n ) / 2 positions, store answer within the block. This takes O ( n (log log n ) / log n ) = o ( n ) bits. • Then store, as before, a table to find answers within (log n ) / 2 positions.

Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 Two-level approach 0.5 * log n 1 1 0 1 0 0 1 1 0 1 0 0 0 1 0 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 5 3 4 8 657 loglog n bits log n bits t * log n � � + O ( √ n · lg n ) t lg n · lg n + n n Space = n + O lg n · lg lg n = n + O ( n log log n / log n ) bits: choose t = Θ(log n / log log n ).

Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 Two-level approach 0.5 * log n 1 1 0 1 0 0 1 1 0 1 0 0 0 1 0 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 5 3 4 8 657 loglog n bits log n bits t * log n � � + O ( √ n · lg n ) t lg n · lg n + n n Space = n + O lg n · lg lg n = n + O ( n log log n / log n ) bits: choose t = Θ(log n / log log n ). • Redundancy O ( n lg lg n / lg n ) bits, optimal for O (1) time operations [Golynski, TCS’07 ].

Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing rank 1 Two-level approach 0.5 * log n 1 1 0 1 0 0 1 1 0 1 0 0 0 1 0 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 5 3 4 8 657 loglog n bits log n bits t * log n � � + O ( √ n · lg n ) t lg n · lg n + n n Space = n + O lg n · lg lg n = n + O ( n log log n / log n ) bits: choose t = Θ(log n / log log n ). • Redundancy O ( n lg lg n / lg n ) bits, optimal for O (1) time operations [Golynski, TCS’07 ]. • Supporting select 1 is similar, though a bit complicated.

Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing select 1 ; the idea

Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing select 1 ; the idea • We will try to manage by using extra O ( n / log log n ) bits.

Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing select 1 ; the idea • We will try to manage by using extra O ( n / log log n ) bits. • Store answer for every lg n (lg lg n ) th 1, takes space n / lg lgn bits.

Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing select 1 ; the idea • We will try to manage by using extra O ( n / log log n ) bits. • Store answer for every lg n (lg lg n ) th 1, takes space n / lg lgn bits. • If the range r between two consecutive answers stored is of size more than (lg n lg lg n ) 2 , store the positions of all the lg n (lg lg n ) 1 in the range;

Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing select 1 ; the idea • We will try to manage by using extra O ( n / log log n ) bits. • Store answer for every lg n (lg lg n ) th 1, takes space n / lg lgn bits. • If the range r between two consecutive answers stored is of size more than (lg n lg lg n ) 2 , store the positions of all the lg n (lg lg n ) 1 in the range; takes (lg n ) 2 (lg lg n ) bits, which is at most r / lg lg n .

Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing select 1 ; the idea • We will try to manage by using extra O ( n / log log n ) bits. • Store answer for every lg n (lg lg n ) th 1, takes space n / lg lgn bits. • If the range r between two consecutive answers stored is of size more than (lg n lg lg n ) 2 , store the positions of all the lg n (lg lg n ) 1 in the range; takes (lg n ) 2 (lg lg n ) bits, which is at most r / lg lg n . • Otherwise recurse.

Introduction Data Structures Libraries Conclusions Bit-Vectors: Implementing select 1 ; the idea • We will try to manage by using extra O ( n / log log n ) bits. • Store answer for every lg n (lg lg n ) th 1, takes space n / lg lgn bits. • If the range r between two consecutive answers stored is of size more than (lg n lg lg n ) 2 , store the positions of all the lg n (lg lg n ) 1 in the range; takes (lg n ) 2 (lg lg n ) bits, which is at most r / lg lg n . • Otherwise recurse. After a couple of levels, the range will be small enough ( O ((lg lg n ) 4 )) that a table look up can complete the job.

Introduction Data Structures Libraries Conclusions Wavelet Tree – Representing strings from a larger alphabet

Introduction Data Structures Libraries Conclusions Wavelet Tree – Representing strings from a larger alphabet Data: Sequence S [1 .. n ] of symbols from an alphabet of size σ . Operations:  rank ( c , i ): number of c ’s in S [1 .. i ] .  select ( c , i ): position of i -th c .  in O (log σ ) time. access ( i ): return S [ i ].

Introduction Data Structures Libraries Conclusions Wavelet Tree – Representing strings from a larger alphabet Data: Sequence S [1 .. n ] of symbols from an alphabet of size σ . Operations:  rank ( c , i ): number of c ’s in S [1 .. i ] .  select ( c , i ): position of i -th c .  in O (log σ ) time. access ( i ): return S [ i ]. Store log 2 σ BVs: n log σ + o ( n log σ ) bits [Grossi, Vitter, SJC ’05 ]. � �� raw size 4 3 0 5 3 2 3 2 6 3 1 1 4 3 0 5 3 2 3 2 6 3 1 1 1 0 0 1 0 0 0 0 1 0 0 0 3 0 3 2 3 2 3 1 1 0 1 2 1 0 1 1 1 1 1 0 0 0 0 1 0 1 1 1 1 0 1 0 1 0 1 0

Introduction Data Structures Libraries Conclusions A Bit vector with only m 1s

Introduction Data Structures Libraries Conclusions A Bit vector with only m 1s Sequence X of n bits, Data: Set X = { x 1 , . . . , x m } ⊆ Data: x 1 , . . . , x n with m 1s. { 1 , . . . , n } , x 1 < x 2 < . . . < x m . Operations: Operations: • select 1 ( i ). • access ( i ) : return x i . � n � ITLB: log 2 = m log 2 ( n / m ) + O ( m ) bits. m [Elias, J. ACM’75 ], [Grossi/Vitter, SICOMP’06 ], [Raman et al., TALG’07 ].

Introduction Data Structures Libraries Conclusions Elias-Fano Representation Bucket according to most significant b bits. Example. b = 3 , ⌈ log 2 n ⌉ = 5 , m = 7. Bucket Keys 000 − x 1 0 1 0 0 0 001 − x 2 0 1 0 0 1 010 x 1 , x 2 , x 3 x 3 0 1 0 1 1 011 x 4 x 4 0 1 1 0 1 100 x 5 , x 6 x 5 1 0 0 0 0 101 x 7 x 6 1 0 0 1 0 110 − x 7 1 0 1 1 1 111 −

Introduction Data Structures Libraries Conclusions Elias-Fano bkt sz data 000 0 − 001 0 − 010 3 00 , 01 , 11 , �� ⊲ Store only low-order bits. x 1 x 2 x 3 ⊲ Keep sizes of all buckets. 011 1 01 �� x 4 Example 100 2 00 , 10 �� select (6) x 5 x 6 101 1 11 �� x 7 110 0 − 111 0 −

Introduction Data Structures Libraries Conclusions Elias-Fano

Introduction Data Structures Libraries Conclusions Elias-Fano • Choose b = ⌊ log 2 m ⌋ bits. In bucket: ⌈ log 2 n ⌉ − ⌊ log 2 m ⌋ -bit keys.

Introduction Data Structures Libraries Conclusions Elias-Fano • Choose b = ⌊ log 2 m ⌋ bits. In bucket: ⌈ log 2 n ⌉ − ⌊ log 2 m ⌋ -bit keys. • m log 2 n − m log 2 m + O ( m ) = m log 2 ( n / m ) + O ( m ) bits for lower part.

Introduction Data Structures Libraries Conclusions Elias-Fano • Choose b = ⌊ log 2 m ⌋ bits. In bucket: ⌈ log 2 n ⌉ − ⌊ log 2 m ⌋ -bit keys. • m log 2 n − m log 2 m + O ( m ) = m log 2 ( n / m ) + O ( m ) bits for lower part. Encoding Bucket Sizes Bucket no: 000 001 010 011 100 101 110 111 Bucket size: 0 0 3 1 2 1 0 0 • Use a unary encoding: 0 , 0 , 3 , 1 , 2 , 1 , 0 , 0 → 110001010010111.

Introduction Data Structures Libraries Conclusions Elias-Fano • Choose b = ⌊ log 2 m ⌋ bits. In bucket: ⌈ log 2 n ⌉ − ⌊ log 2 m ⌋ -bit keys. • m log 2 n − m log 2 m + O ( m ) = m log 2 ( n / m ) + O ( m ) bits for lower part. Encoding Bucket Sizes Bucket no: 000 001 010 011 100 101 110 111 Bucket size: 0 0 3 1 2 1 0 0 • Use a unary encoding: 0 , 0 , 3 , 1 , 2 , 1 , 0 , 0 → 110001010010111. • z buckets, total size m ⇒ m + z = O ( m ) bits ( z = 2 ⌊ log 2 m ⌋ ). • Overall space of E-F bit-vector is m log( n / m ) + O ( m ) bits. • In which bucket is the 6th key? ⊲ “ rank 1 of 6th 0”. • select 1 in O (1) time.

Introduction Data Structures Libraries Conclusions Elias-Fano • Choose b = ⌊ log 2 m ⌋ bits. In bucket: ⌈ log 2 n ⌉ − ⌊ log 2 m ⌋ -bit keys. • m log 2 n − m log 2 m + O ( m ) = m log 2 ( n / m ) + O ( m ) bits for lower part. Encoding Bucket Sizes Bucket no: 000 001 010 011 100 101 110 111 Bucket size: 0 0 3 1 2 1 0 0 • Use a unary encoding: 0 , 0 , 3 , 1 , 2 , 1 , 0 , 0 → 110001010010111. • z buckets, total size m ⇒ m + z = O ( m ) bits ( z = 2 ⌊ log 2 m ⌋ ). • Overall space of E-F bit-vector is m log( n / m ) + O ( m ) bits. • In which bucket is the 6th key? ⊲ “ rank 1 of 6th 0”. • select 1 in O (1) time. • Redundancy can be made o ( m ) and membership and Rankone can also be supported (RRR01)

Introduction Data Structures Libraries Conclusions Tree Representations

Introduction Data Structures Libraries Conclusions Tree Representations Data: n -node binary tree.

Introduction Data Structures Libraries Conclusions Tree Representations Data: n -node binary tree. Operations: Navigation (left child, right child, parent).

Introduction Data Structures Libraries Conclusions Tree Representations Data: n -node binary tree. Operations: Navigation (left child, right child, parent). • Visit nodes in level-order and output 1 if internal node and 0 if external (2 n + 1 bits) [Jacobson, FOCS ’89 ]. Store sequence of bits as bit vector. 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0

Introduction Data Structures Libraries Conclusions Tree Representations Data: n -node binary tree. Operations: Navigation (left child, right child, parent). • Visit nodes in level-order and output 1 if internal node and 0 if external (2 n + 1 bits) [Jacobson, FOCS ’89 ]. Store sequence of bits as bit vector. 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 • Number internal nodes by position of 1 in bit-string

Introduction Data Structures Libraries Conclusions Tree Representations Data: n -node binary tree. Operations: Navigation (left child, right child, parent). • Visit nodes in level-order and output 1 if internal node and 0 if external (2 n + 1 bits) [Jacobson, FOCS ’89 ]. Store sequence of bits as bit vector. 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 • Number internal nodes by position of 1 in bit-string • Left child = 2 ∗ rank 1 ( i ).

Introduction Data Structures Libraries Conclusions Tree Representations Data: n -node binary tree. Operations: Navigation (left child, right child, parent). • Visit nodes in level-order and output 1 if internal node and 0 if external (2 n + 1 bits) [Jacobson, FOCS ’89 ]. Store sequence of bits as bit vector. 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 • Number internal nodes by position of 1 in bit-string • Left child = 2 ∗ rank 1 ( i ). E.g. Left child of node 7 = 7 * 2 = 14. Right child = 2 ∗ rank 1 ( i ) + 1. parent = select 1 ( ⌊ i / 2 ⌋ ).

Introduction Data Structures Libraries Conclusions Tree Representations • ”Optimal” representations of many kinds of trees e.g. ordinal trees (rooted arbitrary degree (un-)labelled trees, e.g. XML documents), tries. • Wide range of O (1)-time operations, e.g.: • ordinal trees in 2 n + o ( n ) bits [Navarro, Sadakane, TALG’12 ].

Introduction Data Structures Libraries Conclusions Pattern Matching – Compressed Text Indexing

Introduction Data Structures Libraries Conclusions Pattern Matching – Compressed Text Indexing Data: Sequence T (”text”) of m symbols from alphabet of size σ . ITLB: n log 2 σ bits. Operation: Given pattern P , determine if P occurs (exactly) in T (and report the number of occurrences, starting positions etc).

Introduction Data Structures Libraries Conclusions Pattern Matching – Compressed Text Indexing Data: Sequence T (”text”) of m symbols from alphabet of size σ . ITLB: n log 2 σ bits. Operation: Given pattern P , determine if P occurs (exactly) in T (and report the number of occurrences, starting positions etc). • For a human genome sequence, m is about 3 billion (3 x 10 9 ) characters, and σ = 4.

Introduction Data Structures Libraries Conclusions Pattern Matching – Compressed Text Indexing Data: Sequence T (”text”) of m symbols from alphabet of size σ . ITLB: n log 2 σ bits. Operation: Given pattern P , determine if P occurs (exactly) in T (and report the number of occurrences, starting positions etc). • For a human genome sequence, m is about 3 billion (3 x 10 9 ) characters, and σ = 4. • Standard data structure is suffix tree , which answers this query in O ( | P | ) time but takes O ( n log n ) bits of space. • In practice, a ST is about 10-30 times larger than the text.

Introduction Data Structures Libraries Conclusions Pattern Matching – Compressed Text Indexing Data: Sequence T (”text”) of m symbols from alphabet of size σ . ITLB: n log 2 σ bits. Operation: Given pattern P , determine if P occurs (exactly) in T (and report the number of occurrences, starting positions etc). • For a human genome sequence, m is about 3 billion (3 x 10 9 ) characters, and σ = 4. • Standard data structure is suffix tree , which answers this query in O ( | P | ) time but takes O ( n log n ) bits of space. • In practice, a ST is about 10-30 times larger than the text. • A number of SDS have been developed: we’ll focus on the FM-Index [Ferragina, Manzini, JACM ’05 ].

Introduction Data Structures Libraries Conclusions Previous Popular Solution – Suffix Trees

Su ffi x trie: making it smaller T = abaaba$ Idea 1: Coalesce non-branching paths into a single edge with a string label $ aba$ Reduces # nodes, edges, guarantees internal nodes have >1 child

Su ffi x tree T = abaaba$ a $ With respect to m : ba How many leaves? m $ How many non-leaf nodes? ≤ m - 1 ba $ aba$ ≤ 2 m -1 nodes total, or O ( m ) nodes aba$ $ aba$ No : total length of edge Is the total size O ( m ) now? labels is quadratic in m

Su ffi x tree T = abaaba$ Idea 2: Store T itself in addition to the tree. Convert tree’s edge labels to (o ff set, length) pairs with respect to T. T = abaaba$ (6, 1) $ a ba (0, 1) (1, 2) (6, 1) $ ba $ (1, 2) (6, 1) aba$ (3, 4) (3, 4) aba$ $ (6, 1) (3, 4) aba$ Space required for su ffi x tree is now O ( m )

Su ffi x tree: leaves hold o ff sets T = abaaba$ T = abaaba$ (6, 1) (6, 1) (0, 1) (0, 1) (1, 2) (1, 2) 6 (6, 1) (6, 1) (1, 2) (6, 1) (1, 2) (6, 1) 5 (3, 4) (3, 4) 4 (3, 4) (3, 4) (6, 1) (6, 1) 1 3 (3, 4) (3, 4) 2 0

Introduction Data Structures Libraries Conclusions Previous Popular Solution – Suffix Trees

Introduction Data Structures Libraries Conclusions Previous Popular Solution – Suffix Trees • A (compressed) trie containing all the suffixes of T . The tree contains m + 1 leaves and at most m other nodes.

Introduction Data Structures Libraries Conclusions Previous Popular Solution – Suffix Trees • A (compressed) trie containing all the suffixes of T . The tree contains m + 1 leaves and at most m other nodes. • Each leaf is labelled with the starting position of the suffix ending at that leaf.

Space Efficient Data Structures and FM index Venkatesh Raman The - PowerPoint PPT Presentation

Space Efficient Data Structures and FM index Venkatesh Raman The Institute of Mathematical Sciences, Chennai NISER Bhubaneshwar, February 9, 2019 Introduction Data Structures Libraries Conclusions Overview Introduction Data Structures

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

Data Structures Index space Box : a rectangular region in index space BoxArray : a union of

Index Rules and Methodology Index Name Ticker S-Network US Equity 3000 Index SN3000 S-Network

The Case for R244 Learned Index Structures Michael Chi Ian Tang Kraska, T., Beutel, A., Chi, E.

Hypo contact and Sasakian SU ( 2 ) -structures in 5-dimensions structures on Lie groups Sasakian

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

Index Rules and Methodology S-Network Europe Equity 500 Index (Ticker: SNE500) S-Network Europe

FAANG+ holdings in S&P 500 & MSCI EM Index S&P 500 Index Weighting 20% MSCI EM Index

THE INDEX OF RETAIL PRICES REVISION OF THE INDEX OF RETAIL PRICES INDEX OF RETAIL PRICES The

Index Blocking Factors, Views Rose-Hulman Institute of Technology Curt Clifton Index Redux

Contact manifolds and SU ( 2 ) -structures in 5-dimensions SU ( n ) -structures Sasaki-Einstein

Baldwin Space Summary October 25 1 Baldwin School Space Summary 2 Baldwin School Space Summary

CS 310 - Advanced Data Structures and Algorithms Basic Data Structures May 31, 2018 Mohammad

Data Structures 1 / 27 Built-in Data Structures Values can be collected in data structures:

Data Structures Data Structures Lists Trees Trees Graphs CSE 680 Review basic

NVIDIA INDEX IMPLEMENTING ADVANCED DATA VISUALIZATION WITH NVIDIA INDEX Alexander Kuhn and Marc

Some designs and binary codes preserved by the simple group Ru of Rudvalis Bernardo Rodrigues

Abstract Geant4 photo-absorption ionisation (PAI) and the Moller-Bhahba standard models were

Introduction to Information by Erol Seke For the course Communications OSMANGAZI

Applications of Galois Geometries to Coding Theory and Cryptography Leo Storme Ghent University

Polynomial ideals associated to combinatorial objects William J. Martin Department of

Successive Cancellation Inactivation Decoding for Modified Reed-Muller and eBCH Codes Mustafa

Fast Dictionary-based Compression for Inverted Indexes Giulio Ermanno Pibiri Matthias Petri

Previously... Joint typical sequences Covering and Packing Lemmas Channel Coding Theorem

Space Efficient Data Structures and FM index Venkatesh Raman The - PowerPoint PPT Presentation

Space Efficient Data Structures and FM index Venkatesh Raman The Institute of Mathematical Sciences, Chennai NISER Bhubaneshwar, February 9, 2019 Introduction Data Structures Libraries Conclusions Overview Introduction Data Structures

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

Data Structures Index space Box : a rectangular region in index space BoxArray : a union of

Index Rules and Methodology Index Name Ticker S-Network US Equity 3000 Index SN3000 S-Network

The Case for R244 Learned Index Structures Michael Chi Ian Tang Kraska, T., Beutel, A., Chi, E.

Hypo contact and Sasakian SU ( 2 ) -structures in 5-dimensions structures on Lie groups Sasakian

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

Index Rules and Methodology S-Network Europe Equity 500 Index (Ticker: SNE500) S-Network Europe

FAANG+ holdings in S&amp;P 500 &amp; MSCI EM Index S&amp;P 500 Index Weighting 20% MSCI EM Index

THE INDEX OF RETAIL PRICES REVISION OF THE INDEX OF RETAIL PRICES INDEX OF RETAIL PRICES The

Index Blocking Factors, Views Rose-Hulman Institute of Technology Curt Clifton Index Redux

Contact manifolds and SU ( 2 ) -structures in 5-dimensions SU ( n ) -structures Sasaki-Einstein

Baldwin Space Summary October 25 1 Baldwin School Space Summary 2 Baldwin School Space Summary

CS 310 - Advanced Data Structures and Algorithms Basic Data Structures May 31, 2018 Mohammad

Data Structures 1 / 27 Built-in Data Structures Values can be collected in data structures:

Data Structures Data Structures Lists Trees Trees Graphs CSE 680 Review basic

NVIDIA INDEX IMPLEMENTING ADVANCED DATA VISUALIZATION WITH NVIDIA INDEX Alexander Kuhn and Marc

Some designs and binary codes preserved by the simple group Ru of Rudvalis Bernardo Rodrigues

Abstract Geant4 photo-absorption ionisation (PAI) and the Moller-Bhahba standard models were

Introduction to Information by Erol Seke For the course Communications OSMANGAZI

Applications of Galois Geometries to Coding Theory and Cryptography Leo Storme Ghent University

Polynomial ideals associated to combinatorial objects William J. Martin Department of

Successive Cancellation Inactivation Decoding for Modified Reed-Muller and eBCH Codes Mustafa

Fast Dictionary-based Compression for Inverted Indexes Giulio Ermanno Pibiri Matthias Petri

Previously... Joint typical sequences Covering and Packing Lemmas Channel Coding Theorem

FAANG+ holdings in S&P 500 & MSCI EM Index S&P 500 Index Weighting 20% MSCI EM Index