Fingerprints in Compressed Strings (In Proc. WADS 2013) Philip Bille - PowerPoint PPT Presentation

Fingerprints in Compressed Strings (In Proc. WADS 2013) Philip Bille 1 , Patrick Hagge Cording 1 , Inge Li Gørtz 1 , Benjamin Sach 2 , Hjalte Wedel Vildhøj 1 and Søren Vind 1 1 Technical University of Denmark, DTU Compute, { phbi,phaco,inge,hwvi,sovi } @dtu.dk 2 University of Bristol, Department of Computer Science, ben@cs.bris.ac.uk October 10, 2013 WCTA 2013, Jerusalem hwv.dk 1 / 14

The Takeaway Message “Karp-Rabin fingerprints can be computed efficiently on compressed strings.” hwv.dk 2 / 14

Straight Line Programs Compression model for strings ◮ Compression is modelled as a Straight Line Program (SLP). ◮ An SLP G is a grammar in Chomsky normal form. ◮ G consists of production rules X 1 , . . . , X n of the form X i = X l X r (nonterminal) or X i = a (terminal) representable as a DAG. ◮ A node v ∈ G produce a unique string S ( v ) of length | S ( v ) | . X 7 X 7 X 5 X 6 X 5 X 6 X 3 X 4 X 3 X 3 X 4 X 3 expands into X 1 X 2 X 1 X 2 X 1 X 2 X 2 X 2 X 1 X 2 a b a b a b b b a b hwv.dk 3 / 14

Karp-Rabin Fingerprints Definition The Karp-Rabin Fingerprint of a string S is defined as | S | S [ k ] c k mod p , � φ ( S ) = k = 1 where p = O ( 2 w ) is a sufficiently large prime and c ∈ Z p is chosen uniformly at random. Storing a fingerprint requires constant space. 1 2 3 4 5 6 7 8 S = a b a b b b a b = 0 1 0 1 1 1 0 1 φ ( S [ 2 , 5 ]) = 1 c 1 + 0 c 2 + 1 c 3 + 1 c 4 mod p hwv.dk 4 / 14

Karp-Rabin Fingerprints Key properties Composition Given any two of φ ( S [ i , j ]) , φ ( S [ j + 1 , k ]) and φ ( S [ i , k ]) , the remaining fingerprint can be computed in O ( 1 ) time. 1 2 3 4 5 6 7 8 S = a b a b b b a b = 0 1 0 1 1 1 0 1 φ ( S [ 2 , 5 ]) φ ( S [ 6 , 8 ]) φ ( S [ 2 , 8 ]) Collisions are very unlikely If S [ i , j ] � = S [ i ′ , j ′ ] then with high probability φ ( S [ i , j ]) � = φ ( S [ i ′ , j ′ ]) . hwv.dk 5 / 14

The SLP Toolbox Useful primitives on SLPs ◮ Decompress a prefix or suffix of a node in linear time. (Ga ¸sieniec, Kolpakov, Potapov and Sant. In Proc. 15th DCC, 2005) ◮ Access a random symbol S [ i ] in O ( log N ) time. (Bille, Landau, Raman, Sadakana, Satti, Weimann. In Proc. 22nd SODA, 2011) ◮ Decompress a substring incident to a bookmark in linear time. (Gagie, Gawrychowski, K¨ arkk¨ ainen, Nekrich, Puglisi. In Proc. LATA, 2012) Our additions to the toolbox: Fingerprints ◮ Compute φ ( S [ i , j ]) in O ( log N ) time (or in O ( log log N ) time if the SLP is “linear”) Longest Common Prefixes / Extensions ◮ Compute LCP ( i , j ) in O ( log N log ℓ ) time (or in O ( log ℓ log log ℓ + log log N ) time if SLP is “linear”) Many applications: Approximate String Matching, Longest Common Substring, Palindromes, Tandem Repats, etc. hwv.dk 6 / 14

Main Ideas We only need to look at prefixes ◮ Fingerprint composition means that it is sufficient to be able to compute fingerprints for prefixes of S , i.e., φ ( S [ 1 , i ]) . ◮ Subtracting two prefix fingerprints, we can obtain any substring fingerprint φ ( S [ i , j ]) in O ( 1 ) time. Compose prefix fingerprint during a random access traversal ◮ Augment the SLP with additional information, e.g., each node stores its fingerprint. ◮ Compose φ ( S [ 1 , i ]) from fingerprints of selected substrings of S [ 1 , i ] . ◮ Obtain these fingerprints from a random access traversal of the SLP and the resulting root-to-leaf path. hwv.dk 7 / 14

Fingerprints in O ( h ) time A simple solution Data structure v Stores φ ( S ( v )) , | S ( v ) | Stores φ ( S ( u )) , | S ( u ) | u w Stores φ ( S ( w )) , | S ( w ) | Composing φ ( S [ 1 , i ]) in O ( h ) time ◮ Traverse the SLP for S [ i ] from the root, comparing i to the substring length at each node to determine the path. ◮ If following a right edge, add the fingerprint for the string generated by the left child to the composed fingerprint. hwv.dk 8 / 14

Fingerprints in O ( log N ) time Theorem (Bille et al., SODA 2011) A random access query for S [ i ] in an SLP can be performed in O ( log N ) time and O ( n ) space, also retrieving the sequence of O ( log N ) heavy paths visited on the root-to-leaf path. v a 1 b 2 a 2 u b 1 i Composing φ ( S [ 1 , i ]) in O ( log N ) time ◮ Perform random access query for S [ i ] , and for each visited heavy path, add fingerprint for all left-hanging nodes in constant time. ◮ Store fingerprints for all left-hanging heavy path suffixes. hwv.dk 9 / 14

Linear Straight Line Programs Almost a normal SLP, but with two differences: ◮ Allow the root to have k children, denoted r 1 , . . . , r k . ◮ Restrict the right child of all other internal nodes to be a leaf. Motivation: ◮ Models LZ78 compression scheme with O ( 1 ) overhead. ◮ Can be converted into a normal SLP of at most double size. r 4 r 6 r 1 r 2 r 3 r 5 a a a b b b hwv.dk 10 / 14

Fingerprints in O ( log log N ) time Root children in Linear SLP ◮ The start position of root child r q is the sum of string lengths for children on the left, B q = � q − 1 p = 1 | S ( r p ) | . ◮ Data structure stores φ ( S ( r i )) and φ ( S [ 1 , B i ]) ( i ∈ 1 , . . . , k ). r 1 r 2 r 3 r 4 r 5 r 6 S ( root ) B 1 B 2 B 3 B 4 B 5 B 6 Composing φ ( S [ 1 , i ]) in O ( log log N ) time ◮ Find the predecessor B j of i in the set { B 1 , . . . , B k } . ◮ Compose φ ( S [ 1 , i ]) from two fingerprints in constant time: ◮ Fingerprint φ ( S [ 1 , B j ]) for a string ending in r j − 1 (which is stored). ◮ Fingerprint φ ( S [ B j + 1 , i ]) for a prefix of a string generated by r j . hwv.dk 11 / 14

Linear Straight Line Programs All prefixes of S ( v ) fully generated by other nodes (for non-root node v ). a b r 1 r 2 a b r 1 r 2 r 3 r 4 r 5 r 6 r 4 r 3 a b a a a r 5 r 6 b b b (a) Linear SLP. (b) Dictionary tree. ◮ Store prefix relationships for non-root nodes in Linear SLP as parent relationship in a dictionary tree of size O ( n ) . ◮ Can find node generating m -length prefix of S ( r j ) in O ( 1 ) time using level ancestor data structure. hwv.dk 12 / 14

Longest Common Prefixes / Extensions Preprocess a Straight Line Program (SLP) G of size n producing a string S of length N to support LCP queries: ◮ LCP ( i , j ) = max ℓ such that S [ i , i + ℓ ] = S [ j , j + ℓ ] . Theorem There are data structures solving the LCP problem on SLPs in ◮ O ( n ) space and query time O ( log ℓ log N ) ◮ O ( n ) space and query time O ( log ℓ log log ℓ + log log N ) if G is a Linear SLP j i S = � � � � � � � � O ( log ℓ ) comparisons × × � � × × � � hwv.dk 13 / 14

The Takeaway Message “Karp-Rabin fingerprints can be computed efficiently on compressed strings.” Open Problems ◮ Other basic primitives on SLPs? ◮ Bookmarked fingerprints on unbalanced SLPs? ◮ LCP queries in same time as random access? hwv.dk 14 / 14

The Takeaway Message “Karp-Rabin fingerprints can be computed efficiently on compressed strings.” Open Problems ◮ Other basic primitives on SLPs? ◮ Bookmarked fingerprints on unbalanced SLPs? ◮ LCP queries in same time as random access? Thank you! hwv.dk 14 / 14

Fingerprints in Compressed Strings (In Proc. WADS 2013) Philip Bille - PowerPoint PPT Presentation

Fingerprints in Compressed Strings (In Proc. WADS 2013) Philip Bille 1 , Patrick Hagge Cording 1 , Inge Li Grtz 1 , Benjamin Sach 2 , Hjalte Wedel Vildhj 1 and Sren Vind 1 1 Technical University of Denmark, DTU Compute, {

s[i] Introduction to Computer Programming Strings CSCI-UA 2 Strings and Characters Strings are

Listing Bit Strings List all bit strings of length 3. Listing Bit Strings List all bit strings

Compressed Membership for NFA (DFA) with Compressed Labels is in NP (P) Artur Je University of

Chapter 9 Strings 1 C-Strings vs C++ Strings T wo string types: C-strings Array

Strings Testing for equality with strings. Lexicographic ordering of strings. Other

counting colours in compressed strings Travis Gagie Juha K arkk ainen CPM 2011 counting

Languages and Regular expressions Lecture 2 1 Strings, Sets of Strings, Sets of Sets of

Strings Digital Medicine I Lists, strings, loops Repetition Hans-Joachim Bckenhauer Dennis

Chapter 9: Strings (To avoid confusion, C-style strings will be referred to as C-string,

Compressed Membership for NFA (DFA) with Compressed Labels is in NP (P) Artur Je Wrocaw,

Pattern Matching on Compressed T exts II Shunsuke Inenaga Kyushu University, Japan Agenda

Decoding in Compressed Sensing Ronald DeVore USC, 2008 p. 1/33 Discrete Compressed Sensing R

Fingerprints Learning a project presentation The

Lehrstuhl fr Systemsicherheit Virtual Machine-based Fingerprints SPRING 9 Bochum, 31.07 -

ARM Assembler Strings Strings p. 1/16 Characters or Strings A string is a sequence of

String Amplitudes, Topological Strings and the Omega-deformation Strings @ Princeton 26 - 06 -

A Critical Evaluation of Website Fingerprinting Attacks Marc Juarez 1 Sadia Afroz 2 Gunes Acar 1

Mi#ga#ng Browser Fingerprint Tracking: Mul#-level Reconfigura#on

Homomorphic Sketches Shrinking Big Data without Sacrificing Structure Andrew McGregor University

VISABIO : French Biometric Visa System 1 CONTENTS Lessons learnt from the BIODEV 1 Pilot

CS 528 Mobile and Ubiquitous Computing Lecture 10b: Mobile Security and Mobile Measurements

Biometrics Outline Biometrics What is a Biometric Signature? What is an Authentication

AniFilter: Parallel and Failure-Atomic Cuckoo Filter for Non-Volatile Memories Hyungjun Oh 1 ,

(VALSE webinar, 2016.1.13)