An opportunistic text indexing structure based on run length - PowerPoint PPT Presentation

CIAC 2015 An opportunistic text indexing structure based on run length encoding Yuya Tamakoshi, Keisuke Goto, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan

Kyushu University, Japan

Kyushu University, Japan Kyushu U. Kyushu U.

Kyushu University, Japan Itoshima Peninsula 糸島 Kyushu U. String Island

String matching Input: text string T and pattern string P Output: all occurrences of P in T

String matching Input: text string T and pattern string P Output: all occurrences of P in T text T We introduce a general framework which is suitable to We introduce a general framework which is suitable to capture an essence of compr capture an essence of compressed pattern matching ompress essed pattern matching according to various dictionary based compressions. The according to various dictionary based comp ompres ressions. The goal is to find all occurrences of a pattern in a text goal is to find all occurrences of a pattern in a text without decompre without decompression, which is one of the most active mpress ssion, which is one of the most active topics in string matching. Our framework includes such topics in string matching. Our framework includes such compre compression methods as Lempel-Ziv family, (LZ77, LZSS, ompress ssion methods as Lempel-Ziv family, (LZ77, LZSS, LZ78, LZW), byte-pair encoding, and the static dictionary LZ78, LZW), byte-pair encoding, and the static dictionary based method. based method. pattern P compress

String matching Input: text string T and pattern string P Output: all occurrences of P in T  String matching is fundamental to areas such as  Information Retrieval  Bioinformatics, etc.

Indexed string matching Preprocess: build index on fixed text T Query: pattern string P Answer: all occurrences of P in T  Goal is to construct a space ce-effi effici cient ent index on T which quick ckly ly answers to string matching query.  Text T can be very long (e.g., DNA sequences).  We may receive many different query patterns.

Classical text index: Suffix Array The suffix array SA of text T is an array which stores the beginning positions of the suffixes of T in lexicographic order [Manber & Myers, 1991] . T = cococacao$ $ is an end-marker which appears only at the end of any string.

Classical text index: Suffix Array The suffix array SA of text T is an array which stores the beginning positions of the suffixes of T in lexicographic order [Manber & Myers, 1991] . cococacao$ 1 ococacao$ 2 cocacao$ 3 ocacao$ 4 cacao$ 5 acao$ 6 cao$ 7 ao$ 8 o$ 9 $ 10

Classical text index: Suffix Array The suffix array SA of text T is an array which stores the beginning positions of the suffixes of T in lexicographic order [Manber & Myers, 1991] . cococacao$ $ SA 10 1 ococacao$ acao$ 2 6 cocacao$ ao$ 3 8 ocacao$ cacao$ 4 5 cacao$ cao$ Sort 5 7 acao$ cocacao$ 6 3 cao$ cococacao$ 7 1 ao$ o$ 8 9 o$ ocacao$ 9 4 $ ococacao$ 10 2

String matching with suffix array Binary search a given pattern P on SA SA $ 10 acao$ 6 ao$ 8 P = coc cacao$ 5 cao$ 7 cocacao$ 3 cococacao$ 1 o$ 9 ocacao$ 4 ococacao$ 2

String matching with suffix array Binary search a given pattern P on SA SA $ 10 acao$ 6 ao$ 8 P = coc cacao$ 5 > cao$ cao$ 7 cocacao$ 3 cococacao$ 1 o$ 9 ocacao$ 4 ococacao$ 2

String matching with suffix array Binary search a given pattern P on SA SA $ 10 acao$ 6 ao$ 8 P = coc cacao$ 5 < cao$ o$ 7 cocacao$ 3 cococacao$ 1 o$ 9 ocacao$ 4 ococacao$ 2

String matching with suffix array Binary search a given pattern P on SA SA $ 10 acao$ 6 ao$ 8 P = coc cacao$ 5 = cao$ cocacao$ 7 cocacao$ ✔ 3 cococacao$ 1 o$ 9 ocacao$ 4 ococacao$ 2

String matching with suffix array Binary search a given pattern P on SA SA $ 10 acao$ 6 ao$ 8 P = coc cacao$ 5 = cao$ cococacao$ 7 cocacao$ ✔ 3 cococacao$ ✔ 1 o$ 9 ocacao$ 4 ococacao$ 2

String matching with suffix array Binary search a given pattern P on SA SA $ ✔ ✔ 10 acao$ 1 2 3 4 5 6 7 8 9 10 6 T = cococacao$ ao$ 8 cacao$ 5 cao$ 7 P = coc cocacao$ ✔ 3 cococacao$ ✔ 1 o$ 9 ocacao$ 4 ococacao$ 2

String matching with suffix array SA $ All occurrences of P in T can 10 acao$ 6 be found in O ( m log u + occ ) ao$ 8 time using SA . cacao$ 5 cao$ 7 u cocacao$ 3 cococacao$ The search time can be 1 o$ 9 improved to O ( m+ log u + occ ) ocacao$ 4 using the LCP array. ococacao$ 2 u = |T| m = |P| occ = # occ. of P in T

SA+LCP Theorem [Manber & Myers, 1991] There is an index (SA+LCP) which reports all occ occurrences of P in T in O ( m +log u + occ ) time, and requires 2 u log u + u logσ + O ( u ) bits of space. Auxiliary data u = |T| SA & LCP Text T structure m = |P| σ = | S |  This can take too much space for large text T (i.e., for large u ).

Compressed index  There are a number of compressed indexes which occupy only compressed size of text.  FM-index [Ferragina & Mancini, 2000], Compressed Suffix Array [Grossi & Vitter, 2000], Lempel-Ziv index [Gagie et al., 2014], etc.  Most of them are slower ower than SA+LCP. Our proposal New compressed index based on run length encoding (RLE) of text which is small ller er & faste ter than SA+LCP.

Run Length Encoding (RLE) The run length encoding of text T , denoted RLE ( T ) , is a compressed representation of T in which each maximal run a…a of characters is encoded by a p , where p denotes the length of the maximal run. T = aaaabbbaacccccccbbbbbaaaaa$ RLE ( T ) = a 4 b 3 a 2 c 7 b 5 a 5 $  Applications to RLE include:  black-white fax messages  image format (PackBits, TIFF)  music format (MIDI)

RLE suffixes Let n = | RLE ( T )| . For any 1 ≤ i ≤ n , RLEsuf ( i ) is the suffix of RLE ( T ) starting with the i -th run. a 4 b 3 a 2 c 7 b 5 a 5 $ RLE ( T ) : a 4 b 3 a 2 c 7 b 5 a 5 $ RLEsuf (1): b 3 a 2 c 7 b 5 a 5 $ RLEsuf (2): n = 7 a 2 c 7 b 5 a 5 $ RLEsuf (3): c 7 b 5 a 5 $ RLEsuf (4): b 5 a 5 $ RLEsuf (5): a 5 $ RLEsuf (6): $ RLEsuf (7):

Difficulty in indexing RLE suffixes  We want to index only RLE suffixes of the text, but simply sorted RLE suffixes don’t work! sorted RLE suffixes of text a 5 b ... a 5 b ... a 5 c ... a 4 b ... a 4 c ... a 4 c ... a 4 c ... a 3 b ... a 3 b ...

Difficulty in indexing RLE suffixes  We want to index only RLE suffixes of the text, but simply sorted RLE suffixes don’t work! sorted RLE suffixes of text aaaaab ... aaaaab ... aaaaac ... aaaab ... aaaac ... aaaac ... aaaac ... aaab ... aaab ...

Difficulty in indexing RLE suffixes  We want to index only RLE suffixes of the text, but simply sorted RLE suffixes don’t work! sorted RLE suffixes of text aaaaab ... RLE ( P ) : a 2 b 1 ✔ aaaaab ... ✔ aaaaac ... aaaab ... ✔ Pattern occurrences are aaaac ... spread out, so we aaaac ... cannot binary search!! aaaac ... aaab ... ✔ aaab ... ✔

Our ideas to index RLE suffixes  When sorting RLE suffixes, we “ignore” the exponents of the first runs of RLE suffixes of text T .  To find occurrences of pattern P , we first “ignore” the exponent of the first run of RLE ( P ) , and find its corresponding range.  We then pick up only the occurrences of RLE ( P ) from this range.

Truncated RLE suffixes tRLEsuf ( i ) is the suffix of RLEsuf ( i ) where the first exponent p i is truncated to 1 . a 4 b 3 a 2 c 7 b 5 a 5 $ a 1 b 3 a 2 c 7 b 5 a 5 $ RLEsuf (1): tRLEsuf (1): b 3 a 2 c 7 b 5 a 5 $ b 1 a 2 c 7 b 5 a 5 $ RLEsuf (2): tRLEsuf (2): a 2 c 7 b 5 a 5 $ a 1 c 7 b 5 a 5 $ RLEsuf (3): tRLEsuf (3): c 7 b 5 a 5 $ c 1 b 5 a 5 $ RLEsuf (4): tRLEsuf (4): b 5 a 5 $ b 1 a 5 $ RLEsuf (5): tRLEsuf (5): a 5 $ a 1 $ RLEsuf (6): tRLEsuf (6): $ $ RLEsuf (7): tRLEsuf (7):

Our index: Truncated RLE Suffix Array The tRLE suffix array tRLESA of text T is an array which stores the beginning positions of the tRLE suffixes in lexicographical order. tRLESA a 1 b 3 a 2 c 7 b 5 a 5 $ $ 1 7 b 1 a 2 c 7 b 5 a 5 $ a 1 $ 2 6 a 1 c 7 b 5 a 5 $ a 1 b 3 a 2 c 7 b 5 a 5 $ 3 1 Sort c 1 b 5 a 5 $ a 1 c 7 b 5 a 5 $ 4 3 b 1 a 5 $ b 1 a 5 $ 5 5 a 1 $ b 1 a 2 c 7 b 5 a 5 $ 6 2 $ c 1 b 5 a 5 $ 7 4

Monotonicity on Truncated RLE Suffix Array tRLE suffixes tRLESA ... ... b (2) c 5 a 2 b 2 a 6 ... 47 b (9) c 5 a 2 b 3 a 1 ... 99 b (1) c 5 a 2 b 5 a 4 ... 11 b (2) c 5 a 2 b 7 a 3 ... 40 b (3) c 5 a 2 b 8 c 2 ... 55 b (9) c 5 a 2 b 6 c 3 ... 72 b (1) c 5 a 2 b 6 c 3 ... 19 b (5) c 5 a 2 b 4 c 7 ... 26 b (1) c 5 a 2 b 1 c 8 ... 4 ... ... Ignored exponents in parentheses

Monotonicity on Truncated RLE Suffix Array tRLE suffixes We first look tRLESA for bc 5 a 2 ... ... b (2) c 5 a 2 b 2 a 6 ... 47 RLE ( P ): b 3 c 5 a 2 b 4 b (9) c 5 a 2 b 3 a 1 ... 99 b (1) c 5 a 2 b 5 a 4 ... 11 b (2) c 5 a 2 b 7 a 3 ... 40 The range b (3) c 5 a 2 b 8 c 2 ... 55 bc 5 a 2 matches b (9) c 5 a 2 b 6 c 3 ... 72 b (1) c 5 a 2 b 6 c 3 ... 19 This range can b (5) c 5 a 2 b 4 c 7 ... 26 be found by b (1) c 5 a 2 b 1 c 8 ... 4 a binary search. ... ...

An opportunistic text indexing structure based on run length - PowerPoint PPT Presentation

CIAC 2015 An opportunistic text indexing structure based on run length encoding Yuya Tamakoshi, Keisuke Goto, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan Kyushu University, Japan Kyushu University, Japan Kyushu

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Text Indexing Arun Chauhan COMP 314 Lecture 15, 16 Mar 4, Mar 6, 2003 Searching Text grep

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Chapter 6 Hash-Based Indexing Efficient Support for Equality Search Hash-Based Indexing Static

Opportunistic Composition of Human- Opportunistic composition Computer Interactions in Ambient

Opportunistic Computing Opportunistic Computing : A New Paradigm : A New Paradigm for Scalable

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

Full text indexing External Memory Algorithms and Data Structures Christian Sommer Full text

Applying Hash-based Indexing in Text-based Information Retrieval Benno Stein and Martin Potthast

Scanning In the old War Games film there is a teenager with an automated way of calling

Popular Branchings and Their Dual Certificates Telikepalli Kavitha, Tam as Kir aly, Jannik

TOWARDS LOSSLESS DATA CENTER RECONFIGURATION: CONSISTENT NETWORK UPDATES IN SDNS KLAUS-TYCHO

Gathering robots on meeting-points: feasibility and optimality Serafino Cicerone 1 Gabriele Di

RELATION AL LANGUAGES User only needs to specify the answer that they want, not how to compute

Administrivia Carnegie Mellon Univ. HW1 is due today. Dept. of Computer Science HW2 is

Recap: (partial) SELECT [DISTINCT] column_name(s) FROM table_name WHERE conditions ORDER BY

SQL - The Language of Databases Developed by IBM in the 1970s Create and process database