searching string collections for the most relevant
play

Searching String Collections for the Most Relevant Documents the - PowerPoint PPT Presentation

Searching String Collections for the Most Relevant Documents the Most Relevant Documents Wing Kai Hon (NTHU, Taiwan) Rahul Shah (LSU) Rahul Shah (LSU) Jeff Vitter (Texas A&M Univ.) Outline Outline Background on compressed data


  1. Searching String Collections for the Most Relevant Documents the Most Relevant Documents Wing ‐ Kai Hon (NTHU, Taiwan) Rahul Shah (LSU) Rahul Shah (LSU) Jeff Vitter (Texas A&M Univ.)

  2. Outline Outline • Background on compressed data structures Background on compressed data structures • Our framework • Achieving optimal results hi i i l l • Construction algorithms • Succinct solutions • Conclusions Conclusions

  3. The Attack of Massive Data The Attack of Massive Data • Lots of massive data sets being generated Lots of massive data sets being generated – Web publishing, bioinformatics, XML, e ‐ mail, satellite geographical data – IP address information, UPCs, credit cards, ISBN numbers, large inverted files • Data sets need to be compressed (and are compressible) – Mobile devices have limited storage available – I/O overhead is reduced – There is never enough memory! • Goal: design data structures to manage massive data sets – Near ‐ minimum amount of space • Measure space in data ‐ aware way i e in terms of each individual data set • Measure space in data ‐ aware way, i.e. in terms of each individual data set – Near ‐ optimal query times for powerful queries – Efficient in external memory 3

  4. Parallel Disk Model [Vitter, Shriver 90, 94] 80 GB – 100 TB and more! N = problem size N problem size M = internal memory size B = disk block size B = disk block size D = # independent disks Scan: O( N / DB ) 8 – 500 KB 1 1 – 4 GB 4 GB Sorting: O( ( N / DB ) log B ( N / M ) ) Search: O( log DB N ) See book [Vitter 08] for overview

  5. Indexing all the books in a library g y  10 ‐ floor library  catalogue of books  each title and some keywords y  negligible additional space li ibl dditi l  a small card (few bytes) per book ( y ) p  one bookshelf to store the cards  limited search operations!

  6. Word ‐ level indexing (à la Google) ( (search for a word using inverted index) h f d i i d i d ) i 1 i 2 1. Split the text into words. 2. Collect all distinct words in a dictionary. 2. Collect all distinct words in a dictionary. 3. For each word w , store the inverted list of its locations i i 1 , i 2 ,  i w w in the text: i 1 , i 2 , 

  7. Word ‐ level indexing Simple implementation: one pointer per location Avg. word size ¸ pointer size i d index space = ¼ text size i ¼  Much better implementation: Much better implementation: compress the inverted lists by encoding the gaps between encoding the gaps between adjacent entries (e.g.,  and  codes WMB99]) codes WMB99])  Index space is 10%-15% 1 ½ floor + 10 floors

  8. Full ‐ text indexing (searching for a general pattern P ) (searching for a general pattern P ) • Not handled efficiently by Google • No clear notion of word is always available: • Some Eastern languages • unknown structure (e.g., DNA sequences) • Alphabet  , text T of size n bytes (i.e., n log |  | bits) : each text position is the start of a potential occurrence of P h h f l f Naive approach: blow-up with O( n 2 ) words of space Can we do better with O(n) words (i.e., O( n log n ) bits)? Or even better with linear space O(n log |  |) bits? Or best yet with compressed space n H k (1 + o (1)) bits?

  9. 160 Suffix tree / Patricia trie, |  |=2 floors 10 10 floors • Compact trie storing the suffixes of input string bababa# (assuming a < # < b) • Space is O(n log n) bits >> text size of n bits • In practice, space is roughly 16 n bytes [MM93]

  10. Suffix array • Sorted list of suffixes (assuming a < b < #) • Sorted list of suffixes (assuming a < b < #) 40-50 floors • Better space occupancy: n log n bits, 4 n bytes in practice 4 n bytes in practice • Additional n bytes for the lcps [MM93] 10 10 floors • Can find pattern P by binary search. (Actually there are better ways.)

  11. Space reduction Space reduction • The importance of space saving (money saving): p p g ( y g) – Portable computing with limited memory – Search engines use DRAM in place of hard disks – Next generation cellular phones will cost # bits transmitted • Sparse suffix tree [KU96] and other data structures based on suffix trees, arrays, and automata [K95,CV97,...] • Practical implementations of suffix trees reduce space but still 10 n bytes [K99] or 2.5 n bytes [M01] on average

  12. Compressed Suffix Array (Grossi, Gupta, Vitter 03) 50-60 40-50 O( |P| + polylog( n )) search time O( |P| + polylog( n )) search time floors floors fl (in RAM model). Size of index equals size of text q New indexes New indexes in entropy-compressed form (such as our CSA) (with multiplicative constant 1)! require 20%-40% of the text size of the text size Self-indexing text: Self indexing text: no need to keep the text! 11 ½ 10 floors Any portion of the text can be y p floors floors decoded from the index. 2-4 1 ½ floors Decoding is fast and does not floors require scanning the whole text. i i th h l t t inverted inverted suffix suffix Can cut search time further by new new text index index array array log n factor (word size). og acto ( o d s e) First external memory version in SPIRE 2009.

  13. Fundamental Problems in Text Search Fundamental Problems in Text Search • Pattern Matching: Given a text T and pattern P g p drawn from alphabet Σ , find all locations of P in T. – data structures: Suffix Trees and Suffix arrays – Better: Compressed Suffix Arrays [GGV03], FM ‐ Index [FM05] • Document Listing: Given a collection of text strings ( documents ) d 1 ,d 2 ,…d D Given a collection of text strings ( documents ) d 1 ,d 2 ,…d D of total length n, search for query pattern P (of length p). – Output the ndoc documents which contain pattern P. – Issue: Total number ndoc of documents output might be much smaller than Issue: Total number ndoc of documents output might be much smaller than total number of pattern occurrences, so going though all occurrences is inefficient. – Muthukrishnan: O(n) words of space, answers queries in optimal O(p + ndoc).

  14. Modified Problem—using Relevance Modified Problem using Relevance • Instead of listing all documents (strings) in which g ( g ) pattern occurs, list only highly ``relevant” documents. – Frequency: where pattern P occurs most frequently. – Proximity: where two occurrences of P are close to each P i it h t f P l t h other. – Importance: where each document has a static weight (e.g., Google’s PageRank). • Threshold vs. Top ‐ k – Thresholding: K ‐ mine and K ‐ repeats problem (Muthu). Thresholding: K mine and K repeats problem (Muthu) – Top ‐ k: Retrieve only the k most ‐ relevant documents. • Intuitive for User

  15. Approaches Approaches • Inverted Indexes – Popular in IR community. – Need to know patterns in advance (words). – In strings the word boundaries are not well defined In strings the word boundaries are not well defined. – Inverted indexes for all possible substrings can take a lot more space. – Else they may not be able to answer arbitrary pattern Else they may not be able to answer arbitrary pattern queries (provably efficiently). • Muthukrishnan’s Structures (based on suffix trees) M th k i h ’ St t (b d ffi t ) – Take O(n log n) words of space for K ‐ mine and K ‐ repeats problem (thresholding) while answering queries i O(P in O(P + ndoc) time. d ) ti – Top ‐ k queries require additional overhead.

  16. Suffix tree ‐ based solutions • Document Retrieval Problem – Store all suffixes of the D documents. ll ff f h d – Each leaf in suffix tree contains • Document id • D: Leaf ‐ rank of previous leaf of the same document D L f k f i l f f th d t – Traverse the suffix tree and get the range [L,R] such that all the occurrences of the pattern correspond to the leaves from leaf ‐ rank L to R. the leaves from leaf rank L to R. – To obtain each document uniquely, output only those leaves with D ‐ values ≤ L (i.e., the smallest leaf rank for the document). • 3 ‐ sided query in 2 dimensions ‐‐ (2,1) ‐ range query. • Can be done using repeated application of RMQs. – O(P+ ndoc) time… see figure. • K ‐ mine and K ‐ repeats p – Fixed K, separate structure for each K value : O(n log n) words space.

  17. Suffix tree based solutions Suffix tree based solutions d1: banana Suffixes: d2: urban a$ a$ ($ ($ < a < b) b) a urban$ an$ $ n ban ana$ d2 d1 n a anana$ rban$ ba $ $ $ $ $ d2 d2 $ ban$ d2 banana$ d2 a ana$ $ na$ $ n$ d2 na$ na$ na$ na$ d1 d1 d1 d1 d1 d1 d1 nana$ d1 rban$ urban$ • Search pattern: “an” • We look at the node’s subtree: d1 appears twice and d2 appears once in this subtree

  18. Preliminary : RMQs for top ‐ k on array • Range Maximum Query: Given an array A and query (i,j), d ( ) report the maximum of A[i..j] – Linear space, linear preprocessing time DS with O(1) query Linear space linear preprocessing time DS with O(1) query time • Range threshold: Given an array A, and a query (i,j, τ ), Range threshold: Given an array A, and a query (i,j, τ ), report all the numbers in A[i..j] which are >= τ – Can be done using repeated RMQs in O(output) time • Range top ‐ k: Given an array A, and a query (i,j,k) report top ‐ k highest numbers in A[i..j] – Repeated RMQs + heap = O(k log k) time • Generalization: Given array A, and query specifies set of t ranges [i 1 ,j 1 ], [i 2 ,j 2 ] ,…[i t ,j t ] – Threshold : O(t +output), top ‐ k : O(t + k log k)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend