scalable full text search for petascale file systems
play

Scalable Full-Text Search for Petascale File Systems Andrew W. - PowerPoint PPT Presentation

Scalable Full-Text Search for Petascale File Systems Andrew W. Leung Ethan L. Miller University of California, Santa Cruz 3 rd Petascale Data Storage Workshop (PDSW 08) November 17th, 2008 Need scalable file management Todays


  1. Scalable Full-Text Search for Petascale File Systems Andrew W. Leung • Ethan L. Miller University of California, Santa Cruz 3 rd Petascale Data Storage Workshop (PDSW ’08) November 17th, 2008

  2. Need scalable file management • Today’s file systems contain • Petabytes of data, billions of files, and thousands of users • File systems have focused on scaling • I/O and metadata throughput, latency, fault-tolerance, cost • Limited work on scaling organization and retrieval • File system organization largely unchanged for 30 years • File organization and retrieval has not kept pace with file systems 2

  3. Problems with current approach • Files are organized into a single hierarchy • Possibly billions of files and directories • Slow and inaccurate • Users must carefully organize and name files and directories - Tedious and time consuming • Users must manually navigate huge hierarchies - Wastes time and is inaccurate • Files only have a single classification • Does not scale to petascale file systems 3

  4. Scalable file retrieval with search • File system search has been researched for decades • Focused on full-text (aka keyword) search • Organizing and retrieving files with search • Files have many automatic classifications - Organization becomes much simpler • Files can be retrieved with any feature/keywords - No more slow namespace navigation - Reduces the chances of lost data 4

  5. Petascale search challenges • Cost • Very expensive - often requires dedicated hardware • Performance • Tough to scale - often trade-off search and update performance • File system search should efficiently do both • Ranking • Limited file ranking algorithms • Security • Can significantly degrade search performance 5

  6. A specialized petascale search design • Exploits file system properties • Can be integrated within the file system • Leverage namespace locality with hierarchical partitioning [Leung09] • Namespace influences • File access patterns [Leung08, Vogel99] • File properties [Agrawal07, Leung09] • Who accesses them [Agrawal07, Leung08] 6

  7. Index partitioning / home usr proj john jim distmeta reliability include thesis scidac src experiments Keyword 1's Posting List Segments Hard Disk • Traditional file system search uses an inverted index • Consists of a dictionary that points posting lists • Our approach partitions the index based on the namespace • Posting lists are broken into segments 7

  8. Benefits of our design • Flexible, fine-grained index control • Search and update can be controlled at sub-tree granularity • Critical for index with billions of files • Reducing the search space • Eliminate partitions that do not match search criteria • Allows users to control scope and performance of queries • Efficient index updates • Smaller posting lists are easier to update and keep sequential on-disk • Better resource utilization 8

  9. The indirect index Indirect Index Keyword 1 Keyword 2 Keyword 3 Keyword 4 Dictionary Posting Lists ... Posting List Segments for Partition 1 ... Posting List Segments for Partition 2 • An inverted index that points to partition locations • Stores the dictionary • Posting lists store partition segment locations 9

  10. Other possible extensions • Security • Eliminate restricted sub-trees from search space • No extra space required and reduces permission check • Ranking • Utilize namespace locality to improve search result ranking • Employ different ranking algorithms for different sub-trees • Cost efficiency • Exploit Zipf-like sub-tree query patterns • Compress or migrate rarely searched sub-tree segments to lower-tier 10

  11. Current and future work • We are currently working on... • Collecting and analyzing keyword data sets • Crawl real-world large-scale file systems • No current file system search keyword collections exist • Completing the index and algorithm designs • Implementation and evaluation within the Ceph petascale file system • Allows realistic integration and benchmarking 11

  12. Thank you! • Thanks to: • Minglong Shao, Timothy Bission, Shankar Pasupathy and NetApp’s ATG • SSRC faculty and students • Come see us at the poster session! • Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems • Questions? 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend