SeFS: Unleashing the Power of Full-text Search on File Systems - - PowerPoint PPT Presentation

sefs unleashing the power of full text search on file
SMART_READER_LITE
LIVE PREVIEW

SeFS: Unleashing the Power of Full-text Search on File Systems - - PowerPoint PPT Presentation

SeFS: Unleashing the Power of Full-text Search on File Systems USENIX FAST 07 (WiP) Stergios V. Anastasiadis (joint work with G. Margaritis) U. Ioannina, Greece Motivation Full-text search in modern systems often used for Email


slide-1
SLIDE 1

SeFS: Unleashing the Power of Full-text Search on File Systems

USENIX FAST ’07 (WiP)

Stergios V. Anastasiadis (joint work with G. Margaritis)

  • U. Ioannina, Greece
slide-2
SLIDE 2

Motivation

  • Full-text search in modern systems often used for

– Email – Application help files – Log files – Any file that contains text – ...

  • Maybe full-text search should

– Receive the attention it deserves from system designers – Be made available as general system service to developers

02/14/2007 (c) S. V. Anastasiadis 2

slide-3
SLIDE 3

File System Features

  • File size

– Most files are small BUT – Most bytes are in large files

  • File lifetime

– Is highly variable across different systems – Varies from minutes to years – Has median age = tens of days

  • User expectations

– Perceive the file system as a reliable “storage medium” – Anticipate changes to be made visible almost immediately

02/14/2007 (c) S. V. Anastasiadis 3

slide-4
SLIDE 4

Attempt #1: Information Retrieval

  • Upside

– Online support of Boolean queries and dynamic updates – Mature technology (first ACM-SIGIR in 1978)

  • Downside

– Technology initially developed for article archives – “Dynamic update” mainly means addition of new articles – Indexing structures biased from decade-old studies to serve the above assumptions

02/14/2007 (c) S. V. Anastasiadis 4

slide-5
SLIDE 5

Index Maintenance in IR

  • Inverted files

– Map terms to term positions in documents (posting lists)

  • Decades ago

– Updated infrequently to include new articles – Contiguously stored on disk to minimize query time

  • Recently

– Updated dynamically to include new articles BUT – Treating document changes as insertions/deletions – Use complex relocation techniques to preserve contiguity

02/14/2007 (c) S. V. Anastasiadis 5

slide-6
SLIDE 6

Question

  • Why not allocate posting lists on fixed-size blocks?

– Avoid data relocation during inserts/appends – Amortize disk seeks over large block sizes – Simplify system structure without major performance penalty

  • Several I/O demanding systems based on blocks

– Database systems – The Google File System (chunks of 64MB) – Video streaming storage – …

02/14/2007 (c) S. V. Anastasiadis 6

slide-7
SLIDE 7

Attempt #2: Web Search

  • Upside

– Technology can handle large data sets – Search results quite close to user expectations

  • Downside

– The web is perceived as unreliable; infrequent updates ok – Distributed nature make stats gathering difficult – Dedicated hardware devoted to indexing

  • Bottom line

– Despite commonalities, file systems differ from the web – Exploit strengths without adopting weaknesses

02/14/2007 (c) S. V. Anastasiadis 7

slide-8
SLIDE 8

Attempt #3: Relational Databases

  • First approach

– Store all system metadata on a relational database system

E.g. SRB/SDSC, SCFS/MIT, Amino/Stony Brook

– Ok for ftp-like services – BUT maybe too heavyweight for fine-grain accesses

  • Why?

– File systems custom-developed/optimized for handling their metadata

02/14/2007 (c) S. V. Anastasiadis 8

slide-9
SLIDE 9

Relational Databases (cont’d)

  • Second approach

– Keep system metadata on custom file-system structures – BUT maintain user metadata in a database – Maybe ok but still insufficient for full-text search

  • Why?

– Full-text search more than a few attribute/value pairs per file – Inverted files most efficient structure for large text collections

02/14/2007 (c) S. V. Anastasiadis 9

slide-10
SLIDE 10

Conclusion

  • File systems

– More flexible in their functionality than article repositories – More reliable and amenable to stats gathering than the web – More efficient in fine-granularity operations than RDBs

  • Full-text search on file systems

– Useful for different applications and system services – Should be designed from scratch, free from inherent drawbacks of solutions from other environments

02/14/2007 (c) S. V. Anastasiadis 10