SmartStore: A New Metadata Organization Paradigm with - - PowerPoint PPT Presentation

smartstore a new metadata organization paradigm with
SMART_READER_LITE
LIVE PREVIEW

SmartStore: A New Metadata Organization Paradigm with - - PowerPoint PPT Presentation

Supercomputing 2009 SmartStore: A New Metadata Organization Paradigm with Semantic-Awareness for Paradigm with Semantic-Awareness for Next-Generation File Systems Yu Hua Hong Jiang Yifeng Zhu Dan Feng Lei Tian 1 Outline Outline


slide-1
SLIDE 1

Supercomputing 2009

SmartStore: A New Metadata Organization Paradigm with Semantic-Awareness for Paradigm with Semantic-Awareness for Next-Generation File Systems

Yu Hua Hong Jiang Yifeng Zhu Dan Feng Lei Tian

1

slide-2
SLIDE 2

Outline Outline

Motivations SmartStore System Key Issues Performance Evaluation Discussion and Conclusion

2

slide-3
SLIDE 3

Motivations Motivations

Some Facts

Storage capacity → Exabyte (or even larger) Storage capacity → Exabyte (or even larger) Amounts of Files → Billions Metadata-based transactions → over 50% Hierarchical directory tree → Performance Bottleneck

Inefficiency of current file systems Inefficiency of current file systems

Static and inflexible I/O interfaces Linearly brute-force searching L

k f f ll tili ti f ti

Lack of full utilization of semantics

3

slide-4
SLIDE 4

Conventional Directory Trees Conventional Directory Trees

Millions of files under each directory

This tree is too FAT ! This tree is too HIGH !

4

slide-5
SLIDE 5

Ideal Scenarios Ideal Scenarios

User requirements

Quickly return queried results with acceptable tradeoff

Q y q p ff

Obtain interested knowledge from data ocean to guide

higher-level services higher level services

Query for high-dimensional data

System requirements

Scalability Reliability Performance improvements

5

slide-6
SLIDE 6

Intuition Intuition

Reduce search space

Not entire large-scale file system

Search correlated metadata

Configure a context related to queries

Desirable interfaces

S h d k i l i

6

Such as range query and top-k query, i.e., complex queries;

slide-7
SLIDE 7

Examples: Complex Queries

7

slide-8
SLIDE 8

Our Approach: SmartStore Our Approach: SmartStore

Basic ideas:

S

ti l ti t d b lti

Semantic: correlation represented by multi-

dimensional attributes of file metadata

Group files based on metadata semantic correlations

by using Latent Semantic Indexing (LSI) tool

Query and other relevant operations can be completed

within one or a small number of such groups.

Our goal is to avoid or minimize brute-force search

that is widely used in a directory-tree based file system during a complex query

8

system during a complex query.

slide-9
SLIDE 9

Comparisons with Conventional File Systems Comparisons with Conventional File Systems

9

slide-10
SLIDE 10

Grouping Procedures Grouping Procedures

Node Vector

slide-11
SLIDE 11

Semantic Grouping Semantic Grouping

Design Objectives Group sizes are approximately equal. A file in a group has a higher correlation with other files

in this group than with any file outside of the group g p y f f g p

11

slide-12
SLIDE 12

System Architecture

Grouping correlated

p g metadata into storage and index units based

  • n the LSI

Insertion Range Query Point Query

Construction of

semantic R-trees in a distributed

Deletion Top-K NN Query

environment

Multiple operations

Semantic Grouping Latent Semantic Indexing

12

slide-13
SLIDE 13

Constructing a Semantic R-tree.

Semantic R-tree leaf nodes as storage units The non-leaf nodes as index units

MBR representation for local metadata

13

slide-14
SLIDE 14

SmartStore functions SmartStore functions

Insertion Deletion On-line Query Approaches

Range Query Top-K Query Point Query

14

Point Query

slide-15
SLIDE 15

Key issues: on-line & off-line Key issues: on line & off line

Accelerate queries

Off-line pre-processing

Each storage unit locally maintains a replica of the

semantic vectors of all first-level index units to speed up the queries L d ti t d l ith i f ti t l

Lazy updating to deal with information staleness

15

slide-16
SLIDE 16

Key Issues: on-line vs off-line

Matching? Query : Forward

(4) if fail, continue to forward Matching? Query : Forward

Query : Forward

Q y

slide-17
SLIDE 17

Key Issues: Consistency Guarantee via Versioning Key Issues: Consistency Guarantee via Versioning

Multi-replica technique can potentially lead to

i f ti t l d i i t information staleness and inconsistency.

Lazy Versioning:

A newly created version attached to its correlated A newly created version attached to its correlated

replica temporarily contains aggregated real-time changes that have not been directly updated in the

  • riginal replicas

g p

SmartStore removes attached versions when

reconfiguring index units reconfiguring index units

The frequency of reconfiguration depends on the user

requirements and environment constraints

17

requirements and environment constraints

slide-18
SLIDE 18

Key issues: Mapping of Index Units Key issues: Mapping of Index Units

Our mapping is based on a simple bottom-up approach

that iteratively applies random selection and labeling that iteratively applies random selection and labeling

  • perations.

18

slide-19
SLIDE 19

Performance Evaluation Performance Evaluation

Prototype Implementation Large file system-level traces, including HP , MSN, and

EECS by using Trace Intensifying Factor y g fy g

Compared with typical DBMS and R-tree

p yp

Query latency reduction: 1000 times Space savings: 20 times

19

slide-20
SLIDE 20

Complex Queries Latency Complex Queries Latency

20

slide-21
SLIDE 21

Preliminary Simulation Results

( ) ( ) T q A q recall = I

  • T(q) is the ideal answer for query q

A( ) i th t l lt

( ) recall T q =

  • A(q) is the actual query results

T 8 NN Q Top‐8 NN Query Range Query

21

slide-22
SLIDE 22

On-line & off-line On line & off line

700 180

(ms)

400 500 600 700

HP(on-line) MSN(on-line) EECS(on-line) HP(off-line) MSN(off-line) EECS(off-line)

ber (1000)

120 150 180

HP(on-line) MSN(on-line) EECS(on-line) HP(off-line) MSN(off-line) EECS(off-line)

Latency

100 200 300 400

ssage Num

30 60 90

Number of Data Nodes

20 30 40 50 60 100

Number of Data Nodes Mes

20 30 40 50 60 30

22

slide-23
SLIDE 23

Discussion Discussion

SmartStore does work for:

Pay only once: configuration efficiency for a long time Pay-only-once: configuration efficiency for a long time

due to complexity for semantic analysis;

Rich semantics of multi-dimensional attributes to

f guarantee the groups to match access patterns well

SmartStore does not efficiently work for:

Lack of semantics, such as uniform distribution;

Q i k d d i l i f i

Quick and dynamic evolution of semantics; Explicit scatter of dimension increments;

23

slide-24
SLIDE 24

Potential Applications Potential Applications

Users’ views

Range query and top-k query

System views

De-duplication Caching Caching Pre-fetching

24

slide-25
SLIDE 25

Conclusions

SmartStore is a new paradigm for organizing file

metadata for next-generation file systems

Exploit file semantics C

l i

Complex queries Enhance system scalability and functionality.

Methodology

S

ti ti

Semantic aggregation Decrease search space

25

slide-26
SLIDE 26

Acknowledgement Acknowledgement

This work is partially supported by

NSFC under Grant 60703046 NSFC under Grant 60703046 National Basic Research 973 Program under Grant

2004CB318201

NSF CCF 0621526 NSF CCF 0937993 NSF CCF 0937988 and NSF CCF-0621526, NSF CCF-0937993, NSF CCF-0937988 and

NSF CCF-0621493

HUST-SRF No.2007Q021B The Program for Changjiang Scholars and Innovative Research

26

The Program for Changjiang Scholars and Innovative Research

Team in University No. IRT-0725.

slide-27
SLIDE 27

Thanks & Questions

27