DataCutter Joel Saltz Alan Sussman Tahsin Kurc University of - - PowerPoint PPT Presentation

datacutter
SMART_READER_LITE
LIVE PREVIEW

DataCutter Joel Saltz Alan Sussman Tahsin Kurc University of - - PowerPoint PPT Presentation

DataCutter Joel Saltz Alan Sussman Tahsin Kurc University of Maryland, College Park and Johns Hopkins Medical Institutions http://www.cs.umd.edu/projects/adr DataCutter A suite of Middleware for subsetting and filtering


slide-1
SLIDE 1

DataCutter

Joel Saltz Alan Sussman Tahsin Kurc University of Maryland, College Park and Johns Hopkins Medical Institutions

http://www.cs.umd.edu/projects/adr

slide-2
SLIDE 2

DataCutter

  • A suite of Middleware for subsetting and filtering

multi-dimensional datasets stored on archival storage systems

  • Subsetting through Range Queries
  • a hyperbox defined in the multi-dimensional space

underlying the dataset

  • items whose multi-dimensional coordinates fall into the

box are retrieved.

slide-3
SLIDE 3

DataCutter

  • Restricted processing (filtering/aggregations)

through Filters

  • to reduce the amount of data transferred to the client
  • filters can run anywhere, but intended to run near (i.e.,
  • ver local area network) storage system
  • based on filter-stream programming model -- to optimize

use of limited resources, such as memory and disk space

slide-4
SLIDE 4

DataCutter

Client Client

Archival Storage System

Range Query Segment Info. Segment Data

Indexing Service Client Interface Service Data Access Service DataCutter Filter Filter Filtering Service

Archival Storage System

Segments: (File,Offset,Size) (File,Offset,Size)

slide-5
SLIDE 5

DataCutter Architecture

  • Client Interface Service
  • Manages client connections and client requests
  • Manages data and information flow between

different services

  • Indexing Service
  • Two-level hierarchical indexing -- summary and

detailed index files

  • Customizable --
  • Default R-tree index
  • User can add new indexing methods
slide-6
SLIDE 6

DataCutter Architecture

  • Filtering Service
  • Manages filters (registered in the system)
  • Users can add/run new filters
  • Data Access Service
  • Manages storage/retrieval of data from the tertiary

storage

  • Low level system dependent I/O operations
slide-7
SLIDE 7

DataCutter -- Subsetting

  • Datasets are partitioned into segments
  • used to index the dataset, unit of retrieval
  • Indexing very large datasets
  • Multi-level hierarchical indexing scheme
  • Summary index files -- to index a group of

segments or detailed index files

  • Detailed index files -- to index the segments
slide-8
SLIDE 8

DataCutter -- Filters

  • Filters
  • Specialized user program to process data

(segments) before returning them to the client

  • Filter-stream programming model
  • Originally developed for Active Disks environment

(Acharya, Uysal, and Saltz)

  • Based on stream abstraction
  • A stream denotes a supply of data
  • Streams deliver data in fixed size buffers
  • Communication of a filter with its environment is

restricted to its input and output streams

  • init, process, finalize interface
slide-9
SLIDE 9

Sample Application:

  • generate 3D reconstructed

view from new set of sensor readings

  • compare features with

reference db

Grid Configuration:

  • remote data server - reference

db

  • sensor host - large raw

readings

  • parallel computation farm

available

  • 3D reconstruction

computationally intensive

A Motivating Scenario

WAN

Raw Dataset

sensor readings

Sensor

?

Computation Farm

?

Client PC

?

Data Server

?

Reference DB

feature list

slide-10
SLIDE 10

A Motivating Scenario (2)

WAN

Raw Dataset

sensor readings

Sensor Extract raw Client PC View result Data Server Extract ref Reference DB

feature list

Computation Farm 3D reconstruction

Application : // process relevant raw readings // generate 3D view // compute features of 3D view // find similar features in reference db // display new view and similar cases

Extract ref Extract raw 3D reconstruction View result

Raw Dataset Reference DB

slide-11
SLIDE 11

Filters

  • Filters
  • communicate with other filters only using streams
  • cannot change stream endpoints
  • are allowed to pre-disclose dynamic allocation of

memory/scratch space in init phase, before processing phase

  • Advantages
  • location independence
  • easier scheduling of resources
  • filter stop and restart is defined explicitly in model
slide-12
SLIDE 12

Placement

  • The dynamic assignment of filters to

particular hosts for execution is placement

(mapping)

  • Optimization criteria:
  • Communication
  • leverage filter affinity to dataset
  • minimize communication volume on slower connections
  • co-locate filters with large communication volume
  • Computation
  • expensive computation on faster, less loaded hosts
slide-13
SLIDE 13

Restructuring Process

Application Target Configuration Decompose Placement / Schedule Execute Application Some set

  • f filters

f3 f4 f5 f1 f2

slide-14
SLIDE 14

Software Infrastructure

  • Prototype implementation of filter framework
  • C++ language binding
  • manual placement
  • wide-area execution service
  • one thread for each instantiated filter
slide-15
SLIDE 15

Filter Framework

class MyFilter : public AS_Filter_Base { public: int init(int argc, char *argv[ ]) { … }; int process(stream_t st) { … }; int finalize(void) { … }; }

slide-16
SLIDE 16

Filter Connectivity / Placement

[filter.A]

  • uts = stream1 stream3

[filter.B] ins = stream1

  • uts = stream2

[filter.C] ins = stream2 stream3

A B C

stream3 stream1 stream2

[placement] A = host1.cs.umd.edu B = host2.cs.umd.edu C = host3.cs.umd.edu

slide-17
SLIDE 17

Execution Service

host1.cs.umd.edu AppExec Daemon

filter A

Application Filter lib EXEC Directory Daemon dir.cs.umd.edu:6000 Directory

name host port **** **** **** **** **** ****

Application Console Filter lib ???.???.???.???

  • 2. Query

Specs

Filter/Stream Placement

  • 1. Read
  • 3. Exec

host2.cs.umd.edu AppExec Daemon

filter B

Application Filter lib EXEC host3.cs.umd.edu AppExec Daemon

filter C

Application Filter lib EXEC

slide-18
SLIDE 18

Related Work

Application Level Programming Models Infrastructure Services Resource Level

Grid available Resources Globus User specified Resources Legion Client/Server Sockets Condor Pool Idle Resources JavaRMI, DCOM, CORBA NetSolve, Ninf AppLeS HPC++ NWS DataCutter Harmony DSM MPI RPC DPSS SRB

slide-19
SLIDE 19

Integrating DataCutter with the Storage Resouce Broker

slide-20
SLIDE 20

Storage Resource Broker (SRB)

  • Middleware between clients and storage

resources

  • Remote Access to storage resources.
  • Various types :
  • File Systems - UNIX, HPSS, UniTree, DPSS (LBL).
  • DB large objects - Oracle, DB2, Illustra.
  • Uniform client interface (API).
slide-21
SLIDE 21

Storage Resource Broker (SRB)

  • MCAT - MetaData Catalog
  • Datasets (files) and Collections (directories) - inodes and

more.

  • Storage resources
  • User information - authentication, access privileges, etc.
  • Software package
  • Server, client library, UNIX-like utilities, Java GUI
  • Platforms - Solaris, Sun OS, Digital Unix, SGI Irix, Cray

T90.

slide-22
SLIDE 22

SRB/DataCutter - Prototype Implementation

  • Support for Range Queries
  • Creation of indices over data sets (composed set
  • f data files)
  • Subsetting of data sets
  • Search for files or portions of files that intersect a given

range query

  • Restricted filter operations on portions of files

(data segments) before returning them to the client (to perform filtering or aggregation to reduce data volume)

slide-23
SLIDE 23

File SID DBLobjID ObjSID Range Query

Indexing Service Filter Filter Filtering Service DataCutter

SRB/DataCutter System

Resource User Application Meta-data Storage Resource Broker (SRB)

SRB I/O and MCAT API

MCAT

Application (SRB client) DB2, Oracle, Illustra, ObjectStore HPSS, UniTree UNIX, ftp

Distributed Storage Resources

slide-24
SLIDE 24

SRB/DataCutter Client Interface

int sfoCreateIndex(srbConn *conn, sfoClass class, int catType, char *inIndexName, char *outIndexName, char *resourceName)

  • Creating and Deleting Index

int sfoDeleteIndex(srbConn *conn, sfoClass class, int catType, char *indexName)

slide-25
SLIDE 25

SRB/DataCutter Client Interface

  • Searching Index -- R-tree index

typedef struct { int dim; /* bounding box dimensions */ double *min; /* minimum in each dimension */ double *max; /* maximum in each dimension */ } sfoMBR; /* Bounding box structure */ typedef struct { sfoMBR segmentMBR; /* bounding box of the segment */ char *objID; /* object in SRB that contains the segment */ char *collectionName; /* collection where object is stored */ unsigned int offset; /* offset of the segment in the object */ unsigned int size; /* size of segment */ } segmentInfo; /* segment meta-data information */ typedef struct { int segmentCount; /* number of segments returned */ segmentInfo *segments; /* segment meta-data information */ int continueIndex; /* continuation flag */ } indexSearchResult; /* search result structure */

slide-26
SLIDE 26

SRB/DataCutter Client Interface

  • Searching Index -- R-tree index

int sfoSearchIndex(srbConn *conn, sfoClass class, char *indexName, void *query, indexSearchResult *myresult, int maxSegCount)

typedef struct { int dim; double *min, *max; } rangeQuery;

int sfoGetMoreSearchResult(srbConn *conn, int continueIndex, indexSearchResult *myresult, int maxSegCount)

slide-27
SLIDE 27

Applying Filters

typedef struct { segmentInfo segInfo; /* info on segment data buffer after filter oper. */ char *segment; /* segment data buffer after filter is applied */ } segmentData; typedef struct { int segmentDataCount; /* #segments in segmentData array */ segmentData *segments; /* segmentData array */ int continueIndex; /* continuation flag */ } filterDataResult;

slide-28
SLIDE 28

Applying Filters

int sfoApplyFilter(srbConn *conn, sfoClass class, char *hostName, int filterID, char *filterArg, int numOfInputSegments, segmentInfo *inputSegments, filterDataResult *myresult, int maxSegCount) int sfoGetMoreFilterResult(srbConn *conn, int continueIndex, filterDataResult *myresult, int maxSegCount)

slide-29
SLIDE 29

zoom view read_data decompress clip

Application: Virtual Microscope

  • Interactive software emulation of high power light

microscope for processing/visualizing image datasets

  • 3-D Image Dataset (100MB to 5GB per focal plane)
  • Client-server system organization
  • Rectangular region queries, multiple data chunk reply
  • pipeline style processing
slide-30
SLIDE 30

Virtual Microscope Client

slide-31
SLIDE 31

VM Application using SRB/DataCutter

Wide Area Network

Local Area Network Distributed Collection of Workstations

zoom decompress

SRB/DataCutter

read

Client

view clip Indexing

Client

view read decompress clip

read image chunks convert jpeg image chunks into RGB pixels clip image to query boundaries

zoom

sub-sample to the required magnification

view

stitch image pieces together and display image Distributed Storage Resources