Storing and Processing Multi-dimensional Scientific Datasets Alan - - PowerPoint PPT Presentation

storing and processing multi dimensional scientific
SMART_READER_LITE
LIVE PREVIEW

Storing and Processing Multi-dimensional Scientific Datasets Alan - - PowerPoint PPT Presentation

Storing and Processing Multi-dimensional Scientific Datasets Alan Sussman UMIACS & Department of Computer Science http://www.cs.umd.edu/~als Data Exploration and Analysis Large data collections emerge as important resources Data


slide-1
SLIDE 1

Storing and Processing Multi-dimensional Scientific Datasets

Alan Sussman UMIACS & Department of Computer Science

http://www.cs.umd.edu/~als

slide-2
SLIDE 2

Alan Sussman - 3/5/08 2

Data Exploration and Analysis

Large data collections emerge as important resources

– Data collected from sensors and large-scale simulations – Multi-resolution, multi-scale, multi-dimensional

  • data elements often correspond to points in multi-dim attribute

space

  • medical images, satellite data, hydrodynamics data, etc.

– Terabytes to petabytes today

Low-cost, high-performance, high-capacity

commodity hardware

– 5 PCs, 5 Terabytes of disk storage for << $10,000

slide-3
SLIDE 3

Alan Sussman - 3/5/08 3

Large Data Collections

Scientific data exploration and analysis

– To identify trends or interesting phenomena – Only requires a portion of the data, accessed through

spatial index

e.g., Quad-tree, R-tree

Spatial (range) query often used to specify iterator

– computation on data obtained from spatial query – computation aggregates data (MapReduce) - resulting

data product size significantly smaller than results of range query

slide-4
SLIDE 4

Alan Sussman - 3/5/08 4

Specify portion of raw sensor data corresponding to some search criterion Output grid onto which a projection is carried out

Typical Query

slide-5
SLIDE 5

Alan Sussman - 3/5/08 5

Target example applications

Processing Remotely-Sensed Data

NOAA Tiros-N w/ AVHRR sensor

AVHRR Level 1 Data AVHRR Level 1 Data

  • As the TIROS-N satellite orbits, the

Advanced Very High Resolution Radiometer (AVHRR) sensor scans perpendicular to the satellite’s track.

  • At regular intervals along a scan line measurements

are gathered to form an instantaneous field of view (IFOV).

  • Scan lines are aggregated into Level 1 data sets.

A single file of Global Area Coverage (GAC) data represents:

  • ~one full earth orbit.
  • ~110 minutes.
  • ~40 megabytes.
  • ~15,000 scan lines.

One scan line is 409 IFOV’s

Water Contamination Study Pathology Satellite Data Processing Multi-perspective volume reconstruction

slide-6
SLIDE 6

Alan Sussman - 3/5/08 6

Outline

Active Data Repository

Overall architecture Query planning Query execution Experimental Results

DataCutter

slide-7
SLIDE 7

Alan Sussman - 3/5/08 7

Active Data Repository (ADR)

An object-oriented framework (class library + runtime

system) for building parallel databases of multi-dimensional datasets

– enables integration of storage, retrieval and processing of multi-

dimensional datasets on distributed memory parallel machines.

– can store and process multiple datasets. – provides support and runtime system for common operations

such as

data retrieval, memory management, scheduling of processing across a parallel machine.

– customizable for application specific processing.

slide-8
SLIDE 8

Dataset Service Attribute Space Service Data Aggregation Service Indexing Service Query Execution Service Query Planning Service Query Interface Service Query Submission Service

Front End

Application Front End

Query

Client 2 (sequential)

Results

Client 1 (parallel) Back End

ADR Architecture

slide-9
SLIDE 9

Alan Sussman - 3/5/08 9

Active Data Repository (ADR)

Dataset is collection of user-defined data chunks

– a data chunk contains a set of data elements – multi-dim bounding box (MBR) for each chunk, used by spatial index – chunks declustered across disks to maximize aggregate I/O bandwidth

Separate planning and execution phases for queries

– Tile output if too large to fit entirely in memory – Plan each tile’s I/O, data movement and computation

Identify all chunks of input that map to tile Distribute processing for chunks among processors

– All processors work on one tile at a time

slide-10
SLIDE 10

Alan Sussman - 3/5/08 10

Query Planning

Index lookup Tiling Workload partitioning

Index lookup

Select data chunks of interest Compute mapping between input and output chunks

Tiling

Partition output chunks so that each tile fits in memory Use Hilbert curve to minimize total length of tile boundaries

Workload partitioning

Each aggregation operation involves an input/output chunk pair Want good load balance and low communication overhead

slide-11
SLIDE 11

Alan Sussman - 3/5/08 11

Query Execution

Broadcast query plan to all processors For each output tile:

– Initialization phase

Read output chunks into memory, replicate if necessary

– Reduction phase

Read and process input chunks that map to current tile

– Combine phase

Combine partial results in replicated output chunks, if any

– Output handling

Compute final output values

slide-12
SLIDE 12

O ← Output dataset, I ← Input dataset A ← Accumulator (for intermediate results) [SI, SO] ← Intersect(I, O, Rquery) foreach oe in SO do read oe ae ← Initialize(oe) foreach ie in SI do read ie SA ← Map(ie) ∩ SO foreach ae in SA do ae ← Aggregate(ie, ae) foreach ae in SO do

  • e ← Output(ae)

write oe

ADR Processing Loop

slide-13
SLIDE 13

Alan Sussman - 3/5/08 13

Query Execution Strategies

Distributed Accumulator (DA)

– Assign aggregation operation to owner of output chunk

Fully Replicated Accumulator (FRA)

– Assign aggregation operation to owner of input chunk – Requires combine phase

Sparsely Replicated Accumulator (SRA)

– similar to FRA, but only replicate output chunk when

needed

slide-14
SLIDE 14

Alan Sussman - 3/5/08 14

Performance Evaluation

128-node IBM SP, with 256MB memory per node Datasets generated by Application Emulators

– Satellite Data Processing (SAT) – non-uniform mapping – Virtual Microscope (VM)

1-5-1 1-40-20 Comp (ms) tinit-tred-tcomb 1.0 16-128 192MB 1.5-24GB VM 4.6 161-1307 25MB 1.6-26GB SAT Fan-out (avg) Fan-in Output Input App

slide-15
SLIDE 15

Alan Sussman - 3/5/08 15

Query Execution Time (sec)

50 100 150 200 250 300 8 16 32 64 128 Number of Processors FRA DA SRA 5 10 15 20 25 30 35 8 16 32 64 128 Number of Processors FRA DA SRA

SAT VM (Fixed input size)

slide-16
SLIDE 16

Alan Sussman - 3/5/08 16

Summary of Experimental Results

Communication volume

– Comm. VolumeDA ∝ fan-out – Comm. VolumeFRA/SRA ∝ fan-in

DA may have computational load imbalance due to

non-uniform mapping

Relative performance depends on

– Query characteristics (e.g., fan-in, fan-out) – Machine configurations (e.g., number of processors)

No strategy always outperforms the others

slide-17
SLIDE 17

Alan Sussman - 3/5/08 17

ADR queries vs. Other Approaches

Similar to out-of-core reductions

(more general MapReduce)

– Commutative & associative – Most reduction optimization

techniques target in-core data

– Out-of-core techniques require

data redistribution

Similar to relational group-by

queries

– Distributive & algebraic [Gray96] – spatial-join + group-by – For ADR, output data items and

extents known prior to processing

double x[max_nodes], y[max_nodes]; integer ia[max_edges], ib[max_edges]; for (i=0; i<max_edges; i++) x[ia[i]] += y[ib[i]]; Select Dept, AVG(Salary) From Employee Group By Dept

slide-18
SLIDE 18

Alan Sussman - 3/5/08 18

Outline

Active Data Repository

DataCutter

Architecture Filter-stream programming Group Instances Transparent copies

slide-19
SLIDE 19

Alan Sussman - 3/5/08 19

Distributed Grid Environment

Heterogeneous Shared Resources:

Host level: machine, CPUs, memory, disk storage Network connectivity

Many Remote Datasets:

Inexpensive archival storage Islands of useful data Too large for replication

slide-20
SLIDE 20

Alan Sussman - 3/5/08 20

DataCutter

Target same classes of applications as ADR

Indexing Service

Multi-level hierarchical indexes based on spatial indexing

methods – e.g., R-trees

– Relies on underlying multi-dimensional space – User can add new indexing methods

Filtering Service

Distributed C++ (and Java) component framework Transparent tuning and adaptation for heterogeneity Filters implemented as threads – 1 process per host

slide-21
SLIDE 21

Alan Sussman - 3/5/08 21

Filter-Stream Programming (FSP)

Purpose: Specialized components for processing data

based on Active Disks research [Acharya, Uysal, Saltz: ASPLOS’98],

macro-dataflow, functional parallelism

filters – logical unit of computation

high level tasks

init,process,finalize interface

streams – how filters communicate

unidirectional buffer pipes

uses fixed size buffers (min, good)

users specify filter connectivity and filter-level characteristics

Extract ref Extract raw 3D reconstruction View result

Raw Dataset Reference DB

slide-22
SLIDE 22

Alan Sussman - 3/5/08 22

FSP: Abstractions

Filter Group

logical collection of filters to use together

application starts filter group instances

Unit-of-work cycle

“work” is application defined (ex.: a query)

work is appended to running instances

init(), process(), finalize() called for each uow

process() returns { EndOfWork | EndOfFilter }

allows for adaptivity A B

uow 0 uow 1 uow 2 buf buf buf buf

S

slide-23
SLIDE 23

Alan Sussman - 3/5/08 23

Optimization Techniques

Mapping filters to hosts

– allow components to execute concurrently

Multiple filter group instances

– allow work to be processed concurrently

Transparent copies

– keep pipeline full by avoiding filter processing imbalance and use

write policies to deal with dynamic buffer distribution

Application memory tuning

– minimize resource usage to allow for copies

slide-24
SLIDE 24

Alan Sussman - 3/5/08 24

Optimization - Group Instances

Work host3 (2 cpu) host2 (2 cpu) host1 (2 cpu)

P0 F0 C0 P1 F1 C1

Match # instances to environment (CPU capacity, network)

slide-25
SLIDE 25

Alan Sussman - 3/5/08 25

Transparent Copies

replicate filters within an instance (intra-work) write policy to distribute work buffers to copies

shared queue within host

across hosts - round robin (RR), weighted RR (WRR), demand-driven (DD), user-defined (UD)

single stream illusion, UOWi < UOWi+1 state consistency problems addressed by a merge step

F0 P0 C0 F1 ?

slide-26
SLIDE 26

Alan Sussman - 3/5/08 26

Runtime Pipeline Balancing

Use local information:

– queue size, send time / receiver acks

Adjust number of transparent copies Demand based dataflow (choice of consumer)

– Within a host – perfect shared queue among copies – Across hosts

Round Robin (RR) Weighted Round Robin (WRR) Demand-Driven (DD) sliding window (buffer consumption rate) User-defined

slide-27
SLIDE 27

Alan Sussman - 3/5/08 27

Experiment – Isosurface Rendering

Isosurface rendering on Red/Blue Linux cluster at Maryland

– Red – 16 2-processor PII-450, 256MB, 18GB SCSI disk – Blue – 12 2-processor PIII-550, 1GB, 2-8GB SCSI disk +

1 8-processor PIII-550, 4GB, 2-18GB SCSI disk

– Connected via Gigabit Ethernet

UT Austin ParSSim chemical species transport simulation

– Single time step 3D visualization, read all data for 1 time step

Two implementations of Raster filter – z-buffer and active

pixels

slide-28
SLIDE 28

Alan Sussman - 3/5/08 28

Sample Isosurface Visualization

V = 0.35 V = 0.7

slide-29
SLIDE 29

Alan Sussman - 3/5/08 29

Experimental setup

read dataset isosurface extraction shade + rasterize merge / view R E Ra M

150 MB 38.6 MB 11.8 MB 28.5 MB 0.64s 0.68s 1.64s 1.65s 11.67s 9.43s 0.73s 0.90s = 14.68s = 12.66s Active Pixel Z-buffer 32.0MB Active Pixel Z-buffer

Experiment to follow combines R and E filters, since that showed best performance in experiments not shown

slide-30
SLIDE 30

Alan Sussman - 3/5/08 30

Active Pixel vs. Z-Buffer

2 Raster Filters

5 10 15 20 25 1 2 4 8 # of processors Time (seconds) Active Zbuffer

1 Raster Filter

5 10 15 20 25 1 2 4 8 # of processors Time (seconds) Active Zbuffer

Configuration: RE-Ra-M Only Red nodes used – each one runs 1 RE, 1 or 2 RA, and one node runs M

slide-31
SLIDE 31

Alan Sussman - 3/5/08 31

Heterogeneous Nodes

Active Pixel algorithm on 8-processor Blue node + Red data nodes Blue node runs 7 Ra or ERa copies and M, Red nodes each run 1 of each except M

RE-Ra-M

1 2 3 4 5 6 7 8 9 1 2 4 8 # of processors Time (seconds) RR WRR DD

R-ERa-M

1 2 3 4 5 6 7 8 9 1 2 4 8 # of processors Time (seconds) RR WRR DD

slide-32
SLIDE 32

Alan Sussman - 3/5/08 32

Summary of Results

Placement matters

– Heterogeneity of shared resources, data volume

More instances and transparent copies

– Balance applications for heterogeneity

No static choice will work

– Runtime heterogeneity and dynamic shared resources

slide-33
SLIDE 33

Alan Sussman - 3/5/08 33

DataCutter as a Grid Service

Application Level Programming Models Infrastructure Services Resource Level

Grid available Resources SRB User specified Resources Legion Client/Server Sockets Condor Pool Idle Resources JavaRMI, DCOM, CORBA NetSolve, Ninf AppLeS HPC++ NWS

DataCutter

Harmony DSM MPI RPC DPSS Globus

slide-34
SLIDE 34

Alan Sussman - 3/5/08 34

Acknowledgments

Students

– Chialin Chang – ADR – Michael Beynon, Renato Ferreira – DataCutter

Other faculty and postdocs (now at Ohio State)

– Joel Saltz – Tahsin Kurc – Umit Catalyurek