1 Querying Irregular Dataset Structure Multi-dimensional Datasets - PDF document

Data Intensive Research Group Very Large Dataset Access and Manipulation: University of Maryland/Johns Hopkins Active Data Repository (ADR) • Mike Beynon and DataCutter • Umit Catalyurek • Chialin Chang Joel Saltz • Renato Ferreira University of Maryland, College Park • Tahsin Kurc and • Alan Sussman Johns Hopkins Medical Institutions http://www.cs.umd.edu/projects/adr N ATIONAL P ARTNERSHIP F R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE N ATIONAL P ARTNERSHIP F R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE O O Tools to Manage Storage Hierarchy Irregular Multi-dimensional Datasets • Spatial/multi-dimensional multi-scale, • Mass Storage: multi-resolution datasets • Load subset of data from tertiary storage into • Applications select portions of one or more disk cache or client datasets • Access data from distributed data collections • Selection of data subset makes use of spatial • Preprocess close to data sources index (e.g., R-tree, quad-tree, etc.) • Fast secondary storage • Data not used “as-is”, generally preprocessing • Tools for on-demand data product generation, is needed - often to reduce data volumes interactive data exploration, visualization • Target closely coupled sets of processors/disks N ATIONAL P ARTNERSHIP F R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE N ATIONAL P ARTNERSHIP F R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE O O DataCutter Active Data Repository (ADR) • A suite of Middleware for subsetting and filtering • Set of services for building parallel databases of multi-dimensional datasets stored on archival multi-dimensional datasets storage systems • enables integration of storage, retrieval and processing of multi-dimensional datasets on parallel machines. • Subsetting through Range Queries • can maintain and jointly process multiple datasets. • a hyperbox in dataset’s multi-dimensional space • provides support and runtime system for common • retrieve items with multi-dimensional coordinates in box operations such as • Processing (filtering/aggregations) through • data retrieval, Filters • memory management, • Carry out processing near data, compute servers • scheduling of processing across a parallel machine. • customizable for various application specific processing. N ATIONAL P ARTNERSHIP F O R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE N ATIONAL P ARTNERSHIP F O R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 1

Querying Irregular Dataset Structure Multi-dimensional Datasets • Irregular datasets • Spatial and temporal resolution may depend • Think of disk-based unstructured meshes, data structures on spatial location used in adaptive multiple grid calculations, sensor data • Physical quantities • indexed by spatial location (e.g., position on earth, position of microscope stage) computed and stored vary with spatial location • Spatial query used to specify iterator • computation on data obtained from spatial query • computation aggregates data - resulting data product size significantly smaller than results of range query N ATIONAL P ARTNERSHIP F R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE N ATIONAL P ARTNERSHIP F R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE O O Processing Irregular Datasets Example -- Interpolation Output grid onto which a projection is carried out Pathology Volume Rendering Applications Specify portion of raw sensor data corresponding Processing Remotely Sensed Data to some search criterion AVHRR Level 1 Data AVHRR Level 1 Data NOAA Tiros- N • As the TIROS-N satellite orbits, the w/ AVHRR sensor Advanced Very High Resolution Radiometer (AVHRR) sensor scans perpendicular to the satellite’s track. • At regular intervals along a scan line measurements are gathered to form an instantaneous field of view (IFOV). • Scan lines are aggregated into Level 1 data sets. A single file of Global Area Coverage (GAC) data represents: • ~one full earth orbit. • ~110 minutes. • ~40 megabytes. • ~15,000 scan lines. Surface/Groundwater One scan line is 409 IFOV’s N ATIONAL P ARTNERSHIP F R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Modeling O Satellite Data Analysis Application Scenarios Application Scenarios (cont.) • Locate TB spatio-temporal region in multi-scale, • Sensor data, fluid dynamics and chemistry multi-resolution PB dataset, project data onto codes to predict condition of waterways (e.g. new spatio-temporal grid Chesapeake bay simulation) and to carry out petroleum reservoir simulation • Ad-hoc queries, data products from satellite sensor data • Predict materials properties using electron microscope computerized tomography sensor • Browse or analyze (multi-resolution) digitized data slides from high power light or electron microscopy • Post-processing, analysis and visualization of data generated by large scientific simulations • 1-50 GBytes per digitized slide, 5-50 slides per case, 100’s of cases per day per hospital N ATIONAL P ARTNERSHIP F O R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE N ATIONAL P ARTNERSHIP F O R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 2

Processing Remotely Sensed Data Spatial Irregularity AVHRR Level 1 Data AVHRR Level 1 Data NOAA Tiros-N • As the TIROS-N satellite orbits, the AVHRR Level 1B NOAA-7 Satellite 16x16 IFOV blocks. w/ AVHRR sensor Advanced Very High Resolution Radiometer (AVHRR) sensor scans perpendicular to the satellite’s track. • At regular intervals along a scan line measurements are gathered to form an instantaneous field of view (IFOV). • Scan lines are aggregated into Level 1 data sets. A single file of Global Area Latitude Coverage (GAC) data represents: • ~one full earth orbit. • ~110 minutes. • ~40 megabytes. • ~15,000 scan lines. One scan line is 409 IFOV’s Longitude Typical Query Output grid onto which a projection Active Data Repository is carried out Specify portion of raw sensor data corresponding to some search criterion N ATIONAL P ARTNERSHIP F R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE N ATIONAL P ARTNERSHIP F R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE O O Application Processing Loop Architecture of Active Data Repository O ← Output dataset, I ← Input dataset Client 2 Client 1 A ← Accumulator (intermediate results) Query (sequential) (parallel) [S I , S O ] ← Intersect(I, O, R query ) Front End foreach o e in S O do Results read o e Application Front End a e ← Initialize(o e ) foreach i e in S I do Query Submission Query Interface read i e Service Service S A ← Map(i e ) ∩ S O foreach a e in S A do Query Execution Query Planning Service Service a e ← Aggregate(i e , a e ) foreach a e in S O do o e ← Output(a e ) Dataset Indexing Attribute Space Data Aggregation Back End Service Service Service Service write o e 3

Loading Datasets into ADR Loading Datasets into ADR • ADR Data Loading Service • A user • Distributes chunks • should decompose dataset into data chunks across the disks in • optionally can distribute chunks across the disks, and the system (e.g., provide an index for accessing them using Hilbert curve • ADR, given data chunks and associated based declustering) minimum bounding rectangles in a set of • Constructs an R-tree files index using bounding boxes of the data • can distribute data chunks across the disks using a chunks Hilbert-curve based declustering algorithm, • can create an R-tree based index on the dataset. Disk Farm N ATIONAL P ARTNERSHIP F R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE O Query Execution in Active Data Data Loading Service Repository • User must decompose the dataset into chunks • An ADR Query contains a reference to • For a fully cooked dataset, User • the data set of interest, • moves the data and index files to disks (via ftp, for • a query window (a multi-dimensional bounding box in example) input dataset’s attribute space), • registers the dataset using ADR utility programs • default or user defined index lookup functions, • For a half cooked dataset, ADR • user-defined accumulator, • computes placement information using a Hilbert curve - • user-defined projection and aggregation functions, based declustering algorithm, • how the results are handled (write to disk, or send back • builds an R-tree index, to the client). • moves the data chunks to the disks • ADR handles multiple simultaneous active • registers the dataset queries N ATIONAL P ARTNERSHIP F R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE N ATIONAL P ARTNERSHIP F R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE O O ADR Query Execution ADR Query Execution query Client Send output to clients Output Handling Global Combine Phase Phase Index lookup Combine partial output results Aggregate local input Generate query plan data into output Initialize output N ATIONAL P ARTNERSHIP F O R A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Initialization Phase Local Reduction Phase 4

1 Querying Irregular Dataset Structure Multi-dimensional Datasets - PDF document

Data Intensive Research Group Very Large Dataset Access and Manipulation: University of Maryland/Johns Hopkins Active Data Repository (ADR) Mike Beynon and DataCutter Umit Catalyurek Chialin Chang Joel Saltz Renato

Finding packages, project organization Steve Bagley somgen223.stanford.edu 1 How to find R

Plan Motivations (to combine navigation and querying in a file system) Specification (ls = ?,

Architectures with Large Die-Stacked DRAM Cache Adarsh Patil Adviser: Prof. R Govindarajan

Stupid !! Andr Seznec 2 Single thread performance Has been driving architecture till

Architecture and Synthesis for Multi- -Cycle Cycle Architecture and Synthesis for Multi On-

Do HiPS yourself! HiPS tutorial ASTERICS Heidelberg - 17 june 2016 P. Fernique & G.

Design Considerations for a DECADE SDT draft-kutscher-decade-protocol-00

HTTP Web eb and d URLs Web page consists of objects Addressable by a URL Can be HTML

Automatic Data Analysis in Visual Analytics Selected Methods Multimedia Information Systems 2

SPIE/IS&T Electronic Imaging, San Francisco, 25 January 2012 cover objects stego objects

N e u r a l M o d e l s f o r M u l t i - S e n s o r I n t e g r

Networking Overview CS 161: Computer Security Prof. Vern Paxson TAs: Jethro Beekman, Mobin

AMS RTI Q. Yan / IHEP V. Choukto / MIT RTI Introduction 1: RTI record each second AMS global

MulticoreBSP for C a high-performance library for shared-memory parallel programming Albert-Jan

Ve Vector tor Pr Prog ogrammi ramming ng Using ing St Structural uctural Rec ecursion

Unit 4: Inference for numerical variables Lecture 1: Bootstrap, paired, and two sample Statistics

Taking Student Success to Scale (TS 3 ) Virtual Convening: High Impact Practices January 28, 2016

Cross-Model Office Hours Session Primary Care First, Direct Contracting, and Kidney Care Choices

Wisconsin Propane Autogas Roundtable Lorrie Lisek, Executive Director Wisconsin Clean Cities

1

FET Open: main features and evaluation process Salvatore SPINELLO Research Programme Officer

How to make MySQL work with Raft Diancheng Wang & Guangchao Bai Staff Database Engineer @

Hawaii (Maui Project) Pier Luigi Fiorini - Lead Developer Maui Project FOSDEM 2014 - 01/02/2014

Academic Writing across Genres: Language Choices in Research Articles and Impact Case Studies

1 Querying Irregular Dataset Structure Multi-dimensional Datasets - PDF document

Data Intensive Research Group Very Large Dataset Access and Manipulation: University of Maryland/Johns Hopkins Active Data Repository (ADR) Mike Beynon and DataCutter Umit Catalyurek Chialin Chang Joel Saltz Renato

Finding packages, project organization Steve Bagley somgen223.stanford.edu 1 How to find R

Plan Motivations (to combine navigation and querying in a file system) Specification (ls = ?,

Architectures with Large Die-Stacked DRAM Cache Adarsh Patil Adviser: Prof. R Govindarajan

Stupid !! Andr Seznec 2 Single thread performance Has been driving architecture till

Architecture and Synthesis for Multi- -Cycle Cycle Architecture and Synthesis for Multi On-

Do HiPS yourself! HiPS tutorial ASTERICS Heidelberg - 17 june 2016 P. Fernique &amp; G.

Design Considerations for a DECADE SDT draft-kutscher-decade-protocol-00

HTTP Web eb and d URLs Web page consists of objects Addressable by a URL Can be HTML

Automatic Data Analysis in Visual Analytics Selected Methods Multimedia Information Systems 2

SPIE/IS&amp;T Electronic Imaging, San Francisco, 25 January 2012 cover objects stego objects

N e u r a l M o d e l s f o r M u l t i - S e n s o r I n t e g r

Networking Overview CS 161: Computer Security Prof. Vern Paxson TAs: Jethro Beekman, Mobin

AMS RTI Q. Yan / IHEP V. Choukto / MIT RTI Introduction 1: RTI record each second AMS global

MulticoreBSP for C a high-performance library for shared-memory parallel programming Albert-Jan

Ve Vector tor Pr Prog ogrammi ramming ng Using ing St Structural uctural Rec ecursion

Unit 4: Inference for numerical variables Lecture 1: Bootstrap, paired, and two sample Statistics

Taking Student Success to Scale (TS 3 ) Virtual Convening: High Impact Practices January 28, 2016

Cross-Model Office Hours Session Primary Care First, Direct Contracting, and Kidney Care Choices

Wisconsin Propane Autogas Roundtable Lorrie Lisek, Executive Director Wisconsin Clean Cities

1

FET Open: main features and evaluation process Salvatore SPINELLO Research Programme Officer

How to make MySQL work with Raft Diancheng Wang &amp; Guangchao Bai Staff Database Engineer @

Hawaii (Maui Project) Pier Luigi Fiorini - Lead Developer Maui Project FOSDEM 2014 - 01/02/2014

Academic Writing across Genres: Language Choices in Research Articles and Impact Case Studies

Do HiPS yourself! HiPS tutorial ASTERICS Heidelberg - 17 june 2016 P. Fernique & G.

SPIE/IS&T Electronic Imaging, San Francisco, 25 January 2012 cover objects stego objects

How to make MySQL work with Raft Diancheng Wang & Guangchao Bai Staff Database Engineer @