Virtual Data Language: A Typed Workflow Notation for Diversely - - PowerPoint PPT Presentation

virtual data language a typed workflow notation for
SMART_READER_LITE
LIVE PREVIEW

Virtual Data Language: A Typed Workflow Notation for Diversely - - PowerPoint PPT Presentation

Virtual Data Language: A Typed Workflow Notation for Diversely Structured Scientific Data Yong Zhao 1 , Michael Wilde 23 , Ian Foster 123 1 Department of Computer Science, University of Chicago 2 Computational Institute, University of Chicago 3


slide-1
SLIDE 1

Virtual Data Language: A Typed Workflow Notation for Diversely Structured Scientific Data

Yong Zhao1, Michael Wilde23, Ian Foster123

1Department of Computer Science, University of Chicago 2Computational Institute, University of Chicago 3Division of MCS, Argonne National Laboratory

DSL Workshop 2006, 02 June 2006

slide-2
SLIDE 2

2

Outline

  • Motivation
  • Typing System
  • XDTM (XML Dataset Typing and Mapping)
  • Virtual Data Language
  • fMRI Use Case
  • Conclusion
slide-3
SLIDE 3

3

The Broad Picture – Data and Grid

  • Data analysis turns into data integration

– The need to discover, access, explore, analyze diverse distributed data sources

  • Science as collaborative workflow

– The need to organize, archive, reuse, explain, and schedule scientific workflows

  • Virtual data as a unifying concept

– Integrated view over data, programs, and computations

slide-4
SLIDE 4

4

Challenges

  • Deluge of data
  • Heterogeneity

– Diversely structured data storage and formats – Metadata encoded in ad hoc ways

  • Geographic and political distribution

– Different administrative domains – Different access protocols and policies

  • Collaboration within/across large, dynamic

communities

– Negotiation and sharing of resources

slide-5
SLIDE 5

5

“Messy” Scientific Data

  • Heterogeneous storage format and access

protocol

– Logically identical dataset can be stored in

  • Textual File (e.g. CSV), binary file, spreadsheet

– Data available from

  • Filesystem, database, HTTP, WebDAV, etc...
  • Metadata encoded in directory and file names

– A fMRI volume is composed of an image file and a header file with the same prefix.

  • Format dependency hinders program and

workflow reuse

slide-6
SLIDE 6

6

But... Data is often Logically Structured

  • Scientific data often maintain hierarchical

structure

  • A common practice is to select a set of

data items and apply a transformation to each individual item

  • A nested approach of such iterations could

scale up to millions of objects

slide-7
SLIDE 7

7

Introducing a Typing System

  • Describe logical data structures as types
  • Define procedures in terms of typed datasets,

use such procedures on different physical representations

  • Compose workflows from typed procedures
  • Benefits

– Type checking – Dataset selection and iteration – Discovery by types – Dynamic binding – Type conversion

slide-8
SLIDE 8

8

XDTM

  • XML Dataset Typing and Mapping
  • Separates logical structure from physical

representations

  • Logical structure described by XML Schema

– Primitive scalar types: int, float, string, date … – Complex types (structs and arrays)

  • Mapping descriptor

– How dataset elements are mapped to physical representations – External parameters (e. g. location)

  • XPath for dataset selection
slide-9
SLIDE 9

9

Mapping

  • Define a common mapping interface

– Initialize, read, create, write, close

  • Data providers implement the interface

– Responsible for data access details

  • XView maintains cached logical datasets

VDS Mapper Data Source VDS XViewMgr Data Source Mapper XView

slide-10
SLIDE 10

10

Virtual Data Schema

slide-11
SLIDE 11

11

Virtual Data System

VDL Program

Virtual Data Catalog Workflow Generator

Abstract Workflow

Planner

Execution Plan

Workflow Enactor Provenance Collector

Launcher

Grid

slide-12
SLIDE 12

12

Use Case – fMRI Data

DBIC Archive Study #1 Group #1 Subject #1 Anatomy high-res volume Functional Runs run #1 volume #001 ... volume #275 ... run #5 volume #001 ... snrun #... … Group #5 ... Study #... DBIC Archive Study_2004.0521.hgd Group_1 Subject_2004.e024 volume_anat.img volume_anat.hdr bold1_001.img bold1_001.hdr ... bold1_275.img bold1_275.hdr ... bold5_001.img ... snrbold*_* air* ... Group_5 ... Study ...

Logical Structure Physical Representation

slide-13
SLIDE 13

13

Type Definitions in VDL

type Image {}; type Header {}; type Volume { Image img; Header hdr; } type Anat Volume; type Warp {}; type NormAnat { Anat aVol; Warp aWarp; Volume nHires; } Part of fMRI AIRSN (Spatial Normalization) Workflow type Run { Volume v [ ]; } type Subject { Anat anat; Run run [ ]; Run snrun [ ]; } type Group { Subject s[ ]; } type Study { Group g[ ]; }

slide-14
SLIDE 14

14

Type Definitions in XML Schema

<xs:schema targetNamespace="http://www.fmri.org/schema/airsn.xsd" xmlns="http://www.fmri.org/schema/airsn.xsd" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:simpleType name="Image"/> <xs:simpleType name="Header"/> <xs:complexType name="Volume"> <xs:sequence> <xs:element name="img" type="Image"/> <xs:element name="hdr" type="Header"/> </xs:sequence> </xs:complexType> <xs:complexType name="Run"> <xs:sequence minOccurs="0” maxOccurs="unbounded"> <xs:element name="v" type="Volume"/> </xs:sequence> </xs:complexType> </xs:schema>

slide-15
SLIDE 15

15

Procedure Definition in VDL

(Run snr) functional( Run r, NormAnat a, Air shrink ) { Run yroRun = reorientRun( r , "y" ); Run roRun = reorientRun( yroRun , "x" ); Volume std = roRun[0]; Run rndr = random_select( roRun, .1 ); //10% sample AirVector rndAirVec = align_linearRun( rndr, std, 12, 1000, 1000, [81,3,3] ); Run reslicedRndr = resliceRun( rndr, rndAirVec, "o", "k"); Volume meanRand = softmean(reslicedRndr, "y", null ); Air mnQAAir = alignlinear( a.nHires, meanRand, 6, 1000, 4, [81,3,3] ); Volume mnQA = reslice( meanRand, mnQAAir, "o", "k“ ); Warp boldNormWarp = combinewarp( shrink, a.aWarp, mnQAAir ); Run nr = reslice_warp_run( boldNormWarp, roRun ); Volume meanAll = strictmean ( nr, "y", null ) Volume boldMask = binarize( meanAll, "y" ); snr = gsmoothRun( nr, boldMask, 6, 6, 6 ); }

slide-16
SLIDE 16

16

Dataset Iteration

  • Functional analysis

expressed in typed datasets

  • Iterate over each

volume in a run

reorientRun reorientRun reslice_warpRun random_select alignlinearRun resliceRun softmean alignlinear combinewarp strictmean gsmoothRun binarize

slide-17
SLIDE 17

17

Expanded Execution Plan

reorient/01 reorient/02 reslice_warp/22 alignlinear/03 alignlinear/07 alignlinear/11 reorient/05 reorient/06 reslice_warp/23 reorient/09 reorient/10 reslice_warp/24 reorient/25 reorient/51 reslice_warp/26 reorient/27 reorient/52 reslice_warp/28 reorient/29 reorient/53 reslice_warp/30 reorient/31 reorient/54 reslice_warp/32 reorient/33 reorient/55 reslice_warp/34 reorient/35 reorient/56 reslice_warp/36 reorient/37 reorient/57 reslice_warp/38 reslice/04 reslice/08 reslice/12 gsmooth/41 strictmean/39 gsmooth/42 gsmooth/43 gsmooth/44 gsmooth/45 gsmooth/46 gsmooth/47 gsmooth/48 gsmooth/49 gsmooth/50 softmean/13 alignlinear/17 combinewarp/21 binarize/40

reorient reorient alignlinear reslice softmean alignlinear combine_warp reslice_warp strictmean binarize gsmooth

  • Datasets

dynamically instantiated from data sources by mappers

slide-18
SLIDE 18

18

Code Size Comparison

37 ~400 215 AIRSN 13 191 84 FEAT 17 134 63 FILM1 10 135 97 GENATLAS2 6 72 49 GENATLAS1 VDL Generator Script Workflow

Lines of code with different workflow encoding

slide-19
SLIDE 19

19

Conclusion

  • XDTM provides the data model for

separation of logical structure from physical representations

  • VDL allows workflow composition and

dataset iteration based on typed signature

  • fMRI use case proves effectiveness and

productivity gain

slide-20
SLIDE 20

20

For More Information

  • GriPhyN

– http://www.griphyn.org/

  • VDS

– http://www.griphyn.org/vds

  • Publications

– http://people.cs.uchicago.edu/~yongzh