Building an Extensible File System via Policy-based Data - - PowerPoint PPT Presentation
Building an Extensible File System via Policy-based Data - - PowerPoint PPT Presentation
Building an Extensible File System via Policy-based Data Management Hao Xu Jewel H. Ward Mike Conway Arcot Rajasekar Reagan W. Moore (iRODS
File System
q Essential Functions: § Ingest, Store, Access q Modern File Systems are built on top of traditional file systems: § Google File System, Amazon S3, Hadoop Distributed File System § Driven by the need of a target application § Customized toward the target application domain
Data Management Needs in Archive and Scientific Communities
q Discoverability q Complex Metadata q Workflow Management q Data Sharing q Provenance q Long Term Preservation q Technology Migration q Interoperability Between Infrastructures
Challenges
Can generic infrastructure meet the needs of a diverse set of data management domains?
Flexibility to Define a Wide Range
- f Application Domain Policies
q User Community à à Policies q File ingest operations: § Authentication § Authorization § Storage Quota § Aggregation § Resource Selection § Replication § File Retention § Metadata
Infrastructure Support For Non-standard Application Domain Operations
q Standard file system operations have robust support: § Metadata § Auditing § Access Control List q Non-standard operations that are implemented as a library do not have direct support from the file system. Examples: § Preservation – OAIS: SIP, AIP, DIP packages § Digital library – Provenance & discovery metadata § Processing pipeline – Format transformation
Interoperability with Other Infrastructures
q Emergent scalability mechanisms: § Organization change
- List à Tree à Graph (Internet) à Search
§ Data structure change
- Files, tables, streams
§ Property enforcement expectations
- Reproducible data-driven research
q Separation of how files are stored, accessed, and manipulated
Policy-based Data Management
Policy = Metadata + Procedure
q Purpose ¡ ¡ ¡Reason ¡a ¡collecIon ¡is ¡assembled ¡ q Proper)es ¡ ¡ ¡ALributes ¡needed ¡to ¡ensure ¡the ¡purpose ¡ q Policies ¡ ¡ ¡Controls ¡for ¡enforcing ¡desired ¡proper)es ¡ ¡ § Procedural ¡Policy: ¡Example: ¡When ¡an ¡object ¡is ¡ingested, ¡run ¡workflow ¡ § Asser?onal ¡Policy: ¡Example: ¡A ¡file ¡has ¡three ¡or ¡more ¡replicas ¡ q Metadata ¡ ¡Persistent ¡state ¡ § State ¡informa?on ¡(consistency ¡in ¡a ¡distributed ¡environment) ¡ § Generated ¡through ¡applica?on ¡of ¡procedures ¡ q Procedures ¡OperaIons ¡performed ¡within ¡the ¡system ¡ § What ¡to ¡run: ¡Func?ons ¡that ¡implement ¡the ¡policies ¡ § How ¡to ¡verify: ¡Valida?on ¡that ¡metadata ¡conforms ¡to ¡the ¡desired ¡ purpose ¡
Collection Purpose
Defines Defines
Policy Property
Defines
Procedure
Controls Updates
Periodic Assessment Criteria Policy
SubType
Metadata
Policy-based Data Management
Collection Purpose
Defines
Attribute
Has Defines
Policy
Has
Property
Defines
Procedure
Controls Updates
Periodic Assessment Criteria Policy
SubType
Metadata
Isa
Digital Object
Updates Has Has
Policy-based Data Management - Collection
Has
Collection Purpose
Completeness Correctness Consensus Defines Consistency
Attribute
HasFeature HasFeature HasFeature Has Defines
Policy
Has
Property
Defines
Procedure
Controls Updates
Periodic Assessment Criteria Policy
SubType
Metadata
Isa
Digital Object
Updates Has Has Integrity Isa Authenticity Isa Access control Isa
Policy-based Data Management – Collection Properties
HasFeature
Collection Purpose
Completeness Correctness Consensus Defines Consistency
Attribute
HasFeature HasFeature HasFeature Has Defines
Policy
Has
Property
Defines
Procedure
Controls Updates
Periodic Assessment Criteria Policy
SubType
Metadata
Isa
Digital Object
Updates Has Has Replication Policy Checksum Policy Quota Policy Data Type Policy Isa Isa Integrity Isa Authenticity Isa Access control Isa
Policy-based Data Management – Collection Policies
Isa Isa HasFeature
Collection Purpose
Completeness Correctness Consensus Defines Consistency
Attribute
HasFeature HasFeature HasFeature Has Defines
Policy
Has
Property
Defines
Procedure
Controls Updates
Periodic Assessment Criteria Policy Workflow
SubType Isa
Function
Chains
Operation
Isa
Metadata
Isa
Digital Object
Updates Has Has Replication Policy Checksum Policy Quota Policy Data Type Policy Isa Isa Integrity Isa Authenticity Isa Access control Isa GetUserACL SetDataType SetQuota DataObjRepl SysChksumDataObj Isa Isa Isa Isa Isa
Policy-based Data Management –Collection Procedures
Isa Isa HasFeature
Collection Purpose
Completeness Correctness Consensus Defines Consistency
Attribute
HasFeature HasFeature HasFeature Has Defines
Policy
Has
Property
Defines
Procedure
Controls Updates
Periodic Assessment Criteria Policy Workflow
SubType Isa
Function
Chains
Operation
Isa
Metadata
Isa
Digital Object
Updates Has Has Replication Policy Checksum Policy Quota Policy Data Type Policy Isa Isa Integrity Isa Authenticity Isa Access control Isa GetUserACL SetDataType SetQuota DataObjRepl SysChksumDataObj Isa Isa Isa Isa Isa
DATA_ID DATA_REPL_NUM DATA_CHECKSUM
Isa Isa Isa
Policy-based Data Management – Persistent State
Isa Isa HasFeature
Collection Purpose
Completeness Correctness Consensus Defines Consistency
Attribute
HasFeature HasFeature HasFeature Has Defines
Policy
Has
Property
Defines
Procedure
Controls Updates
Client Action Periodic Assessment Criteria Policy Policy Enforcement Point Workflow
Invokes Has SubType Isa
Function
Chains
Operation
Isa
Metadata
Isa
Digital Object
Updates Has Has Replication Policy Checksum Policy Quota Policy Data Type Policy Isa Isa Integrity Isa Authenticity Isa Access control Isa GetUserACL SetDataType SetQuota DataObjRepl SysChksumDataObj Isa Isa Isa Isa Isa
DATA_ID DATA_REPL_NUM DATA_CHECKSUM
Isa Isa Isa
Policy-based Data Management – Policy Enforcement
Isa Isa HasFeature
Example of Policy-based Data Management
Policy-based Infrastructure
integrated Rule Oriented Data System
- Biology
- Cognitive Science
Temporal Dynamics of Learning Center
- Human genome
Broad Institute, Wellcome Trust Sanger Institute, NGS
- Medicine
Sick Kids Hospital
- Neuroscience
International Neuroinformatics Coordinating Facility
- Plant genome
the iPlant Collaborative
- Phylogenetics
Phylogenetics at CC IN2P3
- Computer Science
- Network research
GENI experimental network
- Earth Sciences
- Atmospheric science
NASA Langley Atmospheric Sciences Center
- Climate
NOAA National Climatic Data Center
- NASA Center for Climate Simulations
- Ecology
CEED Caveat Emptor Ecological Data
- Hydrology
Institute for the Environment, UNC-CH; Hydroshare
- Oceanography
Ocean Observatories Initiative
- Seismology
Southern California Earthquake Center
- Engineering
- Education repository
CIBER-U
- Physics
- Astrophysics
Auger supernova search
- Cosmic Ray
AMS experiment on the International Space Station
- Dark Matter Physics
Edelweiss II
- High Energy Physics
BaBar / Stanford Linear Accelerator
- Neutrino Physics
T2K and dChooz neutrino experiments
- Optical Astronomy
National Optical Astronomy Observatory
- Particle Physics
Indra multi-detector collaboration at IN2P3
- Quantum Chromodynamics
IN2P3
- Radio Astronomy
Cyber Square Kilometer Array, TREND, BAOradio
- Social Science
Odum, TerraPop
Policy Applications
q Pre-process policy § Applied before an operation is done q Operation § May be policy controlled q Post-process policy § Applied after the operation is done q Are these sufficient to handle the wide diversity of data management applications? q Does this minimize the number of required
- perations?
Policy (Workflow) in Hydrology
Choose gauge
- r outlet (HIS)
Extract drainage area (NHDPlus) Digital Elevation Model (DEM) Worldfile Flowtable RHESSys Slope Aspect Streams (NHD) Roads (DOT) Strata Hillslope Patch Basin Stream network
Nested watershed structure
Land Use Leaf Area Index Phenology Soil Data NLCD (EPA) Landsat TM MODIS USDA
Soil and vegetation parameter files
RHESSys workflow to develop a nested watershed parameter file (worldfile) containing a nested ecogeomorphic object framework, and full, initial system state. For each box, create a micro- service to automate task, and chain into a workflow
Rule Engine
Policies in Software Defined Networking Control selection of network paths
GraphDB Data Policies Network Policies OF Controller iRODS Server iRODS Server iRODS Server iCAT
Policy in Data Storage Aggregation / Caching / Replication
Queen Mary University of London
Source: Di Lodovico et al.
Indexing Policies
iRODS Data Metadata Message Passing (AMQP) DataBook Rules VIVO VIVO Search UI Indexing Framework External Index Indexing Service OSGi Indexer
Index: Text Metadata Events
Policies in Digital Libraries
q SILS LifeTime Library § Student collections range from 2 GBytes to 150 Gbytes § Number of files from 2000 to 12,000 q Library management Policies § Replication, Checksums, Versioning, Strict access controls, Quotas, Metadata catalog replication, Installation environment archiving q Ingestion Policies § Automated synchronization of student directory with LifeTime Library § Automated loading of MP3 metadata
Policies in Archives
Formal Aspects of Policy-based Data Management
Domain Model
q Entities § Data Object, Replica, Collection, User, Resource, Rules, Metadata, Access q Relations § (Collection) contains (Data Object); (Resource) stores (Replica); (Replica) replicates (Data Object); (User) owns (Data Object); (User) is granted (Access); (Access) is granted on (Data Object) q Operations § Get, put, replicate, etc.
Policy
q A policy is implemented as a set of procedures defined in terms of the Domain Model § Assertion about state: “A file has three or more replicas”
- A procedure to maintain state consistency:
replication rule acPostProcForPut
- (Hardware, human errors) A procedure to check
state consistency: periodic integrity check
Example of Formalism Using Monad
q Monad Recap: § A monad represents computations (possibly with side effects, in
- ur example, assume only state change)
q Monad Constructors
§ return:
trivial computation that returns a value
§ x >> y:
do x then y
§ x >>= y:
feed return value of x into y q Monad Laws
§ return x >>= f = f x (Left Id) § f >>= return = f (Right Id) § f >>= g >>= h = f >>= (λx.g x >>= h) (Associative)
- A B C => A (B C)
Domain Model
q Entities:
§ DataObject, Content, Replica, Resource
q Relations:
§ replica: r = replica(o,i)
r is the replica of o at resource i
§ replicas: r ∈ replicas(o)
r is a replica of o
Domain Model
q Basic Operations: § read : read r read content of replica r § write : write c r write content c to replica r § aread : aread i read ith latest audit log entry § awrite : awrite s r append to audit log (s,r) § repl : repl o replicate o to all resources § newest : newest o the newest replica of object o
Complex Operations and Policy Enforcement Points
q Complex Operations: § oread : oread o read the content of object o § owrite : owrite c o write content c to object o q Defined in terms of Basic Operations + PEPs
§ op args = pre args >>= op’ args >>= post args
q We define oread and owrite:
§ oread o = pre o >>= read >>= post o § owrite o = pre c o >>= write c >>= post c o
Basic Semantics
q Only one resource i § oread
- pre = return (replica o i)
read replica of object o
- post = return
return content of replica
§ owrite
- pre = return (replica o i)
write replica of object o
- post = return
simply return
Auditing
q One resource i + audit log § oread
- pre = awrite “read” o >> return (replica o i)
audit + read replica of object o
- post = return
return content of replica
§ owrite
- pre = awrite “write” o >> return (replica o i)
audit + write replica of object o
- post = return
simply return
Replication
q Multiples resources § oread
- pre = return (replica o i)
read arbitrary replica i of object o
- post = return
return content of replica
§ owrite
- pre = return (replica o i’)
write arbitrary replica i’ of object o
- post = λx.(repl o >> return x)
replicate and return
Policy-based Data Management Concept Graph
Collection Purpose (5 main types)
Completeness Correctness Consensus Defines Consistency
Attribute
HasFeature HasFeature HasFeature Has Defines
Policy (11 default)
Has
Property (7 default)
Defines
Procedure (11 default)
Controls Updates
Clients (50) Periodic Assessment Criteria Policy Policy Enforcement Points (72) Workflow
Invokes Has SubType Isa
Micro-service (350)
Chains
Operation
Isa
Persistent State Information (338)
Isa
Digital Object
Updates Has Has Replication Policy Checksum Policy Quota Policy Data Type Policy Isa Isa Integrity Isa Authenticity Isa Access control Isa msiGetUserACL msiSetDataType msiSetQuota msiDataObjRepl msiSysChksumDataObj Isa Isa Isa Isa Isa
DATA_ID DATA_REPL_NUM DATA_CHECKSUM
Isa Isa Isa Isa Isa HasFeature Archive Data grid Collection Digital Library Processing Pipeline SubType