Building an Extensible File System via Policy-based Data - - PowerPoint PPT Presentation

building an extensible file system via policy based data
SMART_READER_LITE
LIVE PREVIEW

Building an Extensible File System via Policy-based Data - - PowerPoint PPT Presentation

Building an Extensible File System via Policy-based Data Management Hao Xu Jewel H. Ward Mike Conway Arcot Rajasekar Reagan W. Moore (iRODS


slide-1
SLIDE 1

¡

Building ¡an ¡Extensible ¡File ¡System ¡via ¡ ¡ Policy-­‑based ¡Data ¡Management

¡

Hao ¡Xu ¡ Jewel ¡H. ¡Ward ¡ Mike ¡Conway ¡ Arcot ¡Rajasekar ¡ Reagan ¡W. ¡Moore ¡ (iRODS ¡ConsorIum, ¡hLp://irods.org) ¡

¡

slide-2
SLIDE 2

File System

q Essential Functions: § Ingest, Store, Access q Modern File Systems are built on top of traditional file systems: § Google File System, Amazon S3, Hadoop Distributed File System § Driven by the need of a target application § Customized toward the target application domain

slide-3
SLIDE 3

Data Management Needs in Archive and Scientific Communities

q Discoverability q Complex Metadata q Workflow Management q Data Sharing q Provenance q Long Term Preservation q Technology Migration q Interoperability Between Infrastructures

slide-4
SLIDE 4

Challenges

Can generic infrastructure meet the needs of a diverse set of data management domains?

slide-5
SLIDE 5

Flexibility to Define a Wide Range

  • f Application Domain Policies

q User Community à à Policies q File ingest operations: § Authentication § Authorization § Storage Quota § Aggregation § Resource Selection § Replication § File Retention § Metadata

slide-6
SLIDE 6

Infrastructure Support For Non-standard Application Domain Operations

q Standard file system operations have robust support: § Metadata § Auditing § Access Control List q Non-standard operations that are implemented as a library do not have direct support from the file system. Examples: § Preservation – OAIS: SIP, AIP, DIP packages § Digital library – Provenance & discovery metadata § Processing pipeline – Format transformation

slide-7
SLIDE 7

Interoperability with Other Infrastructures

q Emergent scalability mechanisms: § Organization change

  • List à Tree à Graph (Internet) à Search

§ Data structure change

  • Files, tables, streams

§ Property enforcement expectations

  • Reproducible data-driven research

q Separation of how files are stored, accessed, and manipulated

slide-8
SLIDE 8

Policy-based Data Management

slide-9
SLIDE 9

Policy = Metadata + Procedure

q Purpose ¡ ¡ ¡Reason ¡a ¡collecIon ¡is ¡assembled ¡ q Proper)es ¡ ¡ ¡ALributes ¡needed ¡to ¡ensure ¡the ¡purpose ¡ q Policies ¡ ¡ ¡Controls ¡for ¡enforcing ¡desired ¡proper)es ¡ ¡ § Procedural ¡Policy: ¡Example: ¡When ¡an ¡object ¡is ¡ingested, ¡run ¡workflow ¡ § Asser?onal ¡Policy: ¡Example: ¡A ¡file ¡has ¡three ¡or ¡more ¡replicas ¡ q Metadata ¡ ¡Persistent ¡state ¡ § State ¡informa?on ¡(consistency ¡in ¡a ¡distributed ¡environment) ¡ § Generated ¡through ¡applica?on ¡of ¡procedures ¡ q Procedures ¡OperaIons ¡performed ¡within ¡the ¡system ¡ § What ¡to ¡run: ¡Func?ons ¡that ¡implement ¡the ¡policies ¡ § How ¡to ¡verify: ¡Valida?on ¡that ¡metadata ¡conforms ¡to ¡the ¡desired ¡ purpose ¡

slide-10
SLIDE 10

Collection Purpose

Defines Defines

Policy Property

Defines

Procedure

Controls Updates

Periodic Assessment Criteria Policy

SubType

Metadata

Policy-based Data Management

slide-11
SLIDE 11

Collection Purpose

Defines

Attribute

Has Defines

Policy

Has

Property

Defines

Procedure

Controls Updates

Periodic Assessment Criteria Policy

SubType

Metadata

Isa

Digital Object

Updates Has Has

Policy-based Data Management - Collection

Has

slide-12
SLIDE 12

Collection Purpose

Completeness Correctness Consensus Defines Consistency

Attribute

HasFeature HasFeature HasFeature Has Defines

Policy

Has

Property

Defines

Procedure

Controls Updates

Periodic Assessment Criteria Policy

SubType

Metadata

Isa

Digital Object

Updates Has Has Integrity Isa Authenticity Isa Access control Isa

Policy-based Data Management – Collection Properties

HasFeature

slide-13
SLIDE 13

Collection Purpose

Completeness Correctness Consensus Defines Consistency

Attribute

HasFeature HasFeature HasFeature Has Defines

Policy

Has

Property

Defines

Procedure

Controls Updates

Periodic Assessment Criteria Policy

SubType

Metadata

Isa

Digital Object

Updates Has Has Replication Policy Checksum Policy Quota Policy Data Type Policy Isa Isa Integrity Isa Authenticity Isa Access control Isa

Policy-based Data Management – Collection Policies

Isa Isa HasFeature

slide-14
SLIDE 14

Collection Purpose

Completeness Correctness Consensus Defines Consistency

Attribute

HasFeature HasFeature HasFeature Has Defines

Policy

Has

Property

Defines

Procedure

Controls Updates

Periodic Assessment Criteria Policy Workflow

SubType Isa

Function

Chains

Operation

Isa

Metadata

Isa

Digital Object

Updates Has Has Replication Policy Checksum Policy Quota Policy Data Type Policy Isa Isa Integrity Isa Authenticity Isa Access control Isa GetUserACL SetDataType SetQuota DataObjRepl SysChksumDataObj Isa Isa Isa Isa Isa

Policy-based Data Management –Collection Procedures

Isa Isa HasFeature

slide-15
SLIDE 15

Collection Purpose

Completeness Correctness Consensus Defines Consistency

Attribute

HasFeature HasFeature HasFeature Has Defines

Policy

Has

Property

Defines

Procedure

Controls Updates

Periodic Assessment Criteria Policy Workflow

SubType Isa

Function

Chains

Operation

Isa

Metadata

Isa

Digital Object

Updates Has Has Replication Policy Checksum Policy Quota Policy Data Type Policy Isa Isa Integrity Isa Authenticity Isa Access control Isa GetUserACL SetDataType SetQuota DataObjRepl SysChksumDataObj Isa Isa Isa Isa Isa

DATA_ID DATA_REPL_NUM DATA_CHECKSUM

Isa Isa Isa

Policy-based Data Management – Persistent State

Isa Isa HasFeature

slide-16
SLIDE 16

Collection Purpose

Completeness Correctness Consensus Defines Consistency

Attribute

HasFeature HasFeature HasFeature Has Defines

Policy

Has

Property

Defines

Procedure

Controls Updates

Client Action Periodic Assessment Criteria Policy Policy Enforcement Point Workflow

Invokes Has SubType Isa

Function

Chains

Operation

Isa

Metadata

Isa

Digital Object

Updates Has Has Replication Policy Checksum Policy Quota Policy Data Type Policy Isa Isa Integrity Isa Authenticity Isa Access control Isa GetUserACL SetDataType SetQuota DataObjRepl SysChksumDataObj Isa Isa Isa Isa Isa

DATA_ID DATA_REPL_NUM DATA_CHECKSUM

Isa Isa Isa

Policy-based Data Management – Policy Enforcement

Isa Isa HasFeature

slide-17
SLIDE 17

Example of Policy-based Data Management

slide-18
SLIDE 18

Policy-based Infrastructure

integrated Rule Oriented Data System

  • Biology
  • Cognitive Science

Temporal Dynamics of Learning Center

  • Human genome

Broad Institute, Wellcome Trust Sanger Institute, NGS

  • Medicine

Sick Kids Hospital

  • Neuroscience

International Neuroinformatics Coordinating Facility

  • Plant genome

the iPlant Collaborative

  • Phylogenetics

Phylogenetics at CC IN2P3

  • Computer Science
  • Network research

GENI experimental network

  • Earth Sciences
  • Atmospheric science

NASA Langley Atmospheric Sciences Center

  • Climate

NOAA National Climatic Data Center

  • NASA Center for Climate Simulations
  • Ecology

CEED Caveat Emptor Ecological Data

  • Hydrology

Institute for the Environment, UNC-CH; Hydroshare

  • Oceanography

Ocean Observatories Initiative

  • Seismology

Southern California Earthquake Center

  • Engineering
  • Education repository

CIBER-U

  • Physics
  • Astrophysics

Auger supernova search

  • Cosmic Ray

AMS experiment on the International Space Station

  • Dark Matter Physics

Edelweiss II

  • High Energy Physics

BaBar / Stanford Linear Accelerator

  • Neutrino Physics

T2K and dChooz neutrino experiments

  • Optical Astronomy

National Optical Astronomy Observatory

  • Particle Physics

Indra multi-detector collaboration at IN2P3

  • Quantum Chromodynamics

IN2P3

  • Radio Astronomy

Cyber Square Kilometer Array, TREND, BAOradio

  • Social Science

Odum, TerraPop

slide-19
SLIDE 19

Policy Applications

q Pre-process policy § Applied before an operation is done q Operation § May be policy controlled q Post-process policy § Applied after the operation is done q Are these sufficient to handle the wide diversity of data management applications? q Does this minimize the number of required

  • perations?
slide-20
SLIDE 20

Policy (Workflow) in Hydrology

Choose gauge

  • r outlet (HIS)

Extract drainage area (NHDPlus) Digital Elevation Model (DEM) Worldfile Flowtable RHESSys Slope Aspect Streams (NHD) Roads (DOT) Strata Hillslope Patch Basin Stream network

Nested watershed structure

Land Use Leaf Area Index Phenology Soil Data NLCD (EPA) Landsat TM MODIS USDA

Soil and vegetation parameter files

RHESSys workflow to develop a nested watershed parameter file (worldfile) containing a nested ecogeomorphic object framework, and full, initial system state. For each box, create a micro- service to automate task, and chain into a workflow

slide-21
SLIDE 21

Rule Engine

Policies in Software Defined Networking Control selection of network paths

GraphDB Data Policies Network Policies OF Controller iRODS Server iRODS Server iRODS Server iCAT

slide-22
SLIDE 22

Policy in Data Storage Aggregation / Caching / Replication

Queen Mary University of London

Source: Di Lodovico et al.

slide-23
SLIDE 23

Indexing Policies

iRODS Data Metadata Message Passing (AMQP) DataBook Rules VIVO VIVO Search UI Indexing Framework External Index Indexing Service OSGi Indexer

Index: Text Metadata Events

slide-24
SLIDE 24

Policies in Digital Libraries

q SILS LifeTime Library § Student collections range from 2 GBytes to 150 Gbytes § Number of files from 2000 to 12,000 q Library management Policies § Replication, Checksums, Versioning, Strict access controls, Quotas, Metadata catalog replication, Installation environment archiving q Ingestion Policies § Automated synchronization of student directory with LifeTime Library § Automated loading of MP3 metadata

slide-25
SLIDE 25

Policies in Archives

slide-26
SLIDE 26

Formal Aspects of Policy-based Data Management

slide-27
SLIDE 27

Domain Model

q Entities § Data Object, Replica, Collection, User, Resource, Rules, Metadata, Access q Relations § (Collection) contains (Data Object); (Resource) stores (Replica); (Replica) replicates (Data Object); (User) owns (Data Object); (User) is granted (Access); (Access) is granted on (Data Object) q Operations § Get, put, replicate, etc.

slide-28
SLIDE 28

Policy

q A policy is implemented as a set of procedures defined in terms of the Domain Model § Assertion about state: “A file has three or more replicas”

  • A procedure to maintain state consistency:

replication rule acPostProcForPut

  • (Hardware, human errors) A procedure to check

state consistency: periodic integrity check

slide-29
SLIDE 29

Example of Formalism Using Monad

q Monad Recap: § A monad represents computations (possibly with side effects, in

  • ur example, assume only state change)

q Monad Constructors

§ return:

trivial computation that returns a value

§ x >> y:

do x then y

§ x >>= y:

feed return value of x into y q Monad Laws

§ return x >>= f = f x (Left Id) § f >>= return = f (Right Id) § f >>= g >>= h = f >>= (λx.g x >>= h) (Associative)

  • A B C => A (B C)
slide-30
SLIDE 30

Domain Model

q Entities:

§ DataObject, Content, Replica, Resource

q Relations:

§ replica: r = replica(o,i)

r is the replica of o at resource i

§ replicas: r ∈ replicas(o)

r is a replica of o

slide-31
SLIDE 31

Domain Model

q Basic Operations: § read : read r read content of replica r § write : write c r write content c to replica r § aread : aread i read ith latest audit log entry § awrite : awrite s r append to audit log (s,r) § repl : repl o replicate o to all resources § newest : newest o the newest replica of object o

slide-32
SLIDE 32

Complex Operations and Policy Enforcement Points

q Complex Operations: § oread : oread o read the content of object o § owrite : owrite c o write content c to object o q Defined in terms of Basic Operations + PEPs

§ op args = pre args >>= op’ args >>= post args

q We define oread and owrite:

§ oread o = pre o >>= read >>= post o § owrite o = pre c o >>= write c >>= post c o

slide-33
SLIDE 33

Basic Semantics

q Only one resource i § oread

  • pre = return (replica o i)

read replica of object o

  • post = return

return content of replica

§ owrite

  • pre = return (replica o i)

write replica of object o

  • post = return

simply return

slide-34
SLIDE 34

Auditing

q One resource i + audit log § oread

  • pre = awrite “read” o >> return (replica o i)

audit + read replica of object o

  • post = return

return content of replica

§ owrite

  • pre = awrite “write” o >> return (replica o i)

audit + write replica of object o

  • post = return

simply return

slide-35
SLIDE 35

Replication

q Multiples resources § oread

  • pre = return (replica o i)

read arbitrary replica i of object o

  • post = return

return content of replica

§ owrite

  • pre = return (replica o i’)

write arbitrary replica i’ of object o

  • post = λx.(repl o >> return x)

replicate and return

slide-36
SLIDE 36

Policy-based Data Management Concept Graph

Collection Purpose (5 main types)

Completeness Correctness Consensus Defines Consistency

Attribute

HasFeature HasFeature HasFeature Has Defines

Policy (11 default)

Has

Property (7 default)

Defines

Procedure (11 default)

Controls Updates

Clients (50) Periodic Assessment Criteria Policy Policy Enforcement Points (72) Workflow

Invokes Has SubType Isa

Micro-service (350)

Chains

Operation

Isa

Persistent State Information (338)

Isa

Digital Object

Updates Has Has Replication Policy Checksum Policy Quota Policy Data Type Policy Isa Isa Integrity Isa Authenticity Isa Access control Isa msiGetUserACL msiSetDataType msiSetQuota msiDataObjRepl msiSysChksumDataObj Isa Isa Isa Isa Isa

DATA_ID DATA_REPL_NUM DATA_CHECKSUM

Isa Isa Isa Isa Isa HasFeature Archive Data grid Collection Digital Library Processing Pipeline SubType

slide-37
SLIDE 37

iRODS D Distributed D Data M Management

slide-38
SLIDE 38

iRODS data grid

Integrated Rule Oriented Data System Open source software http://irods.org Supported by the iRODS Consortium