Management (CDLM) for Petascale Projects Arun Jagatheesan - - PowerPoint PPT Presentation

management cdlm
SMART_READER_LITE
LIVE PREVIEW

Management (CDLM) for Petascale Projects Arun Jagatheesan - - PowerPoint PPT Presentation

Collaborative Data Life-cycle Management (CDLM) for Petascale Projects Arun Jagatheesan iRODS.org, DICE, SDSC/UCSD Agenda Introductions LSST as use case CDLM Attributes of CDLM History behind the story MDAS (Massive Data


slide-1
SLIDE 1

Collaborative Data Life-cycle Management (CDLM) for Petascale Projects

Arun Jagatheesan iRODS.org, DICE, SDSC/UCSD

slide-2
SLIDE 2

Agenda

  • Introductions
  • LSST as use case
  • CDLM
  • Attributes of CDLM
slide-3
SLIDE 3

History behind the story

  • MDAS (Massive Data Analysis System)
  • Support data-intensive applications that manipulate

very large data sets by building upon object-relational database technology and archival storage technology

  • 1995 by DARPA
  • SDSC SRB (Storage Resource Broker)
  • iRODS
  • Flexible license for our community
  • Flexible rules for users
  • Flexible data management
slide-4
SLIDE 4

My role in iRODS Community

  • Large-scale usage and adoption of iRODS
  • Research and Analysis of large-scale use-cases
  • Design requirements for large-scale users
  • Consult on iRODS-based storage infrastructure
  • Community Growth
  • Tutorials, dissemination
  • iROD-Chat (2006), SRB-Chat (2003)
  • Academic and Industrial users
slide-5
SLIDE 5

Large Scale Synoptic Survey

  • Survey entire sky every 3 nights
  • Dark Energy, Dark Matter, Near Earth

Asteroids, and more

  • World’s largest digital camera (3 billion pixels)
  • Images 3000 times wider than Hubble
  • Data from Chile to US and rest of the world
  • 15 TB/night, over hundred(s) petabytes
  • www.youtube.com/watch?v=LtMJ_WwvBb8
slide-6
SLIDE 6

Data Products

  • Releases
  • Cataloged database
  • Provenance Info
  • Metadata
  • Processed Data Sets
  • Raw Images

QuickTime™ and a TIFF (Uncompressed) decompresso are needed to see this picture.

slide-7
SLIDE 7

LSST Data Infrastructure Layout

QuickTime™ and a TIFF (Uncompressed) decompr are needed to see this picture.

QuickTime™ and a TIFF (Uncompressed) decompresso are needed to see this picture.

QuickTime™ and a TIFF (Uncompressed) decompresso are needed to see this picture.

QuickTime™ and a TIFF (Uncompressed) decompresso are needed to see this picture.

slide-8
SLIDE 8

LSST Data Train and iRODS

QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.

/file1..10.fits /file1..10.fits /catalog1.db /nobel.event

QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.

UK or IN2P3 /file1..10.fits /catalog1.db /catalog1.db /file1..10.fits /file1..10.fits /catalog1.db

slide-9
SLIDE 9

LSST CDLM Problem Statement

  • LSST data-lifecycle management infrastructure

for:

  • Performance oriented data storage sub-systems
  • Capacity oriented data storage sub-systems
  • Data (usage oriented) distribution networks
  • [Provenance and archive storage systems]
  • Confluence of three major storage dimensions
  • HPC data processing (pipelines to produce our data)
  • Datacenter sharing (data centers that host our data)
  • Data delivery and distribution (usage of our data)
slide-10
SLIDE 10

CDLM

  • Collaborative Data Lifecycle Management
  • Multiplexing of a single data life-cycle amongst

more than one autonomous partner

  • Attributes of data-lifecycle is shared
  • Varying levels of autonomy and inter-

dependence

slide-11
SLIDE 11

Multiplexing a Data Life-cycle

  • Data Creation (Raw data)
  • Data Processing (Derived data)
  • Data Analysis (Data warehouse, ..)
  • Data Namespace
  • Data Dissemination
  • Data Provenance
  • Data Archival
slide-12
SLIDE 12

Levels of Collaboration

  • Collaboration on Data Life-cycle not

necessarily mean collaboration of businesses

  • Some types of CDLM
  • Symbiotic - All partner businesses benefit from CDLM
  • Neutral - No effect on businesses due to CDLM
  • Competitive - partners of CDLM are actually

competitors of the resulting business process (forced to have a common platform to compete)

  • Hybrid - Multiple or transient partner relationships
slide-13
SLIDE 13

Autonomy & Inter-dependence at right levels for CDLM to work

slide-14
SLIDE 14

LSST Data Layout

QuickTime™ and a TIFF (Uncompressed) decompr are needed to see this picture.

QuickTime™ and a TIFF (Uncompressed) decompresso are needed to see this picture.

QuickTime™ and a TIFF (Uncompressed) decompresso are needed to see this picture.

QuickTime™ and a TIFF (Uncompressed) decompresso are needed to see this picture.

slide-15
SLIDE 15

ALMA data flow

QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.

QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.

QuickTime™ and a TIFF (Uncompressed) d are needed to see this

QuickTime™ and a TIFF (Uncompressed) de are needed to see this

slide-16
SLIDE 16

LSST SC-2008 Prototype

QuickTime™ and a TIFF (Uncompressed) decom are needed to see this pictu

QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.

QuickTime™ and a TIFF (Uncompressed) decompre are needed to see this picture.

QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.

QuickTime™ and a TIFF (Uncompressed) d are needed to see this

slide-17
SLIDE 17

CDLM Infrastructure Design

  • Requirements, Expectations and

Performance Management

  • Minimize dependencies (without

affecting cost)

  • Reduce individual autonomy into

hierarchical groups (that can remain autonomous)

  • Hierarchical rules and community rules
slide-18
SLIDE 18

iRODS enabling CDLM

  • Global Namespace
  • Resource allocation and service levels

as policies/rules

  • Hierarchical rules and access controls
  • Highly Flexible System
slide-19
SLIDE 19

Similar projects? Let’s talk

  • The power of the community
  • Not necessarily “large” scale
  • Symbiotic
  • arun@diceresearch.org