The Data Grid: An Architecture for D istributed Management of Large - - PowerPoint PPT Presentation

the data grid
SMART_READER_LITE
LIVE PREVIEW

The Data Grid: An Architecture for D istributed Management of Large - - PowerPoint PPT Presentation

The Data Grid: An Architecture for D istributed Management of Large Scientific Data Sets Ann Chervenak Ian Foster Carl Kesselman Chuck Salisbury Steve Tuecke Information Sciences Institute, Argonne National Laboratory University of


slide-1
SLIDE 1

The Data Grid:

An Architecture for

Distributed Management

  • f Large Scientific Data Sets

Ann Chervenak Carl Kesselman Information Sciences Institute, University of Southern California Ian Foster Chuck Salisbury Steve Tuecke Argonne National Laboratory

slide-2
SLIDE 2

Data Grid 2

Overview

  • Target Environment
  • Design Principles
  • Grid Services

◆ Storage systems ◆ Metadata ◆ Management of replicated files

  • Implementation
slide-3
SLIDE 3

Data Grid 3

Data Grid Environment

  • Scientific applications

◆ Global climate change, High energy physics

  • Computationally demanding
  • Large data sets and archives

◆ Terabytes, eventually petabytes ◆ Raw and derived data

  • Geographically dispersed users and resources

◆ Data replication for enhanced performance

  • Broad range of capabilities and resources

◆ Networks, systems, storage, and applications

slide-4
SLIDE 4

Data Grid 4

Building a Data Grid: Building Blocks

Ingest/ catalog service Data mover service Catalog manager Query manager Analysis computer

Archive

1-10 PB Archival GB/ s net QoS

Cache

10-100 TB Nonarchival GB/ s net QoS

Network

1-10 TF/ s GB/ s net On-demand QoS 1-10 GB/ s Striped Secure QoS

Security Resource discovery Measurement Resource management Accounting/ payment Fault detection Communications HPSS

Archival, multi-PB Access> 100 MB/ s? No QoS

DPSS

Fast disk cache No QoS

Computers

Preliminary QoS work (e.g., DSRT) XFS: QoS for disk

ESnet, MREN, NTON

QoS: e.g., diffserv

MCAT, SRB pftp GASS STACS, Condor

  • thers

Netlogger Globus toolkit: security, information, fault detection, resource management, communication, etc. Autopilot MPI-IO Akenti Condor ...

slide-5
SLIDE 5

Data Grid 5

Data Grid Objectives

  • Integrate heterogeneous data archives into a

distributed data management “grid”

  • Identify services for high performance,

distributed, data intensive computing

slide-6
SLIDE 6

Data Grid 6

Design Principles

  • Mechanism Neutrality

◆ Support heterogeneous systems

  • Policy Neutrality

◆ User / local decision making and control

  • Compatibility with Computational Grid

◆ Integration of storage and computation

  • Uniformity of Information Infrastructure

◆ Data model and interface for metadata

slide-7
SLIDE 7

Data Grid 7

Data Grid Services

Replica Selection Replica Management Storage System Metadata Repository Resource Management

DPSS HPSS LDAP MCAT LSF

Other High Level

  • Services. . .

DI FFSERV

. . .

Other Core

  • Services. . .

. . . . . .

slide-8
SLIDE 8

Data Grid 8

Data Access Service

  • Uniform access to heterogeneous systems

◆ remote: e.g. DPSS, HTTP, FTP, HPSS ◆ local: e.g. UNIX

  • High performance data movement over WANs

◆ Third party transfer

  • Data extraction and filtering functions
  • Access to data is subject to global and local

policy constraints

slide-9
SLIDE 9

Data Grid 9

Metadata Access Service

  • Uniform treatment for all metadata

◆ Grid components ◆ Application-related metadata ◆ Storage system characteristics ◆ Relationships between data items

  • Uniform access to metadata

◆ LDAP protocol

  • Uniform storage structure

◆ LDAP hierarchical structure for distribution,

replication, referral services

slide-10
SLIDE 10

Data Grid 10

Replica Management

  • Collections contain related files
  • Logical files describe replicated physical files
  • Services for managing replicated file

instances

◆ Create / delete ◆ Schedule / manage data transfer ◆ Register in the replica catalog ◆ Metadata display

slide-11
SLIDE 11

Data Grid 11

Replica Selection

  • User can optimize access characteristics

◆ Grid structure and performance ◆ Storage system and file characteristics

  • Intelligent scheduling to determine appropriate

replica, site for (re)computation, etc.

slide-12
SLIDE 12

Data Grid 12

Climate Data Scenario

Simulation data archive DPSS Cache Historical data archive Cache

Resource manager File access service

m eso hydro com pare

Analysis engine

Historical data archive HPSS “How do midwest flood frequencies under 2xCO2 scenario compare with historical data?”

Query manager

“Access datasets A, B; run A-> meso-> hydro; compare result with B”

slide-13
SLIDE 13

Data Grid 13

Current Activity

  • Ongoing collaborations

◆ Climate ◆ High Energy Physics

  • Storage API for uniform access to data

◆ API specification document ◆ Prototype code for HTTP, FTP, DPSS

  • Replica management

◆ Replica catalog based on LDAP ◆ API and GUI tools for catalog access

  • Quality of Service implementation
slide-14
SLIDE 14

Data Grid 14

Replica Management

slide-15
SLIDE 15

Data Grid 15

Quality of Service

Bulk Transfer support in GARA

2000 4000 6000 8000 10000 12000 50 100 150 200 250

Time Bandwidth (KB/s)

background foreground competitive

slide-16
SLIDE 16

Data Grid 16

Planned Activity

  • Data Access

◆ Integrated quality of service, security ◆ Performance enhancements for networking

  • Performance guarantees for the Data Grid
  • Automatic operation of the Data Grid

◆ Agent technologies used for distributed data

replication, selection, and analysis

  • Integrated CPU scheduling

◆ Server-side data reduction, affinity scheduling

slide-17
SLIDE 17

Data Grid 17