the data grid
play

The Data Grid: An Architecture for D istributed Management of Large - PowerPoint PPT Presentation

The Data Grid: An Architecture for D istributed Management of Large Scientific Data Sets Ann Chervenak Ian Foster Carl Kesselman Chuck Salisbury Steve Tuecke Information Sciences Institute, Argonne National Laboratory University of


  1. The Data Grid: An Architecture for D istributed Management of Large Scientific Data Sets Ann Chervenak Ian Foster Carl Kesselman Chuck Salisbury Steve Tuecke Information Sciences Institute, Argonne National Laboratory University of Southern California

  2. Overview ● Target Environment ● Design Principles ● Grid Services ◆ Storage systems ◆ Metadata ◆ Management of replicated files ● Implementation Data Grid 2

  3. Data Grid Environment ● Scientific applications ◆ Global climate change, High energy physics ● Computationally demanding ● Large data sets and archives ◆ Terabytes, eventually petabytes ◆ Raw and derived data ● Geographically dispersed users and resources ◆ Data replication for enhanced performance ● Broad range of capabilities and resources ◆ Networks, systems, storage, and applications Data Grid 3

  4. Building a Data Grid: Building Blocks Ingest/ Data STACS, Catalog Query pftp MCAT, Condor catalog mover manager manager GASS SRB others service service Security Measurement Communications Globus toolkit: security, information, MPI-IO Netlogger Condor fault detection, resource management, Resource Accounting/ Resource Fault Akenti Autopilot ... communication, etc. discovery payment management detection HPSS Computers ESnet, MREN, 1-10 GB/ s DPSS 1-10 PB 10-100 TB 1-10 TF/ s Archival, multi-PB Preliminary QoS Striped NTON Fast disk cache Analysis Access> 100 MB/ s? Archival work (e.g., DSRT) Nonarchival GB/ s net Network No QoS Secure QoS: e.g., diffserv Archive No QoS XFS: QoS for disk GB/ s net GB/ s net On-demand computer Cache QoS QoS QoS QoS Data Grid 4

  5. Data Grid Objectives ● Integrate heterogeneous data archives into a distributed data management “grid” ● Identify services for high performance, distributed, data intensive computing Data Grid 5

  6. Design Principles ● Mechanism Neutrality ◆ Support heterogeneous systems ● Policy Neutrality ◆ User / local decision making and control ● Compatibility with Computational Grid ◆ Integration of storage and computation ● Uniformity of Information Infrastructure ◆ Data model and interface for metadata Data Grid 6

  7. Data Grid Services Replica Selection Other High Level Services. . . Replica Management Storage Metadata Resource Other Core System Repository Management Services. . . . . . . . . . . . DPSS HPSS LDAP MCAT LSF DI FFSERV Data Grid 7

  8. Data Access Service ● Uniform access to heterogeneous systems ◆ remote: e.g . DPSS, HTTP, FTP, HPSS ◆ local: e.g . UNIX ● High performance data movement over WANs ◆ Third party transfer ● Data extraction and filtering functions ● Access to data is subject to global and local policy constraints Data Grid 8

  9. Metadata Access Service ● Uniform treatment for all metadata ◆ Grid components ◆ Application-related metadata ◆ Storage system characteristics ◆ Relationships between data items ● Uniform access to metadata ◆ LDAP protocol ● Uniform storage structure ◆ LDAP hierarchical structure for distribution, replication, referral services Data Grid 9

  10. Replica Management ● Collections contain related files ● Logical files describe replicated physical files ● Services for managing replicated file instances ◆ Create / delete ◆ Schedule / manage data transfer ◆ Register in the replica catalog ◆ Metadata display Data Grid 10

  11. Replica Selection ● User can optimize access characteristics ◆ Grid structure and performance ◆ Storage system and file characteristics ● Intelligent scheduling to determine appropriate replica, site for (re)computation, etc. Data Grid 11

  12. Climate Data Scenario “Access datasets A, B; Query run A-> meso-> hydro; manager compare result with B” Historical Historical data data “How do midwest flood File access archive archive frequencies under 2xCO 2 service scenario compare with historical data? ” Resource manager Cache Simulation data Analysis archive engine DPSS Cache HPSS m eso hydro com pare Data Grid 12

  13. Current Activity ● Ongoing collaborations ◆ Climate ◆ High Energy Physics ● Storage API for uniform access to data ◆ API specification document ◆ Prototype code for HTTP, FTP, DPSS ● Replica management ◆ Replica catalog based on LDAP ◆ API and GUI tools for catalog access ● Quality of Service implementation Data Grid 13

  14. Data Grid 14 Replica Management

  15. Quality of Service Bulk Transfer support in GARA 12000 10000 8000 Bandwidth (KB/s) background foreground 6000 competitive 4000 2000 0 0 50 100 150 200 250 Time Data Grid 15

  16. Planned Activity ● Data Access ◆ Integrated quality of service, security ◆ Performance enhancements for networking ● Performance guarantees for the Data Grid ● Automatic operation of the Data Grid ◆ Agent technologies used for distributed data replication, selection, and analysis ● Integrated CPU scheduling ◆ Server-side data reduction, affinity scheduling Data Grid 16

  17. Data Grid 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend