The Data Grid:
An Architecture for
Distributed Management
- f Large Scientific Data Sets
Ann Chervenak Carl Kesselman Information Sciences Institute, University of Southern California Ian Foster Chuck Salisbury Steve Tuecke Argonne National Laboratory
The Data Grid: An Architecture for D istributed Management of Large - - PowerPoint PPT Presentation
The Data Grid: An Architecture for D istributed Management of Large Scientific Data Sets Ann Chervenak Ian Foster Carl Kesselman Chuck Salisbury Steve Tuecke Information Sciences Institute, Argonne National Laboratory University of
Ann Chervenak Carl Kesselman Information Sciences Institute, University of Southern California Ian Foster Chuck Salisbury Steve Tuecke Argonne National Laboratory
Data Grid 2
◆ Storage systems ◆ Metadata ◆ Management of replicated files
Data Grid 3
◆ Global climate change, High energy physics
◆ Terabytes, eventually petabytes ◆ Raw and derived data
◆ Data replication for enhanced performance
◆ Networks, systems, storage, and applications
Data Grid 4
Ingest/ catalog service Data mover service Catalog manager Query manager Analysis computer
Archive
1-10 PB Archival GB/ s net QoS
Cache
10-100 TB Nonarchival GB/ s net QoS
Network
1-10 TF/ s GB/ s net On-demand QoS 1-10 GB/ s Striped Secure QoS
Security Resource discovery Measurement Resource management Accounting/ payment Fault detection Communications HPSS
Archival, multi-PB Access> 100 MB/ s? No QoS
DPSS
Fast disk cache No QoS
Computers
Preliminary QoS work (e.g., DSRT) XFS: QoS for disk
ESnet, MREN, NTON
QoS: e.g., diffserv
MCAT, SRB pftp GASS STACS, Condor
Netlogger Globus toolkit: security, information, fault detection, resource management, communication, etc. Autopilot MPI-IO Akenti Condor ...
Data Grid 5
Data Grid 6
◆ Support heterogeneous systems
◆ User / local decision making and control
◆ Integration of storage and computation
◆ Data model and interface for metadata
Data Grid 7
Replica Selection Replica Management Storage System Metadata Repository Resource Management
DPSS HPSS LDAP MCAT LSF
Other High Level
DI FFSERV
. . .
Other Core
. . . . . .
Data Grid 8
◆ remote: e.g. DPSS, HTTP, FTP, HPSS ◆ local: e.g. UNIX
◆ Third party transfer
Data Grid 9
◆ Grid components ◆ Application-related metadata ◆ Storage system characteristics ◆ Relationships between data items
◆ LDAP protocol
◆ LDAP hierarchical structure for distribution,
Data Grid 10
◆ Create / delete ◆ Schedule / manage data transfer ◆ Register in the replica catalog ◆ Metadata display
Data Grid 11
◆ Grid structure and performance ◆ Storage system and file characteristics
Data Grid 12
Simulation data archive DPSS Cache Historical data archive Cache
Resource manager File access service
m eso hydro com pare
Analysis engine
Historical data archive HPSS “How do midwest flood frequencies under 2xCO2 scenario compare with historical data?”
Query manager
“Access datasets A, B; run A-> meso-> hydro; compare result with B”
Data Grid 13
◆ Climate ◆ High Energy Physics
◆ API specification document ◆ Prototype code for HTTP, FTP, DPSS
◆ Replica catalog based on LDAP ◆ API and GUI tools for catalog access
Data Grid 14
Data Grid 15
Bulk Transfer support in GARA
2000 4000 6000 8000 10000 12000 50 100 150 200 250
Time Bandwidth (KB/s)
background foreground competitive
Data Grid 16
◆ Integrated quality of service, security ◆ Performance enhancements for networking
◆ Agent technologies used for distributed data
◆ Server-side data reduction, affinity scheduling
Data Grid 17