 
              Roadmap for Applying Hadoop Distributed File System in Scientific Grid Computing Garhan Attebury 1 , Andrew Baranovski 2 , Ken Bloom 1 , Brian Bockelman 1 , Dorian Kcira 3 , James Letts 4 , Tanya Levshina 2 , Carl Lundestedt 1 , Terrence Martin 4 , Will Maier 5 , Haifeng Pi 4 , Abhishek Rana 4 , Igor Sfiligoi 4 , Alexander Sim 6 , Michael Thomas 3 , Frank Wuerthwein 4 1. University of Nebraska Lincoln 2. Fermi National Accelerator Laboratory 3. California Institute of Technology 4. University of California, San Diego 5. University of Wisconsin Madson 6. Lawrence Berkeley National Laboratory On Behalf of Open Science Grid (OSG) Storage Hadoop Community 1
Storage, a critical component of Grid • Grid computing is data-intensive and CPU-intensive, which requires – Scalable management system for bookkeeping and discovering data – Reliable and fast tools for distributing and replicating data – Efficient procedures for processing and extracting data – Advanced techniques for analyzing and storing data in parallel • A scalable, dynamic, efficient and easy-to-maintain storage system is on the critical path to the success of grid computing – Meet various data access needs in both organization and individual level – Maximize the CPU usage and efficiency – Fit into sophisticated VO policies (e.g. Data security, user privilege ) – Survive the “unexpected” usage of storage system – Minimize the cost of ownership – Easy to expand, reconfigure, commission/decommission as requirement changes 2
A Case Study, Some Requirements for Storage Element (SE) at Compact Muon Solenoid (CMS) • Have a credible support model that meets the reliability, availability, and security expectations consistent with the computing infrastructure • Demonstrate the ability to interface with the existing global data transfer system and the transfer technology of SRM tools and FTS as well as demonstrate the ability to interface to the CMS software locally through ROOT • Well-defined and reliable behavior for recovery from the failure of any hardware components. • Well-defined and reliable method of replicating files to protect against the loss of any individual hardware system • Well-defined and reliable procedure for decommissioning hardware without data loss • Well-defined and reliable procedure for site operators to regularly check the integrity of all files in the SE • Well-defined interfaces to monitoring systems • Capable of delivering at least 1 MB/s/batch slot for CMS applications, capable of writing files from the WAN at a performance of at least 125MB/s while simultaneously writing data from the local farm at an average rate of 20MB/s. • Failures of jobs due to failure to open the file or deliver the data products from the storage systems should be at the level of less than 1 in 10 5 level. 3
Hadoop Distributed File System (HDFS) • Open source project hosted by Apache (http://hadoop.apache.org) and used by YAHOO for its search engine with multiple-PB scale of data involved • Design goal – reduce the impact of hardware failure – Stream data access – handle large datasets – Simple coherency model – Portability across heterogeneous platforms • A scalable distributed cluster file system – The namespace and image of the whole file system is maintained in one single machine's memory, NameNode – The files are split into blocks and stored across the cluster, DataNode – File blocks can be replicated. Loss of one DataNode can be recovered from the replica blocks in other DataNodes. 4
Important Components of HDFS-based SE • Fuse/Fuse-DFS – A linux kernel module, allows file systems to be written in userspace and POSIX- like interface to HDFS – Important for the software application accessing data in the local SE • Globus GridFTP – provide WAN transfer between to SE(s) or SE and workernode (WN). – A special plugin is needed to assemble asynchronous transfered packets for sequential writing to the HDFS if multiple streams are used • BeStMan – provide SRM interface to the HDFS – Possible to develop/implement plugins to select GridFTP servers according to the status of the GridFTP servers A number of software bugs and integration issues have been solved for the last 12 months to really bring all the components together and make a production quality SE 5
HDFS SE Architecture for Scientific Computing Dedicated Data Dedicated Data Node Node NameNode Hadoop Client Hadoop Client (secondary NN) WorkerNode + WorkerNode + (DataNode) + (DataNode) + (GridFTP) (GridFTP) FUSE + FUSE + Hadoop Client Hadoop Client BeStMan Fuse + Hadoop WorkerNode + WorkerNode + (DataNode) + Client (DataNode) + (GridFTP) (GridFTP) FUSE + FUSE + Hadoop Client Hadoop Client GridFTP Node GridFTP Node GUMS FUSE + FUSE + Proxy-User Hadoop Client Hadoop Client Mapping 6
HDFS-based SE at CMS Tier-2 • Currently three CMS Tier-2 sites, Nebraska, Caltech and UCSD, deployed HDFS-based SE – Average 6-12 months operation experience with increasing scale in total disk space – Currently around 100 DataNodes ranging from 300 to 500 TB for each site – Successfully serve the CMS collaboration with up to thousands of grid users and hundreds of local users to access the dataset in HDFS – Successfully serve the data operation and Monte Carlo production run by the CMS • What benefits the new SE brings to these sites – Reliability: stop loss of files because of a decent file replica schemes run by HDFS – Simple deployment: most of the deployment procedure is streamlined with fewer commands done by the administrators – Easy operation: stable system, little effort for system/file recovery, less than 30 min for daily operation and user support – Proved scalability for supporting a large number of simultaneous Read/Write operation and high throughput for serving the data for grid jobs running at the site 7
Highlight of Operational Performance of HDFS-SE • Stably deliver ~3MB/s to applications in the cluster while the cluster is fully loaded with jobs – Sufficient for CMS application's requirement on I/O with high CPU efficiency – CMS application is IOPS limited, not bandwidth limited • HDFS NameNode serves 2500 user request per second – Sufficient for a cluster with thousand of cores with I/O intensive jobs • Sustained WAN transfer rate 400MB/s – Sufficient for CMS Tier-2 data operation (dataset transfer and stage-out of user analysis jobs) • Simultaneously processing thousand client's request at BeStMan – Sustained endpoint processing rate 50 Hz – Sufficient for high-rate transfers of gigabytes-sized files and uncontrolled chaotic user jobs • Observed extremely low file corruption rate – Benefit from robust and fast file replication of HDFS • Decommissioning of a DataNode < 1 hour, restart NameNode in 1 minute, check the image of file system (from memory of NameNode) in 10 sec – Fast and efficient for the operation • Survive various stress test that involves HDFS, BeStMan, GridFTP ... 8
Data Transfer to HDFS-SE 9
NameNode Operation Count 10
Processing Rate at SRM endpoint 11
Monitoring and Routine Test • Integration with general grid monitoring infrastructure – Nagious, Ganglia, MonALISA – CPU, memory, network statistics for the NameNode, DataNode and the whole system • HDFS monitoring – Hadoop web service, Hadoop Chronicle, Jconsole • Status of the file system and user – Logs of NameNode, DataNode and GridFTP, BeStMan • As part of the daily tasks and debugging activities • Regular low-stress test performed by CMS VO – Test analysis jobs, load test of file transfer – Part of the daily commission of the site involves local and remote I/O of the SE • Intentional failure in various parts of the SE with demonstrated recovery mechanism – Documentation of recovery procedure 12
Load test between two HDFS-SE 13
Data Security and Integrity • Security concerns – HDFS • No encryption or strong authentication between client and server. HDFS must only be exposed to a secure internal network • Practically firewall or NAT is needed to properly isolated the HDFS from direct “public” access • Latest HDFS implements access token. Transition to kerberos-based components is expected in 2010. – Grid components (GridFTP and BeStMan) • Use standard GSI security with VOMS extensions • Data integrity and consistency of the file system – HDFS Checksum for blocks of data – Command line tool to check block, directory and file – HDFS keeps multiple journal and file system image – NameNode periodically requests the entire block report from all DataNode. 14
Recommend
More recommend