berkeley archival storage encapsulation library base
play

Berkeley Archival Storage Encapsulation Library (BASE) Alex Sim, - PowerPoint PPT Presentation

Berkeley Archival Storage Encapsulation Library (BASE) Alex Sim, Junmin Gu Scientific Data Management Research Group Computational Research Division Lawrence Berkeley National Laboratory A. Sim, CRD , LBNL Sep. 30, 2015 1 BASE Berkeley


  1. Berkeley Archival Storage Encapsulation Library (BASE) Alex Sim, Junmin Gu Scientific Data Management Research Group Computational Research Division Lawrence Berkeley National Laboratory A. Sim, CRD , LBNL Sep. 30, 2015 1

  2. BASE • Berkeley Archival Storage Encapsulation Library • Support of the archival data on mass storage system is critical to the operations of ESGF • One of fundamental ESGF data management services • Large-scale data access from NERSC HPSS • For the ESGF Gateway system to integrate data access to archival files at NERSC HPSS • From the experience of Berkeley Storage Manager (BeStMan) during 2005-2015 and Hierarchical Resource Manager (HRM) during 1998-2006 • Ensure efficient data access to the archival storage at NERSC for ESGF A. Sim, CRD , LBNL Sep. 30, 2015 2

  3. BeStMan system architecture SRM File Service Queue USER 1 USER 2 Local Policy Module WAN/ LAN Security Module MSS Access Management Request Queue Management (PFTP, HSI, MSRCP, SCP...) USER n MSS USER QUEUE Management Network Access Management DISK Management ( GridFTP. FTP, BBFTP, SCP... ) WAN/LAN GridFTP server WAN GridFTP FTP server DISK BBFTP server SRM A. Sim, CRD , LBNL Sep. 30, 2015 3

  4. BASE design Python module for NERSC HPSS Berkeley Archival Storage Encapsulation (BASE) Library Browse (ls) Retrieve (get) Archive (put) HSI Checksum enabled NERSC HPSS Local DISK Storage A. Sim, CRD , LBNL Sep. 30, 2015 4

  5. Main functions • Python module for three main functions • Browsing, retrieving and archiving • File browsing function • Getting file size information for those files on HPSS • File retrieving function • Getting the file from the HPSS source location to the local destination disk path • File archiving function • Putting the file from the local source disk path to the HPSS destination path • ESGF Gateway service would not use this function A. Sim, CRD , LBNL Sep. 30, 2015 5

  6. Backend calls • HSI command • Used in the backend to access HPSS • Its output log would be parsed for the operation status • Upon successful HSI operation, the output log would be removed to reduce the disk storage usage. • For any reasons, when the HSI operation fails, the output log would be kept so that the cause of the failure would be addressed. A. Sim, CRD , LBNL Sep. 30, 2015 6

  7. Interface • The user codes or higher service integration would use the Python methods. • Backend of the Python methods is C++ class methods A. Sim, CRD , LBNL Sep. 30, 2015 7

  8. Checksum • Checksum comparison is enabled • If the HPSS file is archived with checksum option. • Checksum value would be saved on HPSS file system. • If the archived file does not have the checksum value • The checksum comparison would be skipped. • By default, sha256 would be compared • Checksum type can be configured with an option. • How to change the checksum type is in the manual. A. Sim, CRD , LBNL Sep. 30, 2015 8

  9. SVN repo • SVN repository for BASE source codes • https://code.lbl.gov/projects/base/ • Anonymous access is enabled • svn checkout --username anonsvn https://code.lbl.gov/svn/base/trunk/base A. Sim, CRD , LBNL Sep. 30, 2015 9

  10. Configure/Make/Install • Configure • Will find necessary paths and options. • Make and Make install • Will build the library and place the library file in the lib directory of the distribution directory • Example • cd base ./configure make make install ls –l dist/lib A. Sim, CRD , LBNL Sep. 30, 2015 10

  11. HSI preparation • Download from NERSC (version 4.0.1.2) • NERSC NIM account is needed • https://www.nersc.gov/users/storage-and-file- systems/hpss/storing-and-retrieving-data/software-downloads/ • Install • HSI installation instruction • https://www.nersc.gov/users/storage-and-file- systems/hpss/storing-and-retrieving-data/clients/hsi- configuration-and-installation/ • Credential setup • For the first time use, the NERSC HPSS password needs to be set up in $HOME/.netrc file • NERSC NIM account is needed • http://www.nersc.gov/users/storage-and-file- systems/hpss/getting-started/hpss-passwords/#toc-anchor-3 A. Sim, CRD , LBNL Sep. 30, 2015 11

  12. Python samples • Samples directory includes all python code samples • % ls samples/ HOW_TO_RUN.txt sample-ls.py sample-put.py esgf_base_mss.py sample-multi-read.py sample-get.py sample-multi-write.py A. Sim, CRD , LBNL Sep. 30, 2015 12

  13. class sdm • Esgf_base_mss.py has the class sdm • includes getSize(), getFile() and putFile() calls for browsing, retrieving and archiving respectively. • class sdm(object): def getSize(self, src): def getFile(self, src, tgt): def putFile(self, src, tgt): A. Sim, CRD , LBNL Sep. 30, 2015 13

  14. Browsing • Browsing a file • Getting file size information for those files on HPSS • import esgf_base_mss mssf = esgf_base_mss.sdm() filesize = mssf.getSize(hpss_file_path) A. Sim, CRD , LBNL Sep. 30, 2015 14

  15. Retrieving • Retrieving a file • Retrieving a file from the NERSC HPSS to the local disk path • import esgf_base_mss mssf = esgf_base_mss.sdm() mssf.getFile(hpss_file_path, local_file_path); A. Sim, CRD , LBNL Sep. 30, 2015 15

  16. Archiving • Archiving a file • Archiving a file from the local disk path to the NERSC HPSS • import esgf_base_mss mssf = esgf_base_mss.sdm() mssf.putFile(local_file_path, hpss_file_path); A. Sim, CRD , LBNL Sep. 30, 2015 16

  17. Configuration options • esgf_base_mss.py writes the configuration options for the library • base_mss.rc • To change the options, update esgf_base_mss.py • mss*HSI=hsi mss*checksum=sha256 mss*MSSHostName=archive.nersc.gov mss*EnableLogging=true mss*MSSLogFile=/.../samples/msslogs/mss.log A. Sim, CRD , LBNL Sep. 30, 2015 17

  18. Notes on the API runs • By the NERSC policy, there is a maximum number of concurrent connections to the NERSC HPSS. • It used to be 15. • When the 16 th connection is tried, HIS connection immediately gets failed with an error message of “421 Service not available - maximum number of sessions exceeded”. A. Sim, CRD , LBNL Sep. 30, 2015 18

  19. Multi-file requests • This is optional • Programming can be done for a multi-file request. • It needs to be done at the user level using the API with the control over the maximum number of concurrent connections. • The same multi-file request can be done with sequential individual file requests in multi-threads. • Both cases should have controls over the maximum number of concurrent connections to the HPSS. A. Sim, CRD , LBNL Sep. 30, 2015 19

  20. Retrieving multiple files • For retrieving multiple files from the NERSC HPSS source paths to the local disk destination paths (e.g. sample-multi-read.py for 3 files) from multiprocessing import Process, Queue, current_process, freeze_support, Pool import esgf_base_mss mssf = esgf_base_mss.sdm() tasks=[] # put each file request in an array tasks.append((esgf_base_mss.readFromNersc, (mssf, hpss_file_path_1, local_file_path_1))); tasks.append((esgf_base_mss.readFromNersc, (mssf, hpss_file_path_2, local_file_path_2))); tasks.append((esgf_base_mss.readFromNersc, (mssf, hpss_file_path_3, local_file_path_3))); esgf_base_mss.runTask(tasks, 3); • Note that the value in the runTask(tasks, N) to be less than the maximum number of concurrent allowed connections to HPSS. • Also, make sure that the multiprocessing package for python is imported. A. Sim, CRD , LBNL Sep. 30, 2015 20

  21. Archiving multiple files • For archiving multiple files from the local disk source paths to the NERSC HPSS destination paths (e.g. sample- multi-write.py for 3 files): from multiprocessing import Process, Queue, current_process, freeze_support, Pool import esgf_base_mss mssf = esgf_base_mss.sdm() tasks=[] # put each file request in an array tasks.append((esgf_base_mss.putToNersc, (mssf, local_file_path_1, hpss_file_path_1))); tasks.append((esgf_base_mss.putToNersc, (mssf, local_file_path_2, hpss_file_path_2))); tasks.append((esgf_base_mss.putToNersc, (mssf, local_file_path_3, hpss_file_path_3))); esgf_base_mss.runTask(tasks, 3); • Note that the value in the runTask(tasks, N) to be less than the maximum number of concurrent allowed connections to HPSS. • Also, make sure that the multiprocessing package for python is imported. A. Sim, CRD , LBNL Sep. 30, 2015 21

  22. Summary • BASE Library • Python API and C/C++ library file • HSI access for HPSS • NERSC HPSS as the first step • Source codes are available under BSD license • Anonymous access • https://code.lbl.gov/projects/base/ • Support is available • sdmsupport@lbl.gov A. Sim, CRD , LBNL Sep. 30, 2015 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend