Berkeley Archival Storage Encapsulation Library (BASE) Alex Sim, - - PowerPoint PPT Presentation

berkeley archival storage encapsulation library base
SMART_READER_LITE
LIVE PREVIEW

Berkeley Archival Storage Encapsulation Library (BASE) Alex Sim, - - PowerPoint PPT Presentation

Berkeley Archival Storage Encapsulation Library (BASE) Alex Sim, Junmin Gu Scientific Data Management Research Group Computational Research Division Lawrence Berkeley National Laboratory A. Sim, CRD , LBNL Sep. 30, 2015 1 BASE Berkeley


slide-1
SLIDE 1
  • A. Sim, CRD, LBNL

1

  • Sep. 30, 2015

Berkeley Archival Storage Encapsulation Library (BASE)

Alex Sim, Junmin Gu

Scientific Data Management Research Group Computational Research Division Lawrence Berkeley National Laboratory

slide-2
SLIDE 2
  • A. Sim, CRD, LBNL

2

  • Sep. 30, 2015

BASE

  • Berkeley Archival Storage Encapsulation Library
  • Support of the archival data on mass storage system

is critical to the operations of ESGF

  • One of fundamental ESGF data management services
  • Large-scale data access from NERSC HPSS
  • For the ESGF Gateway system to integrate data access to

archival files at NERSC HPSS

  • From the experience of Berkeley Storage Manager

(BeStMan) during 2005-2015 and Hierarchical Resource Manager (HRM) during 1998-2006

  • Ensure efficient data access to the archival storage at

NERSC for ESGF

slide-3
SLIDE 3
  • A. Sim, CRD, LBNL

3

  • Sep. 30, 2015

SRM DISK MSS USER 1 USER 2 USER n File Service Queue

USER QUEUE Management MSS Access Management

(PFTP, HSI, MSRCP, SCP...)

DISK Management

GridFTP

Network Access Management (GridFTP. FTP, BBFTP, SCP...) Local Policy Module Request Queue Management

WAN WAN/LAN WAN/ LAN GridFTP server FTP server BBFTP server SRM

Security Module

BeStMan system architecture

slide-4
SLIDE 4
  • A. Sim, CRD, LBNL

4

  • Sep. 30, 2015

BASE design

Python module for NERSC HPSS Berkeley Archival Storage Encapsulation (BASE) Library Browse (ls) Retrieve (get) Archive (put) HSI Checksum enabled Local DISK Storage NERSC HPSS

slide-5
SLIDE 5
  • A. Sim, CRD, LBNL

5

  • Sep. 30, 2015

Main functions

  • Python module for three main functions
  • Browsing, retrieving and archiving
  • File browsing function
  • Getting file size information for those files on HPSS
  • File retrieving function
  • Getting the file from the HPSS source location to the local

destination disk path

  • File archiving function
  • Putting the file from the local source disk path to the HPSS

destination path

  • ESGF Gateway service would not use this function
slide-6
SLIDE 6
  • A. Sim, CRD, LBNL

6

  • Sep. 30, 2015

Backend calls

  • HSI command
  • Used in the backend to access HPSS
  • Its output log would be parsed for the operation status
  • Upon successful HSI operation, the output log would

be removed to reduce the disk storage usage.

  • For any reasons, when the HSI operation fails, the
  • utput log would be kept so that the cause of the

failure would be addressed.

slide-7
SLIDE 7
  • A. Sim, CRD, LBNL

7

  • Sep. 30, 2015

Interface

  • The user codes or higher service integration

would use the Python methods.

  • Backend of the Python methods is C++ class methods
slide-8
SLIDE 8
  • A. Sim, CRD, LBNL

8

  • Sep. 30, 2015

Checksum

  • Checksum comparison is enabled
  • If the HPSS file is archived with checksum option.
  • Checksum value would be saved on HPSS file system.
  • If the archived file does not have the checksum value
  • The checksum comparison would be skipped.
  • By default, sha256 would be compared
  • Checksum type can be configured with an option.
  • How to change the checksum type is in the manual.
slide-9
SLIDE 9
  • A. Sim, CRD, LBNL

9

  • Sep. 30, 2015

SVN repo

  • SVN repository for BASE source codes
  • https://code.lbl.gov/projects/base/
  • Anonymous access is enabled
  • svn checkout --username anonsvn

https://code.lbl.gov/svn/base/trunk/base

slide-10
SLIDE 10
  • A. Sim, CRD, LBNL

10

  • Sep. 30, 2015

Configure/Make/Install

  • Configure
  • Will find necessary paths and options.
  • Make and Make install
  • Will build the library and place the library file in the lib

directory of the distribution directory

  • Example
  • cd base

./configure make make install ls –l dist/lib

slide-11
SLIDE 11
  • A. Sim, CRD, LBNL

11

  • Sep. 30, 2015

HSI preparation

  • Download from NERSC (version 4.0.1.2)
  • NERSC NIM account is needed
  • https://www.nersc.gov/users/storage-and-file-

systems/hpss/storing-and-retrieving-data/software-downloads/

  • Install
  • HSI installation instruction
  • https://www.nersc.gov/users/storage-and-file-

systems/hpss/storing-and-retrieving-data/clients/hsi- configuration-and-installation/

  • Credential setup
  • For the first time use, the NERSC HPSS password needs to be set

up in $HOME/.netrc file

  • NERSC NIM account is needed
  • http://www.nersc.gov/users/storage-and-file-

systems/hpss/getting-started/hpss-passwords/#toc-anchor-3

slide-12
SLIDE 12
  • A. Sim, CRD, LBNL

12

  • Sep. 30, 2015

Python samples

  • Samples directory includes all python code

samples

  • % ls samples/

HOW_TO_RUN.txt sample-ls.py sample-put.py esgf_base_mss.py sample-multi-read.py sample-get.py sample-multi-write.py

slide-13
SLIDE 13
  • A. Sim, CRD, LBNL

13

  • Sep. 30, 2015

class sdm

  • Esgf_base_mss.py has the class sdm
  • includes getSize(), getFile() and putFile() calls for

browsing, retrieving and archiving respectively.

  • class sdm(object):

def getSize(self, src): def getFile(self, src, tgt): def putFile(self, src, tgt):

slide-14
SLIDE 14
  • A. Sim, CRD, LBNL

14

  • Sep. 30, 2015

Browsing

  • Browsing a file
  • Getting file size information for those files on HPSS
  • import esgf_base_mss

mssf = esgf_base_mss.sdm() filesize = mssf.getSize(hpss_file_path)

slide-15
SLIDE 15
  • A. Sim, CRD, LBNL

15

  • Sep. 30, 2015

Retrieving

  • Retrieving a file
  • Retrieving a file from the NERSC HPSS to the local

disk path

  • import esgf_base_mss

mssf = esgf_base_mss.sdm() mssf.getFile(hpss_file_path, local_file_path);

slide-16
SLIDE 16
  • A. Sim, CRD, LBNL

16

  • Sep. 30, 2015

Archiving

  • Archiving a file
  • Archiving a file from the local disk path to the NERSC

HPSS

  • import esgf_base_mss

mssf = esgf_base_mss.sdm() mssf.putFile(local_file_path, hpss_file_path);

slide-17
SLIDE 17
  • A. Sim, CRD, LBNL

17

  • Sep. 30, 2015

Configuration options

  • esgf_base_mss.py writes the configuration
  • ptions for the library
  • base_mss.rc
  • To change the options, update esgf_base_mss.py
  • mss*HSI=hsi

mss*checksum=sha256 mss*MSSHostName=archive.nersc.gov mss*EnableLogging=true mss*MSSLogFile=/.../samples/msslogs/mss.log

slide-18
SLIDE 18
  • A. Sim, CRD, LBNL

18

  • Sep. 30, 2015

Notes on the API runs

  • By the NERSC policy, there is a maximum

number of concurrent connections to the NERSC HPSS.

  • It used to be 15.
  • When the 16th connection is tried, HIS connection

immediately gets failed with an error message of “421 Service not available - maximum number of sessions exceeded”.

slide-19
SLIDE 19
  • A. Sim, CRD, LBNL

19

  • Sep. 30, 2015

Multi-file requests

  • This is optional
  • Programming can be done for a multi-file

request.

  • It needs to be done at the user level using the API with

the control over the maximum number of concurrent connections.

  • The same multi-file request can be done with

sequential individual file requests in multi-threads.

  • Both cases should have controls over the maximum

number of concurrent connections to the HPSS.

slide-20
SLIDE 20
  • A. Sim, CRD, LBNL

20

  • Sep. 30, 2015

Retrieving multiple files

  • For retrieving multiple files from the NERSC HPSS

source paths to the local disk destination paths (e.g. sample-multi-read.py for 3 files)

from multiprocessing import Process, Queue, current_process, freeze_support, Pool import esgf_base_mss mssf = esgf_base_mss.sdm() tasks=[] # put each file request in an array tasks.append((esgf_base_mss.readFromNersc, (mssf, hpss_file_path_1, local_file_path_1))); tasks.append((esgf_base_mss.readFromNersc, (mssf, hpss_file_path_2, local_file_path_2))); tasks.append((esgf_base_mss.readFromNersc, (mssf, hpss_file_path_3, local_file_path_3))); esgf_base_mss.runTask(tasks, 3);

  • Note that the value in the runTask(tasks, N) to be less

than the maximum number of concurrent allowed connections to HPSS.

  • Also, make sure that the multiprocessing package for

python is imported.

slide-21
SLIDE 21
  • A. Sim, CRD, LBNL

21

  • Sep. 30, 2015

Archiving multiple files

  • For archiving multiple files from the local disk source

paths to the NERSC HPSS destination paths (e.g. sample- multi-write.py for 3 files):

from multiprocessing import Process, Queue, current_process, freeze_support, Pool import esgf_base_mss mssf = esgf_base_mss.sdm() tasks=[] # put each file request in an array tasks.append((esgf_base_mss.putToNersc, (mssf, local_file_path_1, hpss_file_path_1))); tasks.append((esgf_base_mss.putToNersc, (mssf, local_file_path_2, hpss_file_path_2))); tasks.append((esgf_base_mss.putToNersc, (mssf, local_file_path_3, hpss_file_path_3))); esgf_base_mss.runTask(tasks, 3);

  • Note that the value in the runTask(tasks, N) to be less

than the maximum number of concurrent allowed connections to HPSS.

  • Also, make sure that the multiprocessing package for

python is imported.

slide-22
SLIDE 22
  • A. Sim, CRD, LBNL

22

  • Sep. 30, 2015

Summary

  • BASE Library
  • Python API and C/C++ library file
  • HSI access for HPSS
  • NERSC HPSS as the first step
  • Source codes are available under BSD license
  • Anonymous access
  • https://code.lbl.gov/projects/base/
  • Support is available
  • sdmsupport@lbl.gov