CVMFS for Data Federations Derek Weitzel University of Nebraska - - - PowerPoint PPT Presentation

cvmfs for data federations
SMART_READER_LITE
LIVE PREVIEW

CVMFS for Data Federations Derek Weitzel University of Nebraska - - - PowerPoint PPT Presentation

CVMFS for Data Federations Derek Weitzel University of Nebraska - Lincoln Problem with Data Federations Users must know the exact filenames for each job. They have to use special tools they are unfamiliar with in order to use it (such


slide-1
SLIDE 1

CVMFS for Data Federations

Derek Weitzel University of Nebraska - Lincoln

slide-2
SLIDE 2

Problem with Data Federations

  • Users must know the exact filenames for each job.
  • They have to use special tools they are unfamiliar

with in order to use it (such as xrdcp or stashcp).

  • Applications may only talk POSIX.
  • They are difficult to setup for opportunistic VO’s;

OSG has already created one StashCache.

slide-3
SLIDE 3

Changes to CVMFS

  • As discussed in Brian’s talk on yesterday, changes

in CVMFS developed by him and I have enabled CVMFS’s use in data federations.

  • CVMFS can now access data federations through

HTTP gateways.

  • Metadata (catalogs) come from the normal

OASIS Stratum-1 infrastructure.

  • Data files come from the data federation.
slide-4
SLIDE 4

Changes to CVMFS

  • File accesses can be redirected to another server
  • Files that are retrieved from this other server are not

in standard CVMFS hashed’ format

  • Rather, they are uncompressed.
  • Instead they are pointers to a file on another

server, i.e. a XRootD server.

slide-5
SLIDE 5

Repositories

  • nova.osgstorage.org - Repo from XrootD data

source at FNAL

  • stash.osgstorage.org - Repo built from user

accessible storage at OSG-Connect

  • cms.osgstorage.org - Repo of the CMS data

federation

  • ligo.osgstorage.org - Repo of LIGO data stored at

Nebraska

slide-6
SLIDE 6

Repositories

  • nova.osgstorage.org - Repo from XrootD data

source at FNAL

  • stash.osgstorage.org - Repo built from user

accessible storage at OSG-Connect

  • cms.osgstorage.org - Repo of the CMS data

federation

  • ligo.osgstorage.org - Repo of LIGO data stored at

Nebraska

slide-7
SLIDE 7

stash.osgstorage.org

  • 1. At a CVMFS repo, HTCondor cron job scans the Stash

filesystem at UChicago, recording differences since last scan.

  • This looks at the contents of /stash/$USER/

public found on OSG-Connect.

  • 2. Job puts records files’ metadata (size, checksum) into

the CVMFS repository server. Data stays on Stash.

  • 3. CVMFS repository is published with new contents.
slide-8
SLIDE 8

StashCache

  • Managing data
  • pportunistically at storage

elements requires a CMS- or ATLAS-sized commitment.

  • StashCache uses distributed

caches across the country.

  • Data origin is the Stash

service on OSG-Connect.

  • Users write data into Stash,

and read the data from jobs through StashCache

stashcache.github.io

For a full overview of StashCache, see Brian’s talk from last years AHM.

slide-9
SLIDE 9

Standard Site Worker Node CVMFS Worker Node CVMFS StashCache StashCache Server StashCache Server StashCache Server HTTP HTTP StashCache Federation CVMFS Repository Server StashCache Redirector Stash Origin Site Metadata Actual Data Files (XrootD) XrootD

Overview of CVMFS and StashCache

  • Regular XrootD StashCache

Federation

  • CVMFS contacts the caching

servers over HTTP

  • Caching servers contact the

federation for the data

  • Worker nodes pull data from

the caching servers

slide-10
SLIDE 10

Uses

  • Large datasets which cannot be cached with Squid
  • Full Blast Db’s
  • Nova Flux Files…
  • Targeting working set sizes* from 10GB to 10TB.

Will work fine for smaller sizes, but OASIS may be more efficient for distribution. *Number of unique bytes touched in 24 hours

slide-11
SLIDE 11

User Perspective

  • Copies data onto OSG-Connect using scp, Globus

Online - pick your favorite.

  • Put data into /stash/<user>/public
  • Wait for a while for the data to be published (~1 hr)
  • Use data on the worker nodes!
slide-12
SLIDE 12

Stash -> CVMFS Delay

  • There is a delay between when the file has been

created, and when the it appears in CVMFS.

0% 25% 50% 75% 100% 0.0 0.5 1.0 1.5 2.0 2.5

Delay in Hours Probability of File Existance

Cumulative Distribution of the CVMFS Publish Delay

slide-13
SLIDE 13

Stash -> CVMFS Delay

  • In 1 hour, the files are largely available

0% 25% 50% 75% 100% 0.0 0.5 1.0 1.5 2.0 2.5

Delay in Hours Probability of File Existance

Cumulative Distribution of the CVMFS Publish Delay

slide-14
SLIDE 14

CVMFS + StashCache

  • This creates a global read-only filesystem
  • Originally, a select few could put data into cvmfs

using services such as Oasis

  • Now, everyone with can add their own files and

software into CVMFS

  • Access at /cvmfs/stash.osgstorage.org/
slide-15
SLIDE 15

ligo.osgstorage.org

  • stash.osgstorage.org is unauthenticated access to

public files

  • LIGO has very specific rules about data access

and even namespace visibility

  • Therefore, had to develop new features in CVMFS

to enable VOMS authentication.

slide-16
SLIDE 16

Secure CVMFS

  • Pull certificate from the user’s environment
  • Namespace is protected by authenticated access

to CVMFS HTTP(S) server.

  • Data is authenticated with XRootD HTTP(S) client

authentication.

slide-17
SLIDE 17

Secure CVMFS

  • Special setup of HTTP server for authenticate setup

(mod_gridsite)

  • XrootD serves data directly from data servers,

cannot currently proxy authenticated access.

slide-18
SLIDE 18

What you can do!

  • Update CVMFS on your worker nodes to 2.2

preview:

  • Feel free to install this locally and test the interface
  • n OSG-Connect.
  • Requires sites to upgrade their CVMFS client:

widespread availability will probably occur in July.

yum install --enablerepo=osg-upcoming cvmfs cvmfs-config-osg