CMS remote data access (AAA) Giacinto DONVITO (INFN-Bari) - - PowerPoint PPT Presentation

cms remote data access aaa
SMART_READER_LITE
LIVE PREVIEW

CMS remote data access (AAA) Giacinto DONVITO (INFN-Bari) - - PowerPoint PPT Presentation

CMS remote data access (AAA) Giacinto DONVITO (INFN-Bari) Thanks to: Brian Bockelman, Ken Bloom (UNL) Tommaso Boccali, Daniele Bonacorsi (INFN) AAA


slide-1
SLIDE 1

CMS ¡remote ¡data ¡access ¡ (AAA) ¡

Giacinto ¡DONVITO ¡(INFN-­‑Bari) ¡ Thanks ¡to: ¡ Brian ¡Bockelman, ¡Ken ¡Bloom ¡(UNL) ¡ Tommaso ¡Boccali, ¡Daniele ¡Bonacorsi ¡(INFN) ¡

slide-2
SLIDE 2

Any data, Anytime, Anywhere 2

AAA Project

  • Goal

l Use resources more effectively through remote

data access in CMS

Sub-goals

l Low-ceremony/latency access to any single event l Reduce data access error rate l Overflow jobs from busy sites to less busy ones l Use opportunistic resources l Make life at T3s easier

slide-3
SLIDE 3

Same)namespace,)irrespec2ve)of)data)loca2on

✦ TFile::Open(“root://xrootd.unl.edu//store/foo”)

Any data, Anytime, Anywhere 3

General overview

  • In CMS the association LFN->PFN is a simple “rule”. No

DB involved

  • We can use plain root installation with a given prefix at

each site

slide-4
SLIDE 4

4 ¡

CMS ¡jobs ¡ Local ¡ storage ¡ EU ¡Xrootd ¡ Redirector ¡ EU ¡Tier2 ¡ sites ¡ Global ¡ Xrootd ¡ Redirector ¡ US ¡Tier2 ¡ sites ¡ EOS ¡ CERN ¡ US ¡Xrootd ¡ Redirector ¡

Xrootd world-wide Federation

slide-5
SLIDE 5

Any data, Anytime, Anywhere 5

XRootd: Federating Storage Systems

l Step 1: deploy seamless global storage

interface

l But preserve site autonomy:

l xrootd plugin maps from global logical filename to

physical filename at site

  • Mapping is typically trivial in CMS:

/store/* → /store/*

l xrootd plugin reads from site storage system

  • Example: HDFS, dCache, Lustre, GPFS, DPM

l User authentication also pluggable

  • But we use standard GSI + lcmaps + GUMS
  • Also VOMS plugin is used in production
slide-6
SLIDE 6

Any data, Anytime, Anywhere 6

Status of CMS Federation

l US:

l T1 (disk) + 8/8 T2s federated l Covers 100% of the data for analysis l Does not cover files only on tape

l IT:

l Complete deployment of Xrootd on Tier1+ all the

Tier2

l Also few Tier3 has joined

l Both as client and as servers

World

l 4 T1s (FNAL, CNAF, RAL, JINR) + 3/4 T2s accessible l Monitored but not a “turns your site red” service (yet)

slide-7
SLIDE 7

Any data, Anytime, Anywhere 7

Regulation of Requests

l On average 1 analysis job on AOD data needs about 250kb/s l To 1st order, jobs still run at sites with the data l ~0.25 GB/s average remote read rate l O(10) GB/s average local read rate l ~1.5 GB/s PhEDEx transfer rate l Cases where data is read remotely: l Interactive

  • limited by # humans

l Fallback

  • limited by error rate opening files

l Overflow

  • limited by scheduling policy

l Opportunistic

  • limited by scheduling policy

l T3

  • usually Tier3 is not that “big”
slide-8
SLIDE 8

Any data, Anytime, Anywhere 8

More on Fallback

l On file open error, CMS software can retry

via alternate location/protocol

l Configured by site admin l We fall back to regional xrootd federation

  • US, EU
  • Could also have inter-region fallback

l Can recover from missing file error, but not for

corrupted files (more on this later)

l Has more uses than just error recovery ...

slide-9
SLIDE 9

Any data, Anytime, Anywhere 9

More about Overflow

l GlideinWMS scheduling policy

l Candidates for overflow:

  • Idle jobs with wait time above threshold (6h)
  • Desired data available in a region supporting overflow

l Regulation of overflow:

  • Limited number of overflow glideins submitted per source

site

l Data access

l No reconfiguration of job required

  • Uses fallback mechanism
  • Try local access, fall back to remote access on failure
slide-10
SLIDE 10

Any data, Anytime, Anywhere 10

Running Opportunistically

l To run CMS jobs at non-CMS sites, we need:

l Outbound network access l Access to CMS datafiles

  • Xrootd remote access

l Access to conditions data

  • http proxy

l Access to CMS software

  • CVMFS (also needs http proxy)
  • No need for data pre-placement or any kind of

“site preparation”

  • Fully opportunistic use of computing resources
  • This is useful also to run in the cloud
slide-11
SLIDE 11

Any data, Anytime, Anywhere 11

Fallback++

l Today we can recover when file is missing from

local storage system

l But corrupted files cause jobs to fail

l And job may come back and fail again ... l Admin may need to intervene to recover the data l User may need to resubmit the job

l Can we do better?

slide-12
SLIDE 12

Any data, Anytime, Anywhere 12

Yes, We Hope

l Concept

l Fall back on read error l Cache remotely read data l Insert downloaded data back into storage system

slide-13
SLIDE 13

Any data, Anytime, Anywhere 13

File Healing

slide-14
SLIDE 14

Any data, Anytime, Anywhere 14

Cross-site Replication

l Once we have file healing …

l Could reduce replication level from 2 to 1, in

the site that have an HDFS or similar infrastructure

l use cross-site redundancy instead

  • Would need to enforce the replication policy at

higher level

  • May not be good idea for hot data
  • Need to consider impact on performance
slide-15
SLIDE 15

Any data, Anytime, Anywhere 15

Performance

Mostly CMS application-specific stuff

l Improved remote read performance by

combining multiple reads into vector reads

  • Eliminates many round-trips

l Working on bit-torrent-like capabilities in CMS

application

  • Read from multiple xrootd sources
  • Balance load away from slower source
  • React in O(1) minute time frame
slide-16
SLIDE 16

AAA Dashboard Monitoring

Any data, Anytime, Anywhere

slide-17
SLIDE 17

AAA Dashboard Monitoring

slide-18
SLIDE 18

AAA Dashboard Monitoring

Any data, Anytime, Anywhere

slide-19
SLIDE 19

AAA Dashboard Monitoring

slide-20
SLIDE 20

AAA Accounting

slide-21
SLIDE 21

Examples of using Xrootd

15/04/2013: T2_Legnaro storage was down for a dCache upgrade. The site was accepting and running analysis job without any problem exploiting Xrootd fall-back

21

INFN test using XRootd:

  • Reprocessing @ CNAF reading RAW data with Xrootd-Fallback

from FNAL

  • Reprocessing @ T2_IT reading MC RAW with Xrootd-Fallback

from CNAF

slide-22
SLIDE 22

Regional set-up

  • INFN in italy has a dedicated

Xrootd redirector where all the Italian resources are registered

  • INFN Tier3 could also join the

federation as “data provider” and not only as “consumers”

  • All the data available on this

redirector could be accessed with good bandwidth and very low latency

  • Thanks to GARR-X

infrastructure

22 Any data, Anytime, Anywhere

slide-23
SLIDE 23

High-availability of redirector

  • Working on high-availability for Xrootd

redirector

  • We are exploring the possibility to use

intelligent set-up of DNS

  • The idea is to have few instances of xrootd

redirector distributed geographically

  • In case of failures of one server, the DNS is

automatically reconfigured to use the others

23 Any data, Anytime, Anywhere

slide-24
SLIDE 24

Any data, Anytime, Anywhere 24

Summary

l XRootd storage federation rapidly

expanding and proving useful within CMS

l We hope to do more

l Automatic error recovery l Opportunistic usage l Improving performance

l Work on-going to provide geographically

high-availability on the Xrootd redirector