CMS remote data access (AAA) Giacinto DONVITO (INFN-Bari) - - PowerPoint PPT Presentation
CMS remote data access (AAA) Giacinto DONVITO (INFN-Bari) - - PowerPoint PPT Presentation
CMS remote data access (AAA) Giacinto DONVITO (INFN-Bari) Thanks to: Brian Bockelman, Ken Bloom (UNL) Tommaso Boccali, Daniele Bonacorsi (INFN) AAA
Any data, Anytime, Anywhere 2
AAA Project
- Goal
l Use resources more effectively through remote
data access in CMS
Sub-goals
l Low-ceremony/latency access to any single event l Reduce data access error rate l Overflow jobs from busy sites to less busy ones l Use opportunistic resources l Make life at T3s easier
Same)namespace,)irrespec2ve)of)data)loca2on
✦ TFile::Open(“root://xrootd.unl.edu//store/foo”)
Any data, Anytime, Anywhere 3
General overview
- In CMS the association LFN->PFN is a simple “rule”. No
DB involved
- We can use plain root installation with a given prefix at
each site
4 ¡
CMS ¡jobs ¡ Local ¡ storage ¡ EU ¡Xrootd ¡ Redirector ¡ EU ¡Tier2 ¡ sites ¡ Global ¡ Xrootd ¡ Redirector ¡ US ¡Tier2 ¡ sites ¡ EOS ¡ CERN ¡ US ¡Xrootd ¡ Redirector ¡
Xrootd world-wide Federation
Any data, Anytime, Anywhere 5
XRootd: Federating Storage Systems
l Step 1: deploy seamless global storage
interface
l But preserve site autonomy:
l xrootd plugin maps from global logical filename to
physical filename at site
- Mapping is typically trivial in CMS:
/store/* → /store/*
l xrootd plugin reads from site storage system
- Example: HDFS, dCache, Lustre, GPFS, DPM
l User authentication also pluggable
- But we use standard GSI + lcmaps + GUMS
- Also VOMS plugin is used in production
Any data, Anytime, Anywhere 6
Status of CMS Federation
l US:
l T1 (disk) + 8/8 T2s federated l Covers 100% of the data for analysis l Does not cover files only on tape
l IT:
l Complete deployment of Xrootd on Tier1+ all the
Tier2
l Also few Tier3 has joined
l Both as client and as servers
World
l 4 T1s (FNAL, CNAF, RAL, JINR) + 3/4 T2s accessible l Monitored but not a “turns your site red” service (yet)
Any data, Anytime, Anywhere 7
Regulation of Requests
l On average 1 analysis job on AOD data needs about 250kb/s l To 1st order, jobs still run at sites with the data l ~0.25 GB/s average remote read rate l O(10) GB/s average local read rate l ~1.5 GB/s PhEDEx transfer rate l Cases where data is read remotely: l Interactive
- limited by # humans
l Fallback
- limited by error rate opening files
l Overflow
- limited by scheduling policy
l Opportunistic
- limited by scheduling policy
l T3
- usually Tier3 is not that “big”
Any data, Anytime, Anywhere 8
More on Fallback
l On file open error, CMS software can retry
via alternate location/protocol
l Configured by site admin l We fall back to regional xrootd federation
- US, EU
- Could also have inter-region fallback
l Can recover from missing file error, but not for
corrupted files (more on this later)
l Has more uses than just error recovery ...
Any data, Anytime, Anywhere 9
More about Overflow
l GlideinWMS scheduling policy
l Candidates for overflow:
- Idle jobs with wait time above threshold (6h)
- Desired data available in a region supporting overflow
l Regulation of overflow:
- Limited number of overflow glideins submitted per source
site
l Data access
l No reconfiguration of job required
- Uses fallback mechanism
- Try local access, fall back to remote access on failure
Any data, Anytime, Anywhere 10
Running Opportunistically
l To run CMS jobs at non-CMS sites, we need:
l Outbound network access l Access to CMS datafiles
- Xrootd remote access
l Access to conditions data
- http proxy
l Access to CMS software
- CVMFS (also needs http proxy)
- No need for data pre-placement or any kind of
“site preparation”
- Fully opportunistic use of computing resources
- This is useful also to run in the cloud
Any data, Anytime, Anywhere 11
Fallback++
l Today we can recover when file is missing from
local storage system
l But corrupted files cause jobs to fail
l And job may come back and fail again ... l Admin may need to intervene to recover the data l User may need to resubmit the job
l Can we do better?
Any data, Anytime, Anywhere 12
Yes, We Hope
l Concept
l Fall back on read error l Cache remotely read data l Insert downloaded data back into storage system
Any data, Anytime, Anywhere 13
File Healing
Any data, Anytime, Anywhere 14
Cross-site Replication
l Once we have file healing …
l Could reduce replication level from 2 to 1, in
the site that have an HDFS or similar infrastructure
l use cross-site redundancy instead
- Would need to enforce the replication policy at
higher level
- May not be good idea for hot data
- Need to consider impact on performance
Any data, Anytime, Anywhere 15
Performance
Mostly CMS application-specific stuff
l Improved remote read performance by
combining multiple reads into vector reads
- Eliminates many round-trips
l Working on bit-torrent-like capabilities in CMS
application
- Read from multiple xrootd sources
- Balance load away from slower source
- React in O(1) minute time frame
AAA Dashboard Monitoring
Any data, Anytime, Anywhere
AAA Dashboard Monitoring
AAA Dashboard Monitoring
Any data, Anytime, Anywhere
AAA Dashboard Monitoring
AAA Accounting
Examples of using Xrootd
15/04/2013: T2_Legnaro storage was down for a dCache upgrade. The site was accepting and running analysis job without any problem exploiting Xrootd fall-back
21
INFN test using XRootd:
- Reprocessing @ CNAF reading RAW data with Xrootd-Fallback
from FNAL
- Reprocessing @ T2_IT reading MC RAW with Xrootd-Fallback
from CNAF
Regional set-up
- INFN in italy has a dedicated
Xrootd redirector where all the Italian resources are registered
- INFN Tier3 could also join the
federation as “data provider” and not only as “consumers”
- All the data available on this
redirector could be accessed with good bandwidth and very low latency
- Thanks to GARR-X
infrastructure
22 Any data, Anytime, Anywhere
High-availability of redirector
- Working on high-availability for Xrootd
redirector
- We are exploring the possibility to use
intelligent set-up of DNS
- The idea is to have few instances of xrootd
redirector distributed geographically
- In case of failures of one server, the DNS is
automatically reconfigured to use the others
23 Any data, Anytime, Anywhere
Any data, Anytime, Anywhere 24
Summary
l XRootd storage federation rapidly
expanding and proving useful within CMS
l We hope to do more
l Automatic error recovery l Opportunistic usage l Improving performance
l Work on-going to provide geographically