Software and Computing Operations Christoph Paus, MIT Stephan - - PowerPoint PPT Presentation

software and computing operations
SMART_READER_LITE
LIVE PREVIEW

Software and Computing Operations Christoph Paus, MIT Stephan - - PowerPoint PPT Presentation

U.S. CMS Operations Program Software and Computing Operations Christoph Paus, MIT Stephan Lammel, Fermilab USCMS Operation Budget Review, September 7 th , 2017 U.S. CMS S&C Operations, Goals Operations Program Enable high-quality,


slide-1
SLIDE 1

U.S. CMS Operations Program

Software and Computing Operations

Christoph Paus, MIT Stephan Lammel, Fermilab

USCMS Operation Budget Review, September 7th, 2017

slide-2
SLIDE 2

U.S. CMS Operations Program 2018 U.S. CMS Budget Review — Computing Operations Stephan Lammel, 2017-Sep-07

▪ Enable high-quality, timely research by

▪ processing data ▪ distributing data ▪ running job submission infrastructure ▪ running various data/software/DB services ▪ investigating possible improvements

==> clearly in US/US physicist interest

goal matches CMS, different focus on enabling it

==> strategy is to apply US expertise

increase operations coverage with second 8h shift

S&C Operations, Goals

2/14

slide-3
SLIDE 3

U.S. CMS Operations Program 2018 U.S. CMS Budget Review — Computing Operations Stephan Lammel, 2017-Sep-07

▪ Smooth, effortless operation:

▪ automate where possible ▪ make things robust ▪ off-load monitoring to shifter ▪ effective alerting

▪ Look beyond today:

▪ what is needed next month/year ▪ what becomes available ▪ what needs to be improved/evolved

▪ What is in for USCMS:

▪ know the data and issues ▪ keep US facilities at peak performance ▪ see computing research and development opportunities early

S&C Operations, Strategy

3/14

slide-4
SLIDE 4

U.S. CMS Operations Program 2018 U.S. CMS Budget Review — Computing Operations Stephan Lammel, 2017-Sep-07

Tier-0

Operate the Tier-0 infrastructure

  • USCMS people designed and built the Tier-0
  • USCMS contributes significantly to the operation

Data Distribution

Operate PhEDEx and Dynamo data distribution

  • USCMS designed and built PhEDEx, AAA, and Dynamo
  • USCMS operates the system with collaboration contribution

Data Processing

Re(re)construct data and produce Monte Carlo datasets

  • USCMS designed and build the processing setup
  • USCMS operates the system with collaboration contribution

Submission Infrastructure

Schedule and execute production and user jobs on Grid and Cloud resources of sites

  • USCMS co-developed glide-in WMS
  • USCMS designed and setup the Global Pool
  • USCMS operates the system with OSG and collab contribution

Central Services

Operate various distributed data, database, and software access services

  • USCMS contributed in the development of several services
  • USCMS contributes significantly to the operation

Site Support

Monitor health and performance of CMS grid sites

  • USCMS people developed the setup based on WLCG tools
  • USCMS contributes significantly to the daily monitoring

S&C Operations, Areas

4/14

slide-5
SLIDE 5

U.S. CMS Operations Program 2018 U.S. CMS Budget Review — Computing Operations Stephan Lammel, 2017-Sep-07

Tier-0 components consists of:

interface to transfer system of StorageManager at P5

transfer system to get data from P5 to CERN EOS/MSS

Express and PromptCalib

Repack data from streamer format into ROOT files

PromptReco

AlCaSkim

data quality monitoring

file merge

cloud based infrastructure for CPU resources at CERN

2017 Activities:

commission new interface to transfer system

transfer performance and lost files in EOS

data cached on disk reduced

USCMS effort:

CMS/O&C/CompOps/Tier-0 L3 head at CERN (0.5 FTE costed)

Tier-0 operator at CERN (0.3 FTE uncosted)

Tier-0 operator at Fermilab (1 FTE subsistence)

Tier-0 head/operator at CERN (2 FTE cola)

S&C Operations, Tier-0

5/14

slide-6
SLIDE 6

U.S. CMS Operations Program 2018 U.S. CMS Budget Review — Computing Operations Stephan Lammel, 2017-Sep-07

Data Distribution components consists of:

PhEDEx transfer system

dynamic data management DDM / Dynamo

AAA / xrootd federated data service (redirectors, monitoring)

2017 Activities:

tape-to-disk staging tests at Tier-1s

expanded DDM use

lost files due to storage system failures

network transfer rates at two of the Tier-1s

storage inconsistencies due to race conditions/exceptions

increase DDM functionality and capabilities

USCMS effort:

AAA/xrootd operations at Nebraska (0.5 FTE costed)

network performance integration at Nebraska (0.2 FTE costed)

storage performance integration at Florida (0.5 FTE costed)

transfer team operator at CERN/MIT (0.6 uncosted)

DDM/Dynamo support and evolution at MIT (0.3 uncosted)

transfer team operator at Fermilab (1 FTE subsistence)

CMS/O&C/CompOps/TT L3 head at CERN (1 FTE cola)

S&C Operations, Data Distribution

6/14

slide-7
SLIDE 7

U.S. CMS Operations Program 2018 U.S. CMS Budget Review — Computing Operations Stephan Lammel, 2017-Sep-07

▪ Data Processing tasks consists of:

▪ reconstruction of cosmic and pp-collision data ▪ re-miniAOD campaign for spring conferences ▪ re-reconstruction of 2016 pp-collision data ▪ making pile up Monte Carlo samples for pre-mixing ▪ Run 2, phase 1, and 2 Monte Carlo samples

▪ 2017 Activities:

▪ EOS authentication overload with HLT and Tier-0

resources

▪ stage-out issues ▪ software availability and thus late start of campaigns ▪ network and storage overloads

▪ USCMS effort:

▪ Data Processing operations at Fermilab (1 FTE costed) ▪ CMS/O&C/CompOps/P&R L3 head (0.25 FTE uncosted)

S&C Operations, Data Processing

7/14

slide-8
SLIDE 8

U.S. CMS Operations Program 2018 U.S. CMS Budget Review — Computing Operations Stephan Lammel, 2017-Sep-07

▪ Submission Infrastructure tasks consists of:

▪ operation of the glide-in WMS factories ▪ support and evolution of the batch system Global Pool ▪ interface with glide-in WMS and HTCondor developers and

advise on features/priorities

▪ 2017 Activities and Milestones:

▪ multi-core pilot tuning (task priorities, retirement policies,

and scheduling efficiency)

▪ Global Pool stability and increased scalability (500k cores) ▪ Singularity integration and deployment (glexec

replacement)

▪ including I/O resources in job scheduling

▪ USCMS effort:

▪ GlideIn Factory operations at UCSD (0.2 FTE costed) ▪ Submission Infrastructure leadership at UCSD (0.45 FTE

costed)

S&C Ops, Submission Infrastructure

8/14

slide-9
SLIDE 9

U.S. CMS Operations Program 2018 U.S. CMS Budget Review — Computing Operations Stephan Lammel, 2017-Sep-07

▪ Central Services components consists of:

▪ CVMFS for software and MC gridpack distribution ▪ DBfroNtier/squid infrastructure of distributed database

cache

▪ 2017 Activities:

▪ squids switched from static config to launchpad

discovery

▪ USCMS effort:

▪ CVMFS operations at Florida (0.3 FTE costed) ▪ DBfroNtier/squid operations at Johns Hopkins (0.17

FTE costed)

▪ DBfroNtier/squid support at Fermilab (0.1 FTE costed)

S&C Operations, Central Services

9/14

slide-10
SLIDE 10

U.S. CMS Operations Program 2018 U.S. CMS Budget Review — Computing Operations Stephan Lammel, 2017-Sep-07

▪ Site Support components consists of: ▪ SAM and HC of WLCG ▪ site readiness and status metrics ▪ topology description (VO-feed, SITECONF) ▪ dashboard metric displays ▪ 2017 Activities: ▪ decouple VO-feed from BDII, multi-site support, xrootd ▪ finer granularity tests (SAM, HC, PhEDEx links between

sites)

▪ new pilot startup site test ▪ IPv6 storage commissioning/testing ▪ USCMS effort: ▪ Site Support operator at Fermilab (1.0 FTE subsistence) ▪ CMS/O&C/F&S/SS L3 head (0.25 FTE uncosted)

S&C Operations, Site Support

10/14

slide-11
SLIDE 11

U.S. CMS Operations Program 2018 U.S. CMS Budget Review — Computing Operations Stephan Lammel, 2017-Sep-07

▪ USCMS effort coordinating CMS/O&C: ▪ Submission Infrastructure L2 head (0.1 FTE costed) ▪ Computing Operations L2 head (0.15 FTE uncosted) ▪ Facilities and Services L2 head (0.1 FTE uncosted) ▪ USCMS effort coordinating USCMS Ops/O&C: ▪ Computing Operations L3 (0.2 FTE uncosted) ▪ Guest Scientist Line Management (0.05 FTE uncosted)

S&C Operations, Coordination

11/14

slide-12
SLIDE 12

U.S. CMS Operations Program 2018 U.S. CMS Budget Review — Computing Operations Stephan Lammel, 2017-Sep-07

▪ Operating, operating, operating...

▪ LHC data keeps coming through 2018

▪ Reacting to issues/addressing operational needs

▪ difficult to plan ahead, except

▪ Areas with more evolution component like

▪ Submission Infrastructure

  • need to stay ahead of CPU/core demand: scalability &

efficiency

  • high-availability via IPv6 of Global Pool services
  • feeding HTCondor monitoring and factory logs to MonIT
  • develop/setup mechanism to suspend matching of production

jobs to a sites

▪ Data distribution

  • plan for DDM to become a more sophisticated cache

manager

S&C Operations, FY-18 plans

12/14

slide-13
SLIDE 13

U.S. CMS Operations Program 2018 U.S. CMS Budget Review — Computing Operations Stephan Lammel, 2017-Sep-07

High:

Submission Infrastructure

  • danger of losing glide-in WMS investment
  • USCMS makes big impact

Data Distribution/DDM

  • know/coordinate which data are stored at which sites (physics)

AAA/xrootd

  • influence/guide future of remote data access (leadership)

Dbfrontier/squid

  • cross experiment/frontier activity (leadership)

Moderate:

Data Processing

  • direct knowledge of datasets/processing information would be lost (physics)

Site Support

  • watching out for USCMS sites would be lost

Tier-0

  • we loose connection to data as they are recorded

Storage/Network performance integration

  • don’t be proactive and incur delay/slower implementation when plan ready

CVMFS operation

  • expect CMS to pick this up as service is needed for all sites

Low:

S&C Operations, Priorities

13/14

slide-14
SLIDE 14

U.S. CMS Operations Program 2018 U.S. CMS Budget Review — Computing Operations Stephan Lammel, 2017-Sep-07

▪ USCMS Computing Operations works well and makes a significant

impact in CMS !

▪ Tier-0: keeps up with data taken by detector ▪ Data Distribution: data at sites managed actively; to a few sites

distribution is limited by network bandwidth

▪ Data Processing: MC and data ready in time for spring 2017

conferences; keeping up with requests from PPD

▪ Submission Infrastructure: plenty of resources made available for

production and user analysis

▪ Central Services: CVMFS service runs smoothly and

DBfroNtier/squid is monitored/managed proactively

▪ Site Support: improved detection and turn-around in case of issues ▪ Anticipated LHC performance is expected to put computing under

stress and challenge operations to high efficiency

Summary

4.02 02 + 3 3 FTEs Es c coste ted + s subsiste tence 2.2 2 FTEs s uncoste ted COL OLA A fo for 3 3 FTE TEs

14/14