software and computing operations
play

Software and Computing Operations Christoph Paus, MIT Stephan - PowerPoint PPT Presentation

U.S. CMS Operations Program Software and Computing Operations Christoph Paus, MIT Stephan Lammel, Fermilab USCMS Operation Budget Review, September 7 th , 2017 U.S. CMS S&C Operations, Goals Operations Program Enable high-quality,


  1. U.S. CMS Operations Program Software and Computing Operations Christoph Paus, MIT Stephan Lammel, Fermilab USCMS Operation Budget Review, September 7 th , 2017

  2. U.S. CMS S&C Operations, Goals Operations Program ▪ Enable high-quality, timely research by ▪ processing data ▪ distributing data ▪ running job submission infrastructure ▪ running various data/software/DB services ▪ investigating possible improvements ==> clearly in US/US physicist interest goal matches CMS, different focus on enabling it ==> strategy is to apply US expertise increase operations coverage with second 8h shift Stephan Lammel, 2017-Sep-07 2018 U.S. CMS Budget Review — Computing Operations 2/14

  3. U.S. CMS S&C Operations, Strategy Operations Program ▪ Smooth, effortless operation: ▪ automate where possible ▪ make things robust ▪ off-load monitoring to shifter ▪ effective alerting ▪ Look beyond today: ▪ what is needed next month/year ▪ what becomes available ▪ what needs to be improved/evolved ▪ What is in for USCMS: ▪ know the data and issues ▪ keep US facilities at peak performance ▪ see computing research and development opportunities early Stephan Lammel, 2017-Sep-07 2018 U.S. CMS Budget Review — Computing Operations 3/14

  4. U.S. CMS S&C Operations, Areas Operations Program Operate the Tier-0 infrastructure Tier-0 USCMS people designed and built the Tier-0 • USCMS contributes significantly to the operation • Data Operate PhEDEx and Dynamo data distribution USCMS designed and built PhEDEx, AAA, and Dynamo • Distribution USCMS operates the system with collaboration contribution • Data Re(re)construct data and produce Monte Carlo datasets USCMS designed and build the processing setup • Processing USCMS operates the system with collaboration contribution • Schedule and execute production and user jobs on Grid and Submission Cloud resources of sites USCMS co-developed glide-in WMS • Infrastructure USCMS designed and setup the Global Pool • USCMS operates the system with OSG and collab contribution • Operate various distributed data, database, and software access Central services USCMS contributed in the development of several services Services • USCMS contributes significantly to the operation • Monitor health and performance of CMS grid sites Site Support USCMS people developed the setup based on WLCG tools • USCMS contributes significantly to the daily monitoring • Stephan Lammel, 2017-Sep-07 2018 U.S. CMS Budget Review — Computing Operations 4/14

  5. U.S. CMS S&C Operations, Tier-0 Operations Program ▪ Tier-0 components consists of: ▪ interface to transfer system of StorageManager at P5 ▪ transfer system to get data from P5 to CERN EOS/MSS ▪ Express and PromptCalib ▪ Repack data from streamer format into ROOT files ▪ PromptReco ▪ AlCaSkim ▪ data quality monitoring ▪ file merge ▪ cloud based infrastructure for CPU resources at CERN ▪ 2017 Activities: ▪ commission new interface to transfer system ▪ transfer performance and lost files in EOS ▪ data cached on disk reduced ▪ USCMS effort: ▪ CMS/O&C/CompOps/Tier-0 L3 head at CERN (0.5 FTE costed) ▪ Tier-0 operator at CERN (0.3 FTE uncosted) ▪ Tier-0 operator at Fermilab (1 FTE subsistence) ▪ Tier-0 head/operator at CERN (2 FTE cola) Stephan Lammel, 2017-Sep-07 2018 U.S. CMS Budget Review — Computing Operations 5/14

  6. U.S. CMS S&C Operations, Data Distribution Operations Program ▪ Data Distribution components consists of: ▪ PhEDEx transfer system ▪ dynamic data management DDM / Dynamo ▪ AAA / xrootd federated data service (redirectors, monitoring) ▪ 2017 Activities: ▪ tape-to-disk staging tests at Tier-1s ▪ expanded DDM use ▪ lost files due to storage system failures ▪ network transfer rates at two of the Tier-1s ▪ storage inconsistencies due to race conditions/exceptions ▪ increase DDM functionality and capabilities ▪ USCMS effort: ▪ AAA/xrootd operations at Nebraska (0.5 FTE costed) ▪ network performance integration at Nebraska (0.2 FTE costed) ▪ storage performance integration at Florida (0.5 FTE costed) ▪ transfer team operator at CERN/MIT (0.6 uncosted) ▪ DDM/Dynamo support and evolution at MIT (0.3 uncosted) ▪ transfer team operator at Fermilab (1 FTE subsistence) ▪ CMS/O&C/CompOps/TT L3 head at CERN (1 FTE cola) Stephan Lammel, 2017-Sep-07 2018 U.S. CMS Budget Review — Computing Operations 6/14

  7. U.S. CMS S&C Operations, Data Processing Operations Program ▪ Data Processing tasks consists of: ▪ reconstruction of cosmic and pp-collision data ▪ re-miniAOD campaign for spring conferences ▪ re-reconstruction of 2016 pp-collision data ▪ making pile up Monte Carlo samples for pre-mixing ▪ Run 2, phase 1, and 2 Monte Carlo samples ▪ 2017 Activities: ▪ EOS authentication overload with HLT and Tier-0 resources ▪ stage-out issues ▪ software availability and thus late start of campaigns ▪ network and storage overloads ▪ USCMS effort: ▪ Data Processing operations at Fermilab (1 FTE costed) ▪ CMS/O&C/CompOps/P&R L3 head (0.25 FTE uncosted) Stephan Lammel, 2017-Sep-07 2018 U.S. CMS Budget Review — Computing Operations 7/14

  8. U.S. CMS S&C Ops, Submission Infrastructure Operations Program ▪ Submission Infrastructure tasks consists of: ▪ operation of the glide-in WMS factories ▪ support and evolution of the batch system Global Pool ▪ interface with glide-in WMS and HTCondor developers and advise on features/priorities ▪ 2017 Activities and Milestones: ▪ multi-core pilot tuning (task priorities, retirement policies, and scheduling efficiency) ▪ Global Pool stability and increased scalability (500k cores) ▪ Singularity integration and deployment (glexec replacement) ▪ including I/O resources in job scheduling ▪ USCMS effort: ▪ GlideIn Factory operations at UCSD (0.2 FTE costed) ▪ Submission Infrastructure leadership at UCSD (0.45 FTE costed) Stephan Lammel, 2017-Sep-07 2018 U.S. CMS Budget Review — Computing Operations 8/14

  9. U.S. CMS S&C Operations, Central Services Operations Program ▪ Central Services components consists of: ▪ CVMFS for software and MC gridpack distribution ▪ DBfroNtier/squid infrastructure of distributed database cache ▪ 2017 Activities: ▪ squids switched from static config to launchpad discovery ▪ USCMS effort: ▪ CVMFS operations at Florida (0.3 FTE costed) ▪ DBfroNtier/squid operations at Johns Hopkins (0.17 FTE costed) ▪ DBfroNtier/squid support at Fermilab (0.1 FTE costed) Stephan Lammel, 2017-Sep-07 2018 U.S. CMS Budget Review — Computing Operations 9/14

  10. U.S. CMS S&C Operations, Site Support Operations Program ▪ Site Support components consists of: ▪ SAM and HC of WLCG ▪ site readiness and status metrics ▪ topology description (VO-feed, SITECONF) ▪ dashboard metric displays ▪ 2017 Activities: ▪ decouple VO-feed from BDII, multi-site support, xrootd ▪ finer granularity tests (SAM, HC, PhEDEx links between sites) ▪ new pilot startup site test ▪ IPv6 storage commissioning/testing ▪ USCMS effort: ▪ Site Support operator at Fermilab (1.0 FTE subsistence) ▪ CMS/O&C/F&S/SS L3 head (0.25 FTE uncosted) Stephan Lammel, 2017-Sep-07 2018 U.S. CMS Budget Review — Computing Operations 10/14

  11. U.S. CMS S&C Operations, Coordination Operations Program ▪ USCMS effort coordinating CMS/O&C: ▪ Submission Infrastructure L2 head (0.1 FTE costed) ▪ Computing Operations L2 head (0.15 FTE uncosted) ▪ Facilities and Services L2 head (0.1 FTE uncosted) ▪ USCMS effort coordinating USCMS Ops/O&C: ▪ Computing Operations L3 (0.2 FTE uncosted) ▪ Guest Scientist Line Management (0.05 FTE uncosted) Stephan Lammel, 2017-Sep-07 2018 U.S. CMS Budget Review — Computing Operations 11/14

  12. U.S. CMS S&C Operations, FY-18 plans Operations Program ▪ Operating, operating, operating... ▪ LHC data keeps coming through 2018 ▪ Reacting to issues/addressing operational needs ▪ difficult to plan ahead, except ▪ Areas with more evolution component like ▪ Submission Infrastructure need to stay ahead of CPU/core demand: scalability &  efficiency high-availability via IPv6 of Global Pool services  feeding HTCondor monitoring and factory logs to MonIT  develop/setup mechanism to suspend matching of production  jobs to a sites ▪ Data distribution plan for DDM to become a more sophisticated cache  manager Stephan Lammel, 2017-Sep-07 2018 U.S. CMS Budget Review — Computing Operations 12/14

  13. U.S. CMS S&C Operations, Priorities Operations Program ▪ High: ▪ Submission Infrastructure danger of losing glide-in WMS investment  USCMS makes big impact  ▪ Data Distribution/DDM know/coordinate which data are stored at which sites (physics)  ▪ AAA/xrootd  influence/guide future of remote data access (leadership) ▪ Dbfrontier/squid cross experiment/frontier activity (leadership)  ▪ Moderate: ▪ Data Processing direct knowledge of datasets/processing information would be lost (physics)  ▪ Site Support watching out for USCMS sites would be lost  ▪ Tier-0 we loose connection to data as they are recorded  ▪ Storage/Network performance integration don’t be proactive and incur delay/slower implementation when plan ready  ▪ CVMFS operation expect CMS to pick this up as service is needed for all sites  ▪ Low: Stephan Lammel, 2017-Sep-07 2018 U.S. CMS Budget Review — Computing Operations 13/14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend