LHCb, Vac, Vcycle status Andrew McNab University of Manchester - - PowerPoint PPT Presentation
LHCb, Vac, Vcycle status Andrew McNab University of Manchester - - PowerPoint PPT Presentation
LHCb, Vac, Vcycle status Andrew McNab University of Manchester LHCb status Running production jobs in VMs at 3 UK Vac sites and on 2 IaaS Cloud sites using Vcycle Manchester, Lancaster, Oxford; Imperial and CERN Both Vac and Vcycle
LHCb, Vac, Vcycle - Andrew.McNab@cern.ch - GridPP technical meeting 2
LHCb status
- Running production jobs in VMs at 3 UK Vac sites and on 2 IaaS
Cloud sites using Vcycle
–
Manchester, Lancaster, Oxford; Imperial and CERN
–
Both Vac and Vcycle are advertised as GridPP products
–
Vac has been a supported LHCb platform since last year
–
Vcycle now adopted by LHCb too
- (LHCb hasn't tried running the HLT as a Cloud service, since it has
been a production DIRAC site for several years)
- LHCb's VM architecture is done by us, using the Pilot VM model also
used to make GridPP DIRAC and ATLAS VMs
- DIRAC Pilot 2.0 with improved monitoring, VM support, and
modularity will also be joint CERN and Manchester work
- More VM slots at sites for LHCb would be welcome!
LHCb, Vac, Vcycle - Andrew.McNab@cern.ch - GridPP technical meeting 3
LHCb jobs in VMs
- CLOUD jobs in VMs managed by Vcycle
- n OpenStack
- CLOUD.CERN.ch is ~500 VM slots
- VAC jobs in VMs managed by Vac of
course
LHCb, Vac, Vcycle - Andrew.McNab@cern.ch - GridPP technical meeting 4
Vac
- On each physical node, Vac VM factory daemon runs to create and
apply contextualization to transient VMs
- Multiple VM flavours (“VM types”) are supported, ~1 per experiment
- Each site or Vac “space” is composed of autonomous factory nodes
–
All using the same /etc/vac.conf
- Factories communicate with each other via UDP
–
Type of VM to start in a free slot based on what else is running and target shares
–
So no headnode central point of failure; robust against losing individual nodes
- Aims for reliability and robustness through simplicity
–
VM instantiation failure rate << 1/1000 – much better than typical IaaS sites
- Running LHCb production jobs since last year; and ATLAS production
jobs at Manchester (40K+ jobs), Lancaster, Oxford since early April
- Documentation, RPMs, links to GitHub at www.gridpp.ac.uk/vac
LHCb, Vac, Vcycle - Andrew.McNab@cern.ch - GridPP technical meeting 5
Vcycle on OpenStack etc
- Use Vac approach to run VMs on IaaS cloud platforms
- Python daemon manages lifecycle of VMs in tenancy
– (Re)creates VMs using boot image and user_data
- Supports multiple tenancies and multiple vmtypes per tenancy
- Doesn't need to know about task queues etc
– VMs are black boxes: created, run, shutdown, then deleted – Vcycle can be run by the experiment, site, or a third-party
- In production for LHCb at CERN since early May (~500 VMs)
- Running production ATLAS and LHCb VMs in the gridpp-vcycle
tenancy at Imperial College
- Being evaluated for ATLAS at CERN
- Sources in Vac GitHub area
LHCb, Vac, Vcycle - Andrew.McNab@cern.ch - GridPP technical meeting 6
Immediate plans
- LHCb
–
Begin work on Pilot 2.0
–
New monitoring framework
–
Multiple concurrent single-processor payloads in one pilot job or pilot VM
–
Improve TimeLeft handling, for better elastic MC jobs and multiple payload jobs
- Vac
–
CloudInit support
–
Increase robustness of UDP protocol if high (50%?) packet loss
–
Increase scalability from present level of hundreds of VMs
–
Generic condor worker VM based on CernVM condor support
- Vcycle
–
Man page, Admin Guide, RPMs
–