Glideins for CMS on OSG Jeff Dost (UCSD) Overview Architecture - - PowerPoint PPT Presentation

glideins for cms on osg
SMART_READER_LITE
LIVE PREVIEW

Glideins for CMS on OSG Jeff Dost (UCSD) Overview Architecture - - PowerPoint PPT Presentation

Glideins for CMS on OSG Jeff Dost (UCSD) Overview Architecture Concept of a Global Queue Operations What are glideins? GlideinWMS is an implementation of a pilot Workload Management System A Pilot is simply a grid job that


slide-1
SLIDE 1

Glideins for CMS on OSG

Jeff Dost (UCSD)

slide-2
SLIDE 2

Overview

  • Architecture
  • Concept of a Global Queue
  • Operations
slide-3
SLIDE 3

What are glideins?

  • GlideinWMS is an implementation of a pilot

Workload Management System

  • A Pilot is simply a grid job that lands on a

worker node and reserves a slot in advance for a user job.

  • When it gets there it calls home to retrieve the user

job

  • We call pilot jobs glideins in GlideinWMS
slide-4
SLIDE 4

Why use glideins?

  • Allows CMS to have a global queue to

implement priorities

  • Site failures are not seen by the end user
  • Direct grid submission requires overhead.
  • If a pilot is already on a WN and not currently

“claimed” when a user submits a job the startup

  • verhead is greatly reduced.
  • Efficiency significantly increases on average if

you have a continuous workflow of many jobs on sites for long periods of time (like CMS)

slide-5
SLIDE 5

Overview

  • Architecture
  • Concept of a Global Queue
  • Operations
slide-6
SLIDE 6

Architecture

  • Components of WMS
  • Glidein Internals
  • Topologies of Production Systems
  • Support Teams
slide-7
SLIDE 7

GlideinWMS Components

  • User Pool
  • Implementation of global queue
  • Glidein Frontend
  • Watch global queue, requests resources
  • Glidein Factory
  • Submit glideins in response to resource requests
slide-8
SLIDE 8

User Pool

  • The user pool looks like any other Condor pool
  • Except that instead of on a local cluster, the pool

slots are spread out on Sites all over the grid

  • It has a condor queue that user jobs join on

submission

  • This is what the Frontend checks periodically
  • When new glideins start, the slots they reserve

join the condor pool

  • NOTE This is independent of the underlying batch

system the Site runs!

slide-9
SLIDE 9

Glidein Frontend

  • The Frontend is responsible for checking on

waiting user jobs and sending requests to the Factory to submit glideins as needed

  • User Pool / Frontend operators monitor user

jobs and spot problem users

slide-10
SLIDE 10

Glidein Factory

  • The factory receives requests from the

Frontend and submits glideins to requested Sites using Condor-G

  • Knowledge about how to submit to various

Sites is stored in the Factory configuration

  • Factory Operators perform routine maintenance
  • n the Factory as well as monitor glideins to

ensure they are running on Sites without error.

slide-11
SLIDE 11

Architecture

  • Components of WMS
  • Glidein Internals
  • Topologies of Production Systems
  • Support Teams
slide-12
SLIDE 12

Startup Validation

  • Users don't need to worry about Site problems
  • Glideins do startup validation. If a WN does not

have an adequate environment for a job to run the glidein terminates immediately and reports why.

  • User jobs will never land on a node that fails

validation

  • “Black hole nodes” do not affect the end user
slide-13
SLIDE 13

Validation Examples

  • Checks that CMSSW is available
  • If gLExec is there, test if it works
  • If Squid proxy cache is available glideins will try

to use it

  • Ensure pilot proxy has long enough lifetime
  • Other internal GlideinWMS checks to ensure

glidein can run before it starts

  • In the future add validation similar to SAM Tests
slide-14
SLIDE 14

Notes on gLExec

  • If available on the WNs glideins will use it
  • Two levels of protection:
  • Protects glidein itself from malicious user
  • Protects users from each other who run on the

same glidein

  • Additional benefit of running gLExec:
  • Admins can find the real user in the glexec logs
slide-15
SLIDE 15

Glidein Lifetime

  • Glideins don't reserve slots forever.
  • If a glidein is idle with no user jobs to claim it for

20 minutes it terminates.

  • Factory Operators monitor global time wasted
  • Otherwise the glidein lives as long as we define

it to.

  • We typically set its lifetime to the

MaxWallClockTime or MaxCPUTime (whichever is shorter) from BDII minus a small delta

slide-16
SLIDE 16

Glideins Protect User Jobs

  • User jobs are not tied to the pilots they land on
  • If a pilot fails the user job will just restart on a new

pilot somewhere else. It requires no user re- submission

slide-17
SLIDE 17

Architecture

  • Components of WMS
  • Glidein Internals
  • Topologies of Production Systems
  • Support Teams
slide-18
SLIDE 18

CMS Production + MC

WMAgent schedd schedd collector frontend factory factory CERN UCSD CERN T2s CERN (x3) FNAL (x3) CERN T1s Single User Pilots; DN with Role=production * A T1 only gwms system also exists at FNAL

  • Not relevant to T2/T3; left out of this talk
slide-19
SLIDE 19

CMS AnaOps

CRAB2 schedd schedd collector frontend factory factory factory UCSD UCSD UCSD UCSD GOC CERN T2s T3s Multi-User Pilots; DN with Role=pilot

slide-20
SLIDE 20

Architecture

  • Glidein Internals
  • Components of WMS
  • Topologies of Production Systems
  • Support Teams
slide-21
SLIDE 21

Support Teams

  • Cms-wms-support (funded by CMS)
  • cms-wms-support@physics.ucsd.edu

– James Letts et. al

  • All complaints about Users go here
  • Osg-gfactory-support (funded by OSG)
  • osg-gfactory-support@physics.ucsd.edu

– Dost, Mortensen et. al

  • All complaints about glideins go here
  • T1 Only Support
  • Not relevant to T2s / T3s thus left out of this talk
slide-22
SLIDE 22

Overview

  • Architecture
  • Concept of a Global Queue
  • Operations
slide-23
SLIDE 23

Global Queue

  • User priority is no longer controlled at the Site

level but Globally in the glideinWMS User Pool

  • Exploring ways to make the Global Queue even

more Site independent by exploiting Frontend matchmaking

  • One such example is the Overflow setup
slide-24
SLIDE 24

Overflow

Job 1 Job 2 Job 3 Queue Job requesting to run at Nebraska (data it wants is there) Has been pending >6h

  • If Jobs for a site are Pending in Global Queue

for more than 6 hours, run the job elsewhere

slide-25
SLIDE 25

Overflow

Job 1 Job 2 Job 3 Queue Frontend Wisc UCSD Request Glideins at UCSD and WISC! Nebraska

slide-26
SLIDE 26

UCSD

Overflow

Job 2 Wisc Nebraska Job lands on glidein at UCSD but then uses xrootd to access Nebraska Storage! xrootd

slide-27
SLIDE 27

Overview

  • Architecture
  • Concept of a Global Queue
  • Operations
slide-28
SLIDE 28

Role of cms-wms-support

  • Control which sites to request to and what

should run there

  • Identify problematic user jobs
  • Investigate held user jobs
  • Monitor health of overflow
  • Configure Global Queue
  • Configure special matchmaking such as overflow
  • In the future configure CMS overflow to
  • pportunistic sites and even to clouds
slide-29
SLIDE 29

Role of osg-gfactory-support

  • Report Site issues through GOC and Savannah

Ticketing systems

  • Work closely with Site Admins to help debug

problems

  • Temporarily stop and resume submission as

needed during site downtimes

  • Configure Glidein Factory to submit to new

resources

  • Update Factory configuration to reflect Site

changes (e.g. decommission / replace CEs)

slide-30
SLIDE 30

Conclusion

  • Glidein System jointly operated between CMS

and OSG

  • People power at CERN, FNAL, and UCSD
  • Hardware at GOC, CERN, FNAL, UCSD
  • CMS is one of ~12 Communities served by

OSG Glidein Factory