An An opportunis istic ic HT HTCo Condor po pool insid in - - PowerPoint PPT Presentation

an an opportunis istic ic ht htco condor po pool insid in
SMART_READER_LITE
LIVE PREVIEW

An An opportunis istic ic HT HTCo Condor po pool insid in - - PowerPoint PPT Presentation

At HTCondor Week 2019 Presented by Igor Sfiligoi, UCSD for the PRP team An An opportunis istic ic HT HTCo Condor po pool insid in ide an an in interac activ ive-fr friendl ndly Ku Kubernetes cluster HTCondor Week, May 2019 1


slide-1
SLIDE 1

At HTCondor Week 2019 Presented by Igor Sfiligoi, UCSD for the PRP team

An An opportunis istic ic HT HTCo Condor po pool in insid ide an an in interac activ ive-fr friendl ndly Ku Kubernetes cluster

HTCondor Week, May 2019 1

slide-2
SLIDE 2

Ou Outlin tline

  • Where do I come from?
  • What we did?
  • How is it working?
  • Looking head

HTCondor Week, May 2019 2

slide-3
SLIDE 3

The Pacific Research Platform

  • The PRP originally created as a

regional networking project

  • Establishing end-to-end links

between 10Gbps and 100Gbps

(GDC)

HTCondor Week, May 2019 3

slide-4
SLIDE 4

PRP

PRPv2 Nautilus Transoceanic Nodes Guam Asian Pacific RP Transoceanic Nodes Australia Korea Singapore Netherlands 10G 35TB UvA FIONA6 10G 35TB KISTI 10G 35TB U of Guam 10G 35TB U of Queensland

The Pacific Research Platform

  • The PRP originally created as a

regional networking project

  • Establishing end-to-end links

between 10Gbps and 100Gbps

  • Expanded nationally since
  • And beyond, too

CENIC/PW Link 40G FIONA

UIUC 40G 160TB U Hawaii 40G 160TB NCAR-WY 40G 192TB UWashington

40G FIONA

I2 Chicago

40G FIONA

I2 NYC

40G FIONA

I2 Kansas City

40G FIONA1

UIC HTCondor Week, May 2019 4

slide-5
SLIDE 5

The Pacific Research Platform

  • Recently the PRP evolved in a major resource provider, too
  • Because scientists really need more than bandwidth tests
  • They need to share their data at high speed and compute on it, too
  • The PRP now also provides
  • Extensive compute power – About 330 GPUs and 3.5k CPU cores
  • A large distributed storage area - About 2 PBytes
  • Select user communities now directly use

all the resources PRP has to offer

  • Still doing all the network R&D in the same setup, too
  • We call it the Nautilus cluster

HTCondor Week, May 2019 5

slide-6
SLIDE 6

Kubernetes as a resource manager

  • Large and active development and support community

Industry standard

  • More freedom for users

Container based

  • Allows for easy mixing of service and user workloads

Flexible scheduling

HTCondor Week, May 2019 6

slide-7
SLIDE 7

Designed for interactive use

  • Makes for very happy users

Users expect to get what they need when they need it

  • And is typically short in duration

Congestion happens only very rarely

HTCondor Week, May 2019 7

slide-8
SLIDE 8

Opportunistic use

No congestion Idle compute resources Time for

  • pportunistic

use

HTCondor Week, May 2019 8

slide-9
SLIDE 9

Kubernetes priorities

  • Low priority pods only start if no demand

from higher priority ones

Priorities natively supported in Kubernetes

  • Low priority pods killed the moment

a high priority pod needs the resources

Preemption out of the box

  • Just keep enough low-priority pods in the system

Perfect for opportunistic use

https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/

HTCondor Week, May 2019 9

slide-10
SLIDE 10

HTCondor as the OSG helper

PRP wanted to give opportunistic resources to Open Science Grid (OSG) users

  • Since they can tolerate preemption

But OSG does not have native support for Kubernetes

  • Supports only resources provided by batch systems

We thus instantiated an HTCondor pool

  • As a fully Kubernetes/Containerized deployment

HTCondor Week, May 2019 10

slide-11
SLIDE 11

HTCondor in a (set of) container(s)

  • Just create an image with HTCondor binaries in it!
  • Configuration injected through Kubernetes pod config

Putting HTCondor in a set

  • f containers is not hard
  • The Collector must be discoverable – Kubernetes service
  • Everything else just works from there

HTCondor deals nicely with ephemeral IPs

  • And potentially for the Negotiator, if long term accounting desired
  • Everything else can live off ephemeral storage

Persistency needed for the Schedd(s)

HTCondor Week, May 2019 11

slide-12
SLIDE 12

Service vs Opportunistic

Collector and Schedd(s) deployed as high priority service pods

  • Should be running at all times
  • Few pods, not high CPU or GPU users, so OK
  • Using Kubernetes Deployment to re-start the pods in case of HW problems and/or maintenance
  • Kubernetes Service used to get a persistent routing IP to the collector pod

Startds deployed as low priority pods

  • Hundreds of pods in the Kubernetes queue at all times, many in Pending state
  • HTCondor Startd configured to accept jobs as soon as it starts and forever after
  • If pod preempted, HTCondor gets a SIGTERM and has a few seconds to go away

Pure opportunistic

HTCondor Week, May 2019 12

slide-13
SLIDE 13

Then came the users

  • Well, until we had more than a single user

Everything was working nicely, until we let in real users

  • So they can use any weird software they like

OSG users got used to rely on Containers

  • Cannot launch a user-provided container

But HTCondor Startd already running inside a container!

  • How many of each kind?

So I need to provide user-specific execute pods

HTCondor Week, May 2019 13

Not without elevated privileges

slide-14
SLIDE 14

Then came the users

  • Well, until we had more than a single user

Everything was working nicely, until we let in real users

  • So they can use any weird software they like

OSG users got used to rely on Containers

  • Cannot launch a user-provided container

But HTCondor Startd already running inside a container!

  • How many of each kind?

So I need to provide user-specific execute pods

HTCondor Week, May 2019 14

slide-15
SLIDE 15

Dealing with many opportunistic pod types

  • A different kind of pod could use that resource
  • A glidein-like setup would solve that

Having idle Startd pods not OK anymore

  • They will just terminate without ever running a job
  • Who should regulate the “glidein pressure”?

Keeping pods without users not OK anymore

  • Kubernetes scheduling is basically just priority-FIFO

How do I manage fair share between different pod types?

  • Ideally, HTCondor should have native Kubernetes support

How am I to know what Container images users want?

HTCondor Week, May 2019 15

slide-16
SLIDE 16

Dealing with many opportunistic pod types

  • A different kind of pod could use that resource
  • A glidein-like setup would solve that

Having idle Startd pods not OK anymore

  • They will just terminate without ever running a job
  • Who should regulate the “glidein pressure”?

Keeping pods without users not OK anymore

  • Kubernetes scheduling is basically just priority-FIFO

How do I manage fair share between different pod types?

  • Ideally, HTCondor should have native Kubernetes support

How am I to know what Container images users want?

I know how to implement this.

HTCondor Week, May 2019 16

slide-17
SLIDE 17

Dealing with many opportunistic pod types

  • A different kind of pod could use that resource
  • A glidein-like setup would solve that

Having idle Startd pods not OK anymore

  • They will just terminate without ever running a job
  • Who should regulate the “glidein pressure”?

Keeping pods without users not OK anymore

  • Kubernetes scheduling is basically just priority-FIFO

How do I manage fair share between different pod types?

  • Ideally, HTCondor should have native Kubernetes support

How am I to know what Container images users want?

I was told this is coming.

HTCondor Week, May 2019 17

slide-18
SLIDE 18

Dealing with many opportunistic pod types

  • A different kind of pod could use that resource
  • A glidein-like setup would solve that

Having idle Startd pods not OK anymore

  • They will just terminate without ever running a job
  • Who should regulate the “glidein pressure”?

Keeping pods without users not OK anymore

  • Kubernetes scheduling is basically just priority-FIFO

How do I manage fair share between different pod types?

  • Ideally, HTCondor should have native Kubernetes support

How am I to know what Container images users want?

In OSG-land, glideinWMS solves this for me.

HTCondor Week, May 2019 18

slide-19
SLIDE 19

Dealing with many opportunistic pod types

  • A different kind of pod could use that resource
  • A glidein-like setup would solve that

Having idle Startd pods not OK anymore

  • They will just terminate without ever running a job
  • Who should regulate the “glidein pressure”?

Keeping pods without users not OK anymore

  • Kubernetes scheduling is basically just priority-FIFO

How do I manage fair share between different pod types?

  • Ideally, HTCondor should have native Kubernetes support

How am I to know what Container images users want?

No concrete plans on how to address these yet.

HTCondor Week, May 2019 19

slide-20
SLIDE 20

Dealing with many opportunistic pod types

For now, I just periodically adjust the balance

  • A completely manual process

Currently supporting only a few, well behaved users

  • Maybe not optimal, but good enough

But looking forward to a more automated future

HTCondor Week, May 2019 20

slide-21
SLIDE 21

Are side-containers an option?

Ideally, I do want to use user-provided, per-job Containers

  • Running HTCondor and user jobs in separate pods

not an option due to opportunistic nature

But Kubernetes pods are made of several Containers

  • Could I run HTCondor in a dedicated Container?
  • Then start the user pod in a side-container?

Pretty sure currently not supported

  • But, at least in principle, fits the architecture
  • Would also need HTCondor native support

Pod HTCondor container User job container

HTCondor Week, May 2019 21

slide-22
SLIDE 22

Will nested Containers be a reality soon?

It has been pointed out to me that latest CentOS supports unprivileged Singularity Have not tied it out

  • Probably I should

Cannot currently assume all of my nodes have a recent-enough kernel

  • But eventually will get there

HTCondor Week, May 2019 22

slide-23
SLIDE 23

Looking ahead

Looking forward to a more automated future Will do what I have to myself Would be happier if I could use off-the-shelf solutions

HTCondor Week, May 2019 23

slide-24
SLIDE 24

A final picture

  • Opportunistic GPU usage over the past few months

HTCondor Week, May 2019 24

slide-25
SLIDE 25

Summary

  • We created an opportunistic HTCondor

pool in the PRP Kubernetes cluster

  • OSG users can now use

any otherwise-unused cycles

  • The lack of nested containerization

forces us to have multiple execute pod types

  • Some micromanagement

currently needed, hoping for more automation in the future

HTCondor Week, May 2019 25

slide-26
SLIDE 26

Acknowledgents

This work was partially funded by US National Science Foundation (NSF) awards CNS-1456638, CNS-1730158, ACI-1540112, ACI-1541349, OAC-1826967, OAC 1450871, OAC-1659169 and OAC-1841530.

HTCondor Week, May 2019 26