Challenges in Dynamic Deployment of Condor Across Distributed - - PowerPoint PPT Presentation

challenges in dynamic deployment of condor across
SMART_READER_LITE
LIVE PREVIEW

Challenges in Dynamic Deployment of Condor Across Distributed - - PowerPoint PPT Presentation

Challenges in Dynamic Deployment of Condor Across Distributed Environments Andrew Pavlo Computer Sciences Department University of Wisconsin-Madison pavlo@cs.wisc.edu http://www.cs.wisc.edu/~pavlo/ Problem Statement Difficult to


slide-1
SLIDE 1

Andrew Pavlo Computer Sciences Department University of Wisconsin-Madison pavlo@cs.wisc.edu http://www.cs.wisc.edu/~pavlo/

Challenges in Dynamic Deployment of Condor Across Distributed Environments

slide-2
SLIDE 2

www.cs.wisc.edu/condor

Problem Statement

› Difficult to allocate reliable resources

across multi-sites:

∘ Batch Systems (Scheduling) ∘ Network (Public vs Private, Firewalls) ∘ Availability ∘ Capabilities ∘ Etiquette

slide-3
SLIDE 3

www.cs.wisc.edu/condor

Overlay Grid Network

› Create custom global Condor pool using

Glidein technologies.

› Global fair share at user and group level. › Uniformity across all Grids (OSG, EGEE) › “Reduces grid-related errors by 50%”

slide-4
SLIDE 4

www.cs.wisc.edu/condor

CRONUS

› ATLAS Virtual Computing Cluster › Condor-G Glideins › Condor-C Job Submissions › GCB Network Nodes › Goal: +10,000 jobs

Sanjy Padhi

HEP @ University of Wisconsin

slide-5
SLIDE 5

www.cs.wisc.edu/condor

CRONUS

Wisconsin CERN

Tier-1 Clouds

Job Database Tier-1 Central Managers GCB Servers Condor-G/C Submit Nodes Results Database Job Submit Script Data IT DE UK TW FR JP NL ES US CA Condor-G Glideins Condor-C Laptop Users Condor-C Condor-C Matchmaker

slide-6
SLIDE 6

www.cs.wisc.edu/condor

Deployment Challenges

› Unknown Network Capabilities › Cleaning Up on Execution Node › Retrieving Job Attributes › Scalability Issues

slide-7
SLIDE 7

www.cs.wisc.edu/condor

Unknown Network Capabilities

› Problem: How can we determine the

network environment of execute nodes?

› Firewalls, Public vs. Private IPs › GCB mitigates problem, but is error prone.

slide-8
SLIDE 8

www.cs.wisc.edu/condor

Solution: Network Probe

› Contact Condor servers @ Wisconsin to

determine network information.

› Only enable GCB if needed. › Source code is available!

Probe Server Glidein Node Test Traffic Probe Results

Enable GCB? Yes/No

Firewall

slide-9
SLIDE 9

www.cs.wisc.edu/condor

Cleaning Up on Execution Node

› Problem: How do we make sure that our

Glideins are actually doing work and not wasting cycles?

› Must handle severed network connections.

slide-10
SLIDE 10

www.cs.wisc.edu/condor

Solution: Shutdown Exprs.

› New expressions allow Condor daemons to

shutdown individually and not be restarted by the Master.

STARTD.DAEMON_SHUTDOWN = \

State == "Claimed" && \ Activity == "Idle" && \ (CurrentTime - EnteredCurrentActivity) > 600

MASTER.DAEMON_SHUTDOWN = \

STARTD_StartTime == 0

Glidein Condor Configuration File

slide-11
SLIDE 11

www.cs.wisc.edu/condor

Retrieving Job Attributes

› Problem: How can we get additional

information about Condor-C jobs when they are executing on Glideins?

› Use only existing, reliable Condor

mechanisms.

slide-12
SLIDE 12

www.cs.wisc.edu/condor

Solution: Copy Attributes List

› Provide a list of attributes to copy back to

Condor-C job's ClassAd on submit node.

› Resolves $$(<Parameter>) at runtime.

CONDORC_ATTRS_TO_COPY = \

MATCH_FileSystemDomain, \ MATCH_UidDomain, ....

+Remote_Env = \

"FileSystemDomain=$$(FileSystemDomain)"

Submit Side Condor Configuration File Condor-C Submission File

slide-13
SLIDE 13

www.cs.wisc.edu/condor

Scalability Issues

› Problem: How can we increase the number

  • f jobs per central manager and GCB node?

› Preliminary tests showed only 1,000 jobs

could reliably be submitted for each Tier-1 central manager.

slide-14
SLIDE 14

www.cs.wisc.edu/condor

Solution: Internal Improvements

› Improved core ClassAd library: faster

attribute look-ups and parsing.

› Re-factored scheduling algorithms. › Increased scalability of GCB libaries. › Localhost communication optimizations. › Effort is still ongoing...

slide-15
SLIDE 15

www.cs.wisc.edu/condor

Summary

› Network Probe › Daemon Shutdown Expressions › Condor-C Copy Attributes List › Scalability Improvements › Questions?