challenges in dynamic deployment of condor across
play

Challenges in Dynamic Deployment of Condor Across Distributed - PowerPoint PPT Presentation

Challenges in Dynamic Deployment of Condor Across Distributed Environments Andrew Pavlo Computer Sciences Department University of Wisconsin-Madison pavlo@cs.wisc.edu http://www.cs.wisc.edu/~pavlo/ Problem Statement Difficult to


  1. Challenges in Dynamic Deployment of Condor Across Distributed Environments Andrew Pavlo Computer Sciences Department University of Wisconsin-Madison pavlo@cs.wisc.edu http://www.cs.wisc.edu/~pavlo/

  2. Problem Statement › Difficult to allocate reliable resources across multi-sites: ∘ Batch Systems (Scheduling) ∘ Network (Public vs Private, Firewalls) ∘ Availability ∘ Capabilities ∘ Etiquette www.cs.wisc.edu/condor

  3. Overlay Grid Network › Create custom global Condor pool using Glidein technologies. › Global fair share at user and group level. › Uniformity across all Grids (OSG, EGEE) › “Reduces grid-related errors by 50%” www.cs.wisc.edu/condor

  4. CRONUS › ATLAS Virtual Computing Cluster › Condor-G Glideins › Condor-C Job Submissions › GCB Network Nodes › Goal: +10,000 jobs Sanjy Padhi HEP @ University of Wisconsin www.cs.wisc.edu/condor

  5. CRONUS Job CERN Wisconsin Database Job Submit Script Laptop Users Condor-C Condor-G/C Condor-C Submit Nodes Tier-1 Central Condor-G Managers Glideins Condor-C Matchmaker GCB Servers Data Results Database DE UK JP IT NL US TW CA FR ES Tier-1 Clouds www.cs.wisc.edu/condor

  6. Deployment Challenges › Unknown Network Capabilities › Cleaning Up on Execution Node › Retrieving Job Attributes › Scalability Issues www.cs.wisc.edu/condor

  7. Unknown Network Capabilities › Problem: How can we determine the network environment of execute nodes? › Firewalls, Public vs. Private IPs › GCB mitigates problem, but is error prone. www.cs.wisc.edu/condor

  8. Solution: Network Probe › Contact Condor servers @ Wisconsin to determine network information. › Only enable GCB if needed. › Source code is available! Test Traffic Probe Results Probe Server Glidein Node Enable GCB? Firewall Yes/No www.cs.wisc.edu/condor

  9. Cleaning Up on Execution Node › Problem: How do we make sure that our Glideins are actually doing work and not wasting cycles? › Must handle severed network connections. www.cs.wisc.edu/condor

  10. Solution: Shutdown Exprs. › New expressions allow Condor daemons to shutdown individually and not be restarted by the Master. STARTD.DAEMON_SHUTDOWN = \ State == "Claimed" && \ Activity == "Idle" && \ (CurrentTime - EnteredCurrentActivity) > 600 MASTER.DAEMON_SHUTDOWN = \ STARTD_StartTime == 0 Glidein Condor Configuration File www.cs.wisc.edu/condor

  11. Retrieving Job Attributes › Problem: How can we get additional information about Condor-C jobs when they are executing on Glideins? › Use only existing, reliable Condor mechanisms. www.cs.wisc.edu/condor

  12. Solution: Copy Attributes List › Provide a list of attributes to copy back to Condor-C job's ClassAd on submit node. › Resolves $$(<Parameter>) at runtime. CONDORC_ATTRS_TO_COPY = \ MATCH_FileSystemDomain, \ MATCH_UidDomain, .... Submit Side Condor Configuration File +Remote_Env = \ "FileSystemDomain=$$(FileSystemDomain)" Condor-C Submission File www.cs.wisc.edu/condor

  13. Scalability Issues › Problem: How can we increase the number of jobs per central manager and GCB node? › Preliminary tests showed only 1,000 jobs could reliably be submitted for each Tier-1 central manager. www.cs.wisc.edu/condor

  14. Solution: Internal Improvements › Improved core ClassAd library: faster attribute look-ups and parsing. › Re-factored scheduling algorithms. › Increased scalability of GCB libaries. › Localhost communication optimizations. › Effort is still ongoing... www.cs.wisc.edu/condor

  15. Summary › Network Probe › Daemon Shutdown Expressions › Condor-C Copy Attributes List › Scalability Improvements › Questions? www.cs.wisc.edu/condor

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend