Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton - PowerPoint PPT Presentation

Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Introduction • Two aspects of monitoring – General overview of the system • How many running/idle jobs? By user/VO? By schedd? • How full is the farm? • How many draining worker nodes? – More detailed views • What are individual jobs doing? • What’s happening on individual worker nodes? • Health of the different components of the HTCondor pool • ...in addition to Nagios

Introduction • Methods – Command line utilities – Ganglia – Third-party applications (which run command-line tools or use python API)

Command line • Three useful commands – condor_status • Overview of the pool (including jobs, machines) • Information about specific worker nodes – condor_q • Information about jobs in the queue – condor_history • Information about completed jobs

Overview of jobs -bash-4.1$ condor_status -collector Name Machine RunningJobs IdleJobs HostsTotal RAL-LCG2@condor01.gridpp.rl. condor01.gridpp.rl 10608 8355 11347 RAL-LCG2@condor02.gridpp.rl. condor02.gridpp.rl 10616 8364 11360

Overview of machines -bash-4.1$ condor_status -total Total Owner Claimed Unclaimed Matched Preempting Backfill X86_64/LINUX 11183 95 10441 592 0 0 0 Total 11183 95 10441 592 0 0 0

Jobs by schedd -bash-4.1$ condor_status -schedd Name Machine TotalRunningJobs TotalIdleJobs TotalHeldJobs arc-ce01.gridpp.rl.a arc-ce01.g 2388 1990 13 arc-ce02.gridpp.rl.a arc-ce02.g 2011 1995 31 arc-ce03.gridpp.rl.a arc-ce03.g 4272 1994 9 arc-ce04.gridpp.rl.a arc-ce04.g 1424 2385 12 arc-ce05.gridpp.rl.a arc-ce05.g 1 0 6 cream-ce01.gridpp.rl cream-ce01 266 0 0 cream-ce02.gridpp.rl cream-ce02 247 0 0 lcg0955.gridpp.rl.ac lcg0955.gr 0 0 0 lcgui03.gridpp.rl.ac lcgui03.gr 3 0 0 lcgui04.gridpp.rl.ac lcgui04.gr 0 0 0 lcgvm21.gridpp.rl.ac lcgvm21.gr 0 0 0 TotalRunningJobs TotalIdleJobs TotalHeldJobs Total 10612 8364 71

Jobs by user, schedd -bash-4.1$ condor_status -submitters Name Machine RunningJobs IdleJobs HeldJobs group_ALICE.alice.alice043@g arc-ce01.gridpp.rl 0 0 0 group_ALICE.alice.alicesgm@g arc-ce01.gridpp.rl 540 0 1 group_ATLAS.atlas_pilot.tatl arc-ce01.gridpp.rl 142 0 0 group_ATLAS.prodatls.patls00 arc-ce01.gridpp.rl 82 5 0 group_CMS.cms.cmssgm@gridpp. arc-ce01.gridpp.rl 1 0 0 group_CMS.cms_pilot.ttcms022 arc-ce01.gridpp.rl 214 390 0 group_CMS.cms_pilot.ttcms043 arc-ce01.gridpp.rl 68 100 0 group_CMS.prodcms.pcms004@gr arc-ce01.gridpp.rl 78 476 4 group_CMS.prodcms.pcms054@gr arc-ce01.gridpp.rl 12 910 0 group_CMS.prodcms_multicore. arc-ce01.gridpp.rl 47 102 0 group_DTEAM_OPS.ops.ops047@g arc-ce01.gridpp.rl 0 0 0 group_LHCB.lhcb_pilot.tlhcb0 arc-ce01.gridpp.rl 992 0 2 group_NONLHC.snoplus.snoplus arc-ce01.gridpp.rl 0 0 0 …

…Jobs by user RunningJobs IdleJobs HeldJobs group_ALICE.alice.al 0 0 0 group_ALICE.alice.al 3500 368 5 group_ALICE.alice_pi 0 0 0 group_ATLAS.atlas.at 0 0 0 group_ATLAS.atlas.at 0 0 0 group_ATLAS.atlas_pi 414 12 10 group_ATLAS.atlas_pi 0 0 2 group_ATLAS.prodatls 354 36 11 group_CMS.cms.cmssgm 1 0 0 group_CMS.cms_pilot. 371 2223 0 group_CMS.cms_pilot. 0 0 1 group_CMS.cms_pilot. 68 200 0 group_CMS.prodcms.pc 188 1905 10 group_CMS.prodcms.pc 312 3410 0 group_CMS.prodcms_mu 47 102 0 …

condor_q [root@arc-ce01 ~]# condor_q -- Submitter: arc-ce01.gridpp.rl.ac.uk : <130.246.180.236:64454> : arc- ce01.gridpp.rl.ac.uk ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 794717.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) 794718.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) 794719.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) 794720.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) 794721.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) 794722.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) 794723.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) 794725.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) 794726.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob ) … 3502 jobs; 0 completed, 0 removed, 1528 idle, 1965 running, 9 held, 0 suspended

Multi-core jobs -bash-4.1$ condor_q -global - constraint 'RequestCpus > 1’ -- Schedd: arc-ce01.gridpp.rl.ac.uk : <130.246.180.236:39356> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 832677.0 pcms004 12/5 14:33 0+00:15:07 R 0 2.0 (gridjob ) 832717.0 pcms004 12/5 14:37 0+00:12:02 R 0 0.0 (gridjob ) 832718.0 pcms004 12/5 14:37 0+00:00:00 I 0 0.0 (gridjob ) 832719.0 pcms004 12/5 14:37 0+00:00:00 I 0 0.0 (gridjob ) 832893.0 pcms004 12/5 14:47 0+00:00:00 I 0 0.0 (gridjob ) 832894.0 pcms004 12/5 14:47 0+00:00:00 I 0 0.0 (gridjob ) …

Multi-core jobs • Custom print format -bash-4.1$ condor_q -global -pr queue_mc.cpf -- Schedd: arc-ce01.gridpp.rl.ac.uk : <130.246.180.236:39356> ID OWNER SUBMITTED RUN_TIME ST SIZE CMD CORES 832677.0 pcms004 12/5 14:33 0+00:00:00 R 2.0 (gridjob) 8 832717.0 pcms004 12/5 14:37 0+00:00:00 R 0.0 (gridjob) 8 832718.0 pcms004 12/5 14:37 0+00:00:00 I 0.0 (gridjob) 8 832719.0 pcms004 12/5 14:37 0+00:00:00 I 0.0 (gridjob) 8 832893.0 pcms004 12/5 14:47 0+00:00:00 I 0.0 (gridjob) 8 832894.0 pcms004 12/5 14:47 0+00:00:00 I 0.0 (gridjob) 8 … https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=ExperimentalCustomPrintFormats

Jobs with specific DN -bash-4.1$ condor_q -global -constraint 'x509userproxysubject=="/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1”’ -- Schedd: arc-ce03.gridpp.rl.ac.uk : <130.246.181.25:62763> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 678275.0 tatls015 12/2 17:57 2+06:07:15 R 0 2441.4 (arc_pilot ) 681762.0 tatls015 12/3 03:13 1+21:12:31 R 0 2197.3 (arc_pilot ) 705153.0 tatls015 12/4 07:36 0+16:49:12 R 0 2197.3 (arc_pilot ) 705807.0 tatls015 12/4 08:16 0+16:09:27 R 0 2197.3 (arc_pilot ) 705808.0 tatls015 12/4 08:16 0+16:09:27 R 0 2197.3 (arc_pilot ) 706612.0 tatls015 12/4 09:16 0+15:09:37 R 0 2197.3 (arc_pilot ) 706614.0 tatls015 12/4 09:16 0+15:09:26 R 0 2197.3 (arc_pilot ) …

Jobs killed • Jobs which were removed [root@arc-ce01 ~]# condor_history - constraint 'JobStatus == 3’ ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD 823881.0 alicesgm 12/5 01:01 1+06:13:22 X ??? /var/spool/arc/grid03/CVuMDmBSwGlnCIXDjqi 831849.0 tlhcb005 12/5 13:19 0+18:52:26 X ??? /var/spool/arc/grid09/gWmLDm5x7GlnCIXDjqi 832753.0 tlhcb005 12/5 14:38 0+17:07:07 X ??? /var/spool/arc/grid00/5wqKDm7C9GlnCIXDjqi 819636.0 alicesgm 12/4 19:27 1+12:13:56 X ??? /var/spool/arc/grid00/mlrNDmoErGlnCIXDjqi 825511.0 alicesgm 12/5 03:03 0+18:52:10 X ??? /var/spool/arc/grid04/XpuKDmxLyGlnCIXDjqi 823799.0 alicesgm 12/5 00:56 1+05:58:15 X ??? /var/spool/arc/grid03/DYuMDmzMwGlnCIXDjqi 820001.0 alicesgm 12/4 19:48 1+06:43:22 X ??? /var/spool/arc/grid08/cmzNDmpYrGlnCIXDjqi 833589.0 alicesgm 12/5 16:01 0+14:06:34 X ??? /var/spool/arc/grid09/HKSLDmqUAHlnCIXDjqi 778644.0 tlhcb005 12/2 05:56 4+00:00:10 X ??? /var/spool/arc/grid00/pIJNDm6cvFlnCIXDjqi …

Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton - PowerPoint PPT Presentation

Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014 Introduction Two aspects of monitoring General overview of the system How many running/idle jobs? By user/VO? By

HTCondor Python Bindings Tutorial Brian Bockelman HTCondor Week 2019 HTCondor Clients in 2012

Whats Next for HTCondor-CE? Brian Bockelman OSG AHM 2015 HTCondor-CE in a slide Submit Host

HTCondor Training Florentia Protopsalti IT-CM-IS 5/12/2017 2 Overview HTCondor Batch System

Installation and Configuration of HTCondor from (our) Repositories Tim Theisen Terminology

Submitting Multiple Jobs With HTCondor Christina Koch HTCondor Week 2020 Why multiple jobs?

HTCondor at HEPiX, WLCG and CERN Status and Outlook Helge Meinhard / CERN HTCondor week 2018

Event-Sourced Monitoring of Your HTCondor Cluster Kevin Retzke HTCondor Week 23 May 2019

HTCondor Architecture HTCondor Week 2020 Todd Tannenbaum Center for High Throughput Computing

HTCondor at Collin Mehring Using HTCondor Since 2011 Animation Studio Background

Several Scenarios at IHEP Zou Jiaheng On behalf of Scheduling Group at IHEP HTCondor Week 2019

HTCondor in Astronomy at NCSA Michael Johnson, Greg Daues, and Hsin-Fang Chiang HTCondor Week

Managing a Dynamic Sharded Pool Anthony Tiradani HTCondor Week 2019 22 May 2019 Introduction

HTCondor S r Securi rity: Philosophy a and Administra ration C Changes FEARLESS SCIENCE

Whats new in HTCondor? Whats coming? Todd Tannenbaum Center for High Throughput Computing

European HTCondor Workshop December 2014 summary Ian Collier (Brial Bockelman, Greg Thain, Todd

HTCondor with Google Cloud Platform Michiru Kaneda The International Center for Elementary

Symbolic Bayesian inference by lazy partial evaluation Chung-chieh Shan (Indiana University)

Noun Phrases February 13, 2017 Next assignments Hundred noun phrases Hundred sentences

GCC CompileFarm 20100725 Thanks to GNU Hacker Meeting 2010 Laurent GUERBY

The Pros and Cons of Wiikpedia Alex Bateman Anyone can edit anything! By David S. Goodsell

Food User Experience Technologies (FUXT): Database & Beyond Shih-Chieh Ilya Li

CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT B IIT

Clean Energy States Alliance How SAM Can Be Useful to You and How to Use It Hosted by Todd

TULIP Continuous testing of Linux distributions upgrade Stefane Fermigier Laurent Godard

Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton - PowerPoint PPT Presentation

Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014 Introduction Two aspects of monitoring General overview of the system How many running/idle jobs? By user/VO? By

HTCondor Python Bindings Tutorial Brian Bockelman HTCondor Week 2019 HTCondor Clients in 2012

Whats Next for HTCondor-CE? Brian Bockelman OSG AHM 2015 HTCondor-CE in a slide Submit Host

HTCondor Training Florentia Protopsalti IT-CM-IS 5/12/2017 2 Overview HTCondor Batch System

Installation and Configuration of HTCondor from (our) Repositories Tim Theisen Terminology

Submitting Multiple Jobs With HTCondor Christina Koch HTCondor Week 2020 Why multiple jobs?

HTCondor at HEPiX, WLCG and CERN Status and Outlook Helge Meinhard / CERN HTCondor week 2018

Event-Sourced Monitoring of Your HTCondor Cluster Kevin Retzke HTCondor Week 23 May 2019

HTCondor Architecture HTCondor Week 2020 Todd Tannenbaum Center for High Throughput Computing

HTCondor at Collin Mehring Using HTCondor Since 2011 Animation Studio Background

Several Scenarios at IHEP Zou Jiaheng On behalf of Scheduling Group at IHEP HTCondor Week 2019

HTCondor in Astronomy at NCSA Michael Johnson, Greg Daues, and Hsin-Fang Chiang HTCondor Week

Managing a Dynamic Sharded Pool Anthony Tiradani HTCondor Week 2019 22 May 2019 Introduction

HTCondor S r Securi rity: Philosophy a and Administra ration C Changes FEARLESS SCIENCE

Whats new in HTCondor? Whats coming? Todd Tannenbaum Center for High Throughput Computing

European HTCondor Workshop December 2014 summary Ian Collier (Brial Bockelman, Greg Thain, Todd

HTCondor with Google Cloud Platform Michiru Kaneda The International Center for Elementary

Symbolic Bayesian inference by lazy partial evaluation Chung-chieh Shan (Indiana University)

Noun Phrases February 13, 2017 Next assignments Hundred noun phrases Hundred sentences

GCC CompileFarm 20100725 Thanks to GNU Hacker Meeting 2010 Laurent GUERBY

The Pros and Cons of Wiikpedia Alex Bateman Anyone can edit anything! By David S. Goodsell

Food User Experience Technologies (FUXT): Database &amp; Beyond Shih-Chieh Ilya Li

CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT B IIT

Clean Energy States Alliance How SAM Can Be Useful to You and How to Use It Hosted by Todd

TULIP Continuous testing of Linux distributions upgrade Stefane Fermigier Laurent Godard

Food User Experience Technologies (FUXT): Database & Beyond Shih-Chieh Ilya Li