HTCondor-CE: Troubleshooting ISGC 2019 - Taipei, Taiwan Brian Lin - - PowerPoint PPT Presentation

htcondor ce troubleshooting
SMART_READER_LITE
LIVE PREVIEW

HTCondor-CE: Troubleshooting ISGC 2019 - Taipei, Taiwan Brian Lin - - PowerPoint PPT Presentation

HTCondor-CE: Troubleshooting ISGC 2019 - Taipei, Taiwan Brian Lin University of Wisconsin Madison Log Levels - Useful for temporary debugging Log level can be adjusted per daemon (e.g, SCHEDD_DEBUG ) or across all - daemons ( ALL_DEBUG )


slide-1
SLIDE 1

HTCondor-CE: Troubleshooting

ISGC 2019 - Taipei, Taiwan Brian Lin University of Wisconsin — Madison

slide-2
SLIDE 2

April 1, 2019 ISGC - HTCondor-CE: Troubleshooting 2

Log Levels

  • Useful for temporary debugging
  • Log level can be adjusted per daemon (e.g, SCHEDD_DEBUG) or across all

daemons (ALL_DEBUG)

  • Most common, helpful log levels for HTCondor-CE:
  • D_CAT D_ALL:2 - shows the log level for each line (helpful for debugging HTCondor

bugs!) and increases the log level of general messages

  • D_SECURITY - show authentication messages
  • D_NETWORK - show messages for TCP/UDP connections
slide-3
SLIDE 3

April 1, 2019 ISGC - HTCondor-CE: Troubleshooting

HTCondor-CE Startup

3

Master Collector

/var/log/condor-ce/CollectorLog

Schedd

/var/log/condor-ce/SchedLog

Job Router

/var/log/condor-ce/JobRouterLog

systemctl start condor-ce service condor start condor_ce_on

/var/log/condor-ce/MasterLog

Startup Authorization Legend: Command/Logs

slide-4
SLIDE 4

April 1, 2019 ISGC - HTCondor-CE: Troubleshooting

Troubleshooting Startup

If all goes well, command-line queries should show the following daemons:

# condor_ce_status -any MyType TargetType Name Collector None My Pool - fermicloud068.fnal.gov@fermiclo Scheduler None fermicloud068.fnal.gov DaemonMaster None fermicloud068.fnal.gov Job_Router None htcondor-ce@fermicloud068.fnal.gov

4

slide-5
SLIDE 5

April 1, 2019 ISGC - HTCondor-CE: Troubleshooting

Troubleshooting Startup

5

Master Collector

/var/log/condor-ce/CollectorLog

Schedd

/var/log/condor-ce/SchedLog

Job Router

/var/log/condor-ce/JobRouterLog

systemctl start condor-ce service condor start condor_ce_on

/var/log/condor-ce/MasterLog

Startup Failed AuthZ Legend: Command/Logs 03/20/19 16:05:58 ERROR: AUTHENTICATE:1003:Failed to authenticate with any method Update CA certificates and CRLs, verify host cert validity, verify unified mapfile, run condor_ce_host_network_check

slide-6
SLIDE 6

April 1, 2019 ISGC - HTCondor-CE: Troubleshooting 6

From the CE host: 1. Verify that local job submissions complete successfully from the CE host, e.g. sbatch, condor_submit, qsub, etc. 2. Verify that all required daemons are running with condor_ce_status 3. Verify the CE’s network configuration with condor_ce_host_network_check 4. Verify end-to-end job submission with condor_ce_trace

a. First, from the CE host b. Next, from a remote host with the htcondor-ce-client tools

https://opensciencegrid.org/docs/compute-element/install-htcondor-ce/#validating-htcondor-ce

Validation

slide-7
SLIDE 7

April 1, 2019 ISGC - HTCondor-CE: Troubleshooting

CE Host

Troubleshooting Jobs: HTCondor

7

CE Schedd Job Router Local Schedd

Firewall Auth

  • 1. Grid Job
  • 2. Routed

Job

/var/log/condor-ce/SchedLog /var/log/condor-ce/JobRouterLog /var/log/condor/SchedLog

slide-8
SLIDE 8

April 1, 2019 ISGC - HTCondor-CE: Troubleshooting 8

1. No errors in the SchedLog? Make sure that the firewall is open 2. Authentication errors? Check the condor_mapfile; make sure that mapped users exist; ensure CAs, CRLs, and VO information is up-to-date

a. Using LCMAPS? Also check /var/log/messages or journalctl

Troubleshooting the CE Schedd

slide-9
SLIDE 9

April 1, 2019 ISGC - HTCondor-CE: Troubleshooting

Troubleshooting Jobs

# condor_ce_q -nobatch

  • - Schedd: lhcb-ce.chtc.wisc.edu : <128.104.100.65:9618?... @ 03/20/19 21:31:19

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 153501.0 nu_lhcb 3/18 13:30 2+07:56:31 R 0 733.0 DIRAC_clpM0A_pilotwrapper.py 154043.0 nu_lhcb 3/19 13:43 1+07:41:29 R 0 1709.0 DIRAC_RpJK9Q_pilotwrapper.py 154066.0 nu_lhcb 3/19 13:43 1+07:41:31 R 0 1465.0 DIRAC_RpJK9Q_pilotwrapper.py 154088.0 nu_lhcb 3/19 14:09 1+07:14:33 R 0 1709.0 DIRAC_ekQezG_pilotwrapper.py 154091.0 nu_lhcb 3/19 14:09 1+07:14:32 R 0 1709.0 DIRAC_ekQezG_pilotwrapper.py 154258.0 nu_lhcb 3/19 17:36 1+03:37:18 R 0 1221.0 DIRAC_lIr4FB_pilotwrapper.py

9

slide-10
SLIDE 10

April 1, 2019 ISGC - HTCondor-CE: Troubleshooting

Troubleshooting Jobs

# condor_ce_q -help status [...] JobStatus codes: 1 I IDLE 2 R RUNNING 3 X REMOVED 4 C COMPLETED 5 H HELD 6 > TRANSFERRING_OUTPUT 7 S SUSPENDED

See hold reasons with condor_ce_q -held

10

slide-11
SLIDE 11

April 1, 2019 ISGC - HTCondor-CE: Troubleshooting

Common Hold Reasons

  • Spooling input data files: the remote client is sending input files, should clear up

after the transfer is complete

  • HTCondor-CE held job due to…
  • missing/expired user proxy: job X.509 proxy was removed or expired. In these cases, it’s safe to

remove the job (pilots are cheap)

  • invalid job universe: HTCondor-CE only accepts vanilla, local, scheduler, and standard universe
  • no matching routes, route job limit, or route failure threshold; see 'HTCondor-CE

Troubleshooting Guide': job sat in the queue for > 30 min without being picked up by the job router

  • No routes match the job:

condor_ce_q <JOB ID> | condor_ce_job_router_info -match-jobs \

  • ignore-prior-routing -jobads -
  • All routes are full: condor_ce_router_q
  • Route failure threshold: check the JobRouterLog or GridmanagerLog for local batch system

submission failures

11

slide-12
SLIDE 12

April 1, 2019 ISGC - HTCondor-CE: Troubleshooting 12

  • Wrap ClassAd expressions with the debug() function
  • Ensure that you can submit jobs to your local batch system from the CE host
  • Errors will appear in the JobRouterLog and the local SchedLog if there are

communication issues between HTCondor-CE and the local HTCondor

Troubleshooting the Job Router

slide-13
SLIDE 13

April 1, 2019 ISGC - HTCondor-CE: Troubleshooting

CE Host

Troubleshooting Jobs: Non-HTCondor Edition

13

CE Schedd Job Router Gridmanager

  • 2. Routed Job
  • 1. Grid Job

Firewall Auth

Routed Job

/var/log/condor-ce/GridmanagerLog.<user>

slide-14
SLIDE 14

April 1, 2019 ISGC - HTCondor-CE: Troubleshooting 14

  • Find the routed job ID using one of the following methods:
  • Query the CE schedd: condor_ce_q -af RoutedToJobId <ORIGINAL JOB ID>
  • Find relevant lines in the JobRouterLog

09/17/14 15:00:57 JobRouter (src=86.0,dest=205.0,route=Local_Condor): claimed job

  • Query the local schedd(HTCondor-only): condor_q -af RoutedFromJobId
  • For non-HTCondor batch systems, find the batch system job ID:
  • Query the CE schedd routed job*:

$ condor_ce_q <ROUTED JOB ID> -af GridJobId <snip> lsf/20141206/482046

  • If the batch system jobs has completed, find relevant lines in the GridmanagerLog. Look for <BATCH

SYSTEM>/<DATE>/<JOB ID> lsf/20141206/482046

  • We’re making it easier to track completed batch system jobs

https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6159,86

Tracking Batch System Jobs

slide-15
SLIDE 15

April 1, 2019 ISGC - HTCondor-CE: Troubleshooting 15

If you see failures during the GM_SUBMIT phase, this means that the Batch GAHP/BLAHP is having issues submitting jobs to the local batch system 1. Verify that local job submission to the batch system works 2. Set the following in /usr/libexec/condor/glite/etc/batch_gahp.config:

blah_debug_save_submit_info=<DIR_NAME>

This saves generated submit files that HTCondor-CE uses for submission to <DIR_NAME>

Troubleshooting the Gridmanager

slide-16
SLIDE 16

April 1, 2019 ISGC - HTCondor-CE: Troubleshooting 16

A successful query of the local LSF batch system by the Gridmanager daemon

09/17/14 15:07:24 [25543] (87.0) gm state change: GM_SUBMITTED -> GM_POLL_ACTIVE 09/17/14 15:07:24 [25543] GAHP[25563] <- 'BLAH_JOB_STATUS 3 lsf/20140917/482046' 09/17/14 15:07:24 [25543] GAHP[25563] -> 'S' 09/17/14 15:07:25 [25543] GAHP[25563] <- 'RESULTS' 09/17/14 15:07:25 [25543] GAHP[25563] -> 'R' 09/17/14 15:07:25 [25543] GAHP[25563] -> 'S' '1' 09/17/14 15:07:25 [25543] GAHP[25563] -> '3' '0' 'No Error' '4' '[ BatchjobId = "482046"; JobStatus = 4; ExitCode = 0; WorkerNode = "atl-prod08" ]'

Troubleshooting the Gridmanager

slide-17
SLIDE 17

April 1, 2019 ISGC - HTCondor-CE: Troubleshooting 17

Routed job ID

09/17/14 15:07:24 [25543] (87.0) gm state change: GM_SUBMITTED -> GM_POLL_ACTIVE 09/17/14 15:07:24 [25543] GAHP[25563] <- 'BLAH_JOB_STATUS 3 lsf/20140917/482046' 09/17/14 15:07:24 [25543] GAHP[25563] -> 'S' 09/17/14 15:07:25 [25543] GAHP[25563] <- 'RESULTS' 09/17/14 15:07:25 [25543] GAHP[25563] -> 'R' 09/17/14 15:07:25 [25543] GAHP[25563] -> 'S' '1' 09/17/14 15:07:25 [25543] GAHP[25563] -> '3' '0' 'No Error' '4' '[ BatchjobId = "482046"; JobStatus = 4; ExitCode = 0; WorkerNode = "atl-prod08" ]'

Troubleshooting the Gridmanager

slide-18
SLIDE 18

April 1, 2019 ISGC - HTCondor-CE: Troubleshooting 18

LSF job ID

09/17/14 15:07:24 [25543] (87.0) gm state change: GM_SUBMITTED -> GM_POLL_ACTIVE 09/17/14 15:07:24 [25543] GAHP[25563] <- 'BLAH_JOB_STATUS 3 lsf/20140917/482046' 09/17/14 15:07:24 [25543] GAHP[25563] -> 'S' 09/17/14 15:07:25 [25543] GAHP[25563] <- 'RESULTS' 09/17/14 15:07:25 [25543] GAHP[25563] -> 'R' 09/17/14 15:07:25 [25543] GAHP[25563] -> 'S' '1' 09/17/14 15:07:25 [25543] GAHP[25563] -> '3' '0' 'No Error' '4' '[ BatchjobId = "482046"; JobStatus = 4; ExitCode = 0; WorkerNode = "atl-prod08" ]'

Troubleshooting the Gridmanager

slide-19
SLIDE 19

April 1, 2019 ISGC - HTCondor-CE: Troubleshooting 19

If there are issues, errors should show up here. If the messages do not provide enough information, run the Batch GAHP commands by hand:

/usr/libexec/condor/glite/bin/lsf_status.sh lsf/20140917/482046

09/17/14 15:07:24 [25543] (87.0) gm state change: GM_SUBMITTED -> GM_POLL_ACTIVE 09/17/14 15:07:24 [25543] GAHP[25563] <- 'BLAH_JOB_STATUS 3 lsf/20140917/482046' 09/17/14 15:07:24 [25543] GAHP[25563] -> 'S' 09/17/14 15:07:25 [25543] GAHP[25563] <- 'RESULTS' 09/17/14 15:07:25 [25543] GAHP[25563] -> 'R' 09/17/14 15:07:25 [25543] GAHP[25563] -> 'S' '1' 09/17/14 15:07:25 [25543] GAHP[25563] -> '3' '0' 'No Error' '4' '[ BatchjobId = "482046"; JobStatus = 4; ExitCode = 0; WorkerNode = "atl-prod08" ]'

Troubleshooting the Gridmanager

slide-20
SLIDE 20

April 1, 2019 ISGC - HTCondor-CE: Troubleshooting 20

  • Troubleshooting Guide

https://opensciencegrid.org/docs/compute-element/troubleshoot-htcondor-ce

  • Additional help

htcondor-users@htcondor.org

Additional Resources