Day 11: Workflows with DAGMan Suggested reading: Condor 7.7 Manual: - - PowerPoint PPT Presentation

day 11 workflows with dagman
SMART_READER_LITE
LIVE PREVIEW

Day 11: Workflows with DAGMan Suggested reading: Condor 7.7 Manual: - - PowerPoint PPT Presentation

Computer Sciences 368 Scripting for CHTC Day 11: Workflows with DAGMan Suggested reading: Condor 7.7 Manual: http://www.cs.wisc.edu/condor/manual/v7.7/ Section 2.10: DAGMan Applications Chapter 9: condor_submit_dag 2012 Spring Cartwright 1


slide-1
SLIDE 1

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

Day 11: Workflows with DAGMan

Suggested reading: Condor 7.7 Manual: http://www.cs.wisc.edu/condor/manual/v7.7/ Section 2.10: DAGMan Applications Chapter 9: condor_submit_dag

1

slide-2
SLIDE 2

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

Turn In Homework

2

slide-3
SLIDE 3

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

Homework Review

3

slide-4
SLIDE 4

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

Workflows

4

slide-5
SLIDE 5

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

Introduction to Workflow

5

  • Series of related steps to complete a complex task
  • Organize, manage, and make a process reliable
  • Important in science, where repeatability is key

Attend class OK? Review slides Write code OK? Print code

YES NO NO YES

slide-6
SLIDE 6

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

Workflow Components

  • Workflows are essentially algorithmic!
  • Steps

– Prerequisites and inputs – Process (black box / white box) – Outputs

  • Connections

– Sequence – Branching – Parallelism

  • Metadata: Resources, owners, timing, etc.

6

slide-7
SLIDE 7

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

Workflow Example I

7

Bioinformatics @ Yale: C. Mason, S. Sanders, M. State

slide-8
SLIDE 8

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

Workflow Example II

8

88D Radar Re-mapper Satellite Data Re-mapper NIDS Radar Re-mapper Radar data (Level II) Surface data, upper air mesonet data and wind profiler data Radar data (Level III) Satellite data

ADAS

Terrain Preprocessor 3D Model Data Interpolator

(Initial Boundary Conditions)

3D Model Data Interpolator

(lateral Boundary Conditions)

Terrain data files

NAM, RUC, GFS data WRF Static Preprocessor ARPS to WRF Data Interpolator

ARPS Plotting Program IDV Surface, terrestrial data files

1

WRF to ARPS Data Interpolator

2 3 4 5 6 7 9 11 12 13

Run once per forecast region Repeat periodically for new data Triggered if a storm is detected Visualization

  • n users

request

ADAM

Data mining: look for storm signature

WRF WRF WRF WRF 14

ARPS Ensemble Generator

15

Static data Initialization Forecast Visualization Real time data Analysis Data Mining

10 8

5"

LEAD Weather Forecasting

slide-9
SLIDE 9

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

Automated Workflows

  • Ideally, we want to automate workflows

– Minimize wait times and (certain kinds of) errors – Allow humans to concentrate on design and results

  • Broad objectives:

– Capture whole workflow – Define steps clearly – Identify easy automation

✦ Copying files ✦ Changing data formats ✦ Running jobs!

– Balance costs vs. savings!

9

slide-10
SLIDE 10

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

Workflows in CHTC

10

slide-11
SLIDE 11

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

Directed Acyclic Graphs (DAGs)

11

  • Abstract, formal definition of allowable workflows
  • Terminology

– Step (typically, a job) = Node – Connection is directed: Parent → Child – No loops (or cycles, hence acyclic) – Each node may have 0–n children – Each node may have 0–n parents

“… must succeed before running …”

Parent Child

slide-12
SLIDE 12

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

Example DAG Shapes

12

A B C D Disconnected A B C D Linear / Serial Diamond A B C D

slide-13
SLIDE 13

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

A Real Scientific DAG

13

datafind_L_1 tmplt_L1_1 tmplt_L1_2 tmplt_L1_3 tmplt_L1_4 tmplt_L1_5 tmplt_L1_6 datafind_L_2 insp_L1_1 trigbank_H2_1 insp_L1_2 trigbank_H1_1 trigbank_H2_2 inca_L1H2_1 insp_L1_3 trigbank_H1_2 inca_L1H1_3 insp_L1_4 trigbank_H1_3 insp_L1_5 trigbank_H1_4 inca_L1H1_1 insp_L1_6 tmplt_L1_7 tmplt_L1_8 tmplt_L1_9 datafind_L_3 insp_L1_7 trigbank_H1_5 trigbank_H1_6 inca_L1H1_2 insp_L1_8 trigbank_H2_3 trigbank_H2_4 inca_L1H2_2 insp_L1_9 tmplt_L1_10 tmplt_L1_11 datafind_H_1 insp_L1_10 trigbank_H1_7 trigbank_H1_8 inca_L1H1_4 insp_L1_11 insp_H1_1 insp_H1_2 insp_H1_3 insp_H1_4 datafind_H_2 insp_H1_5 insp_H1_6 datafind_H_3 insp_H1_7 insp_H1_8 datafind_H_4 insp_H2_1 insp_H2_2 datafind_H_5 insp_H2_3 insp_H2_4 datafind_H_6 trigbank_H2_5 trigbank_H2_6 trigbank_H2_7 inca_H1H2_1 inca_L1H1_5 trigbank_H2_8 trigbank_H2_9 inca_H1H2_2 inca_L1H1_6 insp_H2_9 insp_H2_5 insp_H2_6 datafind_H_7 insp_H2_7 insp_H2_8

Laser Interferometer Gravitational-wave Observatory (LIGO)

slide-14
SLIDE 14

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

Condor DAGMan

  • DAGMan: Directed Acyclic Graph Manager
  • Organize Condor jobs into a DAG
  • Condor handles all details of running workflow

– Submits individual jobs when appropriate – Tracks overall workflow – Can retry failed nodes and resume failed workflow – Can limit amount of work done at once

  • DAGs up to 1,000,000 nodes have been run!

14

slide-15
SLIDE 15

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

DAGMan Nodes I

15

slide-16
SLIDE 16

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

DAGMan Nodes I

15

Job (Cluster)

slide-17
SLIDE 17

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

DAGMan Nodes I

15

Job (Cluster) Pre-Script Post-Script

slide-18
SLIDE 18

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

DAGMan Nodes I

15

Job (Cluster) Pre-Script Post-Script

  • prepare data
  • check prereq.s
  • skip node
slide-19
SLIDE 19

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

DAGMan Nodes I

15

Job (Cluster) Pre-Script Post-Script

  • prepare data
  • check prereq.s
  • skip node
  • clean up files
  • check success
slide-20
SLIDE 20

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

DAGMan Nodes II

  • Order of execution
  • 1. Pre-script
  • n submit machine
  • 2. Job(s)
  • n pool
  • 3. Post-script
  • n submit machine
  • Failure handling

– Pre-script exit ≠ 0: Skip job, run post-script (if any) – Any job exit ≠ 0: Run post-script (if any) – Last exit status determines success/failure of node

  • Make sure scripts exit 0 upon success!
  • Can skip job & post on given pre-script exit status

16

slide-21
SLIDE 21

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

DAGMan Files

17

slide-22
SLIDE 22

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC # Define nodes JOB First first.sub JOB Analyze1 stats-1.sub JOB Analyze2 stats-2.sub JOB Sum collate.sub SCRIPT PRE Sum verify-all.py 2 # Define connections PARENT First CHILD Analyze1 Analyze2 PARENT Analyze1 Analyze2 CHILD Sum

Basic DAGMan Submit File

18

slide-23
SLIDE 23

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC # Define nodes JOB First first.sub JOB Analyze1 stats-1.sub JOB Analyze2 stats-2.sub JOB Sum collate.sub SCRIPT PRE Sum verify-all.py 2 # Define connections PARENT First CHILD Analyze1 Analyze2 PARENT Analyze1 Analyze2 CHILD Sum

Basic DAGMan Submit File

18

Comment

slide-24
SLIDE 24

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC # Define nodes JOB First first.sub JOB Analyze1 stats-1.sub JOB Analyze2 stats-2.sub JOB Sum collate.sub SCRIPT PRE Sum verify-all.py 2 # Define connections PARENT First CHILD Analyze1 Analyze2 PARENT Analyze1 Analyze2 CHILD Sum

Basic DAGMan Submit File

18

Declare node name and its Condor submit file

slide-25
SLIDE 25

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC # Define nodes JOB First first.sub JOB Analyze1 stats-1.sub JOB Analyze2 stats-2.sub JOB Sum collate.sub SCRIPT PRE Sum verify-all.py 2 # Define connections PARENT First CHILD Analyze1 Analyze2 PARENT Analyze1 Analyze2 CHILD Sum

Basic DAGMan Submit File

18

Define pre/post scripts for nodes

slide-26
SLIDE 26

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC # Define nodes JOB First first.sub JOB Analyze1 stats-1.sub JOB Analyze2 stats-2.sub JOB Sum collate.sub SCRIPT PRE Sum verify-all.py 2 # Define connections PARENT First CHILD Analyze1 Analyze2 PARENT Analyze1 Analyze2 CHILD Sum

Basic DAGMan Submit File

18

Define node connections

slide-27
SLIDE 27

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC # Define nodes JOB First first.sub JOB Analyze1 stats-1.sub JOB Analyze2 stats-2.sub JOB Sum collate.sub SCRIPT PRE Sum verify-all.py 2 # Define connections PARENT First CHILD Analyze1 Analyze2 PARENT Analyze1 Analyze2 CHILD Sum

Basic DAGMan Submit File

18

Sum First Analyze2 Analyze1

slide-28
SLIDE 28

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

Define a Job

  • One per node
  • Defines node’s name, unique within this DAG
  • Associated with a Condor submit-file
  • Job must yield 1 cluster; may have many processes

19

JOB name submit-file JOB Collate collate.sub JOB Rjob3 run-r-3.sub

slide-29
SLIDE 29

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

Define Dependencies

  • Defines the “lines” (dependencies) between nodes
  • Parent and child names are node names (cf. JOB)
  • EACH child depends on ALL parents

20

PARENT parent1 p2 … CHILD child1 c2 … PARENT p1 p2 p3 CHILD c1 c2 p1 p2 p3 c1 c2

slide-30
SLIDE 30

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

Define Pre- and Post-Scripts

  • Scripts are always optional!
  • Associated with given node name
  • Optional arguments are passed to executable
  • Place scripts in same directory as node’s submit file
  • Scripts run on the submit machine

21

SCRIPT PRE name executable arguments SCRIPT POST name executable arguments JOB First prepare.sub SCRIPT PRE First fetch-data.py SCRIPT POST Collate sum-stats.py 100

slide-31
SLIDE 31

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

Logs in DAGMan

  • DAGMan tracks progress via your log files
  • All nodes (i.e., submit files) can use same log file

– Can be tricky for a person to decode – Best DAGMan performance

  • Each node may have own log file

– More like what you are used to – Easier to read for a person – Cannot use $(CLUSTER) or $(PROCESS), though!

  • Can omit log statement entirely!

– DAGMan defaults to dagfile.nodes.log

22

slide-32
SLIDE 32

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

DAGMan Commands

23

slide-33
SLIDE 33

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

Submit a DAG

24

  • DAGMan itself runs as a Condor job
  • On the submit machine
  • This command creates submit file and submits it

condor_submit_dag dag-file condor_submit_dag -no_submit dag-file

  • Just creates DAGMan submit file, if you are curious

File for submitting this DAG to Condor : dagman.dag.condor.sub Log of DAGMan debugging messages : dagman.dag.dagman.out Log of Condor library output : dagman.dag.lib.out Log of Condor library error messages : dagman.dag.lib.err Log of the life of condor_dagman itself : dagman.dag.dagman.log Submitting job(s). 1 job(s) submitted to cluster 65.

slide-34
SLIDE 34

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

Submit Options

  • Maximum number of jobs to submit at once
  • Can help avoid overload on submit machine
  • Can be limited further by administrator

25

condor_submit_dag -maxjobs N dag-file condor_submit_dag -maxpre N dag-file condor_submit_dag -maxpost N dag-file

  • Limits pre- and post-scripts
  • Again, helps avoid overload on submit machine
  • All options are optional and can be combined
slide-35
SLIDE 35

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

Monitor a DAG

  • Same command as always; same options available
  • But: Organizes DAG jobs visually
  • Not required to use -dag option!

26

condor_q -dag

65.0 cat 11/22 15:43 0+00:11:23 R 0 2.2 condor_dagman -f - 67.0 |-Random1 11/22 15:54 0+00:00:00 I 0 0.0 dag_2.py 68.0 |-Random2 11/22 15:54 0+00:00:00 I 0 0.0 dag_2.py

  • Other options:

– Watch log file(s) – Email notifications on each job (maybe just on last?) – Node status file (later slide)

slide-36
SLIDE 36

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

Remove a DAG

  • But, which job ID?
  • Essentially: remove the condor_dagman job itself
  • Same cluster printed by condor_submit_dag
  • Removes all jobs (idle & running) within DAG

27

condor_rm jobID

65.0 cat 11/22 15:43 0+00:11:23 R 0 2.2 condor_dagman -f - 67.0 |-Random1 11/22 15:54 0+00:00:00 I 0 0.0 dag_2.py 68.0 |-Random2 11/22 15:54 0+00:00:00 I 0 0.0 dag_2.py

slide-37
SLIDE 37

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

Job Recovery

  • Rescue DAG created when DAG does not succeed

– Due to being removed, or – After node fails, when all possible progress completes

28

  • Resubmit the DAG to resume, using Rescue DAG

– Completed nodes are not rerun

  • rw-rw-r-- 1 cat cat 512 Nov 23 10:26 sleep.dag
  • rw-r--r-- 1 cat cat 988 Nov 23 10:38 sleep.dag.condor.sub
  • rw-rw-r-- 1 cat cat 517 Nov 23 10:40 sleep.dag.dagman.log
  • rw-r--r-- 1 cat cat 13179 Nov 23 10:40 sleep.dag.dagman.out
  • rw-r--r-- 1 cat cat 261 Nov 23 10:40 sleep.dag.rescue001

condor_submit_dag dagfile.rescueNNN

< 7.7.2

condor_submit_dag dagfile

≥ 7.7.2

slide-38
SLIDE 38

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

Status of DAG Nodes

  • Writes DAG status info to the given filename
  • Overwrites file no more often than seconds apart

29

NODE_STATUS_FILE filename seconds

JOB A STATUS_DONE () JOB B STATUS_DONE () JOB C STATUS_DONE () JOB D STATUS_DONE () JOB E STATUS_DONE () JOB F STATUS_SUBMITTED (not_idle) JOB G STATUS_SUBMITTED (idle) JOB H STATUS_UNREADY ()

slide-39
SLIDE 39

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

Homework

30

slide-40
SLIDE 40

Cartwright 2012 Spring

Computer Sciences 368 Scripting for CHTC

Homework

  • Run a workflow!
  • The queue simulator is back, but does its own loops
  • If you have an alternate workflow that you would

like to work on instead, talk to me

31