Agent Factory automatic job submission mechanism for Ganga/diane - - PowerPoint PPT Presentation
Agent Factory automatic job submission mechanism for Ganga/diane - - PowerPoint PPT Presentation
Agent Factory automatic job submission mechanism for Ganga/diane Presentation overview Background Motivation Algorithm Implementation Usage scenarios Lattice QCD production data Lattice QCD; searching for QCD critical point Discretize
Presentation overview
Background Motivation Algorithm Implementation Usage scenarios Lattice QCD production data
Lattice QCD; searching for QCD critical point
Discretize space and time into 4-dimensional grid Generate a sample of the most important configurations of quark and gluon fields Evolve the “snapshots”, one Monte-Carlo step at a time
Computational model
⇧ ... ⇧
snapshot iter 77 snapshot iter 78 beta 5.18
⇧ ... ⇧
snapshot iter 351 snapshot iter 352 beta 5.1825
⇧ ... ⇧
snapshot iter 233 snapshot iter 234 beta 5.1845
Setup
grid Agent Factory Worker Agent Worker Agent Worker Agent Run Master
Algorithm
Automatically create and maintain diane Worker Agents Adaptable heuristic approach; independent of the underlying application Three simple phases:
- 1. Evaluation
- 2. Decision (nondeterministic!)
- 3. Submission
Relies on application exit code only; no glue (JDL) requirements
Good / bad CE - where to draw the line?
+ positive feedback
running jobs successfully completed jobs (without errors)
- negative feedback
failed jobs pending jobs (queueing) etc.
Fitness = a measure of reliability
running + completed total fitness = ce01 pending running + completed (without errors) failed
- ther
⇧ ⇧
Nondeterministic decision process
ce01 ce03 grid* F = total fitness p ∈ [0, F) *CE chosen by WMS
Implementation
Ganga script, part of diane distribution Follows Ganga directory structure:
../gangadir/ ./agent_factory ./agent_factory/log ./agent_factory/agent_factory_data ./agent_factory/failure_log ./agent_factory/lockfile_17273
File lock system to prevent concurrent access
Usage scenarios
Live process
ganga --config=config-lostman.gear AgentFactory.py
- -diane-worker-number=1063 --diane-max-pending=50 --repeat-interval=300
Acrontab
ganga --config=config-lostman.gear AgentFactory.py
- -diane-worker-number=1063 --diane-max-pending=50 --repeat-interval=300
- -run-time=10800 >& /dev/null
ganga --config=config-lostman.gear AgentFactory.py --kill
Limitations
- nly one instance per workspace allowed
simultaneously running Ganga may result in a crash finite, non extensible proxy lifetime
Lattice QCD production data
Good computing element
Bad computing element
Run history (part 1)
Run history (part 2)
Lattice QCD: the story so far
Production run using Gear VO Collaboration sites:
* Swiss National Supercomputing Centre * National Institute for Subatomic Physics, Netherlands * CYFRONET, Poland * CERN