Agent Factory automatic job submission mechanism for Ganga/diane - - PowerPoint PPT Presentation

agent factory
SMART_READER_LITE
LIVE PREVIEW

Agent Factory automatic job submission mechanism for Ganga/diane - - PowerPoint PPT Presentation

Agent Factory automatic job submission mechanism for Ganga/diane Presentation overview Background Motivation Algorithm Implementation Usage scenarios Lattice QCD production data Lattice QCD; searching for QCD critical point Discretize


slide-1
SLIDE 1

Agent Factory

automatic job submission mechanism for Ganga/diane

slide-2
SLIDE 2

Presentation overview

Background Motivation Algorithm Implementation Usage scenarios Lattice QCD production data

slide-3
SLIDE 3

Lattice QCD; searching for QCD critical point

Discretize space and time into 4-dimensional grid Generate a sample of the most important configurations of quark and gluon fields Evolve the “snapshots”, one Monte-Carlo step at a time

slide-4
SLIDE 4

Computational model

⇧ ... ⇧

snapshot iter 77 snapshot iter 78 beta 5.18

⇧ ... ⇧

snapshot iter 351 snapshot iter 352 beta 5.1825

⇧ ... ⇧

snapshot iter 233 snapshot iter 234 beta 5.1845

slide-5
SLIDE 5

Setup

grid Agent Factory Worker Agent Worker Agent Worker Agent Run Master

slide-6
SLIDE 6

Algorithm

Automatically create and maintain diane Worker Agents Adaptable heuristic approach; independent of the underlying application Three simple phases:

  • 1. Evaluation
  • 2. Decision (nondeterministic!)
  • 3. Submission

Relies on application exit code only; no glue (JDL) requirements

slide-7
SLIDE 7

Good / bad CE - where to draw the line?

+ positive feedback

running jobs successfully completed jobs (without errors)

  • negative feedback

failed jobs pending jobs (queueing) etc.

slide-8
SLIDE 8

Fitness = a measure of reliability

running + completed total fitness = ce01 pending running + completed (without errors) failed

  • ther

⇧ ⇧

slide-9
SLIDE 9

Nondeterministic decision process

ce01 ce03 grid* F = total fitness p ∈ [0, F) *CE chosen by WMS

slide-10
SLIDE 10

Implementation

Ganga script, part of diane distribution Follows Ganga directory structure:

../gangadir/ ./agent_factory ./agent_factory/log ./agent_factory/agent_factory_data ./agent_factory/failure_log ./agent_factory/lockfile_17273

File lock system to prevent concurrent access

slide-11
SLIDE 11

Usage scenarios

Live process

ganga --config=config-lostman.gear AgentFactory.py

  • -diane-worker-number=1063 --diane-max-pending=50 --repeat-interval=300

Acrontab

ganga --config=config-lostman.gear AgentFactory.py

  • -diane-worker-number=1063 --diane-max-pending=50 --repeat-interval=300
  • -run-time=10800 >& /dev/null

ganga --config=config-lostman.gear AgentFactory.py --kill

Limitations

  • nly one instance per workspace allowed

simultaneously running Ganga may result in a crash finite, non extensible proxy lifetime

slide-12
SLIDE 12

Lattice QCD production data

slide-13
SLIDE 13

Good computing element

slide-14
SLIDE 14

Bad computing element

slide-15
SLIDE 15

Run history (part 1)

slide-16
SLIDE 16

Run history (part 2)

slide-17
SLIDE 17

Lattice QCD: the story so far

Production run using Gear VO Collaboration sites:

* Swiss National Supercomputing Centre * National Institute for Subatomic Physics, Netherlands * CYFRONET, Poland * CERN

~2 million cpu hours / 3 months = 231 years on a single machine! ~4.3 TB of data transferred

slide-18
SLIDE 18

Summary