The FIFE Project: Computing for Experiments Ken Herner for the FIFE - - PowerPoint PPT Presentation

the fife project computing for experiments
SMART_READER_LITE
LIVE PREVIEW

The FIFE Project: Computing for Experiments Ken Herner for the FIFE - - PowerPoint PPT Presentation

The FIFE Project: Computing for Experiments Ken Herner for the FIFE Project DPF 2017 3 August 2017 Introduction to FIFE The F abr I c for F rontier E xperiments aims to: Lead the development of the computing model for non-LHC


slide-1
SLIDE 1

Ken Herner for the FIFE Project DPF 2017 3 August 2017

The FIFE Project: Computing for Experiments

slide-2
SLIDE 2

Introduction to FIFE

  • The FabrIc for Frontier Experiments aims to:

–Lead the development of the computing model for non-LHC experiments –Provide a robust, common, modular set of tools for experiments, including

  • Job submission, monitoring, and management software
  • Data management and transfer tools
  • Database and conditions monitoring
  • Collaboration tools such as electronic logbooks, shift schedulers

–Work closely with experiment contacts during all phases of development and testing; standing meetings w/developers

  • https://web.fnal.gov/project/FIFE/SitePages/Home.aspx

2 Ken Herner | FIFE: Computing for Experiments 8/3/17

slide-3
SLIDE 3

A Wide Variety of Stakeholders

  • At least one experiment in energy, intensity, and cosmic frontiers, studying all

physics drivers from the P5 report, uses some or all of the FIFE tools

  • Experiments range from those built in 1980s to fresh proposals

3 Ken Herner | FIFE: Computing for Experiments 8/3/17

slide-4
SLIDE 4

Common problems, common solutions

  • FIFE experiments on average are 1-2 orders of magnitude smaller than LHC

experiments; often lack sufficient expertise or time to tackle all problems, e.g. software frameworks or job submission tools

– Also much more common to be on multiple experiments in the neutrino world

  • By bringing experiments under a common umbrella, can leverage each other’s

expertise and lessons learned

– Greatly simplifies life for those on multiple experiments

  • Common modular software framework is also available (ART, based on CMSSW)

for most experiments

  • Example of a common problem: large auxiliary files needed by many jobs

– Provide storage solution with a combination of dCache+CVMFS

4 Ken Herner | FIFE: Computing for Experiments 8/3/17

slide-5
SLIDE 5

Common, modular services available from FIFE

  • Submission to distributed computing: JobSub

– GlideinWMS frontend

  • Workflow monitors, alarms, and automated job submission
  • Data handling and distribution

– Sequential Access Via Metadata (SAM) – dCache/Enstore (data caching and transfer/long-term tape storage) – Fermilab File Transfer Service – Intensity Frontier Data Handling Client (data transfer)

  • Software stack distribution via CVMFS
  • User authentication, proxy generation, and security
  • Electronic logbooks, databases, and beam information
  • Integration with new technologies and projects, e.g. GPUs and HEPCloud

5 Ken Herner | FIFE: Computing for Experiments 8/3/17

slide-6
SLIDE 6

FIFE Experiment Data and Job volumes

  • Nearly 7.4 PB new data catalogued over past 6

months across all expts

  • Average throughput of 3.3 PB/wk through FNAL

dCache

  • Typically 16K concurrent running jobs; peak over 36K
  • Combined numbers approaching scale of LHC

(factor of 6-7 wrt ATLAS+CMS)

6 Ken Herner | FIFE: Computing for Experiments 8/3/17

FNAL dCache throughput by experiment, last 6 months Total wall time by experiment, last 6 months Running jobs by experiment, last 6 months Average

slide-7
SLIDE 7

Going global with user jobs

  • International collaborators can often bring additional computing resources to bear; users want

to be able to seamlessly run at all sites with unified submission command

– First International location was for NOvA at FZU in Prague. Now have expanded to JINR for NOvA; Manchester, Lancaster, and Bern for Microboone; Imperial College, FZU, Sheffield, CERN Tier 0 for DUNE/protoDUNE

  • Following OSG prescription (OSG is NOT disappearing) makes it easy to have sites around the

globe communicate with a common interface, with a variety of job management systems underneath

  • Integration times as short as 1-2 weeks; all accessible via standard submission tools. Record

set-up time is just 2 hours!

7 Ken Herner | FIFE: Computing for Experiments 8/3/17

FZU/FZU JINR/JINR/JINR Lancaster Manchester Bern-LHEP Imperial Sheffield FIFE jobs at non-US sites, past 6 months

slide-8
SLIDE 8
  • Extremely important to understand

performance of system

  • Critical for responding to downtimes

and identifying inefficiencies

  • Focused on improving the real time

monitoring of distributed jobs, services, and user experience

  • Enter FIFEMON: project built on open

source tools (ELK stack, Graphite; Grafana for visualization) –Access to historical information using same toolset

FIFE Monitoring of resource utilization

8/3/17 Ken Herner | FIFE: Computing for Experiments 8

Code in https://fifemon.github.io

slide-9
SLIDE 9

Full workflow management

9

Ken Herner | FIFE: Computing for Experiments 8/3/17

  • Now combining job submission, data management,

databases, and monitoring tools into complete workflow management system – Production Operations Management Service (POMS)

  • Can specify user-designed “campaigns” via GUI

describing job dependencies, automatic resubmission of failed jobs, complete monitoring and progress tracking in DB – Visible in standard job monitoring tools

slide-10
SLIDE 10

Improving Productivity with Continuous Integration

10

  • Have built up a Jenkins-based

Continuous Integration system designed for both common software infrastructure (e.g. Art) and experiment-specific software, full web UI

  • In addition to software builds, can

also perform physics validation tests of new code (run specific datasets as grid jobs and compare to reference plots)

  • Supporting SL6/7, working on

OSX and Ubuntu support, experiments free to choose any combination of platforms

  • Targeted email notifications for

failures

NOvA experiment’s CI tests

slide-11
SLIDE 11
  • Clear push from DOE to use more HPC resources (supercomputers)
  • Somewhat of a different paradigm, but current workflows can be adapted
  • Resources typically require an allocation to access them
  • FIFE can help experiment link allocations to existing job submission tools

– Looks like just another site to the job, but shields user from complexity of gaining access – Successfully integrated with NOvA at Ohio Supercomputing Center, MINOS+ at Texas Advanced Computing Center – Mu2e experiment now testing at NERSC (via HEPCloud)

Access to High Performance Computing

11

Photo by Roy Kaltschmidt, LBNL

slide-12
SLIDE 12
  • Lots of (justified) excitement about GPUs; heard quite a bit already this

week

  • Currently no standardized way to access resources
  • FIFE now developing such a standard interface within the existing

job submission system

– Uses a GPU discovery tool from OSG to characterize the system (GPU type, CUDA/OpenCL version, driver info, etc.) – Advertises GPU capabilities in a standard way across sites; users can simply add required capabilities to their job requirements (I need GPU Type X, I need CUDA > 1.23, etc.) System will match jobs and slots accordingly. – Working at two OSG sites: Nebraska Omaha and Syracuse

  • Rolling out to experiments over the next several weeks
  • Starting discussions with non-FIFE experiments (LHC) about trying

to speak a common language as much as possible in this new area

Access to GPU Resources

12

slide-13
SLIDE 13
  • Containers (Docker, Singularity, etc.) becoming more important in increasingly

heterogeneous environments (including GPU machines). Help shepherd users through this process and create some common containers for them

  • Help define the overall computing model of the future (see HEPCloud talk), guide

experiments

– Seamlessly integrate dedicated, opportunistic, HPC, and commercial computing resources – Usher in easy access to GPU resources for those experiments interested

  • Lower barriers to accessing computing elements around the world in

multiple architectures

– Help to connect experimenters and computing professionals to drive experiment SW

to increased multithreading and smaller memory per core footprints

– Federated identity management (reduced access barriers for international partners)

  • Augment data management tools (SAM) to also allow a "jobs to the data" model
  • Scale up and improve UI to existing services

FIFE Plans for the future

13

Ken Herner | FIFE: Computing for Experiments 8/3/17

slide-14
SLIDE 14
  • FIFE providing access to world class computing to help accomplish world-

class science

– FIFE Project aims to provide common, modular tools useful for the full range

  • f HEP computing tasks

– Stakeholders in all areas of HEP; wide range of maturity in experiments – Experiments, datasets, and tools are not limited to Fermilab

  • Overall scale now approaching LHC experiments; plan to heavily

leverage opportunistic resources

  • Now providing full Workflow Manager, functionality not limited to Fermilab

resources

  • Work hand-in-hand with experiments and service providers to move into

new computing models via HEPCloud

Summary

14

Ken Herner | FIFE: Computing for Experiments 8/3/17

http://fife.fnal.gov/

slide-15
SLIDE 15

Backup

15 Ken Herner | FIFE: Computing for Experiments 8/3/17

slide-16
SLIDE 16

Additional Reading and Documentation

16

slide-17
SLIDE 17

Selected results enabled by the FIFE Tools

17 Dark Energy Survey: Dwarf planet discovery Microboone: first results NOvA: Θ23 measurement MINOS+: limits on LEDs

Ken Herner | FIFE: Computing for Experiments 8/3/17

slide-18
SLIDE 18
  • Jan 2016 - NOvA published first papers
  • n oscillation measurements
  • avg 12K CPU hours/day on remote

resources

  • > 500 CPU cores opportunistic
  • FIFE group enabled access to remote

resources and helped configure software stack to operate on remote sites

  • Identified inefficient workflows and

helped analyzers optimize

NOvA – full integration of FIFE Services

Ken Herner | FIFE: Computing for Experiments 18

  • File Transfer Service stored
  • ver 6.5 PB of NOvA data in

dCache and Enstore

  • SAM Catalog contains more

than 41 million files

  • Helped develop SAM4Users

as lightweight catalog

Non-FNAL resources only

8/3/17

slide-19
SLIDE 19

Overview of Experiment Computing Operations

19 Ken Herner | FIFE: Computing for Experiments 8/3/17

slide-20
SLIDE 20

Detailed profiling of experiment operations

20 Ken Herner | FIFE: Computing for Experiments 8/3/17

slide-21
SLIDE 21

Monitoring of jobs and experimental dashboards

21 Ken Herner | FIFE: Computing for Experiments 8/3/17

slide-22
SLIDE 22

Monitoring of jobs and experiment dashboards

Ken Herner | FIFE: Computing for Experiments 22 8/3/17

slide-23
SLIDE 23

Users have access to their

  • wn page, including special

page with details of held jobs

Monitoring at user level

23

slide-24
SLIDE 24

Automated Alerts with FIFEMON

24

Automated notifications for things like idle slot counts, disk utilization can go to email, Slack, websites, to both sysadmins and experimenters

slide-25
SLIDE 25

When processing data with SAM, one:

  • Defines a dataset containing the files you want to process
  • Start a SAM “Project” to hand them out
  • Start one or more jobs which register as “Consumers” of the Project,

including their location.

  • Consumer Jobs then request files from the project, process them,

and request another file, etc.

  • Projects can prestage data while handing out data already on disk,

and refer consumers to the “nearest” replica.

  • Generally output is copied to an FFTS dropbox for production work,
  • r to a user’s personal disk area.
  • Thus the data is sent to the job, not the other way around
  • However projects have limits; only so much at one submission.

Processing Data with SAM Projects and jobs

25

Ken Herner | FIFE: Computing for Experiments 8/3/17

slide-26
SLIDE 26

Ken Herner | FIFE: Computing for Experiments 26

Provide a modular architecture: experiments do not need to take all

  • services. Can insert experiment-specific services as well (e.g. dedicated

local SEs or local lab/university clusters)

8/3/17

slide-27
SLIDE 27
  • Almost no input files
  • Heavy CPU usage
  • <100 MB output per job
  • Ran > 20M CPU-hours

in under 5 months

  • Avg 8000 simultaneous

jobs across > 15 remote sites

  • Usage as high as 20,000 simultaneous jobs and 500,000 CPU

hours in one day – peaked usage 1st wk Oct 2015

  • Achieved stretch goal for processing 24 times live-time data for 3

most important backgrounds

  • Total cost to Mu2e for these resources: $0

Mu2e Beam Simulations Campaign

27 Ken Herner | FIFE: Computing for Experiments 8/3/17

slide-28
SLIDE 28

Job Submission and management architecture

  • Common infrastructure is the fifebatch system: one GlideInWMS pool, 2 schedds, frontend,

collectors, etc.

  • Users interface with system via “jobsub”: middleware that provides a common tool across all

experiments; shields user from intricacies of Condor

– Simple matter of a command-line option to steer jobs to different sites

  • Common monitoring provided by FIFEMON tools

– Now also helps users to understand why jobs aren’t running

  • Automatic enforcement of memory, disk, and run time requests (jobs held if they exceed their request)

28 Ken Herner | FIFE: Computing for Experiments 8/3/17

slide-29
SLIDE 29
  • File I/O is a complex problem (Best place to read? What protocol?

Best place to send output?)

  • Intensity Frontier Data Handling client developed as common

wrapper around standard data movement tools; shield user from site-specific requirements and choosing transfer protocols

  • Nearly a drop-in replacement for cp, rm, etc., but also extensive

features to interface with SAM (can fetch files directly from SAM project, etc.)

  • Supports a wide variety of protocols (including xrootd); automatically

chooses best protocol depending on host machine, source location, and destination (can override if desired)

– Backend behavior can be changed or new protocols added in completely transparent ways

Simplifying I/O with IFDH

29

Ken Herner | FIFE: Computing for Experiments 8/3/17

slide-30
SLIDE 30

SAM originally developed for CDF and D0; many FNAL experiments now using it

  • A File metadata/provenance catalog
  • A File replica catalog (data need not be at Fermilab)
  • Allows metadata query-based “dataset” creation
  • An optimized file delivery system (command-line, C++, Python APIs

available)

  • Originally a Oracle backend; now PostrgreSQL
  • Communication via CORBA for CDF/D0; now via http for everyone

– Eliminates need to worry about opening ports for communication with server in nearly all cases

Data management: SAM and FTS

30

Ken Herner | FIFE: Computing for Experiments 8/3/17

slide-31
SLIDE 31

Fermilab File Transfer Service

  • Watches one or more dropboxes for new files
  • Can extract metadata from files and declare to SAM, or handle files

already declared

  • Copies files to one or more destinations based on file metadata

and/or dropbox used, register locations w/SAM

  • Can automatically clean dropboxes, usually N days after files are on

tape

  • Does not have to run at Fermilab, nor do source or destination

have to be at Fermilab

Data management: SAM and FTS (2)

31

Ken Herner | FIFE: Computing for Experiments 8/3/17

slide-32
SLIDE 32

CI Existing Plans

32

  • Fermilab has already applied the Continuous Integration practice to the LArSoft-based experiments.

Experiments on-boarded in Lar CI are: MicroBooNE, DUNE, LArIAT and ArgoNeuT.

  • Because of the given justification, the CI project plan is to apply the Continuous Integration development

practice to all IF experiments at Fermilab:

– Extend Lar-CI practice to other no-LArSoft based experiments – Add additional features to the existing LAr-CI – Improve performance like: speed the response time of the DB/ schema changes (it requires some code and dataflow analysis to optimize the queries, it may need some DB model changes … suspect scalability issue), create dynamic plots …. – Provide documentation to “facilitate” the use of the CI practice among the experiments.

  • See CI redmine:

https://cdcvs.fnal.gov/redmine/projects/ci

  • Apply the The Plan Do Check Act (PDCA)

cycle: work together with the experiments to define needs and priorities and receive feedback.

Ken Herner | FIFE: Computing for Experiments

slide-33
SLIDE 33

Monitoring in the CI system - NOvA

33

  • Found an issue in the reco processing stage and in a commit of the NOvA code from

a user (contacted and solved)

Ken Herner | FIFE: Computing for Experiments

slide-34
SLIDE 34

Monitoring in the CI system - MicroBooNE

34

  • Memory usage history plot: uboonecode geant4 stage as an example.
  • Using CORSIKA as cosmic shower generator, memory usage goes from ~2Gb to ~3.5Gb.
  • After the intervention of a memory profiling “task force” the memory usage went down to ~1.2Gb.

Ken Herner | FIFE: Computing for Experiments

slide-35
SLIDE 35

POMS: Example Campaign Info

35 Ken Herner | FIFE: Computing for Experiments

slide-36
SLIDE 36

POMS: Example of Troubleshooting

POMS Project - SPPM Meeting August 25, 2016

36