PanDA in Nutshell PanDA = Production and Distributed Analysis System - - PowerPoint PPT Presentation
PanDA in Nutshell PanDA = Production and Distributed Analysis System - - PowerPoint PPT Presentation
PanDA Tadashi Maeno (BNL) NPPS meeting, Jun 19 PanDA in Nutshell PanDA = Production and Distributed Analysis System Designed to meet ATLAS production/analysis requirements for a data-driven workload management system capable of
2
PanDA in Nutshell
➢ PanDA = Production and Distributed Analysis System
– Designed to meet ATLAS production/analysis requirements for a data-driven workload management system capable of operating at LHC data processing scale
➢ Continuous evolution while steadily running for ATLAS since 2005 including data taking periods
– Significant refactoring to move to Oracle from MySQL, major system reengineering to implement new paradigm for high level workload management and fine-grained processing mechanism, migration of ATLAS DDMS to Rucio from DQ2, migration to new pilot provisioning machinery, …
➢ ~150k running production+analysis jobs with ~440k cores, ~32M HTTPS sessions per day, 56M transactions in Oracle per day, 1.6k individual users for analysis in 1 year ➢ ATLAS PanDA, BigPanDA, BigPanDA++, beyond ATLAS, Google projects, ... ➢ Plenty of advanced and interesting functions/activities but
- nly recent ATLAS ones to show due to limited time slot
Tadashi Maeno, CHEP2018, 9-13 July 2018, Sofia, Bulgaria
3
PanDA in ATLAS Computing
PanDA Server
subset of pilot components
compute nodes
HPC center
Edge node
submit, monitor, kill pilot
pilot pilot scheduler
- r CE
submit pilot
Grid site
increase or throttle or submit pilots request job
- r pilot
get/update job kill pilot
Cloud
pilot VM / container request job
- r pilot
Harvester spin-up get/update job kill pilot Node Node Harvester scheduler spin-up submit job+pilot Harvester Node CE
submit, monitor, kill pilot
End-user
analysis task
Production managers jobs
production task JEDI DEFT Harvester generate AGIS generate
4
Harvester 1/2
➢ A resource-facing service between PanDA server and collection
- f pilots (workers) for pilot provisioning
➢ Stateless service plus database for local bookkeeping ➢ Flexible deployment model and modular design for various resource types and workflows
– On HPC edge nodes with limited runtime environment → A single node + multi-threading + sqlite3. On dedicated nodes → Multiple nodes + multi-processing + MariaDB
– Plugins with native API, such as SLURM, LSF, EC2, GCE, k8s, gfal, …,
and plugins with 3rd party services, such as condor, ARC interface, Rucio, FTS, Globus Online, ...
➢ Objectives
– A common machinery for pilot provisioning on all computing resources – Better resource monitoring – Coherent implementations for HPCs – Timely optimization of CPU allocation among various resource types and removal of batch-level partitioning – Tight integration between WFMS and resources for new workflows ➢ The project launched in Dec 2016 with 11 developers in US (BNL,
UTA, Duke U, ANL), Norway, Slovenia, Taiwan, Italy, and Russia
5
Harvester 2/2
➢ Entire ATLAS grid migrated by Jan 2019 ➢ ATLAS High Level Trigger (HLT) CPU farm with 50k cores, aka Sim@P1 in production ➢ Successfully demonstrated GCE + GCE API + Google Storage + preemptible VMs ➢ All US DOE HPCs in production since Feb 2018
Migration of UK grid resources Effect of switching from normal VMS to preemptible VMs on GCE The number of events processed per day at US HPCs around migration
6
Integration of HPCs with Jumbo Payload
➢ Batch jobs are no longer atomic entities in PanDA thanks to capability of high level workload management and event-level bookkeeping ➢ Dynamic shaping of jobs based on real time information of available compute power and walltime for each resource ➢ No dedicated/custom tasks for HPCs
– Old : Special tasks to have big jobs at HPC – New : Common tasks share among various resources including HPCs to have proper sizes of jobs at each resource
➢ In full production at Theta/ALCF and Cori/NERSC while at limited scale for Titan/OLCF due to fragile OLCF file system
➢ Successfully ran at MareNostrum 4 at BSC, will continue for
MN5 which has been granted by EuroHPC recently
7
Resources via Kubernetes
Core K8s Submitter K8s Monitor #Jobs Harvester Container
Create new pods Poll pod states Delete failed pods
RSE K8s master Job I/O K8s Sweeper Pilot Container Pilot Container Pilot Container Pilot
With default K8s scheduler (round robin load balance) With policy tuning to pack nodes
➢ Use Kubernetes as CE + a batch system – Central harvester manages remote resources through kubernetes ➢ Based on SLC6 containers and CVMFS-csi driver ➢ Proxy passed through K8s Secret ➢ Still room for evolution, e.g. allow arbitrary container/options execution, maybe split I/O in 1-core container, improve usage of infrastructure ➢ Tested at scale for some weeks at CERN, being continued at UVic
8
HPC/GPU + ML + MPI
➢ Distributed training on HPC or GPU cluster through PanDA and Harvester ➢ Multi-node payload with MPI to be prepared by users for now
– Might provide a common MPI framework in the future
➢ On-demand deployment for user container images ➢ Trying at BNL Institutional Cluster
GPU job
Harvester
job task container img
Share FS Head node Compute nodes
get img fetch job Deploy img submit task upload img Submit job aprun
Docker hub Outbound connection
9
iDDS 1/2
GPU job
Requester
iDDS Head Consumer request notify d
- w
n l
- a
d upload process get + report Source Storage Destination Storage preprocess delete
External service
Cache / Hop Storage process Input data Temporary data Data info iDDS agent
➢ iDDS : intelligent Data Delivery Service ➢ An intelligent service to preprocess and deliver data to consumers – Delivered data = files, file fragments,
file information, or sets of files
10
➢ Join project between ATLAS and IRIS-HEP ➢ To generalize concept/workflow of Event Streaming Service ➢ Not a storage, WFMS, or DDMS
– Delegation of many functions to WFMS, DDMS and Cache
➢ iDDS + WFMS (as preprocessing backend) + DDMS + Cache = CDN ➢ Requirements
– Experiment agnostic – Flexibility to support more use-cases and backend systems – Easy and cheaper deployment
➢ ATLAS usecases
– Fine-grained processing – Tape carousel and dynamic data placement – Data delivation with WAN – On demand data transfers at HPC – Custom data transformation for hyperparameter optimization – ... ➢ Potentially huge R&D but ATLAS manpower is limited for now
➢ Splinter meeting in S&C workshop next week in NY to reach a consensus in ATLAS before the project “officially” kicks off
– Collaboration with other projects – Manpower allocation