ls1 activities of the atlas software project
play

LS1 Activities of the ATLAS Software Project Markus Elsing report - PowerPoint PPT Presentation

LS1 Activities of the ATLAS Software Project Markus Elsing report at the PH-SFT group meeting December 9th, 2013 reconstructed event in Phase-2 tracker Introduction and Outline the challenges GRID CPU Consumption MC Simulation pileup


  1. LS1 Activities of the ATLAS Software Project Markus Elsing report at the PH-SFT group meeting December 9th, 2013 reconstructed event in Phase-2 tracker

  2. Introduction and Outline • the challenges GRID CPU Consumption MC Simulation ➡ pileup drives resource needs 3% 3% 4% MC Reconstruction • not only in Tier-0 10% Final Analysis 42% ➡ GRID “luminosity” is limited Group Production 19% • full simulation is costly Group Analysis Data Reconstruction ➡ physics requires to increase rate 20% Others • Run-2 data taking rate 1kHz (?) ➡ technologies are evolving fast • software needs to follow CPU vs pileup ➡ support detector upgrade studies • not covered in this talk LHC@25 ¡ns • outline of the talk LHC@50 ¡ns 1. work of Future Software Technologies Forum (FSTF) 2. algorithmic improvements 3. the Integrated Simulation Framework (ISF) for Run-2 4. new Analysis Model for Run-2 5. goals and plans for Data Challenge-14 (DC-14) 6. completion of LS1 program for restart of data taking Markus Elsing 2

  3. Evolution of WLCG Resources • upgrades of existing centers WLCG%Disk%Growth% 500" PB 450" y"="34.2x"+"0.5" 400" Tier2% 350" ➡ additional resources expected mainly from Tier1% 300" CERN% advancements in technology (CPU or disk) 250" %% 200" %2008812%linear% ➡ will not match additional needs in coming years 150" 100" • todays infrastructure 50" 0" 2008" 2009" 2010" 2011" 2012" 2013" 2014" 2015" 2016" 2017" 2018" 2019" 2020" ➡ x86 based, 2-3 GB per core, commodity CPU servers WLCG%CPU%Growth% 5000000" y"="363541x"+"16742" kHS06 ➡ applications running “event” parallel on separate cores 4500000" 4000000" ➡ jobs are send to the data to avoid transfers Tier2% 3500000" Tier1% 3000000" CERN% 2500000" • technology is evolving fast %% 2000000" 2008712%linear% 1500000" 1000000" ➡ network bandwidth fastest growing resource 500000" • data transfer to remote jobs is less of a problem 0" 2008" 2009" 2010" 2011" 2012" 2013" 2014" 2015" 2016" 2017" 2018" 2019" 2020" • strict Monarc Model no longer necessary • fl exible data placement with data popularity driven replication, remote I/O and storage federations Intel Phi ➡ modern processors: vectorization of the applications and optimization for data locality (avoid cache misses) ➡ “many core” processors like Intel Phi (MIC) or GPGPUs • much less memory per core ! Markus Elsing 3

  4. High Performance Computing in ATLAS • infrastructure is getting heterogeneous ➡ mostly opportunistic usage of additional resources • commercial Cloud providers (i.e. Google, Amazon) • free CPU in High Performance Computing centers ➡ big HPC centers outperform WLCG in CPU • X86, BlueGene, NVIDIA GPUs, ARM, ... ➡ GRID (ARC Middleware) or Cloud (OpenStack) interface SuperMUC (München) • suitable applications was ➡ CPU resource hungry with low data throughput • physics generators or detector simulation ➡ X86 based systems • small overhead to migrate applications NVIDIA ➡ GPU based systems • complete rewrite necessary (so far) or dedicated code • ATLAS (ADC) working group to evaluate HPC opportunities ➡ fi rst successful test productions on commercial clouds and HPC clusters Markus Elsing 4

  5. Future Software Technologies Forum • coordinates all technology R&D e ff orts in ATLAS ➡ drives ATLAS developments on vectorization and parallel programming • examples: AthenaMP, AthenaHive, Eigen, VDT/libimf, ... • studies of compilers, allocators, auto-vectorization, ... • explore new languages (ISPC, cilk+, openMP4 etc) ➡ forum for R&D on GPGPUs and other co-processors • algorithm development, share experience, identify successful strategies • get experience on ARM and Intel Phi ➡ pool of experienced programmers • educating development community ➡ software optimization with pro fi ling tools (together with PMB) • tools like: perfmon, gperftools, GoODA • code optimization and identi fi cation of hot spots in ATLAS applications • examples: b- fi eld access, z- fi nder in HLT, optimizing neural-nets • liaison with Concurrency Forum and OpenLab ➡ integration of ATLAS e ff orts in LHC wide activities Markus Elsing 5

  6. V.Tsulaia AthenaMP (Multi-Process) • not a new development, but not yet in production ➡ event parallel processing, aim to share memory (see GaudiMP) ➡ successful simulation, digitization and reconstruction tests recently • still issues with I/O, e.g. on EOS ➡ goal is to put AthenaMP in full production by ~ this summer memory sharing between worker processes • next version of AthenaMP improves GRID integration ➡ including new “event service” I/O model in ProdSys-2 Markus Elsing 6

  7. C.Leggett AthenaHive Testbed • based on GaudiHive project ➡ model is multi-threading at the algorithm level (DAG) ➡ demonstrator study using calorimeter reconstruction • factor 3.3 speedup w.r.t. sequential (on more cores), 28% more memory Try To Find Best Configuration Calorimeter Testbed Dataflow Algorithm Timing SGInputLoader Calo Testbed Memory Usage and Timing SGInputLoader 1/1/20 2/2/20 MyEvent LArRawChannel TileRawChannel TrigTowers 0.142s with cloning (max 10) 2/3/20 2/4/20 LArCalibrationHitActive 2/5/20 3/2/20 un-parllelalizable 3/3/20 3/4/20 3/5/20 4/2/20 LArCalibrationHitDeadMaterial 0.994s LArCalibrationHitInActive 1.18s 4/3/20 4/4/20 CaloCellMaker 4/5/20 5/2/20 CaloCellMaker 680 5/3/20 5/4/20 0.852s 5/5/20 AllCalo serial: 630 1 Store, 1Alg: 523Mb, 316s CaloTopoCluster CmbTowerBldr CmbTowerBldr CaloTopoCluster MBTSContainer 0.082s 1.158s memory (MB) no cloning 1.201s 3 Stores, 3 Algs: 607Mb, 161s CaloTopoCluster CaloCalTopoCluster 580 CombinedTower with cloning 3 Stores, 5 Algs: 618Mb, 134s CaloCell2TopoCluster CaloClusterMakerSWCmb CaloClusterMakerSWCmb 0.187s 0.043s 4 Stores, 4 Algs: 667Mb, 129s CaloCell2TopoCluster 530 CombinedCluster_Link CombinedCluster CaloCell2TopoCluster CombinedCluster_Data StreamESD 0.186s serial: 2.65s 0.186s 480 30 80 130 180 230 280 330 100 events StreamESD time (s) C Leggett 10/23/13 C Leggett 10/23/13 C Leggett 10/23/13 • still a long way to go ➡ all framework services need to support multi-threading ➡ making ATLAS services, tools and algorithms thread safe, adapt con fi guration ➡ in the demonstrator we see limits of DAG (Amdahl’s law at play) • work on Hive necessary step towards fi nal multi-threading goal • need parallelism at all levels (especially for tracking algorithms) Markus Elsing 7

  8. Current Tracking Software Chain • tracking is resource driver in reconstruction ➡ current software optimized for early rejection • avoid combinatorial overhead as much as possible ! ➡ early rejection requires strategic candidate processing and hit removal • not a heavily parallel approach, it is a SEQUENTIAL approach ! ➡ good scaling with pileup (factor 6-8 for 4 times pileup) - still catastrophic • implications for making it heavily parallel ? ➡ Amdahl’s law at work: t || =p/n+s • current strategy has small parallel part P, while it is heavy on sequential S ➡ hence: if we want to gain by a large N threads, we need to reduce S • compromise on early rejection, which means more combinatorial overhead • as a result, we will spend more CPU if we go parallel ➡ makes only sense if we use additional processing power that otherwise would not be usable ! (many core processors) Markus Elsing 8

  9. Tracking Developments during LS1 • work on technology to improve CURRENT algorithms ➡ modi fi ed track seeding to explore 4th Pixel layer ➡ Eigen migration - faster vector+matrix algebra ➡ use vectorized trigonometric functions (VDT, INTEL libimf) ➡ F90 to C++ for the b- fi eld (speed improvement in Geant4 as well) ➡ simplify EDM design to be less OO (was the “hip” thing 10 years ago) ➡ xAOD: a new analysis EDM, maybe more... (may allow for data locality) • work will continue beyond this, examples: ➡ (auto-)vectorize Runge-Kutta, fi tter, etc. and take full bene fi t from Eigen ➡ use only curvilinear frame inside extrapolator ➡ faster tools like reference Kalman fi lter ➡ optimized seeding strategy for high pileup • hence, mix of SIMD and algorithm tuning • may give us a factor 2 (maybe more...) ➡ further speedups probably requires “new” thinking '' Markus Elsing 9

  10. Improved Physics Performance • algorithms essential part of LS1 development work, examples: ➡ improved topo-clustering for calorimeter showers ➡ new tau reconstruction exploring substructure ➡ new jet and missing E T software, improved pileup stability ➡ particle fl ow jets η = a r e a ( ~ 3 . 5 ) τ + →π + π 0 ν CATIA ECAL HCAL staves identify substructure EM1 EM2 in tau decays TRT ATLAS IBL PP0 SCT Pix (I-Flexes) stave and module flexes PP0 to PP1 π + ( n o t y e t f i n a l i z e d ) e + e - stave ring & 6 endblocks π 0 free zone Conversions • software for Phase-0 upgrades Tracking ➡ full inclusion of IBL in track reconstruction inefficiency ➡ emulation of FTK in Trigger simulation chain (next slide) Markus Elsing 10

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend