LS1 Activities of the ATLAS Software Project
Markus Elsing report at the PH-SFT group meeting December 9th, 2013
reconstructed event in Phase-2 tracker
LS1 Activities of the ATLAS Software Project Markus Elsing report - - PowerPoint PPT Presentation
LS1 Activities of the ATLAS Software Project Markus Elsing report at the PH-SFT group meeting December 9th, 2013 reconstructed event in Phase-2 tracker Introduction and Outline the challenges GRID CPU Consumption MC Simulation pileup
Markus Elsing report at the PH-SFT group meeting December 9th, 2013
reconstructed event in Phase-2 tracker
Markus Elsing
➡ pileup drives resource needs
➡ GRID “luminosity” is limited
➡ physics requires to increase rate
➡ technologies are evolving fast
➡ support detector upgrade studies
2
GRID CPU Consumption
3% 3% 4% 10%
19%
20% 42% MC Simulation MC Reconstruction Final Analysis Group Production Group Analysis Data Reconstruction Others LHC@25 ¡ns LHC@50 ¡ns CPU vs pileup
Markus Elsing
➡ additional resources expected mainly from advancements in technology (CPU or disk) ➡ will not match additional needs in coming years
➡ x86 based, 2-3 GB per core, commodity CPU servers ➡ applications running “event” parallel on separate cores ➡ jobs are send to the data to avoid transfers
➡ network bandwidth fastest growing resource
replication, remote I/O and storage federations ➡ modern processors: vectorization of the applications and optimization for data locality (avoid cache misses) ➡ “many core” processors like Intel Phi (MIC) or GPGPUs
3
y"="363541x"+"16742" 0" 500000" 1000000" 1500000" 2000000" 2500000" 3000000" 3500000" 4000000" 4500000" 5000000"2008" 2009" 2010" 2011" 2012" 2013" 2014" 2015" 2016" 2017" 2018" 2019" 2020"
WLCG%CPU%Growth%
Tier2% Tier1% CERN% %% 2008712%linear%
y"="34.2x"+"0.5" 0" 50" 100" 150" 200" 250" 300" 350" 400" 450" 500"2008" 2009" 2010" 2011" 2012" 2013" 2014" 2015" 2016" 2017" 2018" 2019" 2020"
WLCG%Disk%Growth%
Tier2% Tier1% CERN% %% %2008812%linear%
kHS06 PB Intel Phi
Markus Elsing
➡ mostly opportunistic usage of additional resources
➡ big HPC centers outperform WLCG in CPU
➡ GRID (ARC Middleware) or Cloud (OpenStack) interface
➡ CPU resource hungry with low data throughput
➡ X86 based systems
➡ GPU based systems
➡ first successful test productions on commercial clouds and HPC clusters
4
SuperMUC (München) NVIDIA
was
Markus Elsing
➡ drives ATLAS developments on vectorization and parallel programming
➡ forum for R&D on GPGPUs and other co-processors
➡ pool of experienced programmers
➡ software optimization with profiling tools (together with PMB)
➡ integration of ATLAS efforts in LHC wide activities
5
Markus Elsing
➡ event parallel processing, aim to share memory (see GaudiMP) ➡ successful simulation, digitization and reconstruction tests recently
➡ goal is to put AthenaMP in full production by ~ this summer
➡ including new “event service” I/O model in ProdSys-2
6
V.Tsulaia
memory sharing between worker processes
Markus Elsing
➡ model is multi-threading at the algorithm level (DAG) ➡ demonstrator study using calorimeter reconstruction
➡ all framework services need to support multi-threading ➡ making ATLAS services, tools and algorithms thread safe, adapt configuration ➡ in the demonstrator we see limits of DAG (Amdahl’s law at play)
7
C Leggett 10/23/13Calorimeter Testbed Dataflow
SGInputLoader CaloCellMaker CmbTowerBldr CaloTopoCluster CaloClusterMakerSWCmb CaloCell2TopoCluster StreamESD TrigTowers LArRawChannel TileRawChannel MyEvent AllCalo MBTSContainer CombinedTower CaloTopoCluster CaloCalTopoCluster CombinedCluster CombinedCluster_Data CombinedCluster_Link CaloCell2TopoCluster LArCalibrationHitDeadMaterial LArCalibrationHitInActive LArCalibrationHitActiveC.Leggett
C Leggett 10/23/13Algorithm Timing
CaloCellMaker
0.852sCaloTopoCluster
1.158sCmbTowerBldr
0.082s CaloClusterMakerSWCmb 0.187s CaloCell2TopoCluster 0.043sSGInputLoader
0.142sStreamESD
0.186s0.994s 1.201s 0.186s serial: 2.65s un-parllelalizable 1.18s
C Leggett 10/23/13Try To Find Best Configuration
30 80 130 180 230 280 330 480 530 580 630 680 Calo Testbed Memory Usage and Timing with cloning (max 10) 1/1/20 2/2/20 2/3/20 2/4/20 2/5/20 3/2/20 3/3/20 3/4/20 3/5/20 4/2/20 4/3/20 4/4/20 4/5/20 5/2/20 5/3/20 5/4/20 5/5/20 time (s) memory (MB) 100 events serial:1 Store, 1Alg: 523Mb, 316s
no cloning3 Stores, 3 Algs: 607Mb, 161s
with cloning3 Stores, 5 Algs: 618Mb, 134s 4 Stores, 4 Algs: 667Mb, 129s
Markus Elsing
➡ current software optimized for early rejection
➡ early rejection requires strategic candidate processing and hit removal
➡ good scaling with pileup (factor 6-8 for 4 times pileup) - still catastrophic
➡ Amdahl’s law at work:
➡ hence: if we want to gain by a large N threads, we need to reduce S
➡ makes only sense if we use additional processing power that otherwise would not be usable ! (many core processors)
8
Markus Elsing
➡ modified track seeding to explore 4th Pixel layer ➡ Eigen migration - faster vector+matrix algebra ➡ use vectorized trigonometric functions (VDT, INTEL libimf) ➡ F90 to C++ for the b-field (speed improvement in Geant4 as well) ➡ simplify EDM design to be less OO (was the “hip” thing 10 years ago) ➡ xAOD: a new analysis EDM, maybe more... (may allow for data locality)
➡ (auto-)vectorize Runge-Kutta, fitter, etc. and take full benefit from Eigen ➡ use only curvilinear frame inside extrapolator ➡ faster tools like reference Kalman filter ➡ optimized seeding strategy for high pileup
➡ further speedups probably requires “new” thinking
9
''
Markus Elsing
➡ improved topo-clustering for calorimeter showers ➡ new tau reconstruction exploring substructure ➡ new jet and missing ET software, improved pileup stability ➡ particle flow jets
➡ full inclusion of IBL in track reconstruction ➡ emulation of FTK in Trigger simulation chain (next slide)
10
Pix SCT TRT ECAL HCAL
EM1 EM2
τ+→π+π0ν
π0 free zone
π+ e+e-
identify substructure in tau decays
Conversions
Tracking inefficiency
6
a r e a ( ~ η= 3 . 5 )
CATIA
staves PP0 (I-Flexes) PP0 to PP1 ( n
y e t f i n a l i z e d ) stave ring & endblocks stave and module flexes
ATLAS IBL
Markus Elsing
➡ Level-1: hardware based (~50 kHz) ➡ Level-2: software based with RoI access to full granularity data (~5 kHz) ➡ Event Filter: software trigger (~500 Hz)
➡ descendent of the CDF Silicon Vertex Trigger (SVT) ➡ inputs from Pixel and SCT
➡ two step reconstruction
➡ provides track information to Level-2 in ~ 25 μs
➡ FTK is part of digitization & trigger emulation ➡ very resource hungry on CPUs (!)
11
step 1 step 2 tracking enters here
Markus Elsing
➡ various flavors of fast simulation available
➡ question is what is the best compromise between CPU consumption and accuracy ?
➡ very forward showers in otherwise full sim. ➡ for large productions of specific samples
12
EVENT GENERATION
Primary Interaction, Decay, Fragmentation
q q g t tGeant4
Detector Simulation, Full physics list
Digitization Reconstruction Fatras
Track Simulation Material effects Particle decay Photon conversions Digitzation
Atlfast
Track representation smearing 4-Vector, PID
d0h
si
c riTrack
full library alternative/fast parametric
HIERARCHY ACCURACY
high
low
CPU CONSUMPTION
event reconstruction (efficiency/fakes) physics object creation
Minimum bias Simulation (with Frozen Showers) Total CPU per event = 71.7 s tt Simulation (with Frozen Showers) Total CPU per event = 346.1 s i686-slc5-gcc43-opt i686-slc5-gcc43-opt Plots by Z Marshall
Markus Elsing
➡ no major code hot spots other than known ones (EMEC) ➡ a few surprises (pointer sets; physics processes that instantiate a stepper-in-field)
➡ removing all neutrinos and not letting them propagate
➡ removing low energy secondaries from certain processes (below) is
➡ revising range cuts at the same time
➡ e.g. debugging recent issue in G4PolyCone
13
Zach Marschall
Markus Elsing
➡ there are a significant number of electrons propagating <100 fm in a step ➡ re-running now to try to drop the x-range of the histogram (batch is slow)
➡ these are steps in a track, not single steps before the electron dies
➡ there are very few people who fully understand the navigation and interplay with physics processes, and this is the major source of headaches and concern in terms of performance
14
Zach Marschall
Markus Elsing
➡ full model used in Geant4 with 4.8M placed volumes ➡ reconstruction model for fast tracking
➡ embedded navigation replaces voxialization ➡ plus: fast adaptive Runge-Kutta-Nystrom codes
➡ re-uses track reconstruction infrastructure ➡ combined with particle stack and fast physics processes ➡ optionally: fast digitization codes
15
ATLAS G4 tracking ratio
crossed volumes
in tracker
474 95 5
time in SI2K sec
19.1 2.3 8.4
(neutral geantinos, no field lookups)
Volume A Volume B Volume C
Surface CB Surface AB Surface AC
nAC nCB t1 t2
A.Salzburger
Markus Elsing
➡ external particle broker and sim. kernel ➡ simulation codes act as services
➡ based on RoI guidance used in Trigger
➡ mix different simulation types in 1 event
pileup ➡ exploring full potential requires:
16
A.Salzburger, E.Ritsch et al.
Tracker Calo. Muons speedup full fast full ~20 fast fast fast/full >100 RoI g RoI guided fast/ful st/full ~100
Markus Elsing
➡ MC truth based hit filter to find tracks ➡ replace pattern recognition in tracker
➡ real pattern is very efficient and very pure
➡ models main source of inefficiencies well
➡ uses full fit, so resolution come out right ➡ and it is fast (trivial) !
➡ especially double track resolution
➡ corrections are topology dependent
17
R.Jansky et al.
Thursday, October 31, 2013
reconstruction time vs pileup reconstructed tracks truth tracks
Markus Elsing
➡ encountered some technical issues:
➡ new G4-MT version requires some interface changes ➡ make user actions thread save ➡ resolve ATHENA integration issues ➡ move from semaphore to TBB
➡ need to understand best strategy of how to explore parallelization ➡ realistically, timeline is more towards after LS1 (Run-3 ?)
18
Rob Harrington
Markus Elsing
19
J.Catmore
Markus Elsing
20
J.Catmore
Markus Elsing
21
J.Catmore
Markus Elsing
22
J.Catmore
2012 data
analysis but some physicists had to wait three months for D3PD production before they could start → some results missed their target conferences in 2013
Markus Elsing
23
J.Catmore
Markus Elsing
24
J.Catmore
be incompatible with that of other groups/ users → makes cross checking and inter-group analyses difficult/ impossible
Markus Elsing
➡ 20% of analysis teams used AOD in ATHENA ➡ mainly based on D3PD, flat ntuples customized per analysis team, and ROOT ➡ resulting model grew complex, repetitive, with lots of overhead...
➡ factor 2-3 in disk space and CPU time compared to Raw reco. + AOD (!!!)
25
(D)AOD D3PD D3PD D3PD D3PD D3PD D3PD D3PD
INTERMEDIATE FORMATS FINAL N-TUPLE
Athena Athena ROOT-based tools ROOT-based tools Athena
RESULTS
~PB ~PB ~TB ~GB
Fix Fix Fix
Athena Reconstruction
CP CP CP CP CP
Markus Elsing
➡ apply fixes and updates centrally in Tier-0 and update xAOD on GRID ➡ more flexibility, reduces production overhead, validation is crucial (!)
26
AOD0
Time Period A Period C
AOD0 AOD0
Tier-0 Period B Tier-0 Tier-0 s/w fix 0→1 s/w fix 1→2
D3PD D3PD D3PD D3PD D3PD D3PD D3PD D3PD D3PD D3PD D3PD D3PD D3PD D3PD D3PD D3PD D3PD D3PD D3PD D3PD
Fix Fix
+ period A + period A, B
frozen Tier-0 model
AOD0
Time Period A Period C
AOD1 AOD2
Grid Tier-0 Period A available as AOD0 Tier-0
Fix Fix Fix
s/w fix 0→1 s/w fix 1→2 Period B Periods A, B available as AOD1 Periods A, B, C available as AOD2 Grid Grid Tier-0
Fix Fix
staged Tier-0 model
Markus Elsing
➡ xAOD is ROOT and ATHENA writeable and readable ➡ ROOT becomes official ATLAS software framework (for the first time)
27
Common analysis format = XAOD
FINAL N-TUPLE
Reduction framework (Athena)
RESULTS
~PB ~TB ~GB
ROOT-based analysis ROOT Reconstruction (Athena)
CP
Athena-based analysis
CP
Athena-based analysis ROOT-based analysis
Skimmed/slimmed common analysis format
Markus Elsing
AuxElement IParticle
Provides access to auxiliary store (all data!) Interface only Holds no data
e / µ / τ / jet TruthParticle TrackParticle MyParticle
➡ provides an OO user interface ➡ provides the same amount of flexibility for file content manipulation as the Run-1 D3PD files (flat ntuples) ➡ provides partial & lazy information loading from the input file, down to the individual variable level
➡ using a small amount of EDM libraries (<100 MB)
➡ like for current D3PD files, see ROOT I/O workshop discussion
28
Attila Krasznahorkay
Markus Elsing
➡ implementation required updates to ROOT I/O code ➡ read rules themselves are very simple, just a way of resetting the cache of the smart pointers after an I/O operation.
➡ allows us to read/write DataVector<T> objects as a simple list of T, while still allowing us to use the special abilities of DataVector transiently
➡ needed to hide differences between classes that ROOT should not be aware of (when the I/O happens inside/outside of our offline software infrastructure) ➡ still to be implemented in ROOT 6
29
Attila Krasznahorkay
Markus Elsing
➡ analysis trains per physics team or combined performance activity ➡ ATHENA based , concept of smart slimming
30
Common analysis format = XAOD
FINAL N-TUPLE
Reduction framework (Athena)
RESULTS
~PB ~TB ~GB
ROOT-based analysis ROOT Reconstruction (Athena)
CP
Athena-based analysis
CP
Athena-based analysis ROOT-based analysis
Skimmed/slimmed common analysis format
Markus Elsing
➡ establish new ROOT (and MANA/ATHENA) analysis releases (RootCore/HWAF) ➡ tool interface (configuration, messaging, store) transparent to frameworks
31
Common analysis format = XAOD
FINAL N-TUPLE
Reduction framework (Athena)
RESULTS
~PB ~TB ~GB
ROOT-based analysis ROOT Reconstruction (Athena)
CP
Athena-based analysis
CP
Athena-based analysis ROOT-based analysis
Skimmed/slimmed common analysis format
Markus Elsing
➡ new output format xAOD for new Analysis Model ➡ redesign of (simplified) tracking EDM
➡ organizes migration following new tracking EDM ➡ implements xAOD classes for all domains and adapts reconstruction accordingly
➡ deadline for release 19.0.2 next March
32
Jira summary
solved issues new issues
Markus Elsing
➡ test the new Analysis Model
feedback from physics groups ➡ commission the ISF in context of physics analysis
➡ test any updated reconstruction algorithms for Run-2 ➡ provide large scale test of upgraded distributed computing environment
➡ priority over other activities, necessary to achieve main goals
33
Markus Elsing
34
10& 11& 12& 1& 2& 3& 4& 5& 6& 7& 8& 9& 10& 11& 12& 1& 2& 3& 4& 5& 2013( 2014( 2015( pp(collisions( Cosmic( data( 19.0.0( Reco/HLT( EDM+algs( nearly(final( End(data(( challenge( Launch(data( and(Run:1(MC(( 19.0.2( validated( 20.0.Y( fully(validated( Launch(( iniDal(MC15( Launch( MC14( G4(
2(
MC( samples( defined( 18.9.0( 19.0.3( validated( Start(data( analysis(( challenge( Launch((( Run:2(MC(
Key Deadlines
Markus Elsing
35
10& 11& 12& 1& 2& 3& 4& 5& 6& 7& 8& 9& 10& 11& 12& 1& 2& 3& 4& 5& 2013( 2014( 2015( 19.0.0( Reco/HLT( EDM+algs( nearly(final( 19.0.2( validated( 20.0.Y( fully(validated(
3(
ASG(2( HWAF,( xAOD( infrastr.( 18.9.0( 19.0.3( validated( ASG(3( xAOD(prototype,( CP(tool(Interface,( start(migraDon( ASG(4( Analysis( examples( Root/Mana( ASG(5( Fully( funcDonal(( for(DC14(
Markus Elsing
➡ merging of ISF simulation branch into current development release ➡ T/DAQ project branches from offline dev. release
➡ including (auto-)vectorization and timing optimization
➡ using Athena T/P layer, non-trivial schema evolution
➡ ASG release and offline releases use same build system
36
Markus Elsing
➡ ATLAS software currently relies heavily on them
➡ migration benefits from Root6 task force and direct help of Root team (!)
➡ AtlasCore compiles without Reflex, in 17.2.X release branch
➡ smaller, simpler to maintain and much faster “Conversions” and “I/O” code ➡ new Root6 features and improvements
37
DATASET DATASET ROOT Dictionary ROOT Dictionary Reflex Dictionary Current ATLAS Offline – ROOT relation
Gaudi Plug-in ManagerDATASET DATASET ROOT Dictionary ROOT Dictionary Step 2: Use the new redesign and re-validate the step1 version to build ATLAS offline against of ROOT-6.
“missed” “use cases”
V.Fine, S.Binet
Markus Elsing
➡ new Analysis Model with an all new event format (xAOD) ➡ Integrated Simulation Framework with fast and full simulation in an event ➡ integration of Phase-0 detector upgrades in software chain and algorithmic improvements ➡ code optimization and vectorization, Eigen migration and simplification of tracking EDM ➡ ADC: new GRID production system and data management system
➡ R&D on multi-threaded applications, new compilers and hardware technologies
38
Markus Elsing
39
Markus Elsing
40
N e w T r a c k i n g
pre-precessing
➡ Pixel+SCT clustering ➡ TRT drift circle formation ➡ space points formation
combinatorial track finder
➡ iterative :
➡ restricted to roads ➡ bookkeeping to avoid duplicate candidates
ambiguity solution
➡ precise least square fit with full geometry ➡ selection of best silicon tracks using:
extension into TRT
➡ progressive finder ➡ refit of track and selection
TRT segment finder
➡ on remaining drift circles ➡ uses Hough transform
TRT seeded finder
➡ from TRT into SCT+Pixels ➡ combinatorial finder
ambiguity solution
➡ precise fit and selection ➡ TRT seeded tracks
standalone TRT
➡ unused TRT segments
vertexing
➡ primary vertexing ➡ conversion and V0 search
since 17.2.x:
➡ list of selected EM clusters ➡ seed brem. recovery