 
              Multiproce cessi sing ng in Athena na I. I. Performance nce study of Athena na event and job level parallelism on multi-co core systems. II. Performance nce optimizations ns in Athena naMP. 1
Athena multi i jobs Athena MJ - job level l parall lleli lism for i in range(4): $> Athena.py - c “ EvtM tMax=25; SkipEv Events ts=$ =$i *25” Jobo.py core-0 JOB 0: start init end Events: [0,1,…,24] core-1 JOB 1: start end init Events: [25,…,49] core-2 JOB 2: start init end Events: [50,…,74] core-3 JOB 3: start init end Events: [75,…,99] PARALLEL: independent jobs LBL-ATLAS-Computing, 2010 2
Athena naMP - event level parallelism sm $> Athena.py -- --nprocs=4 -c EvtM tMax=100 Jobo.py core-0 output- WORK RKER R 0: tmp Maximize Events: [0, 4, 8,…96] files the shared d memory! core-1 output WORK RKER R 1: tmp firstEvnts Events: [1, 5, 9,…,97] files init end OS-fork merge core-2 Output WORK RKER R 2: tmp Events: [2, 6, 10,…,98] files Inpu put Outpu tput t Files core-3 Output WORK RKER R 3: Files tmp Events: [3, 7, 11,…,99] files SERIAL: PARALLEL: workers event loop SERIAL: parent-merge and finalize parent-init-fork AthenaMP Status by S.Binet - http://indico.cern.ch/getFile.py/access?contribId=2&resId=0&materialId=slides&confId=92059 LBL-ATLAS-Computing, 2010 3
Memory footpr print t of Athen enaMP MP & & Athen enaMJ MJ Athen enaMP ~0. 0.5 5 Gb Gb physical memory ry saved ved per r pro roces ess 4
Event throughp ughput ut of Athena naMP and Athena naMJ Hit the memory limit, swapping Athen enaMP Athen enaMJ 5
1. External Optimizations: (no touching complex Athena code)  Hardware Optimizations: HT, QPI, NUMA, Affinity  OS optimizations: affinity, numactl, io-related, disks, virtual machines, etc.  Compiler, Malloc, etc. 2. Gains from AthenaMP/Athena design improvements:  Shared memory, forking later after init  Queue event distribution endless ground for improvements :) 6
Archi hitectur ure upgrades Intel Nehalem Intel sub-Nehalem coors.lbl.gov, rainier.lbl.gov most of LXPLUS machines: Voatlas91,lxplus250,lxplus251 CPU-Memory symmetric access • Hyper Threading ->two logical cores on physical one • QPI Quick Path from CPU to CPU and CPU-to-Memory • Turbo Boost -> dynamic change of CPU-frequency • CPU-Memory non-symmetric access (NUMA) 7
Event t Through ghput t per process for RDO to ESD reco on differe rent t machines 8
Gain from Hyper er-Threa eadi ding AthenaMP Athena MJ 9
Setti ting g affin init ity y of workers to cpu-cores Affinity: pinning each processes to a separate CPU-core Floating: each process scheduled by OS; core switching is frequent 10
Event workers through ghput Workers floating Workers pinned to cpu-cores 11
Rece cent Progress: s: Event distribution using Queue… Lost evt order core-0 events = multiprocesssing.queue(EvtMax+ncpus) WORK RKER R 0: events = [0,1,2,3,4,…,99, None,None,None,None] Events: [0, 4, 5,…] … core-1 WORK RKER R 1: Events: [1, 6, 9,…] evt_loop(evt=events.get(); evt != None): evt_loop_mgr.seek (evt_nbr) evt_loop_mgr.nextEvent () core-2 WORK RKER R 2: Events: [2, 8, 10,…] core-3 WORK RKER R 3: Events: [3, 7, 11,…] Balance e the e arri riva val times es of f work rker ers! Slower worker doesn’t get left behind LBL-ATLAS-Computing, 2010 12
Worke kers through ghput t for Queue Round-robin event Queue event distribution distribution 13
 AthenaMP shares memory about ~0.5 Gb of real memory footprint per worker.  Queue balances workers arrival times thus improving mp-scaling.  Hyper-Threading can give 25-30% gain on events throughput  Affinity settings exploit CPUs better than linux cpu scheduling.  NUMA effects take place on Nehalem CPUs. . 14
1. Externally available performance gains (without touching the athena code)  Architectural gains: HyperThreading, QPI, NUMA etc.  OS gains: affinity, numactl, io-related, disks, virtual machines, etc.  Compiler, Malloc, etc. 2. Gains from Athena/AthenaMP design improvements:  Faster initialization…  Faster distribution of events to workers...  Faster merging: merging events processed by workers instantly by one writer on a fly, without waiting for workers to finish…  Faster finalization… endless ground for improvements :) 15
• Paolo Calafiura, Sebastien Binet, Yushu Yao, Charles Leggett, Wim Lavrijsen • Keith Jackson, David Levinthal • Ian Hinchliffe and LBL ATLAS Group • LBNL and DOE for Funding • CERN for Research LBL-ATLAS-Computing, 2010 16
Recommend
More recommend