Multiproce cessi sing ng in Athena na I. I. Performance nce - - PowerPoint PPT Presentation

multiproce cessi sing ng in athena na
SMART_READER_LITE
LIVE PREVIEW

Multiproce cessi sing ng in Athena na I. I. Performance nce - - PowerPoint PPT Presentation

Multiproce cessi sing ng in Athena na I. I. Performance nce study of Athena na event and job level parallelism on multi-co core systems. II. Performance nce optimizations ns in Athena naMP. 1 Athena multi i jobs Athena MJ - job


slide-1
SLIDE 1

Multiproce cessi sing ng in Athena na

1

I. I. Performance nce study of Athena na event and job level parallelism on multi-co core systems.

  • II. Performance

nce optimizations ns in Athena naMP.

slide-2
SLIDE 2

Athena multi i jobs Athena MJ - job level l parall lleli lism

end

for i in range(4): $> Athena.py -c “EvtM tMax=25; SkipEv Events ts=$ =$i*25” Jobo.py

core-0

JOB 0: Events: [0,1,…,24]

core-1

JOB 1: Events: [25,…,49]

core-2

JOB 2: Events: [50,…,74]

core-3

JOB 3: Events: [75,…,99]

LBL-ATLAS-Computing, 2010

PARALLEL: independent jobs start end start end start end start init init init init

2

slide-3
SLIDE 3

Athena naMP - event level parallelism sm

end Inpu put Files Outpu tput t Files

OS-fork merge

$> Athena.py --

  • -nprocs=4 -c EvtM

tMax=100 Jobo.py

firstEvnts

core-0

WORK RKER R 0:

Events: [0, 4, 8,…96]

core-1

WORK RKER R 1:

Events: [1, 5, 9,…,97]

core-2

WORK RKER R 2:

Events: [2, 6, 10,…,98]

core-3

WORK RKER R 3:

Events: [3, 7, 11,…,99]

  • utput-

tmp files

  • utput

tmp files Output tmp files Output tmp files

LBL-ATLAS-Computing, 2010

init

Maximize the shared d memory!

PARALLEL: workers event loop

SERIAL: parent-init-fork SERIAL: parent-merge and finalize AthenaMP Status by S.Binet - http://indico.cern.ch/getFile.py/access?contribId=2&resId=0&materialId=slides&confId=92059

3

slide-4
SLIDE 4

4

Memory footpr print t of Athen enaMP MP & & Athen enaMJ MJ

Athen enaMP ~0. 0.5 5 Gb Gb physical memory ry saved ved per r pro roces ess

slide-5
SLIDE 5

5

Athen enaMP Athen enaMJ

Event throughp ughput ut of Athena naMP and Athena naMJ

Hit the memory limit, swapping

slide-6
SLIDE 6

6

  • 1. External Optimizations:

(no touching complex Athena code)

 Hardware Optimizations: HT, QPI, NUMA, Affinity  OS optimizations: affinity, numactl, io-related, disks,

virtual machines, etc.

 Compiler, Malloc, etc.

  • 2. Gains from AthenaMP/Athena design improvements:

 Shared memory, forking later after init  Queue event distribution

endless ground for improvements :)

slide-7
SLIDE 7

7

Archi hitectur ure upgrades

Intel Nehalem

coors.lbl.gov, rainier.lbl.gov

Intel sub-Nehalem

most of LXPLUS machines: Voatlas91,lxplus250,lxplus251

CPU-Memory symmetric access

  • Hyper Threading ->two logical cores on physical one
  • QPI Quick Path from CPU to CPU and CPU-to-Memory
  • Turbo Boost -> dynamic change of CPU-frequency
  • CPU-Memory non-symmetric access (NUMA)
slide-8
SLIDE 8

8

Event t Through ghput t per process for RDO to ESD reco

  • n differe

rent t machines

slide-9
SLIDE 9

9

Gain from Hyper er-Threa eadi ding

AthenaMP Athena MJ

slide-10
SLIDE 10

10

Setti ting g affin init ity y of workers to cpu-cores

Affinity: pinning each processes to a separate CPU-core Floating: each process scheduled by OS; core switching is frequent

slide-11
SLIDE 11

11

Workers floating Workers pinned to cpu-cores

Event workers through ghput

slide-12
SLIDE 12

Rece cent Progress: s: Event distribution using Queue…

core-0

WORK RKER R 0: Events: [0, 4, 5,…]

core-1

WORK RKER R 1: Events: [1, 6, 9,…]

core-2

WORK RKER R 2: Events: [2, 8, 10,…]

core-3

WORK RKER R 3: Events: [3, 7, 11,…]

LBL-ATLAS-Computing, 2010

events = multiprocesssing.queue(EvtMax+ncpus) events = [0,1,2,3,4,…,99, None,None,None,None] … evt_loop(evt=events.get(); evt != None): evt_loop_mgr.seek (evt_nbr) evt_loop_mgr.nextEvent ()

Balance e the e arri riva val times es of f work rker ers!

Slower worker doesn’t get left behind

Lost evt order

12

slide-13
SLIDE 13

13

Round-robin event distribution Queue event distribution

Worke kers through ghput t for Queue

slide-14
SLIDE 14

 AthenaMP shares memory about ~0.5 Gb of real

memory footprint per worker.

 Queue balances workers arrival times thus improving

mp-scaling.

 Hyper-Threading can give 25-30% gain on events

throughput

 Affinity settings exploit CPUs better than linux cpu

scheduling.

 NUMA effects take place on Nehalem CPUs.

.

14

slide-15
SLIDE 15
  • 1. Externally available performance gains (without

touching the athena code)

 Architectural gains: HyperThreading, QPI, NUMA etc.  OS gains: affinity, numactl, io-related, disks, virtual

machines, etc.

 Compiler, Malloc, etc.

  • 2. Gains from Athena/AthenaMP design improvements:

 Faster initialization…  Faster distribution of events to workers...  Faster merging: merging events processed by workers

instantly by one writer on a fly, without waiting for workers to finish…

 Faster finalization…

endless ground for improvements :)

15

slide-16
SLIDE 16
  • Paolo Calafiura, Sebastien Binet, Yushu Yao,

Charles Leggett, Wim Lavrijsen

  • Keith Jackson, David Levinthal
  • Ian Hinchliffe and LBL ATLAS Group
  • LBNL and DOE for Funding
  • CERN for Research

16 LBL-ATLAS-Computing, 2010