CHEP 2010
How to harness the performance potential How to harness the performance potential
- f current Multi-Core CPUs and GPUs
Sverre Jarp CERN
- penlab
IT Dept. CERN CERN
Taipei, Monday 18 October 2010
CHEP 2010 How to harness the performance potential How to harness - - PowerPoint PPT Presentation
CHEP 2010 How to harness the performance potential How to harness the performance potential of current Multi-Core CPUs and GPUs Sverre Jarp CERN openlab IT Dept. CERN CERN Taipei, Monday 18 October 2010 CHEP 2010, Taipei Contents
How to harness the performance potential How to harness the performance potential
Sverre Jarp CERN
IT Dept. CERN CERN
Taipei, Monday 18 October 2010
CHEP 2010, Taipei
Contents Contents
The hardware situation
Current software Soft are protot pes Soft are protot pes
Software prototypes
Some recommendations
Conclusions
Sverre Jarp - CERN
2
CHEP 2010, Taipei
Sverre Jarp - CERN
3
CHEP 2010, Taipei
In the days of the Pentium In the days of the Pentium
Life was really simple:
B i ll t di i
Pipeline
Th b f b
Superscalar
increased the frequency
Nodes
(single-socket) boxes
Sverre Jarp - CERN
4 Sockets
CHEP 2010, Taipei
Today: Seven dimensions of multiplicative performance Seven dimensions of multiplicative performance
Pipelining
Superscalar
dimension:
Vector width Superscalar
dimension:
Nodes Multithreading
p
Sockets
Sverre Jarp - CERN
5
SIMD = Single Instruction Multiple Data Multicore
CHEP 2010, Taipei
Moore’s law Moore s law
transistors every other year
Si l M lti M
H d th di
with 1’000’000’000 transistors!
Sverre Jarp - CERN
6
with 1’000’000’000 transistors!
Adapted from Wikipedia From Wikipedia
CHEP 2010, Taipei
Real consequence of Moore’s law Real consequence of Moore s law
Large number of cores
p p
Sverre Jarp - CERN
7
CHEP 2010, Taipei
Four floating-point data flavours (256b) Four floating point data flavours (256b)
Longer vectors:
Si l i i
E0
E2 E1 E0 E4 E5 E6 E7
E3 E2 E1 E0 E4 E5 E6 E7
E1 E0 E2 E3
Without vectors in our software we will use
Sverre Jarp - CERN
8
Without vectors in our software, we will use 1/4 or 1/8 of the available execution width
CHEP 2010, Taipei
The move to many-core systems The move to many core systems
2 * 6 * 1 = 12
2 * 6 * 2 = 24
2 6 2 = 24
4 * 12 * 1 = 48
4 * 8 * 2 = 64
Quad socket Sun Niagara (T3) processors w/16 cores and 8 threads (each): 4 * 16 * 8 = 512
Sverre Jarp - CERN
9
CHEP 2010, Taipei
Accelerators (1): Intel MIC Accelerators (1): Intel MIC
Based on the x86 architecture, 22nm ( in 2012?)
vector unit
In Order, 4 threads, SIMD-16
ler splay erface ler Fixed nction
In Order, 4 threads, SIMD-16
I$ D$
. . . . . . . . . . . .
In Order, 4 threads, SIMD-16
I$ D$
Dis Inte
F Fu L2 Cache Memo System Interface Memo Texture Logic
In Order, 4 threads, SIMD-16
I$ D$
. . . . . . . . . . . .
In Order, 4 threads, SIMD-16
I$ D$
Sverre Jarp - CERN
10
I$ D$
. . . . . . . . . . . .
I$ D$
CHEP 2010, Taipei
Accelerators (2): Nvidia Fermi GPU Accelerators (2): Nvidia Fermi GPU
(SM) Architecture
Register File Register File Dispatch Dispatch Dispatch Dispatch Core Core Core Core Core Core Core Coreg p g p performance (at 1.15 GHz”:
D l Th d S h d l
Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core CoreLots of interest in the
L1 cache (configurable)
interest in the HEP on-line community
Sverre Jarp - CERN
11
g y y
Adapted from Nvidia
y
CHEP 2010, Taipei
Sverre Jarp - CERN
12
CHEP 2010, Taipei
SW performance: A complicated story! SW performance: A complicated story!
through matter through matter
A compiler (or an interpreter) transforms the high level code to
machine-level code
A sophisticated processor with a complex architecture and even more complex micro-architecture executes the code In most cases e ha e little cl e as to the efficienc
Sverre Jarp - CERN
13
transformation process
CHEP 2010, Taipei
We need forward scalability
extracts maximum performance from today’s hardware extracts maximum performance from today s hardware
automatically
vectors would automatically be put to good use vectors, would automatically be put to good use
g p
g y j p
Sverre Jarp - CERN
14
hardware change!
CHEP 2010, Taipei
Concurrency in HEP Concurrency in HEP
E ti t
I/O streams (ROOT trees, branches)
Sverre Jarp - CERN
15
today’s software frameworks
CHEP 2010, Taipei
HEP programming paradigm HEP programming paradigm
responsible for processing M events
(or more) T d SMT i ft it h d ff i th BIOS (!)
Sverre Jarp - CERN
16
compute!
CHEP 2010, Taipei
What are the multi-core options? What are the multi core options?
way(s) forward: way(s) forward: 1) Stay with event-level parallelism (and entirely 1) Stay with event level parallelism (and entirely independent processes)
y y
2) Rely on forking: 2) Rely on forking:
3) M t f ll lti th d d di 3) Move to a fully multi-threaded paradigm
B h f i d l i
Sverre Jarp - CERN
17
–
But, watch out for increased complexity
CHEP 2010, Taipei
Achieving an efficient memory footprint
Core 0 Core 1 Core 2 Core 3
Achieving an efficient memory footprint
Event specific Event- specific Event- specific Event- specific Global specific data specific data specific data specific data Physics data y processes
Today: Multithreaded Slide shown in my talk at
Magnetic field
Geant4 prototype developed at Northeastern y CHEP2007
Reentrant code
University
Sverre Jarp - CERN
18
CHEP 2010, Taipei
Sverre Jarp - CERN
19
CHEP 2010, Taipei
Examples of parallelism: CBM/ALICE track fitting
Level Trigger (HLT) Code
I.Kisel/GSI: “Fast SIMDized Kalman filter based track fit” http://www-linux.gsi.de/~ikisel/17_CPC_178_2008.pdf
gg ( )
Cell processor
magnetic field magnetic field
code
systems systems
scalars
Sverre Jarp - CERN
20
scalars
“Compressed Baryonic Matter”
CHEP 2010, Taipei
CBM/ALICE track fitting CBM/ALICE track fitting
O t l di ll l h f d t t
instructions: instructions:
– __mm_add_ps corresponds directly to ADDPS, the instruction
that operates on four packed, single-precision FP numbers
P4 F32 4 k d i l l ith l d d t
– P4_F32vec4 – packed single class with overloaded operators
return mm add ps(a,b); } _ _ _p ( , ); }
Sverre Jarp - CERN
21
(single precision)
CHEP 2010, Taipei
Examples of parallelism: CBM track fitting
Sverre Jarp - CERN
22
From H.Bjerke/CERN openlab, I.Kisel/GSI
CHEP 2010, Taipei
Examples of parallelism: GEANT4 p p
events across remote nodes.
parallelism inside a multi-core node
D b NEU PhD t d t Xi D U i F llCMS d T tEM
examples
– Especially global, “extrn”, and static declarations – Preprocessor used for automating the work.
Preprocessor used for automating the work.
– Physics tables, geometry, stepping, etc.
y g y pp g
Sverre Jarp - CERN
23 Dong, Cooperman, Apostolakis: “Multithreaded Geant4: Semi-Automatic Transformation into Scalable Thread-Parallel Software”, Europar 2010
CHEP 2010, Taipei
Multithreaded GEANT4 benchmark Multithreaded GEANT4 benchmark
With 4 k t
Sverre Jarp - CERN
24
From A.Nowak/CERN openlab
CHEP 2010, Taipei
Example: ROOT minimization and fitting Example: ROOT minimization and fitting
u t pa a e at o s depe de t o use code
complex BaBar fitting provided by p y
and parallelized p using MPI
Sverre Jarp - CERN
25
CHEP 2010, Taipei
AthenaMP: event level parallelism
$> Athena.py --nprocs=4 -c EvtMax=100 Jobo.py
AthenaMP: event level parallelism
py p py
co
WORKER 0: Random event order
tmp files
Maximize the shared memory!
re-0
WORKER 0: Events: [0, 4, 5,…]
c
firstEvnts
tmp files
i it
memory!
core-1
WORKER 1: Events: [1, 6, 9,…]
end
OS-fork merge
Output tmp files
init
core-2
WORKER 2: Events: [2, 8, 10,…]
Input Files Output Files
files Output tmp
core
WORKER 3: E t [3 7 11 ]
Files
tmp files SERIAL SERIAL:
Events: [3, 7, 11,…]
Sverre Jarp - CERN
26
PARALLEL: workers event loop
SERIAL: parent-init-fork SERIAL: parent-merge and finalize
26
From: Mous TATARKHANOV/May 2010
CHEP 2010, Taipei
Memory footprint of AthenaMP Memory footprint of AthenaMP
From ~1 5 GB 1.5 GB To G ~1.0 GB
Sverre Jarp - CERN
27
27
AthenaMP ~0.5 GB physical memory saved per process
From: Mous TATARKHANOV/May 2010
CHEP 2010, Taipei
Scalability plots for Athena MP Scalability plots for Athena MP
AthenaMP
with 8 physical cores (16 logical)
Sverre Jarp - CERN
28
28
From: Mous TATARKHANOV/May 2010
CHEP 2010, Taipei
(based on observations in openlab)
Sverre Jarp - CERN
29
CHEP 2010, Taipei
Shortlist Shortlist
1) Broad Programming Talent 2) Holistic View with a clear split: P t t C t Prepare to compute – Compute 3) C t ll d M U 3) Controlled Memory Usage 4) C f P f 4) C++ for Performance 5) B t f b d T l 5) Best-of-breed Tools
Sverre Jarp - CERN
30
CHEP 2010, Taipei
Broad Programming Talent Broad Programming Talent
P bl Problem Algorithms, abstraction
Solution i li t
Source program Compiled code libraries
specialists
System architecture Compiled code, libraries
Technology
Instruction set -architecture
specialists
Circuits Electrons
Sverre Jarp - CERN
31
Electrons
Adapted from Y.Patt, U-Austin
CHEP 2010, Taipei
Performance guidance (cont’d) g ( )
into account into account
Via early prototypes
Heavy compute
Wh ft th il bl ll li
Post Pre
U i ffi it h d li
Sverre Jarp - CERN
32
CHEP 2010, Taipei
Performance guidance (cont’d) Performance guidance (cont d)
accelerator environment) accelerator environment)
Optimi e the cache hierarch
Sverre Jarp - CERN
33
– SQRT, DIV; LOG, EXP, POW; ATAN2, SIN, COS
CHEP 2010, Taipei
C++ parallelization support C++ parallelization support
compiler or as additions):
Native: pthreads/Windows threads
O MP
from Intel; integrating RapidMind)
We must also keep a close eye on
Sverre Jarp - CERN
34
We must also keep a close eye on OpenCL (www.khronos.org/opencl)
CHEP 2010, Taipei
Performance guidance (cont’d) Performance guidance (cont d)
accelerator environment) accelerator environment)
Optimi e the cache hierarch
Sverre Jarp - CERN
35
– SQRT, DIV; LOG, EXP, POW; ATAN2, SIN, COS
CHEP 2010, Taipei
Organization of data: AoS vs SoA Organization of data: AoS vs SoA
I l il
and hardware prefer the latter! the latter!
SP1
X Y Z
SP2
X Y Z
SP3
X Y Z
SP4
X Y Z
SP5
X Y Z
SP6
X Y Z X,Y, Z X,Y, Z X,Y, Z X,Y, Z X,Y, Z X,Y, Z
Spacepoints X X X X X X Spacepoints Y1 Y2 Y3 Y4 Y5 Y6 X1 X2 X3 X4 X5 X6
Sverre Jarp - CERN
36
Z1 Z2 Z3 Z4 Z5 Z6
CHEP 2010, Taipei
Performance guidance (cont’d) Performance guidance (cont d)
Lib i
gg
Sverre Jarp - CERN
37
CHEP 2010, Taipei
Lots of related presentations during this CHEP conference (Sorry if I missed some!)
g y [ ]
pp [ ]
CBM Experiment at FAIR [164]
pp [ ]
Sverre Jarp - CERN
38
CHEP 2010, Taipei
Concluding remarks Concluding remarks
In most HEP programming domains event level processing will and should continue to dominate W ill h f f d i l i l
But it should be able to profit from ALL the available hardware
Accelerators with limited memory, as well as
Sverre Jarp - CERN
39
CHEP 2010, Taipei
Sverre Jarp - CERN
40
CHEP 2010, Taipei
“Intel platform 2015” (and beyond) Intel platform 2015 (and beyond)
Today s silicon processes:
We are here
32 nm
22 (2011/12)
LHC data
S.
Borkar et al. (Intel), "Platform 2015: Intel Platform Evolution for the Next Decade", 2005. et al. (Intel), "Platform 2015: Intel Platform Evolution for the Next Decade", 2005.
( )
– Source: Bill Camp/Intel HPC
Sverre Jarp - CERN
41
CHEP 2010, Taipei
HEP and vectors HEP and vectors
F ti t ti d t l
htt // ki i h id lb d / k t /V /
Sverre Jarp - CERN
42