Sandro Wenzel / CERN-PH-SFT For the VecGeom team Geant4 - - PowerPoint PPT Presentation
Sandro Wenzel / CERN-PH-SFT For the VecGeom team Geant4 - - PowerPoint PPT Presentation
Updates on VecGeom Focus on SIMD performance- and developments Sandro Wenzel / CERN-PH-SFT For the VecGeom team Geant4 collaboration meeting, Fermilab, 31.09.2015 Primary Goals of VecGeom Provide multi-track interface/API to important shape
Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015
Primary Goals of VecGeom
2
Provide multi-track interface/API to important shape functions and geometry navigation
vectors of particles
x1 d1 s x4 x2 x3
ComputeStep for multiple tracks
Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015
Primary Goals of VecGeom
Gain from CPU SIMD units when processing multiple tracks for simple shapes for logical volumes with few daughters Alternatively: Gain from CPU SIMD units when processing single- tracks for complicated shapes for logical volumes with many daughters Code re-usage/compilation on many platforms (including GPUs)
2
Provide multi-track interface/API to important shape functions and geometry navigation
Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015
vector API vector API
void DistanceToOut(„multitrack- interface“) void ComputeStep(...“multi-track“ interface...)
Main components of VecGeom
Box, Tube,... LogicalVolume PlacedVolume NavigationState
„Shapes“ Geometry Modeller Navigation
Transformations Navigator scalar API scalar API
double DistanceToOut(Vector3D const &p, Vector3D const &d) double ComputeStep(Vector3D, Vector3D) 3
Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015
Recap of prototype status early 2014
provided SIMD optimized vector interfaces and algorithms for few elementary solids and geometry base functions ( implemented important functions for particle navigation ) can run chain of algorithms in vector/SIMD mode
4
distFromInside mothervolume pick next daughter volume transform coordinates to daughter frame distToOutside daughtervol update step + boundary
vector flow
SIMD SIMD SIMD SIMD
CHEP13 paper: http://arxiv.org/pdf/1312.0816.pdf
Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015
Recap of prototype status early 2014
provided SIMD optimized vector interfaces and algorithms for few elementary solids and geometry base functions ( implemented important functions for particle navigation ) can run chain of algorithms in vector/SIMD mode
4
distFromInside mothervolume pick next daughter volume transform coordinates to daughter frame distToOutside daughtervol update step + boundary
vector flow
SIMD SIMD SIMD SIMD
CHEP13 paper: http://arxiv.org/pdf/1312.0816.pdf
16 particles 1024 particles SIMD MAX Intel IvyBridge (AVX)
~2.8x ~4.0x 4x
Intel Haswell (AVX2)
~3.0x ~5.0x 4x
Intel Xeon- Phi (AVX512)
~4.1x ~4.8x 8x
gcc 4.8; -O3 -funroll-loops -mavx; no FMA
good overall performance gains for such an algorithm (in toy detector with 4 boxes, 3 tubes, 2 cones) - compared to ROOT/5.34.17
Summary of developments after prototype
transition of prototype into true library development
design work...; integration with USolids developments, ...
porting considerable portion of solid code to VecGeom
ported/adapted existing (USolids) code into generic templated and platform independent code which be instantiated for the scalar + GPU + multi-track interfaces (following the VecGeom development model) see table next slide
focused somewhat on getting CMS geometry treatable with VecGeom; now possible a lot of effort into validating shape algorithms worked on navigator structure, geometry model, etc.
very much ongoing (active R&D)
integration of VecGeom into Geant-V simulation framework
more or less achieved but more effort needed
5 gitlab.cern.ch/VecGeom/VecGeom
Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015
Shape development status mid 2015
Shape VecGeom Box yes Trap + Trd yes Tube[s] yes Cone[s] yes GenericTrap/Arb8 (yes) Tet Polycone yes Polyhedron yes Torus yes Parallelepiped yes Extruded solid MultiUnion Tesselated Solid Composites yes
- Templat. Composites
(yes) Hype,Ellipsoid, Parab yes Orb/Sphere yes ... the rest ...
the rest is „Eltu, Twisted[*], ScaledShape, ...“ 6
Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015
Shape development status mid 2015
Shape VecGeom Box yes Trap + Trd yes Tube[s] yes Cone[s] yes GenericTrap/Arb8 (yes) Tet Polycone yes Polyhedron yes Torus yes Parallelepiped yes Extruded solid MultiUnion Tesselated Solid Composites yes
- Templat. Composites
(yes) Hype,Ellipsoid, Parab yes Orb/Sphere yes ... the rest ...
the rest is „Eltu, Twisted[*], ScaledShape, ...“ 6
Multi-Track SIMD impr Internal SIMD yes yes yes (incomplete) (yes) (yes) (targeted) (targeted) yes yes yes (targeted) (targeted) (targeted) (yes) yes yes
SIMD acceleration
Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015
Example for multi-track SIMD Performance
7
performance of hollow tube segment
gcc 4.7; -O3 -funroll-loops -mavx; no FMA; Geant4 10.1 (Release); Root 5.34.18 (Release); benchmark with 1000 particles
400 800 1200 1600 DistanceToIn SafetyToIn In-or-Out?
ROOT Geant4 USolids VecGeom ScalarAPI VecGeom Many-Track API ROOT G4 USolids VecGeom scalar VMP
time units
excellent SIMD vector performance total speedup cmp to USolids
3.3x 7x 13.62x
Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015
Multi-particle SIMD performance on Xeon Phi
8 Often achieving considerable vector performance on the Intel Xeon Phi with the multi-track interface (example for the trapezoid and simple tube) theoretical max vector gain is 8 for double precision (register width = 512 bytes) benchmark performed by Sofia Vallecorsa + Guilherme Amadio (Intel IPCC)
Inside Contains SafetyToIn SafetyToOut DistanceToIn DistanceToOut Inside Contains SafetyToIn SafetyToOut DistanceToIn DistanceToOut
trapezoid benchmark - Vc vectorization - Intel(R) Xeon Phi(TM) tube benchmark - Vc vectorization - Intel(R) Xeon Phi(TM)
Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015
Example for 1-track SIMD improvement: Polyhedron
9 0.001 0.002 0.003 0.004 0.003 0.005 0.008 0.01
USolids VecGeom noSIMD VecGeom SIMD
HBHalf@CMS small test
for some polyhedra considerable overall improvement compared to USolids implementation For very complex shapes; USolid implementation might be better choice demonstrated gain from internal vectorization ( typically factor 1.4 ish ) test done on iCore7 AVX with 1000 particles
DistToIn DistToOut SafetyToOut
Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015
Global library performance evaluations
10
Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015
A global performance evaluation of 1-track mode
Trying to benchmark complete geometry modeller: shapes + navigation Developed X-Ray benchmark: propagate geantinos pixel-by-pixel not a realistic benchmark ... (G4 is not optimized for geantino tracing) ... but an indication that we are globally moving into the right direction
11
dir G4 ROOT VecGeom* y 21.5s 12.7s 5.9s z 10.7s 6.58s 4.09s
time to obtain the X-Ray image for the CMS calorimeter along different propagation directions (* current stable state of master branch, further improvements expected )
Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015
Scaling on the Xeon Phi
12 preliminary, plot provided by Sofia Vallecorsa (Intel IPCC@CERN) Cannot yet compile Geant-V on the Xeon Phi But we can compile VecGeom X-Ray benchmark and can use it for some scaling studies Idea: treat different pixels in different treads (OpenMP) Plot shows thread-speedup for x-raying the CMS calorimeter Demonstrating: thread safety of VecGeom sharing of the geometry among all threads (memory reduction); and perfect scaling up to the number
- f physical cores
Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015
Comparing VecGeom/TGeo in Geant-V
13 preliminary, plot provided by Andrei Gheata Spent considerable time this year to make CMS@Geant-V run with VecGeom
many many debugging sessions -:) more or less stable now (validated by number of steps + simple observables)
Allows for a first realistic estimate of the overall impact on total simulation time
10 p-p events 7TeV in CMS; Factor ~1.6 improvement in simulation runtime when switching from ROOT to VecGeom using only scalar mode of VecGeom so far; further speedup expected in future
Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015
Comparing VecGeom/TGeo in Geant-V
14 preliminary, plot provided by Andrei Gheata VecGeom has a thin „NavigationStates“ (no caching of global matrix; usage of 32byte indices rather than 64byte volume pointers) leads to considerable memory reduction in Geant-V track objects and in the overall simulation (which also contributes positively to the speed gain)
Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015
Latest developments in navigation
15
Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015
SIMD acceleration of „Voxel“ navigation
currently following ideas based on using (aligned) bounding boxes of geometry
- bjects to filter good hit candidates
16
goal is to implement scalable 1-track navigation in VecGeom that also gains from SIMD vectorization
see, e.g., 10.1111/j.1467-8659.2008.01261.x
Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015
SIMD acceleration of „Voxel“ navigation
currently following ideas based on using (aligned) bounding boxes of geometry
- bjects to filter good hit candidates
16
get SIMD gain from treating group of boxes at same time get scaling from hierarchies of bounding box groups
=
„done in same CPU time“
goal is to implement scalable 1-track navigation in VecGeom that also gains from SIMD vectorization
see, e.g., 10.1111/j.1467-8659.2008.01261.x
Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015
Bounding volume hierarchies
17
pure BVH tree
CPU/SIMD architecture has influence on grouping of boxes. For SSE instruction set we would make groups of 2 bounding boxes for double precision
= group of bounding boxes pure tree structure O(log(N)) scaling but overhead in tree traversal and non-optimal data locality
Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015
Bounding volume hierarchies
17
pure BVH tree
CPU/SIMD architecture has influence on grouping of boxes. For SSE instruction set we would make groups of 2 bounding boxes for double precision
many other choices possible ...
shallow tree: „hybrid“
tree of depth 2 „a list of groups of bounding boxes“ not O(log(N)) scaling asymptotically but improved data locality, better pipelining = group of bounding boxes pure tree structure O(log(N)) scaling but overhead in tree traversal and non-optimal data locality
Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015
Navigation performance example
18
8.15% 4.2% 1.28% 1.33% 1.8% 1% 2.12% 1.27% 1.98% 1%
+1% 1% 3% 5% 7% 9% 11%
VecGeom%na4ve% ROOT%voxel% Geant4%voxel% BVH%AVX% Hybrid%na4ve% Hybrid%AVX%
ZDC_EMLayer%(111%daughters)% MBWheel_1N%(789%daughters)%
31.7 33.6
preliminary, Yang Zhang (KIT) + Sandro Wenzel (CERN)
„factor slower compared to best“
MBWheel_1N (~700 volumes); most complex element in CMS detector
implemented clustering of daughter volumes into hierarchic structures benchmark navigation fo some complicated shapes from CMS; compare BVH + Hybrid + hybrid best in this regime so far; demonstrated gain from SIMD
Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015