Sandro Wenzel / CERN-PH-SFT For the VecGeom team Geant4 - - PowerPoint PPT Presentation

sandro wenzel cern ph sft
SMART_READER_LITE
LIVE PREVIEW

Sandro Wenzel / CERN-PH-SFT For the VecGeom team Geant4 - - PowerPoint PPT Presentation

Updates on VecGeom Focus on SIMD performance- and developments Sandro Wenzel / CERN-PH-SFT For the VecGeom team Geant4 collaboration meeting, Fermilab, 31.09.2015 Primary Goals of VecGeom Provide multi-track interface/API to important shape


slide-1
SLIDE 1

Updates on VecGeom Focus on SIMD performance- and developments

Geant4 collaboration meeting, Fermilab, 31.09.2015

Sandro Wenzel / CERN-PH-SFT

For the VecGeom team

slide-2
SLIDE 2

Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015

Primary Goals of VecGeom

2

Provide multi-track interface/API to important shape functions and geometry navigation

vectors of particles

x1 d1 s x4 x2 x3

ComputeStep for multiple tracks

slide-3
SLIDE 3

Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015

Primary Goals of VecGeom

Gain from CPU SIMD units when processing multiple tracks for simple shapes for logical volumes with few daughters Alternatively: Gain from CPU SIMD units when processing single- tracks for complicated shapes for logical volumes with many daughters Code re-usage/compilation on many platforms (including GPUs)

2

Provide multi-track interface/API to important shape functions and geometry navigation

slide-4
SLIDE 4

Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015

vector API vector API

void DistanceToOut(„multitrack- interface“) void ComputeStep(...“multi-track“ interface...)

Main components of VecGeom

Box, Tube,... LogicalVolume PlacedVolume NavigationState

„Shapes“ Geometry Modeller Navigation

Transformations Navigator scalar API scalar API

double DistanceToOut(Vector3D const &p, Vector3D const &d) double ComputeStep(Vector3D, Vector3D) 3

slide-5
SLIDE 5

Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015

Recap of prototype status early 2014

provided SIMD optimized vector interfaces and algorithms for few elementary solids and geometry base functions ( implemented important functions for particle navigation ) can run chain of algorithms in vector/SIMD mode

4

distFromInside mothervolume pick next daughter volume transform coordinates to daughter frame distToOutside daughtervol update step + boundary

vector flow

SIMD SIMD SIMD SIMD

CHEP13 paper: http://arxiv.org/pdf/1312.0816.pdf

slide-6
SLIDE 6

Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015

Recap of prototype status early 2014

provided SIMD optimized vector interfaces and algorithms for few elementary solids and geometry base functions ( implemented important functions for particle navigation ) can run chain of algorithms in vector/SIMD mode

4

distFromInside mothervolume pick next daughter volume transform coordinates to daughter frame distToOutside daughtervol update step + boundary

vector flow

SIMD SIMD SIMD SIMD

CHEP13 paper: http://arxiv.org/pdf/1312.0816.pdf

16 particles 1024 particles SIMD MAX Intel IvyBridge (AVX)

~2.8x ~4.0x 4x

Intel Haswell (AVX2)

~3.0x ~5.0x 4x

Intel Xeon- Phi (AVX512)

~4.1x ~4.8x 8x

gcc 4.8; -O3 -funroll-loops -mavx; no FMA

good overall performance gains for such an algorithm (in toy detector with 4 boxes, 3 tubes, 2 cones) - compared to ROOT/5.34.17

slide-7
SLIDE 7

Summary of developments after prototype

transition of prototype into true library development

design work...; integration with USolids developments, ...

porting considerable portion of solid code to VecGeom

ported/adapted existing (USolids) code into generic templated and platform independent code which be instantiated for the scalar + GPU + multi-track interfaces (following the VecGeom development model) see table next slide

focused somewhat on getting CMS geometry treatable with VecGeom; now possible a lot of effort into validating shape algorithms worked on navigator structure, geometry model, etc.

very much ongoing (active R&D)

integration of VecGeom into Geant-V simulation framework

more or less achieved but more effort needed

5 gitlab.cern.ch/VecGeom/VecGeom

slide-8
SLIDE 8

Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015

Shape development status mid 2015

Shape VecGeom Box yes Trap + Trd yes Tube[s] yes Cone[s] yes GenericTrap/Arb8 (yes) Tet Polycone yes Polyhedron yes Torus yes Parallelepiped yes Extruded solid MultiUnion Tesselated Solid Composites yes

  • Templat. Composites

(yes) Hype,Ellipsoid, Parab yes Orb/Sphere yes ... the rest ...

the rest is „Eltu, Twisted[*], ScaledShape, ...“ 6

slide-9
SLIDE 9

Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015

Shape development status mid 2015

Shape VecGeom Box yes Trap + Trd yes Tube[s] yes Cone[s] yes GenericTrap/Arb8 (yes) Tet Polycone yes Polyhedron yes Torus yes Parallelepiped yes Extruded solid MultiUnion Tesselated Solid Composites yes

  • Templat. Composites

(yes) Hype,Ellipsoid, Parab yes Orb/Sphere yes ... the rest ...

the rest is „Eltu, Twisted[*], ScaledShape, ...“ 6

Multi-Track SIMD impr Internal SIMD yes yes yes (incomplete) (yes) (yes) (targeted) (targeted) yes yes yes (targeted) (targeted) (targeted) (yes) yes yes

SIMD acceleration

slide-10
SLIDE 10

Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015

Example for multi-track SIMD Performance

7

performance of hollow tube segment

gcc 4.7; -O3 -funroll-loops -mavx; no FMA; Geant4 10.1 (Release); Root 5.34.18 (Release); benchmark with 1000 particles

400 800 1200 1600 DistanceToIn SafetyToIn In-or-Out?

ROOT Geant4 USolids VecGeom ScalarAPI VecGeom Many-Track API ROOT G4 USolids VecGeom scalar VMP

time units

excellent SIMD vector performance total speedup cmp to USolids

3.3x 7x 13.62x

slide-11
SLIDE 11

Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015

Multi-particle SIMD performance on Xeon Phi

8 Often achieving considerable vector performance on the Intel Xeon Phi with the multi-track interface (example for the trapezoid and simple tube) theoretical max vector gain is 8 for double precision (register width = 512 bytes) benchmark performed by Sofia Vallecorsa + Guilherme Amadio (Intel IPCC)

Inside Contains SafetyToIn SafetyToOut DistanceToIn DistanceToOut Inside Contains SafetyToIn SafetyToOut DistanceToIn DistanceToOut

trapezoid benchmark - Vc vectorization - Intel(R) Xeon Phi(TM) tube benchmark - Vc vectorization - Intel(R) Xeon Phi(TM)

slide-12
SLIDE 12

Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015

Example for 1-track SIMD improvement: Polyhedron

9 0.001 0.002 0.003 0.004 0.003 0.005 0.008 0.01

USolids VecGeom noSIMD VecGeom SIMD

HBHalf@CMS small test

for some polyhedra considerable overall improvement compared to USolids implementation For very complex shapes; USolid implementation might be better choice demonstrated gain from internal vectorization ( typically factor 1.4 ish ) test done on iCore7 AVX with 1000 particles

DistToIn DistToOut SafetyToOut

slide-13
SLIDE 13

Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015

Global library performance evaluations

10

slide-14
SLIDE 14

Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015

A global performance evaluation of 1-track mode

Trying to benchmark complete geometry modeller: shapes + navigation Developed X-Ray benchmark: propagate geantinos pixel-by-pixel not a realistic benchmark ... (G4 is not optimized for geantino tracing) ... but an indication that we are globally moving into the right direction

11

dir G4 ROOT VecGeom* y 21.5s 12.7s 5.9s z 10.7s 6.58s 4.09s

time to obtain the X-Ray image for the CMS calorimeter along different propagation directions (* current stable state of master branch, further improvements expected )

slide-15
SLIDE 15

Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015

Scaling on the Xeon Phi

12 preliminary, plot provided by Sofia Vallecorsa (Intel IPCC@CERN) Cannot yet compile Geant-V on the Xeon Phi But we can compile VecGeom X-Ray benchmark and can use it for some scaling studies Idea: treat different pixels in different treads (OpenMP) Plot shows thread-speedup for x-raying the CMS calorimeter Demonstrating: thread safety of VecGeom sharing of the geometry among all threads (memory reduction); and perfect scaling up to the number

  • f physical cores
slide-16
SLIDE 16

Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015

Comparing VecGeom/TGeo in Geant-V

13 preliminary, plot provided by Andrei Gheata Spent considerable time this year to make CMS@Geant-V run with VecGeom

many many debugging sessions -:) more or less stable now (validated by number of steps + simple observables)

Allows for a first realistic estimate of the overall impact on total simulation time

10 p-p events 7TeV in CMS; Factor ~1.6 improvement in simulation runtime when switching from ROOT to VecGeom using only scalar mode of VecGeom so far; further speedup expected in future

slide-17
SLIDE 17

Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015

Comparing VecGeom/TGeo in Geant-V

14 preliminary, plot provided by Andrei Gheata VecGeom has a thin „NavigationStates“ (no caching of global matrix; usage of 32byte indices rather than 64byte volume pointers) leads to considerable memory reduction in Geant-V track objects and in the overall simulation (which also contributes positively to the speed gain)

slide-18
SLIDE 18

Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015

Latest developments in navigation

15

slide-19
SLIDE 19

Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015

SIMD acceleration of „Voxel“ navigation

currently following ideas based on using (aligned) bounding boxes of geometry

  • bjects to filter good hit candidates

16

goal is to implement scalable 1-track navigation in VecGeom that also gains from SIMD vectorization

see, e.g., 10.1111/j.1467-8659.2008.01261.x

slide-20
SLIDE 20

Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015

SIMD acceleration of „Voxel“ navigation

currently following ideas based on using (aligned) bounding boxes of geometry

  • bjects to filter good hit candidates

16

get SIMD gain from treating group of boxes at same time get scaling from hierarchies of bounding box groups

=

„done in same CPU time“

goal is to implement scalable 1-track navigation in VecGeom that also gains from SIMD vectorization

see, e.g., 10.1111/j.1467-8659.2008.01261.x

slide-21
SLIDE 21

Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015

Bounding volume hierarchies

17

pure BVH tree

CPU/SIMD architecture has influence on grouping of boxes. For SSE instruction set we would make groups of 2 bounding boxes for double precision

= group of bounding boxes pure tree structure O(log(N)) scaling but overhead in tree traversal and non-optimal data locality

slide-22
SLIDE 22

Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015

Bounding volume hierarchies

17

pure BVH tree

CPU/SIMD architecture has influence on grouping of boxes. For SSE instruction set we would make groups of 2 bounding boxes for double precision

many other choices possible ...

shallow tree: „hybrid“

tree of depth 2 „a list of groups of bounding boxes“ not O(log(N)) scaling asymptotically but improved data locality, better pipelining = group of bounding boxes pure tree structure O(log(N)) scaling but overhead in tree traversal and non-optimal data locality

slide-23
SLIDE 23

Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015

Navigation performance example

18

8.15% 4.2% 1.28% 1.33% 1.8% 1% 2.12% 1.27% 1.98% 1%

+1% 1% 3% 5% 7% 9% 11%

VecGeom%na4ve% ROOT%voxel% Geant4%voxel% BVH%AVX% Hybrid%na4ve% Hybrid%AVX%

ZDC_EMLayer%(111%daughters)% MBWheel_1N%(789%daughters)%

31.7 33.6

preliminary, Yang Zhang (KIT) + Sandro Wenzel (CERN)

„factor slower compared to best“

MBWheel_1N (~700 volumes); most complex element in CMS detector

implemented clustering of daughter volumes into hierarchic structures benchmark navigation fo some complicated shapes from CMS; compare BVH + Hybrid + hybrid best in this regime so far; demonstrated gain from SIMD

slide-24
SLIDE 24

Sandro Wenzel Geant4 collaboration meeting (Vector session), Fermilab, 31/09/2015

Summary

VecGeom performance for detector simulation is promising (and we are working hard to make it better every day) Offer geometry SIMD gains both in single-track and multi-track modes todo next (my personal biased opinion)

better coupling of multi-track mode to Geant-V find optimal combination of single-track and multi-track modes for Geant- V to leverage as best as possible from all gains focus more on the GPU part

19