 
              Vectorization of single particle Tube DistanceToIn • A particle can hit a tube in three ways: – Top or bottom Z faces • There’s no need to calculate both – only calculate distance with the closest one (by taking the absolute value of Z), impossible to hit the other one • No benefit from vectorization
– Inner or outer cylinders • Involves solving the set of equations: – x’ 2 + y’ 2 = r 2 – x + dist*dir.x = x’ (similarly for y) – Need to solve quadratic equation of the form ax 2 + bx + c = 0 – Can solve both for Rmin and Rmax at the same time using vectorization
– Two phi planes • Calculate intersection between two vectors: the trajectory of the particle and the vector of the phi plane • Can calculate both distances at the same time using vectorization
– An important observation: Each time we will be filling the vectors with only two elements • In AVX, two of the slots in the vector will be empty! – Should not be a problem in most cases, alas… – DIVPD latency: 10-20 cycles – VDIVPD latency: 19-35 cycles – SQRTPD latency: 8-14 cycles – VSQRTPD latency: 16-28 cycles – (number of cycles for a haswell) • In which case we would want to emit SSE instructions, not AVX! – No ability to switch between instruction sets in Vc “on the fly” – Would need to compile for SSE (or wait until Vc 1.0 adds support for SIMD arrays of arbitrary length) » Still would need to be careful – AVX-SSE transition penalties (vzeroupper instruction might help?)
• Compiling for SSE 4.2 performed around 10- 15% faster than AVX in this particular instance Speedups • Averaged over 1000 repetitions • 50% hit rate – no points inside • Turbo off – pinned to core #2 • CPU governor set to maximum
One-particle vectorized DistanceToIn vs USolids 4.00E+00 3.50E+00 3.00E+00 2.50E+00 2.00E+00 Speedup 1.50E+00 1.00E+00 5.00E-01 0.00E+00 10 20 40 80 100 200 400 500 800 1000 2000 5000 10000 Number of particles
Recommend
More recommend