Vectorization of single particle Tube DistanceToIn A particle can - - PowerPoint PPT Presentation

vectorization of single particle tube distancetoin
SMART_READER_LITE
LIVE PREVIEW

Vectorization of single particle Tube DistanceToIn A particle can - - PowerPoint PPT Presentation

Vectorization of single particle Tube DistanceToIn A particle can hit a tube in three ways: Top or bottom Z faces Theres no need to calculate both only calculate distance with the closest one (by taking the absolute value of


slide-1
SLIDE 1

Vectorization of single particle Tube DistanceToIn

  • A particle can hit a tube in three ways:

– Top or bottom Z faces

  • There’s no need to calculate both – only calculate

distance with the closest one (by taking the absolute value of Z), impossible to hit the other one

  • No benefit from vectorization
slide-2
SLIDE 2

– Inner or outer cylinders

  • Involves solving the set of equations:

– x’2 + y’2 = r2 – x + dist*dir.x = x’ (similarly for y) – Need to solve quadratic equation of the form ax2 + bx + c = 0 – Can solve both for Rmin and Rmax at the same time using vectorization

slide-3
SLIDE 3

– Two phi planes

  • Calculate intersection between two vectors: the

trajectory of the particle and the vector of the phi plane

  • Can calculate both distances at the same

time using vectorization

slide-4
SLIDE 4

– An important observation: Each time we will be filling the vectors with only two elements

  • In AVX, two of the slots in the vector will be empty!

– Should not be a problem in most cases, alas…

– DIVPD latency: 10-20 cycles – VDIVPD latency: 19-35 cycles – SQRTPD latency: 8-14 cycles – VSQRTPD latency: 16-28 cycles – (number of cycles for a haswell)

  • In which case we would want to emit SSE instructions,

not AVX!

– No ability to switch between instruction sets in Vc “on the fly” – Would need to compile for SSE (or wait until Vc 1.0 adds support for SIMD arrays of arbitrary length) » Still would need to be careful – AVX-SSE transition penalties (vzeroupper instruction might help?)

slide-5
SLIDE 5

Speedups

  • Averaged over 1000 repetitions
  • 50% hit rate – no points inside
  • Turbo off – pinned to core #2
  • CPU governor set to maximum
  • Compiling for SSE 4.2 performed around 10-

15% faster than AVX in this particular instance

slide-6
SLIDE 6

0.00E+00 5.00E-01 1.00E+00 1.50E+00 2.00E+00 2.50E+00 3.00E+00 3.50E+00 4.00E+00 10 20 40 80 100 200 400 500 800 1000 2000 5000 10000 Number of particles

One-particle vectorized DistanceToIn vs USolids

Speedup