RICH Cherenkov angle status report March 2017 Christina Quast March - - PowerPoint PPT Presentation

rich cherenkov angle status report march 2017
SMART_READER_LITE
LIVE PREVIEW

RICH Cherenkov angle status report March 2017 Christina Quast March - - PowerPoint PPT Presentation

Memory layout Performance improvements RICH Cherenkov angle status report March 2017 Christina Quast March 6, 2017 Christina Quast RICH Cherenkov angle status report March 2017 Memory layout Performance improvements Nanoseconds per photon


slide-1
SLIDE 1

Memory layout Performance improvements

RICH Cherenkov angle status report March 2017

Christina Quast March 6, 2017

Christina Quast RICH Cherenkov angle status report March 2017

slide-2
SLIDE 2

Memory layout Performance improvements

Nanoseconds per photon

Theoretical limit: 52.0B/340GBps = 0.153 For 33554432 photons: Old solver float: 1000.26

Christina Quast RICH Cherenkov angle status report March 2017

slide-3
SLIDE 3

Memory layout Performance improvements

Memory layout before

Christina Quast RICH Cherenkov angle status report March 2017

slide-4
SLIDE 4

Memory layout Performance improvements

Memory layout before 2

Christina Quast RICH Cherenkov angle status report March 2017

slide-5
SLIDE 5

Memory layout Performance improvements

Memory layout after

Christina Quast RICH Cherenkov angle status report March 2017

slide-6
SLIDE 6

Memory layout Performance improvements

Nanoseconds per photon

Theoretical limit: 52.0B/340GBps = 0.153 For 33554432 photons , 1024 wg size, 256 threads: Old solver float: 1000.26 ns Agner Fog’s Vectorclass: 1.04248 ns

Christina Quast RICH Cherenkov angle status report March 2017

slide-7
SLIDE 7

Memory layout Performance improvements

  • -- a/ QuarticSolverCacheline .h

+++ b/ QuarticSolverCacheline .h

  • T reflPointX ;
  • T reflPointY ;
  • T reflPointZ ;

+ T reflPointX __attribute__ (( __aligned__ (64))); + T reflPointY __attribute__ (( __aligned__ (64))); + T reflPointZ __attribute__ (( __aligned__ (64))); reflPointX = ex + CoCX; reflPointY = ey + CoCY; @@ // TODO :align 64 // FIXME: ueberall const dranmachen ?

  • VECT

emissionPointVecX ; + VECT emissionPointVecX __attribute__ (( __aligned__ (64))); emissionPointVecX .load_a (& data.emissPnt.x()[0]);

  • VECT

emissionPointVecY ; + VECT emissionPointVecY __attribute__ (( __aligned__ (64))); emissionPointVecY .load_a (& data.emissPnt.y()[0]);

  • VECT

emissionPointVecZ ; + VECT emissionPointVecZ __attribute__ (( __aligned__ (64))); emissionPointVecZ .load_a (& data.emissPnt.z()[0]);

  • VECT

CoCX; + VECT CoCX __attribute__ (( __aligned__ (64))); CoCX.load_a (& data.centOfCurv .x()[0]);

  • VECT

CoCY; + VECT CoCY __attribute__ (( __aligned__ (64))); CoCY.load_a (& data.centOfCurv .y()[0]);

  • VECT

CoCZ; Christina Quast RICH Cherenkov angle status report March 2017

slide-8
SLIDE 8

Memory layout Performance improvements + VECT CoCZ __attribute__ (( __aligned__ (64))); CoCZ.load_a (& data.centOfCurv .z()[0]); @ VECT e2 = evecX*evecX + evecY*evecY + evecZ*evecZ; // vector from mirror centre

  • f

curvature to virtual detec

  • VECT

virtDetPointVecX ; + VECT virtDetPointVecX __attribute__ (( __aligned__ (64))); virtDetPointVecX .load_a (& data. virtDetPoint .x()[0]);

  • VECT

virtDetPointVecY ; + VECT virtDetPointVecY __attribute__ (( __aligned__ (64))); virtDetPointVecY .load_a (& data. virtDetPoint .y()[0]);

  • VECT

virtDetPointVecZ ; + VECT virtDetPointVecZ __attribute__ (( __aligned__ (64))); virtDetPointVecZ .load_a (& data. virtDetPoint .z()[0]); // const Vector dvec( virtDetPoint

  • CoC );

@@

  • 220,7 +220 ,7 @@

namespace RichCacheline

  • VECT

radius; + VECT radius __attribute__ (( __aligned__ (64))); radius.load_a (& data.radius [0]);

  • VECT

reflPointX ;

  • VECT

reflPointY ;

  • VECT

reflPointZ ; + VECT reflPointX __attribute__ (( __aligned__ (64))); + VECT reflPointY __attribute__ (( __aligned__ (64))); + VECT reflPointZ __attribute__ (( __aligned__ (64)));

  • -- a/main.cpp

Christina Quast RICH Cherenkov angle status report March 2017

slide-9
SLIDE 9

Memory layout Performance improvements +++ b/main.cpp @@

  • 227,8 +227 ,8 @@ int

main ( int argc , char ** argv)

  • VECTYPE :: PhotonReflections <float > dataV0_vect ;
  • VECTYPE :: PhotonReflections <float > dataV1_vect ;

+ VECTYPE :: PhotonReflections <float > dataV0_vect __attribute__ (( __aligned__ (64))); + VECTYPE :: PhotonReflections <float > dataV1_vect __attribute__ (( __aligned__ (64))); diff

  • -git a/vectype.h b/vectype.h

index 75 c05bf ..72 db553 100644

  • -- a/vectype.h

+++ b/vectype.h template <typename T, std :: size_t DIM = 16>

  • using

PhotonReflections = std :: vector <PhotonReflection <T, DIM >>; + using PhotonReflections = std :: vector <PhotonReflection <T, DIM >, aligned_alloca Christina Quast RICH Cherenkov angle status report March 2017

slide-10
SLIDE 10

Memory layout Performance improvements

Nanoseconds per photon

Theoretical limit: 52.0B/340GBps = 0.153 For 33554432 photons , 1024 wg size, 256 threads: Old solver float: 1000.26 ns Agner Fog’s Vectorclass: 1.04248 ns Aligned allocator: 0.946315 ns

Christina Quast RICH Cherenkov angle status report March 2017

slide-11
SLIDE 11

Memory layout Performance improvements

Nanoseconds per photon

Theoretical limit: 52.0B/340GBps = 0.153 For 33554432 photons , 1024 wg size, 256 threads: Old solver float: 1000.26 ns Agner Fog’s Vectorclass: 1.04248 ns Aligned allocator: 0.946315 ns Const variables: 0.932545

Christina Quast RICH Cherenkov angle status report March 2017

slide-12
SLIDE 12

Memory layout Performance improvements

  • -- a/ QuarticSolverCacheline .h

+++ b/ QuarticSolverCacheline .h @@

  • 81,8 +81 ,8 @@

namespace RichCacheline

  • const T divnorm = 1.0f/norm;
  • const T norm_sqrt = sqrt(norm );

+ const T divnorm = approx_recipr (norm ); + const T norm_sqrt = approx_recipr ( approx_rsqrt (norm )); nx *= divnorm; ny *= divnorm; nz *= divnorm; @@

  • const

auto enorm = radius/e; + const auto enorm = radius* approx_recip @@

  • VECT

cosgamma2 = (evecDvec * evecDvec )/ ed2; + VECT cosgamma2 = (evecDvec * evecDvec) * approx_recipr (ed2 );

  • const

VECT e = sqrt(e2);

  • const

VECT d = sqrt(d2); + const VECT e = approx_recipr ( approx_rsqrt (e2 )); + const VECT d = approx_recipr ( approx_rsqrt (d2 ));

  • const

VECT singamma = sqrt (1.0f - cosgamma2 ));

  • const

VECT cosgamma = approx_recipr ( approx_rsqrt (cosgamma2 )); + const VECT singamma = approx_recipr ( approx_rsqrt (1.0f - cosgamma2 )); + const VECT cosgamma = approx_recipr ( approx_rsqrt (cosgamma2 )); @@ const VECT maxval = std :: numeric_limits <SKALART >:: max ();

  • const

VECT inv_a0 = ((a0 > 0)? 1.0f/a0: maxval ); + const VECT inv_a0 = ((a0 > 0)? approx_recipr (a0): maxval ); @@

  • const

auto toberooted = (abs(R) + sqrt(abs(R2 -Q3)) ); + const auto toberooted = (abs(R) + approx_recipr ( approx_rsqrt (abs(R2 -Q3 )))); Christina Quast RICH Cherenkov angle status report March 2017

slide-13
SLIDE 13

Memory layout Performance improvements // FIXME: oder zuerst in normales array , dann load? // FIXME: also for double? @@ const auto A = sgnR * rooted; PR(A);

  • const

auto B = Q / A; + const auto B = Q * approx_recipr (A);

  • const

auto u1 =

  • 0.5 * (A + B) - rc / 3.0;

+ const auto u1 =

  • 0.5 * (A + B) - rc * (1.0f / 3.0f);

// FIXME: saturated

  • r not?

// const const auto u2 = UU * abs_saturated (A-B); const auto u2 = UU * abs(A-B);

  • const

auto V = sqrt(u1*u1 + u2*u2); + const auto V = approx_recipr ( approx_rsqrt (u1*u1 + u2*u2 )); // std :: complex <TYPE > w3 = ( abs_satured (V) != 0.0 ? (TYPE )( qq *

  • 0.125 ) / V :

// std :: complex <TYPE >(0 ,0) ); // FIXME: warum abs saturated when compared to 0.0 ??

  • const

auto w3r = ((V != 0.0)? (qq *

  • 0.125)/V : 0.0);

+ const auto w3r = ((V != 0.0)? (qq *

  • 0.125)* approx_recipr (V) : 0.0);

// TYPE res = std :: real(w1) + std :: real(w2) + std :: real(w3) - (r4*a);

  • const

auto res = sqrt ((u1+V)*2) + w3r - (r4*a); + const auto res = approx_recipr ( approx_rsqrt ((u1+V)*2)) + w3r - (r4*a); // return the final result // FIXME: std :: move ? const auto r = (( res > 1.0)? 1.0: (( res <

  • 1.0)?
  • 1.0: res ));

Christina Quast RICH Cherenkov angle status report March 2017

slide-14
SLIDE 14

Memory layout Performance improvements

Nanoseconds per photon

Theoretical limit: 52.0B/340GBps = 0.153 For 33554432 photons , 1024 wg size, 256 threads: Old solver float: 1000.26 ns Agner Fog’s Vectorclass: 1.04248 ns Aligned allocator: 0.946315 ns Const variables: 0.932545

  • Approx. functions: 0.851242

Christina Quast RICH Cherenkov angle status report March 2017

slide-15
SLIDE 15

Memory layout Performance improvements

  • -- a/ QuarticSolverCacheline .h

+++ b/ QuarticSolverCacheline .h @@

  • 142,6 +142 ,18 @@

namespace RichCacheline { + builtin_prefetch (&(((& data )+0)-> radius [0]) , 0, 3); + builtin_prefetch (&(((& data +1)-> emissPnt.x())[0]) , 0, 3); + builtin_prefetch (&(((& data +1)-> emissPnt.y())[0]) , 0, 3); + builtin_prefetch (&(((& data +1)-> emissPnt.z())[0]) , 0, 3); + builtin_prefetch (&(((& data +1)-> centOfCurv.x())[0]) , 0, 3); + builtin_prefetch (&(((& data +1)-> centOfCurv.y())[0]) , 0, 3); + builtin_prefetch (&(((& data +1)-> centOfCurv.z())[0]) , 0, 3); + builtin_prefetch (&(((& data +1)-> virtDetPoint .x())[0]) , 0, 3); + builtin_prefetch (&(((& data +1)-> virtDetPoint .y())[0]) , 0, 3); + builtin_prefetch (&(((& data +1)-> virtDetPoint .z())[0]) , 0, 3); VECT emissionPointVecX __attribute__ (( __aligned__ (64))); emissionPointVecX .load_a (& data.emissPnt.x()[0]); VECT emissionPointVecY __attribute__ (( __aligned__ (64))); @@ + __builtin_prefetch (& data. sphReflPoint .x()[0] , 1, 0); + __builtin_prefetch (& data. sphReflPoint .y()[0] , 1, 0); + __builtin_prefetch (& data. sphReflPoint .z()[0] , 1, 0); reflPointX .store_a (& data. sphReflPoint .x()[0]); reflPointY .store_a (& data. sphReflPoint .y()[0]); Christina Quast RICH Cherenkov angle status report March 2017

slide-16
SLIDE 16

Memory layout Performance improvements

Nanoseconds per photon

Theoretical limit: 52.0B/340GBps = 0.153 For 33554432 photons , 1024 wg size, 256 threads: Old solver float: 1000.26 ns Agner Fog’s Vectorclass: 1.04248 ns Aligned allocator: 0.946315 ns Const variables: 0.932545

  • Approx. functions: 0.851242

Prefetch: 0.854942

Christina Quast RICH Cherenkov angle status report March 2017

slide-17
SLIDE 17

Memory layout Performance improvements

  • -- a/ QuarticSolverCacheline .h

+++ b/ QuarticSolverCacheline .h @@

  • 81,27 +81 ,19 @@

namespace RichCacheline const T norm = (nx*nx+ny*ny+nz*nz);

  • const T divnorm = approx_recipr (norm );

const T norm_sqrt = approx_recipr ( approx_rsqrt (norm ));

  • nx *=

divnorm;

  • ny *=

divnorm;

  • nz *=

divnorm;

  • // auto

beta = asin(sinbeta );

  • const

auto beta = asin(sinbeta );

  • // const

auto beta = T(asin(sinbeta.get_low ()), asin(sinbeta.get_high ()));

  • const

auto a = sinbeta*norm_sqrt;

  • const

auto b = (1.0f-cos(beta ))*( norm );

  • const

auto enorm = radius* approx_recipr (e); + const auto b = (1.0f- approx_recipr ( approx_rsqrt (1.0f-( sinbeta*sinbeta )))); + const auto enorm = radius* approx_recipr (e*norm );

  • std ::array <T, 9> M = {1+b*(-nz*nz -ny*ny), a*nz+b*nx*ny , -a*ny+b*nx*nz ,
  • a*nz+b*nx*ny , 1+b*(-nx*nx -nz*nz), a*nx+b*ny*nz ,
  • a*ny+b*nx*nz , -a*nx+b*ny*nz , 1+b*(-ny*ny -nx*nx )};

+ const std ::array <T, 9> M = {norm+b*(-nz*nz -ny*ny), a*nz+b*nx*ny , -a*ny+b*nx*nz , +

  • a*nz+b*nx*ny , norm+b*(-nx*nx -nz*nz), a*nx+b*ny*nz ,

+ a*ny+b*nx*nz , -a*nx+b*ny*nz , norm+b*(-ny*ny -nx*nx )}; Christina Quast RICH Cherenkov angle status report March 2017

slide-18
SLIDE 18

Memory layout Performance improvements

Nanoseconds per photon

Theoretical limit: 52.0B/340GBps = 0.153 For 33554432 photons , 1024 wg size, 256 threads: Old solver float: 1000.26 ns Agner Fog’s Vectorclass: 1.04248 ns Aligned allocator: 0.946315 ns Const variables: 0.932545

  • Approx. functions: 0.851242

Prefetch: 0.854942 Transform improved: 0.833473

Christina Quast RICH Cherenkov angle status report March 2017

slide-19
SLIDE 19

Memory layout Performance improvements

Newton

Christina Quast RICH Cherenkov angle status report March 2017

slide-20
SLIDE 20

Memory layout Performance improvements

Newton

Kaon ID Efficiency / % 80 85 90 95 100 Pion MisID Efficiency / % 1 10

RICH Kaon ID

  • 6.91919

1.16162 9.24242 Old Quartic RichDLLk-RichDLLpi > cut Long tracks | 3<P(GeV)<100 | 0.5<Pt(GeV)<100 | 30<TkAng(mrad)<300 Required Dets : AnyRICH 13087 Kaons in Acceptance

  • 7.42424

0.151515 7.72727 New NR Quartic RichDLLk-RichDLLpi > cut Long tracks | 3<P(GeV)<100 | 0.5<Pt(GeV)<100 | 30<TkAng(mrad)<300 Required Dets : AnyRICH 13087 Kaons in Acceptance

RICH Kaon ID

Christina Quast RICH Cherenkov angle status report March 2017

slide-21
SLIDE 21

Memory layout Performance improvements

Nanoseconds per photon

Theoretical limit: 52.0B/340GBps = 0.153 For 33554432 photons , 1024 wg size, 256 threads: Old solver float: 1000.26 ns Agner Fog’s Vectorclass: 1.04248 ns Aligned allocator: 0.946315 ns Const variables: 0.932545

  • Approx. functions: 0.851242

Prefetch: 0.854942 Transform improved: 0.833473 Numactl on MCDRAM: 0.22772

Christina Quast RICH Cherenkov angle status report March 2017