Roofline plot for RICH pattern detection algorithm on Intels Knights - - PowerPoint PPT Presentation

roofline plot for rich pattern detection algorithm on
SMART_READER_LITE
LIVE PREVIEW

Roofline plot for RICH pattern detection algorithm on Intels Knights - - PowerPoint PPT Presentation

The math behind Memory layout Performance improvements Results Roofline plot for RICH pattern detection algorithm on Intels Knights Landing Platform Christina Quast tCSC 2017 June 9, 2017 1 / 13 Christina Quast Roofline plot for RICH


slide-1
SLIDE 1

The math behind Memory layout Performance improvements Results

Roofline plot for RICH pattern detection algorithm on Intels Knights Landing Platform

Christina Quast tCSC 2017 June 9, 2017

1 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights

slide-2
SLIDE 2

The math behind Memory layout Performance improvements Results

Intel Xeon Phi Knights Landing

Figure: KNL 1

1https://www.extremetech.com/wp-content/uploads/2016/04/

KnightsLanding.png

2 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights

slide-3
SLIDE 3

The math behind Memory layout Performance improvements Results Memory rearrangement

Theoretical performance

40B input data + 12B output data = 52 B DRAM: ≈ 80GBps MCDRAM (High Bandwidth Memory on KNL): ≈ 340GBps Theoreticaly best performance in time per photon: 52.0B/340GBps = 0.153 ns

3 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights

slide-4
SLIDE 4

The math behind Memory layout Performance improvements Results Memory rearrangement

Memory layout: AOS to SOA

Arrange memory for better unit strides

Figure: AOS to SOA 2

2http://www.spuify.co.uk/?p=645 4 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights

slide-5
SLIDE 5

The math behind Memory layout Performance improvements Results Performance improvements

Memory and Cacheline optimizations

Alignment of variables to 64 Byte boundaries (Cacheline size) Vectorization through vectorclass library (basically intrinsics abstraction) Const keyword helps compiler to optimize

5 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights

slide-6
SLIDE 6

The math behind Memory layout Performance improvements Results Performance improvements

Approximate functions

Inverse approximate functions up to 10 times faster sqrt() replaced with approx recipr ( approx rsqrt ()) division replaced by approx recip () Instruction uops reciprocal througput VSQRT14PS 18 16 VRSQRT14PS 1 3 VDIVPS 18 32 VRCP28PS 1 3

6 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights

slide-7
SLIDE 7

The math behind Memory layout Performance improvements Results Performance improvements

Mathematical improvements

Removed some divisions Extracted multiplication factors Term cos(arcsin(sin(β))) replaced by

  • (1 − sin2(β))

Removed cubic root (very expensive) and quartic solver, replaced with Newton

7 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights

slide-8
SLIDE 8

The math behind Memory layout Performance improvements Results Performance improvements

MCDRAM ( 340 GBps)

numactl to bind execution to CPUs and MCDRAM Memory

Figure: Quadrant Clustering mode [2]

8 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights

slide-9
SLIDE 9

The math behind Memory layout Performance improvements Results Speedup

Nanoseconds per photon

Theoretical limit: 52.0B/340GBps = 0.153 ns per photon

Improvement Execution time per photon Speedup over baseline code with OMP Baseline code without OMP 1000.26 ns

  • From here: always OpenMP 256 thread

Baseline code 7.13 ns

  • Pinned on MCDRAM (with numactl)

6.63 ns 1.07x Mathematical improvement 4.67 ns 1.53x Vectorization and Memory alignment 0.933 ns 7.64x All three 0.195 ns 36.47x 9 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights

slide-10
SLIDE 10

The math behind Memory layout Performance improvements Results Speedup

Roofline plot

Figure: Roofline plot with mathematic improvements

10 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights

slide-11
SLIDE 11

The math behind Memory layout Performance improvements Results Speedup

Speedup and Efficiency

1 2 4 8 16 32 64 128 256 #OMP_threads 1 2 4 8 16 32 64 128 256 512 speedup 50 100 150 200 250 #OMP_threads 0.5 0.6 0.7 0.8 0.9 1.0 1.1 efficiency

Figure: Strong scaling speedup (left) and efficiency (right) for 10485760 photons and OMP workgroup size of 128

11 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights

slide-12
SLIDE 12

The math behind Memory layout Performance improvements Results Speedup

  • R. Forty and O. Schneider.

Rich pattern recognition. LHCB/98-040, 30 April 1998.

  • A. Vladimirov and R. Asai.

Clustering modes in knights landing processors: Developer’s guide. Colfax International, May 11, 2016.

12 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights

slide-13
SLIDE 13

The math behind Memory layout Performance improvements Results Speedup

Cherenkov angle

Figure: Cherenkov angle calculation[1]

12 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights

slide-14
SLIDE 14

The math behind Memory layout Performance improvements Results Speedup

Struct

template <typename T, std::size_t DIM = 16> class PhotonReflection { public: typedef typename XYZPoints<T, DIM>::vec_type vector; public: XYZPoints<T, DIM> emissPnt; XYZPoints<T, DIM> centOfCurv; XYZPoints<T, DIM> virtDetPoint; XYZPoints<T, DIM> sphReflPoint; std::array<T,DIM> radius; };

13 / 13 Christina Quast Roofline plot for RICH pattern detection algorithm on Intels Knights