High performance numerical validation with stochastic arithmetic - - PowerPoint PPT Presentation

high performance numerical validation with stochastic
SMART_READER_LITE
LIVE PREVIEW

High performance numerical validation with stochastic arithmetic - - PowerPoint PPT Presentation

High performance numerical validation with stochastic arithmetic Pacme Eberhart Joint work with : Fabienne Jzquel, Pierre Fortin In collaboration with Julien Brajard from LOCEAN RAIM2015 April 7, 2015 IRISA, Rennes Pacme Eberhart


slide-1
SLIDE 1

High performance numerical validation with stochastic arithmetic

Pacôme Eberhart Joint work with : Fabienne Jézéquel, Pierre Fortin In collaboration with Julien Brajard from LOCEAN RAIM2015 April 7, 2015 IRISA, Rennes

Pacôme Eberhart High performance stochastic arithmetic RAIM2015 1 / 21

slide-2
SLIDE 2

Estimation of rounding error propagation

Evaluating the accuracy of numerical results

Accumulation of rounding errors ⇒ numerical results different from mathematical results Measure of the reliability and reproducibility of the computation Particularly important in HPC environments and future exascale supercomputers

◮ increased parallelism ◮ higher amount of computation

Some methods

Backward error analysis: low overhead, unfit for some types of code Interval arithmetic: 100% accurate but usually needs code rewriting Stochastic arithmetic: probabilistic approach easy to use in real-life applications

◮ need to reduce overhead for high performance Pacôme Eberhart High performance stochastic arithmetic RAIM2015 2 / 21

slide-3
SLIDE 3

1

High performance numerical validation

2

Stochastic arithmetic and the CADNA library

3

Overhead of the CADNA library

4

Towards a high performance CADNA library

5

Scalar performance

6

SIMD performance

7

Conclusion and future works

slide-4
SLIDE 4

Stochastic arithmetic and the CADNA library

CESTAC method

Each arithmetic operation is performed N times Randomly rounded towards +∞ or −∞ with probability 0.5 Number of exact significant digits estimated with statistical analysis First order approximation method : validity compromised if second

  • rder errors greater than first order

Implementation of the CADNA library

Implementation of stochastic arithmetic in C/C++ Classes and operator overloading for ease of use Contains N = 3 floating-point values and 1 integer

Pacôme Eberhart High performance stochastic arithmetic RAIM2015 3 / 21

slide-5
SLIDE 5

The CADNA library: self-validation and anomaly detection

Anomaly detection

Self-validation to ensure validity of stochastic arithmetic Anomaly detection for numerical analysis of the code

Warning types

Self-validation: both operands in a multiplication or a divisor not significant Cancellation detection: sudden loss in accuracy on addition or subtraction Mathematical instability: instability in a mathematical function Branching instability: undeterminism in a branching test

Pacôme Eberhart High performance stochastic arithmetic RAIM2015 4 / 21

slide-6
SLIDE 6

1

High performance numerical validation

2

Stochastic arithmetic and the CADNA library

3

Overhead of the CADNA library

4

Towards a high performance CADNA library

5

Scalar performance

6

SIMD performance

7

Conclusion and future works

slide-7
SLIDE 7

Overhead

Computation time

Depends on the program and the level of detection Is usually one order of magnitude higher or more on real-life applications Even higher on highly optimised routines

Causes

Cost of anomaly detection Cost of stochastic operations

Pacôme Eberhart High performance stochastic arithmetic RAIM2015 5 / 21

slide-8
SLIDE 8

Cost of anomaly detection

Detection types

Self-validation and branching instability: relatively low cost test Mathematical instability: inexpensive compared to the cost of mathematical function calls Cancellation detection: computing the number of exact significant digits of both operands and the result

Calculating the number of exact significant digits

Uses the mean value and the standard deviation of the set of samples Relies on a costly logarithmic evaluation

Pacôme Eberhart High performance stochastic arithmetic RAIM2015 6 / 21

slide-9
SLIDE 9

Cost of stochastic operations

FPU (Floating Point Unit) rounding modes

Stochastic operations frequently change the rounding mode of the FPU Pipeline flushed when rounding mode changed, hence hindering performance Prevents vectorisation as rounding mode is the same for all lanes

Overloaded operators

Operators replaced by functions, compiled in the library FPU instructions replaced by function calls, causing performance

  • verhead, especially in arithmetic intensive code

Pacôme Eberhart High performance stochastic arithmetic RAIM2015 7 / 21

slide-10
SLIDE 10

1

High performance numerical validation

2

Stochastic arithmetic and the CADNA library

3

Overhead of the CADNA library

4

Towards a high performance CADNA library

5

Scalar performance

6

SIMD performance

7

Conclusion and future works

slide-11
SLIDE 11

Cancellation detection

Logarithm approximation

Cancellation detection: number of exact significant digits computed with log10 Using the base 2 exponent (multiplied by log10(2)) as a fast approximation for logarithm Easily obtained from binary representation of floating point numbers

Difference with the previous evaluation

Estimated number of exact significant digits can vary However, since log10(2) < 0.31, at most a 1 digit difference Approximation gives a more pessimistic estimation for number of digits

Pacôme Eberhart High performance stochastic arithmetic RAIM2015 8 / 21

slide-12
SLIDE 12

Stochastic operations

Removing the change of rounding mode during computation

As a ⊕+∞ b = − (−a ⊕−∞ −b) (likewise for subtraction), And a ⊗+∞ b = − (a ⊗−∞ −b) (likewise for division), Obtain rounded up value from rounded down operations (or conversely) by changing signs Implemented through random flip of the bit sign of the IEEE binary representation

Inlining the functions

Minimise the cost of function calls

Pacôme Eberhart High performance stochastic arithmetic RAIM2015 9 / 21

slide-13
SLIDE 13

Vectorising CADNA

Prerequisites

FPU rounding mode changes not necessary anymore Random generator changed to ease vectorisation through replication

Pacôme Eberhart High performance stochastic arithmetic RAIM2015 10 / 21

slide-14
SLIDE 14

Vectorising CADNA

Prerequisites

FPU rounding mode changes not necessary anymore Random generator changed to ease vectorisation through replication

Vectorising methods

Using intrinsics: tedious and difficult to use due to data types

Pacôme Eberhart High performance stochastic arithmetic RAIM2015 10 / 21

slide-15
SLIDE 15

Vectorising CADNA

Prerequisites

FPU rounding mode changes not necessary anymore Random generator changed to ease vectorisation through replication

Vectorising methods

Using intrinsics: tedious and difficult to use due to data types Automatic vectorisation: impossible due to added dependency from random bit generation

Pacôme Eberhart High performance stochastic arithmetic RAIM2015 10 / 21

slide-16
SLIDE 16

Vectorising CADNA

Prerequisites

FPU rounding mode changes not necessary anymore Random generator changed to ease vectorisation through replication

Vectorising methods

Using intrinsics: tedious and difficult to use due to data types Automatic vectorisation: impossible due to added dependency from random bit generation Compilation directives based: problematic due to lack of lane identifier for random generator

Pacôme Eberhart High performance stochastic arithmetic RAIM2015 10 / 21

slide-17
SLIDE 17

Vectorising CADNA

Prerequisites

FPU rounding mode changes not necessary anymore Random generator changed to ease vectorisation through replication

Vectorising methods

Using intrinsics: tedious and difficult to use due to data types Automatic vectorisation: impossible due to added dependency from random bit generation Compilation directives based: problematic due to lack of lane identifier for random generator SPMD (Single Program Multiple Data) on SIMD

◮ Scalar programming with simple C-like syntax, with lane identifier ◮ Compiler generates SIMD instructions ◮ ispc (Intel SPMD Program Compiler) supports operator overloading,

chosen over OpenCL

Pacôme Eberhart High performance stochastic arithmetic RAIM2015 10 / 21

slide-18
SLIDE 18

Execution masks

Divergence in control flow when vectorising

Vectorised code containing conditional branches Instructions executed even when they should not Changes not commited to memory, through the use of an execution mask Usually implemented through software and costly in terms of performance

Reducing the use of execution masks

Tests on whether an instability is detected or not Replacing these tests with preprocessor directives evaluated at compile time Disables the possibility of changing the detection mode during execution

Pacôme Eberhart High performance stochastic arithmetic RAIM2015 11 / 21

slide-19
SLIDE 19

1

High performance numerical validation

2

Stochastic arithmetic and the CADNA library

3

Overhead of the CADNA library

4

Towards a high performance CADNA library

5

Scalar performance

6

SIMD performance

7

Conclusion and future works

slide-20
SLIDE 20

Performance setup

Hardware

Intel Xeon E3-1275 3.5GHz, 1 core used only

Benchmarks

Pure arithmetic benchmarks

◮ Addition (multiplication) over long vector

More realistic benchmarks

◮ Mandelbrot set computation ◮ Finite difference stencil computation

Application code compiled with gcc -O3

Pacôme Eberhart High performance stochastic arithmetic RAIM2015 12 / 21

slide-21
SLIDE 21

Versions of the CADNA library

Compared versions of the benchmarks

ieee, a IEEE version used as a baseline 1.1.9, the previous version of CADNA mask, removing the FPU rouding mode change during operations and adding the change of sign through masks inline, using mask and inlining the operators dyn, using inline and changing the random generator to produce numbers dynamically

Compiling the libraries

1.1.9 compiled with gcc -O0 due to a known gcc bug mask, inline and dyn compiled with gcc -O3

Pacôme Eberhart High performance stochastic arithmetic RAIM2015 13 / 21

slide-22
SLIDE 22

Cancellation detection

1 10 100 1000 Addition Execution time (s) IEEE 1.1.9 log approx Overhead x 303 Overhead x 184

Analysis

Addition only, cancellation only applies to addition All detections activated Overhead reduced by 40%

Pacôme Eberhart High performance stochastic arithmetic RAIM2015 14 / 21

slide-23
SLIDE 23

Arithmetic benchmarks

5 10 15 20 25 30 35 40 Addition Multiplication Execution time (s) IEEE 1.1.9 mask inline dyn

Overhead x 52.7 Overhead x 33.9 Overhead x 39.6 Overhead x 26.2 Overhead x 9.75 Overhead x 5.28 Overhead x 7.86 Overhead x 4.94

Analysis

Overhead reduced by up to 84%

Pacôme Eberhart High performance stochastic arithmetic RAIM2015 15 / 21

slide-24
SLIDE 24

Realistic benchmarks

2 4 6 8 10 12 Mandelbrot Execution time (s) IEEE 1.1.9 mask inline dyn

Overhead x 115 Overhead x 91.4 Overhead x 22.2 Overhead x 17.0

10 20 30 40 50 60 70 Stencil Execution time (s) IEEE 1.1.9 mask inline dyn

Overhead x 98.3 Overhead x 78.7 Overhead x 20.9 Overhead x 13.1

Analysis

Initial overhead higher than for arithmetic benchmarks Due to nature of chosen applications

◮ Mandelbrot more arithmetically intensive than arithmetic benchmarks ◮ Stencil 3D memory access pattern lowers memory cache and prefetch

efficiency

Overhead reduced by up to 87%

Pacôme Eberhart High performance stochastic arithmetic RAIM2015 16 / 21

slide-25
SLIDE 25

1

High performance numerical validation

2

Stochastic arithmetic and the CADNA library

3

Overhead of the CADNA library

4

Towards a high performance CADNA library

5

Scalar performance

6

SIMD performance

7

Conclusion and future works

slide-26
SLIDE 26

Performance setup

Hardware

Same CPU as scalar, 1 core used only AVX2 instruction set

Compared versions of the benchmarks

ieee, a IEEE version used as a baseline dyn, adapted from the scalar dyn version define, using dyn and replacing tests on anomaly detection flags by #define evaluated at compile time

Pacôme Eberhart High performance stochastic arithmetic RAIM2015 17 / 21

slide-27
SLIDE 27

Arithmetic benchmarks on AVX2

1 2 3 4 5 6 7 8 Addition Multiplication Execution time (s) IEEE - scal IEEE - AVX2 dyn - scal dyn - AVX2 define - scal define - AVX2

Speedup x 7.98 Speedup x 7.38 Speedup x 2.26 Speedup x 2.19 Speedup x 5.63 Speedup x 2.19

Analysis

Vectorisation achieved (was impossible with original CADNA) CADNA speedup up to 5.63 times, lower than IEEE speedup

◮ stochastic types are structures (AoS vs SoA)

define version removing execution masks for addition

Pacôme Eberhart High performance stochastic arithmetic RAIM2015 18 / 21

slide-28
SLIDE 28

Realistic benchmarks on AVX2

0.5 1 1.5 2 2.5 Mandelbrot Execution time (s) IEEE - scal IEEE - AVX2 dyn - scal dyn - AVX2 define - scal define - AVX2

Speedup x 4.44 Speedup x 3.14 Speedup x 3.01

2 4 6 8 10 12 14 Stencil Execution time (s) IEEE - scal IEEE - AVX2 dyn - scal dyn - AVX2 define - scal define - AVX2

Speedup x 4.28 Speedup x 2.05 Speedup x 2.70

Analysis

Better performance of dyn on Mandelbrot confirms AoS vs SoA problem, as Mandelbrot has no memory accesses for stochastic types Improvement of speedup for stencil computation with define version due to removal of execution masks Speedup up to 3.14 times

Pacôme Eberhart High performance stochastic arithmetic RAIM2015 19 / 21

slide-29
SLIDE 29

1

High performance numerical validation

2

Stochastic arithmetic and the CADNA library

3

Overhead of the CADNA library

4

Towards a high performance CADNA library

5

Scalar performance

6

SIMD performance

7

Conclusion and future works

slide-30
SLIDE 30

Conclusion

Scalar improvements

Improved performance on cancellation detection

◮ Use of an approximation of the log10 function ◮ Overhead reduced by 40%

Changes to CADNA enable large performance improvements on stochastic arithmetic operations Reduce in overhead between 81% and 87% on a variety of benchmarks

Vectorising

Enabling SIMD computing with stochastic arithmetic Speedup of up to 5.63 on AVX2 Lower speedup than IEEE due to AoS

Pacôme Eberhart High performance stochastic arithmetic RAIM2015 20 / 21

slide-31
SLIDE 31

Future prospects

New CADNA versions

New CADNA version for shared memory (OpenMP) Extend current SPMD-on-SIMD version to target CPUs and GPUs

◮ either with OpenCL, ◮ or with ispc/CUDA. Pacôme Eberhart High performance stochastic arithmetic RAIM2015 21 / 21