A Reproducible Accurate Summation Algorithm for High-Performance - - PowerPoint PPT Presentation

a reproducible accurate summation algorithm for high
SMART_READER_LITE
LIVE PREVIEW

A Reproducible Accurate Summation Algorithm for High-Performance - - PowerPoint PPT Presentation

A Reproducible Accurate Summation Algorithm for High-Performance Computing Sylvain Collange 1 , David Defour 2 , Stef Graillat 4 , and Roman Iakymchuk 3 , 4 1 INRIA Centre de recherche Rennes Bretagne Atlantique 2 DALILIRMM, Universit


slide-1
SLIDE 1

A Reproducible Accurate Summation Algorithm for High-Performance Computing

Sylvain Collange1, David Defour2, Stef Graillat4, and Roman Iakymchuk3,4

1INRIA – Centre de recherche Rennes – Bretagne Atlantique 2DALI–LIRMM, Université de Perpignan 3Sorbonne Universités, UPMC Univ Paris VI, UMR 7606, LIP6 4Sorbonne Universités, UPMC Univ Paris VI, ICS

roman.iakymchuk@lip6.fr

The SIAM EX14 Workshop July 6th, 2014 Chicago, Illinois, USA

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 1 / 21

slide-2
SLIDE 2

The Patriot Missile Failure

The 1st Gulf War in 1991: an American Patriot missile battery failed to intercept an Iraqi Scud missile The Scud missile hit a US garrison, killing 28 soldiers

Analysis

The Patriot HW clock delivers time in 1/10ths of seconds 0.1 is not representable by a finite number of digits in basis 2 0.1 = 0.0001100110011001100110011001100... The Patriot system had been running for more than 100

  • hours. Time off was 10 · 100 · 3600 · 5.96 · 10−8 = 0.21 secs

In this time, a Scud missile travels roughly 360 m

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 2 / 21

slide-3
SLIDE 3

Outline

1

Computer Arithmetic: Accuracy and Reproducibility

2

Existing Solutions

3

Multi-Level Reproducible and Accurate Algorithm

4

Conclusions and Future Work

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 3 / 21

slide-4
SLIDE 4

Computer Arithmetic

Problems

Floating-point arithmetic suffers from rounding errors Floating-point operations (+,×) are commutative but non-associative (−1 + 1) + 2−53 = −1 + (1 + 2−53) in double precision

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 4 / 21

slide-5
SLIDE 5

Computer Arithmetic

Problems

Floating-point arithmetic suffers from rounding errors Floating-point operations (+,×) are commutative but non-associative 2−53 = 0 in double precision

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 4 / 21

slide-6
SLIDE 6

Computer Arithmetic

Problems

Floating-point arithmetic suffers from rounding errors Floating-point operations (+,×) are commutative but non-associative (−1 + 1) + 2−53 = −1 + (1 + 2−53) in double precision Consequence: results of floating-point computations depend on the order of computation Results computed by performance-optimized parallel floating-point libraries may be frequently inconsistent: each run returns a different result

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 4 / 21

slide-7
SLIDE 7

Reproducibility and ExaScale

Challenges

Increasing power of current computers

GPU accelerators, Intel Phi processors, etc.

Enable to solve more complex problems

Quantum field theory, supernova simulation, etc.

A high number of floating-point operations performed

Each of them leads to round-off error

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 5 / 21

slide-8
SLIDE 8

Reproducibility and ExaScale

Challenges

Increasing power of current computers

GPU accelerators, Intel Phi processors, etc.

Enable to solve more complex problems

Quantum field theory, supernova simulation, etc.

A high number of floating-point operations performed

Each of them leads to round-off error

Needs for Reproducibility

Debugging

Look inside the code step-by-step and might need to rerun multiple times on the same input data

Understanding the reliability of output Contractual reasons (for security, ...)

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 5 / 21

slide-9
SLIDE 9

Sources of Non-Reproducibility

A performance-optimized floating-point library is prone to non-reproducibility for various reasons: Changing Data Layouts:

Data partitioning Data alignment

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 6 / 21

slide-10
SLIDE 10

Sources of Non-Reproducibility

A performance-optimized floating-point library is prone to non-reproducibility for various reasons: Changing Data Layouts:

Data partitioning Data alignment

Changing Hardware Resources

Number of threads Fused Multiply-Add support Intermediate precision (64 bits, 80 bits, 128 bits, etc) Data path (SSE, AVX, GPU warp, etc) Cache line size Number of processors Network topology . . .

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 6 / 21

slide-11
SLIDE 11

Existing Solutions

To Obtain Reproducibility

Fix the Order of Computations

Sequential mode: intolerably costly at large-scale systems Fixed reduction trees: substantial communication overhead → Example: Intel Conditional Numerical Reproducibility (slow, no accuracy guarantees)

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 7 / 21

slide-12
SLIDE 12

Existing Solutions

To Obtain Reproducibility

Fix the Order of Computations

Sequential mode: intolerably costly at large-scale systems Fixed reduction trees: substantial communication overhead → Example: Intel Conditional Numerical Reproducibility (slow, no accuracy guarantees)

Eliminate/Reduce the Rounding Errors

Fixed-point arithmetic: limited range of values Fixed FP expansions with Error-Free Transformations (EFT) → Example: double-double or quad-double (Briggs, Bailey, Hida, Li) (work well on a set of relatively close numbers)

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 7 / 21

slide-13
SLIDE 13

Existing Solutions

To Obtain Reproducibility

Fix the Order of Computations

Sequential mode: intolerably costly at large-scale systems Fixed reduction trees: substantial communication overhead → Example: Intel Conditional Numerical Reproducibility (slow, no accuracy guarantees)

Eliminate/Reduce the Rounding Errors

Fixed-point arithmetic: limited range of values Fixed FP expansions with Error-Free Transformations (EFT) → Example: double-double or quad-double (Briggs, Bailey, Hida, Li) (work well on a set of relatively close numbers) “Infinite” precision: reproducible independently from the inputs → Example: Kulisch accumulator (considered inefficient)

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 7 / 21

slide-14
SLIDE 14

Our Approach

Algorithm 1 EFT of size 2 (Dekker and Knuth) function[r, s] = TwoSum(a, b)

1: r ← a + b 2: z ← r − a 3: s ← (a − (r − z)) + (b − z)

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 8 / 21

slide-15
SLIDE 15

Our Approach

Algorithm 1 EFT of size 2 (Dekker and Knuth) function[r, s] = TwoSum(a, b)

1: r ← a + b 2: z ← r − a 3: s ← (a − (r − z)) + (b − z)

Algorithm 2 EFT of size n (init. by Priest and Shewchuk) function = ExpansionAccumulate(x)

1: for i = 0 → n − 1 do 2:

(ai, x) ← TwoSum(ai, x)

3: end for 4: if x = 0 then 5:

Superaccumulate(x)

6: end if

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 8 / 21

slide-16
SLIDE 16

Our Approach

Algorithm 1 EFT of size 2 (Dekker and Knuth) function[r, s] = TwoSum(a, b)

1: r ← a + b 2: z ← r − a 3: s ← (a − (r − z)) + (b − z)

Algorithm 2 EFT of size n (init. by Priest and Shewchuk) function = ExpansionAccumulate(x)

1: for i = 0 → n − 1 do 2:

(ai, x) ← TwoSum(ai, x)

3: end for 4: if x = 0 then 5:

Superaccumulate(x)

6: end if

Kulisch long accumulator

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 8 / 21

slide-17
SLIDE 17

Our Multi-Level Algorithm

Objective: To compute deterministic sums of floating-point numbers efficiently and with the best possible accuracy Accurate and Reproducible Paral- lel Summation: Based on FP expansions with EFT and Kulisch accumulator Parallel algorithm with 5-levels Suitable for today’s parallel architectures Guarantees “infinite” precision = bit-wise reproductibility

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 9 / 21

slide-18
SLIDE 18

Level 1: Filtering

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 10 / 21

slide-19
SLIDE 19

Level 2 and 3: Scalar Superaccumulator

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 11 / 21

slide-20
SLIDE 20

Level 4 and 5: Reduction and Rounding

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 12 / 21

slide-21
SLIDE 21

Experimental Environments

Table : Hardware platforms employed in the experimental evaluationa.

A Intel Core i7-4770 (Haswell) 4 cores with HT B Intel Xeon E5-2450 (Sandy Bridge-EN) 2 × 8 cores C Intel Xeon Phi 3110P 60 cores × 4-way MT D NVIDIA Tesla K20c 13 SMs × 192 CUDA cores E AMD Radeon HD 7970 32 CUs × 64 units

  • aS. Collange, D. Defour, S. Graillat and R. Iakymchuk. Full-Speed Deterministic

Bit-Accurate Parallel Floating-Point Summation on Multi- and Many-Core Architectures, Feb, 2014. HAL-ID: hal-00949355

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 13 / 21

slide-22
SLIDE 22

Performance Results on Intel Phi

Parallel Summation: Performance Scaling

5 10 15 20 25 30 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 Gacc/s Array size Parallel FP sum TBB deterministic Superaccumulator Expansion 2 Expansion 3 Expansion 4 Expansion 8 early-exit

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 14 / 21

slide-23
SLIDE 23

Performance Results on Intel Phi

Parallel Summation: Data-Dependent Performance

5 10 15 20 25 1 1e+20 1e+40 1e+60 1e+80 1e+100 1e+120 1e+140 Gacc/s Dynamic range Parallel FP Sum Expansion 2 Expansion 3 Expansion 4 Expansion 8 early-exit

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 15 / 21

slide-24
SLIDE 24

Performance Results on NVIDIA Tesla

Parallel Summation: Performance Scaling

2 4 6 8 10 12 14 16 18 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 Gacc/s Array size Parallel FP Sum Superaccumulator Expansion 2 Expansion 3 Expansion 4 Expansion 8 Expansion 8 early-exit

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 16 / 21

slide-25
SLIDE 25

Performance Results on NVIDIA Tesla

Parallel Summation: Data-Dependent Performance

2 4 6 8 10 12 14 16 18 1 1e+20 1e+40 1e+60 1e+80 1e+100 1e+120 1e+140 Gacc/s Dynamic range Parallel FP Sum Superaccumulator Expansion 2 Expansion 3 Expansion 4 Expansion 8 Expansion 8 early-exit

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 17 / 21

slide-26
SLIDE 26

Conclusions

The Proposed Multi-Level Summation Algorithm

Computes the results with no errors due to rounding Provides bit-wise identical reproducibility, regardless of

Data permutation, data assignment Thread scheduling, etc.

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 18 / 21

slide-27
SLIDE 27

Conclusions

The Proposed Multi-Level Summation Algorithm

Computes the results with no errors due to rounding Provides bit-wise identical reproducibility, regardless of

Data permutation, data assignment Thread scheduling, etc.

Is efficient – delivers comparable performance to the standard parallel summation Scale perfectly with the increase of the problem size or the number of cores

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 18 / 21

slide-28
SLIDE 28

Conclusions

The Proposed Multi-Level Summation Algorithm

Computes the results with no errors due to rounding Provides bit-wise identical reproducibility, regardless of

Data permutation, data assignment Thread scheduling, etc.

Is efficient – delivers comparable performance to the standard parallel summation Scale perfectly with the increase of the problem size or the number of cores Can be applied to other operations which use summation or dot product Is suitable for very large scale systems (ExaScale) with one more reduction step between nodes

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 18 / 21

slide-29
SLIDE 29

Future Work

ExBLAS – Exact BLAS

ExBLAS-1: ExSCAL, ExDOT, ExAXPY, ... ExBLAS-2: ExGER, ExGEMV, ExSYR, ... ExBLAS-3: ExGEMM, ExTRMM, ExSYR2K, ...

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 19 / 21

slide-30
SLIDE 30

Future Work

DDOT: α := xT y = N

i xiyi

2 4 6 8 10 12 14 16 18 20 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 Gacc/s Array size Parallel DDOT Superaccumulator Expansion 2 Expansion 3 Expansion 4 Expansion 8 Expansion 4 early-exit Expansion 6 early-exit Expansion 8 early-exit

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 19 / 21

slide-31
SLIDE 31

Future Work

ExBLAS – Exact BLAS

ExBLAS-1: ExSCAL, ExDOT, ExAXPY, ... ExBLAS-2: ExGER, ExGEMV, ExSYR, ... ExBLAS-3: ExGEMM, ExTRMM, ExSYR2K, ...

Distributed architectures

Parallelization with MPI Computation on network cards

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 19 / 21

slide-32
SLIDE 32

Acknowledgement

Thank you for your attention! → This work undertaken (partially) in the framework

  • f CALSIMLAB is supported by the public grant

ANR-11-LABX-0037-01 overseen by the French National Research Agency (ANR) as part of the “Investissements d’Avenir” program (reference: ANR-11-IDEX-0004-02).

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 20 / 21

slide-33
SLIDE 33

References

  • S. Collange, D. Defour, S. Graillat, and R. Iakymchuk

Numerical Reproducibility for the Summation Problem on Multi- and Many-Core

  • Architectures. Submitted to the Parallel Computing Journal
  • S. Collange, D. Defour, S. Graillat, and R. Iakymchuk

Full-Speed Deterministic Bit-Accurate Parallel Floating-Point Summation on Multi- and Many-Core Architectures, Tech report, Feb, 2014. HAL-ID: hal-00949355

  • J. Demmel and H.D. Hguyen

Fast Reproducible Floating-Point Summation. Proceedings of the 21st IEEE Symposium

  • n Computer Arithmetic, Austin, Texas, USA, 2013
  • J. Demmel and H.D. Hguyen

Parallel Reproducible Summation. To appear in IEEE Transactions on Computers, Special Section on Computer Arithmetic 2014

  • J. Demmel and H.D. Hguyen

Numerical reproducibility and accuracy at ExaScale (invited talk), the 21st IEEE Symposium on Computer Arithmetic, Austin, Texas, USA, 2013

  • A. Arteaga, O. Fuhrer, and T.Hoefler

Designing Bit-Reproducible Portable High-Performance Applications. Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2014

  • G. Michelogiannakis, X.S. Li, D.H. Bailey, and J. Shalf

Extending Summation Precision for Network Reduction Operations. Symposium on Computer Architecture and High Performance Computing, Porto de Galinhas, Brazil, 2013

Roman Iakymchuk (ICS & LIP6, UPMC) Reproducible Accurate Summation July 6th, 2014 21 / 21