ACCURATE FLOATING-POINT SUMMATION IN CUB URI VERNER Summer intern

OUTLINE Who needs accurate floating-point summation? ! Round-off error: source and recovery A new method for accurate FP summation on a GPU Added as a function to the open-source CUB library How fast is it? Download link

NORMAL FLOATING-POINT SUMMATION … INPUT 1. 000000 x10 0 -3.333333 x10 -1 -2.467579 x10 -19 RESULT 0.74999988 0.75000000 0.75000024 INACCURATE RESULTS NON-DETERMINISTIC RESULTS!

ACCURATE FP SUMMATION … … INPUT 1. 000000 x10 0 -3.333333 x10 -1 -2.467579 x10 -19 EXACT SUM 0.74999999082897 … ACCURATE SUM AS FLOATING-POINT 0.7500000 Our method computes both!

EXISTING WORK: EXBLAS (OPENCL) By Iakymchuk, Collange, et al. Uses Kulisch accumulators (very wide fixed-precision variables) Our method uses a different approach

WHERE IS THIS USEFUL? High Performance Computing applications An example coming next Cross-platform applications Debugging: bit-exact results for floating point! if (d_result != d_reference) error(“wrong answer!”);

EXAMPLE: LATTICE QCD COMPUTATIONS QCD - Quantum Chromodynamics Describes the strong force that binds quarks and gluons GPU accelerated QUDA library lattice.github.io/quda Accurate summation can potentially improve convergence and reduce computation time

CONVERGENCE OF ITERATIVE ALGORITHM BiCGstab Algorithm: Dirac equation solver convergence stagnation

ROUND-OFF ERROR: SOURCE AND RECOVERY

IEEE-754 FLOATING-POINT STANDARD 32/64 bits S EXP SIGNIFICANT DIGITS 𝑦 = (−1) 𝑡 × 2 𝐹𝑌𝑄 × 1. 𝐸 𝑐 single-precision double-precision Format width 32 bits 64 bits Exponent range -126 .. 127 (8 bits) -1022 .. 1023 (11 bits) Significant digits 23 (+1 implicit) 52 (+1 implicit) Special cases: +/-NaN, +/-Inf, +/-0, subnormals

SOURCE OF NON-REPRODUCIBILITY Non-associative operations Order of operations matters: 1,000,000 + (0.4 + 0.4) -> 1,000,001 (1,000,000 + 0.4) + 0.4 -> 1,000,000 different implementations return different sum values

SOURCE OF ACCURACY LOSS Round-off error in compute operations Computer sum: 1234.567 + 1.234567 accurate actual 1235.801567 1235.802 bigger difference in magnitude => more digits lost

TWO-SUM ALGORITHM (KNUTH) TwoSum(a,b) Round(a+b) round-off error [s,r] = TwoSum(a,b) 6 FP operations s <- a+b z <- s-b r <- (b-(s-z)) + (a-z)

FAST TWO-SUM (DEKKER) FastTwoSum(a,b) Round(a+b) round-off error [s,r] = FastTwoSum(a,b) 3 FP operations s <- a+b z <- s-a Requires EXP(a) ≥ EXP(b) r <- b-z

ERROR-FREE PARALLEL SUMMATION

INTEGRATION INTO CUB LIBRARY CUB: Parallel primitives in CUDA Includes parallel primitives like Sum, Scan, Sort, etc. Performance tuned for every NVIDIA GPU architecture Reduction Aim: use Reduction with TwoSum() for an error-free sum

REDUCTION+TwoSum: PROBLEM #1 The output of TwoSum is two FPs, instead of one! (x1,x2)=2Sum(s1,s2) (y1,y2)=2Sum(r1,r2) + = + = 1 1 1 2 1 1 x2=x2+y1 (x1,x2)=fast(x1,x2) Parallel reduction TwoSum x2=x2+y2 (s3,r3)=fast(x1,x2) 1. Convert x (x, 0.0) + 2. Define (s1,r1) (s2,r2) (s3,r3)

REDUCTION+TwoSum: PROBLEM #2 Limited accuracy E.g.: 10 100 + 10 0 + 10 -100 Multiple values with similar exponents can be added without overflow

DIVIDE THE EXPONENT RANGE INTO BINS Number of EXP values: 2048 (double) 0 1 2 3 4 5 6 7 s r s r s r s r s r s r s r s r add to bin #3 (bin id) 3 𝐠𝐦𝐩𝐩𝐬 𝟐𝟏𝟑𝟒 / 𝟑𝟏𝟓𝟗 = 𝟒 𝟗 0 1023 010101010101010101

HOW MANY EXPONENT VALUES PER BIN? 1. 00000000000000000001 1 000000000000000000000 106 binary digits Suppose we add n numbers to a bin: 𝑏 𝑗 = 2 𝑓 𝑗 ∙ 𝑛 𝑗 , where 𝑓 𝑚 ≤ 𝑓 𝑗 < 𝑓 ℎ . Our budget is 106 digits 𝑏 𝑚 = 1 00000000000111111111 106 ≥ 53 + 𝑚𝑝𝑕 2 𝑜 + 𝑓 ℎ − 𝑓 𝑚 𝑏 ℎ = 1 11111111111111111111 000000000 𝑓 ℎ − 𝑓 𝑚 For 𝑜 = 2 20 , 𝑓 ℎ − 𝑓 𝑚 = 32 different exponents!

ALGORITHM: ERROR-FREE SUMMATION ON GPUS

EACH THREAD DOES THE FOLLOWING EXPONENT=11bit binidx=6bit bin=5bit read input | 𝟑 −𝟗𝟏 | 𝟑 𝟏 | 𝟑 𝟒𝟐 | 𝟑 𝟕𝟏 | 𝟑 𝟔𝟏 | 𝟑 −𝟘𝟏 | 𝟑 𝟐 | 𝟑 𝟏 | | 𝟑 −𝟗𝟏 | 𝟑 −𝟘𝟏 | 𝟑 𝟒𝟐 | 𝟑 𝟏 | 𝟑 𝟐 | 𝟑 𝟏 | 𝟑 𝟕𝟏 | 𝟑 𝟔𝟏 | radix-sort by binidx reduce-by-key binidx update smem bins s r s r s r s r s r s r s r s r 0 1 … 63 SHARED MEMORY

FINAL SUMMATION PHASE Bin 0 . . . Bin 63 Block 0 s r s r s r s r s r s r s r s r Block 1 s r s r s r s r s r s r s r s r … s r s r s r s r s r s r s r s r + s r r2 s r r2 s r r2 s r r2 s r r2 s r r2 s r r2 s r r2 (serial phase) SUM X x x x x x x x x x x x x x x x x x x x x x x x Result

ALGORITHM SUMMARY 1. For each thread block: Repeat: Read input tile in registers Radix-sort items by bin ID in registers (+ temp buffer in shared) Compute sum for each bin with Reduce-by-key in registers (+ temp buffer in shared) Update bins in shared memory in shared memory Save bins to global memory in global memory 2. Merge bins with the same bin ID 3. “Normalize” bins by adding them from low to high 4. Rounded result is in the highest word

PERFORMANCE (K40) 6 5.2 5 4.7 4 5 Billion GItems/sec Items per 3 second Const 2 Random 1 0 Normal summation is ~6 times faster Number of items

DOWNLOAD AND CONTRIBUTE Get it at: https://github.com/uriv/accusum Usage instructions in README.ACCUSUM It’s open source. Use it, improve it!

THANK YOU

BACKUP

ACCURATE FLOATING-POINT SUMMATION IN CUB URI VERNER Summer intern - PowerPoint PPT Presentation

ACCURATE FLOATING-POINT SUMMATION IN CUB URI VERNER Summer intern OUTLINE Who needs accurate floating-point summation? ! Round-off error: source and recovery A new method for accurate FP summation on a GPU Added as a function to the open-source

Debugging Floating-Point Debugging Floating-Point Debugging Floating-Point Math in Racket Math

Formal verification of floating-point algorithms John Harrison Intel Corporation Floating

Floating-point numbers Fractional binary numbers IEEE floating-point standard Floating-point

Lecture 3 Floating Point Representations 1 Floating-point arithmetic We often incur

Machine numbers: how floating point numbers are stored? Floating-point number representation

Floating point Today ! IEEE Floating Point Standard ! Rounding ! Floating Point Operations !

15-213 The course that gives CMU its Zip! Floating Point Sept 6, 2006 Topics Topics

ECS 231 Computer Arithmetic 1 / 27 Outline Floating-point numbers and representations 1

9/20/2018 Today: Floating Point Background: Fractional binary numbers IEEE floating point

2/10/2020 Today: Floating Point Background: Fractional binary numbers IEEE floating point

ZA 2016-1920-CUB ENV-2016-192-EAF CUB 6/1/2016 Lisette Covarrubias CONDITIONAL USE A T T A C H

Correctly Rounded Arbitrary-Precision Floating-Point Summation Vincent LEFVRE AriC, Inria

Pavel Alex James Zach Panchekha Sanchez-Stern Wilcox Tatlock Floating Points Wild

CS 356 Unit 3 IEEE 754 Floating Point Representation 3.2 Floating Point Used to represent

Unit 3 IEEE 754 Floating Point Representation 3.2 Floating Point Used to represent very

for Optimization and Analysis of Floating-Point Computations Heiko Becker, Pavel Panchekha, Eva

Junior Prep Wednesday 20 th March 2019 Kerry Walsh Lead Practitioner in Numeracy A lot of

2 Digit Addition Strategies Example of addition using 100 chart. 25 + 12 =37 S tep 1. Partition

HALF DAY WORKSHOP ON SABKA VISHWAS (LEGACY DISPUTE RESOLUTION) SCHEME, 2019 FACULTIES Speaker: CA

Building Forum Monday 5 December 2016 Welcome Mark Devine, General Manager, Project Delivery

School Maths and the new curriculum The essence of mathematics lies in its freedom.

Description of Presentation Appropriate use of technology is an essential component of preparing

July 22, 2019 Acquisition of Safe Harbor ABOUT THE FIRST BANCSHARES, INC. ADDITIONAL

Overview of 3Q FY2011 Financial Results 1. Consolidated Results of Tokio Marine Holdings