Wanted: Floating-Point Add Round-off Error Instruction Marat Dukhan - PowerPoint PPT Presentation

Wanted: Floating-Point Add Round-off Error Instruction Marat Dukhan Richard Vuduc Jason Riedy School of Computational Science and Engineering College of Computing Georgia Institute of Technology June 23, 2016 M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 1 / 20

Outline Introduction 1 Error-Free Transformations 2 Performance Evaluation 3 M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 2 / 20

High-Precision Arithmetic in High Demand Numerical reproducibility ◮ Dynamic work distribution across threads ◮ Variations in SIMD- and instruction-level parallelism Mathematical functions ◮ IEEE754-2008 recommends correct rounding for LibM functions Growing number of scientific applications ◮ David Bailey’s presetations: 8 areas in (2005): 8 areas of science ◮ His recent presentation on SC BoF (2014): 12 areas of science M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 3 / 20

High-Precision Arithmetic Algorithms Quadruple precision ◮ Software implementation using integer arithmetic Double-double arithmetic ◮ Represent a number as an unevaluated sum of two doubles: x = x hi + x lo Compensated algorithms ◮ High-precision summation, dot product, polynomial evaluation double−double Format quad double−double double with FPADDRE 75 Addition latency, cycles 50 25 0 Intel Intel AMD Intel Xeon Phi Skylake Haswell Steamroller (Knights Corner) M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 4 / 20

Error-Free Multiplication p + e = a · b where p = double ( a · b ) Error-Free Multiplication with FMA p := FPMUL a * b e := FMA a * b - p M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 6 / 20

Error-Free Addition s + t = a + b where s = double ( a + b ) Error-Free Addition (Knuth, 1997) s := FPADD a + b b virtual := FPADD s - b a virtual := FPADD s - b virtual b roundoff := FPADD b - b virtual a roundoff := FPADD a - a virtual e := FPADD a roundoff + b roundoff M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 7 / 20

FPADD3 Instruction Ogita et al (2005) suggested FPADD3 instruction to accelerate Error-Free Addition. FPADD3 adds 3 floating-point numbers without intermediate rounding No general-purpose CPU or GPU ever implemented this instruction Error-Free Addition with FPADD3 (Ogita et al, 2005) s := FPADD a + b e := FPADD3 a + b - s M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 8 / 20

FPADDRE Instruction We suggest an instruction, Floating-Point Add Round-off Error (FPADDRE) to compute the roundoff error of floating-point addition. The instruction offers two benefits for error-free addition: Replace 5 FPADD instructions with 1 FPADDRE Break dependency chain between the sum and the roundoff error Error-Free Addition with FPADDRE s := FPADD a + b e := FPADDRE a + b M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 9 / 20

Reusing FPADD logic in FPADDRE + 12 0b1101011011 + 7 0b1111111101 11101011011_____ + _____11111101101 1111011101011101 ___________1____ 11110111011_____ ___________01101 Schema of FPADD and FPADDRE operations (the case of operands with the same sign and overlapping mantissas). The operations differ only in two aspects: addition or subtraction of a sticky bit and the bits copied to the resulting mantissa. M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 10 / 20

Simulation To estimate performance effect of the FPADDRE instruction, we implemented a several of high-precision algorithms: Double-double scalar addition and multiplication Double-double matrix multiplication Compensated dot product Polynomial evaluation via compensated Horner scheme Then we replaced FPADDRE with an instruction with performance characteristics of addition and benchmarked the algorithms on four microarchitectures: Intel Haswell Intel Skylake AMD Steamroller Intel Knights Corner co-processor M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 12 / 20

Double-double Latency Operation Addition Multiplication 60 55% 53% Latency reduction, % 45% 40 36% 20 11% 3% 1% 0% 0 Intel Intel AMD Intel Xeon Phi Skylake Haswell Steamroller (Knights Corner) M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 13 / 20

Double-double Throughput Operation Addition Multiplication Throughput improvement, % 103% 90 60 36% 34% 30 18% 16% 14% 11% 0% 0 Intel Intel AMD Intel Xeon Phi Skylake Haswell Steamroller (Knights Corner) M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 14 / 20

Double-double Matrix Multiplication Intel 93% Skylake Intel 90% Haswell AMD 84% Steamroller Intel Xeon Phi 28% (Knights Corner) 0 25 50 75 100 Speedup, % Double-double matrix multiplication acceleration with FPADDRE instruction M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 15 / 20

Compensated Dot Product Intel Skylake microarchitecture AMD Steamroller microarchitecture 3 6 Cycles per element Cycles per element 2 4 1 2 1K 4K 16K 64K 256K 1M 4M 16M 1K 4K 16K 64K 256K 1M 4M 16M Array size, elements Array size, elements compensated compensated compensated compensated Algorithm dot product dot product Algorithm dot product dot product dot product dot product with FPADDRE with FPADDRE Intel Haswell microarchitecture Intel Knights Corner microarchitecture 3 4 Cycles per element Cycles per element 2 3 2 1 1 1K 4K 16K 64K 256K 1M 4M 16M 1K 4K 16K 64K 256K 1M 4M 16M Array size, elements Array size, elements M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction compensated PMMA’16 compensated 16 / 20 compensated compensated Algorithm dot product dot product Algorithm dot product dot product dot product dot product

Compensated Polynomial Evaluation compensated Horner scheme Algorithm Horner scheme compensated Horner scheme with FPADDRE 323 300 229 Latency, cycles 191 200 157 157 153 136 130 100 75 69 64 57 0 Intel Intel AMD Intel Xeon Phi Skylake Haswell Steamroller (Knights Corner) M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 17 / 20

Public release We open-sourced the software which was deloped as a part of this research The implementation, unit tests, and benchmarks, are available at github.com/Maratyszcza/FPplus The paper preprint is on arxiv.org/abs/1603.00491 M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 18 / 20

Summary We suggest a new instruction, Floating-Point Add Round-off Error, to compute the roundoff error of floating-point addition Performance simulations suggest that the proposed instruction could accelerate high-precision computations by up to 2x M. Dukhan et al (Georgia Tech) Wanted: FPADDRE Instruction PMMA’16 19 / 20

Funding This research was supported in part by The National Science The U.S. Dept. of Defense Advanced Foundation (NSF) Energy (DOE), Office Research Projects under NSF CAREER of Science, Advanced Agency (DARPA) award number Scientific Computing under agreement 1339745. Research under award #HR0011-13-2-0001 DE-FC02- 10ER26006/DE- SC0004915. Declaimer Any opinions, conclusions or recommendations expressed in this presentation are those of the authors and not necessarily reflect those of NSF, DOE, or DARPA.

Wanted: Floating-Point Add Round-off Error Instruction Marat Dukhan - PowerPoint PPT Presentation

Wanted: Floating-Point Add Round-off Error Instruction Marat Dukhan Richard Vuduc Jason Riedy School of Computational Science and Engineering College of Computing Georgia Institute of Technology June 23, 2016 M. Dukhan et al (Georgia Tech)

Debugging Floating-Point Debugging Floating-Point Debugging Floating-Point Math in Racket Math

Formal verification of floating-point algorithms John Harrison Intel Corporation Floating

Floating-point numbers Fractional binary numbers IEEE floating-point standard Floating-point

Lecture 3 Floating Point Representations 1 Floating-point arithmetic We often incur

Machine numbers: how floating point numbers are stored? Floating-point number representation

Floating point Today ! IEEE Floating Point Standard ! Rounding ! Floating Point Operations !

ECS 231 Computer Arithmetic 1 / 27 Outline Floating-point numbers and representations 1

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Toy Example Toy Example Toy Example Toy Example Toy Example D 1 weak classifiers = vertical or

15-213 The course that gives CMU its Zip! Floating Point Sept 6, 2006 Topics Topics

9/20/2018 Today: Floating Point Background: Fractional binary numbers IEEE floating point

2/10/2020 Today: Floating Point Background: Fractional binary numbers IEEE floating point

Pavel Alex James Zach Panchekha Sanchez-Stern Wilcox Tatlock Floating Points Wild

DLX Floating Point Extend MIPS Pipeline to Floating Point Operations Functional units

Chapter 11: Scaling and Round-off Noise Keshab K. Parhi Outline Introduction Scaling

ACCURATE FLOATING-POINT SUMMATION IN CUB URI VERNER Summer intern OUTLINE Who needs accurate

$TITLE: M2-2.GMS: consumer choice, modeled as an NLP and a MCP * maximize utility subject to a

Exercise sheet 4 Patrick Loiseau, Paul de Kerret Game Theory, Fall 2016 Exercise 1 1. Find all

The Political Economy of Health Care Finance Juan D. Moreno-Ternero Universidad de M alaga and

DANMARKS NATIONALBANK Quest for ROMP Eddie Gerba (jointly with Aguilar P., Fahr S., and Hurtado

TREMOVE Workshop on EU policies to Improve the contribution of Urban Busses and other Captive

Royal Economic Society Royal Economic Society Royal Economic Society The RES Prize Presented by

Data Science Initiative CSCAR : Consulting for Statistics, Computing, and Analytics Research

Incentives and Behavior Prof. Dr. Heiner Schumacher KU Leuven 15. Identity and Motivation Prof.