Wanted: Floating-Point Add Round-off Error Instruction Marat Dukhan - - PowerPoint PPT Presentation

wanted floating point add round off error instruction
SMART_READER_LITE
LIVE PREVIEW

Wanted: Floating-Point Add Round-off Error Instruction Marat Dukhan - - PowerPoint PPT Presentation

Wanted: Floating-Point Add Round-off Error Instruction Marat Dukhan Richard Vuduc Jason Riedy School of Computational Science and Engineering College of Computing Georgia Institute of Technology June 23, 2016 M. Dukhan et al (Georgia Tech)


slide-1
SLIDE 1

Wanted: Floating-Point Add Round-off Error Instruction

Marat Dukhan Richard Vuduc Jason Riedy

School of Computational Science and Engineering College of Computing Georgia Institute of Technology

June 23, 2016

  • M. Dukhan et al (Georgia Tech)

Wanted: FPADDRE Instruction PMMA’16 1 / 20

slide-2
SLIDE 2

Outline

1

Introduction

2

Error-Free Transformations

3

Performance Evaluation

  • M. Dukhan et al (Georgia Tech)

Wanted: FPADDRE Instruction PMMA’16 2 / 20

slide-3
SLIDE 3

High-Precision Arithmetic in High Demand

Numerical reproducibility

◮ Dynamic work distribution across threads ◮ Variations in SIMD- and instruction-level parallelism

Mathematical functions

◮ IEEE754-2008 recommends correct rounding for LibM functions

Growing number of scientific applications

◮ David Bailey’s presetations: 8 areas in (2005): 8 areas of science ◮ His recent presentation on SC BoF (2014): 12 areas of science

  • M. Dukhan et al (Georgia Tech)

Wanted: FPADDRE Instruction PMMA’16 3 / 20

slide-4
SLIDE 4

High-Precision Arithmetic Algorithms

Quadruple precision

◮ Software implementation using integer arithmetic

Double-double arithmetic

◮ Represent a number as an unevaluated sum of two doubles:

x = xhi +xlo

Compensated algorithms

◮ High-precision summation, dot product, polynomial evaluation

25 50 75 Intel Skylake Intel Haswell AMD Steamroller Intel Xeon Phi (Knights Corner)

Addition latency, cycles

Format quad double−double double−double with FPADDRE double

  • M. Dukhan et al (Georgia Tech)

Wanted: FPADDRE Instruction PMMA’16 4 / 20

slide-5
SLIDE 5

Outline

1

Introduction

2

Error-Free Transformations

3

Performance Evaluation

  • M. Dukhan et al (Georgia Tech)

Wanted: FPADDRE Instruction PMMA’16 5 / 20

slide-6
SLIDE 6

Error-Free Multiplication

p +e = a·b where p = double(a·b)

Error-Free Multiplication with FMA

p := FPMUL a * b e := FMA a * b - p

  • M. Dukhan et al (Georgia Tech)

Wanted: FPADDRE Instruction PMMA’16 6 / 20

slide-7
SLIDE 7

Error-Free Addition

s +t = a+b where s = double(a+b)

Error-Free Addition (Knuth, 1997)

s := FPADD a + b bvirtual := FPADD s - b avirtual := FPADD s - bvirtual broundoff := FPADD b - bvirtual aroundoff := FPADD a - avirtual e := FPADD aroundoff + broundoff

  • M. Dukhan et al (Georgia Tech)

Wanted: FPADDRE Instruction PMMA’16 7 / 20

slide-8
SLIDE 8

FPADD3 Instruction

Ogita et al (2005) suggested FPADD3 instruction to accelerate Error-Free Addition. FPADD3 adds 3 floating-point numbers without intermediate rounding No general-purpose CPU or GPU ever implemented this instruction

Error-Free Addition with FPADD3 (Ogita et al, 2005)

s := FPADD a + b e := FPADD3 a + b - s

  • M. Dukhan et al (Georgia Tech)

Wanted: FPADDRE Instruction PMMA’16 8 / 20

slide-9
SLIDE 9

FPADDRE Instruction

We suggest an instruction, Floating-Point Add Round-off Error (FPADDRE) to compute the roundoff error of floating-point addition. The instruction offers two benefits for error-free addition: Replace 5 FPADD instructions with 1 FPADDRE Break dependency chain between the sum and the roundoff error

Error-Free Addition with FPADDRE

s := FPADD a + b e := FPADDRE a + b

  • M. Dukhan et al (Georgia Tech)

Wanted: FPADDRE Instruction PMMA’16 9 / 20

slide-10
SLIDE 10

Reusing FPADD logic in FPADDRE

+ 12 0b1101011011 + 7 0b1111111101

11101011011_____ _____11111101101 1111011101011101 + ___________1____ 11110111011_____ ___________01101

Schema of FPADD and FPADDRE operations (the case of operands with the same sign and overlapping mantissas). The operations differ only in two aspects: addition or subtraction of a sticky bit and the bits copied to the resulting mantissa.

  • M. Dukhan et al (Georgia Tech)

Wanted: FPADDRE Instruction PMMA’16 10 / 20

slide-11
SLIDE 11

Outline

1

Introduction

2

Error-Free Transformations

3

Performance Evaluation

  • M. Dukhan et al (Georgia Tech)

Wanted: FPADDRE Instruction PMMA’16 11 / 20

slide-12
SLIDE 12

Simulation

To estimate performance effect of the FPADDRE instruction, we implemented a several of high-precision algorithms: Double-double scalar addition and multiplication Double-double matrix multiplication Compensated dot product Polynomial evaluation via compensated Horner scheme Then we replaced FPADDRE with an instruction with performance characteristics of addition and benchmarked the algorithms on four microarchitectures: Intel Haswell Intel Skylake AMD Steamroller Intel Knights Corner co-processor

  • M. Dukhan et al (Georgia Tech)

Wanted: FPADDRE Instruction PMMA’16 12 / 20

slide-13
SLIDE 13

Double-double Latency 55% 3% 45% 0% 53% 1% 36% 11%

20 40 60 Intel Skylake Intel Haswell AMD Steamroller Intel Xeon Phi (Knights Corner)

Latency reduction, %

Operation Addition Multiplication

  • M. Dukhan et al (Georgia Tech)

Wanted: FPADDRE Instruction PMMA’16 13 / 20

slide-14
SLIDE 14

Double-double Throughput 36% 11% 18% 16% 103% 14% 34% 0%

30 60 90 Intel Skylake Intel Haswell AMD Steamroller Intel Xeon Phi (Knights Corner)

Throughput improvement, %

Operation Addition Multiplication

  • M. Dukhan et al (Georgia Tech)

Wanted: FPADDRE Instruction PMMA’16 14 / 20

slide-15
SLIDE 15

Double-double Matrix Multiplication

28% 84% 90% 93%

Intel Xeon Phi (Knights Corner) AMD Steamroller Intel Haswell Intel Skylake 25 50 75 100

Speedup, % Double-double matrix multiplication acceleration with FPADDRE instruction

  • M. Dukhan et al (Georgia Tech)

Wanted: FPADDRE Instruction PMMA’16 15 / 20

slide-16
SLIDE 16

Compensated Dot Product

1 2 3 1K 4K 16K 64K 256K 1M 4M 16M

Array size, elements Cycles per element

Algorithm dot product compensated dot product compensated dot product with FPADDRE

Intel Skylake microarchitecture

1 2 3 1K 4K 16K 64K 256K 1M 4M 16M

Array size, elements Cycles per element

Algorithm dot product compensated dot product compensated dot product

Intel Haswell microarchitecture

2 4 6 1K 4K 16K 64K 256K 1M 4M 16M

Array size, elements Cycles per element

Algorithm dot product compensated dot product compensated dot product with FPADDRE

AMD Steamroller microarchitecture

1 2 3 4 1K 4K 16K 64K 256K 1M 4M 16M

Array size, elements Cycles per element

Algorithm dot product compensated dot product compensated dot product

Intel Knights Corner microarchitecture

  • M. Dukhan et al (Georgia Tech)

Wanted: FPADDRE Instruction PMMA’16 16 / 20

slide-17
SLIDE 17

Compensated Polynomial Evaluation

157 57 130 157 75 136 191 69 153 323 64 229

100 200 300 Intel Skylake Intel Haswell AMD Steamroller Intel Xeon Phi (Knights Corner)

Latency, cycles

Algorithm Horner scheme compensated Horner scheme compensated Horner scheme with FPADDRE

  • M. Dukhan et al (Georgia Tech)

Wanted: FPADDRE Instruction PMMA’16 17 / 20

slide-18
SLIDE 18

Public release

We open-sourced the software which was deloped as a part of this research The implementation, unit tests, and benchmarks, are available at github.com/Maratyszcza/FPplus The paper preprint is on arxiv.org/abs/1603.00491

  • M. Dukhan et al (Georgia Tech)

Wanted: FPADDRE Instruction PMMA’16 18 / 20

slide-19
SLIDE 19

Summary

We suggest a new instruction, Floating-Point Add Round-off Error, to compute the roundoff error of floating-point addition Performance simulations suggest that the proposed instruction could accelerate high-precision computations by up to 2x

  • M. Dukhan et al (Georgia Tech)

Wanted: FPADDRE Instruction PMMA’16 19 / 20

slide-20
SLIDE 20

Funding

This research was supported in part by

The National Science Foundation (NSF) under NSF CAREER award number 1339745. The U.S. Dept. of Energy (DOE), Office

  • f Science, Advanced

Scientific Computing Research under award DE-FC02- 10ER26006/DE- SC0004915. Defense Advanced Research Projects Agency (DARPA) under agreement #HR0011-13-2-0001

Declaimer

Any opinions, conclusions or recommendations expressed in this presentation are those of the authors and not necessarily reflect those of NSF, DOE, or DARPA.