am p a r
play

AM P A R CudA Multiple Precision ARithmetic librarY Floating - PowerPoint PPT Presentation

Tight and rigorous error bounds for basic building blocks of double-word arithmetic Mioara Joldes, Jean-Michel Muller, Valentina Popescu JNCF - January 2017 AM P A R CudA Multiple Precision ARithmetic librarY Floating point arithmetics


  1. Tight and rigorous error bounds for basic building blocks of double-word arithmetic Mioara Joldes, Jean-Michel Muller, Valentina Popescu JNCF - January 2017 AM P A R CudA Multiple Precision ARithmetic librarY

  2. Floating point arithmetics A real number X is approximated in a machine by a rational x = M x · 2 e x − p +1 , – M x is the significand , a signed integer number of p digits in radix 2 s.t. 2 p − 1 ≤ | M x | ≤ 2 p − 1 ; – e x is the exponent , a signed integer ( e min ≤ e x ≤ e max ). 1 / 18

  3. Floating point arithmetics A real number X is approximated in a machine by a rational x = M x · 2 e x − p +1 , – M x is the significand , a signed integer number of p digits in radix 2 s.t. 2 p − 1 ≤ | M x | ≤ 2 p − 1 ; – e x is the exponent , a signed integer ( e min ≤ e x ≤ e max ). IEEE 754-2008: Most common formats Single ( binary 32 ) precision format: 1 8 23 s e m Double ( binary 64 ) precision format: 1 11 52 s e m 1 / 18

  4. Floating point arithmetics A real number X is approximated in a machine by a rational x = M x · 2 e x − p +1 , – M x is the significand , a signed integer number of p digits in radix 2 s.t. 2 p − 1 ≤ | M x | ≤ 2 p − 1 ; – e x is the exponent , a signed integer ( e min ≤ e x ≤ e max ). IEEE 754-2008: Most common formats Single ( binary 32 ) precision format: 1 8 23 s e m Double ( binary 64 ) precision format: 1 11 52 s e m Rounding modes 4 rounding modes: RD, RU, RZ, RN; Correct rounding for: + , − , × , ÷ , √ ; Portability, determinism. 1 / 18

  5. Applications → Computations with increased (multiple) precision in numerical applications. 0.4 Chaotic dynamical systems: 0.2 x2 0 bifurcation analysis; -0.2 compute periodic orbits (e.g., H´ enon map, Lorenz -0.4 -1.5 -1 -0.5 0 0.5 1 1.5 attractor); x1 celestial mechanics (e.g., stability of the solar system). Experimental mathematics: computational geometry (e.g., kissing numbers); polynomial optimization etc. 2 / 18

  6. What is a double-word number? Definition: A double-word number x is the unevaluated sum x h + x ℓ of two floating-point numbers x h and x ℓ such that x h = RN ( x ) . 3 / 18

  7. What is a double-word number? Definition: A double-word number x is the unevaluated sum x h + x ℓ of two floating-point numbers x h and x ℓ such that x h = RN ( x ) . → Called double-double when using the binary64 standard format. Example: π in double-double p h = 11 . 001001000011111101101010100010001000010110100011000 2 , and p ℓ = 1 . 0001101001100010011000110011000101000101110000000111 2 × 2 − 53 ; p h + p ℓ ↔ 107 bit FP approx. 3 / 18

  8. Remark − → Not the same as IEEE 754-2008 standard’s binary128/quadruple -precision. Double-word (using binary64/double -precision): Binary128/quadruple -precision: 4 / 18

  9. Remark − → Not the same as IEEE 754-2008 standard’s binary128/quadruple -precision. Double-word (using binary64/double -precision): Binary128/quadruple -precision: “wobbling precision” ≥ 107 bits of precision; 113 bits of precision; 4 / 18

  10. Remark − → Not the same as IEEE 754-2008 standard’s binary128/quadruple -precision. Double-word (using binary64/double -precision): Binary128/quadruple -precision: “wobbling precision” ≥ 107 bits of precision; 113 bits of precision; exponent range limited by larger exponent range ( 15 bits): binary64 ( 11 bits) i.e. − 1022 to − 16382 to 16383 ; 1023 ; 4 / 18

  11. Remark − → Not the same as IEEE 754-2008 standard’s binary128/quadruple -precision. Double-word (using binary64/double -precision): Binary128/quadruple -precision: “wobbling precision” ≥ 107 bits of precision; 113 bits of precision; exponent range limited by larger exponent range ( 15 bits): binary64 ( 11 bits) i.e. − 1022 to − 16382 to 16383 ; 1023 ; defined with all rounding modes lack of clearly defined rounding modes; 4 / 18

  12. Remark − → Not the same as IEEE 754-2008 standard’s binary128/quadruple -precision. Double-word (using binary64/double -precision): Binary128/quadruple -precision: “wobbling precision” ≥ 107 bits of precision; 113 bits of precision; exponent range limited by larger exponent range ( 15 bits): binary64 ( 11 bits) i.e. − 1022 to − 16382 to 16383 ; 1023 ; defined with all rounding modes lack of clearly defined rounding not implemented in hardware on modes; widely available processors. manipulated using error-free transforms (next slide). 4 / 18

  13. Error-Free Transforms Theorem 1 ( 2Sum algorithm) Let a and b be FP numbers. Algorithm 2Sum computes two FP numbers s and e that satisfy the following: s + e = a + b exactly; s = RN ( a + b ) . ( RN stands for performing the operation in rounding to nearest rounding mode.) Algorithm 1 ( 2Sum ( a, b ) ) s ← RN ( a + b ) t ← RN ( s − b ) e ← RN ( RN ( a − t ) + RN ( b − RN ( s − t ))) return ( s, e ) − → 6 FP operations (proved to be optimal unless we have information on the ordering of | a | and | b | ) 5 / 18

  14. Error-Free Transforms Theorem 2 ( Fast2Sum algorithm) Let a and b be FP numbers that satisfy e a ≥ e b ( | a | ≥ | b | ) . Algorithm Fast2Sum computes two FP numbers s and e that satisfy the following: s + e = a + b exactly; s = RN ( a + b ) . Algorithm 2 ( Fast2Sum ( a, b ) ) s ← RN ( a + b ) z ← RN ( s − a ) e ← RN ( b − z ) return ( s, e ) − → 3 FP operations 6 / 18

  15. Error-Free Transforms Theorem 3 ( 2ProdFMA algorithm) Let a and b be FP numbers, e a + e b ≥ e min + p − 1 . Algorithm 2ProdFMA computes two FP numbers p and e that satisfy the following: p + e = a · b exactly; p = RN ( a · b ) . Algorithm 3 ( 2ProdFMA ( a, b ) ) p ← RN ( a · b ) e ← fma ( a, b, − p ) return ( p, e ) − → 2 FP operations − → hardware-implemented FMA available in latest processors. 7 / 18

  16. Previous work: concept introduced by Dekker [DEK71] together with some algorithms for basic operations; Linnainmaa [LIN81] suggested similar algorithms assuming an underlying wider format; library written by Briggs [BRI98] - that is no longer maintained; QD library written by Bailey [Li.et.al02]. 8 / 18

  17. Previous work: concept introduced by Dekker [DEK71] together with some algorithms for basic operations; Linnainmaa [LIN81] suggested similar algorithms assuming an underlying wider format; library written by Briggs [BRI98] - that is no longer maintained; QD library written by Bailey [Li.et.al02]. Problems: 1 . most algorithms come without correctness proof and error bound; 2 . some error bounds published without a proof; 3 . differences between published and implemented algorithms. 8 / 18

  18. Previous work: concept introduced by Dekker [DEK71] together with some algorithms for basic operations; Linnainmaa [LIN81] suggested similar algorithms assuming an underlying wider format; library written by Briggs [BRI98] - that is no longer maintained; QD library written by Bailey [Li.et.al02]. Problems: 1 . most algorithms come without correctness proof and error bound; 2 . some error bounds published without a proof; 3 . differences between published and implemented algorithms. − → Strong need to “clean up” the existing literature. 8 / 18

  19. Previous work: concept introduced by Dekker [DEK71] together with some algorithms for basic operations; Linnainmaa [LIN81] suggested similar algorithms assuming an underlying wider format; library written by Briggs [BRI98] - that is no longer maintained; QD library written by Bailey [Li.et.al02]. Problems: 1 . most algorithms come without correctness proof and error bound; 2 . some error bounds published without a proof; 3 . differences between published and implemented algorithms. − → Strong need to “clean up” the existing literature. Notation: p represents the precision of the underlying FP format; ulp ( x ) = 2 ⌊ log 2 | x |⌋− p +1 , for x � = 0 ; u = 2 − p = 1 2 ulp (1) denotes the roundoff error unit. 8 / 18

  20. Addition: DWPlusFP ( x h , x ℓ , y ) Algorithm 4 1: ( s h , s ℓ ) ← 2Sum ( x h , y ) 2: v ← RN ( x ℓ + s ℓ ) 3: ( z h , z ℓ ) ← Fast2Sum ( s h , v ) 4: return ( z h , z ℓ ) – implemented in the QD library; – no previous error bound published; – relative error bounded by 2 · 2 − 2 p 1 − 2 · 2 − p = 2 · 2 − 2 p + 4 · 2 − 3 p + 8 · 2 − 4 p + · · · , which is less than 2 · 2 − 2 p + 5 · 2 − 3 p as soon as p ≥ 4 ; 9 / 18

  21. Addition: DWPlusFP ( x h , x ℓ , y ) Algorithm 4 1: ( s h , s ℓ ) ← 2Sum ( x h , y ) 2: v ← RN ( x ℓ + s ℓ ) 3: ( z h , z ℓ ) ← Fast2Sum ( s h , v ) 4: return ( z h , z ℓ ) – implemented in the QD library; – no previous error bound published; – relative error bounded by 2 · 2 − 2 p 1 − 2 · 2 − p = 2 · 2 − 2 p + 4 · 2 − 3 p + 8 · 2 − 4 p + · · · , which is less than 2 · 2 − 2 p + 5 · 2 − 3 p as soon as p ≥ 4 ; Asymptotically optimal bound Let x h = 1 , x ℓ = (2 p − 1) · 2 − 2 p , and y = − 1 2 (1 − 2 − p ) . Then: 2 + 3 · 2 − p − 1 and; – z h + z ℓ = 1 2 + 3 · 2 − p − 1 − 2 − 2 p ; – x + y = 1 – relative error 2 · 2 − 2 p 1 + 3 · 2 − p − 2 · 2 − 2 p ≈ 2 · 2 − 2 p − 6 · 2 − 3 p . 9 / 18

  22. Sketch of the proof Lemma 4 (Sterbenz Lemma) Let a and b be two positive FP numbers. If a 2 ≤ b ≤ 2 a, then a − b is a floating-point number, so that RN ( a − b ) = a − b . Lemma 5 Let a and b be FP numbers, and let s = RN ( a + b ) . If s � = 0 then � 1 2 ulp ( a ) , 1 � s ≥ max 2 ulp ( b ) . 10 / 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend