ecs 231 computer arithmetic
play

ECS 231 Computer Arithmetic 1 / 27 Outline Floating-point numbers - PowerPoint PPT Presentation

ECS 231 Computer Arithmetic 1 / 27 Outline Floating-point numbers and representations 1 Floating-point arithmetic 2 Floating-point error analysis 3 Further reading 4 2 / 27 Outline Floating-point numbers and representations 1


  1. ECS 231 Computer Arithmetic 1 / 27

  2. Outline Floating-point numbers and representations 1 Floating-point arithmetic 2 Floating-point error analysis 3 Further reading 4 2 / 27

  3. Outline Floating-point numbers and representations 1 Floating-point arithmetic 2 Floating-point error analysis 3 Further reading 4 3 / 27

  4. Floating-point numbers and representations 1. Floating-point (FP) representation of numbers (scientific notation): ← exponent 3 . 1416 × 10 1 − ↑ ↑ ↑ sign significand base 2. FP representation of a nonzero binary number: x = ± b 0 .b 1 b 2 · · · b p − 1 × 2 E . (1) ◮ It is normalized , i.e., b 0 = 1 (the hidden bit) ◮ Precision (= p ) is the number of bits in the significand (mantissa) (including the hidden bit). ◮ Machine epsilon ǫ = 2 − ( p − 1) , the gap between the number 1 and the smallest FP number that is greater than 1. 3. Special numbers: 0, − 0 , ∞ , −∞ , NaN(=“Not a Number”). 4 / 27

  5. IEEE standard ◮ All computers designed since 1985 use the IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985), represent each number as a binary number and use binary arithmetic. ◮ Essentials of the IEEE standard: ◮ consistent representation of FP numbers ◮ correctly rounded FP operations (using various rounding modes) ◮ consistent treatment of exceptional situation such as division by zero. 5 / 27

  6. IEEE single precision format ◮ Single format takes 32 bits (=4 bytes) long: ← − 8 − → ← − 23 − → s E f t sign exponent binary point fraction ◮ It represents the number ( − 1) s · (1 .f ) × 2 E − 127 ◮ The leading 1 in the fraction need not be stored explicitly since it is always 1 ( hidden bit ) ◮ E min = (00000001) 2 = (1) 10 , E max = (11111110) 2 = (254) 10 . ◮ “ E − 127 ” in exponent avoids the need for storage of a sign bit. ◮ The range of positive normalized numbers: N min = 1 . 00 · · · 0 × 2 E min − 127 = 2 − 126 ≈ 1 . 2 × 10 − 38 N max = 1 . 11 · · · 1 × 2 E max − 127 ≈ 2 128 ≈ 3 . 4 × 10 38 . ◮ Special repsentations for 0, ±∞ and NaN. 6 / 27

  7. IEEE double pecision format ◮ Double format takes 64 bits (= 8 bytes) long: ← − 11 − → ← − 52 − → s E f t sign exponent binary point fraction ◮ It represents the numer ( − 1) s · (1 .f ) × 2 E − 1023 ◮ The range of positive normalized numbers is from N min = 1 . 00 · · · 0 × 2 1022 ≈ 2 . 2 × 10 − 308 N max = 1 . 11 · · · 1 × 2 1023 ≈ 2 1024 ≈ 1 . 8 × 10 308 . ◮ Special repsentations for 0, ±∞ and NaN. 7 / 27

  8. Summary I ◮ Precision and machine epsilon of IEEE single, double and extended formats Machine epsilon ǫ = 2 − p − 1 Format Precision p ǫ = 2 − 23 ≈ 1 . 2 × 10 − 7 single 24 ǫ = 2 − 52 ≈ 2 . 2 × 10 − 16 double 53 ǫ = 2 − 63 ≈ 1 . 1 × 10 − 19 extended 64 ◮ Extra: Higham’s lecture for additional formats, such as half (16 bits) and quadruple (128 bits). 8 / 27

  9. Rounding modes ◮ Let a positive real number x be in the normalized range, i.e., N min ≤ x ≤ N max , and write in the normalized form x = (1 .b 1 b 2 · · · b p − 1 b p b p +1 . . . ) × 2 E , ◮ Then the closest fp number less than or equal to x is x − = 1 .b 1 b 2 · · · b p − 1 × 2 E i.e., x − is obtained by truncating . ◮ The next fp number bigger than x − (also the next one that bigger than x ) is x + = ((1 .b 1 b 2 · · · b p − 1 ) + (0 . 00 · · · 01)) × 2 E ◮ If x is negative, the situtation is reversed. 9 / 27

  10. Correctly rounding modes: ◮ round down : round ( x ) = x − ◮ round up : round ( x ) = x + ◮ round towards zero : round ( x ) = x − if x ≥ 0 round ( x ) = x + if x ≤ 0 ◮ round to nearest : round ( x ) = x − or x + whichever is nearer to x . 1 1 except that if x > N max , round ( x ) = ∞ , and if x < − N max , round ( x ) = −∞ . In the case of tie, i.e., x − and x + are the same distance from x , the one with its least significant bit equal to zero is chosen. 10 / 27

  11. Rounding error ◮ When the round to nearest (IEEE default rounding mode) is in effect, relerr ( x ) = | round ( x ) − x | ≤ 1 2 ǫ. | x | ◮ Therefore, we have  2 · 2 1 − 24 = 2 − 24 ≈ 5 . 96 · 10 − 8 , 1  single  relerr =   2 · 2 − 52 ≈ 1 . 11 × 10 − 16 , 1 double . 11 / 27

  12. Outline Floating-point numbers and representations 1 Floating-point arithmetic 2 Floating-point error analysis 3 Further reading 4 12 / 27

  13. Floating-point arithmetic ◮ IEEE rules for correctly rounded fp operations: if x and y are correctly rounded fp numbers, then fl( x + y ) = round ( x + y ) = ( x + y )(1 + δ ) fl( x − y ) = round ( x − y ) = ( x − y )(1 + δ ) fl( x × y ) = round ( x × y ) = ( x × y )(1 + δ ) fl( x/y ) = round ( x/y ) = ( x/y )(1 + δ ) where | δ | ≤ 1 2 ǫ ◮ IEEE standard also requires that correctly rounded remainder and square root operations be provided. 13 / 27

  14. Floating-point arithmetic, cont’d IEEE standard response to exceptions Event Example Set result to Invalid operation 0 / 0 , 0 × ∞ NaN Division by zero Finite nonzero/0 ±∞ Overflow | x | > N max ±∞ or ± N max underflow x � = 0 , | x | < N min ± 0 , ± N min or subnormal Inexact whenever fl( x ◦ y ) � = x ◦ y correctly rounded value 14 / 27

  15. Floating-point arithmetic error ◮ Let � x and � y be the fp numbers and that � x = x (1 + τ 1 ) and � y = y (1 + τ 2 ) , for | τ i | ≤ τ ≪ 1 where τ i could be the relative errors in the process of “collecting/getting” the data from the original source or the previous operations, and ◮ Question: how do the four basic arithmetic operations behave? 15 / 27

  16. Floating-point arithmetic error: + , − Addition and subtraction: fl( � x + � y ) = ( � x + � y )(1 + δ ) = x (1 + τ 1 )(1 + δ ) + y (1 + τ 2 )(1 + δ ) = x + y + x ( τ 1 + δ + O ( τǫ )) + y ( τ 2 + δ + O ( τǫ )) � � x y = ( x + y ) 1 + x + y ( τ 1 + δ + O ( τǫ )) + x + y ( τ 2 + δ + O ( τǫ )) ( x + y )(1 + � ≡ δ ) , where � � | δ | ≤ 1 δ | ≤ | x | + | y | τ + 1 | � 2 ǫ, 2 ǫ + O ( τǫ ) . | x + y | 16 / 27

  17. Floating-point arithmetic error: + , − Three possible cases: 1. If x and y have the same sign, i.e., xy > 0 , then | x + y | = | x | + | y | ; this implies δ | ≤ τ + 1 | � 2 ǫ + O ( τǫ ) ≪ 1 . Thus fl( � x + � y ) approximates x + y well. 2. If x ≈ − y ⇒ | x + y | ≈ 0 , then ( | x | + | y | ) / | x + y | ≫ 1 ; this implies that | � δ | could be nearly or much bigger than 1. This is so called catastrophic cancellation , it causes relative errors or uncertainties already presented in � x and � y to be magnified. 3. In general, if ( | x | + | y | ) / | x + y | is not too big, fl( � x + � y ) provides a good approximation to x + y . 17 / 27

  18. Catastrophic cancellation: example 1 ◮ Computing √ x + 1 − √ x straightforward causes substantial loss of significant digits for large n fl( √ x + 1) fl( √ x ) fl(fl( √ x + 1) − fl( √ x ) x 1.00e+10 1.00000000004999994e+05 1.00000000000000000e+05 4.99999441672116518e-06 1.00e+11 3.16227766018419061e+05 3.16227766016837908e+05 1.58115290105342865e-06 1.00e+12 1.00000000000050000e+06 1.00000000000000000e+06 5.00003807246685028e-07 1.00e+13 3.16227766016853740e+06 3.16227766016837955e+06 1.57859176397323608e-07 1.00e+14 1.00000000000000503e+07 1.00000000000000000e+07 5.02914190292358398e-08 1.00e+15 3.16227766016838104e+07 3.16227766016837917e+07 1.86264514923095703e-08 1.00e+16 1.00000000000000000e+08 1.00000000000000000e+08 0.00000000000000000e+00 ◮ Catastrophic cancellation can sometimes be avoided if a formula is properly reformulated. ◮ In the present case, one can compute √ x + 1 − √ x almost to full precision by using the equality √ x + 1 − √ x = 1 √ x + 1 + √ x. 18 / 27

  19. Catastrophic cancellation: example 2 ◮ Consider the function f ( x ) = 1 − cos x x 2 Note that 0 ≤ f ( x ) < 1 / 2 for all x � = 0 ◮ Let x = 1 . 2 × 10 − 8 , then the computed fl( f ( x )) = 0 . 770988 ... is completely wrong! ◮ Alternatively, the function can be re-written as � sin( x/ 2) � 2 f ( x ) = . x/ 2 Consequently, for x = 1 . 2 × 10 − 8 , then the computed fl( f ( x )) = 0 . 499999 ... < 1 / 2 is fine! 19 / 27

  20. Floating-point arithmetic error: × , / Multiplication and Division: fl( � x × � y ) = ( � x × � y )(1 + δ ) = xy (1 + τ 1 )(1 + τ 2 )(1 + δ ) xy (1 + � ≡ δ × ) , fl( � x/ � y ) = ( � x/ � y )(1 + δ ) ( x/y )(1 + τ 1 )(1 + τ 2 ) − 1 (1 + δ ) = xy (1 + � ≡ δ ÷ ) , where � � δ × = τ 1 + τ 2 + δ + O ( τǫ ) , δ ÷ = τ 1 − τ 2 + δ + O ( τǫ ) . Thus δ × | ≤ 2 τ + 1 δ ÷ | ≤ 2 τ + 1 | � | � 2 ǫ + O ( τǫ ) , 2 ǫ + O ( τǫ ) we can conclude that multiplication and division are very well-behaved! 20 / 27

  21. Outline Floating-point numbers and representations 1 Floating-point arithmetic 2 Floating-point error analysis 3 Further reading 4 21 / 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend