ECS 231 Computer Arithmetic 1 / 27 Outline Floating-point numbers - PowerPoint PPT Presentation

ECS 231 Computer Arithmetic 1 / 27

Outline Floating-point numbers and representations 1 Floating-point arithmetic 2 Floating-point error analysis 3 Further reading 4 2 / 27

Floating-point numbers and representations 1. Floating-point (FP) representation of numbers (scientific notation): ← exponent 3 . 1416 × 10 1 − ↑ ↑ ↑ sign significand base 2. FP representation of a nonzero binary number: x = ± b 0 .b 1 b 2 · · · b p − 1 × 2 E . (1) ◮ It is normalized , i.e., b 0 = 1 (the hidden bit) ◮ Precision (= p ) is the number of bits in the significand (mantissa) (including the hidden bit). ◮ Machine epsilon ǫ = 2 − ( p − 1) , the gap between the number 1 and the smallest FP number that is greater than 1. 3. Special numbers: 0, − 0 , ∞ , −∞ , NaN(=“Not a Number”). 4 / 27

IEEE standard ◮ All computers designed since 1985 use the IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985), represent each number as a binary number and use binary arithmetic. ◮ Essentials of the IEEE standard: ◮ consistent representation of FP numbers ◮ correctly rounded FP operations (using various rounding modes) ◮ consistent treatment of exceptional situation such as division by zero. 5 / 27

IEEE single precision format ◮ Single format takes 32 bits (=4 bytes) long: ← − 8 − → ← − 23 − → s E f t sign exponent binary point fraction ◮ It represents the number ( − 1) s · (1 .f ) × 2 E − 127 ◮ The leading 1 in the fraction need not be stored explicitly since it is always 1 ( hidden bit ) ◮ E min = (00000001) 2 = (1) 10 , E max = (11111110) 2 = (254) 10 . ◮ “ E − 127 ” in exponent avoids the need for storage of a sign bit. ◮ The range of positive normalized numbers: N min = 1 . 00 · · · 0 × 2 E min − 127 = 2 − 126 ≈ 1 . 2 × 10 − 38 N max = 1 . 11 · · · 1 × 2 E max − 127 ≈ 2 128 ≈ 3 . 4 × 10 38 . ◮ Special repsentations for 0, ±∞ and NaN. 6 / 27

IEEE double pecision format ◮ Double format takes 64 bits (= 8 bytes) long: ← − 11 − → ← − 52 − → s E f t sign exponent binary point fraction ◮ It represents the numer ( − 1) s · (1 .f ) × 2 E − 1023 ◮ The range of positive normalized numbers is from N min = 1 . 00 · · · 0 × 2 1022 ≈ 2 . 2 × 10 − 308 N max = 1 . 11 · · · 1 × 2 1023 ≈ 2 1024 ≈ 1 . 8 × 10 308 . ◮ Special repsentations for 0, ±∞ and NaN. 7 / 27

Summary I ◮ Precision and machine epsilon of IEEE single, double and extended formats Machine epsilon ǫ = 2 − p − 1 Format Precision p ǫ = 2 − 23 ≈ 1 . 2 × 10 − 7 single 24 ǫ = 2 − 52 ≈ 2 . 2 × 10 − 16 double 53 ǫ = 2 − 63 ≈ 1 . 1 × 10 − 19 extended 64 ◮ Extra: Higham’s lecture for additional formats, such as half (16 bits) and quadruple (128 bits). 8 / 27

Rounding modes ◮ Let a positive real number x be in the normalized range, i.e., N min ≤ x ≤ N max , and write in the normalized form x = (1 .b 1 b 2 · · · b p − 1 b p b p +1 . . . ) × 2 E , ◮ Then the closest fp number less than or equal to x is x − = 1 .b 1 b 2 · · · b p − 1 × 2 E i.e., x − is obtained by truncating . ◮ The next fp number bigger than x − (also the next one that bigger than x ) is x + = ((1 .b 1 b 2 · · · b p − 1 ) + (0 . 00 · · · 01)) × 2 E ◮ If x is negative, the situtation is reversed. 9 / 27

Correctly rounding modes: ◮ round down : round ( x ) = x − ◮ round up : round ( x ) = x + ◮ round towards zero : round ( x ) = x − if x ≥ 0 round ( x ) = x + if x ≤ 0 ◮ round to nearest : round ( x ) = x − or x + whichever is nearer to x . 1 1 except that if x > N max , round ( x ) = ∞ , and if x < − N max , round ( x ) = −∞ . In the case of tie, i.e., x − and x + are the same distance from x , the one with its least significant bit equal to zero is chosen. 10 / 27

Rounding error ◮ When the round to nearest (IEEE default rounding mode) is in effect, relerr ( x ) = | round ( x ) − x | ≤ 1 2 ǫ. | x | ◮ Therefore, we have  2 · 2 1 − 24 = 2 − 24 ≈ 5 . 96 · 10 − 8 , 1  single  relerr =   2 · 2 − 52 ≈ 1 . 11 × 10 − 16 , 1 double . 11 / 27

Floating-point arithmetic ◮ IEEE rules for correctly rounded fp operations: if x and y are correctly rounded fp numbers, then fl( x + y ) = round ( x + y ) = ( x + y )(1 + δ ) fl( x − y ) = round ( x − y ) = ( x − y )(1 + δ ) fl( x × y ) = round ( x × y ) = ( x × y )(1 + δ ) fl( x/y ) = round ( x/y ) = ( x/y )(1 + δ ) where | δ | ≤ 1 2 ǫ ◮ IEEE standard also requires that correctly rounded remainder and square root operations be provided. 13 / 27

Floating-point arithmetic, cont’d IEEE standard response to exceptions Event Example Set result to Invalid operation 0 / 0 , 0 × ∞ NaN Division by zero Finite nonzero/0 ±∞ Overflow | x | > N max ±∞ or ± N max underflow x � = 0 , | x | < N min ± 0 , ± N min or subnormal Inexact whenever fl( x ◦ y ) � = x ◦ y correctly rounded value 14 / 27

Floating-point arithmetic error ◮ Let � x and � y be the fp numbers and that � x = x (1 + τ 1 ) and � y = y (1 + τ 2 ) , for | τ i | ≤ τ ≪ 1 where τ i could be the relative errors in the process of “collecting/getting” the data from the original source or the previous operations, and ◮ Question: how do the four basic arithmetic operations behave? 15 / 27

Floating-point arithmetic error: + , − Addition and subtraction: fl( � x + � y ) = ( � x + � y )(1 + δ ) = x (1 + τ 1 )(1 + δ ) + y (1 + τ 2 )(1 + δ ) = x + y + x ( τ 1 + δ + O ( τǫ )) + y ( τ 2 + δ + O ( τǫ )) � � x y = ( x + y ) 1 + x + y ( τ 1 + δ + O ( τǫ )) + x + y ( τ 2 + δ + O ( τǫ )) ( x + y )(1 + � ≡ δ ) , where � � | δ | ≤ 1 δ | ≤ | x | + | y | τ + 1 | � 2 ǫ, 2 ǫ + O ( τǫ ) . | x + y | 16 / 27

Floating-point arithmetic error: + , − Three possible cases: 1. If x and y have the same sign, i.e., xy > 0 , then | x + y | = | x | + | y | ; this implies δ | ≤ τ + 1 | � 2 ǫ + O ( τǫ ) ≪ 1 . Thus fl( � x + � y ) approximates x + y well. 2. If x ≈ − y ⇒ | x + y | ≈ 0 , then ( | x | + | y | ) / | x + y | ≫ 1 ; this implies that | � δ | could be nearly or much bigger than 1. This is so called catastrophic cancellation , it causes relative errors or uncertainties already presented in � x and � y to be magnified. 3. In general, if ( | x | + | y | ) / | x + y | is not too big, fl( � x + � y ) provides a good approximation to x + y . 17 / 27

Catastrophic cancellation: example 1 ◮ Computing √ x + 1 − √ x straightforward causes substantial loss of significant digits for large n fl( √ x + 1) fl( √ x ) fl(fl( √ x + 1) − fl( √ x ) x 1.00e+10 1.00000000004999994e+05 1.00000000000000000e+05 4.99999441672116518e-06 1.00e+11 3.16227766018419061e+05 3.16227766016837908e+05 1.58115290105342865e-06 1.00e+12 1.00000000000050000e+06 1.00000000000000000e+06 5.00003807246685028e-07 1.00e+13 3.16227766016853740e+06 3.16227766016837955e+06 1.57859176397323608e-07 1.00e+14 1.00000000000000503e+07 1.00000000000000000e+07 5.02914190292358398e-08 1.00e+15 3.16227766016838104e+07 3.16227766016837917e+07 1.86264514923095703e-08 1.00e+16 1.00000000000000000e+08 1.00000000000000000e+08 0.00000000000000000e+00 ◮ Catastrophic cancellation can sometimes be avoided if a formula is properly reformulated. ◮ In the present case, one can compute √ x + 1 − √ x almost to full precision by using the equality √ x + 1 − √ x = 1 √ x + 1 + √ x. 18 / 27

Catastrophic cancellation: example 2 ◮ Consider the function f ( x ) = 1 − cos x x 2 Note that 0 ≤ f ( x ) < 1 / 2 for all x � = 0 ◮ Let x = 1 . 2 × 10 − 8 , then the computed fl( f ( x )) = 0 . 770988 ... is completely wrong! ◮ Alternatively, the function can be re-written as � sin( x/ 2) � 2 f ( x ) = . x/ 2 Consequently, for x = 1 . 2 × 10 − 8 , then the computed fl( f ( x )) = 0 . 499999 ... < 1 / 2 is fine! 19 / 27

Floating-point arithmetic error: × , / Multiplication and Division: fl( � x × � y ) = ( � x × � y )(1 + δ ) = xy (1 + τ 1 )(1 + τ 2 )(1 + δ ) xy (1 + � ≡ δ × ) , fl( � x/ � y ) = ( � x/ � y )(1 + δ ) ( x/y )(1 + τ 1 )(1 + τ 2 ) − 1 (1 + δ ) = xy (1 + � ≡ δ ÷ ) , where � � δ × = τ 1 + τ 2 + δ + O ( τǫ ) , δ ÷ = τ 1 − τ 2 + δ + O ( τǫ ) . Thus δ × | ≤ 2 τ + 1 δ ÷ | ≤ 2 τ + 1 | � | � 2 ǫ + O ( τǫ ) , 2 ǫ + O ( τǫ ) we can conclude that multiplication and division are very well-behaved! 20 / 27

ECS 231 Computer Arithmetic 1 / 27 Outline Floating-point numbers - PowerPoint PPT Presentation

ECS 231 Computer Arithmetic 1 / 27 Outline Floating-point numbers and representations 1 Floating-point arithmetic 2 Floating-point error analysis 3 Further reading 4 2 / 27 Outline Floating-point numbers and representations 1

By Shervin Daneshpajouh Computer Arithmetic Computer Arithmetic p Computer Computer Arithmetic

Digital Design Discussion: Arithmetic Binary Arithmetic Floating-Point Arithmetic Binary

Remedial Action Report to Council 231 233 George Street C OUNCIL R EMEDIAL A CTION R EPORT

Groups Group is a set G with an operator Closure Associative property Identity

California State University, Fullerton College of Engineering and Computer Science (ECS) Welcome

ECS 235B, Lecture 22 March 4, 2019 March 4, 2019 ECS 235B, Foundations of Computer and

ECS 235B, Lecture 23 March 6, 2019 March 6, 2019 ECS 235B, Foundations of Computer and

ECS 235B, Lecture 26 March 13, 2019 March 13, 2019 ECS 235B, Foundations of Computer and

ECS 235B, Lecture 15 February 11, 2019 February 11, 2019 ECS 235B, Foundations of Computer and

ECS 235B, Lecture 11 February 1, 2019 February 1, 2019 ECS 235B, Foundations of Computer and

ECS 235B, Lecture 14 February 8, 2019 February 8, 2019 ECS 235B, Foundations of Computer and

ECS 231 Subspace projection methods for LS 1 / 38 Part I. Basics The landscape of solvers for

ECS 231 Gradient descent methods for solving large scale eigenvalue problems 1 / 17 Generalized

ECS 231 Lecture on Approximation and Error Analysis 1 / 9 Approximation and error analysis 1.

Lecture 4 Arithmetic-Logic Unit 1 Arithmetic - Logic Unit ALU Handles integers Does the

Numeration and Computer Arithmetic Some Examples JC Bajard LIRMM, CNRS UM2 161 rue Ada, 34392

Runtime Verifjcation of Scientifjc Software Maxwell Shinn, Clarence Lehman, and Ruzica Piskac

Midterm 2 Review File I/O For files you must first open them: Variable name File name Type

(from Chapter 9 of the text 4 th or 5 th edition) Function Definitions Syntax and

TSM2: Optimizing Tall-and-Skinny Matrix- Matrix Multiplication on GPUs Jieyang Chen , Nan Xiong,

The Best-of-Three Voting on Dense Graphs Nan Kang 1 as Rivera 2 Nicol 1 Department of

Provably Efficient RL via Latent State Decoding Akshay Alekh John Simon S. Du Krishnamurthy

CE419 Session 5: JavaScript Web Programming 1 What is JavaScript? JavaScript is a dynamic

Applying Random Testing to a Base Type Environment Experience Report Vincent St-Amour Neil