mathematical preliminaries and error analysis
play

Mathematical Preliminaries and Error Analysis Instructor: Wei-Cheng - PowerPoint PPT Presentation

Mathematical Preliminaries and Error Analysis Instructor: Wei-Cheng Wang 1 Department of Mathematics National TsingHua University Fall 2011 1These slides are based on Prof. Tsung-Ming Huang(NTNU)s original slides Error Algorithms and


  1. Mathematical Preliminaries and Error Analysis Instructor: Wei-Cheng Wang 1 Department of Mathematics National TsingHua University Fall 2011 1These slides are based on Prof. Tsung-Ming Huang(NTNU)’s original slides

  2. Error Algorithms and Convergence Outline Round-off errors and computer arithmetic 1 IEEE standard floating-point format Absolute and Relative Errors Machine Epsilon Loss of Significance Algorithms and Convergence 2 Algorithm Stability Rate of convergence

  3. Error Algorithms and Convergence IEEE standard floating-point format Terminologies binary: 二 進 位 , decimal: 十 進 位 , hexadecimal: 十 六 進 位 exponent: 指 數 , mantissa: 尾 數 floating point numbers: 浮 點 數 chopping: 無 條 件 捨 去 , rounding: 四 捨 五 入 (X 捨 Y 入 ) single precision: 單 精 度 , double precision: 雙 精 度 round-off error: 捨 入 誤 差 significant digits: 有 效 位 數 loss of significance: 有 效 位 數 喪 失

  4. Error Algorithms and Convergence IEEE standard floating-point format Example What is the binary representation of 2 3 ? Solution: To determine the binary representation for 2 3 , we write 2 3 = (0 .a 1 a 2 a 3 . . . ) 2 . Multiply by 2 to obtain 4 3 = ( a 1 .a 2 a 3 . . . ) 2 . Therefore, we get a 1 = 1 by taking the integer part of both sides.

  5. Error Algorithms and Convergence IEEE standard floating-point format Subtracting 1, we have 1 3 = (0 .a 2 a 3 a 4 . . . ) 2 . Repeating the previous step, we arrive at 2 3 = (0 . 101010 . . . ) 2 .

  6. Error Algorithms and Convergence IEEE standard floating-point format In the computational world, each representable number has only a fixed and finite number of digits. For any real number x , let x = ± 1 .a 1 a 2 · · · a t a t +1 a t +2 · · · × 2 m , denote the normalized scientific binary representation of x . In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report called Binary Floating Point Arithmetic Standard 754-1985 . In this report, formats were specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures to design floating-point hardware.

  7. Error Algorithms and Convergence IEEE standard floating-point format Single precision The single precision IEEE standard floating-point format allocates 32 bits for the normalized floating-point number ± q × 2 m as shown in the following figure. sign of mantissa 8 bits exponent 23 bits normalized mantissa 0 1 8 9 31 The first bit is a sign indicator, denoted s . It is followed by an 8-bit exponent c and a 23-bit mantissa f . The base for the exponent and mantissa is 2, and the actual exponent is c − 127 . The value of c is restricted to 0 ≤ c ≤ 255 .

  8. Error Algorithms and Convergence IEEE standard floating-point format The actual exponent of the number is restricted to − 127 ≤ c − 127 ≤ 128 . A normalization is imposed so that the leading digit in fraction be 1, and this digit ”1” is not stored as part of the 23-bit mantissa f . The resulting floating-point number takes the form ( − 1) s 2 c − 127 (1 + f ) .

  9. Error Algorithms and Convergence IEEE standard floating-point format Example What is the decimal number of the machine number 01000000101000000000000000000000? The leftmost bit is zero, which indicates that the number is 1 positive. The next 8 bits, 10000001 , are equivalent to 2 c = 1 · 2 7 + 0 · 2 6 + · · · + 0 · 2 1 + 1 · 2 0 = 129 . The exponential part of the number is 2 129 − 127 = 2 2 . The final 23 bits specify the mantissa: 3 f = 0 · (2) − 1 + 1 · (2) − 2 + 0 · (2) − 3 + · · · + 0 · (2) − 23 = 0 . 25 . Consequently, this machine number precisely represents 4 the decimal number ( − 1) s 2 c − 127 (1 + f ) = 2 2 · (1 + 0 . 25) = 5 .

  10. Error Algorithms and Convergence IEEE standard floating-point format Example What is the decimal number of the machine number 01000000100111111111111111111111? The final 23 bits specify that the mantissa is 1 0 · (2) − 1 + 0 · (2) − 2 + 1 · (2) − 3 + · · · + 1 · (2) − 23 f = = 0 . 2499998807907105 . Consequently, this machine number precisely represents 2 the decimal number 2 2 · (1 + 0 . 2499998807907105) ( − 1) s 2 c − 127 (1 + f ) = = 4 . 999999523162842 .

  11. Error Algorithms and Convergence IEEE standard floating-point format Example What is the decimal number of the machine number 01000000101000000000000000000001? The final 23 bits specify that the mantissa is 1 0 · 2 − 1 + 1 · 2 − 2 + 0 · 2 − 3 + · · · + 0 · 2 − 22 + 1 · 2 − 23 f = = 0 . 2500001192092896 . Consequently, this machine number precisely represents 2 the decimal number 2 2 · (1 + 0 . 2500001192092896) ( − 1) s 2 c − 127 (1 + f ) = = 5 . 000000476837158 .

  12. Error Algorithms and Convergence IEEE standard floating-point format Summary Above three examples 01000000100111111111111111111111 ⇒ 4 . 999999523162842 01000000101000000000000000000000 ⇒ 5 01000000101000000000000000000001 ⇒ 5 . 000000476837158 Only a relatively small subset of the real number system is used for the representation of all the real numbers. This subset, which are called the floating-point numbers , contains only rational numbers, both positive and negative. When a number can not be represented exactly with the fixed finite number of digits in a computer, a near-by floating-point number is chosen for approximate representation.

  13. Error Algorithms and Convergence IEEE standard floating-point format The smallest (normalized) positive number Let s = 0 , c = 1 and f = 0 . This corresponds to 2 − 126 · (1 + 0) ≈ 1 . 175 × 10 − 38 The largest number Let s = 0 , c = 254 and f = 1 − 2 − 23 which is equivalent to 2 127 · (2 − 2 − 23 ) ≈ 3 . 403 × 10 38 Definition If a number x with | x | < 2 − 126 · (1 + 0) , then we say that an underflow has occurred and is generally set to zero. It is sometimes referred to as an IEEE ’subnormal’ or ’denormal’ and corresponds to c = 0 . If | x | > 2 127 · (2 − 2 − 23 ) , then we say that an overflow has occurred.

  14. Error Algorithms and Convergence IEEE standard floating-point format Double precision A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in the following figure. sign of mantissa 1 11-bit exponent mantissa 0 1 11 12 52-bit normalized mantissa 63 The first bit is a sign indicator, denoted s . It is followed by an 11-bit exponent c and a 52-bit mantissa f . The actual exponent is c − 1023 .

  15. Error Algorithms and Convergence IEEE standard floating-point format Format of floating-point number ( − 1) s × (1 + f ) × 2 c − 1023 The smallest (normalized) positive number Let s = 0 , c = 1 and f = 0 which is equivalent to 2 − 1022 · (1 + 0) ≈ 2 . 225 × 10 − 308 . The largest number Let s = 0 , c = 2046 and f = 1 − 2 − 52 which is equivalent to 2 1023 · (2 − 2 − 52 ) ≈ 1 . 798 × 10 308 .

  16. Error Algorithms and Convergence IEEE standard floating-point format Chopping and rounding For any real number x , let x = ± 1 .a 1 a 2 · · · a t a t +1 a t +2 · · · × 2 m , denote the normalized scientific binary representation of x . chopping: simply discard the excess bits a t +1 , a t +2 , . . . to 1 obtain fl ( x ) = ± 1 .a 1 a 2 · · · a t × 2 m . rounding: add ± 2 − ( t +1) × 2 m to x and then chop the 2 excess bits to obtain a number of the form fl ( x ) = ± 1 .δ 1 δ 2 · · · δ t × 2 m . In this method, if a t +1 = 1 , we add 1 to a t to obtain fl ( x ) , and if a t +1 = 0 , we merely chop off all but the first t digits.

  17. Error Algorithms and Convergence Absolute and Relative Errors Definition (Round-off error) The error resulting from replacing a number with its floating-point form is called round-off error or rounding error . Definition (Absolute Error and Relative Error) If x ⋆ is an approximation to the exact value x , the absolute error is | x ⋆ − x | and the relative error is | x ⋆ − x | , provided that x � = 0 . | x | Example (a) If x = 0 . 3000 × 10 − 3 and x ∗ = 0 . 3100 × 10 − 3 , then the absolute error is 0 . 1 × 10 − 4 and the relative error is 0 . 3333 × 10 − 1 . (b) If x = 0 . 3000 × 10 4 and x ∗ = 0 . 3100 × 10 4 , then the absolute error is 0 . 1 × 10 3 and the relative error is 0 . 3333 × 10 − 1 .

  18. Error Algorithms and Convergence Absolute and Relative Errors Remark As a measure of accuracy, the absolute error may be misleading and the relative error more meaningful. Definition In decimal expressions, the number x ∗ is said to approximate x to t significant digits if t is the largest non-negative integer for which | x − x ∗ | ≤ 5 × 10 − t . | x |

  19. Error Algorithms and Convergence Absolute and Relative Errors In binary expressions, if the floating-point representation fl chop ( x ) for the number x is obtained by t digits chopping, then the relative error is | 0 . 00 · · · 0 a t +1 a t +2 · · · × 2 m | | x − fl chop ( x ) | = | x | | 1 .a 1 a 2 · · · a t a t +1 a t +2 · · · × 2 m | | 0 .a t +1 a t +2 · · · | | 1 .a 1 a 2 · · · a t a t +1 a t +2 · · · | × 2 − t . = The minimal value of the denominator is 1 . The numerator is bounded above by 1. As a consequence � � x − fl chop ( x ) � ≤ 2 − t . � � � � x �

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend