ece232 hardware organization and design
play

ECE232: Hardware Organization and Design Lecture 9: Floating Point - PowerPoint PPT Presentation

ECE232: Hardware Organization and Design Lecture 9: Floating Point Adapted from Computer Organization and Design , Patterson & Hennessy, UCB Floating Point Representation for non-integral numbers Including very small and very large


  1. ECE232: Hardware Organization and Design Lecture 9: Floating Point Adapted from Computer Organization and Design , Patterson & Hennessy, UCB

  2. Floating Point Representation for non-integral numbers  Including very small and very large numbers • Like scientific notation  – 2.34 × 10 56 • +0.002 × 10 – 4 • +987.02 × 10 9 • In binary  normalized ± 1. xxxxxxx 2 × 2 yyyy • Types float and double in C  ECE232: Floating Point 2

  3. Floating Point Numbers The largest 32 bit unsigned integer number is  1111 1111 1111 1111 1111 1111 1111 1111 = 4,294,967,295 What if we want to encode the approx. age of the earth?  4,600,000,000 or 4.6 x 10 9 or the weight in kg of one a.m.u. (atomic mass unit)  0.0000000000000000000000000166 or 1.6 x 10 -27 There is no way we can encode either of the above in a 32-  bit integer. ECE232: Floating Point 3

  4. Exponential Notation The following are equivalent representations of 1,234  123,400.0 x 10 -2 12,340.0 x 10 -1 The representations differ in that the decimal place – 1,234.0 x 10 0 the “point” - “floats” to the 123.4 x 10 1 left or right (with the 12.34 x 10 2 appropriate adjustment in 1.234 x 10 3 the exponent). 0.1234 x 10 4 0.01234x 10 5 ECE232: Floating Point 4

  5. Parts of a Floating Point Number Exponent -0.9876 x 10 -3 Sign of exponent Sign of Location of Mantissa mantissa decimal point Base Mantissa is also called Significand ECE232: Floating Point 5

  6. Single Precision Format Note that the exponent has no explicit sign bit  Base?  32 bits M: Mantissa (23 bits) E: Exponent (8 bits) S: Sign of mantissa (1 bit) ECE232: Floating Point 6

  7. Normalization The mantissa M is a normalized fraction  Has an implied decimal place on left  Has an implied (hidden) “ 1 ” on left of the decimal place  E.g.,  • Fraction  10100000000000000000000 • Represents 1.101 2 = 1.625 10 The significand= 1.f is in the range [1, 2-ulp]  • ulp – unit in the last position     S E Bias ( 1 ) 1 . 2 F f ECE232: Floating Point 7

  8. IEEE Floating-Point Format single: 8 bits single: 23 bits double: 11 bits double: 52 bits S Exponent Fraction       S (Exponent Bias) x ( 1) (1 Fraction) 2 S: sign bit (0  non-negative, 1  negative)  Normalize significand: 1.0 ≤ |significand| < 2.0  Always has a leading pre-binary-point 1 bit, so no need to • represent it explicitly (hidden bit) Significand is Fraction with the “1.” restored • Exponent: excess representation: actual exponent + Bias  Ensures exponent is unsigned • Single: Bias = 127; Double: Bias = 1203 • ECE232: Floating Point 8

  9. Single-Precision Range Exponents 00000000 and 11111111 reserved  Smallest value  Exponent: 00000001 •  actual exponent = 1 – 127 = – 126 Fraction: 000…00  significand = 1.0 • ±1.0 × 2 – 126 ≈ ±1.2 × 10 – 38 • Largest value  exponent: 11111110 •  actual exponent = 254 – 127 = +127 Fraction: 111…11  significand ≈ 2.0 • ±2.0 × 2 +127 ≈ ±3.4 × 10 +38 • ECE232: Floating Point 9

  10. Floating-Point Example Represent – 0.75  – 0.75 = ( – 1) 1 × 1.1 2 × 2 – 1 • S = 1 • Fraction = 1000…00 2 • Exponent = – 1 + Bias • • Single: – 1 + 127 = 126 = 01111110 2 • Double: – 1 + 1023 = 1022 = 01111111110 2 Single: 101111110 1000…00  Double: 101111111110 1000…00  ECE232: Floating Point 10

  11. Floating-Point Example What number is represented by the single-precision float  110000001 01000…00 S = 1 • Fraction = 01000…00 2 • Fxponent = 10000001 2 = 129 • x = ( – 1) 1 × (1 + 01 2 ) × 2 (129 – 127)  = ( – 1) × 1.25 × 2 2 = – 5.0 ECE232: Floating Point 11

  12. Floating-Point Addition Consider a 4-digit decimal example  9.999 × 10 1 + 1.610 × 10 – 1 • 1. Align decimal points  Shift number with smaller exponent • 9.999 × 10 1 + 0.016 × 10 1 • 2. Add significands  9.999 × 10 1 + 0.016 × 10 1 = 10.015 × 10 1 • 3. Normalize result & check for over/underflow  1.0015 × 10 2 • 4. Round and renormalize if necessary  1.002 × 10 2 • ECE232: Floating Point 12

  13. Floating-Point Addition Now consider a 4-digit binary example  1.000 2 × 2 – 1 + – 1.110 2 × 2 – 2 (0.5 + – 0.4375) • 1. Align binary points  Shift number with smaller exponent • 1.000 2 × 2 – 1 + – 0.111 2 × 2 – 1 • 2. Add significands  1.000 2 × 2 – 1 + – 0.111 2 × 2 – 1 = 0.001 2 × 2 – 1 • 3. Normalize result & check for over/underflow  1.000 2 × 2 – 4 , with no over/underflow • 4. Round and renormalize if necessary  1.000 2 × 2 – 4 (no change) = 0.0625 • ECE232: Floating Point 13

  14. Steps in Addition/Subtraction Step 1: Calculate difference d of the two exponents -  d=|E1 - E2| Step 2: Shift significand of smaller number by d positions to  the right Step 3: Add aligned significands and set exponent of result  to exponent of larger operand Step 4: Normalize resultant significand and adjust exponent  if necessary Step 5: Round resultant significand and adjust exponent if  necessary ECE232: Floating Point 14 Source: I. Koren, Computer Arithmetic Algorithms, 2nd Edition, 2002

  15. Example: Single precision 0 10000010 11010000000000000000000 1.1101 2 130 – 127 = 3 0 = positive mantissa +1.1101 2 x 2 3 = 1110.1 2 = 14.5 10 ECE232: Floating Point 15

  16. Converting to IEEE format Example - decimal number: -3.154 X 10 0  What is the sign?  What is the exponent?  What is the mantissa?  Converting Mixed Numbers – Decimal to Binary 456.78 10 = 4 x 10 2 + 5 x 10 1 + 6 x 10 0 + 7 x 10 -1 +8 x 10 -2 1011.11 2 = 1 x 2 3 + 0 x 2 2 + 1 x 2 1 + 1 x 2 0 + 1 x 2 -1 + 1 x 2 -2 = 8 + 0 + 2 + 1 + 1/2 + ¼ = 11 + 0.5 + 0.25 = 11.75 10 ECE232: Floating Point 16

  17. How to convert whole Decimal to Binary Successive division by 2  1 57143 10 = 1101111100110111 2 1 1  3 0 6 1 13 1 27 1 55 1 111 1 223 0 446 0 892 1 1785 1 3571 0 7142 1 14285 1 28571 1 57143 ECE232: Floating Point 17

  18. Converting fractional Decimal to Binary Successive multiplication by 2 12 0.784 0 0 0.154 13 1.568 1 1 0.308 0 14 1.136 1 2 0.616 0 15 0.272 0 3 1.232 1 16 0.544 0 4 0.464 0 17 1.088 1 5 0.928 0 18 0.176 0 6 1.856 1 19 0.352 0 7 1.712 1 20 0.704 0 8 1.424 1 21 1.408 1 9 0.848 0 22 0.816 0 10 1.696 1 11 1.392 1 23 1.632 1 Decimal 0.154 = .0010 0111 0110 1100 1000 101 ECE232: Floating Point 18

  19. Floating Point Special Representations       S E 127 1 1 . f 2 ( 1 ) 1 . 2 F f  There are two Zeroes,  0, and two Infinities  ∞  NaN (Not-a-Number) may have a sign and have a non-zero fraction - used for program diagnostics  NaNs and Infinities have all 1s in the Exp field, E=255. F+  =  , F/  = 0 ECE232: Floating Point 19 Source: I. Koren, Computer Arithmetic Algorithms, 2nd Edition, 2002

  20. Floating Point Special Representations       S E 127 1  E  254 1 1 . f 2 ( 1 ) 1 . 2 F f Single Precision Double Precision Object represented Exponent Fraction Exponent Fraction 0 0 0 0 0 0 nonzero 0 nonzero ± denormalized number 1-254 Anything 1-2046 Anything ± floating point number 255 0 2047 0 ± infinity 255 nonzero 2047 nonzero NaN (not a number) ECE232: Floating Point 20

  21. Smallest & Largest Numbers The smallest non-zero positive and largest non-zero negative  normalized numbers (represented by 1 in the Exp field and 0…0 in the Fraction field) are ±2 −126 ≈ ±1.175494351×10 −38 • The smallest non-zero positive and largest non-zero negative  denormalized numbers (represented by all 0s in the Exp field and 0…01 in the Fraction field) are ±2 −149 ≈ ±1.4012985×10 −45 • The largest finite positive and smallest finite negative numbers  (represented by 254 in the Exp field and 1…1 in the Fraction field) are ±(2)(2 127 )≈ ±3.40×10 38 • ECE232: Floating Point 21

  22. FP Adder Hardware Step 1 Step 2 Step 3 Step 4 ECE232: Floating Point 22

  23. Single Precision Summary Type Exponent Mantissa Value Zero 0000 0000 000 0000 0000 0000 0000 0000 0 One 0111 1111 000 0000 0000 0000 0000 0000 1 Denormalized number 0000 0000 100 0000 0000 0000 0000 0000 5.9 × 10 -39 Largest normalized number 1111 1110 111 1111 1111 1111 1111 1111 3.4 × 10 38 Smallest normalized number 0000 0001 000 0000 0000 0000 0000 0000 1.18 × 10 -38 Infinity 1111 1111 000 0000 0000 0000 0000 0000 Infinity NaN 1111 1111 010 0000 0000 0000 0000 0000 NaN ECE232: Floating Point 23

  24. Summary Floating point numbers represent large numbers with fractions  Number formats are different than 2’s complement.  Requires some memorization • Addition requires aligning, adding, and then realigning  Do examples!  The best way to learn floating point operations • ECE232: Floating Point 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend