floating point
play

Floating Point Used to represent very _________ numbers (fractions) - PowerPoint PPT Presentation

1 2 Floating Point Used to represent very _________ numbers (fractions) and very ________ numbers EE 109 Unit 19 Avogadros Number: +6.0247 * 10 23 Plancks Constant: +6.6254 * 10 -27 Note: 32 or 64-bit integers cant


  1. 1 2 Floating Point • Used to represent very _________ numbers (fractions) and very ________ numbers EE 109 Unit 19 – Avogadro’s Number: +6.0247 * 10 23 – Planck’s Constant: +6.6254 * 10 -27 – Note: 32 or 64-bit integers can’t represent this range IEEE 754 Floating Point • Floating Point representation is used in HLL’s Representation like C by declaring variables as float or Floating Point Arithmetic double 3 4 Fixed Point Floating Point Representation • Unsigned and 2’s complement fall under a category of • Similar to ____________________ used with representations called _____________________ decimal numbers • The radix point is _____________ to be in a fixed location for all – numbers [Note: we could represent fractions by implicitly assuming the binary point is at the left…A variable just stores bits…you can • Floating Point representation uses the assume the binary point is anywhere you like] following form – Integers: 10011101. (binary point to right of LSB) Bit storage – � b.bbbb * 2 � exp • For 32-bits, unsigned range is 0 to ~4 billion Fixed point Rep. – Fractions: .10011101 (binary point to left of MSB) – 3 Fields: _______, ____________, __________ • Range [0 to 1) (also called __________ or significand) • Main point : By fixing the radix point, we _________ the range of numbers that can be represented Overall Sign of # – Floating point allows the radix point to be in a different location for each value

  2. 5 6 Normalized FP Numbers IEEE Floating Point Formats • Decimal Example • Single Precision • Double Precision – +0.754*10 15 is ________ correct scientific notation (32-bit format) (64-bit format) – Must have exactly one ________________ before decimal – ___ Sign bit (0=pos/1=neg) – ___ Sign bit (0=pos/1=neg) point: ___________________________ – ___ Exponent bits – ___ Exponent bits • In binary the only significant digit is ____________ • __________ representation • Excess-1023 representation • Thus normalized FP format is: • More on next slides • More on next slides – ____ fraction (significand – ___ fraction (significand or • FP numbers will always be ______________ before or mantissa) bits mantissa) bits being stored in memory or a reg. – Equiv. Decimal Range: – Equiv. Decimal Range: – The ______ is actually not stored but assumed since we always will • 7 digits x 10 � 38 • 16 digits x 10 � 308 store normalized numbers – If HW calculates a result of 0.001101*2 5 it must normalize to 1 8 23 1 11 52 1.101000*2 2 before storing S Exp. Fraction S Exp. Fraction 7 8 Exponent Representation Exponent Representation • Exponent needs its own sign (+/-) • FP formats reserved 2’s E' Excess- E’ E (= E’-127) comp. (stored Exp.) 127 • Rather than using 2’s comp. system we use the exponent values (range of 8-bits shown) and special values -1 1111 1111 +128 Excess-N representation of all 1’s and all 0’s for 11111111 -2 1111 1110 +127 – Single-Precision uses Excess-127 special purposes 11111110 – Double-Precision uses Excess-1023 -128 1000 0000 1 – This representation allows FP numbers to be • Thus, for single- … easily compared +127 0111 1111 0 precision the range of 10000000 • Let E’ = stored exponent code and +126 0111 1110 -1 exponents is 01111111 E = true exponent value -126 to + 127 • For single-precision: E’ = E + 127 01111110 +1 0000 0001 -126 – 2 1 => E = 1, E’ = 128 10 = 10000000 2 0 0000 0000 -127 … • For double-precision: E’ = E + 1023 Comparison of 00000001 – 2 -2 => E = -2, E’ = 1021 10 = 01111111101 2 2’s comp. & Excess-N 00000000 Q: Why don’t we use Excess-N more to represent negative #’s

  3. 9 10 IEEE Exponent Special Values Single-Precision Examples E’ Fraction Meaning 1 1 1000 0010 110 0110 0000 0000 0000 0000 2 +0.6875 = +0.1011 11 12 Floating Point vs. Fixed Point IEEE Shortened Format • Single Precision (32-bits) Equivalent Decimal Range: • 12-bit format defined just for this class – 7 significant decimal digits * 10 ±38 (doesn’t really exist) – Compare that to 32-bit signed integer where we can – 1 Sign Bit represent ±2 billion. How does a 32-bit float allow us to – 5 Exponent bits (using ______________) represent such a greater range? • Same reserved codes – FP allows for ______________ but sacrifices __________ (can’t represent _______________ in its range) – 6 Fraction (significand) bits • Double Precision (64-bits) Equivalent Decimal Range: 1 5-bits 6-bits • 16 significant decimal digits * 10 ±308 S E’ F Sign Bit Exponent Fraction 0=pos. Excess-15 1.bbbbbb 1=neg. E’ = E+15 E = E’ - 15

  4. 13 14 Examples Rounding Methods • +213.125 = 1.1010101001*2 7 => Can’t keep all fraction bits 1 1 10100 101101 2 +21.75 = +10101.11 • 4 Methods of Rounding (you are only responsible for the first 2) Normal rounding you learned in grade school. Round to the nearest representable number. If Round to _______ exactly halfway between, round to representable value w/ 0 in LSB. Round the representable value closest to but not Round towards __ greater in magnitude than the precise value. 3 1 01101 100000 4 +3.625 = +11.101 (____________) Equivalent to just dropping the extra bits. Round to the closest representable value greater Round toward ___ (Round Up) than the number Round to the closest representable value less Round toward ___ (Round Down) than the number 15 16 Rounding Implementation Number Line View Of Rounding Methods • There may be a large number of bits after the fraction Green lines are numbers that fall between two representable values (dots) and thus need to be • To implement any of the methods we can keep only a rounded subset of the extra bits after the fraction [hardware is Round to finite] + � - � Nearest 0 +5.8 -3.75 – ______ bits: bits immediately after LSB of fraction (in this class we will usually keep only _____________ bit) Round to Zero – ______ bit: bit to the right of the guard bits + � - � 0 – ______ bit: _____________ of all other bits after G & R bits 1.01001010010 x 2 4 Round to + � - � 0 +Infinity 1.010010____ x 2 4 GRS Round to - + � - � 0 We can perform rounding to a 6-bit Infinity fraction using just these 3 bits.

  5. 17 18 Rounding to Nearest Method Round to Nearest Method • Round to the closest representable value • Same idea as rounding in decimal – If precise value is exactly half way between 2 representable – .51 and up, round up, value, round towards the number with 0 in the LSB – .49 and down, round down, – .50 exactly we round up in decimal x 2 4 1.11111011010 1.111110111 x 2 4 • In this method we treat it differently…If precise value is GRS exactly half way between 2 representable values, round Round Up towards the number with 0 in the LSB +1.111110 x 2 4 +1.111110111 x 2 4 +1.111111 x 2 4 Precise value will be rounded to one of the representable value it lies between. In this case, round up because precise value is closer to the next higher respresentable values 19 20 Rounding to Nearest Method Round to Nearest • 3 Cases in binary FP: GRS GRS GRS – G = ________________ => 1.001100110 x 2 4 1.111111101 x 2 4 1.001101001 x 2 4 • round fraction up (add 1 to fraction) G = ‘0’ • may require a re-normalization – G = ________________ => • round to the closest fraction value with a ‘0’ in the LSB • may require a re-normalization – G = ____________ => 0 0 0 • leave fraction alone (add 0 to fraction)

  6. 21 22 Round to Nearest Round to 0 (Chopping) • In all these cases, the numbers are halfway between the 2 possible round • Simply drop the G,R,S bits and take fraction as values • Thus, we round to the value w/ 0 in the LSB is GRS GRS GRS 1.001100100 x 2 4 1.111111100 x 2 4 1.001101100 x 2 4 GRS GRS GRS 1.001100001 x 2 4 1.001101101 x 2 4 1.001100111 x 2 4 0 10011 0 10011 0 10011 0 0 0 23 Important Warning For Programmers • FP addition/subtraction is NOT _________ – Because of rounding / inability to precisely represent fractions, (a+b)+c ≠ a+(b+c) (small + LARGE) – LARGE ≠ small + (LARGE – LARGE) Why? Because of _____________ and special values like Inf. (0.0001 + 98475) – 98474 ≠ 0.0001 + (98475-98474) 98475-98474 ≠ 0.0001 + 1 1 ≠ 1.0001 Another Example: 1 + 1.11…1*2 127 – 1.11…1*2 127

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend