floating point representation
play

Floating point representation (Unsigned) Fixed-point representation - PowerPoint PPT Presentation

Floating point representation (Unsigned) Fixed-point representation The numbers are stored with a fixed number of bits for the integer part and a fixed number of bits for the fractional part. Suppose we have 8 bits to store a real number, where


  1. Floating point representation

  2. (Unsigned) Fixed-point representation The numbers are stored with a fixed number of bits for the integer part and a fixed number of bits for the fractional part. Suppose we have 8 bits to store a real number, where 5 bits store the integer part and 3 bits store the fractional part: 1 0 1 1 1.0 1 1 ! 2 !$ 2 !# 2 !" 2 $ 2 # 2 % 2 " 2 ! Smallest number: 00000.001 # = 0.125 Largest number: 11111.111 # = 31.875

  3. (Unsigned) Fixed-point representation Suppose we have 64 bits to store a real number, where 32 bits store the integer part and 32 bits store the fractional part: $" $# 𝑏 ) 2 ) + 4 𝑐 ) 2 ') 𝑏 $" … 𝑏 # 𝑏 " 𝑏 ! . 𝑐 " 𝑐 # 𝑐 $ … 𝑐 $# # = 4 )*! )*" = 𝑏 !" Γ— 2 !" +𝑏 !# Γ— 2 !# + β‹― + 𝑏 # Γ— 2 # +𝑐 " Γ— 2 $" +𝑐 % Γ— 2 % + β‹― + 𝑐 !% Γ— 2 $!% Smallest number: 𝑏 & = 0 βˆ€π‘— and 𝑐 " , 𝑐 # , … , 𝑐 $" = 0 and 𝑐 $# = 1 β†’ 2 '$# β‰ˆ 10 '"! Largest number: 𝑏 & = 1 βˆ€π‘— and 𝑐 & = 1 βˆ€π‘— β†’ 2 $" + β‹― + 2 ! + 2 '" + β‹― + 2 '$# β‰ˆ 10 (

  4. (Unsigned) Fixed-point representation Suppose we have 64 bits to store a real number, where 32 bits store the integer part and 32 bits store the fractional part: $" $# 𝑏 ) 2 ) + 4 𝑐 ) 2 ') 𝑏 $" … 𝑏 # 𝑏 " 𝑏 ! . 𝑐 " 𝑐 # 𝑐 $ … 𝑐 $# # = 4 )*! )*" Smallest number β†’β‰ˆ 10 '"! Largest number β†’ β‰ˆ 10 ( 0 ∞

  5. (Unsigned) Fixed-point representation Range : difference between the largest and smallest numbers possible. More bits for the integer part ⟢ increase range Precision : smallest possible difference between any two numbers More bits for the fractional part ⟢ increase precision 𝑏 ! 𝑏 " 𝑏 # . 𝑐 " 𝑐 ! 𝑐 $ ! 𝑏 " 𝑏 # . 𝑐 " 𝑐 ! 𝑐 $ 𝑐 % ! OR Wherever we put the binary point, there is a trade-off between the amount of range and precision. It can be hard to decide how much you need of each! Fix: Let the binary point β€œfloat”

  6. Floating-point numbers A floating-point number can represent numbers of different order of magnitude (very large and very small) with the same number of fixed digits. In general, in the binary system, a floating number can be expressed as 𝑦 = Β± π‘Ÿ Γ— 2 & π‘Ÿ is the significand, normally a fractional value in the range [1.0,2.0) 𝑛 is the exponent

  7. Floating-point numbers Numerical Form: 𝑦 = Β±π‘Ÿ Γ— 2 " = ±𝑐 # . 𝑐 $ 𝑐 % 𝑐 & … 𝑐 ' Γ— 2 " Fractional part of significand ( π‘œ digits) 𝑐 ! ∈ 0,1 Exponent range : 𝑛 ∈ 𝑀, 𝑉 Precision : p = π‘œ + 1

  8. β€œFloating” the binary point 1011.1 ! = 1Γ—8 + 0Γ—4 + 1Γ—2 + 1Γ—1 + 1Γ— 1 2 = 11.5 "# 10111 ! = 1Γ—16 + 0Γ—8 + 1Γ—4 + 1Γ—2 + 1Γ—1 = 23 "# = 1011.1 ! Γ— 2 " = 23 "# 101.11 ! = 1Γ—4 + 0Γ—2 + 1Γ—1 + 1Γ— 1 2 + 1Γ— 1 4 = 5.75 "# = 1011.1 ! Γ— 2 &" = 5.75 "# Move β€œbinary point” to the left by one bit position: Divide the decimal number by 2 Move β€œbinary point” to the right by one bit position: Multiply the decimal number by 2

  9. Converting floating points Convert (39.6875) "! = 100111.1011 # into floating point representation 1.001111011 # Γ— 2 + (39.6875) "! = 100111.1011 # =

  10. alized floating-point numbers No Normal Normalized floating point numbers are expressed as 𝑦 = Β± 1. 𝑐 $ 𝑐 % 𝑐 & … 𝑐 ' Γ— 2 " = Β± 1. 𝑔 Γ— 2 " where 𝑔 is the fractional part of the significand, 𝑛 is the exponent and 𝑐 ! ∈ 0,1 . Hidden bit representation: The first bit to the left of the binary point 𝑐 " = 1 does not need to be stored, since its value is fixed. This representation ”adds” 1-bit of precision (we will show some exceptions later, including the representation of number zero).

  11. Iclicker question Determine the normalized floating point representation 1. π’ˆ Γ— 2 𝒏 of the decimal number 𝑦 = 47.125 ( π’ˆ in binary representation and 𝒏 in decimal) 1.01110001 * Γ— 2 πŸ” A) 1.01110001 * Γ— 2 πŸ“ B) 1.01111001 * Γ— 2 πŸ” C) D) 1.01111001 * Γ— 2 πŸ“

  12. Normalized floating-point numbers 𝑦 = Β± π‘Ÿ Γ— 2 ' = Β± 1. 𝑐 " 𝑐 ! 𝑐 $ … 𝑐 ( Γ— 2 ' = Β± 1. 𝑔 Γ— 2 ' β€’ Exponent range : 𝑀, 𝑉 β€’ Precision : p = π‘œ + 1 β€’ Smallest positive normalized FP number: UFL = 2 , β€’ Largest positive normalized FP number: OFL = 2 &'" (1 βˆ’ 2 $( )

  13. Normalized floating point number scale βˆ’βˆž +∞ 0

  14. Floating-point numbers: Simple example A ”toy” number system can be represented as 𝑦 = Β±1. 𝑐 " 𝑐 # Γ—2 - for 𝑛 ∈ [βˆ’4,4] and 𝑐 ) ∈ {0,1} . 1.00 ! Γ—2 " = 1 1.00 ! Γ—2 ! = 4.0 1.00 ! Γ—2 $ = 2 1.01 ! Γ—2 " = 1.25 1.01 ! Γ—2 $ = 2.5 1.01 ! Γ—2 ! = 5.0 1.10 ! Γ—2 " = 1.5 1.10 ! Γ—2 $ = 3.0 1.10 ! Γ—2 ! = 6.0 1.11 ! Γ—2 " = 1.75 1.11 ! Γ—2 $ = 3.5 1.11 ! Γ—2 ! = 7.0 1.00 ! Γ—2 % = 8.0 1.00 ! Γ—2 #$ = 0.5 1.00 ! Γ—2 & = 16.0 1.01 ! Γ—2 % = 10.0 1.01 ! Γ—2 #$ = 0.625 1.01 ! Γ—2 & = 20.0 1.10 ! Γ—2 % = 12.0 1.10 ! Γ—2 #$ = 0.75 1.10 ! Γ—2 & = 24.0 1.11 ! Γ—2 % = 14.0 1.11 ! Γ—2 #$ = 0.875 1.11 ! Γ—2 & = 28.0 1.00 ! Γ—2 #! = 0.25 1.00 ! Γ—2 #% = 0.125 1.00 ! Γ—2 #& = 0.0625 1.01 ! Γ—2 #! = 0.3125 1.01 ! Γ—2 #% = 0.15625 1.01 ! Γ—2 #& = 0.078125 1.10 ! Γ—2 #! = 0.375 1.10 ! Γ—2 #& = 0.09375 1.10 ! Γ—2 #% = 0.1875 1.11 ! Γ—2 #! = 0.4375 1.11 ! Γ—2 #& = 0.109375 1.11 ! Γ—2 #% = 0.21875 Same steps are performed to obtain the negative numbers. For simplicity, we will show only the positive numbers in this example.

  15. 𝑦 = Β±1. 𝑐 " 𝑐 # Γ—2 - for 𝑛 ∈ [βˆ’4,4] and 𝑐 ) ∈ {0,1} β€’ Smallest normalized positive number: 1.00 # Γ—2 '% = 0.0625 β€’ Largest normalized positive number: 1.11 # Γ—2 % = 28.0 β€’ Any number 𝑦 closer to zero than 0.0625 would UNDERFLOW to zero. β€’ Any number 𝑦 outside the range βˆ’28.0 and + 28.0 would OVERFLOW to infinity.

  16. Machine epsilon Machine epsilon ( πœ— % ): is defined as the distance (gap) between 1 and the β€’ next larger floating point number. 𝑦 = Β±1. 𝑐 " 𝑐 % Γ—2 * for 𝑛 ∈ [βˆ’4,4] and 𝑐 ) ∈ {0,1} 1.00 % Γ—2 # = 1 1.01 % Γ—2 # = 1.25 𝝑 𝒏 = 0.01 # Γ—2 ' = 𝟏. πŸ‘πŸ”

  17. Machine numbers: how floating point numbers are stored?

  18. Floating-point number representation What do we need to store when representing floating point numbers in a computer? 𝑦 = Β± 1. π’ˆ Γ— 2 𝒏 𝑛 𝑔 𝑦 = Β± sign exponent significand Initially, different floating-point representations were used in computers, generating inconsistent program behavior across different machines. Around 1980s, computer manufacturers started adopting a standard representation for floating-point number: IEEE (Institute of Electrical and Electronics Engineers) 754 Standard.

  19. Floating-point number representation Numerical form: 𝑦 = Β± 1. π’ˆ Γ— 2 𝒏 Representation in memory: 𝑑 𝑔 𝑦 = 𝒕 sign exponent significand 𝑦 = (βˆ’1) 𝒕 1. π’ˆ Γ— 2 𝒅:π’•π’Šπ’‹π’ˆπ’– 𝒏 = 𝒅 βˆ’ π’•π’Šπ’‹π’ˆπ’–

  20. Finite representation: not all Precisions: numbers can be represented exactly! IEEE-754 Single precision (32 bits): 𝑔 𝑑 𝑑 = 𝑛 + 127 𝑦 = exponent significand sign (8-bit) (23-bit) (1-bit) IEEE-754 Double precision (64 bits): 𝑔 𝑑 𝑑 = 𝑛 + 1023 𝑦 = exponent significand sign (11-bit) (52-bit) (1-bit)

  21. Special Values: 𝑦 = (βˆ’1) 𝒕 1. π’ˆ Γ— 2 𝒏 = 𝒕 𝒅 π’ˆ 1) Zero : 𝑦 = 𝑑 000 … 000 0000 … … 0000 2) Infinity : +∞ ( 𝑑 = 0) and βˆ’βˆž 𝑑 = 1 𝑦 = 𝑑 111 … 111 0000 … … 0000 3) NaN : (results from operations with undefined results) 𝑦 = 𝑑 π‘π‘œπ‘§π‘’β„Žπ‘—π‘œπ‘• β‰  00 … 00 111 … 111 Note that the exponent 𝑑 = 000 … 000 and 𝑑 = 111 … 111 are reserved for these special cases, which limits the exponent range for the other numbers.

  22. IEEE-754 Single Precision (32-bit) 𝑦 = (βˆ’1) 𝒕 1. π’ˆ Γ— 2 𝒏 𝑑 𝑑 = 𝑛 + 127 𝑔 exponent significand sign (8-bit) (23-bit) (1-bit) 𝑑 = 0: positive sign, 𝑑 = 1: negative sign Reserved exponent number for special cases: 𝑑 = 11111111 # = 255 and 𝑑 = 00000000 # = 0 Therefore 0 < c < 255 The largest exponent is U = 254 βˆ’ 127 = 127 The smallest exponent is L = 1 βˆ’ 127 = βˆ’126

  23. IEEE-754 Single Precision (32-bit) 𝑦 = (βˆ’1) 𝒕 1. π’ˆ Γ— 2 𝒏 Example: Represent the number 𝑦 = βˆ’67.125 using IEEE Single- Precision Standard 1000011.001 # = 1.000011001 # Γ—2 ( 67.125 = 𝑑 = 6 + 127 = 133 = 10000101 # 00001100100000 … 000 1 10000101 23-bit 8-bit 1-bit

  24. IEEE-754 Single Precision (32-bit) 𝑦 = (βˆ’1) 𝒕 1. π’ˆ Γ— 2 𝒏 = 𝑑 = 𝑛 + 127 𝒅 π’ˆ 𝒕 β€’ Machine epsilon ( πœ— - ): is defined as the distance (gap) between 1 and the next larger floating point number. 1 $" = 0 01111111 00000000000000000000000 1 $" + πœ— ' = 0 01111111 00000000000000000000001 𝝑 𝒏 = πŸ‘ !πŸ‘πŸ’ β‰ˆ 1.2 Γ— 10 !+ β€’ Smallest positive normalized FP number: UFL = 2 + = 2 $"%, β‰ˆ 1.2 Γ—10 $!- β€’ Largest positive normalized FP number: OFL = 2 &'" (1 βˆ’ 2 $( ) = 2 "%- (1 βˆ’ 2 $%. ) β‰ˆ 3.4 Γ—10 !-

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend