Real Number Representation 1 Topics Terminology IEEE standard - - PowerPoint PPT Presentation

real number representation
SMART_READER_LITE
LIVE PREVIEW

Real Number Representation 1 Topics Terminology IEEE standard - - PowerPoint PPT Presentation

Real Number Representation 1 Topics Terminology IEEE standard for floating-point representation Floating point arithmetic Limitations 2 Terminology All digits in a number following any leading zeros are significant


slide-1
SLIDE 1

1

Real Number Representation

slide-2
SLIDE 2

2

Topics

  • Terminology
  • IEEE standard for floating-point

representation

  • Floating point arithmetic
  • Limitations
slide-3
SLIDE 3

3

Terminology

  • All digits in a number following any leading

zeros are significant digits: 12.345

  • 0.12345

0.00012345

slide-4
SLIDE 4

4

Terminology (cont)

  • The scientific notation for real numbers is:

mantissa × base

exponent

In C, the expression: 12.456e-2 means: 12.456 × 10-2

slide-5
SLIDE 5

5

Terminology (cont)

  • The mantissa is always normalized between 1

and the base (i.e., exactly one significant digit before the point) Unnormalized Normalized

2997.9 × 105 2.9979 × 108 B1.39FC × 1611 B.139FC × 1612 0.010110110101 × 2-1 1.0110110101 × 2-3

slide-6
SLIDE 6

6

Terminology (cont)

  • The precision of a number is how many

digits (or bits) we use to represent it

  • For example:

3 3.14 3.1415926 3.1415926535897932384626433832795028

slide-7
SLIDE 7

7

Representing Numbers

  • A real number n is represented by a

floating-point approximation n*

  • The computer uses 32 bits (or more) to store

each approximation

  • It needs to store

– the mantissa – the sign of the mantissa – the exponent (with its sign)

slide-8
SLIDE 8

8

31 30 22 23

Representing Numbers (cont)

  • The standard way to allocate 32 bits

(specified by IEEE Standard 754) is: – 23 bits for the mantissa – 1 bit for the mantissa's sign – 8 bits for the exponent

slide-9
SLIDE 9

9

31 30 22 23

Representing Numbers (cont)

– 23 bits for the mantissa – 1 bit for the mantissa's sign – 8 bits for the exponent

slide-10
SLIDE 10

10

31 30 22 23

Representing Numbers (cont)

– 23 bits for the mantissa – 1 bit for the mantissa's sign – 8 bits for the exponent

slide-11
SLIDE 11

11

31 30 22 23

Representing Numbers (cont)

– 23 bits for the mantissa – 1 bit for the mantissa's sign – 8 bits for the exponent

slide-12
SLIDE 12

12

  • The mantissa has to be in the range

1 ≤ mantissa < base

  • Therefore

– If we use base 2, the digit before the point must be a 1 – So we don't have to worry about storing it We get 24 bits of precision using 23 bits

Representing the Mantissa

slide-13
SLIDE 13

13

Representing the Mantissa (cont)

  • 24 bits of precision are equivalent to a little
  • ver 7 decimal digits:

24 log210 ≈ 7.2

slide-14
SLIDE 14

14

Representing the Mantissa (cont)

  • Suppose we want to represent π:

3.1415926535897932384626433832795.....

  • That means that we can only represent it as:

3.141592 (if we truncate) 3.141593 (if we round)

slide-15
SLIDE 15

15

Representing the Exponent

  • The exponent is represented as excess-127. E.g.,

Actual Exponent Stored Value

  • 127

↔ 00000000

  • 126

↔ 00000001

. . .

↔ 01111111 +1 ↔ 10000000

. . .

i

↔ (i+127)2

. . .

+128 ↔ 11111111

slide-16
SLIDE 16

16

Representing the Exponent (cont)

  • The IEEE standard restricts exponents to the

range: –126 ≤ exponent ≤ +127

  • The exponents –127 and +128 have special

meanings: – If exponent = –127, the stored value is 0 – If exponent = 128, the stored value is ∞

slide-17
SLIDE 17

17

Representing Numbers -- Example 1 What is 01011011 (8-bit machine) ?

0 101 1011 sign exp mantissa

  • Mantissa: 1.1011
  • Exponent (excess-3 format): 5-3=2

1.1011 × 22 ⇒ 110.11 110.112 = 22 + 21 + 2-1 + 2-2 = 4 + 2 + 0.5 + 0.25 = 6.75

slide-18
SLIDE 18

18

Representing Numbers -- Example 2 Represent -10.375 (32-bit machine)

10.37510 = 10 + 0.25 + 0.125 = 23 + 21 + 2-2 + 2-3 = 1010.0112 ⇒ 1.0100112 × 23

  • Sign: 1
  • Mantissa: 010011
  • Exponent (excess-127 format):

3+127 = 13010 = 100000102 1 10000010 01001100000000000000000

slide-19
SLIDE 19

19

Floating Point Overflow

  • Floating point representations can overflow,

e.g., 1.111111 × 2127 + 1.111111 × 2127 11.111110 × 2127

= ∞

1.1111110 × 2128

slide-20
SLIDE 20

20

Floating Point Underflow

  • Floating point numbers can also get too small,

e.g., 10.010000 × 2-126 ÷ 11.000000 × 20 0.110000 × 2-126

= 0

1.100000 × 2-127

slide-21
SLIDE 21

21

“Normalized” “Normalized”

  • Condition

– exp ≠ 000…0 and exp ≠ 111…1

  • Exponent coded as biased value

E = Exp – Bias

  • Exp : unsigned value denoted by exp
  • Bias : Bias value

– Single precision: 127 (Exp: 1…254, E: -126…127) – Double precision: 1023 (Exp: 1…2046, E: -1022…1023) – in general: Bias = 2e-1 - 1, where e is number of exponent bits

  • Significand coded with implied leading 1

M = 1.xxx…x2

  • xxx…x: bits of frac
  • Minimum when 000…0 (M = 1.0)
  • Maximum when 111…1 (M = 2.0 – ε)
  • Get extra leading bit for “free”
slide-22
SLIDE 22

22

Denormalized Values Denormalized Values

  • Condition

– exp = 000…0

  • Value

– Exponent value E = –Bias + 1 – Significand value M = 0.xxx…x2

  • xxx…x: bits of frac
  • Cases

– exp = 000…0, frac = 000…0

  • Represents value 0
  • Note that have distinct values +0 and –0

– exp = 000…0, frac ≠ 000…0

  • Numbers very close to 0.0
  • Lose precision as get smaller
  • “Gradual underflow”
slide-23
SLIDE 23

23

Special Values Special Values

  • Condition

– exp = 111…1

  • Cases

– exp = 111…1, frac = 000…0

  • Represents value ∞ (infinity)
  • Operation that overflows
  • Both positive and negative
  • E.g., 1.0/0.0 = −1.0/−0.0 = +∞, 1.0/−0.0 = −∞

– exp = 111…1, frac ≠ 000…0

  • Not-a-Number (NaN)
  • Represents case when no numeric value can be determined
  • E.g., sqrt(–1), ∞ − ∞
slide-24
SLIDE 24

24

Floating Point Representation

Most standard floating point representation use: 1 bit for the sign (positive or negative) 8 bits for the range (exponent field) 23 bits for the precision (fraction field)

S exponent fraction

23 8 1

( ) ( )

     = × × − = ≤ ≤ × × − =

− −

, 2 . 1 254 1 , 2 . 1 1

126 127

exponent fraction N exponent fraction N

exponent S exponent S

slide-25
SLIDE 25

25

Floating Point Representation

S exponent fraction

23 8 1

( ) ( )

     = × × − = ≤ ≤ × × − =

− −

, 2 . 1 254 1 , 2 . 1 1

126 127

exponent fraction N exponent fraction N

exponent S exponent S

point? floating in d represente 8 5 6 number the is How : Example −

( )

( )

( )

2 2 2 3 2 1 1 2

2 10101 . 1 101 . 110 2 1 2 2 1 2 2 1 2 1 8 1 2 1 2 4 8 1 8 4 2 4 8 5 6 × − = − = × + × + × + × + × + × − =       + + + − =       + + + − = −

− − −

Thus the exponent is given by:

129 2 127 = ⇒ = − exponent exponent

1 10000001 10101000000000000000000

slide-26
SLIDE 26

26

Floating Point Representation (example)

S exponent fraction

23 8 1

( ) ( )

     = × × − = ≤ ≤ × × − =

− −

, 2 . 1 254 1 , 2 . 1 1

126 127

exponent fraction N exponent fraction N

exponent S exponent S

00111101100000000000000000000000 What is the decimal value of the following floating point number? exponent exponent = 64+32+16+8+2+1=(128-8)+3=120+3=123

( )

16 1 2 . 1 2 . 1 1

4 127 123

= × = × × − =

− −

N

slide-27
SLIDE 27

27

Floating Point Representation (example)

S exponent fraction

23 8 1

( ) ( )

     = × × − = ≤ ≤ × × − =

− −

, 2 . 1 254 1 , 2 . 1 1

126 127

exponent fraction N exponent fraction N

exponent S exponent S

01000001100101000000000000000000 What is the decimal value of the following floating point number? exponent exponent =128+2+1=131

( )

2 4 2 127 131 2

1 . 10010 2 00101 . 1 2 00101 . 1 1 = × = × × − =

N

5 . 18 2 1 2 16 2 2 2

1 1 4

= + + = + + =

N

slide-28
SLIDE 28

28

Floating Point Representation (example)

S exponent fraction

23 8 1

( ) ( )

     = × × − = ≤ ≤ × × − =

− −

, 2 . 1 254 1 , 2 . 1 1

126 127

exponent fraction N exponent fraction N

exponent S exponent S

11000001000101000000000000000000 What is the decimal value of the following floating point number? exponent exponent =128+2=130

( )

2 3 2 127 130 2 1

01 . 1001 2 00101 . 1 2 00101 . 1 1 − = × − = × × − =

N

( )

25 . 9 4 1 1 8 2 2 2

2 3

− =       + + − = + + − =

N

slide-29
SLIDE 29

29

Floating Point

S exponent fraction

23 8 1

( ) ( )

     = × × − = ≤ ≤ × × − =

− −

, 2 . 1 254 1 , 2 . 1 1

126 127

exponent fraction N exponent fraction N

exponent S exponent S

What is the largest number that can be represented in 32 bits floating point using the IEEE 754 format above? 01111111011111111111111111111111 exponent exponent =254

23 22 2 1

2 1 2 1 .... 2 1 2 1

− − − −

× + × + + × + × = fraction 9 9999998807 . 8 1024 1024 1 1 2 1 1 2 1 2 1

23 23

= × × − = − = × − × =

fraction

slide-30
SLIDE 30

30

Floating Point

S exponent fraction

23 8 1

( ) ( )

     = × × − = ≤ ≤ × × − =

− −

, 2 . 1 254 1 , 2 . 1 1

126 127

exponent fraction N exponent fraction N

exponent S exponent S

What is the largest number that can be represented in 32 bits floating point using the IEEE 754 format above? 01111111011111111111111111111111 exponent actual exponent =254-127 = 127

9 9999998807 . = fraction

( )

128 127

2 2 9 9999998807 . 1 1 ≈ × × − = N

slide-31
SLIDE 31

31

Floating Point

S exponent fraction

23 8 1

( ) ( )

     = × × − = ≤ ≤ × × − =

− −

, 2 . 1 254 1 , 2 . 1 1

126 127

exponent fraction N exponent fraction N

exponent S exponent S

What is the smallest number (closest to zero) that can be represented in 32 bits floating point using the IEEE 754 format above? 00000000000000000000000000000001 exponent actual exponent =0-126 = -126

23

2 1

× = fraction

( )

149 126 23

2 2 2 1

− − −

≈ × × − = N

slide-32
SLIDE 32

32

Special Floating Point Representations

In the 8-bit field of the exponent we can represent numbers from 0 to

  • 255. We studied how to read numbers with exponents from 0 to 254.

What is the value represented when the exponent is 255 (i.e. 111111112)? An exponent equal 255 = 111111112 in a floating point representation indicates a special value. When the exponent is equal 255 = 111111112 and the fraction is 0, the value represented is ± infinity. When the exponent is equal 255 = 111111112 and the fraction is non-zero, the value represented is Not a Number (NaN).

slide-33
SLIDE 33

33

Double Precision

32-bit floating point representation is usually called single precision representation. A double precision floating point representation requires 64 bits. In double precision the following number of bits are used: 1 sign bit 11 bits for exponent 52 bits for fraction (also called significand)

slide-34
SLIDE 34

34

Summary of Floating Point Real Number Encodings

NaN NaN

+∞

−∞ −0 +Denorm +Normalized

  • Denorm
  • Normalized

+0

slide-35
SLIDE 35

35

Special Properties of Encoding Special Properties of Encoding

  • FP Zero Same as Integer Zero

– All bits = 0

  • Can (Almost) Use Unsigned Integer Comparison

– Must first compare sign bits – Must consider -0 = 0 – NaNs problematic

  • Will be greater than any other values
  • What should comparison yield?

– Otherwise OK

  • Denorm vs. normalized
  • Normalized vs. infinity
slide-36
SLIDE 36

36

Floating Point Addition

Five steps to add two floating point numbers:

  • 1. Express the numbers with the same

exponent (denormalize)

  • 2. Add the mantissas
  • 3. Adjust the mantissa to one digit/bit before

the point (renormalize)

  • 4. Round or truncate to required precision
  • 5. Check for overflow/underflow
slide-37
SLIDE 37

37

Floating Point Addition -- Example 1 (Assume precision 4 decimal digits)

x = 9.876 × 107 y = 1.357 × 106

slide-38
SLIDE 38

38

Floating Point Addition -- Example 1 (cont) (Assume precision 4 decimal digits)

  • 1. Use the same exponents:

x = 9.876 × 107 y = 0.1357 × 107

slide-39
SLIDE 39

39

Floating Point Addition -- Example 1 (cont) (Assume precision 4 decimal digits)

  • 2. Add the mantissas:

x = 9.876 × 107 y = 0.136 × 107 x+y = 10.012 × 107

slide-40
SLIDE 40

40

Floating Point Addition -- Example 1 (cont) (Assume precision 4 decimal digits)

  • 3. Renormalize the sum:

x = 9.876 × 107 y = 0.136 × 107 x+y = 1.0012 × 108

slide-41
SLIDE 41

41

Floating Point Addition -- Example 1 (cont) (Assume precision 4 decimal digits)

  • 4. Truncate or round:

x = 9.876 × 107 y = 0.136 × 107 x+y = 1.001 × 108

slide-42
SLIDE 42

42

Floating Point Addition -- Example 1 (cont) (Assume precision 4 decimal digits)

  • 5. Check overflow and underflow:

x = 9.876 × 107 y = 0.136 × 107 x+y = 1.001 × 108

slide-43
SLIDE 43

43

Floating Point Addition -- Example 2 (Assume precision 4 decimal digits)

x = 3.506 × 10-5 y = -3.497 × 10-5

slide-44
SLIDE 44

44

Floating Point Addition -- Example 2 (cont) (Assume precision 4 decimal digits)

  • 1. Use the same exponents:

x = 3.506 × 10-5 y = -3.497 × 10-5

slide-45
SLIDE 45

45

Floating Point Addition -- Example 2 (cont) (Assume precision 4 decimal digits)

  • 2. Add the mantissas:

x = 3.506 × 10-5 y = -3.497 × 10-5 x+y = 0.009 × 10-5

slide-46
SLIDE 46

46

Floating Point Addition -- Example 2 (cont) (Assume precision 4 decimal digits)

  • 3. Renormalize the sum:

x = 3.506 × 10-5 y = -3.497 × 10-5 x+y = 9.000 × 10-8

slide-47
SLIDE 47

47

Floating Point Addition -- Example 2 (cont) (Assume precision 4 decimal digits)

  • 4. Truncate or round:

x = 3.506 × 10-5 y = -3.497 × 10-5 x+y = 9.000 × 10-8

(no change)

slide-48
SLIDE 48

48

Floating Point Addition -- Example 2 (cont) (Assume precision 4 decimal digits)

  • 5. Check overflow and underflow:

x = 3.506 × 10-5 y = -3.497 × 10-5 x+y = 9.000 × 10-8

slide-49
SLIDE 49

49

Limitations

  • Floating-point representations only

approximate real numbers

  • The normal laws of arithmetic don't always

hold, e.g., associativity is not guaranteed

slide-50
SLIDE 50

50

Limitations -- Example

(Assume precision 4 decimal digits)

x = 3.002 × 103 y = -3.000 × 103 z = 6.531 × 100

slide-51
SLIDE 51

51

Limitations -- Example (cont)

(Assume precision 4 decimal digits)

x = 3.002 × 103 y = -3.000 × 103 z = 6.531 × 100 x+y = 2.000 × 100

slide-52
SLIDE 52

52

Limitations -- Example (cont)

(Assume precision 4 decimal digits)

x = 3.002 × 103 x+y = 2.000 × 100 y = -3.000 × 103 z = 6.531 × 100 (x+y)+z = 8.531 × 100

slide-53
SLIDE 53

53

Limitations -- Example (cont)

(Assume precision 4 decimal digits)

x = 3.002 × 103 y = -3.000 × 103 z = 6.531 × 100

slide-54
SLIDE 54

54

Limitations -- Example (cont)

(Assume precision 4 decimal digits)

x = 3.002 × 103 y = -3.000 × 103 z = 6.531 × 100 y+z = -2.993 × 103

slide-55
SLIDE 55

55

Limitations -- Example (cont)

(Assume precision 4 decimal digits)

x = 3.002 × 103 y = -3.000 × 103 y+z = -2.993 × 103 z = 6.531 × 100 x+(y+z) = 0.009 × 103

slide-56
SLIDE 56

56

Limitations -- Example (cont)

(Assume precision 4 decimal digits)

x = 3.002 × 103 x+(y+z) = 9.000 × 100 y = -3.000 × 103 y+z = -2.993 × 103 z = 6.531 × 100

slide-57
SLIDE 57

57

Limitations -- Example (cont)

(Assume precision 4 decimal digits)

x = 3.002 × 103 x+(y+z) = 9.000 × 100 y = -3.000 × 103 (x+y)+z = 8.531 × 100 z = 6.531 × 100

slide-58
SLIDE 58

58

Mathematical Properties of FP Add Mathematical Properties of FP Add

  • Compare to those of Abelian Group

–Closed under addition? YES

  • But may generate infinity or NaN

–Commutative? YES –Associative? NO

  • Overflow and inexactness of rounding

–0 is additive identity? YES –Every element has additive inverse ALMOST

  • Except for infinities & NaNs
slide-59
SLIDE 59

59

Circuitry for Addition/Subtraction

slide-60
SLIDE 60

60

Multiplication

  • Multiply Significands
  • Add Exponents
  • Normalize

– Shift Significand – Add or Subtract shift amount to exponent

  • Round

– To number of bits for significand – need to keep extra bits during computation

  • Normalize again if necessary
slide-61
SLIDE 61

61

For multiplication of P = X × × × × Y :

  • 1. Compute Exponent: ExpP = ( ExpY + ExpX ) - Bias
  • 2. Compute Product: ( 1 + SigX ) ×

× × × ( 1 + SigY ) Normalize if necessary; continue until most significant bit is 1

  • 4. Too small (e.g., 0.001xx...) →

→ → → left shift result, decrement result exponent 4'. Too big (e.g., 10.1xx...) → → → → right shift result, increment result exponent

  • 5. If (result significand is 0) then set exponent to 0
  • 6. if (SgnX == SgnY ) then

SgnP = positive (0) else SgnP = negative (1)

slide-62
SLIDE 62

62

FP Multiplication Algorithm

Start

  • 1. Add the biased exponents of the two numbers, subtracting the bias from

the sum to get the new biased exponent.

  • 2. Multiply the Significands.
  • 3. Normalize the product if necessary, shifting it right and incrementing the

exponent.

  • 4. Round the significant to the appropriate number of bits.

Overflow or underflow? Still Normalized?

Exception Done yes yes no no

  • 5. Set the sign of the product to positive if the signs of the original operands

are the same. If they differ, make the sign negative.

slide-63
SLIDE 63

63

1 1 1 Control Small ALU Big ALU Sign Exponent Significand Sign Exponent Significand Exponent difference Shift right Shift left or right Rounding hardware Sign Exponent Significand Increment or decrement 1 1 Shift smaller number right Compare exponents Add Normalize Round

  • FP ADD: Exponents are

subtracted by small ALU; the difference controls the 3 MUXes;

  • Shift smaller exp. to the

right until exponents match;

  • Significants are added in

Big ALU;

  • Normalization step shifts

result left or right, adjusts exponents;

  • Rounding and possible

nornalization

slide-64
SLIDE 64

64

Floating Point Multiplication (Decimal)

Assume that we only can store four digits of the significand and two digits of the exponent in a decimal floating point representation. How would you multiply 1.11010×1010 by 9.20010×10-5 in this representation? Step 1: Add the exponents: new exponent = 10 - 5 = 5 Step 2: Multiply the significands: 1.110 ×9.200 0000 0000 2220 9990 10.212000 Step 3: Normalize the product: 10.21210×105 = 1.021210 ×106 Step 4: Round-off the product: 1.021210×106 = 1.02110 ×106

slide-65
SLIDE 65

65

  • Math. Properties of FP Mult
  • Math. Properties of FP Mult
  • Compare to Commutative Ring

– Closed under multiplication? YES

  • But may generate infinity or NaN

– Multiplication Commutative? YES – Multiplication is Associative? NO

  • Possibility of overflow, inexactness of rounding

– 1 is multiplicative identity? YES – Multiplication distributes over addition? NO

  • Possibility of overflow, inexactness of rounding
  • Monotonicity

– a ≥ b & c ≥ 0 ⇒ a *c ≥ b *c? ALMOST

  • Except for infinities & NaNs
slide-66
SLIDE 66

66

Floating-Point Division

  • Significands divided - exponents subtracted -

bias added to difference E1-E2

  • If resulting exponent out of range - overflow
  • r underflow indication must be generated
  • Resultant significand satisfies 1/β ≤ M1/M2 <

β

  • A single base-β shift right of significand +

increase of 1 in exponent may be needed in postnormalization step - may lead to an

  • verflow
slide-67
SLIDE 67

67

Remainder in Floating-Point Division

  • Fixed-point remainder - R=X-QD (X, Q, D -

dividend, quotient, divisor) - |R| ≤ |D| - generated by division algorithm (restoring or nonrestoring)

  • Flp division - algorithm generates quotient but

not remainder - F1 REM F2 = F1-F2⋅Int(F1/F2) (Int(F1/F2) - quotient F1/F2 converted to integer)

  • Conversion to integer - either truncation

(removing fractional part) or rounding-to-nearest

  • The IEEE standard uses the round-to-nearest-

even mode - |F1 REM F2| ≤ |F2| /2

Emax-Emin

slide-68
SLIDE 68

68

Floating-Point Remainder

  • Brute-force - continue direct division algorithm for E1-E2 steps
  • Problem - E1-E2 can be much greater than number of steps needed to

generate m bits of quotient's significand - may take an arbitrary number of clock cycles

  • Solution - calculate remainder in software
  • Alternative - Define a REM-step operation -

X REM F2 - performs a limited number of divide steps (e.g., limited to number of divide steps required in a regular divide operation)

  • Initial X=F1, then X=remainder of previous REM-step operation
  • REM-step repeated until remainder ≤ F2/2
slide-69
SLIDE 69

69

Floating Point in C Floating Point in C

  • C Guarantees Two Levels

float single precision double double precision

  • Conversions

– Casting between int, float, and double changes numeric values – Double or float to int

  • Truncates fractional part
  • Like rounding toward zero
  • Not defined when out of range

– Generally saturates to TMin or TMax – int to double

  • Exact conversion, as long as int has ≤ 53 bit word size

– int to float

  • Will round according to rounding mode
slide-70
SLIDE 70

70

Floating Point Puzzles Floating Point Puzzles

– For each of the following C expressions, either:

  • Argue that it is true for all argument values
  • Explain why not true
  • x == (int)(float) x
  • x == (int)(double) x
  • f == (float)(double) f
  • d == (float) d
  • f == -(-f);
  • 2/3 == 2/3.0
  • d < 0.0

⇒ ((d*2) < 0.0)

  • d > f

  • f > -d
  • d * d >= 0.0
  • (d+f)-d == f

int x = …; float f = …; double d = …; Assume neither d nor f is NaN

slide-71
SLIDE 71

71

Answers to Floating Point Puzzles Answers to Floating Point Puzzles

  • x == (int)(float) x
  • x == (int)(double) x
  • f == (float)(double) f
  • d == (float) d
  • f == -(-f);
  • 2/3 == 2/3.0
  • d < 0.0

⇒ ((d*2) < 0.0)

  • d > f ⇒
  • f > -d
  • d * d >= 0.0
  • (d+f)-d == f

int x = …; float f = …; double d = …; Assume neither d nor f is NAN

  • x == (int)(float) x

No: 24 bit significand

  • x == (int)(double) x

Yes: 53 bit significand

  • f == (float)(double) f

Yes: increases precision

  • d == (float) d

No: loses precision

  • f == -(-f);

Yes: Just change sign bit

  • 2/3 == 2/3.0

No: 2/3 == 0

  • d < 0.0

⇒ ((d*2) < 0.0) Yes!

  • d > f ⇒
  • f > -d

Yes!

  • d * d >= 0.0

Yes!

  • (d+f)-d == f

No: Not associative

slide-72
SLIDE 72

72

MIPS Coprocessors

CPU Registers $0 $31 Arithmetic unit Multiply divide Lo Hi Coprocessor 1 (FPU) Registers $0 $31 Arithmetic unit Registers BadVAddr Coprocessor 0 (traps and memory) Status Cause EPC Memory

slide-73
SLIDE 73

73

Floating Point in MIPS

MIPS Supports the IEEE 754 single-precision and double-precision formats. MIPS has a separate set of registers to store floating point operands: $f0, $f1, $f2, ... In single precision, each individual register $f0, $f1, $f2, … contains

  • ne single precision (32-bit) value.

In double precision, each pair of registers $f0-$f1, $f2-$f3, … contains

  • ne double precision (64-bit) value.
slide-74
SLIDE 74

74

Floating Point in MIPS

In order to load a value in a floating point register, MIPS offers the load word coprocessor, lwcz, instructions. Because the floating point coprocessor is the coprocessor number 1, the instruction is lwc1. Similarly to store the value of a floating point register into memory, MIPS offers the store word coprocessor, swc1. add.s add.d FP addition single or double sub.s sub.d FP subtraction single or double mul.s mul.d FP multiplication single or double div.s div.d FP division single or double c.x.s c.x.d FP comparison single or double (x = eq, neq. lt, le. gt, or ge) bclt FP branch true bclf FP branch false

slide-75
SLIDE 75

75

Floating Point Instruction in MIPS

What does the following assembly code do? lwc1 $f4, 4($sp) lwc1 $f6, 8($sp) add.s $f2, $f4, $f6 swc1 $f2,12($sp) Reads two floating point values from the stack, performs their addition and stores the result in the stack.

slide-76
SLIDE 76

76

Pentium Bug

  • Pentium FP Divider uses algorithm to generate multiple bits per steps

– FPU uses most significant bits of divisor & dividend/remainder to guess next 2 bits of quotient – Guess is taken from lookup table: -2, -1,0,+1,+2 (if previous guess too large a reminder, quotient is adjusted in subsequent pass of -2) – Guess is multiplied by divisor and subtracted from remainder to generate a new remainder – Called SRT division after 3 people who came up with idea

  • Pentium table uses 7 bits of remainder + 4 bits of divisor = 211 entries
  • 5 entries of divisors omitted: 1.0001, 1.0100, 1.0111, 1.1010, 1.1101 from PLA (fix

is just add 5 entries back into PLA: cost $200,000)

  • Self correcting nature of SRT => string of 1s must follow error

– e.g., 1011 1111 1111 1111 1111 1011 1000 0010 0011 0111 1011 0100 (2.99999892918)

  • Since indexed also by divisor/remainder bits, sometimes bug doesn’t show even

with dangerous divisor value

slide-77
SLIDE 77

77

Pentium bug appearance

  • First 11 bits to right of decimal point always correct: bits 12 to 52 where bug can
  • ccur (4th to 15th decimal digits)
  • FP divisors near integers 3, 9, 15, 21, 27 are dangerous ones:

– 3.0 > d 3.0 - 36 x 2–22 , 9.0 > d 9.0 - 36 x 2–20 – 15.0 > d 15.0 - 36 x 2–20 , 21.0 > d 21.0 - 36 x 2–19

  • 0.333333 x 9 could be problem
  • In Microsoft Excel, try (4,195,835 / 3,145,727) * 3,145,727

– = 4,195,835 => not a Pentium with bug – = 4,195,579 => Pentium with bug (assuming Excel doesn’t have SW bug patch) – Rarely noticed since error in 5th significant digit – Success of IEEE standard made discovery possible:

  • all computers should get same answer
slide-78
SLIDE 78

78

Pentium Bug Time line

  • June 1994: Intel discovers bug in Pentium: takes months to make change,

reverify, put into production: plans good chips in January 1995 4 to 5 million Pentiums produced with bug

  • Scientist suspects errors and posts on Internet in September 1994
  • Nov. 22 Intel Press release: “Can make errors in 9th digit ... Most engineers

and financial analysts need only 4 of 5 digits. Theoretical mathematician should be concerned. ... So far only heard from one.”

  • Intel claims happens once in 27,000 years for typical spread sheet user:

– 1000 divides/day x error rate assuming numbers random

  • Dec 12: IBM claims happens once per 24 days: Bans Pentium sales

– 5000 divides/second x 15 minutes = 4,200,000 divides/day – Intel said it regards IBM's decision to halt shipments of its Pentium processor-based systems as unwarranted.

slide-79
SLIDE 79

79

Pentium jokes

  • Q: What's another name for the "Intel Inside" sticker they put on Pentiums?

A: Warning label.

  • Q: Have you heard the new name Intel has chosen for the Pentium?

A: the Intel Inacura.

  • Q: According to Intel, the Pentium conforms to the IEEE standards for

floating point arithmetic. If you fly in aircraft designed using a Pentium, what is the correct pronunciation of "IEEE"? A: Aaaaaaaiiiiiiiiieeeeeeeeeeeee!

  • TWO OF TOP TEN NEW INTEL SLOGANS FOR THE PENTIUM

9.9999973251 It's a FLAW, Dammit, not a Bug 7.9999414610 Nearly 300 Correct Opcodes

slide-80
SLIDE 80

80

Pentium conclusion: Dec. 21, 1994 $500M write-off

“To owners of Pentium processor-based computers and the PC community: We at Intel wish to sincerely apologize for our handling of the recently publicized Pentium processor flaw. The Intel Inside symbol means that your computer has a microprocessor second to none in quality and performance. Thousands of Intel employees work very hard to ensure that this is true. But no microprocessor is ever perfect. What Intel continues to believe is technically an extremely minor problem has taken on a life of its own. Although Intel firmly stands behind the quality of the current version of the Pentium processor, we recognize that many users have concerns. We want to resolve these concerns. Intel will exchange the current version of the Pentium processor for an updated version, in which this floating-point divide flaw is corrected, for any owner who requests it, free of charge anytime during the life of their

  • computer. Just call 1-800-628-8686.”

Sincerely, Andrew S. Grove Craig R. Barrett Gordon E. Moore President /CEO Executive Vice President Chairman of the Board &COO

slide-81
SLIDE 81

81

  • Pentium: Difference between bugs that board

designers must know about and bugs that potentially affect all users

–Why not make public complete description of bugs in later category? –$200,000 cost in June to repair design –$500,000,000 loss in December in profits to replace bad parts –How much to repair Intel’s reputation?

  • What is technologists responsibility in

disclosing bugs?

slide-82
SLIDE 82

82

Rounding

  • Why is rounding needed?
  • Infinity numbers ⇒ Finite representation
  • Integers only overflow
  • Almost all operations need rounding
  • IEEE - specifies algorithms for arithmetic
slide-83
SLIDE 83

83

Numbers need rounding

  • Out of range:

– x>2•2Emax x<1•2Emin

  • Between 2 floats:

– 0.110 = 0.00011001100….2 = 1.1001100…. •2-4 – 1.1001 •2-4

slide-84
SLIDE 84

84

Measuring Error

  • ULPS

(units in last place)

– 1.12•10-1 Vs 0.124 : 0.4 ulps – 1.12•10-1 Vs 0.118 : 0.2 ulps

  • Relative Error

– Difference/Original – 1.12•10-1 Vs 0.124 : Err=0.004/0.124=0.032

slide-85
SLIDE 85

85

Calculate Using Rounding

  • Benign cancellation

– Calculate 10.1-9.93 (= 0.17) 1.01 •101 0.99 •101 0.02 •101 = 2.00 •10-1 – 30 upls!

slide-86
SLIDE 86

86

Rounding problems

  • Catastrophic cancellation

– b2-4ac – both b2 and 4ac are rounded – the (-) exposes the error – b=3.34 a=1.22 c=2.28 b2=11.2 4ac=11.1 b2-4ac=0.10 correct=0.0292 (70.08 upls)

slide-87
SLIDE 87

87

IEEE Arithmetic

  • Requirement:

+ - • ÷ should be EXACTLY rounded remainder should be EXACTLY rounded Integer conv. should be EXACTLY rounded

  • Not all (transcendental, binary to decimal)
  • “Tie break” - Round to Even
slide-88
SLIDE 88

88

Rounding and IEEE Rounding Modes

  • When we perform math on “real” numbers, we have to worry about rounding to

fit the result in the significant field.

  • The FP hardware carries two extra bits of precision, and then round to get the

proper value

  • Rounding also occurs when converting a double to a single precision value, or

converting a floating point number to an integer Round towards +∞

  • ALWAYS round “up”: 2.001 → 3
  • 2.001 → -2

Round towards -∞

  • ALWAYS round “down”: 1.999 → 1,
  • 1.999 → -2

Truncate

  • Just drop the last bits (round towards 0)

Round to (nearest) even

  • Normal rounding, almost
slide-89
SLIDE 89

89

Round to Even

  • Round like you learned in grade school
  • Except if the value is right on the borderline, in which case we

round to the nearest EVEN number 2.5 -> 2 3.5 -> 4

  • Insures fairness on calculation

This way, half the time we round up on tie, the other half time we round down

  • This is the default rounding mode
slide-90
SLIDE 90

90

Round to Even

  • How will 1.005 be rounded ?

– Round Up: 1.01 – Round Even: 1.00

  • Why? Example:

– xi=xi-1+y-y x0=1.00 y=0.125 – Round up: 1.00, 1.01, 1.02, …. – Round even: 1.00, 1.00, 1.00, ….

slide-91
SLIDE 91

91 Addition: 1.xxxxx 1.xxxxx 1.xxxxx + 1.xxxxx 0.001xxxxx 0.01xxxxx 1x.xxxxy 1.xxxxxyyy 1x.xxxxyyy

post-normalization pre-normalization pre and post

  • Guard Digits: digits to the right of the first p digits of

significand to guard against loss of digits – can later be shifted left into first P places during normalization.

  • Addition: carry-out shifted in
  • Subtraction: borrow digit and guard
  • Multiplication: carry and guard, Division requires

guard

slide-92
SLIDE 92

92

Normalized result, but some non-zero digits to the right of the significand --> the number should be rounded E.g., B = 10, p = 3: 0 2 1.69 0 0 7.85 0 2 1.61 = 1.6900 * 10 = - .0785 * 10 = 1.6115 * 10 2-bias 2-bias 2-bias

  • ne round digit must be carried to the right of the guard digit so that

after a normalizing left shift, the result can be rounded, according to the value of the round digit IEEE Standard: four rounding modes: round to nearest even (default) round towards plus infinity round towards minus infinity round towards 0 round to nearest: round digit < B/2 then truncate > B/2 then round up (add 1 to ULP: unit in last place) = B/2 then round to nearest even digit it can be shown that this strategy minimizes the mean error introduced by rounding

slide-93
SLIDE 93

93

Sticky Bit

Additional bit to the right of the round digit to better fine tune rounding d0 . d1 d2 d3 . . . dp-1 0 0 0 0 . 0 0 X . . . X X X S X X S + Sticky bit: set to 1 if any 1 bits fall off the end of the round digit d0 . d1 d2 d3 . . . dp-1 0 0 0 0 . 0 0 X . . . X X X 0 X X 0

  • d0 . d1 d2 d3 . . . dp-1 0 0 0

0 . 0 0 X . . . X X X 1

  • generates a borrow

Rounding Summary Radix 2 minimizes wobble in precision Normal operations in +,-,*,/ require one carry/borrow bit + one guard digit One round digit needed for correct rounding Sticky bit needed when round digit is B/2 for max accuracy Rounding to nearest has mean error = 0, if uniform distribution of digits are assumed

slide-94
SLIDE 94

94

Floating-Point Division

  • Significands divided - exponents subtracted - bias added to

difference E1-E2

  • If resulting exponent out of range - overflow or underflow

indication must be generated

  • Resultant significand satisfies 1/β ≤ M1/M2 < β
  • A single base-β shift right of significand + increase of 1 in

exponent may be needed in postnormalization step - may lead to an overflow

  • If divisor=0 - indication of division by zero generated -

quotient set to ±∞

  • If both divisor and dividend=0 - result undefined - in the

IEEE 754 standard represented by NaN - not a number - also representing uninitialized variables and the result of 0 ⋅ ∞

slide-95
SLIDE 95

95

Remainder

  • Fixed-point remainder - R=X-QD (X, Q, D - dividend, quotient, divisor) - |R|

≤ |D| - generated by division algorithm (restoring or nonrestoring)

  • Flp division - algorithm generates quotient but not remainder - F1 REM F2 =

F1-F2⋅Int(F1/F2) (Int(F1/F2) - quotient F1/F2 converted to integer)

  • Conversion to integer - either truncation (removing fractional part) or

rounding-to-nearest

  • The IEEE standard uses the round-to-nearest-even mode - |F1 REM F2| ≤ |F2|

/2

  • Int(F1/F2) as large as β
  • high complexity
  • Floating-point remainder calculated separately - only when required - for

example, in argument reduction for periodic functions like sine and cosine

Emax-Emin

slide-96
SLIDE 96

96

Speeding up

  • Different algorithms may be used
  • Result should be exact
  • divide SRT algorithm in pentium

– 5/2048 entries in a table – 1/9,000,000 chance – check:

slide-97
SLIDE 97

97

MIPS R10000 arithmetic units

  • Integer ALU + shifter

– All instructions take one cycle

  • Integer ALU + multiplier

– Booth’s algorithm for multiplication (5-10 cycles) – Non-restoring division (34-67 cycles)

  • Floating point adder

– Carry propagate (2 cycles)

  • Floating point multiplier (3 cycles)

– Booth’s algorithm

  • Floating point divider (12-19 cycles)
  • Floating point square root unit
  • Separate unit for EA calculations
  • Can start up to 5 instructions in 1 cycle
slide-98
SLIDE 98

98

Effect of Loss of Precision

  • According to the

General Accounting Office of the U.S. Government, a loss of precision in converting 24- bit integers into 24-bit floating point numbers was responsible for the failure of a Patriot anti- missile battery.

slide-99
SLIDE 99

99

Ariane 5

– Exploded 37 seconds after liftoff – Cargo worth $500 million

  • Why

– Computed horizontal velocity as floating point number – Converted to 16-bit integer – Worked OK for Ariane 4 – Overflowed for Ariane 5

  • Used same software