numbers 1 base-10 numbers 2 12345 = 1 10 4 + 2 10 3 + 3 10 2 + - - PowerPoint PPT Presentation

numbers
SMART_READER_LITE
LIVE PREVIEW

numbers 1 base-10 numbers 2 12345 = 1 10 4 + 2 10 3 + 3 10 2 + - - PowerPoint PPT Presentation

numbers 1 base-10 numbers 2 12345 = 1 10 4 + 2 10 3 + 3 10 2 + 4 10 1 + 5 10 0 987 . 65 = 9 10 2 + 8 10 1 + 7 10 0 + 6 10 1 + 5 10 2 base-2 numbers 3 20 TEN (or 20 10 ) = 11101 TWO (or 11101 2 ) = 1 2 4


slide-1
SLIDE 1

numbers

1

slide-2
SLIDE 2

base-10 numbers

12345 = 1 · 104 + 2 · 103 + 3 · 102 + 4 · 101 + 5 · 100 987.65 = 9 · 102 + 8 · 101 + 7 · 100 + 6 · 10−1 + 5 · 10−2

2

slide-3
SLIDE 3

base-2 numbers

20TEN (or 2010) = 11101TWO (or 111012) = 1 · 24 + 1 · 23 + 1 · 22 + 0 · 21 + 1 · 20 4TEN = 100TWO = 1 · 22 + 0 · 21 + 0 · 20 1.25TEN = 1.01TWO = 1 · 20 + 0 · 2−1 + 1 · 2−2

3

slide-4
SLIDE 4

base-16 numbers

1 2 3 4 5 6 7 8 9 A B C D E F 15TEN = FSIXTEEN = 15 · 160 100TEN = 64SIXTEEN = 6 · 161 + 4 · 160 0.5TEN = 0.8SIXTEEN = 8 · 16−1

4

slide-5
SLIDE 5

integers in C++

15TEN 15 17EIGHT 017 FSIXTEEN 0xF 99TEN 99 143EIGHT 0143 63SIXTEEN 0x63 16TEN 16 20EIGHT 020 10SIXTEEN 0x10

5

slide-6
SLIDE 6

terminology

base-X number — X is the radix I will call components of base X number ‘digits’

but not a great term — digit sometimes implies base-10 sometimes “radit” base-2 digit = bit base-16 digit = nibble (sometimes)

base-10 = decimal base-2 = binary base-8 = octal base-16 = hexadecimal

6

slide-7
SLIDE 7

convert to decimal

42FIVE =

TEN TEN

121THREE =

TEN 7

slide-8
SLIDE 8

convert to decimal

42FIVE = 4 · 51 + 2 · 50 =

TEN TEN

121THREE =

TEN 7

slide-9
SLIDE 9

convert to decimal

42FIVE = 4 · 51 + 2 · 50 = 20TEN + 2 = 22TEN 121THREE =

TEN 7

slide-10
SLIDE 10

convert to decimal

42FIVE = 4 · 51 + 2 · 50 = 20TEN + 2 = 22TEN 121THREE = 1 · 32 + 2 · 31 + 1 · 30 =

TEN 7

slide-11
SLIDE 11

convert to decimal

42FIVE = 4 · 51 + 2 · 50 = 20TEN + 2 = 22TEN 121THREE = 1 · 32 + 2 · 31 + 1 · 30 = 9 + 6 + 1 = 16TEN

7

slide-12
SLIDE 12

convert to something (1)

42TEN as radix 5 = mod

8

slide-13
SLIDE 13

convert to something (1)

42TEN as radix 5 = __2 42 ÷ 5 = 8 + . . . 42 mod 5 = 2 42 = 8 · 5 + 2

8

slide-14
SLIDE 14

convert to something (1)

42TEN as radix 5 = _32 42 ÷ 5 = 8 + . . . 42 mod 5 = 2 42 = 8 · 5 + 2 8 = 1 · 5 + 3

8

slide-15
SLIDE 15

convert to something (1)

42TEN as radix 5 = 132FIVE 42 ÷ 5 = 8 + . . . 42 mod 5 = 2 42 = 8 · 5 + 2 8 = 1 · 5 + 3 1

8

slide-16
SLIDE 16

convert to something (2)

121TEN as radix 11 = mod

9

slide-17
SLIDE 17

convert to something (2)

121TEN as radix 11 = __0ELEVEN 121 ÷ 11 = 11 121 mod 11 = 121 = 11 · 11 + 0

9

slide-18
SLIDE 18

convert to something (2)

121TEN as radix 11 = _00ELEVEN 121 ÷ 11 = 11 121 mod 11 = 121 = 11 · 11 + 0 11 = 1 · 11 + 0

9

slide-19
SLIDE 19

convert to something (2)

121TEN as radix 11 = 100ELEVEN 121 ÷ 11 = 11 121 mod 11 = 121 = 11 · 11 + 0 11 = 1 · 11 + 0 1

9

slide-20
SLIDE 20

special case: base-16 to base-2

each “nibble” (hexadecimal digit) = 4 binary bits

uzSIXTEEN = u · 161 + z · 160 = (u3 · 23 + u2 · 22 + u1 · 21 + u0 · 20)24 + z3 · 23 + . . . = u3 · 27 + u2 · 26 + u1 · 25 + u0 · 24 + z3 · 23 + . . . = (u3u2u1u0z3z2z1z0)TWO

10

slide-21
SLIDE 21

special case: base-16 to base-2

each “nibble” (hexadecimal digit) = 4 binary bits

1 2 3 4SIXTEEN

TWO TWO SIXTEEN 11

slide-22
SLIDE 22

special case: base-16 to base-2

each “nibble” (hexadecimal digit) = 4 binary bits

1 2 3 4SIXTEEN 0001 0010 0011 0100TWO

TWO SIXTEEN 11

slide-23
SLIDE 23

special case: base-16 to base-2

each “nibble” (hexadecimal digit) = 4 binary bits

1 2 3 4SIXTEEN 0001 0010 0011 0100TWO

TWO SIXTEEN 11

slide-24
SLIDE 24

special case: base-16 to base-2

each “nibble” (hexadecimal digit) = 4 binary bits

1 2 3 4SIXTEEN 0001 0010 0011 0100TWO

TWO SIXTEEN 11

slide-25
SLIDE 25

special case: base-16 to base-2

each “nibble” (hexadecimal digit) = 4 binary bits

1 2 3 4SIXTEEN 0001 0010 0011 0100TWO 1101 1110 0011 0000TWO

SIXTEEN 11

slide-26
SLIDE 26

special case: base-16 to base-2

each “nibble” (hexadecimal digit) = 4 binary bits

1 2 3 4SIXTEEN 0001 0010 0011 0100TWO 1101 1110 0011 0000TWO C D 3 0SIXTEEN

11

slide-27
SLIDE 27

a note on bytes

  • ne byte = one “octet” =

two nibbles (hexadecimal digits) = eight bits this class — byte is always eight bits

(some very old machines called difgerent sizes “bytes”)

12

slide-28
SLIDE 28

a note on bytes

  • ne byte = one “octet” =

two nibbles (hexadecimal digits) = eight bits this class — byte is always eight bits

(some very old machines called difgerent sizes “bytes”)

12

slide-29
SLIDE 29

a note on bytes

  • ne byte = one “octet” =

two nibbles (hexadecimal digits) = eight bits this class — byte is always eight bits

(some very old machines called difgerent sizes “bytes”)

12

slide-30
SLIDE 30

exercise

17NINE =?SEVEN

NINE NINE SEVEN 13

slide-31
SLIDE 31

exercise

17NINE =?SEVEN 17NINE = 7 + 9 = 2 · 7 + 2 17NINE = 22SEVEN

13

slide-32
SLIDE 32
  • n math in other bases

you can do math in other bases usually makes most sense for base 2…

1 1 1

1 2 3 4 4SIXTEEN × 1 5SIXTEEN 5 B 5 4 1 2 3 4 4 1 7 E 4 9 4SIXTEEN $ python3 −c 'print("{:x}".format(0x12344*0x15))' 17e494

14

slide-33
SLIDE 33

integer representation

modern machine represent integers as series of bits (base-2) why not base-10?

15

slide-34
SLIDE 34

ENIAC: base-10 representation

ENIAC: 1946 computer stored base-10 digits “ring counter” of ten electronic switches per digit fmipped switch indicates digit stored tens of vacuum tubes total (each 1¼"diameter by 2¾"height)

16

slide-35
SLIDE 35

ENIAC: base-10 representation

ENIAC: 1946 computer stored base-10 digits “ring counter” of ten electronic switches per digit fmipped switch indicates digit stored tens of vacuum tubes total (each 1¼"diameter by 2¾"height)

16

slide-36
SLIDE 36

ENIAC: base-10 representation

ENIAC: 1946 computer stored base-10 digits “ring counter” of ten electronic switches per digit fmipped switch indicates digit stored tens of vacuum tubes total (each ∼1¼"diameter by 2¾"height)

16

slide-37
SLIDE 37

base-2 representation

base 2 — each switch represents one “digit”

much more effjcient use of switches

used in some pre-ENIAC electronic computers

Atanasofg-Berry computer (1937, Ohio State) Z3 (1941, German Laboratory for Aviation)

why not used in ENIAC?

Eckert (ENIAC designer), 1953: “Although [binary-based digit counters] were known at the time of the construction of the ENIAC, it was not used because it required stable resistors, which were then much more expensive than they are now.” also, important to input/output decimal digits directly

quote: “A Survey of Digital Computer Memory Systems”, Proceedings of the Institute of Radio Engineers, October 1953

17

slide-38
SLIDE 38

base-2 representation

base 2 — each switch represents one “digit”

much more effjcient use of switches

used in some pre-ENIAC electronic computers

Atanasofg-Berry computer (1937, Ohio State) Z3 (1941, German Laboratory for Aviation)

why not used in ENIAC?

Eckert (ENIAC designer), 1953: “Although [binary-based digit counters] were known at the time of the construction of the ENIAC, it was not used because it required stable resistors, which were then much more expensive than they are now.” also, important to input/output decimal digits directly

quote: “A Survey of Digital Computer Memory Systems”, Proceedings of the Institute of Radio Engineers, October 1953

17

slide-39
SLIDE 39

base-2 bit addition

+ 1 00 01 1 01 10

exactly one set to 1 — result (w/o carry) is 1; otherwise 0 both set to 1 — carry is 1; otherwise 0

18

slide-40
SLIDE 40

base-2 bit addition

+ 1 00 01 1 01 10

exactly one set to 1 — result (w/o carry) is 1; otherwise 0 both set to 1 — carry is 1; otherwise 0

18

slide-41
SLIDE 41

base-2 bit addition

+ 1 00 01 1 01 10

exactly one set to 1 — result (w/o carry) is 1; otherwise 0 both set to 1 — carry is 1; otherwise 0

18

slide-42
SLIDE 42

base-2 capacity

n-bit number: bn−1bn−2bn−3 . . . b2b1b0 =

n−1

  • i=0

bi · 2i ≤

n−1

  • i=0

1 · 2i = 2n − 1

missing pieces:

negative numbers? non-whole numbers? what is ?

19

slide-43
SLIDE 43

base-2 capacity

n-bit number: bn−1bn−2bn−3 . . . b2b1b0 =

n−1

  • i=0

bi · 2i ≤

n−1

  • i=0

1 · 2i = 2n − 1

missing pieces:

negative numbers? non-whole numbers? what is ?

19

slide-44
SLIDE 44

base-2 capacity

n-bit number: bn−1bn−2bn−3 . . . b2b1b0 =

n−1

  • i=0

bi · 2i ≤

n−1

  • i=0

1 · 2i = 2n − 1

missing pieces:

negative numbers? non-whole numbers? what is n?

19

slide-45
SLIDE 45

base-2 capacity

n-bit number: bn−1bn−2bn−3 . . . b2b1b0 =

n−1

  • i=0

bi · 2i ≤

n−1

  • i=0

1 · 2i = 2n − 1

missing pieces:

negative numbers? non-whole numbers? what is n?

19

slide-46
SLIDE 46

integer size in C++

varies between machines

compiler uses what makes most sense on each machine?

size in bits type minimum

  • n lab machines

unsigned char 8 8 unsigned short 16 16 unsigned int 16 32 unsigned long 32 64 “unsigned” — can’t be negative (no sign) minimum size required by standard for all C++ compilers all allowed to be bigger

20

slide-47
SLIDE 47

integer size in C++

varies between machines

compiler uses what makes most sense on each machine?

size in bits type minimum

  • n lab machines

unsigned char 8 8 unsigned short 16 16 unsigned int 16 32 unsigned long 32 64 “unsigned” — can’t be negative (no ± sign) minimum size required by standard for all C++ compilers all allowed to be bigger

20

slide-48
SLIDE 48

integer size in C++

varies between machines

compiler uses what makes most sense on each machine?

size in bits type minimum

  • n lab machines

unsigned char 8 8 unsigned short 16 16 unsigned int 16 32 unsigned long 32 64 “unsigned” — can’t be negative (no sign) minimum size required by standard for all C++ compilers all allowed to be bigger

20

slide-49
SLIDE 49

querying sizes in C++

#include <climits> // C: <limits.h> ... ULONG_MAX or UINT_MAX or USHRT_MAX or UCHAR_MAX // e.g. USHRT_MAX == 65535 on lab machines #include <limits> ... std::numeric_limits<unsigned long>::max() // == ULONG_MAX ... sizeof(unsigned long) // number of *bytes* // == 8 on lab machines ...

21

slide-50
SLIDE 50

numbering bits

  • ption 1: n-bit number:

bn−1bn−2bn−3 . . . b2b1b0 =

n−1

  • i=0

bi · 2i

  • ption 2: n-bit number:

b0b1b2 . . . bn−3bn−2bn−1 =

n−1

  • i=0

bi · 2n−i−1

two viable ways to number bits does it matter which I use?

do I have a way to ask for bit ?

22

slide-51
SLIDE 51

numbering bits

  • ption 1: n-bit number:

bn−1bn−2bn−3 . . . b2b1b0 =

n−1

  • i=0

bi · 2i

  • ption 2: n-bit number:

b0b1b2 . . . bn−3bn−2bn−1 =

n−1

  • i=0

bi · 2n−i−1

two viable ways to number bits does it matter which I use?

do I have a way to ask for bit ?

22

slide-52
SLIDE 52

numbering bits

  • ption 1: n-bit number:

bn−1bn−2bn−3 . . . b2b1b0 =

n−1

  • i=0

bi · 2i

  • ption 2: n-bit number:

b0b1b2 . . . bn−3bn−2bn−1 =

n−1

  • i=0

bi · 2n−i−1

two viable ways to number bits does it matter which I use?

do I have a way to ask for bit i?

22

slide-53
SLIDE 53

numbering bytes

  • ption 1: 4-byte number:

B3B2B1B0 =

3

  • i=0

Bi · 256i

  • ption 2: 4-byte number:

B0B1B2B3 =

3

  • i=0

bi · 2563−i

two viable ways to number bytes does it matter which I use?

in memory, yes — each byte needs an address (number)

23

slide-54
SLIDE 54

numbering bytes

  • ption 1: 4-byte number:

B3B2B1B0 =

3

  • i=0

Bi · 256i

  • ption 2: 4-byte number:

B0B1B2B3 =

3

  • i=0

bi · 2563−i

two viable ways to number bytes does it matter which I use?

in memory, yes — each byte needs an address (number)

23

slide-55
SLIDE 55

numbering bytes

  • ption 1: 4-byte number:

B3B2B1B0 =

3

  • i=0

Bi · 256i

  • ption 2: 4-byte number:

B0B1B2B3 =

3

  • i=0

bi · 2563−i

two viable ways to number bytes does it matter which I use?

in memory, yes — each byte needs an address (number)

23

slide-56
SLIDE 56

memory

addr.value (64-bit) 123999 8 323232 16 434093 … … memory (as 64-bit values) 123999 = 1 · 2562 + 228 · 2561 + 95 · 2560 addr.value (8-bit) 95 1 228 2 1 3 4 5 6 7 8 160 9 238 10 4 11 … … … if little endian (as 8-bit values) addr.value (8-bit) 1 2 3 4 5 1 6 228 7 95 8 9 10 11 … … … if big endian (as 8-bit values)

24

slide-57
SLIDE 57

memory

addr.value (64-bit) 123999 8 323232 16 434093 … … memory (as 64-bit values) 123999 = 1 · 2562 + 228 · 2561 + 95 · 2560 addr.value (8-bit) 95 1 228 2 1 3 4 5 6 7 8 160 9 238 10 4 11 … … … if little endian (as 8-bit values) addr.value (8-bit) 1 2 3 4 5 1 6 228 7 95 8 9 10 11 … … … if big endian (as 8-bit values)

24

slide-58
SLIDE 58

fjnding endianness in C++

#include <iostream> using std::cout; using std::hex; using std::endl; int main() { unsigned long value = 0x0123456789ABCDEF; cout << hex << value << endl; unsigned char *ptr = (unsigned char*) &value; for (int i = 0; i < sizeof(unsigned long); ++i) { cout << (int) ptr[i] << " ␣ "; } ... }

little endian (e.g. lab machine):

123456789abcdef ef cd ab 89 67 45 23 1

big endian:

123456789abcdef 1 23 45 67 89 ab cd ef

get pointer to byte with lowest address in value unless you do something like this won’t see endianness use pointer to get ith byte of value (cast to int to output as number, not character) little endian: byte 0 is least signifjcant (afgects overall value the least) big endian: byte 0 is most signifjcant (afgects overall value the most) but we don’t write numbers in a difgerent order based on which end we call “part 0”

25

slide-59
SLIDE 59

fjnding endianness in C++

#include <iostream> using std::cout; using std::hex; using std::endl; int main() { unsigned long value = 0x0123456789ABCDEF; cout << hex << value << endl; unsigned char *ptr = (unsigned char*) &value; for (int i = 0; i < sizeof(unsigned long); ++i) { cout << (int) ptr[i] << " ␣ "; } ... }

little endian (e.g. lab machine):

123456789abcdef ef cd ab 89 67 45 23 1

big endian:

123456789abcdef 1 23 45 67 89 ab cd ef

get pointer to byte with lowest address in value unless you do something like this won’t see endianness use pointer to get ith byte of value (cast to int to output as number, not character) little endian: byte 0 is least signifjcant (afgects overall value the least) big endian: byte 0 is most signifjcant (afgects overall value the most) but we don’t write numbers in a difgerent order based on which end we call “part 0”

25

slide-60
SLIDE 60

fjnding endianness in C++

#include <iostream> using std::cout; using std::hex; using std::endl; int main() { unsigned long value = 0x0123456789ABCDEF; cout << hex << value << endl; unsigned char *ptr = (unsigned char*) &value; for (int i = 0; i < sizeof(unsigned long); ++i) { cout << (int) ptr[i] << " ␣ "; } ... }

little endian (e.g. lab machine):

123456789abcdef ef cd ab 89 67 45 23 1

big endian:

123456789abcdef 1 23 45 67 89 ab cd ef

get pointer to byte with lowest address in value unless you do something like this won’t see endianness use pointer to get ith byte of value (cast to int to output as number, not character) little endian: byte 0 is least signifjcant (afgects overall value the least) big endian: byte 0 is most signifjcant (afgects overall value the most) but we don’t write numbers in a difgerent order based on which end we call “part 0”

25

slide-61
SLIDE 61

fjnding endianness in C++

#include <iostream> using std::cout; using std::hex; using std::endl; int main() { unsigned long value = 0x0123456789ABCDEF; cout << hex << value << endl; unsigned char *ptr = (unsigned char*) &value; for (int i = 0; i < sizeof(unsigned long); ++i) { cout << (int) ptr[i] << " ␣ "; } ... }

little endian (e.g. lab machine):

123456789abcdef ef cd ab 89 67 45 23 1

big endian:

123456789abcdef 1 23 45 67 89 ab cd ef

get pointer to byte with lowest address in value unless you do something like this won’t see endianness use pointer to get ith byte of value (cast to int to output as number, not character) little endian: byte 0 is least signifjcant (afgects overall value the least) big endian: byte 0 is most signifjcant (afgects overall value the most) but we don’t write numbers in a difgerent order based on which end we call “part 0”

25

slide-62
SLIDE 62

fjnding endianness in C++

#include <iostream> using std::cout; using std::hex; using std::endl; int main() { unsigned long value = 0x0123456789ABCDEF; cout << hex << value << endl; unsigned char *ptr = (unsigned char*) &value; for (int i = 0; i < sizeof(unsigned long); ++i) { cout << (int) ptr[i] << " ␣ "; } ... }

little endian (e.g. lab machine):

123456789abcdef ef cd ab 89 67 45 23 1

big endian:

123456789abcdef 1 23 45 67 89 ab cd ef

get pointer to byte with lowest address in value unless you do something like this won’t see endianness use pointer to get ith byte of value (cast to int to output as number, not character) little endian: byte 0 is least signifjcant (afgects overall value the least) big endian: byte 0 is most signifjcant (afgects overall value the most) but we don’t write numbers in a difgerent order based on which end we call “part 0”

25

slide-63
SLIDE 63

fjnding endianness in C++

#include <iostream> using std::cout; using std::hex; using std::endl; int main() { unsigned long value = 0x0123456789ABCDEF; cout << hex << value << endl; unsigned char *ptr = (unsigned char*) &value; for (int i = 0; i < sizeof(unsigned long); ++i) { cout << (int) ptr[i] << " ␣ "; } ... }

little endian (e.g. lab machine):

123456789abcdef ef cd ab 89 67 45 23 1

big endian:

123456789abcdef 1 23 45 67 89 ab cd ef

get pointer to byte with lowest address in value unless you do something like this won’t see endianness use pointer to get ith byte of value (cast to int to output as number, not character) little endian: byte 0 is least signifjcant (afgects overall value the least) big endian: byte 0 is most signifjcant (afgects overall value the most) but we don’t write numbers in a difgerent order based on which end we call “part 0”

25

slide-64
SLIDE 64

fjnding endianness in C++

#include <iostream> using std::cout; using std::hex; using std::endl; int main() { unsigned long value = 0x0123456789ABCDEF; cout << hex << value << endl; unsigned char *ptr = (unsigned char*) &value; for (int i = 0; i < sizeof(unsigned long); ++i) { cout << (int) ptr[i] << " ␣ "; } ... }

little endian (e.g. lab machine):

123456789abcdef ef cd ab 89 67 45 23 1

big endian:

123456789abcdef 1 23 45 67 89 ab cd ef

get pointer to byte with lowest address in value unless you do something like this won’t see endianness use pointer to get ith byte of value (cast to int to output as number, not character) little endian: byte 0 is least signifjcant (afgects overall value the least) big endian: byte 0 is most signifjcant (afgects overall value the most) but we don’t write numbers in a difgerent order based on which end we call “part 0”

25

slide-65
SLIDE 65

little versus big endian

little endian — least signifjcant part has lowest address

i.e. index 0 is the one’s place

big endian — most signifjcant part has the lowest address

i.e. index n − 1 is the one’s place

26

slide-66
SLIDE 66

endianness in the real world

today and this course: little endian is dominant

e.g. x86, typically ARM

historically: big endian was dominant

e.g. typically SPARC, POWER, Alpha, MIPS, … still commonly used for networking because of this

many architectures have switchable endianness

e.g. ARM, SPARC, POWER, MIPS usually, OS chooses one endianness

27

slide-67
SLIDE 67

middle endian

sometimes not just big/little endian e.g. number bytes most to least signifjcant as 5, 6, 7, 8, 1, 2, 3, 4 e.g. doubles on little-endian ARM generally some sort of historical accident

e.g. ARM fmoating point designed for big endian?

28

slide-68
SLIDE 68

endianness is about addresses

endianness is about numbering, not (necessairily) placement on the page but, probably assume English order (left to right, etc.) if not

  • therwise specifjed

addr.value 95 1 228 2 1 3 4 5 6 7 8 160 9 238 10 4 11 … … … addr.value … … 11 … 10 4 9 238 8 160 7 6 5 4 3 2 1 1 228 95

=

29

slide-69
SLIDE 69

endianness and bit-order

we won’t talk about bit order because bits don’t have addresses if I say “bit 0”, question: “numbering from least signifjcant or most signifjcant”?

nothing about how pointers, etc. work suggests either answer is correct

30

slide-70
SLIDE 70

endianness and writing out bytes

0x0102 in binary: 00000001000000010

English’s order — most signifjcant fjrst

bytes of 0x0102 in big endian: (byte 0) 00000001 (byte 1) 00000010 bytes of 0x0102 in little endian: (byte 0) 00000010 (byte 1) 00000001 usually, we don’t change the order we write bits if writing out bytes, fjrst in reading order is usually lowest address

(we’ll specify if not)

31

slide-71
SLIDE 71

representing negative numbers

unsigned integers

0000…0000 = 0 1111…1111 = 2n − 1 0111…1111 = 2n−1 − 1 1000…0000 = 2n − 1

signed integers

0000…0000 = 1111…1111 = ??? 0111…1111 = 1000…0000 = ???

positive numbers up to goal: same bits, signed or not unsigned: and bigger signed: negative numbers, but how?

sign & magnitude 1’s complement 2’s complement 000…000 011…111 100…000 111…111

two representations of zero? x == y needs to do something special more negative values than positive values? all 1’s — least negative? all 1’s — most negative?

32

slide-72
SLIDE 72

representing negative numbers

unsigned integers

0000…0000 = 0 1111…1111 = 2n − 1 0111…1111 = 2n−1 − 1 1000…0000 = 2n − 1

signed integers

0000…0000 = 0 1111…1111 = ??? 0111…1111 = 2n−1 − 1 1000…0000 = ???

positive numbers up to goal: same bits, signed or not unsigned: and bigger signed: negative numbers, but how?

sign & magnitude 1’s complement 2’s complement 000…000 011…111 100…000 111…111

two representations of zero? x == y needs to do something special more negative values than positive values? all 1’s — least negative? all 1’s — most negative?

32

slide-73
SLIDE 73

representing negative numbers

unsigned integers

0000…0000 = 0 1111…1111 = 2n − 1 0111…1111 = 2n−1 − 1 1000…0000 = 2n − 1

signed integers

0000…0000 = 0 1111…1111 = ??? 0111…1111 = 2n−1 − 1 1000…0000 = ???

positive numbers up to 2n − 1 goal: same bits, signed or not unsigned: and bigger signed: negative numbers, but how?

sign & magnitude 1’s complement 2’s complement 000…000 011…111 100…000 111…111

two representations of zero? x == y needs to do something special more negative values than positive values? all 1’s — least negative? all 1’s — most negative?

32

slide-74
SLIDE 74

representing negative numbers

unsigned integers

0000…0000 = 0 1111…1111 = 2n − 1 0111…1111 = 2n−1 − 1 1000…0000 = 2n − 1

signed integers

0000…0000 = 0 1111…1111 = ??? 0111…1111 = 2n−1 − 1 1000…0000 = ???

positive numbers up to goal: same bits, signed or not unsigned: 2n−1 and bigger signed: negative numbers, but how?

sign & magnitude 1’s complement 2’s complement 000…000 011…111 100…000 111…111

two representations of zero? x == y needs to do something special more negative values than positive values? all 1’s — least negative? all 1’s — most negative?

32

slide-75
SLIDE 75

representing negative numbers

unsigned integers

0000…0000 = 1111…1111 = 0111…1111 = 1000…0000 =

signed integers

0000…0000 = 0 1111…1111 = ??? 0111…1111 = 2n−1 − 1 1000…0000 = ???

positive numbers up to goal: same bits, signed or not unsigned: and bigger signed: negative numbers, but how?

sign & magnitude 1’s complement 2’s complement 000…000 011…111 2n−1 − 1 2n−1 − 1 2n−1 − 1 100…000 −2n−1 + 1 −2n−1 111…111 −2n−1 + 1 −1

two representations of zero? x == y needs to do something special more negative values than positive values? all 1’s — least negative? all 1’s — most negative?

32

slide-76
SLIDE 76

representing negative numbers

unsigned integers

0000…0000 = 1111…1111 = 0111…1111 = 1000…0000 =

signed integers

0000…0000 = 0 1111…1111 = ??? 0111…1111 = 2n−1 − 1 1000…0000 = ???

positive numbers up to goal: same bits, signed or not unsigned: and bigger signed: negative numbers, but how?

sign & magnitude 1’s complement 2’s complement 000…000 011…111 2n−1 − 1 2n−1 − 1 2n−1 − 1 100…000 −2n−1 + 1 −2n−1 111…111 −2n−1 + 1 −1

two representations of zero? x == y needs to do something special more negative values than positive values? all 1’s — least negative? all 1’s — most negative?

32

slide-77
SLIDE 77

representing negative numbers

unsigned integers

0000…0000 = 1111…1111 = 0111…1111 = 1000…0000 =

signed integers

0000…0000 = 0 1111…1111 = ??? 0111…1111 = 2n−1 − 1 1000…0000 = ???

positive numbers up to goal: same bits, signed or not unsigned: and bigger signed: negative numbers, but how?

sign & magnitude 1’s complement 2’s complement 000…000 011…111 2n−1 − 1 2n−1 − 1 2n−1 − 1 100…000 −2n−1 + 1 −2n−1 111…111 −2n−1 + 1 −1

two representations of zero? x == y needs to do something special more negative values than positive values? all 1’s — least negative? all 1’s — most negative?

32

slide-78
SLIDE 78

representing negative numbers

unsigned integers

0000…0000 = 1111…1111 = 0111…1111 = 1000…0000 =

signed integers

0000…0000 = 0 1111…1111 = ??? 0111…1111 = 2n−1 − 1 1000…0000 = ???

positive numbers up to goal: same bits, signed or not unsigned: and bigger signed: negative numbers, but how?

sign & magnitude 1’s complement 2’s complement 000…000 011…111 2n−1 − 1 2n−1 − 1 2n−1 − 1 100…000 −2n−1 + 1 −2n−1 111…111 −2n−1 + 1 −1

two representations of zero? x == y needs to do something special more negative values than positive values? all 1’s — least negative? all 1’s — most negative?

32

slide-79
SLIDE 79

representing negative numbers

unsigned integers

0000…0000 = 1111…1111 = 0111…1111 = 1000…0000 =

signed integers

0000…0000 = 0 1111…1111 = ??? 0111…1111 = 2n−1 − 1 1000…0000 = ???

positive numbers up to goal: same bits, signed or not unsigned: and bigger signed: negative numbers, but how?

sign & magnitude 1’s complement 2’s complement 000…000 011…111 2n−1 − 1 2n−1 − 1 2n−1 − 1 100…000 −2n−1 + 1 −2n−1 111…111 −2n−1 + 1 −1

two representations of zero? x == y needs to do something special more negative values than positive values? all 1’s — least negative? all 1’s — most negative?

32

slide-80
SLIDE 80

sign and magnitude

unsigned integers

0000…0000 = 0 1111…1111 = 2n − 1 0111…1111 = 2n−1 − 1 1000…0000 = 2n − 1

fjrst bit is “sign bit” — 0 = positive, 1 = negative fmip sign bit to negate number adding 1 difgerent direction if negative

0000…0000 = +0 1111…1111 = −2n−1 + 1 0111…1111 = 2n−1 − 1 1000…0000 = −0

0000…0101 = 1000…0101 =

33

slide-81
SLIDE 81

sign and magnitude

unsigned integers

0000…0000 = 0 1111…1111 = 2n − 1 0111…1111 = 2n−1 − 1 1000…0000 = 2n − 1

fjrst bit is “sign bit” — 0 = positive, 1 = negative fmip sign bit to negate number adding 1 difgerent direction if negative

0000…0000 = +0 1111…1111 = −2n−1 + 1 0111…1111 = 2n−1 − 1 1000…0000 = −0

0000…0101 = 6 1000…0101 = −6

33

slide-82
SLIDE 82

sign and magnitude

unsigned integers

0000…0000 = 1111…1111 = 0111…1111 = 1000…0000 =

fjrst bit is “sign bit” — 0 = positive, 1 = negative fmip sign bit to negate number adding 1 difgerent direction if negative

0000…0000 = +0 1111…1111 = −2n−1 + 1 0111…1111 = 2n−1 − 1 1000…0000 = −0

0000…0101 = 6 1000…0101 = −6

33

slide-83
SLIDE 83

sign and magnitude

unsigned integers

0000…0000 = 1111…1111 = 0111…1111 = 1000…0000 =

fjrst bit is “sign bit” — 0 = positive, 1 = negative fmip sign bit to negate number adding 1 difgerent direction if negative

0000…0000 = +0 1111…1111 = −2n−1 + 1 0111…1111 = 2n−1 − 1 1000…0000 = −0

0000…0101 = 1000…0101 =

33

slide-84
SLIDE 84

1’s complement

unsigned integers

0000…0000 = 0 1111…1111 = 2n − 1 0111…1111 = 2n−1 − 1 1000…0000 = 2n − 1

fmip all bits to negate number adding 1 same direction, no matter original sign

0000…0000 = +0 1111…1111 = −0 0111…1111 =

n−1 − 1

1000…0000 = −2n−1 − 1

0000…0101 = 1111…1010 =

34

slide-85
SLIDE 85

1’s complement

unsigned integers

0000…0000 = 0 1111…1111 = 2n − 1 0111…1111 = 2n−1 − 1 1000…0000 = 2n − 1

fmip all bits to negate number adding 1 same direction, no matter original sign

0000…0000 = +0 1111…1111 = −0 0111…1111 =

n−1 − 1

1000…0000 = −2n−1 − 1

0000…0101 = 6 1111…1010 = −6

34

slide-86
SLIDE 86

1’s complement

unsigned integers

0000…0000 = 1111…1111 = 0111…1111 = 1000…0000 =

fmip all bits to negate number adding 1 same direction, no matter original sign

0000…0000 = +0 1111…1111 = −0 0111…1111 =

n−1 − 1

1000…0000 = −2n−1 − 1

0000…0101 = 6 1111…1010 = −6

34

slide-87
SLIDE 87

1’s complement

unsigned integers

0000…0000 = 1111…1111 = 0111…1111 = 1000…0000 =

fmip all bits to negate number adding 1 same direction, no matter original sign

0000…0000 = +0 1111…1111 = −0 0111…1111 =

n−1 − 1

1000…0000 = −2n−1 − 1

0000…0101 = 1111…1010 =

34

slide-88
SLIDE 88

two’s complement

unsigned integers

0000…0000 = 0 1111…1111 = 2n − 1 0111…1111 = 2n−1 − 1 1000…0000 = 2n − 1

fmip all bits and add 1 to negate number adding 1 same direction, no matter original sign

0000…0000 = +0 1111…1111 = −1 0111…1111 = 2n−1 − 1 1000…0000 = −2n−1

0000…0101 = 1111…1010 =

35

slide-89
SLIDE 89

two’s complement

unsigned integers

0000…0000 = 0 1111…1111 = 2n − 1 0111…1111 = 2n−1 − 1 1000…0000 = 2n − 1

fmip all bits and add 1 to negate number adding 1 same direction, no matter original sign

0000…0000 = +0 1111…1111 = −1 0111…1111 = 2n−1 − 1 1000…0000 = −2n−1

0000…0101 = 6 1111…1010 = −6

35

slide-90
SLIDE 90

two’s complement

unsigned integers

0000…0000 = 1111…1111 = 0111…1111 = 1000…0000 =

fmip all bits and add 1 to negate number adding 1 same direction, no matter original sign

0000…0000 = +0 1111…1111 = −1 0111…1111 = 2n−1 − 1 1000…0000 = −2n−1

0000…0101 = 6 1111…1010 = −6

35

slide-91
SLIDE 91

two’s complement

unsigned integers

0000…0000 = 1111…1111 = 0111…1111 = 1000…0000 =

fmip all bits and add 1 to negate number adding 1 same direction, no matter original sign

0000…0000 = +0 1111…1111 = −1 0111…1111 = 2n−1 − 1 1000…0000 = −2n−1

0000…0101 = 1111…1010 =

35

slide-92
SLIDE 92

2’s complement (alt. perspective)

+10 = 1 1 0 · (−24) + 1 · 23 + 0 · 22+ + 1 · 21+ + 0 · 20 + 23 + + 21 + 0 = 10

  • 10 =

1 1 1 1 · (−24) + 0 · 23 + 1 · 22 + 1 · 21 + 0 · 20 −24 + + 22 + 21 + 0 = −10

2’s complement (5 bit) “ s place”

36

slide-93
SLIDE 93

2’s complement (alt. perspective)

+10 = 1 1 0 · (−24) + 1 · 23 + 0 · 22+ + 1 · 21+ + 0 · 20 + 23 + + 21 + 0 = 10

  • 10 =

1 1 1 1 · (−24) + 0 · 23 + 1 · 22 + 1 · 21 + 0 · 20 −24 + + 22 + 21 + 0 = −10

2’s complement (5 bit) “−24s place”

36

slide-94
SLIDE 94

unsigned v. 2’s complement

0 or 0 000 1 or 1 001 2 or 2 010 3 or 3 011 4 or −4 100 5 or −3 101 6 or −2 110 7 or −1 111 add a d d

  • r

add add

  • r

2’s complement addition is same as unsigned addition

37

slide-95
SLIDE 95

unsigned v. 2’s complement

0 or 0 000 1 or 1 001 2 or 2 010 3 or 3 011 4 or −4 100 5 or −3 101 6 or −2 110 7 or −1 111 add 1 a d d

  • r

add add

  • r

2’s complement addition is same as unsigned addition

37

slide-96
SLIDE 96

unsigned v. 2’s complement

0 or 0 000 1 or 1 001 2 or 2 010 3 or 3 011 4 or −4 100 5 or −3 101 6 or −2 110 7 or −1 111 add 1 a d d 7

  • r

− 1 add add

  • r

2’s complement addition is same as unsigned addition

37

slide-97
SLIDE 97

unsigned v. 2’s complement

0 or 0 000 1 or 1 001 2 or 2 010 3 or 3 011 4 or −4 100 5 or −3 101 6 or −2 110 7 or −1 111 add a d d

  • r

add 1 add

  • r

2’s complement addition is same as unsigned addition

37

slide-98
SLIDE 98

unsigned v. 2’s complement

0 or 0 000 1 or 1 001 2 or 2 010 3 or 3 011 4 or −4 100 5 or −3 101 6 or −2 110 7 or −1 111 add a d d

  • r

add 1 add 7 or −1 2’s complement addition is same as unsigned addition

37

slide-99
SLIDE 99

unsigned v. 2’s complement

0 or 0 000 1 or 1 001 2 or 2 010 3 or 3 011 4 or −4 100 5 or −3 101 6 or −2 110 7 or −1 111 add a d d

  • r

add 1 add 7 or −1 2’s complement addition is same as unsigned addition

37

slide-100
SLIDE 100
  • ther 2’s complement arithmetic

subtraction also the same as unsigned multiplication — repeated addition — mostly the same

(but need some extra precision)

38

slide-101
SLIDE 101

converting to 2’s complement (version 1)

take absolute value, convert to bits if negative, fmip all the bits and add one −14 → −00001110 → 11110001 + 1 → 11110010 −127 → −01111111 → 10000000 + 1 → 10000001 −128 → −10000000 → 01111111 + 1 → 10000000

39

slide-102
SLIDE 102

converting to 2’s complement (version 2)

if negative, take absolute value, subtract from 2n, encode that −14 → 28 − 14 = 242 → 11110010 −127 → 28 − 127 = 129 → 10000001 −128 → 28 − 127 = 129 → 10000000

40

slide-103
SLIDE 103

sign extension

have 8-bit 2’s complement number 1101 0111 what is this as a 16-bit 2’s complement number? general rule: extend by copying the sign bit:

1101 0111 1111 1111 1101 0111 0010 1111 0000 0000 0010 1111

“sign extension”

41

slide-104
SLIDE 104

sign extension

have 8-bit 2’s complement number 1101 0111 what is this as a 16-bit 2’s complement number? general rule: extend by copying the sign bit:

1101 0111 → 1111 1111 1101 0111 0010 1111 → 0000 0000 0010 1111

“sign extension”

41

slide-105
SLIDE 105

two’s complement summary

1

−231

1

+230

1

+229

… 1

+22

1

+21

1

+20

−1 =

0111 1111… 1111 1000 0000… 0000 1111 1111… 1111

42

slide-106
SLIDE 106

two’s complement summary

1

−231

1

+230

1

+229

… 1

+22

1

+21

1

+20

−1 =

−1 1 231 − 1 −231 −231 + 1

0111 1111… 1111 1000 0000… 0000 1111 1111… 1111

42

slide-107
SLIDE 107

two’s complement summary

1

−231

1

+230

1

+229

… 1

+22

1

+21

1

+20

−1 =

−1 1 231 − 1 −231 −231 + 1

0111 1111… 1111 1000 0000… 0000 1111 1111… 1111

42

slide-108
SLIDE 108

integer overfmow

“wrap around” 8-bit signed: 127 + 1 → −128 8-bit unsigned: 255 + 1 → 0 16-bit signed: 32 767 + 1 → −32 768 16-bit unsigned: 65 536 + 1 → 0 32-bit signed: around 2 billion 64-bit signed: around 9 × 1018 …

43

slide-109
SLIDE 109
  • n integer overfmow in C++ (1)

unsigned int x; // lab machines: 32-bit unsigned x = 4294967295; // (2 to the 32) minus 1 x += 10; cout << x << endl; // OUTPUT: 9

44

slide-110
SLIDE 110
  • n integer overfmow in C++ (1)

int x; // lab machines: 32-bit signed x = 2147483647; // mxaimum integer x += 10; // UNDEFINED! cout << x << endl; // EXPECT big negative number, // but not gaurenteed

in practice: usually get wraparound behavior… but compiler is not required to do this for signed numbers

and takes advantage of this to optimize, sometimes

45

slide-111
SLIDE 111

some real numbers

1 3 −100 7 π 0.1 √ 2 … want to represent these: accurately? compactly? effjciently?

46

slide-112
SLIDE 112

fjxed point

1 3 = 0.101010101 . . .TWO ≈ +0000.1010TWO— represent as 00000 1010 100 7 = 1110.001001001 . . .TWO ≈ −1110.0010TWO— represent as 01110 0010

— represent with fjxed-sized signed integer

this case: and is 9 bits.

47

slide-113
SLIDE 113

fjxed point

1 3 = 0.101010101 . . .TWO ≈ +0000.1010TWO— represent as 00000 1010 100 7 = 1110.001001001 . . .TWO ≈ −1110.0010TWO— represent as 01110 0010

x ≈ y/2K — represent with fjxed-sized signed integer y

this case: y/24 and y is 9 bits.

47

slide-114
SLIDE 114

why fjxed-point?

x ≈ y/2K (y fjxed-sized singed integer) math similar to integer math:

addition/subtraction — same multiplication — same except divide by 2K division — same except multiply by 2K

easy to understand what values are represented well

48

slide-115
SLIDE 115

why not fjxed-point?

pretty small range of numbers for space used hard to choose a 2K that works for lots of applications

49

slide-116
SLIDE 116

recall (?): scientifjc notation

+1 3 = +0.33333333 . . . ≈ +3.33 · 10−1 −100 7 = −14.285714 . . . ≈ −1.42 · 10+1

mantissa baseexponent

50

slide-117
SLIDE 117

recall (?): scientifjc notation

+1 3 = +0.33333333 . . . ≈ +3.33 · 10−1 −100 7 = −14.285714 . . . ≈ −1.42 · 10+1

±mantissa · baseexponent

50

slide-118
SLIDE 118

recall (?): scientifjc notation

+1 3 = +0.33333333 . . . ≈ +3.33 · 10−1 −100 7 = −14.285714 . . . ≈ −1.42 · 10+1

±mantissa · baseexponent

50

slide-119
SLIDE 119

recall (?): scientifjc notation

+1 3 = +0.33333333 . . . ≈ +3.33 · 10−1 −100 7 = −14.285714 . . . ≈ −1.42 · 10+1

±mantissa · baseexponent

50

slide-120
SLIDE 120

recall (?): scientifjc notation

+1 3 = +0.33333333 . . . ≈ +3.33 · 10−1 −100 7 = −14.285714 . . . ≈ −1.42 · 10+1

±mantissa · baseexponent

50

slide-121
SLIDE 121

recall (?): scientifjc notation

+1 3 = +0.33333333 . . . ≈ +3.33 · 10−1 −100 7 = −14.285714 . . . ≈ −1.42 · 10+1

±mantissa · baseexponent

50

slide-122
SLIDE 122

base-2 scientifjc notation

1 3 = 0.101010101 . . .TWO ≈ 0.1010101010TWO = +1.0101010101TWO · 2−1 −125 4 = −111111.01 . . .TWO = −1.1111101TWO · 22 −100 7 = −1110.01001001 . . .TWO ≈ −1110.010010TWO = −1.1100100101TWO · 23

51

slide-123
SLIDE 123

IEEE half-precision fmoating point

  • 1.1100100101TWO·23

sign (1 bit)

for for

mantissa ( bits)

don’t store leading “ ” (because always present)

exponent ( bits)

store is ‘bias’ 1 10010 1100100101

  • n typical little endian system:

byte 0: 00100101 byte 1: 11001011

52

slide-124
SLIDE 124

IEEE half-precision fmoating point

  • 1.1100100101TWO·23

sign (1 bit)

0 for + 1 for −

mantissa (10 bits)

don’t store leading “1.” (because always present)

exponent (5 bits)

store 3 + 15 = 18 15 is ‘bias’ 1 10010 1100100101

  • n typical little endian system:

byte 0: 00100101 byte 1: 11001011

52

slide-125
SLIDE 125

IEEE half-precision fmoating point

  • 1.1100100101TWO·23

sign (1 bit)

0 for + 1 for −

mantissa (10 bits)

don’t store leading “1.” (because always present)

exponent (5 bits)

store 3 + 15 = 18 15 is ‘bias’ 1 10010 1100100101

  • n typical little endian system:

byte 0: 00100101 byte 1: 11001011

52

slide-126
SLIDE 126

IEEE half-precision fmoating point

  • 1.1100100101TWO·23

sign (1 bit)

0 for + 1 for −

mantissa (10 bits)

don’t store leading “1.” (because always present)

exponent (5 bits)

store 3 + 15 = 18 15 is ‘bias’ 1 10010 1100100101

  • n typical little endian system:

byte 0: 00100101 byte 1: 11001011

52

slide-127
SLIDE 127

IEEE half precision fmoat

1 sign bit (1 for negative) 5 expontent bits

bias of 15 — if bits as unsigned are e, exponent is E = e − 15

10 mantissa bits

leading “1.” not stored

value = (1 − 2 · sign) · (1.mantissaTWO) · 2exponent−15

53

slide-128
SLIDE 128

approximation

example: represented 100 7 ≈ 14.285 as 1829 128 ≈ 14.289 too large by 3 896

10 bits mantissa + implicit “1” — about log10(211) ≈ 3.3 decimal digits

54

slide-129
SLIDE 129
  • ther IEEE precisions

half single double quad C++*/Java type — float double — sign bits 1 1 1 1 exponent bits 5 8 11 15 exponent bias 15 (25 − 1) 127 (27 − 1) 1023 (210 − 1) 16383 (214 − 1) mantissa bits 10 23 52 112 total bits 16 32 64 128

(* = typical C++ type; might vary in some implementations)

55

slide-130
SLIDE 130
  • n exponent bias

general rule: 2exponent bits−1 − 1 i.e. 0111…1 means 20 idea: best at representing numbers around 1

56

slide-131
SLIDE 131

diversion: 25.25 to binary

25.25 = 25 + 1 4 = 101 4 = 1100101TWO 22 = 11001.01TWO

57

slide-132
SLIDE 132

diversion: 25.25 to binary

25.25 = 24 + (25.25 − 24) = 24 + 9.25 = 24 + 23 + (9.25 − 23) = 24 + 23 + 1.25 = 24 + 23 + (9.25 − 23) = 24 + 23 + 1.25 (1.25 < 22) (1.25 < 21) = 24 + 23 + (1.25 − 20) = 24 + 23 + 20 + 0.25 (0.25 < 2−1) = 24 + 23 + 20 + 2−2 + (0.25 − 2−2) = 24 + 23 + 20 + 2−2

58

slide-133
SLIDE 133

float example: manually (1)

25.25 = 101 4 = 101 22 largest power of two < 25.25? 16 = 24 (means 1 < 25.25/16 < 2) 101 4 · 24 24 = 101 · 24 26 = 101 26 × 24 = 1100101TWO 26 × 24 = 1.100101TWO × 24

59

slide-134
SLIDE 134

float example: manually (2)

25.25 = 101 4 = 11001.01TWO = +1.1001 0100 0000 0000 0000 000TWO·24

sign (1 bit) for mantissa ( bits) (leading “ ” not stored) exponent ( bits) store “

TWO”

is bias for float 0 1000 0011 1001 0100 0000 0000 0000 000

60

slide-135
SLIDE 135

float example: manually (2)

25.25 = 101 4 = 11001.01TWO = +1.1001 0100 0000 0000 0000 000TWO·24

sign (1 bit) 0 for + mantissa (23 bits) (leading “1.” not stored) exponent (8 bits) store “4 + 127 = 1000 0011TWO” 127 is bias for float 0 1000 0011 1001 0100 0000 0000 0000 000

60

slide-136
SLIDE 136

float example: manually (2)

25.25 = 101 4 = 11001.01TWO = +1.1001 0100 0000 0000 0000 000TWO·24

sign (1 bit) 0 for + mantissa (23 bits) (leading “1.” not stored) exponent (8 bits) store “4 + 127 = 1000 0011TWO” 127 is bias for float 0 1000 0011 1001 0100 0000 0000 0000 000

60

slide-137
SLIDE 137

float example: from C++

#include <iostream> using std::cout; using std::hex; using std::endl; // union: all elements use the *same memory* union floatOrInt { float f; unsigned int u; }; int main() { union floatOrInt x; x.f = 25.25; cout << hex << x.u << endl; // OUTPUT: 41ca0000 }

4 1 c a 0100 0001 1100 1010 0000 0000 0000 0000

61

slide-138
SLIDE 138

float example 2: manually

0.1TEN = 1 16 + 0.0375 = 1 16 + 1 32 + 0.00625 = 1 16 + 1 32 + 1 256 + 0.00234375 = . . . . . . = 0.00011001100110011 . . .TWO ≈ +1.1001 1001 1001 1001 1001 101TWO·2−4

sign (1 bit) for mantissa ( bits) last from rounding exponent ( bits) store “

TWO”

0 0111 1011 1001 1001 1001 1001 1001 101

closest float to 0.1 between and

62

slide-139
SLIDE 139

float example 2: manually

0.1TEN = 1 16 + 0.0375 = 1 16 + 1 32 + 0.00625 = 1 16 + 1 32 + 1 256 + 0.00234375 = . . . . . . = 0.00011001100110011 . . .TWO ≈ +1.1001 1001 1001 1001 1001 101TWO·2−4

sign (1 bit) 0 for + mantissa (23 bits) last 1 from rounding exponent (8 bits) store “−4 + 127 = 0111 1011TWO” 0 0111 1011 1001 1001 1001 1001 1001 101

closest float to 0.1 between and

62

slide-140
SLIDE 140

float example 2: manually

0.1TEN = 1 16 + 0.0375 = 1 16 + 1 32 + 0.00625 = 1 16 + 1 32 + 1 256 + 0.00234375 = . . . . . . = 0.00011001100110011 . . .TWO ≈ +1.1001 1001 1001 1001 1001 101TWO·2−4

sign (1 bit) 0 for + mantissa (23 bits) last 1 from rounding exponent (8 bits) store “−4 + 127 = 0111 1011TWO” 0 0111 1011 1001 1001 1001 1001 1001 101

closest float to 0.1 between and

62

slide-141
SLIDE 141

float example 2: manually

0.1TEN = 1 16 + 0.0375 = 1 16 + 1 32 + 0.00625 = 1 16 + 1 32 + 1 256 + 0.00234375 = . . . . . . = 0.00011001100110011 . . .TWO ≈ +1.1001 1001 1001 1001 1001 101TWO·2−4

sign (1 bit) 0 for + mantissa (23 bits) last 1 from rounding exponent (8 bits) store “−4 + 127 = 0111 1011TWO” 0 0111 1011 1001 1001 1001 1001 1001 101

closest float to 0.1 between 0.1 and 0.1000001

62

slide-142
SLIDE 142

aside: binary long division

63

slide-143
SLIDE 143

float example 2: inaccurate (1)

#include <iostream> using std::cout; using std::endl; int main(void) { int count; float base = 0.1f; for (count = 0; base * count < 10000000; ++count) {} cout << count << endl; // OUTPUT: 99999996 return 0; }

64

slide-144
SLIDE 144

float example 2: inaccurate (2)

#include <iostream> using std::cout; using std::endl; int main(void) { int count = 0; for (float f = 0; f < 2000.0; f += 0.1) { ++count; } cout << count << endl; // OUTPUT: 20004 return 0; }

65

slide-145
SLIDE 145

float example 2: inaccurate (3)

#include <iostream> using std::cout; using std::endl; int main(void) { cout.precision(30); for (float f = 0; f < 2000.0; f += 0.1) { cout << f << endl; } return 0; }

0.100000001490116119384765625 0.20000000298023223876953125 … 2.2000000476837158203125 2.2999999523162841796875 …

66

slide-146
SLIDE 146

float to number (1)

1 sign bit 8 exponent bits (28−1 − 1 bias) 23 mantissa bits

1 1000 0000 1100 0000 0000 0000 0000 000 = ???

TEN

TWO TEN

  • r

67

slide-147
SLIDE 147

float to number (1)

1 sign bit 8 exponent bits (28−1 − 1 bias) 23 mantissa bits

1 1000 0000 1100 0000 0000 0000 0000 000 = ???

−1.1100 . . . · 2128−127=1 = −11.1 = −3.5TEN

TWO TEN

  • r

67

slide-148
SLIDE 148

float to number (1)

1 sign bit 8 exponent bits (28−1 − 1 bias) 23 mantissa bits

1 1000 0000 1100 0000 0000 0000 0000 000 = ???

−1.1100 . . . · 2128−127=1 = −11.1 = −3.5TEN

TWO TEN

  • r

67

slide-149
SLIDE 149

float to number (1)

1 sign bit 8 exponent bits (28−1 − 1 bias) 23 mantissa bits

1 1000 0000 1100 0000 0000 0000 0000 000 = ???

−1.1100 . . . · 2128−127=1 = −11.1 = −3.5TEN

10000000TWO = 128TEN

  • r

67

slide-150
SLIDE 150

float to number (1)

1 sign bit 8 exponent bits (28−1 − 1 bias) 23 mantissa bits

1 1000 0000 1100 0000 0000 0000 0000 000 = ???

−1.1100 . . . · 2128−127=1 = −11.1 = −3.5TEN

10000000TWO = 128TEN

  • r −1.11 · 21 = −(20 + 2−1 + 2−2)21 = −(1.75) · 2 = −3.5

67

slide-151
SLIDE 151

float to number (2)

1 sign bit 8 exponent bits (28−1 − 1 bias) 23 mantissa bits

0 1000 0011 1001 0000 0000 0000 0000 000 = ???

TEN

TWO TEN

  • r

68

slide-152
SLIDE 152

float to number (2)

1 sign bit 8 exponent bits (28−1 − 1 bias) 23 mantissa bits

0 1000 0011 1001 0000 0000 0000 0000 000 = ???

+1.10010000 . . . · 2131−127=4 = +1.1001 · 24 = +11001 = 25TEN

TWO TEN

  • r

68

slide-153
SLIDE 153

float to number (2)

1 sign bit 8 exponent bits (28−1 − 1 bias) 23 mantissa bits

0 1000 0011 1001 0000 0000 0000 0000 000 = ???

+1.10010000 . . . · 2131−127=4 = +1.1001 · 24 = +11001 = 25TEN

TWO TEN

  • r

68

slide-154
SLIDE 154

float to number (2)

1 sign bit 8 exponent bits (28−1 − 1 bias) 23 mantissa bits

0 1000 0011 1001 0000 0000 0000 0000 000 = ???

+1.10010000 . . . · 2131−127=4 = +1.1001 · 24 = +11001 = 25TEN

10000011TWO = 131TEN

  • r

68

slide-155
SLIDE 155

float to number (2)

1 sign bit 8 exponent bits (28−1 − 1 bias) 23 mantissa bits

0 1000 0011 1001 0000 0000 0000 0000 000 = ???

+1.10010000 . . . · 2131−127=4 = +1.1001 · 24 = +11001 = 25TEN

TWO TEN

  • r

68

slide-156
SLIDE 156

float to number (2)

1 sign bit 8 exponent bits (28−1 − 1 bias) 23 mantissa bits

0 1000 0011 1001 0000 0000 0000 0000 000 = ???

+1.10010000 . . . · 2131−127=4 = +1.1001 · 24 = +11001 = 25TEN

TWO TEN

  • r (20 + 2−1 + 2−4)24 = (1 + .5 + .0625)16 = (1.5625)16 = 25

68

slide-157
SLIDE 157

float addition

1 sign bit 8 exponent bits (28−1 − 1 bias) 23 mantissa bits

0 1000 0000 1000 1000 0000 0000 0000 000 + 0 0111 1111 0001 0000 0000 0000 0000 000 = ???

TWO TEN

use difgerence between exponents to ‘shift’ mantissa; then add

69

slide-158
SLIDE 158

float addition

1 sign bit 8 exponent bits (28−1 − 1 bias) 23 mantissa bits

0 1000 0000 1000 1000 0000 0000 0000 000 + 0 0111 1111 0001 0000 0000 0000 0000 000 = ???

1.10001TWO · 21 + 1.0001 · 20 = (11.0001 + 1.0001) · 20 = 100.0010 · 20 = 4.125TEN use difgerence between exponents to ‘shift’ mantissa; then add

69

slide-159
SLIDE 159

float addition

1 sign bit 8 exponent bits (28−1 − 1 bias) 23 mantissa bits

0 1000 0000 1000 1000 0000 0000 0000 000 + 0 0111 1111 0001 0000 0000 0000 0000 000 = ???

1.10001TWO · 21 + 1.0001 · 20 = (11.0001 + 1.0001) · 20 = 100.0010 · 20 = 4.125TEN use difgerence between exponents to ‘shift’ mantissa; then add

69

slide-160
SLIDE 160

fmoating point is not uniform

in half-precision, next number after: 1 = 1.000 000 000 0TWO · 20 is 1.000 000 000 1TWO · 20 ≈ 1.0010TEN

∼ +.001

100 = 1.100 100 000 0TWO · 26 is 1.100 100 000 1TWO · 26 ≈ 100.06TEN

∼ +.06

possible numbers are unevenly spaced same as with ‘normal’ scientifjc notation:

1 = 1.00 · 100 → 1.01 · 100 = 1.01 versus 1.00 · 102 → 1.01 · 102 = 101

70

slide-161
SLIDE 161

don’t compare with ==/!=

double x = 0.3; double y = 0.1; double y3 = y * 3; if (x != y3) { cout << "not ␣ equal" << endl; } cout.setprecision(30); cout << x << endl; cout << y3 << endl; not equal 0.299999999999999988897769753748 0.300000000000000044408920985006

71

slide-162
SLIDE 162
  • n comparing fmoats

#include <cmath> using std::fabs; // or #include <math.h> and use fabs // without a using statement ... // chose based on expected accuracy const float EPSILON = 1e−6; float x, y; ... if (fabs(x − y) < EPSILON) { ... }

72

slide-163
SLIDE 163

fmoating point accuracy

float — about 7 decimal places double — about 15 decimal places

73

slide-164
SLIDE 164

rounding errors (1)

2100 + 1

2100 + 1 cannot be represented exactly

would need 99 mantissa bits rounds to 2100

(but 2100 and 1 can)

74

slide-165
SLIDE 165

rounding errors (2)

(2100 + 1) − 2100 2100 − 2100 (2100 − 2100) + 1 0 + 1 1

75

slide-166
SLIDE 166

dealing with rounding errors

avoid: adding and subtracting values of very difgerent magntiudes

tend to have big errors tend to have errors in one direction (compound over a calculation)

…by reordering and rearranging calculations

76

slide-167
SLIDE 167

the problem of 0

0 is a very imporant number can’t be represented with implicit “1.” solution: special cases

77

slide-168
SLIDE 168

IEEE fmoat special cases

exponent bits mantissa bits meaning 00000000 000…000 ±0 00000000 non-zero denormal number 11111111 000…000 ±∞ 11111111 non-zero not a number (NaN)

(+1/1000000000) ÷ huge positive number = +0 (−1/1000000000) ÷ huge positive number = −0 (+1000000000) × huge positive number = +∞ (−1000000000) × huge positive number = −∞ 1 ÷ 0 = +∞ 0 ÷ 0 = NaN √ −1 = NaN

78

slide-169
SLIDE 169

fmoat min magnitude value

exponent of 0000 0001 (not 0 since that’s special) mantisssa of 000…000 1.000000 . . .TWO · 21−bias = 2−126

79

slide-170
SLIDE 170

fmoat max magnitude value

exponent of 1111 1110 (not all 1s since that’s special) mantisssa of 111…111 1.111111 . . . 11TWO · 2254−bias = 1.11111 . . . 1TWO · 2127 = 2128 − 2104

80

slide-171
SLIDE 171
  • n denormals

denormals — minimum exponent bits, non-zero mantissa smaller in magntiude than “normal” minimum value

ignore the “implicit 1.” rule

notorious for being superslow on some systems

some CPUs take 100s of times longer to compute on them

we won’t ask you about them

81

slide-172
SLIDE 172

decimal fmoating point

if storing 0.001 exactly is important? fmoating point formats base of 10 instead of 2

1.000 × 10−3

example: IEEE decimal fmoating point

32, 64, 128-bit formats still store exponent+mantissa no leading “1.” trick (doesn’t work with 10x)

82

slide-173
SLIDE 173

binary-coded decimal

if integer conversion to/from base-10 is important? but want to use binary hardware

  • ne option: every 4 bits is a decimal digit

not all possible bit patterns used

e.g. represent 147TEN as 0001 0100 0111 part of family on decimal-in-binary encodings

some more compact than this (e.g. store 2 digits at a time)

83

slide-174
SLIDE 174

backup slides

84

slide-175
SLIDE 175
  • ptimizing with overfmow example

void foo(int x) { while (−−x < 0) { bar(); } }

in latest version of clang++ or g++ comples into an infjnite loop if x is initially negative if maximum optimizations are enabled (-O3 command-line option, not default)

85

slide-176
SLIDE 176

IEEE single precision fmoating point

  • 1.1111 1010 0000 0000 0000 000TWO·23

sign (1 bit)

for for

mantissa ( bits)

don’t store leading “ ” (because always present)

exponent ( bits)

store is ‘bias’ 1 100 0000 1 111 1101 0000 0000 0000 0000

86

slide-177
SLIDE 177

IEEE single precision fmoating point

  • 1.1111 1010 0000 0000 0000 000TWO·23

sign (1 bit)

0 for + 1 for −

mantissa (23 bits)

don’t store leading “1.” (because always present)

exponent (8 bits)

store 2 + 127 = 129 127 is ‘bias’ 1 100 0000 1 111 1101 0000 0000 0000 0000

86

slide-178
SLIDE 178

IEEE single precision fmoating point

  • 1.1111 1010 0000 0000 0000 000TWO·23

sign (1 bit)

0 for + 1 for −

mantissa (23 bits)

don’t store leading “1.” (because always present)

exponent (8 bits)

store 2 + 127 = 129 127 is ‘bias’ 1 100 0000 1 111 1101 0000 0000 0000 0000

86

slide-179
SLIDE 179

IEEE single precision fmoat

1 sign bit (1 for negative) 10 expontent bits

bias of 127 — if bits as unsigned are e, exponent is E = e − 127

23 mantissa bits

leading “1.” not stored

value = (1 − 2 · sign) · (1.mantissaTWO) · 2exponent−127

87