61A Extra Lecture 4 Announcements Encoding Strings Representing - - PowerPoint PPT Presentation

61a extra lecture 4 announcements encoding strings
SMART_READER_LITE
LIVE PREVIEW

61A Extra Lecture 4 Announcements Encoding Strings Representing - - PowerPoint PPT Presentation

61A Extra Lecture 4 Announcements Encoding Strings Representing Strings: UTF-8 Encoding 4 Representing Strings: UTF-8 Encoding UTF (UCS (Universal Character Set) Transformation Format) 4 Representing Strings: UTF-8 Encoding UTF (UCS


slide-1
SLIDE 1

61A Extra Lecture 4

slide-2
SLIDE 2

Announcements

slide-3
SLIDE 3

Encoding Strings

slide-4
SLIDE 4

Representing Strings: UTF-8 Encoding

4

slide-5
SLIDE 5

Representing Strings: UTF-8 Encoding

UTF (UCS (Universal Character Set) Transformation Format)

4

slide-6
SLIDE 6

Representing Strings: UTF-8 Encoding

UTF (UCS (Universal Character Set) Transformation Format) Unicode: Correspondence between characters and integers

4

slide-7
SLIDE 7

Representing Strings: UTF-8 Encoding

UTF (UCS (Universal Character Set) Transformation Format) Unicode: Correspondence between characters and integers UTF-8: Correspondence between those integers and bytes

4

slide-8
SLIDE 8

Representing Strings: UTF-8 Encoding

UTF (UCS (Universal Character Set) Transformation Format) Unicode: Correspondence between characters and integers UTF-8: Correspondence between those integers and bytes A byte is 8 bits and can encode any integer 0-255.

4

slide-9
SLIDE 9

Representing Strings: UTF-8 Encoding

UTF (UCS (Universal Character Set) Transformation Format) Unicode: Correspondence between characters and integers UTF-8: Correspondence between those integers and bytes A byte is 8 bits and can encode any integer 0-255. bytes

4

slide-10
SLIDE 10

Representing Strings: UTF-8 Encoding

UTF (UCS (Universal Character Set) Transformation Format) Unicode: Correspondence between characters and integers UTF-8: Correspondence between those integers and bytes A byte is 8 bits and can encode any integer 0-255. bytes integers

4

slide-11
SLIDE 11

Representing Strings: UTF-8 Encoding

UTF (UCS (Universal Character Set) Transformation Format) Unicode: Correspondence between characters and integers UTF-8: Correspondence between those integers and bytes A byte is 8 bits and can encode any integer 0-255. 00000000 bytes integers

4

slide-12
SLIDE 12

Representing Strings: UTF-8 Encoding

UTF (UCS (Universal Character Set) Transformation Format) Unicode: Correspondence between characters and integers UTF-8: Correspondence between those integers and bytes A byte is 8 bits and can encode any integer 0-255. 00000000 00000001 1 bytes integers

4

slide-13
SLIDE 13

Representing Strings: UTF-8 Encoding

UTF (UCS (Universal Character Set) Transformation Format) Unicode: Correspondence between characters and integers UTF-8: Correspondence between those integers and bytes A byte is 8 bits and can encode any integer 0-255. 00000000 00000001 1 00000010 2 bytes integers

4

slide-14
SLIDE 14

Representing Strings: UTF-8 Encoding

UTF (UCS (Universal Character Set) Transformation Format) Unicode: Correspondence between characters and integers UTF-8: Correspondence between those integers and bytes A byte is 8 bits and can encode any integer 0-255. 00000000 00000001 1 00000011 3 00000010 2 bytes integers

4

slide-15
SLIDE 15

Representing Strings: UTF-8 Encoding

UTF (UCS (Universal Character Set) Transformation Format) Unicode: Correspondence between characters and integers UTF-8: Correspondence between those integers and bytes A byte is 8 bits and can encode any integer 0-255. Variable-length encoding: integers vary in the number of bytes required to encode them. 00000000 00000001 1 00000011 3 00000010 2 bytes integers

4

slide-16
SLIDE 16

Representing Strings: UTF-8 Encoding

UTF (UCS (Universal Character Set) Transformation Format) Unicode: Correspondence between characters and integers UTF-8: Correspondence between those integers and bytes A byte is 8 bits and can encode any integer 0-255. Variable-length encoding: integers vary in the number of bytes required to encode them. 00000000 00000001 1 00000011 3 00000010 2 bytes integers In Python: string length is measured in characters, bytes length in bytes.

4

slide-17
SLIDE 17

Representing Strings: UTF-8 Encoding

UTF (UCS (Universal Character Set) Transformation Format) Unicode: Correspondence between characters and integers UTF-8: Correspondence between those integers and bytes A byte is 8 bits and can encode any integer 0-255. Variable-length encoding: integers vary in the number of bytes required to encode them. 00000000 00000001 1 00000011 3 00000010 2 bytes integers In Python: string length is measured in characters, bytes length in bytes.

4

(Demo)

slide-18
SLIDE 18

Fixed-Length Encodings

slide-19
SLIDE 19

A First Attempt

6

slide-20
SLIDE 20

A First Attempt

  • Let’s use an encoding

6

slide-21
SLIDE 21

A First Attempt

  • Let’s use an encoding

6

Letter Binary Letter Binary a n 1 b 1

  • c

p 1 d 1 q 1 e 1 r f s 1 g t h 1 u i 1 v 1 j 1 w 1 k x 1 l 1 y m 1 z

slide-22
SLIDE 22

Decoding

7

slide-23
SLIDE 23

Decoding

  • An encoding without a deterministic decoding procedure is not very useful

7

slide-24
SLIDE 24

Decoding

  • An encoding without a deterministic decoding procedure is not very useful
  • How many bits do we need to encode each letter uniquely?

7

slide-25
SLIDE 25

Decoding

  • An encoding without a deterministic decoding procedure is not very useful
  • How many bits do we need to encode each letter uniquely?
  • lowercase alphabet

7

slide-26
SLIDE 26

Decoding

  • An encoding without a deterministic decoding procedure is not very useful
  • How many bits do we need to encode each letter uniquely?
  • lowercase alphabet
  • 5 bits

7

slide-27
SLIDE 27

A Second Attempt

8

slide-28
SLIDE 28

A Second Attempt

  • Let’s try another encoding

8

slide-29
SLIDE 29

A Second Attempt

  • Let’s try another encoding

8

Letter Binary Letter Binary a 00000 n 01101 b 00001

  • 01110

c 00010 p 01111 d 00011 q 10000 e 00100 r 10001 f 00101 s 10010 g 00110 t 10011 h 00111 u 10100 i 01000 v 10101 j 01001 w 10110 k 01010 x 10111 l 01011 y 11000 m 01100 z 11001

slide-30
SLIDE 30

Analysis

9

slide-31
SLIDE 31

Analysis

Pros

9

slide-32
SLIDE 32

Analysis

Pros

  • Encoding was easy

9

slide-33
SLIDE 33

Analysis

Pros

  • Encoding was easy
  • Decoding was deterministic

9

slide-34
SLIDE 34

Analysis

Pros

  • Encoding was easy
  • Decoding was deterministic

Cons

9

slide-35
SLIDE 35

Analysis

Pros

  • Encoding was easy
  • Decoding was deterministic

Cons

  • Takes more space…

9

slide-36
SLIDE 36

Analysis

Pros

  • Encoding was easy
  • Decoding was deterministic

Cons

  • Takes more space…
  • What restriction did we place that’s unnecessary?

9

slide-37
SLIDE 37

Analysis

Pros

  • Encoding was easy
  • Decoding was deterministic

Cons

  • Takes more space…
  • What restriction did we place that’s unnecessary?
  • Fixed length

9

slide-38
SLIDE 38

Variable-Length Encodings

slide-39
SLIDE 39

Variable Length Encoding

11

slide-40
SLIDE 40

Variable Length Encoding

  • Encoding Candidate 1: A: 1, B:01, C: 10, D: 11, E: 100, F: 101, ...

11

slide-41
SLIDE 41

Variable Length Encoding

  • Encoding Candidate 1: A: 1, B:01, C: 10, D: 11, E: 100, F: 101, ...
  • What does 01111 encode?

11

slide-42
SLIDE 42

Variable Length Encoding

  • Encoding Candidate 1: A: 1, B:01, C: 10, D: 11, E: 100, F: 101, ...
  • What does 01111 encode?
  • Encoding Candidate 2: A: 00, B: 01, C: 100, D: 101, E: 1100, F: 1101, ...

11

slide-43
SLIDE 43

Variable Length Encoding

  • Encoding Candidate 1: A: 1, B:01, C: 10, D: 11, E: 100, F: 101, ...
  • What does 01111 encode?
  • Encoding Candidate 2: A: 00, B: 01, C: 100, D: 101, E: 1100, F: 1101, ...
  • What does 0100101 encode? How about 10111001101001001100?

11

slide-44
SLIDE 44

Variable Length Encoding

  • Encoding Candidate 1: A: 1, B:01, C: 10, D: 11, E: 100, F: 101, ...
  • What does 01111 encode?
  • Encoding Candidate 2: A: 00, B: 01, C: 100, D: 101, E: 1100, F: 1101, ...
  • What does 0100101 encode? How about 10111001101001001100?
  • Deterministic decoding from left to right is possible if the encoding of one character is

never a proper prefix of the decoding of another character.

11

slide-45
SLIDE 45

12

Deterministic Codes Have a Tree Structure

slide-46
SLIDE 46

12

Deterministic Codes Have a Tree Structure

Letter Binary A 00 B 01 C 1

slide-47
SLIDE 47

12

Deterministic Codes Have a Tree Structure

1 Letter Binary A 00 B 01 C 1

slide-48
SLIDE 48

12

Deterministic Codes Have a Tree Structure

1 C Letter Binary A 00 B 01 C 1

slide-49
SLIDE 49

12

Deterministic Codes Have a Tree Structure

1 1 C Letter Binary A 00 B 01 C 1

slide-50
SLIDE 50

12

Deterministic Codes Have a Tree Structure

1 B 1 C Letter Binary A 00 B 01 C 1

slide-51
SLIDE 51

12

Deterministic Codes Have a Tree Structure

1 A B 1 C Letter Binary A 00 B 01 C 1

slide-52
SLIDE 52

Huffman Encoding

13

slide-53
SLIDE 53

Huffman Encoding

13

  • Let’s pretend we want to come up with the optimal encoding:
slide-54
SLIDE 54

Huffman Encoding

13

  • Let’s pretend we want to come up with the optimal encoding:
  • AAAAAAAAAABBBBBCCCCCCCDDDDDDDDD
slide-55
SLIDE 55

Huffman Encoding

13

  • Let’s pretend we want to come up with the optimal encoding:
  • AAAAAAAAAABBBBBCCCCCCCDDDDDDDDD
  • A appears 10 times
slide-56
SLIDE 56

Huffman Encoding

13

  • Let’s pretend we want to come up with the optimal encoding:
  • AAAAAAAAAABBBBBCCCCCCCDDDDDDDDD
  • A appears 10 times
  • B appears 5 times
slide-57
SLIDE 57

Huffman Encoding

13

  • Let’s pretend we want to come up with the optimal encoding:
  • AAAAAAAAAABBBBBCCCCCCCDDDDDDDDD
  • A appears 10 times
  • B appears 5 times
  • C appears 7 times
slide-58
SLIDE 58

Huffman Encoding

13

  • Let’s pretend we want to come up with the optimal encoding:
  • AAAAAAAAAABBBBBCCCCCCCDDDDDDDDD
  • A appears 10 times
  • B appears 5 times
  • C appears 7 times
  • D appears 9 times
slide-59
SLIDE 59

Huffman Encoding

14

slide-60
SLIDE 60

Huffman Encoding

14

  • Start with the two smallest frequencies
slide-61
SLIDE 61

Huffman Encoding

14

  • Start with the two smallest frequencies
  • A appears 10 times, B appears 5 times, C appears 7 times, D appears 9 times
slide-62
SLIDE 62

Huffman Encoding

14

  • Start with the two smallest frequencies
  • A appears 10 times, B appears 5 times, C appears 7 times, D appears 9 times

A B C D

slide-63
SLIDE 63

Huffman Encoding

14

  • Start with the two smallest frequencies
  • A appears 10 times, B appears 5 times, C appears 7 times, D appears 9 times

A B C D

slide-64
SLIDE 64

Huffman Encoding

14

  • Start with the two smallest frequencies
  • A appears 10 times, B appears 5 times, C appears 7 times, D appears 9 times

A B C D 1 A D B C

slide-65
SLIDE 65

Huffman Encoding

15

slide-66
SLIDE 66

Huffman Encoding

15

  • Continue…
slide-67
SLIDE 67

Huffman Encoding

15

  • Continue…
  • A appears 10 times, B & C appear a combined 12 times, D appears 9 times
slide-68
SLIDE 68

Huffman Encoding

15

  • Continue…
  • A appears 10 times, B & C appear a combined 12 times, D appears 9 times

1 A D B C

slide-69
SLIDE 69

Huffman Encoding

15

  • Continue…
  • A appears 10 times, B & C appear a combined 12 times, D appears 9 times

1 A D B C

slide-70
SLIDE 70

Huffman Encoding

15

  • Continue…
  • A appears 10 times, B & C appear a combined 12 times, D appears 9 times

1 A D B C 1 B C 1 A D

slide-71
SLIDE 71

Huffman Encoding

16

slide-72
SLIDE 72

Huffman Encoding

16

  • And finally…
slide-73
SLIDE 73

Huffman Encoding

16

  • And finally…

1 B C 1 A D

slide-74
SLIDE 74

Huffman Encoding

16

  • And finally…

1 B C 1 A D

slide-75
SLIDE 75

Huffman Encoding

16

  • And finally…

1 B C 1 A D B D 1 C 1 A 1

slide-76
SLIDE 76

Huffman Encoding

17

slide-77
SLIDE 77

Huffman Encoding

17

  • Another example…
slide-78
SLIDE 78

Huffman Encoding

17

  • Another example…
  • AAAAAAAAAABCCD
slide-79
SLIDE 79

Huffman Encoding

17

  • Another example…
  • AAAAAAAAAABCCD
  • A appears 10 times
slide-80
SLIDE 80

Huffman Encoding

17

  • Another example…
  • AAAAAAAAAABCCD
  • A appears 10 times
  • B appears 1 time
slide-81
SLIDE 81

Huffman Encoding

17

  • Another example…
  • AAAAAAAAAABCCD
  • A appears 10 times
  • B appears 1 time
  • C appears 2 times
slide-82
SLIDE 82

Huffman Encoding

17

  • Another example…
  • AAAAAAAAAABCCD
  • A appears 10 times
  • B appears 1 time
  • C appears 2 times
  • D appears 1 time
slide-83
SLIDE 83

Huffman Encoding

18

slide-84
SLIDE 84

Huffman Encoding

18

  • Start with the two smallest frequencies
slide-85
SLIDE 85

Huffman Encoding

18

  • Start with the two smallest frequencies
  • A appears 10 times, B appears 1 time, C appears 2 times, D appears 1 time
slide-86
SLIDE 86

Huffman Encoding

18

  • Start with the two smallest frequencies
  • A appears 10 times, B appears 1 time, C appears 2 times, D appears 1 time

A B C D

slide-87
SLIDE 87

Huffman Encoding

18

  • Start with the two smallest frequencies
  • A appears 10 times, B appears 1 time, C appears 2 times, D appears 1 time

A B C D

slide-88
SLIDE 88

Huffman Encoding

18

  • Start with the two smallest frequencies
  • A appears 10 times, B appears 1 time, C appears 2 times, D appears 1 time

A B C D 1 A C B D

slide-89
SLIDE 89

Huffman Encoding

19

slide-90
SLIDE 90

Huffman Encoding

19

  • Start with the two smallest frequencies
slide-91
SLIDE 91

Huffman Encoding

19

  • Start with the two smallest frequencies
  • A appears 10 times, B & D appear a combined 2 times, C appears 2 times
slide-92
SLIDE 92

Huffman Encoding

19

  • Start with the two smallest frequencies
  • A appears 10 times, B & D appear a combined 2 times, C appears 2 times

1 A C B D

slide-93
SLIDE 93

Huffman Encoding

19

  • Start with the two smallest frequencies
  • A appears 10 times, B & D appear a combined 2 times, C appears 2 times

1 A C B D

slide-94
SLIDE 94

Huffman Encoding

19

  • Start with the two smallest frequencies
  • A appears 10 times, B & D appear a combined 2 times, C appears 2 times

1 A C B D 1 C 1 B D A

slide-95
SLIDE 95

Huffman Encoding

20

slide-96
SLIDE 96

Huffman Encoding

20

  • And finally…
slide-97
SLIDE 97

Huffman Encoding

20

  • And finally…

1 C 1 B D A

slide-98
SLIDE 98

Huffman Encoding

20

  • And finally…

1 C 1 B D A

slide-99
SLIDE 99

Huffman Encoding

20

  • And finally…

1 C 1 B D 1 A 1 C 1 B D A