SLIDE 1
61A Extra Lecture 4 Announcements Encoding Strings Representing - - PowerPoint PPT Presentation
61A Extra Lecture 4 Announcements Encoding Strings Representing - - PowerPoint PPT Presentation
61A Extra Lecture 4 Announcements Encoding Strings Representing Strings: UTF-8 Encoding 4 Representing Strings: UTF-8 Encoding UTF (UCS (Universal Character Set) Transformation Format) 4 Representing Strings: UTF-8 Encoding UTF (UCS
SLIDE 2
SLIDE 3
Encoding Strings
SLIDE 4
Representing Strings: UTF-8 Encoding
4
SLIDE 5
Representing Strings: UTF-8 Encoding
UTF (UCS (Universal Character Set) Transformation Format)
4
SLIDE 6
Representing Strings: UTF-8 Encoding
UTF (UCS (Universal Character Set) Transformation Format) Unicode: Correspondence between characters and integers
4
SLIDE 7
Representing Strings: UTF-8 Encoding
UTF (UCS (Universal Character Set) Transformation Format) Unicode: Correspondence between characters and integers UTF-8: Correspondence between those integers and bytes
4
SLIDE 8
Representing Strings: UTF-8 Encoding
UTF (UCS (Universal Character Set) Transformation Format) Unicode: Correspondence between characters and integers UTF-8: Correspondence between those integers and bytes A byte is 8 bits and can encode any integer 0-255.
4
SLIDE 9
Representing Strings: UTF-8 Encoding
UTF (UCS (Universal Character Set) Transformation Format) Unicode: Correspondence between characters and integers UTF-8: Correspondence between those integers and bytes A byte is 8 bits and can encode any integer 0-255. bytes
4
SLIDE 10
Representing Strings: UTF-8 Encoding
UTF (UCS (Universal Character Set) Transformation Format) Unicode: Correspondence between characters and integers UTF-8: Correspondence between those integers and bytes A byte is 8 bits and can encode any integer 0-255. bytes integers
4
SLIDE 11
Representing Strings: UTF-8 Encoding
UTF (UCS (Universal Character Set) Transformation Format) Unicode: Correspondence between characters and integers UTF-8: Correspondence between those integers and bytes A byte is 8 bits and can encode any integer 0-255. 00000000 bytes integers
4
SLIDE 12
Representing Strings: UTF-8 Encoding
UTF (UCS (Universal Character Set) Transformation Format) Unicode: Correspondence between characters and integers UTF-8: Correspondence between those integers and bytes A byte is 8 bits and can encode any integer 0-255. 00000000 00000001 1 bytes integers
4
SLIDE 13
Representing Strings: UTF-8 Encoding
UTF (UCS (Universal Character Set) Transformation Format) Unicode: Correspondence between characters and integers UTF-8: Correspondence between those integers and bytes A byte is 8 bits and can encode any integer 0-255. 00000000 00000001 1 00000010 2 bytes integers
4
SLIDE 14
Representing Strings: UTF-8 Encoding
UTF (UCS (Universal Character Set) Transformation Format) Unicode: Correspondence between characters and integers UTF-8: Correspondence between those integers and bytes A byte is 8 bits and can encode any integer 0-255. 00000000 00000001 1 00000011 3 00000010 2 bytes integers
4
SLIDE 15
Representing Strings: UTF-8 Encoding
UTF (UCS (Universal Character Set) Transformation Format) Unicode: Correspondence between characters and integers UTF-8: Correspondence between those integers and bytes A byte is 8 bits and can encode any integer 0-255. Variable-length encoding: integers vary in the number of bytes required to encode them. 00000000 00000001 1 00000011 3 00000010 2 bytes integers
4
SLIDE 16
Representing Strings: UTF-8 Encoding
UTF (UCS (Universal Character Set) Transformation Format) Unicode: Correspondence between characters and integers UTF-8: Correspondence between those integers and bytes A byte is 8 bits and can encode any integer 0-255. Variable-length encoding: integers vary in the number of bytes required to encode them. 00000000 00000001 1 00000011 3 00000010 2 bytes integers In Python: string length is measured in characters, bytes length in bytes.
4
SLIDE 17
Representing Strings: UTF-8 Encoding
UTF (UCS (Universal Character Set) Transformation Format) Unicode: Correspondence between characters and integers UTF-8: Correspondence between those integers and bytes A byte is 8 bits and can encode any integer 0-255. Variable-length encoding: integers vary in the number of bytes required to encode them. 00000000 00000001 1 00000011 3 00000010 2 bytes integers In Python: string length is measured in characters, bytes length in bytes.
4
(Demo)
SLIDE 18
Fixed-Length Encodings
SLIDE 19
A First Attempt
6
SLIDE 20
A First Attempt
- Let’s use an encoding
6
SLIDE 21
A First Attempt
- Let’s use an encoding
6
Letter Binary Letter Binary a n 1 b 1
- c
p 1 d 1 q 1 e 1 r f s 1 g t h 1 u i 1 v 1 j 1 w 1 k x 1 l 1 y m 1 z
SLIDE 22
Decoding
7
SLIDE 23
Decoding
- An encoding without a deterministic decoding procedure is not very useful
7
SLIDE 24
Decoding
- An encoding without a deterministic decoding procedure is not very useful
- How many bits do we need to encode each letter uniquely?
7
SLIDE 25
Decoding
- An encoding without a deterministic decoding procedure is not very useful
- How many bits do we need to encode each letter uniquely?
- lowercase alphabet
7
SLIDE 26
Decoding
- An encoding without a deterministic decoding procedure is not very useful
- How many bits do we need to encode each letter uniquely?
- lowercase alphabet
- 5 bits
7
SLIDE 27
A Second Attempt
8
SLIDE 28
A Second Attempt
- Let’s try another encoding
8
SLIDE 29
A Second Attempt
- Let’s try another encoding
8
Letter Binary Letter Binary a 00000 n 01101 b 00001
- 01110
c 00010 p 01111 d 00011 q 10000 e 00100 r 10001 f 00101 s 10010 g 00110 t 10011 h 00111 u 10100 i 01000 v 10101 j 01001 w 10110 k 01010 x 10111 l 01011 y 11000 m 01100 z 11001
SLIDE 30
Analysis
9
SLIDE 31
Analysis
Pros
9
SLIDE 32
Analysis
Pros
- Encoding was easy
9
SLIDE 33
Analysis
Pros
- Encoding was easy
- Decoding was deterministic
9
SLIDE 34
Analysis
Pros
- Encoding was easy
- Decoding was deterministic
Cons
9
SLIDE 35
Analysis
Pros
- Encoding was easy
- Decoding was deterministic
Cons
- Takes more space…
9
SLIDE 36
Analysis
Pros
- Encoding was easy
- Decoding was deterministic
Cons
- Takes more space…
- What restriction did we place that’s unnecessary?
9
SLIDE 37
Analysis
Pros
- Encoding was easy
- Decoding was deterministic
Cons
- Takes more space…
- What restriction did we place that’s unnecessary?
- Fixed length
9
SLIDE 38
Variable-Length Encodings
SLIDE 39
Variable Length Encoding
11
SLIDE 40
Variable Length Encoding
- Encoding Candidate 1: A: 1, B:01, C: 10, D: 11, E: 100, F: 101, ...
11
SLIDE 41
Variable Length Encoding
- Encoding Candidate 1: A: 1, B:01, C: 10, D: 11, E: 100, F: 101, ...
- What does 01111 encode?
11
SLIDE 42
Variable Length Encoding
- Encoding Candidate 1: A: 1, B:01, C: 10, D: 11, E: 100, F: 101, ...
- What does 01111 encode?
- Encoding Candidate 2: A: 00, B: 01, C: 100, D: 101, E: 1100, F: 1101, ...
11
SLIDE 43
Variable Length Encoding
- Encoding Candidate 1: A: 1, B:01, C: 10, D: 11, E: 100, F: 101, ...
- What does 01111 encode?
- Encoding Candidate 2: A: 00, B: 01, C: 100, D: 101, E: 1100, F: 1101, ...
- What does 0100101 encode? How about 10111001101001001100?
11
SLIDE 44
Variable Length Encoding
- Encoding Candidate 1: A: 1, B:01, C: 10, D: 11, E: 100, F: 101, ...
- What does 01111 encode?
- Encoding Candidate 2: A: 00, B: 01, C: 100, D: 101, E: 1100, F: 1101, ...
- What does 0100101 encode? How about 10111001101001001100?
- Deterministic decoding from left to right is possible if the encoding of one character is
never a proper prefix of the decoding of another character.
11
SLIDE 45
12
Deterministic Codes Have a Tree Structure
SLIDE 46
12
Deterministic Codes Have a Tree Structure
Letter Binary A 00 B 01 C 1
SLIDE 47
12
Deterministic Codes Have a Tree Structure
1 Letter Binary A 00 B 01 C 1
SLIDE 48
12
Deterministic Codes Have a Tree Structure
1 C Letter Binary A 00 B 01 C 1
SLIDE 49
12
Deterministic Codes Have a Tree Structure
1 1 C Letter Binary A 00 B 01 C 1
SLIDE 50
12
Deterministic Codes Have a Tree Structure
1 B 1 C Letter Binary A 00 B 01 C 1
SLIDE 51
12
Deterministic Codes Have a Tree Structure
1 A B 1 C Letter Binary A 00 B 01 C 1
SLIDE 52
Huffman Encoding
13
SLIDE 53
Huffman Encoding
13
- Let’s pretend we want to come up with the optimal encoding:
SLIDE 54
Huffman Encoding
13
- Let’s pretend we want to come up with the optimal encoding:
- AAAAAAAAAABBBBBCCCCCCCDDDDDDDDD
SLIDE 55
Huffman Encoding
13
- Let’s pretend we want to come up with the optimal encoding:
- AAAAAAAAAABBBBBCCCCCCCDDDDDDDDD
- A appears 10 times
SLIDE 56
Huffman Encoding
13
- Let’s pretend we want to come up with the optimal encoding:
- AAAAAAAAAABBBBBCCCCCCCDDDDDDDDD
- A appears 10 times
- B appears 5 times
SLIDE 57
Huffman Encoding
13
- Let’s pretend we want to come up with the optimal encoding:
- AAAAAAAAAABBBBBCCCCCCCDDDDDDDDD
- A appears 10 times
- B appears 5 times
- C appears 7 times
SLIDE 58
Huffman Encoding
13
- Let’s pretend we want to come up with the optimal encoding:
- AAAAAAAAAABBBBBCCCCCCCDDDDDDDDD
- A appears 10 times
- B appears 5 times
- C appears 7 times
- D appears 9 times
SLIDE 59
Huffman Encoding
14
SLIDE 60
Huffman Encoding
14
- Start with the two smallest frequencies
SLIDE 61
Huffman Encoding
14
- Start with the two smallest frequencies
- A appears 10 times, B appears 5 times, C appears 7 times, D appears 9 times
SLIDE 62
Huffman Encoding
14
- Start with the two smallest frequencies
- A appears 10 times, B appears 5 times, C appears 7 times, D appears 9 times
A B C D
SLIDE 63
Huffman Encoding
14
- Start with the two smallest frequencies
- A appears 10 times, B appears 5 times, C appears 7 times, D appears 9 times
A B C D
SLIDE 64
Huffman Encoding
14
- Start with the two smallest frequencies
- A appears 10 times, B appears 5 times, C appears 7 times, D appears 9 times
A B C D 1 A D B C
SLIDE 65
Huffman Encoding
15
SLIDE 66
Huffman Encoding
15
- Continue…
SLIDE 67
Huffman Encoding
15
- Continue…
- A appears 10 times, B & C appear a combined 12 times, D appears 9 times
SLIDE 68
Huffman Encoding
15
- Continue…
- A appears 10 times, B & C appear a combined 12 times, D appears 9 times
1 A D B C
SLIDE 69
Huffman Encoding
15
- Continue…
- A appears 10 times, B & C appear a combined 12 times, D appears 9 times
1 A D B C
SLIDE 70
Huffman Encoding
15
- Continue…
- A appears 10 times, B & C appear a combined 12 times, D appears 9 times
1 A D B C 1 B C 1 A D
SLIDE 71
Huffman Encoding
16
SLIDE 72
Huffman Encoding
16
- And finally…
SLIDE 73
Huffman Encoding
16
- And finally…
1 B C 1 A D
SLIDE 74
Huffman Encoding
16
- And finally…
1 B C 1 A D
SLIDE 75
Huffman Encoding
16
- And finally…
1 B C 1 A D B D 1 C 1 A 1
SLIDE 76
Huffman Encoding
17
SLIDE 77
Huffman Encoding
17
- Another example…
SLIDE 78
Huffman Encoding
17
- Another example…
- AAAAAAAAAABCCD
SLIDE 79
Huffman Encoding
17
- Another example…
- AAAAAAAAAABCCD
- A appears 10 times
SLIDE 80
Huffman Encoding
17
- Another example…
- AAAAAAAAAABCCD
- A appears 10 times
- B appears 1 time
SLIDE 81
Huffman Encoding
17
- Another example…
- AAAAAAAAAABCCD
- A appears 10 times
- B appears 1 time
- C appears 2 times
SLIDE 82
Huffman Encoding
17
- Another example…
- AAAAAAAAAABCCD
- A appears 10 times
- B appears 1 time
- C appears 2 times
- D appears 1 time
SLIDE 83
Huffman Encoding
18
SLIDE 84
Huffman Encoding
18
- Start with the two smallest frequencies
SLIDE 85
Huffman Encoding
18
- Start with the two smallest frequencies
- A appears 10 times, B appears 1 time, C appears 2 times, D appears 1 time
SLIDE 86
Huffman Encoding
18
- Start with the two smallest frequencies
- A appears 10 times, B appears 1 time, C appears 2 times, D appears 1 time
A B C D
SLIDE 87
Huffman Encoding
18
- Start with the two smallest frequencies
- A appears 10 times, B appears 1 time, C appears 2 times, D appears 1 time
A B C D
SLIDE 88
Huffman Encoding
18
- Start with the two smallest frequencies
- A appears 10 times, B appears 1 time, C appears 2 times, D appears 1 time
A B C D 1 A C B D
SLIDE 89
Huffman Encoding
19
SLIDE 90
Huffman Encoding
19
- Start with the two smallest frequencies
SLIDE 91
Huffman Encoding
19
- Start with the two smallest frequencies
- A appears 10 times, B & D appear a combined 2 times, C appears 2 times
SLIDE 92
Huffman Encoding
19
- Start with the two smallest frequencies
- A appears 10 times, B & D appear a combined 2 times, C appears 2 times
1 A C B D
SLIDE 93
Huffman Encoding
19
- Start with the two smallest frequencies
- A appears 10 times, B & D appear a combined 2 times, C appears 2 times
1 A C B D
SLIDE 94
Huffman Encoding
19
- Start with the two smallest frequencies
- A appears 10 times, B & D appear a combined 2 times, C appears 2 times
1 A C B D 1 C 1 B D A
SLIDE 95
Huffman Encoding
20
SLIDE 96
Huffman Encoding
20
- And finally…
SLIDE 97
Huffman Encoding
20
- And finally…
1 C 1 B D A
SLIDE 98
Huffman Encoding
20
- And finally…
1 C 1 B D A
SLIDE 99
Huffman Encoding
20
- And finally…