SLIDE 1
Graphemic Standardisation and Human Writing Systems
Workshop
Victor Zimmermann 2019-06-22
Department of Computational Linguistics Heidelberg University
SLIDE 2 Writing: A Human Invention
Human writing systems have independently been invented about five times in human history:
- Indus Valley (Harappan Script)
- Sumer (Cuneiform)
- Egypt (Hieroglyphs)
- Huang (Chinese)
- Central America (Maya Script)
Difference to speech, which exists in every human civilisation. There is no known case where a child has acquired writing by itself. It is a technology, just as the wheel or the iPhone.
tacos29 | unicode 1
SLIDE 3 Scripts vs. Character Sets
- Script / Writing System: Method of visualising verbal
communication.
- Character set: Set of symbols used in writing system.
- Alphabet: Set of characters representing phonemes.
- Abjad: Set of characters representing consonants.
- Abugida: Set of characters representing consonants with vowel
notations.
- Syllabary: Set of characters representing syllables.
- Logography: Set of characters representing semantic units.
tacos29 | unicode 2
SLIDE 4
Writing Systems around the World
tacos29 | unicode 3
SLIDE 5 Graphemic Standardisation
“When he wished to print, he took an iron frame and set it on the iron
- plate. In this he placed the types, set close together. When the frame
was full, the whole made one solid block of type. He then placed it near the fire to warm it. When the paste [at the back] was slightly melted, he took a smooth board and pressed it over the surface, so that the block of type became as even as a whetstone.” - Shen Kuo (1031–1095)
tacos29 | unicode 4
SLIDE 6 Graphemic Standardisation
- Early printing standards tightly linked to typography.
- What could be printed was dependent on physical movable
typesets.
- Usage of German metal types used in the early printing presses
led to the removal of characters like Thorn and Eth from the English alphabet.
tacos29 | unicode 5
SLIDE 7 Digital Standards
- Early bit encodings like Morse or Baudot
code utilize bit-like character encryption.
- American 7-Bit ASCII Encoding used
throughout Latin writing world.
- Various 8-Bit extensions of ASCII emerged,
- eg. adding various diacritic versions
for Northern European countries.
- IBM releases own 8-Bit standard, the Extended Binary Coded
Decimal Interchange Code (EBCDIC).
- Many competing standards for Japanese led to Mojigake when
moving data between companies.
tacos29 | unicode 6
SLIDE 8
Workshop: Canadian Aboriginal Syllabics
Figure 1: Evans’ script, as published in 1841
tacos29 | unicode 7
SLIDE 9
Workshop: Cyrillic Scripts
tacos29 | unicode 8
SLIDE 10
Workshop: Cyrillic Scripts
tacos29 | unicode 9
SLIDE 11
Workshop: Hangul
tacos29 | unicode 10
SLIDE 12
Workshop: Hangul
tacos29 | unicode 11
SLIDE 13 Workshop Task
- Are there any cultural aspects you need to consider when
creating a standard? How should they influence your design?
- How would a standard deal with the syllabic blocks, ligatures or
diacritics present in your language?
- Is there a particularly efficient way to represent this script?
- Are there things you wish to include in your standard that go
beyond character sets?
- Does a program using your encoding need special instructions to
represent your characters? What are they?
tacos29 | unicode 12
SLIDE 14
UTF-32, UTF-16, UTF-8
UTF-32: All code points are encoded in 32 bit. UTF-16: All code points are encoded in 16 or 32 bit. Little Endian: left-to-right, Big Endian: right-to-left UTF-8: Variable width encoding through continuation marks and 8 bit chunks.
tacos29 | unicode 13
SLIDE 15
Examples
UTF-32 00000000 00000000 00000000 01000001 UTF-16 LE 00000000 01000001 UTF-16 BE 10000010 00000000 UTF-8 01000001 LATIN CAPITAL LETTER A (U+0041) UTF-32 00000000 00000001 11110011 00101110 UTF-16 LE 11011000 00111100 11011111 00101110 UTF-16 BE 01110100 11111011 00111100 00011011 UTF-8 11110000 10011111 10001100 10101110 TACO (U+1F32E)
tacos29 | unicode 14
SLIDE 16
History of Emoji
The history of Unicode is the history of compromise. UTF-8 came to be because ASCII users did not want to move from 7 to 16 or even 32 bit systems. Because of its logographic nature, Japanese uses a lot of code points. Since encodings always double in size for each bit, Japanese was left with a lot of open code points. Some Japanese phone companies used this space for symbols like emoticons or the poop emoji.
tacos29 | unicode 15
SLIDE 17
History of Emoji
2010 emojis were incorporated in the Unicode Standard to allow compatibility with Japanese phones. iPhone users found out. Poop emojis appeared in messages around the world soon after. Today Emojis are spread over multiple Unicode blocks, the Japanese blocks and a special miscellaneous block. The addition of new emojis is governed by the Unicode Consortium (not Apple).
tacos29 | unicode 16
SLIDE 18
References
SLIDE 19
References
[Uni19] The Unicode Consortium. The Unicode Standard Version 12.0 - Core Specification. Vol. 1. Mountain View, CA: The Unicode Consortium, 2019. isbn: 9781936213238.
coll | references 17
SLIDE 20
Thank you!
That’s it! Thank you for coming! Have fun at TaCoS29! en.axtimhaus.eu zimmermann@cl.uni-heidelberg.de
coll | references 18