Graphemic Standardisation and Human Writing Systems Workshop - - PowerPoint PPT Presentation

▶

Jan 20, 2024 356 likes •571 views

Graphemic Standardisation and Human Writing Systems Workshop Victor Zimmermann 2019-06-22 Department of Computational Linguistics Heidelberg University Writing: A Human Invention Human writing systems have independently been invented about

SLIDE 1

Graphemic Standardisation and Human Writing Systems

Workshop

Victor Zimmermann 2019-06-22

Department of Computational Linguistics Heidelberg University

SLIDE 2

Writing: A Human Invention

Human writing systems have independently been invented about five times in human history:

Indus Valley (Harappan Script)
Sumer (Cuneiform)
Egypt (Hieroglyphs)
Huang (Chinese)
Central America (Maya Script)

Difference to speech, which exists in every human civilisation. There is no known case where a child has acquired writing by itself. It is a technology, just as the wheel or the iPhone.

tacos29 | unicode 1

SLIDE 3

Scripts vs. Character Sets

Script / Writing System: Method of visualising verbal

communication.

Character set: Set of symbols used in writing system.
Alphabet: Set of characters representing phonemes.
Abjad: Set of characters representing consonants.
Abugida: Set of characters representing consonants with vowel

notations.

Syllabary: Set of characters representing syllables.
Logography: Set of characters representing semantic units.

tacos29 | unicode 2

SLIDE 4

Writing Systems around the World

tacos29 | unicode 3

SLIDE 5

Graphemic Standardisation

“When he wished to print, he took an iron frame and set it on the iron

plate. In this he placed the types, set close together. When the frame

was full, the whole made one solid block of type. He then placed it near the fire to warm it. When the paste [at the back] was slightly melted, he took a smooth board and pressed it over the surface, so that the block of type became as even as a whetstone.” - Shen Kuo (1031–1095)

tacos29 | unicode 4

SLIDE 6

Graphemic Standardisation

Early printing standards tightly linked to typography.
What could be printed was dependent on physical movable

typesets.

Usage of German metal types used in the early printing presses

led to the removal of characters like Thorn and Eth from the English alphabet.

tacos29 | unicode 5

SLIDE 7

Digital Standards

Early bit encodings like Morse or Baudot

code utilize bit-like character encryption.

American 7-Bit ASCII Encoding used

throughout Latin writing world.

Various 8-Bit extensions of ASCII emerged,
eg. adding various diacritic versions

for Northern European countries.

IBM releases own 8-Bit standard, the Extended Binary Coded

Decimal Interchange Code (EBCDIC).

Many competing standards for Japanese led to Mojigake when

moving data between companies.

tacos29 | unicode 6

SLIDE 8

Workshop: Canadian Aboriginal Syllabics

Figure 1: Evans’ script, as published in 1841

tacos29 | unicode 7

SLIDE 9

Workshop: Cyrillic Scripts

tacos29 | unicode 8

SLIDE 10

Workshop: Cyrillic Scripts

tacos29 | unicode 9

SLIDE 11

Workshop: Hangul

tacos29 | unicode 10

SLIDE 12

Workshop: Hangul

tacos29 | unicode 11

SLIDE 13

Workshop Task

Are there any cultural aspects you need to consider when

creating a standard? How should they influence your design?

How would a standard deal with the syllabic blocks, ligatures or

diacritics present in your language?

Is there a particularly efficient way to represent this script?
Are there things you wish to include in your standard that go

beyond character sets?

Does a program using your encoding need special instructions to

represent your characters? What are they?

tacos29 | unicode 12

SLIDE 14

UTF-32, UTF-16, UTF-8

UTF-32: All code points are encoded in 32 bit. UTF-16: All code points are encoded in 16 or 32 bit. Little Endian: left-to-right, Big Endian: right-to-left UTF-8: Variable width encoding through continuation marks and 8 bit chunks.

tacos29 | unicode 13

SLIDE 15

Examples

UTF-32 00000000 00000000 00000000 01000001 UTF-16 LE 00000000 01000001 UTF-16 BE 10000010 00000000 UTF-8 01000001 LATIN CAPITAL LETTER A (U+0041) UTF-32 00000000 00000001 11110011 00101110 UTF-16 LE 11011000 00111100 11011111 00101110 UTF-16 BE 01110100 11111011 00111100 00011011 UTF-8 11110000 10011111 10001100 10101110 TACO (U+1F32E)

tacos29 | unicode 14

SLIDE 16

History of Emoji

The history of Unicode is the history of compromise. UTF-8 came to be because ASCII users did not want to move from 7 to 16 or even 32 bit systems. Because of its logographic nature, Japanese uses a lot of code points. Since encodings always double in size for each bit, Japanese was left with a lot of open code points. Some Japanese phone companies used this space for symbols like emoticons or the poop emoji.

tacos29 | unicode 15

SLIDE 17

History of Emoji

2010 emojis were incorporated in the Unicode Standard to allow compatibility with Japanese phones. iPhone users found out. Poop emojis appeared in messages around the world soon after. Today Emojis are spread over multiple Unicode blocks, the Japanese blocks and a special miscellaneous block. The addition of new emojis is governed by the Unicode Consortium (not Apple).

tacos29 | unicode 16

SLIDE 18

References

SLIDE 19

References

[Uni19] The Unicode Consortium. The Unicode Standard Version 12.0 - Core Specification. Vol. 1. Mountain View, CA: The Unicode Consortium, 2019. isbn: 9781936213238.

coll | references 17

SLIDE 20

Graphemic Standardisation and Human Writing Systems

Workshop

Victor Zimmermann 2019-06-22

Department of Computational Linguistics Heidelberg University

Writing: A Human Invention

Human writing systems have independently been invented about five times in human history:

Difference to speech, which exists in every human civilisation. There is no known case where a child has acquired writing by itself. It is a technology, just as the wheel or the iPhone.

tacos29 | unicode 1

Scripts vs. Character Sets

communication.

notations.

tacos29 | unicode 2

Writing Systems around the World

tacos29 | unicode 3

Graphemic Standardisation

“When he wished to print, he took an iron frame and set it on the iron

was full, the whole made one solid block of type. He then placed it near the fire to warm it. When the paste [at the back] was slightly melted, he took a smooth board and pressed it over the surface, so that the block of type became as even as a whetstone.” - Shen Kuo (1031–1095)

tacos29 | unicode 4

Graphemic Standardisation

typesets.

led to the removal of characters like Thorn and Eth from the English alphabet.

tacos29 | unicode 5

Digital Standards

code utilize bit-like character encryption.

throughout Latin writing world.

for Northern European countries.

Decimal Interchange Code (EBCDIC).

moving data between companies.

tacos29 | unicode 6

Workshop: Canadian Aboriginal Syllabics

Figure 1: Evans’ script, as published in 1841

tacos29 | unicode 7

Workshop: Cyrillic Scripts

tacos29 | unicode 8

Workshop: Cyrillic Scripts

tacos29 | unicode 9

Workshop: Hangul

tacos29 | unicode 10

Workshop: Hangul

tacos29 | unicode 11

Workshop Task

creating a standard? How should they influence your design?

diacritics present in your language?

beyond character sets?

represent your characters? What are they?

tacos29 | unicode 12

UTF-32, UTF-16, UTF-8

UTF-32: All code points are encoded in 32 bit. UTF-16: All code points are encoded in 16 or 32 bit. Little Endian: left-to-right, Big Endian: right-to-left UTF-8: Variable width encoding through continuation marks and 8 bit chunks.

tacos29 | unicode 13

Examples

tacos29 | unicode 14

History of Emoji

tacos29 | unicode 15

History of Emoji

tacos29 | unicode 16

References

References

[Uni19] The Unicode Consortium. The Unicode Standard Version 12.0 - Core Specification. Vol. 1. Mountain View, CA: The Unicode Consortium, 2019. isbn: 9781936213238.

coll | references 17

Thank you!

That’s it! Thank you for coming! Have fun at TaCoS29! en.axtimhaus.eu zimmermann@cl.uni-heidelberg.de

coll | references 18