Mined : An Editor with Extensive Unicode and CJK Support for the - - PowerPoint PPT Presentation

▶

Sep 26, 2023 37 likes •181 views

Mined : An Editor with Extensive Unicode and CJK Support for the Text-based Terminal Environment IUC 27, Berlin, 2005-04-08 Thomas Wol Mined : introduction A text editor suitable for editing text suitable for editing

SLIDE 1

Mined: An Editor with Extensive Unicode and CJK Support for the Text-based Terminal Environment

IUC 27, Berlin, 2005-04-08 Thomas Wol

SLIDE 2

Mined

Dr. Thomas Wolff

1 IUC 27

Mined: introduction

A text editor

suitable for editing text
suitable for editing programs

Editing environment

text-mode terminal (xterm,

mlterm, hanterm, cxterm, rxvt, linux console)

non-graphical UI: more light-

weight interaction, seemless integration into command-line workflows

Mined was the first editor that

supported UTF-8 in this environent

SLIDE 3

Mined

Dr. Thomas Wolff

2 IUC 27

Mined: «IUC» support

Internationalisation support

encoding support: UTF-8, CJK, and others
character and script information
character input support, input methods

Encoding environment

text encoding: automatic detection, flexible handling

» configurable detection » mixed-encoding editing, switching online

terminal encoding: automatic detection

» 8 bit vs. UTF-8 vs. CJK » various sets of character width properties

SLIDE 4

Mined

Dr. Thomas Wolff

3 IUC 27

Mined: overview

User interface

simple and intuitive
affirmative toward modern interaction paradigms

» comprehensive menus » mouse control, wheel scrolling » scrollbar navigation

Program editing features

program structure, auto-indent
identifier search functions (multi-file)
HTML highlighting and tag matching
multiple lines in search/replacement patterns
multi-file copy/paste
visual indications, binary transparency

SLIDE 5

Mined

Dr. Thomas Wolff

4 IUC 27

Typographic editing support

Smart quotes

smart quotes mode with different styles
straight quotes (as from keyboard) are replaced with

typographic quote marks for insertion

opening/closing quote mark

» depending on context » nested handling for double/single quotes » special heuristics for CJK quotes (without space context)

Smart dashes

“ --” “–”
“--” “—”
<Hebrew context>”-” ־(Maqaf)

SLIDE 6

Mined

Dr. Thomas Wolff

5 IUC 27

Input support: character input

Mnemonic input

RFC 1345 with completions
HTML mnemonics, TeX mnemonics
useful supplements

Numeric input

hex, octal, decimal
native encoding, Unicode value
ISO 14755

Accented input

accent prefix function keys
extensions for Vietnamese multiple accent combinations

SLIDE 7

Mined

Dr. Thomas Wolff

6 IUC 27

Input support: input methods

Simple keyboard mapping

Greek, Cyrillic, Hebrew, Arabic, Thai

Input methods

CJK
extended keyboard mapping

» multi-character input sequences » ambiguous mappings » resolution with selection menu (“pick-list”)

Radical/Stroke two-level selection menus

“Out-of-the-box” approach

all mapping tables built-in

» all input methods are always available, even on legacy systems

SLIDE 8

Mined

Dr. Thomas Wolff

7 IUC 27

Character handling: combining characters

Combined display mode

“normal” handling
f combining

characters Combined editing

move cursor “into”

a combined character, edit parts Separated display mode

provides clear

view of combining characters in text

SLIDE 9

Mined

Dr. Thomas Wolff

8 IUC 27

Character handling: script highlighting

Distinguish characters with similar-looking glyphs

coloured display of certain scripts

SLIDE 10

Mined

Dr. Thomas Wolff
IUC 27

Character handling: Han character information

Information from Unihan database

pronunciation entries (configurable selection)

» Mandarin, Cantonese, (Sino-)Japanese, Korean, Vietnamese

semantic description of character (in English)

» automatic typographic fixed to descriptions

SLIDE 11

Mined

Dr. Thomas Wolff

10 IUC 27

Character handling: interactive conversion

Case toggle

handles Greek final sigma, Turkish i

Hiragana/Katakana toggle

using case toggle function key

Latin-1 / UTF-8 toggle

search for non-matching character codes
conversion function

Mnemonic conversion

cursor on “oe”

» ESC ä “ö” » ESC ç “œ” » ESC å “ø”

SLIDE 12

Mined

Dr. Thomas Wolff

11 IUC 27

Terminal feature detection: locale trouble

Why the locale mechanism fails

» export LC_CTYPE=my_country.UTF-8 » installed loale: vendor_country.utf8

user has trouble to find a suitable locale on each machine

» if proper encoding is found, language/country may not match

want to install fr_FR.UTF-8? system admin doesnt care!
only some tools / terminals use system locale data

» esp. xterm maintains its own width data

heterogeneous network: rlogin / telnet

» terminal and application run on different machines » they see different system locale data » it is in principle not possible to make sure that the locale data you get from the system matches the behaviour of your terminal

SLIDE 13

Mined

Dr. Thomas Wolff

12 IUC 27

Terminal property auto-detection

Auto-detection mechanism

send test strings to terminal, request cursor position report

» determine width of test strings, conclude width roerties

non-native CJK terminal, Unicode width data (luit) < 10 0xA1A4A1B1EAA5A6A1 not a GB18030 terminal > 2 0x8130A132 («ǆ» in GB18030) xterm with option -cjk_width > 8 ‘’“”…―- xterm version 157–166, corresponding Unicode version 3.2 < 8 《》U+301A U+301B U+FF60 terminal does not support combining characters 2 terminal supports combining characters 1 a U+0321 8 bit terminal or CJK terminal 15, 18 CJK encoded terminal 10, 11, 14, 16, 17 UTF-8, double-width, no LAM/ALEF ligature joining 9 UTF-8, double-width, with LAM/ALEF ligature joining (probably mlterm) 8 UTF-8, no double-width, no LAM/ALEF ligature joining 7 UTF-8 terminal, no double-width support, with Arabic LAM/ALEF ligature joining 6 åلاษษ刈墢 Conclusion Width String

SLIDE 14

Mined

Dr. Thomas Wolff

13 IUC 27

Mined: conclusion

Character encoding supported “out-of-the-box”

including input methods

Mined: An Editor with Extensive Unicode and CJK Support for the Text-based Terminal Environment

IUC 27, Berlin, 2005-04-08 Thomas Wol

Mined: introduction

A text editor

Editing environment

mlterm, hanterm, cxterm, rxvt, linux console)

weight interaction, seemless integration into command-line workflows

supported UTF-8 in this environent

Mined: «IUC» support

Internationalisation support

Encoding environment

» configurable detection » mixed-encoding editing, switching online

» 8 bit vs. UTF-8 vs. CJK » various sets of character width properties

Mined: overview

User interface

» comprehensive menus » mouse control, wheel scrolling » scrollbar navigation

Program editing features

Typographic editing support

Smart quotes

typographic quote marks for insertion

» depending on context » nested handling for double/single quotes » special heuristics for CJK quotes (without space context)

Smart dashes

Input support: character input

Mnemonic input

Numeric input

Accented input

Input support: input methods

Simple keyboard mapping

Input methods

» multi-character input sequences » ambiguous mappings » resolution with selection menu (“pick-list”)

“Out-of-the-box” approach

» all input methods are always available, even on legacy systems

Character handling: combining characters

Combined display mode

characters Combined editing

a combined character, edit parts Separated display mode

view of combining characters in text

Character handling: script highlighting

Distinguish characters with similar-looking glyphs

Character handling: Han character information

Information from Unihan database

» Mandarin, Cantonese, (Sino-)Japanese, Korean, Vietnamese

» automatic typographic fixed to descriptions

Character handling: interactive conversion

Case toggle

Hiragana/Katakana toggle

Latin-1 / UTF-8 toggle

Mnemonic conversion

» ESC ä “ö” » ESC ç “œ” » ESC å “ø”

Terminal feature detection: locale trouble

Why the locale mechanism fails

» export LC_CTYPE=my_country.UTF-8 » installed loale: vendor_country.utf8

» if proper encoding is found, language/country may not match

» esp. xterm maintains its own width data

» terminal and application run on different machines » they see different system locale data » it is in principle not possible to make sure that the locale data you get from the system matches the behaviour of your terminal

Terminal property auto-detection

Auto-detection mechanism

» determine width of test strings, conclude width roerties

Mined: conclusion

Character encoding supported “out-of-the-box”

» users may quickly edit international text without config. trouble

Terminal interoperability

» xterm, mlterm, hanterm, cxterm, rxvt, linux console

Good range of text and program editing features Intuitive user interface Robust text and file handling engine Portability

» Unix (Linux, Sun, HP, BSD, Mac, …), Windows (cygwin), DS

Small-footprint behaviour