upTEX Unicode version of pTEX with CJK extensions Takuji Tanaka - - PowerPoint PPT Presentation

uptex unicode version of ptex with cjk extensions
SMART_READER_LITE
LIVE PREVIEW

upTEX Unicode version of pTEX with CJK extensions Takuji Tanaka - - PowerPoint PPT Presentation

upTEX Unicode version of pTEX with CJK extensions Takuji Tanaka upTEX project Oct 26, 2013 Takuji Tanaka (upTEX project) upTEX Unicode version of pTEX with CJK extensions Oct 26, 2013 1 / 42 Outline /


slide-1
SLIDE 1

upTEX – Unicode version of pTEX with CJK extensions

Takuji Tanaka 田中 琢 爾

upTEX project

Oct 26, 2013

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 1 / 42

slide-2
SLIDE 2

Outline / 概要

Outline / 概要

(1) Introduction (2) Unicodization / Unicode 化

◮ Japanese / 日本語 ◮ CJK / 中韓 / 中・日・한 ◮ with European languages / 欧文との親和性 ◮ world languages / 世界の言語

(3) Imprementation / 実装

◮ Unicodization / Unicode 化 ◮ \kcatcode ◮ set3

(4) upTEX vs. Ω, X E TEX, . . . (5) Present & future / 現在と今後

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 2 / 42

slide-3
SLIDE 3

Part I

Introduction

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 3 / 42

slide-4
SLIDE 4

Introduction pTEX/pL

ATEX

ASCII pTEX/pL

ATEX

It’s great: High quality Japanese typesetting

  • incl. vertical writing, Japanese hyphenation, . . .

Japanese standard TEX/L

ATEX

Strong support by environment

—DVIware, packages, macros, softwares, books, . . .

but has weakness: Japanese local — 8bit Latin/Chinese/Korean are not available Limited character set by legacy encodings (Shift_JIS, EUC-JP)

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 4 / 42

slide-5
SLIDE 5

Introduction Motivation

Motivation

Support wider character set of Japanese by Unicode Support babel by switching Latin–CJK tokens Support Chinese/Korean Keep quality & environment of pTEX

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 5 / 42

slide-6
SLIDE 6

Introduction Feature

Feature of upTEX/upL

ATEX

(1) High quality CJK typesetting based on pTEX/pL

ATEX

(2) Compatible with pTEX/pL

ATEX

(3) Unicode / UTF-8 (4) Switching Latin (12bit) / CJK (29bit) tokens (5) CJK with Babel (Latin/Cyrillic/Greek. . . ) (6) Over BMP — incl. SIP (U+2xxxx)

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 6 / 42

slide-7
SLIDE 7

Part II

Unicodization / Unicode化

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 7 / 42

slide-8
SLIDE 8

Unicodization / Unicode 化 Unicodization / Unicode 化

Unicodization / Unicode化

Strategies of Unicodization (1) Unicodize only IO Ex: \usepackage[utf8]{inputenc} (2) Imprement Unicode functions Ex: X E TEX (3) Comromise upTEX: Intenal: Unicodize only CJK, IO: Fully Unicodize

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 8 / 42

slide-9
SLIDE 9

Unicodization / Unicode 化 Partial Unicodization / 折衷的 Unicode 化

Partial Unicodization / 折衷的Unicode化

TEX pTEX upTEX 7bit Latin

azAZ azAZ azAZ

Latin 8bit Latin

æœÆŒ æœÆŒ

inputenc

гдГД гдГД

Japanese JIS X 0208

あア亜 あア亜

Unicode

① Ⅳ 髙 汉字

CK Unicode

漢字 한글

pTEX, upTEXconsists of two parts (1) As same as original TEX (2) pTeX–JIS X 0208, upTeX–Unicode

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 9 / 42

slide-10
SLIDE 10

Japanese / 日本語 New JIS / 新 JIS

New JIS : JIS X 0213

upTEX treats new JIS X 0213 (over JIS X 0208)

〼 〽 ♮ ♫ ♬ ♩ ♤ ♠ ♢ ♦ ♡ ♥ ♧ ♣ ☖ ☗ 〠 ☎ ☀ ☁ ☂ ☃ ♨ ゔゕ ゖ ヷ ヸ ヹ ヺ ⅓ ⅔ ⅕ ✓ ⌘ ␣ ⏎ ㈱㈲ ① ② ③ ❶ ❷ ❸ ⓵ ⓶ ⓷ ⅰ ⅱ ⅲ Ⅰ Ⅱ Ⅲ ⓐ ⓑ ⓒ ㋐ ㋑ ㋒ 鄧小平 李承燁 里見弴 草彅剛 朴璐美 森鷗外 森雞二 王銘琬 宮﨑 あおい 蔣介石 你好 深圳 東日本旅 客鉃道株式会社 尾骶骨 生酛仕込 凮月堂 㐂寿 仐寿 圓壔函數 啞然 火焰 嚙む 任俠 長身瘦 軀 石鹼 屢〻 刺繡 醬油 蟬時雨 隔靴搔痒 奥飛驒 簞笥 摑む 充塡 顚末 祈禱 瀆職 土囊 潑溂 醱酵 頰紅 素麵 麴町 蓬萊 蠟燭 攢竹

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 11 / 42

slide-11
SLIDE 11

Japanese / 日本語 Characters out of JIS / JIS 外字

Characters out of JIS / JIS外字

  • ver JIS X 0213 (new JIS)

✎ ✍ ☞ ✌

髙島屋、内田百閒、 杮落とし、安全㐧一、𠮷 野家

source

髙島屋、 内田百閒、 杮落 とし、 安全㐧一、 𠮷 野家

  • utput

Platform dependent characters are now in Unicode ① ② ③ ④ ⑤ ⑥ ⑦ ⑧ ⑨ ⑩ ⑪ ⑫ ⑬ ⑭ ⑮ ⑯ ⑰ ⑱ ⑲ ⑳ Ⅰ Ⅱ Ⅲ Ⅳ Ⅴ Ⅵ Ⅶ Ⅷ Ⅸ Ⅹ ㍉㌔㌢㍍㌘㌧㌃㌶㍑㍗㌍㌦㌣㌫㍊㌻ ㎜㎝㎞㎎㎏㏄㎡㍻ 〝〟 № ㏍℡ ㊤㊥㊦㊧㊨㈱㈲㈹㍾㍽㍼ ≒ ≡ ∫ ∮ √ ⊥ ∠ ∟ ⊿ ∵ ∩ ∪ 髙閒塚 德豐﨑 彅弴燁珉鄧

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 13 / 42

slide-12
SLIDE 12

CJK / 中・日・한 basis

Chinese/Japanese/Korean

中・日・한

✎ ✍ ☞ ✌

\schrm 简体中文: 你好 \tchrm 繁體中文: 早晨 \jpnrm 日本語: こんにちは \korrm 한국어: 안녕하세요

source

简体中文: 你好 繁體中文: 早晨 日本語: こんにちは 한국어: 안녕하세요

  • utput

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 15 / 42

slide-13
SLIDE 13

CJK / 中・日・한 glyphs

Difference of glyphs among CJK / CJKのグリフの違い

Simplified Chinese

骨練,平直。神祀,才次.

Traditional Chinese

骨練,平直。神祀,才次.

Japanese

骨練,平直。神祀,才次.

Korean

骨練,平直。神祀,才次.

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 16 / 42

slide-14
SLIDE 14

CJK / 中・日・한 end-of-line

end-of-line

✎ ✍ ☞ ✌

Please give↓ me beer. 请给我↓ 啤酒。 ビールを私に↓ 下さい。 맥주를 나에게↓ 주세요.

Please give me beer.

(treated as space)

请给我啤酒。

(ignored)

ビールを私に下さい。

(ignored)

맥주를 나에게 주세요.

(treated as space)

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 18 / 42

slide-15
SLIDE 15

CJK / 中・日・한 control words

Control word by CJK characters

✎ ✍ ☞ ✌

\def\오늘{% \number\year 연% \number\month 월% \number\day 일% } Today: 《\오늘》

Today:《2013연10월26

일》

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 20 / 42

slide-16
SLIDE 16

CJK / 中・日・한 Japanese-OTF package

Japanese-OTF package

✎ ✍ ☞ ✌

\usepackage[uplatex,...]{otf} ... Adobe-Korea1-1:\\ \CIDK{8322}\CIDK{8588} ... Adobe-Japan1-5:\\ \● 問\◇ 答\ajRecycle{10}% \ajLig{学校法人}% \ajPICT{野球}\\ \ajMaru{1}...

Adobe-Korea1-1: 1⃞☯약⃝ Adobe-Japan1-5: 問 答 ♼ 学校法人 野球 ① ❷ 3 4 ⑸ ⒍ ㈦㊇ Ⅸ Japanese-OTF package also supports CK.

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 22 / 42

slide-17
SLIDE 17

CJK / 中・日・한 Unification / 統合

Unification / 統合 standard full-width Cyrillic Ж U+0416 Ж

U+0416

Latin

W U+0057 W U+FF37

No “full-width” code in Greek, Cyrillic in Unicode. It is a barrier to Unicodize Japanese softs. upTEX can treat full-width Greek, Cyrillic by markup.

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 23 / 42

slide-18
SLIDE 18

with European languages / 欧文との親和性 inputenc

inputenc & UTF-8

✎ ✍ ☞ ✌

\usepackage[utf8]{inputenc} \usepackage[T1]{fontenc} \kcatcode‘ç=15 ... “¿But aren’t Kafka’s Schloß and Æsop’s Œuvres often naïve vis-à-vis the dæmonic phœnix’s official rôle in fluffy soufflés?” “¿But aren’t Kafka’s Schloß and Æsop’s Œuvres often naïve vis-à-vis the dæmonic phœnix’s

  • fficial

rôle in fluffy soufflés?”

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 25 / 42

slide-19
SLIDE 19

with European languages / 欧文との親和性 Babel

Babel

✎ ✍ ☞ ✌

\usepackage[french,...]% {babel} ... \selectlanguage{english} English ... \today ... \selectlanguage{russian} Русский ... \today \selectlanguage{japanese} 日本語 ... \today English October 26, 2013 Français 26 octobre 2013 Deutsch

  • 26. Oktober 2013

Czech

  • 26. října 2013

Русский 26 октября 2013 г. 日本語 2013 年 10 月 26 日

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 27 / 42

slide-20
SLIDE 20

with European languages / 欧文との親和性 It’s a small world

It’s a small world

upTEX can treat CJK, Latin, Cyrillic and Greek. upTEX cannot directly treat Arabic, Brahmic, . . .

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 28 / 42

slide-21
SLIDE 21

Part III

Imprementation / 実装

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 29 / 42

slide-22
SLIDE 22

Imprementation / 実装 Unicodization / Unicode 化

Unicodization / Unicode化

(1) IO: EUC/SJIS in pTEX → UTF8 in upTEX

(ptexenc library)

(2) Internal buffer: 16bit in pTEX → 29bit in upTEX

(Ref. Omega)

(3) Unicodize standard macros, libraries (4) upTEX support of DVIWARE

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 30 / 42

slide-23
SLIDE 23

Imprementation / 実装 DVIware

DVIware

ptetex3+ / Linux W32TeX / Windows

dvipdfmx, dvips, xdvi, dvi2tty & DVIOUT are available

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 31 / 42

slide-24
SLIDE 24

Imprementation / 実装 \kcatcode

\kcatcode

kcat code cat code kind e.g. control word end of line

· · · · · ·

10 space

  • 15

11 char

azAZ

yes as space 12

  • ther char

(.!?

no as space

· · · · · ·

16 Kanji

汉漢

yes ignore 17 Kana

かナ

yes ignore 18 CJK symbol

《・。 』

no ignore 19 Hangul

한글

yes as space If \kcatcode is 15, the character is treat as Latin and upTEX works as same as original TEX.

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 32 / 42

slide-25
SLIDE 25

Imprementation / 実装 set3 & over BMP

set3 & over BMP

𠂉 𠀋 𠂢 𠂤 𠆢 𠈓 𠌫 𠎁 𠍱 𠏹 𠑊 𠔉 𠗖 ⺇ 𠝏 𠠇 𠠺 𠢹 𠥼 𠦝 𠫓 𠬝 𠵅 𠷡 𠺕 𠹭 𠹤 𠽟 𡈁 𡈽 𡉕 𡉻 𡉴 𡋤 𡋗 𡌛 𡋽 𡌶 𡍄 𡏄 𡑮 𡑭 𡗗 𦰩 𡙇 𡜆 𡝂 𡢽 𡧃 𡱖 𡴭 𡚴 𡵅 𡵸 𡵢 𡶡 𡶜 𡶒 𡶷 𡷠 𡸴 𡸳 𡼞 𡽶 𡿺 𢅻 𢌞 𢎭 𢛳 𢡛 𢢫 𢦏 𢪸 𢭐 𢭑 𢭆 𢰝 𢮦 𢰤 𢷡 𣇄 𣇃 𣇵 𣆶 𣍲 𣏓 𣏒 𣏐 𣏤 𣏕 𣏚 𣏟 𣑊 𣑑 𣑋 𣑥 𣓤 𣕚 𣗄 𣖔 𣘹 𣙇 𣘸 𣘺 𣜿 𣜜 𣝣 𣜌 𣝤 𣟿 𣟧 𣠤 𣠽 𣪘 𣱿 𣳾 𣴀 𣵀 𣷺 𣷹 𣷓 𣽾 𤂖 𤄃 𤇆 𤇾 𤎼 𤘩 𤚥 𤟱 𤢖 𤩍 𤭖 𤭯 𤰖 ⺪ 𤸎 𤸷 𤹪 𤺋 𥁊 𥁕 𥄢 𥆩 𥇥 𥇍 𥈞 𥉌 𥐮 𥒎 𥓙 𥔎 𥖧 𥝱 𥞩 𥞴 𥧄 𥧔 𥫤 𥫣 𥫱 𥮲 𥱋 𥱤 𥶡 𥸮 𥹖 𥹥 𥹢 𥻘 𥻂 𥻨 𥼣 𥽜 𥿠 𥿔 𦀌 𥿻 𦀗 𦁠 𦃭 𦉰 𦊆 𦍌 𣴎 𦐂 𦙾 𦚰 𦜝 𦣝 𦣪 ⺽ 𦥯 𦧝 𦨞 𦩘 𦪌 𦪷 𦫿 𦱳 𦳝 𦹀 𦹥 𦾔 𦿸 𦿷 𦿸 𧃴 𧄍 𧄹 𧏛 𧏚 𧏾 𧐐 𧑉 𧘕 𧘔 𧘱 𧚄 𧚓 𧜎 𧜣 𧝒 𧦅 𧪄 𧮳 𧮾 𧯇 𧲸 𧶠 𧸐 ⻊ 𨂊 𨂻 𨉷 𨊂 𨋳 𨏍 𨐌 𨑕 𨕫 𨗉 𨗊 𨛗 𨛺 𨥉 𨥆 𨥫 𨦇 𨦈 𨦻 𨦼 𨨞 𨨩 𨩱 𨩃 𨪙 𨫍 𨫤 𨫝 𨯁 𨯯 𨴐 𨵱 𨷻 𨸟 𨸶 𨺉 𨻫 𨼲 𨿸 𩊠 𩊱 𩒐 𩗏 ⻞ 𩛰 𩜙 𩝐 𩣆 𩩲 𩷛 𩸽 𩸕 𩺊 𩹉 𩻄 𩻩 𩻛 𩿗 𪀯 𪀚 𪃹 𪂂 𪆐 𢈘 𪎌 𪐷 𪗱 𪘂 𪘚 𪚲 𠮟 (JIS2004 includes a lot of CJK Ideograph Extension B)

upTEX supports SIP (Supplementary Ideograph Plane) U+2xxxx by using DVI command set3. How visionary Knuth is!!

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 33 / 42

slide-26
SLIDE 26

Part IV

upTEX vs. Ω , X E TEX, . . .

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 34 / 42

slide-27
SLIDE 27

upTEX vs. Ω , X E TEX, . . .

upTEX vs. Ω , X E TEX, . . .

TEX pTEX upTEX Ω X E TEX Compatibility Latin ◎ ○ ◎ ○ △ Japanese ー ◎ ◎ × × Advancedness × × × × ◎ Multilingual Latin ◎ ○ ◎ ◎ ◎ Japanese ー ○ ◎ △ △ CK ー ー ◎ △ △

  • thers

ー ー ー △ ◎ Integrity (Japanese) ◎ ◎ ◎ △ △ Popularity Japan ◎ ◎ ○ △ △ World ◎ △ △ △ ○ ◎ > ○ > △ > ×

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 35 / 42

slide-28
SLIDE 28

Part V

Present & Future / 現在と今後

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 36 / 42

slide-29
SLIDE 29

Present & Future / 現在と今後 History

History

Year 1995 ASCII pTeX ver.2, pLaTeX2e 2007 upTEX first release, alpha version 2007 upTEX is in W32TeX 2008 e-upTEX by Kitagawa-san 2012 upTEX 1.00 2012 upTEX is in TeX Live 2013 upTEX presentation in TUG2013

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 37 / 42

slide-30
SLIDE 30

Present & Future / 現在と今後 Future

Future / 今後

Currently, upTEX has capability of multilingual (CJK, Latin, Cyrillic, Greek) typesetting. Possible items in the future are: (1) Document classes for Chinese/Korean

(Any volunteer?)

(2) Babel options for Chinese/Korean

(It will be useful in ko.TeX etc. Any volunteer?)

(3) Does upTEX have a potential to be a useful CJK TEX?

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 38 / 42

slide-31
SLIDE 31

Part VI

Appendix / おまけ

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 39 / 42

slide-32
SLIDE 32

Appendix / おまけ Latin/CJK tokens

Latin/CJK tokens

TEX pTEX upTEX Latin I/O

8bit 7bit 8bit (multibytes)† 1byte (multibytes)†

token

charcode 8bit 8bit 8bit catcode 4bit 4bit 4bit

CJK I/O — EUC etc. UTF-8

8bit 8bit 2bytes 2–4bytes

token

charcode — 16bit 24bit kcatcode — — 5bit Latin/CJK classification

— fixed customizable inputenc OK NG OK Babel full partial full

†: with inputenc

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 40 / 42

slide-33
SLIDE 33

Appendix / おまけ Encoding

Character encoding in upTEX

Latin CJK

TEX compatible upTEX extended

<256

BMP

  • ver BMP

comment .tex / .aux UTF8 I/O buffer

1byte 2–3bytes 4bytes

token

12bit 29bit with (k)catcode

set1 set2 set3 .dvi / .vf T1 etc. UCS2 UTF32

8bit 16bit 24bit

.tfm T1 etc. UCS2 —†

†treated as Kanji 8bit 16bit ‘jfm’ for CJK

.ps / CMap T1 etc. UCS2 UTF16

8bit 16bit 2×16bit

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 41 / 42

slide-34
SLIDE 34

Appendix / おまけ kcatcode

kcatcode

kcat code cat code kind e.g. control word end of line

· · · · · ·

10 space

  • 15

11 char

azAZ

yes as space 12

  • ther char

(.!?

no as space

· · · · · ·

16 Kanji

汉漢

yes ignore 17 Kana

かナ

yes ignore 18 CJK symbol

《・。 』

no ignore 19 Hangul

한글

yes as space

Takuji Tanaka 田中 琢 爾 (upTEX project) upTEX – Unicode version of pTEX with CJK extensions Oct 26, 2013 42 / 42