April 7, 2005 A Jonathan Kew SIL International 27th - - PowerPoint PPT Presentation

april 7 2005 a
SMART_READER_LITE
LIVE PREVIEW

April 7, 2005 A Jonathan Kew SIL International 27th - - PowerPoint PPT Presentation

e Multilingual Lion: T EX learns to seak Unicode April 7, 2005 A Jonathan Kew SIL International 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 Te Multilingual Lion: T EX learns to seak Unicode Background


slide-1
SLIDE 1

e Multilingual Lion:

T EX learns to seak Unicode

Jonathan Kew SIL International

April 7, 2005 A 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

slide-2
SLIDE 2

Te Multilingual Lion: T EX learns to seak Unicode

Background

  • T

EX: free typeseting system with a 25-year history

  • stable, reliable, flexible, widely implemented
  • experienced user community
  • rich collection of supporting tools
  • Originally designed for English typeseting
  • support for accents and other European characers
  • language support extended via custom fonts, macros, and

preprocessors

27th Internationalization and Unicode Conference Berlin, Germany, April 2005

slide-3
SLIDE 3

Te Multilingual Lion: T EX learns to seak Unicode

Traditional T EX input conventions

  • Input text is ASCII (or 8-bit codepage)

Source text Typeset output Notes

\'{a}

á

typical accent command

\c{c}

ç

\aa

å

ligature in typical T EX fonts

$\alpha$

α

math mode symbol

{\dnacchaa}

अछा

using custom preprocessor

27th Internationalization and Unicode Conference Berlin, Germany, April 2005

slide-4
SLIDE 4

Te Multilingual Lion: T EX learns to seak Unicode

Multilingual typeseting with T EX

  • Text input
  • Escape sequences for non-ASCII characers
  • Multiple 8-bit codepages
  • Preprocessors for complex scripts
  • Font support
  • Fonts limited to 256 glyphs
  • Custom-encoded fonts with secific glyph sets
  • All tied together via complex T

EX macros

  • Difficult to understand and extend
  • Difficult to integrate with other packages

27th Internationalization and Unicode Conference Berlin, Germany, April 2005

slide-5
SLIDE 5

Te Multilingual Lion: T EX learns to seak Unicode

Towards a cleaner solution

  • Unicode: all required characers directly represented
  • no need for “escape sequences” to access characers not

included in the current codepage

  • no need to switch between codepages according to the

language/script being typeset

  • characers rendered via standard access codes
  • Characer/glyph model and modern font rendering

technologies

  • complex script handling moved out of the domain of the

text data stream

27th Internationalization and Unicode Conference Berlin, Germany, April 2005

slide-6
SLIDE 6

Te Multilingual Lion: T EX learns to seak Unicode

Typeseting Unicode text with X E T EX

  • Accented characers

\halign{#\hfil\quad& #\hfil\cr dan&dan\cr dubok&dubok\cr džabe&ak\cr džin&džabe\cr Džin&džin\cr ak&Džin\cr Evropa&Evropa\cr}

dan dan dubok dubok džabe đak džin džabe Džin džin đak Džin Evropa Evropa

27th Internationalization and Unicode Conference Berlin, Germany, April 2005

slide-7
SLIDE 7

Te Multilingual Lion: T EX learns to seak Unicode

Typeseting Unicode text with X E T EX

  • CJK ideographs

\font\han="STSong"at16pt \font\rom="Gentium"at8pt \def\hc#1#2{\vtop{\hbox{\han#1} \hbox{\kern10pt\rom#2}}} \vtop{\hc{書く}{ka-ku} \hc{最も}{motto-mo} \hc{最後}{sai-go} \hc{働く}{hatara-ku} \hc{海}{umi}}

書く

ka-ku

最も

motto-mo

最後

sai-go

働く

hatara-ku

umi

27th Internationalization and Unicode Conference Berlin, Germany, April 2005

slide-8
SLIDE 8

Te Multilingual Lion: T EX learns to seak Unicode

Typeseting Unicode text with X E T EX

  • Complex scripts

\c1 \s \p \v1 . \v2 .

  • \v3”

.“.

ﺶﺋﺍﺪﻴﭘ ﻲﺟ ﺎﻴﻧﺩ

نآ ۽ ز ا ۾ تو١۱۽ ​ ز و نا٢. ا وا و وا . ناو ا ن ۽ ڍ ن ڏ ا ٣ ا حور . ور “. ور”

27th Internationalization and Unicode Conference Berlin, Germany, April 2005

slide-9
SLIDE 9

Te Multilingual Lion: T EX learns to seak Unicode

Key changes from T EX to X E T EX

  • Unicode as the text encoding
  • directly use Unicode input text, Unicode-encoded fonts
  • Fonts and rendering technologies
  • use any fonts available in the host computer
  • use existing smart-font rendering systems
  • Additional features for multilingual typeseting
  • optional font features
  • line breaking for Asian scripts
  • Backward compatibility issues
  • support for legacy T

EX fonts and documents

27th Internationalization and Unicode Conference Berlin, Germany, April 2005

slide-10
SLIDE 10

Te Multilingual Lion: T EX learns to seak Unicode

From 8 to 16 bits…

  • Characer type in T

EX code was 8-bit value

  • one option: process text as UTF-8
  • Characer codes used to index a number of tables
  • characer category, case pairs, etc.
  • Decision to use 16-bit characer codes
  • all 256-element tables enlarged to 65,536 elements to

match the extended characer set

  • extended T

EX commands that refer to characer codes

27th Internationalization and Unicode Conference Berlin, Germany, April 2005

slide-11
SLIDE 11

Te Multilingual Lion: T EX learns to seak Unicode

From 8 to 16 bits… and beyond?

  • Unicode does not fit in 16 bits either!
  • X

E T EX handles non-BMP characers as UTF-16 surrogate pairs

  • properties of individual characers cannot be set
  • unlikely to mater for typeseting usage: all surrogate codes

can be treated as simple printable characers

  • keeps size of internal tables moderate, without extensive

restructuring

  • Using UTF-16 happens to match the font rendering

APIs that X E T EX uses

27th Internationalization and Unicode Conference Berlin, Germany, April 2005

slide-12
SLIDE 12

Te Multilingual Lion: T EX learns to seak Unicode

Implementing the characer/glyph model

  • Required for support of complex scripts in Unicode
  • Significant change from traditional T

EX model

  • T

EX regards “a secific characer code in a secific font” as the fundamental unit of text to be typeset

  • assumes such a characer has known, fixed dimensions
  • provision for ligatures by characer substitutions
  • a paragraph consists of sequence of “characer” nodes, to be

precisely placed, and intervening “glue” nodes

  • A Unicode characer may not map to a single,

known glyph

  • many scripts require contextual selection of glyphs
  • must measure characers in context, not in isolation

27th Internationalization and Unicode Conference Berlin, Germany, April 2005

slide-13
SLIDE 13

Te Multilingual Lion: T EX learns to seak Unicode

Implementing the characer/glyph model

  • Initial implementation using ATSUI on Mac OS X
  • typeseting process collects runs of characers (words)
  • calls ATSUI text layout APIs to measure width
  • a X

E T EX paragraph consists of sequence of “word” nodes separated by “glue”

  • Typeseting engine positions words, not glyphs
  • this is the job of the font rendering engine

27th Internationalization and Unicode Conference Berlin, Germany, April 2005

slide-14
SLIDE 14

Te Multilingual Lion: T EX learns to seak Unicode

Implementing the characer/glyph model

Nodes in a T EX paragraph Corresponding nodes in X E T EX

!"#$%!&'()!*+,-$ !"#$%!&'()!*+,-$ !"#$%!&'()!*+,-$

  • .,(%!/
  • .,(%!.
  • .,(%!$
  • .,(%!0
  • .,(%!#
  • .,(%!1
  • .,(%!-
  • .,(%!2
  • .,(%!3
  • .,(%!'
  • .,(%!4

!"#$%!&'()!*+,-$ !"#$%!&'()!*+,-$ !"#$%!&'()!*+,-$ &'()%!.'/ &'()%!0#1-2 &'()%!34$

27th Internationalization and Unicode Conference Berlin, Germany, April 2005

slide-15
SLIDE 15

Te Multilingual Lion: T EX learns to seak Unicode

Implementing the characer/glyph model

  • OpenType Layout support using ICU library
  • alternative font layout engine
  • provides support for OpenType features in Latin fonts
  • supports a number of complex (Indic/Asian) scripts
  • X

E T EX uses either ATSUI or ICU according to layout tables found in fonts

  • overall typeseting process is independent of font

technology in use

  • distinction required only at lowest level of measuring a run
  • f text in a given font
  • documents may freely mix AAT and OT fonts

27th Internationalization and Unicode Conference Berlin, Germany, April 2005

slide-16
SLIDE 16

Te Multilingual Lion: T EX learns to seak Unicode

Implementing the characer/glyph model

  • ATSUI APIs used in typeseting
  • ATSUCreateStyle, ATSUSetAttributes
  • ATSUCreateTextLayout, ATSUSetTextPointerLocation,

ATSUSetRunStyle

  • ATSUGetUnjustifiedBounds, ATSUDrawText
  • ICU APIs used in typeseting
  • ubidi_open, ubidi_close, ubidi_setPara,

ubidi_getDirection, ubidi_countRuns, ubidi_getVisualRun

  • LayoutEngine::layoutChars, getGlyphs,

getGlyphPositions

27th Internationalization and Unicode Conference Berlin, Germany, April 2005

slide-17
SLIDE 17

Te Multilingual Lion: T EX learns to seak Unicode

Hyphenation support

  • Paragraphs formed of lists of “word boxes”
  • treated as indivisible units in the token list
  • allows T

EX to remain unaware of low-level details

  • If acceptable line breaks not found, hyphenation

required

  • extract text characers from word nodes
  • find hyphen positions using T

EX’s algorithm

  • repackage words as word fragments and discretionary

break nodes

27th Internationalization and Unicode Conference Berlin, Germany, April 2005

slide-18
SLIDE 18

Te Multilingual Lion: T EX learns to seak Unicode

Hyphenation support

  • Modifying the node list to allow hyphenation

!"# $%&' ()**'+',- *#.'/ $%&' !"# $%&' ()* *'+ ',- *#.'/ $%&' 0120',3 0120',3

  • Problem: unused hyphen points break rendering

!"# $%&' ()* *'+ ',- *#.'/ $%&'

Two differ- ent foxes

  • Need to re-merge word nodes afer choosing breaks

!"# $%&' ()**'+, '-. *#/'0 $%&'

Two differ- ent foxes

27th Internationalization and Unicode Conference Berlin, Germany, April 2005

slide-19
SLIDE 19

Te Multilingual Lion: T EX learns to seak Unicode

Advanced font features

  • OpenType language systems

\font\Doulos="DoulosSIL/ICU" \font\DoulosViet="DoulosSIL/ICU:language=VIT"

Unicode cung cấp một con số duy nhất cho mỗi ký tự Unicode cung cp một con s duy nht cho mi ký tự

\font\Brioso="BriosoPro" \font\BriosoTrk="BriosoPro:language=TRK"

… gelen firmaları … tarafından … … gelen firmaları … tarafından …

27th Internationalization and Unicode Conference Berlin, Germany, April 2005

slide-20
SLIDE 20

Te Multilingual Lion: T EX learns to seak Unicode

Advanced font features

  • Custom AAT features

\font\Doulos="DoulosSIL/AAT" \font\DoulosAlt="DoulosSIL/AAT: Alternateforms=Literacyalternates,

  • Smallv-hookstraightstyle;

UppercaseEngalternates=CapitalNwithtail"

Xɔsee na Mose ɖo Ŋutitotoŋkeke la anyi, eye wòna wohlẽ ʋu ɖe ʋɔtrutiwo ŋu bene dɔla si atsr ŋgɔgbeviwo la nagawɔ nuvevi Israel viwo ya o. Xɔsee n Mose ɖo utitotoŋkeke l nyi, eye wòn wohlẽ u ɖe ɔtrutiwo ŋu bene dɔl si tsr ŋɔbeviwo l nwɔ nuvevi Isrel viwo y o.

27th Internationalization and Unicode Conference Berlin, Germany, April 2005

slide-21
SLIDE 21

Te Multilingual Lion: T EX learns to seak Unicode

East Asian languages

  • Line breaking without word spaces
  • T

EX normally breaks lines at “glue” arising from spaces

  • Chinese, Japanese, Tai, etc. do not use word spaces
  • โดยพื้นฐานแลว, คอมพิวเตอรจะเกี่ยวของกับเรื่องของตัวเลข. คอมพิวเตอรจัดเก็บตัว

โดยการกำหนดหมายเลขใหสำหรับแตละตัว. กอนหนาที่๊ Unicode จะถูกสรางขึ้น, ไดมีระบบ encoding อยูหลายรอยระบบสำหรับการกำหนดหมายเลขเหลานี้.

  • Use ICU line-break: \XeTeXlinebreaklocale"th"
  • โดยพื้นฐานแลว, คอมพิวเตอรจะเกี่ยวของกับเรื่องของตัวเลข. คอมพิวเตอรจัด

เก็บตัวอักษรและอักขระอื่นๆ โดยการกำหนดหมายเลขใหสำหรับแตละตัว. กอนหนาที่๊ Unicode จะถูกสรางขึ้น, ไดมีระบบ encoding อยูหลายรอย ระบบสำหรับการกำหนดหมายเลขเหลานี้.

27th Internationalization and Unicode Conference Berlin, Germany, April 2005

slide-22
SLIDE 22

Te Multilingual Lion: T EX learns to seak Unicode

Backward compatibility

  • Legacy T

EX fonts, esecially for math mode

  • supported via T

EX font metrics and Type 1 font files

  • allow many existing T

EX documents to work

  • not Unicode-compliant!

−∞

e−x2 dx 2 = ∞

−∞

−∞

e−(x2+y2) dx dy = 2π ∞ e−r2r dr dθ = 2π

  • −e−r2

2

  • r=∞

r=0

= π.

27th Internationalization and Unicode Conference Berlin, Germany, April 2005

slide-23
SLIDE 23

Te Multilingual Lion: T EX learns to seak Unicode

Backward compatibility

  • Non-Unicode input text
  • by default, input read as Unicode (UTF-8 or UTF-16)
  • legacy codepages supported via ICU converters
  • set codepage of current input file:

\XeTeXinputencoding"charset-name"

  • set initial codepage for newly-opened input files:

\XeTeXdefaultencoding"charset-name"

27th Internationalization and Unicode Conference Berlin, Germany, April 2005

slide-24
SLIDE 24

Te Multilingual Lion: T EX learns to seak Unicode

Backward compatibility

  • Support for legacy keying pracices
  • typical input:

``\TeX''---atypesettingsystem

  • generates: ``T

EX''---a typeseting system

  • Font mapping for compatibility

;TECkitmappingforTeXinputconventions U+002DU+002D<>U+2013;--->endash U+002DU+002DU+002D<>U+2014;---->emdash U+0027<>U+2019;'->rightsinglequote U+0027U+0027<>U+201D;''->rightdoublequote U+0022>U+201D;"->rightdoublequote

  • generates: “T

EX”—a typeseting system

27th Internationalization and Unicode Conference Berlin, Germany, April 2005

slide-25
SLIDE 25

Te Multilingual Lion: T EX learns to seak Unicode

More fun with font mappings

\def\SampleText{Unicode- этоуникальный коддлялюбогосимвола,\\ независимоотплатформы,\\ независимоотпрограммы,\\ независимоотязыка.} \font\gen="Gentium" \gen\SampleText \bigskip \font\gentrans="Gentium: mapping=cyr-lat-iso9" \gentrans\SampleText

Unicode - это уникальный код для любого символа, независимо от платформы, независимо от программы, независимо от языка. Unicode - èto unikal'nyj kod dlâ lûbogo simvola, nezavisimo ot platformy, nezavisimo ot programmy, nezavisimo ot âzyka.

27th Internationalization and Unicode Conference Berlin, Germany, April 2005

slide-26
SLIDE 26

Te Multilingual Lion: T EX learns to seak Unicode

X E T EX and other T EX extensions

  • T

EX G X

  • a direct ancesor of X

E T EX, but now obsolete

  • e-T

EX

  • basis of current X

E T EX implementation

  • provides a number of features, esecially bidi support
  • Omega, Aleph
  • ambitious project to extend T

EX to all scripts

  • complex configuration, no direct smart-font support
  • pdfT

EX

  • widely-used extension providing rich PDF support
  • no native Unicode or smart-font support

27th Internationalization and Unicode Conference Berlin, Germany, April 2005

slide-27
SLIDE 27

Te Multilingual Lion: T EX learns to seak Unicode

For more information

  • X

E T EX web site and mailing list

  • http://scripts.sil.org/xetex
  • http://tug.org/mailman/listinfo/xetex
  • Contact information
  • mailto:jonathan_kew@sil.org
  • Questions… and answers?

A

?؟"ﺩﻮﻜِﻧﻮﻳ" ﺓﺪﺣﻮﳌﺍ ﺓﺮﻔﺸﻟﺍ ﻲﻫ ﺎﻣ 什麽是Unicode (統一碼/標準萬國碼)? Što je Unicode? ? Τί εἶναι τὸ Unicode; ?דוקינוי הז המ‏ यूिनकोड ा है? Hvað er Unicode? ユニコードとは何か? 유니코드에대해?؟ﺖﺴﻴﭼ ﺪُﻛﯽﻧﻮﻳ Что такое Unicode? Unicode คืออะไร? ?

27th Internationalization and Unicode Conference Berlin, Germany, April 2005