april 7 2005 a
play

April 7, 2005 A Jonathan Kew SIL International 27th - PowerPoint PPT Presentation

e Multilingual Lion: T EX learns to seak Unicode April 7, 2005 A Jonathan Kew SIL International 27th Internationalization and Unicode Conference Berlin, Germany, April 2005 Te Multilingual Lion: T EX learns to seak Unicode Background


  1. e Multilingual Lion: T EX learns to seak Unicode April 7, 2005 A Jonathan Kew SIL International 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

  2. Te Multilingual Lion: T EX learns to seak Unicode Background • T EX: free typeseting system with a 25-year history • stable, reliable, flexible, widely implemented • experienced user community • rich collection of supporting tools • Originally designed for English typeseting • support for accents and other European characers • language support extended via custom fonts, macros, and preprocessors 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

  3. Te Multilingual Lion: T EX learns to seak Unicode Traditional T EX input conventions • Input text is ASCII (or 8-bit codepage) Source text Typeset output Notes á \'{a} typical accent command ç \c{c} å \aa — --- ligature in typical T EX fonts α $\alpha$ math mode symbol अ�छा {\dn � acchaa} using custom preprocessor 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

  4. Te Multilingual Lion: T EX learns to seak Unicode Multilingual typeseting with T EX • Text input • Escape sequences for non-ASCII characers • Multiple 8-bit codepages • Preprocessors for complex scripts • Font support • Fonts limited to 256 glyphs • Custom-encoded fonts with secific glyph sets • All tied together via complex T EX macros • Difficult to understand and extend • Difficult to integrate with other packages 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

  5. Te Multilingual Lion: T EX learns to seak Unicode Towards a cleaner solution • Unicode: all required characers directly represented • no need for “escape sequences” to access characers not included in the current codepage • no need to switch between codepages according to the language/script being typeset • characers rendered via standard access codes • Characer/glyph model and modern font rendering technologies • complex script handling moved out of the domain of the text data stream 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

  6. Te Multilingual Lion: T EX learns to seak Unicode Typeseting Unicode text with X T EX E • Accented characers dan dan \halign{#\hfil\quad& � #\hfil\cr dubok dubok dan& ���� dan\cr d ž abe đ ak dubok& �� dubok\cr d ž in d ž abe džabe& ��� ak\cr D ž in d ž in džin& ��� džabe\cr đ ak D ž in Džin& ��� džin\cr Evropa Evropa � ak& ���� Džin\cr Evropa& � Evropa\cr} 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

  7. Te Multilingual Lion: T EX learns to seak Unicode Typeseting Unicode text with X T EX E • CJK ideographs 書く \font\han="STSong" � at � 16pt ka-ku \font\rom="Gentium" � at � 8pt 最も \def\hc#1#2{\vtop{\hbox{\han � #1} � \hbox{\kern10pt\rom � #2}}} motto-mo \vtop{\hc{書く}{ka-ku} 最後 � \hc{最も}{motto-mo} sai-go � \hc{最後}{sai-go} 働く � \hc{働く}{hatara-ku} hatara-ku � \hc{海}{umi}} 海 umi 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

  8. Te Multilingual Lion: T EX learns to seak Unicode Typeseting Unicode text with X T EX E • Complex scripts \c � 1 ﺶﺋﺍﺪﻴﭘ ﻲﺟ ﺎﻴﻧﺩ \s ������������������ \p �� ن���آ ۽ ���ز ا�� ۾ ت��و�� ١ ۱ ۽ �����​�� ���ز ��و نا ٢ . ��� ا��� \v �� 1 ������������������������� ��������������������� . �� ����وا و����� �� ���� ���وا . ��� نا��و \v �� 2 ������������������������� �� ا�� ن��� �� ����� ۽ �� ���ڍ ن�� �������������� . ������������� ��ڏ ��� ا�� ���� ٣ �� ��� ا��� حور ��������������������������������� . ���� �� ���ور �� “. ��� ���ور ” �� ������������������������ ���������������������� \v �� 3 ��������������������������� ” ����� ���� .“ ��������������������� . �� 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

  9. Te Multilingual Lion: T EX learns to seak Unicode Key changes from T EX to X T EX E • Unicode as the text encoding • directly use Unicode input text, Unicode-encoded fonts • Fonts and rendering technologies • use any fonts available in the host computer • use existing smart-font rendering systems • Additional features for multilingual typeseting • optional font features • line breaking for Asian scripts • Backward compatibility issues • support for legacy T EX fonts and documents 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

  10. Te Multilingual Lion: T EX learns to seak Unicode From 8 to 16 bits… • Characer type in T EX code was 8-bit value • one option: process text as UTF-8 • Characer codes used to index a number of tables • characer category, case pairs, etc. • Decision to use 16-bit characer codes • all 256-element tables enlarged to 65,536 elements to match the extended characer set • extended T EX commands that refer to characer codes 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

  11. Te Multilingual Lion: T EX learns to seak Unicode From 8 to 16 bits… and beyond? • Unicode does not fit in 16 bits eithe r! • X T EX handles non-BMP chara c ers as UTF-16 E surrogate pairs • properties of individual characers cannot be set • unlikely to mater for typeseting usage: all surrogate codes can be treated as simple printable characers • keeps size of internal tables moderate, without extensive restructuring • Using UTF-16 happens to match the font rendering APIs that X T EX uses E 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

  12. T e Multilingual Lion: T EX learns to s eak Unicode Implementing the chara c er/glyph model • Required for support of complex scripts in Unicode • Signi fi cant change from traditional T EX model • T EX regards “a secific characer code in a secific font” as the fundamental unit of text to be typeset • assumes such a characer has known, fixed dimensions • provision for ligatures by characer substitutions • a paragraph consists of sequence of “characer” nodes, to be precisely placed, and intervening “glue” nodes • A Unicode chara c er may not map to a single, known glyph • many scripts require contextual selection of glyphs • must measure characers in context, not in isolation 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

  13. T e Multilingual Lion: T EX learns to s eak Unicode Implementing the chara c er/glyph model • Initial implementation using ATSUI on Mac OS X • typeseting process collects runs of characers (words) • calls ATSUI text layout APIs to measure width • a X T EX paragraph consists of sequence of “word” nodes E separated by “glue” • Typese t ing engine positions words, not glyphs • this is the job of the font rendering engine 27 th Internationalization and Unicode Conference Berlin, Germany, April 2005

  14. T e Multilingual Lion: T EX learns to s eak Unicode Implementing the chara c er/glyph model Nodes in a T EX paragraph Corresponding nodes in X T EX E -.,(% ! / &'()% ! 34$ -.,(% ! . !"#$% ! &'() ! *+,-$ -.,(% ! $ &'()% ! 0#1-2 !"#$% ! &'() ! *+,-$ !"#$% ! &'() ! *+,-$ -.,(% ! 0 -.,(% ! # &'()% ! .'/ -.,(% ! 1 !"#$% ! &'() ! *+,-$ -.,(% ! - -.,(% ! 2 !"#$% ! &'() ! *+,-$ -.,(% ! 3 -.,(% ! ' -.,(% ! 4 !"#$% ! &'() ! *+,-$ 27th Internationalization and Unicode Conference Berlin, Germany, April 2005

  15. T e Multilingual Lion: T EX learns to s eak Unicode Implementing the chara c er/glyph model • OpenType Layout support using ICU library • alternative font layout engine • provides support for OpenType features in Latin fonts • supports a number of complex (Indic/Asian) scripts • X T EX uses either ATSUI or ICU according to E layout tables found in fonts • overall typese t ing process is independent of font technology in use • distinction required only at lowest level of measuring a run of text in a given font • documents may freely mix AAT and OT fonts 27 th Internationalization and Unicode Conference Berlin, Germany, April 2005

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend