Taming UTF-8 in pdfTeX Frank Mittelbach TUG 2019, Palo Alto A - - PDF document

taming utf 8 in pdftex
SMART_READER_LITE
LIVE PREVIEW

Taming UTF-8 in pdfTeX Frank Mittelbach TUG 2019, Palo Alto A - - PDF document

Presentation given at the TUG 2019 Conference, Palo Alto 1 file version: August 25, 2019 0:02 Taming UTF-8 in pdfT EX Frank Mittelbach Abstract To understand the concepts in pdfL A T EX for processing UTF-8 encoded files it is helpful to


slide-1
SLIDE 1

Presentation given at the TUG 2019 Conference, Palo Alto

file version: August 25, 2019 0:02

1 Taming UTF-8 in pdfT EX Frank Mittelbach Abstract To understand the concepts in pdfL

AT

EX for processing UTF-8 encoded files it is helpful to first take a look at the models used by the T EX engine and earlier attempts made by L

AT

EX on top of T

  • EX. The talk provides

a short historical review of that area and explains

  • how it is possible in a T

EX system that only understands 8-bit input to nevertheless interpret and process UTF-8 files successfully;

  • what the obstacles are and how they can be overcome and
  • what restrictions will remain if one doesn’t switch to a Unicode-aware engine such as LuaT

EX or X E T EX. It will finish with an overview about the improvements with respect to UTF-8 handling that will be activated in L

AT

EX within 2019 and explains how they can already be tested today. The slides have been retrospectively constructed from the mindmap used during the presentation.

⋄ Frank Mittelbach Mainz, Germany https://www.latex-project.org

Taming UTF-8 in pdfTeX

Frank Mittelbach

TUG 2019, Palo Alto A short history lesson LaTeX2e solution A short history lession part 2 (UTF-8) Restrictions New - upcoming

Slide #1

Taming UTF-8 in pdfT EX

slide-2
SLIDE 2

2

file version: August 25, 2019 0:02

Presentation given at the TUG 2019 Conference, Palo Alto

A short history lesson TeX79 7bit Input „key code“ = Font slot TeX82 8bit Input „key code“ = Font slot 8 bit code pages differ country by country Font slots 129-255 not really used TeX 3 \language Cork font encoding! Problems 1995 Slide #2

Problems 1995 The German word „Größe“ A „pound“ symbol in italics

Slide #3

Frank Mittelbach

slide-3
SLIDE 3

Presentation given at the TUG 2019 Conference, Palo Alto

file version: August 25, 2019 0:02

3

maps to translated by

LaTeX2e solution inputenc package LICR (LaTeX Internal Character Representation) fontenc package LICR surives the roundtrip through .aux .toc, etc., since only ASCII chars are used and commands are protected from expansion no dependencies on code pages no dependencies on font slots Font encodings are (in theory) well- defined Fonts with the same encoding map exactly the same characters

  • r not

no tofu!

Slide #4

A short history lession part 2 (UTF-8) Encoding 1 byte = ascii 0xxxxxxx 2 bytes 110xxxxx 10xxxxxx 3 bytes 1110xxxx 10xxxxxx 10xxxxxx ... Approach (in pdfTeX) Features

Slide #5

Taming UTF-8 in pdfT EX

slide-4
SLIDE 4

4

file version: August 25, 2019 0:02

Presentation given at the TUG 2019 Conference, Palo Alto

translated by maps to

Approach (in pdfTeX) Program reads only bytes Start byte is made „active“ This reads all necessary further bytes Determines the Unicode slot LICR (LaTeX Internal Character Representation) fontenc we know that already

translated by maps to

Features ... UTF8 characters are only supported if the glyph exists in the loaded fonts But those again without tofu!

Slide #6

Restrictions Each UTF-8 document needs \usepackage[utf8]{inputenc} Multi-Byte UTF-8 can’t be used as part

  • f command names

no \Straße and also no \label{Überblick} Multi-Byte UTF-8 has problems in \typeout etc In file names only restricted usage possible --- if at all Problems with \input, \include or graphic files Example:

Slide #7

Frank Mittelbach

slide-5
SLIDE 5

Presentation given at the TUG 2019 Conference, Palo Alto

file version: August 25, 2019 0:02

5

Example: Input \input{fürchterlich} \input{Straße} OT1 ! LaTeX Error: File `f\unhbox \voidb@x \bgroup \let \unhbox \voidb@x \setbox @tempboxa \hbox {u\global \mathchardef \accent@spacefactor\spacefactor } \accent 127 u\egroup \spacefactor \accent@spacefactor rchterlich.tex' not found. ! LaTeX Error: File `Stra\OT1\ss e.tex' not found. T1 ! LaTeX Error: File `fürchterlich.tex' not found.

(only an error because the file didn’t exist)

! LaTeX Error: File `Stra\T1\ss e.tex' not found.

Slide #8

New - upcoming April 2018 UTF-8 is the new default input encoding

Each UTF-8 document needs \usepackage[utf8]{inputenc}

Combining chars, e.g., „a“ + „U+0301“ = á are not supported This is a pdfTeX restriction which can’t be realistically overcome Fall 2019 Multi-Byte UTF-8 works now in ... \label, \cite, \typeout, etc in file names character not in loaded fonts are allowed too! but still not in command names This is a pdfTeX restriction which can’t be realistically overcome Space in file names are allowed without quoting the name

Available now for testing

LaTeX-dev format

Slide #9

Taming UTF-8 in pdfT EX