ICANN IDN TLD Variant Issues Project Presentation to the Unicode - - PowerPoint PPT Presentation

▶

Jan 27, 2023 106 likes •557 views

L2/11-426 ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com Im a consultant Blame me for mistakes here, not staff or ICANN 2 Background DNS

SLIDE 1

ICANN IDN TLD Variant Issues Project

Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

L2/11-426

SLIDE 2

I’m a consultant

Blame me for mistakes here, not staff or ICANN

2 ¡

SLIDE 3

Background

DNS labels were always in

(a subset of) ASCII

Lots of people don’t

normally use ASCII

Internationalized Domains

Names for Applications (IDNA) invented to help

3 ¡

SLIDE 4

Reminder: two flavours

IDNA2003 IDNA2008

4 ¡

SLIDE 5

Basic problem

IDNA (2003 & 2008) expands

DNS label repertoire

The LDH pattern does not

fit perfectly in other languages, scripts, or both

People want DNS labels to

work like parts of natural language

5 ¡

SLIDE 6

What makes a DNS label?

DNS labels are octets
Preferred syntax (RFC 1035)

is Letters, Digits, and Hyphen (“LDH”)

Special DNS rule for ASCII
Case insensitive but case-

preserving

6 ¡

SLIDE 7

IDNA

Permit non-LDH characters

in label

Be as compatible as

practical with deployed software

No changes to deployed DNS

software or protocol

7 ¡

SLIDE 8

IDNA2003

Provide a list of code points

that are allowed

Map cases that are

troublesome (e.g. ZWNJ, upper-to-lowercase) using Nameprep

To the extent there’s an

installed base, this is it

8 ¡

SLIDE 9

IDNA2008

Attempt to address some

perceived limitations of IDNA2003

Permits or disallows code

points based on code point properties

Certain incompatibilities

with IDNA2003

9 ¡

SLIDE 10

What’s a variant?

Exactly

10 ¡

SLIDE 11

Origins of variants

Starts because of Simplified

Chinese/Traditional Chinese issue

JET Guidelines (RFC 3743)
Became model for other

issues, not always related

11 ¡

SLIDE 12

Things people have claimed

Characters that are

substitutable

“Same words” or “same

meaning”

Sometimes a constraint on

child names, sometimes not

12 ¡

SLIDE 13

Why now?

ccTLD IDN “Fast Track”

process delegated some

Not uncontroversial
New gTLDs under

development

If we’re going to create

“variants”, we should be able to say what they are.

13 ¡

SLIDE 14

IDN Variant Issues Project

SLIDE 15

IDN Variant Issues Project

15 ¡ We are here

{ ¡

SLIDE 16

Comment period to 14 Nov

http://www.icann.org/en/ announcements/ announcement-4-03oct11- en.htm and h.p://www.icann.org/en/ public-‑comment/ ¡

16 ¡

SLIDE 17

Reports are only about the root

While some of the conclusions may apply to

ther types of zones, the

reports discuss variants for TLDs only

17 ¡

SLIDE 18

A planned constraint for TLDs

Current rule is “only letters” (strictly, General Category {Ll, Lo, Lm, Mn})

No numerals
No HYPHEN-MINUS
No ZWNJ/ZWJ

18 ¡

From the guidebook

SLIDE 19

Restrictions suggested in report

No combining marks
No digits
No archaic
No Quranic marks

19 ¡

Arabic team

SLIDE 20

ZWNJ

Arguments for and against
Refinement of IDNA2008

context rule

Issue is lack of shape change
Questions about resulting

variants

20 ¡

Arabic team

SLIDE 21

Groups of characters

Identical shape at some

position (e.g. YEH)

Similar shape at some

position (e.g. ALEF w/ HAMZA ABOVE)

Interchangeable use (e.g.

KAF vs SWASH KAF)

21 ¡

Arabic team

SLIDE 22

“NFC” issues

Not exactly issue with NFC
Example: U+06C7 vs.

U+0648,U+064F

Perhaps could be caught by

“confusables” algorithms?

22 ¡

Arabic team

SLIDE 23

Recommendations

Whenever there is a

variant, all resulting labels are available to the applicant

It is up to the applicant

which ones to activate

23 ¡

Arabic team

SLIDE 24

Focus on Chinese Language

Reports in principle about

“script”, but report primarily about Chinese

Some consideration of

effects on Japanese and Korean

24 ¡

Chinese team

SLIDE 25

RFC 3743, experience

Experience at other levels
f DNS
RFC 3743 a good fit for CJK

use

25 ¡

Chinese team

SLIDE 26

Two fundamental cases

Traditional vs Simplified
Variation due to Source

Separation Rule (e.g. U+6237 versus U+6236)

26 ¡

Chinese team

SLIDE 27

Focus on reducing confusion

Mainly interested in

confusion of strings between languages

Unlike Chinese and Arabic,

no strong recommendation that “everything works”

27 ¡

Cyrillic team

SLIDE 28

Different from other cases

Many more languages than

some other scripts

Extremely fraught political

environment:

Cyrillic vs. Latin
Cyrillic vs. Arabic
Many spelling & character

reforms

28 ¡

Cyrillic team

SLIDE 29

One language can cause issues

Substitutions in one

language obliterate differences in others

E.g. U+0435 vs U+0451,

U+0433 vs U+0491

Some characters not on

keyboards

29 ¡

Cyrillic team

SLIDE 30

Interaction with other scripts

Issue of relation to Greek

and Latin raised

Declared out of scope, but

problematic

30 ¡

Cyrillic team

SLIDE 31

Very different issues

Confusing similarity a high

priority issue

Especially worried about

URL bar display

Concern about ill-formed

akshars

31 ¡

Devanagari team

SLIDE 32

Environment issues

Display of Devanagari script

can be problematic

Rendering engines
Fonts

32 ¡

Devanagari team

SLIDE 33

ZWJ and ZWNJ

Some Devanagari-using

languages rely on ZWJ

Even if there is a

precomposed version that will do

ZWNJ needed for noun

paradigms

Use in TLDs not clear

33 ¡

Devanagari team

SLIDE 34

Inter-script issues

Relationship between

Devanagari and other Bramhi-derived scripts?

Ruled out of scope, but may

be important

34 ¡

Devanagari team

SLIDE 35

Unusual case

Greek alone in studied

scripts in being used for

nly one language

35 ¡

Greek team

SLIDE 36

Additional restrictions

Team recommends

excluding ancient characters

Team recommends sticking

to Monotonic characters

36 ¡

Greek team

SLIDE 37

Sigma and Tonos

IDNA2003 maps upper case

to lower case: Tonos can be lost

IDNA2003 maps away final

form sigma

Transformations in

applications in IDNA2008

37 ¡

Greek team

SLIDE 38

Final sigma

Recommend registering

final form sigmas wherever requested

Also register without the

final sigma (i.e. with small sigma in place of final sigma)

38 ¡

Greek team

SLIDE 39

Tonos

Recommend registering

with Tonos where requested

Also register with Tonos

stripped

39 ¡

Greek team

SLIDE 40

Dimotiki and Katharevousa

Recommendation that, if

Katharevousa string is requested, the “same” Dimotiki “word” is blocked

Only report that requests

variant behaviour because

f whole-string meaning

40 ¡

Greek team

SLIDE 41

The impossible dream

There are too many

relationships among characters in Latin-using languages

There’s no way to decide
Therefore, no variants

41 ¡

Latin team

SLIDE 42

Remember, please comment

Open until 14 November h.p://www.icann.org/en/ public-‑comment/ ¡

42 ¡

SLIDE 43

Questions

43 ¡