ICANN IDN TLD Variant Issues Project Presentation to the Unicode - - PowerPoint PPT Presentation

icann idn tld variant issues project
SMART_READER_LITE
LIVE PREVIEW

ICANN IDN TLD Variant Issues Project Presentation to the Unicode - - PowerPoint PPT Presentation

L2/11-426 ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com Im a consultant Blame me for mistakes here, not staff or ICANN 2 Background DNS


slide-1
SLIDE 1

ICANN IDN TLD Variant Issues Project

Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

L2/11-426

slide-2
SLIDE 2

I’m a consultant

Blame me for mistakes here, not staff or ICANN

2 ¡

slide-3
SLIDE 3

Background

  • DNS labels were always in

(a subset of) ASCII

  • Lots of people don’t

normally use ASCII

  • Internationalized Domains

Names for Applications (IDNA) invented to help

3 ¡

slide-4
SLIDE 4

Reminder: two flavours

IDNA2003 IDNA2008

4 ¡

slide-5
SLIDE 5

Basic problem

  • IDNA (2003 & 2008) expands

DNS label repertoire

  • The LDH pattern does not

fit perfectly in other languages, scripts, or both

  • People want DNS labels to

work like parts of natural language

5 ¡

slide-6
SLIDE 6

What makes a DNS label?

  • DNS labels are octets
  • Preferred syntax (RFC 1035)

is Letters, Digits, and Hyphen (“LDH”)

  • Special DNS rule for ASCII
  • Case insensitive but case-

preserving

6 ¡

slide-7
SLIDE 7

IDNA

  • Permit non-LDH characters

in label

  • Be as compatible as

practical with deployed software

  • No changes to deployed DNS

software or protocol

7 ¡

slide-8
SLIDE 8

IDNA2003

  • Provide a list of code points

that are allowed

  • Map cases that are

troublesome (e.g. ZWNJ, upper-to-lowercase) using Nameprep

  • To the extent there’s an

installed base, this is it

8 ¡

slide-9
SLIDE 9

IDNA2008

  • Attempt to address some

perceived limitations of IDNA2003

  • Permits or disallows code

points based on code point properties

  • Certain incompatibilities

with IDNA2003

9 ¡

slide-10
SLIDE 10

What’s a variant?

Exactly

10 ¡

slide-11
SLIDE 11

Origins of variants

  • Starts because of Simplified

Chinese/Traditional Chinese issue

  • JET Guidelines (RFC 3743)
  • Became model for other

issues, not always related

11 ¡

slide-12
SLIDE 12

Things people have claimed

  • Characters that are

substitutable

  • “Same words” or “same

meaning”

  • Sometimes a constraint on

child names, sometimes not

12 ¡

slide-13
SLIDE 13

Why now?

  • ccTLD IDN “Fast Track”

process delegated some

  • Not uncontroversial
  • New gTLDs under

development

  • If we’re going to create

“variants”, we should be able to say what they are.

13 ¡

slide-14
SLIDE 14

IDN Variant Issues Project

14

slide-15
SLIDE 15

IDN Variant Issues Project

15 ¡ We are here

{ ¡

slide-16
SLIDE 16

Comment period to 14 Nov

http://www.icann.org/en/ announcements/ announcement-4-03oct11- en.htm and h.p://www.icann.org/en/ public-­‑comment/ ¡

16 ¡

slide-17
SLIDE 17

Reports are only about the root

While some of the conclusions may apply to

  • ther types of zones, the

reports discuss variants for TLDs only

17 ¡

slide-18
SLIDE 18

A planned constraint for TLDs

Current rule is “only letters” (strictly, General Category {Ll, Lo, Lm, Mn})

  • No numerals
  • No HYPHEN-MINUS
  • No ZWNJ/ZWJ

18 ¡

From the guidebook

slide-19
SLIDE 19

Restrictions suggested in report

  • No combining marks
  • No digits
  • No archaic
  • No Quranic marks

19 ¡

Arabic team

slide-20
SLIDE 20

ZWNJ

  • Arguments for and against
  • Refinement of IDNA2008

context rule

  • Issue is lack of shape change
  • Questions about resulting

variants

20 ¡

Arabic team

slide-21
SLIDE 21

Groups of characters

  • Identical shape at some

position (e.g. YEH)

  • Similar shape at some

position (e.g. ALEF w/ HAMZA ABOVE)

  • Interchangeable use (e.g.

KAF vs SWASH KAF)

21 ¡

Arabic team

slide-22
SLIDE 22

“NFC” issues

  • Not exactly issue with NFC
  • Example: U+06C7 vs.

U+0648,U+064F

  • Perhaps could be caught by

“confusables” algorithms?

22 ¡

Arabic team

slide-23
SLIDE 23

Recommendations

  • Whenever there is a

variant, all resulting labels are available to the applicant

  • It is up to the applicant

which ones to activate

23 ¡

Arabic team

slide-24
SLIDE 24

Focus on Chinese Language

  • Reports in principle about

“script”, but report primarily about Chinese

  • Some consideration of

effects on Japanese and Korean

24 ¡

Chinese team

slide-25
SLIDE 25

RFC 3743, experience

  • Experience at other levels
  • f DNS
  • RFC 3743 a good fit for CJK

use

25 ¡

Chinese team

slide-26
SLIDE 26

Two fundamental cases

  • Traditional vs Simplified
  • Variation due to Source

Separation Rule (e.g. U+6237 versus U+6236)

26 ¡

Chinese team

slide-27
SLIDE 27

Focus on reducing confusion

  • Mainly interested in

confusion of strings between languages

  • Unlike Chinese and Arabic,

no strong recommendation that “everything works”

27 ¡

Cyrillic team

slide-28
SLIDE 28

Different from other cases

  • Many more languages than

some other scripts

  • Extremely fraught political

environment:

  • Cyrillic vs. Latin
  • Cyrillic vs. Arabic
  • Many spelling & character

reforms

28 ¡

Cyrillic team

slide-29
SLIDE 29

One language can cause issues

  • Substitutions in one

language obliterate differences in others

  • E.g. U+0435 vs U+0451,

U+0433 vs U+0491

  • Some characters not on

keyboards

29 ¡

Cyrillic team

slide-30
SLIDE 30

Interaction with other scripts

  • Issue of relation to Greek

and Latin raised

  • Declared out of scope, but

problematic

30 ¡

Cyrillic team

slide-31
SLIDE 31

Very different issues

  • Confusing similarity a high

priority issue

  • Especially worried about

URL bar display

  • Concern about ill-formed

akshars

31 ¡

Devanagari team

slide-32
SLIDE 32

Environment issues

  • Display of Devanagari script

can be problematic

  • Rendering engines
  • Fonts

32 ¡

Devanagari team

slide-33
SLIDE 33

ZWJ and ZWNJ

  • Some Devanagari-using

languages rely on ZWJ

  • Even if there is a

precomposed version that will do

  • ZWNJ needed for noun

paradigms

  • Use in TLDs not clear

33 ¡

Devanagari team

slide-34
SLIDE 34

Inter-script issues

  • Relationship between

Devanagari and other Bramhi-derived scripts?

  • Ruled out of scope, but may

be important

34 ¡

Devanagari team

slide-35
SLIDE 35

Unusual case

  • Greek alone in studied

scripts in being used for

  • nly one language

35 ¡

Greek team

slide-36
SLIDE 36

Additional restrictions

  • Team recommends

excluding ancient characters

  • Team recommends sticking

to Monotonic characters

36 ¡

Greek team

slide-37
SLIDE 37

Sigma and Tonos

  • IDNA2003 maps upper case

to lower case: Tonos can be lost

  • IDNA2003 maps away final

form sigma

  • Transformations in

applications in IDNA2008

37 ¡

Greek team

slide-38
SLIDE 38

Final sigma

  • Recommend registering

final form sigmas wherever requested

  • Also register without the

final sigma (i.e. with small sigma in place of final sigma)

38 ¡

Greek team

slide-39
SLIDE 39

Tonos

  • Recommend registering

with Tonos where requested

  • Also register with Tonos

stripped

39 ¡

Greek team

slide-40
SLIDE 40

Dimotiki and Katharevousa

  • Recommendation that, if

Katharevousa string is requested, the “same” Dimotiki “word” is blocked

  • Only report that requests

variant behaviour because

  • f whole-string meaning

40 ¡

Greek team

slide-41
SLIDE 41

The impossible dream

  • There are too many

relationships among characters in Latin-using languages

  • There’s no way to decide
  • Therefore, no variants

41 ¡

Latin team

slide-42
SLIDE 42

Remember, please comment

Open until 14 November h.p://www.icann.org/en/ public-­‑comment/ ¡

42 ¡

slide-43
SLIDE 43

Questions

43 ¡