7 international character sets
play

7. International character sets Default character set: Unicode - PowerPoint PPT Presentation

7. International character sets Default character set: Unicode Characters correspond to numbers . Different encodings exist for these numbers. In Web servers HTTP metadata specifies the coding (so called MIME media type), e.g.


  1. 7. International character sets • Default character set: Unicode • Characters correspond to numbers . • Different encodings exist for these numbers. • In Web servers HTTP metadata specifies the coding (so called MIME media type), e.g. HTTP/1.1 ... Content-Type: text/xml; charset=ISO-8859-1 • If the MIME media type is application/xml , then the parser tries to guess the character set from the first few bytes of the document: XML-declaration: <?xml ... encoding=...?> XML-7 J. Teuhola 2013 109

  2. Character sets in external entities • External entities (to be included in a document) may have a different encoding. • An external parsed entity should start with a text declaration (similar to xml declaration, version may be omitted): <?xml encoding=”KOI8-R”?> [denoting Russian characters] • The same holds for external DTD subsets. XML-7 J. Teuhola 2013 110

  3. ISO character sets • 16 different character sets; each with character numbers 0..255: – 0..127: Normal ASCII – 128..159: Control characters – 160..255: Language-specific characters • Examples: – ISO-8859-1 ( = Latin-1 ) Language-specific characters for Danish, Dutch, Finnish, French, German, Spanish, Swedish, ... – ISO-8859-2 ( = Latin-2 ) Eastern-European characters XML-7 J. Teuhola 2013 111

  4. Unicode • International character set for almost all languages (English, Greek, Cyrillic, Han Chinese, Arabic, Hebrew, Thai, Bengali, ...) • Unique numbers (’codepoints’) for all characters • Version 6.0 (2010) contains 109242 graphic characters • Codes are divided into 17 planes , á 65536 chars = 1114112 chars altogether. The first plane ( Basic Multilingual Plane ) covers the chars used in practice. • Five ways of encoding the numbers: UTF-8, UTF-16, UTF-32, UCS-2, UCS-4 • XML parsers are required to understand UTF-8 and UTF-16, but are allowed to understand others, such as ISO-8859-1. XML-7 J. Teuhola 2013 112

  5. Variable-length Unicode encodings • UTF-8 (UCS Transformation Format 8 ) : – Default for XML processors – Characters 0..127 are encoded with 1 byte = ASCII – Characters 128..2047 are encoded with 2 bytes – Characters 2048..65535 are encoded with 3 bytes – Characters 65536..1114111 are encoded with 4 bytes • UTF-16 : – Extended from UCS-2 (incl. big-endian/little-endian options) – Some so called surrogate pairs of 16-bit UCS-2 codes constitute additional 32-bit encodings. • UTF-32 : – Extended from UCS-4; now these two are identical – Fixed-length 4-byte codes XML-7 J. Teuhola 2013 113

  6. Obsolete encodings of Unicode • UCS-2 ( Universal Character Set 2 ): – 2-byte unsigned integer 0..65535, e.g. ”A” = 00000000 01000001 = 65 10 = #x0041 (hex) – Drawbacks: Twice the size of ASCII, not compatible with ASCII, 65536 characters are not enough – Versions: big-endian (most significant byte first), little- endian (least significant byte first) • UCS-4 : – 4 bytes (32 bits) per char – Wasteful for small character sets XML-7 J. Teuhola 2013 114

  7. Miscellaneous • Conversion tools between character sets: – See e.g. http://dataconv.org/apps_unicode_utf8.html http://download.oracle.com/javase/1.5.0/docs/tooldocs/solaris/native2ascii.html • Character sets supported by Java JDK 5.0: http://download.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html • Platform-dependent character sets: – Invented by vendors like Microsoft and Apple, e.g. Cp1252 (”Windows ANSI”), MacRoman, MacGreek – To be used only within a single system, not in transported data XML-7 J. Teuhola 2013 115

  8. How does the parser find out the character set? if external meta-information exists then use it else if the first 4 bytes are = ”<?xm” (= #x3C3F786D in ASCII ) then the code is (superset of) ASCII, and the exact code can be decided from the encoding declaration of the first line (which is pure ASCII, having identical coding in UTF-8). XML-7 J. Teuhola 2013 116

  9. How to type characters outside ASCII • Character references to numeric values, e.g. Greek α : – Either decimal: &#945; – or hexadecimal: &#x3B1; • Character references may be used in element contents, attribute values, comments, DTD attribute defaults, DTD entity replacement text. • Character references may not be used in element and attribute names, processing instruction targets, XML keywords. XML-7 J. Teuhola 2013 117

  10. Examples < αβγ > &# x3B1; &# x3B2; &# x3B3; < / αβγ > Legal, if α , β and γ can be included natively < &# x3B1; &# x3B2; &# x3B3; > &# x3B1; &# x3B2; &# x3B3; < / &# x3B1; &# x3B2; &# x3B3; > Illegal • Codes, see http://www.unicode.org/charts/ XML-7 J. Teuhola 2013 118

  11. Character entities • Character references can be given entity names in DTDs, e.g. <!ENTITY alpha ”&#x3B1”> <!ENTITY beta ”&#x3B2”> • Usage: &alpha; &beta; • Some DTDs contain only entities (file *.ent). They can be included in the actual DTD as external parameter entities. • Predefined ent-files exist for Latin-1, Greek, etc. and can be included as PUBLIC declarations. XML-7 J. Teuhola 2013 119

  12. Multilingual documents • An element may have an attribute xml:lang . It specifies the language (not character set) used within the element. • The language is useful information to the processor (e.g. spell-checker). • Declaration needed: <!ATTLIST elem xml:lang NMTOKEN #IMPLIED> • Language codes 2-4 letters, defined in ISO-639, altogether 7589 languages (ISO-693-3) • 2-letter example codes (English = ”EN”, Finnish=”FI”, Swedish=”SV”, Greek=”EL”, ...) • Subcodes may be defined for dialects. XML-7 J. Teuhola 2013 120

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend