Unicode 4.0 In Common Unicode 4.0 In Common Lisp Lisp Adoption of - - PowerPoint PPT Presentation

unicode 4 0 in common unicode 4 0 in common lisp lisp
SMART_READER_LITE
LIVE PREVIEW

Unicode 4.0 In Common Unicode 4.0 In Common Lisp Lisp Adoption of - - PowerPoint PPT Presentation

- - (defun the- -old old- - --fn (a fn (a d) d) (defun the - - (if (eql a d 2) 2 (* a d 2) 2 (* a d (the d (the- -old old- -


slide-1
SLIDE 1

Unicode 4.0 In Common Lisp

Adoption of Unicode In CLforJava

Unicode 4.0 In Common Lisp

Adoption of Unicode In CLforJava

Jerry Boetje ILC 2005 boetjeg@cofc.edu Jerry Boetje ILC 2005 boetjeg@cofc.edu

(defun the (defun the-

  • थचऊॉ

थचऊॉ -

  • old
  • ld-
  • صﻩطبصﻩطب--fn (a

fn (aגףפגףפd) d) (if (eql a (if (eql aגףפגףפd 2) 2 (* a d 2) 2 (* aגףפגףפd (the d (the-

  • थचऊॉ

थचऊॉ -

  • old
  • ld-
  • صﻩطبصﻩطب--fn (1

fn (1-

  • a

aגףפגףפd))))) d)))))

slide-2
SLIDE 2

CLforJava CLforJava

ASCII Legacy ASCII Legacy

  • In the beginning (1983), there was
  • ASCII (universally recognized)
  • Everything else - mostly 8-bit encodings
  • ISO-8859-x
  • Code Pages (IBM PC)
  • JIS and some Chinese encodings (16 bit)
  • Couldn’t mix encodings
  • Doc in Hebrew, Kanji, and Serbo-Croation
  • In the beginning (1983), there was
  • ASCII (universally recognized)
  • Everything else - mostly 8-bit encodings
  • ISO-8859-x
  • Code Pages (IBM PC)
  • JIS and some Chinese encodings (16 bit)
  • Couldn’t mix encodings
  • Doc in Hebrew, Kanji, and Serbo-Croation
slide-3
SLIDE 3

CLforJava CLforJava

Lisp Response Lisp Response

  • Agree on a subset of ASCII that works

everywhere (standard char)

  • Add font and bits attributes to characters

(later dropped)

  • Fuzzy distinction between types of chars
  • Non-portable method for specifying file

encoding

  • Define functions that would work with ASCII
  • Agree on a subset of ASCII that works

everywhere (standard char)

  • Add font and bits attributes to characters

(later dropped)

  • Fuzzy distinction between types of chars
  • Non-portable method for specifying file

encoding

  • Define functions that would work with ASCII
slide-4
SLIDE 4

CLforJava CLforJava

Pretty Good For Its Time Pretty Good For Its Time

slide-5
SLIDE 5

CLforJava CLforJava

The Rest of the World’s Response The Rest of the World’s Response

  • Define a uniform encoding for all characters on

Earth

  • Deal with the hard issues
  • Collation
  • Line breaks
  • Equivalence
  • Composition
  • etc.
  • Define a uniform encoding for all characters on

Earth

  • Deal with the hard issues
  • Collation
  • Line breaks
  • Equivalence
  • Composition
  • etc.

Unicode Unicode

slide-6
SLIDE 6

CLforJava CLforJava

20 Years Later 20 Years Later

  • Globalization requires speaking all languages
  • Many vendor-specific solutions
  • Unicode version 4 has answers to many of the

issues evoked by Common Lisp - and then some

  • It’s time to formally integrate Unicode into the

Common Lisp Standard

  • But it’s not going to be easy!
  • Globalization requires speaking all languages
  • Many vendor-specific solutions
  • Unicode version 4 has answers to many of the

issues evoked by Common Lisp - and then some

  • It’s time to formally integrate Unicode into the

Common Lisp Standard

  • But it’s not going to be easy!
slide-7
SLIDE 7

CLforJava CLforJava

Unicode 4 in Brief Unicode 4 in Brief

slide-8
SLIDE 8

CLforJava CLforJava

Nature of Characters Nature of Characters

  • It’s not enough to assign a number to a char
  • Characters are no longer atomic
  • A run of chars may be equivalent to one char
  • Some provide information but not content
  • Direction
  • Formatting
  • It’s not enough to assign a number to a char
  • Characters are no longer atomic
  • A run of chars may be equivalent to one char
  • Some provide information but not content
  • Direction
  • Formatting
slide-9
SLIDE 9

CLforJava CLforJava

Nature of Characters Nature of Characters

  • Never confuse the encoding with an ordering
  • Collation is entirely context-dependent
  • Does ‘o’ come before, after, or the same as

‘ö’

  • Different if your German or Swedish
  • Chars have a rich set of properties
  • Simple - digit?, whitespace?
  • Complex - composition, direction, mirrored?
  • Never confuse the encoding with an ordering
  • Collation is entirely context-dependent
  • Does ‘o’ come before, after, or the same as

‘ö’

  • Different if your German or Swedish
  • Chars have a rich set of properties
  • Simple - digit?, whitespace?
  • Complex - composition, direction, mirrored?
slide-10
SLIDE 10

CLforJava CLforJava

Encoding Encoding

  • Number assignments are called ‘code points’
  • Range #x0000 to #x10FFFF (21 bits)
  • ASCII range is the same in Unicode
  • Chars grouped into named ‘blocks’
  • E.g. Tamil, Arabic, Number Forms
  • Number assignments are called ‘code points’
  • Range #x0000 to #x10FFFF (21 bits)
  • ASCII range is the same in Unicode
  • Chars grouped into named ‘blocks’
  • E.g. Tamil, Arabic, Number Forms
slide-11
SLIDE 11

CLforJava CLforJava

Composition / Normalization Composition / Normalization

  • Some chars are composed of others
  • E.g. ‘Ä’ decomposes to ‘A’ and ‘̈’
  • 2 chars are equivalent iff their decomposed,

binary forms are identical

  • But some chars are really “the same” even if

they’re different

  • E.g. some Katakana full and half-width chars
  • There are 2 definitions of equivalence
  • Canonical and Compatibility
  • Some chars are composed of others
  • E.g. ‘Ä’ decomposes to ‘A’ and ‘̈’
  • 2 chars are equivalent iff their decomposed,

binary forms are identical

  • But some chars are really “the same” even if

they’re different

  • E.g. some Katakana full and half-width chars
  • There are 2 definitions of equivalence
  • Canonical and Compatibility
slide-12
SLIDE 12

CLforJava CLforJava

Collation Collation

  • Context-dependent (locales)
  • Unicode defines a table-driven mechanism
  • Very configurable (originally from IBM)
  • Specifically not required
  • Other mechanisms ok if equivalent results
  • Sun/Java uses a rule-based system
  • Context-dependent (locales)
  • Unicode defines a table-driven mechanism
  • Very configurable (originally from IBM)
  • Specifically not required
  • Other mechanisms ok if equivalent results
  • Sun/Java uses a rule-based system
slide-13
SLIDE 13

CLforJava CLforJava

Bi-directional Algorithm Bi-directional Algorithm

  • Unicode specifies algorithm to handle nested

changes in direction (R to L, L to R)

  • Locale-dependent
  • Very important with mixed languages
  • Impacts the printer
  • Characters not printed in memory order
  • Some characters are mirrored
  • Unicode specifies algorithm to handle nested

changes in direction (R to L, L to R)

  • Locale-dependent
  • Very important with mixed languages
  • Impacts the printer
  • Characters not printed in memory order
  • Some characters are mirrored
slide-14
SLIDE 14

CLforJava CLforJava

Line Break Algorithm Line Break Algorithm

  • Unicode specifies algorithm to determine

possible line breaks

  • Handles the <cr>, <lf>, <crlf> problem
  • Locale-dependent
  • Very important with mixed languages
  • Impacts the pretty printer
  • Unicode specifies algorithm to determine

possible line breaks

  • Handles the <cr>, <lf>, <crlf> problem
  • Locale-dependent
  • Very important with mixed languages
  • Impacts the pretty printer
slide-15
SLIDE 15

CLforJava CLforJava

Implies Pervasive Changes to Several Lisp Components Implies Pervasive Changes to Several Lisp Components

slide-16
SLIDE 16

CLforJava CLforJava

CLforJava Implementation CLforJava Implementation

slide-17
SLIDE 17

CLforJava CLforJava

CLforJava Project CLforJava Project

  • Capstone software engineering course
  • Multi-semester undergraduate project
  • Gives students a “real world” experience
  • New, original implementation of Common Lisp
  • Written in Java and Lisp
  • See “Common Lisp for Java: A New

Implementatoin Intertwined with Java” Wed 11am

  • Capstone software engineering course
  • Multi-semester undergraduate project
  • Gives students a “real world” experience
  • New, original implementation of Common Lisp
  • Written in Java and Lisp
  • See “Common Lisp for Java: A New

Implementatoin Intertwined with Java” Wed 11am

slide-18
SLIDE 18

CLforJava CLforJava

Character Types Character Types

  • CL standard defines
  • Standard-Char - 96 ASCII chars
  • Base-char, Extended-char - up to the impl
  • CLforJava defines
  • Standard-Char - same as standard
  • Base-char - Unicode definition of base

character

  • Can’t be composed with char to the left
  • Extended-char - all the rest
  • CL standard defines
  • Standard-Char - 96 ASCII chars
  • Base-char, Extended-char - up to the impl
  • CLforJava defines
  • Standard-Char - same as standard
  • Base-char - Unicode definition of base

character

  • Can’t be composed with char to the left
  • Extended-char - all the rest
slide-19
SLIDE 19

CLforJava CLforJava

Character Naming Character Naming

  • Official names - LATIN SMALL LETTER A
  • Unofficial names - a
  • Lispified names - LATIN-SMALL-LETTER-A
  • #\a, #\|LATIN SMALL LETTER A|,

#\LATIN-SMALL-LETTER-A

  • Lisp names - RETURN, LINEFEED
  • Official names - LATIN SMALL LETTER A
  • Unofficial names - a
  • Lispified names - LATIN-SMALL-LETTER-A
  • #\a, #\|LATIN SMALL LETTER A|,

#\LATIN-SMALL-LETTER-A

  • Lisp names - RETURN, LINEFEED
slide-20
SLIDE 20

CLforJava CLforJava

Character Naming in Java Character Naming in Java

  • 4 interfaces
  • lisp.common.type.Character
  • lisp.common.type.BaseChar
  • lisp.common.type.StandardChar
  • lisp.common.type.ExtendedChar
  • Standard chars available as static fields in

StandardChar

  • public static final Character a;
  • public static final Character slash;
  • 4 interfaces
  • lisp.common.type.Character
  • lisp.common.type.BaseChar
  • lisp.common.type.StandardChar
  • lisp.common.type.ExtendedChar
  • Standard chars available as static fields in

StandardChar

  • public static final Character a;
  • public static final Character slash;
slide-21
SLIDE 21

CLforJava CLforJava

Loading Character Database Loading Character Database

  • XML file derived from Unicode database
  • Approx 15,100 chars
  • Contains all names, code points, etc
  • Loaded on startup
  • All chars are singleton objects
  • Stored in a hash map by code point, all

names

  • Factory class is always a lookup
  • XML file derived from Unicode database
  • Approx 15,100 chars
  • Contains all names, code points, etc
  • Loaded on startup
  • All chars are singleton objects
  • Stored in a hash map by code point, all

names

  • Factory class is always a lookup
slide-22
SLIDE 22

CLforJava CLforJava

Character I/O Streams Character I/O Streams

  • Lisp character I/O streams extend the Java

buffered Reader and Writer classes

  • Necessary to specify the input encoding
  • Java system default if not specified
  • No “guessing” function implemented
  • Lisp character I/O streams extend the Java

buffered Reader and Writer classes

  • Necessary to specify the input encoding
  • Java system default if not specified
  • No “guessing” function implemented
slide-23
SLIDE 23

CLforJava CLforJava

Other CLs and Unicode Other CLs and Unicode

slide-24
SLIDE 24

CLforJava CLforJava

Comparison Table Comparison Table

  • 4 Common Lisp Implementations
  • Allegro (Franz), CLisp, LispWorks, CLforJava
  • 16 aspects
  • 4 Common Lisp Implementations
  • Allegro (Franz), CLisp, LispWorks, CLforJava
  • 16 aspects

General File Encoding Characters Strings

Unicode level Base Char definition System default Reader support Reader support Comparison algorithm Printing support Discovery support Comparison algorithm Comparison algorithm Custom Collation Locale support Available encodings Printing support Printing support Char Width

slide-25
SLIDE 25

CLforJava CLforJava

The Highlights The Highlights

  • Allegro and CLforJava support
  • Unicode 4, Naming, and Collation
  • Allegro and LispWorks support encoding

discovery

  • CLforJava only one to escape Unicode chars in

strings

  • Each has a different definition of base-char
  • Allegro and CLforJava support
  • Unicode 4, Naming, and Collation
  • Allegro and LispWorks support encoding

discovery

  • CLforJava only one to escape Unicode chars in

strings

  • Each has a different definition of base-char
slide-26
SLIDE 26

CLforJava CLforJava

Proposal for Unicode in the Common Lisp Standard Proposal for Unicode in the Common Lisp Standard

“ “Someone had to do it. Someone had to do it.” ”

  • Michael Palin

Michael Palin

slide-27
SLIDE 27

CLforJava CLforJava

Components of the Proposal Components of the Proposal

  • Characters - type, naming, properties, functions
  • Strings - types, encoding, functions
  • The Reader - read macros, strings, numbers
  • The Printer -

characters, strings, direction, line breaks, char width

  • Character I/O - types, functions, locales
  • Characters - type, naming, properties, functions
  • Strings - types, encoding, functions
  • The Reader - read macros, strings, numbers
  • The Printer -

characters, strings, direction, line breaks, char width

  • Character I/O - types, functions, locales
slide-28
SLIDE 28

CLforJava CLforJava

Characters Characters

slide-29
SLIDE 29

CLforJava CLforJava

Characters - Types Characters - Types

  • Retain the current Standard-Char definition
  • Retain the current Extended-char definition
  • (not base-char)
  • Redefine Base-Char to conform to the Unicode

definition of base character

  • Canonical Combining Class value of 0
  • Retain the current Standard-Char definition
  • Retain the current Extended-char definition
  • (not base-char)
  • Redefine Base-Char to conform to the Unicode

definition of base character

  • Canonical Combining Class value of 0
slide-30
SLIDE 30

CLforJava CLforJava

Characters - Naming Characters - Naming

  • Characters accessible via their Unicode name
  • (name-char “LATIN SMALL LETTER A”) => #\a
  • (char-name #\|LATIN SMALL LETTER A|) =>

“LATIN SMALL LETTER A”

  • Unicode names are also lispified by ‘-’
  • LATIN-SMALL-LETTER-A
  • Standard-Chars retain their legacy names as

well

  • Characters have a ‘preferred’ name
  • Characters accessible via their Unicode name
  • (name-char “LATIN SMALL LETTER A”) => #\a
  • (char-name #\|LATIN SMALL LETTER A|) =>

“LATIN SMALL LETTER A”

  • Unicode names are also lispified by ‘-’
  • LATIN-SMALL-LETTER-A
  • Standard-Chars retain their legacy names as

well

  • Characters have a ‘preferred’ name
slide-31
SLIDE 31

CLforJava CLforJava

Characters - Properties Characters - Properties

  • Unicode chars have a wealth (49) of properties
  • Digit, whitespace, direction, combining, etc
  • Functions, macros, and constants for support
  • char-available-properties =>

list of all char properties

  • char-properties char =>

property list for the char

  • getf char indicator &optional default =>

value of the indicated property

  • maximum-surrogate-code-point

minimum-surrogate-code-point - values of the high/low surrogate code points

  • Unicode chars have a wealth (49) of properties
  • Digit, whitespace, direction, combining, etc
  • Functions, macros, and constants for support
  • char-available-properties =>

list of all char properties

  • char-properties char =>

property list for the char

  • getf char indicator &optional default =>

value of the indicated property

  • maximum-surrogate-code-point

minimum-surrogate-code-point - values of the high/low surrogate code points

slide-32
SLIDE 32

CLforJava CLforJava

Characters - Modified Fns Characters - Modified Fns

  • Comparison functions conform to the 2 types of

equivalence and of decomposition

  • char= and char> (and similar) compare

characters after canonical decomposition

  • char-equal and char-greaterp (and

similar) compare characters after compatibility

  • decomposition. Also, it is case-insensitive.
  • Comparison functions conform to the 2 types of

equivalence and of decomposition

  • char= and char> (and similar) compare

characters after canonical decomposition

  • char-equal and char-greaterp (and

similar) compare characters after compatibility

  • decomposition. Also, it is case-insensitive.
slide-33
SLIDE 33

CLforJava CLforJava

Characters - Modified Fns Characters - Modified Fns

  • char-code, char-int char => code-point (an integer)

code-char code-point => character at that code point

  • char-name char => returns the preferred name of the
  • character. The preferred name can be changed to another of

the char names by setf.

  • digit-char-p char &optional radix =>

true if its digit property is true. Radix is honored except for Roman numerals.

  • alpha-char-p char => true if its letter property is true.
  • graphic-char-p char => true if char is not ignorable
  • code-char-limit upper bound for code points

for the supported Unicode level (v4 is #x10FFFF)

  • char-code, char-int char => code-point (an integer)

code-char code-point => character at that code point

  • char-name char => returns the preferred name of the
  • character. The preferred name can be changed to another of

the char names by setf.

  • digit-char-p char &optional radix =>

true if its digit property is true. Radix is honored except for Roman numerals.

  • alpha-char-p char => true if its letter property is true.
  • graphic-char-p char => true if char is not ignorable
  • code-char-limit upper bound for code points

for the supported Unicode level (v4 is #x10FFFF)

slide-34
SLIDE 34

CLforJava CLforJava

Characters - New Fns Characters - New Fns

  • char-names char => list of names of the char.

The first name is the preferred name.

  • char-compose base-char &rest extended-

chars

=> a compatibility composed char

  • char-names char => list of names of the char.

The first name is the preferred name.

  • char-compose base-char &rest extended-

chars

=> a compatibility composed char

slide-35
SLIDE 35

CLforJava CLforJava

Strings Strings

slide-36
SLIDE 36

CLforJava CLforJava

Strings - Types Strings - Types

  • base-string contains only base-chars (current)
  • Implications of this restriction
  • Does not contain any combining chars
  • Affects alterations of base-strings and

coercion to a base-string

  • Insertion of an extended-char changes the

preceding base-char

  • Composed on the fly
  • base-string contains only base-chars (current)
  • Implications of this restriction
  • Does not contain any combining chars
  • Affects alterations of base-strings and

coercion to a base-string

  • Insertion of an extended-char changes the

preceding base-char

  • Composed on the fly
slide-37
SLIDE 37

CLforJava CLforJava

Strings - Encoding Strings - Encoding

  • Standard does not specify an internal encoding
  • It must support all of the updated and new

functions

  • Common choices would be UTF-8 and UTF-16
  • Standard does not specify an internal encoding
  • It must support all of the updated and new

functions

  • Common choices would be UTF-8 and UTF-16
slide-38
SLIDE 38

CLforJava CLforJava

Strings -Modified Fns Strings -Modified Fns

  • String comparision - similar to Character

compare

  • string=, string<, etc use canonical

decomposition and either binary or locale-based comparison (Unicode NFC)

  • string-equal, string-lessp, etc use

compatibility decomposition for equivalence or locale-based comparison (Unicode NFKC)

  • Implementations may support sort keys

(pre-computed comparison key)

  • String comparision - similar to Character

compare

  • string=, string<, etc use canonical

decomposition and either binary or locale-based comparison (Unicode NFC)

  • string-equal, string-lessp, etc use

compatibility decomposition for equivalence or locale-based comparison (Unicode NFKC)

  • Implementations may support sort keys

(pre-computed comparison key)

slide-39
SLIDE 39

CLforJava CLforJava

Strings - New Fns Strings - New Fns

  • Support for Unicode decomposition and

composition algorithms

  • string-decompose-canonical string

=> new string in NFD form

  • string-decompose-compatible string

=> new string in NFKD form

  • string-compose-canonical string

=> new string in NFC if string is in NFD form

  • r

=> new string in NFKC if string is in NFKD form

  • Support for Unicode decomposition and

composition algorithms

  • string-decompose-canonical string

=> new string in NFD form

  • string-decompose-compatible string

=> new string in NFKD form

  • string-compose-canonical string

=> new string in NFC if string is in NFD form

  • r

=> new string in NFKC if string is in NFKD form

slide-40
SLIDE 40

CLforJava CLforJava

The Reader The Reader

slide-41
SLIDE 41

CLforJava CLforJava

Reader - The Basics Reader - The Basics

  • The Reader is always presented with Unicode

characters

  • Reader never has to translate
  • Affects the stream functions (e.g. read-char)
  • The Reader is always presented with Unicode

characters

  • Reader never has to translate
  • Affects the stream functions (e.g. read-char)
slide-42
SLIDE 42

CLforJava CLforJava

Reader - Read Macros Reader - Read Macros

  • #\
  • Supports the Unicode char names and their

lispified form

  • #U, #U+
  • Takes 4 or 6 hex digits representing the code

point of the char

  • “” - the string read macro
  • Works as now, but recognizes #U and #U+

read macros embedded in the string

  • #\
  • Supports the Unicode char names and their

lispified form

  • #U, #U+
  • Takes 4 or 6 hex digits representing the code

point of the char

  • “” - the string read macro
  • Works as now, but recognizes #U and #U+

read macros embedded in the string

slide-43
SLIDE 43

CLforJava CLforJava

Reader - Numbers Reader - Numbers

  • Potential numbers
  • Definition includes any character whose ‘digit’

property is true - includes Roman numerals

  • Legal integer numbers must come from the

same Unicode block

  • E.g. can’t mix European (1, 2...) with

Devanagari (१, २ ...)

  • Question of hex definition (#x१२FF)
  • Recognizes ratio characters (⅔, ⅘)
  • 8⅔ => 26/3
  • Potential numbers
  • Definition includes any character whose ‘digit’

property is true - includes Roman numerals

  • Legal integer numbers must come from the

same Unicode block

  • E.g. can’t mix European (1, 2...) with

Devanagari (१, २ ...)

  • Question of hex definition (#x१२FF)
  • Recognizes ratio characters (⅔, ⅘)
  • 8⅔ => 26/3
slide-44
SLIDE 44

CLforJava CLforJava

The Printer The Printer

slide-45
SLIDE 45

CLforJava CLforJava

Printer - *Print-Escape* Printer - *Print-Escape*

  • Characters
  • If nil, the character is sent uninterpreted to

the stream

  • Stream encoding may lose information
  • Otherwise, character is printed using #\

notation

  • Characters
  • If nil, the character is sent uninterpreted to

the stream

  • Stream encoding may lose information
  • Otherwise, character is printed using #\

notation

slide-46
SLIDE 46

CLforJava CLforJava

Printer - *Print-Escape* Printer - *Print-Escape*

  • Strings
  • If nil, the string is composed (NFC or NFKC)

and the characters are sent to the output. The printer must honor bi-directional

  • information. This may also require mirroring.
  • Otherwise, the characters are streamed in

memory order between “ ”. If the stream encoding supports a char, the char is

  • streamed. If not, the char is escaped using #U
  • r #U+ syntax.
  • Ignorable chars are always passed
  • Strings
  • If nil, the string is composed (NFC or NFKC)

and the characters are sent to the output. The printer must honor bi-directional

  • information. This may also require mirroring.
  • Otherwise, the characters are streamed in

memory order between “ ”. If the stream encoding supports a char, the char is

  • streamed. If not, the char is escaped using #U
  • r #U+ syntax.
  • Ignorable chars are always passed
slide-47
SLIDE 47

CLforJava CLforJava

Pretty Printer Pretty Printer

  • All of the behavior for the Printer
  • Pretty Printer must also conform to
  • Unicode line break algorithm to determine

potential line break locations

  • Char width information
  • Unicode chars may be zero, half, or full

width characters - format

  • All of the behavior for the Printer
  • Pretty Printer must also conform to
  • Unicode line break algorithm to determine

potential line break locations

  • Char width information
  • Unicode chars may be zero, half, or full

width characters - format

slide-48
SLIDE 48

CLforJava CLforJava

Character I/O Character I/O

slide-49
SLIDE 49

CLforJava CLforJava

Character I/O - Types Character I/O - Types

  • encoding
  • A CLOS class that translates between

Unicode encoding and some other encoding (e.g ISO-8859-1)

  • An encoding instance may be passed to the
  • pen function’s :external-format parameter
  • An encoding instance is one of the IANA

recognized encodings or an implementation- specific encoding

  • Encodings may be combined in a stream
  • encoding
  • A CLOS class that translates between

Unicode encoding and some other encoding (e.g ISO-8859-1)

  • An encoding instance may be passed to the
  • pen function’s :external-format parameter
  • An encoding instance is one of the IANA

recognized encodings or an implementation- specific encoding

  • Encodings may be combined in a stream
slide-50
SLIDE 50

CLforJava CLforJava

Character I/O - Modified Fns Character I/O - Modified Fns

  • open
  • :external-format arg takes an encoding
  • Current *locale* provides a default
  • :probe argument
  • Returns a stream that contains an

encoding

  • probe-file
  • Returns a second value that is the file

encoding

  • read-char returns a valid Unicode character
  • open
  • :external-format arg takes an encoding
  • Current *locale* provides a default
  • :probe argument
  • Returns a stream that contains an

encoding

  • probe-file
  • Returns a second value that is the file

encoding

  • read-char returns a valid Unicode character
slide-51
SLIDE 51

CLforJava CLforJava

Character I/O - New Fns Character I/O - New Fns

  • list-encodings => returns a list of the

encodings supported by this implementation

  • encoding-name encoding => name of the

encoding

  • stream-encoding stream => encoding of the

stream

  • list-encodings => returns a list of the

encodings supported by this implementation

  • encoding-name encoding => name of the

encoding

  • stream-encoding stream => encoding of the

stream

slide-52
SLIDE 52

CLforJava CLforJava

Summary Summary

slide-53
SLIDE 53

CLforJava CLforJava

Unicode Integration Implications Unicode Integration Implications

  • Goes beyond just adding some characters
  • Pervasive effects in major subsystems
  • Characters, Strings
  • Reader, Printer
  • Character I/O
  • Sorting, comparisons
  • Goes beyond just adding some characters
  • Pervasive effects in major subsystems
  • Characters, Strings
  • Reader, Printer
  • Character I/O
  • Sorting, comparisons
slide-54
SLIDE 54

CLforJava CLforJava

Unicode Implications Unicode Implications

  • It’s so complex an issue...
  • Small differences in implementation can

disrupt portability

  • What to do?
  • Update the Common Lisp standard
  • Give it a name - How about...?
  • It’s so complex an issue...
  • Small differences in implementation can

disrupt portability

  • What to do?
  • Update the Common Lisp standard
  • Give it a name - How about...?

Common Lisp 2006 Common Lisp 2006 Common Lisp 2006 ☠ Optimist! ☢ ☠ ☠ Optimist Optimist!

! ☢

slide-55
SLIDE 55

CLforJava CLforJava

A Demo! A Demo!

slide-56
SLIDE 56

CLforJava CLforJava

There’s a Discussion Forum There’s a Discussion Forum

  • http://clforjava.cs.cofc.edu/forum/
  • Go to the “Dealing with Unicode” board
  • There’s even a voting system built in
  • http://clforjava.cs.cofc.edu/forum/
  • Go to the “Dealing with Unicode” board
  • There’s even a voting system built in
slide-57
SLIDE 57

CLforJava CLforJava

Q & A Q & A