Towards a roadmap for Towards a roadmap for standardization in standardization in language technology language technology
Laurent Laurent Romary Romary & Nancy & Nancy Ide Ide Loria Loria-
- INRIA
Towards a roadmap for Towards a roadmap for standardization in - - PowerPoint PPT Presentation
Towards a roadmap for Towards a roadmap for standardization in standardization in language technology language technology Laurent Romary Romary & Nancy & Nancy Ide Ide Laurent Loria- -INRIA INRIA Vassar College Vassar
Exchange of data
Interoperability between software components
Comparability of results
From a technological point of view
Stabilizing existing practices
Stabilizing existing practices
Looking ahead for potential roadblocks
Looking ahead for potential roadblocks
From an organizational point of view
International consensus, long term availability and maintenance
International consensus, long term availability and maintenance
TEI (Text Encoding Initiative)
LISA (Localization Industry Standards Association)
Basic building blocks:
XML, XML Schemas (Note: growing importance of alternative
XML, XML Schemas (Note: growing importance of alternative RelaxNG RelaxNG schemas), XSL schemas), XSL
Web services activity
WSDL, SOAP
WSDL, SOAP
Semantic web activity
RDF, RDFS, OWL
RDF, RDFS, OWL
Specific (vertical) activities with little critical mass activities with little critical mass
VoiceML, EMMA, etc.
VoiceML, EMMA, etc.
Basic infrastructural (horizontal) standards
Character encoding (cf. IPA): ISO 10646/Unicode
Character encoding (cf. IPA): ISO 10646/Unicode
Language codes: ISO 639 (e.g. ‘fr’) and ISO 639
Language codes: ISO 639 (e.g. ‘fr’) and ISO 639-
2 (e.g. ‘ ‘fra’ fra’/ /’fre’ ’fre’) )
Note: under ISO/TC 37/SC 2 Note: under ISO/TC 37/SC 2
Vertical standards
MPEG7 for multimedia information
MPEG7 for multimedia information — — hardly implementable : hardly implementable :-
(
Terminology standards: ISO 12200 (
Terminology standards: ISO 12200 (Martif Martif), ISO 12620 (Data ), ISO 12620 (Data categories), ISO 16642 (Terminological markup framework) categories), ISO 16642 (Terminological markup framework)
Note: under ISO/TC 37/SC 3 Note: under ISO/TC 37/SC 3
Collaboration on language aspects
Strong basis provided by ISO 11179
ISO/IEC 9126-
1 [2 & 3 in progress]
ISO/IEC 14598-
1 to 6
SC3 -
Computer applications in terminology
ISO 12200
ISO 12200 -
Martif
Latest version of TEI Terminology chapter Latest version of TEI Terminology chapter
ISO 12620
ISO 12620 -
Data categories (under revision)
ISO 16642
ISO 16642 -
TMF (Terminological Markup Framework)
SC4 -
Language Resource Management (May 2002) (May 2002)
WG1
Basic descriptors and mechanisms for language resources
WG2
Representation schemes
WG3
Multilingual text representation
WG4
Lexical databases
WG5
Workflow of language Resource Management
Data categories
Joint activity with the TEI; CD document almost acheived acheived; planned project on FS declaration ; planned project on FS declaration
E.g. principles of annotation scheme specification and representation, pointing mechanisms for stand representation, pointing mechanisms for stand-
up; draft document available
Stable working draft under diissemination for evaluation
A general specification platform for lexical structures
Preliminary proposals: core model + lexical extensions
Objective: market place of descriptors for all types of language resources and annotation schemes language resources and annotation schemes
E.g.: /grammatical gender/, /
E.g.: /grammatical gender/, /paucal paucal number/, /ablative case/, number/, /ablative case/, etc. etc.
On-
line tool available: http://syntax.loria loria.fr .fr
Three ad hoc groups created
Metadata for language resources
Metadata for language resources
Morphosyntactic
Morphosyntactic descriptors (SC4 plenary last Tuesday) descriptors (SC4 plenary last Tuesday)
Cf.
Morphosyntactic Annotation Framework Annotation Framework
Semantic content descriptors
Semantic content descriptors
Exploratory: discourse relations, dialogue acts, referential lin Exploratory: discourse relations, dialogue acts, referential links, ks, etc. etc.
Wide dissemination of existing standards
Two priorities in ISO/TC 37/SC 4: morphosyntax morphosyntax and and lexical structures lexical structures
Validation of on
Validation of on-
going documents by our community
Feedback on documents, reference implementations Feedback on documents, reference implementations
Gathering up samples and/or test suites (manpower needed)
Gathering up samples and/or test suites (manpower needed)
Organizing the work on the Data Category Registry
Which additional
Which additional topis topis should be addressed? should be addressed?
How to involve a wide variety of experts?
How to involve a wide variety of experts?
Specific publication and information days
Which formats should be ‘frozen’ within the LMF framework framework
semantic content representation semantic content representation
Multilingual information representation
How to relate on
How to relate on-
going activities on translation memories, localization localization, iTV, multimedia information (e.g. sub , iTV, multimedia information (e.g. sub-
titling)
Evaluation of NLP components
General principles: linguistic coverage, metrics
General principles: linguistic coverage, metrics
Application specific evaluation methods: machine translation,
Application specific evaluation methods: machine translation, information extraction information extraction
Workflow of language resources
The life cycle of language resources: creation, enrichment,
The life cycle of language resources: creation, enrichment, validation, dissemination validation, dissemination
Sign languages…
Standards as the identification of stable concepts in a field field
Introduction in academic curricula
Defining priorities
Contribution to technical work
E.g. Evaluation related standards