MULTIFLEX - a Formalism and a Tool for the Computational Morphology - - PowerPoint PPT Presentation
MULTIFLEX - a Formalism and a Tool for the Computational Morphology - - PowerPoint PPT Presentation
MULTIFLEX - a Formalism and a Tool for the Computational Morphology of Multi-Word Units Agata SAVARY Universit Franois Rabelais Laboratoire dInformatique IUT de Blois IPI PAN Seminar Warszawa, Jan 5, 2007 Multi-Word Units (MWUs)
2
Multi-Word Units (MWUs)
- MWUs = hard to define and controversial linguistic objects called, in
various contexts compounds, frozen expressions, complex terms, multi- word named entities, burkinostka (pl.), etc.
- Numerous linguistic and pragmatic definitions (Benveniste 1974, Downing
1977, Levi 1978, Bauer 1983, Gross 1990, Anscombre 1990, Silberztein 1993, Cadiot 1992, Gross 1996, Derwojedowa & Rudolf 2003) and applications (Sparck-Jones & Tait 1984, Smadja 1993, Silberztein 1993, Daille 1994, Enguehard & Pantera 1994, Jacquemin 2001, Paumier 2003)
- Major features invoked in the bibliography (based on controversial
elementary notions and measures, here in italic): – Composed of two or more words – Showing some degree of non-compositionality – Having unique and constant references
3
MWUs: pragmatic defjnition
- MWU = a contiguous sequence of graphical units which,
for some application-dependent reasons, has to be listed, described (morphologically, syntactically, semantically, etc.) and analyzed as a unit
- Examples:
– (en.) battle of nerves, Papua New Guinea, -calculus – (fr.) porte-avions, Windows 3.11, liaison multiple par satellite – (pl.) pranie mózgu, Rondo De Gaulle’a, trzy po trzy
4
Infmectional Morphology of MWUs
- What is a MWU’s morphological class (noun, adjective, etc.) and what
inflection categories (number, gender, etc.), with fixed or variable values, are relevant to it? E.g. pranie mózgu is a noun, it has a masculine gender and it inflects for number and case.
- What are the exceptions to the inflection categories determined above ?
E.g. wybory powszechne is a noun but does not have a singular form.
- What are the inflectional characteristics (base form, morphological
class, inflection paradigm) of each single constituent of the MWU? E.g. porte is an uninflected verb in porte-avions and an inflected noun in porte-fenêtre.
- How do inflected forms of single constituents combine in the inflection
process of the whole compound? E.g.
– battle cry battle cries – battle royal battle royals, or battles royal (*battles royals) – battle of nerves battles of nerves
5
State of the Art (1/2)
- Morphological analysis: stemming or lemmatizing of the
constituent words
- Problems:
– cross-roads *cross-road (should be: cross-roads ) – court martials court ? (martial is not an individual English noun) – des deux-chevaux ? (non standard French nominal construction)
- Morphological generation: grammar-based or bag-of-words
approaches
- Problems:
– notary public notary publics (should also be: notaries public and notaries publics) – battle cry battle cries, *battles cry, *battles cries
6
State of the Art (2/2)
- Xerox – lexical transducers allowing compounding and
- unification. Elegant and mathematically well defined model.
Comparative study with Multiflex is progress.
- Greek DELA – all possible combinations of all inflected
forms of the single constituents + restriction filters. Drawbacks: a graphical unit has a fixed definition, some restrictions on forms cannot be described, heterogeneous and non generic rules (hard to adapt to a different language), separators are impossible to describe.
- Intex – a MWU’s inflection formalism extends the simple
word morphology with new operators like “go to the end of the first word, add an s”, etc. Drawbacks: Redundant description of single words’ morphology.
7
Example of a Formalism for the Morphology of Single-Word
2lle :ms :fs
x
:mp
s
:fp
A72: nouveau,A72 beau,A72
8
My aim
CN23: battle of nerves, CN23 man-of-war, CN23
?
Propose a („universal”) formalism that allows to explicitly and precisely describe all inflected forms of a MWU. An inflectional paradigm for a MWU should be independent from the inflectional paradigms of the single constituents.
9
MULTIFLEX
A Formalism and a Tool
10
Morphological description
- n the language level
11
The alphabet
Aa Aà Àà Aâ Ââ Aä Ää Bb Cc Cç … Aa Ąą Bb Cc Ćć Dd Ee Ęę Éé Ëë … Aa Bb Cc Dd Ee Ff Gg Hh Ii Jj … French Polish Serbian (encoded in ascii only)
List of alphabet characters with upper/lowercase equivalences and sorting keys.
12
Infmectional classes, categories and values
French <CATEGORIES> Nb: s, p Gen: m, f <CLASSES> noun: (Nb,<var>),(Gen,<fixed>) adj:(Nb,<var>),(Gen,<var>) adv: …
Category name: Nb Possible values for this category: s, p Class name: noun Possible categories for this class (variable or fixed):
- Nb (a noun inflects in number)
- Gen (a noun has a gender but
does not inflect in gender (?))
13
Infmectional classes, categories and values
Polish <CATEGORIES> Nb:sing,pl Gen:pers_masc,anim_masc,inanim_masc,fem,neut Case:Nom,Gen,Dat,Acc,Inst,Loc,Voc <CLASSES> noun:(Nb,<var>),(Case,<var>),(Gen,<fixed>) adj:(Nb,<var>),(Case,<var>),(Gen,<var>)
Multi-character names are admitted for classes and categories
14
Morphological description
- n the level of a multi-word
unit : infmection graphs
15
Infmection Graphs for MWUs: battle royal
battle royal
$1 $2 $3
Entry box The 1st constituent remains intact The whole MWU is in singular The 3rd constituent gets inflected into plural Stop box Each path describes one or more inflected forms of a MWU
battle royal (battle royal, [Nb=s]), (battle royals, [Nb=p]), (battles royal, [Nb=p])
16
Infmection Graphs for MWUs: bateau-mouche
bateau-mouche (bateau-mouche, [Gen=m,Nb=s]) (bateaux-mouches, [Gen=m,Nb=p]) bateau - mouche
$1 $2 $3
Gender category has a fixed value in this class (noun). In this paradigm it is masculine. This class (noun) inflects for number. Inflection features for a single constituent may be a partial set. Here the gender is not specified: it is the same as this constituent has in the base form of the MWU. This allows the same graph to apply to e.g. homme politique.
17
Unifjcation variables
bateau-mouche (bateau-mouche, [Gen=m,Nb=s]) (bateaux-mouches, [Gen=m,Nb=p])
bateau - mouche
$1 $2 $3
The instantiation of a variable is identical for all its appearances in one path The unification variable $n may be instantiated to any value of its category’s domain (here: Nb:s,p). The generated forms are identical as for the previous graph.
homme politique
$1 $2 $3 homme politique (homme politique, [Gen=m,Nb=s]) (hommes politiques, [Gen=m,Nb=p])
18
Unifjcation variables and value inheritance
The previous graph is limited to MWU in masculine only. However bateau-mouche and homme politique inflect basically on the same way as moissoneuse-batteuse, liaison numérique, etc: they inflect in number only, and in order to get the plural we need to put the 1st and the 3rd constituent into plural. The generated forms are identical as for the previous graph. The double equal sign (==) means that variable $g has a fixed value in the whole path. It is to be unified to the gender value that the first constituent has in the base form of the MWU (e.g masculine for bateau, and feminine for liaison). The whole MWU inherits this value.
bateau-mouche (bateau-mouche, [Gen=m,Nb=s]) (bateaux-mouches, [Gen=m,Nb=p]) liaison numérique (liaison numérique, [Gen=f,Nb=s]) (liaisons numériques, [Gen=f,Nb=p])
19
Graph size reduction via unifjcation variables: pranie mózgu
pranie mózgu (pranie mózgu, [Gen=neut,Nb=sing,Case=Nom]), (pranie mózgów, [Gen=neut,Nb=sing,Case=Nom]), (prania mózgu, [Gen=neut,Nb=sing,Case=Gen]), (prania mózgów, [Gen=neut,Nb=sing,Case=Gen]), (praniu mózgu, [Gen=neut,Nb=sing,Case=Dat]), (praniu mózgów, [Gen=neut,Nb=sing,Case=Dat]), (pranie mózgu, [Gen=neut,Nb=sing,Case=Acc]), (pranie mózgów, [Gen=neut,Nb=sing,Case=Acc]), (praniem mózgu, [Gen=neut,Nb=sing,Case=Inst]), (praniem mózgów, [Gen=neut,Nb=sing,Case=Inst]), (praniu mózgu, [Gen=neut,Nb=sing,Case=Loc]), (praniu mózgów, [Gen=neut,Nb=sing,Case=Loc]), (pranie mózgu, [Gen=neut,Nb=sing,Case=Voc]), (pranie mózgów, [Gen=neut,Nb=sing,Case=Voc]) (prania mózgu, [Gen=neut,Nb=pl,Case=Nom]), (prania mózgów, [Gen=neut,Nb=pl,Case=Nom]), (prań mózgu, [Gen=neut,Nb=pl,Case=Gen]), (prań mózgów, [Gen=neut,Nb=pl,Case=Gen]), (praniom mózgu, [Gen=neut,Nb=pl,Case=Dat]), (praniom mózgów, [Gen=neut,Nb=pl,Case=Dat]), (prania mózgu, [Gen=neut,Nb=pl,Case=Acc]), (prania mózgów, [Gen=neut,Nb=pl,Case=Acc]), (praniami mózgu, [Gen=neut,Nb=pl,Case=Inst]), (praniami mózgów, [Gen=neut,Nb=pl,Case=Inst]), (praniach mózgu, [Gen=neut,Nb=pl,Case=Loc]), (praniach mózgów, [Gen=neut,Nb=pl,Case=Loc]), (prania mózgu, [Gen=neut,Nb=pl,Case=Voc]), (prania mózgów, [Gen=neut,Nb=pl,Case=Voc])
pranie mózgu (brain washing)
$1 $2 $3 With no use of unification variables the inflection graph would have to contain 28 different paths.
20
Graph size reduction via unifjcation variables: pranie mózgu
pranie mózgu
$1 $2 $3
The 1st and the 2nd constituent inflect in number independently from each other The whole MWU inherits its gender, number and case from the 1st constituent
21
Graphical variant description: student union
student union
$1 $2 $3
Insertions and/or deletions of different elements are possible which may lead to a more complete variant description. student union (student union, [Nb=s]), (students union, [Nb=s]), (students’ union, [Nb=s]), (student unions, [Nb=p]), (students unions, [Nb=p]), (students’ unions, [Nb=p]),
22
Syntactic variant description: birth date
birth date (birth date, [Nb=s]), (date of birth, [Nb=s]), (birth dates, [Nb=p]), (dates of birth, [Nb=p])
birth date
$1 $2 $3
Insertions and/or inversions of different elements are possible.
23
Squeezed compound description: bonhomme
If only wee manage to distinguish the graphical units properly, they can be inflected as simple words.
bonhomme (bonhomme, [Gen=m,Nb=s]) (bonshommes, [Gen=m,Nb=p])
bonhomme
$1 $2
24
Perspectives: Modular Approach to Morphology of Simple & Compound Units
Morphological Module for Simple Words Morphological Module for MWUs (mémoire vive) ([mémoire][ ][vive]) (3, [mémoire vive]) (vive,vif,A23:fs) (vif,A23,fp) (vives) Tokenizer Analyzer + „Tagger” Generator „Grammar” Filtering
NC67
Generator mémoire vive,.N:fs mémoires vives,.N:fp
25
Interface with the Unitex system
26
Common features
- MULTIFLEX uses the same character encoding standards as
UNITEX (Unicode 3.0)
- MULTIFLEX uses the UNITEX’ graph editor for the
representation of MWUs’ inflection graphs
- MULTIFLEX admits the DELA model of morphological
description (an inflection paradigm is a set of actions to be performed
- n the lemma in order to generate its inflected forms, and of
corresponding inflection features to be attached to the generated forms)
- MULTIFLEX allows to extend the UNITEX’ dictionary
treatment to DELAC DELACF inflection.
- MULTIFLEX uses some of the UNITEX’ free libraries (for
the treatment of: Alphabet, AutomateFst2, unicode)
27
Equivalence between DELA features and MULTIFLEX features
Polish s : Nb=sing p : Nb=pl
- : Gen=pers_masc
z : Gen=anim_masc r : Gen=inanim_masc f : Gen=fem n : Gen=neut M : Case=Nom D : Case=Gen …
The corresponding morphological category and value as listed in the “Morphology” file (cf slide 13) DELA morphological feature
French s : Nb=s p : Nb=p m : Gen=m f : Gen=f
28
DELAC entry
mémoire(mémoire.N21:fs) vive(vif.A48:fs),NC_NN
A single constituent with its lemma, inflection code and inflection features MWUs’ inflection code (same as bateau- mouche) A single constituent with its lemma, inflection code and inflection features An uninflected constituent NC_NN
Resulting DELACF entries
mémoire vive,mémoire vive.NC_NN:fs mémoires vives,mémoire vive.NC_NN:fp
29
T agged DELAC
avant-garde(garde.N21:fs),NC_XXN bateau(bateau.N3:ms)-mouche(mouche.N21:fs),NC_NN bon(bon.N41:ms)homme(homme.N1:ms),NC_AN café(café.N1:ms) au lait,NC_NXXXX carte(carte.N21:fs) postale(postal.A8:fs), NC_NN cousin(cousin.N8:ms) germain(germain.A8:ms),NC_NNmf franc(franc.A47:ms) maçon(maçon.N41:ms),NC_AXN1 mémoire(mémoire.N21:fs) vive(vif.A48:fs),NC_NN microscope(microscope.N1:ms) à effet tunnel,NC_NXXXXXX porte-serviette(serviette.N21:fs),NC_VNm
NC_XXN
30
Examples of MWUs’ infmection graphs
NC_NXXXX NC_NN NC_NNmf
31
Examples of MWUs’ infmection graphs
NC_NXXXXXX NC_AXN1 NC_VNm
32
Resulting DELACF
avant-garde,avant-garde.NC_XXN:fs avant-gardes,avant-garde.NC_XXN:fp bateau-mouche,bateau-mouche.NC_NN:ms bateaux-mouches,bateau-mouche.NC_NN:mp bonhomme,bonhomme,NC_AN:ms bonhommes,bonhomme,NC_AN:m$ café au lait,café au lait.NC_NXXXX:ms cafés au lait,café au lait.NC_NXXXX:mp carte postale,carte postale.NC_NN:fs cartes postales,carte postale.NC_NN:fp cousin germain,cousin germain.NC_NNmf:ms cousins germains,cousin germain.NC_NNmf:mp cousine germaine,cousin germain.NC_NNmf:fs cousines germaines,cousin germain.NC_NNmf:fp franc-maçon,franc maçon.NC_AXN1:ms franc-maçonne,franc maçon.NC_AXN1:fs franc maçon,franc maçon.NC_AXN1:ms franc maçonne,franc maçon.NC_AXN1:fs francs-maçons,franc maçon.NC_AXN1:ms francs-maçonnes,franc maçon.NC_AXN1:fs francs maçons,franc maçon.NC_AXN1:ms francs maçonnes,franc maçon.NC_AXN1:fs mémoire vive,mémoire vive.NC_NN:fs mémoires vives,mémoire vive.NC_NN:fp$ microscope à effet tunnel,microscope à effet tunnel.NC_NXXXXXX:ms microscopes à effet tunnel,microscope à effet tunnel.NC_NXXXXXX:mp porte-serviette,porte-serviette.NC_VNm:ms porte-serviettes,porte-serviette.NC_VNm:ms porte-serviettes,porte-serviette.NC_VNm:mp
33
Some remaining problems
- Application to proper names – upper/lower case
spelling should be describable on the simple word level.
Ujedinxene(ujedinxen.A1:aefp1g) nacije(nacija.N600:fp1q),NC_AXN3
- The POS of a word shouldn’t have to be deduced from
its inflection code, contrary to the Unitex convention
Nations Unies,Nations Unies.NC_XXXInvpl faciles comme bonjour,facile comme bonjour,ACm12
Unitex does not allow to analyze this entry as an adjective (but as “ACm”) Unitex does not allow to analyze this entry as a noun (but as “NC”) At present, in order to inflect this entry with an uppercase, its lemma has to be in uppercase
34
Perspectives
- Full integration into Unitex graphical interface
- Large-scale application to a DELAC construction –
Serbian DELAC in progress (with C. Krstev, D. Vitas): 1000 compound lemmas described and inflected
- Automatizing the DELAC creation – cf WS2LR
(by Ranka Stanković, Belgrade)
- New operators in view of the graph size and
number reduction (e.g. to express that the plural of all of the
following words is obtained by putting to plural the last constituent
- nly: main memory, random access memory, programmable random
access memory, erasable programmable random access memory, etc.)