MULTIFLEX - a Formalism and a Tool for the Computational Morphology - - PowerPoint PPT Presentation

multiflex a formalism and a tool for the computational
SMART_READER_LITE
LIVE PREVIEW

MULTIFLEX - a Formalism and a Tool for the Computational Morphology - - PowerPoint PPT Presentation

MULTIFLEX - a Formalism and a Tool for the Computational Morphology of Multi-Word Units Agata SAVARY Universit Franois Rabelais Laboratoire dInformatique IUT de Blois IPI PAN Seminar Warszawa, Jan 5, 2007 Multi-Word Units (MWUs)


slide-1
SLIDE 1

MULTIFLEX - a Formalism and a Tool for the Computational Morphology of Multi-Word Units

Agata SAVARY

Université François Rabelais Laboratoire d’Informatique IUT de Blois IPI PAN Seminar Warszawa, Jan 5, 2007

slide-2
SLIDE 2

2

Multi-Word Units (MWUs)

  • MWUs = hard to define and controversial linguistic objects called, in

various contexts compounds, frozen expressions, complex terms, multi- word named entities, burkinostka (pl.), etc.

  • Numerous linguistic and pragmatic definitions (Benveniste 1974, Downing

1977, Levi 1978, Bauer 1983, Gross 1990, Anscombre 1990, Silberztein 1993, Cadiot 1992, Gross 1996, Derwojedowa & Rudolf 2003) and applications (Sparck-Jones & Tait 1984, Smadja 1993, Silberztein 1993, Daille 1994, Enguehard & Pantera 1994, Jacquemin 2001, Paumier 2003)

  • Major features invoked in the bibliography (based on controversial

elementary notions and measures, here in italic): – Composed of two or more words – Showing some degree of non-compositionality – Having unique and constant references

slide-3
SLIDE 3

3

MWUs: pragmatic defjnition

  • MWU = a contiguous sequence of graphical units which,

for some application-dependent reasons, has to be listed, described (morphologically, syntactically, semantically, etc.) and analyzed as a unit

  • Examples:

– (en.) battle of nerves, Papua New Guinea, -calculus – (fr.) porte-avions, Windows 3.11, liaison multiple par satellite – (pl.) pranie mózgu, Rondo De Gaulle’a, trzy po trzy

slide-4
SLIDE 4

4

Infmectional Morphology of MWUs

  • What is a MWU’s morphological class (noun, adjective, etc.) and what

inflection categories (number, gender, etc.), with fixed or variable values, are relevant to it? E.g. pranie mózgu is a noun, it has a masculine gender and it inflects for number and case.

  • What are the exceptions to the inflection categories determined above ?

E.g. wybory powszechne is a noun but does not have a singular form.

  • What are the inflectional characteristics (base form, morphological

class, inflection paradigm) of each single constituent of the MWU? E.g. porte is an uninflected verb in porte-avions and an inflected noun in porte-fenêtre.

  • How do inflected forms of single constituents combine in the inflection

process of the whole compound? E.g.

– battle cry  battle cries – battle royal  battle royals, or battles royal (*battles royals) – battle of nerves  battles of nerves

slide-5
SLIDE 5

5

State of the Art (1/2)

  • Morphological analysis: stemming or lemmatizing of the

constituent words

  • Problems:

– cross-roads  *cross-road (should be: cross-roads ) – court martials  court ? (martial is not an individual English noun) – des deux-chevaux  ? (non standard French nominal construction)

  • Morphological generation: grammar-based or bag-of-words

approaches

  • Problems:

– notary public  notary publics (should also be: notaries public and notaries publics) – battle cry  battle cries, *battles cry, *battles cries

slide-6
SLIDE 6

6

State of the Art (2/2)

  • Xerox – lexical transducers allowing compounding and
  • unification. Elegant and mathematically well defined model.

Comparative study with Multiflex is progress.

  • Greek DELA – all possible combinations of all inflected

forms of the single constituents + restriction filters. Drawbacks: a graphical unit has a fixed definition, some restrictions on forms cannot be described, heterogeneous and non generic rules (hard to adapt to a different language), separators are impossible to describe.

  • Intex – a MWU’s inflection formalism extends the simple

word morphology with new operators like “go to the end of the first word, add an s”, etc. Drawbacks: Redundant description of single words’ morphology.

slide-7
SLIDE 7

7

Example of a Formalism for the Morphology of Single-Word

2lle :ms :fs

x

:mp

s

:fp

A72: nouveau,A72 beau,A72

slide-8
SLIDE 8

8

My aim

CN23: battle of nerves, CN23 man-of-war, CN23

?

Propose a („universal”) formalism that allows to explicitly and precisely describe all inflected forms of a MWU. An inflectional paradigm for a MWU should be independent from the inflectional paradigms of the single constituents.

slide-9
SLIDE 9

9

MULTIFLEX

A Formalism and a Tool

slide-10
SLIDE 10

10

Morphological description

  • n the language level
slide-11
SLIDE 11

11

The alphabet

Aa Aà Àà Aâ Ââ Aä Ää Bb Cc Cç … Aa Ąą Bb Cc Ćć Dd Ee Ęę Éé Ëë … Aa Bb Cc Dd Ee Ff Gg Hh Ii Jj … French Polish Serbian (encoded in ascii only)

List of alphabet characters with upper/lowercase equivalences and sorting keys.

slide-12
SLIDE 12

12

Infmectional classes, categories and values

French <CATEGORIES> Nb: s, p Gen: m, f <CLASSES> noun: (Nb,<var>),(Gen,<fixed>) adj:(Nb,<var>),(Gen,<var>) adv: …

Category name: Nb Possible values for this category: s, p Class name: noun Possible categories for this class (variable or fixed):

  • Nb (a noun inflects in number)
  • Gen (a noun has a gender but

does not inflect in gender (?))

slide-13
SLIDE 13

13

Infmectional classes, categories and values

Polish <CATEGORIES> Nb:sing,pl Gen:pers_masc,anim_masc,inanim_masc,fem,neut Case:Nom,Gen,Dat,Acc,Inst,Loc,Voc <CLASSES> noun:(Nb,<var>),(Case,<var>),(Gen,<fixed>) adj:(Nb,<var>),(Case,<var>),(Gen,<var>)

Multi-character names are admitted for classes and categories

slide-14
SLIDE 14

14

Morphological description

  • n the level of a multi-word

unit : infmection graphs

slide-15
SLIDE 15

15

Infmection Graphs for MWUs: battle royal

battle royal

$1 $2 $3

Entry box The 1st constituent remains intact The whole MWU is in singular The 3rd constituent gets inflected into plural Stop box Each path describes one or more inflected forms of a MWU

battle royal  (battle royal, [Nb=s]), (battle royals, [Nb=p]), (battles royal, [Nb=p])

slide-16
SLIDE 16

16

Infmection Graphs for MWUs: bateau-mouche

bateau-mouche  (bateau-mouche, [Gen=m,Nb=s]) (bateaux-mouches, [Gen=m,Nb=p]) bateau - mouche

$1 $2 $3

Gender category has a fixed value in this class (noun). In this paradigm it is masculine. This class (noun) inflects for number. Inflection features for a single constituent may be a partial set. Here the gender is not specified: it is the same as this constituent has in the base form of the MWU. This allows the same graph to apply to e.g. homme politique.

slide-17
SLIDE 17

17

Unifjcation variables

bateau-mouche  (bateau-mouche, [Gen=m,Nb=s]) (bateaux-mouches, [Gen=m,Nb=p])

bateau - mouche

$1 $2 $3

The instantiation of a variable is identical for all its appearances in one path The unification variable $n may be instantiated to any value of its category’s domain (here: Nb:s,p). The generated forms are identical as for the previous graph.

homme politique

$1 $2 $3 homme politique  (homme politique, [Gen=m,Nb=s]) (hommes politiques, [Gen=m,Nb=p])

slide-18
SLIDE 18

18

Unifjcation variables and value inheritance

The previous graph is limited to MWU in masculine only. However bateau-mouche and homme politique inflect basically on the same way as moissoneuse-batteuse, liaison numérique, etc: they inflect in number only, and in order to get the plural we need to put the 1st and the 3rd constituent into plural. The generated forms are identical as for the previous graph. The double equal sign (==) means that variable $g has a fixed value in the whole path. It is to be unified to the gender value that the first constituent has in the base form of the MWU (e.g masculine for bateau, and feminine for liaison). The whole MWU inherits this value.

bateau-mouche  (bateau-mouche, [Gen=m,Nb=s]) (bateaux-mouches, [Gen=m,Nb=p]) liaison numérique  (liaison numérique, [Gen=f,Nb=s]) (liaisons numériques, [Gen=f,Nb=p])

slide-19
SLIDE 19

19

Graph size reduction via unifjcation variables: pranie mózgu

pranie mózgu  (pranie mózgu, [Gen=neut,Nb=sing,Case=Nom]), (pranie mózgów, [Gen=neut,Nb=sing,Case=Nom]), (prania mózgu, [Gen=neut,Nb=sing,Case=Gen]), (prania mózgów, [Gen=neut,Nb=sing,Case=Gen]), (praniu mózgu, [Gen=neut,Nb=sing,Case=Dat]), (praniu mózgów, [Gen=neut,Nb=sing,Case=Dat]), (pranie mózgu, [Gen=neut,Nb=sing,Case=Acc]), (pranie mózgów, [Gen=neut,Nb=sing,Case=Acc]), (praniem mózgu, [Gen=neut,Nb=sing,Case=Inst]), (praniem mózgów, [Gen=neut,Nb=sing,Case=Inst]), (praniu mózgu, [Gen=neut,Nb=sing,Case=Loc]), (praniu mózgów, [Gen=neut,Nb=sing,Case=Loc]), (pranie mózgu, [Gen=neut,Nb=sing,Case=Voc]), (pranie mózgów, [Gen=neut,Nb=sing,Case=Voc]) (prania mózgu, [Gen=neut,Nb=pl,Case=Nom]), (prania mózgów, [Gen=neut,Nb=pl,Case=Nom]), (prań mózgu, [Gen=neut,Nb=pl,Case=Gen]), (prań mózgów, [Gen=neut,Nb=pl,Case=Gen]), (praniom mózgu, [Gen=neut,Nb=pl,Case=Dat]), (praniom mózgów, [Gen=neut,Nb=pl,Case=Dat]), (prania mózgu, [Gen=neut,Nb=pl,Case=Acc]), (prania mózgów, [Gen=neut,Nb=pl,Case=Acc]), (praniami mózgu, [Gen=neut,Nb=pl,Case=Inst]), (praniami mózgów, [Gen=neut,Nb=pl,Case=Inst]), (praniach mózgu, [Gen=neut,Nb=pl,Case=Loc]), (praniach mózgów, [Gen=neut,Nb=pl,Case=Loc]), (prania mózgu, [Gen=neut,Nb=pl,Case=Voc]), (prania mózgów, [Gen=neut,Nb=pl,Case=Voc])

pranie mózgu (brain washing)

$1 $2 $3 With no use of unification variables the inflection graph would have to contain 28 different paths.

slide-20
SLIDE 20

20

Graph size reduction via unifjcation variables: pranie mózgu

pranie mózgu

$1 $2 $3

The 1st and the 2nd constituent inflect in number independently from each other The whole MWU inherits its gender, number and case from the 1st constituent

slide-21
SLIDE 21

21

Graphical variant description: student union

student union

$1 $2 $3

Insertions and/or deletions of different elements are possible which may lead to a more complete variant description. student union  (student union, [Nb=s]), (students union, [Nb=s]), (students’ union, [Nb=s]), (student unions, [Nb=p]), (students unions, [Nb=p]), (students’ unions, [Nb=p]),

slide-22
SLIDE 22

22

Syntactic variant description: birth date

birth date  (birth date, [Nb=s]), (date of birth, [Nb=s]), (birth dates, [Nb=p]), (dates of birth, [Nb=p])

birth date

$1 $2 $3

Insertions and/or inversions of different elements are possible.

slide-23
SLIDE 23

23

Squeezed compound description: bonhomme

If only wee manage to distinguish the graphical units properly, they can be inflected as simple words.

bonhomme  (bonhomme, [Gen=m,Nb=s]) (bonshommes, [Gen=m,Nb=p])

bonhomme

$1 $2

slide-24
SLIDE 24

24

Perspectives: Modular Approach to Morphology of Simple & Compound Units

Morphological Module for Simple Words Morphological Module for MWUs (mémoire vive) ([mémoire][ ][vive]) (3, [mémoire vive]) (vive,vif,A23:fs) (vif,A23,fp) (vives) Tokenizer Analyzer + „Tagger” Generator „Grammar” Filtering

NC67

Generator mémoire vive,.N:fs mémoires vives,.N:fp

slide-25
SLIDE 25

25

Interface with the Unitex system

slide-26
SLIDE 26

26

Common features

  • MULTIFLEX uses the same character encoding standards as

UNITEX (Unicode 3.0)

  • MULTIFLEX uses the UNITEX’ graph editor for the

representation of MWUs’ inflection graphs

  • MULTIFLEX admits the DELA model of morphological

description (an inflection paradigm is a set of actions to be performed

  • n the lemma in order to generate its inflected forms, and of

corresponding inflection features to be attached to the generated forms)

  • MULTIFLEX allows to extend the UNITEX’ dictionary

treatment to DELAC  DELACF inflection.

  • MULTIFLEX uses some of the UNITEX’ free libraries (for

the treatment of: Alphabet, AutomateFst2, unicode)

slide-27
SLIDE 27

27

Equivalence between DELA features and MULTIFLEX features

Polish s : Nb=sing p : Nb=pl

  • : Gen=pers_masc

z : Gen=anim_masc r : Gen=inanim_masc f : Gen=fem n : Gen=neut M : Case=Nom D : Case=Gen …

The corresponding morphological category and value as listed in the “Morphology” file (cf slide 13) DELA morphological feature

French s : Nb=s p : Nb=p m : Gen=m f : Gen=f

slide-28
SLIDE 28

28

DELAC entry

mémoire(mémoire.N21:fs) vive(vif.A48:fs),NC_NN

A single constituent with its lemma, inflection code and inflection features MWUs’ inflection code (same as bateau- mouche) A single constituent with its lemma, inflection code and inflection features An uninflected constituent NC_NN

Resulting DELACF entries

mémoire vive,mémoire vive.NC_NN:fs mémoires vives,mémoire vive.NC_NN:fp

slide-29
SLIDE 29

29

T agged DELAC

avant-garde(garde.N21:fs),NC_XXN bateau(bateau.N3:ms)-mouche(mouche.N21:fs),NC_NN bon(bon.N41:ms)homme(homme.N1:ms),NC_AN café(café.N1:ms) au lait,NC_NXXXX carte(carte.N21:fs) postale(postal.A8:fs), NC_NN cousin(cousin.N8:ms) germain(germain.A8:ms),NC_NNmf franc(franc.A47:ms) maçon(maçon.N41:ms),NC_AXN1 mémoire(mémoire.N21:fs) vive(vif.A48:fs),NC_NN microscope(microscope.N1:ms) à effet tunnel,NC_NXXXXXX porte-serviette(serviette.N21:fs),NC_VNm

NC_XXN

slide-30
SLIDE 30

30

Examples of MWUs’ infmection graphs

NC_NXXXX NC_NN NC_NNmf

slide-31
SLIDE 31

31

Examples of MWUs’ infmection graphs

NC_NXXXXXX NC_AXN1 NC_VNm

slide-32
SLIDE 32

32

Resulting DELACF

avant-garde,avant-garde.NC_XXN:fs avant-gardes,avant-garde.NC_XXN:fp bateau-mouche,bateau-mouche.NC_NN:ms bateaux-mouches,bateau-mouche.NC_NN:mp bonhomme,bonhomme,NC_AN:ms bonhommes,bonhomme,NC_AN:m$ café au lait,café au lait.NC_NXXXX:ms cafés au lait,café au lait.NC_NXXXX:mp carte postale,carte postale.NC_NN:fs cartes postales,carte postale.NC_NN:fp cousin germain,cousin germain.NC_NNmf:ms cousins germains,cousin germain.NC_NNmf:mp cousine germaine,cousin germain.NC_NNmf:fs cousines germaines,cousin germain.NC_NNmf:fp franc-maçon,franc maçon.NC_AXN1:ms franc-maçonne,franc maçon.NC_AXN1:fs franc maçon,franc maçon.NC_AXN1:ms franc maçonne,franc maçon.NC_AXN1:fs francs-maçons,franc maçon.NC_AXN1:ms francs-maçonnes,franc maçon.NC_AXN1:fs francs maçons,franc maçon.NC_AXN1:ms francs maçonnes,franc maçon.NC_AXN1:fs mémoire vive,mémoire vive.NC_NN:fs mémoires vives,mémoire vive.NC_NN:fp$ microscope à effet tunnel,microscope à effet tunnel.NC_NXXXXXX:ms microscopes à effet tunnel,microscope à effet tunnel.NC_NXXXXXX:mp porte-serviette,porte-serviette.NC_VNm:ms porte-serviettes,porte-serviette.NC_VNm:ms porte-serviettes,porte-serviette.NC_VNm:mp

slide-33
SLIDE 33

33

Some remaining problems

  • Application to proper names – upper/lower case

spelling should be describable on the simple word level.

Ujedinxene(ujedinxen.A1:aefp1g) nacije(nacija.N600:fp1q),NC_AXN3

  • The POS of a word shouldn’t have to be deduced from

its inflection code, contrary to the Unitex convention

Nations Unies,Nations Unies.NC_XXXInvpl faciles comme bonjour,facile comme bonjour,ACm12

Unitex does not allow to analyze this entry as an adjective (but as “ACm”) Unitex does not allow to analyze this entry as a noun (but as “NC”) At present, in order to inflect this entry with an uppercase, its lemma has to be in uppercase

slide-34
SLIDE 34

34

Perspectives

  • Full integration into Unitex graphical interface
  • Large-scale application to a DELAC construction –

Serbian DELAC in progress (with C. Krstev, D. Vitas): 1000 compound lemmas described and inflected

  • Automatizing the DELAC creation – cf WS2LR

(by Ranka Stanković, Belgrade)

  • New operators in view of the graph size and

number reduction (e.g. to express that the plural of all of the

following words is obtained by putting to plural the last constituent

  • nly: main memory, random access memory, programmable random

access memory, erasable programmable random access memory, etc.)