Representation and Processing of Composition, Variation and - - PowerPoint PPT Presentation

representation and processing of composition variation
SMART_READER_LITE
LIVE PREVIEW

Representation and Processing of Composition, Variation and - - PowerPoint PPT Presentation

Representation and Processing of Composition, Variation and Approximation in Language Resources and Tools Towards an accreditation to supervise research Vers une habilitation diriger des recherches (HDR) Agata Savary Laboratoire


slide-1
SLIDE 1

Representation and Processing

  • f Composition, Variation and

Approximation in Language Resources and Tools

Towards an accreditation to supervise research Vers une habilitation à diriger des recherches (HDR)

Agata Savary

Laboratoire d’informatique Université François Rabelais Tours, Blois

March 27, 2014

slide-2
SLIDE 2

Composition&Variation MWEs NEs FSMs Conclusions CV

Compositionality – controversial notion

Key notion in linguistics, philosophy, logic and computer science. The possibility for us to understand sentences which we have never heard before is evidently based on the fact that we construct the sense of a sentence from parts which correspond to the words.

(Frege, XIX c.)

A compound expression is compositional if its meaning is a function of the meanings of its parts and of the syntactic rule by which they are combined.

(Partee et al., 1990) horse races vs. race horses

  • A. Savary

HDR 27/03/2014 2 / 44

slide-3
SLIDE 3

Composition&Variation MWEs NEs FSMs Conclusions CV

Compositionality – controversial notion

Key notion in linguistics, philosophy, logic and computer science. The possibility for us to understand sentences which we have never heard before is evidently based on the fact that we construct the sense of a sentence from parts which correspond to the words.

(Frege, XIX c.)

A compound expression is compositional if its meaning is a function of the meanings of its parts and of the syntactic rule by which they are combined.

(Partee et al., 1990) horse races vs. race horses

Compositionality is a property of a grammar.

(Kracht, 2007)

  • A. Savary

HDR 27/03/2014 2 / 44

slide-4
SLIDE 4

Composition&Variation MWEs NEs FSMs Conclusions CV

Compositionality – controversial notion

Key notion in linguistics, philosophy, logic and computer science. The possibility for us to understand sentences which we have never heard before is evidently based on the fact that we construct the sense of a sentence from parts which correspond to the words.

(Frege, XIX c.)

A compound expression is compositional if its meaning is a function of the meanings of its parts and of the syntactic rule by which they are combined.

(Partee et al., 1990) horse races vs. race horses

Compositionality is a property of a grammar.

(Kracht, 2007)

Benefits for modeling and computation Preventing a combinatorial explosion of lexicalized cases.

  • A. Savary

HDR 27/03/2014 2 / 44

slide-5
SLIDE 5

Composition&Variation MWEs NEs FSMs Conclusions CV

Non-compositionality of compounds

Semantic non-compositionality Cordon bleu ’expert cook’ is not a blue cord. Morphosyntactic non-compositionality

(Savary et al., 2007)

chief justices vs. lord justices, lords justice, lords justices [czerwony pająkmascAnim]mascHum

’red spider (ex-communist)’

  • A. Savary

HDR 27/03/2014 3 / 44

slide-6
SLIDE 6

Composition&Variation MWEs NEs FSMs Conclusions CV

Non-compositionality of compounds

Semantic non-compositionality Cordon bleu ’expert cook’ is not a blue cord. Morphosyntactic non-compositionality

(Savary et al., 2007)

chief justices vs. lord justices, lords justice, lords justices [czerwony pająkmascAnim]mascHum

’red spider (ex-communist)’

Lexicalization An expression E has a meaning, a reference or inflectional properties that are not totally compositional ⇒ E has to be explicitly mentioned and described in a lexicon.

  • A. Savary

HDR 27/03/2014 3 / 44

slide-7
SLIDE 7

Composition&Variation MWEs NEs FSMs Conclusions CV

“Frozenness” – a measure of non-compositionality

“Frozenness”

(G. Gross 1988; Sag et al. 2002; Mel’čuk, 2010)

Blocking the linguistic transformations typical for a syntactic structure under study: Luc a pris un train de campagne ⇒ Luc a pris un train.

’Luc took a suburb train ⇒ Luc took a train’

Le gouvernement a pris un train de mesures ✟

✟ ❍ ❍

⇒ Le gouvernement a pris un train.

’The government took a “train of measures”. ✚

✚ ❩ ❩

⇒ The government took a train’.

Degree of “frozenness”

(G. Gross 1990)

  • A. Savary

HDR 27/03/2014 4 / 44

slide-8
SLIDE 8

Composition&Variation MWEs NEs FSMs Conclusions CV

Linguistic variation

Types of variants

(Jacquemin 2001; Savary & Jacquemin, 2003)

graphical variants

behavioral model → behavioural model

morphological variants

image converter →image conversion

semantic variants

automobile cleaning → car washing

syntactic variants

processing of cardiac image → image processing

  • A. Savary

HDR 27/03/2014 5 / 44

slide-9
SLIDE 9

Composition&Variation MWEs NEs FSMs Conclusions CV

Linguistic variation – a central challenge in NLP

The same concept has different surface realizations in texts Example in IR: document phrase:

the philosophy and implementation of an experimental interface ⇓

terms (for extraction or indexation) :

interface philosophy, interface implementation, *philosophy implementation.

  • A. Savary

HDR 27/03/2014 6 / 44

slide-10
SLIDE 10

Composition&Variation MWEs NEs FSMs Conclusions CV

Contents

1 Composition and Variation – an Introduction 2 Multi-Word Expressions 3 Compound Named Entities and Beyond 4 Finite-State Methods for Word and Tree Approximation 5 Conclusions and Perspectives 6 Research Framework and Management

  • A. Savary

HDR 27/03/2014 7 / 44

slide-11
SLIDE 11

Composition&Variation MWEs NEs FSMs Conclusions CV

Multi-Word Expressions – controversial objects

The prime time speech by first lady Michelle Obama set the house on fire. She made crystal clear which issues she took to heart but she was preaching to the choir.

  • A. Savary

HDR 27/03/2014 8 / 44

slide-12
SLIDE 12

Composition&Variation MWEs NEs FSMs Conclusions CV

Multi-Word Expressions – controversial objects

The prime time speech by first lady Michelle Obama set the house on fire. She made crystal clear which issues she took to heart but she was preaching to the choir. MWEs – definition criteria being composed of 2 or more words, show some degree of morphological, distributional or semantic non-compositionality, have unique and constant references.

  • A. Savary

HDR 27/03/2014 8 / 44

slide-13
SLIDE 13

Composition&Variation MWEs NEs FSMs Conclusions CV

Multi-Word Expressions – controversial objects

The prime time speech by first lady Michelle Obama set the house on fire. She made crystal clear which issues she took to heart but she was preaching to the choir. MWEs – definition criteria being composed of 2 or more words, show some degree of morphological, distributional or semantic non-compositionality, have unique and constant references. Pragmatic definition

(Savary, 2005)

MWE = a sequence of graphical items which, for some application-dependent reasons, has to be listed, described and processed as a unit.

  • A. Savary

HDR 27/03/2014 8 / 44

slide-14
SLIDE 14

Composition&Variation MWEs NEs FSMs Conclusions CV

Multi-Word Expressions

MWEs – basic facts prevelance (40% of text items belong to MWEs), idiosyncrasy at different levels (lexicon, grammar, meaning, . . . ), sparseness (most MWEs appear rarely in corpora), MWEs are under-represented in language resources and tools, MWEs are hard to detect, understand, translate, etc.

  • A. Savary

HDR 27/03/2014 9 / 44

slide-15
SLIDE 15

Composition&Variation MWEs NEs FSMs Conclusions CV

Idiosyncrasy of MWEs . . .

. . . at different NLP levels segmentation:

bonshommes

’fellows’

personal computer put sth. off

morphology

grand-mères

’grandsing.masc-motherspl.fem’

wybory powszechne

’general elections’, *wybór powszechny

syntax

all of a sudden he kicked the bucket, *the bucket was kicked by him

semantics

to spill the beans = to reveal a secret

  • A. Savary

HDR 27/03/2014 10 / 44

slide-16
SLIDE 16

Composition&Variation MWEs NEs FSMs Conclusions CV

MWEs in NLP - State of the art

lexical description of MWEs

SOA: (Savary, 2008)

DELA e-dictionaries

(Courtois et al., 1990; Silberztein, 1993a; Savary, 2000; Kyriacopoulou et al., 2002; Silberztein, 2005)

two-level morphology

(Beesley & Karttunen, 2003; Karttunen et al., 1992; Karttunen, 1993; Breidt et al., 1996; Oflazer et al., 2004)

relational DB

(Alegria et al., 2004; Itai & Wintner, 2013),

parameterized equivalence classes

(Grégoire, 2010)

unification grammars and meta-grammars

(Sag et al., 2002; Copestake et al., 2002; Villavicencio et al., 2004; Jacquemin, 2001)

  • A. Savary

HDR 27/03/2014 11 / 44

slide-17
SLIDE 17

Composition&Variation MWEs NEs FSMs Conclusions CV

MWEs in NLP - state of the art ctd.

MWE extraction

SOA: (Savary & Jacquemin, 2003)

monolingual

(Smadja, 1992; Daille, 1996; Pecina, 2010; Al-Haj & Wintner, 2010; Ramisch et al., 2010; Davis & Barrett, 2013)

bilingual

(Tsvetkov & Wintner, 2010; Morin & Daille, 2010; Delpech et al., 2012)

MWE identification

(NER systems; Vincze et al., 2013)

MWE annotation

(Abeillé et al., 2003; Bejček & Straňák, 2010; Laporte et al., 2008a,b; Kaalep & Muischnek, 2008)

parsing and MWEs

(Abeillé & Schabes, 1989; Sag et al., 2002; Copestake et al., 2002; Villavicencio et al., 2004; Nivre & Nilsson, 2004; Attia, 2006; Finkel & Manning (2009a), Wehrli et al., 2010, Constant et al., 2013, Green et al., 2013)

  • A. Savary

HDR 27/03/2014 12 / 44

slide-18
SLIDE 18

Composition&Variation MWEs NEs FSMs Conclusions CV

Multiflex – describing the morphosyntax of contiguous MWEs

(Savary, 2005, 2008, 2009; Savary et al., 2007, 2009; Graliński et al., 2010)

Two-layer approach single words are analysed and generated by an external module, MWE inflection graphs combine single forms into MWE forms Interoperability constraints for the underlying single words module same morphological model for the language under study, clear-cut definition of a token, generation of inflected forms of simple words.

  • A. Savary

HDR 27/03/2014 13 / 44

slide-19
SLIDE 19

Composition&Variation MWEs NEs FSMs Conclusions CV

Multiflex – inflection, agreement and non-compositionality

15 variants Lemma Features czerwony pająk czerwony pająk sg:nom:m2 czerwone pająki czerwony pająk pl:acc:m2 czerwonych pająków czerwony pająk pl:acc:m1 . . .

<$1:Case=$c;Nb=$n>

<Case=$c;Nb=$n;Gen=$3.Gen>

<$2> <$3:Case=$c;Nb=$n> <$1:Case=gen;Nb=pl>

<Case=acc;Nb=pl;Gen=m1>

<$3:Case=gen;Nb=pl> <$2>

  • A. Savary

HDR 27/03/2014 14 / 44

slide-20
SLIDE 20

Composition&Variation MWEs NEs FSMs Conclusions CV

Multiflex – inflection, agreement and non-compositionality

15 variants Lemma Features czerwony pająk czerwony pająk sg:nom:m2 czerwone pająki czerwony pająk pl:acc:m2 czerwonych pająków czerwony pająk pl:acc:m1 . . . czerwony pająk $1 $2 $3 lemma: czerwony class: adj Nb: sg Case : nom Gen: m2 Deg: pos lemma: pająk class: subst Case : nom Gen: m2

<$1:Case=$c;Nb=$n>

<Case=$c;Nb=$n;Gen=$3.Gen>

<$2> <$3:Case=$c;Nb=$n> <$1:Case=gen;Nb=pl>

<Case=acc;Nb=pl;Gen=m1>

<$3:Case=gen;Nb=pl> <$2>

  • A. Savary

HDR 27/03/2014 14 / 44

slide-21
SLIDE 21

Composition&Variation MWEs NEs FSMs Conclusions CV

Multiflex – inflection, agreement and non-compositionality

15 variants Lemma Features czerwony pająk czerwony pająk sg:nom:m2 czerwone pająki czerwony pająk pl:acc:m2 czerwonych pająków czerwony pająk pl:acc:m1 . . . czerwony pająk $1 $2 $3 lemma: czerwony class: adj Nb: sg Case : nom Gen: m2 Deg: pos lemma: pająk class: subst Case : nom Gen: m2

<$1:Case=$c;Nb=$n>

<Case=$c;Nb=$n;Gen=$3.Gen>

<$2> <$3:Case=$c;Nb=$n> <$1:Case=gen;Nb=pl>

<Case=acc;Nb=pl;Gen=m1>

<$3:Case=gen;Nb=pl> <$2>

  • A. Savary

HDR 27/03/2014 14 / 44

slide-22
SLIDE 22

Composition&Variation MWEs NEs FSMs Conclusions CV

Multiflex – syntactic variation & agreement

126 variants Lemma Features Jan Rodowicz „Anoda” Jan Rodowicz „Anoda” sg:nom:m1:offic Jan Rodowicz Anoda Jan Rodowicz „Anoda” sg:nom:m1 Jan „Anoda” Rodowicz Jan Rodowicz „Anoda” sg:nom:m1

  • J. Rodowicz „Anoda”

Jan Rodowicz „Anoda” sg:nom:m1

  • J. Rodowicz

Jan Rodowicz „Anoda” sg:nom:m1 „Anoda” Rodowicz Jan Rodowicz „Anoda” sg:nom:m1 Rodowicz Jan Rodowicz „Anoda” sg:nom:m1:spok . . .

  • A. Savary

HDR 27/03/2014 15 / 44

slide-23
SLIDE 23

Composition&Variation MWEs NEs FSMs Conclusions CV

Multiflex – syntactic variation & agreement

126 variants Lemma Features Jan Rodowicz „Anoda” Jan Rodowicz „Anoda” sg:nom:m1:offic Jan Rodowicz Anoda Jan Rodowicz „Anoda” sg:nom:m1 Jan „Anoda” Rodowicz Jan Rodowicz „Anoda” sg:nom:m1

  • J. Rodowicz „Anoda”

Jan Rodowicz „Anoda” sg:nom:m1

  • J. Rodowicz

Jan Rodowicz „Anoda” sg:nom:m1 „Anoda” Rodowicz Jan Rodowicz „Anoda” sg:nom:m1 Rodowicz Jan Rodowicz „Anoda” sg:nom:m1:spok . . .

  • A. Savary

HDR 27/03/2014 15 / 44

slide-24
SLIDE 24

Composition&Variation MWEs NEs FSMs Conclusions CV

MULTIFLEX – nesting

336 variants Lemma Features aleja Jana Rodowicza „Anody” aleja Jana Rodowicza „Anody” sg:nom:f:offic

  • al. Rodowicza

aleja Jana Rodowicza „Anody” sg:nom:f:neut Rodowicza aleja Jana Rodowicza „Anody” sg:nom:f:spok . . .

  • A. Savary

HDR 27/03/2014 16 / 44

slide-25
SLIDE 25

Composition&Variation MWEs NEs FSMs Conclusions CV

MULTIFLEX – nesting

336 variants Lemma Features aleja Jana Rodowicza „Anody” aleja Jana Rodowicza „Anody” sg:nom:f:offic

  • al. Rodowicza

aleja Jana Rodowicza „Anody” sg:nom:f:neut Rodowicza aleja Jana Rodowicza „Anody” sg:nom:f:spok . . . aleja Jana Rodowicza „Anody” $1 $2 $3 lemma: aleja class: subst . . . lemma: Jan Rodowicz „Anoda” class: subst . . . ⇓

  • A. Savary

HDR 27/03/2014 16 / 44

slide-26
SLIDE 26

Composition&Variation MWEs NEs FSMs Conclusions CV

Multiflex – applications

Software integration Unitex (LGPL)

(Paumier, 2008),

LeXimir

(Krstev et al., 2013)

Toposław

(Marciniak et al., 2009b; Sikora & Woliński, 2009)

  • A. Savary

HDR 27/03/2014 17 / 44

slide-27
SLIDE 27

Composition&Variation MWEs NEs FSMs Conclusions CV

Multiflex – applications

Software integration Unitex (LGPL)

(Paumier, 2008),

LeXimir

(Krstev et al., 2013)

Toposław

(Marciniak et al., 2009b; Sikora & Woliński, 2009)

MWE e-dictionaries

Dictionary name Language Type Lexicogr. framework Dictionary size License Lemmas Graphs Forms Serbian DELAC Serbian general-purpose LeXimir 11,000 115 204,500 Greek DELAC modern Greek general-purpose Unitex SAWA Polish urban proper names Toposław 9,000 450 309,000 cc-by sa SEJF Polish general-purpose Toposław 3,200 140 68,000 cc-by sa SEJFEK Polish economic terms Toposław 11,000 290 146,000 cc-by sa

  • A. Savary

HDR 27/03/2014 17 / 44

slide-28
SLIDE 28

Composition&Variation MWEs NEs FSMs Conclusions CV

Contents

1 Composition and Variation – an Introduction 2 Multi-Word Expressions 3 Compound Named Entities and Beyond 4 Finite-State Methods for Word and Tree Approximation 5 Conclusions and Perspectives 6 Research Framework and Management

  • A. Savary

HDR 27/03/2014 18 / 44

slide-29
SLIDE 29

Composition&Variation MWEs NEs FSMs Conclusions CV

Named and naming entities

  • A. Savary

HDR 27/03/2014 19 / 44

slide-30
SLIDE 30

Composition&Variation MWEs NEs FSMs Conclusions CV

Named and naming entities

Terminological problem In NLP naming entities are usually called named entities.

  • A. Savary

HDR 27/03/2014 19 / 44

slide-31
SLIDE 31

Composition&Variation MWEs NEs FSMs Conclusions CV

Named/Naming Entities and beyond

NEs – NLP-central objects are/refer to persons, places, objects, events, . . . , crucial for text understanding, hard to translate, central to IR, IE and QA. NEs – controversial objects

(Ehrmann, 2008)

theoretical studies vs. applicative motivations,

  • nomasiological vs. semasiological definitions.
  • A. Savary

HDR 27/03/2014 20 / 44

slide-32
SLIDE 32

Composition&Variation MWEs NEs FSMs Conclusions CV

Named/Naming Entities and beyond

NEs – NLP-central objects are/refer to persons, places, objects, events, . . . , crucial for text understanding, hard to translate, central to IR, IE and QA. NEs – controversial objects

(Ehrmann, 2008)

theoretical studies vs. applicative motivations,

  • nomasiological vs. semasiological definitions.

Beyond NEs mentions (coreference annotation and resolution), entities (entity linking).

  • A. Savary

HDR 27/03/2014 20 / 44

slide-33
SLIDE 33

Composition&Variation MWEs NEs FSMs Conclusions CV

Most NEs are MWEs

Multi-word NEs in lexicons SAWAa: 98% entries are MWEs, Prolexbase: 66% entries are MWEs.

aGrammatical Lexicon of Warsaw Urban Proper Names

Multi-word NEs in corpora National Corpus of Polish: 53% of the (outermost) NEs are MWEs or ellipses of MWEs.

  • A. Savary

HDR 27/03/2014 21 / 44

slide-34
SLIDE 34

Composition&Variation MWEs NEs FSMs Conclusions CV

NEs in NLP - State of the art

  • A. Savary

HDR 27/03/2014 22 / 44

slide-35
SLIDE 35

Composition&Variation MWEs NEs FSMs Conclusions CV

NEs in NLP - State of the art

  • A. Savary

HDR 27/03/2014 22 / 44

slide-36
SLIDE 36

Composition&Variation MWEs NEs FSMs Conclusions CV

National Corpus of Polish

NKJP (National Corpus of Polish) 1.5-billion (1.5 * 109) word corpus, 300-million word balanced subcorpus, 1-million word manually annotated subcorpus (2 parallel annotators + 1 adjudicator), multilevel annotation: segmentation, morphosyntax, WSD, syntactic words, syntactic groups, NEs, additional coreference level (Polish Coreference Corpus), distributed under GNU GPL v3 and CC BY v.3.

  • A. Savary

HDR 27/03/2014 23 / 44

slide-37
SLIDE 37

Composition&Variation MWEs NEs FSMs Conclusions CV

NEs and mentions in NKJP and PCC – novelty

Common annotation aspects recursively nested NEs and mentions, coordinated and discontinuous NEs and mentions.

  • A. Savary

HDR 27/03/2014 24 / 44

slide-38
SLIDE 38

Composition&Variation MWEs NEs FSMs Conclusions CV

NEs and mentions in NKJP and PCC – novelty

Common annotation aspects recursively nested NEs and mentions, coordinated and discontinuous NEs and mentions. NE annotation aspects

(Savary et al., 2012)

relative adjectives, personal derivations and derivational bases (amerykański ← Stany Zjednoczone

’American ← United States’).

  • A. Savary

HDR 27/03/2014 24 / 44

slide-39
SLIDE 39

Composition&Variation MWEs NEs FSMs Conclusions CV

NEs and mentions in NKJP and PCC – novelty

Common annotation aspects recursively nested NEs and mentions, coordinated and discontinuous NEs and mentions. NE annotation aspects

(Savary et al., 2012)

relative adjectives, personal derivations and derivational bases (amerykański ← Stany Zjednoczone

’American ← United States’).

Coreference annotation aspects

(Ogrodniczuk et al., 2013)

dominant expressions. & semantic heads,

  • A. Savary

HDR 27/03/2014 24 / 44

slide-40
SLIDE 40

Composition&Variation MWEs NEs FSMs Conclusions CV

NEs as annotation trees

województwo poznańskie placeName.region województwo bydgoskie placeName.region poznański relAdj(placeName. settlement) Poznań bydgoski relAdj(placeName. settlement) Bydgoszcz działkowcy z województw : poznańskiego i bydgoskiego ’garden-owners from the regionsgen : Poznań-adjgen and Bydgoszcz-adjgen’

  • A. Savary

HDR 27/03/2014 25 / 44

slide-41
SLIDE 41

Composition&Variation MWEs NEs FSMs Conclusions CV

Annotation tools for nested NEs

TrEd

(Pajas & Štěpánek, 2008)

customized to constituency trees, adjudication.

  • A. Savary

HDR 27/03/2014 26 / 44

slide-42
SLIDE 42

Composition&Variation MWEs NEs FSMs Conclusions CV

Annotation tools for nested NEs

TrEd

(Pajas & Štěpánek, 2008)

customized to constituency trees, adjudication. SProUT

(Savary & Piskorski, 2011)

customized rule-based NER, 78% P, 38% R.

  • A. Savary

HDR 27/03/2014 26 / 44

slide-43
SLIDE 43

Composition&Variation MWEs NEs FSMs Conclusions CV

Annotation tools for nested NEs

TrEd

(Pajas & Štěpánek, 2008)

customized to constituency trees, adjudication. SProUT

(Savary & Piskorski, 2011)

customized rule-based NER, 78% P, 38% R. NERF

(Waszczuk et al., 2013)

ML-based NER, P 80%, R 74%.

  • A. Savary

HDR 27/03/2014 26 / 44

slide-44
SLIDE 44

Composition&Variation MWEs NEs FSMs Conclusions CV

NKJP & PCC – results

One of the largest multilevel-annotated corpora worldwide 87,000 (gold standard) NEs, 180,000 mentions; 109,000 coref. clusters, inter-annotator agreement:

F1 = 0.83 (NEs), κ = 0.74 (mentions).

  • A. Savary

HDR 27/03/2014 27 / 44

slide-45
SLIDE 45

Composition&Variation MWEs NEs FSMs Conclusions CV

Contents

1 Composition and Variation – an Introduction 2 Multi-Word Expressions 3 Compound Named Entities and Beyond 4 Finite-State Methods for Word and Tree Approximation 5 Conclusions and Perspectives 6 Research Framework and Management

  • A. Savary

HDR 27/03/2014 28 / 44

slide-46
SLIDE 46

Composition&Variation MWEs NEs FSMs Conclusions CV

String approximation - state of the art

String-to-string correction

(Damerau, 1964; Wagner & Fisher, 1974; Lowrance & Wagner, 1975; Du & Chang, 1992)

Context: elementary edit operations on letters with costs; allowed edit sequences. Input: two strings x and y. Output: ed(x, y) – edit distance between x and y.

  • A. Savary

HDR 27/03/2014 29 / 44

slide-47
SLIDE 47

Composition&Variation MWEs NEs FSMs Conclusions CV

String approximation - state of the art

String-to-string correction

(Damerau, 1964; Wagner & Fisher, 1974; Lowrance & Wagner, 1975; Du & Chang, 1992)

Context: elementary edit operations on letters with costs; allowed edit sequences. Input: two strings x and y. Output: ed(x, y) – edit distance between x and y. String-to-language correction

(SOA by Boytsov, 2011; Savary, 2003)

Context: as above. Input:

string language (dictionary) L, string x, threshold th.

Output: strings y ∈ L such that ed(x, y) ≤ th.

  • A. Savary

HDR 27/03/2014 29 / 44

slide-48
SLIDE 48

Composition&Variation MWEs NEs FSMs Conclusions CV

Tree approximation - state of the art

Tree-to-tree correction

(Selkow, 1977; Tai, 1979; Zhang & Shasha, 1989)

Context: elementary edit operations on tree nodes or subtrees; edit sequences. Input: two trees x and y. Output: ed(x, y) – edit distance between x and y.

  • A. Savary

HDR 27/03/2014 30 / 44

slide-49
SLIDE 49

Composition&Variation MWEs NEs FSMs Conclusions CV

Tree approximation - state of the art

Tree-to-tree correction

(Selkow, 1977; Tai, 1979; Zhang & Shasha, 1989)

Context: elementary edit operations on tree nodes or subtrees; edit sequences. Input: two trees x and y. Output: ed(x, y) – edit distance between x and y. Tree-to-language correction

(SOA by Tekli et al., 2011)

Context: as above. Input:

tree language L, tree x, threshold th.

Output: trees y ∈ L such that ed(x, y) ≤ th.

  • A. Savary

HDR 27/03/2014 30 / 44

slide-50
SLIDE 50

Composition&Variation MWEs NEs FSMs Conclusions CV

XMLCorrector: XML document correction wrt. a DTD

  • A. Savary

HDR 27/03/2014 31 / 44

slide-51
SLIDE 51

Composition&Variation MWEs NEs FSMs Conclusions CV

XMLCorrector: XML document correction wrt. a DTD

  • A. Savary

HDR 27/03/2014 31 / 44

slide-52
SLIDE 52

Composition&Variation MWEs NEs FSMs Conclusions CV

XMLCorrector

(Amavi et al., 2013)

Input t – XML tree , S – a structure description (DTD), th – threshold, c – intended root node.

  • A. Savary

HDR 27/03/2014 32 / 44

slide-53
SLIDE 53

Composition&Variation MWEs NEs FSMs Conclusions CV

XMLCorrector

(Amavi et al., 2013)

Input t – XML tree , S – a structure description (DTD), th – threshold, c – intended root node. Output Node-edit operation sequences allowing to get all trees t′ ∈ L(S) such that ed(t, t′) ≤ th.

  • A. Savary

HDR 27/03/2014 32 / 44

slide-54
SLIDE 54

Composition&Variation MWEs NEs FSMs Conclusions CV

XMLCorrector: example

root

ǫ

a c d b c b c

0.0 0.1 1 1.0 2 2.0

root

ǫ

b c b c b c

0.0 1 1.0 2 2.0

t′

1

root

ǫ

a c d b c b c c

0.0 0.1 1 1.0 2 2.0 3

t′

2

root

ǫ

a c d b c c

0.0 0.1 1 1.0 2

t′

3

S = {root: b*|ab*c; b: cd; b: c; c: ǫ; d: ǫ} th = 2

  • A. Savary

HDR 27/03/2014 33 / 44

slide-55
SLIDE 55

Composition&Variation MWEs NEs FSMs Conclusions CV

XMLCorrector

First full-fledged tree-to-language correction algorithm and implementation correction trees, sequences and distances returned, all candidates within a threshold found, complexity, correctness and soundness proofs, GNU LGPL license, test data available (reproducibility).

  • A. Savary

HDR 27/03/2014 34 / 44

slide-56
SLIDE 56

Composition&Variation MWEs NEs FSMs Conclusions CV

Contents

1 Composition and Variation – an Introduction 2 Multi-Word Expressions 3 Compound Named Entities and Beyond 4 Finite-State Methods for Word and Tree Approximation 5 Conclusions and Perspectives 6 Research Framework and Management

  • A. Savary

HDR 27/03/2014 35 / 44

slide-57
SLIDE 57

Composition&Variation MWEs NEs FSMs Conclusions CV

Compositional modeling and computation

Advantage Preventing a combinatorial explosion of lexicalized cases

compositional calculus of emotional valency

(Tallec et al., 2010),

nested description of MWEs in Multiflex

(Savary et al., 2009).

Better modeling of semantic relations

nested NE annotation in NKJP

(Savary et al., 2012),

nested mention annotation in PCC.

(Ogrodniczuk et al., 2013).

  • A. Savary

HDR 27/03/2014 36 / 44

slide-58
SLIDE 58

Composition&Variation MWEs NEs FSMs Conclusions CV

Compositional modeling and computation

Advantage Preventing a combinatorial explosion of lexicalized cases

compositional calculus of emotional valency

(Tallec et al., 2010),

nested description of MWEs in Multiflex

(Savary et al., 2009).

Better modeling of semantic relations

nested NE annotation in NKJP

(Savary et al., 2012),

nested mention annotation in PCC.

(Ogrodniczuk et al., 2013).

Challenges: MWEs MWEs defy compositionality principles

(Savary et al., 2007),

MWEs are usually partly frozen and partly variable, heterogeneous properties should be accounted for simultaneously.

  • A. Savary

HDR 27/03/2014 36 / 44

slide-59
SLIDE 59

Composition&Variation MWEs NEs FSMs Conclusions CV

Variability – central challenge in NLP

Objective Conflate different surface realizations of the same underlying concept.

  • A. Savary

HDR 27/03/2014 37 / 44

slide-60
SLIDE 60

Composition&Variation MWEs NEs FSMs Conclusions CV

Variability – central challenge in NLP

Objective Conflate different surface realizations of the same underlying concept. Means lexical and grammatical description, algorithmic approximation.

  • A. Savary

HDR 27/03/2014 37 / 44

slide-61
SLIDE 61

Composition&Variation MWEs NEs FSMs Conclusions CV

Variability – central challenge in NLP

Objective Conflate different surface realizations of the same underlying concept. Means lexical and grammatical description, algorithmic approximation. Multilinguality provides a better understanding of linguistic variability.

  • A. Savary

HDR 27/03/2014 37 / 44

slide-62
SLIDE 62

Composition&Variation MWEs NEs FSMs Conclusions CV

Perspectives

Objective enhancing and extending language resources and tools, integrating language data into Linked Open Data (LOD), integrating MWEs in deep parsing, taxonomy and benchmarking for tree-to-language correction, modeling MWE identification as a tree-to-language correction problem.

  • A. Savary

HDR 27/03/2014 38 / 44

slide-63
SLIDE 63

Composition&Variation MWEs NEs FSMs Conclusions CV

Contents

1 Composition and Variation – an Introduction 2 Multi-Word Expressions 3 Compound Named Entities and Beyond 4 Finite-State Methods for Word and Tree Approximation 5 Conclusions and Perspectives 6 Research Framework and Management

  • A. Savary

HDR 27/03/2014 39 / 44

slide-64
SLIDE 64

Composition&Variation MWEs NEs FSMs Conclusions CV

External collaborations

Polish Academy of Sciences, Warsaw Gdańsk University of Technology University of Poznań University of Olsztyn University of Belgrade Université Paris Est Marne-la-Vallée University of Orléans Tomsk State University

  • A. Savary

HDR 27/03/2014 40 / 44

slide-65
SLIDE 65

Composition&Variation MWEs NEs FSMs Conclusions CV

Event organisation – OC co-chair

Blois, 12–16 July, 2011 16th International Conference on Implementation and Application of Automata (CIAA-2011), 9th International Workshop on Finite State Methods and Natural Language Processing (FSMNLP-2011), 95 participants, 65,000 e budget.

  • A. Savary

HDR 27/03/2014 41 / 44

slide-66
SLIDE 66

Composition&Variation MWEs NEs FSMs Conclusions CV

Participation in funded collaborative projects

Project Dates Budget Coordinator Funding PARSEME 2013–2017 680,000 e

  • A. Savary

COST CORE 2011–2014 120,000e IPIPAN NCN CESAR 2011–2013 Hungarian Ac.

  • f Sc.

EC (PSP) NEKST 2009–2014 3,500,000e IPIPAN & PWr ERDF CODEX 2009–2012 68,336e INRIA ANR LUNA.PL 2008–2009 IPIPAN MNSW NKJP 2007–2010 600,000e IPIPAN MNSW EmotiRob 2007–2009 85,200e

  • Univ. Bretagne ANR

Polonium 2007–2008 6,070 e LI & IPIPAN PHC EGIDE Pavle Savic 2004–2005 5,500e LI & Belgrade Univ. PHC EGIDE NomsPropres 2003–2005 94,000e LI RNTL Outilex 2002–2006 Paris-Est RNTL

  • A. Savary

HDR 27/03/2014 42 / 44

slide-67
SLIDE 67

Composition&Variation MWEs NEs FSMs Conclusions CV

PARSEME (PARsing and Multi-word Expressions)

IC1207 COST action scientific network: 30 COST countries bottom-up approach, 114 members, 4 working groups, 29 languages from 9 language families, duration: 2013–2017. Scientific objective To bridge the gap between linguistic precision and computational efficiency in NLP applications. Key issue: MWEs and their links to (deep) parsing.

  • A. Savary

HDR 27/03/2014 43 / 44

slide-68
SLIDE 68

Composition&Variation MWEs NEs FSMs Conclusions CV

PARSEME (PARsing and Multi-word Expressions)

IC1207 COST action scientific network: 30 COST countries bottom-up approach, 114 members, 4 working groups, 29 languages from 9 language families, duration: 2013–2017. Scientific objective To bridge the gap between linguistic precision and computational efficiency in NLP applications. Key issue: MWEs and their links to (deep) parsing.

  • A. Savary

HDR 27/03/2014 43 / 44

slide-69
SLIDE 69

Composition&Variation MWEs NEs FSMs Conclusions CV

Thank you!

QA . . .

  • A. Savary

HDR 27/03/2014 44 / 44

slide-70
SLIDE 70

Composition&Variation MWEs NEs FSMs Conclusions CV

MWE processing and tree-to-language correction

Modelling a MWEs as a tree language A MWE can have (infinitely) many potential instantiations

to count somebody in, he has counted me and Laura in, I have never counted this idot with stange ideas in

These instantiations form a tree language (of which type?).

slide-71
SLIDE 71

Composition&Variation MWEs NEs FSMs Conclusions CV

MWE processing and tree-to-language correction

Modelling a MWEs as a tree language A MWE can have (infinitely) many potential instantiations

to count somebody in, he has counted me and Laura in, I have never counted this idot with stange ideas in

These instantiations form a tree language (of which type?). Recognizing a MWE in a syntax tree A selected subtree is corrected wrt. the tree language.

slide-72
SLIDE 72

Composition&Variation MWEs NEs FSMs Conclusions CV

MWE processing and tree-to-language correction

Modelling a MWEs as a tree language A MWE can have (infinitely) many potential instantiations

to count somebody in, he has counted me and Laura in, I have never counted this idot with stange ideas in

These instantiations form a tree language (of which type?). Recognizing a MWE in a syntax tree A selected subtree is corrected wrt. the tree language. Applications Annotating MWEs in treebanks. Recognizing MWEs after parsing. Approximate MWE recognition in noisy input.