Automated Acquisition of Linguistic Knowledge for Robust - - PowerPoint PPT Presentation

automated acquisition of linguistic knowledge for robust
SMART_READER_LITE
LIVE PREVIEW

Automated Acquisition of Linguistic Knowledge for Robust - - PowerPoint PPT Presentation

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar


slide-1
SLIDE 1

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Automated Acquisition of Linguistic Knowledge for Robust Multilingual Grammar Development

Valia Kordoni & Yi Zhang

German Research Centre for Artificial Intelligence (DFKI GmbH), and Department of Computational Linguistics (COLI), Saarland University

FEAST – Forum Entwicklung und Anwendung von Sprachtechnologien, 02/11/2009 Cluster of Excellence ’Multimodal Computing and Interaction’

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-2
SLIDE 2

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Road Map

1

Introduction

2

Acquisition of MWEs: Theoretical Background & Motivation

3

Detection of MWEs candidates

4

Evaluation of the Identification of MWEs Resources Comparing Corpora Comparing Statistical Measures

5

Extension of the English Resource Grammar with MWEs Setup Grammar Performance

6

Enhancing Robustness of the German Grammar

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-3
SLIDE 3

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Road Map

1

Introduction

2

Acquisition of MWEs: Theoretical Background & Motivation

3

Detection of MWEs candidates

4

Evaluation of the Identification of MWEs Resources Comparing Corpora Comparing Statistical Measures

5

Extension of the English Resource Grammar with MWEs Setup Grammar Performance

6

Enhancing Robustness of the German Grammar

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-4
SLIDE 4

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Introduction

1

Robust (multilingual) grammars at the heart of “deep” linguistic processing

2

Acquisition of linguistic knowledge, especially automated lexical acquisition for the development of large-scale, wide-coverage robust (multilingual) grammars at the core

  • f “deep” linguistic processing systems

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-5
SLIDE 5

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

“Deep” Linguistic Processing

“Deep” linguistic processing delivers fine-grained syntactic and semantic analyses Its core part is a complex rule system, called “deep” grammar

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-6
SLIDE 6

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

“Deep” Grammars

Most of the so-called “deep” grammars are strongly lexicalised (cf., Head-Driven Phrase Grammar (HPSG), Lexical Functional Grammar (LFG), Combinatory Categorial Grammar (CCG), etc.) Moreover, most of the so-called “deep” grammars are nowadays at the heart of application-oriented (multilingual) grammar engineering systems and platforms

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-7
SLIDE 7

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Robustness of “Deep” Grammars

Hence, “deep” grammars need to be robust, since It is important for many applications to have an analysis generated

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-8
SLIDE 8

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Application-Oriented “Deep” Linguistic Processing: Requirements

Stable formalisms

HPSG (cf., LKB [Copestake, 2002], ALPINO [Bouma et al., 2001], TRALE [Meurers et al., 2002]) LFG (cf., XLE [Maxwell and Kaplan, 1996]) CCG (cf., OpenCCG [Baldridge and Kruijff, 2003]) MP (cf., Minimalist Grammar [Stabler, 2000], [Churng, 2006]) ...

Parsing, generation, and grammar development tools Test suite management tools

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-9
SLIDE 9

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Application-Oriented “Deep” Linguistic Processing Systems: Issues

The main issues for the practical application of “deep” processing systems are

re-usability – in domains other than the one where the “deep” grammar was originally developed in/for specificity – more analyses generated by the “deep” grammar at the heard of the system than expected robustness – fewer or even no analyses generated by “deep” grammar at the core of the system

From the grammar engineering point of view, specificity and robustness are ‘a pair of dual problems’: gain on the

  • ne side means potential loss on the other

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-10
SLIDE 10

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

An Example: the DELPH-IN German Grammar (GG)

The DELPH-IN German Grammar (GG; [Crysmann, 2003]) is a relatively large-scale deep grammar of German, developed within the framework of HPSG The grammar originates from [Müller and Kasper, 2000], but has continued to develop after the end of the Verbmobil project [Wahlster, 2000] and currently consists of:

5K types 115 rules the lexicon contains about 35K entries mapped onto 386 distinct lexical types

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-11
SLIDE 11

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

The DELPH-IN GG: Overall Performance

Result #Sentences Percentage P 62,768 10.22% L 464,112 75.55% N 87,415 14.23% E 3 Total: 614,298 100%

Table: Parsing results with GG and Frankfurter Rundschau

Result #Sentences Percentage P 109,498 4.3% L 2,328,490 90.5% N 134,917 5.2% E 14 Total: 2,572,919 100%

Table: Parsing results with GG and deWaC

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-12
SLIDE 12

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

The DELPH-IN GG: Overall Performance (cont.)

It is obvious from these results that GG has full lexical span/coverage for only a very small portion of the sentences Thus low (lexical) coverage is the main problem for the robustness of GG Similar results are shown by [Baldwin et al., 2004] and [Zhang and Kordoni, 2006] for the English Resource Grammar (ERG; [Copestake and Flickinger, 2000]) and by [van Noord, 2004a] for the Dutch Alpino grammar

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-13
SLIDE 13

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

What is to do then? Acquisition of Linguistic Knowledge for Robust Grammar Engineering

Acquisition as the process of (semi-)automatically learning linguistic properties (usually defined by a given language resource): “fill in the gaps” in a language resource Automated prediction of words, multi-word units/expressions (MWEs) and, possibly, constructions Validation and evaluation of the acquired linguistic knowledge as means for better/more consistent development (and maintenance) of robust (multilingual) grammars for application-oriented “deep” linguistic processing (systems)

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-14
SLIDE 14

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

So how does all this fit together?

In what follows

1

we will “learn” (incl. validate and evaluate) MWEs for the DELPH-IN English Resource Grammar (ERG; [Flickinger, 2000]) and contribute to boosting its coverage

2

we will tackle the robustness problem of the DELPH-IN German Grammar (GG; [Crysmann, 2003]) when employed for real life applications by enhancing it automatically with the linguistic knowledge it lacks

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-15
SLIDE 15

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

The DELPH-IN Collaboration

Deep Linguistic Processing with HPSG: Initiative Grammars:

English: LinGO ERG (23K lexical entries), German: (35K lexical entries), Japanese: JaCY (48K lexical entries) Others: Norwegian, Modern Greek, Korean, Chinese . . .

Processing software:

LKB: grammar engineering platform, PET: efficient parser, [incr tsdb()]: profiling platform, HoG: infrastructure for building hybrid NLP applications based on RMRS semantic representations.

Applications: Machine Translation, IE, Email Autoresponse, . . . All of them available online: http://wiki.delph-in.net/moin

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-16
SLIDE 16

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Road Map

1

Introduction

2

Acquisition of MWEs: Theoretical Background & Motivation

3

Detection of MWEs candidates

4

Evaluation of the Identification of MWEs Resources Comparing Corpora Comparing Statistical Measures

5

Extension of the English Resource Grammar with MWEs Setup Grammar Performance

6

Enhancing Robustness of the German Grammar

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-17
SLIDE 17

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs: Theoretical Linguistic Background

Multiword Expressions:Definition A multiword expression (MWE) is decomposable into multiple simplex words lexically, syntactically, semantically, pragmatically and/or statistically idiosyncratic

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-18
SLIDE 18

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs: Theoretical Linguistic Background

Some Examples San Fancisco, ad hoc, by and large, Where Eagles Dare, kick the bucket, part of speech, in step, the Oakland Raiders, trip the light fantastic, telephone box, call (someone) up, take a walk, do a number on (someone), take (unfair) advantage of, pull strings, kindle excitement, fresh air, ...

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-19
SLIDE 19

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs: Theoretical Linguistic Background

MWE or not MWE? ... there is no unified phenomenon to describe but rather a complex of features that interact in various, often untidy, ways and represent a broad continuum between non-compositional (or idiomatic) and compositional groups

  • f words. [Moon, 1998]

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-20
SLIDE 20

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs: Theoretical Linguistic Background

Lexicosyntactic Idiomaticity by and large (???) = by(P) and(conj) large(Adj) wine and dine (V [trans]) = wine (V [intrans]) and(conj) dine (V [intrans]) ad hoc (Adj) = ad(?) hoc(?)

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-21
SLIDE 21

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs: Theoretical Linguistic Background

Semantic Idiomaticity kick the bucket = die’ spill the beans = reveal’ (secret’) kindle excitement = kindle’ (excitement’)

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-22
SLIDE 22

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs: Theoretical Linguistic Background

Pragmatic Idiomaticity Situatedness: the expression is associated with a fixed pragmatic point

situated MWEs: good morning, all aboard non-situated MWEs: first off, to and fro

The “Wheel of Fortune” factor: how to represent the jumble

  • f phrases stored in the mental lexicon?

The “Monty Python” factor: mish-mash of evocative language fragments

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-23
SLIDE 23

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs: Theoretical Linguistic Background

Statistical Idiomaticity

unblemished spotless flawless immaculate impeccable eye − − − − + gentleman − − ? − + home ? + − + ? lawn − − ? + − memory − − + − ? quality − − − − + record + + + + + reputation + − − + + taste − − − − +

Table: Adapted from [Cruse, 1986]

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-24
SLIDE 24

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs: Theoretical Linguistic Background

MWE Markedness

MWE Marked Lex Syn Sem Prag Stat ad hominem ✔ ? ? ? ✔ at first ✗ ✔ ✗ ✗ ✗ first aid ✗ ✗ ✔ ✗ ? salt and pepper ✗ ✗ ✗ ✗ ✔ good morning ✗ ✗ ✗ ✔ ✔ cat’s cradle ✔ ✔ ✔ ✗ ?

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-25
SLIDE 25

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs: Theoretical Linguistic Background

Other Indicators of MWE-hood ([Fillmore et al., 1988], [Liberman and Sproat, 1992], [Nunberg et al., 1994]) Institutionalisation/conventionalisation: bread and butter Non-identifiability: meaning cannot be predicted from surface form

idiom of decoding (non-identifiable): kick the bucket, fly off the handle idiom of encoding (identifiable): wide awake, plain truth

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-26
SLIDE 26

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs: Theoretical Linguistic Background

Other Indicators of MWE-hood ([Fillmore et al., 1988], [Liberman and Sproat, 1992], [Nunberg et al., 1994]) Figuration: the expression encodes some metaphor, metonymy, hyperbole, etc.

figurative expressions: bull market, beat around the bush non-figurative expressions: first off, to and fro

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-27
SLIDE 27

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs: Theoretical Linguistic Background

Other Indicators of MWE-hood ([Fillmore et al., 1988], [Liberman and Sproat, 1992], [Nunberg et al., 1994]) Single-word paraphrasability: the expression has a single word paraphrase

paraphrasable MWEs: leave out = omit non-paraphrasable MWEs: look up paraphrasable MWEs: take off clothes = undress

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-28
SLIDE 28

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs: Theoretical Linguistic Background

Other Indicators of MWE-hood ([Fillmore et al., 1988], [Liberman and Sproat, 1992], [Nunberg et al., 1994]) Proverbiality: the expression is used to “describe” – and implicitly, to explain – a recurrent situation of particular social interest... in virtue of its resemblance or relation to a scenario involving homely, concrete things and relations’ [Nunberg et al., 1994]

informality: the expression is associated with more informal

  • r colloquial registers

affect: the expression encodes a certain evaluation of affective stance toward the thing it denotes

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-29
SLIDE 29

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs: Theoretical Linguistic Background

Other Indicators of MWE-hood ([Fillmore et al., 1988], [Liberman and Sproat, 1992], [Nunberg et al., 1994]) Prosody: the expression has a distinctive stress pattern which diverges from the norm

prosodically-marked MWE: soft spot prosodically-unmarked MWE: first aid, red herring prosodically-marked non-MWE: dental operation

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-30
SLIDE 30

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs: Theoretical Linguistic Background

MWEs and the Notion of Compositionality: Definition degree to which the features of the parts of an MWE combine to predict the features of the whole

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-31
SLIDE 31

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs: Theoretical Linguistic Background

MWEs and the Notion of Compositionality Generally considered in the context of semantic compositionality, but we can equally talk about:

lexical compositionality syntactic compositionality pragmatic compositionality

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-32
SLIDE 32

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs: Theoretical Linguistic Background

Example: Syntactic Compositionality Definition: Degree to which the syntactic features of the parts of an MWE combine to predict the syntax of the whole

Fixed expression: by and large, San Francisco Verb particles: eat up vs. chicken out

Syntactic compositionality binary effect: non-compositional MWEs lexicalised

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-33
SLIDE 33

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs: Theoretical Linguistic Background

Question Given that compositionality extends over all aspects of markedness that affect MWEs, it is the be all and end of all of MWEs? Almost, but there are subtleties due to: statistical markedness decomposability

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-34
SLIDE 34

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs: Theoretical Linguistic Background

Question Given that compositionality extends over all aspects of markedness that affect MWEs, it is the be all and end of all of MWEs? Almost, but there are subtleties due to: statistical markedness decomposability

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-35
SLIDE 35

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs: Theoretical Linguistic Background

Statistical Markedness (Revisited) Statistical markedness is (often) a reflection of lack of statistical non-compositionality, rather than a lack of compositionality:

1

p(impeccable N) × p(Adj eye) ≈ p(impeccable eye) BUT

2

p(unblemished N) × p(Adj eye) ≫ p(unblemished eye)

3

p(spotless N) × p(Adj eye) ≫ p(spotless eye)

4

p(flawless N) × p(Adj eye) ≫ p(flawless eye)

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-36
SLIDE 36

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs: Theoretical Linguistic Background

Decomposability: Definition degree to which the features of an MWE can be ascribed to those of its parts

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-37
SLIDE 37

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs: Theoretical Linguistic Background

Decomposability and Semantic Idiomaticity kick the bucket = die’ spill the beans = reveal’ (secret’) kindle excitement = kindle’ (excitement’)

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-38
SLIDE 38

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs: Theoretical Linguistic Background

Decomposability: Three Classes of MWEs Classification of MWEs into 3 classes: non-decomposable MWEs: kick the bucket, shoot the breeze, hot dog idiosyncratically decomposable MWEs: spill the beans, let the cat out of the bag, radar footprint simple decomposable MWEs: kindle excitement, traffic light

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-39
SLIDE 39

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs: Theoretical Linguistic Background

Decomposability: Three Classes of MWEs There is a cline of “markedness” for idiosyncratically decomposable MWEs: chicken out vs. home office vs. radar footprint

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-40
SLIDE 40

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs: Theoretical Linguistic Background

Decomposability and Syntactic Flexibility Consider: *the bucket was kicked by Kim Strings were pulled to get Sandy the job. The FBI kept closer tabs on Kim than they kept on Sandy. ... the considerable advantage that was taken of the situation The syntactic flexibility of an idiom can generally be explained in terms of its decomposability

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-41
SLIDE 41

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs: Theoretical Linguistic Background

So What was the Answer to our Question? Yes and No:

simple compositionality is adequate for describing many instances of lexical, syntactic, semantic and pragmatic markedness BUT the notion of compositionality is significantly different for statistically marked MWEs AND decomposability diffuses the markedness boundary

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-42
SLIDE 42

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs: Theoretical Linguistic Background

And Why is it we Care about Compositionality? For all the reasons we care about MWEs:

Lexicography/dictionary making Idiomaticity (coherent semantics) Overgeneration Undergeneration Relevance in applications, including MT, IR, QA, ...

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-43
SLIDE 43

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs in NLP: Motivation

MWEs in NLP It is difficult to provide a unified account for the detection of these distinct but related phenomena. We will show how we build on compositionality to also deal with MWEs in NLP . Challenge for Grammar Engineering and “Deep” Linguistic Processing Lexical coverage is the major barrier to broad-coverage “deep” linguistic processing MWEs constitute a significant part of the problem; this should not be surprising, since in any case they are equivalent in number to single words in speakers’ lexicon [Jackendoff, 1997]

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-44
SLIDE 44

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

MWEs in NLP: Motivation

MWEs in NLP It is difficult to provide a unified account for the detection of these distinct but related phenomena. We will show how we build on compositionality to also deal with MWEs in NLP . Challenge for Grammar Engineering and “Deep” Linguistic Processing Lexical coverage is the major barrier to broad-coverage “deep” linguistic processing MWEs constitute a significant part of the problem; this should not be surprising, since in any case they are equivalent in number to single words in speakers’ lexicon [Jackendoff, 1997]

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-45
SLIDE 45

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Lexical coverage as a major barrier: an example from English

BNC Coverage Test (ERG jan-06 [Flickinger, 2000]) 1.8M sentences (21.2M words) from BNC written component with only ASCII characters and no more than 20 words each Result # Sentences Percentage Parsed 644,940 35.80%

  • Lex. Missing

969,452 53.82% Full Lex. Span, No Parse 186,883 10.38%

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-46
SLIDE 46

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Lexical coverage as a major barrier: an example from English

BNC Coverage Test (ERG jan-06 [Flickinger, 2000]) 1.8M sentences (21.2M words) from BNC written component with only ASCII characters and no more than 20 words each Result # Sentences Percentage Parsed 644,940 35.80%

  • Lex. Missing

969,452 53.82% Full Lex. Span, No Parse 186,883 10.38%

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-47
SLIDE 47

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Road Map

1

Introduction

2

Acquisition of MWEs: Theoretical Background & Motivation

3

Detection of MWEs candidates

4

Evaluation of the Identification of MWEs Resources Comparing Corpora Comparing Statistical Measures

5

Extension of the English Resource Grammar with MWEs Setup Grammar Performance

6

Enhancing Robustness of the German Grammar

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-48
SLIDE 48

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Error Mining [van Noord, 2004b]

Parsability R(wi . . . wj) = C(wi...wj,OK)

C(wi...wj)

If the parsability of a particular word sequence is very low, it indicates that something is wrong Parsabilities can be calculated efficiently for large corpora with suffix arrays and perfect hashing [Lucchesi and Kowaltowski, 1993]

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-49
SLIDE 49

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Error Mining Experiment

Experiment was run on BNC: the parsed sentences and the unparsed sentences (with full lex. span) Low parsability n-grams were extracted 3+ grams were taken for further inverstigation Num. % uni-gram 798 20.84% bi-gram 2,011 52.52% tri-gram 937 24.47%

Table: Distribution of N-grams with R < 0.1

unigram bigram trigram

  • ther

unigram bigram trigram

  • ther

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-50
SLIDE 50

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Example of Low Parsability N-grams

N-gram R Count the burden of 0.000 49 by and large 0.000 37 face of it 0.000 34 frame of mind 0.000 23 points of view 0.000 20 hair and a 0.000 17 the to infinitive 0.000 15

  • f alcohol and

0.000 8 a great many 0.083 44 glance up at 0.083 33 for and against 0.086 21 from of government 0.142 6

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-51
SLIDE 51

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Example of Low Parsability N-grams

N-gram R Count the burden of 0.000 49 by and large 0.000 37 face of it 0.000 34 frame of mind 0.000 23 points of view 0.000 20 hair and a 0.000 17 the to infinitive 0.000 15

  • f alcohol and

0.000 8 a great many 0.083 44 glance up at 0.083 33 for and against 0.086 21 from of government 0.142 6

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-52
SLIDE 52

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Example of Low Parsability N-grams

N-gram R Count the burden of 0.000 49 by and large 0.000 37 face of it 0.000 34 frame of mind 0.000 23 points of view 0.000 20 hair and a 0.000 17 the to infinitive 0.000 15

  • f alcohol and

0.000 8 a great many 0.083 44 glance up at 0.083 33 for and against 0.086 21 from of government 0.142 6

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-53
SLIDE 53

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Summary

Error mining-based MWE detection ? Need for validation of detected MWEs

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-54
SLIDE 54

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Summary

Error mining-based MWE detection ? Need for validation of detected MWEs

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-55
SLIDE 55

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Resources Comparing Corpora Comparing Statistical Measures

Road Map

1

Introduction

2

Acquisition of MWEs: Theoretical Background & Motivation

3

Detection of MWEs candidates

4

Evaluation of the Identification of MWEs Resources Comparing Corpora Comparing Statistical Measures

5

Extension of the English Resource Grammar with MWEs Setup Grammar Performance

6

Enhancing Robustness of the German Grammar

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-56
SLIDE 56

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Resources Comparing Corpora Comparing Statistical Measures

Identification of MWEs

The aim Given a list of sequences of words to distinguish MWEs (e.g., in the red) from random sequences of words (e.g., of alcohol and)

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-57
SLIDE 57

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Resources Comparing Corpora Comparing Statistical Measures

Identification of MWEs

Why so many Statistical Tests in the Literature? Complications in evaluation

hard to say which is the “best” test conflicting results from different researchers

Different corpora have different distributional idiosyncracies Different tests have different statistical idiosyncracies

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-58
SLIDE 58

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Resources Comparing Corpora Comparing Statistical Measures

Identification of MWEs

2 important questions Thus, there are two important questions

How reliable is the corpus used? How precise is a statistical measure to distinguish the phenomena studied?

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-59
SLIDE 59

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Resources Comparing Corpora Comparing Statistical Measures

Resources

1039 trigrams from error mining system [van Noord, 2004b] 4 corpora

BNCf: fragment of the BNC used in the error-mining experiments BNC: complete BNC (from the site http://pie.usna.edu/) Google: Web using Google Yahoo: Web using Yahoo

Corpus Frequency of 1,039 trigrams BNCf 66,101 BNC 322,325 Google 224,479,065 Yahoo 6,081,786,313

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-60
SLIDE 60

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Resources Comparing Corpora Comparing Statistical Measures

Comparing corpora

Hypothesis The relative ordering in frequency for different n-grams is preserved across corpora, in the same domain If not, different conclusions may be drawn from different corpora

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-61
SLIDE 61

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Resources Comparing Corpora Comparing Statistical Measures

Comparing corpora – first test

Relative Frequency Rank for the Trigrams

10-5 10-4 10-3 10-2 10-1 1 10 100 1000

relative frequency rank

BNCf BNC Google Yahoo

The overall ranking distribution is very similar for these corpora, showing the expected Zipf like behaviour

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-62
SLIDE 62

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Resources Comparing Corpora Comparing Statistical Measures

Comparing corpora – second test

Measuring Kendall’s τ scores between corpora a significant correlation was found with p < 0.000001 But what is the degree of correlation among them?

To estimate the correlation: the probability Q that any 2 trigrams chosen from two corpora have the same relative

  • rdering in frequency

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-63
SLIDE 63

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Resources Comparing Corpora Comparing Statistical Measures

Comparing corpora – second test

BNC Google Yahoo BNCf 0.81 0.73 0.78 BNC 0.73 0.77 Google 0.86 The corpora are correlated, and can probably be used interchangeably for the statistical properties of the trigrams A higher correlation was observed between Yahoo and Google

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-64
SLIDE 64

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Resources Comparing Corpora Comparing Statistical Measures

Comparing statistical measures

Using a single corpus: BNCf Comparing Mutual Information (MI), χ2 and Permutation Entropy (PE) for MWE identification MI and χ2 are typical measures of association that compare

the joint probability of occurrence of a certain group of events p(abc) with a prediction derived from the null hypothesis of statistical independence between these events p∅(abc) = p(a) · p(b) · p(c)

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-65
SLIDE 65

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Resources Comparing Corpora Comparing Statistical Measures

MI and χ2

χ2 =

  • a,b,c

[ n(abc) − n∅(abc) ]2 n∅(abc) MI =

  • a,b,c

n(abc) N log2 n(abc) n∅(abc)

  • Valia Kordoni

Automated Acquisition of Linguistic Knowledge

slide-66
SLIDE 66

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Resources Comparing Corpora Comparing Statistical Measures

Permutation Entropy (PE)

Permutation entropy, is a measure of order association PE = −

  • (i,j,k)

p(wiwjwk) ln [ p(wiwjwk) ] p(w1w2w3) = n(w1w2w3)

  • (i,j,k)

n(wiwjwk) where the sum runs over all the permutations: (e.g. by and large, large by and, and large by, and by large, large and by, and by large and)

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-67
SLIDE 67

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Resources Comparing Corpora Comparing Statistical Measures

Permutation Entropy (PE)

PE for MWE detection - Hypothesis: MWEs are more rigid to permutations; therefore they have smaller PEs the more independent the words are the closer the PE is from its maximal value (ln 6, for trigrams) It does not rely on single word counts, which are less accurate in Web based corpora

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-68
SLIDE 68

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Resources Comparing Corpora Comparing Statistical Measures

Are they equivalent?

Kendall’s τ for assessing the correlation of the rankings for these measures and its significance Q is the probability of finding the same ordering in them MI×χ2 MI×PE χ2×PE Q 0.71 0.55 0.45 The correlations found are statistically significant The measures order the trigrams differently

70% chance of getting the same order from MI and χ2 they are very different from the PE

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-69
SLIDE 69

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Resources Comparing Corpora Comparing Statistical Measures

Are they useful for MWE detection?

To check that we compare the measures’ distributions for MWEs and non-MWEs Gold standard = set of 382 MWE candidates annotated by a native speaker

90 MWEs 292 non-MWEs

MI or PE seem to differentiate between MWEs and non-MWEs

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-70
SLIDE 70

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Resources Comparing Corpora Comparing Statistical Measures

Are they useful?

Normalised histograms for MWEs and non-MWEs

The ideal scenario: non overlapping distributions for MWEs and non-MWEs

A simple threshold operation would be enough to distinguish between them

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-71
SLIDE 71

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Resources Comparing Corpora Comparing Statistical Measures

Are they useful?

Normalised histograms for MWEs and non-MWEs

The ideal scenario: non overlapping distributions for MWEs and non-MWEs

A simple threshold operation would be enough to distinguish between them

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-72
SLIDE 72

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Resources Comparing Corpora Comparing Statistical Measures

Are they useful?

Normalised histograms for MWEs and non-MWEs

MI (BNCf)

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

  • 5.5
  • 5
  • 4.5
  • 4
  • 3.5
  • 3
  • 2.5
  • 2

Probability log(MI)

MWEs non-MWEs

χ2 (BNCf)

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 2 3 4 5 6 7 8

Probability log(χ2)

MWEs non-MWEs

PE (Yahoo)

0.05 0.1 0.15 0.2 0.25

  • 3.5
  • 3
  • 2.5
  • 2
  • 1.5
  • 1
  • 0.5

0.5

Probability log(PE(Yahoo))

MWEs non-MWEs

As some types of MWEs may have stronger constraints on word order, more visible effects will probably be seen if we look at application of measures for individual types of MWEs [Evert and Krenn, 2005]

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-73
SLIDE 73

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Resources Comparing Corpora Comparing Statistical Measures

Summary

So far we have detected n-grams which are candidate MWEs We have validated them using statistical measures on corpora For grammar engineering we still need a way, though, of acquiring new lexical entries for MWEs and of evaluating their influence on the grammar performance

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-74
SLIDE 74

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Setup Grammar Performance

Road Map

1

Introduction

2

Acquisition of MWEs: Theoretical Background & Motivation

3

Detection of MWEs candidates

4

Evaluation of the Identification of MWEs Resources Comparing Corpora Comparing Statistical Measures

5

Extension of the English Resource Grammar with MWEs Setup Grammar Performance

6

Enhancing Robustness of the German Grammar

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-75
SLIDE 75

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Setup Grammar Performance

English Resource Grammar [Flickinger, 2000]

A large scale broad coverage precision HPSG grammar Lexicon coverage is a major problem MWEs comprise a large portion of the missing lexical entries

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-76
SLIDE 76

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Setup Grammar Performance

Lexical hierarchy and atomic lexical types

The lexical information is encoded in atomic lexical types A lexicon is a n : n mapping between lexemes and atomic lexical type

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-77
SLIDE 77

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Setup Grammar Performance

Lexical hierarchy and atomic lexical types

The lexical information is encoded in atomic lexical types A lexicon is a n : n mapping between lexemes and atomic lexical type

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-78
SLIDE 78

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Setup Grammar Performance

Lexical hierarchy and atomic lexical types

The lexical information is encoded in atomic lexical types A lexicon is a n : n mapping between lexemes and atomic lexical type

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-79
SLIDE 79

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Setup Grammar Performance

Maximum Entropy Model-based Lexical Type Predictor

A statistical classifier that predicts for each occurrence of an unknown word or a missing lexical entry Input: features from the context Output: atomic lexical types p(t, c) = exp(

i θifi(t, c))

  • t′∈T exp(

i θifi(t′, c))

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-80
SLIDE 80

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Setup Grammar Performance

“Words-with-spaces” vs. compositional approaches

Words-with-spaces approach [Zhang et al., 2006] Assign lexical types for the entire MWE Grammar coverage significantly improves Loss in generality for productive MWEs Compositional approach Assign new lexical entries for the head word to treat the MWE as compositional Hopefully the grammar coverage improves without drop in accuracy

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-81
SLIDE 81

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Setup Grammar Performance

“Words-with-spaces” vs. compositional approaches

Words-with-spaces approach [Zhang et al., 2006] Assign lexical types for the entire MWE Grammar coverage significantly improves Loss in generality for productive MWEs Compositional approach Assign new lexical entries for the head word to treat the MWE as compositional Hopefully the grammar coverage improves without drop in accuracy

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-82
SLIDE 82

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Setup Grammar Performance

Experiment

Rank all the MWE candidates according to the three statistical measures: MI, χ2, PE, and select the top 30 MWE with highest average ranking Extract sub-corpus from BNCf which contains at least one

  • f the MWE for evaluation (674 sentences)

Use heuristics to extract head words (20 head words) Run lexical acquisition for head words on the sub-corpus (21 new entries)

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-83
SLIDE 83

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Setup Grammar Performance

Grammar Coverage

item # parsed #

  • avg. analysis #

coverage % ERG 674 48 335.08 7.1% ERG + MWE 674 153 285.01 22.7%

The coverage improvement is largely compatible with the results of “words-with-spaces” approach reported in [Zhang et al., 2006] (about 15%) Great reduction in lexical entries added

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-84
SLIDE 84

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Setup Grammar Performance

Grammar Accuracy

153 parsed sentences are analyzed by hand 124 (81.0%) of them receive at least one correct/acceptable analysis (comparable to the accuracy reported by [Baldwin et al., 2004]) Parse selection model finds best analysis in top-5 for 66%

  • f the cases, and top-10 for 75%

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-85
SLIDE 85

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Setup Grammar Performance

Summary

MWE candidates have been detected with error mining Different corpora have been compared for the purpose of MWE validation Different statistical measures have been compared for identifying MWEs Grammar performance has been evaluated for automated MWE acquisition using a compositional approach

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-86
SLIDE 86

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Setup Grammar Performance

Outlook

Hand-crafted precision grammars usually face coverage/robustness challenges when applied to unseen data with unknown words/MWEs, unknown constructions, etc., all over the place [Baldwin et al., 2004] reported parsing coverage of 18% on unseen BNC data parsed with the ERG, with the majority

  • f parsing failures related to missing lexical entries

The Lexical Type Prediction model I have presented above is used to handle unknown words (simplex and MWE)

  • n-the-fly

With the use of this model the ERG achieves around 84% parsing coverage on unseen WSJ data

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-87
SLIDE 87

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar Setup Grammar Performance

Outlook

Other “Deep” Parsing Systems LFG

XLE 79.6% F-Score [Kaplan et al., 2004]

CCG

C&C 81.86% F-Score [Clark and Curran, 2007]

HPSG

Enju 82.64% F-Score [Sagae et al., 2008]

The aforementioned systems are evaluated on 700 sentences selected from WSJ data (PARC 700), using Grammatical Relations (GR)

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-88
SLIDE 88

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Road Map

1

Introduction

2

Acquisition of MWEs: Theoretical Background & Motivation

3

Detection of MWEs candidates

4

Evaluation of the Identification of MWEs Resources Comparing Corpora Comparing Statistical Measures

5

Extension of the English Resource Grammar with MWEs Setup Grammar Performance

6

Enhancing Robustness of the German Grammar

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-89
SLIDE 89

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Background

German has rich morphology and that also affects the design of the lexicon of the German Grammar (GG; [Crysmann, 2003]): a large amount of linguistic information is encoded in the form of constraints in the feature structures of the various types

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-90
SLIDE 90

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Importance of Linguistic Constraints

Assumption We have to use the linguistic information contained in these constraints in order to develop linguistically oriented and well motivated (“Deep”) Lexical Acquisition (DLA) methods.

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-91
SLIDE 91

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Expanded Atomic Lexical Types

How do we capture the linguistic information from the feature structures of the GG lexical types? Expand the type definitions of 38 selected atomic types with the relevant linguistic information contained in their feature values Which information is relevant? Not every feature must be considered: the target type inventory would be too sparse; moreover, not every feature is useful for the DLA process

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-92
SLIDE 92

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Expanded Atomic Lexical Types (cont.)

Solution Perform an extensive linguistic analysis of the features to be considered for DLA: linguistically motivated DLA

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-93
SLIDE 93

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Relevant Linguistic Features

Feature Values Meaning SUBJOPT (subject options) + in some cases the article for the noun can be omitted

  • the noun always goes with an article

+ raising verb

  • non-raising verb

KEYAGR (key agreement) – case-number-gender information for nouns c-s-n underspecified-singular-neutral c-p-g underspecified-plural-gender ... ... (O)COMPAGR ((oblique)) a-n-g, d-n-g, etc. case-number-gender information complement – for (oblique) verb complements agreement – case-number-gender of the modified noun (for adjectives) (O)COMPTOPT ((oblique)) – verbs can take a different number of complements complement + the respective (oblique) complement is present

  • ptions
  • the respective (oblique) complement is absent

KEYFORM – the auxiliary verb used for – the formation of perfect tense haben the auxiliary verb is ‘haben’ sein the auxiliary verb is ‘sein’

Table: Relevant features used for type expanding

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-94
SLIDE 94

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Expanded Lexical Type Example

Before expanding

abenteuer-n := count-noun-le & [ [ --SUBJOPT -, KEYAGR c-n-n, KEYREL "_abenteuer_n_rel", KEYSORT situation, MCLASS nclass-2_-u_-e ] ].

After expanding abenteuer-n := count-noun-le_-_c-n-n (the values of the SUBJOPT and KEYAGR attributes are attached to the original type definition)

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-95
SLIDE 95

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Expanded Lexicon

Original Expanded lexicon lexicon Number of lexical types 386 485 Atomic lexical types 38 137

  • nouns

9 72

  • verbs

19 53

  • adjectives

3 5

  • adverbs

7 7

Table: Expanded atomic lexical types

Such target type inventory ensures that learning process will deliver fine-grained linguistic results while not having sparse data problems

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-96
SLIDE 96

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Grammar Performance in Practical Applications

Corpus Coverage Accuracy FR 8.89% 85% FR + DLA 21.08% 83% deWaC 7.46% – deWaC + DLA 16.95% –

Table: Coverage results

The coverage for FR improves with more than 12% Given the fact that deWaC is an open and unbalanced corpus, the 10% increase in coverage is also a significant improvement

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-97
SLIDE 97

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Goal Achieved!

Assumption proven! With our linguistically-oriented DLA methods, we have managed to increase parsing coverage and at the same time, to preserve the high accuracy of the grammar.

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-98
SLIDE 98

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Summary

We have tackled from a more linguistically-oriented point of view the robustness problem which arises when lexicalised grammars are employed as part of bigger processing architectures in real life applications We have shown clearly that missing lexical entries are the main cause for parsing failures and thus illustrated the importance of increasing lexical coverage of lexicalised grammars We have also illustrated the importance of morphology in the lexical prediction process for languages like German

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-99
SLIDE 99

Introduction Acquisition of MWEs: Theoretical Background & Motivation Detection of MWEs candidates Evaluation of the Identification of MWEs Extension of the English Resource Grammar with MWEs Enhancing Robustness of the German Grammar

Summary

With our linguistically motivated DLA methods, parsing coverage of lexicalised grammars improves significantly while the linguistic quality of the grammars remains intact Since our DLA methods are considered to be formalism- and language-independent, it will be interesting, in future research, to apply them on other systems and languages

Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-100
SLIDE 100

Appendix For Further Reading

For Further Reading I

Baldwin, T., Bender, E. M., Flickinger, D., Kim, A., and Oepen, S. (2004). Road-testing the English Resource Grammar over the British National Corpus. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal. Clark, S. and Curran, J. (2007). Formalism-Independent Parser Evaluation with CCG and DepBank. In Proceedings of ACL2007. Copestake, A. and Flickinger, D. (2000). An open-sourse grammar development environment and broad-coverage English grammar using HPSG. In Proceedings of the Second conference on Language Resources and Evaluation (LREC 2000), Athens, Greece. Cruse, A. (1986). Lexical Semantics. Cambridge University Press, Cambridge, UK. Crysmann, B. (2003). On the efficient implementation of German verb placement in HPSG. In Proceedings of RANLP 2003, pages 112–116, Borovets, Bulgaria. Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-101
SLIDE 101

Appendix For Further Reading

For Further Reading II

Evert, S. and Krenn, B. (2005). Using small random samples for the manual evaluation of statistical association measures. Computer Speech and Language, 19(4):450–466. Fillmore, C., Kay, P ., and O’Connor, M. (1988). Regularity and idiomaticity in grammatical constructions. Language, 64:501–38. Flickinger, D. (2000). On building a more efficient grammar by exploiting types. Natural Language Engineering, 6(1):15–28. Jackendoff, R. (1997). Twistin’ the night away. Language, 73:534–59. Kaplan, R., Riezler, S., King, T. H., Maxwell, J., and Vasserman, A. (2004). Speec and accuracy in shallow and deep stochastic processing. In Proceedings of HLT-NAACL ’04. Liberman, M. and Sproat, R. (1992). The stress and structure of modified noun phrases in English. Lexical Matters – CSLI Lecture Notes, 24:99–108. Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-102
SLIDE 102

Appendix For Further Reading

For Further Reading III

Lucchesi, C. and Kowaltowski, T. (1993). Applications of finite automata representing large vocabularies. Software Practice and Experience, 23(1):15–30. Moon, R. (1998). Fixed Expressions and Idioms in English: A Corpus-based Approach. Oxford University Press, Oxford, UK. Müller, S. and Kasper, W. (2000). HPSG analysis of German. In Wahlster, W., editor, Verbmobil: Foundations of Speech-to-Speech Translation, pages 238–253. Springer-Verlag. Nunberg, G., Sag, I., and Wasow, T. (1994). Idioms. Language, 70:491–538. Sagae, K., Miyao, Y., Matsuzaki, T., and Tsujii, J. (2008). Challenges in Mapping of Syntactic Representations for Framework-Independent Parser Evaluation. In Proceedings of Workshop on Automated Syntatic Annotations for Interoperable Language Resources at the First International Conference on Global Interoperability for Language Resources (ICGL ’08), Hong Kong. Valia Kordoni Automated Acquisition of Linguistic Knowledge

slide-103
SLIDE 103

Appendix For Further Reading

For Further Reading IV

van Noord, G. (2004a). Error mining for wide coverage grammar engineering. In Proceedings of the 42nd Meeting of the Assiciation for Computational Linguistics (ACL ’04), Main Volume, pages 446–453, Barcelona, Spain. van Noord, G. (2004b). Error mining for wide-coverage grammar engineering. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL ’04), Main Volume, pages 446–453, Barcelona, Spain. Wahlster, W., editor (2000). Verbmobil: Foundations of Speech-to-Speech Translation. Artificial Intelligence. Springer. Zhang, Y. and Kordoni, V. (2006). Automated deep lexical acquisition for robust open text processing. In Proceedings of the Fifth International Conference on Language Resourses and Evaluation (LREC 2006), Genoa, Italy. Zhang, Y., Kordoni, V., Villavicencio, A., and Idiart, M. (2006). Automated multiword expression prediction for grammar engineering. In Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, pages 36–44, Sydney, Australia. Association for Computational Linguistics. Valia Kordoni Automated Acquisition of Linguistic Knowledge