From dependency structures to LFG representations Dag Haug Seminar - - PowerPoint PPT Presentation

from dependency structures to lfg representations
SMART_READER_LITE
LIVE PREVIEW

From dependency structures to LFG representations Dag Haug Seminar - - PowerPoint PPT Presentation

The corpus Conversion LFG101 F-structures C-structure Conclusions From dependency structures to LFG representations Dag Haug Seminar in computational linguistics April 18 Dag Haug dg2lfg April 18 1 / 57 The corpus Conversion LFG101


slide-1
SLIDE 1

The corpus Conversion LFG101 F-structures C-structure Conclusions

From dependency structures to LFG representations

Dag Haug Seminar in computational linguistics April 18

Dag Haug dg2lfg April 18 1 / 57

slide-2
SLIDE 2

The corpus Conversion LFG101 F-structures C-structure Conclusions

The texts

Core – parallel corpus of New Testament translations:

Ancient Greek (original, 1st century AD) Gothic (4th century AD) Latin (ca. 400 AD) Classical Armenian (ca. 400 AD) Old Church Slavic (9th century AD)

Dag Haug dg2lfg April 18 2 / 57

slide-3
SLIDE 3

The corpus Conversion LFG101 F-structures C-structure Conclusions

The texts

Core – parallel corpus of New Testament translations:

Ancient Greek (original, 1st century AD) Gothic (4th century AD) Latin (ca. 400 AD) Classical Armenian (ca. 400 AD) Old Church Slavic (9th century AD)

Extensions:

Herodotus’ Histories (Greek 5th century BC) Caesar’s Gallic War (Latin, 1st century BC) Cicero’s Letters to Atticus (Latin, 1st century BC) Peregrinatio Aetheriae (Vulgar Latin, ca. 400 AD) Hagiographies (The Slavic Codex Suprasliensis, 11th century AD)

Dag Haug dg2lfg April 18 2 / 57

slide-4
SLIDE 4

The corpus Conversion LFG101 F-structures C-structure Conclusions

The texts

Core – parallel corpus of New Testament translations:

Ancient Greek (original, 1st century AD) Gothic (4th century AD) Latin (ca. 400 AD) Classical Armenian (ca. 400 AD) Old Church Slavic (9th century AD)

Extensions:

Herodotus’ Histories (Greek 5th century BC) Caesar’s Gallic War (Latin, 1st century BC) Cicero’s Letters to Atticus (Latin, 1st century BC) Peregrinatio Aetheriae (Vulgar Latin, ca. 400 AD) Hagiographies (The Slavic Codex Suprasliensis, 11th century AD)

Ultimate goal: a representative corpus of early IE languages

Dag Haug dg2lfg April 18 2 / 57

slide-5
SLIDE 5

The corpus Conversion LFG101 F-structures C-structure Conclusions

Small but beautiful

language tokens chu 64031 got 56315 grc 137750 lat 120253 xcl 22614 total 400963

Dag Haug dg2lfg April 18 3 / 57

slide-6
SLIDE 6

The corpus Conversion LFG101 F-structures C-structure Conclusions

On the languages

Old languages → no native speakers

Dag Haug dg2lfg April 18 4 / 57

slide-7
SLIDE 7

The corpus Conversion LFG101 F-structures C-structure Conclusions

On the languages

Old languages → no native speakers But fairly well-understood and much-studied texts

Dag Haug dg2lfg April 18 4 / 57

slide-8
SLIDE 8

The corpus Conversion LFG101 F-structures C-structure Conclusions

On the languages

Old languages → no native speakers But fairly well-understood and much-studied texts Morphologically rich

Dag Haug dg2lfg April 18 4 / 57

slide-9
SLIDE 9

The corpus Conversion LFG101 F-structures C-structure Conclusions

On the languages

Old languages → no native speakers But fairly well-understood and much-studied texts Morphologically rich Non-configurational, grammatical functions indicated by case rather than word order

Dag Haug dg2lfg April 18 4 / 57

slide-10
SLIDE 10

The corpus Conversion LFG101 F-structures C-structure Conclusions

On the languages

Old languages → no native speakers But fairly well-understood and much-studied texts Morphologically rich Non-configurational, grammatical functions indicated by case rather than word order All in all quite different from English, which creates lots of

  • problems. . .

Dag Haug dg2lfg April 18 4 / 57

slide-11
SLIDE 11

The corpus Conversion LFG101 F-structures C-structure Conclusions

Workflow for annotation

International team of student annotators

Dag Haug dg2lfg April 18 5 / 57

slide-12
SLIDE 12

The corpus Conversion LFG101 F-structures C-structure Conclusions

Workflow for annotation

International team of student annotators Manual disambiguation of morphology and lemmatization

Dag Haug dg2lfg April 18 5 / 57

slide-13
SLIDE 13

The corpus Conversion LFG101 F-structures C-structure Conclusions

Workflow for annotation

International team of student annotators Manual disambiguation of morphology and lemmatization Syntactic annotation

Dag Haug dg2lfg April 18 5 / 57

slide-14
SLIDE 14

The corpus Conversion LFG101 F-structures C-structure Conclusions

Workflow for annotation

International team of student annotators Manual disambiguation of morphology and lemmatization Syntactic annotation Review by project members

Dag Haug dg2lfg April 18 5 / 57

slide-15
SLIDE 15

The corpus Conversion LFG101 F-structures C-structure Conclusions

Workflow for annotation

International team of student annotators Manual disambiguation of morphology and lemmatization Syntactic annotation Review by project members Advanced annotation done by project members

Dag Haug dg2lfg April 18 5 / 57

slide-16
SLIDE 16

The corpus Conversion LFG101 F-structures C-structure Conclusions

Morphology

Verbs inflect for tense, mood, voice, person, number

Dag Haug dg2lfg April 18 6 / 57

slide-17
SLIDE 17

The corpus Conversion LFG101 F-structures C-structure Conclusions

Morphology

Verbs inflect for tense, mood, voice, person, number Nominals inflect for case, number, gender + possibly grade and definiteness

Dag Haug dg2lfg April 18 6 / 57

slide-18
SLIDE 18

The corpus Conversion LFG101 F-structures C-structure Conclusions

Morphology

Verbs inflect for tense, mood, voice, person, number Nominals inflect for case, number, gender + possibly grade and definiteness All in all this makes for 1817 unique MSD-tags

Dag Haug dg2lfg April 18 6 / 57

slide-19
SLIDE 19

The corpus Conversion LFG101 F-structures C-structure Conclusions

Morphology

Verbs inflect for tense, mood, voice, person, number Nominals inflect for case, number, gender + possibly grade and definiteness All in all this makes for 1817 unique MSD-tags In addition there are 25 POS-tags (fairly traditional, with some subdivisions especially in the pronouns)

Dag Haug dg2lfg April 18 6 / 57

slide-20
SLIDE 20

The corpus Conversion LFG101 F-structures C-structure Conclusions

Morphological annotation

Started out with manual disambiguation of alternatives from a transducer

Dag Haug dg2lfg April 18 7 / 57

slide-21
SLIDE 21

The corpus Conversion LFG101 F-structures C-structure Conclusions

Morphological annotation

Started out with manual disambiguation of alternatives from a transducer Ignores the context and offers spurious ambiguities

Dag Haug dg2lfg April 18 7 / 57

slide-22
SLIDE 22

The corpus Conversion LFG101 F-structures C-structure Conclusions

Morphological annotation

Started out with manual disambiguation of alternatives from a transducer Ignores the context and offers spurious ambiguities When we have enough data within a domain, we now use TnT to pretag the text

Dag Haug dg2lfg April 18 7 / 57

slide-23
SLIDE 23

The corpus Conversion LFG101 F-structures C-structure Conclusions

Morphological annotation

Started out with manual disambiguation of alternatives from a transducer Ignores the context and offers spurious ambiguities When we have enough data within a domain, we now use TnT to pretag the text MDSs are supplemented with lemmatization from the transducer

Dag Haug dg2lfg April 18 7 / 57

slide-24
SLIDE 24

The corpus Conversion LFG101 F-structures C-structure Conclusions

Morphological annotation

Started out with manual disambiguation of alternatives from a transducer Ignores the context and offers spurious ambiguities When we have enough data within a domain, we now use TnT to pretag the text MDSs are supplemented with lemmatization from the transducer Skjærholt (2011, 2012): Experiment Token accuracy Cross-validation on BG 84.3% Vulgate → BG 62.8%

Dag Haug dg2lfg April 18 7 / 57

slide-25
SLIDE 25

The corpus Conversion LFG101 F-structures C-structure Conclusions

Morphological annotation

Started out with manual disambiguation of alternatives from a transducer Ignores the context and offers spurious ambiguities When we have enough data within a domain, we now use TnT to pretag the text MDSs are supplemented with lemmatization from the transducer Skjærholt (2011, 2012): Experiment Token accuracy Cross-validation on BG 84.3% Vulgate → BG 62.8% Annotation accuracy goes up and time goes down

Dag Haug dg2lfg April 18 7 / 57

slide-26
SLIDE 26

The corpus Conversion LFG101 F-structures C-structure Conclusions

The syntactic annotation scheme: dependency grammar

Information about syntactic relations and word order stored separately

Dag Haug dg2lfg April 18 8 / 57

slide-27
SLIDE 27

The corpus Conversion LFG101 F-structures C-structure Conclusions

The syntactic annotation scheme: dependency grammar

Information about syntactic relations and word order stored separately Reliance on overt elements

Dag Haug dg2lfg April 18 8 / 57

slide-28
SLIDE 28

The corpus Conversion LFG101 F-structures C-structure Conclusions

The syntactic annotation scheme: dependency grammar

Information about syntactic relations and word order stored separately Reliance on overt elements Inherent problems of: (asyndetic) coordination, structure sharing

Dag Haug dg2lfg April 18 8 / 57

slide-29
SLIDE 29

The corpus Conversion LFG101 F-structures C-structure Conclusions

The syntactic annotation scheme: dependency grammar

Information about syntactic relations and word order stored separately Reliance on overt elements Inherent problems of: (asyndetic) coordination, structure sharing Dependency grammar with LFG adjustments

Dag Haug dg2lfg April 18 8 / 57

slide-30
SLIDE 30

The corpus Conversion LFG101 F-structures C-structure Conclusions

The syntactic annotation scheme: dependency grammar

Information about syntactic relations and word order stored separately Reliance on overt elements Inherent problems of: (asyndetic) coordination, structure sharing Dependency grammar with LFG adjustments

Limited set of empty nodes (for asyndetic coordination and ellipsis)

Dag Haug dg2lfg April 18 8 / 57

slide-31
SLIDE 31

The corpus Conversion LFG101 F-structures C-structure Conclusions

The syntactic annotation scheme: dependency grammar

Information about syntactic relations and word order stored separately Reliance on overt elements Inherent problems of: (asyndetic) coordination, structure sharing Dependency grammar with LFG adjustments

Limited set of empty nodes (for asyndetic coordination and ellipsis) Secondary dependencies (for structure sharing, incl. control/raising)

Dag Haug dg2lfg April 18 8 / 57

slide-32
SLIDE 32

The corpus Conversion LFG101 F-structures C-structure Conclusions

The syntactic annotation scheme: dependency grammar

Information about syntactic relations and word order stored separately Reliance on overt elements Inherent problems of: (asyndetic) coordination, structure sharing Dependency grammar with LFG adjustments

Limited set of empty nodes (for asyndetic coordination and ellipsis) Secondary dependencies (for structure sharing, incl. control/raising) More granular syntactic relations than usual

Dag Haug dg2lfg April 18 8 / 57

slide-33
SLIDE 33

The corpus Conversion LFG101 F-structures C-structure Conclusions

Syntactic relations

Label Function PRED Predicate SUB Subject OBJ Object OBL Oblique AG Agent ADV Adverbial ATR Attribute APOS Apposition NARG Nominal argument Label Function XADV Free predicative XOBJ Open complement Aux Auxiliary XOBJ Open complement clause COMP Complement clause PART Partitive PARPRED Parenthetical VOC Vocative

example Dag Haug dg2lfg April 18 9 / 57

slide-34
SLIDE 34

The corpus Conversion LFG101 F-structures C-structure Conclusions

Empty nodes

Null conjunctions for asyndetic parataxis Null verbs for null copulas and elided verbs

Dag Haug dg2lfg April 18 10 / 57

slide-35
SLIDE 35

The corpus Conversion LFG101 F-structures C-structure Conclusions

Eliminability of empty nodes

Dag Haug dg2lfg April 18 11 / 57

slide-36
SLIDE 36

The corpus Conversion LFG101 F-structures C-structure Conclusions

Eliminability of empty nodes

Dag Haug dg2lfg April 18 11 / 57

slide-37
SLIDE 37

The corpus Conversion LFG101 F-structures C-structure Conclusions

Human processing

  • f which the Belgians inhabit one, the Aquitani V another,C those who are called

Celts in their own language – C Gauls V in our – V the third.

Dag Haug dg2lfg April 18 12 / 57

slide-38
SLIDE 38

The corpus Conversion LFG101 F-structures C-structure Conclusions

Human processing

  • f which the Belgians inhabit one, the Aquitani V another,C those who are called

Celts in their own language – C Gauls V in our – V the third.

Dag Haug dg2lfg April 18 12 / 57

slide-39
SLIDE 39

The corpus Conversion LFG101 F-structures C-structure Conclusions

Structure sharing

Subject control:

Example

Object control:

Example

Various other possibilities Could also be encoded in the label but typically not with the same precision

Dag Haug dg2lfg April 18 13 / 57

slide-40
SLIDE 40

The corpus Conversion LFG101 F-structures C-structure Conclusions

Projectivity

language source nonprojective projective Latin Gallic War 1887 22717 Letters to Atticus 2006 20416 Vulgate 4217 92186

  • Per. Aeth.

1279 14890 Greek Herodotus 6606 56175 NT 4377 103418 OCS Zographensis 36 1034 Suprasliensis 416 7780 Marianus 1828 47731 Gothic NT 1886 46884 Armenian NT 1231 59556 Koriwn 48 1556

Dag Haug dg2lfg April 18 14 / 57

slide-41
SLIDE 41

The corpus Conversion LFG101 F-structures C-structure Conclusions

Token alignments

The translations of the NT have been aligned with the Greek original

Dag Haug dg2lfg April 18 15 / 57

slide-42
SLIDE 42

The corpus Conversion LFG101 F-structures C-structure Conclusions

Token alignments

The translations of the NT have been aligned with the Greek original A ‘dictionary’ based on likelihood of occurring in the same bible verse

Dag Haug dg2lfg April 18 15 / 57

slide-43
SLIDE 43

The corpus Conversion LFG101 F-structures C-structure Conclusions

Token alignments

The translations of the NT have been aligned with the Greek original A ‘dictionary’ based on likelihood of occurring in the same bible verse Information from the annotation: syntax, morphology, word order

Dag Haug dg2lfg April 18 15 / 57

slide-44
SLIDE 44

The corpus Conversion LFG101 F-structures C-structure Conclusions

Token alignments

The translations of the NT have been aligned with the Greek original A ‘dictionary’ based on likelihood of occurring in the same bible verse Information from the annotation: syntax, morphology, word order Manual correction of the Slavic indicates very good results (and a very literal translation) Precision Recall F-score 95.27% 92.97% 94.11%

Dag Haug dg2lfg April 18 15 / 57

slide-45
SLIDE 45

The corpus Conversion LFG101 F-structures C-structure Conclusions

Givenness

Givenness tags based on which context the hearer uses to establish reference

Discourse (anaphora) → OLD

Dag Haug dg2lfg April 18 16 / 57

slide-46
SLIDE 46

The corpus Conversion LFG101 F-structures C-structure Conclusions

Givenness

Givenness tags based on which context the hearer uses to establish reference

Discourse (anaphora) → OLD Situation (deixis) → ACC-sit

Dag Haug dg2lfg April 18 16 / 57

slide-47
SLIDE 47

The corpus Conversion LFG101 F-structures C-structure Conclusions

Givenness

Givenness tags based on which context the hearer uses to establish reference

Discourse (anaphora) → OLD Situation (deixis) → ACC-sit Scenarios (inferences) → ACC-inf

Dag Haug dg2lfg April 18 16 / 57

slide-48
SLIDE 48

The corpus Conversion LFG101 F-structures C-structure Conclusions

Givenness

Givenness tags based on which context the hearer uses to establish reference

Discourse (anaphora) → OLD Situation (deixis) → ACC-sit Scenarios (inferences) → ACC-inf Encyclopedic knowledge → ACC-gen

Dag Haug dg2lfg April 18 16 / 57

slide-49
SLIDE 49

The corpus Conversion LFG101 F-structures C-structure Conclusions

Givenness

Givenness tags based on which context the hearer uses to establish reference

Discourse (anaphora) → OLD Situation (deixis) → ACC-sit Scenarios (inferences) → ACC-inf Encyclopedic knowledge → ACC-gen No context (no extra-NP information) → NEW

Dag Haug dg2lfg April 18 16 / 57

slide-50
SLIDE 50

The corpus Conversion LFG101 F-structures C-structure Conclusions

Givenness

Givenness tags based on which context the hearer uses to establish reference

Discourse (anaphora) → OLD Situation (deixis) → ACC-sit Scenarios (inferences) → ACC-inf Encyclopedic knowledge → ACC-gen No context (no extra-NP information) → NEW

example Dag Haug dg2lfg April 18 16 / 57

slide-51
SLIDE 51

The corpus Conversion LFG101 F-structures C-structure Conclusions

Modal subordination

Luke 5:39 Und niemand ist, der vom alten trinkt und wolle bald den neuen; denn er spricht: Der alte ist milder. The subject and the old and the new wine are embedded under subordination Should be inaccessible (Karttunen, COLING 69) but they aren’t We ignore recursive embeddings and use a special tagset for all embedded referents

Dag Haug dg2lfg April 18 17 / 57

slide-52
SLIDE 52

The corpus Conversion LFG101 F-structures C-structure Conclusions

Tagset for embedded referents

nonspec (but quant for quantification) nonspec inf nonspec old

Dag Haug dg2lfg April 18 18 / 57

slide-53
SLIDE 53

The corpus Conversion LFG101 F-structures C-structure Conclusions

Tagset for embedded referents

nonspec (but quant for quantification) nonspec inf nonspec old No counterparts to acc-gen or acc-sit as these belong in the main DRS by definition

Dag Haug dg2lfg April 18 18 / 57

slide-54
SLIDE 54

The corpus Conversion LFG101 F-structures C-structure Conclusions

Interannotator agreement

Towards the end of the NT tagging projects, kappa values were around 0.8 (after long periods of weekly meetings)

Dag Haug dg2lfg April 18 19 / 57

slide-55
SLIDE 55

The corpus Conversion LFG101 F-structures C-structure Conclusions

Interannotator agreement

Towards the end of the NT tagging projects, kappa values were around 0.8 (after long periods of weekly meetings) New project: Caesar’s Gallic War

Dag Haug dg2lfg April 18 19 / 57

slide-56
SLIDE 56

The corpus Conversion LFG101 F-structures C-structure Conclusions

Interannotator agreement

Towards the end of the NT tagging projects, kappa values were around 0.8 (after long periods of weekly meetings) New project: Caesar’s Gallic War Supervised tagging of 8 chapters (ca. 400 taggables)

Dag Haug dg2lfg April 18 19 / 57

slide-57
SLIDE 57

The corpus Conversion LFG101 F-structures C-structure Conclusions

Interannotator agreement

Towards the end of the NT tagging projects, kappa values were around 0.8 (after long periods of weekly meetings) New project: Caesar’s Gallic War Supervised tagging of 8 chapters (ca. 400 taggables) Unsupervised tagging of 5 chapters (ca. 250 taggables)

κ = 0.66 counting divergences in taggables κ = 0.75 on tags set by both annotators

Dag Haug dg2lfg April 18 19 / 57

slide-58
SLIDE 58

The corpus Conversion LFG101 F-structures C-structure Conclusions

Interannotator agreement

Towards the end of the NT tagging projects, kappa values were around 0.8 (after long periods of weekly meetings) New project: Caesar’s Gallic War Supervised tagging of 8 chapters (ca. 400 taggables) Unsupervised tagging of 5 chapters (ca. 250 taggables)

κ = 0.66 counting divergences in taggables κ = 0.75 on tags set by both annotators

Decent; but much potential for more agreement, especially in taggables

Dag Haug dg2lfg April 18 19 / 57

slide-59
SLIDE 59

The corpus Conversion LFG101 F-structures C-structure Conclusions

Size of IS corpus

Tag Freq

  • ld

34430

  • ld inact

1395 acc gen 3755 acc inf 2634 acc sit 883 new 5768 kind 1178 non spec 4485 non spec inf 408 non spec old 1799 quant 2021 total 58756 edge type freq coreference 36650 bridging 2847 total 39497

Dag Haug dg2lfg April 18 20 / 57

slide-60
SLIDE 60

The corpus Conversion LFG101 F-structures C-structure Conclusions

Storing linguistic analyses

Theory-neutrality →

data for larger audiences

Dag Haug dg2lfg April 18 21 / 57

slide-61
SLIDE 61

The corpus Conversion LFG101 F-structures C-structure Conclusions

Storing linguistic analyses

Theory-neutrality →

data for larger audiences widening gulf between corpus linguistics and linguistic theory

Dag Haug dg2lfg April 18 21 / 57

slide-62
SLIDE 62

The corpus Conversion LFG101 F-structures C-structure Conclusions

Storing linguistic analyses

Theory-neutrality →

data for larger audiences widening gulf between corpus linguistics and linguistic theory

DG corpora (Prague, PROIEL) → DG not really in use as a linguistic theory

Dag Haug dg2lfg April 18 21 / 57

slide-63
SLIDE 63

The corpus Conversion LFG101 F-structures C-structure Conclusions

Storing linguistic analyses

Theory-neutrality →

data for larger audiences widening gulf between corpus linguistics and linguistic theory

DG corpora (Prague, PROIEL) → DG not really in use as a linguistic theory PS corpora (Penn, NEGRA) typically use flatter tree structures than anyone believes in

Dag Haug dg2lfg April 18 21 / 57

slide-64
SLIDE 64

The corpus Conversion LFG101 F-structures C-structure Conclusions

Storing linguistic analyses

Theory-neutrality →

data for larger audiences widening gulf between corpus linguistics and linguistic theory

DG corpora (Prague, PROIEL) → DG not really in use as a linguistic theory PS corpora (Penn, NEGRA) typically use flatter tree structures than anyone believes in On the other hand, LFG and HPSG corpora can be hard to use for people who do not share the theoretical assumptions of these theories

Dag Haug dg2lfg April 18 21 / 57

slide-65
SLIDE 65

The corpus Conversion LFG101 F-structures C-structure Conclusions

Our take

Principles

1 Encode no more structure than is common to all frameworks Dag Haug dg2lfg April 18 22 / 57

slide-66
SLIDE 66

The corpus Conversion LFG101 F-structures C-structure Conclusions

Our take

Principles

1 Encode no more structure than is common to all frameworks 2 Enoded structure could be seen as derived/secondary in some

frameworks

Dag Haug dg2lfg April 18 22 / 57

slide-67
SLIDE 67

The corpus Conversion LFG101 F-structures C-structure Conclusions

Our take

Principles

1 Encode no more structure than is common to all frameworks 2 Enoded structure could be seen as derived/secondary in some

frameworks

3 Encode enough structure to allow reconstruction of theoretically

motived structures

Dag Haug dg2lfg April 18 22 / 57

slide-68
SLIDE 68

The corpus Conversion LFG101 F-structures C-structure Conclusions

Our take

Principles

1 Encode no more structure than is common to all frameworks 2 Enoded structure could be seen as derived/secondary in some

frameworks

3 Encode enough structure to allow reconstruction of theoretically

motived structures In the ideal situation, the information in the annotation can be (monotonically) expanded to structures conforming to a particular theory by adding information from the assumptions of that theory

Dag Haug dg2lfg April 18 22 / 57

slide-69
SLIDE 69

The corpus Conversion LFG101 F-structures C-structure Conclusions

The ideal situation

The added assumptions will typically be about phrase structure, such as various versions of X′ theory

Dag Haug dg2lfg April 18 23 / 57

slide-70
SLIDE 70

The corpus Conversion LFG101 F-structures C-structure Conclusions

The ideal situation

The added assumptions will typically be about phrase structure, such as various versions of X′ theory Given information about what the subject is, it will be possible to create a structure where the subject has a specific position if the theory requires that (unless the data contradict the theory)

Dag Haug dg2lfg April 18 23 / 57

slide-71
SLIDE 71

The corpus Conversion LFG101 F-structures C-structure Conclusions

The ideal situation

The added assumptions will typically be about phrase structure, such as various versions of X′ theory Given information about what the subject is, it will be possible to create a structure where the subject has a specific position if the theory requires that (unless the data contradict the theory) Useful for hypothesis testing

Dag Haug dg2lfg April 18 23 / 57

slide-72
SLIDE 72

The corpus Conversion LFG101 F-structures C-structure Conclusions

Basic principles

Modular: several levels of grammatical description connected by projections (functions)

Dag Haug dg2lfg April 18 24 / 57

slide-73
SLIDE 73

The corpus Conversion LFG101 F-structures C-structure Conclusions

Basic principles

Modular: several levels of grammatical description connected by projections (functions) The c-structure is a tree structure described by a CFG

Dag Haug dg2lfg April 18 24 / 57

slide-74
SLIDE 74

The corpus Conversion LFG101 F-structures C-structure Conclusions

Basic principles

Modular: several levels of grammatical description connected by projections (functions) The c-structure is a tree structure described by a CFG The f-structure is a set of ordered attribute-value pairs

Dag Haug dg2lfg April 18 24 / 57

slide-75
SLIDE 75

The corpus Conversion LFG101 F-structures C-structure Conclusions

Basic principles

Modular: several levels of grammatical description connected by projections (functions) The c-structure is a tree structure described by a CFG The f-structure is a set of ordered attribute-value pairs the attribute is a grammatical function or feature and the value is

a symbol a semantic form an f-structure a set of f-structures (for adjuncts)

Dag Haug dg2lfg April 18 24 / 57

slide-76
SLIDE 76

The corpus Conversion LFG101 F-structures C-structure Conclusions

Basic principles

Modular: several levels of grammatical description connected by projections (functions) The c-structure is a tree structure described by a CFG The f-structure is a set of ordered attribute-value pairs the attribute is a grammatical function or feature and the value is

a symbol a semantic form an f-structure a set of f-structures (for adjuncts)

Lexical items and CFG rules can contribute f-descriptions

Dag Haug dg2lfg April 18 24 / 57

slide-77
SLIDE 77

The corpus Conversion LFG101 F-structures C-structure Conclusions

Basic principles

Modular: several levels of grammatical description connected by projections (functions) The c-structure is a tree structure described by a CFG The f-structure is a set of ordered attribute-value pairs the attribute is a grammatical function or feature and the value is

a symbol a semantic form an f-structure a set of f-structures (for adjuncts)

Lexical items and CFG rules can contribute f-descriptions Lexical-functional languages ∈ context-sensitive languages

Dag Haug dg2lfg April 18 24 / 57

slide-78
SLIDE 78

The corpus Conversion LFG101 F-structures C-structure Conclusions

Configurational encoding

IP1 ↑subj=↓ NP2 ↑=↓ N3 Max pred=‘Max’ ↑=↓ I′

4

↑=↓ I5 pushed pred ‘push subj, obj’ tense=past ↑obj=↓ NP6 ↑=↓ N7 Fred pred=‘Fred’

1 subj = 2 2 = 3 1 = 4 4 = 5 4 obj = 6 6 = 7 Dag Haug dg2lfg April 18 25 / 57

slide-79
SLIDE 79

The corpus Conversion LFG101 F-structures C-structure Conclusions

Configurational encoding

IP1 ↑subj=↓ NP2 ↑=↓ N3 Max pred=‘Max’ ↑=↓ I′

4

↑=↓ I5 pushed pred ‘push subj, obj’ tense=past ↑obj=↓ NP6 ↑=↓ N7 Fred pred=‘Fred’

1,4,5

         pred ‘push

  • subj, obj

subj

2,3

  • pred

‘Max’

  • bj

6,7

  • pred

‘Fred’

  • tense

past         

Dag Haug dg2lfg April 18 25 / 57

slide-80
SLIDE 80

The corpus Conversion LFG101 F-structures C-structure Conclusions

Structure sharing

IP1 ↑subj=↓ NP2 ↑=↓ N3 Max pred=‘Max’ ↑=↓ I′

4

↑=↓ I5 seemed pred ‘seem xcomp, subj’ ↑ subj = ↑ xcomp subj tense=past ↑xcomp=↓ VP V push ‘push subj, obj’ ↑obj=↓ NP6 ↑=↓ N7 Fred pred=‘Fred’

               pred ‘seem

  • xcomp
  • , subj’

subj

  • pred

‘Max’

  • xcomp

     pred ‘push

  • subj, obj

subj —

  • bj
  • pred

‘Fred’

    tense past                Dag Haug dg2lfg April 18 26 / 57

slide-81
SLIDE 81

The corpus Conversion LFG101 F-structures C-structure Conclusions

Non-configurational encoding

S1 ↑gf=↓ NP2 ↑=↓ N3 Maximilianus subj ↑ pred=‘Max.’ ↑=↓ I4 trusit pred ‘push subj, obj’ tense=past ↑gf=↓ NP5 ↑=↓ N6 Fredericum

  • bj ↑

pred=‘Fred.’

1 gf = 2 2 = 3 ∃f .f subj = 3 1 = 4 4 gf = 5 5 = 6 ∃f .f obj = 6 Dag Haug dg2lfg April 18 27 / 57

slide-82
SLIDE 82

The corpus Conversion LFG101 F-structures C-structure Conclusions

Non-configurational encoding

S1 ↑gf=↓ NP2 ↑=↓ N3 Maximilianus subj ↑ pred=‘Max.’ ↑=↓ I4 trusit pred ‘push subj, obj’ tense=past ↑gf=↓ NP5 ↑=↓ N6 Fredericum

  • bj ↑

pred=‘Fred.’

1,4,5

         pred ‘push

  • subj, obj

subj

2,3

  • pred

‘Max.’

  • bj

6,7

  • pred

‘Fred.’

  • tense

past         

Dag Haug dg2lfg April 18 27 / 57

slide-83
SLIDE 83

The corpus Conversion LFG101 F-structures C-structure Conclusions

Non-projectivity

A mock Latin example Maximilianus bonum trusit Fredericum Maximilian.nom good.acc pushed Frederick.acc

Dag Haug dg2lfg April 18 28 / 57

slide-84
SLIDE 84

The corpus Conversion LFG101 F-structures C-structure Conclusions

Non-projectivity

S1 ↑gf=↓ NP2 ↑=↓ N3 Maximilianus subj ↑ pred=‘Max.’ ↑gf=↓ NP4 ↑=↓ N′

5

↑adj ∈ ↓ AdjP6 ↑=↓ Adj7 bonum pred ‘good’ case=acc (adj ∈ ↑) case=acc ↑=↓ I8 trusit pred ‘push subj, obj’ tense=past ↑gf=↓ NP9 ↑=↓ N10 Fredericum

  • bj ↑

pred=‘Fred.’ case=acc

1 gf = 2 2 = 3 ∃f .f subj = 3 1 gf = 4 4 = 5 6 ∈ 5 adj 7 = 6 7 case = acc 5 case = acc 1 = 8 1 gf = 9 9 = 10 5 case = acc ∃f .f obj = 10 Dag Haug dg2lfg April 18 29 / 57

slide-85
SLIDE 85

The corpus Conversion LFG101 F-structures C-structure Conclusions

Non-projectivity

S1 ↑gf=↓ NP2 ↑=↓ N3 Maximilianus subj ↑ pred=‘Max.’ ↑gf=↓ NP4 ↑=↓ N′

5

↑adj ∈ ↓ AdjP6 ↑=↓ Adj7 bonum pred ‘good’ case=acc (adj ∈ ↑) case=acc ↑=↓ I8 trusit pred ‘push subj, obj’ tense=past ↑gf=↓ NP9 ↑=↓ N10 Fredericum

  • bj ↑

pred=‘Fred.’ case=acc

1,8

                  pred ‘push

  • subj, obj

subj

2,3

  • pred

‘Max.’

  • bj

4,5,9,10

       pred ‘Fred.’ case acc adj   

6,7

  • pred

‘good’ case acc

         tense past                  

Dag Haug dg2lfg April 18 30 / 57

slide-86
SLIDE 86

The corpus Conversion LFG101 F-structures C-structure Conclusions

Relationship to DG

F-structures and DGs both encode labelled syntactic dependencies

Dag Haug dg2lfg April 18 31 / 57

slide-87
SLIDE 87

The corpus Conversion LFG101 F-structures C-structure Conclusions

Relationship to DG

F-structures and DGs both encode labelled syntactic dependencies Two major differences

LFG’s structure sharing runs against DG’s unique head principle In DG, every word introduces depth in the graph, whereas multiple words can contribute to the same F-structure (without nesting)

Dag Haug dg2lfg April 18 31 / 57

slide-88
SLIDE 88

The corpus Conversion LFG101 F-structures C-structure Conclusions

Relationship to DG

F-structures and DGs both encode labelled syntactic dependencies Two major differences

LFG’s structure sharing runs against DG’s unique head principle In DG, every word introduces depth in the graph, whereas multiple words can contribute to the same F-structure (without nesting)

We have already given up the unique head principle in our DG

Dag Haug dg2lfg April 18 31 / 57

slide-89
SLIDE 89

The corpus Conversion LFG101 F-structures C-structure Conclusions

Relationship to DG

F-structures and DGs both encode labelled syntactic dependencies Two major differences

LFG’s structure sharing runs against DG’s unique head principle In DG, every word introduces depth in the graph, whereas multiple words can contribute to the same F-structure (without nesting)

We have already given up the unique head principle in our DG The words that do not introduce separate layers of f-structures are typically function words, so they can be identified from the labels

Dag Haug dg2lfg April 18 31 / 57

slide-90
SLIDE 90

The corpus Conversion LFG101 F-structures C-structure Conclusions

Label mapping

Function Label LFG Function Label LFG Adverbial adv adj Oblique

  • bl
  • bjθ/obl

Agent ag

  • blAG

Parenthetical parpred — Apposition apos adj Partitive part adj Attribute atr adj Predicate pred — Auxiliary aux — Subject sub subj Complement comp comp Vocative voc — Argument of noun narg ≈ obl Free predicative xadv xadj Object

  • bj
  • bj

Open complement xobj xcomp

Dag Haug dg2lfg April 18 32 / 57

slide-91
SLIDE 91

The corpus Conversion LFG101 F-structures C-structure Conclusions

A simple example

root amat puellam pulchram atr

  • bj

pred Each node maps to an attribute-value matrix with morphological features and a semantic form    pred ’pulcher’ case acc gend fem   

Dag Haug dg2lfg April 18 33 / 57

slide-92
SLIDE 92

The corpus Conversion LFG101 F-structures C-structure Conclusions

A simple example

root amat puellam pulchram atr

  • bj

pred The relations are translated to attributes with the dependents’ AVM as value    adj           pred ’pulcher’ case acc gend fem              

Dag Haug dg2lfg April 18 34 / 57

slide-93
SLIDE 93

The corpus Conversion LFG101 F-structures C-structure Conclusions

A simple example

root amat puellam pulchram atr

  • bj

pred We do this for all nodes in the structure    pred ’puella’ case acc gend fem   

Dag Haug dg2lfg April 18 35 / 57

slide-94
SLIDE 94

The corpus Conversion LFG101 F-structures C-structure Conclusions

A simple example

root amat puellam pulchram atr

  • bj

pred The AVMs of the head and the relation+dependent are unified

           pred ’puella’ case acc gend fem adj         pred ’pulcher’ case acc gend fem                   

Dag Haug dg2lfg April 18 36 / 57

slide-95
SLIDE 95

The corpus Conversion LFG101 F-structures C-structure Conclusions

A simple example

root amat puellam pulchram atr

  • bj

pred The process terminates with the main verb NB pred = pred

  • pred

’amare subj, obj ’

  • Dag Haug

dg2lfg April 18 37 / 57

slide-96
SLIDE 96

The corpus Conversion LFG101 F-structures C-structure Conclusions

A simple example

root amat puellam pulchram atr

  • bj

pred The final result                pred ’amare subj, obj’

  • bj

           pred ’puella’ case acc gend fem adj           pred ’pulcher’ case acc gend fem                                    

Dag Haug dg2lfg April 18 38 / 57

slide-97
SLIDE 97

The corpus Conversion LFG101 F-structures C-structure Conclusions

Structure sharing 1

root dixerunt et conversi atv Gaius sub Aristarchus sub comites Pauli atr apos sub pred Conjunct participles challenge the unique head principle There are two candidate heads: the main verb and the participle subject

Dag Haug dg2lfg April 18 39 / 57

slide-98
SLIDE 98

The corpus Conversion LFG101 F-structures C-structure Conclusions

Structure sharing 2

root dixerunt conversi xadv et Gaius sub Aristarchus sub comites Pauli atr apos sub pred xsub With secondary edges we can represent both dependencies

Dag Haug dg2lfg April 18 40 / 57

slide-99
SLIDE 99

The corpus Conversion LFG101 F-structures C-structure Conclusions

F-structure representation

       pred ’dico sub, obl, comp’ subj . . . xadj   

  • pred

’convertor sub’ subj . . .

        

Dag Haug dg2lfg April 18 41 / 57

slide-100
SLIDE 100

The corpus Conversion LFG101 F-structures C-structure Conclusions

Features in coordination

et Gaius sub Aristarchus sub comites Pauli atr apos

                    num pl                                          adj      pred ’comes obl’

  • bl
  • pred

’Paulus’

    pred ’Gaius’ num sg             adj

  • pred

’Aristarchus’ num sg                                                         

The adjunct is a distributive feature Non-distributive features are computed from the set members Number, gender and person are such features

Dag Haug dg2lfg April 18 42 / 57

slide-101
SLIDE 101

The corpus Conversion LFG101 F-structures C-structure Conclusions

Preliminaries

C-structures and DGs contain very different information

Dag Haug dg2lfg April 18 43 / 57

slide-102
SLIDE 102

The corpus Conversion LFG101 F-structures C-structure Conclusions

Preliminaries

C-structures and DGs contain very different information Instead of syntactic dependencies, c-structures contain information about

category word order word groupings (constituents)

Dag Haug dg2lfg April 18 43 / 57

slide-103
SLIDE 103

The corpus Conversion LFG101 F-structures C-structure Conclusions

Preliminaries

C-structures and DGs contain very different information Instead of syntactic dependencies, c-structures contain information about

category word order word groupings (constituents)

Of these, only word order is present in a DG (assuming there is a precedence order on terminals)

Dag Haug dg2lfg April 18 43 / 57

slide-104
SLIDE 104

The corpus Conversion LFG101 F-structures C-structure Conclusions

Preliminaries

C-structures and DGs contain very different information Instead of syntactic dependencies, c-structures contain information about

category word order word groupings (constituents)

Of these, only word order is present in a DG (assuming there is a precedence order on terminals) We will see how we can enrich DGs with ‘projections’ that include the

  • ther information

Dag Haug dg2lfg April 18 43 / 57

slide-105
SLIDE 105

The corpus Conversion LFG101 F-structures C-structure Conclusions

Preliminaries

C-structures and DGs contain very different information Instead of syntactic dependencies, c-structures contain information about

category word order word groupings (constituents)

Of these, only word order is present in a DG (assuming there is a precedence order on terminals) We will see how we can enrich DGs with ‘projections’ that include the

  • ther information

The makeup of constituents is a matter of theoretical debate, so we need to introduce theoretical assumptions from LFG

Dag Haug dg2lfg April 18 43 / 57

slide-106
SLIDE 106

The corpus Conversion LFG101 F-structures C-structure Conclusions

Basic DG

What’s in a DG? A DG is a tuple W, r, RD where W is the set of words totally ordered by ≺ RD is a set of dependency relations that forms a tree over W rooted in r(∈ W)

Dag Haug dg2lfg April 18 44 / 57

slide-107
SLIDE 107

The corpus Conversion LFG101 F-structures C-structure Conclusions

DG with categories

The basic point is to note that category constraints are in principle independent of other constraints

Dag Haug dg2lfg April 18 45 / 57

slide-108
SLIDE 108

The corpus Conversion LFG101 F-structures C-structure Conclusions

DG with categories

The basic point is to note that category constraints are in principle independent of other constraints The classic case is the German Mittelfeld (Br¨

  • ker 1998)

Dag Haug dg2lfg April 18 45 / 57

slide-109
SLIDE 109

The corpus Conversion LFG101 F-structures C-structure Conclusions

DG with categories

The basic point is to note that category constraints are in principle independent of other constraints The classic case is the German Mittelfeld (Br¨

  • ker 1998)

We can simply extend our model with a class of categories C and a function VC : W → C

Dag Haug dg2lfg April 18 45 / 57

slide-110
SLIDE 110

The corpus Conversion LFG101 F-structures C-structure Conclusions

DG with categories

The basic point is to note that category constraints are in principle independent of other constraints The classic case is the German Mittelfeld (Br¨

  • ker 1998)

We can simply extend our model with a class of categories C and a function VC : W → C In practice we will use the morphological annotations on the words and map them to a set of theoretically motivated categories

Dag Haug dg2lfg April 18 45 / 57

slide-111
SLIDE 111

The corpus Conversion LFG101 F-structures C-structure Conclusions

DG with categories

The basic point is to note that category constraints are in principle independent of other constraints The classic case is the German Mittelfeld (Br¨

  • ker 1998)

We can simply extend our model with a class of categories C and a function VC : W → C In practice we will use the morphological annotations on the words and map them to a set of theoretically motivated categories Notice that if we conceive of VC as a projection, it is different from LFG projections since it embodies linguistic knowledge (the φ function is not similarly restricted)

Dag Haug dg2lfg April 18 45 / 57

slide-112
SLIDE 112

The corpus Conversion LFG101 F-structures C-structure Conclusions

Order domains (Adapted from Br¨

  • ker 1998)

Definition The order domain Dw of a word w is the largest subset of W such that

1 w ∈ Dw Dag Haug dg2lfg April 18 46 / 57

slide-113
SLIDE 113

The corpus Conversion LFG101 F-structures C-structure Conclusions

Order domains (Adapted from Br¨

  • ker 1998)

Definition The order domain Dw of a word w is the largest subset of W such that

1 w ∈ Dw 2 all words in Dw are dominated by w Dag Haug dg2lfg April 18 46 / 57

slide-114
SLIDE 114

The corpus Conversion LFG101 F-structures C-structure Conclusions

Order domains (Adapted from Br¨

  • ker 1998)

Definition The order domain Dw of a word w is the largest subset of W such that

1 w ∈ Dw 2 all words in Dw are dominated by w 3 Dw is continuous, i.e. for any two words in Dw, all words in between

are also contained in Dw

Dag Haug dg2lfg April 18 46 / 57

slide-115
SLIDE 115

The corpus Conversion LFG101 F-structures C-structure Conclusions

Order domains (Adapted from Br¨

  • ker 1998)

Definition The order domain Dw of a word w is the largest subset of W such that

1 w ∈ Dw 2 all words in Dw are dominated by w 3 Dw is continuous, i.e. for any two words in Dw, all words in between

are also contained in Dw Intuitively, the order domain corresponds to all of the node’s dependents that are not ‘displaced’

Dag Haug dg2lfg April 18 46 / 57

slide-116
SLIDE 116

The corpus Conversion LFG101 F-structures C-structure Conclusions

Order domain structures

Order domain structure The set of order domains of all words w ∈ W is a semi-lattice ordered by set inclusion. The join/meet of the semi-lattice is W. Every order domain is included in exactly one other order domain, and the order domains are ordered by precedence so the order domain structure is in effect an ordered tree

Dag Haug dg2lfg April 18 47 / 57

slide-117
SLIDE 117

The corpus Conversion LFG101 F-structures C-structure Conclusions

Order domain structures

Order domain structure The set of order domains of all words w ∈ W is a semi-lattice ordered by set inclusion. The join/meet of the semi-lattice is W. Every order domain is included in exactly one other order domain, and the order domains are ordered by precedence so the order domain structure is in effect an ordered tree Similar to those generated by CFGs but without the categorial information

Dag Haug dg2lfg April 18 47 / 57

slide-118
SLIDE 118

The corpus Conversion LFG101 F-structures C-structure Conclusions

Example

Dag Haug dg2lfg April 18 48 / 57

slide-119
SLIDE 119

The corpus Conversion LFG101 F-structures C-structure Conclusions

Example

{Maximilianus,bonum,trusit,Fredericum} {malus,Maximilianus} {malus} {bonum} {Fredericum}

Dag Haug dg2lfg April 18 48 / 57

slide-120
SLIDE 120

The corpus Conversion LFG101 F-structures C-structure Conclusions

Example

IP I′ NP N′ AdjP Adj malus N Maximilianus NP N′ AdjP Adj bonum I trusit NP N Fredericum {Maximilianus,bonum,trusit,Fredericum} {malus,Maximilianus} {malus} {bonum} {Fredericum}

Each Br¨

  • ker node corresponds to a X′′ - X′ - X spine

Dag Haug dg2lfg April 18 48 / 57

slide-121
SLIDE 121

The corpus Conversion LFG101 F-structures C-structure Conclusions

Example

IP I′ NP N′ AdjP Adj malus N Maximilianus NP N′ AdjP Adj bonum I trusit NP N Fredericum {Maximilianus,bonum,trusit,Fredericum} {malus,Maximilianus} {malus} malus Maximilianus {bonum} bonum trusit {Fredericum} Fredericum

Each Br¨

  • ker node corresponds to a X′′ - X′ - X spine

We can add explicit heads (each w is the head of Dw)

Dag Haug dg2lfg April 18 48 / 57

slide-122
SLIDE 122

The corpus Conversion LFG101 F-structures C-structure Conclusions

Example

IP I′ NP N′ AdjP Adj malus N Maximilianus NP N′ AdjP Adj bonum I trusit NP N Fredericum {Maximilianus,bonum,trusit,Fredericum} {malus,Maximilianus} {malus} malus Maximilianus {bonum} bonum trusit {Fredericum} Fredericum

Each Br¨

  • ker node corresponds to a X′′ - X′ - X spine

We can add explicit heads (each w is the head of Dw) Probably as close as we can come in a pure projection from the DG

Dag Haug dg2lfg April 18 48 / 57

slide-123
SLIDE 123

The corpus Conversion LFG101 F-structures C-structure Conclusions

Example

IP I′ NP N′ AdjP Adj malus N Maximilianus NP N′ AdjP Adj bonum I trusit NP N Fredericum {Maximilianus,bonum,trusit,Fredericum} {malus,Maximilianus} {malus} malus Maximilianus {bonum} bonum trusit {Fredericum} Fredericum

Each Br¨

  • ker node corresponds to a X′′ - X′ - X spine

We can add explicit heads (each w is the head of Dw) Probably as close as we can come in a pure projection from the DG What we are lacking is a theory of the internal structure of phrases

Dag Haug dg2lfg April 18 48 / 57

slide-124
SLIDE 124

The corpus Conversion LFG101 F-structures C-structure Conclusions

Internal structure of phrases

Questions (from Xia 2001)

1 for a category X, what kind of projections can X have? 2 if a category Y depends on a category Y in a dependency structure,

how far should Y project before it attaches to Xs projection?

3 if a category Y depends on a category X in a dependency structure, to

what position on X’s projection chain should Y’s projection attach?

Dag Haug dg2lfg April 18 49 / 57

slide-125
SLIDE 125

The corpus Conversion LFG101 F-structures C-structure Conclusions

Internal structure of phrases

Answers

1 all categories X project two levels X′ and XP. 2 a dependent Y always projects to Y′ then YP and the YP attaches to

the head’s projection

3 dependents are divided into three types using a set of handwritten

rules: specifiers, modifiers and arguments. Specifiers are made sisters

  • f X′ and arguments are made daughters of X. Modifiers

Chomsky-adjoin to either X′ or XP depending on whether they are restrictive, as indicated by the dependency edge label (atr or apos).

Dag Haug dg2lfg April 18 49 / 57

slide-126
SLIDE 126

The corpus Conversion LFG101 F-structures C-structure Conclusions

An algorithm

L = {} function CreateProjection(n) D = {} for all d: daughters of n do put CreateProjection(d) in D end for for all d ∈ D ∪ L do if d is in n’s order domain then put/leave d’ in D else put/leave d in L end if end for make the elements in D daughters of n end function

Dag Haug dg2lfg April 18 50 / 57

slide-127
SLIDE 127

The corpus Conversion LFG101 F-structures C-structure Conclusions

Adding linguistic knowledge

This algorithm gives us the Br¨

  • ker trees

Dag Haug dg2lfg April 18 51 / 57

slide-128
SLIDE 128

The corpus Conversion LFG101 F-structures C-structure Conclusions

Adding linguistic knowledge

This algorithm gives us the Br¨

  • ker trees

We can enrich these with linguistic knowledge

Dag Haug dg2lfg April 18 51 / 57

slide-129
SLIDE 129

The corpus Conversion LFG101 F-structures C-structure Conclusions

Adding linguistic knowledge

This algorithm gives us the Br¨

  • ker trees

We can enrich these with linguistic knowledge We will use our X′ assumptions, the category mapping and handwritten phrase structure rules A sample rule

N: :phrase adjuncts:

  • NP
  • AdjP

:specifier:

  • DP

:bar adjuncts:

  • AdjP
  • NP

:complements:

  • NP

Dag Haug dg2lfg April 18 51 / 57

slide-130
SLIDE 130

The corpus Conversion LFG101 F-structures C-structure Conclusions

Adding linguistic knowledge

This algorithm gives us the Br¨

  • ker trees

We can enrich these with linguistic knowledge We will use our X′ assumptions, the category mapping and handwritten phrase structure rules We can recursively embed loose nodes under headless structures to achieve the LFG analysis of non-projectivity

Dag Haug dg2lfg April 18 51 / 57

slide-131
SLIDE 131

The corpus Conversion LFG101 F-structures C-structure Conclusions

Where to add linguistics

L = {} function CreateProjection(n) D = {} for all d: daughters of n do put CreateProjection(d) in D end for for all d ∈ D ∪ L do if d is in n’s order domain then put/leave d’ in D else put/leave d in L end if end for make the elements in D daughters of n end function

Dag Haug dg2lfg April 18 52 / 57

slide-132
SLIDE 132

The corpus Conversion LFG101 F-structures C-structure Conclusions

D L NP N′ AdjP Adj malus N Maximilianus

Dag Haug dg2lfg April 18 53 / 57

slide-133
SLIDE 133

The corpus Conversion LFG101 F-structures C-structure Conclusions

D L NP N′ AdjP Adj malus N Maximilianus NP N′ N Fredericum AdjP Adj bonum

Dag Haug dg2lfg April 18 53 / 57

slide-134
SLIDE 134

The corpus Conversion LFG101 F-structures C-structure Conclusions

D L NP N′ AdjP Adj malus N Maximilianus NP N′ N Fredericum NP N′ AdjP Adj bonum

Dag Haug dg2lfg April 18 53 / 57

slide-135
SLIDE 135

The corpus Conversion LFG101 F-structures C-structure Conclusions

The result

IP I′ NP N′ AdjP Adj malus N Maximilianus NP N′ AdjP Adj bonum I trusit NP N Fredericum

Hyperbaton WH-movement Dag Haug dg2lfg April 18 54 / 57

slide-136
SLIDE 136

The corpus Conversion LFG101 F-structures C-structure Conclusions

Summary

We have seen that the PROIEL corpus is a small but deeply annotated corpus

Morphology Syntax Information structure Discourse (experimental, not shown)

Dag Haug dg2lfg April 18 55 / 57

slide-137
SLIDE 137

The corpus Conversion LFG101 F-structures C-structure Conclusions

Summary

We have seen that the PROIEL corpus is a small but deeply annotated corpus

Morphology Syntax Information structure Discourse (experimental, not shown)

The syntax is as theory-neutral as possible

Dag Haug dg2lfg April 18 55 / 57

slide-138
SLIDE 138

The corpus Conversion LFG101 F-structures C-structure Conclusions

Summary

We have seen that the PROIEL corpus is a small but deeply annotated corpus

Morphology Syntax Information structure Discourse (experimental, not shown)

The syntax is as theory-neutral as possible But conversion is possible and an interesting for hypothesis testing

Dag Haug dg2lfg April 18 55 / 57

slide-139
SLIDE 139

The corpus Conversion LFG101 F-structures C-structure Conclusions

Summary

We have seen that the PROIEL corpus is a small but deeply annotated corpus

Morphology Syntax Information structure Discourse (experimental, not shown)

The syntax is as theory-neutral as possible But conversion is possible and an interesting for hypothesis testing The output could be used as a test suite for a implementing an LFG grammar

Dag Haug dg2lfg April 18 55 / 57

slide-140
SLIDE 140

The corpus Conversion LFG101 F-structures C-structure Conclusions

Summary

We have seen that the PROIEL corpus is a small but deeply annotated corpus

Morphology Syntax Information structure Discourse (experimental, not shown)

The syntax is as theory-neutral as possible But conversion is possible and an interesting for hypothesis testing The output could be used as a test suite for a implementing an LFG grammar It can also make the data more widely available to researchers in

  • ther frameworks

Dag Haug dg2lfg April 18 55 / 57

slide-141
SLIDE 141

The corpus Conversion LFG101 F-structures C-structure Conclusions

Outlook

The New Testament text is available for many low-resources languages

Dag Haug dg2lfg April 18 56 / 57

slide-142
SLIDE 142

The corpus Conversion LFG101 F-structures C-structure Conclusions

Outlook

The New Testament text is available for many low-resources languages The fine-grained reference system (book, chapter, verse) makes alignment feasible

Dag Haug dg2lfg April 18 56 / 57

slide-143
SLIDE 143

The corpus Conversion LFG101 F-structures C-structure Conclusions

Outlook

The New Testament text is available for many low-resources languages The fine-grained reference system (book, chapter, verse) makes alignment feasible We will experiment with annotation transfer

Cooperation with the Linguistic Data Consortium at Penn: alignment, comparison, annotation transfer with phrase structure-based NT corpora

Dag Haug dg2lfg April 18 56 / 57

slide-144
SLIDE 144

The corpus Conversion LFG101 F-structures C-structure Conclusions

Outlook

The New Testament text is available for many low-resources languages The fine-grained reference system (book, chapter, verse) makes alignment feasible We will experiment with annotation transfer

Cooperation with the Linguistic Data Consortium at Penn: alignment, comparison, annotation transfer with phrase structure-based NT corpora Cooperation with Iceland and Spr˚ akbanken in Gothenburg: alignment and annotation transfer between annotated and unannotated, Nordic bible texts (Old Swedish, Icelandic, possibly Old Finnish)

Dag Haug dg2lfg April 18 56 / 57

slide-145
SLIDE 145

The corpus Conversion LFG101 F-structures C-structure Conclusions

Availability

The corpus is available for everyone to use. We publish XML files with raw data as well. All our data is released under a Creative Commons license. Visit http://www.hf.uio.no/ifikk/proiel/ for details.

Dag Haug dg2lfg April 18 57 / 57