The PROIEL corpora Dag Trygve Truslew Haug Milan, 4 June 2019 Dag - - PowerPoint PPT Presentation

the proiel corpora
SMART_READER_LITE
LIVE PREVIEW

The PROIEL corpora Dag Trygve Truslew Haug Milan, 4 June 2019 Dag - - PowerPoint PPT Presentation

The PROIEL corpora Dag Trygve Truslew Haug Milan, 4 June 2019 Dag Haug PROIEL Milan, 4 June 2019 1 / 23 The background A corpus for linguists: focus on making the most of a limited data set for linguistic research Dag Haug PROIEL Milan,


slide-1
SLIDE 1

The PROIEL corpora

Dag Trygve Truslew Haug Milan, 4 June 2019

Dag Haug PROIEL Milan, 4 June 2019 1 / 23

slide-2
SLIDE 2

The background

A corpus for linguists: focus on making the most of a limited data set for linguistic research

Dag Haug PROIEL Milan, 4 June 2019 2 / 23

slide-3
SLIDE 3

The background

A corpus for linguists: focus on making the most of a limited data set for linguistic research Pragmatic Resources in Old Indo-European Languages (PROIEL, 2008-2012)

word order anaphoric expressions definiteness participles (background events) discourse particles

Dag Haug PROIEL Milan, 4 June 2019 2 / 23

slide-4
SLIDE 4

The background

A corpus for linguists: focus on making the most of a limited data set for linguistic research Pragmatic Resources in Old Indo-European Languages (PROIEL, 2008-2012)

word order anaphoric expressions definiteness participles (background events) discourse particles

The corpus should help this research, but also be useful for others Annotation continues (with less resources)

Dag Haug PROIEL Milan, 4 June 2019 2 / 23

slide-5
SLIDE 5

Texts

NT and translations (Greek, Latin, Gothic, Armenian, OCS)

Dag Haug PROIEL Milan, 4 June 2019 3 / 23

slide-6
SLIDE 6

Texts

NT and translations (Greek, Latin, Gothic, Armenian, OCS) Latin additional texts:

extracts from the Gallic War, Letters to Atticus, De officiis Peregrinatio Aetheriae, Palladius’ Opus Agriculturae 225.064 tokens Petronius’ Satyricon (32020 tokens) annotated but not reviewed

Dag Haug PROIEL Milan, 4 June 2019 3 / 23

slide-7
SLIDE 7

Cooperation with other projects

Many projects have used our platform for annotation of other texts

Poetic Edda (Greinir skáldskapar) Old Norwegian (Mediaeval Nordic Text Archive) Old Swedish (Språkbanken in Gothenburg) Mediaeval English and Romance (ISWOC) Old Slavic texts (TOROT) Rigveda just starting (Erica Biagetti)

Adds up to a sizeable knowledge base on ancient and medieval Indo-European languages!

Dag Haug PROIEL Milan, 4 June 2019 4 / 23

slide-8
SLIDE 8

The PROIEL annotation

Many-layered annotation:

Morphological annotation Syntactic annotation (dependency/LFG-based) Semantic and other customised annotation (e.g. animacy) Annotation of information structure and anaphoric links Token alignments

Dag Haug PROIEL Milan, 4 June 2019 5 / 23

slide-9
SLIDE 9

Workflow for annotation

International team of student annotators Manual disambiguation of morphology and lemmatization Syntactic annotation Review by project members Advanced annotation by project members

Dag Haug PROIEL Milan, 4 June 2019 6 / 23

slide-10
SLIDE 10

Morphology

All standard categories of Latin annotated Fully lemmatized No known compatibility issues with other tagsets Preprocessed with statistical taggers and checked manually Interesting non-standard forms in the postclassical texts

Dag Haug PROIEL Milan, 4 June 2019 7 / 23

slide-11
SLIDE 11

Dependency Grammar 1

Dag Haug PROIEL Milan, 4 June 2019 8 / 23

slide-12
SLIDE 12

Dependency Grammar 1

Dependencies are asymmetric relations between words

Dag Haug PROIEL Milan, 4 June 2019 8 / 23

slide-13
SLIDE 13

Dependency Grammar 1

Dependencies are asymmetric relations between words We label these dependencies with the function of the dependent

Dag Haug PROIEL Milan, 4 June 2019 8 / 23

slide-14
SLIDE 14

Dependency Grammar 1

Dependencies are asymmetric relations between words We label these dependencies with the function of the dependent The dependencies form a tree under an abstract root

Dag Haug PROIEL Milan, 4 June 2019 8 / 23

slide-15
SLIDE 15

Dependency Grammar 1

Dependencies are asymmetric relations between words We label these dependencies with the function of the dependent The dependencies form a tree under an abstract root No explicit constituency

Dag Haug PROIEL Milan, 4 June 2019 8 / 23

slide-16
SLIDE 16

Dependency grammar 2

Inheritent limitations: unique head and overt tokens

Dag Haug PROIEL Milan, 4 June 2019 9 / 23

slide-17
SLIDE 17

Dependency grammar 2

Inheritent limitations: unique head and overt tokens Other Latin treebanks live with this, but for our purposes we could not

⇒ introduction of structure sharing (from LFG) ⇒ explicit empty nodes

Dag Haug PROIEL Milan, 4 June 2019 9 / 23

slide-18
SLIDE 18

Dependency grammar 2

Inheritent limitations: unique head and overt tokens Other Latin treebanks live with this, but for our purposes we could not

⇒ introduction of structure sharing (from LFG) ⇒ explicit empty nodes

In itself monotonic increase of information compared to LDT/ITTB but led to some different annotation decisions which make conversion non-trivial (but possible!) Also, more fine-grained syntactic relations (easily removeable in one direction)

Dag Haug PROIEL Milan, 4 June 2019 9 / 23

slide-19
SLIDE 19

Empty nodes

Empty nodes appear in the analysis of some very frequent phenomena in Latin

Null conjunctions for asyndetic parataxis Null verbs for null copulas and elided verbs

Dag Haug PROIEL Milan, 4 June 2019 10 / 23

slide-20
SLIDE 20

Empty nodes

Dag Haug PROIEL Milan, 4 June 2019 11 / 23

slide-21
SLIDE 21

Empty nodes

Dag Haug PROIEL Milan, 4 June 2019 11 / 23

slide-22
SLIDE 22

Human processing

Dag Haug PROIEL Milan, 4 June 2019 12 / 23

slide-23
SLIDE 23

Human processing

Dag Haug PROIEL Milan, 4 June 2019 12 / 23

slide-24
SLIDE 24

Secondary dependencies 1: Control

Dag Haug PROIEL Milan, 4 June 2019 13 / 23

slide-25
SLIDE 25

Secondary dependencies 2: Shared arguments

Dag Haug PROIEL Milan, 4 June 2019 14 / 23

slide-26
SLIDE 26

Secondary dependencies 3: Ellipsis

Dag Haug PROIEL Milan, 4 June 2019 15 / 23

slide-27
SLIDE 27

Too few empty nodes

Dag Haug PROIEL Milan, 4 June 2019 16 / 23

slide-28
SLIDE 28

Too few empty nodes

Syntactic structure cannot actually be reduced to relations between words In retrospect we erred towards conservativity in not using more empty nodes

Dag Haug PROIEL Milan, 4 June 2019 16 / 23

slide-29
SLIDE 29

Too few empty nodes

Syntactic structure cannot actually be reduced to relations between words In retrospect we erred towards conservativity in not using more empty nodes Makes LiLa’s life easier!

Dag Haug PROIEL Milan, 4 June 2019 16 / 23

slide-30
SLIDE 30

Semantic annotation – animacy

HUMAN ORG ANIMAL VEH CONC PLACE NONCONC TIME 1745 Latin nominal lemmata tagged (out of 6612 total) Mainly the Biblical language

Dag Haug PROIEL Milan, 4 June 2019 17 / 23

slide-31
SLIDE 31

Givenness

Givenness tags based on which context the hearer uses to establish reference

Discourse (anaphora) → OLD Situation (deixis) → ACC-sit Scenarios (inferences) → ACC-inf Encyclopedic knowledge → ACC-gen No context (no extra-NP information) → NEW

Dag Haug PROIEL Milan, 4 June 2019 18 / 23

slide-32
SLIDE 32

Givenness

Givenness tags based on which context the hearer uses to establish reference

Discourse (anaphora) → OLD Situation (deixis) → ACC-sit Scenarios (inferences) → ACC-inf Encyclopedic knowledge → ACC-gen No context (no extra-NP information) → NEW

In Latin, this is available for Peregrinatio, and parts of the letters to Atticus and the Gallic War

Dag Haug PROIEL Milan, 4 June 2019 18 / 23

slide-33
SLIDE 33

Dag Haug PROIEL Milan, 4 June 2019 19 / 23

slide-34
SLIDE 34

Watch the null node!

Dag Haug PROIEL Milan, 4 June 2019 19 / 23

slide-35
SLIDE 35

Alignment: Translating participles in the Vulgate

Our NT translations are aligned with the original Greek Automatic alignment of high quality, and hand-corrected for many languages (not Latin!)

Dag Haug PROIEL Milan, 4 June 2019 20 / 23

slide-36
SLIDE 36

Alignment: Translating participles in the Vulgate

Our NT translations are aligned with the original Greek Automatic alignment of high quality, and hand-corrected for many languages (not Latin!) A useful tool to explore syntax and translation strategies

Translation w/ participle Translation w/ imperative Dag Haug PROIEL Milan, 4 June 2019 20 / 23

slide-37
SLIDE 37

Accessibility

Data:

XML exports containing morphology, syntax, information status and alignment are available at https://proiel.github.io/ Semantic annotation so far only on request

Dag Haug PROIEL Milan, 4 June 2019 21 / 23

slide-38
SLIDE 38

Accessibility

Data:

XML exports containing morphology, syntax, information status and alignment are available at https://proiel.github.io/ Semantic annotation so far only on request

Tools:

Simple query interface at http://foni.uio.no:3000 Syntactic query interface at INESS, Bergen Command line interface for working with the files Little for computational illiterates, but http://syntacticus.org is a start

Dag Haug PROIEL Milan, 4 June 2019 21 / 23

slide-39
SLIDE 39

Interoperability?

All the Latin treebanks are converted to UD

Dag Haug PROIEL Milan, 4 June 2019 22 / 23

slide-40
SLIDE 40

Interoperability?

All the Latin treebanks are converted to UD Useful for comparing annotations, but a lot of harmonization needed And the conversion is very lossy (especially for PROIEL)

Dag Haug PROIEL Milan, 4 June 2019 22 / 23

slide-41
SLIDE 41

Future plans

Pending funding. . .

add texts (finish Satyricon, Plautus) add languages: ideally get good coverage of the various IE branches develop Syntacticus as a reading portal

Dag Haug PROIEL Milan, 4 June 2019 23 / 23