Linked Open Treebanks Latin treebanks in the LiLa Knowledge Base - - PowerPoint PPT Presentation

linked open treebanks
SMART_READER_LITE
LIVE PREVIEW

Linked Open Treebanks Latin treebanks in the LiLa Knowledge Base - - PowerPoint PPT Presentation

Linked Open Treebanks Latin treebanks in the LiLa Knowledge Base Francesco Mambrini and Marco Passarotti {francesco.mambrini}{marco.passarotti}@unicatt.it SyntaxFest TLT 2019 | Paris | August 29, 2019 This project has received funding from


slide-1
SLIDE 1

Linked Open Treebanks

Latin treebanks in the LiLa Knowledge Base Francesco Mambrini and Marco Passarotti {francesco.mambrini}{marco.passarotti}@unicatt.it SyntaxFest – TLT 2019 | Paris | August 29, 2019

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme - Grant Agreement No. 769994.

slide-2
SLIDE 2

1

Table of Contents

Introduction Latin treebanks The LiLa Knowledge Base Populating LiLa Lemmas Treebanks Potential use cases Conclusions

  • F. Mambrini & M. Passarotti | LiLa – Linking Latin
slide-3
SLIDE 3

2

4 treebanks of Latin

◮ Latin Dependency Treebank (2006-): Classical Lat., prose and poetry,

about 50k tokens;

◮ Index Thomisticus Treebank (2006-): Medieval Lat., only 1 author

(Thomas Aquinas), about 400k tokens;

◮ PROIEL (2008): Late and Classical prose, transl. of NT (Jerome’s

Vulgate, 4th CE), plus other prose, about 250k;

◮ Late Latin Charter Treebank (2011-): 8th-9th century notary documents

(charters) from Central Italy, about 250k.

  • F. Mambrini & M. Passarotti | LiLa – Linking Latin
slide-4
SLIDE 4

2

4 treebanks of Latin

◮ Latin Dependency Treebank (2006-): Classical Lat., prose and poetry,

about 50k tokens;

◮ Index Thomisticus Treebank (2006-): Medieval Lat., only 1 author

(Thomas Aquinas), about 400k tokens;

◮ PROIEL (2008): Late and Classical prose, transl. of NT (Jerome’s

Vulgate, 4th CE), plus other prose, about 250k;

◮ Late Latin Charter Treebank (2011-): 8th-9th century notary documents

(charters) from Central Italy, about 250k.

  • F. Mambrini & M. Passarotti | LiLa – Linking Latin
slide-5
SLIDE 5

2

4 treebanks of Latin

◮ Latin Dependency Treebank (2006-): Classical Lat., prose and poetry,

about 50k tokens;

◮ Index Thomisticus Treebank (2006-): Medieval Lat., only 1 author

(Thomas Aquinas), about 400k tokens;

◮ PROIEL (2008): Late and Classical prose, transl. of NT (Jerome’s

Vulgate, 4th CE), plus other prose, about 250k;

◮ Late Latin Charter Treebank (2011-): 8th-9th century notary documents

(charters) from Central Italy, about 250k.

  • F. Mambrini & M. Passarotti | LiLa – Linking Latin
slide-6
SLIDE 6

3

Aims

◮ Create a Knowledge Base of

linguistic resources for Latin

◮ corpora ◮ lexicons ◮ NLP tools

◮ Create common vocabularies

to describe them

◮ Use the LOD paradigm

  • F. Mambrini & M. Passarotti | LiLa – Linking Latin
slide-7
SLIDE 7

4

The lemma

a gateway to Latin linguistic resources

Lemmas

Lexical Entries Tokens T extual Resources

  • Digital libraries
  • Treebanks
  • T

extual corpora...

NLP Output NLP T

  • ols
  • T
  • kenizers
  • T

aggers/parsers

  • Lemmatizers...

Lexical Resources

  • Latin Wordnet
  • Valency Lexicon
  • Dictionaries...
  • F. Mambrini & M. Passarotti | LiLa – Linking Latin
slide-8
SLIDE 8

5

LEMLAT: the foundation stone

http://www.lemlat3.eu/

◮ 43,432 lemmas from

Georges, 1913-1918; OLD and Gradenwitz, 1904;

◮ 82,556 lemmas from Du

Cange, 1883-1887;

◮ 26,250 lemmas from

Forcellini, 1940.

  • F. Mambrini & M. Passarotti | LiLa – Linking Latin
slide-9
SLIDE 9

6

Towards an ontology of Latin lemmas

Lemma

  • ntolex:Form

rdfs:subClassOf

  • lia:Verb

lemma:2012 a a amo

  • ntolex:writtenRep

VERB vocab:lemmario_upostag

  • F. Mambrini & M. Passarotti | LiLa – Linking Latin
slide-10
SLIDE 10

7

Workflow

◮ start from a shallow conversion from TB format to RDF triples ◮ compare the string of the lemmatized token with the written

representation(s) of a LEMLAT lemma

◮ link the token to the lemma via the hasLemma property

  • F. Mambrini & M. Passarotti | LiLa – Linking Latin
slide-11
SLIDE 11

8

Linking corpora and lemmas

nif:Sentence Lemma nif:Word Suffix

  • lia:CommonNoun

lemma:arcanus a suffix:7 hasSuffix base:639 lemma:inferiae a hasBase proiel:s17835_6 a a proiel:17835_4 conll:HEAD lemma:20369 hasLemma proiel:s17835_0 isPartOfSent NOUN conll:UPOS inferni conll:WORD ref=REV_1.18 conll:MISC infernus conll:LEMMA conj conll:EDGE a a hasBase hasSuffix infernus

  • ntolex:writtenRep

a

  • n

rdfs:label a proiel:s17836_0 nif:nextSentence lemma:infernalis a hasBase

  • F. Mambrini & M. Passarotti | LiLa – Linking Latin
slide-12
SLIDE 12

8

Linking corpora and lemmas

nif:Sentence Lemma nif:Word Suffix

  • lia:CommonNoun

lemma:arcanus a suffix:7 hasSuffix base:639 lemma:inferiae a hasBase proiel:s17835_6 a a proiel:17835_4 conll:HEAD lemma:20369 hasLemma proiel:s17835_0 isPartOfSent NOUN conll:UPOS inferni conll:WORD ref=REV_1.18 conll:MISC infernus conll:LEMMA conj conll:EDGE a a hasBase hasSuffix infernus

  • ntolex:writtenRep

a

  • n

rdfs:label a proiel:s17836_0 nif:nextSentence lemma:infernalis a hasBase

  • F. Mambrini & M. Passarotti | LiLa – Linking Latin
slide-13
SLIDE 13

8

Linking corpora and lemmas

nif:Sentence Lemma nif:Word Suffix

  • lia:CommonNoun

lemma:arcanus a suffix:7 hasSuffix base:639 lemma:inferiae a hasBase proiel:s17835_6 a a proiel:17835_4 conll:HEAD lemma:20369 hasLemma proiel:s17835_0 isPartOfSent NOUN conll:UPOS inferni conll:WORD ref=REV_1.18 conll:MISC infernus conll:LEMMA conj conll:EDGE a a hasBase hasSuffix infernus

  • ntolex:writtenRep

a

  • n

rdfs:label a proiel:s17836_0 nif:nextSentence lemma:infernalis a hasBase

  • F. Mambrini & M. Passarotti | LiLa – Linking Latin
slide-14
SLIDE 14

9

A wealth of interlinked information

that can be queried!

Token 1 Token 2

  • F. Mambrini & M. Passarotti | LiLa – Linking Latin
slide-15
SLIDE 15

9

A wealth of interlinked information

that can be queried!

Token 1 Token 2

  • F. Mambrini & M. Passarotti | LiLa – Linking Latin
slide-16
SLIDE 16

9

A wealth of interlinked information

that can be queried!

Token 1 Token 2 Lemma 1

  • F. Mambrini & M. Passarotti | LiLa – Linking Latin
slide-17
SLIDE 17

9

A wealth of interlinked information

that can be queried!

Token 1 Token 2 Lemma 1 Synset Morph

  • F. Mambrini & M. Passarotti | LiLa – Linking Latin
slide-18
SLIDE 18

9

A wealth of interlinked information

that can be queried!

Token 1 Token 2 Lemma 1 Synset Lemma 2 Token 3 Morph Lemma 3 Token 4

  • F. Mambrini & M. Passarotti | LiLa – Linking Latin
slide-19
SLIDE 19

10

Querying with SPARQL

All verbs that govern subjects formed with affix “-(t)or”

  • F. Mambrini & M. Passarotti | LiLa – Linking Latin
slide-20
SLIDE 20

11

Sample of results from PROIEL

from Cicero’s Letters to Atticus

(1) gladiatores gladiators.ACC.PL audio hear.1SG pugnare fight.INF mirifice superbly I hear that your gladiators fight superbly. (Cic. Att.. 4.4a.2)

  • F. Mambrini & M. Passarotti | LiLa – Linking Latin
slide-21
SLIDE 21

12

Wordcloud of results from the Index Thomisticus

“the Interpreter (of Aristotle, i.e. Averroes) says...”

  • F. Mambrini & M. Passarotti | LiLa – Linking Latin
slide-22
SLIDE 22

13

Conclusions

and future works

◮ Language is complex! Morpho-syntactic description is not enough to

capture all complexities

◮ LOD provide a way to link treebank annotation and information on

  • ther levels (semantics, derivational morphology...)

◮ a lexically based approach (using lemmas as hub node) is one way to

do it!

  • F. Mambrini & M. Passarotti | LiLa – Linking Latin
slide-23
SLIDE 23

13

Conclusions

and future works

◮ Language is complex! Morpho-syntactic description is not enough to

capture all complexities

◮ LOD provide a way to link treebank annotation and information on

  • ther levels (semantics, derivational morphology...)

◮ a lexically based approach (using lemmas as hub node) is one way to

do it!

◮ but (future works)...

  • F. Mambrini & M. Passarotti | LiLa – Linking Latin
slide-24
SLIDE 24

13

Conclusions

and future works

◮ Language is complex! Morpho-syntactic description is not enough to

capture all complexities

◮ LOD provide a way to link treebank annotation and information on

  • ther levels (semantics, derivational morphology...)

◮ a lexically based approach (using lemmas as hub node) is one way to

do it!

◮ but (future works)...

◮ we need to harmonize the tagsets (ontologies)

  • F. Mambrini & M. Passarotti | LiLa – Linking Latin
slide-25
SLIDE 25

13

Conclusions

and future works

◮ Language is complex! Morpho-syntactic description is not enough to

capture all complexities

◮ LOD provide a way to link treebank annotation and information on

  • ther levels (semantics, derivational morphology...)

◮ a lexically based approach (using lemmas as hub node) is one way to

do it!

◮ but (future works)...

◮ we need to harmonize the tagsets (ontologies) ◮ we need to find sustainable, scalable solutions together with the projects

that own and maintain the resources

  • F. Mambrini & M. Passarotti | LiLa – Linking Latin
slide-26
SLIDE 26

14

Thanks!

Get in touch

The LiLa Team

Università Cattolica del Sacro Cuore CIRCSE Research Centre info@lila-erc.eu https://github.com/CIRCSE https://lila-erc.eu @ERC_LiLa Largo Gemelli 1, 20123 Milan, Italy

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme - Grant Agreement No. 769994.

  • F. Mambrini & M. Passarotti | LiLa – Linking Latin