linked open treebanks
play

Linked Open Treebanks Latin treebanks in the LiLa Knowledge Base - PowerPoint PPT Presentation

Linked Open Treebanks Latin treebanks in the LiLa Knowledge Base Francesco Mambrini and Marco Passarotti {francesco.mambrini}{marco.passarotti}@unicatt.it SyntaxFest TLT 2019 | Paris | August 29, 2019 This project has received funding from


  1. Linked Open Treebanks Latin treebanks in the LiLa Knowledge Base Francesco Mambrini and Marco Passarotti {francesco.mambrini}{marco.passarotti}@unicatt.it SyntaxFest – TLT 2019 | Paris | August 29, 2019 This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme - Grant Agreement No. 769994.

  2. Table of Contents 1 Introduction Latin treebanks The LiLa Knowledge Base Populating LiLa Lemmas Treebanks Potential use cases Conclusions F. Mambrini & M. Passarotti | LiLa – Linking Latin

  3. 4 treebanks of Latin 2 ◮ Latin Dependency Treebank (2006-): Classical Lat., prose and poetry, about 50k tokens; ◮ Index Thomisticus Treebank (2006-): Medieval Lat., only 1 author (Thomas Aquinas), about 400k tokens; ◮ PROIEL (2008): Late and Classical prose, transl. of NT (Jerome’s Vulgate , 4th CE), plus other prose, about 250k; ◮ Late Latin Charter Treebank (2011-): 8th-9th century notary documents (charters) from Central Italy, about 250k. F. Mambrini & M. Passarotti | LiLa – Linking Latin

  4. 4 treebanks of Latin 2 ◮ Latin Dependency Treebank (2006-): Classical Lat., prose and poetry, about 50k tokens; ◮ Index Thomisticus Treebank (2006-): Medieval Lat., only 1 author (Thomas Aquinas), about 400k tokens; ◮ PROIEL (2008): Late and Classical prose, transl. of NT (Jerome’s Vulgate , 4th CE), plus other prose, about 250k; ◮ Late Latin Charter Treebank (2011-): 8th-9th century notary documents (charters) from Central Italy, about 250k. F. Mambrini & M. Passarotti | LiLa – Linking Latin

  5. 4 treebanks of Latin 2 ◮ Latin Dependency Treebank (2006-): Classical Lat., prose and poetry, about 50k tokens; ◮ Index Thomisticus Treebank (2006-): Medieval Lat., only 1 author (Thomas Aquinas), about 400k tokens; ◮ PROIEL (2008): Late and Classical prose, transl. of NT (Jerome’s Vulgate , 4th CE), plus other prose, about 250k; ◮ Late Latin Charter Treebank (2011-): 8th-9th century notary documents (charters) from Central Italy, about 250k. F. Mambrini & M. Passarotti | LiLa – Linking Latin

  6. Aims 3 ◮ Create a Knowledge Base of linguistic resources for Latin ◮ corpora ◮ lexicons ◮ NLP tools ◮ Create common vocabularies to describe them ◮ Use the LOD paradigm F. Mambrini & M. Passarotti | LiLa – Linking Latin

  7. The lemma a gateway to Latin linguistic resources 4 Lemmas Lexical Entries Tokens NLP Output Lexical Resources T extual Resources NLP T ools - Latin Wordnet - Digital libraries - T okenizers - Valency Lexicon - Treebanks - T aggers/parsers - Dictionaries... - T extual corpora... - Lemmatizers... F. Mambrini & M. Passarotti | LiLa – Linking Latin

  8. LEMLAT: the foundation stone http://www.lemlat3.eu/ 5 ◮ 43,432 lemmas from Georges, 1913-1918; OLD and Gradenwitz, 1904; ◮ 82,556 lemmas from Du Cange, 1883-1887; ◮ 26,250 lemmas from Forcellini, 1940. F. Mambrini & M. Passarotti | LiLa – Linking Latin

  9. Towards an ontology of Latin lemmas 6 ontolex:Form rdfs:subClassOf Lemma olia:Verb amo VERB a a ontolex:writtenRep vocab:lemmario_upostag lemma:2012 F. Mambrini & M. Passarotti | LiLa – Linking Latin

  10. Workflow 7 ◮ start from a shallow conversion from TB format to RDF triples ◮ compare the string of the lemmatized token with the written representation(s) of a LEMLAT lemma ◮ link the token to the lemma via the hasLemma property F. Mambrini & M. Passarotti | LiLa – Linking Latin

  11. Linking corpora and lemmas 8 isPartOfSent a nif:Sentence proiel:s17835_0 nif:nextSentence a proiel:s17836_0 nif:Word olia:CommonNoun a a infernus ontolex:writtenRep hasSuffix lemma:20369 a Suffix a suffix:7 rdfs:label hasSuffix hasLemma -n lemma:arcanus hasBase a lemma:infernalis Lemma a a lemma:inferiae hasBase proiel:17835_4 base:639 conll:HEAD proiel:s17835_6 conll:UPOS hasBase conll:WORD NOUN conll:MISC inferni conll:LEMMA ref=REV_1.18 conll:EDGE infernus conj F. Mambrini & M. Passarotti | LiLa – Linking Latin

  12. Linking corpora and lemmas 8 isPartOfSent a nif:Sentence proiel:s17835_0 nif:nextSentence a proiel:s17836_0 nif:Word olia:CommonNoun a a infernus ontolex:writtenRep hasSuffix lemma:20369 a Suffix a suffix:7 rdfs:label hasSuffix hasLemma -n lemma:arcanus hasBase a lemma:infernalis Lemma a a lemma:inferiae hasBase proiel:17835_4 base:639 conll:HEAD proiel:s17835_6 conll:UPOS hasBase conll:WORD NOUN conll:MISC inferni conll:LEMMA ref=REV_1.18 conll:EDGE infernus conj F. Mambrini & M. Passarotti | LiLa – Linking Latin

  13. Linking corpora and lemmas 8 isPartOfSent a nif:Sentence proiel:s17835_0 nif:nextSentence a proiel:s17836_0 nif:Word olia:CommonNoun a a infernus ontolex:writtenRep hasSuffix lemma:20369 a Suffix a suffix:7 rdfs:label hasSuffix hasLemma -n lemma:arcanus hasBase a lemma:infernalis Lemma a a lemma:inferiae hasBase proiel:17835_4 base:639 conll:HEAD proiel:s17835_6 conll:UPOS hasBase conll:WORD NOUN conll:MISC inferni conll:LEMMA ref=REV_1.18 conll:EDGE infernus conj F. Mambrini & M. Passarotti | LiLa – Linking Latin

  14. A wealth of interlinked information that can be queried! 9 Token 1 Token 2 F. Mambrini & M. Passarotti | LiLa – Linking Latin

  15. A wealth of interlinked information that can be queried! 9 Token 1 Token 2 F. Mambrini & M. Passarotti | LiLa – Linking Latin

  16. A wealth of interlinked information that can be queried! 9 Lemma 1 Token 1 Token 2 F. Mambrini & M. Passarotti | LiLa – Linking Latin

  17. A wealth of interlinked information that can be queried! 9 Synset Lemma 1 Morph Token 1 Token 2 F. Mambrini & M. Passarotti | LiLa – Linking Latin

  18. A wealth of interlinked information that can be queried! 9 Synset Lemma 2 Token 3 Lemma 1 Morph Lemma 3 Token 4 Token 1 Token 2 F. Mambrini & M. Passarotti | LiLa – Linking Latin

  19. Querying with SPARQL All verbs that govern subjects formed with affix “-(t)or” 10 F. Mambrini & M. Passarotti | LiLa – Linking Latin

  20. Sample of results from PROIEL from Cicero’s Letters to Atticus 11 gladiatores audio pugnare mirifice (1) gladiators .ACC.PL hear .1SG fight .INF superbly I hear that your gladiators fight superbly. (Cic. Att. . 4.4a.2) F. Mambrini & M. Passarotti | LiLa – Linking Latin

  21. Wordcloud of results from the Index Thomisticus “the Interpreter (of Aristotle, i.e. Averroes) says...” 12 F. Mambrini & M. Passarotti | LiLa – Linking Latin

  22. Conclusions and future works 13 ◮ Language is complex! Morpho-syntactic description is not enough to capture all complexities ◮ LOD provide a way to link treebank annotation and information on other levels (semantics, derivational morphology...) ◮ a lexically based approach (using lemmas as hub node) is one way to do it! F. Mambrini & M. Passarotti | LiLa – Linking Latin

  23. Conclusions and future works 13 ◮ Language is complex! Morpho-syntactic description is not enough to capture all complexities ◮ LOD provide a way to link treebank annotation and information on other levels (semantics, derivational morphology...) ◮ a lexically based approach (using lemmas as hub node) is one way to do it! ◮ but (future works)... F. Mambrini & M. Passarotti | LiLa – Linking Latin

  24. Conclusions and future works 13 ◮ Language is complex! Morpho-syntactic description is not enough to capture all complexities ◮ LOD provide a way to link treebank annotation and information on other levels (semantics, derivational morphology...) ◮ a lexically based approach (using lemmas as hub node) is one way to do it! ◮ but (future works)... ◮ we need to harmonize the tagsets (ontologies) F. Mambrini & M. Passarotti | LiLa – Linking Latin

  25. Conclusions and future works 13 ◮ Language is complex! Morpho-syntactic description is not enough to capture all complexities ◮ LOD provide a way to link treebank annotation and information on other levels (semantics, derivational morphology...) ◮ a lexically based approach (using lemmas as hub node) is one way to do it! ◮ but (future works)... ◮ we need to harmonize the tagsets (ontologies) ◮ we need to find sustainable, scalable solutions together with the projects that own and maintain the resources F. Mambrini & M. Passarotti | LiLa – Linking Latin

  26. Thanks! Get in touch 14 The LiLa Team Università Cattolica del Sacro Cuore CIRCSE Research Centre info@lila-erc.eu https://github.com/CIRCSE https://lila-erc.eu @ERC_LiLa Largo Gemelli 1, 20123 Milan, Italy This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme - Grant Agreement No. 769994. F. Mambrini & M. Passarotti | LiLa – Linking Latin

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend