CHLT Project (IST-2001-32745) Workpackage 5. Neo-Latin - - PowerPoint PPT Presentation

chlt project
SMART_READER_LITE
LIVE PREVIEW

CHLT Project (IST-2001-32745) Workpackage 5. Neo-Latin - - PowerPoint PPT Presentation

CHLT Project (IST-2001-32745) Workpackage 5. Neo-Latin Morphological Analyser C.N.R. Istituto di Linguistica Computazionale Andrea Bozzi Giuseppe Cappelli Marco Passarotti Paolo Ruffolo Bozzi, Passarotti, CHLT LEMLAT 1 1. LEMLAT A


slide-1
SLIDE 1

Bozzi, Passarotti, CHLT LEMLAT 1

CHLT Project

(IST-2001-32745)

Workpackage 5. Neo-Latin Morphological Analyser

C.N.R.

Istituto di Linguistica Computazionale

Andrea Bozzi Giuseppe Cappelli Marco Passarotti Paolo Ruffolo

slide-2
SLIDE 2

Bozzi, Passarotti, CHLT LEMLAT 2

1. LEMLAT

A latin morphological analyser for CHLT

slide-3
SLIDE 3

Bozzi, Passarotti, CHLT LEMLAT 3

LEMLAT

  • Lexical collated sources:

– Georges – Gradenwitz – Oxford Latin Dictionary – TLL (partially)

  • Number of entries

– 58147 LES (invariable parts of the inflected forms)

slide-4
SLIDE 4

Bozzi, Passarotti, CHLT LEMLAT 4

The LEMLAT dictionary structure

A0014 ABALIENATION N31 A0015 V ABALIEN V1 A0015 ABALEN V1 A0016 ABALIUD I A0017 ABALTERUTRUM I A0018 ABAMBUL V1I A0019 ABAMIT N1 A0020 ABANTE I A0021 V ABARC V2 A0021 ABERC V2

Different LES receive the same ID Number, if they have a common lemma (generated by the LES registered with V code):

A0015 V ABALIEN V1 A0015 ABALEN V1

Lemma: abalieno

LES COD LES ID Num.

slide-5
SLIDE 5

Bozzi, Passarotti, CHLT LEMLAT 5

The LEMLAT morphological analysis

Input form Lemma Segmentation attempts COD LEM ID Num.

slide-6
SLIDE 6

Bozzi, Passarotti, CHLT LEMLAT 6

LEMLAT tests

  • Checking of the Decretum Gratiani Lemmatization
  • Production of lexical index for LIE (Lessico

Intellettuale Europeo, Roma), in Leibniz texts

  • Lemmatization of the Latin Grammarians Corpus (not

published)

  • Lexical resource for Olissipo Project (University of

Lisboa)

slide-7
SLIDE 7

Bozzi, Passarotti, CHLT LEMLAT 7

Why LEMLAT for CHLT?

  • Lexical quantity
  • Graphical variants management
  • Open-source usable tool
slide-8
SLIDE 8

Bozzi, Passarotti, CHLT LEMLAT 8

Comparison LEMLAT/Other latin morphological analysers 1.

  • Compared analysers

– Words: Version 1.97 by William Whitaker http://www.erols.com/whitaker/words.htm – Nomen: by Paravia (Italian publishing house) – Perseus Latin Morphological Analysis: by Perseus Project

slide-9
SLIDE 9

Bozzi, Passarotti, CHLT LEMLAT 9

Comparison LEMLAT/Other latin morphological analysers 2.

  • Lexical quantity

– LEMLAT: 58147 LES – Words: 48698 stems – Nomen: 31903 lemmas – Perseus: ?

Example

– pardalios

  • LEMLAT: analysed
  • Words: not analysed
  • Nomen: not analysed
  • Perseus: not analysed
slide-10
SLIDE 10

Bozzi, Passarotti, CHLT LEMLAT 10

Comparison LEMLAT/Other latin morphological analysers 3.

  • Graphical variants management

– vies (form of via: abl., pl. in Corp. Inscr. Lat. 4, 1410)

  • LEMLAT: lemmatized as a form of via, vieo and vio
  • Words: lemmatized as a form of vieo
  • Nomen: lemmatized as a form of vieo and vio
  • Perseus: lemmatized as a form of vio
slide-11
SLIDE 11

Bozzi, Passarotti, CHLT LEMLAT 11

2. What has to be done on LEMLAT for CHLT requirements

Aims, means and problems

slide-12
SLIDE 12

Bozzi, Passarotti, CHLT LEMLAT 12

Aims

  • Completion of LEMLAT synthetical morphological analysis

with an analytical one, through adding on the LEMLAT lemmatization results the following items:

– new morphological informations

aquai

  • LEMLAT: aqu-ai (segmented form), aqua (lemma), n1 (COD LEM)
  • CHLT LEMLAT: aqua (lemma)

Common, Noun, I Decl., Gen., Sing., Fem.

– new stylistic and historical-linguistic informations

aquai

  • CHLT LEMLAT: aqua (lemma)

Common, Noun, I Decl., Gen., Sing., Fem., Poetic., Arch.

slide-13
SLIDE 13

Bozzi, Passarotti, CHLT LEMLAT 13

How to obtain these aims

  • New coding of the LEMLAT basical wordform

segments (morphemes) recognized by the segmentation module:

– LES: antiqu- – SM (paradigmatic suffixes): -issim- – SF (endings): -orum

slide-14
SLIDE 14

Bozzi, Passarotti, CHLT LEMLAT 14

Type of codes

  • Definition of codes according to morphological

coding conventions developed by EAGLES project (Expert Advisory Group on Language Engineering Standards) EAGLES coding advantages:

– accepted standard – largely tested on a number of languages – flexibility and personalization (useful for this first application on a dead language)

slide-15
SLIDE 15

Bozzi, Passarotti, CHLT LEMLAT 15

SF Coding Codes positions and their attributes

====== ================== Code P ATTRIBUTE ====== ================== 1 PoS 2 Type 3 Flexive Category 4 Mood 5 Tense 6 Case 7 Gender 8 Number 9 Person 10 Degree

slide-16
SLIDE 16

Bozzi, Passarotti, CHLT LEMLAT 16

Example Third position: values and codes

= ===================== ===================== = P ATTRIBUTE VALUE C = ===================== ===================== = 3 Flexive Category I decl. A II decl. B III decl. C IV decl. D V decl. E I conjug. F II conjug. G III conjug. H IV conjug. L Conjug e/i M Exceptional Conjug. N No Flexive Category -

slide-17
SLIDE 17

Bozzi, Passarotti, CHLT LEMLAT 17

Coding samples

a n1 NcA--bfs-- ros-a a n1 NcA--bms-- pirat-a a n1 NcA--nfs-- ros-a a n1 NcA--nms-- pirat-a a n1 NcA--vfs-- ros-a a n1 NcA--vms-- pirat-a a n1e NcA--bfs-- plastic-a a n1e NcA--bms-- poet-a a n1e NcA--nfs-- plastic-a a n1e NcA--nms-- poet-a a n1e NcA--vfs-- plastic-a a n1e NcA--vms-- poet-a abus n1e NcA--bfp-- de-abus abus n1e NcA--dfp-- de-abus SF LEMLAT Cod. EAGLES Cod. Examples

slide-18
SLIDE 18

Bozzi, Passarotti, CHLT LEMLAT 18

A coding problem

  • The following kinds of forms are lemmatized by LEMLAT

with no segmentation:

– FE (exceptional forms): registered as such in the look-up table, with COD LES FE (ex. amassint)

A1705 AMASSINT FE A1705 V AM V1

– LE (exceptional lemmas): generated through a special information registered in the fourth field of the look-up table (ex. agape)

A1128 AGAP N1E -E

– I (invariable forms): registered as such in the look-up table, with COD LES I (ex. assultim)

A3200 ASSULTIM I

slide-19
SLIDE 19

Bozzi, Passarotti, CHLT LEMLAT 19

Why is this a problem?

  • Remember!

The analytical morphological analysis we need derives from the coding of wordform segments (LES/SM/SF)

  • No segmentation of input wordform means no

recognition of its segments

  • No recognition of input wordform segments

means no analytical morphological analysis of that wordform

slide-20
SLIDE 20

Bozzi, Passarotti, CHLT LEMLAT 20

Problem solution 1. FE and I

  • Every single FE and I will be manually coded in

an ad hoc file, where all FE and I are listed AMASSINT FE VmFa6—p3-

ASSULTIM I Ri-------

slide-21
SLIDE 21

Bozzi, Passarotti, CHLT LEMLAT 21

Problem solution 2. LE

  • Every LE will receive its morphological analysis according to:

– the COD LES of the LE LES – the kind of information registered in the fourth field of the LE LES raw in the look-up table: LEMLAT adds this information to the LES to generate the LE

A1128 AGAP N1E -E

LE: agape (AGAP plus –E) no segmented wordform! COD LES: N1E + Morphological analysis: Fourth field: -E Common, Noun, I Decl., Nomin., Sing., Fem. Common, Noun, I Decl., Voc., Sing., Fem. Common, Noun, I Decl., Abl., Sing., Fem.

slide-22
SLIDE 22

Bozzi, Passarotti, CHLT LEMLAT 22

3. The future

Next steps and CHLT LEMLAT developments and applications

slide-23
SLIDE 23

Bozzi, Passarotti, CHLT LEMLAT 23

Next steps 1.

  • To add gender codes to every single nominal LES

in the look-up table (partially automatic operation) A0019f ABAMIT N1

– Input form: abamitas Segmentation: abamit-as

  • SF: -as n1

as n1 NcA--afp- as n1 NcA--amp–

  • LES: abamit-

A0019f ABAMIT N1

Selected SF:

as n1 NcA--afp-

slide-24
SLIDE 24

Bozzi, Passarotti, CHLT LEMLAT 24

Next steps 2.

  • To code SM
  • To code FE and I
  • To code stylistic and historical-linguistic

informations

  • Software

– To choose a RDBMS (Relational Database Management System) among the available open-source systems – To use the chosen RDBMS in LEMLAT – Software development for implementing of new features

slide-25
SLIDE 25

Bozzi, Passarotti, CHLT LEMLAT 25

Next steps 3.

(but additional funds are needed)

  • To add proper nouns (Onomasticon) in the look-

up table

  • To add late latin items (from Humanism and

Renaissance) in the look-up table

slide-26
SLIDE 26

Bozzi, Passarotti, CHLT LEMLAT 26

Future CHLT LEMLAT developments and applications

Proposal for EU Sixth Framework, 2003

  • Latin Lexical Database for content extraction

– To be added to lemmas:

  • Encyclopedic and dictionary informations
  • Etymological informations
  • Informations about people, places and things
  • Images
  • Movies and sounds
  • Syntactic analysis (syntactic disambiguator)
  • Metric structure analyser

– Metric reading through a multimedial tool (text-to-speech and sound reproduction)