[PPT] - Computational Lexicography: Some proposals , David Lindemann PowerPoint Presentation

SLIDE 1

June 2016

Computational Lexicography: Some proposals

Computational Lexicography: Some proposals,

David Lindemann UPV/EHU University of the Basque Country david.lindemann@ehu.eus

SLIDE 2

June 2016

Computational Lexicography: Some proposals

Overview

(1) Intro: Computational Lexicography (2) EuDeLex – a German-Basque electronic dictionary for Basque- L1 German learners (PhD project) (3) Bilingual Dictionary Drafting: Connecting Basque word senses to multilingual equivalents (Postdoc project)

SLIDE 3

June 2016

Computational Lexicography: Some proposals

Computational Lexicography

Storing and editing of lexical data Dictionary Writing Systems: Interface for data import and editing, storing in relational database, representation (export) as XML.

SLIDE 4

June 2016

Computational Lexicography: Some proposals

Computational Lexicography

Dictionary Publishing

parsing OCR

SLIDE 5

June 2016

Computational Lexicography: Some proposals

Computational Lexicography

Corpus Linguistics + Lexicography Corpus data provides

Frequencies of lemma, word form, multiword expression, collocation, syntactic pattern... (“word sketches”) Evidence for the definition of meaning Example sentences Parallel corpora: Translated example sentences

SLIDE 6

June 2016

Computational Lexicography: Some proposals

Computational Lexicography

Natural Language Processing (NLP) + Lexicography NLP Resources

Extraction of monolingual and multilingual lexical data Linking to dictionary entries EDBL

NLP Tools

Corpus building Bilingual document alignment Sentence alignment (build PC) Word alignment (extract TE) etc.

SLIDE 7

June 2016

Computational Lexicography: Some proposals

Teamwork with computational linguists

I. San Vicente
I. Manterola
X. Saralegi

Elhuyar Foundation

R. Nazar

UPF (Barcelona)

U. Valparaíso (Chile)

► Lindemann & Nazar (2013) ► Lindemann, Manterola, Nazar, San Vicente & Saralegi (2014) ► Lindemann & San Vicente (2015a; 2015b; in prep. 2016)

SLIDE 8

June 2016

Computational Lexicography: Some proposals

Creation of EuDeLex: working steps

✔

Definition of a corpus-based DE macrostructure

✔

Definition of a microstructure, suited for the needs of Basque-L1 German learners

✔

Compiling of DE-EU parallel corpus (SkE)

✔

10% of DE Lemmalist (Letters A, B): Edition of Dictionary entries DE>EU

✔

Investigation: Bilingual Dictionary Drafting

✔

Definition of Methods

✔

Application and evaluation on German Letter A

✗

Edition and publication of all dictionary entries DE>EU

✗

Edition and publication of dictionary entries EU>DE

✗

Drafting by inverting DE>EU articles

http://www.ehu.es/eudelex/

SLIDE 9

June 2016

Computational Lexicography: Some proposals

Parallel and comparable DE-EU corpora

Literary Corpus (parallel) Created at UPV-EHU (Sanz, Uribarri & Zubillaga 2011-13) about 2 million tokens per language 146.457 sentence pairs Sentence alignment hand revised Bible Corpus (parallel) Created by X. Saralegi 640.000 tokens per language 30.440 sentence pairs Comparable Corpora Created at Elhuyar DE and EU Wikipedia “Die Zeit” and “Berria” newspaper corpus Corpora lemmatized and POS-tagged DE: TreeTagger (SkE) EU: Eustagger (Gorka Labaka)

SLIDE 10

June 2016

Computational Lexicography: Some proposals

Working with the Literary Parallel Corpus (1)

Parallel KWIC from lemmatized and POS-tagged DE-EU corpus using SkE. Example: “Begriff” (noun)

„Begriff“

kontzeptu 18 ideia 11 hitza 2 adigai 1 burutapena 1 ezagutza 1 gai 1 hitz 1 ikusmolde 1 pentsakera 1 termino 1 ulerbide 1 ulerkera 1

„im Begriff sein“

tzera joan

5

tzekotan izan

1

tzeko zorian

4

tzera

apoderatzen izan 1

tzear egon

3 ekarri 1

tzeko asmoa

izan 3 gutxi falta + subjkt. 1

tzeko asmotan

egon 2 hasia izan 1

tzea pentsatu

1

inf. nahi izan

1

tzear izan

1

inf. nahian

1

tzeko duda egin

1 egin behar izan 1 kein Begriff sein ez ezagutu 1 kein Begriff sein horren entzuterik ere ez izan 1 keinen andern Begriff haben als ... ... baino ez pentsatu 1 nicht allzu schnell von Begriff sein ez oso azkarra izan 1 schwer von Begriff sein burugogorra izan 1 sich einen stillen Begriff machen gutxi gora behera irudika ahal izan 1 sich kaum einen Begriff machen

zta-ozta ideia bat izan

1 sich keinen Begriff machen ez jakin 1 sich keinen Begriff machen ezin imajinatu ere egin 1 über alle Begriffe ezin esan bezain 1 Der Begriff X X 1 einen Begriff geben aditzera eman 1

LU and TE candidates in Parallel Corpus (SkE). Example: “Begriff” (noun)

SLIDE 11

June 2016

Computational Lexicography: Some proposals

Working with the Literary Parallel Corpus (2)

Gemütlichkeit: 4 hits

goxotasun patxada lasaitasun konfortea

Schadenfreude: 10 hits

(voll Schadenfreude sein) zoritxarraz poztu (Schadenfreude empfinden) maltzur sentitu bozkario gozatze modu bat poz txiki bat alaitasun maltzur besteak umiliatzeko poza poz gaizto (aus Schadenfreude) besteren gaitzak ninduen pozten kalte poz

2 terms ”very hard to translate“
all hapax TE
cf. Teubert 2002

SLIDE 12

June 2016

Computational Lexicography: Some proposals

Automatic TE pairing for Bilingual Dictionary Drafting

Corpus based

1. DE-EU parallel corpora: GIZA++ (Elhuyar)
2. DE-EU parallel corpus: Bifid (R. Nazar)

Mixed

3. Elhuyar Pibolex mixed approach

Lexical Knowledge based

4. Wikipedia IL-links
5. de.wiktionary: Basque links
6. GermaNet / EusWN (PWN as pivot)

SLIDE 13

June 2016

Computational Lexicography: Some proposals

BDD methods: evaluation results overview

Criterion Giza++ Bifid Pibolex Wikipedia Wiktionary WordNet BG: Recall on GS lemmalist 19% 5% 16% 7% 4% 21% BG: lemma with 1+ good TE 63% 95% 79% 89% 100% 90% BG: lemma with all TE „good“ 42% 95% 57% 89% 97% 78% Lemma with 1+ good TE: Recall on GS lemmalist 13% 5% 13% 7% 4% 21% Lemma with all TE „good“: Recall on GS lemmalist 9% 5% 9% 7% 4% 18%

SLIDE 14

June 2016

Computational Lexicography: Some proposals

BDD methods: result interpretation

Overall Outcome (mixed methods)

80% recall up to 60% of lemmata in Gold Standard lemmalist provided with 1+ good TE A half of these (30%) is noise free 73% of these are nouns (54% of GS lemmas are nouns (Wikipedia only nouns))

Three groups of Dictionary Draft Data

(1) Wiktionary, Wikipedia, Bifid (Lit Corpus): GS: 40,6% 1+ good TE

Nearly 100% precision Direct pasting into Dictionary Database, post-editing

(2) Bifid (Bible Corpus), WordNet: GS: 20% 1+ good TE

High precision Revision by hand, then paste into Dictionary Database

(3) Giza, Elhuyar Pivot: GS: 18,7% 1+ good TE

Lower precision More strategies for noise reduction needed Data displayed as support in Dictionary entry editing

SLIDE 15

June 2016

Computational Lexicography: Some proposals

EuDeLex Online-Publishing

SLIDE 16

June 2016

Computational Lexicography: Some proposals

From EuDeLex (predoc) to EuMultiLex (postdoc)

Application of BDD methods to more language pairs

Wikipedia? 292 languages Wiktionary? 170 languages Parallel Corpora? with Basque, difficult WordNet? 34 with free licence, around 30 non-free

Starting from Basque

Definition of a basic lemma list we want to cover

Semi-automatically obtained draft for manual edition

Preparation of draft dictionary set Manual edition of German-Basque-German: David Other combinations: Lexicographers skilled in both languages

SLIDE 17

June 2016

Computational Lexicography: Some proposals

Definition of a basic lemma list

Corpus-based frequency lemma list for Basque

Lemmata extracted from ETC (Sarasola, Salaburu & Landa 2013), and Elh200 (Leturia 2014) Comparison to 6 reference resources

► Lindemann & San Vicente (2015a; 2015b)

SLIDE 18

June 2016

Computational Lexicography: Some proposals

Basque Dictionary Draft: (1) Homograph Level

Basic list of lemma-signs, frequency data from Elh200 corpus

SLIDE 19

June 2016

Computational Lexicography: Some proposals

(1) Homograph, (2) Syntactical Entity

Syntactical Entities (LemPos-entities) from Elh200 corpus Corpus tagged with EusTagger, based on EDBL data Frequency data for each entity

SLIDE 20

June 2016

Computational Lexicography: Some proposals

(1) Homograph, (2) Syntactical Entity, (3) Sense

Word senses from EusWN Linking of senses to syntactical entities (as child elements)

► Lindemann & San Vicente (in press, 2016)

SLIDE 21

June 2016

Computational Lexicography: Some proposals

Drafted Basque dictionary content

Corpus- based SE SE with

ne or

more EusWN Word senses Total EusWN Word senses Polysemy ratio SE present in corpus but not in EusWN SE present in EusWN but not found in corpus

Verbs 4,151 1,636 6,567 2.01 2,515 279 Common Nouns 23,921 15,193 30,613 4.01 8,728 3,479 Proper Nouns 2,443 132 153 1.16 2,311 60 Adjectives 6,147 50 141 2.82 6,097 8 Adverbs 1,556 0.00 1,556 Total 38,218 17,011 37,474 2.20 21,207 3,826

SLIDE 22

June 2016

Computational Lexicography: Some proposals

Dictionary Draft SE Gap Detection: semi-automatic

Blank SE (present in EDBL, not in EusWN): Find corresponding synset in Princeton WordNet, copy ID

SLIDE 23

June 2016

Computational Lexicography: Some proposals

Dictionary Draft Sense Gap Detection: Hand work!

EusWN Lexical Unit Definition EN EusWN 3.0 synset EN synset CAT synset adar_1

ne of the bony outgrowths on the

heads of certain ungulates adar_1 horn_2 banya_1 adar_2 a railway line connected to a trunk line adar_2 branch_line_1 spur_track_1 spur_5 enforcall_1 forcall_1 adar_3 a warning signal that is a loud wailing sound adar_3, sirena_2 turuta_5 siren_3 adar_4 a local branch of some fraternity or association adar_4 chapter_3 capítol_2 adar_5 a division of a stem, or secondary stem arising from the main stem of a plant adar_5 abar_2 besanga_1 beso_12 branch_2 branca_1 branc_1 adar_6 an alarm device that makes a loud warning sound sirena_4 adar_6 turuta_6 horn_9 adar_7 a device used for easing the foot into a shoe zapata_sartzeko_1 shoehorn_1 calçador_1

SLIDE 24

June 2016

Computational Lexicography: Some proposals

A big advantage of this approach:

Every single bit of handwork invested, every gap that is filled, every link that is set, every error that is found, will lead to an enrichment of both EDBL and EusWN.

SLIDE 25

June 2016

Computational Lexicography: Some proposals

Computational Lexicography: Some proposals,

David Lindemann UPV/EHU University of the Basque Country david.lindemann@ehu.eus

Overview

(1) Intro: Computational Lexicography (2) EuDeLex – a German-Basque electronic dictionary for Basque- L1 German learners (PhD project) (3) Bilingual Dictionary Drafting: Connecting Basque word senses to multilingual equivalents (Postdoc project)

Computational Lexicography

Storing and editing of lexical data Dictionary Writing Systems: Interface for data import and editing, storing in relational database, representation (export) as XML.

Computational Lexicography

Dictionary Publishing

parsing OCR

Computational Lexicography

Corpus Linguistics + Lexicography Corpus data provides

Frequencies of lemma, word form, multiword expression, collocation, syntactic pattern... (“word sketches”) Evidence for the definition of meaning Example sentences Parallel corpora: Translated example sentences

Computational Lexicography

Natural Language Processing (NLP) + Lexicography NLP Resources

Extraction of monolingual and multilingual lexical data Linking to dictionary entries EDBL

NLP Tools

Corpus building Bilingual document alignment Sentence alignment (build PC) Word alignment (extract TE) etc.

Teamwork with computational linguists

Elhuyar Foundation

UPF (Barcelona)

Creation of EuDeLex: working steps

Definition of a corpus-based DE macrostructure

Definition of a microstructure, suited for the needs of Basque-L1 German learners

Compiling of DE-EU parallel corpus (SkE)

10% of DE Lemmalist (Letters A, B): Edition of Dictionary entries DE>EU

Investigation: Bilingual Dictionary Drafting

Edition and publication of all dictionary entries DE>EU

Edition and publication of dictionary entries EU>DE

Drafting by inverting DE>EU articles

http://www.ehu.es/eudelex/

Parallel and comparable DE-EU corpora

Working with the Literary Parallel Corpus (1)

„Begriff“

„im Begriff sein“

Working with the Literary Parallel Corpus (2)

Gemütlichkeit: 4 hits

goxotasun patxada lasaitasun konfortea

Schadenfreude: 10 hits

(voll Schadenfreude sein) zoritxarraz poztu (Schadenfreude empfinden) maltzur sentitu bozkario gozatze modu bat poz txiki bat alaitasun maltzur besteak umiliatzeko poza poz gaizto (aus Schadenfreude) besteren gaitzak ninduen pozten kalte poz

Automatic TE pairing for Bilingual Dictionary Drafting

Corpus based

Mixed

Lexical Knowledge based

BDD methods: evaluation results overview

BDD methods: result interpretation

Overall Outcome (mixed methods)

80% recall up to 60% of lemmata in Gold Standard lemmalist provided with 1+ good TE A half of these (30%) is noise free 73% of these are nouns (54% of GS lemmas are nouns (Wikipedia only nouns))

Three groups of Dictionary Draft Data

(1) Wiktionary, Wikipedia, Bifid (Lit Corpus): GS: 40,6% 1+ good TE

(2) Bifid (Bible Corpus), WordNet: GS: 20% 1+ good TE

(3) Giza, Elhuyar Pivot: GS: 18,7% 1+ good TE

EuDeLex Online-Publishing

From EuDeLex (predoc) to EuMultiLex (postdoc)

Application of BDD methods to more language pairs

Wikipedia? 292 languages Wiktionary? 170 languages Parallel Corpora? with Basque, difficult WordNet? 34 with free licence, around 30 non-free

Starting from Basque

Definition of a basic lemma list we want to cover

Semi-automatically obtained draft for manual edition

Preparation of draft dictionary set Manual edition of German-Basque-German: David Other combinations: Lexicographers skilled in both languages

Definition of a basic lemma list

Corpus-based frequency lemma list for Basque

Lemmata extracted from ETC (Sarasola, Salaburu & Landa 2013), and Elh200 (Leturia 2014) Comparison to 6 reference resources

Basque Dictionary Draft: (1) Homograph Level

Basic list of lemma-signs, frequency data from Elh200 corpus

(1) Homograph, (2) Syntactical Entity

Syntactical Entities (LemPos-entities) from Elh200 corpus Corpus tagged with EusTagger, based on EDBL data Frequency data for each entity

(1) Homograph, (2) Syntactical Entity, (3) Sense

Word senses from EusWN Linking of senses to syntactical entities (as child elements)

Drafted Basque dictionary content

Corpus- based SE SE with

more EusWN Word senses Total EusWN Word senses Polysemy ratio SE present in corpus but not in EusWN SE present in EusWN but not found in corpus

Verbs 4,151 1,636 6,567 2.01 2,515 279 Common Nouns 23,921 15,193 30,613 4.01 8,728 3,479 Proper Nouns 2,443 132 153 1.16 2,311 60 Adjectives 6,147 50 141 2.82 6,097 8 Adverbs 1,556 0.00 1,556 Total 38,218 17,011 37,474 2.20 21,207 3,826

Dictionary Draft SE Gap Detection: semi-automatic

Blank SE (present in EDBL, not in EusWN): Find corresponding synset in Princeton WordNet, copy ID

Dictionary Draft Sense Gap Detection: Hand work!

EusWN Lexical Unit Definition EN EusWN 3.0 synset EN synset CAT synset adar_1

A big advantage of this approach:

Every single bit of handwork invested, every gap that is filled, every link that is set, every error that is found, will lead to an enrichment of both EDBL and EusWN.

Vielen Dank für Ihre Aufmerksamkeit eskerrik asko, muchas gracias

David Lindemann david.lindemann@ehu.eus