June 2016
Computational Lexicography: Some proposals
Computational Lexicography: Some proposals , David Lindemann - - PowerPoint PPT Presentation
Computational Lexicography: Some proposals , David Lindemann UPV/EHU University of the Basque Country david.lindemann@ehu.eus June 2016 Computational Lexicography: Some proposals Overview (1) Intro: Computational Lexicography (2) EuDeLex
June 2016
Computational Lexicography: Some proposals
June 2016
Computational Lexicography: Some proposals
June 2016
Computational Lexicography: Some proposals
June 2016
Computational Lexicography: Some proposals
June 2016
Computational Lexicography: Some proposals
June 2016
Computational Lexicography: Some proposals
June 2016
Computational Lexicography: Some proposals
► Lindemann & Nazar (2013) ► Lindemann, Manterola, Nazar, San Vicente & Saralegi (2014) ► Lindemann & San Vicente (2015a; 2015b; in prep. 2016)
June 2016
Computational Lexicography: Some proposals
✔
✔
✔
✔
✔
✔
Definition of Methods
✔
Application and evaluation on German Letter A
✗
✗
✗
June 2016
Computational Lexicography: Some proposals
Literary Corpus (parallel) Created at UPV-EHU (Sanz, Uribarri & Zubillaga 2011-13) about 2 million tokens per language 146.457 sentence pairs Sentence alignment hand revised Bible Corpus (parallel) Created by X. Saralegi 640.000 tokens per language 30.440 sentence pairs Comparable Corpora Created at Elhuyar DE and EU Wikipedia “Die Zeit” and “Berria” newspaper corpus Corpora lemmatized and POS-tagged DE: TreeTagger (SkE) EU: Eustagger (Gorka Labaka)
June 2016
Computational Lexicography: Some proposals
Parallel KWIC from lemmatized and POS-tagged DE-EU corpus using SkE. Example: “Begriff” (noun)
kontzeptu 18 ideia 11 hitza 2 adigai 1 burutapena 1 ezagutza 1 gai 1 hitz 1 ikusmolde 1 pentsakera 1 termino 1 ulerbide 1 ulerkera 1
5
1
4
apoderatzen izan 1
3 ekarri 1
izan 3 gutxi falta + subjkt. 1
egon 2 hasia izan 1
1
1
1
1
1 egin behar izan 1 kein Begriff sein ez ezagutu 1 kein Begriff sein horren entzuterik ere ez izan 1 keinen andern Begriff haben als ... ... baino ez pentsatu 1 nicht allzu schnell von Begriff sein ez oso azkarra izan 1 schwer von Begriff sein burugogorra izan 1 sich einen stillen Begriff machen gutxi gora behera irudika ahal izan 1 sich kaum einen Begriff machen
1 sich keinen Begriff machen ez jakin 1 sich keinen Begriff machen ezin imajinatu ere egin 1 über alle Begriffe ezin esan bezain 1 Der Begriff X X 1 einen Begriff geben aditzera eman 1
LU and TE candidates in Parallel Corpus (SkE). Example: “Begriff” (noun)
June 2016
Computational Lexicography: Some proposals
June 2016
Computational Lexicography: Some proposals
June 2016
Computational Lexicography: Some proposals
Criterion Giza++ Bifid Pibolex Wikipedia Wiktionary WordNet BG: Recall on GS lemmalist 19% 5% 16% 7% 4% 21% BG: lemma with 1+ good TE 63% 95% 79% 89% 100% 90% BG: lemma with all TE „good“ 42% 95% 57% 89% 97% 78% Lemma with 1+ good TE: Recall on GS lemmalist 13% 5% 13% 7% 4% 21% Lemma with all TE „good“: Recall on GS lemmalist 9% 5% 9% 7% 4% 18%
June 2016
Computational Lexicography: Some proposals
Nearly 100% precision Direct pasting into Dictionary Database, post-editing
High precision Revision by hand, then paste into Dictionary Database
Lower precision More strategies for noise reduction needed Data displayed as support in Dictionary entry editing
June 2016
Computational Lexicography: Some proposals
June 2016
Computational Lexicography: Some proposals
June 2016
Computational Lexicography: Some proposals
► Lindemann & San Vicente (2015a; 2015b)
June 2016
Computational Lexicography: Some proposals
June 2016
Computational Lexicography: Some proposals
June 2016
Computational Lexicography: Some proposals
► Lindemann & San Vicente (in press, 2016)
June 2016
Computational Lexicography: Some proposals
June 2016
Computational Lexicography: Some proposals
June 2016
Computational Lexicography: Some proposals
June 2016
Computational Lexicography: Some proposals
June 2016
Computational Lexicography: Some proposals