Lexical Resources in GF Krasimir Angelov University of Gothenburg - - PowerPoint PPT Presentation

lexical resources in gf
SMART_READER_LITE
LIVE PREVIEW

Lexical Resources in GF Krasimir Angelov University of Gothenburg - - PowerPoint PPT Presentation

Lexical Resources in GF Krasimir Angelov University of Gothenburg July 15, 2015 History 1 English 2 Translations 3 GF Lexicon vs WordNet 4 Some History 2008 OALD imported (Bj orn Bringert) 2010 Further development for wide coverage


slide-1
SLIDE 1

Lexical Resources in GF

Krasimir Angelov

University of Gothenburg

July 15, 2015

slide-2
SLIDE 2

1

History

2

English

3

Translations

4

GF Lexicon vs WordNet

slide-3
SLIDE 3

Some History

2008 OALD imported (Bj¨

  • rn Bringert)

2010 Further development for wide coverage parsing in English (Krasimir Angelov) 2012 Translation to Swedish, Finnish, Hindi, Urdu, Bulgarian (Aarne Ranta, Shafqat Virk, Krasimir Angelov) 2013 First Mobile Translator (Bj¨

  • rn Bringert, Krasimir Angelov)

.... Many more languages added

slide-4
SLIDE 4

1

History

2

English

3

Translations

4

GF Lexicon vs WordNet

slide-5
SLIDE 5

English Lexicon

Nouns, Verbs, Adjectives, Adverbs

Oxford Advanced Learners Dictionary Princeton WordNet Spelling variants (British/American/Others) Harmonized with RGL

Prepositions

PennTreebank Wikipedia

Verb Frames

PennTreebank VerbNet (TODO)

Phrasal Verbs

Web Sites for Learning English

slide-6
SLIDE 6

English Lexicon

Example: lin house_N = mkN "house" "houses"; lin play_V = mkV "play"; lin beautiful_A = compoundA (mkA "beautiful"); lin behind_Adv = mkAdv "behind"; lin instead_of_Prep = mkPrep "instead of"; lin theatre_N = variants {mkN "theatre"; mkN "theater"}; lin maharaja_N = variants {mkN "maharaja"; mkN "maharajah"};

slide-7
SLIDE 7

Verb Frames

Currently a limited inventory of verb frames from OALD and PennTreebank lin make_V = IrregEng.make_V; lin make_V2 = mkV2 (IrregEng.make_V); lin make_V2A = mkV2A (IrregEng.make_V) noPrep; lin make_V2V = mkV2V (IrregEng.make_V) noPrep noPrep; VerbNet has a better inventory which should be incorporated. This would also require extensions in the RGL

slide-8
SLIDE 8

Multiword Units

There are a number of multiword units: lin cod_liver_oil_N = mkN "cod-liver oil" ; These are all inherited and there is no clear criteria about which units should be in the lexicon.

slide-9
SLIDE 9

1

History

2

English

3

Translations

4

GF Lexicon vs WordNet

slide-10
SLIDE 10

Translations

Free Electronic Dictionaries (Bulgarian, Swedish) WordNet (Finnish) Universal WordNet (Bulgarian) Apertium (Bulgarian, Others?) Google Translate (Bulgarian, Swedish) Phrase Tables (Bulgarian) PannLex (Thai) Manual Translation (Bulgarian, Chinese) Wiktionary (Most Other Languages)

slide-11
SLIDE 11

Sense Splits

Sense Ambiguities in English English Swedish letter 1 N letter brev letter 2 N letter bokstav Gender Ambiguities in English English Bulgarian German teacherMasc N teacher uˇ citel Lehrer teacherFem N teacher uˇ citelka Lehrerin

slide-12
SLIDE 12

Morphology

Smart Paradigms IrregXXX modules Free Morphological Lexicons (OALD, Open Office, SALDO, KOTUS)

slide-13
SLIDE 13

Validation

There are still many errors in the dictionaries. English, Swedish and Bulgarian seems to be in the best shape. Go Through the Word List in Frequency Order Use Your Vacation to Test the Translator

slide-14
SLIDE 14

1

History

2

English

3

Translations

4

GF Lexicon vs WordNet

slide-15
SLIDE 15

GF Lexicon vs WordNet

GF Lexicon Mostly one sense per word Focus on the primary sense Many sense confusions WordNet No morphology Coarse POS tags Not focused on translation

slide-16
SLIDE 16

Ongoing and Past Work on Integration

Past Shafqat Virk, K.V.S. Prasad, Aarne Ranta, Krasimir Angelov. Developing an interlingual translation lexicon using WordNets and Grammatical Framework. Ongoing Selective Translation Choice from WordNet

slide-17
SLIDE 17

A New Statistical Model

The current model is trained on the English PennTreebank With more split senses we will need something else:

Princeton WordNet has some sense frequency information This can be complemented by using the EM algorithm. Example: English Swedish Bulgarian German letter 1 N letter brev pismo Brief letter 2 N letter bokstav bukva Buchstabe teacherMasc N teacher l¨ arare uˇ citel Lehrer teacherFem N teacher l¨ arare uˇ citelka Lehrerin