A General Introduction to Lexical Databases
Emmanuel Keuleers Department of Experimental Psychology Ghent University
EMLAR 2015 - Utrecht, April 15-17, 2015
- What can you find in a lexical database?
- How can you find it?
Lexical Databases Like a dictionary Lexical properties of interest - - PDF document
A General Introduction to Lexical Databases Emmanuel Keuleers Department of Experimental Psychology Ghent University EMLAR 2015 - Utrecht, April 15-17, 2015 What can you find in a lexical database? How can you find it? Lexical
Emmanuel Keuleers Department of Experimental Psychology Ghent University
EMLAR 2015 - Utrecht, April 15-17, 2015
Interfaculty Research Unit for Language and Speech of the University of Nijmegen (now CLS)
Dutch, English, and German
Dutch (1984)
(1954), plus later revisions, including the 1994 spelling reform
Institute for Dutch Lexicology (INL) 42,380,000 words in all
and Bonner Zeitungskorpus 1
fiction
Verb Frequency Deviation Freq/Million
accept 3712 207.37 accord 2010 12 112.29 achieve 2121 118.49 act 2212 430 123.58 add 4190 234.08 agree 3424 191.28
Lexicon Form Frequency Deviation Frequency/ million
lemma act 2212 430 123.58 wordform act 269 1233 15.03 wordform acted 92 366 5.14 wordform acting 489 103 27.32 wordform acts 187 80 10.45 wordform act 269 1233 15.03 wordform acted 92 366 5.14 wordform act 269 1233 15.03 wordform acted 92 366 5.14 wordform act 269 1233 15.03 wordform acted 92 366 5.14 wordform acted 92 366 5.14
frequency of zero
found in the frequency sources
sources but are left with a zero frequency after disambiguation
decomposition of other lemmas have a frequency of zero
word frequency rank
the 1 093 546 1
540 085 2 and 514 946 3 to 483 428 4 a 422 334 5 in 337 995 6 that 217 376 7 it 199 920 8 i 198 139 9
the
and to in a all these words have frequency 1
106=1,000,000
103=1,000 100=1
word frequency
frequency per million
rank
the 1 093 546 60 955.74 1
540 085 30 105.07 2 and 514 946 28 703.79 3 to 483 428 26 946.93 4 a 422 334 23 541.47 5 in 337 995 18 840.30 6 that 217 376 12 116.83 7 it 199 920 11 143.81 8 i 198 139 11 044.54 9
Van Heuven, Mandera, Keuleers, & Brysbaert (2014)
word frequency
Relative Frequency log10(fpm)
zipf
the 1 093 546 0.0602191 4.78 7.78
540 085 0.0297413 4.47 7.47 and 514 946 0.0283569 4.45 7.45 to 483 428 0.0266213 4.43 7.43 a 422 334 0.0232570 4.37 7.37 in 337 995 0.0186127 4.27 7.27 that 217 376 0.0119704 4.08 7.08 it 199 920 0.0110092 4.04 7.04 i 198 139 0.0109111 4.04 7.04
with separate frequencies
INL corpus
Lemma ID Form Status Frequency 1070 aardappelcroquet preferred 1070 aardappelkroket non-preferred 1138 aardelektrode preferred 1138 aardelectrode non-preferred 1202 aardolieprodukt preferred 6 1202 aardolieproduct non-preferred 1357 abductie preferred 1357 abduktie non-preferred Lemma ID Form Status Frequency
1359 anaesthesia British 12 1359 anesthesia American 1 1360 anaesthetic British 47 1360 anesthetic American 4 1361 anaesthetic British 8 1361 anesthetic American 1362 anaesthetist British 16
Type Stem Abstract Stem
Adjective approximatief approximatiev Noun arbeidershuis arbeidershuiz Noun arbeidersparadijs arbeidersparadijz Noun arbeidsbeurs arbeidsbeurz Adjective arbeidsextensief arbeidsextensiev Adjective arbeidsintensief arbeidsintensiev Adjective arbeidsloos arbeidslooz
Idnum Spelling Status DISC Syllables IPA Stress CV
42577 sleekness primary sliknIs sliːk-nɪs 10 CCVVC- CVC 42577 sleekness secondary sliknIs sliːk-nəs 10 CCVVC- CVC 42582 sleepily primary slipIlI sliː-pɪ-lɪ 100 CCVV- CV-CV 42582 sleepily secondary slipIlI sliː-pə-lɪ 100 CCVV- CV-CV 42584 sleepiness primary slipInIs sliː-pɪ-nɪs 100 CCVV- CV-CVC 42584 sleepiness secondary slipInIs sliː-pɪ-nəs 100 CCVV- CV-CVC
and stems
in the English Pronouncing Dictionary
English, transcriptions in CeLex are probably RP .
Stem Phonological Transcription Phonetic Transcription Arbeiter arbait+@r [ar][bai][t@r] Arbeitsplatz arbait+s#plats [ar][baits][plats] Arbeitgeber arbait#ge:b+@r [ar][bait][ge:] [b@r] arbeitsamkeit arbait#za:m#kait [ar][bait][za:m] [kait]
aansprakelijkheidsverzekering aansprakelijkheid s verzekering
aansprakelijkheidsverzekering aan spreek elijk heid s ver zeker ing
aansprakelijkheidsverzekering aansprakelijkheid aansprakelijk aanspreek aan spreek elijk heid s verzekering verzeker ver zeker ing
Form Flection Frequency
adjusted past,1p,singular 35 adjuster singular 5 adjusters plural 1 adjusting present,participle 71 adjustment singular 150 adjustments plural 84 adjusts present,3p,singular 14
Form Class Subclasses
videotape Noun uncountable videotape Verb transitive vide supra Interjection vie Verb linking Vietnam Noun proper Vietnamese Adjective
Vietnamese Noun countable
Form Class Subclasses
magnetisch Adjective nonadverbial magnetiseren Verb lexical intransitive transitive magnetiseerb aar Adjective nonadverbial magnifiek Adjective adverbial Magyaars Adjective adverbial maharadja Noun maharishi Noun Mahdi Noun Mahler Noun propername
To explore these questions, we conducted a replica- tion simulation using a much larger corpus. Monosyllables were extracted from the CELEX electronic corpus (Baayen, Piepenbrock, & van Rijn, 1993). All items fitting a CC- CVVCCC template were used, yielding 7,839 words. Most
ical network was expanded from 66 to 88 units to accommo-
Harm, M. W. & Seidenberg, M. S. (1999). Phonology, reading acquisition, and dyslexia: insights from connectionist models. Psychol Rev, 106(3), 491-528.
We sought to feed both our rule-based and analogical models a diet of stem/past tense pairs that would resemble what had been encountered by our experimental
database (Baayen, Piepenbrock, & Gulikers, 1995), selecting all the verbs that had a lemma frequency of 10 or greater. In addition, for verbs that show more than one past tense (like dived/dove), we included both as separate entries (e.g. both dive-dived and dive-dove). The resulting corpus consisted of 4253 stem/past tense pairs, 4035 regular and 218 irregular. Verb forms were listed in a phonemic transcription reflecting American English pronunciation.
Albright, A. & Hayes, B. (2003). Rules vs. analogy in English past tenses: A computational/experimental
A current debate in the acquisition literature (Bybee, 1995; Clahsen & Rothweiler, 1992; Marcus, Brinkmann, Clahsen, Wiese, & Pinker, 1995) concerns whether prefixed forms of the same stem (e.g. do/redo/outdo) should be counted separately for purposes
were removed, thus cutting its size down to 3308 input pairs (3170 regular, 138 irregular), and ran both learning models on both sets. As it turned out, the rule-based model did slightly better on the full set, and the analogical model did slightly better on the edited set. The results below reflect the performance of each model on its own best learning set.
Albright, A. & Hayes, B. (2003). Rules vs. analogy in English past tenses: A computational/experimental
Stimulus materials The word stimuli were two lists of 24 nouns drawn from the Celex database (Baayen, Piepenbrock, & Gul- ikers, 1995), based on a corpus of 16.6 million written words and 1.3 million spoken words. The first list con- sisted of singular dominant items, with an average fre- quency of 25 per million for the singular forms and eight for the plural forms. The second list consisted of plural dominant items with average frequencies of 9 and 26, respectively. The base frequencies (34 vs. 35) did not differ between the lists. The stimuli were further matched on the number of letters (6.3 and 6.3) and the number of syllables (2 and 2). A complete list of the stimuli is presented in Appendix A. As in Experiment 1, two versions of the word list were created, so that each participant saw only one form of a word.
New, B., Brysbaert, M., Segui, J., Ferrand, L., & Rastle, K. (2004). The processing of singular and plural nouns in French and English. Journal of Memory and Language, 51(4), 568-585.
Materials used in Experiment 3 Word Singular Plural Mean reaction time SD Frequency Mean reaction time SD Frequency Singular dominant items beast 468 68 17 516 96 11 belief 439 43 67 477 58 24 cathedral 476 68 15 553 83 3 clinic 480 65 15 548 77 5 dragon 482 43 8 495 80 2 famine 525 52 7 642 145 1 hat 422 33 53 439 51 15 journal 495 78 18 456 52 6 lieutenant 603 67 14 609 132 1
Plural dominant items acre 584 101 15 559 85 23 ancestor 569 77 6 568 95 22 biscuit 434 43 5 458 85 11 critic 546 92 12 534 85 23 disciple 560 114 4 533 137 13 dollar 514 65 15 503 76 53 glove 455 59 5 441 50 15 heel 490 94 11 479 48 18 ingredient 535 75 4 553 110 11 lip 444 50 17 482 67 61