[PDF] - Lexical Databases Like a dictionary Lexical properties of interest PDF Document

SLIDE 1

A General Introduction to Lexical Databases

Emmanuel Keuleers Department of Experimental Psychology Ghent University

EMLAR 2015 - Utrecht, April 15-17, 2015

What can you find in a lexical database?
How can you find it?

SLIDE 2

Lexical Databases

Like a dictionary
Lexical properties of interest to

psycholinguists

Frequency, orthography, phonology,

morphology, syntax, …

Subjective ratings of those words
Behavioural responses to those data

Lexical Databases

No standard: each database has its own

format, peculiarities, ...

Text files, web interfaces, e-mail services,

etc ...

In essence, a lexical database is just a list

with a bunch of information about words.

SLIDE 3

Lexical Databases

The truth: you'll have to find out where

to find something and be prepared to do some processing work.

CELEX: the big and complex lexical database

SLIDE 4

History

Centre for Lexical Information
Founded in Nijmegen in 1986
Max Planck Institute for Psycholinguistics &

Interfaculty Research Unit for Language and Speech of the University of Nijmegen (now CLS)

Project ended in 2000
Three large databases with lexical information for

Dutch, English, and German

Dutch Database
124,136 lemmata
381,292 wordforms
211,389 corpus types
English Database
52,446 lemmata
160,594 wordforms
220,271 corpus types
German database
51,728 lemmata
365,530 wordforms
290,712 corpus types

SLIDE 5

Wordforms, lemmas, and corpus types

Letter strings, regardless of part of speech
a walk in the park = to walk slowly =

i walk alone = you walk alone

Corpus types

SLIDE 6

Letter strings disambiguated for part of

speech (and sometimes meaning)

a walk in the park ≠ to walk slowly ≠

i walk alone ≠ you walk alone

(walk, noun, singular), (walk, verb,

infinitive), (walk, verb, 1p), (walk, verb, 2p)

Wordforms

Headwords
(walk, noun): a walk in the park = the

long walks

(walk, verb): I'm walking slowly = i walk

alone = he walks too fast

Lemmas

SLIDE 7

Celex Build Up

Information from dictionary sources
Corpus counts or correlation with existing

frequency counts

Almost completely biased towards written

language

Dutch Database Sources

Van Dale's Comprehensive Dictionary of Contemporary

Dutch (1984)

80,000 lemmata
Word List of the Dutch Language ('Het Groene Boekje')

(1954), plus later revisions, including the 1994 spelling reform

65,000 lemmata
The most frequent lemmata from the text corpus of the

Institute for Dutch Lexicology (INL) 42,380,000 words in all

15,000 lemmata

SLIDE 8

English Database Sources

Oxford Advanced Learner's Dictionary (1974)
41,000 lemmata
Longman Dictionary of Contemporary English (1978)
53,000 lemmata

German Database Sources

Bonnlex, supplied by the Institute for

Communication Research and Phonetics in Bonn

Molex, supplied by the Institute for German

Language in Mannheim

Noetic Circle Services (MIT) German

spelling lexicon

SLIDE 9

Dutch Frequency Sources

INL Corpus (42 million tokens)
930 entire fiction and non-fiction books

(approx. 30% fiction, 70% non-fiction) published between 1970 and 1988. Newspapers, magazines, children's books, textbooks and specialist literature do not feature in the collection.

English Frequency Sources

COBUILD/Birmingham corpus (17.9 million

tokens)

16.6 million tokens from written texts
1.3 million tokens from transcribed

dialogue

SLIDE 10

German Frequency Sources

Mannheimer Korpus I, Mannheimer Korpus II

and Bonner Zeitungskorpus 1

5.4 million tokens
written texts like newspapers, fiction and non-

fiction

Freiburger Korpus
600,000 tokens
transcribed speech
Corpus Types
Frequency
Orthography

SLIDE 11

Lemma lexica
Frequency
Orthography
Phonology
Derivational Morphology
Grammatical information
Wordform Lexica
Frequency
Orthography
Phonology
Inflectional Morphology

SLIDE 12

Frequency

Verb Frequency Deviation Freq/Million

accept 3712 207.37 accord 2010 12 112.29 achieve 2121 118.49 act 2212 430 123.58 add 4190 234.08 agree 3424 191.28

SLIDE 13

Lexicon Form Frequency Deviation Frequency/ million

lemma act 2212 430 123.58 wordform act 269 1233 15.03 wordform acted 92 366 5.14 wordform acting 489 103 27.32 wordform acts 187 80 10.45 wordform act 269 1233 15.03 wordform acted 92 366 5.14 wordform act 269 1233 15.03 wordform acted 92 366 5.14 wordform act 269 1233 15.03 wordform acted 92 366 5.14 wordform acted 92 366 5.14

Lemma frequency
Frequency over all wordforms of the lemma
Wordform frequency
Deviation == 0 : exact count
Deviation > 0 : result of disambiguation

SLIDE 14

Less than 100 tokens
Manual disambiguation
More than 100 tokens
Disambiguation on a sample of 100

tokens

Frequency ± deviation = 95 % confidence

interval

No disambiguation for verbal flection
Frequency divided between forms
Frequency Deviation > Frequency
No disambiguation for German
Frequency divided between forms

SLIDE 15

English and German databases have

separate fields for written and spoken frequencies

Spoken frequencies based on very small

corpora

1.3 million for English
0.4 million for German
What does it mean when an entry in CELEX has a

frequency of zero

Many entries in the database sources were not

found in the frequency sources

A few entries do not come from database

sources but are left with a zero frequency after disambiguation

will have deviation > zero
Many entries added to CeLex for morphological

decomposition of other lemmas have a frequency of zero

SLIDE 16

Word frequency distributions

word frequency rank

the 1 093 546 1

f

540 085 2 and 514 946 3 to 483 428 4 a 422 334 5 in 337 995 6 that 217 376 7 it 199 920 8 i 198 139 9

SLIDE 17

Let's plot the rank of each word in the

COBUILD corpus against its frequency.

The word with the highest frequency gets

the highest rank (1), the word with the lowest frequency gets the lowest rank (220,270).

In total there are 17.9 million word tokens

in the COBUILD corpus.

SLIDE 18

Not very clear.
Let's plot it again so that the difference

between a frequency of 1 and a frequency

f 10 is the same as the difference between

a frequency of 10 and a frequency of 100

the

f

and to in a all these words have frequency 1

106=1,000,000

103=1,000 100=1

SLIDE 19

Word frequency lists are composed of very

few words with a very high frequency

Most words (corpus types) occur only
nce in the corpus!
The relation between word frequency and

rank is log linear.

SLIDE 20

Word frequencies from different databases

cannot be easily compared because of different corpus sizes

Example: Celex Dutch ±42m vs Celex

English ±18million

Solution: frequency per million words

Comparing frequencies

Frequency per million

word frequency

frequency   per million

rank

the 1 093 546 60 955.74 1

f

540 085 30 105.07 2 and 514 946 28 703.79 3 to 483 428 26 946.93 4 a 422 334 23 541.47 5 in 337 995 18 840.30 6 that 217 376 12 116.83 7 it 199 920 11 143.81 8 i 198 139 11 044.54 9

SLIDE 21

Beware! Some frequency lists contain

words with a frequency of 0

Log10(0) is not something that can be

computed

Solution: always add 1 to the raw

frequencies when you are transforming to frequencies per million

Comparing frequencies Formula

Frequency per million = Raw Frequency +1 (adjusted) Corpus size in million FPM ('that') = 217 376 +1 17.94 =12116.89 log10(12116.89)=4.08

SLIDE 22

Zipf Values

Van Heuven, Mandera, Keuleers, & Brysbaert (2014)

Formula

Freq. per billion = Raw Frequency +1

Corpus size in billion FPB ('that') = 217 376 +1 .01794 =12116889.63 log10(12116889.63)=7.08

SLIDE 23

word frequency

Relative Frequency log10(fpm)

zipf

the 1 093 546 0.0602191 4.78 7.78

f

540 085 0.0297413 4.47 7.47 and 514 946 0.0283569 4.45 7.45 to 483 428 0.0266213 4.43 7.43 a 422 334 0.0232570 4.37 7.37 in 337 995 0.0186127 4.27 7.27 that 217 376 0.0119704 4.08 7.08 it 199 920 0.0110092 4.04 7.04 i 198 139 0.0109111 4.04 7.04

Orthography

SLIDE 24

Lemma and wordform lexica list orthographic variants

with separate frequencies

Dutch: preferred, non-preferred, informal
preferred & non-preferred: in “Groene Boekje”
informal: non-standard forms occurring at least once in

INL corpus

English: British, American
British: acceptable for British
American: occurs only in American
German
No orthographic variants

Status

SLIDE 25

Lemma ID Form Status Frequency 1070 aardappelcroquet preferred 1070 aardappelkroket non-preferred 1138 aardelektrode preferred 1138 aardelectrode non-preferred 1202 aardolieprodukt preferred 6 1202 aardolieproduct non-preferred 1357 abductie preferred 1357 abduktie non-preferred Lemma ID Form Status Frequency

1359 anaesthesia British 12 1359 anesthesia American 1 1360 anaesthetic British 47 1360 anesthetic American 4 1361 anaesthetic British 8 1361 anesthetic American 1362 anaesthetist British 16

SLIDE 26

Abstract stems for Dutch
if a stem with final s or f changes to z or

v anywhere in its inflectional paradigm, an abstract stem is given ending with z or v.

Type Stem Abstract Stem

Adjective approximatief approximatiev Noun arbeidershuis arbeidershuiz Noun arbeidersparadijs arbeidersparadijz Noun arbeidsbeurs arbeidsbeurz Adjective arbeidsextensief arbeidsextensiev Adjective arbeidsintensief arbeidsintensiev Adjective arbeidsloos arbeidslooz

SLIDE 27

Phonology

Canonical phonetic transcriptions for

written forms

English: primary and secondary

pronunciation

Dutch, German: no phonetic variants
Syllabified
Stress and CV patterns

SLIDE 28

IPA-like character sets
SAM-PA
CeLex
CPA
DISC character set
one character per phoneme
no ambiguity
unreadable

Idnum Spelling Status DISC Syllables IPA Stress CV

42577 sleekness primary sliknIs sliːk-nɪs 10 CCVVC- CVC 42577 sleekness secondary sliknIs sliːk-nəs 10 CCVVC- CVC 42582 sleepily primary slipIlI sliː-pɪ-lɪ 100 CCVV- CV-CV 42582 sleepily secondary slipIlI sliː-pə-lɪ 100 CCVV- CV-CV 42584 sleepiness primary slipInIs sliː-pɪ-nɪs 100 CCVV- CV-CVC 42584 sleepiness secondary slipInIs sliː-pɪ-nəs 100 CCVV- CV-CVC

SLIDE 29

Dutch and German
Separate phonetic trancsriptions for headwords

and stems

English
First variant is always the primary one, as listed

in the English Pronouncing Dictionary

Newer versions use BBC English and Network

English, transcriptions in CeLex are probably RP .

Phonological transcriptions for

morphologically complex Dutch and German stems with indication of morpheme boundaries

Only with CELEX and CPA character sets

SLIDE 30

Stem Phonological Transcription Phonetic Transcription Arbeiter arbait+@r [ar][bai][t@r] Arbeitsplatz arbait+s#plats [ar][baits][plats] Arbeitgeber arbait#ge:b+@r [ar][bait][ge:] [b@r] arbeitsamkeit arbait#za:m#kait [ar][bait][za:m] [kait]

Morphology

SLIDE 31

Lemma Morphology
Morphstatus: indicates if the lemma has a

relevant morphological decomposition

Segmentation
Immediate, Flat, Hierachical

Immediate Segmentation

aansprakelijkheidsverzekering aansprakelijkheid s verzekering

SLIDE 32

aansprakelijkheidsverzekering aan spreek elijk heid s ver zeker ing

Flat Segmentation

Hierachical Segmentation

aansprakelijkheidsverzekering aansprakelijkheid aansprakelijk aanspreek aan spreek elijk heid s verzekering verzeker ver zeker ing

SLIDE 33

Wordform morphology
Inflectional Features

Form Flection Frequency

adjusted past,1p,singular 35 adjuster singular 5 adjusters plural 1 adjusting present,participle 71 adjustment singular 150 adjustments plural 84 adjusts present,3p,singular 14

SLIDE 34

Grammatical Information

Syntactic class for lemmas
Dutch: Expression, Noun, Adjective,

Quantifier/Numeral, Verb, Article, Pronoun, Adverb, Preposition, Conjunction, Interjection

English: Noun, Adjective, Numeral,

Verb, Article, Pronoun, Adverb, Preposition, Conjunction, Interjection, Single, Complex, Letter, Abbreviation, Infinitival

SLIDE 35

German
Noun, Adjective, Quantifier/Numeral,

Verb, Article, Pronoun, Adverb, Preposition, Conjunction, Interjection

Additional subclassification

Form Class Subclasses

videotape Noun uncountable videotape Verb transitive vide supra Interjection vie Verb linking Vietnam Noun proper Vietnamese Adjective

rdinary

Vietnamese Noun countable

SLIDE 36

Form Class Subclasses

magnetisch Adjective nonadverbial magnetiseren Verb lexical intransitive transitive magnetiseerb aar Adjective nonadverbial magnifiek Adjective adverbial Magyaars Adjective adverbial maharadja Noun maharishi Noun Mahdi Noun Mahler Noun propername

How to get information from CeLex

SLIDE 37

Option 1: CeLex CD with textfiles
Typical ‘text processing’ languages

(AWK, Perl)

Elegant language (Python)
Import in spreadsheet application

SLIDE 38

SLIDE 39

Option 2: Public web interface at MPI

(WebCeLex)

Good tool for selection
Process with scripting language

SLIDE 40

SLIDE 41

Examples

SLIDE 42

To explore these questions, we conducted a replica- tion simulation using a much larger corpus. Monosyllables were extracted from the CELEX electronic corpus (Baayen, Piepenbrock, & van Rijn, 1993). All items fitting a CC- CVVCCC template were used, yielding 7,839 words. Most

f the additional words are inflected items. The phonolog-

ical network was expanded from 66 to 88 units to accommo-

Harm, M. W. & Seidenberg, M. S. (1999). Phonology, reading acquisition, and dyslexia: insights from connectionist models. Psychol Rev, 106(3), 491-528.

We sought to feed both our rule-based and analogical models a diet of stem/past tense pairs that would resemble what had been encountered by our experimental

participants. We took our set of input forms from the English portion of the CELEX

database (Baayen, Piepenbrock, & Gulikers, 1995), selecting all the verbs that had a lemma frequency of 10 or greater. In addition, for verbs that show more than one past tense (like dived/dove), we included both as separate entries (e.g. both dive-dived and dive-dove). The resulting corpus consisted of 4253 stem/past tense pairs, 4035 regular and 218 irregular. Verb forms were listed in a phonemic transcription reflecting American English pronunciation.

Albright, A. & Hayes, B. (2003). Rules vs. analogy in English past tenses: A computational/experimental

study. Cognition, 90(2), 119-161.

SLIDE 43

A current debate in the acquisition literature (Bybee, 1995; Clahsen & Rothweiler, 1992; Marcus, Brinkmann, Clahsen, Wiese, & Pinker, 1995) concerns whether prefixed forms of the same stem (e.g. do/redo/outdo) should be counted separately for purposes

f learning. We prepared a version of our learning set from which all prefixed forms

were removed, thus cutting its size down to 3308 input pairs (3170 regular, 138 irregular), and ran both learning models on both sets. As it turned out, the rule-based model did slightly better on the full set, and the analogical model did slightly better on the edited set. The results below reflect the performance of each model on its own best learning set.

Albright, A. & Hayes, B. (2003). Rules vs. analogy in English past tenses: A computational/experimental

study. Cognition, 90(2), 119-161.

Stimulus materials The word stimuli were two lists of 24 nouns drawn from the Celex database (Baayen, Piepenbrock, & Gul- ikers, 1995), based on a corpus of 16.6 million written words and 1.3 million spoken words. The first list consisted of singular dominant items, with an average frequency of 25 per million for the singular forms and eight for the plural forms. The second list consisted of plural dominant items with average frequencies of 9 and 26, respectively. The base frequencies (34 vs. 35) did not differ between the lists. The stimuli were further matched on the number of letters (6.3 and 6.3) and the number of syllables (2 and 2). A complete list of the stimuli is presented in Appendix A. As in Experiment 1, two versions of the word list were created, so that each participant saw only one form of a word.

New, B., Brysbaert, M., Segui, J., Ferrand, L., & Rastle, K. (2004). The processing of singular and plural nouns in French and English. Journal of Memory and Language, 51(4), 568-585.

SLIDE 44

Materials used in Experiment 3 Word Singular Plural Mean reaction time SD Frequency Mean reaction time SD Frequency Singular dominant items beast 468 68 17 516 96 11 belief 439 43 67 477 58 24 cathedral 476 68 15 553 83 3 clinic 480 65 15 548 77 5 dragon 482 43 8 495 80 2 famine 525 52 7 642 145 1 hat 422 33 53 439 51 15 journal 495 78 18 456 52 6 lieutenant 603 67 14 609 132 1

Plural dominant items acre 584 101 15 559 85 23 ancestor 569 77 6 568 95 22 biscuit 434 43 5 458 85 11 critic 546 92 12 534 85 23 disciple 560 114 4 533 137 13 dollar 514 65 15 503 76 53 glove 455 59 5 441 50 15 heel 490 94 11 479 48 18 ingredient 535 75 4 553 110 11 lip 444 50 17 482 67 61

SLIDE 45