Lexical Databases Like a dictionary Lexical properties of interest - - PDF document

lexical databases
SMART_READER_LITE
LIVE PREVIEW

Lexical Databases Like a dictionary Lexical properties of interest - - PDF document

A General Introduction to Lexical Databases Emmanuel Keuleers Department of Experimental Psychology Ghent University EMLAR 2015 - Utrecht, April 15-17, 2015 What can you find in a lexical database? How can you find it? Lexical


slide-1
SLIDE 1

A General Introduction to Lexical Databases

Emmanuel Keuleers Department of Experimental Psychology Ghent University

EMLAR 2015 - Utrecht, April 15-17, 2015

  • What can you find in a lexical database?
  • How can you find it?
slide-2
SLIDE 2

Lexical Databases

  • Like a dictionary
  • Lexical properties of interest to

psycholinguists

  • Frequency, orthography, phonology,

morphology, syntax, …

  • Subjective ratings of those words
  • Behavioural responses to those data

Lexical Databases

  • No standard: each database has its own

format, peculiarities, ...

  • Text files, web interfaces, e-mail services,

etc ...

  • In essence, a lexical database is just a list

with a bunch of information about words.

slide-3
SLIDE 3

Lexical Databases

  • The truth: you'll have to find out where

to find something and be prepared to do some processing work.

CELEX: the big and complex lexical database

slide-4
SLIDE 4

History

  • Centre for Lexical Information
  • Founded in Nijmegen in 1986
  • Max Planck Institute for Psycholinguistics &

Interfaculty Research Unit for Language and Speech of the University of Nijmegen (now CLS)

  • Project ended in 2000
  • Three large databases with lexical information for

Dutch, English, and German

  • Dutch Database
  • 124,136 lemmata
  • 381,292 wordforms
  • 211,389 corpus types
  • English Database
  • 52,446 lemmata
  • 160,594 wordforms
  • 220,271 corpus types
  • German database
  • 51,728 lemmata
  • 365,530 wordforms
  • 290,712 corpus types
slide-5
SLIDE 5

Wordforms, lemmas, and corpus types

  • Letter strings, regardless of part of speech
  • a walk in the park = to walk slowly = 


i walk alone = you walk alone

Corpus types

slide-6
SLIDE 6
  • Letter strings disambiguated for part of

speech (and sometimes meaning)

  • a walk in the park ≠ to walk slowly ≠ 


i walk alone ≠ you walk alone

  • (walk, noun, singular), (walk, verb,

infinitive), (walk, verb, 1p), (walk, verb, 2p)

Wordforms

  • Headwords
  • (walk, noun): a walk in the park = the

long walks

  • (walk, verb): I'm walking slowly = i walk

alone = he walks too fast

Lemmas

slide-7
SLIDE 7

Celex Build Up

  • Information from dictionary sources
  • Corpus counts or correlation with existing

frequency counts

  • Almost completely biased towards written

language

Dutch Database Sources

  • Van Dale's Comprehensive Dictionary of Contemporary

Dutch (1984)

  • 80,000 lemmata
  • Word List of the Dutch Language ('Het Groene Boekje')

(1954), plus later revisions, including the 1994 spelling reform

  • 65,000 lemmata
  • The most frequent lemmata from the text corpus of the

Institute for Dutch Lexicology (INL) 42,380,000 words in all

  • 15,000 lemmata
slide-8
SLIDE 8

English Database Sources

  • Oxford Advanced Learner's Dictionary (1974)
  • 41,000 lemmata
  • Longman Dictionary of Contemporary English (1978)
  • 53,000 lemmata

German Database Sources

  • Bonnlex, supplied by the Institute for

Communication Research and Phonetics in Bonn

  • Molex, supplied by the Institute for German

Language in Mannheim

  • Noetic Circle Services (MIT) German

spelling lexicon

slide-9
SLIDE 9

Dutch Frequency Sources

  • INL Corpus (42 million tokens)
  • 930 entire fiction and non-fiction books

(approx. 30% fiction, 70% non-fiction) published between 1970 and 1988. Newspapers, magazines, children's books, textbooks and specialist literature do not feature in the collection.

English Frequency Sources

  • COBUILD/Birmingham corpus (17.9 million

tokens)

  • 16.6 million tokens from written texts
  • 1.3 million tokens from transcribed

dialogue

slide-10
SLIDE 10

German Frequency Sources

  • Mannheimer Korpus I, Mannheimer Korpus II

and Bonner Zeitungskorpus 1

  • 5.4 million tokens
  • written texts like newspapers, fiction and non-

fiction

  • Freiburger Korpus
  • 600,000 tokens
  • transcribed speech
  • Corpus Types
  • Frequency
  • Orthography
slide-11
SLIDE 11
  • Lemma lexica
  • Frequency
  • Orthography
  • Phonology
  • Derivational Morphology
  • Grammatical information
  • Wordform Lexica
  • Frequency
  • Orthography
  • Phonology
  • Inflectional Morphology
slide-12
SLIDE 12

Frequency

Verb Frequency Deviation Freq/Million

accept 3712 207.37 accord 2010 12 112.29 achieve 2121 118.49 act 2212 430 123.58 add 4190 234.08 agree 3424 191.28

slide-13
SLIDE 13

Lexicon Form Frequency Deviation Frequency/ million

lemma act 2212 430 123.58 wordform act 269 1233 15.03 wordform acted 92 366 5.14 wordform acting 489 103 27.32 wordform acts 187 80 10.45 wordform act 269 1233 15.03 wordform acted 92 366 5.14 wordform act 269 1233 15.03 wordform acted 92 366 5.14 wordform act 269 1233 15.03 wordform acted 92 366 5.14 wordform acted 92 366 5.14

  • Lemma frequency
  • Frequency over all wordforms of the lemma
  • Wordform frequency
  • Deviation == 0 : exact count
  • Deviation > 0 : result of disambiguation
slide-14
SLIDE 14
  • Less than 100 tokens
  • Manual disambiguation
  • More than 100 tokens
  • Disambiguation on a sample of 100

tokens

  • Frequency ± deviation = 95 % confidence

interval

  • No disambiguation for verbal flection
  • Frequency divided between forms
  • Frequency Deviation > Frequency
  • No disambiguation for German
  • Frequency divided between forms
slide-15
SLIDE 15
  • English and German databases have

separate fields for written and spoken frequencies

  • Spoken frequencies based on very small

corpora

  • 1.3 million for English
  • 0.4 million for German
  • What does it mean when an entry in CELEX has a

frequency of zero

  • Many entries in the database sources were not

found in the frequency sources

  • A few entries do not come from database

sources but are left with a zero frequency after disambiguation

  • will have deviation > zero
  • Many entries added to CeLex for morphological

decomposition of other lemmas have a frequency of zero

slide-16
SLIDE 16

Word frequency distributions

Word frequency distributions

word frequency rank

the 1 093 546 1

  • f

540 085 2 and 514 946 3 to 483 428 4 a 422 334 5 in 337 995 6 that 217 376 7 it 199 920 8 i 198 139 9

slide-17
SLIDE 17
  • Let's plot the rank of each word in the

COBUILD corpus against its frequency.

  • The word with the highest frequency gets

the highest rank (1), the word with the lowest frequency gets the lowest rank (220,270).

  • In total there are 17.9 million word tokens

in the COBUILD corpus.

slide-18
SLIDE 18
  • Not very clear.
  • Let's plot it again so that the difference

between a frequency of 1 and a frequency

  • f 10 is the same as the difference between

a frequency of 10 and a frequency of 100

the

  • f

and to in a all these words have frequency 1

106=1,000,000

103=1,000 100=1

slide-19
SLIDE 19
  • Word frequency lists are composed of very

few words with a very high frequency

  • Most words (corpus types) occur only
  • nce in the corpus!
  • The relation between word frequency and

rank is log linear.

slide-20
SLIDE 20
  • Word frequencies from different databases

cannot be easily compared because of different corpus sizes

  • Example: Celex Dutch ±42m vs Celex

English ±18million

  • Solution: frequency per million words

Comparing frequencies

Frequency per million

word frequency

frequency 
 per million

rank

the 1 093 546 60 955.74 1

  • f

540 085 30 105.07 2 and 514 946 28 703.79 3 to 483 428 26 946.93 4 a 422 334 23 541.47 5 in 337 995 18 840.30 6 that 217 376 12 116.83 7 it 199 920 11 143.81 8 i 198 139 11 044.54 9

slide-21
SLIDE 21
  • Beware! Some frequency lists contain

words with a frequency of 0

  • Log10(0) is not something that can be

computed

  • Solution: always add 1 to the raw

frequencies when you are transforming to frequencies per million

Comparing frequencies Formula

Frequency per million = Raw Frequency +1 (adjusted) Corpus size in million FPM ('that') = 217 376 +1 17.94 =12116.89 log10(12116.89)=4.08

slide-22
SLIDE 22

Zipf Values

Van Heuven, Mandera, Keuleers, & Brysbaert (2014)

Formula

  • Freq. per billion = Raw Frequency +1

Corpus size in billion FPB ('that') = 217 376 +1 .01794 =12116889.63 log10(12116889.63)=7.08

slide-23
SLIDE 23

word frequency

Relative Frequency log10(fpm)

zipf

the 1 093 546 0.0602191 4.78 7.78

  • f

540 085 0.0297413 4.47 7.47 and 514 946 0.0283569 4.45 7.45 to 483 428 0.0266213 4.43 7.43 a 422 334 0.0232570 4.37 7.37 in 337 995 0.0186127 4.27 7.27 that 217 376 0.0119704 4.08 7.08 it 199 920 0.0110092 4.04 7.04 i 198 139 0.0109111 4.04 7.04

Orthography

slide-24
SLIDE 24
  • Lemma and wordform lexica list orthographic variants

with separate frequencies

  • Dutch: preferred, non-preferred, informal
  • preferred & non-preferred: in “Groene Boekje”
  • informal: non-standard forms occurring at least once in

INL corpus

  • English: British, American
  • British: acceptable for British
  • American: occurs only in American
  • German
  • No orthographic variants

Status

slide-25
SLIDE 25

Lemma ID Form Status Frequency 1070 aardappelcroquet preferred 1070 aardappelkroket non-preferred 1138 aardelektrode preferred 1138 aardelectrode non-preferred 1202 aardolieprodukt preferred 6 1202 aardolieproduct non-preferred 1357 abductie preferred 1357 abduktie non-preferred Lemma ID Form Status Frequency

1359 anaesthesia British 12 1359 anesthesia American 1 1360 anaesthetic British 47 1360 anesthetic American 4 1361 anaesthetic British 8 1361 anesthetic American 1362 anaesthetist British 16

slide-26
SLIDE 26
  • Abstract stems for Dutch
  • if a stem with final s or f changes to z or

v anywhere in its inflectional paradigm, an abstract stem is given ending with z or v.

Type Stem Abstract Stem

Adjective approximatief approximatiev Noun arbeidershuis arbeidershuiz Noun arbeidersparadijs arbeidersparadijz Noun arbeidsbeurs arbeidsbeurz Adjective arbeidsextensief arbeidsextensiev Adjective arbeidsintensief arbeidsintensiev Adjective arbeidsloos arbeidslooz

slide-27
SLIDE 27

Phonology

  • Canonical phonetic transcriptions for

written forms

  • English: primary and secondary

pronunciation

  • Dutch, German: no phonetic variants
  • Syllabified
  • Stress and CV patterns
slide-28
SLIDE 28
  • IPA-like character sets
  • SAM-PA
  • CeLex
  • CPA
  • DISC character set
  • one character per phoneme
  • no ambiguity
  • unreadable

Idnum Spelling Status DISC Syllables IPA Stress CV

42577 sleekness primary sliknIs sliːk-nɪs 10 CCVVC- CVC 42577 sleekness secondary sliknIs sliːk-nəs 10 CCVVC- CVC 42582 sleepily primary slipIlI sliː-pɪ-lɪ 100 CCVV- CV-CV 42582 sleepily secondary slipIlI sliː-pə-lɪ 100 CCVV- CV-CV 42584 sleepiness primary slipInIs sliː-pɪ-nɪs 100 CCVV- CV-CVC 42584 sleepiness secondary slipInIs sliː-pɪ-nəs 100 CCVV- CV-CVC

slide-29
SLIDE 29
  • Dutch and German
  • Separate phonetic trancsriptions for headwords

and stems

  • English
  • First variant is always the primary one, as listed

in the English Pronouncing Dictionary

  • Newer versions use BBC English and Network

English, transcriptions in CeLex are probably RP .

  • Phonological transcriptions for

morphologically complex Dutch and German stems with indication of morpheme boundaries

  • Only with CELEX and CPA character sets
slide-30
SLIDE 30

Stem Phonological Transcription Phonetic Transcription Arbeiter arbait+@r [ar][bai][t@r] Arbeitsplatz arbait+s#plats [ar][baits][plats] Arbeitgeber arbait#ge:b+@r [ar][bait][ge:] [b@r] arbeitsamkeit arbait#za:m#kait [ar][bait][za:m] [kait]

Morphology

slide-31
SLIDE 31
  • Lemma Morphology
  • Morphstatus: indicates if the lemma has a

relevant morphological decomposition

  • Segmentation
  • Immediate, Flat, Hierachical

Immediate Segmentation

aansprakelijkheidsverzekering aansprakelijkheid s verzekering

slide-32
SLIDE 32

aansprakelijkheidsverzekering aan spreek elijk heid s ver zeker ing

Flat Segmentation

Hierachical Segmentation

aansprakelijkheidsverzekering aansprakelijkheid aansprakelijk aanspreek aan spreek elijk heid s verzekering verzeker ver zeker ing

slide-33
SLIDE 33
  • Wordform morphology
  • Inflectional Features

Form Flection Frequency

adjusted past,1p,singular 35 adjuster singular 5 adjusters plural 1 adjusting present,participle 71 adjustment singular 150 adjustments plural 84 adjusts present,3p,singular 14

slide-34
SLIDE 34

Grammatical Information

  • Syntactic class for lemmas
  • Dutch: Expression, Noun, Adjective,

Quantifier/Numeral, Verb, Article, Pronoun, Adverb, Preposition, Conjunction, Interjection

  • English: Noun, Adjective, Numeral,

Verb, Article, Pronoun, Adverb, Preposition, Conjunction, Interjection, Single, Complex, Letter, Abbreviation, Infinitival

slide-35
SLIDE 35
  • German
  • Noun, Adjective, Quantifier/Numeral,

Verb, Article, Pronoun, Adverb, Preposition, Conjunction, Interjection

  • Additional subclassification

Form Class Subclasses

videotape Noun uncountable videotape Verb transitive vide supra Interjection vie Verb linking Vietnam Noun proper Vietnamese Adjective

  • rdinary

Vietnamese Noun countable

slide-36
SLIDE 36

Form Class Subclasses

magnetisch Adjective nonadverbial magnetiseren Verb lexical intransitive transitive magnetiseerb aar Adjective nonadverbial magnifiek Adjective adverbial Magyaars Adjective adverbial maharadja Noun maharishi Noun Mahdi Noun Mahler Noun propername

How to get information from CeLex

slide-37
SLIDE 37
  • Option 1: CeLex CD with textfiles
  • Typical ‘text processing’ languages

(AWK, Perl)

  • Elegant language (Python)
  • Import in spreadsheet application
slide-38
SLIDE 38
slide-39
SLIDE 39
  • Option 2: Public web interface at MPI

(WebCeLex)

  • Good tool for selection
  • Process with scripting language
slide-40
SLIDE 40
slide-41
SLIDE 41

Examples

slide-42
SLIDE 42

To explore these questions, we conducted a replica- tion simulation using a much larger corpus. Monosyllables were extracted from the CELEX electronic corpus (Baayen, Piepenbrock, & van Rijn, 1993). All items fitting a CC- CVVCCC template were used, yielding 7,839 words. Most

  • f the additional words are inflected items. The phonolog-

ical network was expanded from 66 to 88 units to accommo-

Harm, M. W. & Seidenberg, M. S. (1999). Phonology, reading acquisition, and dyslexia: insights from connectionist models. Psychol Rev, 106(3), 491-528.

We sought to feed both our rule-based and analogical models a diet of stem/past tense pairs that would resemble what had been encountered by our experimental

  • participants. We took our set of input forms from the English portion of the CELEX

database (Baayen, Piepenbrock, & Gulikers, 1995), selecting all the verbs that had a lemma frequency of 10 or greater. In addition, for verbs that show more than one past tense (like dived/dove), we included both as separate entries (e.g. both dive-dived and dive-dove). The resulting corpus consisted of 4253 stem/past tense pairs, 4035 regular and 218 irregular. Verb forms were listed in a phonemic transcription reflecting American English pronunciation.

Albright, A. & Hayes, B. (2003). Rules vs. analogy in English past tenses: A computational/experimental

  • study. Cognition, 90(2), 119-161.
slide-43
SLIDE 43

A current debate in the acquisition literature (Bybee, 1995; Clahsen & Rothweiler, 1992; Marcus, Brinkmann, Clahsen, Wiese, & Pinker, 1995) concerns whether prefixed forms of the same stem (e.g. do/redo/outdo) should be counted separately for purposes

  • f learning. We prepared a version of our learning set from which all prefixed forms

were removed, thus cutting its size down to 3308 input pairs (3170 regular, 138 irregular), and ran both learning models on both sets. As it turned out, the rule-based model did slightly better on the full set, and the analogical model did slightly better on the edited set. The results below reflect the performance of each model on its own best learning set.

Albright, A. & Hayes, B. (2003). Rules vs. analogy in English past tenses: A computational/experimental

  • study. Cognition, 90(2), 119-161.

Stimulus materials The word stimuli were two lists of 24 nouns drawn from the Celex database (Baayen, Piepenbrock, & Gul- ikers, 1995), based on a corpus of 16.6 million written words and 1.3 million spoken words. The first list con- sisted of singular dominant items, with an average fre- quency of 25 per million for the singular forms and eight for the plural forms. The second list consisted of plural dominant items with average frequencies of 9 and 26, respectively. The base frequencies (34 vs. 35) did not differ between the lists. The stimuli were further matched on the number of letters (6.3 and 6.3) and the number of syllables (2 and 2). A complete list of the stimuli is presented in Appendix A. As in Experiment 1, two versions of the word list were created, so that each participant saw only one form of a word.

New, B., Brysbaert, M., Segui, J., Ferrand, L., & Rastle, K. (2004). The processing of singular and plural nouns in French and English. Journal of Memory and Language, 51(4), 568-585.

slide-44
SLIDE 44

Materials used in Experiment 3 Word Singular Plural Mean reaction time SD Frequency Mean reaction time SD Frequency Singular dominant items beast 468 68 17 516 96 11 belief 439 43 67 477 58 24 cathedral 476 68 15 553 83 3 clinic 480 65 15 548 77 5 dragon 482 43 8 495 80 2 famine 525 52 7 642 145 1 hat 422 33 53 439 51 15 journal 495 78 18 456 52 6 lieutenant 603 67 14 609 132 1

Plural dominant items acre 584 101 15 559 85 23 ancestor 569 77 6 568 95 22 biscuit 434 43 5 458 85 11 critic 546 92 12 534 85 23 disciple 560 114 4 533 137 13 dollar 514 65 15 503 76 53 glove 455 59 5 441 50 15 heel 490 94 11 479 48 18 ingredient 535 75 4 553 110 11 lip 444 50 17 482 67 61

slide-45
SLIDE 45

Exercises and practical applications http://crr.ugent.be/emlar2015