Morphology and Corpora: Introduction Marco Baroni University of - - PowerPoint PPT Presentation

morphology and corpora introduction
SMART_READER_LITE
LIVE PREVIEW

Morphology and Corpora: Introduction Marco Baroni University of - - PowerPoint PPT Presentation

Morphology and Corpora: Introduction Marco Baroni University of Bologna Granada Morphology and Corpora Seminar Outline Corpora General overview Data sparseness and the need for larger corpora Morphology Derivational vs. inflectional


slide-1
SLIDE 1

Morphology and Corpora: Introduction

Marco Baroni

University of Bologna

Granada “Morphology and Corpora” Seminar

slide-2
SLIDE 2

Outline

Corpora General overview Data sparseness and the need for larger corpora Morphology Derivational vs. inflectional morphology Data in morphology

slide-3
SLIDE 3

Corpora: what and why

◮ Collections of natural text stored on computer

slide-4
SLIDE 4

Corpora: what and why

◮ Collections of natural text stored on computer ◮ Useful for:

slide-5
SLIDE 5

Corpora: what and why

◮ Collections of natural text stored on computer ◮ Useful for:

◮ NLP (e.g., speech recognition, text categorization, question

answering, machine translation. . . )

slide-6
SLIDE 6

Corpora: what and why

◮ Collections of natural text stored on computer ◮ Useful for:

◮ NLP (e.g., speech recognition, text categorization, question

answering, machine translation. . . )

◮ lexicography, grammar writing, language teaching

slide-7
SLIDE 7

Corpora: what and why

◮ Collections of natural text stored on computer ◮ Useful for:

◮ NLP (e.g., speech recognition, text categorization, question

answering, machine translation. . . )

◮ lexicography, grammar writing, language teaching ◮ theoretical linguistics?

slide-8
SLIDE 8

Typology

◮ Balanced, representative, ‘reference’ corpora: Brown/LOB

(1M tokens), COBUILD (10M, . . . ), BNC (100M)

slide-9
SLIDE 9

Typology

◮ Balanced, representative, ‘reference’ corpora: Brown/LOB

(1M tokens), COBUILD (10M, . . . ), BNC (100M)

◮ Opportunistic: WSJ, la Repubblica-SSLMIT, Gigaword (1B)

slide-10
SLIDE 10

Typology

◮ Balanced, representative, ‘reference’ corpora: Brown/LOB

(1M tokens), COBUILD (10M, . . . ), BNC (100M)

◮ Opportunistic: WSJ, la Repubblica-SSLMIT, Gigaword (1B) ◮ Web-derived corpora (WaCky project: 1.65B tokens of

German, 1.9B tokens of Italian)

slide-11
SLIDE 11

Typology

◮ Balanced, representative, ‘reference’ corpora: Brown/LOB

(1M tokens), COBUILD (10M, . . . ), BNC (100M)

◮ Opportunistic: WSJ, la Repubblica-SSLMIT, Gigaword (1B) ◮ Web-derived corpora (WaCky project: 1.65B tokens of

German, 1.9B tokens of Italian)

◮ Specialized, parallel, comparable, diachronic. . .

slide-12
SLIDE 12

Standard requirements for a modern corpus

◮ POS-tagging and lemmatization

slide-13
SLIDE 13

Standard requirements for a modern corpus

◮ POS-tagging and lemmatization ◮ Indexing with specialized software that allows

sophisticated linguistic queries

slide-14
SLIDE 14

Standard requirements for a modern corpus

◮ POS-tagging and lemmatization ◮ Indexing with specialized software that allows

sophisticated linguistic queries

◮ Many other desirable features:

slide-15
SLIDE 15

Standard requirements for a modern corpus

◮ POS-tagging and lemmatization ◮ Indexing with specialized software that allows

sophisticated linguistic queries

◮ Many other desirable features:

◮ Meta-data

slide-16
SLIDE 16

Standard requirements for a modern corpus

◮ POS-tagging and lemmatization ◮ Indexing with specialized software that allows

sophisticated linguistic queries

◮ Many other desirable features:

◮ Meta-data ◮ Syntactic parsing

slide-17
SLIDE 17

Standard requirements for a modern corpus

◮ POS-tagging and lemmatization ◮ Indexing with specialized software that allows

sophisticated linguistic queries

◮ Many other desirable features:

◮ Meta-data ◮ Syntactic parsing ◮ Web interface

slide-18
SLIDE 18

Standard requirements for a modern corpus

◮ POS-tagging and lemmatization ◮ Indexing with specialized software that allows

sophisticated linguistic queries

◮ Many other desirable features:

◮ Meta-data ◮ Syntactic parsing ◮ Web interface ◮ . . .

slide-19
SLIDE 19

Zipf’s Law

  • 2

4 6 8 10 12 14 5000 10000 20000

LOB Frequency Spectrum

frequency class types

slide-20
SLIDE 20

There is no data like more data!

◮ In NLP (Banko and Brill, 2001), lexicography (Kilgarriff

2005) as well as corpus-based linguistics (Mair, 2003),

  • ften. . .
slide-21
SLIDE 21

There is no data like more data!

◮ In NLP (Banko and Brill, 2001), lexicography (Kilgarriff

2005) as well as corpus-based linguistics (Mair, 2003),

  • ften. . .

◮ more data is better data!

slide-22
SLIDE 22

There is no data like more data!

◮ In NLP (Banko and Brill, 2001), lexicography (Kilgarriff

2005) as well as corpus-based linguistics (Mair, 2003),

  • ften. . .

◮ more data is better data! ◮ This implies:

slide-23
SLIDE 23

There is no data like more data!

◮ In NLP (Banko and Brill, 2001), lexicography (Kilgarriff

2005) as well as corpus-based linguistics (Mair, 2003),

  • ften. . .

◮ more data is better data! ◮ This implies:

◮ Less clean data sources (the Web)

slide-24
SLIDE 24

There is no data like more data!

◮ In NLP (Banko and Brill, 2001), lexicography (Kilgarriff

2005) as well as corpus-based linguistics (Mair, 2003),

  • ften. . .

◮ more data is better data! ◮ This implies:

◮ Less clean data sources (the Web) ◮ Automated processing

slide-25
SLIDE 25

Outline

Corpora General overview Data sparseness and the need for larger corpora Morphology Derivational vs. inflectional morphology Data in morphology

slide-26
SLIDE 26

Derivation vs. inflection

◮ Derivational morphology: word formation, e.g.:

compounding, nominalizations, English prefixing

slide-27
SLIDE 27

Derivation vs. inflection

◮ Derivational morphology: word formation, e.g.:

compounding, nominalizations, English prefixing

◮ Inflectional morphology: syntax-driven morphology, e.g.:

agreement, plural formation, verbal paradigms

slide-28
SLIDE 28

Derivation vs. inflection

◮ Derivational morphology: word formation, e.g.:

compounding, nominalizations, English prefixing

◮ Inflectional morphology: syntax-driven morphology, e.g.:

agreement, plural formation, verbal paradigms

slide-29
SLIDE 29

Derivation vs. inflection

◮ Derivational morphology: word formation, e.g.:

compounding, nominalizations, English prefixing

◮ Inflectional morphology: syntax-driven morphology, e.g.:

agreement, plural formation, verbal paradigms

◮ Corpus data especially relevant to derivational morphology

(productivity, lexicalization, close link to lexical semantics)

slide-30
SLIDE 30

Data in morphology

◮ Unlike syntacticians, morphologists have traditionally

recognized importance of extensional linguistic data

slide-31
SLIDE 31

Data in morphology

◮ Unlike syntacticians, morphologists have traditionally

recognized importance of extensional linguistic data

◮ In word formation, attestedness matters, cf. notion of

possible vs. existing word, issues of lexical storage

slide-32
SLIDE 32

Data in morphology

◮ Unlike syntacticians, morphologists have traditionally

recognized importance of extensional linguistic data

◮ In word formation, attestedness matters, cf. notion of

possible vs. existing word, issues of lexical storage

◮ (In syntax – except in recent “constructional” approaches –

it makes no sense to distinguish between possible and existing well-formed sentences)

slide-33
SLIDE 33

Data in morphology

◮ Unlike syntacticians, morphologists have traditionally

recognized importance of extensional linguistic data

◮ In word formation, attestedness matters, cf. notion of

possible vs. existing word, issues of lexical storage

◮ (In syntax – except in recent “constructional” approaches –

it makes no sense to distinguish between possible and existing well-formed sentences)

◮ Traditionally, data in morphology come from dictionaries

slide-34
SLIDE 34

Problems with dictionaries

◮ Underestimation of very productive, “unintentional” word

formation processes

slide-35
SLIDE 35

Problems with dictionaries

◮ Underestimation of very productive, “unintentional” word

formation processes

◮ Overestimation of “fancy” word formation (e.g.,

latinate/neoclassic wf in specialized lexicon)

slide-36
SLIDE 36

Problems with dictionaries

◮ Underestimation of very productive, “unintentional” word

formation processes

◮ Overestimation of “fancy” word formation (e.g.,

latinate/neoclassic wf in specialized lexicon)

◮ History and contemporary language mixed

slide-37
SLIDE 37

Problems with dictionaries

◮ Underestimation of very productive, “unintentional” word

formation processes

◮ Overestimation of “fancy” word formation (e.g.,

latinate/neoclassic wf in specialized lexicon)

◮ History and contemporary language mixed ◮ Criteria for selection of entries not clear

slide-38
SLIDE 38

Problems with dictionaries

◮ Underestimation of very productive, “unintentional” word

formation processes

◮ Overestimation of “fancy” word formation (e.g.,

latinate/neoclassic wf in specialized lexicon)

◮ History and contemporary language mixed ◮ Criteria for selection of entries not clear ◮ No frequency information

slide-39
SLIDE 39

Problems with dictionaries

◮ Underestimation of very productive, “unintentional” word

formation processes

◮ Overestimation of “fancy” word formation (e.g.,

latinate/neoclassic wf in specialized lexicon)

◮ History and contemporary language mixed ◮ Criteria for selection of entries not clear ◮ No frequency information ◮ Very little contextual information

slide-40
SLIDE 40

Problems with dictionaries

◮ Underestimation of very productive, “unintentional” word

formation processes

◮ Overestimation of “fancy” word formation (e.g.,

latinate/neoclassic wf in specialized lexicon)

◮ History and contemporary language mixed ◮ Criteria for selection of entries not clear ◮ No frequency information ◮ Very little contextual information ◮ More and more dictionaries are corpus-based in any case

slide-41
SLIDE 41

The importance of the past tense debate

◮ The English past tense debate between connectionists and

defenders of the symbolic approach. . .

slide-42
SLIDE 42

The importance of the past tense debate

◮ The English past tense debate between connectionists and

defenders of the symbolic approach. . .

◮ not quite corpus-based

slide-43
SLIDE 43

The importance of the past tense debate

◮ The English past tense debate between connectionists and

defenders of the symbolic approach. . .

◮ not quite corpus-based ◮ and for some participants focus on morphology feels

“incidental”

slide-44
SLIDE 44

The importance of the past tense debate

◮ The English past tense debate between connectionists and

defenders of the symbolic approach. . .

◮ not quite corpus-based ◮ and for some participants focus on morphology feels

“incidental”

◮ but stressed importance of frequency data

slide-45
SLIDE 45

The importance of the past tense debate

◮ The English past tense debate between connectionists and

defenders of the symbolic approach. . .

◮ not quite corpus-based ◮ and for some participants focus on morphology feels

“incidental”

◮ but stressed importance of frequency data ◮ and relevance of computational simulations of learning to

theoretical debate

slide-46
SLIDE 46

The importance of the past tense debate

◮ The English past tense debate between connectionists and

defenders of the symbolic approach. . .

◮ not quite corpus-based ◮ and for some participants focus on morphology feels

“incidental”

◮ but stressed importance of frequency data ◮ and relevance of computational simulations of learning to

theoretical debate

◮ (See Albright and Hayes 2003 for a take on English past

tense from a linguists’ point of view)

slide-47
SLIDE 47

Corpus-based simulations of morphological learning

◮ Lots of recent NLP work; on the linguistic side, Goldsmith’s

Linguistica project, my Ph.D. work, Vito Pirrelli’s SOMs (focus on inflectional paradigms, e.g., Pirrelli et al. 2003)

slide-48
SLIDE 48

Corpus-based simulations of morphological learning

◮ Lots of recent NLP work; on the linguistic side, Goldsmith’s

Linguistica project, my Ph.D. work, Vito Pirrelli’s SOMs (focus on inflectional paradigms, e.g., Pirrelli et al. 2003)

◮ Emphasis on unsupervised models: ultimate frontier of

learning simulations

slide-49
SLIDE 49

Corpus-based simulations of morphological learning

◮ Lots of recent NLP work; on the linguistic side, Goldsmith’s

Linguistica project, my Ph.D. work, Vito Pirrelli’s SOMs (focus on inflectional paradigms, e.g., Pirrelli et al. 2003)

◮ Emphasis on unsupervised models: ultimate frontier of

learning simulations

◮ Early models word-frequency-list-based, but increasing

role played by context

slide-50
SLIDE 50

Corpus-based simulations of morphological learning

◮ Lots of recent NLP work; on the linguistic side, Goldsmith’s

Linguistica project, my Ph.D. work, Vito Pirrelli’s SOMs (focus on inflectional paradigms, e.g., Pirrelli et al. 2003)

◮ Emphasis on unsupervised models: ultimate frontier of

learning simulations

◮ Early models word-frequency-list-based, but increasing

role played by context

◮ Not much contact with corpus linguistics

slide-51
SLIDE 51

Corpora in productivity studies

◮ Focus of this seminar

slide-52
SLIDE 52

Corpora in productivity studies

◮ Focus of this seminar ◮ Work by Baayen and colleagues

slide-53
SLIDE 53

Corpora in productivity studies

◮ Focus of this seminar ◮ Work by Baayen and colleagues ◮ Productivity: the “readiness” with which a wf process can

form new words in a language (-ness vs. -ity, re- vs. en-)

slide-54
SLIDE 54

Corpora in productivity studies

◮ Focus of this seminar ◮ Work by Baayen and colleagues ◮ Productivity: the “readiness” with which a wf process can

form new words in a language (-ness vs. -ity, re- vs. en-)

◮ Early (earliest?) tradition of usage of corpora in work

published in “mainstream” theoretical linguistics journals (from late eighties)

slide-55
SLIDE 55

Corpora in productivity studies

◮ Focus of this seminar ◮ Work by Baayen and colleagues ◮ Productivity: the “readiness” with which a wf process can

form new words in a language (-ness vs. -ity, re- vs. en-)

◮ Early (earliest?) tradition of usage of corpora in work

published in “mainstream” theoretical linguistics journals (from late eighties)

◮ Corpus seen as word frequency list

slide-56
SLIDE 56

Corpora in productivity studies

◮ Focus of this seminar ◮ Work by Baayen and colleagues ◮ Productivity: the “readiness” with which a wf process can

form new words in a language (-ness vs. -ity, re- vs. en-)

◮ Early (earliest?) tradition of usage of corpora in work

published in “mainstream” theoretical linguistics journals (from late eighties)

◮ Corpus seen as word frequency list ◮ Links to old tradition of lexical statistics, stylometry,

authorship attribution (Baayen 2001)

slide-57
SLIDE 57

Corpora in productivity studies

◮ Focus of this seminar ◮ Work by Baayen and colleagues ◮ Productivity: the “readiness” with which a wf process can

form new words in a language (-ness vs. -ity, re- vs. en-)

◮ Early (earliest?) tradition of usage of corpora in work

published in “mainstream” theoretical linguistics journals (from late eighties)

◮ Corpus seen as word frequency list ◮ Links to old tradition of lexical statistics, stylometry,

authorship attribution (Baayen 2001)

◮ Less affected by later developments in corpus linguistics

and corpus-based NLP

slide-58
SLIDE 58

Word-formation, lexical semantics, corpora

◮ Recent burst of interest in semantic aspects of morphology

(Lieber, 2004)

slide-59
SLIDE 59

Word-formation, lexical semantics, corpora

◮ Recent burst of interest in semantic aspects of morphology

(Lieber, 2004)

◮ A good moment to explore how corpora and

corpus-linguistic methodology (collocational analysis, contextual approaches to meaning, emphasis on lexico-grammar) can help morphological research

slide-60
SLIDE 60

The “importance of low frequency events” dilemma

◮ Students of word formation, by definition, trade in low

frequency words

slide-61
SLIDE 61

The “importance of low frequency events” dilemma

◮ Students of word formation, by definition, trade in low

frequency words

◮ Very large corpora are needed to find enough rare events

(e.g., in project with Lüdeling, Evert, we are studying compounding with metaphorical obsession – we find only 23 relevant tokens in 1.65B words German corpus)

slide-62
SLIDE 62

The “importance of low frequency events” dilemma

◮ Students of word formation, by definition, trade in low

frequency words

◮ Very large corpora are needed to find enough rare events

(e.g., in project with Lüdeling, Evert, we are studying compounding with metaphorical obsession – we find only 23 relevant tokens in 1.65B words German corpus)

◮ Very large corpora require automated processing, and

acceptance of a high degree of noise

slide-63
SLIDE 63

The “importance of low frequency events” dilemma

◮ Students of word formation, by definition, trade in low

frequency words

◮ Very large corpora are needed to find enough rare events

(e.g., in project with Lüdeling, Evert, we are studying compounding with metaphorical obsession – we find only 23 relevant tokens in 1.65B words German corpus)

◮ Very large corpora require automated processing, and

acceptance of a high degree of noise

◮ Automated processing is more likely to fail on low

frequency events, and especially new formations!