Combining Data-Intense and Compute-Intense Methods for Fine-Grained - - PowerPoint PPT Presentation

combining data intense and compute intense methods for
SMART_READER_LITE
LIVE PREVIEW

Combining Data-Intense and Compute-Intense Methods for Fine-Grained - - PowerPoint PPT Presentation

Combining Data-Intense and Compute-Intense Methods for Fine-Grained Morphological Analyses Petra Steiner Friedrich Schiller University Jena Jena, Germany September 19, 2019 Outline 3 September 19, 2019 Fine-Grained Morphological Analyses


slide-1
SLIDE 1

Combining Data-Intense and Compute-Intense Methods for Fine-Grained Morphological Analyses

Petra Steiner Friedrich Schiller University Jena Jena, Germany September 19, 2019

slide-2
SLIDE 2

Outline

1

Introduction German Word-Formation

2

Combining Data-Intense Methods with Contextual Retrieval Overview Data-Intense Methods Word Splitting and Contextual Retrieval Contextual Search in Wikipedia Corpus Morphological Segmentation based on Corpus Frequencies The Relation between Length and Frequency

3

Evaluation Test Data Results of Hybrid Word Analyzing

4

Conclusions and Future Work

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 2 / 29

slide-3
SLIDE 3

Introduction German Word-Formation

Characteristics of German Word-Formation I

language with highly productive and complex processes of word formation most common: compounding and derivation long orthographical word forms, many combinatorially possible analyses, e.g. Arbeitsaufwand ‘work efgort, expenditure of labor’

Arbeitsaufwand N Arbeit ‘work’ x s ‘fjller letter’ N Aufwand ‘expense’

♯Arbeitsaufwand

N Arbeit ‘work’ V Sauf ‘to booze’ N Wand ‘wall’ .

Figure 1: Ambiguous analysis of Arbeitsaufwand ‘expenditure of labor’

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 3 / 29

slide-4
SLIDE 4

Introduction German Word-Formation

Characteristics of German Word-Formation I

language with highly productive and complex processes of word formation most common: compounding and derivation long orthographical word forms, many combinatorially possible analyses, e.g. Arbeitsaufwand ‘work efgort, expenditure of labor’

Arbeitsaufwand N Arbeit ‘work’ x s ‘fjller letter’ N Aufwand ‘expense’

♯Arbeitsaufwand

N Arbeit ‘work’ V Sauf ‘to booze’ N Wand ‘wall’ .

Figure 1: Ambiguous analysis of Arbeitsaufwand ‘expenditure of labor’

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 3 / 29

slide-5
SLIDE 5

Introduction German Word-Formation

Characteristics of German Word-Formation II

Arbeitsaufwand N Arbeit ‘work’ x s ‘fjller letter’ N Aufwand ‘expense’ V aufwenden ‘to expend’ x auf ‘on, prefjx’ V wenden ‘to turn’

Figure 2: Deep analysis of Arbeitsaufwand ‘expenditure of labor’

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 4 / 29

slide-6
SLIDE 6

Combining Data-Intense Methods with Contextual Retrieval

Outline

1

Introduction

2

Combining Data-Intense Methods with Contextual Retrieval Overview Data-Intense Methods Word Splitting and Contextual Retrieval Contextual Search in Wikipedia Corpus Morphological Segmentation based on Corpus Frequencies The Relation between Length and Frequency

3

Evaluation

4

Conclusions and Future Work

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 5 / 29

slide-7
SLIDE 7

Combining Data-Intense Methods with Contextual Retrieval Overview

Combining Data-Intense Methods with Contextual Retrieval

A hybrid approach for fjnding the correct splits of complex German words by using A formerly derived morphological trees database (Steiner, 2017) adjusted output of a morphological splitter co(n)texts from 1.8 Mio Wikipedia texts morphological segmentation based on corpus frequencies quantitative properties of German morpheme lengths

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 6 / 29

slide-8
SLIDE 8

Combining Data-Intense Methods with Contextual Retrieval Overview

Combining Data-Intense Methods with Contextual Retrieval

A hybrid approach for fjnding the correct splits of complex German words by using A formerly derived morphological trees database (Steiner, 2017) adjusted output of a morphological splitter co(n)texts from 1.8 Mio Wikipedia texts morphological segmentation based on corpus frequencies quantitative properties of German morpheme lengths

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 6 / 29

slide-9
SLIDE 9

Combining Data-Intense Methods with Contextual Retrieval Overview

Combining Data-Intense Methods with Contextual Retrieval

A hybrid approach for fjnding the correct splits of complex German words by using A formerly derived morphological trees database (Steiner, 2017) adjusted output of a morphological splitter co(n)texts from 1.8 Mio Wikipedia texts morphological segmentation based on corpus frequencies quantitative properties of German morpheme lengths

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 6 / 29

slide-10
SLIDE 10

Combining Data-Intense Methods with Contextual Retrieval Overview

Combining Data-Intense Methods with Contextual Retrieval

A hybrid approach for fjnding the correct splits of complex German words by using A formerly derived morphological trees database (Steiner, 2017) adjusted output of a morphological splitter co(n)texts from 1.8 Mio Wikipedia texts morphological segmentation based on corpus frequencies quantitative properties of German morpheme lengths

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 6 / 29

slide-11
SLIDE 11

Combining Data-Intense Methods with Contextual Retrieval Overview

Combining Data-Intense Methods with Contextual Retrieval

A hybrid approach for fjnding the correct splits of complex German words by using A formerly derived morphological trees database (Steiner, 2017) adjusted output of a morphological splitter co(n)texts from 1.8 Mio Wikipedia texts morphological segmentation based on corpus frequencies quantitative properties of German morpheme lengths

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 6 / 29

slide-12
SLIDE 12

Combining Data-Intense Methods with Contextual Retrieval Overview

Wordlists Arbeitsaufwand Chefredakteurin Bambussieb preußisch-europäisch Hybrid Word Analyzer Morphological Trees (*Arbeit* arbeiten)|s| (*Aufwand* (*aufwenden* auf|wenden)) New splits Monomorphemic Lexemes Lexical DBs Found in data- bases? Check for subanalyses yes Results: Arbeitsaufwand (*Arbeit* arbeiten)|s|(*Aufwand* (*aufwenden* auf|wenden)) Chefredakteurin (*Chefredakteur* Chef|(*Redakteur* redakt|eur))|in Bambus|Sieb SMOR & Moremorph Chefredakteurin Chef redakt eur in NN V NNSUFF NNSUFF no Build all combinations Filter out implausibles Contextual search in Wikipedia corpus Chefredakteur|in

♯Bambussieb

Recheck simple analyses Chefredakteur|in Frequencies in Wikipedia corpus (♯preußisch|-|Europa|isch) Bambus|Sieb Weighting by word lengths preußisch|-|europäisch Analyzable? yes no

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 7 / 29

slide-13
SLIDE 13

Combining Data-Intense Methods with Contextual Retrieval Data-Intense Methods

Morphological Trees Database

from CELEX, German part, and GermaNet database 101,588 entries Example: Arbeitsaufwand (*Arbeit* arbeiten)|s|(*Aufwand* (*aufwenden* auf|wenden))

Merged Morphological Trees DB CELEX Trees DB GermaNet Trees DB CELEXextract GNextract (with CELEX) Germa- Net Refurbished CELEX- German OrthCELEX CELEX- German

Figure 3: Extracting and Merging Morphological Trees

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 8 / 29

slide-14
SLIDE 14

Combining Data-Intense Methods with Contextual Retrieval Data-Intense Methods

Wordlists Arbeitsaufwand Chefredakteurin Bambussieb preußisch-europäisch Hybrid Word Analyzer Morphological Trees (*Arbeit* arbeiten)|s| (*Aufwand* (*aufwenden* auf|wenden)) New splits Monomorphemic Lexemes Lexical DBs Found in data- bases? Check for subanalyses yes Results: Arbeitsaufwand (*Arbeit* arbeiten)|s|(*Aufwand* (*aufwenden* auf|wenden)) Chefredakteurin (*Chefredakteur* Chef|(*Redakteur* redakt|eur))|in Bambus|Sieb SMOR & Moremorph Chefredakteurin Chef redakt eur in NN V NNSUFF NNSUFF no Build all combinations Filter out implausibles Contextual search in Wikipedia corpus Chefredakteur|in

♯Bambussieb

Recheck simple analyses Chefredakteur|in Frequencies in Wikipedia corpus (♯preußisch|-|Europa|isch) Bambus|Sieb Weighting by word lengths preußisch|-|europäisch Analyzable? yes no

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 9 / 29

slide-15
SLIDE 15

Combining Data-Intense Methods with Contextual Retrieval Word Splitting and Contextual Retrieval

Wordlists Arbeitsaufwand Chefredakteurin Bambussieb preußisch-europäisch Hybrid Word Analyzer Morphological Trees (*Arbeit* arbeiten)|s| (*Aufwand* (*aufwenden* auf|wenden)) New splits Monomorphemic Lexemes Lexical DBs Found in data- bases? Check for subanalyses yes Results: Arbeitsaufwand (*Arbeit* arbeiten)|s|(*Aufwand* (*aufwenden* auf|wenden)) Chefredakteurin (*Chefredakteur* Chef|(*Redakteur* redakt|eur))|in Bambus|Sieb SMOR & Moremorph Chefredakteurin Chef redakt eur in NN V NNSUFF NNSUFF no Build all combinations Filter out implausibles Contextual search in Wikipedia corpus Chefredakteur|in

♯Bambussieb

Recheck simple analyses Chefredakteur|in Frequencies in Wikipedia corpus (♯preußisch|-|Europa|isch) Bambus|Sieb Weighting by word lengths preußisch|-|europäisch Analyzable? yes no

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 10 / 29

slide-16
SLIDE 16

Combining Data-Intense Methods with Contextual Retrieval Word Splitting and Contextual Retrieval

SMOR: A Morphological Tool for German

Stuttgarter Morphological Analysis Tool, adjusted by the add-on Moremorph Main lexicon with 42,205 entries, proper name lexicons with 16,718 entries and difgerent datasets with other morphological information

(1)

Example output for Chefredakteurin ‘editor-in-chieff emale’: Chef R:redakteur in Chef rede:<>n:<> A:akte U:urin Chef rede:<>n:<> A:akteur in Chef rede:<>n:<> A:akt e U:urin Chef rede:<>n:<> akt eur in Chef rede:<>n:<> A:akte U:urin Chef rede:<>n:<> A:akteur in Chef rede:<>n:<> A:akt e U:urin Chef rede:<>n:<> akt eur in Chef redakt eur in (2)

  • a. [[NN,NN],[NNSUFF]]

Chefredakteur|in

  • b. ♯[[NN],[NN, NNSUFF]] Chef|redakteurin
  • c. ♯[[NN, NN, NNSUFF]]

Chefredakteurin

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 11 / 29

slide-17
SLIDE 17

Combining Data-Intense Methods with Contextual Retrieval Word Splitting and Contextual Retrieval

SMOR: A Morphological Tool for German

Stuttgarter Morphological Analysis Tool, adjusted by the add-on Moremorph Main lexicon with 42,205 entries, proper name lexicons with 16,718 entries and difgerent datasets with other morphological information

(1)

Example output for Chefredakteurin ‘editor-in-chieff emale’: Chef R:redakteur in Chef rede:<>n:<> A:akte U:urin Chef rede:<>n:<> A:akteur in Chef rede:<>n:<> A:akt e U:urin Chef rede:<>n:<> akt eur in Chef rede:<>n:<> A:akte U:urin Chef rede:<>n:<> A:akteur in Chef rede:<>n:<> A:akt e U:urin Chef rede:<>n:<> akt eur in Chef redakt eur in (2)

  • a. [[NN,NN],[NNSUFF]]

Chefredakteur|in

  • b. ♯[[NN],[NN, NNSUFF]] Chef|redakteurin
  • c. ♯[[NN, NN, NNSUFF]]

Chefredakteurin

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 11 / 29

slide-18
SLIDE 18

Combining Data-Intense Methods with Contextual Retrieval Word Splitting and Contextual Retrieval

SMOR: A Morphological Tool for German

Stuttgarter Morphological Analysis Tool, adjusted by the add-on Moremorph Main lexicon with 42,205 entries, proper name lexicons with 16,718 entries and difgerent datasets with other morphological information

(1)

Example output for Chefredakteurin ‘editor-in-chieff emale’: Chef R:redakteur in Chef rede:<>n:<> A:akte U:urin Chef rede:<>n:<> A:akteur in Chef rede:<>n:<> A:akt e U:urin Chef rede:<>n:<> akt eur in Chef rede:<>n:<> A:akte U:urin Chef rede:<>n:<> A:akteur in Chef rede:<>n:<> A:akt e U:urin Chef rede:<>n:<> akt eur in Chef redakt eur in (2)

  • a. [[NN,NN],[NNSUFF]]

Chefredakteur|in

  • b. ♯[[NN],[NN, NNSUFF]] Chef|redakteurin
  • c. ♯[[NN, NN, NNSUFF]]

Chefredakteurin

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 11 / 29

slide-19
SLIDE 19

Combining Data-Intense Methods with Contextual Retrieval Word Splitting and Contextual Retrieval

SMOR: A Morphological Tool for German

Stuttgarter Morphological Analysis Tool, adjusted by the add-on Moremorph Main lexicon with 42,205 entries, proper name lexicons with 16,718 entries and difgerent datasets with other morphological information

(1)

Example output for Chefredakteurin ‘editor-in-chieff emale’: Chef R:redakteur in Chef rede:<>n:<> A:akte U:urin Chef rede:<>n:<> A:akteur in Chef rede:<>n:<> A:akt e U:urin Chef rede:<>n:<> akt eur in Chef rede:<>n:<> A:akte U:urin Chef rede:<>n:<> A:akteur in Chef rede:<>n:<> A:akt e U:urin Chef rede:<>n:<> akt eur in Chef redakt eur in (2)

  • a. [[NN,NN],[NNSUFF]]

Chefredakteur|in

  • b. ♯[[NN],[NN, NNSUFF]] Chef|redakteurin
  • c. ♯[[NN, NN, NNSUFF]]

Chefredakteurin

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 11 / 29

slide-20
SLIDE 20

Combining Data-Intense Methods with Contextual Retrieval Word Splitting and Contextual Retrieval

Wordlists Arbeitsaufwand Chefredakteurin Bambussieb preußisch-europäisch Hybrid Word Analyzer Morphological Trees (*Arbeit* arbeiten)|s| (*Aufwand* (*aufwenden* auf|wenden)) New splits Monomorphemic Lexemes Lexical DBs Found in data- bases? Check for subanalyses yes Results: Arbeitsaufwand (*Arbeit* arbeiten)|s|(*Aufwand* (*aufwenden* auf|wenden)) Chefredakteurin (*Chefredakteur* Chef|(*Redakteur* redakt|eur))|in Bambus|Sieb SMOR & Moremorph Chefredakteurin Chef redakt eur in NN V NNSUFF NNSUFF no Build all combinations Filter out implausibles Contextual search in Wikipedia corpus Chefredakteur|in

♯Bambussieb

Recheck simple analyses Chefredakteur|in Frequencies in Wikipedia corpus (♯preußisch|-|Europa|isch) Bambus|Sieb Weighting by word lengths preußisch|-|europäisch Analyzable? yes no

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 12 / 29

slide-21
SLIDE 21

Combining Data-Intense Methods with Contextual Retrieval Contextual Search in Wikipedia Corpus

Idea: For splits of unknown compounds, each immediate constituent should be found within the context at least somewhere inside a large corpus. For derivatives, this holds

  • nly for hypothetical constituents which are free morphs or lexemes.

Contexts: the texts of a corpus in which the respective analyzed word form occurs. Corpus: 1.8 million articles of the annotated German Wikipedia Korpus of 2015 (Margaretha and Lüngen, 2014) Tokenizer: a modifjed version of the tool from Dipper (2016); lemmatizer: TreeTagger (Schmid, 1999) Text indices: for the tokenized and lemmatized forms. For each text containing the input word form Wwf , the document frequencies (df1 ...dfm) of the free hypothetical immediate constituents (cwf ,s,1 ...cwf ,s,n) are being retrieved and summarized. This yields a text frequency score (Swf ,s,t) for each text and split of n constituents. Swf ,s,t =

n

  • c=1

dfi (1) Of all morphological analyses for Wwf , the one with the largest score is processed for the storage. A missing hypothetical constituent inside a text containing Wwf leads to a document frequency of 0 for this constituent, which can be compensated by the frequencies of the other constituents of the split sequence.

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 13 / 29

slide-22
SLIDE 22

Combining Data-Intense Methods with Contextual Retrieval Contextual Search in Wikipedia Corpus

Idea: For splits of unknown compounds, each immediate constituent should be found within the context at least somewhere inside a large corpus. For derivatives, this holds

  • nly for hypothetical constituents which are free morphs or lexemes.

Contexts: the texts of a corpus in which the respective analyzed word form occurs. Corpus: 1.8 million articles of the annotated German Wikipedia Korpus of 2015 (Margaretha and Lüngen, 2014) Tokenizer: a modifjed version of the tool from Dipper (2016); lemmatizer: TreeTagger (Schmid, 1999) Text indices: for the tokenized and lemmatized forms. For each text containing the input word form Wwf , the document frequencies (df1 ...dfm) of the free hypothetical immediate constituents (cwf ,s,1 ...cwf ,s,n) are being retrieved and summarized. This yields a text frequency score (Swf ,s,t) for each text and split of n constituents. Swf ,s,t =

n

  • c=1

dfi (1) Of all morphological analyses for Wwf , the one with the largest score is processed for the storage. A missing hypothetical constituent inside a text containing Wwf leads to a document frequency of 0 for this constituent, which can be compensated by the frequencies of the other constituents of the split sequence.

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 13 / 29

slide-23
SLIDE 23

Combining Data-Intense Methods with Contextual Retrieval Contextual Search in Wikipedia Corpus

Idea: For splits of unknown compounds, each immediate constituent should be found within the context at least somewhere inside a large corpus. For derivatives, this holds

  • nly for hypothetical constituents which are free morphs or lexemes.

Contexts: the texts of a corpus in which the respective analyzed word form occurs. Corpus: 1.8 million articles of the annotated German Wikipedia Korpus of 2015 (Margaretha and Lüngen, 2014) Tokenizer: a modifjed version of the tool from Dipper (2016); lemmatizer: TreeTagger (Schmid, 1999) Text indices: for the tokenized and lemmatized forms. For each text containing the input word form Wwf , the document frequencies (df1 ...dfm) of the free hypothetical immediate constituents (cwf ,s,1 ...cwf ,s,n) are being retrieved and summarized. This yields a text frequency score (Swf ,s,t) for each text and split of n constituents. Swf ,s,t =

n

  • c=1

dfi (1) Of all morphological analyses for Wwf , the one with the largest score is processed for the storage. A missing hypothetical constituent inside a text containing Wwf leads to a document frequency of 0 for this constituent, which can be compensated by the frequencies of the other constituents of the split sequence.

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 13 / 29

slide-24
SLIDE 24

Combining Data-Intense Methods with Contextual Retrieval Contextual Search in Wikipedia Corpus

Idea: For splits of unknown compounds, each immediate constituent should be found within the context at least somewhere inside a large corpus. For derivatives, this holds

  • nly for hypothetical constituents which are free morphs or lexemes.

Contexts: the texts of a corpus in which the respective analyzed word form occurs. Corpus: 1.8 million articles of the annotated German Wikipedia Korpus of 2015 (Margaretha and Lüngen, 2014) Tokenizer: a modifjed version of the tool from Dipper (2016); lemmatizer: TreeTagger (Schmid, 1999) Text indices: for the tokenized and lemmatized forms. For each text containing the input word form Wwf , the document frequencies (df1 ...dfm) of the free hypothetical immediate constituents (cwf ,s,1 ...cwf ,s,n) are being retrieved and summarized. This yields a text frequency score (Swf ,s,t) for each text and split of n constituents. Swf ,s,t =

n

  • c=1

dfi (1) Of all morphological analyses for Wwf , the one with the largest score is processed for the storage. A missing hypothetical constituent inside a text containing Wwf leads to a document frequency of 0 for this constituent, which can be compensated by the frequencies of the other constituents of the split sequence.

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 13 / 29

slide-25
SLIDE 25

Combining Data-Intense Methods with Contextual Retrieval Contextual Search in Wikipedia Corpus

Idea: For splits of unknown compounds, each immediate constituent should be found within the context at least somewhere inside a large corpus. For derivatives, this holds

  • nly for hypothetical constituents which are free morphs or lexemes.

Contexts: the texts of a corpus in which the respective analyzed word form occurs. Corpus: 1.8 million articles of the annotated German Wikipedia Korpus of 2015 (Margaretha and Lüngen, 2014) Tokenizer: a modifjed version of the tool from Dipper (2016); lemmatizer: TreeTagger (Schmid, 1999) Text indices: for the tokenized and lemmatized forms. For each text containing the input word form Wwf , the document frequencies (df1 ...dfm) of the free hypothetical immediate constituents (cwf ,s,1 ...cwf ,s,n) are being retrieved and summarized. This yields a text frequency score (Swf ,s,t) for each text and split of n constituents. Swf ,s,t =

n

  • c=1

dfi (1) Of all morphological analyses for Wwf , the one with the largest score is processed for the storage. A missing hypothetical constituent inside a text containing Wwf leads to a document frequency of 0 for this constituent, which can be compensated by the frequencies of the other constituents of the split sequence.

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 13 / 29

slide-26
SLIDE 26

Combining Data-Intense Methods with Contextual Retrieval Contextual Search in Wikipedia Corpus

Wordlists Arbeitsaufwand Chefredakteurin Bambussieb preußisch-europäisch Hybrid Word Analyzer Morphological Trees (*Arbeit* arbeiten)|s| (*Aufwand* (*aufwenden* auf|wenden)) New splits Monomorphemic Lexemes Lexical DBs Found in data- bases? Check for subanalyses yes Results: Arbeitsaufwand (*Arbeit* arbeiten)|s|(*Aufwand* (*aufwenden* auf|wenden)) Chefredakteurin (*Chefredakteur* Chef|(*Redakteur* redakt|eur))|in Bambus|Sieb SMOR & Moremorph Chefredakteurin Chef redakt eur in NN V NNSUFF NNSUFF no Build all combinations Filter out implausibles Contextual search in Wikipedia corpus Chefredakteur|in

♯Bambussieb

Recheck simple analyses Chefredakteur|in Frequencies in Wikipedia corpus (♯preußisch|-|Europa|isch) Bambus|Sieb Weighting by word lengths preußisch|-|europäisch Analyzable? yes no

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 14 / 29

slide-27
SLIDE 27

Combining Data-Intense Methods with Contextual Retrieval Contextual Search in Wikipedia Corpus

Wordlists Arbeitsaufwand Chefredakteurin Bambussieb preußisch-europäisch Hybrid Word Analyzer Morphological Trees (*Arbeit* arbeiten)|s| (*Aufwand* (*aufwenden* auf|wenden)) New splits Monomorphemic Lexemes Lexical DBs Found in data- bases? Check for subanalyses yes Results: Arbeitsaufwand (*Arbeit* arbeiten)|s|(*Aufwand* (*aufwenden* auf|wenden)) Chefredakteurin (*Chefredakteur* Chef|(*Redakteur* redakt|eur))|in Bambus|Sieb SMOR & Moremorph Chefredakteurin Chef redakt eur in NN V NNSUFF NNSUFF no Build all combinations Filter out implausibles Contextual search in Wikipedia corpus Chefredakteur|in

♯Bambussieb

Recheck simple analyses Chefredakteur|in Frequencies in Wikipedia corpus (♯preußisch|-|Europa|isch) Bambus|Sieb Weighting by word lengths preußisch|-|europäisch Analyzable? yes no

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 15 / 29

slide-28
SLIDE 28

Combining Data-Intense Methods with Contextual Retrieval Morphological Segmentation based on Corpus Frequencies

Corpus Frequencies

The corpus itself is considered as a context in the widest sense if no text contains the word form Wwf a double check for longer word forms is advisable (3) Bambussieb ‘bamboo screen’

  • a. [[NN],[NN]]

Bambus|Sieb

  • b. ♯[[NN, NN]]

Bambussieb Investigations on the lengths of German morphs show that German simplex lexemes rarely possess more than 7 phonemes (98.41%) (Menzerath, 1954; Gerlach, 1982). The number of graphemes is proportional and slightly larger (Krott, 1996). Check: all word forms with more than 8 characters if a. the contextual search found only splits comprising just one constituent but b. hypothetical splits with more than one constituent do exist.

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 16 / 29

slide-29
SLIDE 29

Combining Data-Intense Methods with Contextual Retrieval Morphological Segmentation based on Corpus Frequencies

Wordlists Arbeitsaufwand Chefredakteurin Bambussieb preußisch-europäisch Hybrid Word Analyzer Morphological Trees (*Arbeit* arbeiten)|s| (*Aufwand* (*aufwenden* auf|wenden)) New splits Monomorphemic Lexemes Lexical DBs Found in data- bases? Check for subanalyses yes Results: Arbeitsaufwand (*Arbeit* arbeiten)|s|(*Aufwand* (*aufwenden* auf|wenden)) Chefredakteurin (*Chefredakteur* Chef|(*Redakteur* redakt|eur))|in Bambus|Sieb SMOR & Moremorph Chefredakteurin Chef redakt eur in NN V NNSUFF NNSUFF no Build all combinations Filter out implausibles Contextual search in Wikipedia corpus Chefredakteur|in

♯Bambussieb

Recheck simple analyses Chefredakteur|in Frequencies in Wikipedia corpus (♯preußisch|-|Europa|isch) Bambus|Sieb Weighting by word lengths preußisch|-|europäisch Bambus|Sieb Analyzable? yes no

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 17 / 29

slide-30
SLIDE 30

Combining Data-Intense Methods with Contextual Retrieval The Relation between Length and Frequency

The frequency-based weighting has a bias towards constructions with small constituents.

(4) a.

♯Figur|Kombi|Nation ‘fjgure|combi (short form of combination)|nation’

b. Figur|Kombination ‘fjgure|combination’ c. Figur|(*Kombination* kombin|ation)

The functional dependency between morph/lexeme frequency and length is mutual (Köhler (1986), Krott (2004)) and infmuenced by

  • ther factors such as the age of words and lexicon size.

For each constituent with a length of l characters, the frequency of its word length class Ll is used as an inverse proportional factor for the document frequencies (2). WeightedSwf ,s,t =

n

  • c=1

dfi f req(Ll(c)) (2)

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 18 / 29

slide-31
SLIDE 31

Combining Data-Intense Methods with Contextual Retrieval The Relation between Length and Frequency

The frequency-based weighting has a bias towards constructions with small constituents.

(4) a.

♯Figur|Kombi|Nation ‘fjgure|combi (short form of combination)|nation’

b. Figur|Kombination ‘fjgure|combination’ c. Figur|(*Kombination* kombin|ation)

The functional dependency between morph/lexeme frequency and length is mutual (Köhler (1986), Krott (2004)) and infmuenced by

  • ther factors such as the age of words and lexicon size.

For each constituent with a length of l characters, the frequency of its word length class Ll is used as an inverse proportional factor for the document frequencies (2). WeightedSwf ,s,t =

n

  • c=1

dfi f req(Ll(c)) (2)

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 18 / 29

slide-32
SLIDE 32

Combining Data-Intense Methods with Contextual Retrieval The Relation between Length and Frequency

The frequency-based weighting has a bias towards constructions with small constituents.

(4) a.

♯Figur|Kombi|Nation ‘fjgure|combi (short form of combination)|nation’

b. Figur|Kombination ‘fjgure|combination’ c. Figur|(*Kombination* kombin|ation)

The functional dependency between morph/lexeme frequency and length is mutual (Köhler (1986), Krott (2004)) and infmuenced by

  • ther factors such as the age of words and lexicon size.

For each constituent with a length of l characters, the frequency of its word length class Ll is used as an inverse proportional factor for the document frequencies (2). WeightedSwf ,s,t =

n

  • c=1

dfi f req(Ll(c)) (2)

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 18 / 29

slide-33
SLIDE 33

Combining Data-Intense Methods with Contextual Retrieval The Relation between Length and Frequency

Wordlists Arbeitsaufwand Chefredakteurin Bambussieb preußisch-europäisch Hybrid Word Analyzer Morphological Trees (*Arbeit* arbeiten)|s| (*Aufwand* (*aufwenden* auf|wenden)) New splits Monomorphemic Lexemes Lexical DBs Found in data- bases? Check for subanalyses yes Results: Arbeitsaufwand (*Arbeit* arbeiten)|s|(*Aufwand* (*aufwenden* auf|wenden)) Chefredakteurin (*Chefredakteur* Chef|(*Redakteur* redakt|eur))|in preußisch|-|(*europäisch* Europa|isch) SMOR & Moremorph Chefredakteurin Chef redakt eur in NN V NNSUFF NNSUFF no Build all combinations Filter out implausibles Contextual search in Wikipedia corpus Chefredakteur|in

♯Bambussieb

Recheck simple analyses Chefredakteur|in Frequencies in Wikipedia corpus (♯preußisch|-|Europa|isch) Bambus|Sieb Weighting by word lengths preußisch|-|europäisch Bambus|Sieb Analyzable? yes no

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 19 / 29

slide-34
SLIDE 34

Evaluation

Outline

1

Introduction

2

Combining Data-Intense Methods with Contextual Retrieval

3

Evaluation Test Data Results of Hybrid Word Analyzing

4

Conclusions and Future Work

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 20 / 29

slide-35
SLIDE 35

Evaluation Test Data

Corpus: Korpus Magazin Lufthansa Bordbuch (MLD), part of the DeReKo-2016-I (Institut für Deutsche Sprache 2016) corpus (see Kupietz et al. 2010), an in-fmight magazine with articles on traveling, consumption and aviation. Tokenization: enlarged and costumized tokenizer by Dipper (2016) 276 texts with 5,202 paragraphs, 16,046 sentences and 260,114 tokens 38,337 word-form types, and 27,902 lemma types. 15,622 of these lemma types are inside the databases of trees or monomorphemic words. Coverage of 55.99% with an accuracy of nearly 100% due to the quality of the CELEX and GermaNet data. The remaining 44.01% of all lemma types were processed by SMOR and Moremorph with a coverage of 100%. sample of 1,006 word forms

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 21 / 29

slide-36
SLIDE 36

Evaluation Results of Hybrid Word Analyzing

correct analysis (all levels) no analysis (partially) fmat analysis wrong analysis DBs ≈ 55.99% DBs + Con- text + Corpus Look-up 87.77% 7.45% 3.48% 1.29% + Recheck 92.44% 2.68% 2.88% 1.99% + Recheck + Weighting 93.34% 2.58% 2.78% 1.29%

(5)

  • a. adjective vs. participle: folgend ‘following’, gewandt ‘turnedv, skillfuladj’
  • b. constituents not in context: Metallkäfjg ‘metal cage’, Tierärztin ‘vetf em’
  • c. fmat analyses: Roll|vor|Gang ‘♯?(to roll|prefjx, before|gait), rolling procedure’
  • d. analysis from GermaNet: ♯?(Land|Nahme) ‘(land|”take”), settlement’
  • e. frequent homograph:

♯(Parlament|(*arisch* ar|isch)) ‘(parliament|(*Aryan* Ar|ian), parliamentary’

  • f. correction by word-length weighting:

rollen|(*Vorgang* (*vorgehen* vor|gehen)) ‘to roll|(*procedure* (*to proceed* pro|ceed))’ 5,696 new entries for monomorphemic lexemes; 8,448 for the new splits.

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 22 / 29

slide-37
SLIDE 37

Evaluation Results of Hybrid Word Analyzing

correct analysis (all levels) no analysis (partially) fmat analysis wrong analysis DBs ≈ 55.99% DBs + Con- text + Corpus Look-up 87.77% 7.45% 3.48% 1.29% + Recheck 92.44% 2.68% 2.88% 1.99% + Recheck + Weighting 93.34% 2.58% 2.78% 1.29%

(5)

  • a. adjective vs. participle: folgend ‘following’, gewandt ‘turnedv, skillfuladj’
  • b. constituents not in context: Metallkäfjg ‘metal cage’, Tierärztin ‘vetf em’
  • c. fmat analyses: Roll|vor|Gang ‘♯?(to roll|prefjx, before|gait), rolling procedure’
  • d. analysis from GermaNet: ♯?(Land|Nahme) ‘(land|”take”), settlement’
  • e. frequent homograph:

♯(Parlament|(*arisch* ar|isch)) ‘(parliament|(*Aryan* Ar|ian), parliamentary’

  • f. correction by word-length weighting:

rollen|(*Vorgang* (*vorgehen* vor|gehen)) ‘to roll|(*procedure* (*to proceed* pro|ceed))’ 5,696 new entries for monomorphemic lexemes; 8,448 for the new splits.

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 22 / 29

slide-38
SLIDE 38

Evaluation Results of Hybrid Word Analyzing

correct analysis (all levels) no analysis (partially) fmat analysis wrong analysis DBs ≈ 55.99% DBs + Con- text + Corpus Look-up 87.77% 7.45% 3.48% 1.29% + Recheck 92.44% 2.68% 2.88% 1.99% + Recheck + Weighting 93.34% 2.58% 2.78% 1.29%

(5)

  • a. adjective vs. participle: folgend ‘following’, gewandt ‘turnedv, skillfuladj’
  • b. constituents not in context: Metallkäfjg ‘metal cage’, Tierärztin ‘vetf em’
  • c. fmat analyses: Roll|vor|Gang ‘♯?(to roll|prefjx, before|gait), rolling procedure’
  • d. analysis from GermaNet: ♯?(Land|Nahme) ‘(land|”take”), settlement’
  • e. frequent homograph:

♯(Parlament|(*arisch* ar|isch)) ‘(parliament|(*Aryan* Ar|ian), parliamentary’

  • f. correction by word-length weighting:

rollen|(*Vorgang* (*vorgehen* vor|gehen)) ‘to roll|(*procedure* (*to proceed* pro|ceed))’ 5,696 new entries for monomorphemic lexemes; 8,448 for the new splits.

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 22 / 29

slide-39
SLIDE 39

Evaluation Results of Hybrid Word Analyzing

correct analysis (all levels) no analysis (partially) fmat analysis wrong analysis DBs ≈ 55.99% DBs + Con- text + Corpus Look-up 87.77% 7.45% 3.48% 1.29% + Recheck 92.44% 2.68% 2.88% 1.99% + Recheck + Weighting 93.34% 2.58% 2.78% 1.29%

(5)

  • a. adjective vs. participle: folgend ‘following’, gewandt ‘turnedv, skillfuladj’
  • b. constituents not in context: Metallkäfjg ‘metal cage’, Tierärztin ‘vetf em’
  • c. fmat analyses: Roll|vor|Gang ‘♯?(to roll|prefjx, before|gait), rolling procedure’
  • d. analysis from GermaNet: ♯?(Land|Nahme) ‘(land|”take”), settlement’
  • e. frequent homograph:

♯(Parlament|(*arisch* ar|isch)) ‘(parliament|(*Aryan* Ar|ian), parliamentary’

  • f. correction by word-length weighting:

rollen|(*Vorgang* (*vorgehen* vor|gehen)) ‘to roll|(*procedure* (*to proceed* pro|ceed))’ 5,696 new entries for monomorphemic lexemes; 8,448 for the new splits.

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 22 / 29

slide-40
SLIDE 40

Evaluation Results of Hybrid Word Analyzing

correct analysis (all levels) no analysis (partially) fmat analysis wrong analysis DBs ≈ 55.99% DBs + Con- text + Corpus Look-up 87.77% 7.45% 3.48% 1.29% + Recheck 92.44% 2.68% 2.88% 1.99% + Recheck + Weighting 93.34% 2.58% 2.78% 1.29%

(5)

  • a. adjective vs. participle: folgend ‘following’, gewandt ‘turnedv, skillfuladj’
  • b. constituents not in context: Metallkäfjg ‘metal cage’, Tierärztin ‘vetf em’
  • c. fmat analyses: Roll|vor|Gang ‘♯?(to roll|prefjx, before|gait), rolling procedure’
  • d. analysis from GermaNet: ♯?(Land|Nahme) ‘(land|”take”), settlement’
  • e. frequent homograph:

♯(Parlament|(*arisch* ar|isch)) ‘(parliament|(*Aryan* Ar|ian), parliamentary’

  • f. correction by word-length weighting:

rollen|(*Vorgang* (*vorgehen* vor|gehen)) ‘to roll|(*procedure* (*to proceed* pro|ceed))’ 5,696 new entries for monomorphemic lexemes; 8,448 for the new splits.

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 22 / 29

slide-41
SLIDE 41

Evaluation Results of Hybrid Word Analyzing

correct analysis (all levels) no analysis (partially) fmat analysis wrong analysis DBs ≈ 55.99% DBs + Con- text + Corpus Look-up 87.77% 7.45% 3.48% 1.29% + Recheck 92.44% 2.68% 2.88% 1.99% + Recheck + Weighting 93.34% 2.58% 2.78% 1.29%

(5)

  • a. adjective vs. participle: folgend ‘following’, gewandt ‘turnedv, skillfuladj’
  • b. constituents not in context: Metallkäfjg ‘metal cage’, Tierärztin ‘vetf em’
  • c. fmat analyses: Roll|vor|Gang ‘♯?(to roll|prefjx, before|gait), rolling procedure’
  • d. analysis from GermaNet: ♯?(Land|Nahme) ‘(land|”take”), settlement’
  • e. frequent homograph:

♯(Parlament|(*arisch* ar|isch)) ‘(parliament|(*Aryan* Ar|ian), parliamentary’

  • f. correction by word-length weighting:

rollen|(*Vorgang* (*vorgehen* vor|gehen)) ‘to roll|(*procedure* (*to proceed* pro|ceed))’ 5,696 new entries for monomorphemic lexemes; 8,448 for the new splits.

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 22 / 29

slide-42
SLIDE 42

Evaluation Results of Hybrid Word Analyzing

correct analysis (all levels) no analysis (partially) fmat analysis wrong analysis DBs ≈ 55.99% DBs + Con- text + Corpus Look-up 87.77% 7.45% 3.48% 1.29% + Recheck 92.44% 2.68% 2.88% 1.99% + Recheck + Weighting 93.34% 2.58% 2.78% 1.29%

(5)

  • a. adjective vs. participle: folgend ‘following’, gewandt ‘turnedv, skillfuladj’
  • b. constituents not in context: Metallkäfjg ‘metal cage’, Tierärztin ‘vetf em’
  • c. fmat analyses: Roll|vor|Gang ‘♯?(to roll|prefjx, before|gait), rolling procedure’
  • d. analysis from GermaNet: ♯?(Land|Nahme) ‘(land|”take”), settlement’
  • e. frequent homograph:

♯(Parlament|(*arisch* ar|isch)) ‘(parliament|(*Aryan* Ar|ian), parliamentary’

  • f. correction by word-length weighting:

rollen|(*Vorgang* (*vorgehen* vor|gehen)) ‘to roll|(*procedure* (*to proceed* pro|ceed))’ 5,696 new entries for monomorphemic lexemes; 8,448 for the new splits.

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 22 / 29

slide-43
SLIDE 43

Conclusions and Future Work

Outline

1

Introduction

2

Combining Data-Intense Methods with Contextual Retrieval

3

Evaluation

4

Conclusions and Future Work

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 23 / 29

slide-44
SLIDE 44

Conclusions and Future Work

An Hybrid Approach for Deep-Level Morphological Analysis

Starting points: a. morphological trees database b. fmat structures from a morphological segmentation tool. All plausible combinations of the immediate constituents were evaluated by look-ups in textual environments of a large corpus or inside the set of all types as a back-ofg strategy. Biases towards small constituents with high frequencies on the one side and unsplit words on the other were tackled by insights from investigations in quantitative linguistics. The combination of the methods lead to an accuracy of 93% for complex structures and 98.7% for acceptable output.

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 24 / 29

slide-45
SLIDE 45

Conclusions and Future Work

An Hybrid Approach for Deep-Level Morphological Analysis

Starting points: a. morphological trees database b. fmat structures from a morphological segmentation tool. All plausible combinations of the immediate constituents were evaluated by look-ups in textual environments of a large corpus or inside the set of all types as a back-ofg strategy. Biases towards small constituents with high frequencies on the one side and unsplit words on the other were tackled by insights from investigations in quantitative linguistics. The combination of the methods lead to an accuracy of 93% for complex structures and 98.7% for acceptable output.

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 24 / 29

slide-46
SLIDE 46

Conclusions and Future Work

An Hybrid Approach for Deep-Level Morphological Analysis

Starting points: a. morphological trees database b. fmat structures from a morphological segmentation tool. All plausible combinations of the immediate constituents were evaluated by look-ups in textual environments of a large corpus or inside the set of all types as a back-ofg strategy. Biases towards small constituents with high frequencies on the one side and unsplit words on the other were tackled by insights from investigations in quantitative linguistics. The combination of the methods lead to an accuracy of 93% for complex structures and 98.7% for acceptable output.

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 24 / 29

slide-47
SLIDE 47

Conclusions and Future Work

An Hybrid Approach for Deep-Level Morphological Analysis

Starting points: a. morphological trees database b. fmat structures from a morphological segmentation tool. All plausible combinations of the immediate constituents were evaluated by look-ups in textual environments of a large corpus or inside the set of all types as a back-ofg strategy. Biases towards small constituents with high frequencies on the one side and unsplit words on the other were tackled by insights from investigations in quantitative linguistics. The combination of the methods lead to an accuracy of 93% for complex structures and 98.7% for acceptable output.

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 24 / 29

slide-48
SLIDE 48

Conclusions and Future Work

Future Work

For improvement, there are two directions: using larger corpora, to possibly obtain a better fjt of the wordlength-frequency relationship. On the other hand, inhomogeneous data can blur models. Therefore, analyzing words text by text could help to achieve larger contextual dependency and to fjnd morphological structures fjtting to the direct

  • environment. This would result in difgerent structures for
  • rthographical words according to their contexts.

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 25 / 29

slide-49
SLIDE 49

Thank you for your ‘permanent|attention’

Daueraufmerksamkeit N Dauer ‘endurance’ N Aufmerksamkeit ‘attention’ Adj aufmerksam ‘attentive’ V aufmerken ‘to attend to’ Prefjx auf V merken ‘to notice’ Suffjx sam Suffjx keit

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 26 / 29

slide-50
SLIDE 50

References I

Harald Baayen, Richard Piepenbrock, and Léon Gulikers. 1995. The CELEX lexical database (CD-ROM). Stefanie Dipper. 2016. Tokenizer for German. https://www.linguistics.rub.de/~dipper/resources/tokenizer.html. Rainer Gerlach. 1982. Zur Überprüfung des Menzerathschen Gesetzes im Bereich der

  • Morphologie. In W. Lehfeldt and U. Strauss, editors, Glottometrika 4, Brockmeyer,

Quantitative Linguistics 14, pages 95–102. Birgit Hamp and Helmut Feldweg. 1997. GermaNet - a Lexical-Semantic Net for German. In Proceedings of ACL Workshop Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications. pages 9–15. http://www.aclweb.org/anthology/W97-0802. Verena Henrich and Erhard Hinrichs. 2011. Determining Immediate Constituents of Compounds in GermaNet. In Proceedings of the International Conference Recent Advances in Natural Language Processing, Hissar, Bulgaria, 2011. Association for Computational Linguistics, pages 420–426. http://www.aclweb.org/anthology/R11-1058. Reinhard Köhler. 1986. Zur linguistischen Synergetik: Struktur und Dynamik der Lexik. Quantitative Linguistics 31. Studienverlag Dr. N. Brockmeyer, Bochum. Andrea Krott. 1996. Some remarks on the relation between word length and morpheme length. Journal of Quantitative Linguistics 3(1):29–37. https://doi.org/10.1080/09296179608590061.

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 27 / 29

slide-51
SLIDE 51

References II

Andrea Krott. 2004. Ein funktionalanalytisches Modell der Wortbildung [A functional analytical model of word formation]. In Reinhard Köhler, editor, Korpuslinguistische Untersuchungen zur Quantitativen und Systemtheoretischen Linguistik [Corpus-linguistic Investigations of Quantitative and System-theoretical Linguistics], Elektronische Hochschulschriften an der Universität Trier, Trier, pages 75–126. http://ubt.opus.hbz-nrw.de/volltexte/2004/279/pdf/04_krott.pdf. Eliza Margaretha and Harald Lüngen. 2014. Building linguistic corpora from wikipedia articles and discussions. Journal of Language Technology and Computational Linguistics. Special issue on building and annotating corpora of computer-mediated communication. Issues and challenges at the interface between computational and corpus linguistics 29(2):59 – 82. http://nbn-resolving.de/urn:nbn:de:bsz:mh39-33306, http://www.jlcl.org/2014_Heft2/3MargarethaLuengen.pdf. Paul Menzerath. 1954. Die Architektonik des deutschen Wortschatzes. Phonetische Studien. Dümmler, Bonn ; Hannover ; Stuttgart. Helmut Schmid. 1999. Improvements in Part-of-Speech Tagging with an Application to German. In Susan Armstrong, Kenneth Church, Pierre Isabelle, Sandra Manzi, Evelyne Tzoukermann, and David Yarowsky, editors, Natural Language Processing Using Very Large Corpora, Springer Netherlands, Dordrecht, pages 13–25. https://doi.org/10.1007/978-94-017-2390-9_2.

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 28 / 29

slide-52
SLIDE 52

References III

Helmut Schmid, Arne Fitschen, and Ulrich Heid. 2004. SMOR: A German Computational Morphology Covering Derivation, Composition and Infmection. In Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004, May 26-28, 2004, Lisbon, Portugal. European Language Resources Association (ELRA). http://www.aclweb.org/anthology/L04-1275. Petra Steiner. 2017. Merging the Trees - Building a Morphological Treebank for German from Two Resources. In Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories, January 23-24, 2018, Prague, Czech Republic. pages 146–160. https://aclweb.org/anthology/W17-7619. Petra Steiner and Reinhard Rapp. in press. Building and Exploiting Lexical Databases for Morphological Parsing. In Proceedings of The International Conference on Contemporary Issues in Data Science, March 5-8, 2019, Zanjan, Iran. Springer, Lecture Notes in Computer Science. Petra Steiner and Josef Ruppenhofer. 2018. Building a Morphological Treebank for German from a Linguistic Database. In Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA). https://www.aclweb.org/anthology/L18-1613.

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 29 / 29

slide-53
SLIDE 53
slide-54
SLIDE 54
slide-55
SLIDE 55

More slides ....

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 1 / 44

slide-56
SLIDE 56

Figure 4: length/frequeny of lemmas from MLD corpus

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 2 / 44

slide-57
SLIDE 57

Databases

Reliable Morphological Data

CELEX: standard resource for German lexical data

51,728 entries 38,650 derivatives or compounds, 2,402 conversions core vocabulary

  • utdated format
  • utdated spelling

GermaNet:

rich vocabulary, complex lexemes segmentation restricted to nominal compounds

  • approx. 68,000 entries of compounds

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 3 / 44

slide-58
SLIDE 58

Databases Revised CELEX Database

Transfer of CELEX to the Modern Standard

  • utdated format, e.g. Abschlu$ (orthographical dataset)

Abschluss (morphological dataset) − → Abschluß ‘conclusion’

  • utdated spelling, e.g. Abschluß ‘conclusion’ −

→ Abschluss

German morphology lemmas (GML) German orthography lemmas (GOL) Morphological Dataset Orthographical Dataset Change character Dia- critic more than

  • ne

Substitution with cotext no change yes no Simple substitution yes no Morph-orthographical data Intermediate Results Control output Substitution of morphs check and add Revised German Morphology - Lemmas revise spelling Consonant rules Generate substitution rules New German Morphology - Lemmas (GMOL) check and add

see ? Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 4 / 44

slide-59
SLIDE 59

Databases Revised CELEX Database

Overview of the Data Processing

Merged Morphological Trees DB CELEX Trees DB GermaNet Trees DB CELEXextract GNextract (with CELEX) Germa- Net Refurbished CELEX- German OrthCELEX CELEX- German

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 5 / 44

slide-60
SLIDE 60

Databases Revised CELEX Database

Overview of the Data Processing

Merged Morphological Trees DB CELEX Trees DB GermaNet Trees DB CELEXextract GNextract (with CELEX) Germa- Net Refurbished CELEX- German OrthCELEX CELEX- German

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 5 / 44

slide-61
SLIDE 61

Databases Morphological Trees of CELEX

CELEX Trees I

Examples 97\Abdrift\0\C\1\Y\Y\Y\ab+drift\xV\N\N\N\ ((ab)[N|.V],((treib)[V])[V])[N]\Y\N\N\N\S3/P3\N ‘leeway - away|to fmoat’ 207\Abgangszeugnis\4\C\1\Y\Y\Y\Abgang+s+Zeugnis\NxN\N\N\N\ ((((ab)[V|.V],(geh)[V])[V])[N],(s)[N|N.N],((zeug)[V],(nis)[N|V.])[N]) [...] ‘leaving certifjcate - leave|certifjcate’ 605\Abschlussprüfung\C\1\Y\Y\Y\Abschluss+Prüfung\NN\N\N\N\ ((((ab)[V|.V],(schließ)[V])[V])[N],((prüf)[V],(ung)[N|V.])[N]\[...] ‘fjnal exam - conclusion|exam’

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 6 / 44

slide-62
SLIDE 62

Databases Morphological Trees of CELEX

CELEX Trees II

NN N ab ‘away’ V schließ ‘close’ N V prüf ‘examine’ ung suffjx Figure 5: Morphological analysis of Abschlussprüfung ‘fjnal exam’

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 7 / 44

slide-63
SLIDE 63

Databases Morphological Trees of CELEX

CELEX Trees III

NN N ab ‘away’ V geh ‘to go’ x s ‘interfjx’ N V zeug ‘to witness’ nis suffjx Figure 6: Morphological analysis of Abgangszeugnis ‘leaving certifjcate’

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 8 / 44

slide-64
SLIDE 64

Databases Morphological Trees of CELEX

Overview of the Data Processing

Merged Morphological Trees DB CELEX Trees DB GermaNet Trees DB CELEXextract GNextract (with CELEX) Germa- Net Refurbished CELEX- German OrthCELEX CELEX- German

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 9 / 44

slide-65
SLIDE 65

Databases Morphological Trees of CELEX

Overview of the Data Processing

Merged Morphological Trees DB CELEX Trees DB GermaNet Trees DB CELEXextract GNextract (with CELEX) Germa- Net Refurbished CELEX- German OrthCELEX CELEX- German

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 9 / 44

slide-66
SLIDE 66

Databases Morphological Trees from GermaNet

GermaNet

Lexical-semantic database, hierarchically structured in synsets same approach as WordNet (Hamp and Feldweg, 1997)

<synset id=”s5552” category=”nomen” class=”Artefakt”> <lexUnit id=”l8355” sense=”1” source=”core” namedEntity=”no” artifjcial=”no” styleMarking=”no”> <orthForm>Werkstück</orthForm> <compound> <modifjer category=”Nomen”>Werk</modifjer> <modifjer category=”Verb”> werken</modifjer> <head>Stück</head> </compound> </lexUnit> </synset>

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 10 / 44

slide-67
SLIDE 67

Databases Morphological Trees from GermaNet

GermaNet

Lexical-semantic database, hierarchically structured in synsets same approach as WordNet (Hamp and Feldweg, 1997)

<synset id=”s5552” category=”nomen” class=”Artefakt”> <lexUnit id=”l8355” sense=”1” source=”core” namedEntity=”no” artifjcial=”no” styleMarking=”no”> <orthForm>Werkstück</orthForm> <compound> <modifjer category=”Nomen”>Werk</modifjer> <modifjer category=”Verb”> werken</modifjer> <head>Stück</head> </compound> </lexUnit> </synset>

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 10 / 44

slide-68
SLIDE 68

Databases Morphological Trees from GermaNet

GN Trees I: Compounds of GermaNet

GermaNet compounds (Henrich and Hinrichs, 2011), version 11 with 66,059 compounds of which some have ambiguous structures remove proper names, foreign word expressions (After-Show-Party, Bodenseeregion ‘Lake of Constance region’) remove defjcient entries, e.g. with missing parts-of-speech classes or affjxoids add interfjxes (Fugen/fjller letters) by heuristics Abfahrtszeit ‘departure time’ GermaNet: Abfahrt|zeit − → Abfahrt|s|zeit

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 11 / 44

slide-69
SLIDE 69

Databases Morphological Trees from GermaNet

GN Trees II: Compound structures from GermaNet

1 generate fmat compound entries

Beitragssatz : Beitrag|s|Satz ‘contribution rate’ Beitragssatzsicherung : Beitragssatz|Sicherung ‘contribution rate safeguarding’ Beitragssatzsicherungsgesetz : Beitragssatzsicherung|s|Gesetz ‘contribution rate safeguarding law’

2 infer GN complex structure by recursive look-up

Beitragssatz Beitrag s Satz Beitragssatz|Sicherung (Beitrag s Satz) Sicherung Beitragssatzsicherung|s|Gesetz ((Beitrag s Satz) Sicherung) s Gesetz insert insert

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 12 / 44

slide-70
SLIDE 70

Databases Morphological Trees from GermaNet

GN Trees II: Compound structures from GermaNet

1 generate fmat compound entries

Beitragssatz : Beitrag|s|Satz ‘contribution rate’ Beitragssatzsicherung : Beitragssatz|Sicherung ‘contribution rate safeguarding’ Beitragssatzsicherungsgesetz : Beitragssatzsicherung|s|Gesetz ‘contribution rate safeguarding law’

2 infer GN complex structure by recursive look-up

Beitragssatz Beitrag|s|Satz Beitragssatz|Sicherung (Beitrag s Satz) Sicherung Beitragssatzsicherung|s|Gesetz ((Beitrag s Satz) Sicherung) s Gesetz insert insert

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 12 / 44

slide-71
SLIDE 71

Databases Morphological Trees from GermaNet

GN Trees II: Compound structures from GermaNet

1 generate fmat compound entries

Beitragssatz : Beitrag|s|Satz ‘contribution rate’ Beitragssatzsicherung : Beitragssatz|Sicherung ‘contribution rate safeguarding’ Beitragssatzsicherungsgesetz : Beitragssatzsicherung|s|Gesetz ‘contribution rate safeguarding law’

2 infer GN complex structure by recursive look-up

Beitragssatz Beitrag|s|Satz Beitragssatz|Sicherung (Beitrag|s|Satz)|Sicherung Beitragssatzsicherung|s|Gesetz ((Beitrag s Satz) Sicherung) s Gesetz insert insert

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 12 / 44

slide-72
SLIDE 72

Databases Morphological Trees from GermaNet

GN Trees II: Compound structures from GermaNet

1 generate fmat compound entries

Beitragssatz : Beitrag|s|Satz ‘contribution rate’ Beitragssatzsicherung : Beitragssatz|Sicherung ‘contribution rate safeguarding’ Beitragssatzsicherungsgesetz : Beitragssatzsicherung|s|Gesetz ‘contribution rate safeguarding law’

2 infer GN complex structure by recursive look-up

Beitragssatz Beitrag|s|Satz Beitragssatz|Sicherung (Beitrag|s|Satz)|Sicherung Beitragssatzsicherung|s|Gesetz ((Beitrag|s|Satz)|Sicherung)|s|Gesetz insert insert

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 12 / 44

slide-73
SLIDE 73

Databases Morphological Trees from GermaNet

Overview of the Data Processing

Merged Morphological Trees DB CELEX Trees DB GermaNet Trees DB CELEXextract GNextract (with CELEX) Germa- Net Refurbished CELEX- German OrthCELEX CELEX- German

(Steiner, 2017)

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 13 / 44

slide-74
SLIDE 74

Databases Morphological Trees from GermaNet

Overview of the Data Processing

Merged Morphological Trees DB CELEX Trees DB GermaNet Trees DB CELEXextract GNextract (with CELEX) Germa- Net Refurbished CELEX- German OrthCELEX CELEX- German

(Steiner, 2017)

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 13 / 44

slide-75
SLIDE 75

Combining CELEX and GermaNet Building the Trees

Combining GN Trees and CELEX Trees

GermaNet compound structures immediate constituents and other information from CELEX Beitrag ‘contribution’ (*beitragenV * beix|tragenV ) Sicherung ‘safeguarding’ (*sichernV * sicherA|nx)|ungx Gesetz ‘law’ gex|setzenV infer complex (derivative) structures by recursive look-up

Beitragssatz Beitrag|s|Satz Beitragssatzsicherung (Beitrag|s|Satz)|(sichern|ung) Beitragssatzsicherungsgesetz ((Beitrag|s|Satz)|(sichern|ung))|s|(ge|setzen)

Some mistakes/questionable or missing analyses:

  • approx. 2000 missing segmentations

Restrukturierungsmaßnahmen: Restrukturierung|s|(Maß|Nahme)

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 14 / 44

slide-76
SLIDE 76

Combining CELEX and GermaNet Building the Trees

Formats of Output

Optional parameters: Depth of analysis for compounds Parts of speech for the constructs and/or the smallest constituents Choice of the output format (parentheses or a notation with | for the splits on the same level) Addition of fjller letters for GN Transfering the GN annotation scheme to CELEX scheme Removing compounds with proper names and/or foreign words as constituents for GN Analysis of conversions for CELEX Depth of analysis for conversions for CELEX Dissimilarity measure for CELEX diachronic analyses

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 15 / 44

slide-77
SLIDE 77

Combining CELEX and GermaNet Merged German Morphological Trees

Example

Währungsausgleichsfonds N Währungsausgleich ‘currency adjustment’ N Währung ‘currency’ x s N Ausgleich ‘adjustment’ x s N Fonds ‘fund’ Ausgleich V ausgleichen ‘to adjust’ x aus V gleichen ‘to equal’ Adj gleich ‘equal’ x en

Figure 7: Merged morphological analysis of Währungsausgleichsfonds ‘currency adjustment fund’

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 16 / 44

slide-78
SLIDE 78

Combining CELEX and GermaNet Merged German Morphological Trees

Example

Währungsausgleichsfonds N Währungsausgleich ‘currency adjustment’ N Währung ‘currency’ x s N Ausgleich ‘adjustment’ x s N Fonds ‘fund’ Ausgleich V ausgleichen ‘to adjust’ x aus V gleichen ‘to equal’ Adj gleich ‘equal’ x en

Figure 7: Merged morphological analysis of Währungsausgleichsfonds ‘currency adjustment fund’

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 16 / 44

slide-79
SLIDE 79

Combining CELEX and GermaNet Merged German Morphological Trees

Example

Währungsausgleichsfonds N Währungsausgleich ‘currency adjustment’ N Währung ‘currency’ x s N Ausgleich ‘adjustment’ x s N Fonds ‘fund’ Ausgleich V ausgleichen ‘to adjust’ x aus V gleichen ‘to equal’ Adj gleich ‘equal’ x en

Figure 7: Merged morphological analysis of Währungsausgleichsfonds ‘currency adjustment fund’

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 16 / 44

slide-80
SLIDE 80

Combining CELEX and GermaNet Merged German Morphological Trees

Example

Währungsausgleichsfonds N Währungsausgleich ‘currency adjustment’ N Währung ‘currency’ x s N Ausgleich ‘adjustment’ V ausgleichen ‘to adjust’ x aus V gleichen ‘to equal’ Adj gleich ‘equal’ x en x s N Fonds ‘fund’

Figure 8: Merged morphological analysis of Währungsausgleichsfonds ‘currency adjustment fund’

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 17 / 44

slide-81
SLIDE 81

Combining CELEX and GermaNet Merged German Morphological Trees

Small Examples of List Representations

  • a. Abschlussprüfung (*Abschluss_N* (*abschließen_V*

ab_x|schließen_V))|(*Prüfung_N* prüfen_V|ung_x)

  • b. Abschlussprüfung (*Abschluss_N* (*abschließen_V* (ab_x)

(schließen_V)))(*Prüfung_N* (prüfen_V)(ung_x))

  • c. Abschlussprüfung Abschluss_N|Prüfung_N
  • a. Abdrift ab_x|(driften_V)
  • b. Abdrift (ab_x)(*driften_V* treiben_V)
  • c. Abdrift ab_x|driften_V
  • a. Abgangszeugnis (*Abgang_N* (*abgehen_V* ab_x|gehen_V))

|s_x|*Zeugnis_N* (zeugen_V|nis_x)

  • b. Abgangszeugnis (*Abgang_N* (*abgehen_V* (ab_x)(gehen_V)))

(s_x)(*Zeugnis_N* (zeugen_V)(nis_x))

  • c. Abgangszeugnis Abgang_N|s_x|Zeugnis_N

a: | notation, threshold 0.5; b: parenthesis notation and no restrictions on diachronic conversions; c: fmat representation of the immediate constituent.

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 18 / 44

slide-82
SLIDE 82

Combining CELEX and GermaNet Merged German Morphological Trees

Merged GermanTrees

The parameters for the deep-level analyses are 6 for the levels of complex words and 2 for conversions. The Levenshtein dissimilarity threshold was set to 0.5. Double entries were removed. Structures GN entries CELEX entries German Trees fmat 67,452 40,097 100,095 deep-level 68,163 40,097 104,424 merged with CELEX 68,171 n/a 100,986 merged with CELEX plus simplex words 68,171 n/a 112,086

Table 1: Databases of German word trees

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 19 / 44

slide-83
SLIDE 83

Combining Databases and Segmenters by Hybrid Word Analysis

One Complex Database, Two Segmenters

Wordlists: Abgangszeugnis Hybrid Word Splitter Morphological Trees DB CELEX Trees & simplex words GermaNet Trees CELEXextract GNextract (withCELEX) GermaNet Refurbished CELEX- German OrthCELEX CELEX- German (*Abgang_N* (*abgehen_V* (ab_x) (gehen_V))) (s_x) (*Zeugnis_N* (zeugen_V) (nis_x)) SMOR/Moremorph Abgang<N>s<FL> Zeugnis<NN> Morphy SUB NOM SIN NEU KMP Abgang/Zeugnis

Figure 9: Morphological trees database and two difgerent word segmenters as alternative methods for word splitting

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 20 / 44

slide-84
SLIDE 84

Evaluation: Testcorpus and Recall

Coverage of the Lemma Forms

Corpus: Korpus Magazin Lufthansa Bordbuch (MLD), part of the DeReKo-2016-I (Institut für Deutsche Sprache 2016) corpus (see Kupietz et al. 2010), an in-fmight magazine with articles on traveling, consumption and aviation. Tokenization: enlarged and costumized tokenizer by Dipper (2016) 276 texts with 5,202 paragraphs, 16,046 sentences and 260,115 tokens

lemma types recall lemmas in text recall corpus size 29,313 260,014 MergedDB + simplex 14,446 49.29% 157,535 60.59% + Morphy 21,953 74.89% 241,117 92.73% + Moremorphs 27,907 95.20% 256,903 98.80%

Table 2: Recall of Tree DBs (Steiner and Rapp, in press)

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 21 / 44

slide-85
SLIDE 85

Evaluation: Testcorpus and Recall

Coverage of the Lemma Forms

Corpus: Korpus Magazin Lufthansa Bordbuch (MLD), part of the DeReKo-2016-I (Institut für Deutsche Sprache 2016) corpus (see Kupietz et al. 2010), an in-fmight magazine with articles on traveling, consumption and aviation. Tokenization: enlarged and costumized tokenizer by Dipper (2016) 276 texts with 5,202 paragraphs, 16,046 sentences and 260,115 tokens

lemma types recall lemmas in text recall corpus size 29,313 260,014 MergedDB + simplex 14,446 49.29% 157,535 60.59% + Morphy 21,953 74.89% 241,117 92.73% + Moremorphs 27,907 95.20% 256,903 98.80%

Table 2: Recall of Tree DBs (Steiner and Rapp, in press)

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 21 / 44

slide-86
SLIDE 86

Evaluation: Testcorpus and Recall

Coverage of the Lemma Forms

Corpus: Korpus Magazin Lufthansa Bordbuch (MLD), part of the DeReKo-2016-I (Institut für Deutsche Sprache 2016) corpus (see Kupietz et al. 2010), an in-fmight magazine with articles on traveling, consumption and aviation. Tokenization: enlarged and costumized tokenizer by Dipper (2016) 276 texts with 5,202 paragraphs, 16,046 sentences and 260,115 tokens

lemma types recall lemmas in text recall corpus size 29,313 260,014 MergedDB + simplex 14,446 49.29% 157,535 60.59% + Morphy 21,953 74.89% 241,117 92.73% + Moremorphs 27,907 95.20% 256,903 98.80%

Table 2: Recall of Tree DBs (Steiner and Rapp, in press)

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 21 / 44

slide-87
SLIDE 87

Conclusion

Conclusion

100,986 merged trees Currently the biggest available data resource of its kind Text coverage of 60.59% Combined with Morphy: 92.73% Combined with SMOR: 98.80% Downloads without data: https://github.com/petrasteiner/morphology

The authors were partially supported by the German Research Foundation (DFG) under grant RU 1873/2-1 and by a Marie Curie Career Integration Grant within the 7th European Community Framework Programme.

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 22 / 44

slide-88
SLIDE 88

Conclusion Revised CELEX Database

The Lexical Database CELEX

Dutch, English, and German lexical information combined with information on word-formation types and frequencies manually annotated multi-tiered word structures (Baayen et al., 1995)

  • utdated format, e.g.

Abschlu$ (orthographical dataset) Abschluss (morphological dataset) − → Abschluß (modern format) ‘conclusion’

  • utdated spelling, e.g.

Abschluß ‘conclusion’ − → Abschluss (modern spelling).

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 23 / 44

slide-89
SLIDE 89

Conclusion Revised CELEX Database

Transfer of CELEX to the Modern Standard

German morphology lemmas (GML) German orthography lemmas (GOL) Morphological Dataset Orthographical Dataset Change character Dia- critic more than

  • ne

Substitution with cotext no change yes no Simple substitution yes no Morph-orthographical data Intermediate Results Control output Substitution of morphs check and add Revised German Morphology - Lemmas revise spelling Consonant rules Generate substitution rules New German Morphology - Lemmas (GMOL) check and add see ? Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 24 / 44

slide-90
SLIDE 90

Conclusion Revised CELEX Database

CELEX Revision - Facts and Figures

51,728 entries 10,106 entries with diacritics 576 entries with updated spelling 38,683 complex entries (morphological deep-level analyses)

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 25 / 44

slide-91
SLIDE 91

Conclusion Revised CELEX Database

The Lexical Database CELEX: Morphological Structures

manually annotated multi-tiered word structures (Baayen et al., 1995) needs only a few repairs of missing constituents or wrong analyses Example (Morphological analysis of Abschlußprüfung ‘fjnal exam’) Abschlusspruefung (‘fjnal exam’) Abschluss+Pruefung ((((ab)[V|.V],(schliess)[V])[V])[N], ((pruef)[V],(ung)[N|V.])[N])

NN N VPREF ab ‘away’ V schließ ‘close’ N V prüf ‘examine’ NSUFF ung suffjx

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 26 / 44

slide-92
SLIDE 92

Conclusion

CELEX Trees IV: Restriction of Diachronic Information

Cut two forms f1, f2 with length l1 and l2 to the strings s1, s2 of the smaller length (min(l1,l2)) and calculate the Levensthein distance (LD) of these. Special characters such as ä or ß are transformed to a and ss, uppercase characters to lowercase. Then the quotient of both values is compared to a threshold t as in (3): LD(s1,s2) min(l1,l2) < t (3) Example: the stem of the derived form treib and its component driften are reduced to the smaller size (5): drift and treib. (4) shows that the analysis will stop for a threshold at 0.8 or below. LD(drift,treib) min(l1,l2) = 4 5 (4) Plus a small list of exceptions. Steiner and Ruppenhofer (2018)

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 27 / 44

slide-93
SLIDE 93

Conclusion Algorithm 1: Building a merged morphological treebank Input: CELEX-German revised, GN fmat compounds Output: A DB of Morphological Trees initialization of parameters: depth of analysis, linguistic information, levenshtein threshold, parts of speech, style of

  • utput;

add CELEX data to the knowledge base forall entries of GN fmat compounds do if entry is a compound then foreach constituent of entry do if depth of analysis reached then retrieve linguistic information/PoS as required; return linguistic information and constituent end else if constituent not found in GN data then depth of analysis++; analysedeepercelex part with parameters and depth; return result of analysedeepercelex end else foreach part of constituent do depth of analysis++; analysedeeper part with parameters and depth; return result of analysedeeper end end end end end

sub analysedeeper part (parameters and level) if part is simplex or depth of analysis reached then retrieve linguistic information/PoS as required; return linguistic information and part end else if constituent not found in GN data then depth of analysis++; analysedeepercelex part with parameters and depth; return result of analysedeepercelex end else depth of analysis++; foreach subpart of part do analysedeeper subpart return result of analysedeeper subpart end end sub analysedeepercelex part (parameters and level) if part is simplex or depth of analysis reached then retrieve linguistic information/PoS as required; return linguistic information and part end else foreach subpart of part do analysedeepercelex subpart if levenshtein threshold and analysedeepercelex subpart is dissimilar then skip deeper analysis; return subpart end else return result of analysedeepercelex subpart end end end

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 28 / 44

slide-94
SLIDE 94

Conclusion Confmicts

Conversion or ambiguity?

Examples (GermaNet vs. CELEX) Werkstück Werk|Stück ‘work(noun)|piece’ Werkstück werken|Stück ‘to work|piece’ Glaswerkstück Glas|(Werk|Stück) ‘glass|work(noun)|piece’ Glaswerkstück Glas|(werken|Stück) ‘glass|to work|piece’ Werkstück (*werken_V* (Werk_N)(en_x))(Stück_N) ‘(*to work_V* (work_N)(en(suffjx))(piece_N)’

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 29 / 44

slide-95
SLIDE 95

Conclusion Confmicts

Compounding or Prefjxation/Conversion?

Examples (GermaNet vs. CELEX) Abwasser (ab_P)(Wasser_N) ‘(away_P)(water_N) waste water’ (ab_x)(Wasser_N) ‘(away_x)(water_N) waste water’ afroasiatisch (afro_R)(Asiatisch_N) ‘(afro_R)(Asian_N)’ afroamerikanisch (afro_x)(amerikanisch_A) ‘(afro_x)(American_A)’ Maßnahme (Maß_N)(Nahme_N) ‘(measure_n)(taking_N) measure’ maßnehmen_V ‘(to measure_take_V) measure’

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 30 / 44

slide-96
SLIDE 96

Conclusion Confmicts

Mapping the Morphological Tagsets

Part of Speech/morph type GN CELEX GN Trees noun nomen, Nomen N N adjective Adjektiv A A adverb Adverb B B preposition Präposition P P verb Verb, verben V V article Artikel D D interjection Interjektion I I pronoun Pronomen O O abbreviation Abkürzung X X word group Wortgruppe n n root/confjx Konfjx R R fjller letters, affjxes

  • x

x Table 3: Mapping of two morphological tagsets

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 31 / 44

slide-97
SLIDE 97

Conclusion Outlook

Outlook

Wordlists: Abgangszeugnis Moremorph Abgang<N>s<FL> Zeugnis<NN> Contextual Word Splitter Morphological Trees DB Indices Buildindex Wikipedia Corpus CELEX Trees GermaNet Trees CELEXextract GNextract (withCELEX) GermaNet Refurbished CELEX- German OrthCELEX CELEX- German (*AbgangN* (*abgehenV * (abx) (gehenV ))) (sx) (*ZeugnisN* (zeugenV ) (nisx))

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 32 / 44

slide-98
SLIDE 98

Frequencies Frequencies from CELEX

Morphs and immediate constituents I

Three datasets with frequency information were extracted from CELEX: all morphs with their frequencies within the lemmas (13,419 entries)

ung 3588 ein 755 er 3066 ge 750 ig 2531 auf 681 s 2327 über 630 e 2120 um 557 ver 1694 vor 517 n 1581 bar 485 lich 1273 heit 475 be 1236 ent 475 ier 1215 los 455 un 1141 en 423 aus 983 ation 394 keit 974 t 382 ab 896 unter 381 an 845 zu 378 isch 836 in 374

all immediate constituents with their frequencies within the lemmas (21,406 entries) all immediate constituents within the lemmas with their frequencies as found in the Mannheim Corpus.

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 33 / 44

slide-99
SLIDE 99

Frequencies Frequencies from GermaNet + CELEX

Morphs and immediate constituents II

Preliminary results: smallest parts of GN trees.

11905 s 771 al 8961 ung 730 Land 5939 n 728 ation 5339 e 721 ion 5324 er 715 Zeit 2198 ge 634 stellen 1564 be 631 fahren 1537 ver 622 Bau 1452 en 620 heit 1197 schaft 577 bauen 913 es 569 Arbeit

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 34 / 44

slide-100
SLIDE 100

Frequencies Frequencies from GermaNet + CELEX

Characteristics of German Word-Formation Version 2

language with complex processes of word formation most common are compounding and derivation

Oberklasse- Kompaktschlagbohrmaschine ‘Premium class compact hammer drill (machine)’ many combinatorially possible analyses

Oberklassenschlagbohrmaschine Oberklasse ‘premium class’ Schlagbohrmaschine ‘hammer drill’

♯Oberklassenschlagbohrmaschine

Ober ‘premium’ Klassenschlag ‘*class hit’ Bohrmaschine ‘drill’

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 35 / 44

slide-101
SLIDE 101

Frequencies Overview of a Morphological Analyzer

Overview - New Version

Revised SMOR output Filter weights + heuristics Weighted morphological analyses Corpora Subcorpora Texts ... Frequencies

  • f cotexts

Revised CELEX-DB Evaluation Gold standards: morphological segmentations and parses Morph Trees GermaNet Other DBs

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 36 / 44

slide-102
SLIDE 102

Frequencies Revised SMOR analyses

SMOR

Stuttgarter Morphologisches Analysewerkzeug Morphological analyzer based on two-level morphology, implemented as a set

  • f fjnite-state transducers (Schmid et al., 2004)

Main lexicon with 41,944 entries, proper name lexicons with 15,188 entries and difgerent datasets with other morphological information

(6)

  • ber<PREF>Klasse<NN> Schlag <NN>bohren<V>Maschine<+NN><Fem><Acc><Sg>
  • ber<PREF>Klasse<NN> Schlag <NN>bohren<V>Maschine<+NN><Fem><Dat><Sg>
  • ber<PREF>Klasse<NN> Schlag <NN>bohren<V>Maschine<+NN><Fem><Gen><Sg>
  • ber<PREF>Klasse<NN> Schlag <NN>bohren<V>Maschine<+NN><Fem><Nom><Sg>
  • ber<PREF>Klasse<NN> schlagen <V><NN><SUFF>bohren<V>Maschine<+NN><Fem><Acc><Sg>
  • ber<PREF>Klasse<NN> schlagen <V><NN><SUFF>bohren<V>Maschine<+NN><Fem><Dat><Sg>
  • ber<PREF>Klasse<NN> schlagen <V><NN><SUFF>bohren<V>Maschine<+NN><Fem><Gen><Sg>
  • ber<PREF>Klasse<NN> schlagen <V><NN><SUFF>bohren<V>Maschine<+NN><Fem><Nom><Sg>

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 37 / 44

slide-103
SLIDE 103

Frequencies Revised SMOR analyses

SMOR

Stuttgarter Morphologisches Analysewerkzeug Morphological analyzer based on two-level morphology, implemented as a set

  • f fjnite-state transducers (Schmid et al., 2004)

Main lexicon with 41,944 entries, proper name lexicons with 15,188 entries and difgerent datasets with other morphological information

(6)

  • ber<PREF>Klasse<NN> Schlag <NN>bohren<V>Maschine<+NN><Fem><Acc><Sg>
  • ber<PREF>Klasse<NN> Schlag <NN>bohren<V>Maschine<+NN><Fem><Dat><Sg>
  • ber<PREF>Klasse<NN> Schlag <NN>bohren<V>Maschine<+NN><Fem><Gen><Sg>
  • ber<PREF>Klasse<NN> Schlag <NN>bohren<V>Maschine<+NN><Fem><Nom><Sg>
  • ber<PREF>Klasse<NN> schlagen <V><NN><SUFF>bohren<V>Maschine<+NN><Fem><Acc><Sg>
  • ber<PREF>Klasse<NN> schlagen <V><NN><SUFF>bohren<V>Maschine<+NN><Fem><Dat><Sg>
  • ber<PREF>Klasse<NN> schlagen <V><NN><SUFF>bohren<V>Maschine<+NN><Fem><Gen><Sg>
  • ber<PREF>Klasse<NN> schlagen <V><NN><SUFF>bohren<V>Maschine<+NN><Fem><Nom><Sg>

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 37 / 44

slide-104
SLIDE 104

Frequencies Revised SMOR analyses

SMOR

Stuttgarter Morphologisches Analysewerkzeug Morphological analyzer based on two-level morphology, implemented as a set

  • f fjnite-state transducers (Schmid et al., 2004)

Main lexicon with 41,944 entries, proper name lexicons with 15,188 entries and difgerent datasets with other morphological information

(6)

  • ber<PREF>Klasse<NN> Schlag <NN>bohren<V>Maschine<+NN><Fem><Acc><Sg>
  • ber<PREF>Klasse<NN> Schlag <NN>bohren<V>Maschine<+NN><Fem><Dat><Sg>
  • ber<PREF>Klasse<NN> Schlag <NN>bohren<V>Maschine<+NN><Fem><Gen><Sg>
  • ber<PREF>Klasse<NN> Schlag <NN>bohren<V>Maschine<+NN><Fem><Nom><Sg>
  • ber<PREF>Klasse<NN> schlagen <V><NN><SUFF>bohren<V>Maschine<+NN><Fem><Acc><Sg>
  • ber<PREF>Klasse<NN> schlagen <V><NN><SUFF>bohren<V>Maschine<+NN><Fem><Dat><Sg>
  • ber<PREF>Klasse<NN> schlagen <V><NN><SUFF>bohren<V>Maschine<+NN><Fem><Gen><Sg>
  • ber<PREF>Klasse<NN> schlagen <V><NN><SUFF>bohren<V>Maschine<+NN><Fem><Nom><Sg>

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 37 / 44

slide-105
SLIDE 105

Frequencies Revised SMOR analyses

SMOR

Stuttgarter Morphologisches Analysewerkzeug Morphological analyzer based on two-level morphology, implemented as a set

  • f fjnite-state transducers (Schmid et al., 2004)

Main lexicon with 41,944 entries, proper name lexicons with 15,188 entries and difgerent datasets with other morphological information

(6)

  • ber<PREF>Klasse<NN> Schlag <NN>bohren<V>Maschine<+NN><Fem><Acc><Sg>
  • ber<PREF>Klasse<NN> Schlag <NN>bohren<V>Maschine<+NN><Fem><Dat><Sg>
  • ber<PREF>Klasse<NN> Schlag <NN>bohren<V>Maschine<+NN><Fem><Gen><Sg>
  • ber<PREF>Klasse<NN> Schlag <NN>bohren<V>Maschine<+NN><Fem><Nom><Sg>
  • ber<PREF>Klasse<NN> schlagen <V><NN><SUFF>bohren<V>Maschine<+NN><Fem><Acc><Sg>
  • ber<PREF>Klasse<NN> schlagen <V><NN><SUFF>bohren<V>Maschine<+NN><Fem><Dat><Sg>
  • ber<PREF>Klasse<NN> schlagen <V><NN><SUFF>bohren<V>Maschine<+NN><Fem><Gen><Sg>
  • ber<PREF>Klasse<NN> schlagen <V><NN><SUFF>bohren<V>Maschine<+NN><Fem><Nom><Sg>

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 37 / 44

slide-106
SLIDE 106

Frequencies Revised SMOR analyses

SMOR tag <TRUNC>

(7) {Oberklassen}-<TRUNC>Schlag<NN>bohren <V>Maschine<+NN><Fem><Acc><Sg>

SMOR Output

{Oberklassen}-<TRUNC>Schlag<NN>

Remove hyphen Reuse SMOR lexicons for <TRUNC>

Oberklassenschlag...

  • :Ober<PREF>:<>K:klasse<>:n <NN

>:<>Schlag<NN»....

Restore hyphens

Ober<PREF>Klasse<NN>n<FL>- <HYPHEN>Schlag<NN»....

(8) Ober<PREF>Klasse<NN>n<FL>-<HYPHEN>Schlag<NN>bohren<V>Maschine<+NN> Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 38 / 44

slide-107
SLIDE 107

Frequencies Revised SMOR analyses

Coverage / Form of Output

Table 4 summarizes the changes for 1,101 items from our gold standard data. Method t n a r (a) SMOR baseline 105 3 0.00 (b) remove hyphens 48 2 58 0.54 (c) reanalyze TRUNC 39 3 66 0.61 (d) combine (b) and (c) 2 2 104 0.96

Table 4: Analyzed hyphenated forms; t: analyses containing TRUNC; n: hyphenated forms without analyses; a: correctly pre-analyzed hyphenated word form; r: relative frequency of a Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 39 / 44

slide-108
SLIDE 108

Methods Geometric Mean Score

Geometric Mean Score

We use the geometric mean as in (5)       

n

  • i=1

xi       

1/n

for xi...xn, (5) Anbaumenge , x1 = 845 for an , x2 = 168 for bau , x3 = 8 for Menge gm(An|bau|Menge) = 104.33 Frequencies of morphs, of constituents, of corpus frequencies

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 40 / 44

slide-109
SLIDE 109

Methods Geometric Mean Score

Geometric Mean Score

We use the geometric mean as in (5)       

n

  • i=1

xi       

1/n

for xi...xn, (5) Anbaumenge , x1 = 845 for an , x2 = 168 for bau , x3 = 8 for Menge gm(An|bau|Menge) = 104.33 Frequencies of morphs, of constituents, of corpus frequencies

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 40 / 44

slide-110
SLIDE 110

Methods Geometric Mean Score

Geometric Mean Score

We use the geometric mean as in (5)       

n

  • i=1

xi       

1/n

for xi...xn, (5) Anbaumenge , x1 = 845 for an , x2 = 168 for bau , x3 = 8 for Menge gm(An|bau|Menge) = 104.33 Frequencies of morphs, of constituents, of corpus frequencies

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 40 / 44

slide-111
SLIDE 111

Methods Word structures as Integer Compositions

Combinatorial structure of morphological analyses

There is isomorphy to the permuted integer partitions of n (9) Drahtseilakt ‘High-wire act’

  • a. [[Draht],[seil],[akt]]
  • b. [[Draht],[ seilakt ]]
  • c. [[ Drahtseil ],[akt]]
  • d. [[Drahtseilakt]]

Corresponding integer compositions

(10)

  • a. 1-1-1
  • b. 1- 2

c. 2 -1

  • d. 3

The algorithm for processing the combinatorially possible analyses makes use of this analogy. c(n) = 2n−1;n >= 1

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 41 / 44

slide-112
SLIDE 112

Methods Word structures as Integer Compositions

Combinatorial structure of morphological analyses

There is isomorphy to the permuted integer partitions of n (9) Drahtseilakt ‘High-wire act’

  • a. [[Draht],[seil],[akt]]
  • b. [[Draht],[ seilakt ]]
  • c. [[ Drahtseil ],[akt]]
  • d. [[Drahtseilakt]]

Corresponding integer compositions

(10)

  • a. 1-1-1
  • b. 1- 2

c. 2 -1

  • d. 3

The algorithm for processing the combinatorially possible analyses makes use of this analogy. c(n) = 2n−1;n >= 1

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 41 / 44

slide-113
SLIDE 113

Methods Word structures as Integer Compositions

Combinatorial structure of morphological analyses

There is isomorphy to the permuted integer partitions of n (9) Drahtseilakt ‘High-wire act’

  • a. [[Draht],[seil],[akt]]
  • b. [[Draht],[ seilakt ]]
  • c. [[ Drahtseil ],[akt]]
  • d. [[Drahtseilakt]]

Corresponding integer compositions

(10)

  • a. 1-1-1
  • b. 1- 2

c. 2 -1

  • d. 3

The algorithm for processing the combinatorially possible analyses makes use of this analogy. c(n) = 2n−1;n >= 1

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 41 / 44

slide-114
SLIDE 114

Methods Word structures as Integer Compositions

Path pruning

(11) Compositions of abwechslungsreich

  • a. [[ab],[wechsl],[ung,s],[reich]]
  • b. [[ab],[wechsl],[ung,s,reich]]
  • c. [[ab],[wechsl,ung,s],[reich]]
  • d. [[ab],[wechsl,ung,s,reich]]
  • e. [[ab,wechsl],[ung,s],[reich]]
  • f. [[ab,wechsl],[ung,s,reich]]
  • g. [[ab,wechsl,ung,s],[reich]]
  • h. [[ab,wechsl,ung,s,reich]]

(12)

Be VPREF be.pref nutz V use er NNSUFF er.suff unter VPART below stütz V support ung NNSUFF ung.suff

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 42 / 44

slide-115
SLIDE 115

Methods Word structures as Integer Compositions

The Lexical Database CELEX: Example

  • rthographical data, e.g.

605\Abschlu$pr”ufung\14\Ab-schlu$-prü-fung\N\ Abschlu$prüfung\Ab-schlu$-prü-fung\N morphological data, e.g. 605\Abschlusspruefung\14\C\1\Y\Y\Y\Abschluss+Prue- fung\NN\N\ N\N\((((ab)[V|.V],(schliess)[V])[V])[N],((pruef)[V],(ung)[N|V.])[N])[N]\ Y\N\N\N\S3/P3\N phonological data, e.g. 605\Abschlusspruefung\14\’&p-SlUs-pry-fUN\[ap][SlUs][pry:][fUN] \’&p-SlUs-pry-fUN\[ap][SlUs][pry:][fUN]\[VC][CCVC][CCVV][CVC]\ [VC][CCVC][CCVV][CVC]\ap#Sli:s#pry:f+UN\ap#Sli:s#pry:f+UN syntactic data, e.g. 605\Abschlusspruefung\14\1\2\\N\N\\\\\\\\\\\\\\\\ frequency data

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 43 / 44

slide-116
SLIDE 116

Methods Word structures as Integer Compositions

CELEX Trees IV: Some repairs

missing constituents and missing parts of speech information within the morphological trees missing constituents within the fjeld of immediate constituency information inconsistent morphological analyses e.g. for phrasal compounds.

Adj Adj warm ‘warm’ N Herz ‘heart’ ig suffjx

  • ♯Adj

N Kopf ‘head’ Adj N Last ‘load’ ig suffjx Adj N Kopf ‘head’ N Last ‘load’ ig suffjx

= ⇒

Figure 10: warmherzig ‘warm-heartedly’ and kopfmastig ‘top heavy’

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 44 / 44

slide-117
SLIDE 117

Methods Word structures as Integer Compositions

CELEX Trees IV: Some repairs

missing constituents and missing parts of speech information within the morphological trees missing constituents within the fjeld of immediate constituency information inconsistent morphological analyses e.g. for phrasal compounds.

Adj Adj warm ‘warm’ N Herz ‘heart’ ig suffjx

  • ♯Adj

N Kopf ‘head’ Adj N Last ‘load’ ig suffjx Adj N Kopf ‘head’ N Last ‘load’ ig suffjx

= ⇒

Figure 10: warmherzig ‘warm-heartedly’ and kopfmastig ‘top heavy’

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 44 / 44

slide-118
SLIDE 118

Methods Word structures as Integer Compositions

CELEX Trees IV: Some repairs

missing constituents and missing parts of speech information within the morphological trees missing constituents within the fjeld of immediate constituency information inconsistent morphological analyses e.g. for phrasal compounds.

Adj Adj warm ‘warm’ N Herz ‘heart’ ig suffjx

  • ♯Adj

N Kopf ‘head’ Adj N Last ‘load’ ig suffjx Adj N Kopf ‘head’ N Last ‘load’ ig suffjx

= ⇒

Figure 10: warmherzig ‘warm-heartedly’ and kopfmastig ‘top heavy’

Petra Steiner @ DeriMo2019 Fine-Grained Morphological Analyses September 19, 2019 44 / 44