Morphology in CLARIN-D Danil de Kok Introduction A whirlwind - - PowerPoint PPT Presentation

▶

Jun 08, 2023 140 likes •343 views

Morphology in CLARIN-D Danil de Kok Introduction A whirlwind introduction: CLARIN-D tools: WebLicht, TNDRA Resources: corpora with morphology Mostly oriented towards inflectional morphology WebLicht WebLicht is a web

SLIDE 1

Morphology in CLARIN-D

Daniël de Kok

SLIDE 2

Introduction

A whirlwind introduction:

CLARIN-D tools: WebLicht, TüNDRA
Resources: corpora with morphology
Mostly oriented towards inflectional morphology

SLIDE 3

WebLicht

WebLicht is a web application for creating and running NLP pipelines

SLIDE 4

SLIDE 5

SLIDE 6

SLIDE 7

Services

Centers provide RESTful annotation services

○ Input: Text Corpus Format (TCF) ○ Output: TCF with the added layers

Centers create metadata for their annotations services and put them in

their repository

SLIDE 8

WebLicht architecture

SLIDE 9

Morphology services

Currently available (morphological tagging):

○ German: Stuttgart Morphology (RFTagger), SMOR ○ Dutch: Alpino ○ English: MorphAdorner

Adding new services for morphology:

○ Since WebLicht is decentralized, any CLARIN center could add additional morphology services. ○ If some interesting tool is missing, let us know!

SLIDE 10

Stuttgart morphology (German)

HMM tagger specialized for large, feature-rich tag sets.
Trained on the Tiger treebank.
Uses a supplementary lexicon.
Outputs morphological tags in the TIGER morphology scheme:
Part-of-speech
Gender
Case
Number
Degree
Person
Tense
Mood
Finiteness

SLIDE 11

Alpino (Dutch)

Wide-coverage dependency parser for Dutch.
But also has:

○ An extensive lexicon with subcategorization frames. ○ A guesser for unknown words.

Eventual frames are decided by:

○ Filtering by n-best tagging. ○ The parse selected by the disambiguation model.

SLIDE 12

Resources for German

Semi-automatically annotated

○ Tiger treebank ○ TüBa-D/Z

Automatically annotated

○ TüBa-D/W

SLIDE 13

Tiger treebank

~50,000 sentences
Newspaper text (Frankfurter Rundschau)
Semi-automatically annotated
Annotations:

○ STTS part-of-speech tags ○ Lemmas ○ Inflectional morphology ○ Constituency structure ○ Dependency conversion (subset hand-annotated)

SLIDE 14

TüBa-D/Z

~95,500 sentences
Newspaper text (taz)
Semi-automatically annotated
Annotations:

○ STTS part-of-speech tags ○ Lemmas ○ Inflectional morphology ○ Constituency structure ○ Dependency conversion ○ Anaphora and coreference relations ○ Subset with GermaNet word senses ○ Named entity class

SLIDE 15

TüBa-D/W

36.1 million sentences
German Wikipedia
Automatically annotated
Annotations:

○ STTS part-of-speech tags ○ Lemmas ○ Inflectional morphology ○ Dependency structure

Processed using WebLicht :-)

SLIDE 16

TüBa-D/W

TüBa-D/W is fully searchable using the TüNDRA treebank viewer

SLIDE 17

SLIDE 18

SLIDE 19

Links

WebLicht: https://weblicht.sfs.uni-tuebingen.de/ TüNDRA: https://weblicht.sfs.uni-tuebingen.de/Tundra/