Morphology in CLARIN-D Danil de Kok Introduction A whirlwind - - PowerPoint PPT Presentation

morphology in clarin d
SMART_READER_LITE
LIVE PREVIEW

Morphology in CLARIN-D Danil de Kok Introduction A whirlwind - - PowerPoint PPT Presentation

Morphology in CLARIN-D Danil de Kok Introduction A whirlwind introduction: CLARIN-D tools: WebLicht, TNDRA Resources: corpora with morphology Mostly oriented towards inflectional morphology WebLicht WebLicht is a web


slide-1
SLIDE 1

Morphology in CLARIN-D

Daniël de Kok

slide-2
SLIDE 2

Introduction

A whirlwind introduction:

  • CLARIN-D tools: WebLicht, TüNDRA
  • Resources: corpora with morphology
  • Mostly oriented towards inflectional morphology
slide-3
SLIDE 3

WebLicht

WebLicht is a web application for creating and running NLP pipelines

slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7

Services

  • Centers provide RESTful annotation services

○ Input: Text Corpus Format (TCF) ○ Output: TCF with the added layers

  • Centers create metadata for their annotations services and put them in

their repository

slide-8
SLIDE 8

WebLicht architecture

slide-9
SLIDE 9

Morphology services

  • Currently available (morphological tagging):

○ German: Stuttgart Morphology (RFTagger), SMOR ○ Dutch: Alpino ○ English: MorphAdorner

  • Adding new services for morphology:

○ Since WebLicht is decentralized, any CLARIN center could add additional morphology services. ○ If some interesting tool is missing, let us know!

slide-10
SLIDE 10

Stuttgart morphology (German)

  • HMM tagger specialized for large, feature-rich tag sets.
  • Trained on the Tiger treebank.
  • Uses a supplementary lexicon.
  • Outputs morphological tags in the TIGER morphology scheme:
  • Part-of-speech
  • Gender
  • Case
  • Number
  • Degree
  • Person
  • Tense
  • Mood
  • Finiteness
slide-11
SLIDE 11

Alpino (Dutch)

  • Wide-coverage dependency parser for Dutch.
  • But also has:

○ An extensive lexicon with subcategorization frames. ○ A guesser for unknown words.

  • Eventual frames are decided by:

○ Filtering by n-best tagging. ○ The parse selected by the disambiguation model.

slide-12
SLIDE 12

Resources for German

  • Semi-automatically annotated

○ Tiger treebank ○ TüBa-D/Z

  • Automatically annotated

○ TüBa-D/W

slide-13
SLIDE 13

Tiger treebank

  • ~50,000 sentences
  • Newspaper text (Frankfurter Rundschau)
  • Semi-automatically annotated
  • Annotations:

○ STTS part-of-speech tags ○ Lemmas ○ Inflectional morphology ○ Constituency structure ○ Dependency conversion (subset hand-annotated)

slide-14
SLIDE 14

TüBa-D/Z

  • ~95,500 sentences
  • Newspaper text (taz)
  • Semi-automatically annotated
  • Annotations:

○ STTS part-of-speech tags ○ Lemmas ○ Inflectional morphology ○ Constituency structure ○ Dependency conversion ○ Anaphora and coreference relations ○ Subset with GermaNet word senses ○ Named entity class

slide-15
SLIDE 15

TüBa-D/W

  • 36.1 million sentences
  • German Wikipedia
  • Automatically annotated
  • Annotations:

○ STTS part-of-speech tags ○ Lemmas ○ Inflectional morphology ○ Dependency structure

  • Processed using WebLicht :-)
slide-16
SLIDE 16

TüBa-D/W

TüBa-D/W is fully searchable using the TüNDRA treebank viewer

slide-17
SLIDE 17
slide-18
SLIDE 18
slide-19
SLIDE 19

Links

WebLicht: https://weblicht.sfs.uni-tuebingen.de/ TüNDRA: https://weblicht.sfs.uni-tuebingen.de/Tundra/