morphology in clarin d
play

Morphology in CLARIN-D Danil de Kok Introduction A whirlwind - PowerPoint PPT Presentation

Morphology in CLARIN-D Danil de Kok Introduction A whirlwind introduction: CLARIN-D tools: WebLicht, TNDRA Resources: corpora with morphology Mostly oriented towards inflectional morphology WebLicht WebLicht is a web


  1. Morphology in CLARIN-D Daniël de Kok

  2. Introduction A whirlwind introduction: ● CLARIN-D tools: WebLicht, TüNDRA ● Resources: corpora with morphology ● Mostly oriented towards inflectional morphology

  3. WebLicht WebLicht is a web application for creating and running NLP pipelines

  4. Services ● Centers provide RESTful annotation services ○ Input: Text Corpus Format (TCF) ○ Output: TCF with the added layers ● Centers create metadata for their annotations services and put them in their repository

  5. WebLicht architecture

  6. Morphology services ● Currently available (morphological tagging): ○ German: Stuttgart Morphology (RFTagger) , SMOR ○ Dutch: Alpino ○ English: MorphAdorner ● Adding new services for morphology: ○ Since WebLicht is decentralized, any CLARIN center could add additional morphology services. ○ If some interesting tool is missing, let us know!

  7. Stuttgart morphology (German) ● HMM tagger specialized for large, feature-rich tag sets. ● Trained on the Tiger treebank. ● Uses a supplementary lexicon. ● Outputs morphological tags in the TIGER morphology scheme: ● Part-of-speech ● Gender ● Case ● Number ● Degree ● Person ● Tense ● Mood ● Finiteness

  8. Alpino (Dutch) ● Wide-coverage dependency parser for Dutch. ● But also has: ○ An extensive lexicon with subcategorization frames. ○ A guesser for unknown words. ● Eventual frames are decided by: ○ Filtering by n-best tagging. ○ The parse selected by the disambiguation model.

  9. Resources for German ● Semi-automatically annotated ○ Tiger treebank ○ TüBa-D/Z ● Automatically annotated ○ TüBa-D/W

  10. Tiger treebank ● ~50,000 sentences ● Newspaper text (Frankfurter Rundschau) ● Semi-automatically annotated ● Annotations: ○ STTS part-of-speech tags ○ Lemmas ○ Inflectional morphology ○ Constituency structure ○ Dependency conversion (subset hand-annotated)

  11. TüBa-D/Z ● ~95,500 sentences ● Newspaper text ( taz ) ● Semi-automatically annotated ● Annotations: ○ STTS part-of-speech tags ○ Lemmas ○ Inflectional morphology ○ Constituency structure ○ Dependency conversion ○ Anaphora and coreference relations ○ Subset with GermaNet word senses ○ Named entity class

  12. TüBa-D/W ● 36.1 million sentences ● German Wikipedia ● Automatically annotated ● Annotations: ○ STTS part-of-speech tags ○ Lemmas ○ Inflectional morphology ○ Dependency structure ● Processed using WebLicht :-)

  13. TüBa-D/W TüBa-D/W is fully searchable using the TüNDRA treebank viewer

  14. Links WebLicht: https://weblicht.sfs.uni-tuebingen.de/ TüNDRA: https://weblicht.sfs.uni-tuebingen.de/Tundra/

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend