Maca a configurable tool to Maca a configurable tool to integrate - PowerPoint PPT Presentation

Maca — a configurable tool to Maca — a configurable tool to integrate Polish morphological data integrate Polish morphological data Adam Radziszewski Tomasz Śniatowski Wrocław University of Technology

Outline Outline ● Morphological resources for Polish ● Tagset and segmentation differences ● Requirements ● Our solution ● Usage scenarios ● Summary

Introduction Introduction ● Morphological analysis: assigning morphological descriptions to tokens ● Token → set of ( MSD tag , lemma ) pairs ● MSD — morphosyntactic description tag ● Part-of-Speech / grammatical class ● Values of inflectional and syntactic attributes, e.g. case Example: analysis of the form myśl myśl subst:sg:nom:f thought myśleć impt:sg:imperf think!

Morphological resources for Polish Morphological resources for Polish IPI PAN Corpus tagset Morfologik tagset Analyser: Morfeusz SIAT Analyser: Morfologik Large dictionary Large dictionary (3.5 mln forms) Data by recognised Polish linguists Data from ispell/myspell Very restrictive licence GNU LGPL or CC BY-SA Corpus: IPI PAN (fragment) 660 000 tokens manually annot'd 84 000 different forms GNU GPL *Free: src available There are more non-free analysers & corpora with various tagsets

Morphological resources for Polish (2) Morphological resources for Polish (2) ● Important to have corpus and analyser in the same tagset ● Corpus usually too small to obtain reliable lexical model ● POS/MSD taggers for Polish rely on external analysers ● Goal: to integrate corpus morphological data with available analysers ● Important to be able to modify an existing dictionary ● Correct erroneous entries ● Extend ● Supersede entries with domain-specific terminology ● Integrate multiple dictionaries

Tagset differences Tagset differences ● Traditional Parts-of-Speech (nouns, pronouns, verbs…) ● Non-free analysers, e.g. POLEX PMDBF ● Partially Morfologik ● PoS classes based on inflectional properties ● Morfeusz / IPI PAN Corpus, partially Morfologik ● Each class assigned a set of attributes whose values must be given ● If some subset of a PoS not specified for an attribute, should constitute a separate class ● Moja ( my -fem-sg) inflects as adjective, thus labelled so ● Jasno ( light ) is gradable → adverb; dziś ( today ) is not → particle

Segmentation differences Segmentation differences ● When attaching MSD tags, we need to know what kind of units (tokens) we want to account for ● Traditionally, strings of letters cut by punctuation and white spaces (Morfologik, POLEX PMDBF) ● Morfeusz: some verb forms are split into parts Miałem ( I had masc) → miał (sing. masc. ) + em (sing. 1 person) ● Miałbym (I'd have masc) → miał (sg. masc.) + by (conj.part.) + m (sg. 1 person) ● ● Motivation: occasional scrambling gdyby + m miał ( If I had masc) ● Seg. ambiguities: miałem is also a noun in instr. case ( dust ) ● Morfeusz outputs graphs miał em miałem

Requirements (functional) Requirements (functional) ● Integrate available morphological data under different settings, providing multiple configurations ● Select analysers to use at the moment ● Be able to use Morfeusz until enough free data available ● Support overriding entries and extending dictionaries ● Tight coupling with tokeniser ● Take advantage of knowing token type (numbers, words, punct.) ● Tie different analysis pipelines to different token types ● Handle some differences in tagsets and seg. strategies ● Handle large dictionaries efficiently (transducers)

Requirements (technical) Requirements (technical) ● Whole functionality as command-line tools and C/C++ library for use in NLP software ● Performance, low start-up time (no VM) ● Easy integration with Python and C++ ● Re-usability ● Division into libraries wrt. functionality (I/O, tokeniser, analyser) ● Useful command-line tools also serving as library API usage examples ● Supporting standards and available resources ● SRX — segmentation rule exchange format for MT systems ● Unicode (using ICU library) ● SFST transducers ● Support for Morfeusz data (graphs) and XCES XML format (IPI PAN Corpus)

Our solution: MACA system Our solution: MACA system Running SRX rules Toki — configurable tokeniser text for Polish Running text  seq of tokens (by Miłkowski) or sentences containing tokens Tokenisation rules defined in INI files Token May point to SRX file (sentence splitting rules) Orth: Aaa Label: w Space before: newlines corpus2 library MACA — Morphological Analysis Token Data structs Converter and Aggregator Orth: . Corpus XML I/O Toki tokens  seq of corpus2 tokens Label: p Tags, tagsets Analyser configs defined in INI files Space before: none Tagset conversion routines as INI files May point to SFST transducers, Morfeusz, txt files Define analyser pipelines and use Toki labels

Usage scenarios (1) Usage scenarios (1) ● Compiling working analyser from existing data ● Use one of the provided Toki config or tailor a specific one ● Compile a text file with dictionary into SFST format ● Simple Maca config: attaches fixed tags to punctuation and digits, the compiled SFST transducer to the rest ● Practical usage in another project: converted Morfologik data into the IPIC tagset; resulting in free replacement of Morfeusz ● Using and patching Morfeusz ● Morfeusz is a library + rudimentary utility to pose queries ● Morfeusz + Maca is able to analyse running text or XML files ● When seg. ambiguity encountered, warns and selects shortest path

Usage scenarios (2) Usage scenarios (2) ● Simple tag/segmentation conversions ● Serious tagset conversion is better performed off-line ● MACA: mapping rules, conditional token joining and splitting ● Differences in attribute value sets across corpus versions ● Reducing a tagset to PoS-only tags ● Reducing ambiguity in Morfeusz output: conversion routines may be applied to graph paths separately before joining miał em miał em miałem miałem miałem miałem

Summary Summary ● A working system, bundled with practical configs & data ● C++ framework to build NLP applications on ● Released under GNU GPL 3.0 at http://nlp.pwr.wroc.pl/redmine/projects/libpltagger ● First open-source C/C++ SRX implementation ● Further work: ● Python wrappers ● Support additional corpus formats ● Support MULTEXT -EAST tag string representation ● Test for other languages

Maca a configurable tool to Maca a configurable tool to integrate - PowerPoint PPT Presentation

Maca a configurable tool to Maca a configurable tool to integrate Polish morphological data integrate Polish morphological data Adam Radziszewski Tomasz niatowski Wrocaw University of Technology Outline Outline Morphological

Fibre Optic Multiplexer Configurable The What is the Badger Fully configurable Audio/Data

Sampling Effect on Performance Prediction of Configurable Systems : A Case Study Juliana Alves

Overview of Overview of configurable architectures configurable architectures Prof. Kurt

Dual-Mode Configurable RISC-V Processor IP Nuclei System Technology Dual-Mode

Designing a Web of Highly-Configurable Designing a Web of Highly-Configurable Intrusion Detection

Configurable software- -based based Configurable software edge router architecture edge router

An Architecture for An Architecture for Configurable Dependability of Configurable Dependability

Reinforcement Learning in Configurable Continuous Environments Alberto Maria Metelli, Emanuele

A Configurable Hardware Scheduler A Configurable Hardware Scheduler (CHS) for Real- -Time

SynAthina Onli line Tools 1. . A mapping tool 2. A Community Tool 3. An Archive Tool 3. An

Performance Prediction of Configurable Software Systems by Fourier Learning Yi Zhang, Jianmei

Configurable and Extensible Processors Change System Design Ricardo E. Gonzalez Tensilica, Inc.

Sampling Effect on Performance Prediction of Configurable Systems : A Case Study Juliana Alves

Transfer Learning for Improving Model Predictions in Highly Configurable Software Pooyan

IADC PRESENTATION IADC PRESENTATION 2009 2009 RICARDO CESAR RICARDO CESAR BRAZILIAN NAVY

Living Wisely in an Era of Gene Editing at Will Anjeanette AJ Roberts, MACA, PhD Research

Combinatorial interpretations of binomial coefficient analogues related to Lucas sequences Bruce

The Synthesis and Improvement of Quantum Circuits and Programs David R. White & John A. Clark

On error distributions in ring-based LWE Wouter Castryck 1 , 2 , Ilia Iliashenko 1 , Frederik

Elliptic Analogues of Multiple Zeta Values Nils Matthes, Uni Hamburg 16th September 2014 Nils

Rivet for Heavy Ions introduction & tutorial Christian Bierlich, bierlich@thep.lu.se

D4 project https://www.d4-project.org/ 2019/07/03 TEAM CIRCL P roblem statement CSIRTs (or

Lex (& Flex): A Lexical Analyzer Generator Input: Lex and Yacc Regular exprs defining

Part II Course Goals and Overview Nikita Borisov (UIUC) CS/ECE 374 15 Fall 2019 15 / 33

Maca a configurable tool to Maca a configurable tool to integrate - PowerPoint PPT Presentation

Maca a configurable tool to Maca a configurable tool to integrate Polish morphological data integrate Polish morphological data Adam Radziszewski Tomasz niatowski Wrocaw University of Technology Outline Outline Morphological

Fibre Optic Multiplexer Configurable The What is the Badger Fully configurable Audio/Data

Sampling Effect on Performance Prediction of Configurable Systems : A Case Study Juliana Alves

Overview of Overview of *configurable* architectures *configurable* architectures Prof. Kurt

Dual-Mode Configurable RISC-V Processor IP Nuclei System Technology Dual-Mode

Designing a Web of Highly-Configurable Designing a Web of Highly-Configurable Intrusion Detection

Configurable software- -based based Configurable software edge router architecture edge router

An Architecture for An Architecture for Configurable Dependability of Configurable Dependability

Reinforcement Learning in Configurable Continuous Environments Alberto Maria Metelli, Emanuele

A Configurable Hardware Scheduler A Configurable Hardware Scheduler (CHS) for Real- -Time

SynAthina Onli line Tools 1. . A mapping tool 2. A Community Tool 3. An Archive Tool 3. An

Performance Prediction of Configurable Software Systems by Fourier Learning Yi Zhang, Jianmei

Configurable and Extensible Processors Change System Design Ricardo E. Gonzalez Tensilica, Inc.

Sampling Effect on Performance Prediction of Configurable Systems : A Case Study Juliana Alves

Transfer Learning for Improving Model Predictions in Highly Configurable Software Pooyan

IADC PRESENTATION IADC PRESENTATION 2009 2009 RICARDO CESAR RICARDO CESAR BRAZILIAN NAVY

Living Wisely in an Era of Gene Editing at Will Anjeanette AJ Roberts, MACA, PhD Research

Combinatorial interpretations of binomial coefficient analogues related to Lucas sequences Bruce

The Synthesis and Improvement of Quantum Circuits and Programs David R. White &amp; John A. Clark

On error distributions in ring-based LWE Wouter Castryck 1 , 2 , Ilia Iliashenko 1 , Frederik

Elliptic Analogues of Multiple Zeta Values Nils Matthes, Uni Hamburg 16th September 2014 Nils

Rivet for Heavy Ions introduction &amp; tutorial Christian Bierlich, bierlich@thep.lu.se

D4 project https://www.d4-project.org/ 2019/07/03 TEAM CIRCL P roblem statement CSIRTs (or

Lex (&amp; Flex): A Lexical Analyzer Generator Input: Lex and Yacc Regular exprs defining

Part II Course Goals and Overview Nikita Borisov (UIUC) CS/ECE 374 15 Fall 2019 15 / 33

Overview of Overview of configurable architectures configurable architectures Prof. Kurt

The Synthesis and Improvement of Quantum Circuits and Programs David R. White & John A. Clark

Rivet for Heavy Ions introduction & tutorial Christian Bierlich, bierlich@thep.lu.se

Lex (& Flex): A Lexical Analyzer Generator Input: Lex and Yacc Regular exprs defining