Learning Morphology from the Corpus Ondej Duek Institute of Formal - PowerPoint PPT Presentation

Motivation Generation Analysis Learning Morphology from the Corpus Ondřej Dušek Institute of Formal and Applied Linguistics Charles University in Prague November 11, 2013 . . . . . . 1/ 22 Ondřej Dušek Learning Morphology from the Corpus

Motivation Generation Analysis Motivation (general) Morphology needed in most NLP tasks • Parsing • Structural MT • Factored phrase-based MT • Corpora • User interfaces • Dialogue systems Morphology module influences overall quality of the systems . . . . . . 2/ 22 Ondřej Dušek Learning Morphology from the Corpus

KHRESMOI – translation of medical text: terms ALEX dialogue system – public transport: stop names Up to 5% of words are not recognized in special domains There's no guesser in Treex (that I know of) Dolnokrčská X@------------- artroplastika X@------------- “Inflect anything” Translate and create unseen phrases Speak freely in dialogue systems Motivation Generation Analysis Motivation (personal) “Avoid the X@ tag in Czech as much as possible” • Words unknown to the Czech dictionary are relatively common in some applications . . . . . . 3/ 22 Ondřej Dušek Learning Morphology from the Corpus

There's no guesser in Treex (that I know of) “Inflect anything” Translate and create unseen phrases Speak freely in dialogue systems Motivation Generation Analysis Motivation (personal) “Avoid the X@ tag in Czech as much as possible” • Words unknown to the Czech dictionary are relatively common in some applications • KHRESMOI – translation of medical text: terms • ALEX dialogue system – public transport: stop names • Up to 5% of words are not recognized in special domains Dolnokrčská X@------------- artroplastika X@------------- . . . . . . 3/ 22 Ondřej Dušek Learning Morphology from the Corpus

“Inflect anything” Translate and create unseen phrases Speak freely in dialogue systems Motivation Generation Analysis Motivation (personal) “Avoid the X@ tag in Czech as much as possible” • Words unknown to the Czech dictionary are relatively common in some applications • KHRESMOI – translation of medical text: terms • ALEX dialogue system – public transport: stop names • Up to 5% of words are not recognized in special domains • There's no guesser in Treex (that I know of) Dolnokrčská X@------------- artroplastika X@------------- . . . . . . 3/ 22 Ondřej Dušek Learning Morphology from the Corpus

Motivation Generation Analysis Motivation (personal) “Avoid the X@ tag in Czech as much as possible” • Words unknown to the Czech dictionary are relatively common in some applications • KHRESMOI – translation of medical text: terms • ALEX dialogue system – public transport: stop names • Up to 5% of words are not recognized in special domains • There's no guesser in Treex (that I know of) Dolnokrčská X@------------- artroplastika X@------------- “Inflect anything” • Translate and create unseen phrases • Speak freely in dialogue systems . . . . . . 3/ 22 Ondřej Dušek Learning Morphology from the Corpus

Motivation Generation Analysis Exploiting the regularities in morphology • Morphology of many languages is mostly regular, but for a certain number of exceptions • Size, number, and shape of inflection patterns differ . . . . . . 4/ 22 Ondřej Dušek Learning Morphology from the Corpus

Hand-written rules? rule Hard to maintain with complex morphology y x B C Learning from the data! Obtaining the rules automatically Plenty of corpora of sufficient size available Motivation Generation Analysis Possible approaches to morphology Dictionaries? • Work well, reliable • Limited coverage and/or availability . . . . . . 5/ 22 Ondřej Dušek Learning Morphology from the Corpus

Learning from the data! Obtaining the rules automatically Plenty of corpora of sufficient size available Motivation Generation Analysis Possible approaches to morphology Dictionaries? • Work well, reliable • Limited coverage and/or availability Hand-written rules? rule • Hard to maintain with complex morphology y x B C . . . . . . 5/ 22 Ondřej Dušek Learning Morphology from the Corpus

Motivation Generation Analysis Possible approaches to morphology Dictionaries? • Work well, reliable • Limited coverage and/or availability Hand-written rules? rule • Hard to maintain with complex morphology y x B C Learning from the data! • Obtaining the rules automatically • Plenty of corpora of sufficient size available . . . . . . 5/ 22 Ondřej Dušek Learning Morphology from the Corpus

1. Generation with Filip Jurčíček (see also: our paper at ACL-SRW 2013) Flect : statistical morphology generator 2. Analysis recent, only partially finished experiments on Czech a simple morphology module to go with the Featurama tagger, comparison with others 3. Discussion Motivation Generation Analysis My experiments with morphology • in chronological (less logical) order . . . . . . 6/ 22 Ondřej Dušek Learning Morphology from the Corpus

2. Analysis recent, only partially finished experiments on Czech a simple morphology module to go with the Featurama tagger, comparison with others 3. Discussion Motivation Generation Analysis My experiments with morphology • in chronological (less logical) order 1. Generation • with Filip Jurčíček (see also: our paper at ACL-SRW 2013) • Flect : statistical morphology generator . . . . . . 6/ 22 Ondřej Dušek Learning Morphology from the Corpus

3. Discussion Motivation Generation Analysis My experiments with morphology • in chronological (less logical) order 1. Generation • with Filip Jurčíček (see also: our paper at ACL-SRW 2013) • Flect : statistical morphology generator 2. Analysis • recent, only partially finished experiments on Czech • a simple morphology module to go with the Featurama tagger, comparison with others . . . . . . 6/ 22 Ondřej Dušek Learning Morphology from the Corpus

Motivation Generation Analysis My experiments with morphology • in chronological (less logical) order 1. Generation • with Filip Jurčíček (see also: our paper at ACL-SRW 2013) • Flect : statistical morphology generator 2. Analysis • recent, only partially finished experiments on Czech • a simple morphology module to go with the Featurama tagger, comparison with others 3. Discussion . . . . . . 6/ 22 Ondřej Dušek Learning Morphology from the Corpus

Only previous statistical morphology module known to us: Bohnet et al. (2010) Flect tested on 6 languages from the CoNLL 2009 data set with a varying degree of morphological richness Semantics EN DE ES CA JA CS Syntax for these languages Natural Language Generation Morphology Text Motivation Introduction Generation The system Analysis Results Flect : Morphology generator • Using machine learning to predict inflection . . . . . . 7/ 22 Ondřej Dušek Learning Morphology from the Corpus

Flect tested on 6 languages from the CoNLL 2009 data set with a varying degree of morphological richness Semantics EN DE ES CA JA CS Syntax for these languages Natural Language Generation Morphology Text Motivation Introduction Generation The system Analysis Results Flect : Morphology generator • Using machine learning to predict inflection • Only previous statistical morphology module known to us: Bohnet et al. (2010) . . . . . . 7/ 22 Ondřej Dušek Learning Morphology from the Corpus

Motivation Introduction Generation The system Analysis Results Flect : Morphology generator • Using machine learning to predict inflection • Only previous statistical morphology module known to us: Bohnet et al. (2010) • Flect tested on 6 languages from the CoNLL 2009 data set with a varying degree of morphological richness Semantics EN DE ES CA JA CS Syntax for these languages Natural Language Generation Morphology Text . . . . . . 7/ 22 Ondřej Dušek Learning Morphology from the Corpus

Languages with more inflection (e.g. Czech): even the simplest applications have trouble with morphology é ě Toto se líbí uživateli Jana Nováková. --------- - - [masc] [fem] This is liked by user (name) [dat] [nom] e u Děkujeme, Jan Novák , vaše hlasování bylo vytvořeno. Thank you, (name) [nom] your poll has been created Motivation Introduction Generation The system Analysis Results The need to generate morphology • English – not so much: hard-coded solutions often work well enough . . . . . . 8/ 22 Ondřej Dušek Learning Morphology from the Corpus

Learning Morphology from the Corpus Ondej Duek Institute of Formal - PowerPoint PPT Presentation

Motivation Generation Analysis Learning Morphology from the Corpus Ondej Duek Institute of Formal and Applied Linguistics Charles University in Prague November 11, 2013 . . . . . . 1/ 22 Ondej Duek Learning Morphology from

Morphology Morphology Morphology yields words with Morphology yields words with predictable

Computational Morphology: Machine learning of morphology Yulia Zinova 09 April 2014 16 July

Update on morphology WP activities M. Huertas-Company (GAL-SWG - morphology) EUCLID France - 7

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing

The need for Corpus Statistics: Corpus analysis and the identification of linguistically relevant

Lexical Phonology and Morphology February 4, 2016 Lexical Phonology and Morphology Paul

Computational Morphology: Introduction Yulia Zinova SoSe 2020 Yulia Zinova Computational

Introduction to English Linguistics 3: Morphology and Word Formation Part I: Morphology Part II:

Introduction to English Linguistics 3: Morphology and Word Formation Part I: Morphology Part II:

Morphology and Corpora: Introduction Marco Baroni University of Bologna Granada Morphology

Discrete Morphology and Distances on graphs Jean Cousty Four-Day Course on Mathematical

The IMS Corpus WorkBench Marco Baroni University of Bologna Granada Morphology and Corpora

The Estonian Reference Corpus: its composition and morphology-aware user interface Heiki-Jaan

TrustedOut Corpus Intelligence Corpus Intelligence Makes Intelligence Trustworthy. Florent Solt,

MACAQ : A Multi Annotated Corpus to study how we adapt Answers to various Questions Anne

Basics Of Graph Morphology Sravan Danda April 9, 2015 Table of contents Why Discrete

Experimental Methods in Transport Physics Prof. Carlo Requio da Cunha, Ph.D. unit: Review of

Agenda for 10/25/17 115 th Meeting Reminder: please turn off or mute cell phones

UDT 2020 ASW using LWTs from Submarines ORUWA / Thomas Petersson 2020-02-25 COMPANY RESTRICTED

[1] https://developer.chrome.com/extensions

Features of Statistical Parsers Preliminary results Mark Johnson Brown University TTI, October

Lecture Todays Lecture 9/6/16 Problem Solving Strategies Why Bother?? Introduction to the

arXiv:1312.5602v1 [cs.LG] 19 Dec 2013 DeepMind Technologies {

15/02/2016 After an inspirational Mediterranean campaign HMAS Sydney II and its crew had become

Sambuz

Useful Links

Newsletter

Mail Us