Unsupervised discovery of Construction Grammar representations for - PowerPoint PPT Presentation

Unsupervised discovery of Construction Grammar representations for under-resourced languages Bogdan Babych University of Leeds Centre for Translation Studies (CTS) http://www.comp.leeds.ac.uk/bogdan b.babych@leeds.ac.uk

Corpus annotation for under- resourced languages • Getting a language on a ‘technology map’ • Morphosyntactic annotation & generation – Part-of-speech taggers, lemmatisers, paradigms – Dependency / constituency parsing, chunking – Annotated general-purpose & domain-specific corpora, treebanks • Starting point for computational applications – Addressing data sparseness for inflected languages – Language models (for Speech Recognition, MT) – Text normalization (Text-to-speech)

Technological value of morphosyntax: MT for under-resourced languages • Neural MT: generation of lemma sequence + morphological tagging (Conforti et al., 2018) • Factored SMT: data sparseness & disambiguation (Koehn, • Häuser  Haus | NN.plur.nom.neut 2009) • Haus  house; • RBMT Analysis, Generation • NN.plur.nom.neut  N.plur &Transfer pipelines • house | N.plur  houses – Successful morphological • Their weight disambiguation  correct changes.(VERB.3pers.sing) every day translation equivalents • Some people record their weight – Cascaded disambiguation changes.(NOUN.plur) every day • Morphological ambiguities resolved at the syntax level

Corpus annotation practice vs. theoretical lexicogrammar • Schemes traditionally relied on theory-neutral, consensual annotation (Leech, 1993; Straka and Straková, 2017) – Theoretically sensitive decisions (Garside et al., 1997) – Possibility of linguistically unsound, ad-hoc or contradictory solutions – Potential errors reduce usefulness of annotation – Conservative view on linguistic material missing recent theoretical developments • Traditionally: two separate stages – grammar and the lexicon development – Tagsets & morphosyntactic features, disambiguated tags in a sub-corpus – Emission tags for word forms, paradigm classes for lemmas

Corpus annotation practice vs. theoretical lexicogrammar • Limitations: morphological disambiguation depends on lexical features, e.g.: • [Prep (Adj.Case+Num)? N.Case+Num] PP • в (Prep_Case  Gen | Acc |Loc) книжки ( Gen+Sing |Nom+Plur| Acc+Plur ) (with a book; into books) • на (Prep_Case  Acc |Loc) книжки (Gen+Sing|Nom+Plur| Acc+Plur ) (onto books) • до (Prep_Case  Gen ) книжки ( Gen+Sing |Nom+Plur|Acc+Plur) (to a book) • The need for lexicalized morphosyntactic representations – A systematic lexicalized theoretical framework

Unsupervised linguistic annotation of under-resourced languages • Supervised methods need manual annotation – Not available for under-resourced languages • Unsupervised & weakly-supervised methods: – More suitable for under-resourced scenarios – Smaller and more qualified development effort – Strong assumptions about expected linguistic structures – Models of expected variation (phonological, morphological, syntactic …

Context: Experience of HyghTra project (FP7 MC IAPP) • RBMT core architecture (Lingenio GmbH) – Transfer-based, syntactic dependencies + semantic features for selectional restrictions • Corpus-based resource creation & disambiguation – Faster development for new translation directions – Exploiting similarities between closely-related languages (nl  de; pt,es  fr; uk  ru) – Alignment of richly-annotated, morphologically and syntactically disambiguated corpora • Under-specified representations: morph., synt., sem.

Lingenio ’ s RBMT lexicon

Ukrainian news corpus • Low-resource scenario: ~250 million words, not balanced • News texts collected via targeted crawling – Part-of-speech annotation via transfer learning (Babych & Sharoff, 2016) – Coverage of tag emission & lematization lexicon: ~ 15k words (~91% on news texts) – Accuracy: 93% on known & 72% on unknown words • Available on: http://corpus.leeds.ac.uk/internet.html • Tasks for unsupervised learning: – T1: Discovery of Construction Grammar representations – T2: Induction of wide-coverage morphological lexicon

T1. Discovery of Construction Grammar representations in a Ukrainian corpus • Construction grammar framework (Kay & Fillmore, 1999; Fillmore, 2002) – Lexicalized morphosyntactic representations • specify syntactic relations, valencies and semantics for associated linguistic structures (cf. Fillmore, 2013: 112) • Include different levels : morphosyntactic, lexical, phraseological • have underspecified slots for lexical or grammatical valencies, that are lexically or morphologically restricted • Examples: What’s X doing Y ; to look forward to X – Single-stage induction of morphosyntactic lexicon • Syntax is lexicalised = lexicon has morphosyntactic annotation – Unified framework for Single- and Multiword Expressions (MWEs) • Words are not elementary units, MWEs have structure • Explaining valencies & syntactic variation (lexicalised TAG)

(to) look forward to V-ing • Representations of lexicalized structures (CWB format) • Modeling variation: – I look forward to receiving President Tadic – He looked forward to arguing the case in court – I ’ m looking forward to being able to see his talk online • (an overlap with “ (to) be able to X ” construction) – Hawking looks forward to knowing (metaphorically, of course) the "mind of God ”

TAG representations: syntactic variation (initial & auxiliary trees)

Unsupervised discovery of lexicalized constructions • Methodology ~ discovering multiword expressions in PoS-annotated corpora • Justeson &Katz, 1995; Babych & Hartley, 2009 – Collecting & sorting lexical N-grams and skip-grams – Filtering: frequency & lexical salience • Frequency threshold (>4); Association measures (Log likelihood, Mutual Information … ) • PoS configurations: positive vs. negative filters, statistical ft.idf filters – *user interface of ; *with user interface • Generalizing methods for multilevel annotation – [word, lemma, PoS, Sub-classes, syntactic dependencies … ] – Computationally intensive: deal with “ longer ” N-grams – Recurring feature patterns across annotation levels – Underspecified representations: partially filled positions

Underspecified N-grams: construction candidates & selected lexical classes

Fully lexicalized constructions NN IN NN 2393 NN point IN of NN view 417 NN courseIN of NN action 2104 NN sort IN of NN thing 405 NN head IN of NN state 1272 NN cup IN of NN tea 384 NN matterIN of NN fact 1014 NN way IN of NN life 342 NN lot IN of NN work 865 NN periodIN of NN time 336 NN person IN per NN night 841 NN lot IN of NN money 318 NN sheet IN of NN paper 710 NN value IN for NN money 301 NN work IN of NN art 692 NN kind IN of NN thing 296 NN rule IN of NN law 595 NN quality IN of NN life 294 NN state IN of NN emergency 566 NN piece IN of NN paper 286 NN balance IN of NN power 551 NN sense IN of NN humour 281 NN breach IN of NN contract 524 NN length IN of NN time 277 NN sum IN of NN money 521 NN division IN of NN labour 277 NN state IN of NN mind 519 NN side IN by NN side 277 NN rate IN of NN return 518 NN lot IN of NN time 269 NN hand IN in NN hand 513 NN rate IN of NN interest 262 NN duty IN of NN care 510 NN amount IN of NN money 255 NN time IN of NN day 477 NN cup IN of NN coffee 255 NN secretary IN of NN state 454 NN waste IN of NN time 250 NN sourceIN of NN information 449 NN member IN of NN staff 250 NN rate IN of NN growth 437 NN amount IN of NN time 250 NN friend IN of NN mine 424 NN time IN of NN year 247 NN cause IN of NN death 419 NN rate IN of NN inflation 242 NN sort IN of NN person

Unsupervised discovery of Construction Grammar representations for - PowerPoint PPT Presentation

Unsupervised discovery of Construction Grammar representations for under-resourced languages Bogdan Babych University of Leeds Centre for Translation Studies (CTS) http://www.comp.leeds.ac.uk/bogdan b.babych@leeds.ac.uk Corpus annotation

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

On the Limitations of Unsupervised Bilingual Dictionary Induction Anders Sgaard Sebastian

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

From Search to Discovery in our Future Library From Search to Discovery W e see a spectrum of

Watson Discovery Spring 2020 Discovery pipeline Using NLU, document conversion, and UI tools

Tunnel End-point Discovery Tunnel End-point Discovery draft-palet-v6ops-tun-auto-disc-03.txt

VPN Discovery VPN Discovery Design Team Discussions and Options Design Team Discussions and

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

Unsupervised learning introduction October 7, 2019 Unsupervised learning introduction

Unsupervised Learning Introduction Nakul Verma Unsupervised Learning What can we learn from

UNSUPERVISED LEARNING AND CLUSTERING Jeff Robble, Brian Renzenbrink, Doug Roberts Unsupervised

12. Unsupervised Deep Learning CS 535 Deep Learning, Winter 2018 Fuxin Li With materials from

Not pure bronchiolitis obliterans (obliterative bronchiolitis) Not idiopathic pulmonary

Statistics in medicine One group is measured one time: Lecture 3: Bivariate association :

Pyruvate Kinase Deficiency Natural History Study Rachael Grace, MD, MMSc On behalf of the PKD

Case in discussion Common diagnostic problems in 62, F gallbladder pathology Underwent

Visual Argument(Visual and Statistical Thinking by Tufte) November 4 th , 2018 Why Visualize?

ECON 626: Applied Microeconomics Lecture 3: Difference-in-Differences Professors: Pamela Jakiela

Storytelling CS 7250 S PRING 2020 Prof. Cody Dunne N ORTHEASTERN U NIVERSITY Slides and

Chosen Vessels David Lipscomb 1 The poor of this world were the chosen vessels of mercy,

Sambuz

Useful Links

Newsletter

Mail Us