Unsupervised discovery of Construction Grammar representations for - - PowerPoint PPT Presentation
Unsupervised discovery of Construction Grammar representations for - - PowerPoint PPT Presentation
Unsupervised discovery of Construction Grammar representations for under-resourced languages Bogdan Babych University of Leeds Centre for Translation Studies (CTS) http://www.comp.leeds.ac.uk/bogdan b.babych@leeds.ac.uk Corpus annotation
Corpus annotation for under- resourced languages
- Getting a language on a ‘technology map’
- Morphosyntactic annotation & generation
– Part-of-speech taggers, lemmatisers, paradigms – Dependency / constituency parsing, chunking – Annotated general-purpose & domain-specific corpora, treebanks
- Starting point for computational applications
– Addressing data sparseness for inflected languages – Language models (for Speech Recognition, MT) – Text normalization (Text-to-speech)
Technological value of morphosyntax: MT for under-resourced languages
- Neural MT: generation of
lemma sequence + morphological tagging (Conforti et al., 2018)
- Factored SMT: data
sparseness & disambiguation (Koehn, 2009)
- RBMT Analysis, Generation
&Transfer pipelines
– Successful morphological disambiguation correct translation equivalents – Cascaded disambiguation
- Morphological ambiguities
resolved at the syntax level
- Häuser Haus | NN.plur.nom.neut
- Haus house;
- NN.plur.nom.neut N.plur
- house | N.plur houses
- Their weight
changes.(VERB.3pers.sing) every day
- Some people record their weight
changes.(NOUN.plur) every day
Corpus annotation practice vs. theoretical lexicogrammar
- Schemes traditionally relied on theory-neutral, consensual
annotation (Leech, 1993; Straka and Straková, 2017)
– Theoretically sensitive decisions (Garside et al., 1997) – Possibility of linguistically unsound, ad-hoc or contradictory solutions – Potential errors reduce usefulness of annotation – Conservative view on linguistic material missing recent theoretical developments
- Traditionally: two separate stages – grammar and the
lexicon development
– Tagsets & morphosyntactic features, disambiguated tags in a sub-corpus – Emission tags for word forms, paradigm classes for lemmas
Corpus annotation practice vs. theoretical lexicogrammar
- Limitations: morphological disambiguation
depends on lexical features, e.g.:
- [Prep (Adj.Case+Num)? N.Case+Num]PP
- в (Prep_CaseGen|Acc|Loc) книжки (Gen+Sing|Nom+Plur|Acc+Plur)
(with a book; into books)
- на (Prep_CaseAcc|Loc) книжки (Gen+Sing|Nom+Plur|Acc+Plur)
(onto books)
- до(Prep_CaseGen) книжки (Gen+Sing|Nom+Plur|Acc+Plur)
(to a book)
- The need for lexicalized morphosyntactic
representations
– A systematic lexicalized theoretical framework
Unsupervised linguistic annotation of under-resourced languages
- Supervised methods need manual annotation
– Not available for under-resourced languages
- Unsupervised & weakly-supervised methods:
– More suitable for under-resourced scenarios – Smaller and more qualified development effort – Strong assumptions about expected linguistic structures – Models of expected variation (phonological, morphological, syntactic…
Context: Experience of HyghTra project (FP7 MC IAPP)
- RBMT core architecture (Lingenio GmbH)
– Transfer-based, syntactic dependencies + semantic features for selectional restrictions
- Corpus-based resource creation & disambiguation
– Faster development for new translation directions – Exploiting similarities between closely-related languages (nlde; pt,esfr; ukru) – Alignment of richly-annotated, morphologically and syntactically disambiguated corpora
- Under-specified representations: morph., synt., sem.
Lingenio’s RBMT lexicon
Ukrainian news corpus
- Low-resource scenario: ~250 million words, not balanced
- News texts collected via targeted crawling
– Part-of-speech annotation via transfer learning (Babych & Sharoff, 2016) – Coverage of tag emission & lematization lexicon: ~ 15k words (~91% on news texts) – Accuracy: 93% on known & 72% on unknown words
- Available on: http://corpus.leeds.ac.uk/internet.html
- Tasks for unsupervised learning:
– T1: Discovery of Construction Grammar representations – T2: Induction of wide-coverage morphological lexicon
- T1. Discovery of Construction Grammar
representations in a Ukrainian corpus
- Construction grammar framework (Kay & Fillmore, 1999;
Fillmore, 2002)
– Lexicalized morphosyntactic representations
- specify syntactic relations, valencies and semantics for associated
linguistic structures (cf. Fillmore, 2013: 112)
- Include different levels: morphosyntactic, lexical, phraseological
- have underspecified slots for lexical or grammatical valencies, that are
lexically or morphologically restricted
- Examples: What’s X doing Y; to look forward to X
– Single-stage induction of morphosyntactic lexicon
- Syntax is lexicalised = lexicon has morphosyntactic annotation
– Unified framework for Single- and Multiword Expressions (MWEs)
- Words are not elementary units, MWEs have structure
- Explaining valencies & syntactic variation (lexicalised TAG)
(to) look forward to V-ing
- Representations of lexicalized structures (CWB format)
- Modeling variation:
– I look forward to receiving President Tadic – He looked forward to arguing the case in court – I’m looking forward to being able to see his talk online
- (an overlap with “(to) be able to X” construction)
– Hawking looks forward to knowing (metaphorically, of course) the "mind of God”
TAG representations: syntactic variation (initial & auxiliary trees)
TAG representations: syntactic variation (initial & auxiliary trees)
TAG representations: syntactic variation (initial & auxiliary trees)
Unsupervised discovery of lexicalized constructions
- Methodology ~ discovering multiword expressions in
PoS-annotated corpora
- Justeson &Katz, 1995; Babych & Hartley, 2009
– Collecting & sorting lexical N-grams and skip-grams – Filtering: frequency & lexical salience
- Frequency threshold (>4); Association measures (Log likelihood,
Mutual Information…)
- PoS configurations: positive vs. negative filters, statistical ft.idf
filters – *user interface of ; *with user interface
- Generalizing methods for multilevel annotation
– [word, lemma, PoS, Sub-classes, syntactic dependencies…] – Computationally intensive: deal with “longer” N-grams – Recurring feature patterns across annotation levels – Underspecified representations: partially filled positions
Underspecified N-grams: construction candidates & selected lexical classes
Fully lexicalized constructions NN IN NN
2393 NN point IN
- f
NN view 2104 NN sort IN
- f
NN thing 1272 NN cup IN
- f
NN tea 1014 NN way IN
- f
NN life 865 NN periodIN
- f
NN time 841 NN lot IN
- f
NN money 710 NN value IN for NN money 692 NN kind IN
- f
NN thing 595 NN quality IN
- f
NN life 566 NN piece IN
- f
NN paper 551 NN sense IN
- f
NN humour 524 NN length IN
- f
NN time 521 NN division IN
- f
NN labour 519 NN side IN by NN side 518 NN lot IN
- f
NN time 513 NN rate IN
- f
NN interest 510 NN amount IN
- f
NN money 477 NN cup IN
- f
NN coffee 454 NN waste IN
- f
NN time 449 NN member IN
- f
NN staff 437 NN amount IN
- f
NN time 424 NN time IN
- f
NN year 419 NN rate IN
- f
NN inflation 417 NN courseIN
- f
NN action 405 NN head IN
- f
NN state 384 NN matterIN
- f
NN fact 342 NN lot IN
- f
NN work 336 NN person IN per NN night 318 NN sheet IN
- f
NN paper 301 NN work IN
- f
NN art 296 NN rule IN
- f
NN law 294 NN state IN
- f
NN emergency 286 NN balance IN
- f
NN power 281 NN breach IN
- f
NN contract 277 NN sum IN
- f
NN money 277 NN state IN
- f
NN mind 277 NN rate IN
- f
NN return 269 NN hand IN in NN hand 262 NN duty IN
- f
NN care 255 NN time IN
- f
NN day 255 NN secretary IN
- f
NN state 250 NN sourceIN
- f
NN information 250 NN rate IN
- f
NN growth 250 NN friend IN
- f
NN mine 247 NN cause IN
- f
NN death 242 NN sort IN
- f
NN person
Fully lexicalized constructions JJ NN NN
308 JJ
- ther NN
way NN round 284 JJ inflammatoryNN bowel NN disease 226 JJ second NN world NN war 178 JJ high NN blood NN pressure 170 JJ criminal NN justiceNN system 166 JJ national NN health NN service 159 JJ right NN hand NN side 156 JJ social NN security NN system 147 JJ graphic NN user NN interface 142 JJ local NN education NN authority 140 JJ left NN hand NN side 134 JJ nuclear NN power NN station 132 JJ coronary NN heart NN disease 129 JJ real NN wage NN rate 122 JJ fourth NN quarter NN net 122 JJ first NN time NN I 116 JJ primary NN health NN care 106 JJ intensive NN care NN unit 104 JJ first NN world NN war 101 JJ local NN planning NN authority 100 JJ third NN quarter NN net 100 JJ social NN history NN discipline 97 JJ substantive NN research NN contract 92 JJ local NN government NN finance 83 JJ wrong NN way NN round 80 JJ second NN quarter NN net 72 JJ public NN sector NN borrowing 71 JJ local NN authority NN control 70 JJ irritable NN bowel NN syndrome 69 JJ hot NN water NN cylinder 66 JJ local NN income NN tax 66 JJ foreign NN exchange NN market 63 JJ local NN authority NN housing 62 JJ net NN asset NN value 62 JJ alcoholic NN liver NN disease 60 JJ primary NN sclerosing NN cholangitis 59 JJ duodenal NN ulcer NN disease 58 JJ retail NN price NN index 58 JJ first NN time NN round 57 JJ infant NN death NN syndrome 57 JJ aggregate NN demand NN curve 56 JJ regional NN health NN authority 55 JJ environmental NN impact NN assessment 54 JJ new NN world NN
- rder
53 JJ social NN work NN practice 53 JJ public NN service NN broadcasting
Discovery of Ukrainian constructions
Construction Grammar: underspecified MWE lexicon with syntactic relations
- Automated discovery of ‘diagnostic contexts’
– Characterize lexical classes via configurations of formal features, e.g.:
- NN IN _ abstract nouns with specific valency patterns: sort
kind lot type amount sense form lack level piece point degree rate use bit number period source time loss matter range state process increase deal quality cup change way cost system area need question development course method nature concept
- Ukrainian nouns in the same diagnostic context: perspective
project change event record clash participation trading decision struggle committee candidate gas action head comparison relation situation election course meeting
- Constructions are found via distributional analysis
– Lexogrammatic classes identified automatically
T2: Induction of wide-coverage morphological lexicon for Ukrainian
- Existing resources
– Standardized tagset for Slavonic family (MULTEXT) – Morphological lexicon: ≈15k freq. lemmas = 200k inflected forms from non-disambiguating tagger
- No manually disambiguated corpus
- Automatically derived morphological
disambiguation
– Babych and Sharoff, Rapid induction of morphological disambiguation resources from a closely related
- language. HyTra-2016WS at EAMT
Existing Ukrainian morphological lexicon (Kotsyba et al., 2009)
+ Ukrainian 250 MW news corpus Why needed & important:
- For under-resourced languages
- Wide-coverage systems
- For well-resourced languages:
- Keeping up with the lexicon change –
new words and expressions
Transfer learning via a closely- related better resourced language (Babych, B., Sharoff,
- S. 2016).
Existing approaches:
Ahlberg, M., Forsberg, M., & Hulden, M. (2015): Paradigm classification in supervised learning of morphology; (2014): Semi- supervised learning of morphological paradigms and lexicons
Problems with existing approaches: under-resourced languages
- Need annotated data sets for supervised training
– May not be available from the start – No guarantee of coverage / representativeness
- Lexicon induction: token in corpus inflection
– ‘Trusting the corpus’ too much (noise problems) – Needs ‘additional information’: forms from corpus need to be (a) with known PoS; (b) lemmatized (!!!)
- Mix up the terms ‘inflection table’ and ‘paradigm’ ,
correct use:
– Paradigm = system of word forms; – Inflection table = table of inflections for several paradigms
Problems of complexity and feasibility for under-resourced languages
- Order of complexity: typically training on ~2000-
7000 paradigms 100-300 tables
– developed by a linguist/informant or derived from a morphologically analyzed corpus
- All possible paradigms assigned to each word
– If the lexicon induction is done from lemmas (base forms) then oracle vocabulary needed – Use of unlabeled corpora only to weight paradigms
- Hard to re-implement in a realistic scenario for
under-resourced languages
- Engineering value vs. model for language acquisition
Problems with ambiguity
- Forces single paradigm for a lemma
– Paradigm can depend on meaning, e.g., abstract/concrete uk: – ‘block’.N.gen.sing: блока (concrete) - блоку (abstract ‘mil. block’) – ‘order’.N.gen.sing: ордена (concrete ‘award’) – ордену (abstract ‘group’)
- A labeled form = evidence for a single paradigm
– If paradigms are collected from corpus: PoS-tagged tokens already disambiguated; cannot be assigned to multiple paradigms – No guarantee that possible rare forms will be correctly tagged or present: e.g., V.imperative rare in written corpora, need subject- specific spoken corpus
- вибори = N.masc..nom|acc.plur (‘choices’/‘elections’) &
/vybory/ V.imper.2pers.sing (‘win’/‘fight for’)
– Combination of lexical and morphological ambiguity: ‘pryklady’
- Приклади = N.masc.{nom|acc|voc}.plur
– ‘examples’
- Приклади = N.masc.{nom|acc|voc}.plur
– ‘holders’
- Приклади = V.imper.p2.sing
– ‘attach’
Problem with inflection table coverage:
- Stem alternations determined phonologically, changes
happen across paradigms, not always regular:
- /o,e/-> /i/, in “newly-closed syllable”: ~13thcentury AD:
Noun: столъ /stolô/ cтіл /stil/; стола /stola/ [no chage] Verb: неслъ /neslô/несъ/nesô/ніс /nis/; нести /nesty/ [no change] Does not happen in new words since 15th cent. (порт ‘port’)
- Hard to predict the number of paradigms needed for
covering existing inflection tables
– Related languages require different number of paradigms – Can be large due to interaction of pure phonetic & morphonological shifts
Unsupervised lexicon induction for given paradigms
- Inflection tables based on comprehensive
grammar descriptions
– Guarantee to cover inflection types, but not stem alternations
- ‘Latent’ paradigm induction
– Getting more evidence to filter out noise (thresholds)
- Filling in all paradigm types
– Possibility to justify several paradigms with the same token (‘pryklady’) V and N paradigms
Idea for our approach
- Hand-coded ‘textbook’ inflection tables + features
- For each word type in a corpus (frq. list):
– Each inflection in each table is tested if it can split {stem + inflection} of the word – If yes, then the corresponding inflection table generates ‘expectations’: hypothetical word forms consistent with the separated {stem + all other inflections in the table} – If for a table {N>Threshold} expectations exit in the corpus (or phonologically close forms, measured by ‘graphonological Levenshtein edit distance’: Babych, 2016)
- then the paradigm for the stem based on the inflection table is
confirmed & lemma is added to the lexicon
– For any word several paradigms can be confirmed
- Paradigms are ‘latent’
– confirmed by observations indirectly
Example: inflection tables
input рука (‘ruka’, hand) inflection table for фабрика (‘fabryka’, factory), etc.
Example: inflection tables
рук|а (‘ruka’, hand)
Example: inflection tables
Example: inflection tables
Example: inflection tables
Paradigm confirmed
Models for morphonological distortion
- Graphonological Levenshtein edit distance (Babych,
2016, EAMT)
– рука (ruka) руці (rutsi) – к [k]=cons|backtongue|velar|plosive|unvoiced – ц [ts]=cons|frongtongue|dental|plosive|unvoiced
V.Levenshtein (1935-)
+
Roman Jacobson’s phonological features
leva,b(i, j) = if min(i, j) = 0,then :max(i, j),else : min delete :leva,b(i -1, j)+1 insert :leva,b(i, j -1)+1 subst :leva,b(i -1, j -1)+1(ai¹bj ) ì í ï î ï ü ý ï þ ï ì í ï ï î ï ï ü ý ï ï þ ï ï where :1(ai¹bj ) = f (ai,bj): if (ai = bj):then : f (ai,bj) = 0;else : f (ai,bj) =1
feature hierarchy (to be trained on known paradigms)
+
Models for morphonological distortion
- Graphonological Levenshtein edit distance (Babych,
2016, EAMT)
– рука (ruka) руці (rutsi) – к [k]=cons|backtongue|velar|plosive|unvoiced – ц [ts]=cons|frongtongue|dental|plosive|unvoiced
V.Levenshtein (1935-)
+
Roman Jacobson’s phonological features
leva,b(i, j) = if min(i, j) = 0,then :max(i, j),else : min delete :leva,b(i -1, j)+1 insert :leva,b(i, j -1)+1 subst :leva,b(i -1, j -1)+1(ai¹bj ) ì í ï î ï ü ý ï þ ï ì í ï ï î ï ï ü ý ï ï þ ï ï where :1(ai¹bj ) = f (ai,bj): if (ai = bj):then : f (ai,bj) = 0;else : f (ai,bj) =1
feature hierarchy (to be trained on known paradigms)
+
Advantages of unsupervised paradigm induction
- Does not require labeled data (PoS + lemma)
– Smaller but more qualified input from language specialists (paradigm types, patterns of historical changes)
- Robust against noise in corpus
– Because of indirect (latent) induction ambiguity is allowed – Word = possible evidence for several paradigms
- Can be used to update lexicon from a new corpus
– Achieving wide-coverage: 15k >200k; domain-specific words, proper names (may change quickly)
- More realistic for:
– morphological development under-resourced languages – models of morphological acquisition, learning morphonological variation from non-annotated corpora
Open questions
- Size of the corpus
– What is the relation of the corpus size & threshold values
- Learning graphonological distortion models
– Weights for substitution of phonological features needs to be extended by insertion & deletion models – Learning distortion models from corpora
Conclusions & Future work
- Methodology for unsupervised discovery of
Construction Grammar representations
– Syntactically annotated underspecified Multiword Expressions – Relies on wide-coverage morphological annotated lexicon
- Lexicon induction via confronting hand-coded
‘textbook’ inflection tables with large corpus
– Confirmed paradigms from corpus frequency lists
- Future work: lexicon & construction grammar