NLU lecture 6: Compositional character representations
Adam Lopez alopez@inf.ed.ac.uk Credits: Clara Vania 2 Feb 2018
NLU lecture 6: Compositional character representations Adam Lopez - - PowerPoint PPT Presentation
NLU lecture 6: Compositional character representations Adam Lopez alopez@inf.ed.ac.uk Credits: Clara Vania 2 Feb 2018 Lets revisit an assumption in language modeling (& word2vec) When does this assumption make sense for language
Adam Lopez alopez@inf.ed.ac.uk Credits: Clara Vania 2 Feb 2018
When does this assumption make sense for language modeling?
When does this assumption make sense for language modeling?
merged into a single symbol, reducing the vocabulary size to |V| = 16,383.”
frequent words in each language to train our
mapped to a special token ([UNK]).”
Ref | the main crop of japan is rice . Hyp | the _UNK is popular of _UNK . _EOS
Class-based LM
hierarchically (hierarchical softmax). Brown clustering: hard clustering based on mutual information
frequent words, fewer to less frequent words.
Source: Strategies for training large vocabulary language models. Chen, Auli, and Grangier, 2015
Source: Strategies for training large vocabulary language models. Chen, Auli, and Grangier, 2015
Source: Strategies for training large vocabulary language models. Chen, Auli, and Grangier, 2015
Noise contrastive estimation
Source: Strategies for training large vocabulary language models. Chen, Auli, and Grangier, 2015
Skip normalization step altogether
Source: Strategies for training large vocabulary language models. Chen, Auli, and Grangier, 2015
Room for improvement
Source: Strategies for training large vocabulary language models. Chen, Auli, and Grangier, 2015
vocabulary size.
Proof 1. productive morphology, loanwords, “fleek” Proof 2. 1, 2, 3, 4, …
Characters.
Characters. More precisely, unicode code points.
Characters. More precisely, unicode code points. Are you sure? 🤸
Characters. More precisely, unicode code points. Are you sure? 🤸 Not all characters are the same, because not all languages have alphabets. Some have syllabaries (e.g. Japanese kana) and/ or logographies (Chinese hànzì).
Source: Finding function in form: compositional character models for
Source: Finding function in form: compositional character models for
Source: Character-aware neural language models, Kim et al. 2015
Source: Finding function in form: compositional character models for
anterest artifactive capacited capitaling compensive dermitories despertator dividement extremilated faxemary follect hamburgo identimity ipoteca nightmale
patholicism pinguenas sammitment tasteman understrumental wisholver
anterest artifactive capacited capitaling compensive dermitories despertator dividement extremilated faxemary follect hamburgo identimity ipoteca nightmale
patholicism pinguenas sammitment tasteman understrumental wisholver
Wow, the disconversated vocabulations of their system are fantastics! —Sharon Goldwater
Implied(?): character-level neural models learn everything they need to know about language.
Implied(?): character-level neural models learn everything they need to know about language.
?
Morpheme: the smallest meaningful unit of language “loves”
root/stem: love affix: -s
love +s
Analytic languages
Synthetic languages many morphemes per word
English Turkish West Greenlandic
Inflectional morphology love (VB), loves (VB), loving(VB), loved(VB) Derivational morphology lover (NN), lovely(ADJ), lovable(ADJ) “love” (VB)
Fusional languages many features per morpheme Agglutinative languages
(Turkish)
read-AOR.COND.1SG ‘If I read …’ (English) read-s read-3SG.SG ‘reads’
Compounding many stems per word Affixation
studying study + ing Rettungshubschraubernotlandeplatz Rettung + s + hubschrauber + not + lande+ platz rescue + LNK + helicopter + emergency + landing + place ‘Rescue helicopter emergency landing pad’ (English) (German)
Base Modification Root & Pattern Reduplication drink, drank, drunk k(a)t(a)b(a) write-PST.3SG.M ‘he wrote’ kemerah~merahan red-ADJ ‘reddish’ (English) (Arabic) (Indonesian)
al., 2016, Lee at al., 2016)
Wieting et al., 2016, Bojanowski et al., 2016)
& Blunsom, 2014, Sennrich et al., 2016)
Schütze, 2015, Kann & Schütze, 2016)
Addition, Bidirectional LSTMs, Convolutional NN, …
basic unit(s)
0.01 … 0.3 … … … 0.12 … 0.05
Basic Units of Representation Compositional Function
compared with those based on characters?
compared with those based on characters?
representations?
compared with those based on characters?
representations?
utility as models with knowledge of morphology?
compared with those based on characters?
representations?
utility as models with knowledge of morphology?
languages of different morphological typologies?
what we vary Open vocabulary history Closed vocabulary prediction Open vocabulary prediction is interesting, but our goal is to understand representations, not build a better neural LM.
Unit Examples Morfessor ^want, s$ BPE ^w, ants$ char-trigram ^wa, wan, ant, nts, ts$ character ^, w, a, n, t, s, $ analysis want+VB, +3rd, +SG, +Pres Approximations to morphology Annotated morphology
The last row is part of an oracle experiment: suppose you had an oracle that could tell you the true morphology. In this case, the oracle is a human annotator.
Fusional (English) read-s read-3SG.SG ‘reads’ Agglutinative (Turkish)
read-AOR.COND.1SG ‘If I read …’ Root&Pattern (Arabic) k(a)t(a)b(a) write-PST.3SG.M ‘he wrote’ Reduplication (Indonesian) anak~anak child-PL ‘children’
Language word character char-trigrams BPE Morfessor %imp bi-LSTM CNN add bi-LSTM add bi-LSTM add bi-LSTM Czech 41.46 34.25 36.6 42.73 33.59 49.96 33.74 47.74 36.87 18.98 English 46.4 43.53 44.67 45.41 42.97 47.51 43.3 49.72 49.72 7.39 Russian 34.93 28.44 29.47 35.15 27.72 40.1 28.52 39.6 31.31 20.64 Finnish 24.21 20.05 20.29 24.89 18.62 26.77 19.08 27.79 22.45 23.09 Japanese 98.14 98.14 91.63 101.99 101.09 126.53 96.8 111.97 99.23 6.63 Turkish 66.97 54.46 55.07 50.07 54.23 59.49 57.32 62.2 62.7 25.24 Arabic 48.2 42.02 43.17 50.85 39.87 50.85 42.79 52.88 45.46 17.28 Hebrew 38.23 31.63 33.19 39.67 30.4 44.15 32.91 44.94 34.28 20.48 Indonesian 46.07 45.47 46.6 58.51 45.96 59.17 43.37 59.33 44.86 5.86 Malay 54.67 53.01 50.56 68.51 50.74 68.99 51.21 68.2 52.5 7.52
Language word character char-trigrams BPE Morfessor %imp bi-LSTM CNN add bi-LSTM add bi-LSTM add bi-LSTM Czech 41.46 34.25 36.6 42.73 33.59 49.96 33.74 47.74 36.87 18.98 English 46.4 43.53 44.67 45.41 42.97 47.51 43.3 49.72 49.72 7.39 Russian 34.93 28.44 29.47 35.15 27.72 40.1 28.52 39.6 31.31 20.64 Finnish 24.21 20.05 20.29 24.89 18.62 26.77 19.08 27.79 22.45 23.09 Japanese 98.14 98.14 91.63 101.99 101.09 126.53 96.8 111.97 99.23 6.63 Turkish 66.97 54.46 55.07 50.07 54.23 59.49 57.32 62.2 62.7 25.24 Arabic 48.2 42.02 43.17 50.85 39.87 50.85 42.79 52.88 45.46 17.28 Hebrew 38.23 31.63 33.19 39.67 30.4 44.15 32.91 44.94 34.28 20.48 Indonesian 46.07 45.47 46.6 58.51 45.96 59.17 43.37 59.33 44.86 5.86 Malay 54.67 53.01 50.56 68.51 50.74 68.99 51.21 68.2 52.5 7.52
Language word character char-trigrams BPE Morfessor %imp bi-LSTM CNN add bi-LSTM add bi-LSTM add bi-LSTM Czech 41.46 34.25 36.6 42.73 33.59 49.96 33.74 47.74 36.87 18.98 English 46.4 43.53 44.67 45.41 42.97 47.51 43.3 49.72 49.72 7.39 Russian 34.93 28.44 29.47 35.15 27.72 40.1 28.52 39.6 31.31 20.64 Finnish 24.21 20.05 20.29 24.89 18.62 26.77 19.08 27.79 22.45 23.09 Japanese 98.14 98.14 91.63 101.99 101.09 126.53 96.8 111.97 99.23 6.63 Turkish 66.97 54.46 55.07 50.07 54.23 59.49 57.32 62.2 62.7 25.24 Arabic 48.2 42.02 43.17 50.85 39.87 50.85 42.79 52.88 45.46 17.28 Hebrew 38.23 31.63 33.19 39.67 30.4 44.15 32.91 44.94 34.28 20.48 Indonesian 46.07 45.47 46.6 58.51 45.96 59.17 43.37 59.33 44.86 5.86 Malay 54.67 53.01 50.56 68.51 50.74 68.99 51.21 68.2 52.5 7.52
Language word character char-trigrams BPE Morfessor %imp bi-LSTM CNN add bi-LSTM add bi-LSTM add bi-LSTM Czech 41.46 34.25 36.6 42.73 33.59 49.96 33.74 47.74 36.87 18.98 English 46.4 43.53 44.67 45.41 42.97 47.51 43.3 49.72 49.72 7.39 Russian 34.93 28.44 29.47 35.15 27.72 40.1 28.52 39.6 31.31 20.64 Finnish 24.21 20.05 20.29 24.89 18.62 26.77 19.08 27.79 22.45 23.09 Japanese 98.14 98.14 91.63 101.99 101.09 126.53 96.8 111.97 99.23 6.63 Turkish 66.97 54.46 55.07 50.07 54.23 59.49 57.32 62.2 62.7 25.24 Arabic 48.2 42.02 43.17 50.85 39.87 50.85 42.79 52.88 45.46 17.28 Hebrew 38.23 31.63 33.19 39.67 30.4 44.15 32.91 44.94 34.28 20.48 Indonesian 46.07 45.47 46.6 58.51 45.96 59.17 43.37 59.33 44.86 5.86 Malay 54.67 53.01 50.56 68.51 50.74 68.99 51.21 68.2 52.5 7.52
Still lots of work to do on unsupervised morphology…
(^, r, e, a, d, s, $) (read, VB, 3rd, SG, Present) no morphology actual morphology
Perplexity 5 10 15 20 25 30 35 Czech Russian
26.4 30.1 27.7 33.6 28.4 34.3 character char-trigram
(^, r, e, a, d, s, $) (read, VB, 3rd, SG, Present) no morphology actual morphology
Perplexity 8 16 24 32 40 word char-trigram char-CNN
28.8 28.8 28.8 39.1 35.8 35.6 37 34.8 35.2 35.2 32.3 39.7
1M 5M 10M
subset of words in the test data
interest are in the most recent history: nouns and verbs
Green tea or white tea ? The sushi is great , and they have a great selection .
Perplexity 16 32 48 64 80 all frequent rare
42.6 40.1 40.9 62.8 50 53.4 56.1 53.4 50.3 59 47.9 51 73 56.8 61.2
word characters char-trigrams BPE
Perplexity 20 40 60 80 100 all frequent rare
61.8 58.6 59.5 78.3 72.5 74.2 70.6 63.7 65.8 77.1 68.1 70.8 99.4 74.3 81.4
word characters char-trigrams BPE
Language type-level (%) token-level (%) Indonesian 1.1% 2.6 Malay 1.3% 2.9 32 64 96 128 160 all frequent rare
156.8 108.9 117.2 137.4 91.4 99.2 157 91.7 101.7
word characters BPE
Percentage of full-reduplication in the training data Targeted Perplexity
Model Frequent Rare Unknown man including unconditional hydroplane uploading foodism word person like nazi molybdenum
featuring fairly your
include joints imperial
bi-lstm ii called unintentional emphasize upbeat vigilantism hill involve ungenerous heartbeat uprising pyrethrum text like unanimous hybridized handling pausanias char trigrams bi-lstm mak include unconstitutional selenocysteine drifted tuaregs vill includes constitutional guerrillas affected quft cow undermining unimolecular scrofula conflicted subjectivism char bi-lstm mayr inclusion relates hydrolyzed musagte formulas many insularity unmyelinated hydraulics mutualism formally may include uncoordinated hysterotomy mutualist fecal char CNN mtn include unconventional hydroxyproline unloading fordham mann includes unintentional hydrate loading dadaism nun excluding unconstitutional hydrangea upgrading popism
Model Frequent Rare OOV man including unconditional hydroplane uploading foodism word person like nazi molybdenum
featuring fairly your
include joints imperial
bi-lstm ii called unintentional emphasize upbeat vigilantism hill involve ungenerous heartbeat uprising pyrethrum text like unanimous hybridized handling pausanias char trigrams bi-lstm mak include unconstitutional selenocysteine drifted tuaregs vill includes constitutional guerrillas affected quft cow undermining unimolecular scrofula conflicted subjectivism char bi-lstm mayr inclusion relates hydrolyzed musagte formulas many insularity unmyelinated hydraulics mutualism formally may include uncoordinated hysterotomy mutualist fecal char CNN mtn include unconventional hydroxyproline unloading fordham mann includes unintentional hydrate loading dadaism nun excluding unconstitutional hydrangea upgrading popism
Good at frequent words Maybe they learn “word classes”?
Model Frequent Rare OOV man including unconditional hydroplane uploading foodism word person like nazi molybdenum
featuring fairly your
include joints imperial
bi-lstm ii called unintentional emphasize upbeat vigilantism hill involve ungenerous heartbeat uprising pyrethrum text like unanimous hybridized handling pausanias char trigrams bi-lstm mak include unconstitutional selenocysteine drifted tuaregs vill includes constitutional guerrillas affected quft cow undermining unimolecular scrofula conflicted subjectivism char bi-lstm mayr inclusion relates hydrolyzed musagte formulas many insularity unmyelinated hydraulics mutualism formally may include uncoordinated hysterotomy mutualist fecal char CNN mtn include unconventional hydroxyproline unloading fordham mann includes unintentional hydrate loading dadaism nun excluding unconstitutional hydrangea upgrading popism
Good at frequent words Maybe they learn “word classes”?
Model Frequent Rare OOV man including unconditional hydroplane uploading foodism word person like nazi molybdenum
featuring fairly your
include joints imperial
bi-lstm ii called unintentional emphasize upbeat vigilantism hill involve ungenerous heartbeat uprising pyrethrum text like unanimous hybridized handling pausanias char trigrams bi-lstm mak include unconstitutional selenocysteine drifted tuaregs vill includes constitutional guerrillas affected quft cow undermining unimolecular scrofula conflicted subjectivism char bi-lstm mayr inclusion relates hydrolyzed musagte formulas many insularity unmyelinated hydraulics mutualism formally may include uncoordinated hysterotomy mutualist fecal char CNN mtn include unconventional hydroxyproline unloading fordham mann includes unintentional hydrate loading dadaism nun excluding unconstitutional hydrangea upgrading popism
Good at frequent words Maybe they learn “word classes”?
Source: Yova Kementchedjhieva, Morpho-syntactic awareness in a character-level language model 2017 Informatics M.Sc. thesis
Source: Yova Kementchedjhieva, Morpho-syntactic awareness in a character-level language model 2017 Informatics M.Sc. thesis
Source: Yova Kementchedjhieva, Morpho-syntactic awareness in a character-level language model 2017 Informatics M.Sc. thesis
especially for agglutinative morphology.
model with explicit knowledge of morphology (or POS).
subpatterns.
memorization, local generalization.
end learning.
data, and these models could be better if they knew about morphology.