Mimicking Word Embeddings using Subword RNNs
Yuval Pinter, Robert Guthrie, Jacob Eisenstein
@yuvalpi
Presented at EMNLP September 2017, Copenhagen
Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert - - PowerPoint PPT Presentation
Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert Guthrie, Jacob Eisenstein @yuvalpi Presented at EMNLP September 2017, Copenhagen The Word Embedding Pipeline Unlabeled Unlabeled corpus Unlabeled corpus Unlabeled corpus
@yuvalpi
Presented at EMNLP September 2017, Copenhagen
Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus
Wikipedia GigaWord Reddit ... 2
Embedding model (vectors) W2V GloVe Polyglot FastText ...
Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus
Wikipedia GigaWord Reddit ... 2
Supervised task corpus Supervised task corpus
Penn TreeBank SemEval OntoNotes
... Embedding model (vectors) W2V GloVe Polyglot FastText ...
Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus
Wikipedia GigaWord Reddit ... 2
Supervised task corpus Supervised task corpus
Penn TreeBank SemEval OntoNotes
... Embedding model (vectors) W2V GloVe Polyglot FastText ...
Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus
Wikipedia GigaWord Reddit ... Tagging Parsing Sentiment NER ... Task model 2
All possible text Unlabeled text
Supervised task text 3
All possible text Unlabeled text
Supervised task text 4
All possible text Unlabeled text
Supervised task text 5
All possible text Unlabeled text
Supervised task text
5
All possible text Unlabeled text
Supervised task text
5
All possible text Unlabeled text
Supervised task text
suggested
5
All possible text Unlabeled text
Supervised task text
suggested
subword OOV model
5
6
Chalabi has increasingly marginalized within Iraq, ...
6
Chalabi has increasingly marginalized within Iraq, ... Important species (...) include shrimp, (...) and some varieties of flatfish.
6
Chalabi has increasingly marginalized within Iraq, ... Important species (...) include shrimp, (...) and some varieties of flatfish. This term was first used in German (Hochrenaissance), …
6
Chalabi has increasingly marginalized within Iraq, ... Important species (...) include shrimp, (...) and some varieties of flatfish. This term was first used in German (Hochrenaissance), … Without George Martin the Beatles would have been just another untalented band as Oasis.
6
Chalabi has increasingly marginalized within Iraq, ... Important species (...) include shrimp, (...) and some varieties of flatfish. This term was first used in German (Hochrenaissance), … Without George Martin the Beatles would have been just another untalented band as Oasis. What if Google morphed into GoogleOS?
6
Chalabi has increasingly marginalized within Iraq, ... Important species (...) include shrimp, (...) and some varieties of flatfish. This term was first used in German (Hochrenaissance), … Without George Martin the Beatles would have been just another untalented band as Oasis. What if Google morphed into GoogleOS? We’ll have four bands, and Big D is cookin’. lots of fun and great prizes.
6
Chalabi has increasingly marginalized within Iraq, ... Important species (...) include shrimp, (...) and some varieties of flatfish. This term was first used in German (Hochrenaissance), … Without George Martin the Beatles would have been just another untalented band as Oasis. What if Google morphed into GoogleOS? We’ll have four bands, and Big D is cookin’. lots of fun and great prizes. I dislike this urban society and I want to leave this whole enviroment.
6
Chalabi has increasingly marginalized within Iraq, ... Important species (...) include shrimp, (...) and some varieties of flatfish. This term was first used in German (Hochrenaissance), … Without George Martin the Beatles would have been just another untalented band as Oasis. What if Google morphed into GoogleOS? We’ll have four bands, and Big D is cookin’. lots of fun and great prizes. I dislike this urban society and I want to leave this whole enviroment. ???
6
Supervised task corpus
Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus
OOV
7
Supervised task corpus
Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus
OOV
??? 7
Supervised task corpus
Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus
○ Average existing embeddings ○ Trained with embeddings (stochastic unking)
OOV UNK
8
Supervised task corpus
Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus
○ Average existing embeddings ○ Trained with embeddings (stochastic unking)
OOV UNK
8
Supervised task corpus
Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus
○ Average existing embeddings ○ Trained with embeddings (stochastic unking)
OOV UNK
8
Supervised task corpus
Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus
○ Average existing embeddings ○ Trained with embeddings (stochastic unking)
○ Bhatia et al. (2016), Wieting et al. (2016)
OOV
9
Supervised task corpus
○ Average existing embeddings ○ Trained with embeddings (stochastic unking)
○ Bhatia et al. (2016), Wieting et al. (2016) ○ What if we don’t have access to the original corpus? (e.g. FastText)
OOV
9
OOV
10
Supervised task corpus
Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus
○ Ling et al. (2015), Plank et al. (2016)
OOV
10
Supervised task corpus
Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus
○ Ling et al. (2015), Plank et al. (2016)
OOV
10
Supervised task corpus
Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus
○ Ling et al. (2015), Plank et al. (2016)
efficient transfer to test set OOVs
OOV
10
Supervised task corpus
Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus
Supervised task corpus
Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus
OOV
OOV
11
○ Vector dictionary ○ Orthography (the way words are spelled)
Supervised task corpus
Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus
OOV
OOV
11
○ Vector dictionary ○ Orthography (the way words are spelled)
Supervised task corpus
Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus
OOV
OOV
11
○ Vector dictionary ○ Orthography (the way words are spelled)
Supervised task corpus
Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus
OOV
OOV
11
○ Vector dictionary ○ Orthography (the way words are spelled)
○ No need to access original unlabeled corpus ○ Many training examples ○ (No context)
Supervised task corpus
Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus
OOV
OOV
11
○ Vector dictionary ○ Orthography (the way words are spelled)
○ No need to access original unlabeled corpus ○ Many training examples ○ (No context)
○ Very extensible ○ (Character inventory changes?)
Supervised task corpus
Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus
OOV
OOV
11
m e k a
Pre-trained Embedding (Polyglot/FastText/etc.)
make
All possible text Unlabeled text
make
12
Character embeddings
m e k a
Pre-trained Embedding (Polyglot/FastText/etc.)
make
All possible text Unlabeled text
make
12
Character embeddings
m e k a
Pre-trained Embedding (Polyglot/FastText/etc.)
make
All possible text Unlabeled text
make
Forward LSTM 12
Character embeddings
m e k a
Pre-trained Embedding (Polyglot/FastText/etc.)
make
All possible text Unlabeled text
make
Backward LSTM Forward LSTM 12
Character embeddings
m e k a
Pre-trained Embedding (Polyglot/FastText/etc.)
make
All possible text Unlabeled text
make
Backward LSTM Forward LSTM Multilayered Perceptron 12
Character embeddings
m e k a
Pre-trained Embedding (Polyglot/FastText/etc.)
make Loss (L2) Mimicked Embedding
All possible text Unlabeled text
make
Backward LSTM Forward LSTM Multilayered Perceptron 12
Character embeddings
b h a l
Mimicked Embedding
All possible text Unlabeled text
blah
Backward LSTM Forward LSTM Multilayered Perceptron 13
14
14
○ MCT → AWS, OTA, APT, PDM 14
○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly 14
○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser 14
○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser
14
○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser
○רותפת→ גתת(she/you-3p.sg.) will come true (she/you-3p.sg.) will solve 14
○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser
○רותפת→ גתת(she/you-3p.sg.) will come true (she/you-3p.sg.) will solve ○םיירט→ םיירטמואיג geometric (m.pl., nontrad. spelling) geometric (m.pl.) 14
○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser
○רותפת→ גתת(she/you-3p.sg.) will come true (she/you-3p.sg.) will solve ○םיירט→ םיירטמואיג geometric (m.pl., nontrad. spelling) geometric (m.pl.) ○ךרא→ ציר’ןוסדר Richardson Eustrach 14
○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser
○רותפת→ גתת(she/you-3p.sg.) will come true (she/you-3p.sg.) will solve ○םיירט→ םיירטמואיג geometric (m.pl., nontrad. spelling) geometric (m.pl.) ○ךרא→ ציר’ןוסדר Richardson Eustrach
✔ Syntactic properties ✘ Semantics
14
15
15
15
derivations
15
derivations
Nearest: programmatic transformational mechanistic transactional contextual NN FUN!!! 15
derivations
16
○ Eng: his stated goals
Tense=Past|VerbForm=Part
○ Cze: osoby v pokročilém věku
Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age
derivations
16
○ Eng: his stated goals
Tense=Past|VerbForm=Part
○ Cze: osoby v pokročilém věku
Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age
VBZ VBG NN DT Word embeddings the sitting is cat Forward LSTM Backward LSTM
○ Eng: his stated goals
Tense=Past|VerbForm=Part
○ Cze: osoby v pokročilém věku
Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age
pres
VBG NN DT Word embeddings the sitting is cat Forward LSTM Backward LSTM
POS Number Tense
17
○ Eng: his stated goals
Tense=Past|VerbForm=Part
○ Cze: osoby v pokročilém věku
Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age
pres
VBG NN DT Word embeddings the sitting is cat Forward LSTM Backward LSTM
POS Number Tense
17
○ Eng: his stated goals
Tense=Past|VerbForm=Part
○ Cze: osoby v pokročilém věku
Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age
pres
VBG NN DT Word embeddings the sitting is cat Forward LSTM Backward LSTM
POS Number Tense
17
○ Micro F1
Minna Sundberg
18
Minna Sundberg
18
Minna Sundberg
18
○ 12 fusional
Minna Sundberg
18
○ 12 fusional ○ 3 analytic
Minna Sundberg
18
○ 12 fusional ○ 3 analytic ○ 1 isolating
Minna Sundberg
18
○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative
Minna Sundberg
18
○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative
Minna Sundberg
18
○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative
○ 13 Indo-European (7 different branches)
Minna Sundberg
18
○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative
○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches
Minna Sundberg
18
○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative
○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches
Minna Sundberg
18
○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative
○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches
○ Much word-level data
Minna Sundberg
18
○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative
○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches
○ Much word-level data ○ Relatively free word order
Minna Sundberg
18
○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative
○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches
○ Much word-level data ○ Relatively free word order Institutional Entrepreneurial Linguistic Anatomical Ideological
Minna Sundberg
18
19
19
○ 7 in non-alphabetic scripts 19
○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters 19
○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion 19
○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables 19
○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables
19
○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables
○ Range from 0% (Vietnamese) to 92.4% (Hindi) 19
○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables
○ Range from 0% (Vietnamese) to 92.4% (Hindi)
19
○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables
○ Range from 0% (Vietnamese) to 92.4% (Hindi)
○ 16.9%-70.8% type-level (median 29.1%) 19
○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables
○ Range from 0% (Vietnamese) to 92.4% (Hindi)
○ 16.9%-70.8% type-level (median 29.1%) ○ 2.2%-33.1% token-level (median 9.2%) 19
the sitting is flatfish
20
the sitting is flatfish
20
○ 3x Training time
Char- LSTM Char- LSTM Char- LSTM Char- LSTM the sitting is flatfish
20
○ 3x Training time
Char- LSTM Char- LSTM Char- LSTM Char- LSTM the sitting is flatfish
20
○ 3x Training time
POINT UNION ROAD LIGHT LONG
Char- LSTM Char- LSTM Char- LSTM Char- LSTM the sitting is flatfish
20
POS tags (accuracy)
21
NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH
POS tags (accuracy)
22
NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH
NONE MIMICK CHAR2TAG BOTH
Slavic languages POS 23
NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH
Slavic languages POS Agglutinative languages morpho. attribute F1 23
24
POS tags (accuracy)
NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH
Code & models: https://github.com/yuvalpinter/Mimick 25
Code & models: https://github.com/yuvalpinter/Mimick 25
Code & models: https://github.com/yuvalpinter/Mimick 25
○ Sentiment! Code & models: https://github.com/yuvalpinter/Mimick 25
○ Sentiment! ○ Parsing! Code & models: https://github.com/yuvalpinter/Mimick 25
○ Sentiment! ○ Parsing! ○ IE! Code & models: https://github.com/yuvalpinter/Mimick 25
○ Sentiment! ○ Parsing! ○ IE! ○ QA! Code & models: https://github.com/yuvalpinter/Mimick 25
○ Sentiment! ○ Parsing! ○ IE! ○ QA! ○ ... Code & models: https://github.com/yuvalpinter/Mimick 25
○ Sentiment! ○ Parsing! ○ IE! ○ QA! ○ ...
Code & models: https://github.com/yuvalpinter/Mimick 25
○ Sentiment! ○ Parsing! ○ IE! ○ QA! ○ ...
Code & models: https://github.com/yuvalpinter/Mimick 25
○ Sentiment! ○ Parsing! ○ IE! ○ QA! ○ ...
○ <1MB each, dynet format Code & models: https://github.com/yuvalpinter/Mimick 25
○ Sentiment! ○ Parsing! ○ IE! ○ QA! ○ ...
○ <1MB each, dynet format ○ Learn all OOVs in advance and add to param table, or Code & models: https://github.com/yuvalpinter/Mimick 25
○ Sentiment! ○ Parsing! ○ IE! ○ QA! ○ ...
○ <1MB each, dynet format ○ Learn all OOVs in advance and add to param table, or ○ Load into memory and infer on-line Code & models: https://github.com/yuvalpinter/Mimick 25
26
26
26
26
26
○ Morphologically-rich languages 26
○ Morphologically-rich languages ○ Large character vocabulary 26
○ Morphologically-rich languages ○ Large character vocabulary
26
○ Morphologically-rich languages ○ Large character vocabulary
○ Vietnamese - syllabic vocabulary 26
○ Morphologically-rich languages ○ Large character vocabulary
○ Vietnamese - syllabic vocabulary ○ Hebrew and Arabic - nontrivial tokenization, no case 26
○ Morphologically-rich languages ○ Large character vocabulary
○ Vietnamese - syllabic vocabulary ○ Hebrew and Arabic - nontrivial tokenization, no case ○ Try other subword levels (morphemes, phonemes, bytes) 26
○ Morphologically-rich languages ○ Large character vocabulary
○ Vietnamese - syllabic vocabulary ○ Hebrew and Arabic - nontrivial tokenization, no case ○ Try other subword levels (morphemes, phonemes, bytes) ○ Improve morphosyntactic attribute tagging scheme 26
Code & models: https://github.com/yuvalpinter/Mimick Neglect Satisfaction Illness Espionage Bullying
27