Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert - PowerPoint PPT Presentation

Enter MIMICK ● What data do we have, post-unlabeled corpus? ○ Vector dictionary ○ Orthography (the way words are spelled) ● Use the former as training objective, latter as input ● Pre-trained vectors as target ○ No need to access original unlabeled corpus ○ Many training examples Unlabeled Supervised ○ (No context) Unlabeled corpus task corpus Unlabeled corpus Unlabeled corpus corpus OOV OOV 11

Enter MIMICK ● What data do we have, post-unlabeled corpus? ○ Vector dictionary ○ Orthography (the way words are spelled) ● Use the former as training objective, latter as input ● Pre-trained vectors as target ○ No need to access original unlabeled corpus ○ Many training examples Unlabeled Supervised ○ (No context) Unlabeled corpus task corpus Unlabeled corpus Unlabeled ● Subword units as inputs corpus corpus ○ Very extensible ○ (Character inventory changes?) OOV OOV 11

MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) All possible text Unlabeled text make m a k e 12

MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) All possible text Unlabeled text make Character embeddings m a k e 12

MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) All possible text Unlabeled text make Forward LSTM Character embeddings m a k e 12

MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) All possible text Unlabeled text make Backward LSTM Forward LSTM Character embeddings m a k e 12

MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) All possible text Multilayered Perceptron Unlabeled text make Backward LSTM Forward LSTM Character embeddings m a k e 12

MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) Loss (L 2 ) All possible text Mimicked Embedding Multilayered Perceptron Unlabeled text make Backward LSTM Forward LSTM Character embeddings m a k e 12

MIMICK Inference All possible text Mimicked Embedding Multilayered Perceptron Unlabeled text blah Backward LSTM Forward LSTM Character embeddings b l a h 13

Observation – Nearest Neighbors 14

Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) 14

Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM 14

Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly 14

Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser 14

Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser ● Hebrew 14

Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser ● Hebrew ○ רותפת → גתת (she/you-3p.sg.) will come true (she/you-3p.sg.) will solve 14

Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser ● Hebrew ○ רותפת → גתת (she/you-3p.sg.) will come true (she/you-3p.sg.) will solve ○ םיירט → םיירטמואיג geometric (m.pl., nontrad. spelling) geometric (m.pl.) 14

Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser ● Hebrew ○ רותפת → גתת (she/you-3p.sg.) will come true (she/you-3p.sg.) will solve ○ םיירט → םיירטמואיג geometric (m.pl., nontrad. spelling) geometric (m.pl.) ○ ךרא → ציר ’ ןוסדר Richardson Eustrach 14

Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser ● Hebrew ○ רותפת → גתת (she/you-3p.sg.) will come true (she/you-3p.sg.) will solve ○ םיירט → םיירטמואיג geometric (m.pl., nontrad. spelling) geometric (m.pl.) ○ ךרא → ציר ’ ןוסדר Richardson Eustrach ● ✔ Surface form ✔ Syntactic properties ✘ Semantics 14

Intrinsic Evaluation – RareWords 15

Intrinsic Evaluation – RareWords ● RareWords similarity task: morphologically-complex, mostly unseen words 15

Intrinsic Evaluation – RareWords ● RareWords similarity task: morphologically-complex, mostly unseen words Names ● Domain-specific jargon ● Foreign words ● Rare(-ish) morphological ● derivations Nonce words ● Nonstandard orthography ● Typos and other errors ● ... ● 15

Intrinsic Evaluation – RareWords ● RareWords similarity task: morphologically-complex, mostly unseen words NN FUN!!! Nearest: programmatic transformational Names ● mechanistic Domain-specific jargon ● transactional Foreign words ● contextual Rare(-ish) morphological ● derivations Nonce words ● Nonstandard orthography ● Typos and other errors ● ... ● 15

Extrinsic Evaluation – POS + Attribute Tagging ● UD is annotated for POS and morphosyntactic attributes ○ Eng: his stated goals Tense=Past|VerbForm=Part ○ Cze: osoby v pokročilém věku Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age ● Names ● Domain-specific jargon ● Foreign words ● Rare(-ish) morphological derivations Nonce words ● ● Nonstandard orthography ● Typos and other errors ● ... 16

Extrinsic Evaluation – POS + Attribute Tagging ● UD is annotated for POS and morphosyntactic attributes ○ Eng: his stated goals Tense=Past|VerbForm=Part ○ Cze: osoby v pokročilém věku Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age ● POS model from Ling et al. (2015) DT NN VBZ VBG ● Names ● Domain-specific jargon Backward LSTM ● Foreign words ● Rare(-ish) morphological Forward derivations LSTM Nonce words ● ● Nonstandard Word orthography embeddings ● Typos and other errors the cat is sitting ● ... 16

Extrinsic Evaluation – POS + Attribute Tagging ● UD is annotated for POS and morphosyntactic attributes ○ Eng: his stated goals Tense=Past|VerbForm=Part ○ Cze: osoby v pokročilém věku Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age ● POS model from Ling et al. (2015) Tense -- -- pres -- Number -- sing -- -- ● Attributes - same as POS layer POS DT NN VBZ VBG Backward LSTM Forward LSTM Word embeddings the cat is sitting 17

Extrinsic Evaluation – POS + Attribute Tagging ● UD is annotated for POS and morphosyntactic attributes ○ Eng: his stated goals Tense=Past|VerbForm=Part ○ Cze: osoby v pokročilém věku Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age ● POS model from Ling et al. (2015) Tense -- -- pres -- Number -- sing -- -- ● Attributes - same as POS layer POS DT NN VBZ VBG ● Negative effect on POS Backward LSTM Forward LSTM Word embeddings the cat is sitting 17

Extrinsic Evaluation – POS + Attribute Tagging ● UD is annotated for POS and morphosyntactic attributes ○ Eng: his stated goals Tense=Past|VerbForm=Part ○ Cze: osoby v pokročilém věku Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age ● POS model from Ling et al. (2015) Tense -- -- pres -- Number -- sing -- -- ● Attributes - same as POS layer POS DT NN VBZ VBG ● Negative effect on POS ● Attribute evaluation metric Backward LSTM ○ Micro F1 Forward LSTM Word embeddings the cat is sitting 17

Language Selection Minna Sundberg 18

Language Selection ● |UD ∩ Polyglot| = 44, we took 23 Minna Sundberg 18

Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure Minna Sundberg 18

Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional Minna Sundberg 18

Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic Minna Sundberg 18

Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating Minna Sundberg 18

Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative Minna Sundberg 18

Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity Minna Sundberg 18

Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity ○ 13 Indo-European (7 different branches) Minna Sundberg 18

Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity ○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches Minna Sundberg 18

Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity ○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches MRLs (e.g. Slavic languages) ● Minna Sundberg 18

Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity ○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches MRLs (e.g. Slavic languages) ● ○ Much word-level data Minna Sundberg 18

Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity ○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches MRLs (e.g. Slavic languages) ● ○ Much word-level data ○ Relatively free word order Minna Sundberg 18

Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative Institutional ● Geneological diversity Entrepreneurial Linguistic ○ 13 Indo-European (7 different branches) Anatomical ○ 10 from 8 non-IE branches Ideological MRLs (e.g. Slavic languages) ● ○ Much word-level data ○ Relatively free word order Minna Sundberg 18

Language Selection (contd.) 19

Language Selection (contd.) ● Script type 19

Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts 19

Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters 19

Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion 19

Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables 19

Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables ● Attribute-carrying tokens 19

Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables ● Attribute-carrying tokens ○ Range from 0% (Vietnamese) to 92.4% (Hindi) 19

Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables ● Attribute-carrying tokens ○ Range from 0% (Vietnamese) to 92.4% (Hindi) ● OOV rate (UD against Polyglot vocabulary) 19

Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables ● Attribute-carrying tokens ○ Range from 0% (Vietnamese) to 92.4% (Hindi) ● OOV rate (UD against Polyglot vocabulary) ○ 16.9%-70.8% type-level (median 29.1%) 19

Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables ● Attribute-carrying tokens ○ Range from 0% (Vietnamese) to 92.4% (Hindi) ● OOV rate (UD against Polyglot vocabulary) ○ 16.9%-70.8% type-level (median 29.1%) ○ 2.2%-33.1% token-level (median 9.2%) 19

Evaluated Systems ● NONE: Polyglot ’ s default UNK embedding the flatfish is sitting 20

Evaluated Systems ● NONE: Polyglot ’ s default UNK embedding ● MIMICK the flatfish is sitting 20

Evaluated Systems ● NONE: Polyglot ’ s default UNK embedding ● MIMICK ● CHAR2TAG - additional RNN layer ○ 3x Training time the flatfish is sitting Char- Char- Char- Char- LSTM LSTM LSTM LSTM 20

Evaluated Systems ● NONE: Polyglot ’ s default UNK embedding ● MIMICK ● CHAR2TAG - additional RNN layer ○ 3x Training time ● BOTH: MIMICK + CHAR2TAG the flatfish is sitting Char- Char- Char- Char- LSTM LSTM LSTM LSTM 20

Evaluated Systems ● NONE: Polyglot ’ s default UNK embedding ● MIMICK POINT ● CHAR2TAG - additional RNN layer UNION ○ 3x Training time ROAD LIGHT ● BOTH: MIMICK + CHAR2TAG LONG the flatfish is sitting Char- Char- Char- Char- LSTM LSTM LSTM LSTM 20

Results - Full Data NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH Morpho. Attributes (micro F1) POS tags (accuracy) 21

Results - 5,000 training tokens NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH Morpho. Attributes (micro F1) POS tags (accuracy) 22

Results - Language Types (5,000 tokens) NONE MIMICK CHAR2TAG BOTH Slavic languages POS 23

Results - Language Types (5,000 tokens) NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH Agglutinative languages morpho. attribute F1 Slavic languages POS 23

Results - Chinese NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH Morpho. Attributes (micro F1) POS tags (accuracy) 24

Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert - PowerPoint PPT Presentation

Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert Guthrie, Jacob Eisenstein @yuvalpi Presented at EMNLP September 2017, Copenhagen The Word Embedding Pipeline Unlabeled Unlabeled corpus Unlabeled corpus Unlabeled corpus

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction Roy Schwartz + ,

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Lecture 8: NLP and Word Embeddings Alireza Akhavan Pour CLASS.VISION

Word Embeddings through Hellinger PCA Rmi Lebret and Ronan Collobert Idiap Research Institute /

Connect: Live - Online Synch Courses Medical billers and coders are the connection between

Significance Significance of of Guanx Guanxi Yan anji jie e Bian Bian University of

Councillors Workshop 29 th May 2019 Thames-Coromandel SMPs Governance and Engagement SMP

eLearning Parent Night 1/8/2019 What is eLearning? 2009 textbook 1:1 options allowed for schools

SSReflect , A Small Scale Reflection Extension for the Coq system Robbert Krebbers 1 June 26, 2009

Tim Tyler Principal Solutions Engineer, MetLife, Inc. @timotyler ttyler Our Products are

Adolescent Literacy: Issues and Opportunities An invited keynote address for Advancing

DETERMINANTS OF FOOD WASTE BEHAVIOUR IN GREEK HOUSEHOLDS T. KRITIKOU, D. PANAGIOTAKOS AND K.

Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert - PowerPoint PPT Presentation

Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert Guthrie, Jacob Eisenstein @yuvalpi Presented at EMNLP September 2017, Copenhagen The Word Embedding Pipeline Unlabeled Unlabeled corpus Unlabeled corpus Unlabeled corpus

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction Roy Schwartz + ,

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky &amp; Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky &amp; Martin How to

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Lecture 8: NLP and Word Embeddings Alireza Akhavan Pour CLASS.VISION

Word Embeddings through Hellinger PCA Rmi Lebret and Ronan Collobert Idiap Research Institute /

Connect: Live - Online Synch Courses Medical billers and coders are the connection between

Significance Significance of of Guanx Guanxi Yan anji jie e Bian Bian University of

Councillors Workshop 29 th May 2019 Thames-Coromandel SMPs Governance and Engagement SMP

eLearning Parent Night 1/8/2019 What is eLearning? 2009 textbook 1:1 options allowed for schools

SSReflect , A Small Scale Reflection Extension for the Coq system Robbert Krebbers 1 June 26, 2009

Tim Tyler Principal Solutions Engineer, MetLife, Inc. @timotyler ttyler Our Products are

Adolescent Literacy: Issues and Opportunities An invited keynote address for Advancing

DETERMINANTS OF FOOD WASTE BEHAVIOUR IN GREEK HOUSEHOLDS T. KRITIKOU, D. PANAGIOTAKOS AND K.

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to