mimicking word embeddings
play

Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert - PowerPoint PPT Presentation

Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert Guthrie, Jacob Eisenstein @yuvalpi Presented at EMNLP September 2017, Copenhagen The Word Embedding Pipeline Unlabeled Unlabeled corpus Unlabeled corpus Unlabeled corpus


  1. Enter MIMICK ● What data do we have, post-unlabeled corpus? ○ Vector dictionary ○ Orthography (the way words are spelled) ● Use the former as training objective, latter as input ● Pre-trained vectors as target ○ No need to access original unlabeled corpus ○ Many training examples Unlabeled Supervised ○ (No context) Unlabeled corpus task corpus Unlabeled corpus Unlabeled corpus corpus OOV OOV 11

  2. Enter MIMICK ● What data do we have, post-unlabeled corpus? ○ Vector dictionary ○ Orthography (the way words are spelled) ● Use the former as training objective, latter as input ● Pre-trained vectors as target ○ No need to access original unlabeled corpus ○ Many training examples Unlabeled Supervised ○ (No context) Unlabeled corpus task corpus Unlabeled corpus Unlabeled ● Subword units as inputs corpus corpus ○ Very extensible ○ (Character inventory changes?) OOV OOV 11

  3. MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) All possible text Unlabeled text make m a k e 12

  4. MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) All possible text Unlabeled text make Character embeddings m a k e 12

  5. MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) All possible text Unlabeled text make Forward LSTM Character embeddings m a k e 12

  6. MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) All possible text Unlabeled text make Backward LSTM Forward LSTM Character embeddings m a k e 12

  7. MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) All possible text Multilayered Perceptron Unlabeled text make Backward LSTM Forward LSTM Character embeddings m a k e 12

  8. MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) Loss (L 2 ) All possible text Mimicked Embedding Multilayered Perceptron Unlabeled text make Backward LSTM Forward LSTM Character embeddings m a k e 12

  9. MIMICK Inference All possible text Mimicked Embedding Multilayered Perceptron Unlabeled text blah Backward LSTM Forward LSTM Character embeddings b l a h 13

  10. Observation – Nearest Neighbors 14

  11. Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) 14

  12. Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM 14

  13. Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly 14

  14. Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser 14

  15. Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser ● Hebrew 14

  16. Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser ● Hebrew ○ רותפת → גתת (she/you-3p.sg.) will come true (she/you-3p.sg.) will solve 14

  17. Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser ● Hebrew ○ רותפת → גתת (she/you-3p.sg.) will come true (she/you-3p.sg.) will solve ○ םיירט → םיירטמואיג geometric (m.pl., nontrad. spelling) geometric (m.pl.) 14

  18. Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser ● Hebrew ○ רותפת → גתת (she/you-3p.sg.) will come true (she/you-3p.sg.) will solve ○ םיירט → םיירטמואיג geometric (m.pl., nontrad. spelling) geometric (m.pl.) ○ ךרא → ציר ’ ןוסדר Richardson Eustrach 14

  19. Observation – Nearest Neighbors ● English (OOV  Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser ● Hebrew ○ רותפת → גתת (she/you-3p.sg.) will come true (she/you-3p.sg.) will solve ○ םיירט → םיירטמואיג geometric (m.pl., nontrad. spelling) geometric (m.pl.) ○ ךרא → ציר ’ ןוסדר Richardson Eustrach ● ✔ Surface form ✔ Syntactic properties ✘ Semantics 14

  20. Intrinsic Evaluation – RareWords 15

  21. Intrinsic Evaluation – RareWords ● RareWords similarity task: morphologically-complex, mostly unseen words 15

  22. Intrinsic Evaluation – RareWords ● RareWords similarity task: morphologically-complex, mostly unseen words 15

  23. Intrinsic Evaluation – RareWords ● RareWords similarity task: morphologically-complex, mostly unseen words Names ● Domain-specific jargon ● Foreign words ● Rare(-ish) morphological ● derivations Nonce words ● Nonstandard orthography ● Typos and other errors ● ... ● 15

  24. Intrinsic Evaluation – RareWords ● RareWords similarity task: morphologically-complex, mostly unseen words NN FUN!!! Nearest: programmatic transformational Names ● mechanistic Domain-specific jargon ● transactional Foreign words ● contextual Rare(-ish) morphological ● derivations Nonce words ● Nonstandard orthography ● Typos and other errors ● ... ● 15

  25. Extrinsic Evaluation – POS + Attribute Tagging ● UD is annotated for POS and morphosyntactic attributes ○ Eng: his stated goals Tense=Past|VerbForm=Part ○ Cze: osoby v pokročilém věku Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age ● Names ● Domain-specific jargon ● Foreign words ● Rare(-ish) morphological derivations Nonce words ● ● Nonstandard orthography ● Typos and other errors ● ... 16

  26. Extrinsic Evaluation – POS + Attribute Tagging ● UD is annotated for POS and morphosyntactic attributes ○ Eng: his stated goals Tense=Past|VerbForm=Part ○ Cze: osoby v pokročilém věku Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age ● POS model from Ling et al. (2015) DT NN VBZ VBG ● Names ● Domain-specific jargon Backward LSTM ● Foreign words ● Rare(-ish) morphological Forward derivations LSTM Nonce words ● ● Nonstandard Word orthography embeddings ● Typos and other errors the cat is sitting ● ... 16

  27. Extrinsic Evaluation – POS + Attribute Tagging ● UD is annotated for POS and morphosyntactic attributes ○ Eng: his stated goals Tense=Past|VerbForm=Part ○ Cze: osoby v pokročilém věku Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age ● POS model from Ling et al. (2015) Tense -- -- pres -- Number -- sing -- -- ● Attributes - same as POS layer POS DT NN VBZ VBG Backward LSTM Forward LSTM Word embeddings the cat is sitting 17

  28. Extrinsic Evaluation – POS + Attribute Tagging ● UD is annotated for POS and morphosyntactic attributes ○ Eng: his stated goals Tense=Past|VerbForm=Part ○ Cze: osoby v pokročilém věku Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age ● POS model from Ling et al. (2015) Tense -- -- pres -- Number -- sing -- -- ● Attributes - same as POS layer POS DT NN VBZ VBG ● Negative effect on POS Backward LSTM Forward LSTM Word embeddings the cat is sitting 17

  29. Extrinsic Evaluation – POS + Attribute Tagging ● UD is annotated for POS and morphosyntactic attributes ○ Eng: his stated goals Tense=Past|VerbForm=Part ○ Cze: osoby v pokročilém věku Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age ● POS model from Ling et al. (2015) Tense -- -- pres -- Number -- sing -- -- ● Attributes - same as POS layer POS DT NN VBZ VBG ● Negative effect on POS ● Attribute evaluation metric Backward LSTM ○ Micro F1 Forward LSTM Word embeddings the cat is sitting 17

  30. Language Selection Minna Sundberg 18

  31. Language Selection ● |UD ∩ Polyglot| = 44, we took 23 Minna Sundberg 18

  32. Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure Minna Sundberg 18

  33. Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional Minna Sundberg 18

  34. Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic Minna Sundberg 18

  35. Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating Minna Sundberg 18

  36. Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative Minna Sundberg 18

  37. Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity Minna Sundberg 18

  38. Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity ○ 13 Indo-European (7 different branches) Minna Sundberg 18

  39. Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity ○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches Minna Sundberg 18

  40. Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity ○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches MRLs (e.g. Slavic languages) ● Minna Sundberg 18

  41. Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity ○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches MRLs (e.g. Slavic languages) ● ○ Much word-level data Minna Sundberg 18

  42. Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity ○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches MRLs (e.g. Slavic languages) ● ○ Much word-level data ○ Relatively free word order Minna Sundberg 18

  43. Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative Institutional ● Geneological diversity Entrepreneurial Linguistic ○ 13 Indo-European (7 different branches) Anatomical ○ 10 from 8 non-IE branches Ideological MRLs (e.g. Slavic languages) ● ○ Much word-level data ○ Relatively free word order Minna Sundberg 18

  44. Language Selection (contd.) 19

  45. Language Selection (contd.) ● Script type 19

  46. Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts 19

  47. Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters 19

  48. Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion 19

  49. Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables 19

  50. Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables ● Attribute-carrying tokens 19

  51. Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables ● Attribute-carrying tokens ○ Range from 0% (Vietnamese) to 92.4% (Hindi) 19

  52. Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables ● Attribute-carrying tokens ○ Range from 0% (Vietnamese) to 92.4% (Hindi) ● OOV rate (UD against Polyglot vocabulary) 19

  53. Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables ● Attribute-carrying tokens ○ Range from 0% (Vietnamese) to 92.4% (Hindi) ● OOV rate (UD against Polyglot vocabulary) ○ 16.9%-70.8% type-level (median 29.1%) 19

  54. Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables ● Attribute-carrying tokens ○ Range from 0% (Vietnamese) to 92.4% (Hindi) ● OOV rate (UD against Polyglot vocabulary) ○ 16.9%-70.8% type-level (median 29.1%) ○ 2.2%-33.1% token-level (median 9.2%) 19

  55. Evaluated Systems ● NONE: Polyglot ’ s default UNK embedding the flatfish is sitting 20

  56. Evaluated Systems ● NONE: Polyglot ’ s default UNK embedding ● MIMICK the flatfish is sitting 20

  57. Evaluated Systems ● NONE: Polyglot ’ s default UNK embedding ● MIMICK ● CHAR2TAG - additional RNN layer ○ 3x Training time the flatfish is sitting Char- Char- Char- Char- LSTM LSTM LSTM LSTM 20

  58. Evaluated Systems ● NONE: Polyglot ’ s default UNK embedding ● MIMICK ● CHAR2TAG - additional RNN layer ○ 3x Training time ● BOTH: MIMICK + CHAR2TAG the flatfish is sitting Char- Char- Char- Char- LSTM LSTM LSTM LSTM 20

  59. Evaluated Systems ● NONE: Polyglot ’ s default UNK embedding ● MIMICK POINT ● CHAR2TAG - additional RNN layer UNION ○ 3x Training time ROAD LIGHT ● BOTH: MIMICK + CHAR2TAG LONG the flatfish is sitting Char- Char- Char- Char- LSTM LSTM LSTM LSTM 20

  60. Results - Full Data NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH Morpho. Attributes (micro F1) POS tags (accuracy) 21

  61. Results - 5,000 training tokens NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH Morpho. Attributes (micro F1) POS tags (accuracy) 22

  62. Results - Language Types (5,000 tokens) NONE MIMICK CHAR2TAG BOTH Slavic languages POS 23

  63. Results - Language Types (5,000 tokens) NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH Agglutinative languages morpho. attribute F1 Slavic languages POS 23

  64. Results - Chinese NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH Morpho. Attributes (micro F1) POS tags (accuracy) 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend