Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert - - PowerPoint PPT Presentation

mimicking word embeddings
SMART_READER_LITE
LIVE PREVIEW

Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert - - PowerPoint PPT Presentation

Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert Guthrie, Jacob Eisenstein @yuvalpi Presented at EMNLP September 2017, Copenhagen The Word Embedding Pipeline Unlabeled Unlabeled corpus Unlabeled corpus Unlabeled corpus


slide-1
SLIDE 1

Mimicking Word Embeddings using Subword RNNs

Yuval Pinter, Robert Guthrie, Jacob Eisenstein

@yuvalpi

Presented at EMNLP September 2017, Copenhagen

slide-2
SLIDE 2

The Word Embedding Pipeline

Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus

Wikipedia GigaWord Reddit ... 2

slide-3
SLIDE 3

Embedding model (vectors) W2V GloVe Polyglot FastText ...

The Word Embedding Pipeline

Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus

Wikipedia GigaWord Reddit ... 2

slide-4
SLIDE 4

Supervised task corpus Supervised task corpus

Penn TreeBank SemEval OntoNotes

  • Univ. Dependencies

... Embedding model (vectors) W2V GloVe Polyglot FastText ...

The Word Embedding Pipeline

Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus

Wikipedia GigaWord Reddit ... 2

slide-5
SLIDE 5

Supervised task corpus Supervised task corpus

Penn TreeBank SemEval OntoNotes

  • Univ. Dependencies

... Embedding model (vectors) W2V GloVe Polyglot FastText ...

The Word Embedding Pipeline

Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus

Wikipedia GigaWord Reddit ... Tagging Parsing Sentiment NER ... Task model 2

slide-6
SLIDE 6

All possible text Unlabeled text

Assumed Pattern

Supervised task text 3

slide-7
SLIDE 7

All possible text Unlabeled text

Actual Pattern

Supervised task text 4

slide-8
SLIDE 8

All possible text Unlabeled text

Actual Pattern

Supervised task text 5

slide-9
SLIDE 9

All possible text Unlabeled text

Actual Pattern

Supervised task text

  • No pre-trained vectors

5

slide-10
SLIDE 10

All possible text Unlabeled text

Actual Pattern

Supervised task text

  • No pre-trained vectors
  • Affects supervised tasks

5

slide-11
SLIDE 11

All possible text Unlabeled text

Actual Pattern

Supervised task text

  • No pre-trained vectors
  • Affects supervised tasks
  • Multiple treatments

suggested

5

slide-12
SLIDE 12

All possible text Unlabeled text

Actual Pattern

Supervised task text

  • No pre-trained vectors
  • Affects supervised tasks
  • Multiple treatments

suggested

  • Our method - compositional

subword OOV model

5

slide-13
SLIDE 13

Sources of OOVs

6

slide-14
SLIDE 14

Sources of OOVs

  • Names

Chalabi has increasingly marginalized within Iraq, ...

6

slide-15
SLIDE 15

Sources of OOVs

  • Names
  • Domain-specific jargon

Chalabi has increasingly marginalized within Iraq, ... Important species (...) include shrimp, (...) and some varieties of flatfish.

6

slide-16
SLIDE 16

Sources of OOVs

  • Names
  • Domain-specific jargon
  • Foreign words

Chalabi has increasingly marginalized within Iraq, ... Important species (...) include shrimp, (...) and some varieties of flatfish. This term was first used in German (Hochrenaissance), …

6

slide-17
SLIDE 17

Sources of OOVs

  • Names
  • Domain-specific jargon
  • Foreign words
  • Rare morphological derivations

Chalabi has increasingly marginalized within Iraq, ... Important species (...) include shrimp, (...) and some varieties of flatfish. This term was first used in German (Hochrenaissance), … Without George Martin the Beatles would have been just another untalented band as Oasis.

6

slide-18
SLIDE 18

Sources of OOVs

  • Names
  • Domain-specific jargon
  • Foreign words
  • Rare morphological derivations
  • Nonce words

Chalabi has increasingly marginalized within Iraq, ... Important species (...) include shrimp, (...) and some varieties of flatfish. This term was first used in German (Hochrenaissance), … Without George Martin the Beatles would have been just another untalented band as Oasis. What if Google morphed into GoogleOS?

6

slide-19
SLIDE 19

Sources of OOVs

  • Names
  • Domain-specific jargon
  • Foreign words
  • Rare morphological derivations
  • Nonce words
  • Nonstandard orthography

Chalabi has increasingly marginalized within Iraq, ... Important species (...) include shrimp, (...) and some varieties of flatfish. This term was first used in German (Hochrenaissance), … Without George Martin the Beatles would have been just another untalented band as Oasis. What if Google morphed into GoogleOS? We’ll have four bands, and Big D is cookin’. lots of fun and great prizes.

6

slide-20
SLIDE 20

Sources of OOVs

  • Names
  • Domain-specific jargon
  • Foreign words
  • Rare morphological derivations
  • Nonce words
  • Nonstandard orthography
  • Typos and other errors

Chalabi has increasingly marginalized within Iraq, ... Important species (...) include shrimp, (...) and some varieties of flatfish. This term was first used in German (Hochrenaissance), … Without George Martin the Beatles would have been just another untalented band as Oasis. What if Google morphed into GoogleOS? We’ll have four bands, and Big D is cookin’. lots of fun and great prizes. I dislike this urban society and I want to leave this whole enviroment.

6

slide-21
SLIDE 21

Sources of OOVs

  • Names
  • Domain-specific jargon
  • Foreign words
  • Rare morphological derivations
  • Nonce words
  • Nonstandard orthography
  • Typos and other errors

Chalabi has increasingly marginalized within Iraq, ... Important species (...) include shrimp, (...) and some varieties of flatfish. This term was first used in German (Hochrenaissance), … Without George Martin the Beatles would have been just another untalented band as Oasis. What if Google morphed into GoogleOS? We’ll have four bands, and Big D is cookin’. lots of fun and great prizes. I dislike this urban society and I want to leave this whole enviroment. ???

6

slide-22
SLIDE 22

Supervised task corpus

Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus

Common OOV handling techniques

  • None (random init)

OOV

7

slide-23
SLIDE 23

Supervised task corpus

Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus

Common OOV handling techniques

  • None (random init)

OOV

??? 7

slide-24
SLIDE 24

Supervised task corpus

Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus

Common OOV handling techniques

  • None (random init)
  • One UNK to rule them all

○ Average existing embeddings ○ Trained with embeddings (stochastic unking)

OOV UNK

8

slide-25
SLIDE 25

Supervised task corpus

Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus

Common OOV handling techniques

  • None (random init)
  • One UNK to rule them all

○ Average existing embeddings ○ Trained with embeddings (stochastic unking)

OOV UNK

8

slide-26
SLIDE 26

Supervised task corpus

Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus

Common OOV handling techniques

  • None (random init)
  • One UNK to rule them all

○ Average existing embeddings ○ Trained with embeddings (stochastic unking)

OOV UNK

8

slide-27
SLIDE 27

Supervised task corpus

Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus

Common OOV handling techniques

  • None (random init)
  • One UNK to rule them all

○ Average existing embeddings ○ Trained with embeddings (stochastic unking)

  • Add subword model during WE training

○ Bhatia et al. (2016), Wieting et al. (2016)

OOV

9

slide-28
SLIDE 28

Supervised task corpus

Common OOV handling techniques

  • None (random init)
  • One UNK to rule them all

○ Average existing embeddings ○ Trained with embeddings (stochastic unking)

  • Add subword model during WE training

○ Bhatia et al. (2016), Wieting et al. (2016) ○ What if we don’t have access to the original corpus? (e.g. FastText)

OOV

9

slide-29
SLIDE 29

Char2Tag

OOV

10

Supervised task corpus

Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus

slide-30
SLIDE 30

Char2Tag

  • Add subword layer to supervised task

○ Ling et al. (2015), Plank et al. (2016)

OOV

10

Supervised task corpus

Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus

slide-31
SLIDE 31

Char2Tag

  • Add subword layer to supervised task

○ Ling et al. (2015), Plank et al. (2016)

  • OOVs benefit from co-trained character model

OOV

10

Supervised task corpus

Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus

slide-32
SLIDE 32

Char2Tag

  • Add subword layer to supervised task

○ Ling et al. (2015), Plank et al. (2016)

  • OOVs benefit from co-trained character model
  • Requires large supervised training set for

efficient transfer to test set OOVs

OOV

10

Supervised task corpus

Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus

slide-33
SLIDE 33

Supervised task corpus

Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus

OOV

Enter MIMICK

OOV

11

slide-34
SLIDE 34
  • What data do we have, post-unlabeled corpus?

○ Vector dictionary ○ Orthography (the way words are spelled)

Supervised task corpus

Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus

OOV

Enter MIMICK

OOV

11

slide-35
SLIDE 35
  • What data do we have, post-unlabeled corpus?

○ Vector dictionary ○ Orthography (the way words are spelled)

Supervised task corpus

Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus

OOV

Enter MIMICK

OOV

11

slide-36
SLIDE 36
  • What data do we have, post-unlabeled corpus?

○ Vector dictionary ○ Orthography (the way words are spelled)

  • Use the former as training objective, latter as input

Supervised task corpus

Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus

OOV

Enter MIMICK

OOV

11

slide-37
SLIDE 37
  • What data do we have, post-unlabeled corpus?

○ Vector dictionary ○ Orthography (the way words are spelled)

  • Use the former as training objective, latter as input
  • Pre-trained vectors as target

○ No need to access original unlabeled corpus ○ Many training examples ○ (No context)

Supervised task corpus

Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus

OOV

Enter MIMICK

OOV

11

slide-38
SLIDE 38
  • What data do we have, post-unlabeled corpus?

○ Vector dictionary ○ Orthography (the way words are spelled)

  • Use the former as training objective, latter as input
  • Pre-trained vectors as target

○ No need to access original unlabeled corpus ○ Many training examples ○ (No context)

  • Subword units as inputs

○ Very extensible ○ (Character inventory changes?)

Supervised task corpus

Unlabeled corpus Unlabeled corpus Unlabeled corpus Unlabeled corpus

OOV

Enter MIMICK

OOV

11

slide-39
SLIDE 39

MIMICK Training

m e k a

Pre-trained Embedding (Polyglot/FastText/etc.)

make

All possible text Unlabeled text

make

12

slide-40
SLIDE 40

Character embeddings

MIMICK Training

m e k a

Pre-trained Embedding (Polyglot/FastText/etc.)

make

All possible text Unlabeled text

make

12

slide-41
SLIDE 41

Character embeddings

MIMICK Training

m e k a

Pre-trained Embedding (Polyglot/FastText/etc.)

make

All possible text Unlabeled text

make

Forward LSTM 12

slide-42
SLIDE 42

Character embeddings

MIMICK Training

m e k a

Pre-trained Embedding (Polyglot/FastText/etc.)

make

All possible text Unlabeled text

make

Backward LSTM Forward LSTM 12

slide-43
SLIDE 43

Character embeddings

MIMICK Training

m e k a

Pre-trained Embedding (Polyglot/FastText/etc.)

make

All possible text Unlabeled text

make

Backward LSTM Forward LSTM Multilayered Perceptron 12

slide-44
SLIDE 44

Character embeddings

MIMICK Training

m e k a

Pre-trained Embedding (Polyglot/FastText/etc.)

make Loss (L2) Mimicked Embedding

All possible text Unlabeled text

make

Backward LSTM Forward LSTM Multilayered Perceptron 12

slide-45
SLIDE 45

Character embeddings

MIMICK Inference

b h a l

Mimicked Embedding

All possible text Unlabeled text

blah

Backward LSTM Forward LSTM Multilayered Perceptron 13

slide-46
SLIDE 46

Observation – Nearest Neighbors

14

slide-47
SLIDE 47

Observation – Nearest Neighbors

  • English (OOV  Nearest in-vocab words)

14

slide-48
SLIDE 48

Observation – Nearest Neighbors

  • English (OOV  Nearest in-vocab words)

○ MCT → AWS, OTA, APT, PDM 14

slide-49
SLIDE 49

Observation – Nearest Neighbors

  • English (OOV  Nearest in-vocab words)

○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly 14

slide-50
SLIDE 50

Observation – Nearest Neighbors

  • English (OOV  Nearest in-vocab words)

○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser 14

slide-51
SLIDE 51

Observation – Nearest Neighbors

  • English (OOV  Nearest in-vocab words)

○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser

  • Hebrew

14

slide-52
SLIDE 52

Observation – Nearest Neighbors

  • English (OOV  Nearest in-vocab words)

○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser

  • Hebrew

○רותפת→ גתת(she/you-3p.sg.) will come true (she/you-3p.sg.) will solve 14

slide-53
SLIDE 53

Observation – Nearest Neighbors

  • English (OOV  Nearest in-vocab words)

○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser

  • Hebrew

○רותפת→ גתת(she/you-3p.sg.) will come true (she/you-3p.sg.) will solve ○םיירט→ םיירטמואיג geometric (m.pl., nontrad. spelling) geometric (m.pl.) 14

slide-54
SLIDE 54

Observation – Nearest Neighbors

  • English (OOV  Nearest in-vocab words)

○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser

  • Hebrew

○רותפת→ גתת(she/you-3p.sg.) will come true (she/you-3p.sg.) will solve ○םיירט→ םיירטמואיג geometric (m.pl., nontrad. spelling) geometric (m.pl.) ○ךרא→ ציר’ןוסדר Richardson Eustrach 14

slide-55
SLIDE 55

Observation – Nearest Neighbors

  • English (OOV  Nearest in-vocab words)

○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser

  • Hebrew

○רותפת→ גתת(she/you-3p.sg.) will come true (she/you-3p.sg.) will solve ○םיירט→ םיירטמואיג geometric (m.pl., nontrad. spelling) geometric (m.pl.) ○ךרא→ ציר’ןוסדר Richardson Eustrach

  • ✔ Surface form

✔ Syntactic properties ✘ Semantics

14

slide-56
SLIDE 56

Intrinsic Evaluation – RareWords

15

slide-57
SLIDE 57

Intrinsic Evaluation – RareWords

  • RareWords similarity task: morphologically-complex, mostly unseen words

15

slide-58
SLIDE 58

Intrinsic Evaluation – RareWords

  • RareWords similarity task: morphologically-complex, mostly unseen words

15

slide-59
SLIDE 59

Intrinsic Evaluation – RareWords

  • RareWords similarity task: morphologically-complex, mostly unseen words
  • Names
  • Domain-specific jargon
  • Foreign words
  • Rare(-ish) morphological

derivations

  • Nonce words
  • Nonstandard orthography
  • Typos and other errors
  • ...

15

slide-60
SLIDE 60

Intrinsic Evaluation – RareWords

  • RareWords similarity task: morphologically-complex, mostly unseen words
  • Names
  • Domain-specific jargon
  • Foreign words
  • Rare(-ish) morphological

derivations

  • Nonce words
  • Nonstandard orthography
  • Typos and other errors
  • ...

Nearest: programmatic transformational mechanistic transactional contextual NN FUN!!! 15

slide-61
SLIDE 61

Extrinsic Evaluation – POS + Attribute Tagging

  • Names
  • Domain-specific jargon
  • Foreign words
  • Rare(-ish) morphological

derivations

  • Nonce words
  • Nonstandard
  • rthography
  • Typos and other errors
  • ...

16

  • UD is annotated for POS and morphosyntactic attributes

○ Eng: his stated goals

Tense=Past|VerbForm=Part

○ Cze: osoby v pokročilém věku

Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age

slide-62
SLIDE 62

Extrinsic Evaluation – POS + Attribute Tagging

  • Names
  • Domain-specific jargon
  • Foreign words
  • Rare(-ish) morphological

derivations

  • Nonce words
  • Nonstandard
  • rthography
  • Typos and other errors
  • ...

16

  • UD is annotated for POS and morphosyntactic attributes

○ Eng: his stated goals

Tense=Past|VerbForm=Part

○ Cze: osoby v pokročilém věku

Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age

  • POS model from Ling et al. (2015)

VBZ VBG NN DT Word embeddings the sitting is cat Forward LSTM Backward LSTM

slide-63
SLIDE 63
  • UD is annotated for POS and morphosyntactic attributes

○ Eng: his stated goals

Tense=Past|VerbForm=Part

○ Cze: osoby v pokročilém věku

Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age

  • POS model from Ling et al. (2015)
  • Attributes - same as POS layer

pres

  • sing
  • VBZ

VBG NN DT Word embeddings the sitting is cat Forward LSTM Backward LSTM

POS Number Tense

17

Extrinsic Evaluation – POS + Attribute Tagging

slide-64
SLIDE 64
  • UD is annotated for POS and morphosyntactic attributes

○ Eng: his stated goals

Tense=Past|VerbForm=Part

○ Cze: osoby v pokročilém věku

Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age

  • POS model from Ling et al. (2015)
  • Attributes - same as POS layer

pres

  • sing
  • VBZ

VBG NN DT Word embeddings the sitting is cat Forward LSTM Backward LSTM

POS Number Tense

17

  • Negative effect on POS

Extrinsic Evaluation – POS + Attribute Tagging

slide-65
SLIDE 65
  • UD is annotated for POS and morphosyntactic attributes

○ Eng: his stated goals

Tense=Past|VerbForm=Part

○ Cze: osoby v pokročilém věku

Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age

  • POS model from Ling et al. (2015)
  • Attributes - same as POS layer

pres

  • sing
  • VBZ

VBG NN DT Word embeddings the sitting is cat Forward LSTM Backward LSTM

POS Number Tense

17

  • Negative effect on POS
  • Attribute evaluation metric

○ Micro F1

Extrinsic Evaluation – POS + Attribute Tagging

slide-66
SLIDE 66

Language Selection

Minna Sundberg

18

slide-67
SLIDE 67

Language Selection

  • |UD ∩ Polyglot| = 44, we took 23

Minna Sundberg

18

slide-68
SLIDE 68

Language Selection

  • |UD ∩ Polyglot| = 44, we took 23
  • Morphological structure

Minna Sundberg

18

slide-69
SLIDE 69

Language Selection

  • |UD ∩ Polyglot| = 44, we took 23
  • Morphological structure

○ 12 fusional

Minna Sundberg

18

slide-70
SLIDE 70

Language Selection

  • |UD ∩ Polyglot| = 44, we took 23
  • Morphological structure

○ 12 fusional ○ 3 analytic

Minna Sundberg

18

slide-71
SLIDE 71

Language Selection

  • |UD ∩ Polyglot| = 44, we took 23
  • Morphological structure

○ 12 fusional ○ 3 analytic ○ 1 isolating

Minna Sundberg

18

slide-72
SLIDE 72

Language Selection

  • |UD ∩ Polyglot| = 44, we took 23
  • Morphological structure

○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative

Minna Sundberg

18

slide-73
SLIDE 73

Language Selection

  • |UD ∩ Polyglot| = 44, we took 23
  • Morphological structure

○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative

  • Geneological diversity

Minna Sundberg

18

slide-74
SLIDE 74

Language Selection

  • |UD ∩ Polyglot| = 44, we took 23
  • Morphological structure

○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative

  • Geneological diversity

○ 13 Indo-European (7 different branches)

Minna Sundberg

18

slide-75
SLIDE 75

Language Selection

  • |UD ∩ Polyglot| = 44, we took 23
  • Morphological structure

○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative

  • Geneological diversity

○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches

Minna Sundberg

18

slide-76
SLIDE 76

Language Selection

  • |UD ∩ Polyglot| = 44, we took 23
  • Morphological structure

○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative

  • Geneological diversity

○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches

  • MRLs (e.g. Slavic languages)

Minna Sundberg

18

slide-77
SLIDE 77

Language Selection

  • |UD ∩ Polyglot| = 44, we took 23
  • Morphological structure

○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative

  • Geneological diversity

○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches

  • MRLs (e.g. Slavic languages)

○ Much word-level data

Minna Sundberg

18

slide-78
SLIDE 78

Language Selection

  • |UD ∩ Polyglot| = 44, we took 23
  • Morphological structure

○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative

  • Geneological diversity

○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches

  • MRLs (e.g. Slavic languages)

○ Much word-level data ○ Relatively free word order

Minna Sundberg

18

slide-79
SLIDE 79

Language Selection

  • |UD ∩ Polyglot| = 44, we took 23
  • Morphological structure

○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative

  • Geneological diversity

○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches

  • MRLs (e.g. Slavic languages)

○ Much word-level data ○ Relatively free word order Institutional Entrepreneurial Linguistic Anatomical Ideological

Minna Sundberg

18

slide-80
SLIDE 80

Language Selection (contd.)

19

slide-81
SLIDE 81

Language Selection (contd.)

  • Script type

19

slide-82
SLIDE 82

Language Selection (contd.)

  • Script type

○ 7 in non-alphabetic scripts 19

slide-83
SLIDE 83

Language Selection (contd.)

  • Script type

○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters 19

slide-84
SLIDE 84

Language Selection (contd.)

  • Script type

○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion 19

slide-85
SLIDE 85

Language Selection (contd.)

  • Script type

○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables 19

slide-86
SLIDE 86

Language Selection (contd.)

  • Script type

○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables

  • Attribute-carrying tokens

19

slide-87
SLIDE 87

Language Selection (contd.)

  • Script type

○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables

  • Attribute-carrying tokens

○ Range from 0% (Vietnamese) to 92.4% (Hindi) 19

slide-88
SLIDE 88

Language Selection (contd.)

  • Script type

○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables

  • Attribute-carrying tokens

○ Range from 0% (Vietnamese) to 92.4% (Hindi)

  • OOV rate (UD against Polyglot vocabulary)

19

slide-89
SLIDE 89

Language Selection (contd.)

  • Script type

○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables

  • Attribute-carrying tokens

○ Range from 0% (Vietnamese) to 92.4% (Hindi)

  • OOV rate (UD against Polyglot vocabulary)

○ 16.9%-70.8% type-level (median 29.1%) 19

slide-90
SLIDE 90

Language Selection (contd.)

  • Script type

○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables

  • Attribute-carrying tokens

○ Range from 0% (Vietnamese) to 92.4% (Hindi)

  • OOV rate (UD against Polyglot vocabulary)

○ 16.9%-70.8% type-level (median 29.1%) ○ 2.2%-33.1% token-level (median 9.2%) 19

slide-91
SLIDE 91

Evaluated Systems

  • NONE: Polyglot’s default UNK embedding

the sitting is flatfish

20

slide-92
SLIDE 92

Evaluated Systems

  • NONE: Polyglot’s default UNK embedding
  • MIMICK

the sitting is flatfish

20

slide-93
SLIDE 93

Evaluated Systems

  • NONE: Polyglot’s default UNK embedding
  • MIMICK
  • CHAR2TAG - additional RNN layer

○ 3x Training time

Char- LSTM Char- LSTM Char- LSTM Char- LSTM the sitting is flatfish

20

slide-94
SLIDE 94

Evaluated Systems

  • NONE: Polyglot’s default UNK embedding
  • MIMICK
  • CHAR2TAG - additional RNN layer

○ 3x Training time

  • BOTH: MIMICK + CHAR2TAG

Char- LSTM Char- LSTM Char- LSTM Char- LSTM the sitting is flatfish

20

slide-95
SLIDE 95

Evaluated Systems

  • NONE: Polyglot’s default UNK embedding
  • MIMICK
  • CHAR2TAG - additional RNN layer

○ 3x Training time

  • BOTH: MIMICK + CHAR2TAG

POINT UNION ROAD LIGHT LONG

Char- LSTM Char- LSTM Char- LSTM Char- LSTM the sitting is flatfish

20

slide-96
SLIDE 96

Results - Full Data

POS tags (accuracy)

  • Morpho. Attributes (micro F1)

21

NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH

slide-97
SLIDE 97

Results - 5,000 training tokens

POS tags (accuracy)

  • Morpho. Attributes (micro F1)

22

NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH

slide-98
SLIDE 98

NONE MIMICK CHAR2TAG BOTH

Results - Language Types (5,000 tokens)

Slavic languages POS 23

slide-99
SLIDE 99

NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH

Results - Language Types (5,000 tokens)

Slavic languages POS Agglutinative languages morpho. attribute F1 23

slide-100
SLIDE 100

Results - Chinese

24

POS tags (accuracy)

  • Morpho. Attributes (micro F1)

NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH

slide-101
SLIDE 101

A Word (Model) from our Sponsor

Code & models: https://github.com/yuvalpinter/Mimick 25

slide-102
SLIDE 102

A Word (Model) from our Sponsor

  • Our extrinsic results are on tagging

Code & models: https://github.com/yuvalpinter/Mimick 25

slide-103
SLIDE 103

A Word (Model) from our Sponsor

  • Our extrinsic results are on tagging
  • Please consider us for all your WE use cases!

Code & models: https://github.com/yuvalpinter/Mimick 25

slide-104
SLIDE 104

A Word (Model) from our Sponsor

  • Our extrinsic results are on tagging
  • Please consider us for all your WE use cases!

○ Sentiment! Code & models: https://github.com/yuvalpinter/Mimick 25

slide-105
SLIDE 105

A Word (Model) from our Sponsor

  • Our extrinsic results are on tagging
  • Please consider us for all your WE use cases!

○ Sentiment! ○ Parsing! Code & models: https://github.com/yuvalpinter/Mimick 25

slide-106
SLIDE 106

A Word (Model) from our Sponsor

  • Our extrinsic results are on tagging
  • Please consider us for all your WE use cases!

○ Sentiment! ○ Parsing! ○ IE! Code & models: https://github.com/yuvalpinter/Mimick 25

slide-107
SLIDE 107

A Word (Model) from our Sponsor

  • Our extrinsic results are on tagging
  • Please consider us for all your WE use cases!

○ Sentiment! ○ Parsing! ○ IE! ○ QA! Code & models: https://github.com/yuvalpinter/Mimick 25

slide-108
SLIDE 108

A Word (Model) from our Sponsor

  • Our extrinsic results are on tagging
  • Please consider us for all your WE use cases!

○ Sentiment! ○ Parsing! ○ IE! ○ QA! ○ ... Code & models: https://github.com/yuvalpinter/Mimick 25

slide-109
SLIDE 109

A Word (Model) from our Sponsor

  • Our extrinsic results are on tagging
  • Please consider us for all your WE use cases!

○ Sentiment! ○ Parsing! ○ IE! ○ QA! ○ ...

  • Code compatible with w2v, Polyglot, FastText

Code & models: https://github.com/yuvalpinter/Mimick 25

slide-110
SLIDE 110

A Word (Model) from our Sponsor

  • Our extrinsic results are on tagging
  • Please consider us for all your WE use cases!

○ Sentiment! ○ Parsing! ○ IE! ○ QA! ○ ...

  • Code compatible with w2v, Polyglot, FastText
  • Models for Polyglot also on github

Code & models: https://github.com/yuvalpinter/Mimick 25

slide-111
SLIDE 111

A Word (Model) from our Sponsor

  • Our extrinsic results are on tagging
  • Please consider us for all your WE use cases!

○ Sentiment! ○ Parsing! ○ IE! ○ QA! ○ ...

  • Code compatible with w2v, Polyglot, FastText
  • Models for Polyglot also on github

○ <1MB each, dynet format Code & models: https://github.com/yuvalpinter/Mimick 25

slide-112
SLIDE 112

A Word (Model) from our Sponsor

  • Our extrinsic results are on tagging
  • Please consider us for all your WE use cases!

○ Sentiment! ○ Parsing! ○ IE! ○ QA! ○ ...

  • Code compatible with w2v, Polyglot, FastText
  • Models for Polyglot also on github

○ <1MB each, dynet format ○ Learn all OOVs in advance and add to param table, or Code & models: https://github.com/yuvalpinter/Mimick 25

slide-113
SLIDE 113

A Word (Model) from our Sponsor

  • Our extrinsic results are on tagging
  • Please consider us for all your WE use cases!

○ Sentiment! ○ Parsing! ○ IE! ○ QA! ○ ...

  • Code compatible with w2v, Polyglot, FastText
  • Models for Polyglot also on github

○ <1MB each, dynet format ○ Learn all OOVs in advance and add to param table, or ○ Load into memory and infer on-line Code & models: https://github.com/yuvalpinter/Mimick 25

slide-114
SLIDE 114

Conclusions

26

slide-115
SLIDE 115

Conclusions

  • MIMICK: an OOV-extension embedding processing step for downstream tasks

26

slide-116
SLIDE 116

Conclusions

  • MIMICK: an OOV-extension embedding processing step for downstream tasks
  • Compositional model complementing distributional artifact

26

slide-117
SLIDE 117

Conclusions

  • MIMICK: an OOV-extension embedding processing step for downstream tasks
  • Compositional model complementing distributional artifact
  • Powerful technique for low-resource scenarios

26

slide-118
SLIDE 118

Conclusions

  • MIMICK: an OOV-extension embedding processing step for downstream tasks
  • Compositional model complementing distributional artifact
  • Powerful technique for low-resource scenarios
  • Especially good for:

26

slide-119
SLIDE 119

Conclusions

  • MIMICK: an OOV-extension embedding processing step for downstream tasks
  • Compositional model complementing distributional artifact
  • Powerful technique for low-resource scenarios
  • Especially good for:

○ Morphologically-rich languages 26

slide-120
SLIDE 120

Conclusions

  • MIMICK: an OOV-extension embedding processing step for downstream tasks
  • Compositional model complementing distributional artifact
  • Powerful technique for low-resource scenarios
  • Especially good for:

○ Morphologically-rich languages ○ Large character vocabulary 26

slide-121
SLIDE 121

Conclusions

  • MIMICK: an OOV-extension embedding processing step for downstream tasks
  • Compositional model complementing distributional artifact
  • Powerful technique for low-resource scenarios
  • Especially good for:

○ Morphologically-rich languages ○ Large character vocabulary

  • Sore spots and Future Work

26

slide-122
SLIDE 122

Conclusions

  • MIMICK: an OOV-extension embedding processing step for downstream tasks
  • Compositional model complementing distributional artifact
  • Powerful technique for low-resource scenarios
  • Especially good for:

○ Morphologically-rich languages ○ Large character vocabulary

  • Sore spots and Future Work

○ Vietnamese - syllabic vocabulary 26

slide-123
SLIDE 123

Conclusions

  • MIMICK: an OOV-extension embedding processing step for downstream tasks
  • Compositional model complementing distributional artifact
  • Powerful technique for low-resource scenarios
  • Especially good for:

○ Morphologically-rich languages ○ Large character vocabulary

  • Sore spots and Future Work

○ Vietnamese - syllabic vocabulary ○ Hebrew and Arabic - nontrivial tokenization, no case 26

slide-124
SLIDE 124

Conclusions

  • MIMICK: an OOV-extension embedding processing step for downstream tasks
  • Compositional model complementing distributional artifact
  • Powerful technique for low-resource scenarios
  • Especially good for:

○ Morphologically-rich languages ○ Large character vocabulary

  • Sore spots and Future Work

○ Vietnamese - syllabic vocabulary ○ Hebrew and Arabic - nontrivial tokenization, no case ○ Try other subword levels (morphemes, phonemes, bytes) 26

slide-125
SLIDE 125

Conclusions

  • MIMICK: an OOV-extension embedding processing step for downstream tasks
  • Compositional model complementing distributional artifact
  • Powerful technique for low-resource scenarios
  • Especially good for:

○ Morphologically-rich languages ○ Large character vocabulary

  • Sore spots and Future Work

○ Vietnamese - syllabic vocabulary ○ Hebrew and Arabic - nontrivial tokenization, no case ○ Try other subword levels (morphemes, phonemes, bytes) ○ Improve morphosyntactic attribute tagging scheme 26

slide-126
SLIDE 126

Questions?

Code & models: https://github.com/yuvalpinter/Mimick Neglect Satisfaction Illness Espionage Bullying

27