Word Representations, Seed Lexicons, Mapping Procedures, and - - PowerPoint PPT Presentation

word representations seed lexicons mapping procedures and
SMART_READER_LITE
LIVE PREVIEW

Word Representations, Seed Lexicons, Mapping Procedures, and - - PowerPoint PPT Presentation

Canadian AI 2020 Word Representations, Seed Lexicons, Mapping Procedures, and Reference Lists: What Matters in Bilingual Lexicon Induction from Comparable Corpora? Martin Laville, Mrime Bouhandi, Emmanuel Morin, and Philippe


slide-1
SLIDE 1

Word Representations, Seed Lexicons, Mapping Procedures, and Reference Lists: What Matters in Bilingual Lexicon Induction from Comparable Corpora?

Canadian AI 2020 Martin Laville¹, Mérième Bouhandi¹, Emmanuel Morin¹, and Philippe Langlais²

¹LS2N, Université de Nantes, France ²RALI, Université de Montréal, Canada

slide-2
SLIDE 2

What is Bilingual Lexicon Induction (BLI)?

  • Finding translations of words between language
  • Useful for Machine Translation, Information Retrieval…
slide-3
SLIDE 3

English/French Data

  • Corpora

○ General (Wikipedia) : 200M words ○ Specialized (Breast Cancer) : 1M words

slide-4
SLIDE 4

English/French Data

  • Corpora

○ General (Wikipedia) : 200M words ○ Specialized (Breast Cancer) : 1M words

  • Reference Lists

○ 1,446 pairs of terms from MUSE (Conneau, 2017) for general corpus ○ 248 pairs of terms from UMLS for breast cancer corpus

slide-5
SLIDE 5

English/French Data

  • Corpora

○ General (Wikipedia) : 200M words ○ Specialized (Breast Cancer) : 1M words

  • Reference Lists

○ 1,446 pairs of terms from MUSE (Conneau, 2017) for general corpus ○ 248 pairs of terms from UMLS for breast cancer corpus

  • Seed lexicon

○ MUSE : 10,872 pairs (Conneau 2017) ○ ELRA-M0033 : 243,539 pairs

slide-6
SLIDE 6

Methods

  • BoW : Bag of Words (Rapp 1999, Fung 1998)

○ Count of co-occurrences ○ Translation using a seed lexicon

slide-7
SLIDE 7

Methods

  • BoW : Bag of Words (Rapp 1999, Fung 1998)

○ Count of co-occurrences of each pair of words ○ Translation using a seed lexicon

  • 2 Embeddings

fastText (Bojanowski 2016) : uncontextualised, one vector per word

ELMo (Peters 2018) : contextualised, one vector per token ■ Anchor embeddings (Schuster 2019) : mean of all the vectors of the same word

  • Mapping (Mikolov 2013, Artetxe 2018)

■ Supervised : need a seed lexicon ■ Unsupervised : no seed lexicon

slide-8
SLIDE 8

Evaluation

  • Our general list is used on a lot of work, but what is it really ? Some pairs :

○ Enjoy / Enjoy ○ Madagascar / Madagascar ○ Hugo / Hugo

  • We create 3 sublists by :

○ Removing pairs with words not in a monolingual dictionary ○ Removing pairs too graphically close (Levenshtein distance) ○ Removing too frequent pairs

slide-9
SLIDE 9

Experiences

  • Results are presented with Precision @ 1
slide-10
SLIDE 10

Experiences

  • Results are presented with Precision @ 1
  • We vary :

○ 3 approaches : BoW, Contextualised Embeddings and Uncontextualised Embeddings ○ 2 corpora : Specialized or General ○ 2 seed lexicon : a small (MUSE) and a bigger and better (ELRA) ○ 4 Reference lists : Original, in-dictionary, edit distance, frequency

  • We seek to look which parameters matters
slide-11
SLIDE 11

Results

  • Supervised is still the way to go
  • fastText is better
  • Bigger/better seed lexicon

degrades results

slide-12
SLIDE 12

Results

  • Supervised is still the way to go
  • fastText is better
  • Bigger/better seed lexicon

degrades results

  • Specialized domain BoW : having a

bigger seed lexicon is better

slide-13
SLIDE 13

Results

  • Supervised is still the way to go
  • fastText is better
  • Bigger/better dictionary degrades

results

  • Specialized domain BoW, having a

bigger dictionary is better

  • fastText worse while ELMo better
slide-14
SLIDE 14

Results

  • Supervised is still the way to go
  • fastText is better
  • Bigger/better dictionary degrades

results

  • Specialized domain BoW, having a

bigger dictionary is better

  • fastText worse while ELMo better
  • Huge loss for all method
slide-15
SLIDE 15

Results

  • Supervised is still the way to go
  • fastText is better
  • Bigger/better dictionary degrades

results

  • Specialized domain BoW, having a

bigger dictionary is better

  • fastText worse while ELMo better
  • Huge loss for all method
  • BoW really bad
slide-16
SLIDE 16

Analysis (general domain)

  • fastText finds graphically close

words

slide-17
SLIDE 17

Analysis (general domain)

  • fastText finds graphically close

words

  • ELMo seems to capture the

meaning

slide-18
SLIDE 18

Analysis (general domain)

  • fastText finds graphically close

words

  • ELMo seems to capture the

meaning

  • BoW is really affected by
  • ccurrences
slide-19
SLIDE 19

Analysis (specialized domain)

  • Seems easier as words are

less likely to be found in varying context

slide-20
SLIDE 20

Analysis (specialized domain)

  • Seems easier as words are less

likely to be found in varying context

  • Still ok for lower frequency
slide-21
SLIDE 21

Analysis (specialized domain)

  • Seems easier as words are less

likely to be found in varying context

  • Still ok for lower frequency
  • fastText still finds graphically

close words when low frequency

slide-22
SLIDE 22

Conclusion

  • Reference lists needs to be questioned and not used as is
  • We hope this work will help people to best consider this aspect