Word Representations, Seed Lexicons, Mapping Procedures, and - - PowerPoint PPT Presentation

▶

Mar 07, 2024 634 likes •867 views

Canadian AI 2020 Word Representations, Seed Lexicons, Mapping Procedures, and Reference Lists: What Matters in Bilingual Lexicon Induction from Comparable Corpora? Martin Laville, Mrime Bouhandi, Emmanuel Morin, and Philippe

SLIDE 1

Word Representations, Seed Lexicons, Mapping Procedures, and Reference Lists: What Matters in Bilingual Lexicon Induction from Comparable Corpora?

Canadian AI 2020 Martin Laville¹, Mérième Bouhandi¹, Emmanuel Morin¹, and Philippe Langlais²

¹LS2N, Université de Nantes, France ²RALI, Université de Montréal, Canada

SLIDE 2

What is Bilingual Lexicon Induction (BLI)?

Finding translations of words between language
Useful for Machine Translation, Information Retrieval…

SLIDE 3

English/French Data

Corpora

○ General (Wikipedia) : 200M words ○ Specialized (Breast Cancer) : 1M words

SLIDE 4

English/French Data

Corpora

○ General (Wikipedia) : 200M words ○ Specialized (Breast Cancer) : 1M words

Reference Lists

○ 1,446 pairs of terms from MUSE (Conneau, 2017) for general corpus ○ 248 pairs of terms from UMLS for breast cancer corpus

SLIDE 5

English/French Data

Corpora

○ General (Wikipedia) : 200M words ○ Specialized (Breast Cancer) : 1M words

Reference Lists

○ 1,446 pairs of terms from MUSE (Conneau, 2017) for general corpus ○ 248 pairs of terms from UMLS for breast cancer corpus

Seed lexicon

○ MUSE : 10,872 pairs (Conneau 2017) ○ ELRA-M0033 : 243,539 pairs

SLIDE 6

Methods

BoW : Bag of Words (Rapp 1999, Fung 1998)

○ Count of co-occurrences ○ Translation using a seed lexicon

SLIDE 7

Methods

BoW : Bag of Words (Rapp 1999, Fung 1998)

○ Count of co-occurrences of each pair of words ○ Translation using a seed lexicon

2 Embeddings

○

fastText (Bojanowski 2016) : uncontextualised, one vector per word

○

ELMo (Peters 2018) : contextualised, one vector per token ■ Anchor embeddings (Schuster 2019) : mean of all the vectors of the same word

Mapping (Mikolov 2013, Artetxe 2018)

■ Supervised : need a seed lexicon ■ Unsupervised : no seed lexicon

SLIDE 8

Evaluation

Our general list is used on a lot of work, but what is it really ? Some pairs :

○ Enjoy / Enjoy ○ Madagascar / Madagascar ○ Hugo / Hugo

We create 3 sublists by :

○ Removing pairs with words not in a monolingual dictionary ○ Removing pairs too graphically close (Levenshtein distance) ○ Removing too frequent pairs

SLIDE 9

Experiences

Results are presented with Precision @ 1

SLIDE 10

Experiences

Results are presented with Precision @ 1
We vary :

○ 3 approaches : BoW, Contextualised Embeddings and Uncontextualised Embeddings ○ 2 corpora : Specialized or General ○ 2 seed lexicon : a small (MUSE) and a bigger and better (ELRA) ○ 4 Reference lists : Original, in-dictionary, edit distance, frequency

We seek to look which parameters matters

SLIDE 11

Results

Supervised is still the way to go
fastText is better
Bigger/better seed lexicon

degrades results

SLIDE 12

Results

Supervised is still the way to go
fastText is better
Bigger/better seed lexicon

degrades results

Specialized domain BoW : having a

bigger seed lexicon is better

SLIDE 13

Results

Supervised is still the way to go
fastText is better
Bigger/better dictionary degrades

results

Specialized domain BoW, having a

bigger dictionary is better

fastText worse while ELMo better

SLIDE 14

Results

Supervised is still the way to go
fastText is better
Bigger/better dictionary degrades

results

Specialized domain BoW, having a

bigger dictionary is better

fastText worse while ELMo better
Huge loss for all method

SLIDE 15

Results

Supervised is still the way to go
fastText is better
Bigger/better dictionary degrades

results

Specialized domain BoW, having a

bigger dictionary is better

fastText worse while ELMo better
Huge loss for all method
BoW really bad

SLIDE 16

Analysis (general domain)

fastText finds graphically close

words

SLIDE 17

Analysis (general domain)

fastText finds graphically close

words

ELMo seems to capture the

meaning

SLIDE 18

Analysis (general domain)

fastText finds graphically close

words

ELMo seems to capture the

meaning

BoW is really affected by
ccurrences

SLIDE 19

Analysis (specialized domain)

Seems easier as words are

less likely to be found in varying context

SLIDE 20

Analysis (specialized domain)

Seems easier as words are less

likely to be found in varying context

Still ok for lower frequency

SLIDE 21

Analysis (specialized domain)

Seems easier as words are less

likely to be found in varying context

Still ok for lower frequency
fastText still finds graphically

close words when low frequency

SLIDE 22

Conclusion

Reference lists needs to be questioned and not used as is
We hope this work will help people to best consider this aspect