Vi Visual G Gro rounding i in Vi Video eo fo for r Un Unsup - - PowerPoint PPT Presentation

vi visual g gro rounding i in vi video eo fo for r un
SMART_READER_LITE
LIVE PREVIEW

Vi Visual G Gro rounding i in Vi Video eo fo for r Un Unsup - - PowerPoint PPT Presentation

Vi Visual G Gro rounding i in Vi Video eo fo for r Un Unsup upervised Word Transla lation G. Sigurdsson, J-B. Alayrac, A. Nematzadeh, L. Smaira, M. Malinowski, J. Carreira, P. Blunsom, A. Zisserman 6/15/20 Video


slide-1
SLIDE 1

Vi Visual G Gro rounding i in Vi Video eo fo for r Un Unsup upervised Word Transla lation

  • G. Sigurdsson, J-B. Alayrac, A. Nematzadeh, L. Smaira, M. Malinowski, J. Carreira, P. Blunsom, A. Zisserman

6/15/20

Video Pentathlon The End-of-End-to-End: A Video Understanding Pentathlon Workshop

slide-2
SLIDE 2

How can we learn a link between different languages from unpaired narrated videos?

slide-3
SLIDE 3

Our goal: relate different languages through the visual domain

Je casse les

  • eufs.*

I need to mix the eggs with the flour.

* I break the eggs.

slide-4
SLIDE 4

Our setup: unsupervised word translation

different videos in each language (no paired data)

I need to mix the eggs with the flour. Je casse les

  • eufs.
slide-5
SLIDE 5

Dataset: HowToWorld dataset

We extend the HowTo100M[a] dataset in other languages (we follow the same collection procedure but obtain different videos narrated in their original language).

[a] [HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips, Miech, Zhukov, Alayrac, Tapaswi, Laptev and Sivic, ICCV19]

slide-6
SLIDE 6

Contrastive loss [b]

mix eggs with flour

Base Model: learn a joint space between languages and video

[b] MIL-NCE: [End-to-End Learning of Visual Representations from Uncurated Instructional Videos, Miech, Alayrac, Smaira, Laptev, Sivic and Zisserman, CVPR20]

slide-7
SLIDE 7

Base Model: learn a joint space between languages and video

Contrastive loss

je casse les oeufs Bilingual-visual joint space

slide-8
SLIDE 8

Base Model: learn a joint space between languages and video

Contrastive loss

je casse les oeufs

Next, we evaluate the quality of the joint bilingual space with English to French word retrieval: “Given a word in English, we score all French words using dot products in the joint space and report the percentage of time a correct translation is retrieved in top 1 (R@1)”

Bilingual-visual joint space

slide-9
SLIDE 9

Quantitative results for the Base Model

English-French (reporting recall@1) Dictionary

(Conneau et al., 2017)

Simple Words

(top 1000 words Wikipedia)

All Visual All Visual Random Chance 0.1 0.2 0.1 0.2 Base Model 9.1 15.2 28.0 45.3

Dictionary: 1000 words in English and French coming from (Conneau et al, 2017) Simple words: Top 1000 words from Wikipedia Visual: restrict to “visual” words (remove abstracts concepts).

slide-10
SLIDE 10

Do we need videos at all?

It has been shown that one can align word embeddings in different languages via a simple transformation (rotation). Only a few correspondences are required to estimate this transformation: Unsupervised approaches are possible that do not require any paired data to learn that alignment: [MUSE, Conneau, et al, ICLR2018], [VecMap: Artetxe, et al. ACL 2018]

These methods have robustness issues (e.g. language similarity / training corpora statistics), can vision help there?

slide-11
SLIDE 11

MUSE: 1) Find an initial linear mapping via an adversarial approach 2) From initialization find the most aligned word pairs and use them as anchors to refine the mapping with the Procrustes algorithm 3) Normalizing the distances using the local neighborhood

Aligned space

MUSE

slide-12
SLIDE 12

MUVE: Multilingual Unsupervised Visual Embeddings

MUSE: 1) Find an initial linear mapping via an adversarial approach 2) From initialization find the most aligned word pairs and use them as anchors to refine the mapping with the Procrustes algorithm 3) Normalizing the distances using the local neighborhood 1) Use AdaptLayer as an initial linear mapping MUSE: MUVE

Aligned space

slide-13
SLIDE 13

MUVE vs Base model

English-French (reporting R@1) Dictionary

(Conneau et al., 2017)

Simple Words

(top 1000 words)

All Visual All Visual Random Chance 0.1 0.2 0.1 0.2 Base Model 9.1 15.2 28.0 45.3 MUVE 28.9 39.5 58.3 67.5

slide-14
SLIDE 14

Performance of models across language pairs

Larger gap in performance for more distant languages reporting R@1 En-Fr En-Ko En-Ja MUSE

(Conneau et al., 2017)

26.3 11.8 11.6 VecMap

(Artetxe et al., 2018)

28.4 13.0 13.7 MUVE

(ours)

28.9 17.7 15.1 Supervised 57.9 41.8 41.1

slide-15
SLIDE 15

Robustness to dissimilarity of text corpora for embedding pretraining

reporting R@10 MUSE

(Conneau et al., 2017)

VecMap

(Artetxe et al., 2018)

MUVE

(ours)

HowTo-En 45.8 45.4 47.3 WMT-En 0.3 0.2 26.4 Wiki-En 0.3 0.1 32.6 HowTo-Fr

slide-16
SLIDE 16

Conclusion and Future work

slide-17
SLIDE 17

Takeaways

Unsupervised word translation through visual grounding is possible based on unpaired and uncurated narrated videos. The information contained in vision is complementary to the one contained in the structure of the languages which enables to have a better and more robust approach (MUVE) for unsupervised word translation.

Conclusion

From words to sentences. Using multilingual datasets to learn better visual representation (more data sources, less biased, …)

Future work Links: Paper, Blog post Q&A Time CVPR2020: Thursday, June 18, 2020 9-11AM and 9–11PM PT