Unsupervised Cross-Modal Alignment of Speech and Text Embedding - - PowerPoint PPT Presentation
Unsupervised Cross-Modal Alignment of Speech and Text Embedding - - PowerPoint PPT Presentation
Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces Yu-An Chung Wei-Hung Weng Schrasing Tong James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge,
Machine Translation (MT) Automatic Speech Recognition (ASR) Text-to-Speech Synthesis (TTS) “the cat is black” “le chat est noir” “dogs are cute” “cats are adorable” MT system ASR system TTS system Training data ( , ) ( , ) ( , ) English transcription French translation English audio English text English text English audio
Machine Translation (MT) Automatic Speech Recognition (ASR) Text-to-Speech Synthesis (TTS) “the cat is black” “le chat est noir” “dogs are cute” “cats are adorable” MT system ASR system TTS system Training data
Parallel corpora for training à Expensive to collect!
( , ) ( , ) ( , ) English transcription French translation English audio English text English text English audio
Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openly editable and viewable content, a wiki. It is the largest and most popular …
Language 1 Language 2
Framework
Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openly editable and viewable content, a wiki. It is the largest and most popular …
Language 1 Language 2 Do not need to be parallel!
Framework
Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openly editable and viewable content, a wiki. It is the largest and most popular …
Word2vec [Mikolov et al., 2013] Language 1 Language 2 Do not need to be parallel! Y = y$ y% ⋮ y'
(
∈ ℝ+×'
Framework
Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openly editable and viewable content, a wiki. It is the largest and most popular …
Speech2vec [Chung & Glass, 2018] Word2vec [Mikolov et al., 2013] Language 1 Language 2 Do not need to be parallel! X = x$ x% ⋮ x'
(
∈ ℝ+×' Y = y$ y% ⋮ y'
(
∈ ℝ+×'
Framework
Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openly editable and viewable content, a wiki. It is the largest and most popular …
Speech2vec [Chung & Glass, 2018] Word2vec [Mikolov et al., 2013] Language 1 Language 2 Do not need to be parallel! MUSE [Lample et al., 2018] X = x$ x% ⋮ x'
(
∈ ℝ+×' Y = y$ y% ⋮ y'
(
∈ ℝ+×' Learn a linear mapping W such that
W∗ = argmin W∈ℝ7×7 WX WX − Y 9
Framework
Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openly editable and viewable content, a wiki. It is the largest and most popular …
Speech2vec [Chung & Glass, 2018] Word2vec [Mikolov et al., 2013] Language 1 Language 2 Do not need to be parallel! MUSE [Lample et al., 2018] p Word2vec
- Learns distributed representations of words from a text corpus that model
word semantics in an unsupervised manner p Speech2vec
- A speech version of word2vec that learns semantic word representations
from a speech corpus; also unsupervised p MUSE
- An unsupervised way to learn W with the assumption that the two
embedding spaces are approximately isomorphic X = x% x& ⋮ x(
)
∈ ℝ,×( Y = y% y& ⋮ y(
)
∈ ℝ,×( Learn a linear mapping W such that
W∗ = argmin W∈ℝ7×7 WX WX − Y 9
Components Framework
Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openly editable and viewable content, a wiki. It is the largest and most popular …
Speech2vec [Chung & Glass, 2018] Word2vec [Mikolov et al., 2013] Language 1 Language 2 Do not need to be parallel! MUSE [Lample et al., 2018] p Word2vec
- Learns distributed representations of words from a text corpus that model
word semantics in an unsupervised manner p Speech2vec
- A speech version of word2vec that learns semantic word representations
from a speech corpus; also unsupervised p MUSE
- An unsupervised way to learn W with the assumption that the two
embedding spaces are approximately isomorphic p Rely only on monolingual corpora of speech and text that:
- Do not need to be parallel
- Can be collected independently, greatly reducing human
labeling efforts p The framework is unsupervised:
- Each component uses unsupervised learning
- Applicable to low-resource language pairs that lack bilingual
resources X = x% x& ⋮ x(
)
∈ ℝ,×( Y = y% y& ⋮ y(
)
∈ ℝ,×( Learn a linear mapping W such that
W∗ = argmin W∈ℝ7×7 WX WX − Y 9
Components Advantages Framework
Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openly editable and viewable content, a wiki. It is the largest and most popular …
Speech2vec [Chung & Glass, 2018] Word2vec [Mikolov et al., 2013] Language 1 Language 2 Do not need to be parallel! MUSE [Lample et al., 2018] p Word2vec
- Learns distributed representations of words from a text corpus that model
word semantics in an unsupervised manner p Speech2vec
- A speech version of word2vec that learns semantic word representations
from a speech corpus; also unsupervised p MUSE
- An unsupervised way to learn W with the assumption that the two
embedding spaces are approximately isomorphic p Rely only on monolingual corpora of speech and text that:
- Do not need to be parallel
- Can be collected independently, greatly reducing human
labeling efforts p The framework is unsupervised:
- Each component uses unsupervised learning
- Applicable to low-resource language pairs that lack bilingual
resources X = x% x& ⋮ x(
)
∈ ℝ,×( Y = y% y& ⋮ y(
)
∈ ℝ,×( Learn a linear mapping W such that
W∗ = argmin W∈ℝ7×7 WX WX − Y 9
Usage of the learned W
#1: Unsupervised spoken word recognition: when language 1 = language 2 (e.g., both English) #2: Unsupervised spoken word translation: when language 1 ≠ language 2 (e.g., English to French) Nearest neighbor search Input spoken word “dog”
+
“dog” “dogs” “puppy” “pet” ⋮ Nearest neighbor search Input spoken word “dog”
+
“chien” “chiot” ⋮
Components Advantages Framework
An interesting property of our approach: synonym retrieval à The list of nearest neighbors actually contain both synonyms and different lexical forms of the input spoken word. Foundation for Unsupervised Automatic Speech Recognition Foundation for Unsupervised Speech-to-Text Translation