unsupervised cross modal alignment of speech and text
play

Unsupervised Cross-Modal Alignment of Speech and Text Embedding - PowerPoint PPT Presentation

Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces Yu-An Chung Wei-Hung Weng Schrasing Tong James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge,


  1. Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces Yu-An Chung Wei-Hung Weng Schrasing Tong James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, Massachusetts, USA NeurIPS Montréal, Québec, Canada December 2018

  2. Machine Translation (MT) Training data French English ( , ) “the cat is black” MT system “le chat est noir” translation text Automatic Speech Recognition (ASR) English English ( , ) “dogs are cute” ASR system transcription audio Text-to-Speech Synthesis (TTS) English English ( , ) TTS system “cats are adorable” audio text

  3. Machine Translation (MT) Training data French English ( , ) “the cat is black” MT system “le chat est noir” translation text Automatic Speech Recognition (ASR) English English ( , ) “dogs are cute” ASR system transcription audio Text-to-Speech Synthesis (TTS) English English ( , ) TTS system “cats are adorable” audio text Parallel corpora for training à Expensive to collect!

  4. Framework Language 1 Language 2 Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openly editable and viewable content, a wiki. It is the largest and most popular …

  5. Do not need to be parallel! Framework Language 1 Language 2 Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openly editable and viewable content, a wiki. It is the largest and most popular …

  6. Do not need to be parallel! Framework Language 1 Language 2 Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openly editable and viewable content, a wiki. It is the largest and most popular … Word2vec [Mikolov et al., 2013] y $ ( y % ∈ ℝ +×' Y = ⋮ y '

  7. Do not need to be parallel! Framework Language 1 Language 2 Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openly editable and viewable content, a wiki. It is the largest and most popular … Word2vec Speech2vec [Mikolov et al., 2013] [Chung & Glass, 2018] x $ ( x % ∈ ℝ +×' X = ⋮ x ' y $ ( y % ∈ ℝ +×' Y = ⋮ y '

  8. Do not need to be parallel! Framework Language 1 Language 2 Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openly editable and viewable content, a wiki. It is the largest and most popular … Word2vec Speech2vec [Mikolov et al., 2013] [Chung & Glass, 2018] x $ ( x % ∈ ℝ +×' X = ⋮ x ' y $ ( MUSE y % [Lample et al., 2018] ∈ ℝ +×' Y = ⋮ y ' Learn a linear mapping W such that W ∗ = argmin WX − Y 9 W ∈ℝ 7×7 WX

  9. Do not need to be parallel! Framework Language 1 Language 2 Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openly editable and viewable content, a wiki. It is the largest and most popular … Word2vec Speech2vec [Mikolov et al., 2013] [Chung & Glass, 2018] x % ) x & ∈ ℝ ,×( X = ⋮ x ( y % ) MUSE y & [Lample et al., 2018] ∈ ℝ ,×( Y = ⋮ y ( Learn a linear mapping W such that W ∗ = argmin WX − Y 9 W ∈ℝ 7×7 WX Components p Word2vec • Learns distributed representations of words from a text corpus that model word semantics in an unsupervised manner p Speech2vec • A speech version of word2vec that learns semantic word representations from a speech corpus; also unsupervised p MUSE • An unsupervised way to learn W with the assumption that the two embedding spaces are approximately isomorphic

  10. Do not need to be parallel! Framework Advantages Language 1 Language 2 p Rely only on monolingual corpora of speech and text that: • Do not need to be parallel • Wikipedia is a multilingual, web-based, free Can be collected independently, greatly reducing human encyclopedia based on a model of openly labeling efforts editable and viewable content, a wiki. It is p The framework is unsupervised: the largest and most popular … • Each component uses unsupervised learning • Applicable to low-resource language pairs that lack bilingual Word2vec resources Speech2vec [Mikolov et al., 2013] [Chung & Glass, 2018] x % ) x & ∈ ℝ ,×( X = ⋮ x ( y % ) MUSE y & [Lample et al., 2018] ∈ ℝ ,×( Y = ⋮ y ( Learn a linear mapping W such that W ∗ = argmin WX − Y 9 W ∈ℝ 7×7 WX Components p Word2vec • Learns distributed representations of words from a text corpus that model word semantics in an unsupervised manner p Speech2vec • A speech version of word2vec that learns semantic word representations from a speech corpus; also unsupervised p MUSE • An unsupervised way to learn W with the assumption that the two embedding spaces are approximately isomorphic

  11. Do not need to be parallel! Framework Advantages Language 1 Language 2 p Rely only on monolingual corpora of speech and text that: • Do not need to be parallel • Wikipedia is a multilingual, web-based, free Can be collected independently, greatly reducing human encyclopedia based on a model of openly labeling efforts editable and viewable content, a wiki. It is p The framework is unsupervised: the largest and most popular … • Each component uses unsupervised learning • Applicable to low-resource language pairs that lack bilingual Word2vec resources Speech2vec [Mikolov et al., 2013] [Chung & Glass, 2018] Usage of the learned W x % ) x & ∈ ℝ ,×( X = #1 : Unsupervised spoken word recognition: when language 1 = language 2 ⋮ (e.g., both English) x ( Input spoken y % ) “ dog ” word “dog” MUSE y & “dogs” Nearest neighbor [Lample et al., 2018] ∈ ℝ ,×( Y = + ⋮ search “puppy” y ( “pet” ⋮ Foundation for Unsupervised Automatic Speech Recognition Learn a linear mapping W such that An interesting property of our approach: synonym retrieval W ∗ = argmin à The list of nearest neighbors actually contain both synonyms and WX − Y 9 W ∈ℝ 7×7 WX different lexical forms of the input spoken word. Components #2 : Unsupervised spoken word translation: when language 1 ≠ language 2 p Word2vec (e.g., English to French) • Input spoken Learns distributed representations of words from a text corpus that model word “dog” word semantics in an unsupervised manner “ chien ” Nearest neighbor p Speech2vec + “chiot” search • A speech version of word2vec that learns semantic word representations ⋮ from a speech corpus; also unsupervised p MUSE • An unsupervised way to learn W with the assumption that the two Foundation for Unsupervised Speech-to-Text Translation embedding spaces are approximately isomorphic

  12. Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces 10:45 AM – 12:45 PM Room 210 & 230 AB #156

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend