Unsupervised Cross-Modal Alignment of Speech and Text Embedding - PowerPoint PPT Presentation

Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces Yu-An Chung Wei-Hung Weng Schrasing Tong James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, Massachusetts, USA NeurIPS Montréal, Québec, Canada December 2018

Machine Translation (MT) Training data French English ( , ) “the cat is black” MT system “le chat est noir” translation text Automatic Speech Recognition (ASR) English English ( , ) “dogs are cute” ASR system transcription audio Text-to-Speech Synthesis (TTS) English English ( , ) TTS system “cats are adorable” audio text

Machine Translation (MT) Training data French English ( , ) “the cat is black” MT system “le chat est noir” translation text Automatic Speech Recognition (ASR) English English ( , ) “dogs are cute” ASR system transcription audio Text-to-Speech Synthesis (TTS) English English ( , ) TTS system “cats are adorable” audio text Parallel corpora for training à Expensive to collect!

Framework Language 1 Language 2 Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openly editable and viewable content, a wiki. It is the largest and most popular …

Do not need to be parallel! Framework Language 1 Language 2 Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openly editable and viewable content, a wiki. It is the largest and most popular …

Do not need to be parallel! Framework Language 1 Language 2 Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openly editable and viewable content, a wiki. It is the largest and most popular … Word2vec [Mikolov et al., 2013] y $ ( y % ∈ ℝ +×' Y = ⋮ y '

Do not need to be parallel! Framework Language 1 Language 2 Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openly editable and viewable content, a wiki. It is the largest and most popular … Word2vec Speech2vec [Mikolov et al., 2013] [Chung & Glass, 2018] x $ ( x % ∈ ℝ +×' X = ⋮ x ' y $ ( y % ∈ ℝ +×' Y = ⋮ y '

Do not need to be parallel! Framework Language 1 Language 2 Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openly editable and viewable content, a wiki. It is the largest and most popular … Word2vec Speech2vec [Mikolov et al., 2013] [Chung & Glass, 2018] x $ ( x % ∈ ℝ +×' X = ⋮ x ' y $ ( MUSE y % [Lample et al., 2018] ∈ ℝ +×' Y = ⋮ y ' Learn a linear mapping W such that W ∗ = argmin WX − Y 9 W ∈ℝ 7×7 WX

Do not need to be parallel! Framework Language 1 Language 2 Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openly editable and viewable content, a wiki. It is the largest and most popular … Word2vec Speech2vec [Mikolov et al., 2013] [Chung & Glass, 2018] x % ) x & ∈ ℝ ,×( X = ⋮ x ( y % ) MUSE y & [Lample et al., 2018] ∈ ℝ ,×( Y = ⋮ y ( Learn a linear mapping W such that W ∗ = argmin WX − Y 9 W ∈ℝ 7×7 WX Components p Word2vec • Learns distributed representations of words from a text corpus that model word semantics in an unsupervised manner p Speech2vec • A speech version of word2vec that learns semantic word representations from a speech corpus; also unsupervised p MUSE • An unsupervised way to learn W with the assumption that the two embedding spaces are approximately isomorphic

Do not need to be parallel! Framework Advantages Language 1 Language 2 p Rely only on monolingual corpora of speech and text that: • Do not need to be parallel • Wikipedia is a multilingual, web-based, free Can be collected independently, greatly reducing human encyclopedia based on a model of openly labeling efforts editable and viewable content, a wiki. It is p The framework is unsupervised: the largest and most popular … • Each component uses unsupervised learning • Applicable to low-resource language pairs that lack bilingual Word2vec resources Speech2vec [Mikolov et al., 2013] [Chung & Glass, 2018] x % ) x & ∈ ℝ ,×( X = ⋮ x ( y % ) MUSE y & [Lample et al., 2018] ∈ ℝ ,×( Y = ⋮ y ( Learn a linear mapping W such that W ∗ = argmin WX − Y 9 W ∈ℝ 7×7 WX Components p Word2vec • Learns distributed representations of words from a text corpus that model word semantics in an unsupervised manner p Speech2vec • A speech version of word2vec that learns semantic word representations from a speech corpus; also unsupervised p MUSE • An unsupervised way to learn W with the assumption that the two embedding spaces are approximately isomorphic

Do not need to be parallel! Framework Advantages Language 1 Language 2 p Rely only on monolingual corpora of speech and text that: • Do not need to be parallel • Wikipedia is a multilingual, web-based, free Can be collected independently, greatly reducing human encyclopedia based on a model of openly labeling efforts editable and viewable content, a wiki. It is p The framework is unsupervised: the largest and most popular … • Each component uses unsupervised learning • Applicable to low-resource language pairs that lack bilingual Word2vec resources Speech2vec [Mikolov et al., 2013] [Chung & Glass, 2018] Usage of the learned W x % ) x & ∈ ℝ ,×( X = #1 : Unsupervised spoken word recognition: when language 1 = language 2 ⋮ (e.g., both English) x ( Input spoken y % ) “ dog ” word “dog” MUSE y & “dogs” Nearest neighbor [Lample et al., 2018] ∈ ℝ ,×( Y = + ⋮ search “puppy” y ( “pet” ⋮ Foundation for Unsupervised Automatic Speech Recognition Learn a linear mapping W such that An interesting property of our approach: synonym retrieval W ∗ = argmin à The list of nearest neighbors actually contain both synonyms and WX − Y 9 W ∈ℝ 7×7 WX different lexical forms of the input spoken word. Components #2 : Unsupervised spoken word translation: when language 1 ≠ language 2 p Word2vec (e.g., English to French) • Input spoken Learns distributed representations of words from a text corpus that model word “dog” word semantics in an unsupervised manner “ chien ” Nearest neighbor p Speech2vec + “chiot” search • A speech version of word2vec that learns semantic word representations ⋮ from a speech corpus; also unsupervised p MUSE • An unsupervised way to learn W with the assumption that the two Foundation for Unsupervised Speech-to-Text Translation embedding spaces are approximately isomorphic

Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces 10:45 AM – 12:45 PM Room 210 & 230 AB #156

Unsupervised Cross-Modal Alignment of Speech and Text Embedding - PowerPoint PPT Presentation

Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces Yu-An Chung Wei-Hung Weng Schrasing Tong James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge,

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

The Expressive Power of Backround Modal Dependence Logic Modal logic Team semantics Modal

W HAT IS EHD? Introduction EHD without cross-flow Modal Dielectric fluid Non-modal EHD with

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

What is text alignment? Text alignment is the comparison of two or more parallel texts It

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

Modal logic Benzm uller/Rojas, 2014 Artificial Intelligence 2 What is Modal Logic?

Why is modal logic decidable Petros Potikas NTUA 9/5/2017 Petros Potikas (NTUA) Modal logic

Pushing the limits with spectroscopy: High-redshift overdensities at 2<z<6 in zCOSMOS

Vi Visual G Gro rounding i in Vi Video eo fo for r Un Unsup upervised Word Transla

Creating slides Marco Pessotto 2016 Syntax 9 10 . . . . . . . . . . . . . . . . . . . . . . .

Mining and Understanding Software Enclaves (MUSE) Suresh Jagannathan Information Innovation

Probing the gaseous environment of star-forming galaxies in absorption and emission Michele

What matter(s) around galaxies? Shining a bright light on the cold phase of the CGM Cantalupo et

Towards high-precision cluster lensing models: illuminating dark matter and dark ages Piero Rosati

H istory of M erger C onversations kjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA 1824 1968

Unsupervised Cross-Modal Alignment of Speech and Text Embedding - PowerPoint PPT Presentation

Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces Yu-An Chung Wei-Hung Weng Schrasing Tong James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge,

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

The Expressive Power of Backround Modal Dependence Logic Modal logic Team semantics Modal

W HAT IS EHD? Introduction EHD without cross-flow Modal Dielectric fluid Non-modal EHD with

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

What is text alignment? Text alignment is the comparison of two or more parallel texts It

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

Modal logic Benzm uller/Rojas, 2014 Artificial Intelligence 2 What is Modal Logic?

Why is modal logic decidable Petros Potikas NTUA 9/5/2017 Petros Potikas (NTUA) Modal logic

Pushing the limits with spectroscopy: High-redshift overdensities at 2&lt;z&lt;6 in zCOSMOS

Vi Visual G Gro rounding i in Vi Video eo fo for r Un Unsup upervised Word Transla

Creating slides Marco Pessotto 2016 Syntax 9 10 . . . . . . . . . . . . . . . . . . . . . . .

Mining and Understanding Software Enclaves (MUSE) Suresh Jagannathan Information Innovation

Probing the gaseous environment of star-forming galaxies in absorption and emission Michele

What matter(s) around galaxies? Shining a bright light on the cold phase of the CGM Cantalupo et

Towards high-precision cluster lensing models: illuminating dark matter and dark ages Piero Rosati

H istory of M erger C onversations kjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA 1824 1968

Pushing the limits with spectroscopy: High-redshift overdensities at 2<z<6 in zCOSMOS