Pre-training on high-resource speech recognition improves - PowerPoint PPT Presentation

Pre-training on high-resource speech recognition improves low-resource speech-to-text translation Sameer Bansal Herman Kamper Karen Livescu Adam Lopez Sharon Goldwater

Current systems Spanish Audio: ? English text: 2

Current systems Spanish Audio: Automatic Speech ola mi nombre es hodor Spanish text: Recognition ? English text: 3

Current systems Spanish Audio: Automatic Speech ola mi nombre es hodor Spanish text: Recognition hi my name is hodor Machine English text: Translation 4

~100 languages supported by Google Translate ... 5

Unwritten languages Mboshi: Bantu language, Republic of Congo, ~160K speakers ~3000 languages with no writing system Automatic Speech Mboshi text: not available Recognition 6

Unwritten languages Mboshi: paired with French translations (Godard et al. 2018) ~3000 languages with no writing system Efforts to collect speech and translations using mobile apps ○ Aikuma : Bird et al. 2014, LIG-Aikuma : Blachon et al. 2016 7

Haiti Earthquake, 2010 Survivors sent text messages to helpline People trapped in Moun kwense nan Sacred Heart Sakre Kè nan Church, PauP Pòtoprens ● International rescue teams face language barrier ● No automated tools available ● Volunteers from global Haitian diaspora help create parallel text corpora in short time [Munro 2010] 8

Are we better prepared in 2019? Moun kwense nan Sakre Kè nan Pòtoprens People trapped in Sacred Heart Church, PauP Voice messages 9

Can we build a speech-to-text translation ( ST ) system? … given as training data: (source audio) paired with translations ● Tens of hours of speech paired with text translations ● No source text available 10

Neural models ... Spanish Audio: Weiss et al. (2017) Sequence-to-Sequence hi my name is hodor English text: Directly translate speech 11

Spanish speech to English text Spanish Audio ● telephone speech (unscripted) ● realistic noise conditions ● multiple speakers and dialects Encoder ● crowdsourced English text translations Attention Closer to real-world conditions Decoder English text 12

Spanish speech to English text Weiss et al. Good performance if trained on 100+ hours *for comparison text-to-text = 58 13

But ... Weiss et al. Poor performance in low-resource settings *for comparison text-to-text = 58 14

Goal: to improve translation performance 15

Goal: to improve translation performance … without labeling more low-resource speech 16

100s of hours of monolingual speech paired with text available … typically used to train ASR systems English text French text (English Audio) (French Audio) Key idea: leverage monolingual data from a different high-resource language 17

100s of hours of monolingual speech paired with text available … typically used to train ASR systems English text French text (English Audio) (French Audio) Sequence-to-Sequence English text Spanish Audio ~20 hours of Spanish-English ? 18

100s of hours of monolingual speech paired with text available … typically used to train ASR systems Weiss et al. 2017 Spanish text Anastasopoulos and Chiang 2018 Bérard et al. 2018 ( Spanish Audio ) Sperber et al. 2019 Sequence-to-Sequence English text Spanish Audio ~20 hours of Spanish-English 19

100s of hours of monolingual speech paired with text available … typically used to train ASR systems English text French text (English Audio) (French Audio) Sequence-to-Sequence English text Spanish Audio ~20 hours of Spanish-English ? 20

Why Spanish-English? 21

Why Spanish-English? simulate low-resource settings and test our method 22

Why Spanish-English? simulate low-resource settings and test our method Later: results on truly low-resource language --- Mboshi to French 23

Method Audio Same model architecture for ASR and ST Encoder Attention *randomly initialized parameters Decoder text 24

Pretrain on high-resource English audio 300 hours of English audio and text Encoder Attention *train until convergence Decoder English text 25

Fine-tune on low-resource 20 hours Spanish-English English audio Spanish audio Encoder Encoder Attention Attention transfer from English ASR Decoder Decoder English text English text 26

Fine-tune on low-resource 20 hours Spanish-English Spanish audio Encoder Attention *train until convergence Decoder English text 27

Will this work? 28

Spanish-English BLEU scores *for comparison Weiss et al. = 47.3 baseline 29

Spanish-English BLEU scores pretraining *for comparison Weiss et al. = 47.3 baseline 30

Spanish-English BLEU scores pretraining ● +9 BLEU *for comparison Weiss et al. = 47.3 baseline 31

Spanish-English BLEU scores pretraining ● better performance with half the data *for comparison Weiss et al. = 47.3 baseline 32

Further analysis pretraining 20 hours Spanish-English *for comparison Weiss et al. = 47.3 baseline 33

Faster training time pretraining baseline 34

Faster training time ● potentially useful in time critical scenarios pretraining 2 hours ~20 hours baseline 35

Ablation: model parameters English Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 Decoder Decoder English text English text 36

Ablation: model parameters English Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 random Attention Attention +English ASR 19.9 Decoder Decoder +English ASR: decoder 10.5 English text English text 37

Ablation: model parameters English Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 random Decoder Decoder +English ASR: decoder 10.5 English text English text +English ASR: encoder 16.6 38

Ablation: model parameters English Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 random Decoder Decoder +English ASR: decoder 10.5 English text English text +English ASR: encoder 16.6 … transferring encoder only parameters works well! 39

Ablation: model parameters English Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 random Decoder Decoder +English ASR: decoder 10.5 English text English text +English ASR: encoder 16.6 … can pretrain on a language different from both source and target in ST pair 40

Pretraining on French French Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 random Decoder Decoder +English ASR: encoder 16.6 French text English text +French ASR: encoder ? *only 20 hours of French ASR 41

Pretraining on French French Spanish Spanish to English, N = 20 hours BLEU Encoder Encoder baseline 10.8 Attention Attention +English ASR 19.9 random Decoder Decoder +English ASR: encoder 16.6 French text English text +French ASR: encoder 12.5 French ASR helps Spanish-English ST 42

Takeaways ● Pretraining on a different language helps ● transfer all model parameters for best gains ● encoder parameters account for most of these … useful when target vocabulary is different 43

… Mboshi-French ST 44

Mboshi-French ST ● ST data by Godard et al. 2018 ○ ~4 hours of speech, paired with French translations ● Mboshi ○ Bantu language, Republic of Congo ○ Unwritten ○ ~160K speakers 45

Mboshi-French: Results Mboshi Mboshi to French, N = 4 hours BLEU Encoder baseline ? Attention Decoder French text 46

Mboshi-French: Results Mboshi Mboshi to French, N = 4 hours BLEU Encoder baseline 3.5 Attention Decoder French text *outperformed by a naive baseline 47

Pretraining on French ASR French Mboshi Mboshi to French, N = 4 hours BLEU Encoder Encoder baseline 3.5 Attention Attention +French ASR: all ? Decoder Decoder French text French text transfer all parameters 48

Pretraining on French ASR French Mboshi Mboshi to French, N = 4 hours BLEU Encoder Encoder baseline 3.5 Attention Attention +French ASR: all 5.9 Decoder Decoder French text French text French ASR helps Mboshi-French ST 49

Pretraining on French ASR French Mboshi Mboshi to French, N = 4 hours BLEU Encoder Encoder baseline 3.5 Attention Attention +French ASR: all 5.9 Decoder Decoder French text French text French ASR helps Mboshi-French ST 50

Pretraining on English ASR English Mboshi Mboshi to French, N = 4 hours BLEU Encoder Encoder baseline 3.5 Attention Attention +French ASR: all 5.9 random Decoder Decoder +English ASR: encoder ? English text French text using encoder trained on a lot more data 51

Pretraining on English ASR English Mboshi Mboshi to French, N = 4 hours BLEU Encoder Encoder baseline 3.5 Attention Attention +French ASR: all 5.9 random Decoder Decoder +English ASR: encoder 5.3 English text French text English ASR helps Mboshi-French ST 52

Pre-training on high-resource speech recognition improves - PowerPoint PPT Presentation

Pre-training on high-resource speech recognition improves low-resource speech-to-text translation Sameer Bansal Herman Kamper Karen Livescu Adam Lopez Sharon Goldwater Current systems Spanish Audio: ? English text: 2 Current systems

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Effective Open Source Speech Recognition in Your Application #kde-speech Peter Grasch

Speech Processing 15-492/18-492 Speech Recognition Intro Acoustic modelling HMMs Speech

Assistive Tech Dan Dyer@uidaho.edu What do you want to learn about? I was asked to provide:

Download for PC The most accessible web browser available. Download Step 1:

From Property Tax to? Revenue Sources, Tradeoffs, Trends LISA CHRISTENSEN GEE I N S T I T U T E

Economic Investment Outlook John Pesce, Chief Executive Officer Blake Rhodes, Fixed Income

A PPLYING A RCHITECTURE T ECHNIQUES TO A NCHOR S YSTEMS R OADMAPS Methods & Tools: Experience

SCALABLE ORDINAL EMBEDDING TO MODEL USER BEHAVIOR 2 3 4 PAIRWISE CITY DISTANCES Boston NYC

Teaching & Learning Update Danielle Klingaman- Assistant Superintendent School Committee

Close Reading for ALL Disciplines Jennifer Hengel, Nicole Hochholzer, Brian Reindl, and Coreen

Pre-training on high-resource speech recognition improves - PowerPoint PPT Presentation

Pre-training on high-resource speech recognition improves low-resource speech-to-text translation Sameer Bansal Herman Kamper Karen Livescu Adam Lopez Sharon Goldwater Current systems Spanish Audio: ? English text: 2 Current systems

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Effective Open Source Speech Recognition in Your Application #kde-speech Peter Grasch

Speech Processing 15-492/18-492 Speech Recognition Intro Acoustic modelling HMMs Speech

Assistive Tech Dan Dyer@uidaho.edu What do you want to learn about? I was asked to provide:

Download for PC The most accessible web browser available. Download Step 1:

From Property Tax to? Revenue Sources, Tradeoffs, Trends LISA CHRISTENSEN GEE I N S T I T U T E

Economic Investment Outlook John Pesce, Chief Executive Officer Blake Rhodes, Fixed Income

A PPLYING A RCHITECTURE T ECHNIQUES TO A NCHOR S YSTEMS R OADMAPS Methods &amp; Tools: Experience

SCALABLE ORDINAL EMBEDDING TO MODEL USER BEHAVIOR 2 3 4 PAIRWISE CITY DISTANCES Boston NYC

Teaching &amp; Learning Update Danielle Klingaman- Assistant Superintendent School Committee

Close Reading for ALL Disciplines Jennifer Hengel, Nicole Hochholzer, Brian Reindl, and Coreen

A PPLYING A RCHITECTURE T ECHNIQUES TO A NCHOR S YSTEMS R OADMAPS Methods & Tools: Experience

Teaching & Learning Update Danielle Klingaman- Assistant Superintendent School Committee