Ge Generativ ive Pre-Tr Training for Speech with Autoregressive - PowerPoint PPT Presentation

Ge Generativ ive Pre-Tr Training for Speech with Autoregressive Pr Predictive Coding Yu-An Chung James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02139, USA ICASSP 2020

Self-supervised learning background • What is self-supervised learning? • A form of unsupervised learning where the data itself provides supervision • In general, the goal is to predict some part of the data from any other part of it • Can leverage large quantities of unlabeled data à cheaper data and richer representations Relative location prediction • Very successful in Vision and NLP [Doersch et al., 2015] • Vision (pretext tasks) • Colorization • Image patches relationship prediction • NLP (pre-training) • Masked LM (BERT) • Autoregressive LM (GPT) BERT • Permutation LM (XLNet) [Devlin et al., 2019]

Self-supervised approaches for speech (incomprehensive) • Future prediction • To predict future audio features from the historical ones • Contrastive predictive coding (CPC) [Oord et al., 2018] • Autoregressive predictive coding (APC) [Chung et al., 2019] • wav2vec [Schneider et al., 2019] • Mask prediction • To predict masked part of the input audio signals • Mockingjay [Liu et al., 2020] • Masked reconstruction [Wang et al., 2020] • Multiple self-supervised tasks at the same time • Ideally, solving each task contributes prior knowledge into the representation • Problem-agnostic speech encoder (PASE) [Pascual et al., 2019]

What this work is about • In our previous work (Chung et al., 2019), we: • Proposed autoregressive predictive coding (APC) • Used RNNs as the backbone architecture • Experimented on toy tasks such as phonetic classification • In this work, we further explore APC by: • Replacing RNNs with Transformers as the backbone architecture • Experimenting on real-world applications such as ASR, speech translation, and speaker identification, comparing with CPC and PASE features • Investigating the usefulness of the representations in low-resource regime, where only small amounts of labeled speech data are available APC is a simple yet effective generative pre-training method for speech applications

Autoregressive Predictive Coding (APC) • Given a previous context ! " , ! $ , … , ! & , APC tries to predict a future audio feature ! &'( that is ( steps ahead of ! & • Uses an autoregressive model ) *+ to summarize history and produce output • , ≥ 1 encourages ) *+ to infer more global underlying structures of the data rather than simply exploiting local smoothness of speech signals / 2 / 5 / 6 / 3 … , = 2 in this example Target sequence 7 1 7 2 … 7 341 7 0 Output sequence Training … 34G / E'G − 7 E , argmin ∑ EF0 ) *+ @ AB ,C … 7 E = ) *+ / 0 , … , / E I J J is a linear transformation that … maps ) *+ ’s output back to / E ’s dimensionality Input acoustic feature … / 0 / 1 / 2 / 341 sequence (e.g., log Mel)

Types of autoregressive model ! "# • ! "# L K6@ y L 7 L I L J • Input: x = & ' , & ) , … , & + • Output: y = - ' , - ) , … , - + h < … • . -layer Unidirectional RNN: h 7 … h 1 = x H K6@ x H 7 H I H J RNN h 2 = RNN 2 ℎ 267 , ∀9 ∈ 1, . y = h < = > L K6@ y L 7 L I L J Transformer • . -layer Transformer decoder blocks (decoder) h < … • Positional encodings, h 1 = x = > ?@ +B x > ?@ and > EFG are h 2 = TRF 2 ℎ 267 , ∀9 ∈ 1, . not shown here h 7 … M • We keep > EFG = > y = h < = > ?@ EFG H K6@ as regularization in x H 7 H I H J practice • Feature extraction: h 0

Transfer learning experiments • Setup: pre-training + fine-tuning • Pre-training data • Speech portion of the LibriSpeech 360 hours subset • 921 speakers • 80-dimensional log Mel spectrograms as input acoustic features (i.e., ! " ∈ ℝ %& ) • Use extracted features to replace log Mel as new inputs to downstream models • Considered downstream tasks • Speech recognition • Speech translation • Speaker identification (skipped in this talk, see paper!) • Comparing methods • Contrastive predictive coding (CPC) • Problem-agnostic speech encoder (PASE)

Speech Recognition • Considered dataset: Wall Street Journal • Training: 90% of si284 (~ 72 hours of audio) • Validation: 10% of si284 • Test: dev93 • APC ! "# • RNNs: 4-layer, 512-dim GRUs • Transformers: 4-layer, 512-dim Transformer decoder blocks • Downstream ASR model • Seq2seq with attention [Chorowski et al., 2015] • Beam search with beam size = 5 • No language model rescoring

Choice of ! , and whether to fine-tune " #$ 26 Notations T-APC Scratch • R stands for RNN 24 • T stands for Transformer R-APC Scratch Scratch : % &' randomly initialized and • concatenate with ASR model 22 Frozen : keep % &' frozen when training ASR • log Mel T-APC Finetuned model R-APC Scratch 20 Finetuned : fine-tune % &' along with ASR model • R-APC Frozen WER R-APC Finetuned log Mel R-APC Finetuned 18 T-APC Scratch Findings T-APC Frozen R-APC Frozen • Sweet spot exists for both Frozen and Finetuned T-APC Finetuned 16 when varying ( • Scratch performance is poor, even worse than 14 log Mel baseline • T-APC Frozen APC outperforms log Mel most of the time • For both R and T, Frozen outperforms Finetuned 12 Will use R-APC Frozen with ( = 3 and T-APC • n = 1 n = 2 n = 3 n = 5 n = 10 n = 20 Frozen with ( = 5 for the rest

APC for reducing the amount of labeled training data log Mel CPC R-APC T-APC PASE Recap: all feature extractors were pre-trained with 360 90 87.7 hours of LibriSpeech data; we did not fine-tune any 88.1 feature extractor with the ASR model 80 78.6 Findings 69.7 66.8 70 • Full set: 66.4 § 25% and 17% relative improvement for 60 63.2 T-APC (13.7) and R-APC (15.2) over log Mel 58.8 baseline (18.3), respectively WER 50.9 50 49 • As we decrease the amount of training data: 44.6 § T-APC (yellow) and R-APC (gray) always 43 42.1 40 38.8 outperform other methods 35.8 § Gap between T-APC / R-APC and log Mel 33.4 28.3 32.8 (blue) becomes larger 30 31.4 § 26.6 Using just half of si284, T-APC (16.4) already 24.6 20.7 24.1 outperforms log Mel trained on full set (18.3) 20.8 18.3 20 21.3 18.3 15.2 • 16.4 In the paper we also have the figure where all 13.7 feature extractors were pre-trained on only 10 hrs 10 1 1/2 1/4 1/8 1/16 1/32 of LibriSpeech data. TLDR : pre-training still helps even with just 10 hrs of pre-training data Proportion of si284 for training

APC for reducing downstream model size log Mel CPC R-APC T-APC PASE 47 Note: all models trained on full si284 45.4 42 Findings 37 • T-APC (yellow) and R-APC (gray) always outperform other methods 32 29.4 WER • T-APC with just 2 layers (18.6) performs similar to 29.8 28.8 log Mel with 4 layers (18.3) 27 26.2 25.7 25.2 25.2 23.5 22.5 20.7 22 20.8 20.8 20.3 18.6 18.3 17.6 17 15.2 15.8 13.7 12 1 2 3 4 (original) Number of encoder layers in the ASR model

Speech Translation • Considered dataset: LibriSpeech En-Fr • Training set has around 100 hrs of audio • Report BLEU scores on test set • Downstream speech translation model • RNN-based seq2seq with attention model [Berard et al., 2018] • Also compare with two other baselines • Cascaded system (ASR + MT) • S-Transformer (end-to-end SOTA) [Di Gangi et al., 2019]

Speech translation results 15 Findings 14.5 • 11% and 7% relative improvement for T-APC (14.3) and R-APC (13.8) over log Mel (12.9), respectively 14 • T-APC (14.3) outperforms end-to-end SOTA 13.5 S-Transformer with log Mel input (13.8) • Since S-Transformer is larger than our RNN- 13 BLEU based seq2seq model, this result also suggests 14.6 that using APC features can reduce 14.3 12.5 downstream model size for speech translation 13.8 13.8 • 12 T-APC (14.3) is close to cascaded system (14.6) 12.9 12.5 12.4 11.5 11 Cascaded S-Transformer log Mel CPC PASE R-APC T-APC

Conclusions Empirically demonstrate that APC is a simple yet effective pre-training strategy for speech • Can leverage large quantities of unlabeled data • Architecture-agnostic: any autoregressive model can be used as backbone; in this paper we explored Transformer and RNN • Learns general speech representations that can be transferred to different speech applications and outperform log Mel baseline and other self-supervised representations • Allows to train downstream models more (labeled) data- and model-efficient

Thank you! Questions? Slides: http://people.csail.mit.edu/andyyuan/docs/icassp-20.generative.slides.pdf Code: https://github.com/iamyuanchung/Autoregressive-Predictive-Coding

Ge Generativ ive Pre-Tr Training for Speech with Autoregressive - PowerPoint PPT Presentation

Ge Generativ ive Pre-Tr Training for Speech with Autoregressive Pr Predictive Coding Yu-An Chung James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02139, USA ICASSP 2020

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Tax axono onomy my of f ge generativ erative e mo models dels Prof. Leal-Taix and Prof.

Condit itio ional Generativ ive Adversaria ial Networks (cGANs) Prof. Leal-Taix and Prof.

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

IVE: SITE OF SEREGNO (MB), ITALY WWW.IVEVERNICI.COM IVE: THE COMPANY IVE was founded in 1941

De ve loping a c ommunity- De ve loping a c ommunity- dr dr ive n r ive n r e se ar e se ar

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech sound disorder by Sajjal (2018) Definition A speech sound disorder (SSD) is a speech

Speech of Greta Thunberg at the UN Climate Change COP24 Conference in Katowice Content -Greta

Chapter 1 Introduction to Speech Signal Processing 1 Outline The

Autoregressive Models Autoregressive Models In [1]: from mxnet import autograd, nd, gluon, init

Low Power Design Prof. Dr. J. Henkel CES - Chair for Embedded Systems KIT, Germany Thermal

Incorporating Intra-flow Dependencies and Inter-flow Correlations for Traffic Matrix Prediction

Measurement and Analysis of Traffic in a Hybrid Satellite-Terrestrial Network Qing (Kenny) Shao

CS473: Link Analysis Luo Si Department of Computer Science Purdue University Borrowed Slides

ASL AIRS contributions Using AIRS data in the presence of dust L2 : dust impact OLR forcing :

Causal Learning in Question Quality Improvement Yichuan Li (Arizona State University) Ruocheng

http://cs224w.stanford.edu Evolving Networks are networks that change as a function of time