An Unsupervised Autoregressive Model for Speech Representation - PowerPoint PPT Presentation

An Unsupervised Autoregressive Model for Speech Representation Learning Yu-An Chung Wei-Ning Hsu Hao Tang James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, Massachusetts, USA Interspeech Graz, Austria September 16, 2019

Why representation learning? • Speech signals are complicated – Contain rich acoustic and linguistic properties (e.g., lexical, speaker characteristics) • High-level properties are important but poorly captured by surface features – E.g., wave signals, log Mel spectrums, MFCCs – Require a large model to learn feature transformation from surface features – Need large amounts of paired audio and text for supervised learning • Representation learning: a two-steps procedure Learn a transformation function ! " that transforms a surface feature " 1) into a higher-level and more accessible form Easier to learn this classifier Use ! " as input to downstream model 2) -1 instead of " $ # -1 • Linear separability as accessibility +1 +1 # -space $ # -space Autoregressive Predictive Coding, Interspeech 2019

Why unsupervised learning of ! " ? • Unlabeled data are (much) cheaper – Vision: one-time collection of large-scale labeled data may be okay – Language: infeasible to collect labeled data for all languages • Less likely to learn specialized representations; sometimes the target task is unknown • Our goal of ! " : Retain as much information about " , while making them more accessible for (possibly unknown) downstream usage Autoregressive Predictive Coding, Interspeech 2019

Learning ! " via Autoregressive Predictive Coding (APC) • Basic idea: Given previous frames up to the current one " # , " % , … , " ' , APC tries to predict a future frame " '() that is ) steps ahead – Use an autoregressive RNN to summarize history and produce new output – + ≥ 1 encourages encoder to infer more global structures rather than exploiting local smoothness . 5 . 8 . 9 . 6 … + = 2 in this example Target sequence : 4 … : 5 : 674 : / Output sequence • Training … 67H . F(H − : F , 0 123 . ∑ FG/ argmin C66,D … : F = JKK . F L M • Feature extraction … Take RNN output of each time step: … 0 123 . F = JKK . F ∀ O = 1,2, … , K Input acoustic feature … . / . 4 . 5 . 674 sequence (e.g., log Mel)

Comparing with Contrastive Predictive Coding (CPC) • Architecture – APC is almost a pure RNN – CPC consists of a CNN as frame encoder and an RNN as context encoder • Training objective – APC predicts a future frame ! "#$ directly – CPC distinguishes ! "#$ and a set of randomly sampled negative frames % ! • Learned & ! – ( )*) + encodes information most discriminative between + ,#- and . + * E.g., % ! sampled from same vs. different utterance as ! "#$ * Better to know what downstream task is when choosing sampling strategy – ( /*) + encodes information sufficient for predicting future frames, more likely to retain information about original signals * Representation Learning with Contrastive Predictive Coding, Oord et al., 2018 Autoregressive Predictive Coding, Interspeech 2019

Experiments • LibriSpeech 360-hour subset (921 speakers) for training all feature extractors (i.e., all APC and CPC variants) • 80-dimensional log Mel spectrums as input (surface) features – Normalized to zero mean and unit variance per speaker • Examine two important characteristics of speech: phone and speaker information contained in extracted features – Phone classification on WSJ – Speaker verification on TIMIT • Test if they generalize to datasets of different domains Autoregressive Predictive Coding, Interspeech 2019

Model Hyperparameters • APC architecture – " -layer LSTMs where " ∈ 1, 2, 3 – 512 hidden units each layer – Residual connections between two consecutive layers – Predict ( )*+ where , ∈ 1, 2, 3, 5, 10, 20 • CPC Architecture – Mainly follow the original implementation – Change the frame encoder (to take log Mel spectrums as inputs) * Original: 5-layer strided CNN * New: 3-layer, 512-dim fully-connected NN w/ ReLU activations Autoregressive Predictive Coding, Interspeech 2019

Phone Classification on Wall Street Journal • Data split: – Train set: 90% of si284 – Dev set: 10% of si284 – Test set: dev93 • Task: Predict phoneme class for each frame and report frame error rate (FER) • Linear separability among phoneme classes as accessibility by downstream models – Comparing + + {linear classifier, MLP}, , -.- + + linear classifier, and , /.- + + linear classifier * 1 : log Mel features * 2 343 1 : representations extracted by CPC * 2 543 1 : representations extracted by APC Autoregressive Predictive Coding, Interspeech 2019

Phone Classification Results Discussions Best # $%$ " : § ! Method 1) Training - Sample negatives from 1 2 3 5 10 20 same utterance as target frame (a) " + linear 50.0 2) Feature extraction - Take context (b) " + 1-layer MLP encoder output instead of frame 43.4 encoder output (c) " + 3-layer MLP 41.3 Surface features " with § (d) Best # $%$ " + linear 42.1 linear / non-linear classifier (a) ~ (c): (e) # &%$_( " + linear 39.4 36.5 35.4 35.6 35.4 37.7 1) Incorporating non-linearity improves FER (f) # &%$_) " + linear 38.5 35.6 35.9 35.7 34.6 38.8 " + 3-layer MLP outperforms the 2) (g) # &%$_* " + linear 37.2 36.7 33.5 36.1 37.1 38.8 best # $%$ " # &%$_+ " : , is the number of RNN layers • Comparison of # &%$_+ " (e) ~ (g): § - is not relevant for (a) ~ (d) • Sweep spot exists when we vary - 1) 2) Significantly outperform (a) ~ (d) Autoregressive Predictive Coding, Interspeech 2019

Speaker Verification on TIMIT • Comparing APC with ! -vector and CPC – Obtaining " -vector representations * Train a universal background model (GMM w/ 256 components), ! -vector extractor, and LDA model on TIMIT train set * Extract 100-dim ! -vectors, project them to 24-dim with LDA • Utterance representation = simple average of frame representations • Report equal error rates (EER) on dev set; only consider female-female & male- male pairs Autoregressive Predictive Coding, Interspeech 2019

Speaker Verification Results ! Discussions Method 1 2 3 5 10 20 # '%$ > best # $%$ > " -vector § (a) " -vector 6.64 In general, smaller - captures more speaker § (b) Best # $%$ & 5.00 information (c) # '%$_) & 4.71 4.07 4.14 4.14 5.14 5.29 § Unlike phone classification, deeper APC tends to (d) # '%$_* & 4.71 4.64 5.71 4.86 5.57 6.07 perform worse on speaker verification (c) ~ (e) (e) # '%$_+ & 5.21 4.93 4.43 4.57 5.79 6.21 § Shallow layers contain more speaker information (e) ~ (g) (f) # '%$_+,) & 3.79 4.64 4.14 4.29 5.14 5.00 (g) # '%$_+,* & 3.43 3.86 3.79 3.86 4.07 4.86 # '%$_+,. & : output of the / -th layer of # '%$_+ & • - is not relevant for (a) and (b) • Autoregressive Predictive Coding, Interspeech 2019

Conclusions • Autoregressive Predictive Coding for speech representation learning – Unsupervised - no labeled data required for training – Transforms surface features (e.g., log Mel) into a more accessible form * Accessibility is defined as linear separability – Extracted representations contain both phone and speaker information * Outperform surface features, CPC, ! -vector – In a deep APC, lower layers tend to be more discriminative for speakers while upper layers provide more phonetic content • Code: https://github.com/iamyuanchung/Autoregressive-Predictive-Coding Autoregressive Predictive Coding, Interspeech 2019

Thank you! Questions? Autoregressive Predictive Coding, Interspeech 2019

An Unsupervised Autoregressive Model for Speech Representation - PowerPoint PPT Presentation

An Unsupervised Autoregressive Model for Speech Representation Learning Yu-An Chung Wei-Ning Hsu Hao Tang James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, Massachusetts,

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Chapter 4: Video 1 - Supplemental slides The Autoregressive Model Autoregressive (AR) processes

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Lecture 12: Autoregressive Filters Mark Hasegawa-Johnson ECE 401: Signal and Image Analysis, Fall

Autoregressive Models Autoregressive Models In [1]: from mxnet import autograd, nd, gluon, init

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Unsupervised speech processing using acoustic word embeddings Herman Kamper School of

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Recent progress with APC Nb 3 Sn conductors Xingchen Xu, Fermilab Xuan Peng, Hyper Tech

Software development activities Note activities not steps l Often happening

Laziness and Parallelism Based on slides by Koen Claessen A Function fun :: Maybe Int -> Int

On a Mathematical Theory of Repeated Quantum Measurements Vojkan Jaksic McGill University Based

The Future of APC Management in Polymetastatic Disease William K. Oh, MD Deputy Director, Tisch

An infinite family of Steiner triple systems without parallel classes Daniel Horsley (Monash

Watershed Charter School APC Meeting Agenda 1.Welcome and Call to Order 2.Head Teacher Update

SAF SAFEGUAR ARDING NG CIVILIZAT ATION RIPPLE20: WHAT YOU NEED TO KNOW REID WIGHTMAN &

An Unsupervised Autoregressive Model for Speech Representation - PowerPoint PPT Presentation

An Unsupervised Autoregressive Model for Speech Representation Learning Yu-An Chung Wei-Ning Hsu Hao Tang James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, Massachusetts,

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Chapter 4: Video 1 - Supplemental slides The Autoregressive Model Autoregressive (AR) processes

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Lecture 12: Autoregressive Filters Mark Hasegawa-Johnson ECE 401: Signal and Image Analysis, Fall

Autoregressive Models Autoregressive Models In [1]: from mxnet import autograd, nd, gluon, init

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Unsupervised speech processing using acoustic word embeddings Herman Kamper School of

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Recent progress with APC Nb 3 Sn conductors Xingchen Xu, Fermilab Xuan Peng, Hyper Tech

Software development activities Note activities not steps l Often happening

Laziness and Parallelism Based on slides by Koen Claessen A Function fun :: Maybe Int -&gt; Int

On a Mathematical Theory of Repeated Quantum Measurements Vojkan Jaksic McGill University Based

The Future of APC Management in Polymetastatic Disease William K. Oh, MD Deputy Director, Tisch

An infinite family of Steiner triple systems without parallel classes Daniel Horsley (Monash

Watershed Charter School APC Meeting Agenda 1.Welcome and Call to Order 2.Head Teacher Update

SAF SAFEGUAR ARDING NG CIVILIZAT ATION RIPPLE20: WHAT YOU NEED TO KNOW REID WIGHTMAN &amp;

Laziness and Parallelism Based on slides by Koen Claessen A Function fun :: Maybe Int -> Int

SAF SAFEGUAR ARDING NG CIVILIZAT ATION RIPPLE20: WHAT YOU NEED TO KNOW REID WIGHTMAN &