an unsupervised autoregressive model for speech
play

An Unsupervised Autoregressive Model for Speech Representation - PowerPoint PPT Presentation

An Unsupervised Autoregressive Model for Speech Representation Learning Yu-An Chung Wei-Ning Hsu Hao Tang James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, Massachusetts,


  1. An Unsupervised Autoregressive Model for Speech Representation Learning Yu-An Chung Wei-Ning Hsu Hao Tang James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, Massachusetts, USA Interspeech Graz, Austria September 16, 2019

  2. Why representation learning? • Speech signals are complicated – Contain rich acoustic and linguistic properties (e.g., lexical, speaker characteristics) • High-level properties are important but poorly captured by surface features – E.g., wave signals, log Mel spectrums, MFCCs – Require a large model to learn feature transformation from surface features – Need large amounts of paired audio and text for supervised learning • Representation learning: a two-steps procedure Learn a transformation function ! " that transforms a surface feature " 1) into a higher-level and more accessible form Easier to learn this classifier Use ! " as input to downstream model 2) -1 instead of " $ # -1 • Linear separability as accessibility +1 +1 # -space $ # -space Autoregressive Predictive Coding, Interspeech 2019

  3. Why unsupervised learning of ! " ? • Unlabeled data are (much) cheaper – Vision: one-time collection of large-scale labeled data may be okay – Language: infeasible to collect labeled data for all languages • Less likely to learn specialized representations; sometimes the target task is unknown • Our goal of ! " : Retain as much information about " , while making them more accessible for (possibly unknown) downstream usage Autoregressive Predictive Coding, Interspeech 2019

  4. Learning ! " via Autoregressive Predictive Coding (APC) • Basic idea: Given previous frames up to the current one " # , " % , … , " ' , APC tries to predict a future frame " '() that is ) steps ahead – Use an autoregressive RNN to summarize history and produce new output – + ≥ 1 encourages encoder to infer more global structures rather than exploiting local smoothness . 5 . 8 . 9 . 6 … + = 2 in this example Target sequence : 4 … : 5 : 674 : / Output sequence • Training … 67H . F(H − : F , 0 123 . ∑ FG/ argmin C66,D … : F = JKK . F L M • Feature extraction … Take RNN output of each time step: … 0 123 . F = JKK . F ∀ O = 1,2, … , K Input acoustic feature … . / . 4 . 5 . 674 sequence (e.g., log Mel)

  5. Comparing with Contrastive Predictive Coding (CPC) • Architecture – APC is almost a pure RNN – CPC consists of a CNN as frame encoder and an RNN as context encoder • Training objective – APC predicts a future frame ! "#$ directly – CPC distinguishes ! "#$ and a set of randomly sampled negative frames % ! • Learned & ! – ( )*) + encodes information most discriminative between + ,#- and . + * E.g., % ! sampled from same vs. different utterance as ! "#$ * Better to know what downstream task is when choosing sampling strategy – ( /*) + encodes information sufficient for predicting future frames, more likely to retain information about original signals * Representation Learning with Contrastive Predictive Coding, Oord et al., 2018 Autoregressive Predictive Coding, Interspeech 2019

  6. Experiments • LibriSpeech 360-hour subset (921 speakers) for training all feature extractors (i.e., all APC and CPC variants) • 80-dimensional log Mel spectrums as input (surface) features – Normalized to zero mean and unit variance per speaker • Examine two important characteristics of speech: phone and speaker information contained in extracted features – Phone classification on WSJ – Speaker verification on TIMIT • Test if they generalize to datasets of different domains Autoregressive Predictive Coding, Interspeech 2019

  7. Model Hyperparameters • APC architecture – " -layer LSTMs where " ∈ 1, 2, 3 – 512 hidden units each layer – Residual connections between two consecutive layers – Predict ( )*+ where , ∈ 1, 2, 3, 5, 10, 20 • CPC Architecture – Mainly follow the original implementation – Change the frame encoder (to take log Mel spectrums as inputs) * Original: 5-layer strided CNN * New: 3-layer, 512-dim fully-connected NN w/ ReLU activations Autoregressive Predictive Coding, Interspeech 2019

  8. Phone Classification on Wall Street Journal • Data split: – Train set: 90% of si284 – Dev set: 10% of si284 – Test set: dev93 • Task: Predict phoneme class for each frame and report frame error rate (FER) • Linear separability among phoneme classes as accessibility by downstream models – Comparing + + {linear classifier, MLP}, , -.- + + linear classifier, and , /.- + + linear classifier * 1 : log Mel features * 2 343 1 : representations extracted by CPC * 2 543 1 : representations extracted by APC Autoregressive Predictive Coding, Interspeech 2019

  9. Phone Classification Results Discussions Best # $%$ " : § ! Method 1) Training - Sample negatives from 1 2 3 5 10 20 same utterance as target frame (a) " + linear 50.0 2) Feature extraction - Take context (b) " + 1-layer MLP encoder output instead of frame 43.4 encoder output (c) " + 3-layer MLP 41.3 Surface features " with § (d) Best # $%$ " + linear 42.1 linear / non-linear classifier (a) ~ (c): (e) # &%$_( " + linear 39.4 36.5 35.4 35.6 35.4 37.7 1) Incorporating non-linearity improves FER (f) # &%$_) " + linear 38.5 35.6 35.9 35.7 34.6 38.8 " + 3-layer MLP outperforms the 2) (g) # &%$_* " + linear 37.2 36.7 33.5 36.1 37.1 38.8 best # $%$ " # &%$_+ " : , is the number of RNN layers • Comparison of # &%$_+ " (e) ~ (g): § - is not relevant for (a) ~ (d) • Sweep spot exists when we vary - 1) 2) Significantly outperform (a) ~ (d) Autoregressive Predictive Coding, Interspeech 2019

  10. Speaker Verification on TIMIT • Comparing APC with ! -vector and CPC – Obtaining " -vector representations * Train a universal background model (GMM w/ 256 components), ! -vector extractor, and LDA model on TIMIT train set * Extract 100-dim ! -vectors, project them to 24-dim with LDA • Utterance representation = simple average of frame representations • Report equal error rates (EER) on dev set; only consider female-female & male- male pairs Autoregressive Predictive Coding, Interspeech 2019

  11. Speaker Verification Results ! Discussions Method 1 2 3 5 10 20 # '%$ > best # $%$ > " -vector § (a) " -vector 6.64 In general, smaller - captures more speaker § (b) Best # $%$ & 5.00 information (c) # '%$_) & 4.71 4.07 4.14 4.14 5.14 5.29 § Unlike phone classification, deeper APC tends to (d) # '%$_* & 4.71 4.64 5.71 4.86 5.57 6.07 perform worse on speaker verification (c) ~ (e) (e) # '%$_+ & 5.21 4.93 4.43 4.57 5.79 6.21 § Shallow layers contain more speaker information (e) ~ (g) (f) # '%$_+,) & 3.79 4.64 4.14 4.29 5.14 5.00 (g) # '%$_+,* & 3.43 3.86 3.79 3.86 4.07 4.86 # '%$_+,. & : output of the / -th layer of # '%$_+ & • - is not relevant for (a) and (b) • Autoregressive Predictive Coding, Interspeech 2019

  12. Conclusions • Autoregressive Predictive Coding for speech representation learning – Unsupervised - no labeled data required for training – Transforms surface features (e.g., log Mel) into a more accessible form * Accessibility is defined as linear separability – Extracted representations contain both phone and speaker information * Outperform surface features, CPC, ! -vector – In a deep APC, lower layers tend to be more discriminative for speakers while upper layers provide more phonetic content • Code: https://github.com/iamyuanchung/Autoregressive-Predictive-Coding Autoregressive Predictive Coding, Interspeech 2019

  13. Thank you! Questions? Autoregressive Predictive Coding, Interspeech 2019

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend