Vector Quantized Neural Networks for Acoustic Unit Discovery
Benjamin van Niekerk, Leanne Nortje, Herman Kamper
Vector Quantized Neural Networks for Acoustic Unit Discovery - - PowerPoint PPT Presentation
Vector Quantized Neural Networks for Acoustic Unit Discovery Benjamin van Niekerk, Leanne Nortje, Herman Kamper The Generative Factors of Speech HH / Y / UW / M / ER HUMOUR Content: Prosody: Timbre: Discrete phonetic units. Rhythm
Benjamin van Niekerk, Leanne Nortje, Herman Kamper
Content:
HH / Y / UW / M / ER
Prosody:
Timbre:
spectrum.
HUMOUR
Content:
HH / Y / UW / M / ER
Prosody:
Timbre:
spectrum.
HUMOUR
Content:
HH / Y / UW / M / ER
Prosody:
Timbre:
spectrum.
HUMOUR
Content:
HH / Y / UW / M / ER
Prosody:
Timbre:
spectrum.
HUMOUR
HH / Y UW / M / ER /
Content:
Prosody:
Timbre:
spectrum.
HH / Y UW / M / ER /
Content:
Prosody:
Timbre:
spectrum.
HH / Y UW / M / ER /
Content:
Prosody:
Timbre:
spectrum.
HH / Y UW / M / ER /
Content:
Prosody:
Timbre:
spectrum.
Content:
Prosody:
Timbre:
spectrum.
The goal is to learn discrete representations of speech that separate phonetic content from the other factors. …all without any labels or annotations!
The goal is to learn discrete representations of speech that separate phonetic content from the other factors. …all without any labels or annotations!
Encoder
The goal is to learn discrete representations of speech that separate phonetic content from the other factors. …all without any labels or annotations!
Encoder
The goal is to learn discrete representations of speech that separate phonetic content from the other factors. …all without any labels or annotations!
Bootstrap training of low-resource speech systems: Automatic speech recognition Text-to-speech Non-parallel voice conversion
Automatic speech recognition Text-to-speech Non-parallel voice conversion Bootstrap training of low-resource speech systems:
Automatic speech recognition Text-to-speech Non-parallel voice conversion Bootstrap training of low-resource speech systems:
Automatic speech recognition Text-to-speech Non-parallel voice conversion Bootstrap training of low-resource speech systems:
Advances in Neural Information Processing Systems. 2017.
Codebook
Encoder Codebook
Encoder Codebook
Encoder Codebook
Encoder Codebook
Encoder Codebook
Encoder Codebook
Encoder Codebook
Encoder Codebook
Our contribution: we propose and compare two models for acoustic unit discovery in the ZeroSpeech 2020 Challenge.
A Vector-Quantized Variational Autoencoder (VQ-VAE)
1.
A combination of Vector-Quantization and Contrastive Predictive Coding (VQ-CPC)
2.
Encoder Decoder VQ layer
Inspired by: J. Chorowski, et al. “Unsupervised speech representation learning using wavenet autoencoders.” IEEE/ACM transactions on audio, speech, and language processing. 2019.
Our contribution: we propose and compare two models for acoustic unit discovery in the ZeroSpeech 2020 Challenge.
A Vector-Quantized Variational Autoencoder (VQ-VAE)
1.
A combination of Vector-Quantization and Contrastive Predictive Coding (VQ-CPC)
2.
Encoder Decoder VQ layer
Inspired by: J. Chorowski, et al. “Unsupervised speech representation learning using wavenet autoencoders.” IEEE/ACM transactions on audio, speech, and language processing. 2019.
Our contribution: we propose and compare two models for acoustic unit discovery in the ZeroSpeech 2020 Challenge.
A combination of Vector-Quantization and Contrastive Predictive Coding (VQ-CPC)
2.
A Vector-Quantized Variational Autoencoder (VQ-VAE)
1.
Encoder Decoder VQ layer
Inspired by: J. Chorowski, et al. “Unsupervised speech representation learning using wavenet autoencoders.” IEEE/ACM transactions on audio, speech, and language processing. 2019. Inspired by: A. van den Oord, et al. “Representation Learning with Contrastive Predictive Coding.” 2018.
Encoder VQ layer Decoder
minimize reconstruction error
Encoder VQ layer Decoder
Encoder Decoder VQ layer
Information bottleneck
Encoder Decoder VQ layer
Information bottleneck
Speaker
Encoder Decoder VQ layer
Information bottleneck
Speaker
Powerful autoregressive model
Input Prediction
Input Encoder
Input Encoder VQ layer
Input Encoder VQ layer Context model
Input Encoder VQ layer Context model Predictions
Context vector
Context vector Positive example
Context vector Positive example Negative examples
Context vector Positive example Negative examples
Context vector Positive example Negative examples
Encoder Decoder VQ layer Evaluation Metrics:
Encoder Decoder VQ layer Evaluation Metrics:
Source Converted Target Other Conversion
Triphone A: bug
Encoder
Triphone A: bug Triphone B: bag
Encoder Encoder
Triphone A: bug Triphone B: bag Triphone X: bag
Encoder Encoder Encoder
Triphone A: bug Triphone B: bag Triphone X: bag
Encoder Encoder Encoder
log-Mel spec conv3(768) batchnorm ReLU conv3(768) batchnorm ReLU conv4stride2(768) batchnorm ReLU conv3(768) batchnorm ReLU conv3(768) batchnorm ReLU
Encoder
linear(64) VQ(512)
100Hz 50Hz
Bottleneck
jitter(0.5) embedding
Decoder
concat upsample biGRU(128) biGRU(128) upsample GRU(896) linear(256) ReLU linear(256) ReLU softmax sample mu-law embedding speaker