vector quantized neural networks for acoustic unit
play

Vector Quantized Neural Networks for Acoustic Unit Discovery - PowerPoint PPT Presentation

Vector Quantized Neural Networks for Acoustic Unit Discovery Benjamin van Niekerk, Leanne Nortje, Herman Kamper The Generative Factors of Speech HH / Y / UW / M / ER HUMOUR Content: Prosody: Timbre: Discrete phonetic units. Rhythm


  1. Vector Quantized Neural Networks for Acoustic Unit Discovery Benjamin van Niekerk, Leanne Nortje, Herman Kamper

  2. The Generative Factors of Speech HH / Y / UW / M / ER HUMOUR Content: Prosody: Timbre: ● Discrete phonetic units. ● Rhythm ● Quality of a particular voice. ● ≅ 44 phonemes in English. ● Intonation ● Characterized by frequency ● Stresses spectrum.

  3. The Generative Factors of Speech HH / Y / UW / M / ER HUMOUR Content: Prosody: Timbre: ● Discrete phonetic units. ● Rhythm ● Quality of a particular voice. ● ≅ 44 phonemes in English. ● Intonation ● Characterized by frequency ● Stresses spectrum.

  4. The Generative Factors of Speech HH / Y / UW / M / ER HUMOUR Content: Prosody: Timbre: ● Discrete phonetic units. ● Rhythm ● Quality of a particular voice. ● ≅ 44 phonemes in English. ● Intonation ● Characterized by frequency ● Stresses spectrum.

  5. The Generative Factors of Speech HH / Y / UW / M / ER HUMOUR Content: Prosody: Timbre: ● Discrete phonetic units. ● Rhythm ● Quality of a particular voice. ● ≅ 44 phonemes in English. ● Intonation ● Characterized by frequency ● Stresses spectrum.

  6. The Generative Factors of Speech HH / Y / UW / M / ER Content: Prosody: Timbre: ● Discrete phonetic units. ● Rhythm ● Quality of a particular voice. ● ≅ 44 phonemes in English. ● Intonation ● Characterized by frequency ● Stresses spectrum.

  7. The Generative Factors of Speech HH / Y / UW / M / ER Content: Prosody: Timbre: ● Discrete phonetic units. ● Rhythm ● Quality of a particular voice. ● ≅ 44 phonemes in English. ● Intonation ● Characterized by frequency ● Stresses spectrum.

  8. The Generative Factors of Speech HH / Y / UW / M / ER Content: Prosody: Timbre: ● Discrete phonetic units. ● Rhythm ● Quality of a particular voice. ● ≅ 44 phonemes in English. ● Intonation ● Characterized by frequency ● Stresses spectrum.

  9. The Generative Factors of Speech HH / Y / UW / M / ER Content: Prosody: Timbre: ● Discrete phonetic units. ● Rhythm ● Quality of a particular voice. ● ≅ 44 phonemes in English. ● Intonation ● Characterized by frequency ● Stresses spectrum.

  10. The Generative Factors of Speech Content: Prosody: Timbre: ● Discrete phonetic units. ● Rhythm ● Quality of a particular voice. ● ≅ 44 phonemes in English. ● Intonation ● Characterized by frequency ● Stresses spectrum.

  11. What is Acoustic Unit Discovery? The goal is to learn discrete representations of speech that separate phonetic content from the other factors. …all without any labels or annotations!

  12. What is Acoustic Unit Discovery? The goal is to learn discrete representations of speech that separate phonetic content from the other factors. …all without any labels or annotations!

  13. What is Acoustic Unit Discovery? The goal is to learn discrete representations of speech that separate phonetic content from the other factors. …all without any labels or annotations! Encoder

  14. What is Acoustic Unit Discovery? The goal is to learn discrete representations of speech that separate phonetic content from the other factors. …all without any labels or annotations! Encoder

  15. Applications Bootstrap training of low-resource speech systems: Automatic speech recognition Text-to-speech Non-parallel voice conversion

  16. Applications Bootstrap training of low-resource speech systems: Automatic speech recognition Text-to-speech Non-parallel voice conversion

  17. Applications Bootstrap training of low-resource speech systems: Automatic speech recognition Text-to-speech Non-parallel voice conversion

  18. Applications Bootstrap training of low-resource speech systems: Automatic speech recognition Text-to-speech Non-parallel voice conversion

  19. But, how do we learn discrete representations using neural networks?

  20. But, how do we learn discrete representations using neural networks? A. van den Oord, and O. Vinyals. “Neural discrete representation learning.” Advances in Neural Information Processing Systems . 2017.

  21. Vector Quantization Layer Codebook

  22. Vector Quantization Layer Codebook Encoder

  23. Vector Quantization Layer Codebook Encoder

  24. Vector Quantization Layer Codebook Encoder

  25. Vector Quantization Layer Codebook Encoder

  26. Vector Quantization Layer Codebook Encoder

  27. Vector Quantization Layer Codebook Encoder

  28. Vector Quantization Layer Codebook Encoder

  29. Vector Quantization Layer Codebook Encoder

  30. Our contribution: we propose and compare two models for acoustic unit discovery in the ZeroSpeech 2020 Challenge . A Vector-Quantized Variational A combination of Vector-Quantization and 1. 2. Autoencoder (VQ-VAE) Contrastive Predictive Coding (VQ-CPC) VQ layer Encoder Decoder Inspired by: J. Chorowski, et al. “Unsupervised speech representation learning using wavenet autoencoders.” IEEE/ACM transactions on audio, speech, and language processing. 2019.

  31. Our contribution: we propose and compare two models for acoustic unit discovery in the ZeroSpeech 2020 Challenge . A Vector-Quantized Variational A combination of Vector-Quantization and 1. 2. Autoencoder (VQ-VAE) Contrastive Predictive Coding (VQ-CPC) VQ layer Encoder Decoder Inspired by: J. Chorowski, et al. “Unsupervised speech representation learning using wavenet autoencoders.” IEEE/ACM transactions on audio, speech, and language processing. 2019.

  32. Our contribution: we propose and compare two models for acoustic unit discovery in the ZeroSpeech 2020 Challenge . A Vector-Quantized Variational A combination of Vector-Quantization and 1. 2. Autoencoder (VQ-VAE) Contrastive Predictive Coding (VQ-CPC) VQ layer Encoder Decoder Inspired by: J. Chorowski, et al. “Unsupervised speech representation learning using wavenet autoencoders.” Inspired by: A. van den Oord, et al. “Representation Learning with Contrastive Predictive Coding.” 2018. IEEE/ACM transactions on audio, speech, and language processing. 2019.

  33. Vector-Quantized Variational Autoencoder VQ layer Encoder Decoder

  34. Vector-Quantized Variational Autoencoder VQ layer Encoder Decoder minimize reconstruction error

  35. Vector-Quantized Variational Autoencoder Information bottleneck VQ layer Encoder Decoder

  36. Vector-Quantized Variational Autoencoder Information bottleneck VQ layer Encoder Decoder Speaker

  37. Vector-Quantized Variational Autoencoder Information bottleneck VQ layer Encoder Decoder Powerful autoregressive Speaker model

  38. Vector-Quantized Contrastive Predictive Coding Prediction Input

  39. Vector-Quantized Contrastive Predictive Coding Encoder Input

  40. Vector-Quantized Contrastive Predictive Coding VQ layer Encoder Input

  41. Vector-Quantized Contrastive Predictive Coding Context model VQ layer Encoder Input

  42. Vector-Quantized Contrastive Predictive Coding Predictions Context model VQ layer Encoder Input

  43. Vector-Quantized Contrastive Predictive Coding Context vector

  44. Vector-Quantized Contrastive Predictive Coding Positive example Context vector

  45. Vector-Quantized Contrastive Predictive Coding Positive example Context vector Negative examples

  46. Vector-Quantized Contrastive Predictive Coding Positive example Context vector Negative examples

  47. Vector-Quantized Contrastive Predictive Coding Positive example Context vector Negative examples

  48. Evaluation - Voice Conversion Evaluation Metrics: VQ layer Encoder Decoder ● Speaker similarity (1-5 scale). ● Intelligibility (character error rate). ● Mean opinion score (1-5 scale).

  49. Evaluation - Voice Conversion Evaluation Metrics: VQ layer Encoder Decoder ● Speaker similarity (1-5 scale). ● Intelligibility (character error rate). ● Mean opinion score (1-5 scale).

  50. Evaluation - Voice Conversion Source Converted Target Other Conversion

  51. Evaluation - Voice Conversion

  52. Evaluation - Voice Conversion

  53. Evaluation - Voice Conversion

  54. Evaluation - ABX Score Triphone A: bug Encoder

  55. Evaluation - ABX Score Triphone A: Triphone B: bug bag Encoder Encoder

  56. Evaluation - ABX Score Triphone A: Triphone X: Triphone B: bug bag bag Encoder Encoder Encoder

  57. Evaluation - ABX Score Triphone A: Triphone X: Triphone B: bug bag bag Encoder Encoder Encoder

  58. Evaluation - ABX Score

  59. Questions?

  60. Vector Quantized Variational Autoencoder Bottleneck Encoder linear(64) VQ(512) ReLU Decoder batchnorm conv 3 ( 768 ) jitter(0.5) embedding ReLU concat batchnorm upsample conv 3 ( 768 ) biGRU(128) ReLU biGRU(128) batchnorm 50Hz upsample conv 4 stride 2 ( 768 ) ReLU GRU(896) batchnorm linear(256) conv 3 ( 768 ) embedding ReLU ReLU linear(256) batchnorm ReLU conv 3 ( 768 ) 100Hz sample mu-law softmax log-Mel spec speaker

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend