Vector Quantized Neural Networks for Acoustic Unit Discovery - - PowerPoint PPT Presentation

vector quantized neural networks for acoustic unit
SMART_READER_LITE
LIVE PREVIEW

Vector Quantized Neural Networks for Acoustic Unit Discovery - - PowerPoint PPT Presentation

Vector Quantized Neural Networks for Acoustic Unit Discovery Benjamin van Niekerk, Leanne Nortje, Herman Kamper The Generative Factors of Speech HH / Y / UW / M / ER HUMOUR Content: Prosody: Timbre: Discrete phonetic units. Rhythm


slide-1
SLIDE 1

Vector Quantized Neural Networks for Acoustic Unit Discovery

Benjamin van Niekerk, Leanne Nortje, Herman Kamper

slide-2
SLIDE 2

The Generative Factors of Speech

Content:

  • Discrete phonetic units.
  • ≅44 phonemes in English.

HH / Y / UW / M / ER

Prosody:

  • Rhythm
  • Intonation
  • Stresses

Timbre:

  • Quality of a particular voice.
  • Characterized by frequency

spectrum.

HUMOUR

slide-3
SLIDE 3

The Generative Factors of Speech

Content:

  • Discrete phonetic units.
  • ≅44 phonemes in English.

HH / Y / UW / M / ER

Prosody:

  • Rhythm
  • Intonation
  • Stresses

Timbre:

  • Quality of a particular voice.
  • Characterized by frequency

spectrum.

HUMOUR

slide-4
SLIDE 4

The Generative Factors of Speech

Content:

  • Discrete phonetic units.
  • ≅44 phonemes in English.

HH / Y / UW / M / ER

Prosody:

  • Rhythm
  • Intonation
  • Stresses

Timbre:

  • Quality of a particular voice.
  • Characterized by frequency

spectrum.

HUMOUR

slide-5
SLIDE 5

The Generative Factors of Speech

Content:

  • Discrete phonetic units.
  • ≅44 phonemes in English.

HH / Y / UW / M / ER

Prosody:

  • Rhythm
  • Intonation
  • Stresses

Timbre:

  • Quality of a particular voice.
  • Characterized by frequency

spectrum.

HUMOUR

slide-6
SLIDE 6

The Generative Factors of Speech

HH / Y UW / M / ER /

Content:

  • Discrete phonetic units.
  • ≅44 phonemes in English.

Prosody:

  • Rhythm
  • Intonation
  • Stresses

Timbre:

  • Quality of a particular voice.
  • Characterized by frequency

spectrum.

slide-7
SLIDE 7

The Generative Factors of Speech

HH / Y UW / M / ER /

Content:

  • Discrete phonetic units.
  • ≅44 phonemes in English.

Prosody:

  • Rhythm
  • Intonation
  • Stresses

Timbre:

  • Quality of a particular voice.
  • Characterized by frequency

spectrum.

slide-8
SLIDE 8

The Generative Factors of Speech

HH / Y UW / M / ER /

Content:

  • Discrete phonetic units.
  • ≅44 phonemes in English.

Prosody:

  • Rhythm
  • Intonation
  • Stresses

Timbre:

  • Quality of a particular voice.
  • Characterized by frequency

spectrum.

slide-9
SLIDE 9

The Generative Factors of Speech

HH / Y UW / M / ER /

Content:

  • Discrete phonetic units.
  • ≅44 phonemes in English.

Prosody:

  • Rhythm
  • Intonation
  • Stresses

Timbre:

  • Quality of a particular voice.
  • Characterized by frequency

spectrum.

slide-10
SLIDE 10

The Generative Factors of Speech

Content:

  • Discrete phonetic units.
  • ≅44 phonemes in English.

Prosody:

  • Rhythm
  • Intonation
  • Stresses

Timbre:

  • Quality of a particular voice.
  • Characterized by frequency

spectrum.

slide-11
SLIDE 11

What is Acoustic Unit Discovery?

The goal is to learn discrete representations of speech that separate phonetic content from the other factors. …all without any labels or annotations!

slide-12
SLIDE 12

What is Acoustic Unit Discovery?

The goal is to learn discrete representations of speech that separate phonetic content from the other factors. …all without any labels or annotations!

slide-13
SLIDE 13

What is Acoustic Unit Discovery?

Encoder

The goal is to learn discrete representations of speech that separate phonetic content from the other factors. …all without any labels or annotations!

slide-14
SLIDE 14

What is Acoustic Unit Discovery?

Encoder

The goal is to learn discrete representations of speech that separate phonetic content from the other factors. …all without any labels or annotations!

slide-15
SLIDE 15

Applications

Bootstrap training of low-resource speech systems: Automatic speech recognition Text-to-speech Non-parallel voice conversion

slide-16
SLIDE 16

Applications

Automatic speech recognition Text-to-speech Non-parallel voice conversion Bootstrap training of low-resource speech systems:

slide-17
SLIDE 17

Applications

Automatic speech recognition Text-to-speech Non-parallel voice conversion Bootstrap training of low-resource speech systems:

slide-18
SLIDE 18

Applications

Automatic speech recognition Text-to-speech Non-parallel voice conversion Bootstrap training of low-resource speech systems:

slide-19
SLIDE 19

But, how do we learn discrete representations using neural networks?

slide-20
SLIDE 20

But, how do we learn discrete representations using neural networks?

  • A. van den Oord, and O. Vinyals. “Neural discrete representation learning.”

Advances in Neural Information Processing Systems. 2017.

slide-21
SLIDE 21

Vector Quantization Layer

Codebook

slide-22
SLIDE 22

Vector Quantization Layer

Encoder Codebook

slide-23
SLIDE 23

Vector Quantization Layer

Encoder Codebook

slide-24
SLIDE 24

Vector Quantization Layer

Encoder Codebook

slide-25
SLIDE 25

Vector Quantization Layer

Encoder Codebook

slide-26
SLIDE 26

Vector Quantization Layer

Encoder Codebook

slide-27
SLIDE 27

Vector Quantization Layer

Encoder Codebook

slide-28
SLIDE 28

Vector Quantization Layer

Encoder Codebook

slide-29
SLIDE 29

Vector Quantization Layer

Encoder Codebook

slide-30
SLIDE 30

Our contribution: we propose and compare two models for acoustic unit discovery in the ZeroSpeech 2020 Challenge.

A Vector-Quantized Variational Autoencoder (VQ-VAE)

1.

A combination of Vector-Quantization and Contrastive Predictive Coding (VQ-CPC)

2.

Encoder Decoder VQ layer

Inspired by: J. Chorowski, et al. “Unsupervised speech representation learning using wavenet autoencoders.” IEEE/ACM transactions on audio, speech, and language processing. 2019.

slide-31
SLIDE 31

Our contribution: we propose and compare two models for acoustic unit discovery in the ZeroSpeech 2020 Challenge.

A Vector-Quantized Variational Autoencoder (VQ-VAE)

1.

A combination of Vector-Quantization and Contrastive Predictive Coding (VQ-CPC)

2.

Encoder Decoder VQ layer

Inspired by: J. Chorowski, et al. “Unsupervised speech representation learning using wavenet autoencoders.” IEEE/ACM transactions on audio, speech, and language processing. 2019.

slide-32
SLIDE 32

Our contribution: we propose and compare two models for acoustic unit discovery in the ZeroSpeech 2020 Challenge.

A combination of Vector-Quantization and Contrastive Predictive Coding (VQ-CPC)

2.

A Vector-Quantized Variational Autoencoder (VQ-VAE)

1.

Encoder Decoder VQ layer

Inspired by: J. Chorowski, et al. “Unsupervised speech representation learning using wavenet autoencoders.” IEEE/ACM transactions on audio, speech, and language processing. 2019. Inspired by: A. van den Oord, et al. “Representation Learning with Contrastive Predictive Coding.” 2018.

slide-33
SLIDE 33

Vector-Quantized Variational Autoencoder

Encoder VQ layer Decoder

slide-34
SLIDE 34

Vector-Quantized Variational Autoencoder

minimize reconstruction error

Encoder VQ layer Decoder

slide-35
SLIDE 35

Vector-Quantized Variational Autoencoder

Encoder Decoder VQ layer

Information bottleneck

slide-36
SLIDE 36

Vector-Quantized Variational Autoencoder

Encoder Decoder VQ layer

Information bottleneck

Speaker

slide-37
SLIDE 37

Vector-Quantized Variational Autoencoder

Encoder Decoder VQ layer

Information bottleneck

Speaker

Powerful autoregressive model

slide-38
SLIDE 38

Vector-Quantized Contrastive Predictive Coding

Input Prediction

slide-39
SLIDE 39

Vector-Quantized Contrastive Predictive Coding

Input Encoder

slide-40
SLIDE 40

Vector-Quantized Contrastive Predictive Coding

Input Encoder VQ layer

slide-41
SLIDE 41

Vector-Quantized Contrastive Predictive Coding

Input Encoder VQ layer Context model

slide-42
SLIDE 42

Vector-Quantized Contrastive Predictive Coding

Input Encoder VQ layer Context model Predictions

slide-43
SLIDE 43

Vector-Quantized Contrastive Predictive Coding

Context vector

slide-44
SLIDE 44

Vector-Quantized Contrastive Predictive Coding

Context vector Positive example

slide-45
SLIDE 45

Vector-Quantized Contrastive Predictive Coding

Context vector Positive example Negative examples

slide-46
SLIDE 46

Vector-Quantized Contrastive Predictive Coding

Context vector Positive example Negative examples

slide-47
SLIDE 47

Vector-Quantized Contrastive Predictive Coding

Context vector Positive example Negative examples

slide-48
SLIDE 48

Evaluation - Voice Conversion

Encoder Decoder VQ layer Evaluation Metrics:

  • Speaker similarity (1-5 scale).
  • Intelligibility (character error rate).
  • Mean opinion score (1-5 scale).
slide-49
SLIDE 49

Evaluation - Voice Conversion

Encoder Decoder VQ layer Evaluation Metrics:

  • Speaker similarity (1-5 scale).
  • Intelligibility (character error rate).
  • Mean opinion score (1-5 scale).
slide-50
SLIDE 50

Source Converted Target Other Conversion

Evaluation - Voice Conversion

slide-51
SLIDE 51

Evaluation - Voice Conversion

slide-52
SLIDE 52

Evaluation - Voice Conversion

slide-53
SLIDE 53

Evaluation - Voice Conversion

slide-54
SLIDE 54

Evaluation - ABX Score

Triphone A: bug

Encoder

slide-55
SLIDE 55

Evaluation - ABX Score

Triphone A: bug Triphone B: bag

Encoder Encoder

slide-56
SLIDE 56

Evaluation - ABX Score

Triphone A: bug Triphone B: bag Triphone X: bag

Encoder Encoder Encoder

slide-57
SLIDE 57

Evaluation - ABX Score

Triphone A: bug Triphone B: bag Triphone X: bag

Encoder Encoder Encoder

slide-58
SLIDE 58

Evaluation - ABX Score

slide-59
SLIDE 59

Questions?

slide-60
SLIDE 60

Vector Quantized Variational Autoencoder

log-Mel spec conv3(768) batchnorm ReLU conv3(768) batchnorm ReLU conv4stride2(768) batchnorm ReLU conv3(768) batchnorm ReLU conv3(768) batchnorm ReLU

Encoder

linear(64) VQ(512)

100Hz 50Hz

Bottleneck

jitter(0.5) embedding

Decoder

concat upsample biGRU(128) biGRU(128) upsample GRU(896) linear(256) ReLU linear(256) ReLU softmax sample mu-law embedding speaker