Neural Discrete Representation Learning (VQ-VAE) Aaron van den - - PowerPoint PPT Presentation

neural discrete
SMART_READER_LITE
LIVE PREVIEW

Neural Discrete Representation Learning (VQ-VAE) Aaron van den - - PowerPoint PPT Presentation

Neural Discrete Representation Learning (VQ-VAE) Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu Google Deepmind NIPS 2017 Neural Discrete Representation Learning 1. What is the task? 2. Comparison & Contribution 3. VQ-VAE Model


slide-1
SLIDE 1

Neural Discrete Representation Learning (VQ-VAE)

Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu Google Deepmind NIPS 2017

slide-2
SLIDE 2

Neural Discrete Representation Learning

  • 1. What is the task?
  • 2. Comparison & Contribution
  • 3. VQ-VAE Model
  • 4. Results

1. Density estimation & Reconstruction 2. Sampling 3. Speech

  • 5. Discussion & Conclusion
slide-3
SLIDE 3

What is the task?

  • Task 1: Density estimation: learn p(x)
  • Task 2: Extract meaningful latent variable (unsupervised)
  • Task 3: Reconstruct input

Latent z Input x Output x’

slide-4
SLIDE 4

Comparison & Contribution

  • 1. Bounds p(x), but does not require variational approximation
  • 2. Train using maximum likelihood (stable training)
  • 3. First to use discrete latent variables successfully
  • 4. Uses whole latent space (avoid ‘posterior collapse’)

A little girl sitting on a bed with a teddybear After discussion: Why is discrete nice? More natural representation for humans, avoids posterior collapse (because you can more easily manage your latent space using your dictionary), compresseable, easier to learn a prior over a discrete latent space (more tractable than a continuous latent space).

slide-5
SLIDE 5

Auto Encoder

For the example: We take this to be a 4 x 4 image with 2 channels. We can train this system end-to-end using MSE (reconstruction loss) Input Output (reconstruction) Latent variable

How to discretize?

slide-6
SLIDE 6

How to Discretize?

Each e has 2 dimensions. Channel 1 Channel 2 4 x 4 image with 2 channels. We plot all pixel values (16) in 2D (since we have 2 channels)

slide-7
SLIDE 7

How to Discretize?

4 x 4 image with 2 channels. Make dictionary of vectors 𝑓1, … , 𝑓𝐿 Each 𝑓𝑗 has 2 dimensions.

slide-8
SLIDE 8

How to Discretize?

4 x 4 image with 2 channels. For each latent pixel, look up nearest dictionary element 𝑓 𝑓3 𝑓1 𝑓2 Make dictionary of vectors 𝑓1, … , 𝑓𝐿 Each 𝑓𝑗 has 2 dimensions.

slide-9
SLIDE 9

How to Discretize?

4 x 4 image with 2 channels. Each 𝑓𝑗 has 2 dimensions. 𝑓3

slide-10
SLIDE 10

Proposed Model

Latent is 1 channel image and contains the id of each e for each pixel (discrete). Input Output (reconstruction) Latent variable

slide-11
SLIDE 11

How to train?

  • No time to discuss… See slide 18-19
  • Lets talk about results
slide-12
SLIDE 12

R1: Density Estimation & Reconstructions

  • Comparable with VAE on CIFAR-10 in terms of density estimation
  • Reconstructions on ImageNet are very good

Imagenet 128 * 128 * 3 * 8 = 393216 bits = 48 Kb Reconstruction 32 * 32 * 9 = 9216 bits = 1 Kb

slide-13
SLIDE 13

R2: Sampling / Generation

PixelCNN

  • Lack global structure, unsharp.
  • 1 pixelCNN is not powerful enough. Hierarchical representation necessary

Class: pickup

slide-14
SLIDE 14

R3: Stacking VQ-VAE

  • No time to discuss… See slide 20-22
  • Lets go to R4: Speech.
slide-15
SLIDE 15

R4: Speech

  • Decoder: Wavenet (state of the art speech generation)
  • Excellent speech reconstruction
  • Sampling results
  • Unsupervised learning
  • Voice style transfer
  • Learns phonemes (using latents: 49.3% accuracy – 7.2% random)

https://avdnoord.github.io/homepage/vqvae/

slide-16
SLIDE 16

Discussion and Conclusion

  • Impressive results & good idea
  • Paper
  • Glances over many details, supplement & implementation missing
  • Are learned latents useful? Should be addressed quantatively
  • Image generation can be greatly improved
  • Using a hierarchical model as in Lampert (previous coffeetalk) should greatly

improve speed and quality

slide-17
SLIDE 17

Thanks!

  • Slides author
  • https://drive.google.com/file/d/1t8W2L1H2RtUge-

IQYqGXa9ihKNVQpqNI/view

  • Talk author
  • https://www.youtube.com/watch?v=HqaIkq3qH40
slide-18
SLIDE 18

How to train? (1/2)

  • How to backpropegate through the discretization?
  • Lets say a gradient is incoming to a dictionary vector
  • We do not update the dictionary vector (fixed)
  • Instead we apply the gradient of e to the non-discretized vector

𝑓3

slide-19
SLIDE 19

How to train? (2/2)

  • Loss part 1: reconstruction error (dictionary fixed)
  • Loss part 2: to update the dictionary
slide-20
SLIDE 20

R3: Stacking VQ-VAE (1/2)

  • VQ-VAE stacked to get higher level

latents

  • Use DeepMind lab (artificial images)
  • Errors: sharpness and global mismatch
  • Latents seem ‘useful’: can generate

coherent video from latent space (input first 6 images, output: video)

  • No quantative experiment

Original (21168 bits = 3 Kb) Reconstruction (27 bits)

slide-21
SLIDE 21

R3: Stacking VQ-VAE (2/2) Generated Video

slide-22
SLIDE 22

Multistage VQ-VAE

VQ 3 latents in [0,512] 21 x 21 x 1 in [0,512] 84 x 84 x 3 In [0,256] Before: 84 * 84 * 3 * 8 = 21168 bits = 3 Kb After 3 * 9 = 27 bits Reconstruction not very accurate but powerful representation

slide-23
SLIDE 23

Comparison

GAN Variational Autoencoder Pixel CNN VQ-VAE (This talk) Compute exact likelihood p(x)

   

Has latent variable z

   

Compute latent variable z (inference)

   

Discrete latent variable

   

Stable training?

   

Sharp images?

   ?