neural discrete
play

Neural Discrete Representation Learning (VQ-VAE) Aaron van den - PowerPoint PPT Presentation

Neural Discrete Representation Learning (VQ-VAE) Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu Google Deepmind NIPS 2017 Neural Discrete Representation Learning 1. What is the task? 2. Comparison & Contribution 3. VQ-VAE Model


  1. Neural Discrete Representation Learning (VQ-VAE) Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu Google Deepmind NIPS 2017

  2. Neural Discrete Representation Learning 1. What is the task? 2. Comparison & Contribution 3. VQ-VAE Model 4. Results 1. Density estimation & Reconstruction 2. Sampling 3. Speech 5. Discussion & Conclusion

  3. What is the task? • Task 1: Density estimation: learn p(x) • Task 2: Extract meaningful latent variable (unsupervised) • Task 3: Reconstruct input Output x’ Latent z Input x

  4. Comparison & Contribution 1. Bounds p(x), but does not require variational approximation 2. Train using maximum likelihood (stable training) 3. First to use discrete latent variables successfully A little girl sitting on a bed with a 4. Uses whole latent space (avoid ‘posterior collapse’) teddybear After discussion: Why is discrete nice? More natural representation for humans, avoids posterior collapse (because you can more easily manage your latent space using your dictionary), compresseable, easier to learn a prior over a discrete latent space (more tractable than a continuous latent space).

  5. Auto Encoder How to discretize? For the example: We take this to be a 4 x 4 image with 2 channels. Output Input Latent variable (reconstruction) We can train this system end-to-end using MSE (reconstruction loss)

  6. How to Discretize? 4 x 4 image with 2 channels. Each e has 2 We plot all pixel values (16) in 2D dimensions. (since we have 2 channels) Channel 2 Channel 1

  7. How to Discretize? Make dictionary of vectors 𝑓 1 , … , 𝑓 𝐿 4 x 4 image with 2 channels. Each 𝑓 𝑗 has 2 dimensions.

  8. How to Discretize? Make dictionary of vectors 𝑓 1 , … , 𝑓 𝐿 4 x 4 image with 2 channels. Each 𝑓 𝑗 has 2 dimensions. 𝑓 3 𝑓 2 For each latent pixel, look up nearest dictionary element 𝑓 𝑓 1

  9. How to Discretize? 4 x 4 image with 2 channels. Each 𝑓 𝑗 has 2 dimensions. 𝑓 3

  10. Proposed Model Output Input (reconstruction) Latent variable Latent is 1 channel image and contains the id of each e for each pixel ( discrete ).

  11. How to train? • No time to discuss … See slide 18 -19 • Lets talk about results

  12. R1: Density Estimation & Reconstructions • Comparable with VAE on CIFAR-10 in terms of density estimation • Reconstructions on ImageNet are very good Imagenet 128 * 128 * 3 * 8 = 393216 bits = 48 Kb Reconstruction 32 * 32 * 9 = 9216 bits = 1 Kb

  13. Class: pickup R2: Sampling / Generation PixelCNN • Lack global structure, unsharp. • 1 pixelCNN is not powerful enough. Hierarchical representation necessary

  14. R3: Stacking VQ-VAE • No time to discuss … See slide 20 -22 • Lets go to R4: Speech.

  15. R4: Speech • Decoder: Wavenet (state of the art speech generation) • Excellent speech reconstruction • Sampling results • Unsupervised learning • Voice style transfer • Learns phonemes (using latents: 49.3% accuracy – 7.2% random) https://avdnoord.github.io/homepage/vqvae/

  16. Discussion and Conclusion • Impressive results & good idea • Paper • Glances over many details, supplement & implementation missing • Are learned latents useful? Should be addressed quantatively • Image generation can be greatly improved • Using a hierarchical model as in Lampert (previous coffeetalk) should greatly improve speed and quality

  17. Thanks! • Slides author • https://drive.google.com/file/d/1t8W2L1H2RtUge- IQYqGXa9ihKNVQpqNI/view • Talk author • https://www.youtube.com/watch?v=HqaIkq3qH40

  18. How to train? (1/2) • How to backpropegate through the discretization? • Lets say a gradient is incoming to a dictionary vector • We do not update the dictionary vector (fixed) • Instead we apply the gradient of e to the non-discretized vector 𝑓 3

  19. How to train? (2/2) • Loss part 1: reconstruction error (dictionary fixed) • Loss part 2: to update the dictionary

  20. R3: Stacking VQ-VAE (1/2) Original (21168 bits = 3 Kb) • VQ-VAE stacked to get higher level latents • Use DeepMind lab (artificial images) • Errors: sharpness and global mismatch • Latents seem ‘useful’: can generate coherent video from latent space (input first 6 images, output: video) • No quantative experiment Reconstruction (27 bits)

  21. R3: Stacking VQ-VAE (2/2) Generated Video

  22. Multistage VQ-VAE VQ 3 latents in [0,512] 21 x 21 x 1 in [0,512] Before: 84 * 84 * 3 * 8 = 21168 bits = 3 Kb After 3 * 9 = 27 bits 84 x 84 x 3 In [0,256] Reconstruction not very accurate but powerful representation

  23. Comparison GAN Variational Pixel CNN VQ-VAE Autoencoder (This talk)     Compute exact likelihood p(x)     Has latent variable z     Compute latent variable z (inference)     Discrete latent variable     Stable training?     ? Sharp images?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend