Neural Discrete Representation Learning Aaron van den Oord , Oriol - - PowerPoint PPT Presentation

neural discrete representation learning
SMART_READER_LITE
LIVE PREVIEW

Neural Discrete Representation Learning Aaron van den Oord , Oriol - - PowerPoint PPT Presentation

Neural Discrete Representation Learning Aaron van den Oord , Oriol Vinyals, Koray Kavukcuoglu Generative Models Goal : Estimate the probability distribution of high-dimensional data Such as images, audio, video, text, ... Motivation: Learn the


slide-1
SLIDE 1

Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu

Neural Discrete Representation Learning

slide-2
SLIDE 2

Goal: Estimate the probability distribution of high-dimensional data Such as images, audio, video, text, ...

Motivation: Learn the underlying structure in data. Capture the dependencies between the variables. Generate new data with similar properties. Learn useful features from the data in an unsupervised fashion.

Generative Models

slide-3
SLIDE 3

Autoregressive Models

slide-4
SLIDE 4

Recent Autoregressive models at DeepMind

PixelRNN PixelCNN

White Whale Hartebeest Tiger Geyser

Video Pixel Networks WaveNet ByteNet

van den Oord et al, 2016ab van den Oord et al, 2016c Kalchbrenner et al, 2016a Kalchbrenner et al, 2016b

slide-5
SLIDE 5

Modeling Audio

slide-6
SLIDE 6

Causal Convolution

Input Hidden Layer

slide-7
SLIDE 7

Causal Convolution

Input Hidden Layer Hidden Layer

slide-8
SLIDE 8

Causal Convolution

Input Hidden Layer Hidden Layer Hidden Layer

slide-9
SLIDE 9

Causal Convolution

Input Hidden Layer Hidden Layer Hidden Layer Output

slide-10
SLIDE 10

Causal Convolution

Input Hidden Layer Hidden Layer Hidden Layer Output

slide-11
SLIDE 11

Causal Dilated Convolution

Input

slide-12
SLIDE 12

Input Hidden Layer

Causal Dilated Convolution

slide-13
SLIDE 13

Input Hidden Layer Hidden Layer dilation=2 dilation=1

Causal Dilated Convolution

slide-14
SLIDE 14

Input Hidden Layer Hidden Layer Hidden Layer dilation=2 dilation=4 dilation=1

Causal Dilated Convolution

slide-15
SLIDE 15

Input Hidden Layer Hidden Layer Hidden Layer Output dilation=4 dilation=2 dilation=8 dilation=1

Causal Dilated Convolution

slide-16
SLIDE 16

Input Hidden Layer Hidden Layer Hidden Layer Output dilation=4 dilation=2 dilation=8 dilation=1

Causal Dilated Convolution

slide-17
SLIDE 17

Multiple Stacks

slide-18
SLIDE 18

Sampling

slide-19
SLIDE 19

Speaker-conditional Generation

...

Does not depend on timestep Speaker embedding

slide-20
SLIDE 20

Text-To-Speech samples

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

slide-21
SLIDE 21

Speaker-conditional samples

(but not conditioned on text) https://deepmind.com/blog/wavenet-generative-model-raw-audio/

slide-22
SLIDE 22

Piano Music samples

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

slide-23
SLIDE 23

VQ-VAE

  • Towards modeling a latent space
  • Learn meaningful representations.
  • Abstract away noise and details.
  • Model what’s important in a compressed latent representation.
  • Why discrete?
  • Many important real-world things are discrete.
  • Arguably easier to model for the prior (e.g., softmax vs RNADE)
  • Continuous representations are often inherently discretized by encoder/decoder.
slide-24
SLIDE 24

VQ-VAE

Related work: PixelVAE (Gulrajani et al, 2016) Variational Lossy AutoEncoder (Chen et al, 2016)

slide-25
SLIDE 25

VQ-VAE

slide-26
SLIDE 26

VQ-VAE

slide-27
SLIDE 27

Images

slide-28
SLIDE 28

ImageNet reconstructions

Original 128x128 images Reconstructions

slide-29
SLIDE 29

VQ-VAE - Sample

slide-30
SLIDE 30

ImageNet samples

slide-31
SLIDE 31

DM-Lab Samples

slide-32
SLIDE 32

3 Global Latents Reconstruction

slide-33
SLIDE 33

3 Global Latents Reconstruction

Originals Reconstructions from compressed representations (27 bits per image).

slide-34
SLIDE 34

Video Generation in the latent space

slide-35
SLIDE 35

Speech

slide-36
SLIDE 36

https://avdnoord.github.io/homepage/vqvae/

slide-37
SLIDE 37

Speech - reconstruction

Original Reconstruction

slide-38
SLIDE 38

Speech - Sample from prior

slide-39
SLIDE 39

https://avdnoord.github.io/homepage/vqvae/

slide-40
SLIDE 40

Speech - speaker conditional

slide-41
SLIDE 41

https://avdnoord.github.io/homepage/vqvae/

slide-42
SLIDE 42

Unsupervised Learning of phonemes

Encoder Decoder Discrete codes alphabet = codebook Phonemes

slide-43
SLIDE 43

Unsupervised Learning of phonemes

Phonemes Discrete codes 41-way classification 49.3% accuracy fully unsupervised

slide-44
SLIDE 44

References and related work

Pixel Recurrent Neural Networks - van den Oord et al, ICML 2016 Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den Oord et al, Arxiv 2016 Neural Machine Translation in Linear Time - Kalchbrenner et al, Arxiv 2016 Video Pixel Networks - Kalchbrenner et al, ICML 2017 Neural Discrete Representation Learning - van den Oord et al, NIPS 2017 Related work: The Neural Autoregressive Distribution Estimator - Larochelle et al, AISTATS 2011 Generative image modeling using spatial LSTMs - Theis et al, NIPS 2015 SampleRNN: An Unconditional End-to-End Neural Audio Generation Model - Mehri et al, ICLR 2017 PixelVAE: A Latent Variable Model for Natural Images - Gulrajani et al, ICLR 2017 Variational Lossy Autoencoder - Chen et al, ICLR 2017 Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations - Agustsson et al, NIPS 2017

slide-45
SLIDE 45

Thank you!