Improving PixelCNN Vertical stack oblem with this m of masked - - PowerPoint PPT Presentation

improving pixelcnn
SMART_READER_LITE
LIVE PREVIEW

Improving PixelCNN Vertical stack oblem with this m of masked - - PowerPoint PPT Presentation

Improving PixelCNN Vertical stack oblem with this m of masked convolution. Blind spot Horizontal stack Solution: use two stacks of Stacking layers of masked convolution, convolution creates convolution creates a blindspot a blindspot a


slide-1
SLIDE 1

Improving PixelCNN

66

  • blem with this

m of masked convolution.

Blind spot

Stacking layers of masked convolution creates a blindspot Solution: use two stacks of convolution, convolution creates a blindspot a vertical stack and a horizontal stack

Horizontal stack Vertical stack

slide-2
SLIDE 2

Improving PixelCNN I

67

1 1 1 1 1 1 1 1 1 1 1 1

There is a problem with this form of masked convolution.

  • blem with this

m of masked convolution.

Blind spot

Stacking layers of masked convolution creates a blindspot

slide-3
SLIDE 3

Improving PixelCNN II

68

Use more expressive nonlinearity:

essive nonlinearity: hk+1 = tanh(Wk,f ⇤ hk) σ(Wk,g ⇤ hk)

This information flow (between vertical and horizontal stacks) preserves the correct pixel dependencies

slide-4
SLIDE 4

Topi Topics: cs: CIFAR-10

  • Samples from a class-conditioned PixelCNN

69

Samples from PixelCNN

Coral Reef

slide-5
SLIDE 5

Topi Topics: cs: CIFAR-10

  • Samples from a class-conditioned PixelCNN

70

Samples from PixelCNN

slide-6
SLIDE 6

Topi Topics: cs: CIFAR-10

  • Samples from a class-conditioned PixelCNN

71

Samples from PixelCNN

Sandbar

slide-7
SLIDE 7

72

Neural Image Model: Pixel RNN P( )

x1 xi xn

xn2

Convolutional Long

LSTM

Convolutional Long Short-Term Memory

Stollenga et al, 2015 Oord, Kalchbrenner, Kavukcuoglu, 2016

Row LSTM

slide-8
SLIDE 8

73

Neural Image Model: Pixel RNN P( )

x1 xi xn

xn2

Convolutional Long

LSTM

Pixel RNN

Multiple layers of convolutional LSTM

slide-9
SLIDE 9

Samples from PixelRNN

74

slide-10
SLIDE 10

Samples from PixelRNN

75

slide-11
SLIDE 11

Samples from PixelRNN

76

slide-12
SLIDE 12

77

Architecture for 1D sequences (Bytenet / Wavenet)

  • Stack of dilated, masked 1-D

convolutions in the decoder

  • The architecture is parallelizable

along the time dimension (during training or scoring)

  • Easy access to many states from

the past

slide-13
SLIDE 13

78

Masked convolution

Video Pixel Net

slide-14
SLIDE 14

79

Video Pixel Net

slide-15
SLIDE 15

80

VPN Samples for Moving MNIST

No frame dependencies VPN Videos on nal.ai/vpn

slide-16
SLIDE 16

81

VPN Samples for Robotic Pushing

No frame dependencies VPN Videos on nal.ai/vpn

slide-17
SLIDE 17

82

VPN Samples for Robotic Pushing

slide-18
SLIDE 18

Variational Autoencoders

83

slide-19
SLIDE 19

Variational Auto-Encoders in General

Desi sign ch choice ces

  • Pri

rior r on th the late tent t vari riable

− Continuous, Discrete, Gaussian, Bernoulli, Mixture

  • Li

Likel elihood function

− iid (static), sequential, temporal, spatial

  • Ap

Approximating posterior

− distribution, sequential, spatial

84

F(q) = Eqφ(z)[log pθ(x|z)] KL[qφ(z|x)kp(z)]

Variational Auto-encoder (VAE) Amortised variational inference for latent variable models

Data x

Inference Network q(z |x) z ~ q(z | x) Model p(x |z) x ~ p(x | z) z

For sca scalability and ease se of implementation

  • Stochastic gradient descent (and variants),
  • Stochastic gradient estimation
slide-20
SLIDE 20

Variational Autoencoders (VAEs)

  • The VAE defines a generative process in terms of ancestral sampling through a

cascade of hidden stochastic layers:

85

Each term may denote a h3 h2 h1 v W3 W2 W1

Gen Proces

Input data

Input data

slide-21
SLIDE 21

Variational Autoencoders (VAEs)

  • The VAE defines a generative process in terms of ancestral sampling through a

cascade of hidden stochastic layers:

86

Each term may denote a h3 h2 h1 v W3 W2 W1

Gen Proces

Input data

Input data

Process

Generative Process

slide-22
SLIDE 22

Variational Autoencoders (VAEs)

  • The VAE defines a generative process in terms of ancestral sampling through a

cascade of hidden stochastic layers:

87

Each term may denote a h3 h2 h1 v W3 W2 W1

Gen Proces

Input data

Input data

Process

Generative Process

  • denotes parameters of VAE.
  • L is the number of st

stoch chast stic layers.

  • Sampling and probability evaluation is

tractable for each .

  • d

aluaRon is tractab ach .

slide-23
SLIDE 23

Each term may denote a

Variational Autoencoders (VAEs)

  • The VAE defines a generative process in terms of ancestral sampling through a

cascade of hidden stochastic layers:

88

h3 h2 h1 v W3 W2 W1

Gen Proces

Input data

Input data

Process

Generative Process

  • denotes parameters of VAE.
  • L is the number of st

stoch chast stic layers.

  • Sampling and probability evaluation is

tractable for each .

  • d

aluaRon is tractab ach .

Each term may denote a complicated nonlinear relationship

slide-24
SLIDE 24

This term denotes a o

DeterminisRc Layer StochasRc Layer StochasRc Layer

Variational Autoencoders (VAEs)

  • The VAE defines a generative process in terms of ancestral sampling through a

cascade of hidden stochastic layers:

89

Stochastic Layer

  • denotes parameters of VAE.
  • L is the number of st

stoch chast stic layers.

  • Sampling and probability evaluation is

tractable for each .

  • d

aluaRon is tractab ach .

This term denotes a one-layer neural net

Stochastic Layer Deterministic Layer

slide-25
SLIDE 25

Variational Bound

  • The VAE is trained to maximize the variational lower bound:

90

h3 h2 h1 v W3 W2 W1

Input data

slide-26
SLIDE 26

Variational Bound

  • The VAE is trained to maximize the variational lower bound:
  • Trading off the data log-likelihood and the KL divergence from the true posterior.

91

h3 h2 h1 v W3 W2 W1

Input data

slide-27
SLIDE 27

Variational Bound

  • The VAE is trained to maximize the variational lower bound:
  • Trading off the data log-likelihood and the KL divergence from the true posterior.

92

h3 h2 h1 v W3 W2 W1

Input data

  • Hard to optimize the variational bound with respect to the

recognition network (high-variance).

  • Key idea of Kingma and Welling is to use

reparameterization trick.

slide-28
SLIDE 28

Reparameterization Trick

  • Assume that the recognition distribution is Gaussian:

with mean and covariance computed from the state of the hidden units at the previous layer.

93

slide-29
SLIDE 29

Reparameterization Trick

  • Assume that the recognition distribution is Gaussian:

with mean and covariance computed from the state of the hidden units at the previous layer.

  • Alternatively, we can express this in term of au

auxi xiliar ary y var variab able:

94

slide-30
SLIDE 30

Reparameterization Trick

  • Assume that the recognition distribution is Gaussian:
  • Or
  • The recognition distribution can be expressed in terms
  • f a deterministic mapping:

95

  • n c

apping:

Deterministic Encoder Distribution of does not depend on

  • f

n

slide-31
SLIDE 31

Reparameterization Trick

96

Encoder ( ) Decoder ( ) Sample from Encoder ( ) Decoder ( ) Sample from

*

+

without reparameterization trick with reparameterization trick

Image: Carl Doersch

slide-32
SLIDE 32

Computing the Gradients

  • The gradient w.r.t the parameters: both recognition and generative:

97

generaRve:

Gradients can be

Gradients can be computed by backprop The mapping h is a deterministic neural net for fixed

  • f
slide-33
SLIDE 33

Implementing a Variational Algorithm

Variational inference turns integration into optimization: Au Automated Tools: :

  • Di

Differentiation: Theano, Torch7, TensorFlow, Stan.

  • Me

Message passing: infer.NET

  • Stochastic gradient descent and
  • ther preconditioned optimization.
  • Same code can run on both GPUs or
  • n distributed clusters.
  • Probabilistic models are modular, can

easily be combined.

103

tion tion. s or

Forward pass

Prior p(z) log p(z)

Model p(x |z) log p(x|z)

Inference q(z |x) H[q(z)] z

Data x

Inference q(z |x) Model p(x |z) Prior p(z)

rφ rφ

Backward pass

Ideally want probabilistic programming using variational inference.

slide-34
SLIDE 34

Latent Gaussian VAE

104

p(z) = N(0, I) qφ(z|x) = N(µq

φ(x), Σq φ(x))

pθ(x|z) = N(µp

θ(z), Σp θ(z))

p(x|f p

θ (z))

Model p(x |z) log p(x|z)

Prior p(z) log p(z)

Inference q(z |x) H[q(z)] z

Data x

Deep Latent Gaussian Model

F(x, q) = Eq(z)[log p(x|z)] KL[q(z)kp(z)]

All functions are deep networks.

slide-35
SLIDE 35

Latent Gaussian VAE

105

Latent space disentangles the input data.

Oxygen/Swimmers Moving Left

3 dimensional latent variable of MNIST

−2 −1 1 2 3 −2 −1 1 2 3

R D p(Y|θ)

0.2 - 0.4 0.01 - 0.2 < 0.01

Factor 2 Factor 1

Latent Factor Embedding

Latent space and likelihood bound gives a visualisation

  • f importance.
slide-36
SLIDE 36

VAE Representations

Representations are useful for strategies such as episodic control.

106

Representation

a R a1 a2 a3

Blundell, Charles, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z. Leibo, Jack Rae, Daan Wierstra, and Demis Hassabis. "Model-Free Episodic Control.” 2016

slide-37
SLIDE 37

Latent Gaussian VAE

  • Require flexible approximations for the types of posteriors we are likely to see.

107

slide-38
SLIDE 38

Latent Binary VAE

108

p(x|z) = Y

i

p(xi|x<i, z) p(x|z) = Y

i

Bern(xi|f p

θ (x<i, z))

qφ(z) = Y

i

Bern(zi|f q

φ(z<i))

qφ(z) = Y

i

qφ(zi|z<i)

p(z) = Y

i

p(zi|z<i)

p(zi|z<i) = Bern(zi|f(z<i))

Deep Auto-regressive Networks

Model p(x |z) log p(x|z)

Prior p(z) log p(z)

Inference q(z |x) H[q(z)] z

Data x

Gregor, Karol, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. "Deep autoregressive

  • networks. 2013
slide-39
SLIDE 39

Latent Binary VAE

Samples from binarized Atari frames

109

Gregor, Karol, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. "Deep autoregressive

  • networks. 2013
slide-40
SLIDE 40

Semi-supervised VAE

Visual Analogies

110

67

2.12 3.3

VAT

0.96

Aux. SS-VAE

1.06

Ladder Net SS-VAE % Classification Error (100 labels)

0.92

SS-GAN

Class Prior p(y) log p(y) Prior p(z) log p(z)

Model p(x |z,y) log p(x|z, y)

Class Inference q(z |x) Latent Inference q(y |x, z)

Data x

% Classification Err

slide-41
SLIDE 41

Sequential Latent Gaussian VAE

114

DRAW

p(z) = Y

i

p(zi|z<i)

qφ(z) = Y

i

qφ(zi|z<i) p(x|f p

θ (z))

pθ(x|z) = N(µp

θ(z), Σp θ(z))

Model p(x |z) log p(x|z)

Prior p(z) log p(z)

Inference q(z |x) H[q(z)] z

Data x Gregor, Karol, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. "DRAW: A recurrent neural network for image generation.” ICML 2015

slide-42
SLIDE 42

Sequential Latent Gaussian VAE

  • LS

LSTM or GRU networks for state modules

  • Spa

Spatia ial a l attent ntion ion in both the recognition and generation phase using spatial transformers.

  • Can remove inference model RNN and share the generate

model state.

  • Can include additional canvas

115

DRAW

Prior p(z1) State h(z) Prior p(z2) State h(z)

W

Prior p(zT) State h(z)

Inference q(z1⎜x)

Data x

State s(x)

Data x

State s(x) Inference q(z2⎜x) State s(x)

Data x

Inference q(zT⎜x)

Model p(x |z) log p(x|z)

Gregor, Karol, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. "DRAW: A recurrent neural network for image generation.” ICML 2015

slide-43
SLIDE 43

Sequential Latent Gaussian VAE

116

DRAW

Gregor, Karol, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. "DRAW: A recurrent neural network for image generation.” ICML 2015

slide-44
SLIDE 44

Generating Images from Captions

  • Generati

tive Model: Stochastic Recurrent Network, chained sequence of Variational Autoencoders, with a single stochastic layer.

  • Reco

cognition Model: Deterministic Recurrent Network.

117

Overall Model

StochasRc Layer

Variational Autoencoder

(Mansimov, Parisotto, Ba, Salakhutdinov, 2015) (Gregor et al., 2015)

slide-45
SLIDE 45

Motivating Example

  • Can we generate images from natural language descriptions?

118

(Mansimov, Parisotto, Ba, Salakhutdinov, 2015)

A stop sign is flying in blue skies A pale yellow school bus is flying in blue skies A large commercial airplane is flying in blue skies

slide-46
SLIDE 46

119

A yellow school bus parked in the parking lot A red school bus parked in the parking lot A green school bus parked in the parking lot A blue school bus parked in the parking lot

47

Flipping Colors

slide-47
SLIDE 47

Flipping Backgrounds

120

A very large commercial plane flying in clear skies. A very large commercial plane flying in rainy skies. A herd of elephants walking across a dry grass field. A herd of elephants walking across a green grass field.

48

slide-48
SLIDE 48

121

Next lecture: Generative Adversarial Networks