Adversarial BoltzmannMachines Belief Nets Networks - - PowerPoint PPT Presentation

adversarial boltzmannmachines belief nets networks
SMART_READER_LITE
LIVE PREVIEW

Adversarial BoltzmannMachines Belief Nets Networks - - PowerPoint PPT Presentation

Unsupervised Learning Non-probabilistic Models Probabilistic Sparse Coding (Generative) Models Autoencoders Others (e.g. k-means) Tractable Models Non-Tractable Models Generative Fully observed Adversarial


slide-1
SLIDE 1

98

Explicit Density p(x)

Unsupervised Learning Non-probabilistic Models

  • Sparse Coding
  • Autoencoders
  • Others (e.g. k-means)

Probabilistic (Generative) Models Tractable Models

  • Fully observed

Belief Nets

  • NADE
  • PixelRNN

Non-Tractable Models

  • BoltzmannMachines
  • Variational

Autoencoders

  • Helmholtz Machines
  • Many others...

Implicit Density

  • Generative

Adversarial Networks

  • Moment

Matching Networks

slide-2
SLIDE 2

Unsupervised Learning

  • Basic Building Blocks:
  • Sparse Coding
  • Autoencoders
  • Deep Generative Models
  • Restricted Boltzmann Machines
  • Deep Boltzmann Machines
  • Deep Belief Networks
  • Helmholtz Machines / Variational Autoencoders
  • Generative Adversarial Networks

99

slide-3
SLIDE 3

Deep Generative Model

25,000 characters from 50 alphabets around the world.

  • 3,000 hidden variables
  • 784 observed variables

(28 x 28 images)

  • About 2 million parameters

100

Sanskrit

Sanskrit Model P(image)

Bernoulli Markov Random Field

slide-4
SLIDE 4

Deep Generative Model

101

P(image | partial image)

p 28 28

Conditional Simulation

28 28 possible images!

Why so difficult?

Bernoulli Markov Random Field

slide-5
SLIDE 5

pmodel(x) = pmodel(x1)

n

Y

i=2

pmodel(xi | x1, . . . , xi−1)

Each condiAonal can be a

Fully Observed Models

  • Explicitly model conditional probabilities:

102

slide-6
SLIDE 6

pmodel(x) = pmodel(x1)

n

Y

i=2

pmodel(xi | x1, . . . , xi−1)

Each condiAonal can be a

Fully Observed Models

  • Explicitly model conditional probabilities:

103

Each conditional can be a complicated neural network

slide-7
SLIDE 7

pmodel(x) = pmodel(x1)

n

Y

i=2

pmodel(xi | x1, . . . , xi−1)

Each condiAonal can be a

Fully Observed Models

  • Explicitly model conditional probabilities:
  • A number of successful models,

including

⎯ NADE, RNADE (Larochelle, et.al. 2011) ⎯ Pixel CNN (van den Ord et. al. 2016) ⎯ Pixel RNN (van den Ord et. al. 2016)

104

Pixel CNN

Each conditional can be a complicated neural network Pixel CNN

slide-8
SLIDE 8

Restricted Boltzman Machines

RBM is a Markov Random Field with:

  • Stochastic binary visible variables
  • Stochastic binary hidden variables
  • Bipartite connections.

105

Markov random fields, Boltzmann machines, log-linear models.

Image visible variables hidden variables

Feature Detectors

hidden variables

image visible variables

es les

Graphical Models: Powerful framework for representing dependency structure between random variables

slide-9
SLIDE 9

Restricted Boltzman Machines

RBM is a Markov Random Field with:

  • Stochastic binary visible variables
  • Stochastic binary hidden variables
  • Bipartite connections.

106

Markov random fields, Boltzmann machines, log-linear models.

Image visible variables hidden variables

Feature Detectors

hidden variables

Pairwise Unary

image visible variables

es les

Partition function (intractable)

slide-10
SLIDE 10

Restricted Boltzman Machines

RBM is a Markov Random Field with:

  • Stochastic binary visible variables
  • Stochastic binary hidden variables
  • Bipartite connections.

107

Markov random fields, Boltzmann machines, log-linear models.

Image visible variables hidden variables

Feature Detectors

hidden variables

Pairwise Unary

image visible variables

es les

slide-11
SLIDE 11

Learning Features

108

Subset of 1000 features

= ….

New Image:

LogisAc FuncAon: Suitable for modeling binary images

Subset of 25,000 characters

Learned W: “edges” Subset of 1000 features Observed Data Subset of 25,000 characters New Image:

Logistic Functon: Suitable for modeling binary images

Sparse representations

slide-12
SLIDE 12

Model Learning

109

Image visible units

Hidden units

Max

image visible variables Hidden units

Given a set of i.i.d. training examples we want to learn

Given a set of i.i.d. training , model parameters rs .

slide-13
SLIDE 13

Model Learning

110

Image visible units

Hidden units

Max

image visible variables Hidden units

Given a set of i.i.d. training examples we want to learn

Given a set of i.i.d. training , model parameters rs .

Maximize log-likelihood objective:

slide-14
SLIDE 14

DerivaAve of the log-likelihood:

Model Learning

111

Image visible units

Hidden units

Max

image visible variables Hidden units

Derivative of the log-likelihood: Given a set of i.i.d. training examples we want to learn

Given a set of i.i.d. training , model parameters rs .

Maximize log-likelihood objective:

slide-15
SLIDE 15

Difficult to compute: exponenAally many configuraAons

DerivaAve of the log-likelihood:

Model Learning

112

Image visible units

Hidden units

Max

image visible variables Hidden units

Derivative of the log-likelihood: Given a set of i.i.d. training examples we want to learn

Given a set of i.i.d. training , model parameters rs .

Maximize log-likelihood objective:

slide-16
SLIDE 16

Difficult to compute: exponenAally many configuraAons

DerivaAve of the log-likelihood:

Model Learning

113

Image visible units

Hidden units

Max

image visible variables Hidden units

Derivative of the log-likelihood: Given a set of i.i.d. training examples we want to learn

Given a set of i.i.d. training , model parameters rs .

Maximize log-likelihood objective:

Difficult to compute: Exponentially many configurations

slide-17
SLIDE 17

Easy to compute exactly Difficult to compute:

Model Learning

114

Image visible units

Hidden units

Max

image visible variables Hidden units

Derivative of the log-likelihood:

Approximate maximum like kelihood learning Difficult to compute: Exponentially many configurations

Use MCMC

Easy to compute exactly

slide-18
SLIDE 18

Approximate Learning

  • An approximation to the gradient of the log-likelihood objective:
  • Replace the average over all possible input configurations by samples
  • Run MCMC chain (Gibbs sampling) starting from the observed examples.

115

ace the average over all possible input configuraAons by s

  • Initialize v0 = v
  • Sample h0 from P(h | v0)
  • For t=1:T

⎯ Sample vt from P(v | ht-1) ⎯ Sample ht from P(h | vt)

slide-19
SLIDE 19

Approximate ML Learning for RBMs

  • Run Markov chain (alternating Gibbs Sampling):

116

slide-20
SLIDE 20

Approximate ML Learning for RBMs

  • Run Markov chain (alternating Gibbs Sampling):

117

Data

Data

slide-21
SLIDE 21

Approximate ML Learning for RBMs

  • Run Markov chain (alternating Gibbs Sampling):

118

Data

D

Data

slide-22
SLIDE 22

Approximate ML Learning for RBMs

  • Run Markov chain (alternating Gibbs Sampling):

119

Data T=1

D

Data T=1

slide-23
SLIDE 23

Approximate ML Learning for RBMs

  • Run Markov chain (alternating Gibbs Sampling):

120

Data T=1 T= infinity

D Equilibr D

Data T=1 T=infinity Equilibrium Distribution

slide-24
SLIDE 24

Contrastive Divergence

  • A quick way to learn RBM:
  • Update model parameters:
  • Implementation: ~10 lines of Matlab code.

121

(Hinton, Neural Computation 2002)

  • Start with a training vector on the

visible units.

  • Update all the hidden units in

parallel.

  • Update the all the visible units in

parallel to get a “reconstruction”.

  • Update the hidden units again.

Data Reconstructed Data

Data Reconstructed Data

slide-25
SLIDE 25

Contrastive Divergence

  • A quick way to learn RBM:
  • Update model parameters:
  • Implementation: ~10 lines of Matlab code.

122

(Hinton, Neural Computation 2002)

  • Start with a training vector on the

visible units.

  • Update all the hidden units in

parallel.

  • Update the all the visible units in

parallel to get a “reconstruction”.

  • Update the hidden units again.

Data Reconstructed Data

Data Reconstructed Data

The distributions of data and reconstructed data should be the same.

slide-26
SLIDE 26

RBMs for Real-valued Data

Gaussian-Bernoulli RBM:

  • Stochastic real-valued visible variables
  • Stochastic binary hidden variables
  • Bipartite connections.

123

Image visible units

Hidden units

Max

image visible variables Hidden units

Pairwise Unary

slide-27
SLIDE 27

RBMs for Real-valued Data

124

Image visible units

Hidden units

Max

image visible variables Hidden units

Pairwise Unary

4 million unlabelled images

4 million unlabeled images unlabeled Learned features (out of 10,000)

slide-28
SLIDE 28

RBMs for Real-valued Data

125

4 million unlabelled images

4 million unlabeled images unlabeled Learned features (out of 10,000)

= 0.9 * + 0.8 * + 0.6 * …

New Image

New Image

slide-29
SLIDE 29

RBMs for Word Counts

Replicated Softmax Model: undirected topic model:

  • Stochastic 1-of-K visible variables.
  • Stochastic binary hidden variables
  • Bipartite connections.

126

Pair-wise Unary

Pθ(v, h) = 1 Z(θ) exp @

D

X

i=1 K

X

k=1 F

X

j=1

W k

ijvk i hj + D

X

i=1 K

X

k=1

vk

i bk i + F

X

j=1

hjaj 1 A

Pθ(vk

i = 1|h) =

exp ⇣ bk

i + PF j=1 hjW k ij

⌘ PK

q=1 exp

⇣ bq

i + PF j=1 hjW q ij

1

(Salakhutdinov & Hinton, NIPS 2010, Srivastava & Salakhutdinov, NIPS 2012)

Pairwise Unary

slide-30
SLIDE 30

RBMs for Word Counts

Replicated Softmax Model: undirected topic model:

  • Stochastic 1-of-K visible variables.
  • Stochastic binary hidden variables
  • Bipartite connections.

127

Pair-wise Unary

Pθ(v, h) = 1 Z(θ) exp @

D

X

i=1 K

X

k=1 F

X

j=1

W k

ijvk i hj + D

X

i=1 K

X

k=1

vk

i bk i + F

X

j=1

hjaj 1 A

Pθ(vk

i = 1|h) =

exp ⇣ bk

i + PF j=1 hjW k ij

⌘ PK

q=1 exp

⇣ bq

i + PF j=1 hjW q ij

1

(Salakhutdinov & Hinton, NIPS 2010, Srivastava & Salakhutdinov, NIPS 2012)

Pairwise Unary

slide-31
SLIDE 31

RBMs for Word Counts

128

Pair-wise Unary

Pθ(v, h) = 1 Z(θ) exp @

D

X

i=1 K

X

k=1 F

X

j=1

W k

ijvk i hj + D

X

i=1 K

X

k=1

vk

i bk i + F

X

j=1

hjaj 1 A

Pθ(vk

i = 1|h) =

exp ⇣ bk

i + PF j=1 hjW k ij

⌘ PK

q=1 exp

⇣ bq

i + PF j=1 hjW q ij

1

(Salakhutdinov & Hinton, NIPS 2010, Srivastava & Salakhutdinov, NIPS 2012)

Pairwise Unary

Reuters dataset:d

Reuters dataset: 804,414 unla unlabe bele led newswire stories Bag-of-Words russian russia moscow yeltsin soviet clinton house president bill congress computer system product software develop trade country import world economy stock wall street point dow Learned features: “topics”

slide-32
SLIDE 32

RBMs for Word Counts

  • One-step reconstruction from the Replicated Softmax model.

129

slide-33
SLIDE 33

Collaborative Filtering

130

inary hidden: user preferences

h v W1 MulAnomial visible: user raAngs

Binary hidden: user preferences Multinomial visible: user ratings

(Salakhutdinov, Mnih, Hinton, ICML 2007)

slide-34
SLIDE 34

Collaborative Filtering

131

inary hidden: user preferences

h v W1 MulAnomial visible: user raAngs

Binary hidden: user preferences Multinomial visible: user ratings

(Salakhutdinov, Mnih, Hinton, ICML 2007)

Netflix dataset: 480,189 users 17,770 movies Over 100 million ratings

slide-35
SLIDE 35

Collaborative Filtering

132

inary hidden: user preferences

h v W1 MulAnomial visible: user raAngs

es

Binary hidden: user preferences Multinomial visible: user ratings

Netflix dataset: 480,189 users 17,770 movies Over 100 million ratings

Learned features: ``genre’’

Fahrenheit 9/11 Bowling for Columbine The People vs. Larry Flynt Canadian Bacon La Dolce Vita Friday the 13th The Texas Chainsaw Massacre Children of the Corn Child's Play The Return of Michael Myers Independence Day The Day After Tomorrow Con Air Men in Black II Men in Black Scary Movie Naked Gun Hot Shots! American Pie Police Academy

(Salakhutdinov, Mnih, Hinton, ICML 2007)

slide-36
SLIDE 36

Different Data Modalities

  • Binary/Gaussian/Softmax RBMs: All have binary hidden variables but use them

to model different kinds of data.

  • It is easy to infer the states of the hidden variables:

133

Image visible units

Hidden units

Max

image visible variables Hidden units

1

Binary Real-values 1-of-K

B B

slide-37
SLIDE 37

Product of Experts

134

The joint distribution is given by:

Pro

Marginalizing over hidden variables:

Product of Experts

Marginalizing over hidden variables: Product of Experts

slide-38
SLIDE 38

Product of Experts

135

government authority power empire federation clinton house president bill congress bribery corruption dishonesty corrupt fraud mafia business gang mob insider stock wall street point dow ...

The joint distribution is given by:

Pro

Marginalizing over hidden variables:

Product of Experts

Marginalizing over hidden variables: Product of Experts

Topics “government”, ”corruption” and ”mafia” can combine to give very high probability to a word “Silvio Berlusconi”.

Silvio Berlusconi

slide-39
SLIDE 39

Product of Experts

136

government authority power empire federation clinton house president bill congress bribery corruption dishonesty corrupt fraud mafia business gang mob insider stock wall street point dow ...

The joint distribution is given by:

Pro

Marginalizing over hidden variables:

Product of Experts

Marginalizing over hidden variables: Product of Experts

Topics “government”, ”corruption” and ”mafia” can combine to give very high probability to a word “Silvio Berlusconi”.

g over hidden variables:

Prod

Silvio Berlusconi

inton house esident bill congress bribery corrupAon dishonesty corrupt fraud mafia business gang mob insider stock wall street point dow …

Topics “government”, and ”mafia” can comb high probability to a w Berlusconi”.

0.001 0.006 0.051 0.4 1.6 6.4 25.6 100 10 20 30 40 50

Recall (%) Precision (%)

Replicated Softmax 50−D LDA 50−D

Silvio Berlusconi

slide-40
SLIDE 40

Unsupervised Learning

  • Basic Building Blocks:
  • Sparse Coding
  • Autoencoders
  • Deep Generative Models
  • Restricted Boltzmann Machines
  • Deep Boltzmann Machines
  • Deep Belief Networks
  • Helmholtz Machines / Variational Autoencoders
  • Generative Adversarial Networks

137

slide-41
SLIDE 41

Deep Boltzmann Network

138

(Hinton et al. Neural Computation 2006)

Input: Pixels Low-level features: Edges

Built from unla unlabe bele led inputs.

slide-42
SLIDE 42

Deep Boltzmann Network

139

(Hinton et al. Neural Computation 2006)

Image

Higher-lev CombinaA Low-leve Edges Input: Pix

Input: Pixels Low-level features: Edges Higher-level features: Combination of edges

Internal representations capture higher-order statistical structure Built from unla unlabe bele led inputs.

slide-43
SLIDE 43

Unsupervised Learning

  • Basic Building Blocks:
  • Sparse Coding
  • Autoencoders
  • Deep Generative Models
  • Restricted Boltzmann Machines
  • Deep Boltzmann Machines
  • Deep Belief Networks
  • Helmholtz Machines / Variational Autoencoders
  • Generative Adversarial Networks

140

slide-44
SLIDE 44

Deep Belief Network

141

Hidden Layers Visible Layer RBM Sigmoid Belief Network

RBM Sigmoid Belief Network Hidden Layers Visible Layer

slide-45
SLIDE 45

Deep Belief Network

142

RB Sig Bel N

The joint probability distribution factorizes: Deep Belief Network RBM Sigmoid Belief Network

slide-46
SLIDE 46

Deep Belief Network

143

RB Sig Bel N

The joint probability distribution factorizes: Deep Belief Network RBM Sigmoid Belief Network RBM

slide-47
SLIDE 47

Deep Belief Network

144

RB Sig Bel N

The joint probability distribution factorizes: Deep Belief Network RBM Sigmoid Belief Network Sigmoid Belief Network RBM

slide-48
SLIDE 48

Deep Belief Network

145

GeneraAv Process roximate rence

v h2 h1 h3 W1 W3 W2

Generative Process Approximate Inference

slide-49
SLIDE 49

DBN Layer-wise Training

146

  • Learn an RBM with an input layer v

and a hidden layer h.

slide-50
SLIDE 50

DBN Layer-wise Training

147

  • Learn an RBM with an input layer v

and a hidden layer h.

  • Treat inferred values

as the data for training 2nd-layer RBM.

  • Learn and freeze 2nd layer RBM
  • Treat inferred values

as for training 2nd-layer

slide-51
SLIDE 51

DBN Layer-wise Training

  • Learn an RBM with an input layer v

and a hidden layer h.

  • Treat inferred values

as the data for training 2nd-layer RBM.

  • Learn and freeze 2nd layer RBM
  • Proceed to the next layer

148

v h2 h1 h3 W1 W3 W2

layer. layer as the data RBM.

  • Treat inferred values

as for training 2nd-layer

Unsupervised Feature Learning

slide-52
SLIDE 52

DBN Layer-wise Training

  • Learn an RBM with an input layer v

and a hidden layer h.

  • Treat inferred values

as the data for training 2nd-layer RBM.

  • Learn and freeze 2nd layer RBM
  • Proceed to the next layer

149

v h2 h1 h3 W1 W3 W2

layer. layer as the data RBM.

  • Treat inferred values

as for training 2nd-layer

Unsupervised Feature Learning

Layerwise pretraining improves variational lower bound

slide-53
SLIDE 53

Why this Pre-training Works?

  • Greedy training improves variational lower bound!
  • For any approximating

distribution

150

h v W1

approximaAng n

slide-54
SLIDE 54

M and 2-layer DBN en

Why this Pre-training Works?

  • Greedy training improves variational lower bound!
  • RBM and 2-layer DBN are equivalent

when

  • The lower bound is tight and

the log-likelihood improves by greedy training.

  • For any approximating

distribution

151

approximaAng n T

Train 2nd-layer RBM

slide-55
SLIDE 55

Learning Part-based Representation

152

(Lee, Grosse, Ranganath, Ng, ICML 2009)

Convolutional DBN

v h2 h1 h3 W1 W3 W2

T

Faces Groups of parts Object Parts Trained on face images

slide-56
SLIDE 56

Learning Part-based Representation

153

Faces Cars Elephants Chairs

(Lee, Grosse, Ranganath, Ng, ICML 2009)

Faces Cars Elephants Chairs

slide-57
SLIDE 57

Learning Part-based Representation

154

(Lee, Grosse, Ranganath, Ng, ICML 2009)

Groups of parts Class-specific object parts Trained from multiple classes (cars, faces, motorbikes, airplanes).

slide-58
SLIDE 58

155

Next lecture: Deep Generative Models Part I