Neural Network Part 5: Unsupervised Models CS 760@UW-Madison Goals - - PowerPoint PPT Presentation

β–Ά
neural network part 5
SMART_READER_LITE
LIVE PREVIEW

Neural Network Part 5: Unsupervised Models CS 760@UW-Madison Goals - - PowerPoint PPT Presentation

Neural Network Part 5: Unsupervised Models CS 760@UW-Madison Goals for the lecture you should understand the following concepts autoencoder restricted Boltzmann machine (RBM) Nash equilibrium minimax game generative


slide-1
SLIDE 1

Neural Network Part 5: Unsupervised Models

CS 760@UW-Madison

slide-2
SLIDE 2

Goals for the lecture

you should understand the following concepts

  • autoencoder
  • restricted Boltzmann machine (RBM)
  • Nash equilibrium
  • minimax game
  • generative adversarial network (GAN)

2

slide-3
SLIDE 3

Autoencoder

slide-4
SLIDE 4

Autoencoder

  • Neural networks trained to attempt to copy its input to its output
  • Contain two parts:
  • Encoder: map the input to a hidden representation
  • Decoder: map the hidden representation to the output
slide-5
SLIDE 5

Autoencoder

β„Ž 𝑦 𝑠 Hidden representation (the code) Reconstruction Input

slide-6
SLIDE 6

Autoencoder

β„Ž 𝑦 𝑠 Decoder 𝑕(β‹…) Encoder 𝑔(β‹…) β„Ž = 𝑔 𝑦 , 𝑠 = 𝑕 β„Ž = 𝑕(𝑔 𝑦 )

slide-7
SLIDE 7

Why want to copy input to output

  • Not really care about copying
  • Interesting case: NOT able to copy exactly but strive to do so
  • Autoencoder forced to select which aspects to preserve and

thus hopefully can learn useful properties of the data

  • Historical note: goes back to (LeCun, 1987; Bourlard and Kamp,

1988; Hinton and Zemel, 1994).

slide-8
SLIDE 8

Undercomplete autoencoder

  • Constrain the code to have smaller dimension than the input
  • Training: minimize a loss function

𝑀 𝑦, 𝑠 = 𝑀(𝑦, 𝑕 𝑔 𝑦 ) β„Ž 𝑦 𝑠

slide-9
SLIDE 9

Undercomplete autoencoder

  • Constrain the code to have smaller dimension than the input
  • Training: minimize a loss function

𝑀 𝑦, 𝑠 = 𝑀(𝑦, 𝑕 𝑔 𝑦 )

  • Special case: 𝑔, 𝑕 linear, 𝑀 mean square error
  • Reduces to Principal Component Analysis
slide-10
SLIDE 10

Undercomplete autoencoder

  • What about nonlinear encoder and decoder?
  • Capacity should not be too large
  • Suppose given data 𝑦1, 𝑦2, … , π‘¦π‘œ
  • Encoder maps 𝑦𝑗 to 𝑗
  • Decoder maps 𝑗 to 𝑦𝑗
  • One dim β„Ž suffices for perfect reconstruction
slide-11
SLIDE 11

Regularization

  • Typically NOT
  • Keeping the encoder/decoder shallow or
  • Using small code size
  • Regularized autoencoders: add regularization term that

encourages the model to have other properties

  • Sparsity of the representation (sparse autoencoder)
  • Robustness to noise or to missing inputs (denoising autoencoder)
slide-12
SLIDE 12

Sparse autoencoder

  • Constrain the code to have sparsity
  • Training: minimize a loss function

𝑀𝑆 = 𝑀(𝑦, 𝑕 𝑔 𝑦 ) + 𝑆(β„Ž) β„Ž 𝑦 𝑠

slide-13
SLIDE 13

Probabilistic view of regularizing β„Ž

  • Suppose we have a probabilistic model π‘ž(β„Ž, 𝑦)
  • MLE on 𝑦

log π‘ž(𝑦) = log ෍

β„Žβ€²

π‘ž(β„Žβ€², 𝑦)

  •  Hard to sum over β„Žβ€²
slide-14
SLIDE 14

Probabilistic view of regularizing β„Ž

  • Suppose we have a probabilistic model π‘ž(β„Ž, 𝑦)
  • MLE on 𝑦

max log π‘ž(𝑦) = max log ෍

β„Žβ€²

π‘ž(β„Žβ€², 𝑦)

  • Approximation: suppose β„Ž = 𝑔(𝑦) gives the most likely hidden

representation, and Οƒβ„Žβ€² π‘ž(β„Žβ€², 𝑦) can be approximated by π‘ž(β„Ž, 𝑦)

slide-15
SLIDE 15

Probabilistic view of regularizing β„Ž

  • Suppose we have a probabilistic model π‘ž(β„Ž, 𝑦)
  • Approximate MLE on 𝑦, β„Ž = 𝑔(𝑦)

max log π‘ž(β„Ž, 𝑦) = max log π‘ž(𝑦|β„Ž) + log π‘ž(β„Ž) Regularization Loss

slide-16
SLIDE 16

Sparse autoencoder

  • Constrain the code to have sparsity
  • Laplacian prior: π‘ž β„Ž =

πœ‡ 2 exp(βˆ’ πœ‡ 2 β„Ž 1)

  • Training: minimize a loss function

𝑀𝑆 = 𝑀(𝑦, 𝑕 𝑔 𝑦 ) + πœ‡ β„Ž 1

slide-17
SLIDE 17

Denoising autoencoder

  • Traditional autoencoder: encourage to learn 𝑕 𝑔 β‹…

to be identity

  • Denoising : minimize a loss function

𝑀 𝑦, 𝑠 = 𝑀(𝑦, 𝑕 𝑔 ΰ·€ 𝑦 ) where ΰ·€ 𝑦 is 𝑦 + π‘œπ‘π‘—π‘‘π‘“

slide-18
SLIDE 18

Boltzmann Machine

slide-19
SLIDE 19

Boltzmann machine

  • Introduced by Ackley et al. (1985)
  • General β€œconnectionist” approach to learning arbitrary

probability distributions over binary vectors

  • Special case of energy model: π‘ž 𝑦 =

exp(βˆ’πΉ 𝑦 ) π‘Ž

slide-20
SLIDE 20

Boltzmann machine

  • Energy model:

π‘ž 𝑦 = exp(βˆ’πΉ 𝑦 ) π‘Ž

  • Boltzmann machine: special case of energy model with

𝐹 𝑦 = βˆ’π‘¦π‘ˆπ‘‰π‘¦ βˆ’ π‘π‘ˆπ‘¦ where 𝑉 is the weight matrix and 𝑐 is the bias parameter

slide-21
SLIDE 21

Boltzmann machine with latent variables

  • Some variables are not observed

𝑦 = 𝑦𝑀, π‘¦β„Ž , 𝑦𝑀 visible, π‘¦β„Ž hidden 𝐹 𝑦 = βˆ’π‘¦π‘€

π‘ˆπ‘†π‘¦π‘€ βˆ’ 𝑦𝑀 π‘ˆπ‘‹π‘¦β„Ž βˆ’ π‘¦β„Ž π‘ˆπ‘‡π‘¦β„Ž βˆ’ π‘π‘ˆπ‘¦π‘€ βˆ’ π‘‘π‘ˆπ‘¦β„Ž

  • Universal approximator of probability mass functions
slide-22
SLIDE 22

Maximum likelihood

  • Suppose we are given data π‘Œ = 𝑦𝑀

1, 𝑦𝑀 2, … , 𝑦𝑀 π‘œ

  • Maximum likelihood is to maximize

log π‘ž π‘Œ = ෍

𝑗

log π‘ž(𝑦𝑀

𝑗)

where π‘ž 𝑦𝑀 = ෍

π‘¦β„Ž

π‘ž(𝑦𝑀, π‘¦β„Ž) = ෍

π‘¦β„Ž

1 π‘Ž exp(βˆ’πΉ(𝑦𝑀, π‘¦β„Ž))

  • π‘Ž = Οƒ exp(βˆ’πΉ(𝑦𝑀, π‘¦β„Ž)): partition function, difficult to compute
slide-23
SLIDE 23

Restricted Boltzmann machine

  • Invented under the name harmonium (Smolensky, 1986)
  • Popularized by Hinton and collaborators to Restricted

Boltzmann machine

slide-24
SLIDE 24

Restricted Boltzmann machine

  • Special case of Boltzmann machine with latent variables:

π‘ž 𝑀, β„Ž = exp(βˆ’πΉ 𝑀, β„Ž ) π‘Ž where the energy function is 𝐹 𝑀, β„Ž = βˆ’π‘€π‘ˆπ‘‹β„Ž βˆ’ π‘π‘ˆπ‘€ βˆ’ π‘‘π‘ˆβ„Ž with the weight matrix 𝑋 and the bias 𝑐, 𝑑

  • Partition function

π‘Ž = ෍

𝑀

෍

β„Ž

exp(βˆ’πΉ 𝑀, β„Ž )

slide-25
SLIDE 25

Restricted Boltzmann machine

Figure from Deep Learning, Goodfellow, Bengio and Courville

slide-26
SLIDE 26

Restricted Boltzmann machine

  • Conditional distribution is factorial

π‘ž β„Ž|𝑀 = π‘ž(𝑀, β„Ž) π‘ž(𝑀) = ΰ·‘

π‘˜

π‘ž(β„Žπ‘˜|𝑀) and π‘ž β„Žπ‘˜ = 1|𝑀 = 𝜏 𝑑

π‘˜ + π‘€π‘ˆπ‘‹ :,π‘˜

is logistic function

slide-27
SLIDE 27

Restricted Boltzmann machine

  • Similarly,

π‘ž 𝑀|β„Ž = π‘ž(𝑀, β„Ž) π‘ž(β„Ž) = ΰ·‘

𝑗

π‘ž(𝑀𝑗|β„Ž) and π‘ž 𝑀𝑗 = 1|β„Ž = 𝜏 𝑐𝑗 + 𝑋

𝑗,:β„Ž

is logistic function

slide-28
SLIDE 28

Generative Adversarial Networks (GAN)

See Ian Goodfellow’s tutorial slides: http://www.iangoodfellow.com/slides/2018-06-22-gan_tutorial.pdf

slide-29
SLIDE 29

THANK YOU

Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Matt Gormley, Elad Hazan, Tom Dietterich, Pedro Domingos, Geoffrey Hinton, and Ian Goodfellow.