Neural Network Part 5: Unsupervised Models Yingyu Liang Computer - - PowerPoint PPT Presentation

β–Ά
neural network part 5 unsupervised models
SMART_READER_LITE
LIVE PREVIEW

Neural Network Part 5: Unsupervised Models Yingyu Liang Computer - - PowerPoint PPT Presentation

Neural Network Part 5: Unsupervised Models Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David


slide-1
SLIDE 1

Neural Network Part 5: Unsupervised Models

Yingyu Liang Computer Sciences 760 Fall 2017

http://pages.cs.wisc.edu/~yliang/cs760/

Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Matt Gormley, Elad Hazan, Tom Dietterich, and Pedro Domingos.

slide-2
SLIDE 2

Goals for the lecture

you should understand the following concepts

  • autoencoder
  • restricted Boltzmann machine (RBM)
  • Nash equilibrium
  • minimax game
  • generative adversarial network (GAN)

2

slide-3
SLIDE 3

Autoencoder

  • Neural networks trained to attempt to copy its input to its output
  • Contain two parts:
  • Encoder: map the input to a hidden representation
  • Decoder: map the hidden representation to the output
slide-4
SLIDE 4

Autoencoder

β„Ž 𝑦 𝑠 Hidden representation (the code) Reconstruction Input

slide-5
SLIDE 5

Autoencoder

β„Ž 𝑦 𝑠 Decoder 𝑕(β‹…) Encoder 𝑔(β‹…) β„Ž = 𝑔 𝑦 , 𝑠 = 𝑕 β„Ž = 𝑕(𝑔 𝑦 )

slide-6
SLIDE 6

Why want to copy input to output

  • Not really care about copying
  • Interesting case: NOT able to copy exactly but strive to do so
  • Autoencoder forced to select which aspects to preserve and thus

hopefully can learn useful properties of the data

  • Historical note: goes back to (LeCun, 1987; Bourlard and Kamp, 1988;

Hinton and Zemel, 1994).

slide-7
SLIDE 7

Undercomplete autoencoder

  • Constrain the code to have smaller dimension than the input
  • Training: minimize a loss function

𝑀 𝑦, 𝑠 = 𝑀(𝑦, 𝑕 𝑔 𝑦 ) β„Ž 𝑦 𝑠

slide-8
SLIDE 8

Undercomplete autoencoder

  • Constrain the code to have smaller dimension than the input
  • Training: minimize a loss function

𝑀 𝑦, 𝑠 = 𝑀(𝑦, 𝑕 𝑔 𝑦 )

  • Special case: 𝑔, 𝑕 linear, 𝑀 mean square error
  • Reduces to Principal Component Analysis
slide-9
SLIDE 9

Undercomplete autoencoder

  • What about nonlinear encoder and decoder?
  • Capacity should not be too large
  • Suppose given data 𝑦1, 𝑦2, … , π‘¦π‘œ
  • Encoder maps 𝑦𝑗 to 𝑗
  • Decoder maps 𝑗 to 𝑦𝑗
  • One dim β„Ž suffices for perfect reconstruction
slide-10
SLIDE 10

Regularization

  • Typically NOT
  • Keeping the encoder/decoder shallow or
  • Using small code size
  • Regularized autoencoders: add regularization term that encourages

the model to have other properties

  • Sparsity of the representation (sparse autoencoder)
  • Robustness to noise or to missing inputs (denoising autoencoder)
slide-11
SLIDE 11

Sparse autoencoder

  • Constrain the code to have sparsity
  • Training: minimize a loss function

𝑀𝑆 = 𝑀(𝑦, 𝑕 𝑔 𝑦 ) + 𝑆(β„Ž) β„Ž 𝑦 𝑠

slide-12
SLIDE 12

Probabilistic view of regularizing β„Ž

  • Suppose we have a probabilistic model π‘ž(β„Ž, 𝑦)
  • MLE on 𝑦

log π‘ž(𝑦) = log ෍

β„Žβ€²

π‘ž(β„Žβ€², 𝑦)

  •  Hard to sum over β„Žβ€²
slide-13
SLIDE 13

Probabilistic view of regularizing β„Ž

  • Suppose we have a probabilistic model π‘ž(β„Ž, 𝑦)
  • MLE on 𝑦

max log π‘ž(𝑦) = max log ෍

β„Žβ€²

π‘ž(β„Žβ€², 𝑦)

  • Approximation: suppose β„Ž = 𝑔(𝑦) gives the most likely hidden

representation, and Οƒβ„Žβ€² π‘ž(β„Žβ€², 𝑦) can be approximated by π‘ž(β„Ž, 𝑦)

slide-14
SLIDE 14

Probabilistic view of regularizing β„Ž

  • Suppose we have a probabilistic model π‘ž(β„Ž, 𝑦)
  • Approximate MLE on 𝑦, β„Ž = 𝑔(𝑦)

max log π‘ž(β„Ž, 𝑦) = max log π‘ž(𝑦|β„Ž) + log π‘ž(β„Ž) Regularization Loss

slide-15
SLIDE 15

Sparse autoencoder

  • Constrain the code to have sparsity
  • Laplacian prior: π‘ž β„Ž =

πœ‡ 2 exp(βˆ’ πœ‡ 2 β„Ž 1)

  • Training: minimize a loss function

𝑀𝑆 = 𝑀(𝑦, 𝑕 𝑔 𝑦 ) + πœ‡ β„Ž 1

slide-16
SLIDE 16

Denoising autoencoder

  • Traditional autoencoder: encourage to learn 𝑕 𝑔 β‹…

to be identity

  • Denoising : minimize a loss function

𝑀 𝑦, 𝑠 = 𝑀(𝑦, 𝑕 𝑔 ΰ·€ 𝑦 ) where ΰ·€ 𝑦 is 𝑦 + π‘œπ‘π‘—π‘‘π‘“

slide-17
SLIDE 17

Boltzmann machine

  • Introduced by Ackley et al. (1985)
  • General β€œconnectionist” approach to learning arbitrary probability

distributions over binary vectors

  • Special case of energy model: π‘ž 𝑦 =

exp(βˆ’πΉ 𝑦 ) π‘Ž

slide-18
SLIDE 18

Boltzmann machine

  • Energy model:

π‘ž 𝑦 = exp(βˆ’πΉ 𝑦 ) π‘Ž

  • Boltzmann machine: special case of energy model with

𝐹 𝑦 = βˆ’π‘¦π‘ˆπ‘‰π‘¦ βˆ’ π‘π‘ˆπ‘¦ where 𝑉 is the weight matrix and 𝑐 is the bias parameter

slide-19
SLIDE 19

Boltzmann machine with latent variables

  • Some variables are not observed

𝑦 = 𝑦𝑀, π‘¦β„Ž , 𝑦𝑀 visible, π‘¦β„Ž hidden 𝐹 𝑦 = βˆ’π‘¦π‘€

π‘ˆπ‘†π‘¦π‘€ βˆ’ 𝑦𝑀 π‘ˆπ‘‹π‘¦β„Ž βˆ’ π‘¦β„Ž π‘ˆπ‘‡π‘¦β„Ž βˆ’ π‘π‘ˆπ‘¦π‘€ βˆ’ π‘‘π‘ˆπ‘¦β„Ž

  • Universal approximator of probability mass functions
slide-20
SLIDE 20

Maximum likelihood

  • Suppose we are given data π‘Œ = 𝑦𝑀

1, 𝑦𝑀 2, … , 𝑦𝑀 π‘œ

  • Maximum likelihood is to maximize

log π‘ž π‘Œ = ෍

𝑗

log π‘ž(𝑦𝑀

𝑗 )

where π‘ž 𝑦𝑀 = ෍

π‘¦β„Ž

π‘ž(𝑦𝑀, π‘¦β„Ž) = ෍

π‘¦β„Ž

1 π‘Ž exp(βˆ’πΉ(𝑦𝑀, π‘¦β„Ž))

  • π‘Ž = Οƒ exp(βˆ’πΉ(𝑦𝑀, π‘¦β„Ž)): partition function, difficult to compute
slide-21
SLIDE 21

Restricted Boltzmann machine

  • Invented under the name harmonium (Smolensky, 1986)
  • Popularized by Hinton and collaborators to Restricted Boltzmann

machine

slide-22
SLIDE 22

Restricted Boltzmann machine

  • Special case of Boltzmann machine with latent variables:

π‘ž 𝑀, β„Ž = exp(βˆ’πΉ 𝑀, β„Ž ) π‘Ž where the energy function is 𝐹 𝑀, β„Ž = βˆ’π‘€π‘ˆπ‘‹β„Ž βˆ’ π‘π‘ˆπ‘€ βˆ’ π‘‘π‘ˆβ„Ž with the weight matrix 𝑋 and the bias 𝑐, 𝑑

  • Partition function

π‘Ž = ෍

𝑀

෍

β„Ž

exp(βˆ’πΉ 𝑀, β„Ž )

slide-23
SLIDE 23

Restricted Boltzmann machine

Figure from Deep Learning, Goodfellow, Bengio and Courville

slide-24
SLIDE 24

Restricted Boltzmann machine

  • Conditional distribution is factorial

π‘ž β„Ž|𝑀 = π‘ž(𝑀, β„Ž) π‘ž(𝑀) = ΰ·‘

π‘˜

π‘ž(β„Žπ‘˜|𝑀) and π‘ž β„Žπ‘˜ = 1|𝑀 = 𝜏 𝑑

π‘˜ + π‘€π‘ˆπ‘‹ :,π‘˜

is logistic function

slide-25
SLIDE 25

Restricted Boltzmann machine

  • Similarly,

π‘ž 𝑀|β„Ž = π‘ž(𝑀, β„Ž) π‘ž(β„Ž) = ΰ·‘

𝑗

π‘ž(𝑀𝑗|β„Ž) and π‘ž 𝑀𝑗 = 1|β„Ž = 𝜏 𝑐𝑗 + 𝑋

𝑗,:β„Ž

is logistic function

slide-26
SLIDE 26

Prisoners’ Dilemma

Two suspects in a major crime are held in separate cells. There is enough evidence to convict each of them of a minor offense, but not enough evidence to convict either of them of the major crime unless one of them acts as an informer against the other (defects). If they both stay quiet, each will be convicted of the minor offense and spend one year in prison. If one and only

  • ne of them defects, she will be freed and used as a witness against the other,

who will spend four years in prison. If they both defect, each will spend three years in prison. Players: The two suspects. Actions: Each player’s set of actions is {Quiet, Defect}. Preferences: Suspect 1’s ordering of the action profiles, from best to worst, is (Defect, Quiet) (he defects and suspect 2 remains quiet, so he is freed), (Quiet, Quiet) (he gets one year in prison), (Defect, Defect) (he gets three years in prison), (Quiet, Defect) (he gets four years in prison). Suspect 2’s

  • rdering is (Quiet, Defect), (Quiet, Quiet), (Defect, Defect), (Defect, Quiet).
slide-27
SLIDE 27

3 represents best outcome, 0 worst, etc.

slide-28
SLIDE 28

Nash Equilibrium

Thanks, Wikipedia.

slide-29
SLIDE 29

Another Example

Thanks, Prof. Osborne of U. Toronto, Economics

slide-30
SLIDE 30

Minimax with Simultaneous Moves

  • maximin value: largest value player can be assured of

without knowing other player’s actions

  • minimax value: smallest value other players can force

this player to receive without knowing this player’s action

  • minimax is an upper bound on maximin
slide-31
SLIDE 31

Key Result

  • Utility: numeric reward for actions
  • Game: 2 or more players take turns or take

simultaneous actions. Moves lead to states, states have utilities.

  • Game is like an optimization problem, but each player

tries to maximize own objective function (utility function)

  • Zero-sum game: each player’s gain or loss in utility is

exactly balanced by others’

  • In zero-sum game, Minimax solution is same as Nash

Equilibrium

slide-32
SLIDE 32

Generative Adversarial Networks

  • Approach: Set up zero-sum game between deep nets to

– Generator: Generate data that looks like training set – Discriminator: Distinguish between real and synthetic data

  • Motivation:

– Building accurate generative models is hard (e.g., learning and sampling from Markov net or Bayes net) – Want to use all our great progress on supervised learners to do this unsupervised learning task better – Deep nets may be our favorite supervised learner, especially for image data, if nets are convolutional (use tricks of sliding windows with parameter tying, cross-entropy transfer function, batch normalization)

slide-33
SLIDE 33

Does It Work?

Thanks, Ian Goodfellow, NIPS 2016 Tutorial on GANS, for this and most of what follows…

slide-34
SLIDE 34

A Bit More on GAN Algorithm

slide-35
SLIDE 35

The Rest of the Details

  • Use deep convolutional neural networks for

Discriminator D and Generator G

  • Let x denote trainset and z denote random, uniform input
  • Set up zero-sum game by giving D the following
  • bjective, and G the negation of it:
  • Let D and G compute their gradients simultaneously,

each make one step in direction of the gradient, and repeat until neither can make progress… Minimax

slide-36
SLIDE 36

Not So Fast

  • While preceding version is theoretically elegant, in

practice the gradient for G vanishes before we reach best practical solution

  • While no longer true Minimax, use same objective for D

but change objective for G to:

  • Sometimes better if instead of using one minibatch at a

time to compute gradient and do batch normalization, we also have a fixed subset of training set, and use combination of fixed subset and current minibatch

slide-37
SLIDE 37

Comments on GANs

  • Potentially can use our high-powered supervised learners to build

better, faster data generators (can they replace MCMC, etc.?)

  • While some nice theory based on Nash Equilibria, better results in

practice if we move a bit away from the theory

  • In general, many in ML community have strong concern that we

don’t really understand why deep learning works, including GANs

  • Still much research into figuring out why this works better than
  • ther generative approaches for some types of data, how we can

improve performance further, how to take these from image data to other data types where CNNs might not be the most natural deep network structure