outline of this tutorial motivations 1 ACISS09 tutorial on deep - PowerPoint PPT Presentation

outline of this tutorial motivations 1 ACISS’09 tutorial on deep belief nets deep autoencoders deep belief nets sigmoid belief nets 2 Marcus Frean why are they hard to train? could layer-by-layer training work? Melbourne, 2009 Boltzmann machines 3 why are they hard to train? the restricted Boltzmann machine (RBM) towers built from RBMs 4 how to do it why it works 1 December, 2009 fine-tuning the result 2 applications: a classifier and an autoencoder Several of the diagrams used here are based on those in Geoff Hinton’s papers & lectures. Victoria University, Wellington, New Zealand Marcus Frean (Melbourne, 2009) ACISS’09 tutorial on deep belief nets 1 / 60 Marcus Frean (Melbourne, 2009) ACISS’09 tutorial on deep belief nets 2 / 60 back-propagation networks autoencoder nets unsupervised learners map each pattern in a training set back to itself dimensionality reduction, if there’s a ”bottleneck” could be trained by back-propagation a nice way to do dimensionality reduction Marcus Frean (Melbourne, 2009) ACISS’09 tutorial on deep belief nets 3 / 60 Marcus Frean (Melbourne, 2009) ACISS’09 tutorial on deep belief nets 4 / 60

why haven’t deep auto-encoders worked? belief nets all the hidden units interact A belief net is a directed acyclic graph composed of stochastic the gradient gets tiny as you variables move away from the ”output” We get to observe some of the layer variables and would like to solve two problems: The inference problem: Infer the 1 states of unobserved variables The learning problem: Adjust the 2 interactions between variables to make the network more likely to generate the observed data. too hard to learn Marcus Frean (Melbourne, 2009) ACISS’09 tutorial on deep belief nets 5 / 60 Marcus Frean (Melbourne, 2009) ACISS’09 tutorial on deep belief nets 6 / 60 what would a really interesting generative model for parameterized belief networks (say) images look like? stochastic Large belief nets are still too lots of units powerful to learn with finite data. several layers But we can parameterize the easy to sample from factors: e.g. sigmoid function... sigmoid belief net an interesting generative model Marcus Frean (Melbourne, 2009) ACISS’09 tutorial on deep belief nets 7 / 60 Marcus Frean (Melbourne, 2009) ACISS’09 tutorial on deep belief nets 8 / 60

stochastic neurons sampling from the joint input to the i th neuron: It is easy to sample from the prior p ( x ) = p ( h , v ) of a sigmoidal belief � φ i = w ji x j net: j p ( x ) = p ( x 1 , x 2 . . . x n ) probability of generating a 1: � = p ( x i | parents i ) i 1 p i = 1 + exp( φ i ) so we sample from each layer in turn, ending at the visible units. learning rule for making x more likely: it’s easy to make particular patterns ∆ w ji ∝ ( x i − p i ) x j more likely This seems like an attractive generative model. Marcus Frean (Melbourne, 2009) ACISS’09 tutorial on deep belief nets 9 / 60 Marcus Frean (Melbourne, 2009) ACISS’09 tutorial on deep belief nets 10 / 60 but... Gibbs sampling to learn better weights based on a training To draw samples from p ( x ) = p ( x 1 , x 2 , . . . , x n ) : set, we need to generate samples from the posterior , not the prior choose i at random 1 choose x i from p ( x i | x \ i ) 2 joint = p ( h , v ) = ”generative model”: start at the top and sample from sigmoids posterior = p ( h | v ) = ”recognition model”: sample from hidden units, given the visible ones. how to draw such samples? adapted from David MacKay’s classic book filter from the generative model?// no! oh dear This results in a Markov Chain. Running the chain for long enough − → samples from p ( x ) . Marcus Frean (Melbourne, 2009) ACISS’09 tutorial on deep belief nets 11 / 60 Marcus Frean (Melbourne, 2009) ACISS’09 tutorial on deep belief nets 12 / 60

Gibbs sampling in sigmoid belief nets explaining away Hidden states are So what does p ( x i | x \ i ) look like, in a sigmoid belief net? independent in the prior dependent in the posterior two hidden ”causes” That dependence means sampling from visible node is observed... one hidden unit has to cause a change Gibbs sampling for h 1 , h 2 : in how all other hidden units update their states. � � − 1 1 + (1 − f ( b 1 )) f ( w 2 ) p ( h 1 = 1 | v = 1) = f ( b 1 ) f ( w 1 + w 2 ) But we are interested in nets with lots of = yuck! hidden units. Gibbs sampler in sigmoid belief net: an inconvenient truth: ugly, slow there’s no quick way to draw a sample from p ( hidden | visible ) reason is ’explaining away’ Marcus Frean (Melbourne, 2009) ACISS’09 tutorial on deep belief nets 13 / 60 Marcus Frean (Melbourne, 2009) ACISS’09 tutorial on deep belief nets 14 / 60 sigmoidal belief nets are: building a deep BN layer-by-layer Easy to sample from as a generative model, but hard to learn Here’s a way to train a multilayer sigmoid belief net: sampling from the posterior is slow, due to explaining away. This 1 start with a single layer only. The hidden 1 also makes them hard to use for recognition. units are driven by bias inputs only. Train ‘deep’ layers learn nothing until the ‘shallow’ ones have settled, 2 to maximize the likelihood of generating but shallow layers have to learn while being driven by the deep the training data. layers (chicken and egg...) freeze the weights in that layer, and 2 replace the hidden units’ bias inputs by a second layer of weights. Let’s ignore the quibbles about slowness for a moment, and consider train the second layer of weights to this idea: one way around the second difficulty might be to train the 3 maximize the likelihood. first layer as a simple BN first, and then the second layer, and so on. and so on... 4 Question: what should the training set be for the second layer? Marcus Frean (Melbourne, 2009) ACISS’09 tutorial on deep belief nets 15 / 60 Marcus Frean (Melbourne, 2009) ACISS’09 tutorial on deep belief nets 16 / 60

the EM algorithm the EM algorithm, via sampling The W 1 weights, and the biases In practice we can’t work with into the hidden units, are the posterior p ( h | v ) trained so as to maximize the analytically: we have to sample probability of generating the from it instead. patterns on visible units in a data set. For each v in the training data: E step: draw a sample from p ( h | v ) The EM algorithm achieves this by repeating two steps, for each v in the training data: M step: move the W 1 weights so as to make v more likely, given that h . E step: calculate p ( h | v ) Move the hidden biases to better produce that h . M step: move the W 1 weights so as to make v more likely, under p ( h | v ) . And move the hidden biases to better produce that distribution. Marcus Frean (Melbourne, 2009) ACISS’09 tutorial on deep belief nets 17 / 60 Marcus Frean (Melbourne, 2009) ACISS’09 tutorial on deep belief nets 18 / 60 aggregate posterior training the next layer Averaging over the training set, Q: what is the best thing that the we have an aggregate posterior second layer could learn to do? distribution: A: accurately generate the aggregate posterior distribution over the � p agg ( h ) = p ( h | v ) layer 1 hidden units. It is the distri- v ∈D bution that makes the training data most likely, given W 1 So the W 1 weights end up at values that maximize the likelihood of the data, given h sampled from the aggregate posterior distribution. Easy! For each visible pattern, we just collect one sample (or The hidden bias weights end up at values that approximate this more) from p ( h | v ) . distribution. This gives us a greedy, layer-wise procedure for training deep belief nets. Marcus Frean (Melbourne, 2009) ACISS’09 tutorial on deep belief nets 19 / 60 Marcus Frean (Melbourne, 2009) ACISS’09 tutorial on deep belief nets 20 / 60

outline of this tutorial motivations 1 ACISS09 tutorial on deep - PowerPoint PPT Presentation

outline of this tutorial motivations 1 ACISS09 tutorial on deep belief nets deep autoencoders deep belief nets sigmoid belief nets 2 Marcus Frean why are they hard to train? could layer-by-layer training work? Melbourne, 2009

Tutorial Tutorial A2 is out, its called Inpainting Tutorial Tutorial A2 is out, its called

A GAMS TUTORIAL A GAMS TUTORIAL A GAMS TUTORIAL WHAT IS GAMS ? General Algebraic Modeling

Excel Tutorial 1 Getting Started with Excel Tutorial 2 Formatting a Workbook Tutorial 3

PROGRAMMING TUTORIAL Thierry Lepley, April 4 th 2016 TUTORIAL GOAL Intermediate Tutorial for

Do Fifty- Two Motivation Overview of the Language

UPPAAL Tutorial UPPAAL Tutorial UPPAAL Tutorial Introduction Introduction Alexandre David

PowerPoint Tutorial 1 Creating a Presentation Tutorial 2 Applying and Modifying Text and

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Comp 1402 Winter 2008 Tutorial #1 Tutorial 1 The objectives of this tutorial will be:

XDP hands-on tutorial Jesper Dangaard Brouer Toke Hiland-Jrgensen Bornhack Gelsted, August

Tutorial Outline Tutorial Outline XLE: XLE: What is a deep grammar and why would you want

Prose tutorial Edit New Page Sumit Gulwani edited this page 9 minutes ago 60 revisions

Tutorial on using the Google Cloud Platform (GCP) Tutorial on using the Google Cloud Platform

CS 525M Mobile and Ubiquitous Computing Tutorial 1: Introduction by Bucky Roberts (thenewboston)

CAVE2 Unity Tutorial CAVE2 unity tutorial on github Omicron Cave example unity scene Cave2

NLP Programming Tutorial 0 - Programming Basics Graham Neubig Nara Institute of Science and

s min : a global inclusive variable mass scale determination at LHC Pheno09, Madison

New particle mass spectrometry at the LHC : Resolving combinatoric endpoints Won-Sang Cho (IPMU)

Alice Learning to program: Part Three By Ruthie Tucker and Jenna Hayes, Under the direction of

ADVANCED DATABASE SYSTEMS Multi-Version Concurrency Control (Garbage Collection) @ Andy_Pavlo

TH perspective on CPV in (fermionic) Higgs couplings Adam Martin (amarti41@nd.edu) University of

Bayesian model comparison with applications Johannes Bergstr om Department of Theoretical

Gamma-ray Signatures of Scalar Dark Matter Takashi Toma Durham University Institute for Particle

Vision Statements CMS611J/6.073 Fall 2014 MIT Game Lab CMS.611J/6.073 Fall 2014 1 Why?