CSC321 Lecture 20: Autoencoders Roger Grosse Roger Grosse CSC321 - PowerPoint PPT Presentation

CSC321 Lecture 20: Autoencoders Roger Grosse Roger Grosse CSC321 Lecture 20: Autoencoders 1 / 16

Overview Latent variable models so far: mixture models Boltzmann machines Both of these involve discrete latent variables. Now let’s talk about continuous ones. One use of continuous latent variables is dimensionality reduction Roger Grosse CSC321 Lecture 20: Autoencoders 2 / 16

Autoencoders An autoencoder is a feed-forward neural net whose job it is to take an input x and predict x . To make this non-trivial, we need to add a bottleneck layer whose dimension is much smaller than the input. Roger Grosse CSC321 Lecture 20: Autoencoders 3 / 16

Autoencoders Why autoencoders? Map high-dimensional data to two dimensions for visualization Compression (i.e. reducing the file size) Note: autoencoders don’t do this for free — it requires other ideas as well. Learn abstract features in an unsupervised way so you can apply them to a supervised task Unlabled data can be much more plentiful than labeled data Roger Grosse CSC321 Lecture 20: Autoencoders 4 / 16

Principal Component Analysis The simplest kind of autoencoder has one hidden layer, linear activations, and squared error loss. x � 2 L ( x , ˜ x ) = � x − ˜ This network computes ˜ x = UVx , which is a linear function. If K ≥ D , we can choose U and V such that UV is the identity. This isn’t very interesting. But suppose K < D : V maps x to a K -dimensional space, so it’s doing dimensionality reduction. The output must lie in a K -dimensional subspace, namely the column space of U . Roger Grosse CSC321 Lecture 20: Autoencoders 5 / 16

Principal Component Analysis We just saw that a linear autoencoder has to map D -dimensional inputs to a K -dimensional subspace S . Knowing this, what is the best possible mapping it can choose? Roger Grosse CSC321 Lecture 20: Autoencoders 6 / 16

Principal Component Analysis We just saw that a linear autoencoder has to map D -dimensional inputs to a K -dimensional subspace S . Knowing this, what is the best possible mapping it can choose? By definition, the projection of x onto S is the point in S which minimizes the distance to x . Fortunately, the linear autoencoder can represent projection onto S : pick U = Q and V = Q ⊤ , where Q is an orthonormal basis for S . Roger Grosse CSC321 Lecture 20: Autoencoders 6 / 16

Principal Component Analysis The autoencoder should learn to choose the subspace which minimizes the squared distance from the data to the projections. This is equivalent to the subspace which maximizes the variance of the projections. By the Pythagorean Theorem, N N 1 + 1 � x ( i ) − µ � 2 � � x ( i ) − ˜ x ( i ) � 2 � ˜ N N i =1 i =1 � �� reconstruction error projected variance N = 1 � � x ( i ) − µ � 2 N i =1 � �� constant You wouldn’t actually sove this problem by training a neural net. There’s a closed-form solution, which you learn about in CSC 411. The algorithm is called principal component analysis (PCA). Roger Grosse CSC321 Lecture 20: Autoencoders 7 / 16

Principal Component Analysis PCA for faces (“Eigenfaces”) Roger Grosse CSC321 Lecture 20: Autoencoders 8 / 16

Principal Component Analysis PCA for digits Roger Grosse CSC321 Lecture 20: Autoencoders 9 / 16

Deep Autoencoders Deep nonlinear autoencoders learn to project the data, not onto a subspace, but onto a nonlinear manifold This manifold is the image of the decoder. This is a kind of nonlinear dimensionality reduction. Roger Grosse CSC321 Lecture 20: Autoencoders 10 / 16

Deep Autoencoders Nonlinear autoencoders can learn more powerful codes for a given dimensionality, compared with linear autoencoders (PCA) Roger Grosse CSC321 Lecture 20: Autoencoders 11 / 16

Layerwise Training There’s a neat connection between autoencoders and RBMs. An RBM is like an autoencoder with tied weights, except that the units are sampled stochastically. Roger Grosse CSC321 Lecture 20: Autoencoders 12 / 16

Layerwise Training Suppose we’ve already trained an RBM with weights W (1) . Let’s compute its hidden features on the training set, and feed that in as data to another RBM: Note that now W (1) is held fixed, but W (2) is being trained using contrastive divergence. Roger Grosse CSC321 Lecture 20: Autoencoders 13 / 16

Layerwise Training A stack of two RBMs can be thought of as an autoencoder with three hidden layers: This gives a good initialization for the deep autoencoder. You can then fine-tune the autoencoder weights using backprop. This strategy is known as layerwise pre-training. Roger Grosse CSC321 Lecture 20: Autoencoders 14 / 16

Autoencoders are not a probabilistic model. However, there is an autoencoder-like probabilistic model called a variational autoencoder (VAE). These are beyond the scope of the course, and require some more advanced math. Check out David Duvenaud’s excellent course “Differentiable Inference and Generative Models”: https://www.cs.toronto.edu/ ~duvenaud/courses/csc2541/index.html Roger Grosse CSC321 Lecture 20: Autoencoders 15 / 16

Deep Autoencoders (Professor Hinton’s slides) Roger Grosse CSC321 Lecture 20: Autoencoders 16 / 16

CSC321 Lecture 20: Autoencoders Roger Grosse Roger Grosse CSC321 - PowerPoint PPT Presentation

CSC321 Lecture 20: Autoencoders Roger Grosse Roger Grosse CSC321 Lecture 20: Autoencoders 1 / 16 Overview Latent variable models so far: mixture models Boltzmann machines Both of these involve discrete latent variables. Now lets talk

CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25

CSC321 Lecture 7: Distributed Representations Roger Grosse Roger Grosse CSC321 Lecture 7:

CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26

CSC321 Lecture 1: Introduction Roger Grosse Roger Grosse CSC321 Lecture 1: Introduction 1 / 26

CSC321 Lecture 19: Boltzmann Machines Roger Grosse Roger Grosse CSC321 Lecture 19: Boltzmann

CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a

CSC321 Lecture 16: Learning Long-Term Dependencies Roger Grosse Roger Grosse CSC321 Lecture 16:

CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer

CSC321 Lecture 10: Automatic Differentiation Roger Grosse Roger Grosse CSC321 Lecture 10:

CSC321 Lecture 21: Bayesian Hyperparameter Optimization Roger Grosse Roger Grosse CSC321

CSC321 Lecture 6: Backpropagation Roger Grosse Roger Grosse CSC321 Lecture 6: Backpropagation 1

CSC321 Lecture 18: Mixture Modeling Roger Grosse Roger Grosse CSC321 Lecture 18: Mixture

CSC321 Lecture 23: Go Roger Grosse Roger Grosse CSC321 Lecture 23: Go 1 / 22 Final Exam

CSC321 Lecture 19: Generative Adversarial Networks Roger Grosse Roger Grosse CSC321 Lecture 19:

CSC321 Lecture 17: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 17: ResNets

CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture

Reinforcement Learning CS 294-112: Deep Reinforcement Learning Sergey Levine Class Notes 1.

Cloud Security An IAM GAME Nathaniel Beckstead whoami I am here because I love to give

Border Control: Sandboxing Accelerators L. E. Olson, Jason Power, Mark. D. Hill and David A.Wood

Chapter 11 Software Security Secure programs Security implies some degree of trust that the

CSE 505: Programming Languages Lecture 17 Subtyping Zach Tatlock Fall 2013 Tradeoffs

Informatics 2A: Lecture 1 Introduction and Course Administration John Longley Mirella Lapata

1 Choose your language 4 Exercise sessions (bungsgruppen) are available in German (5) and

CSE101: Design and Analysis of Algorithms Ragesh Jaiswal, CSE, UCSD Ragesh Jaiswal, CSE, UCSD