Deep Learning Y LeCun & Convolutional Networks In Vision - PowerPoint PPT Presentation

Deep Learning Y LeCun & Convolutional Networks In Vision (part 2) VRML, Paris 2013 -07-23 Yann LeCun Center for Data Science & Courant Institute, NYU yann@cs.nyu.edu http://yann.lecun.com

Y LeCun Energy-Based Unsupervised Learning

Energy-Based Unsupervised Learning Y LeCun Learning an energy function (or contrast function) that takes Low values on the data manifold Higher values everywhere else Y2 Y1

Capturing Dependencies Between Variables with an Energy Function Y LeCun The energy surface is a “contrast function” that takes low values on the data manifold, and higher values everywhere else Special case: energy = negative log density Example: the samples live in the manifold Y 2 =( Y 1 ) 2 Y1 Y2

Transforming Energies into Probabilities (if necessary) Y LeCun The energy can be interpreted as an unnormalized negative log density Gibbs distribution: Probability proportional to exp(-energy) Beta parameter is akin to an inverse temperature Don't compute probabilities unless you absolutely have to Because the denominator is often intractable P(Y|W) Y E(Y,W) Y

Learning the Energy Function Y LeCun parameterized energy function E(Y,W) Make the energy low on the samples Make the energy higher everywhere else Making the energy low on the samples is easy But how do we make it higher everywhere else?

Seven Strategies to Shape the Energy Function Y LeCun 1. build the machine so that the volume of low energy stuff is constant PCA, K-means, GMM, square ICA 2. push down of the energy of data points, push up everywhere else Max likelihood (needs tractable partition function) 3. push down of the energy of data points, push up on chosen locations contrastive divergence, Ratio Matching, Noise Contrastive Estimation, Minimum Probability Flow 4. minimize the gradient and maximize the curvature around data points score matching 5. train a dynamical system so that the dynamics goes to the manifold denoising auto-encoder 6. use a regularizer that limits the volume of space that has low energy Sparse coding, sparse auto-encoder, PSD 7. if E(Y) = ||Y - G(Y)||^2, make G(Y) as "constant" as possible. Contracting auto-encoder, saturating auto-encoder

#1: constant volume of low energy Y LeCun 1. build the machine so that the volume of low energy stuff is constant PCA, K-means, GMM, square ICA... K-Means, PCA Z constrained to 1-of-K code T WY − Y ∥ 2 E ( Y )=∥ W E ( Y )= min z ∑ i ∥ Y − W i Z i ∥ 2

#2: push down of the energy of data points, push up everywhere else Y LeCun Max likelihood (requires a tractable partition function) Maximizing P(Y|W) on training samples P(Y) make this big make this small Y Minimizing -log P(Y,W) on training samples E(Y) Y make this small make this big

#2: push down of the energy of data points, push up everywhere else Y LeCun Gradient of the negative log-likelihood loss for one sample Y: Y Gradient descent: E(Y) Pushes down on the Pulls up on the Y energy of the samples energy of low-energy Y's

#3. push down of the energy of data points, push up on chosen locations Y LeCun contrastive divergence, Ratio Matching, Noise Contrastive Estimation, Minimum Probability Flow Contrastive divergence: basic idea Pick a training sample, lower the energy at that point From the sample, move down in the energy surface with noise Stop after a while Push up on the energy of the point where we stopped This creates grooves in the energy surface around data manifolds CD can be applied to any energy function (not just RBMs) Persistent CD: use a bunch of “particles” and remember their positions Make them roll down the energy surface with noise Push up on the energy wherever they are Faster than CD RBM T WY E ( Y ,Z )=− Z T WY E ( Y )=− log ∑ z e Z

#6. use a regularizer that limits the volume of space that has low energy Y LeCun Sparse coding, sparse auto-encoder, Predictive Saprse Decomposition

Y LeCun Sparse Modeling, Sparse Auto-Encoders, Predictive Sparse Decomposition LISTA

How to Speed Up Inference in a Generative Model? Y LeCun Factor Graph with an asymmetric factor Inference Z → Y is easy Run Z through deterministic decoder, and sample Y Inference Y → Z is hard, particularly if Decoder function is many-to-one MAP: minimize sum of two factors with respect to Z Z* = argmin_z Distance[Decoder(Z), Y] + FactorB(Z) Examples: K-Means (1of K), Sparse Coding (sparse), Factor Analysis Generative Model Factor A Factor B Distance Decoder LATENT Y INPUT Z VARIABLE

Sparse Coding & Sparse Modeling Y LeCun [Olshausen & Field 1997] Sparse linear reconstruction Energy = reconstruction_error + code_prediction_error + code_sparsity 2 + λ ∑ j ∣ z j ∣ i ,Z )=∥ Y E ( Y i − W d Z ∥  ∑ j . i −  ∥ Y Y ∥ 2 W d Z ∣ z j ∣ DETERMINISTIC FACTOR Y Z INPUT FUNCTION FEATURES VARIABLE Y → ̂ Z = argmin Z E ( Y ,Z ) Inference is slow

Encoder Architecture Y LeCun Examples: most ICA models, Product of Experts Factor B LATENT INPUT Y Z Fast Feed-Forward Model VARIABLE Factor A' Encoder Distance

Encoder-Decoder Architecture Y LeCun [Kavukcuoglu, Ranzato, LeCun, rejected by every conference, 2008-2009] Train a “simple” feed-forward function to predict the result of a complex optimization on the data points of interest Generative Model Factor A Factor B Distance Decoder LATENT INPUT Y Z Fast Feed-Forward Model VARIABLE Factor A' Encoder Distance 1. Find optimal Zi for all Yi; 2. Train Encoder to predict Zi from Yi

Why Limit the Information Content of the Code? Y LeCun Training sample Input vector which is NOT a training sample Feature vector INPUT SPACE FEATURE SPACE

Why Limit the Information Content of the Code? Y LeCun Training sample Input vector which is NOT a training sample Feature vector Training based on minimizing the reconstruction error over the training set INPUT SPACE FEATURE SPACE

Why Limit the Information Content of the Code? Y LeCun Training sample Input vector which is NOT a training sample Feature vector BAD: machine does not learn structure from training data!! It just copies the data. INPUT SPACE FEATURE SPACE

Why Limit the Information Content of the Code? Y LeCun Training sample Input vector which is NOT a training sample Feature vector IDEA: reduce number of available codes. INPUT SPACE FEATURE SPACE

Predictive Sparse Decomposition (PSD): sparse auto-encoder Y LeCun [Kavukcuoglu, Ranzato, LeCun, 2008 → arXiv:1010.3467], Prediction the optimal code with a trained encoder Energy = reconstruction_error + code_prediction_error + code_sparsity 2  ∑ j ∣ z j ∣ i ,Z =∥ Y E  Y i − W d Z ∥ 2 ∥ Z − g e  W e ,Y i ∥ g e ( W e ,Y i )= shrinkage ( W e Y i )  ∑ j . i −  ∥ Y Y ∥ 2 W d Z ∣ z j ∣ Y Z INPUT FEATURES g e  W e ,Y i  ∥ Z −  Z ∥ 2

PSD: Basis Functions on MNIST Y LeCun Basis functions (and encoder matrix) are digit parts

Predictive Sparse Decomposition (PSD): Training Y LeCun Training on natural images patches. 12X12 256 basis functions

Learned Features on natural patches: V1-like receptive fields Y LeCun

Better Idea: Give the “right” structure to the encoder Y LeCun ISTA/FISTA: iterative algorithm that converges to optimal sparse code [Gregor & LeCun, ICML 2010], [Bronstein et al. ICML 2012], [Rolfe & LeCun ICLR 2013] W e sh () + Y Z INPUT S Lateral Inhibition

LISTA: Train We and S matrices to give a good approximation quickly Y LeCun Think of the FISTA flow graph as a recurrent neural net where We and S are trainable parameters W e sh () + Y Z INPUT S Time-Unfold the flow graph for K iterations Learn the We and S matrices with “backprop-through-time” Get the best approximate solution within K iterations W e Y sh () sh () + S + S Z

Learning ISTA (LISTA) vs ISTA/FISTA Y LeCun Reconstruction Error Number of LISTA or FISTA iterations

LISTA with partial mutual inhibition matrix Y LeCun Reconstruction Error Smallest elements removed Proportion of S matrix elements that are non zero

Learning Coordinate Descent (LcoD): faster than LISTA Y LeCun Reconstruction Error Number of LISTA or FISTA iterations

Discriminative Recurrent Sparse Auto-Encoder (DrSAE) Y LeCun Architecture L 1 ̄ Z 0 Lateral Decoding Inhibition Filters W e X W d ̄ X X () + () + S + Encoding W c ̄ Y Y Filters Can be repeated Rectified linear units [Rolfe & LeCun ICLR 2013] Classification loss: cross-entropy Reconstruction loss: squared error Sparsity penalty: L1 norm of last hidden layer Rows of Wd and columns of We constrained in unit sphere

DrSAE Discovers manifold structure of handwritten digits Y LeCun Image = prototype + sparse sum of “parts” (to move around the manifold)

Deep Learning Y LeCun & Convolutional Networks In Vision - PowerPoint PPT Presentation

Deep Learning Y LeCun & Convolutional Networks In Vision (part 2) VRML, Paris 2013 -07-23 Yann LeCun Center for Data Science & Courant Institute, NYU yann@cs.nyu.edu http://yann.lecun.com Y LeCun Energy-Based Unsupervised Learning

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Minjie Wang Deep Learning Deep Learning trend in the past 10 years Caffe State-of-art DL

Principle ERP reduction and analysis Estimating and using principle ERP waveforms underlying ERPs

What we are going to talk about? New tool released at Blackhat Canape What is Citrix

GWSA IAC Meeting March 7, 2019 Agenda Review draft meeting minutes of December 6, 2018

ICA Mountain Cartography Workshop Ban f, 2 -24 April 2014 MAP ING NEW ZEALAND's GREAT WALKS

New Opportunities in Industrial Cooperation in Israel Th The indus e industrial cooperat trial

Improvement of instruments and capacity-building activities under the CHAIN-project S. UeNo , K.

Principal Component Analysis Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department

STUDENT HANDBOOK & ORIENTATION SLIDES ATTACHMENT FEE PROTECTION SCHEME The School shall

Deep Learning Y LeCun & Convolutional Networks In Vision - PowerPoint PPT Presentation

Deep Learning Y LeCun & Convolutional Networks In Vision (part 2) VRML, Paris 2013 -07-23 Yann LeCun Center for Data Science & Courant Institute, NYU yann@cs.nyu.edu http://yann.lecun.com Y LeCun Energy-Based Unsupervised Learning

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Minjie Wang Deep Learning Deep Learning trend in the past 10 years Caffe State-of-art DL

Principle ERP reduction and analysis Estimating and using principle ERP waveforms underlying ERPs

What we are going to talk about? New tool released at Blackhat Canape What is Citrix

GWSA IAC Meeting March 7, 2019 Agenda Review draft meeting minutes of December 6, 2018

ICA Mountain Cartography Workshop Ban f, 2 -24 April 2014 MAP ING NEW ZEALAND's GREAT WALKS

New Opportunities in Industrial Cooperation in Israel Th The indus e industrial cooperat trial

Improvement of instruments and capacity-building activities under the CHAIN-project S. UeNo , K.

Principal Component Analysis Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department

STUDENT HANDBOOK &amp; ORIENTATION SLIDES ATTACHMENT FEE PROTECTION SCHEME The School shall

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

STUDENT HANDBOOK & ORIENTATION SLIDES ATTACHMENT FEE PROTECTION SCHEME The School shall