Energy Based Models Stefano Ermon, Aditya Grover Stanford - PowerPoint PPT Presentation

Energy Based Models Stefano Ermon, Aditya Grover Stanford University Lecture 11 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 11 1 / 21

Summary Story so far Representation: Latent variable vs. fully observed Objective function and optimization algorithm: Many divergences and distances optimized via likelihood-free (two sample test) or likelihood based methods Plan for today: Energy based models Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 11 2 / 21

Likelihood based learning Probability distributions p ( x ) are a key building block in generative modeling. Properties: 1 non-negative: p ( x ) ≥ 0 � 2 sum-to-one: � x p ( x ) = 1 (or p ( x ) dx = 1 for continuous variables) Sum-to-one is key: Total “volume” is fixed: increasing p ( x train ) guarantees that x train becomes relatively more likely (compared to the rest). Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 11 3 / 21

Parameterizing probability distributions Probability distributions p ( x ) are a key building block in generative modeling. Properties: 1 non-negative: p ( x ) ≥ 0 � 2 sum-to-one: � x p ( x ) = 1 (or p ( x ) d x = 1 for continuous variables) Coming up with a non-negative function p θ ( x ) is not hard. For example: g θ ( x ) = f θ ( x ) 2 where f θ is any neural network g θ ( x ) = exp( f θ ( x )) where f θ is any neural network · · · Problem : g θ ( x ) ≥ 0 is easy, but g θ ( x ) might not sum-to-one. � x g θ ( x ) = Z ( θ ) � = 1 in general, so g θ ( x ) is not a valid probability mass function or density Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 11 4 / 21

Likelihood based learning Problem : g θ ( x ) ≥ 0 is easy, but g θ ( x ) might not be normalized Solution : 1 1 p θ ( x ) = Volume ( g θ ) g θ ( x ) = g θ ( x ) d x g θ ( x ) � � Then by definition, p θ ( x ) d x = 1. Typically, choose g θ ( x ) so that we know the volume analytically as a function of θ . For example, √ 1 g ( µ,σ ) ( x ) = e − ( x − µ )2 e − x − µ � 2 πσ 2 → Gaussian 2 σ 2 . Volume is: 2 σ 2 dx = � + ∞ 2 g λ ( x ) = e − λ x . Volume is: e − λ x dx = 1 λ . → Exponential 0 3 Etc. We can only choose functional forms g θ ( x ) that we can integrate analytically . This is very restrictive, but as we have seen, they are very useful as building blocks for more complex models (e.g., conditionals in autoregressive models) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 11 5 / 21

Likelihood based learning Problem : g θ ( x ) ≥ 0 is easy, but g θ ( x ) might not be normalized Solution : 1 1 p θ ( x ) = Volume ( g θ ) g θ ( x ) = g θ ( x ) dx g θ ( x ) � Typically, choose g θ ( x ) so that we know the volume analytically . More complex models can be obtained by combining these building blocks. Two main strategies: Autoregressive: Products of normalized objects p θ ( x ) p θ ′ ( x ) ( y ): 1 � � � � � y p θ ( x ) p θ ′ ( x ) ( y ) dxdy = x p θ ( x ) p θ ′ ( x ) ( y ) dy dx = x p θ ( x ) dx = 1 x y � �� =1 Latent variables: Mixtures of normalized objects α p θ ( x ) + (1 − α ) p θ ′ ( x ) : 2 � x α p θ ( x ) + (1 − α ) p θ ′ ( x ) dx = α + (1 − α ) = 1 How about using models where the “volume”/normalization constant is not easy to compute analytically? Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 11 6 / 21

Energy based model 1 1 p θ ( x ) = exp( f θ ( x )) d x exp( f θ ( x )) = Z ( θ ) exp( f θ ( x )) � The volume/normalization constant � Z ( θ ) = exp( f θ ( x )) d x is also called the partition function. Why exponential (and not e.g. f θ ( x ) 2 )? Want to capture very large variations in probability. log-probability is the 1 natural scale we want to work with. Otherwise need highly non-smooth f θ . Exponential families. Many common distributions can be written in this 2 form. These distributions arise under fairly general assumptions in statistical 3 physics (maximum entropy, second law of thermodynamics). − f θ ( x ) is called the energy , hence the name. Intuitively, configurations x with low energy (high f θ ( x )) are more likely. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 11 7 / 21

Energy based model 1 1 p θ ( x ) = exp( f θ ( x )) d x exp( f θ ( x )) = Z ( θ ) exp( f θ ( x )) � Pros: 1 extreme flexibility: can use pretty much any function f θ ( x ) you want Cons (lots of them): 1 Sampling from p θ ( x ) is hard 2 Evaluating and optimizing likelihood p θ ( x ) is hard (learning is hard) 3 No feature learning (but can add latent variables) Curse of dimensionality: The fundamental issue is that computing Z ( θ ) numerically (when no analytic solution is available) scales exponentially in the number of dimensions of x . Nevertheless, some tasks do not require knowing Z ( θ ) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 11 8 / 21

Applications of Energy based models 1 1 p θ ( x ) = exp( f θ ( x )) d x exp( f θ ( x )) = Z ( θ ) exp( f θ ( x )) � Given x , x ′ evaluating p θ ( x ) or p θ ( x ′ ) requires Z ( θ ). However, their ratio p θ ( x ) p θ ( x ′ ) = exp( f θ ( x ) − f θ ( x ′ )) does not involve Z ( θ ). This means we can easily check which one is more likely. Applications: anomaly detection 1 denoising 2 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 11 9 / 21

Applications of Energy based models E(Y, X) E(Y, X) E(Y, X) X Y X Y X Y cat “class” noun object recognition sequence labeling image restoration Given a trained model, many applications require relative comparisons. Hence Z ( θ ) is not needed. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 11 10 / 21

Example: Ising Model There is a true image y ∈ { 0 , 1 } 3 × 3 , and a corrupted image x ∈ { 0 , 1 } 3 × 3 . We know x , and want to somehow recover y . Markov Random Field Y 1 Y 2 Y 3 X 1 X 2 X 3 Y 4 Y 5 Y 6 X 4 X 5 X 6 Y 7 Y 8 Y 9 X 7 X 8 X 9 X i : noisy pixels Y i : “true” pixels We model the joint probability distribution p ( y , x ) as   p ( y , x ) = 1 � � Z exp ψ i ( x i , y i ) + ψ ij ( y i , y j )  i ( i , j ) ∈ E ψ i ( x i , y i ): the i -th corrupted pixel depends on the i -th original pixel ψ ij ( y i , y j ): neighboring pixels tend to have the same value How did the original image y look like? Solution: maximize p ( y | x ) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 11 11 / 21

Example: Product of Experts Suppose you have trained several models q θ 1 ( x ), r θ 2 ( x ), t θ 3 ( x ). They can be different models (PixelCNN, Flow, etc.) Each one is like an expert that can be used to score how likely an input x is. Assuming the experts make their judgments indpendently, it is tempting to ensemble them as p θ 1 ( x ) q θ 2 ( x ) r θ 3 ( x ) To get a valid probability distribution, we need to normalize 1 p θ 1 ,θ 2 ,θ 3 ( x ) = Z ( θ 1 , θ 2 , θ 3 ) q θ 1 ( x ) r θ 2 ( x ) t θ 3 ( x ) Note: similar to an AND operation (e.g., probability is zero as long as one model gives zero probability), unlike mixture models which behave more like OR Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 11 12 / 21

Example: Restricted Boltzmann machine (RBM) RBM: energy-based model with latent variables Two types of variables: x ∈ { 0 , 1 } n are visible variables (e.g., pixel values) 1 z ∈ { 0 , 1 } m are latent ones 2 The joint distribution is � n m � p W , b , c ( x , z ) = 1 = 1 � � x T W z + b x + c z � � Z exp Z exp x i z j w ij + b x + c z i =1 j =1 Hidden units Visible units Restricted because there are no visible-visible and hidden-hidden connections, i.e., x i x j or z i z j terms in the objective Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 11 13 / 21

Deep Boltzmann Machines Stacked RBMs are one of the first deep generative models: Deep Boltzmann machine h (1) W (3) h (2) W (2) h (3) W (1) v Bottom layer variables v are pixel values. Layers above ( h ) represent “higher-level” features (corners, edges, etc). Early deep neural networks for supervised learning had to be pre-trained like this to make them work. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 11 14 / 21

Boltzmann Machines: samples Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 11 15 / 21

Energy based models: learning and inference 1 1 p θ ( x ) = exp( f θ ( x )) exp( f θ ( x )) = Z ( θ ) exp( f θ ( x )) � Pros: 1 can plug in pretty much any function f θ ( x ) you want Cons (lots of them): 1 Sampling is hard 2 Evaluating likelihood (learning) is hard 3 No feature learning Curse of dimensionality: The fundamental issue is that computing Z ( θ ) numerically (when no analytic solution is available) scales exponentially in the number of dimensions of x . Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 11 16 / 21

Energy Based Models Stefano Ermon, Aditya Grover Stanford - PowerPoint PPT Presentation

Energy Based Models Stefano Ermon, Aditya Grover Stanford University Lecture 11 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 11 1 / 21 Summary Story so far Representation: Latent variable vs. fully observed

Clean Energy Sources Wind Energy Hydro-Energy Bio-Energy Solar-Energy 1 Why Clean Energy

From Conceptual Models From Conceptual Models to Simulation Models to Simulation Models Model

Energy What is Energy What is Energy? Energy can be defined as the capacity to do work.

Chapter 5 Phrase-based models Statistical Machine Translation Motivation Word-Based Models

Phrase-Based Models Philipp Koehn 15 September 2020 Philipp Koehn Machine Translation:

Farm Energy IQ Farms Today Securing Our Energy Future Dairy Farm Energy Efficiency Gary

Energy: Forms and Changes Nature of Energy Energy is all around you! l You can hear

Energy Stewardship Program TAMU College Station Energy Stewardship Team Energy Stewardship

NHEC Perspectives on Energy NHEC Perspectives on Energy Efficiency and Sustainable Energy

Kinetic and Potential Energy Potential Energy Potential energy is that energy which an object has

DSGE Models: A User Guide for Policymakers Lawrence J. Christiano Outline Why models? Why

Seminar LIGHTING MODELS What is a light? Types of light Illumination models

Factor Models: A Review James J. Heckman The University of Chicago Econ 312, Winter 2019

Weak memory models INF4140 - Models of concurrency Weak memory models Fall 2016 30. 10. 2016

4CSLL5 IBM Translation Models Martin Emms October 22, 2020 4CSLL5 IBM Translation Models IBM

Consistency of Models) 1. Big Models 2. Examples of Graphs in Models 3. Types of Graphs Prof.

Smart Energy Communities in Northern & Remote Canada: The Northwest Territories Marie-Soleil

Execution Time Prediction for Energy- Efficient Hardware Accelerators Tao Chen, Alex Rucker, and

LINEE GUIDA DELLASMA: UP TO DATE Azienda Universit degli Ospedaliera Studi di Pisa Pisana

Productivity with Pre-Packed Chromatography Columns Michael Dittmer MedImmune, BioProcess

The energy based discontinuous Galerkin method Daniel Appel Applied Mathematics University of

HAWC High Energy Upgrade with a Sparse Outrigger Array t h 3 5 I n t e r n a t i o

Title slide 1 Global energy use and CO 2 emissions Source: Global Carbon Project,

Energy and Environmental Protection Governors Council on Climate Change Connecticut Department

Energy Based Models Stefano Ermon, Aditya Grover Stanford - PowerPoint PPT Presentation

Energy Based Models Stefano Ermon, Aditya Grover Stanford University Lecture 11 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 11 1 / 21 Summary Story so far Representation: Latent variable vs. fully observed

Clean Energy Sources Wind Energy Hydro-Energy Bio-Energy Solar-Energy 1 Why Clean Energy

From Conceptual Models From Conceptual Models to Simulation Models to Simulation Models Model

Energy What is Energy What is Energy? Energy can be defined as the capacity to do work.

Chapter 5 Phrase-based models Statistical Machine Translation Motivation Word-Based Models

Phrase-Based Models Philipp Koehn 15 September 2020 Philipp Koehn Machine Translation:

Farm Energy IQ Farms Today Securing Our Energy Future Dairy Farm Energy Efficiency Gary

Energy: Forms and Changes Nature of Energy Energy is all around you! l You can hear

Energy Stewardship Program TAMU College Station Energy Stewardship Team Energy Stewardship

NHEC Perspectives on Energy NHEC Perspectives on Energy Efficiency and Sustainable Energy

Kinetic and Potential Energy Potential Energy Potential energy is that energy which an object has

DSGE Models: A User Guide for Policymakers Lawrence J. Christiano Outline Why models? Why

Seminar LIGHTING MODELS What is a light? Types of light Illumination models

Factor Models: A Review James J. Heckman The University of Chicago Econ 312, Winter 2019

Weak memory models INF4140 - Models of concurrency Weak memory models Fall 2016 30. 10. 2016

4CSLL5 IBM Translation Models Martin Emms October 22, 2020 4CSLL5 IBM Translation Models IBM

Consistency of Models) 1. Big Models 2. Examples of Graphs in Models 3. Types of Graphs Prof.

Smart Energy Communities in Northern &amp; Remote Canada: The Northwest Territories Marie-Soleil

Execution Time Prediction for Energy- Efficient Hardware Accelerators Tao Chen, Alex Rucker, and

LINEE GUIDA DELLASMA: UP TO DATE Azienda Universit degli Ospedaliera Studi di Pisa Pisana

Productivity with Pre-Packed Chromatography Columns Michael Dittmer MedImmune, BioProcess

The energy based discontinuous Galerkin method Daniel Appel Applied Mathematics University of

HAWC High Energy Upgrade with a Sparse Outrigger Array t h 3 5 I n t e r n a t i o

Title slide 1 Global energy use and CO 2 emissions Source: Global Carbon Project,

Energy and Environmental Protection Governors Council on Climate Change Connecticut Department

Smart Energy Communities in Northern & Remote Canada: The Northwest Territories Marie-Soleil