normalizing flow models
play

Normalizing Flow Models Stefano Ermon, Aditya Grover Stanford - PowerPoint PPT Presentation

Normalizing Flow Models Stefano Ermon, Aditya Grover Stanford University Lecture 7 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 1 / 16 Recap of likelihood-based learning so far: Model families: Autoregressive


  1. Normalizing Flow Models Stefano Ermon, Aditya Grover Stanford University Lecture 7 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 1 / 16

  2. Recap of likelihood-based learning so far: Model families: Autoregressive Models: p θ ( x ) = � n i =1 p θ ( x i | x < i ) � Variational Autoencoders: p θ ( x ) = p θ ( x , z ) d z Autoregressive models provide tractable likelihoods but no direct mechanism for learning features Variational autoencoders can learn feature representations (via latent variables z ) but have intractable marginal likelihoods Key question : Can we design a latent variable model with tractable likelihoods? Yes! Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 2 / 16

  3. Simple Prior to Complex Data Distributions Desirable properties of any model distribution: Analytic density Easy-to-sample Many simple distributions satisfy the above properties e.g., Gaussian, uniform distributions Unfortunately, data distributions could be much more complex (multi-modal) Key idea : Map simple distributions (easy to sample and evaluate densities) to complex distributions (learned via data) using change of variables . Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 3 / 16

  4. Change of Variables formula Let Z be a uniform random variable U [0 , 2] with density p Z . What is p Z (1)? 1 2 Let X = 4 Z , and let p X be its density. What is p X (4)? p X (4) = p ( X = 4) = p (4 Z = 4) = p ( Z = 1) = p Z (1) = 1 / 2 No Clearly, X is uniform in [0 , 8], so p X (4) = 1 / 8 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 4 / 16

  5. Change of Variables formula Change of variables (1D case) : If X = f ( Z ) and f ( · ) is monotone with inverse Z = f − 1 ( X ) = h ( X ), then: p X ( x ) = p Z ( h ( x )) | h ′ ( x ) | Previous example: If X = 4 Z and Z ∼ U [0 , 2], what is p X (4)? Note that h ( X ) = X / 4 p X (4) = p Z (1) h ′ (4) = 1 / 2 × 1 / 4 = 1 / 8 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 5 / 16

  6. Geometry: Determinants and volumes Let Z be a uniform random vector in [0 , 1] n Let X = AZ for a square invertible matrix A , with inverse W = A − 1 . How is X distributed? Geometrically, the matrix A maps the unit hypercube [0 , 1] n to a parallelotope Hypercube and parallelotope are generalizations of square/cube and parallelogram/parallelopiped to higher dimensions � a � c Figure: The matrix A = maps a unit square to a parallelogram b d Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 6 / 16

  7. Geometry: Determinants and volumes The volume of the parallelotope is equal to the determinant of the transformation A � a � c det ( A ) = det = ad − bc b d X is uniformly distributed over the parallelotope. Hence, we have p X ( x ) = p Z ( W x ) | det ( W ) | = p Z ( W x ) / | det ( A ) | Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 7 / 16

  8. Generalized change of variables For linear transformations specified via A , change in volume is given by the determinant of A For non-linear transformations f ( · ), the linearized change in volume is given by the determinant of the Jacobian of f ( · ). Change of variables (General case) : The mapping between Z and X , given by f : R n �→ R n , is invertible such that X = f ( Z ) and Z = f − 1 ( X ). � ∂ f − 1 ( x ) � � �� f − 1 ( x ) � � � p X ( x ) = p Z � det � � ∂ x � Note 1: x , z need to be continuous and have the same dimension. For example, if x ∈ R n then z ∈ R n Note 2: For any invertible matrix A , det ( A − 1 ) = det ( A ) − 1 − 1 � �� � ∂ f ( z ) � � p X ( x ) = p Z ( z ) � det � � ∂ z � Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 8 / 16

  9. Two Dimensional Example Let Z 1 and Z 2 be continuous random variables with joint density p Z 1 , Z 2 . Let u = ( u 1 , u 2 ) be a transformation Let v = ( v 1 , v 2 ) be the inverse transformation Let X 1 = u 1 ( Z 1 , Z 2 ) and X 2 = u 2 ( Z 1 , Z 2 ) Then, Z 1 = v 1 ( X 1 , X 2 ) and Z 2 = v 2 ( X 1 , X 2 ) p X 1 , X 2 ( x 1 , x 2 ) � ∂ v 1 ( x 1 , x 2 ) � ∂ v 1 ( x 1 , x 2 ) �� � � ∂ x 1 ∂ x 2 = p Z 1 , Z 2 ( v 1 ( x 1 , x 2 ) , v 2 ( x 1 , x 2 )) � (inverse) � det � � ∂ v 2 ( x 1 , x 2 ) ∂ v 2 ( x 1 , x 2 ) � � ∂ x 1 ∂ x 2 � ∂ u 1 ( z 1 , z 2 ) − 1 � ∂ u 1 ( z 1 , z 2 ) �� � � ∂ z 1 ∂ z 2 = p Z 1 , Z 2 ( z 1 , z 2 ) (forward) � det � � ∂ u 2 ( z 1 , z 2 ) ∂ u 2 ( z 1 , z 2 ) � � ∂ z 1 ∂ z 2 � Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 9 / 16

  10. Normalizing flow models Consider a directed, latent-variable model over observed variables X and latent variables Z In a normalizing flow model , the mapping between Z and X , given by f θ : R n �→ R n , is deterministic and invertible such that X = f θ ( Z ) and Z = f − 1 θ ( X ) Using change of variables, the marginal likelihood p ( x ) is given by � �� � ∂ f − 1 θ ( x ) � � f − 1 � � p X ( x ; θ ) = p Z θ ( x ) � det � � ∂ x � � � Note: x , z need to be continuous and have the same dimension. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 10 / 16

  11. A Flow of Transformations Normalizing: Change of variables gives a normalized density after applying an invertible transformation Flow: Invertible transformations can be composed with each other z m := f m θ ◦ · · · ◦ f 1 θ ( z 0 ) = f m θ ( f m − 1 ( · · · ( f 1 θ ( z 0 )))) � f θ ( z 0 ) θ Start with a simple distribution for z 0 (e.g., Gaussian) Apply a sequence of M invertible transformations x � z M By change of variables M θ ) − 1 ( z m ) � � ∂ ( f m �� f − 1 � � � � � p X ( x ; θ ) = p Z θ ( x ) � det � � ∂ z m � m =1 (Note: determininant of product equals product of determinants) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 11 / 16

  12. Planar flows (Rezende & Mohamed, 2016) Planar flow. Invertible transformation x = f θ ( z ) = z + u h ( w T z + b ) parameterized by θ = ( w , u , b ) where h ( · ) is a non-linearity Absolute value of the determinant of the Jacobian is given by � � � det ∂ f θ ( z ) � � � � � det ( I + h ′ ( w T z + b ) uw T ) � = � � � � ∂ z � � � � 1 + h ′ ( w T z + b ) u T w = � � � (matrix determinant lemma) Need to restrict parameters and non-linearity for the mapping to be invertible. For example, h = tanh () and h ′ ( w T z + b ) u T w ≥ − 1 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 12 / 16

  13. Planar flows (Rezende & Mohamed, 2016) Base distribution: Gaussian Base distribution: Uniform 10 planar transformations can transform simple distributions into a more complex one Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 13 / 16

  14. Learning and Inference Learning via maximum likelihood over the dataset D � � �� ∂ f − 1 θ ( x ) � � � f − 1 � � max log p X ( D ; θ ) = log p Z θ ( x ) + log � det � � ∂ x � � θ � x ∈D Exact likelihood evaluation via inverse tranformation x �→ z and change of variables formula Sampling via forward transformation z �→ x z ∼ p Z ( z ) x = f θ ( z ) Latent representations inferred via inverse transformation (no inference network required!) z = f − 1 θ ( x ) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 14 / 16

  15. Desiderata for flow models Simple prior p Z ( z ) that allows for efficient sampling and tractable likelihood evaluation. E.g., isotropic Gaussian Invertible transformations with tractable evaluation: Likelihood evaluation requires efficient evaluation of x �→ z mapping Sampling requires efficient evaluation of z �→ x mapping Computing likelihoods also requires the evaluation of determinants of n × n Jacobian matrices, where n is the data dimensionality Computing the determinant for an n × n matrix is O ( n 3 ): prohibitively expensive within a learning loop! Key idea : Choose tranformations so that the resulting Jacobian matrix has special structure. For example, the determinant of a triangular matrix is the product of the diagonal entries, i.e., an O ( n ) operation Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 15 / 16

  16. Triangular Jacobian x = ( x 1 , · · · , x n ) = f ( z ) = ( f 1 ( z ) , · · · , f n ( z )) ∂ f 1 ∂ f 1  · · ·  ∂ z 1 ∂ z n J = ∂ f ∂ z = · · · · · · · · ·   ∂ f n ∂ f n · · · ∂ z 1 ∂ z n Suppose x i = f i ( z ) only depends on z ≤ i . Then ∂ f 1  · · · 0  ∂ z 1 J = ∂ f ∂ z = · · · · · · · · ·   ∂ f n ∂ f n · · · ∂ z 1 ∂ z n has lower triangular structure. Determinant can be computed in linear time . Similarly, the Jacobian is upper triangular if x i only depends on z ≥ i Next lecture: Designing invertible transformations! Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 7 16 / 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend