gatsby theoretical neuroscience lectures non gaussian
play

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics - PowerPoint PPT Presentation

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV Aapo Hyv arinen Gatsby Unit University College London Aapo Hyv arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics


  1. Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV Aapo Hyv¨ arinen Gatsby Unit University College London Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  2. Part III: Estimation of unnormalized models ◮ Often, in natural image statistics, the probabilistic models are unnormalized ◮ Major computational problem ◮ Here, we consider new methods to tackle this problem ◮ Later, we see applications on natural image statistics Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  3. Unnormalized models: Problem definition ◮ We want to estimate a parametric model of a multivariate random vector x ∈ R n ◮ Density function f norm is known only up to a multiplicative constant 1 f norm ( x ; θ ) = Z ( θ ) p un ( x ; θ ) � Z ( θ ) = ξ ∈ R n p un ( ξ ; θ ) d ξ ◮ Functional form of p un is known (can be easily computed) ◮ Partition function Z cannot be computed with reasonable computing time (numerical integration) ◮ Here: How to estimate model while avoiding numerical integration? Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  4. Examples of unnormalized models related to ICA ◮ ICA with overcomplete basis simple by 1 � G ( w T f norm ( x ; W ) = Z ( W ) exp[ i x )] (1) i ◮ Estimation of second layer in ISA and topographic ICA 1 � � j x ) 2 )] m ij ( w T f norm ( x ; W , M ) = Z ( W , M ) exp[ G ( (2) i i ◮ Non-Gaussian Markov Random Fields ◮ ... many more Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  5. Previous solutions ◮ Monte Carlo methods ◮ Consistent estimators (convergence to real parameter values when sample size → ∞ ) ◮ Computation very slow (I think) ◮ Various approximations, e.g. variational methods ◮ Computation often fast ◮ Consistency not known, or proven inconsistent ◮ Pseudo-likelihood and contrastive divergence ◮ Presumably consistent ◮ Computations slow with continuous-valued variables: needs 1-D integration at every step, or sophisticated MCMC methods Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  6. Content of this talk ◮ We have proposed two methods for estimation of unnormalized models ◮ Both methods avoid numerical integration ◮ First: Score matching (Hyv¨ arinen, JMLR, 2005) ◮ Take derivative of model log-density w.r.t. x , so partition function disappears ◮ Fit this derivative to the same derivative of data density ◮ Easy to compute due to partial integration trick ◮ Closed-form solution for exponential families ◮ Second: Noise-contrastive estimation (Gutmann and Hyv¨ arinen, JMLR, 2012) ◮ Learn to distinguish data from artificially generated noise: Logistic regression learns ratios of pdf’s of data and noise ◮ For known noise pdf, we have in fact learnt data pdf ◮ Consistent even in the unnormalized case Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  7. Definition of “score function” (in this talk) ◮ Define model score function R n → R n as  ∂ log f norm ( ξ ; θ )  ∂ξ 1 .  .  ψ ( ξ ; θ ) =  = ∇ ξ log f norm ( ξ ; θ ) .    ∂ log f norm ( ξ ; θ ) ∂ξ n where f norm is normalized model density. ◮ Similarly, define data score function as ψ x ( ξ ) = ∇ ξ log p x ( ξ ) where observed data is assumed to follow p x ( . ). ◮ In conventional terminology: Fisher score with respect to a hypothetical location parameter: f norm ( x − θ ), evaluated at θ = 0 . Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  8. Score matching: definition of objective function ◮ Estimate by minimizing a distance between model score function ψ ( . ; θ ) and score function of observed data ψ x ( . ): J ( θ ) = 1 � ξ ∈ R n p x ( ξ ) � ψ ( ξ ; θ ) − ψ x ( ξ ) � 2 d ξ (3) 2 ˆ θ = arg min θ J ( θ ) ◮ This gives a consistent estimator almost by construction ◮ ψ ( ξ ; θ ) does not depend on Z ( θ ) because ψ ( ξ ; θ ) = ∇ ξ log p un ( ξ ; θ ) −∇ ξ log Z ( θ ) = ∇ ξ log p un ( ξ ; θ ) − 0 (4) ◮ No need to compute normalization constant Z , non-normalized pdf p un is enough. ◮ Computation of J quite simple due to theorem below Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  9. A computational trick: central theorem of score matching ◮ In the objective function we have score function of data distribution ψ x ( . ). How to compute it? ◮ In fact, no need to compute it because Theorem Assume some regularity conditions, and smooth densities. Then, the score matching objective function J can be expressed as n � ∂ i ψ i ( ξ ; θ ) + 1 � � � 2 ψ i ( ξ ; θ ) 2 J ( θ ) = ξ ∈ R n p x ( ξ ) d ξ + const. (5) i =1 where the constant does not depend on θ , and , ∂ i ψ i ( ξ ; θ ) = ∂ 2 log p un ( ξ ; θ ) ψ i ( ξ ; θ ) = ∂ log p un ( ξ ; θ ) ∂ξ 2 ∂ξ i i Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  10. Simple explanation of score matching trick ◮ Consider objective function J ( θ ): 1 � � p x ( ξ ) � ψ ( ξ ; θ ) � 2 d ξ − p x ( ξ ) ψ x ( ξ ) T ψ ( ξ ; θ ) d ξ + const. 2 ◮ Constant does not depend on θ . First term easy to compute. ◮ The trick is to use partial integration on second term. In one dimension: p x ( x ) p ′ � � x ( x ) p x ( x )(log p x ) ′ ( x ) ψ ( x ; θ ) dx = p x ( x ) ψ ( x ; θ ) dx � � p ′ p x ( x ) ψ ′ ( x ; θ ) dx = x ( x ) ψ ( x ; θ ) dx = 0 − ◮ This is why score function of data distribution p x ( x ) disappears! Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  11. Final method of score matching ◮ Replace integration over data density p x ( . ) by sample average ◮ Given T observations x (1) , . . . , x ( T ), minimize T n J ( θ ) = 1 � ∂ i ψ i ( x ( t ); θ ) + 1 � ˜ � � 2 ψ i ( x ( t ); θ ) 2 (6) T t =1 i =1 where ψ i is a partial derivative of non-normalized model log-density log p un , and ∂ i ψ i a second partial derivative ◮ Only needs evaluation of some derivatives of the non-normalized (log)-density p un which are simple to compute (by assumption) ◮ Thus: a new computationally simple and statistically consistent method for parameter estimation Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  12. Closed-form solution in the exponential family ◮ Assume pdf can be expressed in the form m � log p un ( ξ ; θ ) = θ k F k ( ξ ) − log Z ( θ ) (7) k =1 ◮ Define matrices of partial derivatives: , and H ki ( ξ ) = ∂ 2 F k K ki ( ξ ) = ∂ F k (8) ∂ξ 2 ∂ξ i i ◮ Then, the score matching estimator is given by: � − 1 � ˆ E { K ( x ) K ( x ) T } ˆ � ˆ θ = − ( E { h i ( x ) } ) (9) i where ˆ E denotes the sample average, and the vector h i is the i -th column of the matrix H . Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  13. ICA with overcomplete basis Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  14. Second method: Noise-contrastive estimation (NCE) ◮ Train a nonlinear classifier to discriminate observed data from some artificial noise ◮ To be successful, the classifier must “discover structure” in the data ◮ For example, compare natural images with Gaussian noise Natural images Gaussian noise Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  15. Definition of classifier in NCE ◮ Observed data set X = ( x (1) , . . . , x ( T )) with un known pdf p x ◮ Generate “noise” Y = ( y (1) , . . . , y ( T )) with known pdf p y ◮ Define a nonlinear function (e.g. multilayer perceptron) g ( u ; θ ), which models data log-density log p x ( u ). ◮ We use logistic regression with the nonlinear function G ( u ; θ ) = g ( u ; θ ) − log p y ( u ) . (10) ◮ Well-known developments lead to objective (likelihood) � J ( θ ) = log [ h ( x ( t ); θ )] + log [1 − h ( y ( t ); θ )] t 1 where h ( u ; θ ) = (11) 1 + exp[ − G ( u ; θ )] Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

  16. What does the classifying system do in NCE? ◮ Theorem: ◮ Assume our parametric model g ( u ; θ ) (e.g. an MLP) can approximate any function. ◮ Then, the maximum of classification objective is attained when g ( u ; θ ) = log p x ( u ) (12) where p x ( u ) is the pdf of the observed data. ◮ Corollary: If data generated according to model, i.e. log p x ( u ) = g ( u ; θ ∗ ) , we have a statistically consistent estimator. ◮ Supervised learning thus leads to unsupervised estimation of a probabilistic model given by log-density g ( u ; θ ). Aapo Hyv¨ arinen Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend