Higher Order Statistics Matthias Hennig School of Informatics, - PowerPoint PPT Presentation

Higher Order Statistics Matthias Hennig School of Informatics, University of Edinburgh March 1, 2019 0 Acknowledgements: Mark van Rossum and Chris Williams. 1 / 44

Outline First, second and higher-order statistics Generative models, recognition models Sparse Coding Independent Components Analysis 2 / 44

Sensory information is highly redundant [Figure: Matthias Bethge] 3 / 44

and higher order correlations are relevant [Figure: Matthias Bethge] note Fourier transform of the autocorrelation function is equal to the power spectral density (Wiener-Khinchin theorem) 4 / 44

Redundancy Reduction (Barlow, 1961; Attneave 1954) Natural images are redundant in that there exist statistical dependencies amongst pixel values in space and time In order to make efficient use of resources, the visual system should reduce redundancy by removing statistical dependencies 5 / 44

The visual system [Figure from Matthias Bethge] 6 / 44

The visual system [Figure from Matthias Bethge] 7 / 44

Natural Image Statistics and Efficient Coding First-order statistics Intensity/contrast histograms ⇒ e.g. histogram equalization Second-order statistics Autocorrelation function (1 / f 2 power spectrum) Decorrelation/whitening Higher-order statistics orientation, phase spectrum (systematically model higher orders) Projection pursuit, sparse coding (find useful projections) 8 / 44

Image synthesis: First-order statistics [Figure: Olshausen, 2005] Log-normal distribution of intensities. 9 / 44

Image synthesis: Second-order statistics [Figure: Olshausen, 2005] Describe as correlated Gaussian statistics, or equivalently, power spectrum. 10 / 44

Higher-order statistics [Figure: Olshausen, 2005] 11 / 44

Importance of phase information [Hyvärinen et al., 2009] 12 / 44

Generative models, recognition models (§10.1, Dayan and Abbott) How is sensory information encoded to support higher level tasks? Has to be based on the statistical structure of sensory information. Causal models: find the causes that give rise to observed stimuli. Generative models: reconstruct stimuli based on causes, model can fill in based on statistics. Allows the brain to generate appropriate actions (motor outputs) based on causes. A stronger constraint than optimal encoding alone (although it should still be optimal). 13 / 44

Generative models, recognition models (§10.1, Dayan and Abbott) Left: observations. Middle: poor model; 2 latent causes (prior distribution) but wrong generating distribution given causes. Right: good model. In image processing context one would want, e.g. A are cars, B are faces. They would explain the image, and could generate images with an appropriate generating distribution. 14 / 44

Generative models, recognition models Hidden (latent) variables h (causes) that explain visible variables u (e.g. image) Generative model � p ( u |G ) = p ( u | h , G ) p ( h |G ) h Recognition model p ( h | u , G ) = p ( u | h , G ) p ( h |G ) p ( u |G ) Matching p ( u |G ) to the actual density p ( u ) . Maximize the log likelihood L ( G ) = � log p ( u |G ) � p ( u ) Train parameters G of the model using EM (expectation-maximization) 15 / 44

Examples of generative models (§10.1, Dayan and Abbott) Mixtures of Gaussians Factor analysis, PCA Sparse Coding Independent Components Analysis 16 / 44

Sparse Coding Area V1 is highly overcomplete. V1 : LGN ≈ 25:1 (in cat) Firing rate distribution is typically exponential (i.e. sparse) Experimental evidence for sparse coding in insects, zebra finch, mouse, rabbit, rat, macaque monkey, human [Olshausen and Field, 2004] Activity of a macaque IT cell in response to video images [Figure: Dayan and Abbott, 2001] 17 / 44

Sparse Coding Distributions that are close to zero most of the time but occasionally far from 0 are called sparse Sparse distributions are more likely than Gaussians to generate values near to zero, and also far from zero (heavy tailed) � p ( x )( x − x ) 4 dx kurtosis = � 2 − 3 �� p ( x )( x − x ) 2 dx ( Gaussian has kurtosis 0, positive k implies sparse distributions (super-Gaussian, leptokurtotic) Kurtosis is sensitive to outliers (i.e. it is not robust). See HHH §6.2 for other measures of sparsity 18 / 44

Skewed distributions p ( h ) = exp( g ( h )) exponential: g ( h ) = −| h | Cauchy: g ( h ) = − log( 1 + h 2 ) Gaussian: g ( h ) = − h 2 / 2 [Figure: Dayan and Abbott, 2001] 19 / 44

The sparse coding model Single component model for image: u = g h . Find g so that sparseness maximal, while � h � = 0, � h 2 � = 1. Multiple components: u = G h + n Minimize [Olshausen and Field, 1996] E = [ reconstruction error ] − λ [ sparseness ] Factorial: p ( h ) = � i p ( h i ) Sparse: p ( h i ) ∝ exp( g ( h i )) (non-Gaussian) Laplacian: g ( h ) = − α | h | Cauchy: g ( h ) = − log( β 2 + h 2 ) n is a noise term Goal: find set of basis functions G such that the coefficients h are as sparse and statistically independent as possible See D and A pp 378-383, and HHH §13.1.1-13.1.4 20 / 44

Recognition step Suppose G is given. For given image, what is h ? For g ( h ) is Cauchy distribution, p ( h | u , G ) is difficult to compute exactly The overcomplete model is not invertible p ( h | u ) = p ( u | h ) p ( h ) p ( u ) Olshausen and Field (1996) used MAP approximation. As p ( u ) does not depend on h , we can find h by maximising: log p ( h | u ) = log( p ( u | h )) + log( p ( h )) 21 / 44

Recognition step We assume a sparse and independent prior p ( h ) , so N h � log p ( h ) = g ( h a ) a = 1 Assuming Gaussian noise n ∼ N ( 0 , σ 2 I ) , p ( u | h ) is drawn from a Gaussian distribution at u − G h and variance σ 2 : N h log p ( h | u , G ) = − 1 2 σ 2 | u − G h | 2 + � g ( h a ) + const a = 1 22 / 44

Recognition step At maximum (differentiate w.r.t. to h ) N h 1 σ 2 [ u − G ˆ h ] b G ba + g ′ (ˆ � h a ) = 0 b = 1 1 σ 2 G T [ u − G ˆ h ] + g ′ (ˆ h ) = 0 or 23 / 44

To solve this equation, follow dynamics N h dh a dt = 1 � [ u − G h ] b G ba + g ′ ( h a ) τ h σ 2 b = 1 Neural network interpretation (notation, v = h ) [Figure: Dayan and Abbott, 2001] Dynamics does gradient ascent on log posterior. A combination of feed forward excitation, lateral inhibition and relaxation of neural firing rates. Process is guaranteed only to find a local (not global) maximum 24 / 44

Learning of the model Now we have h , we can compare Log likelihood L ( G ) = � log p ( u |G ) � . Learning rule: ∆ G ∝ ∂ L ∂ G Basically linear regression (mean-square error cost) ∆ G = ǫ ( u − G ˆ h )ˆ h T Small values of h can be balanced by scaling up G . Hence b G 2 impose constraint on � ba for each cause a to encourage the variances of each h a to be approximately equal It is common to whiten the inputs before learning (so that � u � = 0 and � uu T � = I ), to force the network to find structure beyond second order 25 / 44

[Figure: Dayan and Abbott (2001), after Olshausen and Field (1997)] 26 / 44

Projective Fields and Receptive Fields Projective field for h a is G ba for all b values Note resemblance to simple cells in V1 Receptive fields: includes network interaction. Outputs of network are sparser than feedforward input, or pixel values Comparison with physiology: spatial-frequency bandwidth, orientation bandwidth 27 / 44

Overcomplete: 200 basis functions from 12 × 12 patches [Figure: Olshausen, 2005] 28 / 44

Gabor functions Can be used to model the receptive fields. A sinusoid modulated by a Gaussian envelope � � − x 2 − y 2 1 exp cos( kx − φ ) 2 σ 2 2 σ 2 2 πσ x σ y x y 29 / 44

Image synthesis: sparse coding [Figure: Olshausen, 2005] 30 / 44

Spatio-temporal sparse coding (Olshausen 2002) M n m � � h m i g m ( t − τ m u ( t ) = i ) + n ( t ) m = 1 n = 1 G is now 3-dimensional, having time slices as well Goal: find a set of space-time basis functions for representing natural images such that the time-varying coefficients { h m i } are as sparse and statistically independent as possible over both space and time. 200 bases, 12 × 12 × 7: http://redwood.berkeley.edu/bruno/bfmovie/bfmovie.html 31 / 44

Sparse coding: limitations Sparseness-enforcing non-linearity choice is arbitrary Learning based on enforcing uncorrelated h is ad hoc Unclear if p ( h ) is a proper prior distribution Solution: a generative model which describes how the image was generated from a transformation of the latent variables. 32 / 44

ICA: Independent Components Analysis [Bell and Sejnowski, 1995] Linear network with output non-linearity h = W u , y j = f ( h j ) . h j are statistically independent random variables. h j are from a non-Gaussian distribution (as in sparse coding). Find weight matrix maximizing information between u and y No noise (cf. Linsker): I ( u , y ) = H ( y ) − H ( y | u ) = H ( y ) H ( y ) = � log p ( y ) � y = � log p ( u ) / det J � u with J ji = ∂ y j = ∂ h j ∂ y j � f ′ ( h j ) = w ij ∂ u i ∂ u i ∂ h j j (for a transformation, the PDF is multiplied by the absolute value of the determinant of the transformation matrix to ensure nominalisation) 33 / 44

Higher Order Statistics Matthias Hennig School of Informatics, - PowerPoint PPT Presentation

Higher Order Statistics Matthias Hennig School of Informatics, University of Edinburgh March 1, 2019 0 Acknowledgements: Mark van Rossum and Chris Williams. 1 / 44 Outline First, second and higher-order statistics Generative models,

Higher order complexity Hugo Fre Mathieu Hoyrup CCA 2013 Hugo Fre Higher order

Outline Higher Order Statistics First, second and higher-order statistics Matthias Hennig

York University www.cs.york.ac.uk/~ndm First order vs Higher order Higher order:

Higher order Ambisonics Higher order Ambisonics A future-proof 3D audio technique A future-proof

Higher Order Functions 1 Shell CSCE 314 TAMU Higher-order Functions A function is called

More JavaScript! Higher-Order Functions, Callbacks, and Array Methods Higher-Order Functions

Higher Order Proof Engineering Robert White ILLC/INRIA Cool Logic, ILLC 1/23 Higher Order

Reorder Buffer Method Issue Execute Write Classic 5-stage pipeline In-order In-order

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Order Statistics and Pitman Closeness Katherine F. Davies Department of Statistics University of

Order Statistics and Applications Rosemary Smith Introduction to Order Statistics Unordered

Outline Higher order is commonly used on convergence and on derivatives in opti- Trust Region with

Higher-order Interpretations for Higher-order Complexity Emmanuel Hainry & Romain Pchoux

Higher-order Interpretations for Higher-order Complexity Emmanuel Hainry & Romain Pchoux

Math 211 Math 211 Lecture #31 Higher Order Equations Harmonic Motion November 7, 2003 2

Math 211 Math 211 Lecture #32 Higher Order Equations Harmonic Motion November 11, 2002 2

Continuous Probability 3 2 Continuous Probability Motivation I Sometimes you cant model

LSH for 2 distances Lecture 15 October 15, 2020 Chandra (UIUC) CS498ABD 1 Fall 2020 1 /

Limitations of Realistic A Faster Method: . . . Monte-Carlo Techniques Monte-Carlo: . . . Proof

Towards A Neural-Based Cauchy Deviate . . . Werboss Idea: Use . . . Understanding of the We

Distributions of functions in noncommuting random variables Serban T. Belinschi CNRS - Institut

Summer School 2008, Disentis Gap Probabilities for Random Matrix Ensembles Felix Rubin July 21,

A general procedure to combine estimators Fr ed eric Lavancier Laboratoire de Math

The time slice axiom in perturbative QFT on globally hyperbolic spacetimes Bruno Chilian June 7

Higher Order Statistics Matthias Hennig School of Informatics, - PowerPoint PPT Presentation

Higher Order Statistics Matthias Hennig School of Informatics, University of Edinburgh March 1, 2019 0 Acknowledgements: Mark van Rossum and Chris Williams. 1 / 44 Outline First, second and higher-order statistics Generative models,

Higher order complexity Hugo Fre Mathieu Hoyrup CCA 2013 Hugo Fre Higher order

Outline Higher Order Statistics First, second and higher-order statistics Matthias Hennig

York University www.cs.york.ac.uk/~ndm First order vs Higher order Higher order:

Higher order Ambisonics Higher order Ambisonics A future-proof 3D audio technique A future-proof

Higher Order Functions 1 Shell CSCE 314 TAMU Higher-order Functions A function is called

More JavaScript! Higher-Order Functions, Callbacks, and Array Methods Higher-Order Functions

Higher Order Proof Engineering Robert White ILLC/INRIA Cool Logic, ILLC 1/23 Higher Order

Reorder Buffer Method Issue Execute Write Classic 5-stage pipeline In-order In-order

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Order Statistics and Pitman Closeness Katherine F. Davies Department of Statistics University of

Order Statistics and Applications Rosemary Smith Introduction to Order Statistics Unordered

Outline Higher order is commonly used on convergence and on derivatives in opti- Trust Region with

Higher-order Interpretations for Higher-order Complexity Emmanuel Hainry &amp; Romain Pchoux

Higher-order Interpretations for Higher-order Complexity Emmanuel Hainry &amp; Romain Pchoux

Math 211 Math 211 Lecture #31 Higher Order Equations Harmonic Motion November 7, 2003 2

Math 211 Math 211 Lecture #32 Higher Order Equations Harmonic Motion November 11, 2002 2

Continuous Probability 3 2 Continuous Probability Motivation I Sometimes you cant model

LSH for 2 distances Lecture 15 October 15, 2020 Chandra (UIUC) CS498ABD 1 Fall 2020 1 /

Limitations of Realistic A Faster Method: . . . Monte-Carlo Techniques Monte-Carlo: . . . Proof

Towards A Neural-Based Cauchy Deviate . . . Werboss Idea: Use . . . Understanding of the We

Distributions of functions in noncommuting random variables Serban T. Belinschi CNRS - Institut

Summer School 2008, Disentis Gap Probabilities for Random Matrix Ensembles Felix Rubin July 21,

A general procedure to combine estimators Fr ed eric Lavancier Laboratoire de Math

The time slice axiom in perturbative QFT on globally hyperbolic spacetimes Bruno Chilian June 7

Higher-order Interpretations for Higher-order Complexity Emmanuel Hainry & Romain Pchoux

Higher-order Interpretations for Higher-order Complexity Emmanuel Hainry & Romain Pchoux