Information Theory Matthias Hennig School of Informatics, - PowerPoint PPT Presentation

Information Theory Matthias Hennig School of Informatics, University of Edinburgh February 7, 2019 0 Acknowledgements: Mark van Rossum and Chris Williams. 1 / 36

Why information theory? Understanding the neural code. Encoding and decoding. We imposed coding schemes, such as a linear kernel, or a GLM. We possibly lost information in doing so. Instead, use information: Don’t need to impose encoding or decoding scheme (non-parametric). In particular important for 1) spike timing codes, 2) higher areas. Estimate how much information is present in a recorded signal. Caveats: The decoding process is ignored (upper bound only) Requires more data, and biases are tricky 2 / 36

Overview Entropy, Mutual Information Entropy Maximization for a Single Neuron Maximizing Mutual Information Estimating information Reading: Dayan and Abbott ch 4, Rieke 3 / 36

Definitions For the probability of an event P ( x ) , the quantity h ( p ) = − log p ( x ) is called ‘surprise‘ or ‘information‘. Measures the information gained when observing x . Additive for independent events. Often log 2 is used, then unit is bits ( log e has unit nats). 4 / 36

Surprise 5 / 36

Definitions The entropy of a quantity is the average � H ( X ) = − P ( x ) log 2 P ( x ) x Properties: Continuous, non-negative, H ( 1 ) = 0 If p i = 1 n , it increases monotonically with n . H = log 2 n . Parallel independent events add. [Shannon and Weaver, 1949, Cover and Thomas, 1991, Rieke et al., 1996] 6 / 36

Entropy Discrete variable � H ( R ) = − p ( r ) log 2 p ( r ) r Continuous variable at resolution ∆ r � � H ( R ) = − p ( r )∆ r log 2 ( p ( r )∆ r ) = − p ( r )∆ r log 2 p ( r ) − log 2 ∆ r r r letting ∆ r → 0 we have � ∆ r → 0 [ H + log 2 ∆ r ] = − lim p ( r ) log 2 p ( r ) dr (also called differential entropy) 7 / 36

Joint, Conditional entropy Joint entropy: � H ( S , R ) = − P ( S , R ) log 2 P ( S , R ) r , s Conditional entropy: � H ( S | R ) = P ( R = r ) H ( S | R = r ) r � � = − P ( r ) P ( s | r ) log 2 P ( s | r ) r s = H ( S , R ) − H ( R ) If S , R are independent H ( S , R ) = H ( S ) + H ( R ) 8 / 36

Mutual information Mutual information: p ( r , s ) � I m ( R ; S ) = p ( r , s ) log 2 p ( r ) p ( s ) r , s = H ( R ) − H ( R | S ) = H ( S ) − H ( S | R ) Measures reduction in uncertainty of R by knowing S (or vice versa) H ( R | S ) is called noise entropy , the part of the response not explained by the stimulus. I m ( R ; S ) ≥ 0 The continuous version is the difference of two entropies, the ∆ r divergence cancels 9 / 36

Relationships between information measures 10 / 36

Coding channels 11 / 36

Coding channels Can we reconstruct the stimulus? We need a en/decoding model: P ( s | r ) = P ( r | s ) P ( s ) P ( r ) How much information is conveyed? This can be addressed non-parametrically: I m ( S ; R ) = H ( S ) − H ( S | R ) = H ( R ) − H ( R | S ) 12 / 36

Kullback-Leibler divergence KL-divergence measures distance between two probability distributions � P ( x ) D KL ( P || Q ) = P ( x ) log 2 Q ( x ) dx P i � D KL ( P || Q ) ≡ P i log 2 Q i i Not symmetric (Jensen Shannon divergence is the symmetrised form) I m ( R ; S ) = D KL ( p ( r , s ) || p ( r ) p ( s )) , hence measures KLD to independent model. Often used as probabilistic cost function: D KL ( data || model ) . 13 / 36

Mutual info between jointly Gaussian variables P ( y 1 ) P ( y 2 ) dy 1 dy 2 = − 1 P ( y 1 , y 2 ) � � 2 log 2 ( 1 − ρ 2 ) I ( Y 1 ; Y 2 ) = P ( y 1 , y 2 ) log 2 ρ is (Pearson-r) correlation coefficient. 14 / 36

Populations of Neurons Given � H ( R ) = − p ( r ) log 2 p ( r ) d r − N log 2 ∆ r and � H ( R i ) = − p ( r i ) log 2 p ( r i ) d r − log 2 ∆ r We have � H ( R ) ≤ H ( R i ) i (proof, consider KL divergence) 15 / 36

Mutual information in populations of Neurons Reduncancy can be defined as (compare to above) n r � R = I ( r i ; s ) − I ( r ; s ) . i = 1 Some codes have R > 0 (redundant code), others R < 0 (synergistic) Example of synergistic code: P ( r 1 , r 2 , s ) with P ( 0 , 0 , 1 ) = P ( 0 , 1 , 0 ) = P ( 1 , 0 , 0 ) = P ( 1 , 1 , 1 ) = 1 4 , other probabilities zero 16 / 36

Entropy Maximization for a Single Neuron I m ( R ; S ) = H ( R ) − H ( R | S ) If noise entropy H ( R | S ) is independent of the transformation S → R , we can maximize mutual information by maximizing H ( R ) under given constraints Possible constraint: response r is 0 < r < r max . Maximal H ( R ) if ⇒ p ( r ) ∼ U ( 0 , r max ) ( U is uniform dist) If average firing rate is limited, and 0 < r < ∞ : exponential distribution is optimal p ( x ) = 1 / ¯ xexp ( − x / ¯ x ) . H = log 2 e ¯ x If variance is fixed and −∞ < r < ∞ : Gaussian distribution. H = 1 2 log 2 ( 2 π e σ 2 ) 17 / 36

Let r = f ( s ) and s ∼ p ( s ) . Which f (assumed monotonic) maximizes H ( R ) using max firing rate constraint? Require: 1 P ( r ) = r max p ( s ) = p ( r ) dr 1 df ds = r max ds Thus df / ds = r max p ( s ) and � s p ( s ′ ) ds ′ f ( s ) = r max s min This strategy is known as histogram equalization in signal processing 18 / 36

Fly retina Evidence that the large monopolar cell in the fly visual system carries out histogram equalization Contrast response for fly large monopolar cell (points) matches environment statistics (line) [Laughlin, 1981] (but changes in high noise conditions) 19 / 36

V1 contrast responses Similar in V1, but On and Off channels [Brady and Field, 2000] 20 / 36

Information of time varying signals Single analog channel with Gaussian signal s and Gaussian noise η : r = s + η 2 log 2 ( 1 + σ 2 I = 1 ) = 1 s 2 log 2 ( 1 + SNR ) σ 2 η � d ω 2 π log 2 ( 1 + s ( ω ) For time dependent signals I = 1 2 T n ( ω ) ) To maximize information, when variance of the signal is constrained, use all frequency bands such that signal+noise = constant. Whitening. Water filling analog: 21 / 36

Information of graded synapses Light - (photon noise) - photoreceptor - (synaptic noise) - LMC At low light levels photon noise dominates, synaptic noise is negligible. Information rate: 1500 bits/s [de Ruyter van Steveninck and Laughlin, 1996]. 22 / 36

Spiking neurons: maximal information Spike train with N = T /δ t bins [Mackay and McCullogh, 1952] δ t “time-resolution”. N ! pN = N 1 events, #words = N 1 !( N − N 1 )! Maximal entropy if all words are equally likely. H = � p i log 2 p i = log 2 N ! − log 2 N 1 ! − log 2 ( N − N 1 )! Use for large x that log x ! ≈ x (log x − 1 ) H = − T δ t [ p log 2 p + ( 1 − p ) log 2 ( 1 − p )] log 2 ( e ) For low rates p ≪ 1, setting λ = ( δ t ) p : H = T λ log 2 ( e λδ t ) 23 / 36

Spiking neurons Calculation incorrect when multiple spikes per bin. 24 / 36

Spiking neurons: rate code [Stein, 1967] Measure rate in window T , during which stimulus is constant. Periodic neuron can maximally encode [ 1 + ( f max − f min ) T ] stimuli H ≈ log 2 [ 1 + ( f max − f min ) T ] . Note, only ∝ log( T ) 25 / 36

[Stein, 1967] Similar behaviour for Poisson : H ∝ log( T ) 26 / 36

Maximizing Information Transmission: single output Single linear neuron with post-synaptic noise v = w · u + η where η is an independent noise variable I m ( u ; v ) = H ( v ) − H ( v | u ) Second term depends only on p ( η ) To maximize I m need to maximize H ( v ) ; sensible constraint is that � w � 2 = 1 If u ∼ N ( 0 , Q ) and η ∼ N ( 0 , σ 2 η ) then v ∼ N ( 0 , w T Q w + σ 2 η ) 27 / 36

For a Gaussian RV with variance σ 2 we have H = 1 2 log 2 π e σ 2 . To maximize H ( v ) we need to maximize w T Q w subject to the constraint � w � 2 = 1 Thus w ∝ e 1 so we obtain PCA If v is non-Gaussian then this calculation gives an upper bound on H ( v ) (as the Gaussian distribution is the maximum entropy distribution for a given mean and covariance) 28 / 36

Infomax Infomax: maximize information in multiple outputs wrt weights [Linsker, 1988] v = W u + η H ( v ) = 1 2 log det( � vv T � ) Example: 2 inputs and 2 outputs. Input is correlated. w 2 k 1 + w 2 k 2 = 1. At low noise independent coding, at high noise joint coding. 29 / 36

Estimating information Information estimation requires a lot of data. Most statistical quantities are unbiased (mean, var,...). But both entropy and noise entropy have bias. [Panzeri et al., 2007] 30 / 36

Try to fit 1 / N correction [Strong et al., 1998] 31 / 36

Common technique for I m : shuffle correction [Panzeri et al., 2007] See also: [Paninski, 2003, Nemenman et al., 2002] 32 / 36

Summary Information theory provides non parametric framework for coding Optimal coding schemes depend strongly on noise assumptions and optimization constraints In data analysis biases can be substantial 33 / 36

Information Theory Matthias Hennig School of Informatics, - PowerPoint PPT Presentation

Information Theory Matthias Hennig School of Informatics, University of Edinburgh February 7, 2019 0 Acknowledgements: Mark van Rossum and Chris Williams. 1 / 36 Why information theory? Understanding the neural code. Encoding and decoding. We

Chapter 2- -3 3 Chapter 2 Definition of Theory: A theory is a systematic Definition of

Game Theory and Nuclear Weapons Game Theory and Nuclear Weapons Game Theory and Nuclear Warfare

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

SOCIOLOGICAL THEORY: A SCIENTIFIC APPROACH What is a theory? ! What does a theory consist of?

Applied Hodge Theory: Social Choice, Crowdsourced Ranking, and Game Theory Yuan Yao HKUST

SOCIOLOGICAL THEORY: A SCIENTIFIC APPROACH What is a theory? What does a theory consist of?

General motivations Model theory Recursion theory Lambda calculus Set theory

Information Theory project Lo Bordy 29 mai 2017 Lo Bordy Information Theory project Global

Overview Coding and Information Theory What is information theory? Entropy Coding Chris

Lectures 34: Consumer Theory Alexander Wolitzky MIT 14.121 1 Consumer Theory Consumer theory

Absolute notions in model theory Syntactic and semantic notions Absolutness from model theory

Theory or Practice? Theory : Without theory, practice is but routine born out of habit.

String Theory Ideology Or Tool Box Plan What is string theory? Unification ideology.

FILLING IN THE MARGINS: THE USE OF QUEER THEORY, FEMINIST STANDPOINT THEORY, & CRITICAL RACE

EXCHANGE THEORY Chapter 3 Leader Member Exchange Theory 2 Initially the theory described the

ROBOTICS 01PEEQW Basilio Bona DAUIN Politecnico di Torino Control Part 1 Tasks Two

Bayesian Probabilistic Numerical Methods (Part I) Chris. J. Oates Newcastle University Alan

at UPC Barcelona June 3, 2011 1 About us Advanced Broadband Comm. Center (CCABA)

Scalar Curvature and Gauss-Bonnet Theorem for Noncommutative Tori Farzad Fathizadeh joint with

Auslanders formula in dualizing variaties Shijie Zhu (Joint with Ron Gentle, Job Rachowicz and

Localized Pressure and Equilibrium States Tamara Kucherenko, CCNY (joint work with Christian

Class 26: review for final exam 18.05, Spring 2014 Probability Counting Sets

tt rs qrt rs

Information Theory Matthias Hennig School of Informatics, - PowerPoint PPT Presentation

Information Theory Matthias Hennig School of Informatics, University of Edinburgh February 7, 2019 0 Acknowledgements: Mark van Rossum and Chris Williams. 1 / 36 Why information theory? Understanding the neural code. Encoding and decoding. We

Chapter 2- -3 3 Chapter 2 Definition of Theory: A theory is a systematic Definition of

Game Theory and Nuclear Weapons Game Theory and Nuclear Weapons Game Theory and Nuclear Warfare

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

SOCIOLOGICAL THEORY: A SCIENTIFIC APPROACH What is a theory? ! What does a theory consist of?

Applied Hodge Theory: Social Choice, Crowdsourced Ranking, and Game Theory Yuan Yao HKUST

SOCIOLOGICAL THEORY: A SCIENTIFIC APPROACH What is a theory? What does a theory consist of?

General motivations Model theory Recursion theory Lambda calculus Set theory

Information Theory project Lo Bordy 29 mai 2017 Lo Bordy Information Theory project Global

Overview Coding and Information Theory What is information theory? Entropy Coding Chris

Lectures 34: Consumer Theory Alexander Wolitzky MIT 14.121 1 Consumer Theory Consumer theory

Absolute notions in model theory Syntactic and semantic notions Absolutness from model theory

Theory or Practice? Theory : Without theory, practice is but routine born out of habit.

String Theory Ideology Or Tool Box Plan What is string theory? Unification ideology.

FILLING IN THE MARGINS: THE USE OF QUEER THEORY, FEMINIST STANDPOINT THEORY, &amp; CRITICAL RACE

EXCHANGE THEORY Chapter 3 Leader Member Exchange Theory 2 Initially the theory described the

ROBOTICS 01PEEQW Basilio Bona DAUIN Politecnico di Torino Control Part 1 Tasks Two

Bayesian Probabilistic Numerical Methods (Part I) Chris. J. Oates Newcastle University Alan

at UPC Barcelona June 3, 2011 1 About us Advanced Broadband Comm. Center (CCABA)

Scalar Curvature and Gauss-Bonnet Theorem for Noncommutative Tori Farzad Fathizadeh joint with

Auslanders formula in dualizing variaties Shijie Zhu (Joint with Ron Gentle, Job Rachowicz and

Localized Pressure and Equilibrium States Tamara Kucherenko, CCNY (joint work with Christian

Class 26: review for final exam 18.05, Spring 2014 Probability Counting Sets

tt rs qrt rs

FILLING IN THE MARGINS: THE USE OF QUEER THEORY, FEMINIST STANDPOINT THEORY, & CRITICAL RACE