Information Theory Matthias Hennig School of Informatics, University of Edinburgh February 7, 2019 0 Acknowledgements: Mark van Rossum and Chris Williams. 1 / 36

Why information theory? Understanding the neural code. Encoding and decoding. We imposed coding schemes, such as a linear kernel, or a GLM. We possibly lost information in doing so. Instead, use information: Don’t need to impose encoding or decoding scheme (non-parametric). In particular important for 1) spike timing codes, 2) higher areas. Estimate how much information is present in a recorded signal. Caveats: The decoding process is ignored (upper bound only) Requires more data, and biases are tricky 2 / 36

Overview Entropy, Mutual Information Entropy Maximization for a Single Neuron Maximizing Mutual Information Estimating information Reading: Dayan and Abbott ch 4, Rieke 3 / 36

Definitions For the probability of an event P ( x ) , the quantity h ( p ) = − log p ( x ) is called ‘surprise‘ or ‘information‘. Measures the information gained when observing x . Additive for independent events. Often log 2 is used, then unit is bits ( log e has unit nats). 4 / 36

Surprise 5 / 36

Definitions The entropy of a quantity is the average � H ( X ) = − P ( x ) log 2 P ( x ) x Properties: Continuous, non-negative, H ( 1 ) = 0 If p i = 1 n , it increases monotonically with n . H = log 2 n . Parallel independent events add. [Shannon and Weaver, 1949, Cover and Thomas, 1991, Rieke et al., 1996] 6 / 36

Entropy Discrete variable � H ( R ) = − p ( r ) log 2 p ( r ) r Continuous variable at resolution ∆ r � � H ( R ) = − p ( r )∆ r log 2 ( p ( r )∆ r ) = − p ( r )∆ r log 2 p ( r ) − log 2 ∆ r r r letting ∆ r → 0 we have � ∆ r → 0 [ H + log 2 ∆ r ] = − lim p ( r ) log 2 p ( r ) dr (also called differential entropy) 7 / 36

Joint, Conditional entropy Joint entropy: � H ( S , R ) = − P ( S , R ) log 2 P ( S , R ) r , s Conditional entropy: � H ( S | R ) = P ( R = r ) H ( S | R = r ) r � � = − P ( r ) P ( s | r ) log 2 P ( s | r ) r s = H ( S , R ) − H ( R ) If S , R are independent H ( S , R ) = H ( S ) + H ( R ) 8 / 36

Mutual information Mutual information: p ( r , s ) � I m ( R ; S ) = p ( r , s ) log 2 p ( r ) p ( s ) r , s = H ( R ) − H ( R | S ) = H ( S ) − H ( S | R ) Measures reduction in uncertainty of R by knowing S (or vice versa) H ( R | S ) is called noise entropy , the part of the response not explained by the stimulus. I m ( R ; S ) ≥ 0 The continuous version is the difference of two entropies, the ∆ r divergence cancels 9 / 36

Relationships between information measures 10 / 36

Coding channels 11 / 36

Coding channels Can we reconstruct the stimulus? We need a en/decoding model: P ( s | r ) = P ( r | s ) P ( s ) P ( r ) How much information is conveyed? This can be addressed non-parametrically: I m ( S ; R ) = H ( S ) − H ( S | R ) = H ( R ) − H ( R | S ) 12 / 36

Kullback-Leibler divergence KL-divergence measures distance between two probability distributions � P ( x ) D KL ( P || Q ) = P ( x ) log 2 Q ( x ) dx P i � D KL ( P || Q ) ≡ P i log 2 Q i i Not symmetric (Jensen Shannon divergence is the symmetrised form) I m ( R ; S ) = D KL ( p ( r , s ) || p ( r ) p ( s )) , hence measures KLD to independent model. Often used as probabilistic cost function: D KL ( data || model ) . 13 / 36

Mutual info between jointly Gaussian variables P ( y 1 ) P ( y 2 ) dy 1 dy 2 = − 1 P ( y 1 , y 2 ) � � 2 log 2 ( 1 − ρ 2 ) I ( Y 1 ; Y 2 ) = P ( y 1 , y 2 ) log 2 ρ is (Pearson-r) correlation coefficient. 14 / 36

Populations of Neurons Given � H ( R ) = − p ( r ) log 2 p ( r ) d r − N log 2 ∆ r and � H ( R i ) = − p ( r i ) log 2 p ( r i ) d r − log 2 ∆ r We have � H ( R ) ≤ H ( R i ) i (proof, consider KL divergence) 15 / 36

Mutual information in populations of Neurons Reduncancy can be defined as (compare to above) n r � R = I ( r i ; s ) − I ( r ; s ) . i = 1 Some codes have R > 0 (redundant code), others R < 0 (synergistic) Example of synergistic code: P ( r 1 , r 2 , s ) with P ( 0 , 0 , 1 ) = P ( 0 , 1 , 0 ) = P ( 1 , 0 , 0 ) = P ( 1 , 1 , 1 ) = 1 4 , other probabilities zero 16 / 36

Entropy Maximization for a Single Neuron I m ( R ; S ) = H ( R ) − H ( R | S ) If noise entropy H ( R | S ) is independent of the transformation S → R , we can maximize mutual information by maximizing H ( R ) under given constraints Possible constraint: response r is 0 < r < r max . Maximal H ( R ) if ⇒ p ( r ) ∼ U ( 0 , r max ) ( U is uniform dist) If average firing rate is limited, and 0 < r < ∞ : exponential distribution is optimal p ( x ) = 1 / ¯ xexp ( − x / ¯ x ) . H = log 2 e ¯ x If variance is fixed and −∞ < r < ∞ : Gaussian distribution. H = 1 2 log 2 ( 2 π e σ 2 ) 17 / 36

Let r = f ( s ) and s ∼ p ( s ) . Which f (assumed monotonic) maximizes H ( R ) using max firing rate constraint? Require: 1 P ( r ) = r max p ( s ) = p ( r ) dr 1 df ds = r max ds Thus df / ds = r max p ( s ) and � s p ( s ′ ) ds ′ f ( s ) = r max s min This strategy is known as histogram equalization in signal processing 18 / 36

Fly retina Evidence that the large monopolar cell in the fly visual system carries out histogram equalization Contrast response for fly large monopolar cell (points) matches environment statistics (line) [Laughlin, 1981] (but changes in high noise conditions) 19 / 36

V1 contrast responses Similar in V1, but On and Off channels [Brady and Field, 2000] 20 / 36

Information of time varying signals Single analog channel with Gaussian signal s and Gaussian noise η : r = s + η 2 log 2 ( 1 + σ 2 I = 1 ) = 1 s 2 log 2 ( 1 + SNR ) σ 2 η � d ω 2 π log 2 ( 1 + s ( ω ) For time dependent signals I = 1 2 T n ( ω ) ) To maximize information, when variance of the signal is constrained, use all frequency bands such that signal+noise = constant. Whitening. Water filling analog: 21 / 36

Information of graded synapses Light - (photon noise) - photoreceptor - (synaptic noise) - LMC At low light levels photon noise dominates, synaptic noise is negligible. Information rate: 1500 bits/s [de Ruyter van Steveninck and Laughlin, 1996]. 22 / 36

Spiking neurons: maximal information Spike train with N = T /δ t bins [Mackay and McCullogh, 1952] δ t “time-resolution”. N ! pN = N 1 events, #words = N 1 !( N − N 1 )! Maximal entropy if all words are equally likely. H = � p i log 2 p i = log 2 N ! − log 2 N 1 ! − log 2 ( N − N 1 )! Use for large x that log x ! ≈ x (log x − 1 ) H = − T δ t [ p log 2 p + ( 1 − p ) log 2 ( 1 − p )] log 2 ( e ) For low rates p ≪ 1, setting λ = ( δ t ) p : H = T λ log 2 ( e λδ t ) 23 / 36

Spiking neurons Calculation incorrect when multiple spikes per bin. 24 / 36

Spiking neurons: rate code [Stein, 1967] Measure rate in window T , during which stimulus is constant. Periodic neuron can maximally encode [ 1 + ( f max − f min ) T ] stimuli H ≈ log 2 [ 1 + ( f max − f min ) T ] . Note, only ∝ log( T ) 25 / 36

[Stein, 1967] Similar behaviour for Poisson : H ∝ log( T ) 26 / 36

Maximizing Information Transmission: single output Single linear neuron with post-synaptic noise v = w · u + η where η is an independent noise variable I m ( u ; v ) = H ( v ) − H ( v | u ) Second term depends only on p ( η ) To maximize I m need to maximize H ( v ) ; sensible constraint is that � w � 2 = 1 If u ∼ N ( 0 , Q ) and η ∼ N ( 0 , σ 2 η ) then v ∼ N ( 0 , w T Q w + σ 2 η ) 27 / 36

For a Gaussian RV with variance σ 2 we have H = 1 2 log 2 π e σ 2 . To maximize H ( v ) we need to maximize w T Q w subject to the constraint � w � 2 = 1 Thus w ∝ e 1 so we obtain PCA If v is non-Gaussian then this calculation gives an upper bound on H ( v ) (as the Gaussian distribution is the maximum entropy distribution for a given mean and covariance) 28 / 36

Infomax Infomax: maximize information in multiple outputs wrt weights [Linsker, 1988] v = W u + η H ( v ) = 1 2 log det( � vv T � ) Example: 2 inputs and 2 outputs. Input is correlated. w 2 k 1 + w 2 k 2 = 1. At low noise independent coding, at high noise joint coding. 29 / 36

Estimating information Information estimation requires a lot of data. Most statistical quantities are unbiased (mean, var,...). But both entropy and noise entropy have bias. [Panzeri et al., 2007] 30 / 36

Try to fit 1 / N correction [Strong et al., 1998] 31 / 36

Common technique for I m : shuffle correction [Panzeri et al., 2007] See also: [Paninski, 2003, Nemenman et al., 2002] 32 / 36

Summary Information theory provides non parametric framework for coding Optimal coding schemes depend strongly on noise assumptions and optimization constraints In data analysis biases can be substantial 33 / 36

Recommend

More recommend