information theory
play

Information Theory Matthias Hennig School of Informatics, - PowerPoint PPT Presentation

Information Theory Matthias Hennig School of Informatics, University of Edinburgh February 7, 2019 0 Acknowledgements: Mark van Rossum and Chris Williams. 1 / 36 Why information theory? Understanding the neural code. Encoding and decoding. We


  1. Information Theory Matthias Hennig School of Informatics, University of Edinburgh February 7, 2019 0 Acknowledgements: Mark van Rossum and Chris Williams. 1 / 36

  2. Why information theory? Understanding the neural code. Encoding and decoding. We imposed coding schemes, such as a linear kernel, or a GLM. We possibly lost information in doing so. Instead, use information: Don’t need to impose encoding or decoding scheme (non-parametric). In particular important for 1) spike timing codes, 2) higher areas. Estimate how much information is present in a recorded signal. Caveats: The decoding process is ignored (upper bound only) Requires more data, and biases are tricky 2 / 36

  3. Overview Entropy, Mutual Information Entropy Maximization for a Single Neuron Maximizing Mutual Information Estimating information Reading: Dayan and Abbott ch 4, Rieke 3 / 36

  4. Definitions For the probability of an event P ( x ) , the quantity h ( p ) = − log p ( x ) is called ‘surprise‘ or ‘information‘. Measures the information gained when observing x . Additive for independent events. Often log 2 is used, then unit is bits ( log e has unit nats). 4 / 36

  5. Surprise 5 / 36

  6. Definitions The entropy of a quantity is the average � H ( X ) = − P ( x ) log 2 P ( x ) x Properties: Continuous, non-negative, H ( 1 ) = 0 If p i = 1 n , it increases monotonically with n . H = log 2 n . Parallel independent events add. [Shannon and Weaver, 1949, Cover and Thomas, 1991, Rieke et al., 1996] 6 / 36

  7. Entropy Discrete variable � H ( R ) = − p ( r ) log 2 p ( r ) r Continuous variable at resolution ∆ r � � H ( R ) = − p ( r )∆ r log 2 ( p ( r )∆ r ) = − p ( r )∆ r log 2 p ( r ) − log 2 ∆ r r r letting ∆ r → 0 we have � ∆ r → 0 [ H + log 2 ∆ r ] = − lim p ( r ) log 2 p ( r ) dr (also called differential entropy) 7 / 36

  8. Joint, Conditional entropy Joint entropy: � H ( S , R ) = − P ( S , R ) log 2 P ( S , R ) r , s Conditional entropy: � H ( S | R ) = P ( R = r ) H ( S | R = r ) r � � = − P ( r ) P ( s | r ) log 2 P ( s | r ) r s = H ( S , R ) − H ( R ) If S , R are independent H ( S , R ) = H ( S ) + H ( R ) 8 / 36

  9. Mutual information Mutual information: p ( r , s ) � I m ( R ; S ) = p ( r , s ) log 2 p ( r ) p ( s ) r , s = H ( R ) − H ( R | S ) = H ( S ) − H ( S | R ) Measures reduction in uncertainty of R by knowing S (or vice versa) H ( R | S ) is called noise entropy , the part of the response not explained by the stimulus. I m ( R ; S ) ≥ 0 The continuous version is the difference of two entropies, the ∆ r divergence cancels 9 / 36

  10. Relationships between information measures 10 / 36

  11. Coding channels 11 / 36

  12. Coding channels Can we reconstruct the stimulus? We need a en/decoding model: P ( s | r ) = P ( r | s ) P ( s ) P ( r ) How much information is conveyed? This can be addressed non-parametrically: I m ( S ; R ) = H ( S ) − H ( S | R ) = H ( R ) − H ( R | S ) 12 / 36

  13. Kullback-Leibler divergence KL-divergence measures distance between two probability distributions � P ( x ) D KL ( P || Q ) = P ( x ) log 2 Q ( x ) dx P i � D KL ( P || Q ) ≡ P i log 2 Q i i Not symmetric (Jensen Shannon divergence is the symmetrised form) I m ( R ; S ) = D KL ( p ( r , s ) || p ( r ) p ( s )) , hence measures KLD to independent model. Often used as probabilistic cost function: D KL ( data || model ) . 13 / 36

  14. Mutual info between jointly Gaussian variables P ( y 1 ) P ( y 2 ) dy 1 dy 2 = − 1 P ( y 1 , y 2 ) � � 2 log 2 ( 1 − ρ 2 ) I ( Y 1 ; Y 2 ) = P ( y 1 , y 2 ) log 2 ρ is (Pearson-r) correlation coefficient. 14 / 36

  15. Populations of Neurons Given � H ( R ) = − p ( r ) log 2 p ( r ) d r − N log 2 ∆ r and � H ( R i ) = − p ( r i ) log 2 p ( r i ) d r − log 2 ∆ r We have � H ( R ) ≤ H ( R i ) i (proof, consider KL divergence) 15 / 36

  16. Mutual information in populations of Neurons Reduncancy can be defined as (compare to above) n r � R = I ( r i ; s ) − I ( r ; s ) . i = 1 Some codes have R > 0 (redundant code), others R < 0 (synergistic) Example of synergistic code: P ( r 1 , r 2 , s ) with P ( 0 , 0 , 1 ) = P ( 0 , 1 , 0 ) = P ( 1 , 0 , 0 ) = P ( 1 , 1 , 1 ) = 1 4 , other probabilities zero 16 / 36

  17. Entropy Maximization for a Single Neuron I m ( R ; S ) = H ( R ) − H ( R | S ) If noise entropy H ( R | S ) is independent of the transformation S → R , we can maximize mutual information by maximizing H ( R ) under given constraints Possible constraint: response r is 0 < r < r max . Maximal H ( R ) if ⇒ p ( r ) ∼ U ( 0 , r max ) ( U is uniform dist) If average firing rate is limited, and 0 < r < ∞ : exponential distribution is optimal p ( x ) = 1 / ¯ xexp ( − x / ¯ x ) . H = log 2 e ¯ x If variance is fixed and −∞ < r < ∞ : Gaussian distribution. H = 1 2 log 2 ( 2 π e σ 2 ) 17 / 36

  18. Let r = f ( s ) and s ∼ p ( s ) . Which f (assumed monotonic) maximizes H ( R ) using max firing rate constraint? Require: 1 P ( r ) = r max p ( s ) = p ( r ) dr 1 df ds = r max ds Thus df / ds = r max p ( s ) and � s p ( s ′ ) ds ′ f ( s ) = r max s min This strategy is known as histogram equalization in signal processing 18 / 36

  19. Fly retina Evidence that the large monopolar cell in the fly visual system carries out histogram equalization Contrast response for fly large monopolar cell (points) matches environment statistics (line) [Laughlin, 1981] (but changes in high noise conditions) 19 / 36

  20. V1 contrast responses Similar in V1, but On and Off channels [Brady and Field, 2000] 20 / 36

  21. Information of time varying signals Single analog channel with Gaussian signal s and Gaussian noise η : r = s + η 2 log 2 ( 1 + σ 2 I = 1 ) = 1 s 2 log 2 ( 1 + SNR ) σ 2 η � d ω 2 π log 2 ( 1 + s ( ω ) For time dependent signals I = 1 2 T n ( ω ) ) To maximize information, when variance of the signal is constrained, use all frequency bands such that signal+noise = constant. Whitening. Water filling analog: 21 / 36

  22. Information of graded synapses Light - (photon noise) - photoreceptor - (synaptic noise) - LMC At low light levels photon noise dominates, synaptic noise is negligible. Information rate: 1500 bits/s [de Ruyter van Steveninck and Laughlin, 1996]. 22 / 36

  23. Spiking neurons: maximal information Spike train with N = T /δ t bins [Mackay and McCullogh, 1952] δ t “time-resolution”. N ! pN = N 1 events, #words = N 1 !( N − N 1 )! Maximal entropy if all words are equally likely. H = � p i log 2 p i = log 2 N ! − log 2 N 1 ! − log 2 ( N − N 1 )! Use for large x that log x ! ≈ x (log x − 1 ) H = − T δ t [ p log 2 p + ( 1 − p ) log 2 ( 1 − p )] log 2 ( e ) For low rates p ≪ 1, setting λ = ( δ t ) p : H = T λ log 2 ( e λδ t ) 23 / 36

  24. Spiking neurons Calculation incorrect when multiple spikes per bin. 24 / 36

  25. Spiking neurons: rate code [Stein, 1967] Measure rate in window T , during which stimulus is constant. Periodic neuron can maximally encode [ 1 + ( f max − f min ) T ] stimuli H ≈ log 2 [ 1 + ( f max − f min ) T ] . Note, only ∝ log( T ) 25 / 36

  26. [Stein, 1967] Similar behaviour for Poisson : H ∝ log( T ) 26 / 36

  27. Maximizing Information Transmission: single output Single linear neuron with post-synaptic noise v = w · u + η where η is an independent noise variable I m ( u ; v ) = H ( v ) − H ( v | u ) Second term depends only on p ( η ) To maximize I m need to maximize H ( v ) ; sensible constraint is that � w � 2 = 1 If u ∼ N ( 0 , Q ) and η ∼ N ( 0 , σ 2 η ) then v ∼ N ( 0 , w T Q w + σ 2 η ) 27 / 36

  28. For a Gaussian RV with variance σ 2 we have H = 1 2 log 2 π e σ 2 . To maximize H ( v ) we need to maximize w T Q w subject to the constraint � w � 2 = 1 Thus w ∝ e 1 so we obtain PCA If v is non-Gaussian then this calculation gives an upper bound on H ( v ) (as the Gaussian distribution is the maximum entropy distribution for a given mean and covariance) 28 / 36

  29. Infomax Infomax: maximize information in multiple outputs wrt weights [Linsker, 1988] v = W u + η H ( v ) = 1 2 log det( � vv T � ) Example: 2 inputs and 2 outputs. Input is correlated. w 2 k 1 + w 2 k 2 = 1. At low noise independent coding, at high noise joint coding. 29 / 36

  30. Estimating information Information estimation requires a lot of data. Most statistical quantities are unbiased (mean, var,...). But both entropy and noise entropy have bias. [Panzeri et al., 2007] 30 / 36

  31. Try to fit 1 / N correction [Strong et al., 1998] 31 / 36

  32. Common technique for I m : shuffle correction [Panzeri et al., 2007] See also: [Paninski, 2003, Nemenman et al., 2002] 32 / 36

  33. Summary Information theory provides non parametric framework for coding Optimal coding schemes depend strongly on noise assumptions and optimization constraints In data analysis biases can be substantial 33 / 36

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend