Information Theory Maneesh Sahani Gatsby Computational Neuroscience - PowerPoint PPT Presentation

Kullback-Leibler Divergence Another useful information theoretic quantity measures the difference between two distributions. � P ( s ) log P ( s ) KL [ P ( S ) � Q ( S )] = Q ( s ) s � 1 − H [ P ] = P ( s ) log Q ( s ) s � �� cross entropy Excess cost in bits paid by encoding according to Q instead of P .

Kullback-Leibler Divergence Another useful information theoretic quantity measures the difference between two distributions. � P ( s ) log P ( s ) KL [ P ( S ) � Q ( S )] = Q ( s ) s � 1 − H [ P ] = P ( s ) log Q ( s ) s � �� cross entropy Excess cost in bits paid by encoding according to Q instead of P . � P ( s ) log Q ( s ) − KL [ P � Q ] = P ( s ) s � P ( s ) Q ( s ) ≤ log by Jensen P ( s ) s � = log Q ( s ) = log 1 = 0 s So KL [ P � Q ] ≥ 0. Equality iff P = Q

Mutual Information and KL � P ( s , r ) I [ S ; R ] = P ( s , r ) log P ( s ) P ( r ) = KL [ P ( S , R ) � P ( S ) P ( R )] s , r

Mutual Information and KL � P ( s , r ) I [ S ; R ] = P ( s , r ) log P ( s ) P ( r ) = KL [ P ( S , R ) � P ( S ) P ( R )] s , r Thus: 1. Mutual information is always non-negative I [ S ; R ] ≥ 0

Mutual Information and KL � P ( s , r ) I [ S ; R ] = P ( s , r ) log P ( s ) P ( r ) = KL [ P ( S , R ) � P ( S ) P ( R )] s , r Thus: 1. Mutual information is always non-negative I [ S ; R ] ≥ 0 2. Conditioning never increases entropy H [ S | R ] ≤ H [ S ]

Multiple Responses Two responses to the same stimulus, R 1 and R 2 , may provide either more or less information jointly than independently.

Multiple Responses Two responses to the same stimulus, R 1 and R 2 , may provide either more or less information jointly than independently. I 12 = I [ S ; R 1 , R 2 ] = H [ R 1 , R 2 ] − H [ R 1 , R 2 | S ] R 1 ⊥ ⊥ R 2 ⇒ H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] R 1 ⊥ ⊥ R 2 | S ⇒ H [ R 1 , R 2 | S ] = H [ R 1 | S ] + H [ R 2 | S ]

Multiple Responses Two responses to the same stimulus, R 1 and R 2 , may provide either more or less information jointly than independently. I 12 = I [ S ; R 1 , R 2 ] = H [ R 1 , R 2 ] − H [ R 1 , R 2 | S ] R 1 ⊥ ⊥ R 2 ⇒ H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] R 1 ⊥ ⊥ R 2 | S ⇒ H [ R 1 , R 2 | S ] = H [ R 1 | S ] + H [ R 2 | S ] R 1 ⊥ ⊥ R 2 R 1 ⊥ ⊥ R 2 | S I 12 < I 1 + I 2 no yes redundant

Multiple Responses Two responses to the same stimulus, R 1 and R 2 , may provide either more or less information jointly than independently. I 12 = I [ S ; R 1 , R 2 ] = H [ R 1 , R 2 ] − H [ R 1 , R 2 | S ] R 1 ⊥ ⊥ R 2 ⇒ H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] R 1 ⊥ ⊥ R 2 | S ⇒ H [ R 1 , R 2 | S ] = H [ R 1 | S ] + H [ R 2 | S ] R 1 ⊥ ⊥ R 2 R 1 ⊥ ⊥ R 2 | S I 12 < I 1 + I 2 no yes redundant I 12 = I 1 + I 2 yes yes independent

Multiple Responses Two responses to the same stimulus, R 1 and R 2 , may provide either more or less information jointly than independently. I 12 = I [ S ; R 1 , R 2 ] = H [ R 1 , R 2 ] − H [ R 1 , R 2 | S ] R 1 ⊥ ⊥ R 2 ⇒ H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] R 1 ⊥ ⊥ R 2 | S ⇒ H [ R 1 , R 2 | S ] = H [ R 1 | S ] + H [ R 2 | S ] R 1 ⊥ ⊥ R 2 R 1 ⊥ ⊥ R 2 | S I 12 < I 1 + I 2 no yes redundant I 12 = I 1 + I 2 yes yes independent I 12 > I 1 + I 2 yes no synergistic

Multiple Responses Two responses to the same stimulus, R 1 and R 2 , may provide either more or less information jointly than independently. I 12 = I [ S ; R 1 , R 2 ] = H [ R 1 , R 2 ] − H [ R 1 , R 2 | S ] R 1 ⊥ ⊥ R 2 ⇒ H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] R 1 ⊥ ⊥ R 2 | S ⇒ H [ R 1 , R 2 | S ] = H [ R 1 | S ] + H [ R 2 | S ] R 1 ⊥ ⊥ R 2 R 1 ⊥ ⊥ R 2 | S I 12 < I 1 + I 2 no yes redundant I 12 = I 1 + I 2 yes yes independent I 12 > I 1 + I 2 yes no synergistic no no ? any of the above

Multiple Responses Two responses to the same stimulus, R 1 and R 2 , may provide either more or less information jointly than independently. I 12 = I [ S ; R 1 , R 2 ] = H [ R 1 , R 2 ] − H [ R 1 , R 2 | S ] R 1 ⊥ ⊥ R 2 ⇒ H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] R 1 ⊥ ⊥ R 2 | S ⇒ H [ R 1 , R 2 | S ] = H [ R 1 | S ] + H [ R 2 | S ] R 1 ⊥ ⊥ R 2 R 1 ⊥ ⊥ R 2 | S I 12 < I 1 + I 2 no yes redundant I 12 = I 1 + I 2 yes yes independent I 12 > I 1 + I 2 yes no synergistic no no ? any of the above I 12 > max ( I 1 , I 2 ) : the second response cannot destroy information.

Multiple Responses Two responses to the same stimulus, R 1 and R 2 , may provide either more or less information jointly than independently. I 12 = I [ S ; R 1 , R 2 ] = H [ R 1 , R 2 ] − H [ R 1 , R 2 | S ] R 1 ⊥ ⊥ R 2 ⇒ H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] R 1 ⊥ ⊥ R 2 | S ⇒ H [ R 1 , R 2 | S ] = H [ R 1 | S ] + H [ R 2 | S ] R 1 ⊥ ⊥ R 2 R 1 ⊥ ⊥ R 2 | S I 12 < I 1 + I 2 no yes redundant I 12 = I 1 + I 2 yes yes independent I 12 > I 1 + I 2 yes no synergistic no no ? any of the above I 12 > max ( I 1 , I 2 ) : the second response cannot destroy information. Thus, the Venn-like diagram with three variables is misleading.

Data Processing Inequality

Data Processing Inequality Suppose S → R 1 → R 2 form a Markov chain; that is, R 2 ⊥ ⊥ S | R 1 . Then, P ( R 2 , S | R 1 ) = P ( R 2 | R 1 ) P ( S | R 1 ) ⇒ P ( S | R 1 , R 2 ) = P ( S | R 1 )

Data Processing Inequality Suppose S → R 1 → R 2 form a Markov chain; that is, R 2 ⊥ ⊥ S | R 1 . Then, P ( R 2 , S | R 1 ) = P ( R 2 | R 1 ) P ( S | R 1 ) ⇒ P ( S | R 1 , R 2 ) = P ( S | R 1 ) Thus, H [ S | R 2 ] ≥ H [ S | R 1 , R 2 ] = H [ S | R 1 ] ⇒ I [ S ; R 2 ] ≤ I [ S ; R 1 ] So any computation based on R 1 that does not have separate access to S cannot add information (in the Shannon sense) about the world. Equality holds iff S → R 2 → R 1 as well. In this case R 2 is called a sufficient statistic for S .

Entropy Rate So far we have discussed S and R as single (or iid) random variables. But real stimuli and responses form a time series.

Entropy Rate So far we have discussed S and R as single (or iid) random variables. But real stimuli and responses form a time series. Let S = { S 1 , S 2 , S 3 . . . } form a stochastic process. H [ S 1 , S 2 , . . . , S n ] = H [ S n | S 1 , S 2 , . . . , S n − 1 ] + H [ S 1 , S 2 , . . . , S n − 1 ] = H [ S n | S 1 , S 2 , . . . , S n − 1 ] + H [ S n − 1 | S 1 , S 2 , . . . , S n − 2 ] + . . . + H [ S 1 ]

Entropy Rate So far we have discussed S and R as single (or iid) random variables. But real stimuli and responses form a time series. Let S = { S 1 , S 2 , S 3 . . . } form a stochastic process. H [ S 1 , S 2 , . . . , S n ] = H [ S n | S 1 , S 2 , . . . , S n − 1 ] + H [ S 1 , S 2 , . . . , S n − 1 ] = H [ S n | S 1 , S 2 , . . . , S n − 1 ] + H [ S n − 1 | S 1 , S 2 , . . . , S n − 2 ] + . . . + H [ S 1 ] The entropy rate of S is defined as H [ S 1 , S 2 , . . . , S n ] H [ S ] = lim N n →∞ or alternatively as H [ S ] = lim n →∞ H [ S n | S 1 , S 2 , . . . , S n − 1 ]

Entropy Rate So far we have discussed S and R as single (or iid) random variables. But real stimuli and responses form a time series. Let S = { S 1 , S 2 , S 3 . . . } form a stochastic process. H [ S 1 , S 2 , . . . , S n ] = H [ S n | S 1 , S 2 , . . . , S n − 1 ] + H [ S 1 , S 2 , . . . , S n − 1 ] = H [ S n | S 1 , S 2 , . . . , S n − 1 ] + H [ S n − 1 | S 1 , S 2 , . . . , S n − 2 ] + . . . + H [ S 1 ] The entropy rate of S is defined as H [ S 1 , S 2 , . . . , S n ] H [ S ] = lim N n →∞ or alternatively as H [ S ] = lim n →∞ H [ S n | S 1 , S 2 , . . . , S n − 1 ] iid If S i ∼ P ( S ) then H [ S ] = H [ S ] . If S is Markov (and stationary) then H [ S ] = H [ S n | S n − 1 ] .

Continuous Random Variables The discussion so far has involved discrete S and R . Now, let S ∈ R with density p(s). What is its entropy?

Continuous Random Variables The discussion so far has involved discrete S and R . Now, let S ∈ R with density p(s). What is its entropy? Suppose we discretise with length ∆ s : � H ∆ [ S ] = − p ( s i )∆ s log p ( s i )∆ s i � = − p ( s i )∆ s ( log p ( s i ) + log ∆ s ) i

Continuous Random Variables The discussion so far has involved discrete S and R . Now, let S ∈ R with density p(s). What is its entropy? Suppose we discretise with length ∆ s : � H ∆ [ S ] = − p ( s i )∆ s log p ( s i )∆ s i � = − p ( s i )∆ s ( log p ( s i ) + log ∆ s ) i � � = − p ( s i )∆ s log p ( s i ) − log ∆ s p ( s i )∆ s i i

Continuous Random Variables The discussion so far has involved discrete S and R . Now, let S ∈ R with density p(s). What is its entropy? Suppose we discretise with length ∆ s : � H ∆ [ S ] = − p ( s i )∆ s log p ( s i )∆ s i � = − p ( s i )∆ s ( log p ( s i ) + log ∆ s ) i � � = − p ( s i )∆ s log p ( s i ) − log ∆ s p ( s i )∆ s i i � = − ∆ s p ( s i ) log p ( s i ) − log ∆ s i

Continuous Random Variables The discussion so far has involved discrete S and R . Now, let S ∈ R with density p(s). What is its entropy? Suppose we discretise with length ∆ s : � H ∆ [ S ] = − p ( s i )∆ s log p ( s i )∆ s i � = − p ( s i )∆ s ( log p ( s i ) + log ∆ s ) i � � = − p ( s i )∆ s log p ( s i ) − log ∆ s p ( s i )∆ s i i � = − ∆ s p ( s i ) log p ( s i ) − log ∆ s i � → − ds p ( s ) log p ( s ) + ∞

Continuous Random Variables The discussion so far has involved discrete S and R . Now, let S ∈ R with density p(s). What is its entropy? Suppose we discretise with length ∆ s : � H ∆ [ S ] = − p ( s i )∆ s log p ( s i )∆ s i � = − p ( s i )∆ s ( log p ( s i ) + log ∆ s ) i � � = − p ( s i )∆ s log p ( s i ) − log ∆ s p ( s i )∆ s i i � = − ∆ s p ( s i ) log p ( s i ) − log ∆ s i � → − ds p ( s ) log p ( s ) + ∞ We define the differential entropy : � h ( S ) = − ds p ( s ) log p ( s ) . Note that h ( S ) can be < 0, and can be ±∞ .

Continuous Random Variables We can define other information theoretic quantities similarly.

Continuous Random Variables We can define other information theoretic quantities similarly. The conditional differential entropy is � h ( S | R ) = − ds dr p ( s , r ) log p ( s | r ) and, like the differential entropy itself, may be poorly behaved.

Continuous Random Variables We can define other information theoretic quantities similarly. The conditional differential entropy is � h ( S | R ) = − ds dr p ( s , r ) log p ( s | r ) and, like the differential entropy itself, may be poorly behaved. The mutual information, however, is well-defined I ∆ [ S ; R ] = H ∆ [ S ] − H ∆ [ S | R ] � = − ∆ s p ( s i ) log p ( s i ) − log ∆ s i � � � � − − ∆ s p ( s i | r ) log p ( s i | r ) − log ∆ s dr p ( r ) i → h ( S ) − h ( S | R ) as are other KL divergences.

Maximum Entropy Distributions 1. H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] with equality iff R 1 ⊥ ⊥ R 2 .

Maximum Entropy Distributions 1. H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] with equality iff R 1 ⊥ ⊥ R 2 . � 2. Let ds p ( s ) f ( s ) = a for some function f . What distribution has maximum entropy?

Maximum Entropy Distributions 1. H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] with equality iff R 1 ⊥ ⊥ R 2 . � 2. Let ds p ( s ) f ( s ) = a for some function f . What distribution has maximum entropy? Use Lagrange multipliers: � �� L = ds p ( s ) log p ( s ) − λ 0 ds p ( s ) − 1 − λ 1 ds p ( s ) f ( s ) − a δ L δ p ( s ) = 1 + log p ( s ) − λ 0 − λ 1 f ( s ) = 0 ⇒ log p ( s ) = λ 0 + λ 1 f ( s ) − 1 ⇒ p ( s ) = 1 Z e λ 1 f ( s ) The constants λ 0 and λ 1 can be found by solving the constraint equations.

Maximum Entropy Distributions 1. H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] with equality iff R 1 ⊥ ⊥ R 2 . � 2. Let ds p ( s ) f ( s ) = a for some function f . What distribution has maximum entropy? Use Lagrange multipliers: � �� L = ds p ( s ) log p ( s ) − λ 0 ds p ( s ) − 1 − λ 1 ds p ( s ) f ( s ) − a δ L δ p ( s ) = 1 + log p ( s ) − λ 0 − λ 1 f ( s ) = 0 ⇒ log p ( s ) = λ 0 + λ 1 f ( s ) − 1 ⇒ p ( s ) = 1 Z e λ 1 f ( s ) The constants λ 0 and λ 1 can be found by solving the constraint equations. Thus, p ( s ) = 1 Z e λ 1 s . f ( s ) = s ⇒ Exponential (need p ( s ) = 0 for s < T ) . Z e λ 1 s 2 . f ( s ) = s 2 p ( s ) = 1 ⇒ Gaussian .

Maximum Entropy Distributions 1. H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] with equality iff R 1 ⊥ ⊥ R 2 . � 2. Let ds p ( s ) f ( s ) = a for some function f . What distribution has maximum entropy? Use Lagrange multipliers: � �� L = ds p ( s ) log p ( s ) − λ 0 ds p ( s ) − 1 − λ 1 ds p ( s ) f ( s ) − a δ L δ p ( s ) = 1 + log p ( s ) − λ 0 − λ 1 f ( s ) = 0 ⇒ log p ( s ) = λ 0 + λ 1 f ( s ) − 1 ⇒ p ( s ) = 1 Z e λ 1 f ( s ) The constants λ 0 and λ 1 can be found by solving the constraint equations. Thus, p ( s ) = 1 Z e λ 1 s . f ( s ) = s ⇒ Exponential (need p ( s ) = 0 for s < T ) . Z e λ 1 s 2 . f ( s ) = s 2 p ( s ) = 1 ⇒ Gaussian . Both results together ⇒ maximum entropy point process (for fixed mean arrival rate) is homogeneous Poisson – independent, exponentially distributed ISIs.

Channels We now direct our focus to the conditional P ( R | S ) which defines the channel linking S to R . P ( R | S ) − → R S

Channels We now direct our focus to the conditional P ( R | S ) which defines the channel linking S to R . P ( R | S ) − → R S The mutual information � � P ( s , r ) P ( s ) P ( r | s ) log P ( r | s ) I [ S ; R ] = P ( s , r ) log P ( s ) P ( r ) = P ( r ) s , r s , r depends on marginals P ( s ) and P ( r ) = � s P ( r | s ) P ( s ) as well and thus is unsuitable to characterise the conditional alone.

Channels We now direct our focus to the conditional P ( R | S ) which defines the channel linking S to R . P ( R | S ) − → R S The mutual information � � P ( s , r ) P ( s ) P ( r | s ) log P ( r | s ) I [ S ; R ] = P ( s , r ) log P ( s ) P ( r ) = P ( r ) s , r s , r depends on marginals P ( s ) and P ( r ) = � s P ( r | s ) P ( s ) as well and thus is unsuitable to characterise the conditional alone. Instead, we characterise the channel by its capacity C R | S = sup I [ S ; R ] P ( s ) Thus the capacity gives the theoretical limit on the amount of information that can be transmitted over a channel. Clearly, this is limited by the properties of the noise.

Joint source-channel coding theorem The remarkable central result of information theory. encoder channel decoder → � → � − − − − − − − − − − − − − − − − − − − − − − → − − − − − − − − − − − S S R T C R | � S Any source ensemble S with entropy H [ S ] < C R | � S can be transmitted (in sufficiently long blocks) with P error → 0. The proof is beyond our scope. Some of the key ideas that appear in the proof are: ◮ block coding ◮ error correction ◮ joint typicality ◮ random codes

The channel coding problem encoder channel decoder → � → � − − − − − − − − − − − − − − − − − − − − − − → − − − − − − − − − − − S S R T C R | � S Given channel P ( R | � S ) and source P ( S ) , find encoding P ( � S | S ) (may be deterministic) to maximise I [ S ; R ] . By data processing inequality, and defn of capacity: I [ S ; R ] ≤ I [ � S ; R ] ≤ C R | � S By JSCT, equality can be achieved (in the limit of increasing block size). Thus I [ � S ; R ] should saturate C R | � S . See homework for an algorithm (Blahut-Arimoto) to find P ( � S ) that saturates C R | � S for a general discrete channel.

Entropy maximisation � � I [ � R | � S ; R ] = H [ R ] − H S � �� marginal entropy noise entropy

Entropy maximisation � � I [ � R | � S ; R ] = H [ R ] − H S � �� marginal entropy noise entropy � � � If noise is small and “constant” ⇒ maximise marginal entropy ⇒ maximise H S

Entropy maximisation � � I [ � R | � S ; R ] = H [ R ] − H S � �� marginal entropy noise entropy � � � If noise is small and “constant” ⇒ maximise marginal entropy ⇒ maximise H S Consider a (rate coding) neuron with r ∈ [ 0 , r max ] . � r max h ( r ) = − dr p ( r ) log p ( r ) 0

Entropy maximisation � � I [ � R | � S ; R ] = H [ R ] − H S � �� marginal entropy noise entropy � � � If noise is small and “constant” ⇒ maximise marginal entropy ⇒ maximise H S Consider a (rate coding) neuron with r ∈ [ 0 , r max ] . � r max h ( r ) = − dr p ( r ) log p ( r ) 0 To maximise the marginal entropy, we add a Lagrange multiplier ( µ ) to enforce normalisation and then differentiate � � � − log p ( r ) − 1 − µ � r max δ r ∈ [ 0 , r max ] h ( r ) − µ p ( r ) = δ p ( r ) 0 otherwise 0

Entropy maximisation � � I [ � R | � S ; R ] = H [ R ] − H S � �� marginal entropy noise entropy � � � If noise is small and “constant” ⇒ maximise marginal entropy ⇒ maximise H S Consider a (rate coding) neuron with r ∈ [ 0 , r max ] . � r max h ( r ) = − dr p ( r ) log p ( r ) 0 To maximise the marginal entropy, we add a Lagrange multiplier ( µ ) to enforce normalisation and then differentiate � � � − log p ( r ) − 1 − µ � r max δ r ∈ [ 0 , r max ] h ( r ) − µ p ( r ) = δ p ( r ) 0 otherwise 0 ⇒ p ( r ) = const for r ∈ [ 0 , r max ]

Entropy maximisation � � I [ � R | � S ; R ] = H [ R ] − H S � �� marginal entropy noise entropy � � � If noise is small and “constant” ⇒ maximise marginal entropy ⇒ maximise H S Consider a (rate coding) neuron with r ∈ [ 0 , r max ] . � r max h ( r ) = − dr p ( r ) log p ( r ) 0 To maximise the marginal entropy, we add a Lagrange multiplier ( µ ) to enforce normalisation and then differentiate � � � − log p ( r ) − 1 − µ � r max δ r ∈ [ 0 , r max ] h ( r ) − µ p ( r ) = δ p ( r ) 0 otherwise 0 ⇒ p ( r ) = const for r ∈ [ 0 , r max ] i.e. � 1 r ∈ [ 0 , r max ] p ( r ) = r max 0 otherwise

Histogram Equalisation Suppose r = ˜ s + η where η represents a (relatively small) source of noise. Consider deterministic encoding ˜ s = f ( s ) . How do we ensure that p ( r ) = 1 / r max ? s ) = p ( s ) 1 ⇒ f ′ ( s ) = r max p ( s ) r max = p ( r ) ≈ p (˜ f ′ ( s ) � s ds ′ p ( s ′ ) ⇒ f ( s ) = r max −∞ 1 0.9 0.8 0.7 0.6 ˜ s 0.5 0.4 0.3 0.2 0.1 0 −3 −2 −1 0 1 2 3 s

Histogram Equalisation Laughlin (1981)

Gaussian channel A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm.

Gaussian channel A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm. We will need the differential entropy of a (multivariate) Gaussian distribution:

Gaussian channel A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm. We will need the differential entropy of a (multivariate) Gaussian distribution: � � Let − 1 p ( Z ) = | 2 π Σ | − 1 / 2 exp 2 ( Z − µ ) T Σ − 1 ( Z − µ ) , then, � � � − 1 2 log | 2 π Σ | − 1 2 ( Z − µ ) T Σ − 1 ( Z − µ ) h ( Z ) = − d Z p ( Z )

Gaussian channel A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm. We will need the differential entropy of a (multivariate) Gaussian distribution: � � Let − 1 p ( Z ) = | 2 π Σ | − 1 / 2 exp 2 ( Z − µ ) T Σ − 1 ( Z − µ ) , then, � � � − 1 2 log | 2 π Σ | − 1 2 ( Z − µ ) T Σ − 1 ( Z − µ ) h ( Z ) = − d Z p ( Z ) � � Σ − 1 ( Z − µ )( Z − µ ) T � = 1 2 log | 2 π Σ | + 1 d Z p ( Z ) Tr 2

Gaussian channel A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm. We will need the differential entropy of a (multivariate) Gaussian distribution: � � Let − 1 p ( Z ) = | 2 π Σ | − 1 / 2 exp 2 ( Z − µ ) T Σ − 1 ( Z − µ ) , then, � � � − 1 2 log | 2 π Σ | − 1 2 ( Z − µ ) T Σ − 1 ( Z − µ ) h ( Z ) = − d Z p ( Z ) � � Σ − 1 ( Z − µ )( Z − µ ) T � = 1 2 log | 2 π Σ | + 1 d Z p ( Z ) Tr 2 � � = 1 2 log | 2 π Σ | + 1 Σ − 1 Σ 2Tr

Gaussian channel A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm. We will need the differential entropy of a (multivariate) Gaussian distribution: � � Let − 1 p ( Z ) = | 2 π Σ | − 1 / 2 exp 2 ( Z − µ ) T Σ − 1 ( Z − µ ) , then, � � � − 1 2 log | 2 π Σ | − 1 2 ( Z − µ ) T Σ − 1 ( Z − µ ) h ( Z ) = − d Z p ( Z ) � � Σ − 1 ( Z − µ )( Z − µ ) T � = 1 2 log | 2 π Σ | + 1 d Z p ( Z ) Tr 2 � � = 1 2 log | 2 π Σ | + 1 Σ − 1 Σ 2Tr = 1 2 log | 2 π Σ | + 1 ( log e ) 2 d

Gaussian channel A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm. We will need the differential entropy of a (multivariate) Gaussian distribution: � � Let − 1 p ( Z ) = | 2 π Σ | − 1 / 2 exp 2 ( Z − µ ) T Σ − 1 ( Z − µ ) , then, � � � − 1 2 log | 2 π Σ | − 1 2 ( Z − µ ) T Σ − 1 ( Z − µ ) h ( Z ) = − d Z p ( Z ) � � Σ − 1 ( Z − µ )( Z − µ ) T � = 1 2 log | 2 π Σ | + 1 d Z p ( Z ) Tr 2 � � = 1 2 log | 2 π Σ | + 1 Σ − 1 Σ 2Tr = 1 2 log | 2 π Σ | + 1 ( log e ) 2 d = 1 2 log | 2 π e Σ |

Gaussian channel – white noise ∼ N ( 0 , k z ) Z � + R S

Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) � + R S

Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S

Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z )

Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z .

Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ .

Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) � S 2 � � ≤ P S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ . n � Therefore, constrain 1 s 2 ˜ i ≤ P. n i = 1

Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) � S 2 � � ≤ P S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ . n � Therefore, constrain 1 s 2 ˜ i ≤ P. n i = 1 Then, � S + Z ) 2 � � � � R 2 � S 2 + Z 2 + 2 � ( � � = = SZ

Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) � S 2 � � ≤ P S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ . n � Therefore, constrain 1 s 2 ˜ i ≤ P. n i = 1 Then, � S + Z ) 2 � � � � R 2 � S 2 + Z 2 + 2 � ( � � = = ≤ P + k z + 0 SZ

Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) � S 2 � � ≤ P S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ . n � Therefore, constrain 1 s 2 ˜ i ≤ P. n i = 1 Then, � S + Z ) 2 � � � � R 2 � S 2 + Z 2 + 2 � ( � � = = ≤ P + k z + 0 SZ ⇒ h ( R ) ≤ h ( N ( 0 , P + k z ))

Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) � S 2 � � ≤ P S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ . n � Therefore, constrain 1 s 2 ˜ i ≤ P. n i = 1 Then, � S + Z ) 2 � � � � R 2 � S 2 + Z 2 + 2 � ( � � = = ≤ P + k z + 0 SZ ⇒ h ( R ) ≤ h ( N ( 0 , P + k z )) = 1 2 log 2 π e ( P + k z )

Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) � S 2 � � ≤ P S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ . n � Therefore, constrain 1 s 2 ˜ i ≤ P. n i = 1 Then, � S + Z ) 2 � � � � R 2 � S 2 + Z 2 + 2 � ( � � = = ≤ P + k z + 0 SZ ⇒ h ( R ) ≤ h ( N ( 0 , P + k z )) = 1 2 log 2 π e ( P + k z ) S ; R ] ≤ 1 2 log 2 π e ( P + k z ) − 1 ⇒ I [ � 2 log 2 π e k z

Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) � S 2 � � ≤ P S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ . n � Therefore, constrain 1 s 2 ˜ i ≤ P. n i = 1 Then, � S + Z ) 2 � � � � R 2 � S 2 + Z 2 + 2 � ( � � = = ≤ P + k z + 0 SZ ⇒ h ( R ) ≤ h ( N ( 0 , P + k z )) = 1 2 log 2 π e ( P + k z ) � � S ; R ] ≤ 1 2 log 2 π e ( P + k z ) − 1 2 log 2 π e k z = 1 1 + P ⇒ I [ � 2 log 2 π e k z

Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) � S 2 � � ≤ P S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ . n � Therefore, constrain 1 s 2 ˜ i ≤ P. n i = 1 Then, � S + Z ) 2 � � � � R 2 � S 2 + Z 2 + 2 � ( � � = = ≤ P + k z + 0 SZ ⇒ h ( R ) ≤ h ( N ( 0 , P + k z )) = 1 2 log 2 π e ( P + k z ) � � S ; R ] ≤ 1 2 log 2 π e ( P + k z ) − 1 2 log 2 π e k z = 1 1 + P ⇒ I [ � 2 log 2 π e k z � � S = 1 1 + P C R | � 2 log 2 π e k z ⇒ � The capacity is achieved iff R ∼ N ( 0 , P + k z ) S ∼ N ( 0 , P ) .

Gaussian channel – correlated noise Now consider a vector Gaussian channel: = ( Z 1 , . . . , Z d ) ∼ N ( 0 , K z ) Z � + R = ( R 1 , . . . , R d ) S = ( S 1 , . . . , S d ) � T � S � � 1 ≤ P d Tr S

Gaussian channel – correlated noise Now consider a vector Gaussian channel: = ( Z 1 , . . . , Z d ) ∼ N ( 0 , K z ) Z � + R = ( R 1 , . . . , R d ) S = ( S 1 , . . . , S d ) � T � S � � 1 ≤ P d Tr S Following the same approach as before: � � � � S ; R ] = h ( R ) − h ( Z ) ≤ 1 − 1 ( 2 π e ) d | K ˜ ( 2 π e ) d | K z | I [ � s + K z | 2 log 2 log ,

Gaussian channel – correlated noise Now consider a vector Gaussian channel: = ( Z 1 , . . . , Z d ) ∼ N ( 0 , K z ) Z � + R = ( R 1 , . . . , R d ) S = ( S 1 , . . . , S d ) � T � S � � 1 ≤ P d Tr S Following the same approach as before: � � � � S ; R ] = h ( R ) − h ( Z ) ≤ 1 − 1 ( 2 π e ) d | K ˜ ( 2 π e ) d | K z | I [ � s + K z | 2 log 2 log , ⇒ C R | S achieved when � s + K z | max given 1 S (and thus R ) ∼ N , with | K ˜ d Tr [ K ˜ s ] ≤ P .

Gaussian channel – correlated noise Now consider a vector Gaussian channel: = ( Z 1 , . . . , Z d ) ∼ N ( 0 , K z ) Z � + R = ( R 1 , . . . , R d ) S = ( S 1 , . . . , S d ) � T � S � � 1 ≤ P d Tr S Following the same approach as before: � � � � S ; R ] = h ( R ) − h ( Z ) ≤ 1 − 1 ( 2 π e ) d | K ˜ ( 2 π e ) d | K z | I [ � s + K z | 2 log 2 log , ⇒ C R | S achieved when � s + K z | max given 1 S (and thus R ) ∼ N , with | K ˜ d Tr [ K ˜ s ] ≤ P . Diagonalise K z ⇒ K ˜ s is diagonal in same basis.

Gaussian channel – correlated noise Now consider a vector Gaussian channel: = ( Z 1 , . . . , Z d ) ∼ N ( 0 , K z ) Z � + R = ( R 1 , . . . , R d ) S = ( S 1 , . . . , S d ) � T � S � � 1 ≤ P d Tr S Following the same approach as before: � � � � S ; R ] = h ( R ) − h ( Z ) ≤ 1 − 1 ( 2 π e ) d | K ˜ ( 2 π e ) d | K z | I [ � s + K z | 2 log 2 log , ⇒ C R | S achieved when � s + K z | max given 1 S (and thus R ) ∼ N , with | K ˜ d Tr [ K ˜ s ] ≤ P . Diagonalise K z ⇒ K ˜ s is diagonal in same basis. For stationary noise (wrt dimension indexed by d ) this can be achieved by a Fourier transform ⇒ index diagonal elements by ω .

Gaussian channel – correlated noise Now consider a vector Gaussian channel: = ( Z 1 , . . . , Z d ) ∼ N ( 0 , K z ) Z � + R = ( R 1 , . . . , R d ) S = ( S 1 , . . . , S d ) � T � S � � 1 ≤ P d Tr S Following the same approach as before: � � � � S ; R ] = h ( R ) − h ( Z ) ≤ 1 − 1 ( 2 π e ) d | K ˜ ( 2 π e ) d | K z | I [ � s + K z | 2 log 2 log , ⇒ C R | S achieved when � s + K z | max given 1 S (and thus R ) ∼ N , with | K ˜ d Tr [ K ˜ s ] ≤ P . Diagonalise K z ⇒ K ˜ s is diagonal in same basis. For stationary noise (wrt dimension indexed by d ) this can be achieved by a Fourier transform ⇒ index diagonal elements by ω . � � such that 1 k ∗ s ( ω ) = argmax ( k ˜ s ( ω ) + k z ( ω )) s ( ω ) ≤ P k ˜ ˜ d ω

Water filling Assume that optimum is achieved for max. input power. �� 1 k ∗ s ( ω ) = argmax log ( k ˜ s ( ω ) + k z ( ω )) − λ k ˜ s ( ω ) − P ˜ d ω ω

Information Theory Maneesh Sahani Gatsby Computational Neuroscience - PowerPoint PPT Presentation

Information Theory Maneesh Sahani Gatsby Computational Neuroscience Unit University College London February 2019 Quantifying a Code How much information does a neural response carry about a stimulus? How efficient is a hypothetical

Chapter 2- -3 3 Chapter 2 Definition of Theory: A theory is a systematic Definition of

Game Theory and Nuclear Weapons Game Theory and Nuclear Weapons Game Theory and Nuclear Warfare

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

SOCIOLOGICAL THEORY: A SCIENTIFIC APPROACH What is a theory? ! What does a theory consist of?

Applied Hodge Theory: Social Choice, Crowdsourced Ranking, and Game Theory Yuan Yao HKUST

SOCIOLOGICAL THEORY: A SCIENTIFIC APPROACH What is a theory? What does a theory consist of?

General motivations Model theory Recursion theory Lambda calculus Set theory

Information Theory project Lo Bordy 29 mai 2017 Lo Bordy Information Theory project Global

Overview Coding and Information Theory What is information theory? Entropy Coding Chris

Lectures 34: Consumer Theory Alexander Wolitzky MIT 14.121 1 Consumer Theory Consumer theory

Absolute notions in model theory Syntactic and semantic notions Absolutness from model theory

Theory or Practice? Theory : Without theory, practice is but routine born out of habit.

String Theory Ideology Or Tool Box Plan What is string theory? Unification ideology.

FILLING IN THE MARGINS: THE USE OF QUEER THEORY, FEMINIST STANDPOINT THEORY, & CRITICAL RACE

EXCHANGE THEORY Chapter 3 Leader Member Exchange Theory 2 Initially the theory described the

Does Methodological Naturalism entail Ontological Naturalism? Hans Weichselbaum Auckland, New

A Systematic Approach to Incremental Redundancy over Erasure Channels Anoosheh Heidarzadeh (Texas

TREC Video Retrieval Evaluation TRECVID 2018 George Awad#* Alan Smeaton, Yvette Graham

TRECVID 2018 INSTANCE RETRIEVAL INTRODUCTION AND TASK OVERVIEW Wessel Kraaij Leiden

Minimizing GCD sums and applications joint work with Marc Munsch and G erald Tenenbaum

Entangled Polynomial Codes for Secure, Private, and Batch Distributed Matrix Multiplication:

Powerful relativistic jets in spiral galaxies Luigi Luigi Fosc oschini hini National Institute

On Vassiliev invariants of braids of the sphere Vladimir Vershinin "Knots, braids and

Information Theory Maneesh Sahani Gatsby Computational Neuroscience - PowerPoint PPT Presentation

Information Theory Maneesh Sahani Gatsby Computational Neuroscience Unit University College London February 2019 Quantifying a Code How much information does a neural response carry about a stimulus? How efficient is a hypothetical

Chapter 2- -3 3 Chapter 2 Definition of Theory: A theory is a systematic Definition of

Game Theory and Nuclear Weapons Game Theory and Nuclear Weapons Game Theory and Nuclear Warfare

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

SOCIOLOGICAL THEORY: A SCIENTIFIC APPROACH What is a theory? ! What does a theory consist of?

Applied Hodge Theory: Social Choice, Crowdsourced Ranking, and Game Theory Yuan Yao HKUST

SOCIOLOGICAL THEORY: A SCIENTIFIC APPROACH What is a theory? What does a theory consist of?

General motivations Model theory Recursion theory Lambda calculus Set theory

Information Theory project Lo Bordy 29 mai 2017 Lo Bordy Information Theory project Global

Overview Coding and Information Theory What is information theory? Entropy Coding Chris

Lectures 34: Consumer Theory Alexander Wolitzky MIT 14.121 1 Consumer Theory Consumer theory

Absolute notions in model theory Syntactic and semantic notions Absolutness from model theory

Theory or Practice? Theory : Without theory, practice is but routine born out of habit.

String Theory Ideology Or Tool Box Plan What is string theory? Unification ideology.

FILLING IN THE MARGINS: THE USE OF QUEER THEORY, FEMINIST STANDPOINT THEORY, &amp; CRITICAL RACE

EXCHANGE THEORY Chapter 3 Leader Member Exchange Theory 2 Initially the theory described the

Does Methodological Naturalism entail Ontological Naturalism? Hans Weichselbaum Auckland, New

A Systematic Approach to Incremental Redundancy over Erasure Channels Anoosheh Heidarzadeh (Texas

TREC Video Retrieval Evaluation TRECVID 2018 George Awad#* Alan Smeaton, Yvette Graham

TRECVID 2018 INSTANCE RETRIEVAL INTRODUCTION AND TASK OVERVIEW Wessel Kraaij Leiden

Minimizing GCD sums and applications joint work with Marc Munsch and G erald Tenenbaum

Entangled Polynomial Codes for Secure, Private, and Batch Distributed Matrix Multiplication:

Powerful relativistic jets in spiral galaxies Luigi Luigi Fosc oschini hini National Institute

On Vassiliev invariants of braids of the sphere Vladimir Vershinin &quot;Knots, braids and

FILLING IN THE MARGINS: THE USE OF QUEER THEORY, FEMINIST STANDPOINT THEORY, & CRITICAL RACE

On Vassiliev invariants of braids of the sphere Vladimir Vershinin "Knots, braids and