information theory
play

Information Theory Maneesh Sahani Gatsby Computational Neuroscience - PowerPoint PPT Presentation

Information Theory Maneesh Sahani Gatsby Computational Neuroscience Unit University College London February 2019 Quantifying a Code How much information does a neural response carry about a stimulus? How efficient is a hypothetical


  1. Kullback-Leibler Divergence Another useful information theoretic quantity measures the difference between two distributions. � P ( s ) log P ( s ) KL [ P ( S ) � Q ( S )] = Q ( s ) s � 1 − H [ P ] = P ( s ) log Q ( s ) s � �� � cross entropy Excess cost in bits paid by encoding according to Q instead of P .

  2. Kullback-Leibler Divergence Another useful information theoretic quantity measures the difference between two distributions. � P ( s ) log P ( s ) KL [ P ( S ) � Q ( S )] = Q ( s ) s � 1 − H [ P ] = P ( s ) log Q ( s ) s � �� � cross entropy Excess cost in bits paid by encoding according to Q instead of P . � P ( s ) log Q ( s ) − KL [ P � Q ] = P ( s ) s � P ( s ) Q ( s ) ≤ log by Jensen P ( s ) s � = log Q ( s ) = log 1 = 0 s So KL [ P � Q ] ≥ 0. Equality iff P = Q

  3. Mutual Information and KL � P ( s , r ) I [ S ; R ] = P ( s , r ) log P ( s ) P ( r ) = KL [ P ( S , R ) � P ( S ) P ( R )] s , r

  4. Mutual Information and KL � P ( s , r ) I [ S ; R ] = P ( s , r ) log P ( s ) P ( r ) = KL [ P ( S , R ) � P ( S ) P ( R )] s , r Thus: 1. Mutual information is always non-negative I [ S ; R ] ≥ 0

  5. Mutual Information and KL � P ( s , r ) I [ S ; R ] = P ( s , r ) log P ( s ) P ( r ) = KL [ P ( S , R ) � P ( S ) P ( R )] s , r Thus: 1. Mutual information is always non-negative I [ S ; R ] ≥ 0 2. Conditioning never increases entropy H [ S | R ] ≤ H [ S ]

  6. Multiple Responses Two responses to the same stimulus, R 1 and R 2 , may provide either more or less information jointly than independently.

  7. Multiple Responses Two responses to the same stimulus, R 1 and R 2 , may provide either more or less information jointly than independently. I 12 = I [ S ; R 1 , R 2 ] = H [ R 1 , R 2 ] − H [ R 1 , R 2 | S ] R 1 ⊥ ⊥ R 2 ⇒ H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] R 1 ⊥ ⊥ R 2 | S ⇒ H [ R 1 , R 2 | S ] = H [ R 1 | S ] + H [ R 2 | S ]

  8. Multiple Responses Two responses to the same stimulus, R 1 and R 2 , may provide either more or less information jointly than independently. I 12 = I [ S ; R 1 , R 2 ] = H [ R 1 , R 2 ] − H [ R 1 , R 2 | S ] R 1 ⊥ ⊥ R 2 ⇒ H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] R 1 ⊥ ⊥ R 2 | S ⇒ H [ R 1 , R 2 | S ] = H [ R 1 | S ] + H [ R 2 | S ] R 1 ⊥ ⊥ R 2 R 1 ⊥ ⊥ R 2 | S I 12 < I 1 + I 2 no yes redundant

  9. Multiple Responses Two responses to the same stimulus, R 1 and R 2 , may provide either more or less information jointly than independently. I 12 = I [ S ; R 1 , R 2 ] = H [ R 1 , R 2 ] − H [ R 1 , R 2 | S ] R 1 ⊥ ⊥ R 2 ⇒ H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] R 1 ⊥ ⊥ R 2 | S ⇒ H [ R 1 , R 2 | S ] = H [ R 1 | S ] + H [ R 2 | S ] R 1 ⊥ ⊥ R 2 R 1 ⊥ ⊥ R 2 | S I 12 < I 1 + I 2 no yes redundant I 12 = I 1 + I 2 yes yes independent

  10. Multiple Responses Two responses to the same stimulus, R 1 and R 2 , may provide either more or less information jointly than independently. I 12 = I [ S ; R 1 , R 2 ] = H [ R 1 , R 2 ] − H [ R 1 , R 2 | S ] R 1 ⊥ ⊥ R 2 ⇒ H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] R 1 ⊥ ⊥ R 2 | S ⇒ H [ R 1 , R 2 | S ] = H [ R 1 | S ] + H [ R 2 | S ] R 1 ⊥ ⊥ R 2 R 1 ⊥ ⊥ R 2 | S I 12 < I 1 + I 2 no yes redundant I 12 = I 1 + I 2 yes yes independent I 12 > I 1 + I 2 yes no synergistic

  11. Multiple Responses Two responses to the same stimulus, R 1 and R 2 , may provide either more or less information jointly than independently. I 12 = I [ S ; R 1 , R 2 ] = H [ R 1 , R 2 ] − H [ R 1 , R 2 | S ] R 1 ⊥ ⊥ R 2 ⇒ H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] R 1 ⊥ ⊥ R 2 | S ⇒ H [ R 1 , R 2 | S ] = H [ R 1 | S ] + H [ R 2 | S ] R 1 ⊥ ⊥ R 2 R 1 ⊥ ⊥ R 2 | S I 12 < I 1 + I 2 no yes redundant I 12 = I 1 + I 2 yes yes independent I 12 > I 1 + I 2 yes no synergistic no no ? any of the above

  12. Multiple Responses Two responses to the same stimulus, R 1 and R 2 , may provide either more or less information jointly than independently. I 12 = I [ S ; R 1 , R 2 ] = H [ R 1 , R 2 ] − H [ R 1 , R 2 | S ] R 1 ⊥ ⊥ R 2 ⇒ H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] R 1 ⊥ ⊥ R 2 | S ⇒ H [ R 1 , R 2 | S ] = H [ R 1 | S ] + H [ R 2 | S ] R 1 ⊥ ⊥ R 2 R 1 ⊥ ⊥ R 2 | S I 12 < I 1 + I 2 no yes redundant I 12 = I 1 + I 2 yes yes independent I 12 > I 1 + I 2 yes no synergistic no no ? any of the above I 12 > max ( I 1 , I 2 ) : the second response cannot destroy information.

  13. Multiple Responses Two responses to the same stimulus, R 1 and R 2 , may provide either more or less information jointly than independently. I 12 = I [ S ; R 1 , R 2 ] = H [ R 1 , R 2 ] − H [ R 1 , R 2 | S ] R 1 ⊥ ⊥ R 2 ⇒ H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] R 1 ⊥ ⊥ R 2 | S ⇒ H [ R 1 , R 2 | S ] = H [ R 1 | S ] + H [ R 2 | S ] R 1 ⊥ ⊥ R 2 R 1 ⊥ ⊥ R 2 | S I 12 < I 1 + I 2 no yes redundant I 12 = I 1 + I 2 yes yes independent I 12 > I 1 + I 2 yes no synergistic no no ? any of the above I 12 > max ( I 1 , I 2 ) : the second response cannot destroy information. Thus, the Venn-like diagram with three variables is misleading.

  14. Data Processing Inequality

  15. Data Processing Inequality Suppose S → R 1 → R 2 form a Markov chain; that is, R 2 ⊥ ⊥ S | R 1 . Then, P ( R 2 , S | R 1 ) = P ( R 2 | R 1 ) P ( S | R 1 ) ⇒ P ( S | R 1 , R 2 ) = P ( S | R 1 )

  16. Data Processing Inequality Suppose S → R 1 → R 2 form a Markov chain; that is, R 2 ⊥ ⊥ S | R 1 . Then, P ( R 2 , S | R 1 ) = P ( R 2 | R 1 ) P ( S | R 1 ) ⇒ P ( S | R 1 , R 2 ) = P ( S | R 1 ) Thus, H [ S | R 2 ] ≥ H [ S | R 1 , R 2 ] = H [ S | R 1 ] ⇒ I [ S ; R 2 ] ≤ I [ S ; R 1 ] So any computation based on R 1 that does not have separate access to S cannot add information (in the Shannon sense) about the world.

  17. Data Processing Inequality Suppose S → R 1 → R 2 form a Markov chain; that is, R 2 ⊥ ⊥ S | R 1 . Then, P ( R 2 , S | R 1 ) = P ( R 2 | R 1 ) P ( S | R 1 ) ⇒ P ( S | R 1 , R 2 ) = P ( S | R 1 ) Thus, H [ S | R 2 ] ≥ H [ S | R 1 , R 2 ] = H [ S | R 1 ] ⇒ I [ S ; R 2 ] ≤ I [ S ; R 1 ] So any computation based on R 1 that does not have separate access to S cannot add information (in the Shannon sense) about the world. Equality holds iff S → R 2 → R 1 as well. In this case R 2 is called a sufficient statistic for S .

  18. Entropy Rate So far we have discussed S and R as single (or iid) random variables. But real stimuli and responses form a time series.

  19. Entropy Rate So far we have discussed S and R as single (or iid) random variables. But real stimuli and responses form a time series. Let S = { S 1 , S 2 , S 3 . . . } form a stochastic process. H [ S 1 , S 2 , . . . , S n ] = H [ S n | S 1 , S 2 , . . . , S n − 1 ] + H [ S 1 , S 2 , . . . , S n − 1 ] = H [ S n | S 1 , S 2 , . . . , S n − 1 ] + H [ S n − 1 | S 1 , S 2 , . . . , S n − 2 ] + . . . + H [ S 1 ]

  20. Entropy Rate So far we have discussed S and R as single (or iid) random variables. But real stimuli and responses form a time series. Let S = { S 1 , S 2 , S 3 . . . } form a stochastic process. H [ S 1 , S 2 , . . . , S n ] = H [ S n | S 1 , S 2 , . . . , S n − 1 ] + H [ S 1 , S 2 , . . . , S n − 1 ] = H [ S n | S 1 , S 2 , . . . , S n − 1 ] + H [ S n − 1 | S 1 , S 2 , . . . , S n − 2 ] + . . . + H [ S 1 ] The entropy rate of S is defined as H [ S 1 , S 2 , . . . , S n ] H [ S ] = lim N n →∞ or alternatively as H [ S ] = lim n →∞ H [ S n | S 1 , S 2 , . . . , S n − 1 ]

  21. Entropy Rate So far we have discussed S and R as single (or iid) random variables. But real stimuli and responses form a time series. Let S = { S 1 , S 2 , S 3 . . . } form a stochastic process. H [ S 1 , S 2 , . . . , S n ] = H [ S n | S 1 , S 2 , . . . , S n − 1 ] + H [ S 1 , S 2 , . . . , S n − 1 ] = H [ S n | S 1 , S 2 , . . . , S n − 1 ] + H [ S n − 1 | S 1 , S 2 , . . . , S n − 2 ] + . . . + H [ S 1 ] The entropy rate of S is defined as H [ S 1 , S 2 , . . . , S n ] H [ S ] = lim N n →∞ or alternatively as H [ S ] = lim n →∞ H [ S n | S 1 , S 2 , . . . , S n − 1 ] iid If S i ∼ P ( S ) then H [ S ] = H [ S ] . If S is Markov (and stationary) then H [ S ] = H [ S n | S n − 1 ] .

  22. Continuous Random Variables The discussion so far has involved discrete S and R . Now, let S ∈ R with density p(s). What is its entropy?

  23. Continuous Random Variables The discussion so far has involved discrete S and R . Now, let S ∈ R with density p(s). What is its entropy? Suppose we discretise with length ∆ s : � H ∆ [ S ] = − p ( s i )∆ s log p ( s i )∆ s i � = − p ( s i )∆ s ( log p ( s i ) + log ∆ s ) i

  24. Continuous Random Variables The discussion so far has involved discrete S and R . Now, let S ∈ R with density p(s). What is its entropy? Suppose we discretise with length ∆ s : � H ∆ [ S ] = − p ( s i )∆ s log p ( s i )∆ s i � = − p ( s i )∆ s ( log p ( s i ) + log ∆ s ) i � � = − p ( s i )∆ s log p ( s i ) − log ∆ s p ( s i )∆ s i i

  25. Continuous Random Variables The discussion so far has involved discrete S and R . Now, let S ∈ R with density p(s). What is its entropy? Suppose we discretise with length ∆ s : � H ∆ [ S ] = − p ( s i )∆ s log p ( s i )∆ s i � = − p ( s i )∆ s ( log p ( s i ) + log ∆ s ) i � � = − p ( s i )∆ s log p ( s i ) − log ∆ s p ( s i )∆ s i i � = − ∆ s p ( s i ) log p ( s i ) − log ∆ s i

  26. Continuous Random Variables The discussion so far has involved discrete S and R . Now, let S ∈ R with density p(s). What is its entropy? Suppose we discretise with length ∆ s : � H ∆ [ S ] = − p ( s i )∆ s log p ( s i )∆ s i � = − p ( s i )∆ s ( log p ( s i ) + log ∆ s ) i � � = − p ( s i )∆ s log p ( s i ) − log ∆ s p ( s i )∆ s i i � = − ∆ s p ( s i ) log p ( s i ) − log ∆ s i � → − ds p ( s ) log p ( s ) + ∞

  27. Continuous Random Variables The discussion so far has involved discrete S and R . Now, let S ∈ R with density p(s). What is its entropy? Suppose we discretise with length ∆ s : � H ∆ [ S ] = − p ( s i )∆ s log p ( s i )∆ s i � = − p ( s i )∆ s ( log p ( s i ) + log ∆ s ) i � � = − p ( s i )∆ s log p ( s i ) − log ∆ s p ( s i )∆ s i i � = − ∆ s p ( s i ) log p ( s i ) − log ∆ s i � → − ds p ( s ) log p ( s ) + ∞ We define the differential entropy : � h ( S ) = − ds p ( s ) log p ( s ) . Note that h ( S ) can be < 0, and can be ±∞ .

  28. Continuous Random Variables We can define other information theoretic quantities similarly.

  29. Continuous Random Variables We can define other information theoretic quantities similarly. The conditional differential entropy is � h ( S | R ) = − ds dr p ( s , r ) log p ( s | r ) and, like the differential entropy itself, may be poorly behaved.

  30. Continuous Random Variables We can define other information theoretic quantities similarly. The conditional differential entropy is � h ( S | R ) = − ds dr p ( s , r ) log p ( s | r ) and, like the differential entropy itself, may be poorly behaved. The mutual information, however, is well-defined I ∆ [ S ; R ] = H ∆ [ S ] − H ∆ [ S | R ] � = − ∆ s p ( s i ) log p ( s i ) − log ∆ s i � � � � − − ∆ s p ( s i | r ) log p ( s i | r ) − log ∆ s dr p ( r ) i → h ( S ) − h ( S | R ) as are other KL divergences.

  31. Maximum Entropy Distributions 1. H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] with equality iff R 1 ⊥ ⊥ R 2 .

  32. Maximum Entropy Distributions 1. H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] with equality iff R 1 ⊥ ⊥ R 2 . � 2. Let ds p ( s ) f ( s ) = a for some function f . What distribution has maximum entropy?

  33. Maximum Entropy Distributions 1. H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] with equality iff R 1 ⊥ ⊥ R 2 . � 2. Let ds p ( s ) f ( s ) = a for some function f . What distribution has maximum entropy? Use Lagrange multipliers: � �� � �� � L = ds p ( s ) log p ( s ) − λ 0 ds p ( s ) − 1 − λ 1 ds p ( s ) f ( s ) − a δ L δ p ( s ) = 1 + log p ( s ) − λ 0 − λ 1 f ( s ) = 0 ⇒ log p ( s ) = λ 0 + λ 1 f ( s ) − 1 ⇒ p ( s ) = 1 Z e λ 1 f ( s ) The constants λ 0 and λ 1 can be found by solving the constraint equations.

  34. Maximum Entropy Distributions 1. H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] with equality iff R 1 ⊥ ⊥ R 2 . � 2. Let ds p ( s ) f ( s ) = a for some function f . What distribution has maximum entropy? Use Lagrange multipliers: � �� � �� � L = ds p ( s ) log p ( s ) − λ 0 ds p ( s ) − 1 − λ 1 ds p ( s ) f ( s ) − a δ L δ p ( s ) = 1 + log p ( s ) − λ 0 − λ 1 f ( s ) = 0 ⇒ log p ( s ) = λ 0 + λ 1 f ( s ) − 1 ⇒ p ( s ) = 1 Z e λ 1 f ( s ) The constants λ 0 and λ 1 can be found by solving the constraint equations. Thus, p ( s ) = 1 Z e λ 1 s . f ( s ) = s ⇒ Exponential (need p ( s ) = 0 for s < T ) . Z e λ 1 s 2 . f ( s ) = s 2 p ( s ) = 1 ⇒ Gaussian .

  35. Maximum Entropy Distributions 1. H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] with equality iff R 1 ⊥ ⊥ R 2 . � 2. Let ds p ( s ) f ( s ) = a for some function f . What distribution has maximum entropy? Use Lagrange multipliers: � �� � �� � L = ds p ( s ) log p ( s ) − λ 0 ds p ( s ) − 1 − λ 1 ds p ( s ) f ( s ) − a δ L δ p ( s ) = 1 + log p ( s ) − λ 0 − λ 1 f ( s ) = 0 ⇒ log p ( s ) = λ 0 + λ 1 f ( s ) − 1 ⇒ p ( s ) = 1 Z e λ 1 f ( s ) The constants λ 0 and λ 1 can be found by solving the constraint equations. Thus, p ( s ) = 1 Z e λ 1 s . f ( s ) = s ⇒ Exponential (need p ( s ) = 0 for s < T ) . Z e λ 1 s 2 . f ( s ) = s 2 p ( s ) = 1 ⇒ Gaussian . Both results together ⇒ maximum entropy point process (for fixed mean arrival rate) is homogeneous Poisson – independent, exponentially distributed ISIs.

  36. Channels We now direct our focus to the conditional P ( R | S ) which defines the channel linking S to R . P ( R | S ) − → R S

  37. Channels We now direct our focus to the conditional P ( R | S ) which defines the channel linking S to R . P ( R | S ) − → R S The mutual information � � P ( s , r ) P ( s ) P ( r | s ) log P ( r | s ) I [ S ; R ] = P ( s , r ) log P ( s ) P ( r ) = P ( r ) s , r s , r depends on marginals P ( s ) and P ( r ) = � s P ( r | s ) P ( s ) as well and thus is unsuitable to characterise the conditional alone.

  38. Channels We now direct our focus to the conditional P ( R | S ) which defines the channel linking S to R . P ( R | S ) − → R S The mutual information � � P ( s , r ) P ( s ) P ( r | s ) log P ( r | s ) I [ S ; R ] = P ( s , r ) log P ( s ) P ( r ) = P ( r ) s , r s , r depends on marginals P ( s ) and P ( r ) = � s P ( r | s ) P ( s ) as well and thus is unsuitable to characterise the conditional alone. Instead, we characterise the channel by its capacity C R | S = sup I [ S ; R ] P ( s ) Thus the capacity gives the theoretical limit on the amount of information that can be transmitted over a channel. Clearly, this is limited by the properties of the noise.

  39. Joint source-channel coding theorem The remarkable central result of information theory. encoder channel decoder → � → � − − − − − − − − − − − − − − − − − − − − − − → − − − − − − − − − − − S S R T C R | � S Any source ensemble S with entropy H [ S ] < C R | � S can be transmitted (in sufficiently long blocks) with P error → 0. The proof is beyond our scope. Some of the key ideas that appear in the proof are: ◮ block coding ◮ error correction ◮ joint typicality ◮ random codes

  40. The channel coding problem encoder channel decoder → � → � − − − − − − − − − − − − − − − − − − − − − − → − − − − − − − − − − − S S R T C R | � S Given channel P ( R | � S ) and source P ( S ) , find encoding P ( � S | S ) (may be deterministic) to maximise I [ S ; R ] . By data processing inequality, and defn of capacity: I [ S ; R ] ≤ I [ � S ; R ] ≤ C R | � S By JSCT, equality can be achieved (in the limit of increasing block size). Thus I [ � S ; R ] should saturate C R | � S . See homework for an algorithm (Blahut-Arimoto) to find P ( � S ) that saturates C R | � S for a general discrete channel.

  41. Entropy maximisation � � I [ � R | � S ; R ] = H [ R ] − H S � �� � � �� � marginal entropy noise entropy

  42. Entropy maximisation � � I [ � R | � S ; R ] = H [ R ] − H S � �� � � �� � marginal entropy noise entropy � � � If noise is small and “constant” ⇒ maximise marginal entropy ⇒ maximise H S

  43. Entropy maximisation � � I [ � R | � S ; R ] = H [ R ] − H S � �� � � �� � marginal entropy noise entropy � � � If noise is small and “constant” ⇒ maximise marginal entropy ⇒ maximise H S Consider a (rate coding) neuron with r ∈ [ 0 , r max ] . � r max h ( r ) = − dr p ( r ) log p ( r ) 0

  44. Entropy maximisation � � I [ � R | � S ; R ] = H [ R ] − H S � �� � � �� � marginal entropy noise entropy � � � If noise is small and “constant” ⇒ maximise marginal entropy ⇒ maximise H S Consider a (rate coding) neuron with r ∈ [ 0 , r max ] . � r max h ( r ) = − dr p ( r ) log p ( r ) 0 To maximise the marginal entropy, we add a Lagrange multiplier ( µ ) to enforce normalisation and then differentiate � � � − log p ( r ) − 1 − µ � r max δ r ∈ [ 0 , r max ] h ( r ) − µ p ( r ) = δ p ( r ) 0 otherwise 0

  45. Entropy maximisation � � I [ � R | � S ; R ] = H [ R ] − H S � �� � � �� � marginal entropy noise entropy � � � If noise is small and “constant” ⇒ maximise marginal entropy ⇒ maximise H S Consider a (rate coding) neuron with r ∈ [ 0 , r max ] . � r max h ( r ) = − dr p ( r ) log p ( r ) 0 To maximise the marginal entropy, we add a Lagrange multiplier ( µ ) to enforce normalisation and then differentiate � � � − log p ( r ) − 1 − µ � r max δ r ∈ [ 0 , r max ] h ( r ) − µ p ( r ) = δ p ( r ) 0 otherwise 0 ⇒ p ( r ) = const for r ∈ [ 0 , r max ]

  46. Entropy maximisation � � I [ � R | � S ; R ] = H [ R ] − H S � �� � � �� � marginal entropy noise entropy � � � If noise is small and “constant” ⇒ maximise marginal entropy ⇒ maximise H S Consider a (rate coding) neuron with r ∈ [ 0 , r max ] . � r max h ( r ) = − dr p ( r ) log p ( r ) 0 To maximise the marginal entropy, we add a Lagrange multiplier ( µ ) to enforce normalisation and then differentiate � � � − log p ( r ) − 1 − µ � r max δ r ∈ [ 0 , r max ] h ( r ) − µ p ( r ) = δ p ( r ) 0 otherwise 0 ⇒ p ( r ) = const for r ∈ [ 0 , r max ] i.e. � 1 r ∈ [ 0 , r max ] p ( r ) = r max 0 otherwise

  47. Histogram Equalisation Suppose r = ˜ s + η where η represents a (relatively small) source of noise. Consider deterministic encoding ˜ s = f ( s ) . How do we ensure that p ( r ) = 1 / r max ? s ) = p ( s ) 1 ⇒ f ′ ( s ) = r max p ( s ) r max = p ( r ) ≈ p (˜ f ′ ( s ) � s ds ′ p ( s ′ ) ⇒ f ( s ) = r max −∞ 1 0.9 0.8 0.7 0.6 ˜ s 0.5 0.4 0.3 0.2 0.1 0 −3 −2 −1 0 1 2 3 s

  48. Histogram Equalisation Laughlin (1981)

  49. Gaussian channel A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm.

  50. Gaussian channel A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm. We will need the differential entropy of a (multivariate) Gaussian distribution:

  51. Gaussian channel A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm. We will need the differential entropy of a (multivariate) Gaussian distribution: � � Let − 1 p ( Z ) = | 2 π Σ | − 1 / 2 exp 2 ( Z − µ ) T Σ − 1 ( Z − µ ) , then, � � � − 1 2 log | 2 π Σ | − 1 2 ( Z − µ ) T Σ − 1 ( Z − µ ) h ( Z ) = − d Z p ( Z )

  52. Gaussian channel A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm. We will need the differential entropy of a (multivariate) Gaussian distribution: � � Let − 1 p ( Z ) = | 2 π Σ | − 1 / 2 exp 2 ( Z − µ ) T Σ − 1 ( Z − µ ) , then, � � � − 1 2 log | 2 π Σ | − 1 2 ( Z − µ ) T Σ − 1 ( Z − µ ) h ( Z ) = − d Z p ( Z ) � � Σ − 1 ( Z − µ )( Z − µ ) T � = 1 2 log | 2 π Σ | + 1 d Z p ( Z ) Tr 2

  53. Gaussian channel A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm. We will need the differential entropy of a (multivariate) Gaussian distribution: � � Let − 1 p ( Z ) = | 2 π Σ | − 1 / 2 exp 2 ( Z − µ ) T Σ − 1 ( Z − µ ) , then, � � � − 1 2 log | 2 π Σ | − 1 2 ( Z − µ ) T Σ − 1 ( Z − µ ) h ( Z ) = − d Z p ( Z ) � � Σ − 1 ( Z − µ )( Z − µ ) T � = 1 2 log | 2 π Σ | + 1 d Z p ( Z ) Tr 2 � � = 1 2 log | 2 π Σ | + 1 Σ − 1 Σ 2Tr

  54. Gaussian channel A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm. We will need the differential entropy of a (multivariate) Gaussian distribution: � � Let − 1 p ( Z ) = | 2 π Σ | − 1 / 2 exp 2 ( Z − µ ) T Σ − 1 ( Z − µ ) , then, � � � − 1 2 log | 2 π Σ | − 1 2 ( Z − µ ) T Σ − 1 ( Z − µ ) h ( Z ) = − d Z p ( Z ) � � Σ − 1 ( Z − µ )( Z − µ ) T � = 1 2 log | 2 π Σ | + 1 d Z p ( Z ) Tr 2 � � = 1 2 log | 2 π Σ | + 1 Σ − 1 Σ 2Tr = 1 2 log | 2 π Σ | + 1 ( log e ) 2 d

  55. Gaussian channel A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm. We will need the differential entropy of a (multivariate) Gaussian distribution: � � Let − 1 p ( Z ) = | 2 π Σ | − 1 / 2 exp 2 ( Z − µ ) T Σ − 1 ( Z − µ ) , then, � � � − 1 2 log | 2 π Σ | − 1 2 ( Z − µ ) T Σ − 1 ( Z − µ ) h ( Z ) = − d Z p ( Z ) � � Σ − 1 ( Z − µ )( Z − µ ) T � = 1 2 log | 2 π Σ | + 1 d Z p ( Z ) Tr 2 � � = 1 2 log | 2 π Σ | + 1 Σ − 1 Σ 2Tr = 1 2 log | 2 π Σ | + 1 ( log e ) 2 d = 1 2 log | 2 π e Σ |

  56. Gaussian channel – white noise ∼ N ( 0 , k z ) Z � + R S

  57. Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) � + R S

  58. Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S

  59. Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z )

  60. Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z .

  61. Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ .

  62. Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) � S 2 � � ≤ P S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ . n � Therefore, constrain 1 s 2 ˜ i ≤ P. n i = 1

  63. Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) � S 2 � � ≤ P S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ . n � Therefore, constrain 1 s 2 ˜ i ≤ P. n i = 1 Then, � S + Z ) 2 � � � � R 2 � S 2 + Z 2 + 2 � ( � � = = SZ

  64. Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) � S 2 � � ≤ P S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ . n � Therefore, constrain 1 s 2 ˜ i ≤ P. n i = 1 Then, � S + Z ) 2 � � � � R 2 � S 2 + Z 2 + 2 � ( � � = = ≤ P + k z + 0 SZ

  65. Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) � S 2 � � ≤ P S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ . n � Therefore, constrain 1 s 2 ˜ i ≤ P. n i = 1 Then, � S + Z ) 2 � � � � R 2 � S 2 + Z 2 + 2 � ( � � = = ≤ P + k z + 0 SZ ⇒ h ( R ) ≤ h ( N ( 0 , P + k z ))

  66. Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) � S 2 � � ≤ P S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ . n � Therefore, constrain 1 s 2 ˜ i ≤ P. n i = 1 Then, � S + Z ) 2 � � � � R 2 � S 2 + Z 2 + 2 � ( � � = = ≤ P + k z + 0 SZ ⇒ h ( R ) ≤ h ( N ( 0 , P + k z )) = 1 2 log 2 π e ( P + k z )

  67. Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) � S 2 � � ≤ P S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ . n � Therefore, constrain 1 s 2 ˜ i ≤ P. n i = 1 Then, � S + Z ) 2 � � � � R 2 � S 2 + Z 2 + 2 � ( � � = = ≤ P + k z + 0 SZ ⇒ h ( R ) ≤ h ( N ( 0 , P + k z )) = 1 2 log 2 π e ( P + k z ) S ; R ] ≤ 1 2 log 2 π e ( P + k z ) − 1 ⇒ I [ � 2 log 2 π e k z

  68. Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) � S 2 � � ≤ P S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ . n � Therefore, constrain 1 s 2 ˜ i ≤ P. n i = 1 Then, � S + Z ) 2 � � � � R 2 � S 2 + Z 2 + 2 � ( � � = = ≤ P + k z + 0 SZ ⇒ h ( R ) ≤ h ( N ( 0 , P + k z )) = 1 2 log 2 π e ( P + k z ) � � S ; R ] ≤ 1 2 log 2 π e ( P + k z ) − 1 2 log 2 π e k z = 1 1 + P ⇒ I [ � 2 log 2 π e k z

  69. Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) � S 2 � � ≤ P S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ . n � Therefore, constrain 1 s 2 ˜ i ≤ P. n i = 1 Then, � S + Z ) 2 � � � � R 2 � S 2 + Z 2 + 2 � ( � � = = ≤ P + k z + 0 SZ ⇒ h ( R ) ≤ h ( N ( 0 , P + k z )) = 1 2 log 2 π e ( P + k z ) � � S ; R ] ≤ 1 2 log 2 π e ( P + k z ) − 1 2 log 2 π e k z = 1 1 + P ⇒ I [ � 2 log 2 π e k z � � S = 1 1 + P C R | � 2 log 2 π e k z ⇒ � The capacity is achieved iff R ∼ N ( 0 , P + k z ) S ∼ N ( 0 , P ) .

  70. Gaussian channel – correlated noise Now consider a vector Gaussian channel: = ( Z 1 , . . . , Z d ) ∼ N ( 0 , K z ) Z � + R = ( R 1 , . . . , R d ) S = ( S 1 , . . . , S d ) � T � S � � 1 ≤ P d Tr S

  71. Gaussian channel – correlated noise Now consider a vector Gaussian channel: = ( Z 1 , . . . , Z d ) ∼ N ( 0 , K z ) Z � + R = ( R 1 , . . . , R d ) S = ( S 1 , . . . , S d ) � T � S � � 1 ≤ P d Tr S Following the same approach as before: � � � � S ; R ] = h ( R ) − h ( Z ) ≤ 1 − 1 ( 2 π e ) d | K ˜ ( 2 π e ) d | K z | I [ � s + K z | 2 log 2 log ,

  72. Gaussian channel – correlated noise Now consider a vector Gaussian channel: = ( Z 1 , . . . , Z d ) ∼ N ( 0 , K z ) Z � + R = ( R 1 , . . . , R d ) S = ( S 1 , . . . , S d ) � T � S � � 1 ≤ P d Tr S Following the same approach as before: � � � � S ; R ] = h ( R ) − h ( Z ) ≤ 1 − 1 ( 2 π e ) d | K ˜ ( 2 π e ) d | K z | I [ � s + K z | 2 log 2 log , ⇒ C R | S achieved when � s + K z | max given 1 S (and thus R ) ∼ N , with | K ˜ d Tr [ K ˜ s ] ≤ P .

  73. Gaussian channel – correlated noise Now consider a vector Gaussian channel: = ( Z 1 , . . . , Z d ) ∼ N ( 0 , K z ) Z � + R = ( R 1 , . . . , R d ) S = ( S 1 , . . . , S d ) � T � S � � 1 ≤ P d Tr S Following the same approach as before: � � � � S ; R ] = h ( R ) − h ( Z ) ≤ 1 − 1 ( 2 π e ) d | K ˜ ( 2 π e ) d | K z | I [ � s + K z | 2 log 2 log , ⇒ C R | S achieved when � s + K z | max given 1 S (and thus R ) ∼ N , with | K ˜ d Tr [ K ˜ s ] ≤ P . Diagonalise K z ⇒ K ˜ s is diagonal in same basis.

  74. Gaussian channel – correlated noise Now consider a vector Gaussian channel: = ( Z 1 , . . . , Z d ) ∼ N ( 0 , K z ) Z � + R = ( R 1 , . . . , R d ) S = ( S 1 , . . . , S d ) � T � S � � 1 ≤ P d Tr S Following the same approach as before: � � � � S ; R ] = h ( R ) − h ( Z ) ≤ 1 − 1 ( 2 π e ) d | K ˜ ( 2 π e ) d | K z | I [ � s + K z | 2 log 2 log , ⇒ C R | S achieved when � s + K z | max given 1 S (and thus R ) ∼ N , with | K ˜ d Tr [ K ˜ s ] ≤ P . Diagonalise K z ⇒ K ˜ s is diagonal in same basis. For stationary noise (wrt dimension indexed by d ) this can be achieved by a Fourier transform ⇒ index diagonal elements by ω .

  75. Gaussian channel – correlated noise Now consider a vector Gaussian channel: = ( Z 1 , . . . , Z d ) ∼ N ( 0 , K z ) Z � + R = ( R 1 , . . . , R d ) S = ( S 1 , . . . , S d ) � T � S � � 1 ≤ P d Tr S Following the same approach as before: � � � � S ; R ] = h ( R ) − h ( Z ) ≤ 1 − 1 ( 2 π e ) d | K ˜ ( 2 π e ) d | K z | I [ � s + K z | 2 log 2 log , ⇒ C R | S achieved when � s + K z | max given 1 S (and thus R ) ∼ N , with | K ˜ d Tr [ K ˜ s ] ≤ P . Diagonalise K z ⇒ K ˜ s is diagonal in same basis. For stationary noise (wrt dimension indexed by d ) this can be achieved by a Fourier transform ⇒ index diagonal elements by ω . � � such that 1 k ∗ s ( ω ) = argmax ( k ˜ s ( ω ) + k z ( ω )) s ( ω ) ≤ P k ˜ ˜ d ω

  76. Water filling Assume that optimum is achieved for max. input power. �� � �� � 1 k ∗ s ( ω ) = argmax log ( k ˜ s ( ω ) + k z ( ω )) − λ k ˜ s ( ω ) − P ˜ d ω ω

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend