Information Theory Maneesh Sahani Gatsby Computational Neuroscience - - PowerPoint PPT Presentation

information theory
SMART_READER_LITE
LIVE PREVIEW

Information Theory Maneesh Sahani Gatsby Computational Neuroscience - - PowerPoint PPT Presentation

Information Theory Maneesh Sahani Gatsby Computational Neuroscience Unit University College London February 2019 Quantifying a Code How much information does a neural response carry about a stimulus? How efficient is a hypothetical


slide-1
SLIDE 1

Information Theory

Maneesh Sahani Gatsby Computational Neuroscience Unit University College London February 2019

slide-2
SLIDE 2

Quantifying a Code

◮ How much information does a neural response carry about a stimulus? ◮ How efficient is a hypothetical code, given the statistical behaviour of the components? ◮ How much better could another code do, given the same components? ◮ Is the information carried by different neurons complementary, synergistic (whole is

greater than sum of parts), or redundant?

◮ Can further processing extract more information about a stimulus?

Information theory is the mathematical framework within which questions such as these can be framed and answered.

slide-3
SLIDE 3

Quantifying a Code

◮ How much information does a neural response carry about a stimulus? ◮ How efficient is a hypothetical code, given the statistical behaviour of the components? ◮ How much better could another code do, given the same components? ◮ Is the information carried by different neurons complementary, synergistic (whole is

greater than sum of parts), or redundant?

◮ Can further processing extract more information about a stimulus?

Information theory is the mathematical framework within which questions such as these can be framed and answered. Information theory does not directly address:

◮ estimation (but there are some relevant bounds) ◮ computation (but “information bottleneck” might provide a motivating framework) ◮ representation (but redundancy reduction has obvious information theoretic connections)

slide-4
SLIDE 4

Uncertainty and Information

Information is related to the removal of uncertainty.

slide-5
SLIDE 5

Uncertainty and Information

Information is related to the removal of uncertainty. S → R → P(S|R) How informative is R about S?

slide-6
SLIDE 6

Uncertainty and Information

Information is related to the removal of uncertainty. S → R → P(S|R) How informative is R about S? P(S|R) =

  • 0, 0, 1, 0, . . . , 0
  • ⇒ high information?

P(S|R) =

1

M , 1 M , . . . , 1 M

  • ⇒ low information?
slide-7
SLIDE 7

Uncertainty and Information

Information is related to the removal of uncertainty. S → R → P(S|R) How informative is R about S? P(S|R) =

  • 0, 0, 1, 0, . . . , 0
  • ⇒ high information?

P(S|R) =

1

M , 1 M , . . . , 1 M

  • ⇒ low information?

But also depends on P(S). We need to start by considering the uncertainty in a probability distribution → called the entropy

slide-8
SLIDE 8

Uncertainty and Information

Information is related to the removal of uncertainty. S → R → P(S|R) How informative is R about S? P(S|R) =

  • 0, 0, 1, 0, . . . , 0
  • ⇒ high information?

P(S|R) =

1

M , 1 M , . . . , 1 M

  • ⇒ low information?

But also depends on P(S). We need to start by considering the uncertainty in a probability distribution → called the entropy Let S ∼ P(S). The entropy is the minimum number of bits needed, on average, to specify the value S takes, assuming P(S) is known. Equivalently, the minimum average number of yes/no questions needed to guess S.

slide-9
SLIDE 9

Entropy

slide-10
SLIDE 10

Entropy

◮ Suppose there are M equiprobable stimuli: P(sm) = 1/M.

To specify which stimulus appears on a given trial, we would need assign each a (binary) number. This would take, Bs ≤ log2 M + 1

[2B ≥ M] = − log2

1 M + 1 bits

slide-11
SLIDE 11

Entropy

◮ Suppose there are M equiprobable stimuli: P(sm) = 1/M.

To specify which stimulus appears on a given trial, we would need assign each a (binary) number. This would take, Bs ≤ log2 M + 1

[2B ≥ M] = − log2

1 M + 1 bits

◮ Now suppose we code N such stimuli, drawn iid, at once.

BN ≤ log2 MN + 1

→ −N log2

1 M as N → ∞

⇒ Bs → − log2 p bits

This is called block coding. It is useful for extracting theoretical limits. The nervous system is unlikely to use block codes in time, but may in space.

slide-12
SLIDE 12

Entropy

◮ Now suppose stimuli are not equiprobable. Write P(sm) = pm. Then

P(S1, S2, . . . , SN) =

  • m

pnm

m

[where nm = (# of Si = sm)].

slide-13
SLIDE 13

Entropy

◮ Now suppose stimuli are not equiprobable. Write P(sm) = pm. Then

P(S1, S2, . . . , SN) =

  • m

pnm

m

[where nm = (# of Si = sm)].

As N → ∞ only “typical” sequences, with nm = pmN, have non-zero probability of

  • ccuring; and they are all equally likely. This is called the Asymptotic Equipartition

Property (or AEP).

slide-14
SLIDE 14

Entropy

◮ Now suppose stimuli are not equiprobable. Write P(sm) = pm. Then

P(S1, S2, . . . , SN) =

  • m

pnm

m

[where nm = (# of Si = sm)].

As N → ∞ only “typical” sequences, with nm = pmN, have non-zero probability of

  • ccuring; and they are all equally likely. This is called the Asymptotic Equipartition

Property (or AEP). Thus, BN

→ − log2

  • m pnm

m

= −

m nm log2 pm

= −

m pmN log2 pm

= −N

  • m

pm log2 pm

  • −H[s]
slide-15
SLIDE 15

Entropy

◮ Now suppose stimuli are not equiprobable. Write P(sm) = pm. Then

P(S1, S2, . . . , SN) =

  • m

pnm

m

[where nm = (# of Si = sm)].

As N → ∞ only “typical” sequences, with nm = pmN, have non-zero probability of

  • ccuring; and they are all equally likely. This is called the Asymptotic Equipartition

Property (or AEP). Thus, BN

→ − log2

  • m pnm

m

= −

m nm log2 pm

= −

m pmN log2 pm

= −N

  • m

pm log2 pm

  • −H[s]

H[S] = E [− log2 P(S)], also written H[P(S)], is the entropy of the stimulus distribution.

slide-16
SLIDE 16

Entropy

◮ Now suppose stimuli are not equiprobable. Write P(sm) = pm. Then

P(S1, S2, . . . , SN) =

  • m

pnm

m

[where nm = (# of Si = sm)].

As N → ∞ only “typical” sequences, with nm = pmN, have non-zero probability of

  • ccuring; and they are all equally likely. This is called the Asymptotic Equipartition

Property (or AEP). Thus, BN

→ − log2

  • m pnm

m

= −

m nm log2 pm

= −

m pmN log2 pm

= −N

  • m

pm log2 pm

  • −H[s]

H[S] = E [− log2 P(S)], also written H[P(S)], is the entropy of the stimulus distribution.

Rather than appealing to typicality, we could instead have used the law of large numbers directly: 1 N log2 P(S1, S2, . . . SN) = 1 N log2

  • i

P(Si) = 1 N

  • i

log2 P(Si)

N→∞

E[log2 P(Si)]

slide-17
SLIDE 17

Conditional Entropy

Entropy is a measure of “available information” in the stimulus ensemble.

slide-18
SLIDE 18

Conditional Entropy

Entropy is a measure of “available information” in the stimulus ensemble. Now suppose we measure a particular response r which depends on the stimulus according to P(R|S). How uncertain is the stimulus once we know r?

slide-19
SLIDE 19

Conditional Entropy

Entropy is a measure of “available information” in the stimulus ensemble. Now suppose we measure a particular response r which depends on the stimulus according to P(R|S). How uncertain is the stimulus once we know r? Bayes rule gives us P(S|r) = P(r|S)P(S)

  • s P(r|s)P(s)

so we can write H[S|r] = −

  • s

P(s|r) log2 P(s|r)

slide-20
SLIDE 20

Conditional Entropy

Entropy is a measure of “available information” in the stimulus ensemble. Now suppose we measure a particular response r which depends on the stimulus according to P(R|S). How uncertain is the stimulus once we know r? Bayes rule gives us P(S|r) = P(r|S)P(S)

  • s P(r|s)P(s)

so we can write H[S|r] = −

  • s

P(s|r) log2 P(s|r) The average uncertainty in S for r ∼ P(R) =

s P(R|s)p(s) is then

H[S|R] =

  • r

P(r)

  • s

P(s|r) log2 P(s|r)

  • = −
  • s,r

P(s, r) log2 P(s|r)

slide-21
SLIDE 21

Conditional Entropy

Entropy is a measure of “available information” in the stimulus ensemble. Now suppose we measure a particular response r which depends on the stimulus according to P(R|S). How uncertain is the stimulus once we know r? Bayes rule gives us P(S|r) = P(r|S)P(S)

  • s P(r|s)P(s)

so we can write H[S|r] = −

  • s

P(s|r) log2 P(s|r) The average uncertainty in S for r ∼ P(R) =

s P(R|s)p(s) is then

H[S|R] =

  • r

P(r)

  • s

P(s|r) log2 P(s|r)

  • = −
  • s,r

P(s, r) log2 P(s|r) It is easy to show that:

  • 1. H[S|R] ≤ H[S]
  • 2. H[S|R] = H[S, R] − H[R]
  • 3. H[S|R] = H[S] iff S ⊥

⊥ R

slide-22
SLIDE 22

Average Mutual Information

A natural definition of the average information gained about S from R is I[S; R] = H[S] − H[S|R] Measures reduction in uncertainty due to R.

slide-23
SLIDE 23

Average Mutual Information

A natural definition of the average information gained about S from R is I[S; R] = H[S] − H[S|R] Measures reduction in uncertainty due to R. It follows from the definition that I[S; R] =

  • s

P(s) log 1 P(s) −

  • s,r

P(s, r) log 1 P(s|r)

=

  • s,r

P(s, r) log 1 P(s) +

  • s,r

P(s, r) log P(s|r)

=

  • s,r

P(s, r) log P(s|r) P(s)

=

  • s,r

P(s, r) log P(s, r) P(s)P(r)

= I[R; S]

slide-24
SLIDE 24

Average Mutual Information

The symmetry suggests a Venn-like diagram. H[S|R] I[S; R] I[R; S] H[R|S] H[S, R] H[S] H[R] All of the additive and equality relationships implied by this picture hold for two variables. Unfortunately, we will see that this does not generalise to any more than two.

slide-25
SLIDE 25

Kullback-Leibler Divergence

Another useful information theoretic quantity measures the difference between two distributions. KL[P(S)Q(S)] =

  • s

P(s) log P(s) Q(s)

=

  • s

P(s) log 1 Q(s)

  • cross entropy

−H[P]

Excess cost in bits paid by encoding according to Q instead of P.

slide-26
SLIDE 26

Kullback-Leibler Divergence

Another useful information theoretic quantity measures the difference between two distributions. KL[P(S)Q(S)] =

  • s

P(s) log P(s) Q(s)

=

  • s

P(s) log 1 Q(s)

  • cross entropy

−H[P]

Excess cost in bits paid by encoding according to Q instead of P.

−KL[PQ] =

  • s

P(s) log Q(s) P(s)

≤ log

  • s

P(s)Q(s) P(s) by Jensen

= log

  • s

Q(s) = log 1 = 0 So KL[PQ] ≥ 0. Equality iff P = Q

slide-27
SLIDE 27

Mutual Information and KL

I[S; R] =

  • s,r

P(s, r) log P(s, r) P(s)P(r) = KL[P(S, R)P(S)P(R)]

slide-28
SLIDE 28

Mutual Information and KL

I[S; R] =

  • s,r

P(s, r) log P(s, r) P(s)P(r) = KL[P(S, R)P(S)P(R)] Thus:

  • 1. Mutual information is always non-negative

I[S; R] ≥ 0

slide-29
SLIDE 29

Mutual Information and KL

I[S; R] =

  • s,r

P(s, r) log P(s, r) P(s)P(r) = KL[P(S, R)P(S)P(R)] Thus:

  • 1. Mutual information is always non-negative

I[S; R] ≥ 0

  • 2. Conditioning never increases entropy

H[S|R] ≤ H[S]

slide-30
SLIDE 30

Multiple Responses

Two responses to the same stimulus, R1 and R2, may provide either more or less information jointly than independently.

slide-31
SLIDE 31

Multiple Responses

Two responses to the same stimulus, R1 and R2, may provide either more or less information jointly than independently. I12 = I[S; R1, R2] = H[R1, R2] − H[R1, R2|S] R1 ⊥

⊥ R2 ⇒ H[R1, R2] = H[R1] + H[R2]

R1 ⊥

⊥ R2|S ⇒ H[R1, R2|S] = H[R1|S] + H[R2|S]

slide-32
SLIDE 32

Multiple Responses

Two responses to the same stimulus, R1 and R2, may provide either more or less information jointly than independently. I12 = I[S; R1, R2] = H[R1, R2] − H[R1, R2|S] R1 ⊥

⊥ R2 ⇒ H[R1, R2] = H[R1] + H[R2]

R1 ⊥

⊥ R2|S ⇒ H[R1, R2|S] = H[R1|S] + H[R2|S]

R1 ⊥

⊥ R2

R1 ⊥

⊥ R2|S

no yes I12 < I1 + I2 redundant

slide-33
SLIDE 33

Multiple Responses

Two responses to the same stimulus, R1 and R2, may provide either more or less information jointly than independently. I12 = I[S; R1, R2] = H[R1, R2] − H[R1, R2|S] R1 ⊥

⊥ R2 ⇒ H[R1, R2] = H[R1] + H[R2]

R1 ⊥

⊥ R2|S ⇒ H[R1, R2|S] = H[R1|S] + H[R2|S]

R1 ⊥

⊥ R2

R1 ⊥

⊥ R2|S

no yes I12 < I1 + I2 redundant yes yes I12 = I1 + I2 independent

slide-34
SLIDE 34

Multiple Responses

Two responses to the same stimulus, R1 and R2, may provide either more or less information jointly than independently. I12 = I[S; R1, R2] = H[R1, R2] − H[R1, R2|S] R1 ⊥

⊥ R2 ⇒ H[R1, R2] = H[R1] + H[R2]

R1 ⊥

⊥ R2|S ⇒ H[R1, R2|S] = H[R1|S] + H[R2|S]

R1 ⊥

⊥ R2

R1 ⊥

⊥ R2|S

no yes I12 < I1 + I2 redundant yes yes I12 = I1 + I2 independent yes no I12 > I1 + I2 synergistic

slide-35
SLIDE 35

Multiple Responses

Two responses to the same stimulus, R1 and R2, may provide either more or less information jointly than independently. I12 = I[S; R1, R2] = H[R1, R2] − H[R1, R2|S] R1 ⊥

⊥ R2 ⇒ H[R1, R2] = H[R1] + H[R2]

R1 ⊥

⊥ R2|S ⇒ H[R1, R2|S] = H[R1|S] + H[R2|S]

R1 ⊥

⊥ R2

R1 ⊥

⊥ R2|S

no yes I12 < I1 + I2 redundant yes yes I12 = I1 + I2 independent yes no I12 > I1 + I2 synergistic no no ? any of the above

slide-36
SLIDE 36

Multiple Responses

Two responses to the same stimulus, R1 and R2, may provide either more or less information jointly than independently. I12 = I[S; R1, R2] = H[R1, R2] − H[R1, R2|S] R1 ⊥

⊥ R2 ⇒ H[R1, R2] = H[R1] + H[R2]

R1 ⊥

⊥ R2|S ⇒ H[R1, R2|S] = H[R1|S] + H[R2|S]

R1 ⊥

⊥ R2

R1 ⊥

⊥ R2|S

no yes I12 < I1 + I2 redundant yes yes I12 = I1 + I2 independent yes no I12 > I1 + I2 synergistic no no ? any of the above I12 > max(I1, I2): the second response cannot destroy information.

slide-37
SLIDE 37

Multiple Responses

Two responses to the same stimulus, R1 and R2, may provide either more or less information jointly than independently. I12 = I[S; R1, R2] = H[R1, R2] − H[R1, R2|S] R1 ⊥

⊥ R2 ⇒ H[R1, R2] = H[R1] + H[R2]

R1 ⊥

⊥ R2|S ⇒ H[R1, R2|S] = H[R1|S] + H[R2|S]

R1 ⊥

⊥ R2

R1 ⊥

⊥ R2|S

no yes I12 < I1 + I2 redundant yes yes I12 = I1 + I2 independent yes no I12 > I1 + I2 synergistic no no ? any of the above I12 > max(I1, I2): the second response cannot destroy information. Thus, the Venn-like diagram with three variables is misleading.

slide-38
SLIDE 38

Data Processing Inequality

slide-39
SLIDE 39

Data Processing Inequality

Suppose S → R1 → R2 form a Markov chain; that is, R2 ⊥

⊥ S | R1.

Then, P(R2, S|R1) = P(R2|R1)P(S|R1)

⇒ P(S|R1, R2) = P(S|R1)

slide-40
SLIDE 40

Data Processing Inequality

Suppose S → R1 → R2 form a Markov chain; that is, R2 ⊥

⊥ S | R1.

Then, P(R2, S|R1) = P(R2|R1)P(S|R1)

⇒ P(S|R1, R2) = P(S|R1)

Thus, H[S|R2] ≥ H[S|R1, R2] = H[S|R1]

⇒ I[S; R2] ≤ I[S; R1]

So any computation based on R1 that does not have separate access to S cannot add information (in the Shannon sense) about the world.

slide-41
SLIDE 41

Data Processing Inequality

Suppose S → R1 → R2 form a Markov chain; that is, R2 ⊥

⊥ S | R1.

Then, P(R2, S|R1) = P(R2|R1)P(S|R1)

⇒ P(S|R1, R2) = P(S|R1)

Thus, H[S|R2] ≥ H[S|R1, R2] = H[S|R1]

⇒ I[S; R2] ≤ I[S; R1]

So any computation based on R1 that does not have separate access to S cannot add information (in the Shannon sense) about the world. Equality holds iff S → R2 → R1 as well. In this case R2 is called a sufficient statistic for S.

slide-42
SLIDE 42

Entropy Rate

So far we have discussed S and R as single (or iid) random variables. But real stimuli and responses form a time series.

slide-43
SLIDE 43

Entropy Rate

So far we have discussed S and R as single (or iid) random variables. But real stimuli and responses form a time series. Let S = {S1, S2, S3 . . .} form a stochastic process. H[S1, S2, . . . , Sn] = H[Sn|S1, S2, . . . , Sn−1] + H[S1, S2, . . . , Sn−1]

= H[Sn|S1, S2, . . . , Sn−1] + H[Sn−1|S1, S2, . . . , Sn−2] + . . . + H[S1]

slide-44
SLIDE 44

Entropy Rate

So far we have discussed S and R as single (or iid) random variables. But real stimuli and responses form a time series. Let S = {S1, S2, S3 . . .} form a stochastic process. H[S1, S2, . . . , Sn] = H[Sn|S1, S2, . . . , Sn−1] + H[S1, S2, . . . , Sn−1]

= H[Sn|S1, S2, . . . , Sn−1] + H[Sn−1|S1, S2, . . . , Sn−2] + . . . + H[S1]

The entropy rate of S is defined as H[S] = lim

n→∞

H[S1, S2, . . . , Sn] N

  • r alternatively as

H[S] = lim

n→∞ H[Sn|S1, S2, . . . , Sn−1]

slide-45
SLIDE 45

Entropy Rate

So far we have discussed S and R as single (or iid) random variables. But real stimuli and responses form a time series. Let S = {S1, S2, S3 . . .} form a stochastic process. H[S1, S2, . . . , Sn] = H[Sn|S1, S2, . . . , Sn−1] + H[S1, S2, . . . , Sn−1]

= H[Sn|S1, S2, . . . , Sn−1] + H[Sn−1|S1, S2, . . . , Sn−2] + . . . + H[S1]

The entropy rate of S is defined as H[S] = lim

n→∞

H[S1, S2, . . . , Sn] N

  • r alternatively as

H[S] = lim

n→∞ H[Sn|S1, S2, . . . , Sn−1]

If Si

iid

∼ P(S) then H[S] = H[S].

If S is Markov (and stationary) then H[S] = H[Sn|Sn−1].

slide-46
SLIDE 46

Continuous Random Variables

The discussion so far has involved discrete S and R. Now, let S ∈ R with density p(s). What is its entropy?

slide-47
SLIDE 47

Continuous Random Variables

The discussion so far has involved discrete S and R. Now, let S ∈ R with density p(s). What is its entropy? Suppose we discretise with length ∆s: H∆[S] = −

  • i

p(si)∆s log p(si)∆s

= −

  • i

p(si)∆s(log p(si) + log ∆s)

slide-48
SLIDE 48

Continuous Random Variables

The discussion so far has involved discrete S and R. Now, let S ∈ R with density p(s). What is its entropy? Suppose we discretise with length ∆s: H∆[S] = −

  • i

p(si)∆s log p(si)∆s

= −

  • i

p(si)∆s(log p(si) + log ∆s)

= −

  • i

p(si)∆s log p(si) − log ∆s

  • i

p(si)∆s

slide-49
SLIDE 49

Continuous Random Variables

The discussion so far has involved discrete S and R. Now, let S ∈ R with density p(s). What is its entropy? Suppose we discretise with length ∆s: H∆[S] = −

  • i

p(si)∆s log p(si)∆s

= −

  • i

p(si)∆s(log p(si) + log ∆s)

= −

  • i

p(si)∆s log p(si) − log ∆s

  • i

p(si)∆s

= −

  • i

∆s p(si) log p(si) − log ∆s

slide-50
SLIDE 50

Continuous Random Variables

The discussion so far has involved discrete S and R. Now, let S ∈ R with density p(s). What is its entropy? Suppose we discretise with length ∆s: H∆[S] = −

  • i

p(si)∆s log p(si)∆s

= −

  • i

p(si)∆s(log p(si) + log ∆s)

= −

  • i

p(si)∆s log p(si) − log ∆s

  • i

p(si)∆s

= −

  • i

∆s p(si) log p(si) − log ∆s → −

  • ds p(s) log p(s) + ∞
slide-51
SLIDE 51

Continuous Random Variables

The discussion so far has involved discrete S and R. Now, let S ∈ R with density p(s). What is its entropy? Suppose we discretise with length ∆s: H∆[S] = −

  • i

p(si)∆s log p(si)∆s

= −

  • i

p(si)∆s(log p(si) + log ∆s)

= −

  • i

p(si)∆s log p(si) − log ∆s

  • i

p(si)∆s

= −

  • i

∆s p(si) log p(si) − log ∆s → −

  • ds p(s) log p(s) + ∞

We define the differential entropy: h(S) = −

  • ds p(s) log p(s).

Note that h(S) can be < 0, and can be ±∞.

slide-52
SLIDE 52

Continuous Random Variables

We can define other information theoretic quantities similarly.

slide-53
SLIDE 53

Continuous Random Variables

We can define other information theoretic quantities similarly. The conditional differential entropy is h(S|R) = −

  • ds dr p(s, r) log p(s|r)

and, like the differential entropy itself, may be poorly behaved.

slide-54
SLIDE 54

Continuous Random Variables

We can define other information theoretic quantities similarly. The conditional differential entropy is h(S|R) = −

  • ds dr p(s, r) log p(s|r)

and, like the differential entropy itself, may be poorly behaved. The mutual information, however, is well-defined I∆[S; R] = H∆[S] − H∆[S|R]

= −

  • i

∆s p(si) log p(si) − log ∆s −

  • dr p(r)
  • i

∆s p(si|r) log p(si|r) − log ∆s

  • → h(S) − h(S|R)

as are other KL divergences.

slide-55
SLIDE 55

Maximum Entropy Distributions

  • 1. H[R1, R2] = H[R1] + H[R2] with equality iff R1 ⊥

⊥ R2.

slide-56
SLIDE 56

Maximum Entropy Distributions

  • 1. H[R1, R2] = H[R1] + H[R2] with equality iff R1 ⊥

⊥ R2.

  • 2. Let
  • ds p(s)f(s) = a for some function f. What distribution has maximum entropy?
slide-57
SLIDE 57

Maximum Entropy Distributions

  • 1. H[R1, R2] = H[R1] + H[R2] with equality iff R1 ⊥

⊥ R2.

  • 2. Let
  • ds p(s)f(s) = a for some function f. What distribution has maximum entropy?

Use Lagrange multipliers:

L =

  • ds p(s) log p(s) − λ0
  • ds p(s) − 1
  • − λ1
  • ds p(s)f(s) − a
  • δL

δp(s) = 1 + log p(s) − λ0 − λ1f(s) = 0 ⇒ log p(s) = λ0 + λ1f(s) − 1 ⇒ p(s) = 1

Z eλ1f(s) The constants λ0 and λ1 can be found by solving the constraint equations.

slide-58
SLIDE 58

Maximum Entropy Distributions

  • 1. H[R1, R2] = H[R1] + H[R2] with equality iff R1 ⊥

⊥ R2.

  • 2. Let
  • ds p(s)f(s) = a for some function f. What distribution has maximum entropy?

Use Lagrange multipliers:

L =

  • ds p(s) log p(s) − λ0
  • ds p(s) − 1
  • − λ1
  • ds p(s)f(s) − a
  • δL

δp(s) = 1 + log p(s) − λ0 − λ1f(s) = 0 ⇒ log p(s) = λ0 + λ1f(s) − 1 ⇒ p(s) = 1

Z eλ1f(s) The constants λ0 and λ1 can be found by solving the constraint equations. Thus, f(s) = s

p(s) = 1

Z eλ1s.

Exponential (need p(s) = 0 for s < T). f(s) = s2

p(s) = 1

Z eλ1s2.

Gaussian.

slide-59
SLIDE 59

Maximum Entropy Distributions

  • 1. H[R1, R2] = H[R1] + H[R2] with equality iff R1 ⊥

⊥ R2.

  • 2. Let
  • ds p(s)f(s) = a for some function f. What distribution has maximum entropy?

Use Lagrange multipliers:

L =

  • ds p(s) log p(s) − λ0
  • ds p(s) − 1
  • − λ1
  • ds p(s)f(s) − a
  • δL

δp(s) = 1 + log p(s) − λ0 − λ1f(s) = 0 ⇒ log p(s) = λ0 + λ1f(s) − 1 ⇒ p(s) = 1

Z eλ1f(s) The constants λ0 and λ1 can be found by solving the constraint equations. Thus, f(s) = s

p(s) = 1

Z eλ1s.

Exponential (need p(s) = 0 for s < T). f(s) = s2

p(s) = 1

Z eλ1s2.

Gaussian. Both results together ⇒ maximum entropy point process (for fixed mean arrival rate) is homogeneous Poisson – independent, exponentially distributed ISIs.

slide-60
SLIDE 60

Channels

We now direct our focus to the conditional P(R|S) which defines the channel linking S to R. S

P(R|S)

− → R

slide-61
SLIDE 61

Channels

We now direct our focus to the conditional P(R|S) which defines the channel linking S to R. S

P(R|S)

− → R

The mutual information I[S; R] =

  • s,r

P(s, r) log P(s, r) P(s)P(r) =

  • s,r

P(s)P(r|s) log P(r|s) P(r) depends on marginals P(s) and P(r) =

s P(r|s)P(s) as well and thus is unsuitable to

characterise the conditional alone.

slide-62
SLIDE 62

Channels

We now direct our focus to the conditional P(R|S) which defines the channel linking S to R. S

P(R|S)

− → R

The mutual information I[S; R] =

  • s,r

P(s, r) log P(s, r) P(s)P(r) =

  • s,r

P(s)P(r|s) log P(r|s) P(r) depends on marginals P(s) and P(r) =

s P(r|s)P(s) as well and thus is unsuitable to

characterise the conditional alone. Instead, we characterise the channel by its capacity CR|S = sup

P(s)

I[S; R] Thus the capacity gives the theoretical limit on the amount of information that can be transmitted over a channel. Clearly, this is limited by the properties of the noise.

slide-63
SLIDE 63

Joint source-channel coding theorem

The remarkable central result of information theory. S

encoder

− − − − − − − − − − − →

S

channel

− − − − − − − − − − − →

CR|

S

R

decoder

− − − − − − − − − − − →

T Any source ensemble S with entropy H[S] < CR|

S can be transmitted (in sufficiently long

blocks) with Perror → 0. The proof is beyond our scope. Some of the key ideas that appear in the proof are:

◮ block coding ◮ error correction ◮ joint typicality ◮ random codes

slide-64
SLIDE 64

The channel coding problem

S

encoder

− − − − − − − − − − − →

S

channel

− − − − − − − − − − − →

CR|

S

R

decoder

− − − − − − − − − − − →

T Given channel P(R| S) and source P(S), find encoding P( S|S) (may be deterministic) to maximise I[S; R]. By data processing inequality, and defn of capacity: I[S; R] ≤ I[ S; R] ≤ CR|

S

By JSCT, equality can be achieved (in the limit of increasing block size). Thus I[ S; R] should saturate CR|

S.

See homework for an algorithm (Blahut-Arimoto) to find P( S) that saturates CR|

S for a

general discrete channel.

slide-65
SLIDE 65

Entropy maximisation

I[ S; R] = H[R] marginal entropy

H

  • R|

S

  • noise entropy
slide-66
SLIDE 66

Entropy maximisation

I[ S; R] = H[R] marginal entropy

H

  • R|

S

  • noise entropy

If noise is small and “constant” ⇒ maximise marginal entropy ⇒ maximise H

  • S
slide-67
SLIDE 67

Entropy maximisation

I[ S; R] = H[R] marginal entropy

H

  • R|

S

  • noise entropy

If noise is small and “constant” ⇒ maximise marginal entropy ⇒ maximise H

  • S
  • Consider a (rate coding) neuron with r ∈ [0, rmax].

h(r) = −

rmax

dr p(r) log p(r)

slide-68
SLIDE 68

Entropy maximisation

I[ S; R] = H[R] marginal entropy

H

  • R|

S

  • noise entropy

If noise is small and “constant” ⇒ maximise marginal entropy ⇒ maximise H

  • S
  • Consider a (rate coding) neuron with r ∈ [0, rmax].

h(r) = −

rmax

dr p(r) log p(r) To maximise the marginal entropy, we add a Lagrange multiplier (µ) to enforce normalisation and then differentiate

δ δp(r)

  • h(r) − µ

rmax

p(r)

  • =

− log p(r) − 1 − µ

r ∈ [0, rmax]

  • therwise
slide-69
SLIDE 69

Entropy maximisation

I[ S; R] = H[R] marginal entropy

H

  • R|

S

  • noise entropy

If noise is small and “constant” ⇒ maximise marginal entropy ⇒ maximise H

  • S
  • Consider a (rate coding) neuron with r ∈ [0, rmax].

h(r) = −

rmax

dr p(r) log p(r) To maximise the marginal entropy, we add a Lagrange multiplier (µ) to enforce normalisation and then differentiate

δ δp(r)

  • h(r) − µ

rmax

p(r)

  • =

− log p(r) − 1 − µ

r ∈ [0, rmax]

  • therwise

⇒ p(r) = const for r ∈ [0, rmax]

slide-70
SLIDE 70

Entropy maximisation

I[ S; R] = H[R] marginal entropy

H

  • R|

S

  • noise entropy

If noise is small and “constant” ⇒ maximise marginal entropy ⇒ maximise H

  • S
  • Consider a (rate coding) neuron with r ∈ [0, rmax].

h(r) = −

rmax

dr p(r) log p(r) To maximise the marginal entropy, we add a Lagrange multiplier (µ) to enforce normalisation and then differentiate

δ δp(r)

  • h(r) − µ

rmax

p(r)

  • =

− log p(r) − 1 − µ

r ∈ [0, rmax]

  • therwise

⇒ p(r) = const for r ∈ [0, rmax]

i.e. p(r) =

  • 1

rmax

r ∈ [0, rmax]

  • therwise
slide-71
SLIDE 71

Histogram Equalisation

Suppose r = ˜ s + η where η represents a (relatively small) source of noise. Consider deterministic encoding ˜ s = f(s). How do we ensure that p(r) = 1/rmax? 1 rmax = p(r) ≈ p(˜ s) = p(s) f ′(s)

⇒ f ′(s) = rmax p(s) ⇒ f(s) = rmax s

−∞

ds′ p(s′)

˜

s

−3 −2 −1 1 2 3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

s

slide-72
SLIDE 72

Histogram Equalisation

Laughlin (1981)

slide-73
SLIDE 73

Gaussian channel

A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm.

slide-74
SLIDE 74

Gaussian channel

A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm. We will need the differential entropy of a (multivariate) Gaussian distribution:

slide-75
SLIDE 75

Gaussian channel

A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm. We will need the differential entropy of a (multivariate) Gaussian distribution: Let p(Z) = |2πΣ|−1/2 exp

  • −1

2(Z − µ)TΣ−1(Z − µ)

  • ,

then, h(Z) = −

  • dZ p(Z)
  • −1

2 log |2πΣ| − 1 2(Z − µ)TΣ−1(Z − µ)

slide-76
SLIDE 76

Gaussian channel

A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm. We will need the differential entropy of a (multivariate) Gaussian distribution: Let p(Z) = |2πΣ|−1/2 exp

  • −1

2(Z − µ)TΣ−1(Z − µ)

  • ,

then, h(Z) = −

  • dZ p(Z)
  • −1

2 log |2πΣ| − 1 2(Z − µ)TΣ−1(Z − µ)

  • = 1

2 log |2πΣ| + 1 2

  • dZ p(Z)Tr
  • Σ−1(Z − µ)(Z − µ)T
slide-77
SLIDE 77

Gaussian channel

A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm. We will need the differential entropy of a (multivariate) Gaussian distribution: Let p(Z) = |2πΣ|−1/2 exp

  • −1

2(Z − µ)TΣ−1(Z − µ)

  • ,

then, h(Z) = −

  • dZ p(Z)
  • −1

2 log |2πΣ| − 1 2(Z − µ)TΣ−1(Z − µ)

  • = 1

2 log |2πΣ| + 1 2

  • dZ p(Z)Tr
  • Σ−1(Z − µ)(Z − µ)T

= 1

2 log |2πΣ| + 1 2Tr

  • Σ−1Σ
slide-78
SLIDE 78

Gaussian channel

A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm. We will need the differential entropy of a (multivariate) Gaussian distribution: Let p(Z) = |2πΣ|−1/2 exp

  • −1

2(Z − µ)TΣ−1(Z − µ)

  • ,

then, h(Z) = −

  • dZ p(Z)
  • −1

2 log |2πΣ| − 1 2(Z − µ)TΣ−1(Z − µ)

  • = 1

2 log |2πΣ| + 1 2

  • dZ p(Z)Tr
  • Σ−1(Z − µ)(Z − µ)T

= 1

2 log |2πΣ| + 1 2Tr

  • Σ−1Σ
  • = 1

2 log |2πΣ| + 1 2d

(log e)

slide-79
SLIDE 79

Gaussian channel

A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm. We will need the differential entropy of a (multivariate) Gaussian distribution: Let p(Z) = |2πΣ|−1/2 exp

  • −1

2(Z − µ)TΣ−1(Z − µ)

  • ,

then, h(Z) = −

  • dZ p(Z)
  • −1

2 log |2πΣ| − 1 2(Z − µ)TΣ−1(Z − µ)

  • = 1

2 log |2πΣ| + 1 2

  • dZ p(Z)Tr
  • Σ−1(Z − µ)(Z − µ)T

= 1

2 log |2πΣ| + 1 2Tr

  • Σ−1Σ
  • = 1

2 log |2πΣ| + 1 2d

(log e) = 1

2 log |2πeΣ|

slide-80
SLIDE 80

Gaussian channel – white noise

+

  • S

R Z

∼ N (0, kz)

slide-81
SLIDE 81

Gaussian channel – white noise

+

  • S

R Z

∼ N (0, kz)

I[ S; R] = h(R) − h(R| S)

slide-82
SLIDE 82

Gaussian channel – white noise

+

  • S

R Z

∼ N (0, kz)

I[ S; R] = h(R) − h(R| S)

= h(R) − h(

S + Z| S)

slide-83
SLIDE 83

Gaussian channel – white noise

+

  • S

R Z

∼ N (0, kz)

I[ S; R] = h(R) − h(R| S)

= h(R) − h(

S + Z| S)

= h(R) − h(Z)

slide-84
SLIDE 84

Gaussian channel – white noise

+

  • S

R Z

∼ N (0, kz)

I[ S; R] = h(R) − h(R| S)

= h(R) − h(

S + Z| S)

= h(R) − h(Z) ⇒ I[

S; R] = h(R) − 1 2 log 2πekz.

slide-85
SLIDE 85

Gaussian channel – white noise

+

  • S

R Z

∼ N (0, kz)

I[ S; R] = h(R) − h(R| S)

= h(R) − h(

S + Z| S)

= h(R) − h(Z) ⇒ I[

S; R] = h(R) − 1 2 log 2πekz. Without constraint, h(R) → ∞ and CR|

S = ∞.

slide-86
SLIDE 86

Gaussian channel – white noise

+

  • S

R Z

∼ N (0, kz)

  • S2

≤ P

I[ S; R] = h(R) − h(R| S)

= h(R) − h(

S + Z| S)

= h(R) − h(Z) ⇒ I[

S; R] = h(R) − 1 2 log 2πekz. Without constraint, h(R) → ∞ and CR|

S = ∞.

Therefore, constrain 1 n

n

  • i=1

˜

s2

i ≤ P.

slide-87
SLIDE 87

Gaussian channel – white noise

+

  • S

R Z

∼ N (0, kz)

  • S2

≤ P

I[ S; R] = h(R) − h(R| S)

= h(R) − h(

S + Z| S)

= h(R) − h(Z) ⇒ I[

S; R] = h(R) − 1 2 log 2πekz. Without constraint, h(R) → ∞ and CR|

S = ∞.

Therefore, constrain 1 n

n

  • i=1

˜

s2

i ≤ P.

Then,

  • R2

=

  • (

S + Z)2

=

  • S2 + Z 2 + 2

SZ

slide-88
SLIDE 88

Gaussian channel – white noise

+

  • S

R Z

∼ N (0, kz)

  • S2

≤ P

I[ S; R] = h(R) − h(R| S)

= h(R) − h(

S + Z| S)

= h(R) − h(Z) ⇒ I[

S; R] = h(R) − 1 2 log 2πekz. Without constraint, h(R) → ∞ and CR|

S = ∞.

Therefore, constrain 1 n

n

  • i=1

˜

s2

i ≤ P.

Then,

  • R2

=

  • (

S + Z)2

=

  • S2 + Z 2 + 2

SZ

  • ≤ P + kz + 0
slide-89
SLIDE 89

Gaussian channel – white noise

+

  • S

R Z

∼ N (0, kz)

  • S2

≤ P

I[ S; R] = h(R) − h(R| S)

= h(R) − h(

S + Z| S)

= h(R) − h(Z) ⇒ I[

S; R] = h(R) − 1 2 log 2πekz. Without constraint, h(R) → ∞ and CR|

S = ∞.

Therefore, constrain 1 n

n

  • i=1

˜

s2

i ≤ P.

Then,

  • R2

=

  • (

S + Z)2

=

  • S2 + Z 2 + 2

SZ

  • ≤ P + kz + 0

⇒ h(R) ≤ h(N (0, P + kz))

slide-90
SLIDE 90

Gaussian channel – white noise

+

  • S

R Z

∼ N (0, kz)

  • S2

≤ P

I[ S; R] = h(R) − h(R| S)

= h(R) − h(

S + Z| S)

= h(R) − h(Z) ⇒ I[

S; R] = h(R) − 1 2 log 2πekz. Without constraint, h(R) → ∞ and CR|

S = ∞.

Therefore, constrain 1 n

n

  • i=1

˜

s2

i ≤ P.

Then,

  • R2

=

  • (

S + Z)2

=

  • S2 + Z 2 + 2

SZ

  • ≤ P + kz + 0

⇒ h(R) ≤ h(N (0, P + kz)) = 1

2 log 2πe(P + kz)

slide-91
SLIDE 91

Gaussian channel – white noise

+

  • S

R Z

∼ N (0, kz)

  • S2

≤ P

I[ S; R] = h(R) − h(R| S)

= h(R) − h(

S + Z| S)

= h(R) − h(Z) ⇒ I[

S; R] = h(R) − 1 2 log 2πekz. Without constraint, h(R) → ∞ and CR|

S = ∞.

Therefore, constrain 1 n

n

  • i=1

˜

s2

i ≤ P.

Then,

  • R2

=

  • (

S + Z)2

=

  • S2 + Z 2 + 2

SZ

  • ≤ P + kz + 0

⇒ h(R) ≤ h(N (0, P + kz)) = 1

2 log 2πe(P + kz)

⇒ I[

S; R] ≤ 1 2 log 2πe(P + kz) − 1 2 log 2πekz

slide-92
SLIDE 92

Gaussian channel – white noise

+

  • S

R Z

∼ N (0, kz)

  • S2

≤ P

I[ S; R] = h(R) − h(R| S)

= h(R) − h(

S + Z| S)

= h(R) − h(Z) ⇒ I[

S; R] = h(R) − 1 2 log 2πekz. Without constraint, h(R) → ∞ and CR|

S = ∞.

Therefore, constrain 1 n

n

  • i=1

˜

s2

i ≤ P.

Then,

  • R2

=

  • (

S + Z)2

=

  • S2 + Z 2 + 2

SZ

  • ≤ P + kz + 0

⇒ h(R) ≤ h(N (0, P + kz)) = 1

2 log 2πe(P + kz)

⇒ I[

S; R] ≤ 1 2 log 2πe(P + kz) − 1 2 log 2πekz = 1 2 log 2πe

  • 1 + P

kz

slide-93
SLIDE 93

Gaussian channel – white noise

+

  • S

R Z

∼ N (0, kz)

  • S2

≤ P

I[ S; R] = h(R) − h(R| S)

= h(R) − h(

S + Z| S)

= h(R) − h(Z) ⇒ I[

S; R] = h(R) − 1 2 log 2πekz. Without constraint, h(R) → ∞ and CR|

S = ∞.

Therefore, constrain 1 n

n

  • i=1

˜

s2

i ≤ P.

Then,

  • R2

=

  • (

S + Z)2

=

  • S2 + Z 2 + 2

SZ

  • ≤ P + kz + 0

⇒ h(R) ≤ h(N (0, P + kz)) = 1

2 log 2πe(P + kz)

⇒ I[

S; R] ≤ 1 2 log 2πe(P + kz) − 1 2 log 2πekz = 1 2 log 2πe

  • 1 + P

kz

  • CR|

S = 1

2 log 2πe

  • 1 + P

kz

  • The capacity is achieved iff R ∼ N (0, P + kz)

S ∼ N (0, P).

slide-94
SLIDE 94

Gaussian channel – correlated noise

Now consider a vector Gaussian channel: +

  • S = (S1, . . . , Sd)

R = (R1, . . . , Rd) Z

= (Z1, . . . , Zd) ∼ N (0, Kz)

1 d Tr

  • S

S

T

≤ P

slide-95
SLIDE 95

Gaussian channel – correlated noise

Now consider a vector Gaussian channel: +

  • S = (S1, . . . , Sd)

R = (R1, . . . , Rd) Z

= (Z1, . . . , Zd) ∼ N (0, Kz)

1 d Tr

  • S

S

T

≤ P

Following the same approach as before: I[ S; R] = h(R) − h(Z) ≤ 1 2 log

  • (2πe)d |K˜

s + Kz|

  • − 1

2 log

  • (2πe)d |Kz|
  • ,
slide-96
SLIDE 96

Gaussian channel – correlated noise

Now consider a vector Gaussian channel: +

  • S = (S1, . . . , Sd)

R = (R1, . . . , Rd) Z

= (Z1, . . . , Zd) ∼ N (0, Kz)

1 d Tr

  • S

S

T

≤ P

Following the same approach as before: I[ S; R] = h(R) − h(Z) ≤ 1 2 log

  • (2πe)d |K˜

s + Kz|

  • − 1

2 log

  • (2πe)d |Kz|
  • ,

⇒ CR|S achieved when

S (and thus R) ∼ N , with |K˜

s + Kz| max given 1 d Tr [K˜ s] ≤ P.

slide-97
SLIDE 97

Gaussian channel – correlated noise

Now consider a vector Gaussian channel: +

  • S = (S1, . . . , Sd)

R = (R1, . . . , Rd) Z

= (Z1, . . . , Zd) ∼ N (0, Kz)

1 d Tr

  • S

S

T

≤ P

Following the same approach as before: I[ S; R] = h(R) − h(Z) ≤ 1 2 log

  • (2πe)d |K˜

s + Kz|

  • − 1

2 log

  • (2πe)d |Kz|
  • ,

⇒ CR|S achieved when

S (and thus R) ∼ N , with |K˜

s + Kz| max given 1 d Tr [K˜ s] ≤ P.

Diagonalise Kz ⇒K˜

s is diagonal in same basis.

slide-98
SLIDE 98

Gaussian channel – correlated noise

Now consider a vector Gaussian channel: +

  • S = (S1, . . . , Sd)

R = (R1, . . . , Rd) Z

= (Z1, . . . , Zd) ∼ N (0, Kz)

1 d Tr

  • S

S

T

≤ P

Following the same approach as before: I[ S; R] = h(R) − h(Z) ≤ 1 2 log

  • (2πe)d |K˜

s + Kz|

  • − 1

2 log

  • (2πe)d |Kz|
  • ,

⇒ CR|S achieved when

S (and thus R) ∼ N , with |K˜

s + Kz| max given 1 d Tr [K˜ s] ≤ P.

Diagonalise Kz ⇒K˜

s is diagonal in same basis.

For stationary noise (wrt dimension indexed by d) this can be achieved by a Fourier transform ⇒ index diagonal elements by ω.

slide-99
SLIDE 99

Gaussian channel – correlated noise

Now consider a vector Gaussian channel: +

  • S = (S1, . . . , Sd)

R = (R1, . . . , Rd) Z

= (Z1, . . . , Zd) ∼ N (0, Kz)

1 d Tr

  • S

S

T

≤ P

Following the same approach as before: I[ S; R] = h(R) − h(Z) ≤ 1 2 log

  • (2πe)d |K˜

s + Kz|

  • − 1

2 log

  • (2πe)d |Kz|
  • ,

⇒ CR|S achieved when

S (and thus R) ∼ N , with |K˜

s + Kz| max given 1 d Tr [K˜ s] ≤ P.

Diagonalise Kz ⇒K˜

s is diagonal in same basis.

For stationary noise (wrt dimension indexed by d) this can be achieved by a Fourier transform ⇒ index diagonal elements by ω. k∗

˜ s (ω) = argmax

  • ω

(k˜

s(ω) + kz(ω))

such that 1 d

s(ω) ≤ P

slide-100
SLIDE 100

Water filling

Assume that optimum is achieved for max. input power. k∗

˜ s (ω) = argmax

  • ω

log (k˜

s(ω) + kz(ω)) − λ

  • 1

d

  • ω

s(ω) − P

slide-101
SLIDE 101

Water filling

Assume that optimum is achieved for max. input power. k∗

˜ s (ω) = argmax

  • ω

log (k˜

s(ω) + kz(ω)) − λ

  • 1

d

  • ω

s(ω) − P

1 k∗

˜ s (ω) + kz(ω) − λ

d = 0

slide-102
SLIDE 102

Water filling

Assume that optimum is achieved for max. input power. k∗

˜ s (ω) = argmax

  • ω

log (k˜

s(ω) + kz(ω)) − λ

  • 1

d

  • ω

s(ω) − P

1 k∗

˜ s (ω) + kz(ω) − λ

d = 0

⇒ k∗

˜ s (ω) + kz(ω) = ν

(const.)

slide-103
SLIDE 103

Water filling

Assume that optimum is achieved for max. input power. k∗

˜ s (ω) = argmax

  • ω

log (k˜

s(ω) + kz(ω)) − λ

  • 1

d

  • ω

s(ω) − P

1 k∗

˜ s (ω) + kz(ω) − λ

d = 0

⇒ k∗

˜ s (ω) + kz(ω) = ν

(const.) (k˜

s ≥ 0) ⇒ k∗ ˜ s (ω) = [ν − kz(ω)]+

slide-104
SLIDE 104

Water filling

Assume that optimum is achieved for max. input power. k∗

˜ s (ω) = argmax

  • ω

log (k˜

s(ω) + kz(ω)) − λ

  • 1

d

  • ω

s(ω) − P

1 k∗

˜ s (ω) + kz(ω) − λ

d = 0

⇒ k∗

˜ s (ω) + kz(ω) = ν

(const.) (k˜

s ≥ 0) ⇒ k∗ ˜ s (ω) = [ν − kz(ω)]+

Waterfilling: choose ν so

  • ω

s(ω) = d · P

ν kz(ω) ks(ω) ω k(ω)

R is white or decorrelated (within power budget) ⇒variance equalisation.

slide-105
SLIDE 105

Decorrelation at the retina

Atick and Redlich (1992) argued that the retina decorrelates natural spatial statistics.

slide-106
SLIDE 106

Decorrelation at the retina

Atick and Redlich (1992) argued that the retina decorrelates natural spatial statistics. RGCs exhibit roughly linear (centre-surround) processing: ra − ra =

  • dx Ds(x − a)
  • filter

s(x)

  • stimulus
slide-107
SLIDE 107

Decorrelation at the retina

Atick and Redlich (1992) argued that the retina decorrelates natural spatial statistics. RGCs exhibit roughly linear (centre-surround) processing: ra − ra =

  • dx Ds(x − a)
  • filter

s(x)

  • stimulus

Therefore the correlation (covariance) between cells is Qr(a, b) =

  • dx dy Ds(x − a)Ds(y − b)s(x)s(y)
  • =
  • dx dy Ds(x − a)Ds(y − b) s(x)s(y)
  • Qs(x,y)
slide-108
SLIDE 108

Decorrelation at the retina

Atick and Redlich (1992) argued that the retina decorrelates natural spatial statistics. RGCs exhibit roughly linear (centre-surround) processing: ra − ra =

  • dx Ds(x − a)
  • filter

s(x)

  • stimulus

Therefore the correlation (covariance) between cells is Qr(a, b) =

  • dx dy Ds(x − a)Ds(y − b)s(x)s(y)
  • =
  • dx dy Ds(x − a)Ds(y − b) s(x)s(y)
  • Qs(x,y)

Using (spatial) stationarity, we can transform to the Fourier domain:

  • Qr(k) = |

Ds(k)|2 Qs(k)

slide-109
SLIDE 109

Decorrelation at the retina

Atick and Redlich (1992) argued that the retina decorrelates natural spatial statistics. RGCs exhibit roughly linear (centre-surround) processing: ra − ra =

  • dx Ds(x − a)
  • filter

s(x)

  • stimulus

Therefore the correlation (covariance) between cells is Qr(a, b) =

  • dx dy Ds(x − a)Ds(y − b)s(x)s(y)
  • =
  • dx dy Ds(x − a)Ds(y − b) s(x)s(y)
  • Qs(x,y)

Using (spatial) stationarity, we can transform to the Fourier domain:

  • Qr(k) = |

Ds(k)|2 Qs(k) and thus output decorrelation requires

|

Ds(k)|2 ∝ 1

  • Qs(k)
slide-110
SLIDE 110

Decorrelation at the retina

Spatial correlations of natural images fall off with f −2:

  • Qs(k) ∝

1

|k|2 + k 2

and the optical filter of the eye introduces (crudely) a low-pass term ∝ e−α|k|. So decorrelation requires

|

Ds(k)|2 ∝ |k|2 + k 2 e−α|k|

slide-111
SLIDE 111

Decorrelation at the retina

Spatial correlations of natural images fall off with f −2:

  • Qs(k) ∝

1

|k|2 + k 2

and the optical filter of the eye introduces (crudely) a low-pass term ∝ e−α|k|. So decorrelation requires

|

Ds(k)|2 ∝ |k|2 + k 2 e−α|k| But: not all input is signal.

slide-112
SLIDE 112

Decorrelation at the retina

Spatial correlations of natural images fall off with f −2:

  • Qs(k) ∝

1

|k|2 + k 2

and the optical filter of the eye introduces (crudely) a low-pass term ∝ e−α|k|. So decorrelation requires

|

Ds(k)|2 ∝ |k|2 + k 2 e−α|k| But: not all input is signal. Photodetection introduces noise. Therefore, cascade linear filters: s + η −

− − − − →

ˆ

s −

− − − − →

Ds

r with

  • Dη(k) =
  • Qs(k)
  • Qs(k) +

Qη(k) (Wiener filter)

slide-113
SLIDE 113

Decorrelation at the retina

Spatial correlations of natural images fall off with f −2:

  • Qs(k) ∝

1

|k|2 + k 2

and the optical filter of the eye introduces (crudely) a low-pass term ∝ e−α|k|. So decorrelation requires

|

Ds(k)|2 ∝ |k|2 + k 2 e−α|k| But: not all input is signal. Photodetection introduces noise. Therefore, cascade linear filters: s + η −

− − − − →

ˆ

s −

− − − − →

Ds

r with

  • Dη(k) =
  • Qs(k)
  • Qs(k) +

Qη(k) (Wiener filter) Thus the combined RGC filter is predicted to be:

|

Ds(k)| Dη(k) ∝

  • Qs(k)
  • Qs(k) +

Qη(k)

slide-114
SLIDE 114

Decorrelation at the retina

slide-115
SLIDE 115

Decorrelation at the retina

slide-116
SLIDE 116

Related ideas

◮ efficient channel utilisation ◮ output entropy maximisation ◮ variance equalisation ◮ redundancy reduction ◮ decorrelation ◮ discovery of independent projections or components