Lecture 2 Measures of Information I-Hsiang Wang Department of - - PowerPoint PPT Presentation

lecture 2 measures of information
SMART_READER_LITE
LIVE PREVIEW

Lecture 2 Measures of Information I-Hsiang Wang Department of - - PowerPoint PPT Presentation

Entropy and Conditional Entropy Mutual Information and KullbackLeibler Divergence Lecture 2 Measures of Information I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw September 16, 2015 1 / 42


slide-1
SLIDE 1

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence

Lecture 2 Measures of Information

I-Hsiang Wang

Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw

September 16, 2015

1 / 42 I-Hsiang Wang IT Lecture 2

slide-2
SLIDE 2

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence

How to measure information?

Before this, we should ask:

What is information?

2 / 42 I-Hsiang Wang IT Lecture 2

slide-3
SLIDE 3

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence

In our daily lives, information is often obtained by learning something that we did not know before. Examples: result of a ball game, score of an exam, weather, … In other words, one gets some information by learning something about which that he/she was uncertain before. Shannon: “Information is the resolution of uncertainty.”

3 / 42 I-Hsiang Wang IT Lecture 2

slide-4
SLIDE 4

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence

Motivating Example

Let us take the following example: Suppose there is a professional basketball (NBA) final and a tennis tournament (the French Open quarterfinals) happening right now. D is an enthusiastic sports fan. He is interested in who will win the NBA final and who will win the Men’s single. However, due to his work, he cannot access any news in 10 days. How much information can he get after 10 days when he learns the two pieces of news (the two messages)? For the NBA final, D will learn that one of the two teams eventually wins the final (message B). For the French Open quarterfinals, D will learn that one of the eight players eventually wins the gold medal (message T).

4 / 42 I-Hsiang Wang IT Lecture 2

slide-5
SLIDE 5

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence

Observations

1 The amount of information is related to the number of possible

  • utcomes: message B is a result of two possible outcomes, while

message T is a result of eight possible outcomes.

2 The amount of information obtained in learning the two messages

should be additive, while the number of possible outcomes is multiplicative. Let f (·) be a function that measures the amount of information:

f (·) f (·)

# of possible

  • utcomes of B

# of possible

  • utcomes of T

Amount of info. from learning B Amount of info. from learning T

f (·)

# of possible

  • utcomes of B

# of possible

  • utcomes of T

Amount of info. from learning B Amount of info. from learning T

× +

What function produces additive outputs with multiplicative inputs? Logarithmic Function

5 / 42 I-Hsiang Wang IT Lecture 2

slide-6
SLIDE 6

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence

Logarithm as the Information Measure

Initial guess of the measure of information: log (# of possible outcomes). However, this measure does not take the likeliness into account – if some

  • utcome occurs with very high probability, the amount of information of

that outcome should be very little. For example, suppose D knows that the Spurs was leading the Heats 3:1 The probability that the Heats win the final:

1 2 → 1 8.

The Heats win the final (w.p.

1 8): it is like out of 8 times there is

  • nly 1 time that will generate this outcome =

⇒ the amount of information = log 8 = 3 bits. The probability that the Spurs win the final:

1 2 → 7 8.

The Spurs win the final (w.p.

7 8): it is like out of 8 7 times there is

  • nly 1 time that will generate this outcome =

⇒ the amount of information = log 8

7 = 3 − log 7 bits.

6 / 42 I-Hsiang Wang IT Lecture 2

slide-7
SLIDE 7

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence

Information and Uncertainty

From the motivation, we collect the following intuitions:

1 The amount of information is related to the # of possible outcomes 2 The measure of information should be additive 3 The measure of information should take the likeliness into account 4 The measure of information is actually measuring the amount of

uncertainty of an unknown outcome Hence, a plausible measure of information of a realization x drawn from a random outcome X is f (x) := log

1 P{X=x} .

Correspondingly, the measure of information of a random outcome X is the averaged value of f (x): EX [f(X)] . Notation: in this lecture, the logarithms are of base 2 if not specified.

7 / 42 I-Hsiang Wang IT Lecture 2

slide-8
SLIDE 8

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence Definition of Entropy and Conditional Entropy Properties of Entropy and Conditional Entropy

1 Entropy and Conditional Entropy

Definition of Entropy and Conditional Entropy Properties of Entropy and Conditional Entropy

2 Mutual Information and Kullback–Leibler Divergence

8 / 42 I-Hsiang Wang IT Lecture 2

slide-9
SLIDE 9

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence Definition of Entropy and Conditional Entropy Properties of Entropy and Conditional Entropy

Entropy: Measure of Uncertainty of a Random Variable

log

1 P{X=x} ⇝ measure of information/uncertainty of an outcome x.

If the outcome has small probability, it contains higher uncertainty. However, on the average, it happens rarely. Hence, to measure the uncertainty of a random variable, we should take the expectation of the self information over all possible realizations: Definition 1 (Entropy) The entropy of a random variable X is defined by H(X) := EX [ log

1 p(X)

] = ∑

x∈X

p(x) log

1 p(x).

Note: Entropy can be understood as the (average) amount of information when one learns the actual outcome/realization of r.v. X. Note: By convention we set 0 log 1

0 = 0 since lim t→0 t log t = 0.

9 / 42 I-Hsiang Wang IT Lecture 2

slide-10
SLIDE 10

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence Definition of Entropy and Conditional Entropy Properties of Entropy and Conditional Entropy

Example 1 (Binary entropy function) Let X ∼ Ber(p) be a Bernoulli random variable, that is, X ∈ {0, 1}, pX(1) = 1 − pX(0) = p. Then, the entropy of X is called the binary entropy function Hb(p), where (note: we follow the convention that 0 log 1

0 = 0.)

Hb(p) := H(X) = −p log p − (1 − p) log(1 − p), p ∈ [0, 1].

.1 .2 .3 .4 .5 .6 .7 .8 .9 1

p

0.5 1 0.5 1

Hb(p)

Exercise 1

1 Analytically check that

max

p∈[0,1] Hb(p) = 1,

arg max

p∈[0,1] Hb(p) = 1/2. 2 Analytically prove that Hb(p) is

concave in p.

10 / 42 I-Hsiang Wang IT Lecture 2

slide-11
SLIDE 11

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence Definition of Entropy and Conditional Entropy Properties of Entropy and Conditional Entropy

Example 2 Consider a random variable X ∈ {0, 1, 2, 3} with p.m.f. defined as follows: x 1 2 3 p(x)

1 6 1 3 1 3 1 6

Compute H(X) and H(Y), where Y := X mod 2. sol: H(X) = 2 × 1

6 × log 6 + 2 × 1 3 × log 3 = 1 3 + log 3.

H(Y) = 2 × 1

2 × log 2 = 1.

11 / 42 I-Hsiang Wang IT Lecture 2

slide-12
SLIDE 12

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence Definition of Entropy and Conditional Entropy Properties of Entropy and Conditional Entropy

Operational Meaning of Entropy

Besides the intuitive motivation, Entropy has operational meanings. Below we take a slight deviation and look at a mathematical problem. Problem: Consider a sequence of discrete rv’s Xn := (X1, X2, . . . , Xn), where Xi ∈ X, Xi

i.i.d.

∼ pX, ∀ i = 1, 2, . . . , n. |X| < ∞. For a given ϵ ∈ (0, 1), we say A ⊆ X n is an ϵ-high-probability set iff P {Xn ∈ A} ≥ 1 − ϵ. We would like to find the asymptotic scaling of the smallest cardinality of ϵ-high-probability sets as n → ∞. Let s (n, ϵ) be that smallest cardinality.

12 / 42 I-Hsiang Wang IT Lecture 2

slide-13
SLIDE 13

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence Definition of Entropy and Conditional Entropy Properties of Entropy and Conditional Entropy

Theorem 1 (Cardinality of High Probability Sets) lim

n→∞ 1 n log s (n, ϵ) = H (X) , ∀ ϵ ∈ (0, 1) .

pf: Application of Law of Large Numbers. See HW1. Implications: H (X) is the minimum possible compression ratio. With the theorem, if one would like to describe a random length-n X-sequence with a missed probability at most ϵ, he/she only needs k ≈ nH (X) bits when n is large. Why? Because the above theorem guarantees that, for any prescribed missed probability, minimum # of bits required n → H (X) as n → ∞. This is the saving (compression) due to the statistical structure of random source sequences! (as Shannon pointed out in his 1948 paper.)

13 / 42 I-Hsiang Wang IT Lecture 2

slide-14
SLIDE 14

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence Definition of Entropy and Conditional Entropy Properties of Entropy and Conditional Entropy

Entropy: Definition

Initially we define entropy for a random variable. It is straightforward to extend this definition to a sequence of random variables, or, a random vector. In some literature, the entropy of a random vector is also called the joint entropy of the component random variables. Definition 2 (Entropy) The entropy of a d-dimensional random vector X := [X1 · · · Xd ]T is defined by the expectation of the self information H (X) := EX [ log

1 p(X)

] = ∑

x∈X1×···×Xd

p(x) log

1 p(x) = H (X1, . . . , Xd) .

14 / 42 I-Hsiang Wang IT Lecture 2

slide-15
SLIDE 15

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence Definition of Entropy and Conditional Entropy Properties of Entropy and Conditional Entropy

Example 3 Consider two random variables X1, X2 ∈ {0, 1} with joint p.m.f. (x1, x2) (0, 0) (0, 1) (1, 0) (1, 1) p(x1, x2)

1 6 1 3 1 3 1 6

Compute H (X1), H (X2), and H (X1, X2). sol: H (X1, X2) = 2 × 1

6 × log 6 + 2 × 1 3 × log 3 = 1 3 + log 3.

H(X1) = 2 × ( 1

3 + 1 6

) × log

1

1 3 + 1 6 = 1 = H(X2).

Compared to Example 2, it can be understood that the value of entropy

  • nly depends on the distribution of the random variable/vector, not on

the actual values it may take.

15 / 42 I-Hsiang Wang IT Lecture 2

slide-16
SLIDE 16

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence Definition of Entropy and Conditional Entropy Properties of Entropy and Conditional Entropy

Conditional Entropy

For two r.v.’s with conditional p.m.f. pX|Y (x|y), we are able to define “the entropy of X given Y = y” according to pX|Y (·|y): H (X|Y = y) := ∑

x∈X

pX|Y (x|y) log

1 pX|Y(x|y).

This can be understood as the amount of uncertainty of X when we know that a potentially correlated Y takes value at y. Averaging over Y, we obtain the amount of uncertainty of X given Y: Definition 3 (Conditional Entropy) The conditional entropy of X given Y is defined by H (X|Y) := ∑

y∈Y

p(y)H (X|Y = y) = ∑

x∈X,y∈Y

p (x, y) log

1 pX|Y(x|y).

16 / 42 I-Hsiang Wang IT Lecture 2

slide-17
SLIDE 17

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence Definition of Entropy and Conditional Entropy Properties of Entropy and Conditional Entropy

Example 4 Consider two random variables X1, X2 ∈ {0, 1} with joint p.m.f. (x1, x2) (0, 0) (0, 1) (1, 0) (1, 1) p(x1, x2)

1 6 1 3 1 3 1 6

Compute H (X1|X2 = 0), H (X1|X2 = 1), H (X1|X2), and H (X2|X1). sol: (x1, x2) (0, 0) (0, 1) (1, 0) (1, 1) p(x1|x2)

1 3 2 3 2 3 1 3

p(x2|x1)

1 3 2 3 2 3 1 3

H (X1|X2 = 0) = 1

3 log 3 + 2 3 log 3 2 = Hb

( 1

3

) , H (X1|X2 = 1) = 2

3 log 3 2 + 1 3 log 3 = Hb

( 1

3

) . H (X1|X2) = 2 × 1

6 × log 3 + 2 × 1 3 × log 3 2 = Hb

( 1

3

) = log 3 − 2

3

= H (X2|X1)

17 / 42 I-Hsiang Wang IT Lecture 2

slide-18
SLIDE 18

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence Definition of Entropy and Conditional Entropy Properties of Entropy and Conditional Entropy

Properties of Entropy

Theorem 2 (Properties of (Joint) Entropy)

1 H (X) ≥ 0, with equality iff X is deterministic. 2 H (X) ≤ log |X|, with equality iff X is uniformly distributed over X. 3 H (X) ≥ 0, with equality iff X is deterministic. 4 H (X) ≤ ∑d i=1 log |Xd|, with equality iff X is uniformly distributed

  • ver X1 × · · · × Xd.

Interpretation: Quite natural Amount of uncertainty in X = 0 ⇐ ⇒ X is deterministic. Amount of uncertainty in X is maximized ⇐ ⇒ X is equally likely to take every value in its alphabet X.

18 / 42 I-Hsiang Wang IT Lecture 2

slide-19
SLIDE 19

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence Definition of Entropy and Conditional Entropy Properties of Entropy and Conditional Entropy

Lemma 1 (Jensen’s Inequality) Let f : R → R be a strictly concave function, and X be a real-valued r.v.. Then, E [f(X)] ≤ f (E [X]), with equality iff X is deterministic. We shall use the above lemma to prove that H (X) ≤ log |X|, with equality iff X is uniformly distributed over X. pf: Let the support of X, suppX, denote the subset of X where X takes non-zero probability. Define a new r.v. U :=

1 p(X). Note that E [U] = |suppX|. Hence,

H (X) = E [log U]

(Jensen)

≤ log (E [U]) = log |suppX| ≤ log |X|. The first inequality holds with equality ⇐ ⇒ U is deterministic ⇐ ⇒ for all x ∈ suppX, p(x) are equal. The second inequality holds with equality ⇐ ⇒ suppX = X.

19 / 42 I-Hsiang Wang IT Lecture 2

slide-20
SLIDE 20

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence Definition of Entropy and Conditional Entropy Properties of Entropy and Conditional Entropy

Chain Rule

Theorem 3 (Chain Rule) H (X, Y) = H (Y) + H (X|Y) = H (X) + H (Y|X). Interpretation: Amount of uncertainty of (X, Y) = Amount of uncertainty of Y + Amount of uncertainty of X after knowing Y. pf: By definition, H (X, Y) = ∑

x∈X

y∈Y

p (x, y) log

1 p(x,y) = ∑ x∈X

y∈Y

p (x, y) log

1 p(y)p(x|y)

= ∑

x∈X

y∈Y

p (x, y) log

1 p(y) + ∑ x∈X

y∈Y

p (x, y) log

1 p(x|y)

= H (Y) + H (X|Y)

20 / 42 I-Hsiang Wang IT Lecture 2

slide-21
SLIDE 21

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence Definition of Entropy and Conditional Entropy Properties of Entropy and Conditional Entropy

Conditioning Reduces Entropy

Theorem 4 (Conditioning Reduces Entropy) H (X|Y) ≤ H (X), with equality iff X is independent of Y. Interpretation: The more one learns, the less the uncertainty is. If and only if what you have learned is independent of your target, the amount uncertainty of your target remains the same. Exercise 2 While it is always true that H (X|Y) ≤ H (X), for y ∈ Y, the following two are both possible: H (X|Y = y) < H (X), or H (X|Y = y) > H (X). Please construct examples for the above two cases respectively.

21 / 42 I-Hsiang Wang IT Lecture 2

slide-22
SLIDE 22

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence Definition of Entropy and Conditional Entropy Properties of Entropy and Conditional Entropy

pf: By definition and Jensen’s inequality, we have H (X|Y) − H (X) = ∑

x∈X

y∈Y

p (x, y) log

p(x) p(x|y) = ∑ x∈X

y∈Y

p (x, y) log p(x)p(y)

p(x,y)

≤ log ( ∑

x∈X

y∈Y

p (x, y) p(x)p(y)

p(x,y)

) = log ( ∑

x∈X

y∈Y

p (x) p (y) ) = log (1) = 0 Exercise 3 For any jointly distributed (X, Y), show that H (X|Y) ≥ 0 with equality iff X is a deterministic function of Y.

22 / 42 I-Hsiang Wang IT Lecture 2

slide-23
SLIDE 23

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence Definition of Entropy and Conditional Entropy Properties of Entropy and Conditional Entropy

Example 5 Consider two random variables X1, X2 ∈ {0, 1} with joint p.m.f. (x1, x2) (0, 0) (0, 1) (1, 0) (1, 1) p(x1, x2)

1 6 1 3 1 3 1 6

In the previously examples, we have H (X1, X2) = log 3 + 1

3, H (X1) = H (X2) = 1,

H (X1|X2) = H (X2|X1) = log 3 − 2

3.

It is straightforward to check that the chain rule holds. Besides, it can be easily seen that conditioning reduces entropy.

23 / 42 I-Hsiang Wang IT Lecture 2

slide-24
SLIDE 24

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence Definition of Entropy and Conditional Entropy Properties of Entropy and Conditional Entropy

Generalization

Theorem 5 (Chain Rule) The chain rule can be generalized to more than two r.v.’s: H (X1, . . . , Xn) =

n

i=1

H (Xi|X1, . . . , Xi−1) . Proof is left as exercise. Theorem 6 (Conditioning Reduces Entropy) Conditioning reduces entropy can be generalized to more than two r.v.’s: H (X|Y, Z) ≤ H (X|Y) . Proof is left as exercise.

24 / 42 I-Hsiang Wang IT Lecture 2

slide-25
SLIDE 25

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence Definition of Entropy and Conditional Entropy Properties of Entropy and Conditional Entropy

Upper Bound on Joint Entropy

Exercise 4 (Joint Entropy ≤ Sum of Marginal Entropies) Use chain rule of entropy and the fact that conditioning reduces entropy to prove that H (X1, . . . , Xn) ≤

n

i=1

H (Xi)

25 / 42 I-Hsiang Wang IT Lecture 2

slide-26
SLIDE 26

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence

1 Entropy and Conditional Entropy

Definition of Entropy and Conditional Entropy Properties of Entropy and Conditional Entropy

2 Mutual Information and Kullback–Leibler Divergence

26 / 42 I-Hsiang Wang IT Lecture 2

slide-27
SLIDE 27

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence

Conditioning Reduces Entropy Revisited

Entropy quantifies the amount of uncertainty of a r.v., say, X. Conditional entropy quantifies the amount of uncertainty of a r.v. X given another r.v., say, Y.

H (X) L e a r n i n g Y H (X|Y ) H (X) L e a r n i n g Y H (X|Y )

}I (X; Y )

Question: How much information does Y tell about X? Ans: The amount of information about X that one obtains by learning Y is the difference between H (X) and H (X|Y).

27 / 42 I-Hsiang Wang IT Lecture 2

slide-28
SLIDE 28

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence

Mutual Information

Definition 4 (Mutual Information) For a pair of jointly distributed r.v.’s (X, Y), the mutual information between them is defined as I (X; Y) := H (X) − H (X|Y). What channel coding does is to infer some information about the channel input X from the channel output Y.

H (X) L e a r n i n g Y H (X|Y )

}I (X; Y )

pY |X (y|x) X Y

28 / 42 I-Hsiang Wang IT Lecture 2

slide-29
SLIDE 29

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence

Properties of Mutual Information

Theorem 7 (An Identity) I (X; Y) = H (X) − H (X|Y) = H (Y) − H (Y|X) = H (X) + H (Y) − H (X, Y) . pf: By chain rule: H (X|Y) = H (X, Y) − H (Y).

(

H (X, Y )

( (

H (X) H (Y ) H (Y |X) H (X|Y ) I (X; Y )

Note: Mutual information is symmetric, that is, I (X; Y) = I (Y; X). Note: The mutual information between X and itself is equal to its entropy: I (X; X) = H (X) since H (X|X) = 0. Hence, the entropy is also called “self information” in some literatures.

29 / 42 I-Hsiang Wang IT Lecture 2

slide-30
SLIDE 30

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence

Mutual Information Measures the Level of Dependency

Theorem 8 (Extremal Values of Mutual Information)

1 I (X; Y) ≥ 0, with equality iff X, Y are independent. 2 I (X; Y) ≤ H (X), with equality iff X is a deterministic function of Y.

pf: The proof of the first one is due to the fact that conditioning reduces

  • entropy. The proof of the second one is due to H (X|Y) ≥ 0.

Interpretation: the mutual information between X and Y, I (X; Y) can also be viewed as a measure of the dependency between X and Y: If X is determined by Y (highly dependent), I (X; Y) is maximized. If X is independent of Y (no dependency), I (X; Y) = 0. The above interpretation will become clearer when we introduce the notion of Kullback–Leibler divergence.

30 / 42 I-Hsiang Wang IT Lecture 2

slide-31
SLIDE 31

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence

Example 6 Consider two random variables X1, X2 ∈ {0, 1} with joint p.m.f. (x1, x2) (0, 0) (0, 1) (1, 0) (1, 1) p(x1, x2)

1 6 1 3 1 3 1 6

Compute I (X1; X2). sol: From the previous examples, we have H (X1, X2) = log 3 + 1 3, H (X1) = H (X2) = 1, H (X1|X2) = H (X2|X1) = log 3 − 2 3. Hence, I (X1; X2) = H (X1) − H (X1|X2) = 5

3 − log 3.

31 / 42 I-Hsiang Wang IT Lecture 2

slide-32
SLIDE 32

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence

Conditional Mutual Information

Definition 5 (Conditional Mutual Information) For a tuple of jointly distributed r.v.’s (X, Y, Z), the mutual information between X and Y given Z is I (X; Y|Z) := H (X|Z) − H (X|Y, Z) Similar to the previous identity (Theorem 7), we have I (X; Y|Z) = H (X|Z) − H (X|Y, Z) = H (Y|Z) − H (Y|X, Z) = H (X|Z) + H (Y|Z) − H (X, Y|Z) . Similar to Theorem 8, we have

1 I (X; Y|Z) ≥ 0, with equality iff X, Y are independent given Z, that

is, X − Z − Y forms a Markov chain.

2 I (X; Y|Z) ≤ H (X|Z), with equality iff X is a deterministic function

  • f Y and Z.

32 / 42 I-Hsiang Wang IT Lecture 2

slide-33
SLIDE 33

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence

Chain Rule for Mutual Information

Theorem 9 (Chain Rule for Mutual Information) I (X; Y1, . . . , Yn) =

n

i=1

I (X; Yi|Y1, . . . , Yi−1). pf: The chain rule for mutual information can be proved by definition and the chain rule for entropy. Exercise 5 Show that I (X; Z) ≤ I (X; Y, Z) and I (X; Y|Z) ≤ I (X; Y, Z).

33 / 42 I-Hsiang Wang IT Lecture 2

slide-34
SLIDE 34

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence

Data Processing Inequality

Theorem 10 (Data Processing Inequality) For a Markov chain X − Y − Z, that is, p (x, y, z) = p (x) p (y|x) p (z|y), we have I (X; Y) ≥ I (X; Z). Interpretation: The Markov chain X − Y − Z implies that the information of X that Z can provide is contained in Y. Hence, the amount of information of X that can be inferred by Z ≤ the amount of information of X that can be inferred by Y. pf: Since X − Y − Z, we have I (X; Z|Y) = 0. Hence, I (X; Y, Z) = I (X; Y) + I (X; Z|Y) = I (X; Y) (∵ I (X; Z|Y) = 0) I (X; Y, Z) = I (X; Z) + I (X; Y|Z) (Chain Rule) = ⇒ I (X; Y) = I (X; Z) + I (X; Y|Z) ≥ I (X; Z) .

34 / 42 I-Hsiang Wang IT Lecture 2

slide-35
SLIDE 35

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence

Data Processing Inequality: Applications

Markov chains are common in communication systems. For example, in channel coding (without feedback), the message W, the channel input XN := X[1 : N], the channel output YN := Y[1 : N], and the decoded message W form a Markov chain W − XN − YN − W,

Encoder Noisy Channel Decoder

W X[1 : N] Y [1 : N] c W pY |X

Data processing inequality turns out to be crucial in obtaining impossibility results in information theory. Exercise 6 (Functions of R.V.) For Z := g (Y) being a deterministic function of Y, show that H (Y) ≥ H (Z) and I (X; Y) ≥ I (X; Z) Exercise 7 Show that X1 − X2 − X3 − X4 = ⇒ I (X1; X4) ≤ I (X2; X3).

35 / 42 I-Hsiang Wang IT Lecture 2

slide-36
SLIDE 36

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence

Example 7 Consider two random variables X1, X2 ∈ {0, 1} with the same joint p.m.f. as that in Example 6. Let X3 := X2 ⊕ Z, where Z ∼ Ber (p) and Z is independent of (X1, X2).

1 Compute I (X1; X3) and I (X1; X2|X3). 2 Show that X1 − X2 − X3 forms a Markov chain. 3 Verify the data processing inequality I (X1; X2) ≥ I (X1; X3).

sol: (x1, x2, x3) (0, 0, 0) (0, 0, 1) (0, 1, 0) (0, 1, 1) p(x1, x2, x3)

1 6(1 − p) 1 6p 1 3p 1 3(1 − p)

(x1, x2, x3) (1, 0, 0) (1, 0, 1) (1, 1, 0) (1, 1, 1) p(x1, x2, x3)

1 3(1 − p) 1 3p 1 6p 1 6(1 − p)

Then it is straightforward to compute mutual informations and verify the Markov chain X1 − X2 − X3.

36 / 42 I-Hsiang Wang IT Lecture 2

slide-37
SLIDE 37

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence

Conditioning Reduces Mutual Information?

Does conditioning reduce mutual information? Does conditioning reduce the dependency between two r.v.’s? The answer is yes and no: sometimes yes, and sometimes no. Example 8 (Conditioning Increases Mutual Information) Let X and Y are i.i.d. Ber ( 1

2

) random variables, and Z := X ⊕ Y. Evaluate I (X; Y|Z) and show that I (X; Y|Z) > I (X; Y). sol: I (X; Y|Z) = H (X|Z) − H (X|Y, Z) = H (X|Z) − H (X|Y, X ⊕ Y) = H (X|Z) − H (X|Y, X) = H (X|Z) = H (X) = 1.

(note that X and Z are independent)

On the other hand, I (X; Y) = 0. Hence, 1 = I (X; Y|Z) > I (X; Y) = 0. Corollary 1 (Conditioning Decreases Mutual Information) For a Markov chain X − Y − Z, we have I (X; Y) ≥ I (X; Y|Z).

37 / 42 I-Hsiang Wang IT Lecture 2

slide-38
SLIDE 38

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence

Measuring the Distance between Probability Distributions

Recall that mutual information I (X; Y) measures the dependency between two r.v.’s X and Y. Reason: I (X; Y) = E [ log

1 pX(X)pY(Y) − log 1 pX,Y(X,Y)

] , where the first term inside E [·] measures the uncertainty of (X, Y) when they are independent (with distribution pX · pY), while the second term measures the actual uncertainty of (X, Y) (with distribution pX,Y). In other words, it measures how far the independent distribution pX · pY is away from the actual distribution pX,Y, in terms of uncertainty. Kullback–Leibler Divergence is a generalization of this concept, which measures how far a distribution q is away from the actual distribution p. Note: Since distributions over a finite alphabet are finite dimensional vectors, geometrically Lp norms also naturally measure how far q is away from p. Example (L1-norm): ∑

x |q(x) − p(x)|.

38 / 42 I-Hsiang Wang IT Lecture 2

slide-39
SLIDE 39

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence

Kullback–Leibler Divergence

Definition 6 (Kullback–Leibler Divergence (Relative Entropy)) Let p (·) and q (·) are two p.m.f.’s of a random variable X. The relative entropy between p and q is D (p||q) := Ep [ log p(X)

q(X)

]

(the subscript “p” denotes that the expectation is taken over the distribution p.)

Note: In taking the above expectation, we follow the convention that 0 log 0

q = 0 for any 0 ≤ q ≤ 1, and p log p 0 = ∞ for any 0 < p ≤ 1.

Note: Hence, it is easy to see that if the support of q is strictly contained in the suppot of p, then D (p||q) = ∞. Note: I (X; Y) = D (pX,Y||pX · pY). Note: KL divergence is NOT symmetric: D (p||q) ̸= D (q||p).

39 / 42 I-Hsiang Wang IT Lecture 2

slide-40
SLIDE 40

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence

Properties of KL Divergence

Theorem 11 D (p||q) ≥ 0, with equality iff p (x) = q (x) for all x ∈ X. pf: Proved by Jensen’s inequality similar to previous proofs. Note: Although it is tempting to think of KL divergence as a distance function, in fact it is not, because (1) it is asymmetric, and (2) it does not satisfy the triangle inequality. Exercise 8 Show that uniform distribution attains maximal entropy, by using D (p||u) ≥ 0, where u denotes the uniform distribution.

40 / 42 I-Hsiang Wang IT Lecture 2

slide-41
SLIDE 41

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence

Summary

41 / 42 I-Hsiang Wang IT Lecture 2

slide-42
SLIDE 42

Entropy and Conditional Entropy Mutual Information and Kullback–Leibler Divergence

Entropy H (X) := EpX [ log

1 pX(X)

] measures the amount of uncertainty in X. Conditional entropy H (X|Y) := EpX,Y [ log

1 pX|Y(X|Y)

] measures the amount of uncertainty in X given Y. Mutual information I (X; Y) := H (X) − H (X|Y) = H (Y) − H (Y|X) measures the amount of information of X(Y) that Y(X) can infer. Conditioning reduces entropy: H (X|Y, Z) ≤ H (X|Y). Chain rule: I (X; Y, Z) = I (X; Y) + I (X; Z|Y) Data processing inequality: X − Y − Z = ⇒ I (X; Y) ≥ I (X; Z). 0 ≤ H (X) ≤ log |X| is maximized if X is uniformly distributed, and it is minimized if X is deterministic. 0 ≤ I (X; Y) ≤ min{H (X) , H (Y)} is maximized if X or Y is determined by the other, and it is minimized if X and Y are independent.

42 / 42 I-Hsiang Wang IT Lecture 2