Lecture 1 Matthieu Bloch 1 Notation and basic definitions integers - - PDF document

lecture 1
SMART_READER_LITE
LIVE PREVIEW

Lecture 1 Matthieu Bloch 1 Notation and basic definitions integers - - PDF document

1 arises with random vectors. Tie notion of Markov kernel is mainly used to simplify notation and does not bear any profound (1) use the notion of Markov kernel as defined below. convenient to define a conditional distribution without specific


slide-1
SLIDE 1

Revised December 5, 2019 Information Theoretic Security

Lecture 1

Matthieu Bloch

1 Notation and basic definitions We start by briefly introducing the notion used throughout this book. Tie set of real numbers and integers are denoted by R and N, respectively. For a < b ∈ R, the open (resp. closed) interval between a and b is denoted by ]a; b[ (resp. [a; b]). For n < m ∈ N, we set n, m ≜ {i ∈ N : n ⩽

i ⩽ m}. Scalar real-valued random variables are denoted by uppercase letters, e.g., X with realizations denoted in lowercase, e.g, x. Vector real-valued random variables and realizations are shown in bold, e.g., X and x. Sets are denoted by calligraphic letters, e.g., X. Given a vector x ∈ X n, the components of x are denoted {xi}n

i=1. For 1 ⩽ i ⩽ j ⩽ n, we define xi:j ≜ {xk : i ⩽ k ⩽ j}.

Matrices are also denoted by capital bold letters, e.g., A ∈ Rn×m and we make sure no confusion arises with random vectors. Unless otherwise specified, sets are assumed to be finite, e.g., |X| < ∞, and random variables are assumed to be discrete. Tie probability simplex over X is denoted ∆(X) and we always denote the Probability Mass Function (PMF) of X by pX. Tie support of a PMF p ∈ ∆(X) is supp (p) ≜ {x ∈ X : p(x) > 0}. If X is a continuous random variable, we will abuse notation and also denote its Probability Density Function (PDF) by pX; whether we manipulate PDFs or PMFs will always be clear from context and no confusion should arise. Given two jointly distributed random variables X ∈ X and Y ∈ Y, their joint distribution is denoted by pXY and the conditional distribution of Y given X is denoted by pY |X. It is sometimes convenient to define a conditional distribution without specific random variables, in which case we use the notion of Markov kernel as defined below. Definition 1.1. W is a Markov kernel from X to Y, if for all x, y ∈ X × Y W(y|x) ⩾ 0 and for all x ∈ X

y∈Y W(y|x) = 1. For any p ∈ ∆(X), we define W · p ∈ ∆(X × Y) and W ◦ p ∈ ∆(Y) as

(W · p) (x, y) ≜ W(y|x)p(x) and (W ◦ p) (y) ≜

  • x∈X

(W · p)(x, y). (1) Tie notion of Markov kernel is mainly used to simplify notation and does not bear any profound

  • meaning. Two jointly distributed random variables X and Y are called independent if pXY =

pXpY . Tie expected value, or average, or a random variable X is defined as E(X) ≜

  • x∈X

xpX(x). Tie m-th centered moment of X is defined as E((X − EX(X))m), and in particular the variance

  • f X denoted Var(X) is the second centered moment.

Definition 1.2 (Markov chain and conditional independence). Markov chain Let X, Y, Z be real- valued random variables with joint PMF pXY Z. Tien X, Y , Z form a Markov chain in that order, denoted X −Y −Z, if X and Z are conditionally independent given Y , i.e., ∀(x, y, z, ) ∈ X ×Y ×Z we have pXZY (x, y, z) = pZ|Y (z|y)pY |X(y|x)pX(x). 1

slide-2
SLIDE 2

Revised December 5, 2019 Information Theoretic Security

Finally, we define the notion of absolute continuity that will prove useful in later sections. Definition 1.3 (Absolute continuity). Let p, q ∈ ∆(X). We say that p is absolutely continuous with respect to (w.r.t.) q, denoted by p ≪ q, if supp (p) ⊆ supp (q). If p is not absolutely continuous w.r.t. q, we write p ≪ / q. 2 Convexity and Jensen’s inequality Convexity plays a central role in many of our proofs, largely because information-theoretic metrics possess convenient convexity properties. We shall extensively use Jensen’s inequality and the related log-sum inequality in subsequent chapters and we recall these results here for completeness. Definition 2.1 (Convex and concave functions). A function f : a, b − → R is convex if ∀λ ∈ 0, 1 f(λa + (1 − λ)b) ⩽ λf(a) + (1 − λ)f(b). A function f is strictly convex if the inequality above is

  • strict. A function f is (strictly) concave if −f is (strictly) convex.

Tieorem 2.2 (Jensen’s inequality). Jensen’s inequality Let X be a real-valued random variable defined

  • n some interval [a, b] and with PDF pX. Let f : [a, b] → R be a real valued function that is convex

in [a, b]. Tien, f(E(X)) ⩽ E(f(X)). For any strictly convex function, equality holds if and only if X is a constant. Tie results also holds more generally for continuous random variables.

  • Proof. Let hL : [a, b] → R be a line such that ∀x ∈ a, b hL(x) ⩽ f(x). Such a line always exists

as a result of convexity. Tien, E(hL(X)) ⩽ E(f(X)), but since hL is a line, we have hL(E(X)) = E(hL(X)) ⩽ E(f(X)). In particular, we can choose hL such that hL(E(X)) = f(E(X)) because f is convex. Hence, f(E(X)) ⩽ E(f(X)) and if f is strictly convex, we have equality if and only if X = cst.

Corollary 2.3 (Log-sum inequality). log-sum inequality Let {ai}n

i=1 ∈ Rn + and {bi}n i=1 ∈ Rn +.

Tien,

n

  • i=1

ai ln ai bi ⩾ n

  • i=1

ai

  • ln (n

i=1 ai)

(n

i=1 bi) .

(2)

  • Proof. Note that if bj = 0 and aj = 0 for some j ∈ 1, n then the results holds since the left-

hand-side of (2) is infinite. If not, we introduce the function f : R+ → R : x → x ln x with the convention that f(0) = 0, which is infinitely differentiable on its domain. Since f ′′(x) = 1 ⩾ 0, f is convex. Set a ≜ n

i=1 ai and b ≜ n i=1 bi. Tien, note that n

  • i=1

ai ln ai bi =

n

  • i=1

bif ai bi

  • = b

n

  • i=1

bi b ai bi ⩾ bf n

  • i=1

ai b

  • = bf

a b

  • = a ln a

b , (3) where we have used Jensen’s inequality.

Proposition 2.4. Let X be a real-valued random variable defined on some interval [a, b] and with PDF

  • pX. Let f : [a, b] → R be a real valued function that is convex in [a, b]. Tien,

E(f(X)) ⩽ f(a) + f(b) − f(a) b − a (E(X) − a) 2

slide-3
SLIDE 3

Revised December 5, 2019 Information Theoretic Security

For any strictly convex function, the equality holds if and only if X is only distributed on the end of the interval

  • Proof. Let hU : [a, b] → R be a line such that ∀x ∈ a, b f(x) ⩽ hU(x). Tien, E(f(X)) ⩽

E(hU(X)) = hU(E(X)). In particular, we may choose hU : x → f(a) + f(b) − f(a) b − a (x − a). Hence, E(f(X)) ⩽ f(a) + f(b)−f(a)

b−a

(E(X) − a) and if f is strictly convex, equality holds if X is such that pX(x) = 0 for x ∈]a, b[.

3 Distances between distributions As we will see in subsequent chapters, many information-theoretic security metrics can be expressed in terms of how close or distinct probability distributions are. We develop here the properties of two distances, the total variation distance and the relative entropy, which we shall extensively use. Definition 3.1 (Total variation distance). Tie total variation between two distributions p, q ∈ ∆(X) is

V(p, q) ≜ 1 2p − q1 ≜ 1 2

  • x∈X

|p(x) − q(x)| . (4) For all practical purposes, the total variation distance is an ℓ1 norm on the probability simplex ∆(X) and inherits all its properties (symmetry, positivity, triangle, inequality). Tie normalization by 1

2 is for convenience as we shall see from the properties derived next. Tie total variation can be

expressed more generally as shown in the next proposition. Proposition 3.2. Tie total variation between two distributions p, q ∈ ∆(X) is V(p, q) = sup

E⊂X

(Pp(E) − Pq(E)) = sup

E⊂X

(Pq(E) − Pp(E)) . (5) Tie supremum is attained for E ≜ {x :∈ X : p(x) > q(x)}. Consequently, 0 ⩽ V(p, q) ⩽ 1.

  • Proof. From the definition, upon setting E0 ≜ {x ∈ X : p(x) > q(x)} we have

V(p, q) = 1 2

  • x∈E0

(p(x) − q(x)) + 1 2

  • x:∈Ec

(q(x) − p(x)) (6) = 1 2 (Pp(E0) − Pq(E0) + Pq(Ec

0) − Pp(Ec 0))

(7) = Pp(E0) − Pq(E0) (8) ⩽ sup

E⊂X

(Pp(E) − Pq(E)) . (9) Conversely, note that for every E Pp(E) − Pq(E) = 1 2 (Pp(E) − Pq(E) + Pq(Ec) − Pp(Ec)) (10) = 1 2

  • x∈E

(p(x) − q(x)) + 1 2

  • x:∈Ec

(q(x) − p(x)) (11) ⩽ V(p, q), (12) so that supE⊂X (Pp(E) − Pq(E)) ⩽ V(p, q).

3

slide-4
SLIDE 4

Revised December 5, 2019 Information Theoretic Security

Proposition 3.2 is convenient to lower bound the total variation, since it suffices to choose any E ⊂ X for which the evaluation (5) is easy to carry out. Proposition 3.3. Let W be a Markov kernel from X to Y and let p, q ∈ ∆(X). Tien, V(W ◦ p, W ◦ q) ⩽ V(W · p, W · q) = V(p, q).

  • Proof. Tie results follow from the definition of W·p and W◦p and the properties of |·| as a distance
  • n R. Note that

V(W ◦ p, W ◦ q) =

  • y
  • x

(W · p)(x, y), (W · q)(x, y)

  • ⩽ V(W · p, W · q),

V(W · p, W · q) =

  • x
  • y

W(y|x) |p(x) − q(x)| = V(p, q).

Tie inequality V(W ◦ p, W ◦ q) ⩽ V(p, q) can be understood as a data-processing inequality, stating that that distributions only become more indistinguishable when passed through the same Markov kernel. Definition 3.4 (Relative entropy). Tie relative entropy, also called Kullback-Leibler divergence, be- tween two distributions p, q ∈ ∆(X) is D(pq) ≜

  • x∈X

p(x) ln p(x) q(x), (13) with the convention that D(pq) = ∞ if p ≪ / q. Proposition 3.5 (Positivity of relative entropy). For any p, q ∈ ∆(X) D(pq) ⩾ 0 with equality if and only if p = q.

  • Proof. Using the concavity of the logarithm and Jensen’s inequality,

−D(pq) =

  • x∈X

p(x) ln q(x) p(x) ⩽ ln

  • x∈X

q(x) = ln(1) = 0. Since the logarithm is strictly concave, equality happens if and only if there exists c ∈ R such that ∀x ∈ X p(x) = cq(x). Since 1 =

x p(x) = c x q(x) = c, equality happens if and only if for

all x p(x) = q(x).

Proposition 3.6. Let W be a Markov kernel from X to Y and let p, q ∈ ∆(X). Tien, D(W ◦ pW ◦ q) ⩽ D(W · pW · q) = D(pq).

  • Proof. If p ≪

/ q then D(pq) = ∞ and the inequality holds. If p ≪ q then W ◦ p ≪ W ◦ q, as well. Using the log-sum inequality, we then have D(W ◦ pW ◦ q) =

  • y
  • x

(W · p)(x, y)

  • ln
  • x(W · p)(x, y)
  • x(W · q)(x, y) ⩽ D(W · pW · q),

D(W · pW · q) =

  • x
  • y

(W · p)(x, y) ln p(x) q(x) = D(pq).

4

slide-5
SLIDE 5

Revised December 5, 2019 Information Theoretic Security

Proposition 3.6 can again be interpreted as a data-processing inequality stating that relative entropy can only decrease through a Markov kernel. Tie relative entropy is unfortunately not a distance on ∆(X) since it is not symmetric and does not satisfy the triangle inequality. It is, however, often convenient to use because if involves the ratio

  • f probabilities as opposed to their difference. To get the benefits of both total variation and relative

entropy, it can be convenient to relate the two metrics through the following inequalities. Proposition 3.7 (Pinsker’s inequality). For any p, q ∈ ∆(X) V(p, q) ⩽

  • 1

2D(pq).

  • Proof. We start by considering the special case in which p, q ∈ ∆({0, 1}) with p(1) = α ∈ [0; 1]

and q(1) = β ∈ [0; 1]. Without loss of generality, we assume α ⩾ β. Tien, V(p, q) = α − β and D(pq) = α ln α β + (1 − α) ln 1 − α 1 − β . If β = 0, D(pq) is infinite and the inequality holds. If β = α, V(p, q) = D(pq) = 0, the inequality holds, as well. Else, consider the differentiable function fα(β) :]0; α[→ R : β → α ln α β + (1 − α) ln 1 − α 1 − β − 2(α − β)2. (14) Since f ′

α(β) = (α − β)

  • 4 −

1 β(1−β)

  • ⩽ 0, the inequality holds in this case too.

We consider next arbitrary p, q ∈ ∆(X) and set E ≜ {x ∈ X : p(x) > q(x)}. From Proposi- tion 3.2, we have V(p, q) = Pp(E)−Pq(E). Define a Markov kernel W from X to {0, 1} such that W(1|x) ≜ 1{x ∈ E}. Notice that (W ◦ p)(1) = Pp(E) and (W ◦ q)(1) = Pq(E) so that V(p, q) = V(W ◦ p, W ◦ q). Since W ◦ p, W ◦ q ∈ ∆({0, 1}), the result of the spe- cial case applies and V(W ◦ p, W ◦ q) ⩽

  • 1

2D(W ◦ pW ◦ q). Finally, the result follows because

D(W ◦ pW ◦ q) ⩽ D(pq) by Proposition 3.6.

Very often, we will loosen Pinsker’s inequality and only use V(p, q) ⩽

  • D(pq), which will be

sufficient for our purpose. Proposition 3.8 (Reverse Pinsker’s inequality). For any p, q ∈ ∆(X) with supp (q) = X and p ≪ q, we have D(pq) ⩽ 2V(p, q) ln

1 qmin where qmin ≜ minx∈supp(q) q(x).

  • Proof. Define the differentiable function ϕ : [0; ∞[→ R : x → x ln x

x−1 . Note that ϕ(1) = 0 and

ϕ′(x) = x−1−ln x

(x−1)2

⩾ 0 in the domain. Hence ϕ(·) is increasing, positive for x > 1, negative for x ∈ [0, 1[. In addition, ϕ(x) ⩽ ln x for x ⩾ 1. Consequently, D(pq) =

  • x∈X:p(x)̸=q(x)

p(x) ln p(x) q(x) =

  • x∈X:p(x)̸=q(x)

ϕ p(x) q(x)

  • (p(x) − q(x))

  • x∈X:p(x)>q(x)

ϕ p(x) q(x)

  • (p(x) − q(x)).

5

slide-6
SLIDE 6

Revised December 5, 2019 Information Theoretic Security

Upper bounding ϕ

  • p(x)

q(x)

  • by ϕ
  • 1

qmin

  • and recalling the definition of V(·) from Proposition 3.2,

we have

  • x∈X:p(x)>q(x)

ϕ p(x) q(x)

  • (p(x) − q(x)) ⩽ ϕ

1 qmin

  • V(p, q).

Tie result follows because qmin ⩽ 1

2 so that ϕ

  • 1

qmin

  • ⩽ 2 ln

1 qmin .

Tie reverse Pinkser’s inequality stated here should be used with care, as the bound involves

  • qmin. For some distributions, qmin may be very small so that the bound might not be tight at all.

Fortunately, the presence of a logarithm mitigates this, as we shall see later in the textbook when we manipulate product distributions. Definition 3.9. Tie χ2 distance, between two distributions p, q ∈ ∆(X) is χ2 (pq) ≜

  • x∈X

(p(x) − q(x))2 q(x) =

  • x∈X

p(x)2 q(x) − 1. with the convention that χ2 (pq) = ∞ if p ≪ / q.

4 Tail and concentration inequalities For a random variable X, a tail inequality is a bound of the form P(X ⩾ t) ⩽ f(t) for some t ∈ R and some function f : R → R+. Tie idea is to capture in the bound the fact that the random variable cannot take too large values with some probability. Most if not all tail inequalities are derived from Markov’s inequality. Lemma 4.1 (Markov’s inequality). Let X be a non-negative real-valued random variable. Tien for all

t > 0 P(X ⩾ t) ⩽ E(X) t . (15)

  • Proof. For t > 0, let 1{X ⩾ t} be the indicator function of the event {X ⩾ t}. Tien,

E[X] ⩾ E[X1{X ⩾ t}] ⩾ tP[X ⩾ t], (16) where the first inequality follows because the indicator function is {0, 1}-valued and X is non- negative; the second because X ⩾ t whenever 1{X ⩾ t} = 1 and 0 else.

By choosing t = sE(X) for s > 0 in (15), we obtain P(X ⩾ sE(X)) ⩽ 1

s, which is consistent

with the intuition that it is unlikely that a random variable takes a value very far away from its mean. Despite its relative simplicity, Markov’s inequality is a powerful tool because it can be “boosted.” For X ∈ X ⊂ R, consider ϕ : X → R+ increasing on X such that E(|ϕ(X)|) < ∞. Tien, P[X ⩾ t] = E[1{X ⩾ t}] = E[1{X ⩾ t}1{ϕ(X) ⩾ ϕ(t)}] ⩽ P[ϕ(X) ⩾ ϕ(t)], (17) where we have used the definition of ϕ and the fact that an indicator function is upper bounded by

  • ne. Applying Markov’s inequality we obtain

P[X ⩾ t] ⩽ E[ϕ(X)] ϕ(t) , (18) 6

slide-7
SLIDE 7

Revised December 5, 2019 Information Theoretic Security

which is potentially a better bound than (15). Of course, the difficulty is in choosing the appropriate function ϕ to make the result meaningful. Tie most well-known application of this concept leads to Chebyshev’s inequality. Lemma 4.2 (Chebyshev’s inequality). Let X ∈ R. Tien, P[|X − E(X)| ⩾ t] ⩽ Var(X) t2 . (19)

  • Proof. Define Y ≜ |X − E(X)| and ϕ : R+ → R+ : t → t2. Tien, by the boosted Markov’s

inequality we obtain P[|X − E(X)| ⩾ t] = P[Y ⩾ t] ⩽ E[Y 2] t2 = Var(X) t2 . (20)

Chebyshev’s inequality is an example of a concentration inequality that quantifies how the vari- able X concentrates its probability mass around the average E(X). Not surprisingly, Chebyshev’s inequality states that most of the probability mass is in a neighborhood around the average E(X) but the size of that neighborhood is a function of the variance of X. As an application of Chebyshev’s inequality, we derive the weak law of large numbers. Lemma 4.3 (Weak law of large numbers). Let Xi ∼ pXi be independent with E[|Xi|] < ∞ and Var(Xi) < σ2 for some σ2 ∈ R+. Tien the random variable 1

n

n

i=1(Xi − E(Xi)) converges in

probability to 0.

  • Proof. Set Z = 1

n

n

i=1(Xi − E(Xi)) and note that

E[Z] = 1 n

n

  • i=1

(E[Xi] − E(Xi)) = 0 and Var(Z) = 1 n2

n

  • i=1

Var(Xi). (21) Tierefore, P

  • 1

n

n

  • i=1

Xi − 1 n

n

  • i=1

E[Xi]

  • ⩾ ϵ

n

  • i=1

Var(Xi) n2ϵ2 < σ2 nϵ2 . (22)

Tie weak law of large numbers is essentially stating that 1

n

n

i=1 Xi concentrates its probability

mass in a small neighborhood around its average. Note, however, that the convergence proved in (22) is rather slow, on the order of 1/n. Under some mild conditions, it is possible to considerably improve the statement of the weak law of large numbers and obtain a fast convergence exponential in n. We state without proof two such inequalities that will be particularly useful later on. Proposition 4.4 (Hoeffding’s inequality). Consider independent random variables Xi with E[Xi] = 0 and Xi ∈ [ai, bi]. Let Y = n

i=1 Xi. Tien

P n

  • i=1

Xi ⩾ t

  • ⩽ exp

2t2 n

i=1(ai − bi)2

  • (23)

7

slide-8
SLIDE 8

Revised December 5, 2019 Information Theoretic Security

Proposition 4.5 (McDiarmid’s inequality). Consider independent real-valued random variables {Xi}n

i=1 ∈

X n and a function f : X n → R. If for all j ∈ 1, n and for all {xi}n

i=1 ∈ X n and x′ j ∈ X there

exists cj ⩾ 0 such that

  • f(x1, · · · , xj−1, xj, xj+1, · · · , xn) − f(x1, · · · , xj−1, x′

j, xj+1, · · · , xn)

  • ⩽ cj

then P(|f(X1, · · · , Xn) − E(f(X1, · · · , Xn))| ⩾ t) ⩽ 2 exp

2t2 n

i=1 c2 i

  • .

McDiarmid’s inequality is more general that Hoeffding’s inequality and allows us to bound the concentration of a function of n random variables around its average. 5 Shannon entropy and mutual information Definition 5.1 (Shannon entropy). entropy!Shannon entropy Let X ∈ X be a discrete random variable with |X| < ∞. Tie Shannon entropy of X is defined as H(X) ≜ EX(− ln pX(X)) = −

  • x∈X

pX(x) ln pX(x), (24) with the convention that 0 ln 0 = 0. Unless the context requires clarification, we refer to the Shannon entropy as the entropy for

  • short. Note that H(X) only depends on the PMF pX and not on the exact choice of X, and we

sometimes use the notation H(pX) ≜ H(X) to emphasize this. Tie base of the log determines the unit of entropy and, by convention, we always use the natural log in the textbook to measure entropy in nats. One can convert results bits by merely scaling the entropy by

1 ln 2. In the special

case of a binary random variable X ∈ {0, 1}, the entropy takes on a particularly simple form. Since pX is fully specified by the parameter p ∈ [0, 1] such that pX(1) = p, we define the binary entropy function as Hb (p) ≜ −p ln p − (1 − p) ln(1 − p). (25) When viewed as a function of a PMF, the following properties hold. Proposition 5.2. Tie entropy H : ∆(X) → R : p → H(p) is a concave function.

  • Proof. TBD

Proposition 5.3 (Csiszár’s inequality). For p, q ∈ ∆(X), we have |H(p) − H(q)| ⩽ V(p, q) ln |X| V(p, q).

  • Proof. TBD

Tie entropy of a random variable X ∈ X can be intuitively understood as a means to measure the uncertainty in the outcome of an experiment modeled by the sampling of the random variable. Tiis intuition is justified by several properties that we develop next. 8

slide-9
SLIDE 9

Revised December 5, 2019 Information Theoretic Security

Proposition 5.4 (Positivity of entropy and condition for equality). Let X ∈ X be a discrete ran- dom variable. Tien H(X) ⩾ 0 with equality iff X is a constant, i.e., there exists x∗ ∈ X such that P(X = x∗) = 1.

  • Proof. Since X is a discrete random variable, we have ∀x ∈ X, pX(x) ∈ [0, 1]. Hence, − ln pX(x) ⩾

0 and H(X) ⩾ 0 as a convex combination of non-negative terms. Assume now that H(X) = 0. Since ∀x ∈ X − pX(x) ln pX(x) ⩾ 0, it must be that ∀x ∈ X pX(x) ln pX(x) = 0. Necessarily, ∀x ∈ X, pX(x) ∈ {0, 1}. Since

x∈X pX(x) = 1, there exists x∗ ∈ X such that pX(x∗) = 1.

Tieorem 5.5 (Maximum entropy). Let X ∈ X be a discrete random variable, then H(X) ⩽ ln |X| with equality if and only if X is uniform over X, i.e., ∀x ∈ X, pX(x) =

1 |X|.

  • Proof. Using Jensen’s inequality and the concavity of ln, we obtain

H(X) = EX

  • ln

1 pX(X)

  • ⩽ ln EX
  • 1

pX(X)

  • = log |X| .

Equality is achieved when

1 pX(X) = c for some c ⩾ 0. Since x∈X pX(x) = 1, we must have

c =

1 |X| and X must be uniform over X.

Tie entropy can be extended to jointly distributed random variables. Definition 5.6 (Joint and conditional entropy). entropy!Shannon entropy!jointentropy!Shannon en- tropy!conditional Tie joint entropy of two discrete random variables X ∈ X and Y ∈ Y with joint PMF PXY is H(X, Y ) ≜ EXY (− log2 pXY (X, Y )) = −

  • x∈X
  • y∈Y

pXY (x, y) log2 pXY (x, y). Furthermore, the conditional entropy Y given X is H(Y |X) ≜ EXY (− log2 pY |X(Y |X)) = −

  • x∈X
  • y∈Y

pXY (x, y) log2 pY |X(y|x). Note that H(X, Y ) = H(Y, X) by definition but, in general, H(Y |X) = H(X|Y ). Upon defining the entropy of the random variable Y |X = x for x ∈ X as H(Y |X = x) = −

  • y∈Y

pY |X(y|x) log2 pY |X(y|x), (26) note that H(Y |X) = EX(H(Y |X = X)). Tiis allows us to intuitively interpret the conditional entropy as the average uncertainty about Y after observing X and we obtain the following properties. Proposition 5.7 (Positivity of conditional entropy and condition for equality). Let X, Y be discrete random variables with joint PMF pXY . Tien H(Y |X) ⩾ 0 with equality if and only if Y is a function

  • f X.
  • Proof. By definition, H(Y |X = x) for all x ∈ X. Tierefore H(Y |X) ⩾ 0 as a convex combination
  • f non-negative terms and H(Y |X) = 0 if and only if ∀x ∈ X, H(Y |X = x) = 0. From

Proposition 5.4, this happens if and only if for all x ∈ X there exists yx ∈ Y such that pY |X(yx|x) = 1, i.e., Y is a function of X.

9

slide-10
SLIDE 10

Revised December 5, 2019 Information Theoretic Security

Proposition 5.8 (Chain rule of entropy). Let X, Y , and Z be discrete random variable with joint PMF pXY Z. Tien H(XY |Z) = H(X|Z) + H(Y |XZ) = H(Y |Z) + H(X|Y Z). More generally, if X ≜ {Xi}n

i=1 and Z are jointly distributed random variables we have

H(X|Z) =

n

  • i=1

H

  • Xi|X1:i−1Z

with the convention X1:0 ≜ ∅.

  • Proof. Tie first result follows from Bayes’ rule and basic manipulations of the definition of joint
  • entropy. Tie generalization follows by induction.

For two jointly distributed random variables X and Y , the quantity H(X)−H(X|Y ) intuitively represents the uncertainty of X minus the uncertainty of X after observing Y or, in other words, the reduction of uncertainty about X resulting from observing Y . Tiis reduction of uncertainty is what Shannon defined as the mutual information between X and Y . Definition 5.9 (Mutual information). mutual information!mutual information Let X, Y be two ran- dom variables with joint PMF pXY . Tie mutual information between X and Y is I(X; Y ) ≜ H(X) − H(X|Y ). Using Bayes’s rule, one can directly check that the mutual information is expressed in terms of relative entropy as I(X; Y ) = D(pXY pXpY ). By Proposition 3.5, this alternative formulation proves that I(X; Y ) ⩾ 0 and that equality happens if and only if pXY = pXpY , i.e., X is independent of Y . Additionally, the symmetry of the expres- sion in X and Y also shows that I(X; Y ) ≜ H(Y ) − H(Y |X). Combining these two observations, we obtain the following important result. Corollary 5.10 (Monotonicity of entropy). Let X and Y be discrete random variables with joint PMF pXY . Tien H(X|Y ) ⩽ H(X) with equality if and only if X is independent of Y .

  • Proof. H(X) − H(X|Y ) = I(X; Y ) = D(pXY pXpY ) and the result follows by Proposition 3.5.

Tie monotonicity of entropy w.r.t. to conditioning is colloquially known as the property that “conditioning reduces entropy.” Tie mutual information is a function of the joint distribution pXY , which itself consists of a PMF pX and a kernel pY |X. We sometimes make this explicit using the notation I(pX, pY |X) ≜ I(X; Y ). Proposition 5.11. If ∆(X → Y) denotes the set of kernels from X to Y then the function I : ∆(X) × ∆(X → Y) : (p, W) → I(p; W) is a concave function of p and a convex function of W.

  • Proof. TBD

Definition 5.12 (Conditional mutual information). mutual information!conditional mutual infor- mation Let X, Y, Z be discrete random variables with joint distribution pXY Z. Tie conditional mutual information between X and Y given Z is I(X; Y |Z) ≜ H(X|Z) − H(X|Y Z). 10

slide-11
SLIDE 11

Revised December 5, 2019 Information Theoretic Security

Using again Baye’s rule, we can express the conditional mutual information in terms of relative entropy as I(X; Y |Z) = EZ

  • D
  • pXY |ZpX|ZpY |Z
  • .

(27) Tie symmetry of (27) directly shows that I(X; Y |Z) = I(Y ; X|Z). In addition, (27) together with Proposition 3.5 shows that I(X; Y |Z) ⩾ 0 if and only if X and Y a re conditionally independent given Z. Proposition 5.13 (Chain rule of mutual information). Let X ≜ {Xi}n

i=1, Y , and Z be jointly

distributed random variables. Tien, I(X; Y |Z) =

n

  • i=1

I

  • Xi; Y |ZX1:i−1 with the convention X1:0 = ∅.
  • Proof. Tie result follows from the chain rule of entropy by writing I(X; Y |Z) = H(X|Z)−H(X|Y Z).

Proposition 5.14 (Data-processing inequality). data processing inequality Let X, Y ,and Z be discrete random variables such that X → Y → Z. Tien I(X; Y ) ⩾ I(X; Z) or, equivalently, H(X|Z) ⩾ H(X|Y ).

  • Proof. Tie result follows by using the chain rule to write I(X; Y Z) = I(X; Y ) + I(X; Z|Y ) =

I(X; Z) + I(X; Y |Z). Note that I(X; Z|Y ) = 0 since X − Y − Z and I(X; Y |X) ⩾ 0, the result follows.

We conclude this section with the celebrated Fano’s inequality that relates a probability of error to a conditional entropy. Tieorem 5.15 (Fano’s inequality). Fano’s inequality Let X be a discrete random variable with alphabet

  • X. Let ˆ

X be an estimate of X, ˆ X ∈ X and with joint distribution pX ˆ

  • X. We define the probability of

estimation error: Pe ≜ P[X = ˆ X]. Tien, H(X| ˆ X) ⩽ Hb(Pe) + Pe log(|X| − 1).

  • Proof. Let us introduce the binary random variable E ≜ 1{X = ˆ

X}, i.e., E is a function of X and ˆ X that indicates whether an error occurs. Note that P[E = 1] = P[X = ˆ X] = Pe. H(X| ˆ X) = H(XE| ˆ X) − H(E|X ˆ X) (chain rule) = H(XE| ˆ X) (H(E|X ˆ X) = 0 since E is a function of X and ˆ X) = H(E| ˆ X) + H(X|E ˆ X) ⩽ H(E) + H(X|E ˆ X) (conditioning reduces entropy) = Hb(Pe) + H(X|E ˆ X) Note that H(X|E ˆ X) = H(X| ˆ XE = 0)P(E = 0) + H(X| ˆ XE = 1)P(E = 1), with H(X| ˆ XE = 0) = 0 ; P(E = 1) = Pe and H(X| ˆ XE = 1) ⩽ log(|X| − 1).

Despite it’s apparent simplicity, Fano’s inequality plays a crucial role in information theory because it allows one to relate an operational quantity, the probability of error, to an information- theoretic measure, a conditional entropy. 11