Lecture 8: Information Theory and Statistics I-Hsiang Wang - - PowerPoint PPT Presentation

lecture 8 information theory and statistics
SMART_READER_LITE
LIVE PREVIEW

Lecture 8: Information Theory and Statistics I-Hsiang Wang - - PowerPoint PPT Presentation

Hypothesis Testing Lecture 8: Information Theory and Statistics I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 22, 2015 1 / 30 I-Hsiang Wang IT Lecture 8 Part II Part II : Hypothesis


slide-1
SLIDE 1

Hypothesis Testing

Lecture 8: Information Theory and Statistics

Part II: Hypothesis Testing and Estimation

I-Hsiang Wang

Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw

December 22, 2015

1 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-2
SLIDE 2

Hypothesis Testing Basic Theory Asymptotics

1 Hypothesis Testing

Basic Theory Asymptotics

2 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-3
SLIDE 3

Hypothesis Testing Basic Theory Asymptotics

1 Hypothesis Testing

Basic Theory Asymptotics

3 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-4
SLIDE 4

Hypothesis Testing Basic Theory Asymptotics

Basic Setup

We begin with the simplest setup – binary hypothesis testing:

1 Two hypotheses regarding the observation X, indexed by θ ∈ {0, 1}:

H0 : X ∼ P0 (Null Hypothesis, θ = 0) H1 : X ∼ P1 (Alternative Hypothesis, θ = 1)

2 Goal: design a decision making algorithm φ : X → {0, 1} , x → ˆ

θ, to choose one of the two hypotheses, based on the observed realization

  • f X, so that a certain cost (or risk) is minimized.

3 A popular measure of the cost is based on probability of errors:

Probability of false alarm (false positive; type I error): αφ ≡ PFA (φ) ≜ P {H1 is chosen | H0} . Probability of miss detection (false negative; type II error): βφ ≡ PMD (φ) ≜ P {H0 is chosen | H1} .

4 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-5
SLIDE 5

Hypothesis Testing Basic Theory Asymptotics

Deterministic Testing Algorithm ≡ Decision Regions

Observation Space

X A1 (φ) A0 (φ)

Acceptance Region of H1. Acceptance Region of H0.

A test φ : X → {0, 1} is equivalently characterized by its corresponding acceptance (decision) regions: Aθ (φ) ≡ φ−1 ( ˆ θ ) ≜ { x ∈ X : φ (x) = ˆ θ } , ˆ θ = 0, 1. Hence, the two types of probability of error can be equivalently represented as αφ = ∑

x∈A1(φ)

P0 (x) = ∑

x∈X

φ (x) P0 (x), βφ = ∑

x∈A0(φ)

P1 (x) = ∑

x∈X

(1 − φ (x)) P1 (x). When the context is clear, we often drop the dependency on the test φ when dealing with acceptance regions Aˆ

θ.

5 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-6
SLIDE 6

Hypothesis Testing Basic Theory Asymptotics

Likelihood Ratio Test

Definition 1 (Likelihood Ratio Test) A (deteministic) likelihood ratio test (LRT) is a test φτ, parametrized by constants τ > 0 (called threshold), defined as follows: φτ (x) = { 1 if P1 (x) > τP0 (x) if P1 (x) ≤ τP0 (x) . For x ∈ suppP0, the likelihood ratio L (x) ≜ P1(x)

P0(x).

Hence, LRT is a thresholding algorithm on likelihood ratio L (x). Remark: For computational convenience, often one deals with log likelihood ratio (LLR) log (L(x)) = log (P1(x)) − log (P0(x)).

6 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-7
SLIDE 7

Hypothesis Testing Basic Theory Asymptotics

Trade-Off Between α (PFA) and β (PMD)

Theorem 1 (Neyman-Pearson Lemma) For a likelihood ratio test φτ and another deterministic test φ, αφ ≤ αφτ = ⇒ βφ ≥ βφτ . pf: Observe ∀ x ∈ X, 0 ≤ (φτ (x) − φ (x)) (P1 (x) − τP0 (x)) , because if P1 (x) − τP0 (x) > 0 = ⇒ φτ (x) = 1 = ⇒ (φτ (x) − φ (x)) ≥ 0. if P1 (x) − τP0 (x) ≤ 0 = ⇒ φτ (x) = 0 = ⇒ (φτ (x) − φ (x)) ≤ 0. Summing over all x ∈ X, we get 0 ≤ (1 − βφτ ) − (1 − βφ) − τ (αφτ − αφ) = (βφ − βφτ ) + τ (αφ − αφτ ) . Since τ > 0, from above we conclude that αφ ≤ αφτ = ⇒ βφ ≥ βφτ .

7 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-8
SLIDE 8

Hypothesis Testing Basic Theory Asymptotics

1 1 α (PFA) β (PMD) α (PFA) β (PMD) 1 1

Question: What is the optimal trade-off curve? What is the optimal test achieving the curve?

8 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-9
SLIDE 9

Hypothesis Testing Basic Theory Asymptotics

Randomized Testing Algorithm

Randomized tests include deterministic tests as special cases. Definition 2 (Randomized Test) A randomized test decides ˆ θ = 1 with probability φ (x) and ˆ θ = 0 with probability 1 − φ (x), where φ is a mapping φ : X → [0, 1]. Note: A randomized test is characterized by φ, as in deterministic tests. Definition 3 (Randomized LRT) A randomized likelihood ratio test (LRT) is a test φτ,γ, parametrized by cosntants τ > 0 and γ ∈ (0, 1), defined as follows: φτ,γ (x) =      1 if P1 (x) > τP0 (x) γ if P1 (x) = τP0 (x) if P1 (x) < τP0 (x) .

9 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-10
SLIDE 10

Hypothesis Testing Basic Theory Asymptotics

Randomized LRT Achieves the Optimal Trade-Off

Consider the following optimization problem: Neyman-Pearson Problem minimize

φ:X→[0,1]

βφ subject to αφ ≤ α∗ Theorem 2 (Neyman-Pearson) A randomized LRT φτ ∗,γ∗ with the parameters (τ ∗, γ∗) satisfying α∗ = αφτ∗,γ∗, attains optimality for the Neyman-Pearson Problem.

10 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-11
SLIDE 11

Hypothesis Testing Basic Theory Asymptotics

pf: First argue that for any α∗ ∈ (0, 1), one can find (τ ∗, γ∗) such that α∗ = αφτ∗,γ∗ = ∑

x∈X

φτ ∗,γ∗ (x) P0 (x) = ∑

x: L(x)>τ ∗ P0 (x) +

x: L(x)=τ ∗ γ∗P0 (x)

For any test φ, due to a similar argument as in Theorem 1, we have ∀ x ∈ X, (φτ ∗,γ∗ (x) − φ (x)) (P1 (x) − τ ∗P0 (x)) ≥ 0. Summing over all x ∈ X, similarly we get ( βφ − βφτ∗,γ∗ ) + τ ∗ ( αφ − αφτ∗,γ∗ ) ≥ 0 Hence, for any feasible test φ with αφ ≤ α∗ = αφτ∗,γ∗, its probability of type II error βφ ≥ βφτ∗,γ∗.

11 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-12
SLIDE 12

Hypothesis Testing Basic Theory Asymptotics

Bayesian Setup

Sometimes prior probabilities of the two hypotheses are known: πθ ≜ P {Hθ is true} , θ = 0, 1, π0 + π1 = 1. In this sense, one can view the index Θ as a (binary) random variable with (prior) distribution P {Θ = θ} = πθ, for θ = 0, 1. With prior probabilities, it then makes sense to talk about the average probability of error for a test φ, or more generally, the average cost (risk): Pe (φ) ≜ π0αφ + π1βφ = EΘ,X [ 1 { Θ ̸= ˆ Θ }] , R (φ) ≜ EΘ,X [ rΘ, ˆ

Θ

] . The Bayesian hypothesis testing problem is to test the two hypotheses with knowledge of prior probabilities so that the average probability of error (or in general, a risk function) is minimized.

12 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-13
SLIDE 13

Hypothesis Testing Basic Theory Asymptotics

Minimizing Bayes Risk

Consider the following problem of minimizing Bayes risk. Bayesian Problem minimize

φ:X→[0,1]

R (φ) ≜ EΘ,X [ rΘ, ˆ

Θ

] with known (π0, π1) and rθ,ˆ

θ

Theorem 3 (LRT is an Optimal Bayesian Test) Assume r0,0 < r0,1 and r1,1 < r1,0. A deterministic LRT φτ ∗ with threshold τ ∗ = (r0,1 − r0,0) π0 (r1,0 − r1,1) π1 attains optimality for the Bayesian Problem.

13 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-14
SLIDE 14

Hypothesis Testing Basic Theory Asymptotics

pf: R (φ) = ∑

x∈X

r0,0π0P0 (x) (1 − φ (x)) + ∑

x∈X

r0,1π0P0 (x) φ (x) + ∑

x∈X

r1,0π1P1 (x) (1 − φ (x)) + ∑

x∈X

r1,1π1P1 (x) φ (x) = r0,0π0 + ∑

x∈X

(r0,1 − r0,0) π0P0 (x) φ (x) + r1,0π1 + ∑

x∈X

(r1,1 − r1,0) π1P1 (x) φ (x) = ∑

x∈X (∗)

  • [

(r0,1 − r0,0) π0P0 (x) − (r1,1 − r1,0) π1P1 (x) ] φ (x) + r0,0π0 + r1,0π1. For each x ∈ X, we shall choose φ (x) ∈ [0, 1] such that (∗) is minimized. It is then obvious that we should choose φ (x) = { 1 if (r0,1 − r0,0) π0P0 (x) − (r1,1 − r1,0) π1P1 (x) < 0 if (r0,1 − r0,0) π0P0 (x) − (r1,1 − r1,0) π1P1 (x) ≥ 0 .

14 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-15
SLIDE 15

Hypothesis Testing Basic Theory Asymptotics

Discussions

For binary hypothesis testing problems, the likelihood ratio L (x) ≜ P1(x)

P0(x)

turns out to be a sufficient statistics. Moreover, a likelihood ratio test (LRT) is optimal both in the Bayesian and Neyman-Pearson settings. Extensions include M-ary hypothesis testing Minimax risk optimization (with unknown prior) Composite hypothesis testing, etc. Here we do not pursue these directions further. Instead, we would like to explore the asymptotic behavior of hypothesis testing, and the connection with information theoretic tools.

15 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-16
SLIDE 16

Hypothesis Testing Basic Theory Asymptotics

1 Hypothesis Testing

Basic Theory Asymptotics

16 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-17
SLIDE 17

Hypothesis Testing Basic Theory Asymptotics

i.i.d. Observations

So far we focus on the general setting where the observation space X can be arbitrary alphabets. In the following, we consider product space X n, length-n observation sequence Xn drawn i.i.d. from one of the two distributions, and the two hypotheses are H0 : Xi

i.i.d.

∼ P0, i = 1, 2, . . . , n H1 : Xi

i.i.d.

∼ P1, i = 1, 2, . . . , n The corresponding probability of errors are denoted by α(n) ≡ P(n)

FA ≜ P {H1 is chosen | H0}

β(n) ≡ P(n)

MD ≜ P {H0 is chosen | H1}

Throughout the lecture we assume X = {a1, a2, . . . , ad} is a finite set.

17 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-18
SLIDE 18

Hypothesis Testing Basic Theory Asymptotics

LRT under i.i.d. Observation (1)

With i.i.d. observation, the likelihood ratio of a sequence xn ∈ X n is L (xn) =

n

i=1 P1(xi) P0(xi) = ∏ a∈X

(

P1(a) P0(a)

)N(a|xn) = ∏

a∈X

(

P1(a) P0(a)

)nπ(a|xn) , where N (a|xn) ≜ # of a’s in xn and π (a|xn) ≜ 1

nN (a|xn) is the relative

frequency of occurrence of symbol a in sequence xn. Note: From the above manipulation, we see that the collection of relative frequency of occurrence (as a |X|-dim probabilty vector), Πxn ≜ [π (a1|xn) π (a2|xn) · · · π (ad|xn)]T , called the type of sequence xn, is a sufficient statistics for all the previously mentioned hypothesis testing problems.

18 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-19
SLIDE 19

Hypothesis Testing Basic Theory Asymptotics

LRT under i.i.d. Observation (2)

Let us further manipulate the LRT, by taking log likelihood ratio: L (xn) ⋛ τn ⇐ ⇒ log (L (xn)) ⋛ log τn ⇐ ⇒ ∑

a∈X

nπ (a|xn) log (

P1(a) P0(a)

) ⋛ log τn ⇐ ⇒      ∑

a∈X

π (a|xn) log (

π(a|xn) P0(a)

) − ∑

a∈X

π (a|xn) log (

π(a|xn) P1(a)

)      ⋛ 1

n log τn

⇐ ⇒ D (Πxn ∥P0) − D (Πxn ∥P1) ⋛ 1 n log τn

19 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-20
SLIDE 20

Hypothesis Testing Basic Theory Asymptotics

Observation Space

X A1 (φ) A0 (φ)

Acceptance Region of H1. Acceptance Region of H0.

← →

P0 P1 P (X) F(n)

1

Acceptance Region of H1.

F(n)

Acceptance Region of H0.

Probability Simplex

{xn ∈ Ai = ⇒ decide Hi} ← → { Πxn ∈ F(n)

i

= ⇒ decide Hi } .

20 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-21
SLIDE 21

Hypothesis Testing Basic Theory Asymptotics

P0 P1 P (X) P ∗

Probability Simplex

F(n)

1

F(n)

By Sanov’s Theorem, we know that α(n) = Pn ( F(n)

1

) ≈ 2−nD(P∗ ∥P0), β(n) = Pn

1

( F(n) ) ≈ 2−nD(P∗ ∥P1).

21 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-22
SLIDE 22

Hypothesis Testing Basic Theory Asymptotics

Asymptotic Behaviors

1 Neyman-Pearson:

β∗ (n, ε) ≜ min

φn:X n→[0,1]

β(n)

φn ,

subject to α(n)

φn ≤ ε.

It turns out that for all ε ∈ (0, 1), lim

n→∞

{ − 1

n log β∗ (n, ε)

} = D (P0 ∥P1)

2 Bayesian: P∗ e(n) ≜

min

φn:X n→[0,1]

{ π0α(n)

φn + π1β(n) φn

} . It turns out that lim

n→∞

{ − 1

n log P∗ e (n)

} = D (Pλ∗ ∥P0) = D (Pλ∗ ∥P1) where Pλ (a) ≜

(P0(a))λ(P1(a))1−λ ∑

x∈X (P0(x))λ(P1(x))1−λ , ∀ a ∈ X,

and λ∗ ∈ (0, 1) such that D (Pλ∗ ∥P0) = D (Pλ∗ ∥P1)

22 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-23
SLIDE 23

Hypothesis Testing Basic Theory Asymptotics

Asymptotics in Neyman-Pearson Setup

Theorem 4 (Chernoff-Stein) For all ε ∈ (0, 1), lim

n→∞

{ − 1

n log β∗ (n, ε)

} = D (P0 ∥P1). pf: We shall prove the achievability and the converese part separately. Achievability: construct a sequence of tests {φn} with α(n)

φn ≤ ε for

n sufficiently large, such that lim inf

n→∞

{ − 1

n log β(n) φn

} ≥ D (P0 ∥P1). Converse: for any sequence of tests {φn} with α(n)

φn ≤ ε for n

sufficiently large, show that lim sup

n→∞

{ − 1

n log β(n) φn

} ≤ D (P0 ∥P1). We use method of types to prove both the achievability and the converse. Alternatively, Chapter 11.8 of Cover&Thomas[1] uses a kind of weak typicality to prove the theorem.

23 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-24
SLIDE 24

Hypothesis Testing Basic Theory Asymptotics

Achievability: Consider a deterministic test φn (xn) = 1 {D (Πxn ∥P0) ≥ δn} , δn ≜ 1

n

( log 1

ε + d log(n + 1)

) . In other words, it determines H1 if D (Πxn ∥P0) ≥ δn, and H0 otherwise. Check the probability of Type I error : By Prop. 4 in Part I, we have

α(n)

φn = PXi

i.i.d.

∼ P0 {D (ΠXn ∥P0) ≥ δn} ≤ 2 −n ( δn−d log(n+1)

n

) (a)

= ε, where (a) is due to our construction.

Analyze the probability of Type II error : ((b) is due to Prop. 3 in Part I)

β(n)

φn =

Q∈Pn: D(Q ∥P0)<δn

Pn

1 (Tn (Q)) (b)

≤ ∑

Q∈Pn: D(Q ∥P0)<δn

2−nD(Q ∥P1) ≤ |Pn| 2−nD∗

n , where D∗

n ≜

min

Q∈Pn: D(Q ∥P0)<δn {D (Q ∥P1)}.

Since lim

n→∞ δn = 0, we have lim n→∞ D∗ n = D (P0 ∥P1), and achievability is done.

24 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-25
SLIDE 25

Hypothesis Testing Basic Theory Asymptotics

Converse: We prove the converse for deterministic tests. Extension to randomized tests is left as en exercise (HW6).

Let A(n)

i

≜ {xn | φn (xn) = i}, the acceptance region of Hi, for i = 0, 1. Let B(n) ≜ {xn | D (Πxn ∥P0) < εn}, εn ≜ 2d log(n+1)

n

. By Prop. 4, we have Pn ( B(n)) = 1 − PXi

i.i.d.

∼ P0 {D (ΠXn ∥P0) ≥ εn}

≥ 1 − 2

−n ( εn−d log(n+1)

n

)

= 1 − 2−d log(n+1) → 1 as n → ∞. Hence, for sufficiently large n, both Pn ( B(n)) and Pn ( A(n) ) > 1 − ε, and Pn ( B(n) ∩ A(n) ) = Pn ( B(n)) + Pn ( A(n) ) − Pn ( B(n) ∪ A(n) ) > 2 (1 − ε) − 1 = 1 − 2ε. Note B(n) = ∪

Q∈Pn: D(Q ∥P0)<εn

Tn (Q). Hence ∃ Qn ∈ Pn, D (Qn ∥P0) < εn such that Pn ( Tn (Qn) ∩ A(n) ) > (1 − 2ε) Pn

0 (Tn (Qn)) .

(1)

25 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-26
SLIDE 26

Hypothesis Testing Basic Theory Asymptotics

Key Observation :

Note that the probability of each sequence in the same type class is the same, under any product distribution. Hence, (1) is equivalent to

  • Tn (Qn) ∩ A(n)
  • > (1 − 2ε) |Tn (Qn)| ,

which implies Pn

1

( Tn (Qn) ∩ A(n) ) > (1 − 2ε) Pn

1 (Tn (Qn)) .

Hence, for sufficiently large n, ∃ Qn ∈ Pn with D (Qn ∥P0) < εn such that Pn

1

( A(n) ) ≥ Pn

1

( Tn (Qn) ∩ A(n) ) > (1 − 2ε) Pn

1 (Tn (Qn)) (c)

≥ (1 − 2ε) |Pn|−1 2−nD(Qn ∥P1), where (c) is due to Prop. 3. Finally, as lim

n→∞ εn = 0, we have lim n→∞ D (Qn ∥P1) = D (P0 ∥P1), and the

converse proof is done.

26 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-27
SLIDE 27

Hypothesis Testing Basic Theory Asymptotics

Asymptotics in Bayesian Setup

Theorem 5 (Chernoff) lim

n→∞

{ − 1

n log P∗ e (n)

} = D (Pλ∗ ∥P0) = D (Pλ∗ ∥P1) = max

λ∈[0,1] log

(

1 ∑

x∈X

(P0(x))λ(P1(x))1−λ

)

  • Chernoff Information CI (P0, P1)

where Pλ (a) ≜

(P0(a))λ(P1(a))1−λ ∑

x∈X

(P0(x))λ(P1(x))1−λ , ∀ a ∈ X,

and λ∗ ∈ (0, 1) such that D (Pλ∗ ∥P0) = D (Pλ∗ ∥P1). Note: The optimal Bayesian test (for minimizing Pe) is the maximum a posterier (MAP) test: φMAP (xn) = 1 {π1Pn

1 (xn) ≥ π0Pn 0 (xn)}.

27 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-28
SLIDE 28

Hypothesis Testing Basic Theory Asymptotics

pf: The proof is based on application of large deviation in analyzing the

  • ptimal test, MAP: φMAP (xn) = 1 {π1P1 (xn) ≥ π0P0 (xn)}.

Analysis of error probabilities of MAP test :

α(n) = Pn ( F (n)

1

) , β(n) = Pn

1

( F (n) ) , where F(n)

1

≜ { Q ∈ P (X)

  • D (Q ∥P0) − D (Q ∥P1) ≥ 1

n log π0 π1

} , F(n) ≜ { Q ∈ P (X)

  • D (Q ∥P0) − D (Q ∥P1) ≤ 1

n log π0 π1

} . Asymptotics : By Sanov’s Theorem, we have lim

n→∞

{ − 1

n log α(n)}

= min

Q∈F1D (Q ∥P0) , lim n→∞

{ − 1

n log β(n)}

= min

Q∈F0D (Q ∥P1) ,

where F1 ≜ {Q ∈ P (X) | D (Q ∥P0) − D (Q ∥P1) ≥ 0} and F0 ≜ {Q ∈ P (X) | D (Q ∥P0) − D (Q ∥P1) ≤ 0}.

28 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-29
SLIDE 29

Hypothesis Testing Basic Theory Asymptotics

Exponents : Characterizing the two exponents is equivalent to solving the two (convex) optimization problems: min

Q∈F1D (Q ∥P0)

minimize

(Q1,...,Qd)

∑d

l=1 Ql log Ql P0(al)

subject to ∑d

l=1 Ql log P1(al) P0(al) ≥ 0

Ql ≥ 0, l = 1, . . . , d ∑d

l=1 Ql = 1

min

Q∈F0D (Q ∥P1)

minimize

(Q1,...,Qd)

∑d

l=1 Ql log Ql P1(al)

subject to ∑d

l=1 Ql log P1(al) P0(al) ≤ 0

Ql ≥ 0, l = 1, . . . , d ∑d

l=1 Ql = 1

It turns out that both problems have a common optimal solution Pλ∗ (a) =

(P0(a))λ∗ (P1(a))1−λ∗ ∑

x∈X

(P0(x))λ∗ (P1(x))1−λ∗ , ∀ a ∈ X,

with λ∗ ∈ [0, 1] such that D (Pλ∗ ∥P0) = D (Pλ∗ ∥P1).

29 / 30 I-Hsiang Wang IT Lecture 8 Part II

slide-30
SLIDE 30

Hypothesis Testing Basic Theory Asymptotics

Hence, both types of error probabilities have the same exponent, and so does the average error probability. This completes the proof of the first part.

Chernoff Information : To show that

CI (P0, P1) ≜ max

λ∈[0,1] log

(

1 ∑

x∈X

(P0(x))λ(P1(x))1−λ

) = D (Pλ∗ ∥P0) , simply observe that D (Pλ ∥P0) = D (Pλ ∥P1) ⇐ ⇒ ∑

a∈X

(P0(a))λ (P1(a))1−λ (log P0(a) − log P1(a)) = 0 ⇐ ⇒ D (Pλ ∥P0) = D (Pλ ∥P1) = log    1 ∑

x∈X

(P0(x))λ (P1(x))1−λ    .

Proof complete.

30 / 30 I-Hsiang Wang IT Lecture 8 Part II