[PPT] - Lecture 8: Information Theory and Statistics I-Hsiang Wang PowerPoint Presentation

SLIDE 1

Hypothesis Testing Estimation

Lecture 8: Information Theory and Statistics

Part II: Hypothesis Testing and Estimation

I-Hsiang Wang

Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw

December 23, 2015

1 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 2

Hypothesis Testing Estimation Basic Theory Asymptotics

1 Hypothesis Testing

Basic Theory Asymptotics

2 Estimation

Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators

2 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 3

Hypothesis Testing Estimation Basic Theory Asymptotics

1 Hypothesis Testing

Basic Theory Asymptotics

2 Estimation

Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators

3 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 4

Hypothesis Testing Estimation Basic Theory Asymptotics

Basic Setup

We begin with the simplest setup – binary hypothesis testing:

1 Two hypotheses regarding the observation X, indexed by θ ∈ {0, 1}:

H0 : X ∼ P0 (Null Hypothesis, θ = 0) H1 : X ∼ P1 (Alternative Hypothesis, θ = 1)

2 Goal: design a decision making algorithm φ : X → {0, 1} , x → ˆ

θ, to choose one of the two hypotheses, based on the observed realization

f X, so that a certain cost (or risk) is minimized.

3 A popular measure of the cost is based on probability of errors:

Probability of false alarm (false positive; type I error): αφ ≡ PFA (φ) ≜ P {H1 is chosen | H0} . Probability of miss detection (false negative; type II error): βφ ≡ PMD (φ) ≜ P {H0 is chosen | H1} .

4 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 5

Hypothesis Testing Estimation Basic Theory Asymptotics

Deterministic Testing Algorithm ≡ Decision Regions

Observation Space

X A1 (φ) A0 (φ)

Acceptance Region of H1. Acceptance Region of H0.

A test φ : X → {0, 1} is equivalently characterized by its corresponding acceptance (decision) regions: Aθ (φ) ≡ φ−1 ( ˆ θ ) ≜ { x ∈ X : φ (x) = ˆ θ } , ˆ θ = 0, 1. Hence, the two types of probability of error can be equivalently represented as αφ = ∑

x∈A1(φ)

P0 (x) = ∑

x∈X

φ (x) P0 (x), βφ = ∑

x∈A0(φ)

P1 (x) = ∑

x∈X

(1 − φ (x)) P1 (x). When the context is clear, we often drop the dependency on the test φ when dealing with acceptance regions Aˆ

θ.

5 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 6

Hypothesis Testing Estimation Basic Theory Asymptotics

Likelihood Ratio Test

Definition 1 (Likelihood Ratio Test) A (deteministic) likelihood ratio test (LRT) is a test φτ, parametrized by constants τ > 0 (called threshold), defined as follows: φτ (x) = { 1 if P1 (x) > τP0 (x) if P1 (x) ≤ τP0 (x) . For x ∈ suppP0, the likelihood ratio L (x) ≜ P1(x)

P0(x).

Hence, LRT is a thresholding algorithm on likelihood ratio L (x). Remark: For computational convenience, often one deals with log likelihood ratio (LLR) log (L(x)) = log (P1(x)) − log (P0(x)).

6 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 7

Hypothesis Testing Estimation Basic Theory Asymptotics

Trade-Off Between α (PFA) and β (PMD)

Theorem 1 (Neyman-Pearson Lemma) For a likelihood ratio test φτ and another deterministic test φ, αφ ≤ αφτ = ⇒ βφ ≥ βφτ . pf: Observe ∀ x ∈ X, 0 ≤ (φτ (x) − φ (x)) (P1 (x) − τP0 (x)) , because if P1 (x) − τP0 (x) > 0 = ⇒ φτ (x) = 1 = ⇒ (φτ (x) − φ (x)) ≥ 0. if P1 (x) − τP0 (x) ≤ 0 = ⇒ φτ (x) = 0 = ⇒ (φτ (x) − φ (x)) ≤ 0. Summing over all x ∈ X, we get 0 ≤ (1 − βφτ ) − (1 − βφ) − τ (αφτ − αφ) = (βφ − βφτ ) + τ (αφ − αφτ ) . Since τ > 0, from above we conclude that αφ ≤ αφτ = ⇒ βφ ≥ βφτ .

7 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 8

Hypothesis Testing Estimation Basic Theory Asymptotics

1 1 α (PFA) β (PMD) α (PFA) β (PMD) 1 1

Question: What is the optimal trade-off curve? What is the optimal test achieving the curve?

8 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 9

Hypothesis Testing Estimation Basic Theory Asymptotics

Randomized Testing Algorithm

Randomized tests include deterministic tests as special cases. Definition 2 (Randomized Test) A randomized test decides ˆ θ = 1 with probability φ (x) and ˆ θ = 0 with probability 1 − φ (x), where φ is a mapping φ : X → [0, 1]. Note: A randomized test is characterized by φ, as in deterministic tests. Definition 3 (Randomized LRT) A randomized likelihood ratio test (LRT) is a test φτ,γ, parametrized by cosntants τ > 0 and γ ∈ (0, 1), defined as follows: φτ,γ (x) =      1 if P1 (x) > τP0 (x) γ if P1 (x) = τP0 (x) if P1 (x) < τP0 (x) .

9 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 10

Hypothesis Testing Estimation Basic Theory Asymptotics

Randomized LRT Achieves the Optimal Trade-Off

Consider the following optimization problem: Neyman-Pearson Problem minimize

φ:X→[0,1]

βφ subject to αφ ≤ α∗ Theorem 2 (Neyman-Pearson) A randomized LRT φτ ∗,γ∗ with the parameters (τ ∗, γ∗) satisfying α∗ = αφτ∗,γ∗, attains optimality for the Neyman-Pearson Problem.

10 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 11

Hypothesis Testing Estimation Basic Theory Asymptotics

pf: First argue that for any α∗ ∈ (0, 1), one can find (τ ∗, γ∗) such that α∗ = αφτ∗,γ∗ = ∑

x∈X

φτ ∗,γ∗ (x) P0 (x) = ∑

x: L(x)>τ ∗ P0 (x) +

∑

x: L(x)=τ ∗ γ∗P0 (x)

For any test φ, due to a similar argument as in Theorem 1, we have ∀ x ∈ X, (φτ ∗,γ∗ (x) − φ (x)) (P1 (x) − τ ∗P0 (x)) ≥ 0. Summing over all x ∈ X, similarly we get ( βφ − βφτ∗,γ∗ ) + τ ∗ ( αφ − αφτ∗,γ∗ ) ≥ 0 Hence, for any feasible test φ with αφ ≤ α∗ = αφτ∗,γ∗, its probability of type II error βφ ≥ βφτ∗,γ∗.

11 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 12

Hypothesis Testing Estimation Basic Theory Asymptotics

Bayesian Setup

Sometimes prior probabilities of the two hypotheses are known: πθ ≜ P {Hθ is true} , θ = 0, 1, π0 + π1 = 1. In this sense, one can view the index Θ as a (binary) random variable with (prior) distribution P {Θ = θ} = πθ, for θ = 0, 1. With prior probabilities, it then makes sense to talk about the average probability of error for a test φ, or more generally, the average cost (risk): Pe (φ) ≜ π0αφ + π1βφ = EΘ,X [ 1 { Θ ̸= ˆ Θ }] , R (φ) ≜ EΘ,X [ rΘ, ˆ

Θ

] . The Bayesian hypothesis testing problem is to test the two hypotheses with knowledge of prior probabilities so that the average probability of error (or in general, a risk function) is minimized.

12 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 13

Hypothesis Testing Estimation Basic Theory Asymptotics

Minimizing Bayes Risk

Consider the following problem of minimizing Bayes risk. Bayesian Problem minimize

φ:X→[0,1]

R (φ) ≜ EΘ,X [ rΘ, ˆ

Θ

] with known (π0, π1) and rθ,ˆ

θ

Theorem 3 (LRT is an Optimal Bayesian Test) Assume r0,0 < r0,1 and r1,1 < r1,0. A deterministic LRT φτ ∗ with threshold τ ∗ = (r0,1 − r0,0) π0 (r1,0 − r1,1) π1 attains optimality for the Bayesian Problem.

13 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 14

Hypothesis Testing Estimation Basic Theory Asymptotics

pf: R (φ) = ∑

x∈X

r0,0π0P0 (x) (1 − φ (x)) + ∑

x∈X

r0,1π0P0 (x) φ (x) + ∑

x∈X

r1,0π1P1 (x) (1 − φ (x)) + ∑

x∈X

r1,1π1P1 (x) φ (x) = r0,0π0 + ∑

x∈X

(r0,1 − r0,0) π0P0 (x) φ (x) + r1,0π1 + ∑

x∈X

(r1,1 − r1,0) π1P1 (x) φ (x) = ∑

x∈X (∗)

[

(r0,1 − r0,0) π0P0 (x) − (r1,1 − r1,0) π1P1 (x) ] φ (x) + r0,0π0 + r1,0π1. For each x ∈ X, we shall choose φ (x) ∈ [0, 1] such that (∗) is minimized. It is then obvious that we should choose φ (x) = { 1 if (r0,1 − r0,0) π0P0 (x) − (r1,1 − r1,0) π1P1 (x) < 0 if (r0,1 − r0,0) π0P0 (x) − (r1,1 − r1,0) π1P1 (x) ≥ 0 .

14 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 15

Hypothesis Testing Estimation Basic Theory Asymptotics

Discussions

For binary hypothesis testing problems, the likelihood ratio L (x) ≜ P1(x)

P0(x)

turns out to be a sufficient statistics. Moreover, a likelihood ratio test (LRT) is optimal both in the Bayesian and Neyman-Pearson settings. Extensions include M-ary hypothesis testing Minimax risk optimization (with unknown prior) Composite hypothesis testing, etc. Here we do not pursue these directions further. Instead, we would like to explore the asymptotic behavior of hypothesis testing, and the connection with information theoretic tools.

15 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 16

Hypothesis Testing Estimation Basic Theory Asymptotics

1 Hypothesis Testing

Basic Theory Asymptotics

2 Estimation

Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators

16 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 17

Hypothesis Testing Estimation Basic Theory Asymptotics

i.i.d. Observations

So far we focus on the general setting where the observation space X can be arbitrary alphabets. In the following, we consider product space X n, length-n observation sequence Xn drawn i.i.d. from one of the two distributions, and the two hypotheses are H0 : Xi

i.i.d.

∼ P0, i = 1, 2, . . . , n H1 : Xi

i.i.d.

∼ P1, i = 1, 2, . . . , n The corresponding probability of errors are denoted by α(n) ≡ P(n)

FA ≜ P {H1 is chosen | H0}

β(n) ≡ P(n)

MD ≜ P {H0 is chosen | H1}

Throughout the lecture we assume X = {a1, a2, . . . , ad} is a finite set.

17 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 18

Hypothesis Testing Estimation Basic Theory Asymptotics

LRT under i.i.d. Observation (1)

With i.i.d. observation, the likelihood ratio of a sequence xn ∈ X n is L (xn) =

n

∏

i=1 P1(xi) P0(xi) = ∏ a∈X

(

P1(a) P0(a)

)N(a|xn) = ∏

a∈X

(

P1(a) P0(a)

)nπ(a|xn) , where N (a|xn) ≜ # of a’s in xn and π (a|xn) ≜ 1

nN (a|xn) is the relative

frequency of occurrence of symbol a in sequence xn. Note: From the above manipulation, we see that the collection of relative frequency of occurrence (as a |X|-dim probabilty vector), Πxn ≜ [π (a1|xn) π (a2|xn) · · · π (ad|xn)]T , called the type of sequence xn, is a sufficient statistics for all the previously mentioned hypothesis testing problems.

18 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 19

Hypothesis Testing Estimation Basic Theory Asymptotics

LRT under i.i.d. Observation (2)

Let us further manipulate the LRT, by taking log likelihood ratio: L (xn) ⋛ τn ⇐ ⇒ log (L (xn)) ⋛ log τn ⇐ ⇒ ∑

a∈X

nπ (a|xn) log (

P1(a) P0(a)

) ⋛ log τn ⇐ ⇒      ∑

a∈X

π (a|xn) log (

π(a|xn) P0(a)

) − ∑

a∈X

π (a|xn) log (

π(a|xn) P1(a)

)      ⋛ 1

n log τn

⇐ ⇒ D (Πxn ∥P0) − D (Πxn ∥P1) ⋛ 1 n log τn

19 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 20

Hypothesis Testing Estimation Basic Theory Asymptotics

Observation Space

X A1 (φ) A0 (φ)

Acceptance Region of H1. Acceptance Region of H0.

← →

P0 P1 P (X) F(n)

1 Acceptance Region of H1.

F(n)

Acceptance Region of H0.

Probability Simplex

{xn ∈ Ai = ⇒ decide Hi} ← → { Πxn ∈ F(n)

i

= ⇒ decide Hi } .

20 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 21

Hypothesis Testing Estimation Basic Theory Asymptotics

P0 P1 P (X) P ∗

Probability Simplex

F(n)

1

F(n)

By Sanov’s Theorem, we know that α(n) = Pn ( F(n)

1

) ≈ 2−nD(P∗ ∥P0), β(n) = Pn

1

( F(n) ) ≈ 2−nD(P∗ ∥P1).

21 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 22

Hypothesis Testing Estimation Basic Theory Asymptotics

Asymptotic Behaviors

1 Neyman-Pearson:

β∗ (n, ε) ≜ min

φn:X n→[0,1]

β(n)

φn ,

subject to α(n)

φn ≤ ε.

It turns out that for all ε ∈ (0, 1), lim

n→∞

{ − 1

n log β∗ (n, ε)

} = D (P0 ∥P1)

2 Bayesian: P∗ e(n) ≜

min

φn:X n→[0,1]

{ π0α(n)

φn + π1β(n) φn

} . It turns out that lim

n→∞

{ − 1

n log P∗ e (n)

} = D (Pλ∗ ∥P0) = D (Pλ∗ ∥P1) where Pλ (a) ≜

(P0(a))1−λ(P1(a))λ ∑

x∈X (P0(x))1−λ(P1(x))λ , ∀ a ∈ X,

and λ∗ ∈ (0, 1) such that D (Pλ∗ ∥P0) = D (Pλ∗ ∥P1)

22 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 23

Hypothesis Testing Estimation Basic Theory Asymptotics

Asymptotics in Neyman-Pearson Setup

Theorem 4 (Chernoff-Stein) For all ε ∈ (0, 1), lim

n→∞

{ − 1

n log β∗ (n, ε)

} = D (P0 ∥P1). pf: We shall prove the achievability and the converese part separately. Achievability: construct a sequence of tests {φn} with α(n)

φn ≤ ε for

n sufficiently large, such that lim inf

n→∞

{ − 1

n log β(n) φn

} ≥ D (P0 ∥P1). Converse: for any sequence of tests {φn} with α(n)

φn ≤ ε for n

sufficiently large, show that lim sup

n→∞

{ − 1

n log β(n) φn

} ≤ D (P0 ∥P1). We use method of types to prove both the achievability and the converse. Alternatively, Chapter 11.8 of Cover&Thomas[1] uses a kind of weak typicality to prove the theorem.

23 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 24

Hypothesis Testing Estimation Basic Theory Asymptotics

Achievability: Consider a deterministic test φn (xn) = 1 {D (Πxn ∥P0) ≥ δn} , δn ≜ 1

n

( log 1

ε + d log(n + 1)

) . In other words, it determines H1 if D (Πxn ∥P0) ≥ δn, and H0 otherwise. Check the probability of Type I error : By Prop. 4 in Part I, we have

α(n)

φn = PXi

i.i.d.

∼ P0 {D (ΠXn ∥P0) ≥ δn} ≤ 2 −n ( δn−d log(n+1)

n

) (a)

= ε, where (a) is due to our construction.

Analyze the probability of Type II error : ((b) is due to Prop. 3 in Part I)

β(n)

φn =

∑

Q∈Pn: D(Q ∥P0)<δn

Pn

1 (Tn (Q)) (b)

≤ ∑

Q∈Pn: D(Q ∥P0)<δn

2−nD(Q ∥P1) ≤ |Pn| 2−nD∗

n , where D∗

n ≜

min

Q∈Pn: D(Q ∥P0)<δn {D (Q ∥P1)}.

Since lim

n→∞ δn = 0, we have lim n→∞ D∗ n = D (P0 ∥P1), and achievability is done.

24 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 25

Hypothesis Testing Estimation Basic Theory Asymptotics

Converse: We prove the converse for deterministic tests. Extension to randomized tests is left as en exercise (HW6).

Let A(n)

i

≜ {xn | φn (xn) = i}, the acceptance region of Hi, for i = 0, 1. Let B(n) ≜ {xn | D (Πxn ∥P0) < εn}, εn ≜ 2d log(n+1)

n

. By Prop. 4, we have Pn ( B(n)) = 1 − PXi

i.i.d.

∼ P0 {D (ΠXn ∥P0) ≥ εn}

≥ 1 − 2

−n ( εn−d log(n+1)

n

)

= 1 − 2−d log(n+1) → 1 as n → ∞. Hence, for sufficiently large n, both Pn ( B(n)) and Pn ( A(n) ) > 1 − ε, and Pn ( B(n) ∩ A(n) ) = Pn ( B(n)) + Pn ( A(n) ) − Pn ( B(n) ∪ A(n) ) > 2 (1 − ε) − 1 = 1 − 2ε. Note B(n) = ∪

Q∈Pn: D(Q ∥P0)<εn

Tn (Q). Hence ∃ Qn ∈ Pn, D (Qn ∥P0) < εn such that Pn ( Tn (Qn) ∩ A(n) ) > (1 − 2ε) Pn

0 (Tn (Qn)) .

(1)

25 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 26

Hypothesis Testing Estimation Basic Theory Asymptotics

Key Observation :

Note that the probability of each sequence in the same type class is the same, under any product distribution. Hence, (1) is equivalent to

Tn (Qn) ∩ A(n)
> (1 − 2ε) |Tn (Qn)| ,

which implies Pn

1

( Tn (Qn) ∩ A(n) ) > (1 − 2ε) Pn

1 (Tn (Qn)) .

Hence, for sufficiently large n, ∃ Qn ∈ Pn with D (Qn ∥P0) < εn such that Pn

1

( A(n) ) ≥ Pn

1

( Tn (Qn) ∩ A(n) ) > (1 − 2ε) Pn

1 (Tn (Qn)) (c)

≥ (1 − 2ε) |Pn|−1 2−nD(Qn ∥P1), where (c) is due to Prop. 3. Finally, as lim

n→∞ εn = 0, we have lim n→∞ D (Qn ∥P1) = D (P0 ∥P1), and the

converse proof is done.

26 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 27

Hypothesis Testing Estimation Basic Theory Asymptotics

Asymptotics in Bayesian Setup

Theorem 5 (Chernoff) lim

n→∞

{ − 1

n log P∗ e (n)

} = D (Pλ∗ ∥P0) = D (Pλ∗ ∥P1) = max

λ∈[0,1] log

(

1 ∑

x∈X

(P0(x))1−λ(P1(x))λ

)

Chernoff Information CI (P0, P1)

where Pλ (a) ≜

(P0(a))1−λ(P1(a))λ ∑

x∈X

(P0(x))1−λ(P1(x))λ , ∀ a ∈ X,

and λ∗ ∈ (0, 1) such that D (Pλ∗ ∥P0) = D (Pλ∗ ∥P1). Note: The optimal Bayesian test (for minimizing Pe) is the maximum a posterier (MAP) test: φMAP (xn) = 1 {π1Pn

1 (xn) ≥ π0Pn 0 (xn)}.

27 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 28

Hypothesis Testing Estimation Basic Theory Asymptotics

0.5

0.5 1 1.5 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

λ D(Pλ||P0) D(Pλ||P1) Only intercept at one point, and it is in [0, 1]

28 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 29

Hypothesis Testing Estimation Basic Theory Asymptotics

0.2 0.4 0.6 0.8 1 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

λ D(Pλ||P0) D(Pλ||P1) min{D(Pλ||P0), D(Pλ||P0)} λ∗

29 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 30

Hypothesis Testing Estimation Basic Theory Asymptotics

pf: The proof is based on application of large deviation in analyzing the

ptimal test, MAP: φMAP (xn) = 1 {π1P1 (xn) ≥ π0P0 (xn)}.

Analysis of error probabilities of MAP test :

α(n) = Pn ( F (n)

1

) , β(n) = Pn

1

( F (n) ) , where F(n)

1

≜ { Q ∈ P (X)

D (Q ∥P0) − D (Q ∥P1) ≥ 1

n log π0 π1

} , F(n) ≜ { Q ∈ P (X)

D (Q ∥P0) − D (Q ∥P1) ≤ 1

n log π0 π1

} . Asymptotics : By Sanov’s Theorem, we have lim

n→∞

{ − 1

n log α(n)}

= min

Q∈F1D (Q ∥P0) , lim n→∞

{ − 1

n log β(n)}

= min

Q∈F0D (Q ∥P1) ,

where F1 ≜ {Q ∈ P (X) | D (Q ∥P0) − D (Q ∥P1) ≥ 0} and F0 ≜ {Q ∈ P (X) | D (Q ∥P0) − D (Q ∥P1) ≤ 0}.

30 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 31

Hypothesis Testing Estimation Basic Theory Asymptotics

Exponents : Characterizing the two exponents is equivalent to solving the two (convex) optimization problems: min

Q∈F1D (Q ∥P0)

minimize

(Q1,...,Qd)

∑d

l=1 Ql log Ql P0(al)

subject to ∑d

l=1 Ql log P1(al) P0(al) ≥ 0

Ql ≥ 0, l = 1, . . . , d ∑d

l=1 Ql = 1

min

Q∈F0D (Q ∥P1)

minimize

(Q1,...,Qd)

∑d

l=1 Ql log Ql P1(al)

subject to ∑d

l=1 Ql log P1(al) P0(al) ≤ 0

Ql ≥ 0, l = 1, . . . , d ∑d

l=1 Ql = 1

It turns out that both problems have a common optimal solution Pλ∗ (a) =

(P0(a))1−λ∗ (P1(a))λ∗ ∑

x∈X

(P0(x))1−λ∗ (P1(x))λ∗ , ∀ a ∈ X,

with λ∗ ∈ [0, 1] such that D (Pλ∗ ∥P0) = D (Pλ∗ ∥P1).

31 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 32

Hypothesis Testing Estimation Basic Theory Asymptotics

Hence, both types of error probabilities have the same exponent, and so does the average error probability. This completes the proof of the first part.

Chernoff Information : To show that

CI (P0, P1) ≜ max

λ∈[0,1] log

(

1 ∑

x∈X

(P0(x))1−λ(P1(x))λ

) = D (Pλ∗ ∥P0) , simply observe that D (Pλ ∥P0) = D (Pλ ∥P1) ⇐ ⇒ ∑

a∈X

(P0(a))1−λ (P1(a))λ (log P0(a) − log P1(a)) = 0 ⇐ ⇒ D (Pλ ∥P0) = D (Pλ ∥P1) = log    1 ∑

x∈X

(P0(x))1−λ (P1(x))λ    .

Proof complete.

32 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 33

Hypothesis Testing Estimation Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators

1 Hypothesis Testing

Basic Theory Asymptotics

2 Estimation

Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators

33 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 34

Hypothesis Testing Estimation Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators

Parametric Estimation

In this lecture we focus on parametric estimation, where samples of data are assumed to be drawn from a family of distributions on alphabet X {Pθ ∈ P (X) | θ ∈ Θ} , where θ is called the parameter and Θ is the parameter set.

(In this lecture, we mainly focus on X = R or Rn where Pθ are densities.)

Such parametric framework is useful when one is familiar with certain properties of the data and has a good statistical model for the data. The parameter set Θ is hence fixed, not scaling with the samples of data. In contrast, if such knowledge about the underlying data is not sufficient, the non-parametric framework might be more suitable.

34 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 35

Hypothesis Testing Estimation Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators

Outline

Parametric estimation itself is a vast area. In this lecture we shall go through some basic results and then draw some connections between estimation theory and information theory. Topics to be discussed in this lecture:

1 Performance Evaluation of Estimators

Bias, mean squared error, and Cramér-Rao lower bound Risk function optimization

2 Maximum Likelihood Estimator (MLE) 3 Asymptotic Evaluation

Consistency Efficiency

4 Bayesian Estimators

35 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 36

Hypothesis Testing Estimation Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators

1 Hypothesis Testing

Basic Theory Asymptotics

2 Estimation

Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators

36 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 37

Hypothesis Testing Estimation Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators

Estimator, Bias, Mean Squared Error

Definition 4 (Estimator) Consider X ∼ Pθ randomly generates the observed sample x, where θ ∈ Θ is an unknown parameter lies in the parameter set Θ. An estimator of θ based on observed x is a mapping φ : X → Θ, x → ˆ θ. An estimator of function z (θ) is a mapping ζ : X → z (Θ) , x → ˆ z. For the case X = R or Rn, it is reasonable to consider the following two measures of performance of estimators. Definition 5 (Bias, Mean Squared Error) For an estimator φ (x) of θ, Biasθ (φ) ≜ EPθ [φ (X)] − θ, MSEθ (φ) ≜ EPθ [ |φ (X) − θ|2] .

37 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 38

Hypothesis Testing Estimation Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators

Risk Function

Fact 1 (MSE = Variance +(Bias)2) For an estimator φ (x) of θ, MSEθ (φ) = VarPθ [φ (X)] + (Biasθ (φ))2. pf: MSEθ (φ)

≜ EPθ [ |φ (X) − θ|2] = EPθ [ (φ (X) − EPθ [φ (X)] + EPθ [φ (X)] − θ)2] = VarPθ [φ (X)] + (Biasθ (φ))2 + 2Biasθ (φ) EPθ [φ (X) − EPθ [φ (X)]]

.

MSE is a special case of the risk function of an estimator. Definition 6 (Risk Function) Let r : Θ × Θ → R denote the risk (cost) of estimating θ with ˆ θ. The risk function of an estimator φ is defined as Rθ (φ) ≜ EPθ [r (θ, φ(X))].

38 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 39

Hypothesis Testing Estimation Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators

With risk functions as the performance measures of estimators, it is then possible to ask the following questions: What is the best estimator that minimizes the risk? What is the minimum risk? But, the questions are not explicit: optimal in what sense? Minimax: the worst-case risk (over Θ) is minimized Bayesian: with prior distribution {π (θ) | θ ∈ Θ}, the expected risk (Bayes risk) is minimized In the following, we do not pursue these directions further (detailed treatment can be found in decision theory). Instead, we provide a parameter-dependent lower bound on the MSE of unbiased estimators, namely, Cramér-Rao Inequality. Later, we shall also briefly introduce results in the Bayesian setup.

39 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 40

Hypothesis Testing Estimation Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators

Lower Bound on MSE of Unbiased Estimators

Below we deal with densities and hence change notation from Pθ to fθ. Definition 7 (Fisher Information) The Fisher Information of θ is defined as J (θ) ≜ Efθ [( ∂

∂ θ ln fθ(X)

)2] . Definition 8 (Unbiased Estimator) An estimator φ is unbiased if Biasθ (φ) = 0 for all θ ∈ Θ. Now we are ready to state the theorem. Theorem 6 (Cramér-Rao) For any unbiased estimator φ, we have MSEθ (φ) ≥

1 J(θ), ∀ θ ∈ Θ .

40 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 41

Hypothesis Testing Estimation Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators

Proof of Cramér-Rao Inequality

pf: The proof is essentially an application of Cauchy-Schwarz inequality. Let us begin with the observation that J (θ) = Varfθ [sθ (X)], where sθ (X) ≜

∂ ∂ θ ln fθ(X) = 1 fθ(X) ∂ ∂ θfθ(X), because

Efθ [sθ (X)] = ∫ ∞

−∞ fθ (x) 1 fθ(x) ∂ ∂ θfθ(x) dx =

∫ ∞

−∞ ∂ ∂ θfθ(x) dx

=

d d θ

∫ ∞

−∞ fθ(x) dx = 0.

Hence, by Cauchy-Schwarz inequality, we have (Covfθ (sθ (X) , φ (X)))2 ≤ Varfθ [sθ (X)] Varfθ [φ (X)] . Since Biasθ (φ) = 0, we have MSEθ (φ) = Varfθ [φ (X)], and hence MSEθ (φ) J (θ) ≥ (Covfθ (sθ (X) , φ (X)))2 .

41 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 42

Hypothesis Testing Estimation Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators

It remains to prove that Covfθ (sθ (X) , φ (X)) = 1: Covfθ (sθ (X) , φ (X)) = Efθ [sθ (X) φ (X)] −

Efθ [sθ (X)] Efθ [φ (X)]

= Efθ [sθ (X) φ (X)] = Efθ [

1 fθ(X) ∂ ∂ θfθ(X)φ (X)

] =

d d θ

∫ ∞

−∞ fθ(x)φ (x) dx

=

d d θEfθ [φ (X)] (a)

=

d d θθ = 1,

where the (a) holds because φ is unbiased. The proof is complete. Remark: Cramér-Rao inequality can be extended to vector estimators, biased estimators, estimator of a function of θ, etc.

42 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 43

Hypothesis Testing Estimation Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators

Extensions of Cramér-Rao Inequality

Below we list some extensions and leave the proofs as exercises.

Exercise 1 (Cramér-Rao Inequality for Unbiased Functional Estimators) Prove that for any unbiased estimator ζ of z (θ), MSEθ (ζ) ≥

1 J(θ)

(

d d θ z (θ)

)2 . Exercise 2 (Cramér-Rao Inequality for Biased Estimators) Prove that for any estimator φ of the parameter θ, MSEθ (φ) ≥

1 J(θ)

( 1 +

d d θ Biasθ (φ)

)2 + (Biasθ (φ))2 . Exercise 3 (Attainment of Cramér-Rao) Show that the necessary and sufficient condition for an unbiased estimator φ to attain the Cramér-Rao lower bound is that, there exists some function g such that for all x, g (θ) (φ (x) − θ) =

∂ ∂ θ ln fθ (x) .

43 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 44

Hypothesis Testing Estimation Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators

More on Fisher Information

Fisher Information plays a key role in Cramér-Rao lower bound. We make further remarks about it.

1 J (θ) ≜ Efθ

[ (sθ (X))2] = Varfθ [sθ (X)], where the score of θ, sθ (X) ≜

∂ ∂ θ ln fθ(X) = 1 fθ(X) ∂ ∂ θfθ(X) is zero-mean. 2 Suppose Xi i.i.d.

∼ fθ, then for the estimation problem with observation Xn, its Fisher information Jn (θ) = nJ (θ), where J (θ) is the Fisher information when the observation is just X ∼ fθ.

3 For an exponential family {fθ | θ ∈ Θ}, it can be shown that

J (θ) = −Efθ [

∂2 ∂ θ2 ln fθ (X)

] , which makes computation of J (θ) simpler.

44 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 45

Hypothesis Testing Estimation Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators

1 Hypothesis Testing

Basic Theory Asymptotics

2 Estimation

Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators

45 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 46

Hypothesis Testing Estimation Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators

Maximum Likelihood Estimator

Maximum Likelihood Estimator (MLE) is a widely used estimator. Definition 9 (Maximum Likelihood Estimator) The Maximum Likelihood Estimator (MLE) for estimating θ from a randomly drawn X ∼ Pθ is defined as φMLE (x) ≜ arg max

θ∈Θ

{Pθ (x)} . Here Pθ (x) is called the likelihood function.

Exercise 4 (MLE of Gaussian with Unknown Mean and Variance) Consider Xi

i.i.d.

∼ N ( µ, σ2) for i = 1, 2, . . . , n, where θ ≜ ( µ, σ2) denote the unknown

parameter. Let x ≜ 1

n

∑n

i=1 xi. Show that

φMLE (xn) = ( x ,

1 n

∑n

i=1 (xi − x)2)

.

46 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 47

Hypothesis Testing Estimation Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators

Asymptotic Evaluations: Consistency

In the following we consider observation of n i.i.d. drawn samples Xi

i.i.d.

∼ Pθ, i = 1, . . . , n, and give two ways of evaluating the performance

f a sequence of estimators {φn (xn) | n ∈ N} as n → ∞.

Definition 10 (Consistency) A sequence of estimators {ζn (xn) | n ∈ N} is consistent if ∀ ε > 0, lim

n→∞ PXi

i.i.d.

∼ Pθ {|ζn (Xn) − z (θ)| < ε} = 1, ∀ θ ∈ Θ.

In other words, ζn (Xn)

p

→ z (θ) for all θ ∈ Θ. Theorem 7 (MLE is Consistent) For a family of densities {fθ | θ ∈ Θ}, under some regularity conditions on fθ (x), the plug-in estimator z (φMLE (xn)) is a consistent estimator of z (θ), where z is a continuous function of θ.

47 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 48

Hypothesis Testing Estimation Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators

Asymptotic Evaluations: Efficiency

Motivated by Cramér-Rao inequality, we would like to see if the lower bound is asymptotically attainable. Definition 11 (Efficiency) A sequence of estimators {ζn (xn) | n ∈ N} is asymptotically efficient if √n (ζn (Xn) − z (θ))

d

→ N ( 0,

1 J(θ)

( d

d θz (θ)

)2) as n → ∞. Theorem 8 (MLE is Asymptotically Efficient) For a family of densities {fθ | θ ∈ Θ}, under some regularity conditions on fθ (x), the plug-in estimator z (φn (xn)) is an asymptotically efficient estimator of z (θ), where z is a continuous function of θ.

48 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 49

Hypothesis Testing Estimation Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators

Bayesian Estimators

In the Bayesian setting, prior distribution of the parameter, π (θ) for θ ∈ Θ, is known, and hence the joint distribution of (θ, X) ∼ π (θ) Pθ (x). Hence the goal is to find an estimator that minimize the Bayes Risk, defined as the average risk function, averaged over random θ: R (φ) ≜ Eθ∼π [Rθ (φ)] = E(θ,X)∼π·Pθ [r (θ, φ (X))] . The optimal Bayesian estimator is φ∗ (·) ≜ arg min

φ: X→Θ

{R (φ)} . Below we give some examples of Bayesian estimators, for several kinds of risks: 0-1 risk, squared-error risk, and absolute-error risk.

49 / 50 I-Hsiang Wang IT Lecture 8 Part II

SLIDE 50

Hypothesis Testing Estimation Performance Evaluation of Estimators MLE, Asympototics, and Bayesian Estimators

1 0-1 risk r

( θ, ˆ θ ) = 1 { θ ̸= ˆ θ } : This kind of risk is reasonable for finite

X, and the Bayes risk is the same as average probability of error.

The optimal Bayesian estimator is φMAP (x) = arg max

θ∈Θ

{π (θ) Pθ (x)} , called maximum a posterior probability (MAP) estimator.

2 Squared-error risk r

( θ, ˆ θ ) =

θ − ˆ

θ

2

: The optimal Bayesian estimator is φMMSE (x) = Eθ∼π(θ | X=x) [θ | X = x] , where π (θ | X = x) ≜

π(θ)Pθ(x) ∑

θ∈Θ π(θ)Pθ(x) is the a posterior probability.

3 Absolute-error risk r

( θ, ˆ θ ) =

θ − ˆ

θ

: The optimal Bayesian

estimator is the median of π (θ | X = x).

50 / 50 I-Hsiang Wang IT Lecture 8 Part II