Lecture 8: Information Theory and Statistics I-Hsiang Wang - - PowerPoint PPT Presentation

lecture 8 information theory and statistics
SMART_READER_LITE
LIVE PREVIEW

Lecture 8: Information Theory and Statistics I-Hsiang Wang - - PowerPoint PPT Presentation

Hypothesis Testing Lecture 8: Information Theory and Statistics I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 14, 2015 1 / 20 I-Hsiang Wang IT Lecture 8 Part II Part II : Hypothesis


slide-1
SLIDE 1

Hypothesis Testing

Lecture 8: Information Theory and Statistics

Part II: Hypothesis Testing and Estimation

I-Hsiang Wang

Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw

December 14, 2015

1 / 20 I-Hsiang Wang IT Lecture 8 Part II

slide-2
SLIDE 2

Hypothesis Testing Basic Theory Asymptotics

1 Hypothesis Testing

Basic Theory Asymptotics

2 / 20 I-Hsiang Wang IT Lecture 8 Part II

slide-3
SLIDE 3

Hypothesis Testing Basic Theory Asymptotics

1 Hypothesis Testing

Basic Theory Asymptotics

3 / 20 I-Hsiang Wang IT Lecture 8 Part II

slide-4
SLIDE 4

Hypothesis Testing Basic Theory Asymptotics

Basic Setup

We begin with the simplest setup – binary hypothesis testing:

1 Two hypotheses regarding the observation X, indexed by θ ∈ {0, 1}:

H0 : X ∼ P0 (Null Hypothesis, θ = 0) H1 : X ∼ P1 (Alternative Hypothesis, θ = 1)

2 Goal: design a decision making algorithm φ : X → {0, 1} , x → ˆ

θ, to choose one of the two hypotheses, based on the observed realization

  • f X, so that a certain cost (or risk) is minimized.

3 A popular measure of the cost is based on probability of errors:

Probability of false alarm (false positive; type I error): αφ ≡ PFA (φ) ≜ P {H1 is chosen | H0} . Probability of miss detection (false negative; type II error): βφ ≡ PMD (φ) ≜ P {H0 is chosen | H1} .

4 / 20 I-Hsiang Wang IT Lecture 8 Part II

slide-5
SLIDE 5

Hypothesis Testing Basic Theory Asymptotics

Deterministic Testing Algorithm ≡ Decision Regions

Observation Space

X A1 (φ) A0 (φ)

Acceptance Region of H1. Acceptance Region of H0.

A test φ : X → {0, 1} is equivalently characterized by its corresponding acceptance (decision) regions: Aθ (φ) ≡ φ−1 ( ˆ θ ) ≜ { x ∈ X : φ (x) = ˆ θ } , ˆ θ = 0, 1. Hence, the two types of probability of error can be equivalently represented as αφ = ∑

x∈A1(φ)

P0 (x) = ∑

x∈X

φ (x) P0 (x), βφ = ∑

x∈A0(φ)

P1 (x) = ∑

x∈X

(1 − φ (x)) P1 (x). When the context is clear, we often drop the dependency on the test φ when dealing with acceptance regions Aˆ

θ.

5 / 20 I-Hsiang Wang IT Lecture 8 Part II

slide-6
SLIDE 6

Hypothesis Testing Basic Theory Asymptotics

Likelihood Ratio Test

Definition 1 (Likelihood Ratio Test) A (deteministic) likelihood ratio test (LRT) is a test φτ, parametrized by constants τ > 0 (called threshold), defined as follows: φτ (x) = { 1 if P1 (x) > τP0 (x) if P1 (x) ≤ τP0 (x) . For x ∈ suppP0, the likelihood ratio L (x) ≜ P1(x)

P0(x).

Hence, LRT is a thresholding algorithm on likelihood ratio L (x). Remark: For computational convenience, often one deals with log likelihood ratio (LLR) log (L(x)) = log (P1(x)) − log (P0(x)).

6 / 20 I-Hsiang Wang IT Lecture 8 Part II

slide-7
SLIDE 7

Hypothesis Testing Basic Theory Asymptotics

Trade-Off Between α (PFA) and β (PMD)

Theorem 1 (Neyman-Pearson Lemma) For a likelihood ratio test φτ and another deterministic test φ, αφ ≤ αφτ = ⇒ βφ ≥ βφτ . pf: Observe ∀ x ∈ X, 0 ≤ (φτ (x) − φ (x)) (P1 (x) − τP0 (x)) , because if P1 (x) − τP0 (x) > 0 = ⇒ φτ (x) = 1 = ⇒ (φτ (x) − φ (x)) ≥ 0. if P1 (x) − τP0 (x) ≤ 0 = ⇒ φτ (x) = 0 = ⇒ (φτ (x) − φ (x)) ≤ 0. Summing over all x ∈ X, we get 0 ≤ (1 − βφτ ) − (1 − βφ) − τ (αφτ − αφ) = (βφ − βφτ ) + τ (αφ − αφτ ) . Since τ > 0, from above we conclude that αφ ≤ αφτ = ⇒ βφ ≥ βφτ .

7 / 20 I-Hsiang Wang IT Lecture 8 Part II

slide-8
SLIDE 8

Hypothesis Testing Basic Theory Asymptotics

1 1 α (PFA) β (PMD) α (PFA) β (PMD) 1 1

Question: What is the optimal trade-off curve? What is the optimal test achieving the curve?

8 / 20 I-Hsiang Wang IT Lecture 8 Part II

slide-9
SLIDE 9

Hypothesis Testing Basic Theory Asymptotics

Randomized Testing Algorithm

Randomized tests include deterministic tests as special cases. Definition 2 (Randomized Test) A randomized test decides ˆ θ = 1 with probability φ (x) and ˆ θ = 0 with probability 1 − φ (x), where φ is a mapping φ : X → [0, 1]. Note: A randomized test is characterized by φ, as in deterministic tests. Definition 3 (Randomized LRT) A randomized likelihood ratio test (LRT) is a test φτ,γ, parametrized by cosntants τ > 0 and γ ∈ (0, 1), defined as follows: φτ,γ (x) =      1 if P1 (x) > τP0 (x) γ if P1 (x) = τP0 (x) if P1 (x) < τP0 (x) .

9 / 20 I-Hsiang Wang IT Lecture 8 Part II

slide-10
SLIDE 10

Hypothesis Testing Basic Theory Asymptotics

Randomized LRT Achieves the Optimal Trade-Off

Consider the following optimization problem: Neyman-Pearson Problem minimize

φ:X→[0,1]

βφ subject to αφ ≤ α∗ Theorem 2 (Neyman-Pearson) A randomized LRT φτ ∗,γ∗ with the parameters (τ ∗, γ∗) satisfying α∗ = αφτ∗,γ∗, attains optimality for the Neyman-Pearson Problem.

10 / 20 I-Hsiang Wang IT Lecture 8 Part II

slide-11
SLIDE 11

Hypothesis Testing Basic Theory Asymptotics

pf: First argue that for any α∗ ∈ (0, 1), one can find (τ ∗, γ∗) such that α∗ = αφτ∗,γ∗ = ∑

x∈X

φτ ∗,γ∗ (x) P0 (x) = ∑

x: L(x)>τ ∗ P0 (x) +

x: L(x)=τ ∗ γ∗P0 (x)

For any test φ, due to a similar argument as in Theorem 1, we have ∀ x ∈ X, (φτ ∗,γ∗ (x) − φ (x)) (P1 (x) − τ ∗P0 (x)) ≥ 0. Summing over all x ∈ X, similarly we get ( βφ − βφτ∗,γ∗ ) + τ ∗ ( αφ − αφτ∗,γ∗ ) ≥ 0 Hence, for any feasible test φ with αφ ≤ α∗ = αφτ∗,γ∗, its probability of type II error βφ ≥ βφτ∗,γ∗.

11 / 20 I-Hsiang Wang IT Lecture 8 Part II

slide-12
SLIDE 12

Hypothesis Testing Basic Theory Asymptotics

Bayesian Setup

Sometimes prior probabilities of the two hypotheses are known: πθ ≜ P {Hθ is true} , θ = 0, 1, π0 + π1 = 1. In this sense, one can view the index Θ as a (binary) random variable with (prior) distribution P {Θ = θ} = πθ, for θ = 0, 1. With prior probabilities, it then makes sense to talk about the average probability of error for a test φ, or more generally, the average cost (risk): Pe (φ) ≜ π0αφ + π1βφ = EΘ,X [ 1 { Θ ̸= ˆ Θ }] , R (φ) ≜ EΘ,X [ rΘ, ˆ

Θ

] . The Bayesian hypothesis testing problem is to test the two hypotheses with knowledge of prior probabilities so that the average probability of error (or in general, a risk function) is minimized.

12 / 20 I-Hsiang Wang IT Lecture 8 Part II

slide-13
SLIDE 13

Hypothesis Testing Basic Theory Asymptotics

Minimizing Bayes Risk

Consider the following problem of minimizing Bayes risk. Bayes Problem minimize

φ:X→[0,1]

R (φ) ≜ EΘ,X [ rΘ, ˆ

Θ

] with known (π0, π1) and rθ,ˆ

θ

Theorem 3 (LRT is an Optimal Bayesian Test) Assume r0,0 < r0,1 and r1,1 < r1,0. A deterministic LRT φτ ∗ with threshold τ ∗ = (r0,1 − r0,0) π0 (r1,0 − r1,1) π1 attains optimality for the Bayes Problem.

13 / 20 I-Hsiang Wang IT Lecture 8 Part II

slide-14
SLIDE 14

Hypothesis Testing Basic Theory Asymptotics

pf: R (φ) = ∑

x∈X

r0,0π0P0 (x) (1 − φ (x)) + ∑

x∈X

r0,1π0P0 (x) φ (x) + ∑

x∈X

r1,0π1P1 (x) (1 − φ (x)) + ∑

x∈X

r1,1π1P1 (x) φ (x) = r0,0π0 + ∑

x∈X

(r0,1 − r0,0) π0P0 (x) φ (x) + r1,0π1 + ∑

x∈X

(r1,1 − r1,0) π1P1 (x) φ (x) = ∑

x∈X (∗)

  • [

(r0,1 − r0,0) π0P0 (x) − (r1,1 − r1,0) π1P1 (x) ] φ (x) + r0,0π0 + r1,0π1. For each x ∈ X, we shall choose φ (x) ∈ [0, 1] such that (∗) is minimized. It is then obvious that we should choose φ (x) = { 1 if (r0,1 − r0,0) π0P0 (x) − (r1,1 − r1,0) π1P1 (x) < 0 if (r0,1 − r0,0) π0P0 (x) − (r1,1 − r1,0) π1P1 (x) ≥ 0 .

14 / 20 I-Hsiang Wang IT Lecture 8 Part II

slide-15
SLIDE 15

Hypothesis Testing Basic Theory Asymptotics

Discussions

For binary hypothesis testing problems, the likelihood ratio L (x) ≜ P1(x)

P0(x)

turns out to be a sufficient statistics. Moreover, a likelihood ratio test (LRT) is optimal both in the Bayesian and Neyman-Pearson settings. Extensions include M-ary hypothesis testing Minimax risk optimization (with unknown prior) Composite hypothesis testing, etc. Here we do not pursue these directions further. Instead, we would like to explore the asymptotic behavior of hypothesis testing, and the connection with information theoretic tools.

15 / 20 I-Hsiang Wang IT Lecture 8 Part II

slide-16
SLIDE 16

Hypothesis Testing Basic Theory Asymptotics

1 Hypothesis Testing

Basic Theory Asymptotics

16 / 20 I-Hsiang Wang IT Lecture 8 Part II

slide-17
SLIDE 17

Hypothesis Testing Basic Theory Asymptotics

i.i.d. Observations

So far we focus on the general setting where the observation space X can be arbitrary alphabets. In the following, we consider product space X n, length-n observation sequence Xn drawn i.i.d. from one of the two distributions, and the two hypotheses are H0 : Xi

i.i.d.

∼ P0, i = 1, 2, . . . , n H1 : Xi

i.i.d.

∼ P1, i = 1, 2, . . . , n The corresponding probability of errors are denoted by α(n) ≡ P(n)

FA ≜ P {H1 is chosen | H0}

β(n) ≡ P(n)

MD ≜ P {H0 is chosen | H1}

Throughout the lecture we assume X = {a1, a2, . . . , ad} is a finite set.

17 / 20 I-Hsiang Wang IT Lecture 8 Part II

slide-18
SLIDE 18

Hypothesis Testing Basic Theory Asymptotics

LRT under i.i.d. Observation (1)

With i.i.d. observation, the likelihood ratio of a sequence xn ∈ X n is L (xn) =

n

i=1 P1(xi) P0(xi) = ∏ a∈X

(

P1(a) P0(a)

)N(a|xn) = ∏

a∈X

(

P1(a) P0(a)

)nπ(a|xn) , where N (a|xn) ≜ # of a’s in xn and π (a|xn) ≜ 1

nN (a|xn) is the relative

frequency of occurrence of symbol a in sequence xn. Note: From the above manipulation, we see that the collection of relative frequency of occurrence (as a |X|-dim probabilty vector), Πxn ≜ [π (a1|xn) π (a2|xn) · · · π (ad|xn)]T , called the type of sequence xn, is a sufficient statistics for all the previously mentioned hypothesis testing problems.

18 / 20 I-Hsiang Wang IT Lecture 8 Part II

slide-19
SLIDE 19

Hypothesis Testing Basic Theory Asymptotics

LRT under i.i.d. Observation (2)

Let us further manipulate the LRT, by taking log likelihood ratio: L (xn) ⋛ τ ⇐ ⇒ log (L (xn)) ⋛ log τ ⇐ ⇒ ∑

a∈X

nπ (a|xn) log (

P1(a) P0(a)

) ⋛ log τ ⇐ ⇒      ∑

a∈X

π (a|xn) log (

π(a|xn) P0(a)

) − ∑

a∈X

π (a|xn) log (

π(a|xn) P1(a)

)      ⋛ 1

n log τ

⇐ ⇒ D (Πxn ∥P0) − D (Πxn ∥P1) ⋛ 1 n log τ

19 / 20 I-Hsiang Wang IT Lecture 8 Part II

slide-20
SLIDE 20

Hypothesis Testing Basic Theory Asymptotics

We see that types and KL divergence play central roles in hypothesis testing problems with i.i.d. observations. Types turn out to be a powerful tool in other contexts. Hence, in the following we first give an introduction to the method of types, and then return to the original hypothesis testing problems.

20 / 20 I-Hsiang Wang IT Lecture 8 Part II