[PPT] - Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang PowerPoint Presentation

SLIDE 1

Lecture 7 Introduction to Statistical Decision Theory

I-Hsiang Wang

Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw

December 14, 2016

1 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 2

In the rest of this course we switch gear to the interplay between information theory and statistics.

1 In this lecture, we will introduce the basic elements of statistical decision theory:

It is about how to make decision from data samples collected from a statistical model. It is about how to evaluate decision making algorithms (decision rules) under a statistical model.

It also serves the purpose of overviewing the contents to be covered in the follow-up lectures.

2 In the follow-up lectures, we will go into details of several topics, including

Hypothesis testing: large-sample asymptotic performance limits Point estimation: Bayes vs. Minimax, lower bounding techniques, high dimensional problems, etc.

Along side, we will introduce tools and techniques for investigating the asymptotic performance

f several statistical problems, and show its interplay with information theory.

Tools from probability theory: large deviation, concentration inequalities, etc. Elements from information theory: information measures, lower bounding techniques, etc.

2 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 3

Overview of this Lecture

In this lecture, the goal is to establish basics of statistical decision theory.

1 We will begin with setting up the framework of statistical decision theory, including:

Statistical experiment: parameter space, data samples, statistical model Decision rule: deterministic vs. randomized Performance evaluation: loss function, risk, minimax vs. Bayes

2 Next, we will introduce two basic statical decision making problems, including

Hypothesis testing Point estimation

3 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 4

Statistical Model and Decision Making

1 Statistical Model and Decision Making Basic Framework Examples Paradigms

2 Hypothesis Testing Basics Asymptotics: Overview

3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency

4 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 5

Statistical Model and Decision Making Basic Framework

1 Statistical Model and Decision Making Basic Framework Examples Paradigms

2 Hypothesis Testing Basics Asymptotics: Overview

3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency

5 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 6

Statistical Model and Decision Making Basic Framework

Statistical Experiment

Pθ (·)

θ ∈ Θ Decision Making

τ (X) = ˆ T X θ

X ∈ X

6 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 7

Statistical Model and Decision Making Basic Framework

Statistical Experiment

Pθ (·)

θ ∈ Θ Decision Making

τ (X) = ˆ T X θ

X ∈ X

Statistical Experiment Statistical Model: A collection of data-generating distributions P ≜ {Pθ | θ ∈ Θ}, where

▶ Θ is called the parameter space, could be finite, infinitely countable, or uncountable. ▶ Pθ (·) is a probability distribution which accounts for the implicit randomness in experiments,

sampling, or making observations

Data (Sample/Outcome/Observation): X is generated by a random draw from Pθ, that is, X ∼ Pθ.

▶ X could be random variables, vectors, matrices, processes, etc.

7 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 8

Statistical Model and Decision Making Basic Framework

Statistical Experiment

Pθ (·)

θ ∈ Θ Decision Making

τ (X) = ˆ T X θ

X ∈ X

Inference Task Objective: T (θ), a function of the parameter θ. From the data X ∼ Pθ, one would like to infer T (θ) from X. Decision Rule Decision rule (deterministic): τ (·) is a function of X. ˆ

T = τ (X) is the inferred result.

Decision rule (randomized): τ (·, ·) is a function of (X, U), where U is external randomness.

ˆ T = τ (X, U) is the inferred result.

8 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 9

Statistical Model and Decision Making Basic Framework

Statistical Experiment

Pθ (·)

θ ∈ Θ Decision Making

τ (X) = ˆ T X θ

X ∈ X

T (·)

Loss Function

T (θ) l (·, ·) θ ˆ T EX∼Pθ [·]

l(T(θ), τ(X))

Lθ (τ) Performance Evaluation: how good is a decision rule τ? Loss function: l(T(θ), τ(X)) measures how bad the decision rule τ is (with a specific data point X).

Note: since X is random, l (T (θ) , τ (X)) is also random.

Risk: Lθ (τ) ≜ EX∼Pθ [l(T(θ), τ(X))] measures on average how bad the decision rule τ is when the true parameter is θ.

9 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 10

Statistical Model and Decision Making Basic Framework

Statistical Experiment

Pθ (·)

θ ∈ Θ Decision Making

τ (X) = ˆ T X θ

X ∈ X

T (·)

Loss Function

T (θ) l (·, ·) θ ˆ T EX∼Pθ [·]

l(T(θ), τ(X))

Lθ (τ) Performance Evaluation: what if the decision rule τ is randomized? Loss function becomes l(T(θ), τ(X, U)). Risk becomes Lθ (τ) ≜ EU,X∼Pθ [l(T(θ), τ(X, U))].

10 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 11

Statistical Model and Decision Making Examples

1 Statistical Model and Decision Making Basic Framework Examples Paradigms

2 Hypothesis Testing Basics Asymptotics: Overview

3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency

11 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 12

Statistical Model and Decision Making Examples

Sometimes we care about the inferred object itself.

12 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 13

Statistical Model and Decision Making Examples

Example: Decoding

Decoding in channel coding over a DMC is one example that we are familiar with. Parameter is the message:

θ ← → m

Parameter space is the message set:

Θ ← → { 1, 2, . . . , 2NR}

Data is the received sequence:

X ← → Y N

Statistical model is Encoder+Channel:

Pθ(x) ← → ∏N

i=1 PY |X(yi|xi(m))

Task is to decode the message:

T (θ) ← → m

Decision rule is the decoding algorithm:

τ(X) ← → dec(Y N)

Loss function is the 0-1 loss:

l (T(θ), τ(x)) ← → 1 { m ̸= dec(yN) }

Risk is the decoding error probability:

Lθ(τ) ← → λm,dec ≜ P { m ̸= dec(Y N)

m is sent

}

13 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 14

Statistical Model and Decision Making Examples

Example: Hypothesis Testing

Decoding in channel coding belongs to a more general class of problems called hypothesis testing. Parameter space is a finite set:

|Θ| < ∞

Task is to infer parameter θ:

T(θ) = θ

Loss function is the 0-1 loss:

l (T(θ), τ(x)) = 1 {θ ̸= τ(x)}

Risk is the probability of error:

Lθ(τ) = PX∼Pθ {θ ̸= τ(X)}

14 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 15

Statistical Model and Decision Making Examples

Example: Density Estimation

Estimate the probability density function from the collected samples. Parameter space is a (huge) set of density functions:

Θ ← → F = {f : R → [0, +∞) which is concave/continuous/Lipschitz continuous/etc.}

Data is the observed i.i.d. sequence:

X ← → Xn, Xi

i.i.d.

∼ f(·)

Task is to infer density function f(·):

T(θ) ← → f

Decision rule is the density estimator:

τ(X) ← → ˆ fXn(·)

Loss function is some sort of divergence:

l (T(θ), τ(x)) ← → D ( f

ˆ

fxn )

Risk is the expected loss:

Lθ(τ) ← → EXn∼f⊗n [ D ( f

ˆ

fXn )]

15 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 16

Statistical Model and Decision Making Examples

Sometimes we care about the utility of the inferred object.

16 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 17

Statistical Model and Decision Making Examples

Example: Classification/Prediction

A basic problem in learning is to train a classifier that predicts the category of a new object. Parameter space is a collection of labelings:

Θ ← → H = {h : X → [1 : K]}

Data is the training data set:

X ← → (Xn, Y n), label Yi ∈ [1 : K].

Statistical model is the noisy labeling:

Pθ(x) ← → ∏n

i=1 PX (xi) PY |h(X)(yi|h(xi))

Task is to infer the true labeling h ∈ H:

T (θ) ← → h(·), τ(X) ← → ˆ hXn,Y n(·).

Loss function is the prediction error probability:

l (T, τ(x)) ← → EX∼PX [ 1 { h(X) ̸= ˆ h(X) }]

(Note: This is still random as ˆ

h depends on the randomly drawn training data (X, Y )n)

Risk is the averaged loss over training:

Lθ(τ) ← → E(Xn,Y n)∼(PXPY |h(X))⊗n [ EX∼PX [ 1 { h(X) ̸= ˆ h(X) }]]

17 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 18

Statistical Model and Decision Making Examples

Example: Regression

Another example that we are very familiar with is regression under mean squared error. Parameter space is a collection of functions:

Θ ← → F = {f : Rp → R}

Data is the training data:

X ← → (Xn, Y n)

Statistical model is the noisy observation:

Pθ(x) ← → ∏n

i=1 PX (xi) PZ (yi − f(xi)) (Y = f(X) + Z where Z is the observation noise)

Task is to infer f ∈ F:

T (θ) ← → f, τ(X) ← → ˆ fXn,Y n(·)

Loss function is the mean-squared-error:

l (T, τ(x)) ← → E(X,Y )∼Pf [ (Y − ˆ f(X))2]

(Note: This is still random as ˆ

f depends on the randomly drawn training data (X, Y )n)

Risk is the averaged loss over training:

Lθ(τ) ← → E(Xn,Y n)∼P ⊗n

f

[ E(X,Y )∼Pf [ (Y − ˆ f(X))2]]

18 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 19

Statistical Model and Decision Making Examples

Example: Linear Regression

For linear regression, the underlying model is linear: f (x) = β⊺x + γ. Parameter is (β, γ) ∈ Rp × R:

θ ← → (β, γ), Θ ← → Rp × R

Data is the training data:

X ← → (Xn, Y n)

Statistical model is the noisy linear model: Pθ(x) ←

→ ∏n

i=1 PX (xi) PZ (yi − γ − β⊺xi)

Task is to infer (β, γ):

T (θ) ← → (β, γ), τ(X) ← → (ˆ β, ˆ γ)Xn,Y n

Loss function is the mean-squared-error:

l (T, τ(x)) ← → E(X,Y )∼Pβ,γ [ (Y − ˆ β

⊺X − ˆ

γ)2]

(Note: This is still random as (ˆ

β, ˆ γ) depends on the randomly drawn training data (X, Y )n)

Risk is the averaged loss over random training data:

Lθ(τ) ← → E(Xn,Y n)∼P ⊗n

β,γ

[ E(X,Y )∼Pβ,γ [ (Y − ˆ β

⊺X − ˆ

γ)2]]

19 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 20

Statistical Model and Decision Making Paradigms

1 Statistical Model and Decision Making Basic Framework Examples Paradigms

2 Hypothesis Testing Basics Asymptotics: Overview

3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency

20 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 21

Statistical Model and Decision Making Paradigms

How to Determine the Best Estimator?

Recall: Risk of a decision rule τ is Lθ(τ), which is defined according to the underlying statistical model, the task, and the loss function. Note that the risk depends on the hidden parameter θ. If we can find a decision making algorithm τ ∗ such that for all other decision rule τ,

Lθ (τ ∗) ≤ Lθ (τ) ∀ θ ∈ Θ,

then τ ∗ is obviously the best. Unfortunately this is not likely to happen. For example, say the loss function is the ℓ2-norm between

θ and ˆ θ = τ(x). Then, τ(x) = θ will minimize the risk Lθ(τ) when the true parameter is θ. This

means that there is no single τ that can simultaneously minimize Lθ for all θ ∈ Θ ! There are two main paradigms that resolve this issue: Average-case paradigm (Bayes) and Worst-case paradigm (Minimax)

21 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 22

Statistical Model and Decision Making Paradigms

Bayes Paradigm

In the Bayes paradigm, a prior distribution π (·) over the parameter space Θ is required, and the performance of a decision rule is evaluated according to the average-case analysis with respect to π. Definition 1 (Bayes Risk) The average risk of a decision rule τ with respect to prior π is defined as Rπ (τ) ≜ EΘ∼π [LΘ(τ)]. The Bayes risk for a prior π is defined as the minimum average risk, that is,

R∗

π ≜ infτ Rπ (τ) ,

(1) which is attained by a Bayes decision rule τ ∗

π (may not be unique).

Note that in information theory most coding theorems are derived in the Bayes paradigm – we assume random sources and messages. We will give more results later when we talk about hypothesis testing and point estimation.

22 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 23

Statistical Model and Decision Making Paradigms

Minimax Paradigm

In a Minimax paradigm, the performance of a decision rule is evaluated according to the worst-case risk over the entire parameter space. Definition 2 (Minimax Risk) The worst-case risk of a decision rule τ is defined as R (τ) ≜ supθ∈Θ Lθ(τ). The Minimax risk is defined as the minimum worst-case risk, that is,

R∗ ≜ infτ R (τ) = infτ supθ∈Θ Lθ(τ),

(2) which is attained by a Minimax decision rule τ ∗ (may not be unique). A main criticism on the Bayes paradigm is that in many applications there is no prior over the parameter space, because the statistical task is done one-shot. In such cases, the Minimax paradigm provides a conservative but robust evaluation and theoretical guarantees.

23 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 24

Statistical Model and Decision Making Paradigms

Bayes vs. Minimax

A simple yet fundamental relationship between Minimax and Bayes is stated in the theorem below. Theorem 1 (Minimax Risk ≥ Worst-Case Bayes Risk) In general, the Minimax risk is not smaller than any Bayes risk, that is,

R∗ ≥ supπ R∗

π.

(3) Furthermore, the above inequality becomes equality in the following two cases:

1 The parameter space Θ and the data alphabet X are both finite. 2 |Θ| < ∞ and the loss function is bounded from below.

Remark: When deriving lower bound on the minimax risk, inequality (3) is useful.

24 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 25

Statistical Model and Decision Making Paradigms

pf: Note that R∗ = infτ supθ∈Θ Lθ(τ) and supπ R∗

π = supπ infτ EΘ∼π [LΘ(τ)].

For any decision rule τ and prior distribution π, we have

supθ∈Θ Lθ(τ) ≥ EΘ∼π [LΘ(τ)] .

Hence, for any prior distribution π, we have

R∗ = infτ supθ∈Θ Lθ(τ) ≥ infτ EΘ∼π [LΘ(τ)] = R∗

π.

Therefore, R∗ ≥ supπ R∗

π.

Proof of the two sufficient conditions for the equality to hold requires some knowledge in convex optimization

theory. Essentially speaking, these are the two sufficient conditions such that "strong duality" holds.

25 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 26

Statistical Model and Decision Making Paradigms

Deterministic vs. Randomized Decision Rules

Randomization does not always help. In the following we give two such scenarios. Proposition 1

1 If the loss function τ → l(T, τ) is convex, then randomization does not help. 2 In the Bayes paradigm, there always exists a deterministic decision rule which is Bayes optimal.

pf: First, for any randomized decision rule τ(X, U), by Jensen's inequality we have Lθ (τ) ≜ EU,X∼Pθ [l(T(θ), τ(X, U))] ≥ EX∼Pθ [l(T(θ), EU [τ(X, U)|X])] = Lθ (EU [τ(X, U)|X]) .

Second, for any randomized decision rule τ(X, U) and prior distribution Θ ∼ π,

Rπ(τ) = EΘ,U,X [l (T(Θ), τ(X, U)] = EU [Rπ(τ(·, U))] ≥ Rπ(τ(·, u)), for some u.

Hence, the average risk of a randomized decision rule is always lower bound by that of a deterministic one, namely, τ(·, u) wtih u attains a smaller Rπ(τ(·, u)) than the average.

26 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 27

Statistical Model and Decision Making Paradigms

Parametric vs. Non-Parametric Models

Conventionally, parametric model refers to the case where the parameter of interest is finite dimensional, while in the non-parametric model, parameter is infinite dimensional (eg., a function). The parametric framework is useful when one is familiar with certain properties of the data and has a good statistical model for the data. The parameter space Θ is hence fixed, not scaling with the samples of data. In contrast, if such knowledge about the underlying data is not sufficient, the non-parametric framework might be more suitable. However, this cut is vague and not well-defined at times. Parametric: Hypothesis testing, Mean estimation, Covariance matrix estimation, etc. Non-Parametric: Density estimation, Regression, etc.

27 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 28

Hypothesis Testing

1 Statistical Model and Decision Making Basic Framework Examples Paradigms

2 Hypothesis Testing Basics Asymptotics: Overview

3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency

28 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 29

Hypothesis Testing Basics

1 Statistical Model and Decision Making Basic Framework Examples Paradigms

2 Hypothesis Testing Basics Asymptotics: Overview

3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency

29 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 30

Hypothesis Testing Basics

Basic Setup

We begin with the simplest setup – binary hypothesis testing:

1 Two hypotheses regarding the observation X, indexed by θ ∈ {0, 1}:

H0 : X ∼ P0

(Null Hypothesis, θ = 0)

H1 : X ∼ P1

(Alternative Hypothesis, θ = 1)

2 Goal: design a decision making algorithm ϕ : X → {0, 1} , x → ˆ

θ, to choose one of the two

hypotheses, based on the observed realization of X, so that a certain risk is minimized.

3 The loss function is the 0-1 loss, rendering two kinds probability of errors:

Probability of false alarm (false positive; type I error):

αφ ≡ PFA (ϕ) ≜ P {H1 is chosen | H0} .

Probability of miss detection (false negative; type II error):

βφ ≡ PMD (ϕ) ≜ P {H0 is chosen | H1} .

30 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 31

Hypothesis Testing Basics

Deterministic Testing Algorithm ≡ Decision Regions

Observation Space

X A1 (φ) A0 (φ)

Acceptance Region of H1. Acceptance Region of H0. A test ϕ : X → {0, 1} is equivalently characterized by its corresponding acceptance (decision) regions:

Aθ (ϕ) ≡ ϕ−1(ˆ θ) ≜ {x ∈ X : ϕ (x) = ˆ θ}, ˆ θ = 0, 1.

The two types of probability of error can be equivalently represented as

αφ = ∑

x∈A1(φ) P0 (x) = ∑ x∈X ϕ (x) P0 (x),

βφ = ∑

x∈A0(φ) P1 (x) = ∑ x∈X (1 − ϕ (x)) P1 (x).

When the context is clear, we often drop the dependency on the test ϕ when dealing with acceptance regions Aˆ

θ.

31 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 32

Hypothesis Testing Basics

Likelihood Ratio Test

Definition 3 (Likelihood Ratio Test) A (deteministic) likelihood ratio test (LRT) is a test ϕτ, parametrized by constants τ > 0 (called threshold), defined as follows:

ϕτ (x) = { 1

if P1 (x) > τP0 (x) if P1 (x) ≤ τP0 (x) . For x ∈ suppP0, the likelihood ratio L (x) ≜ P1(x)

P0(x). Hence, LRT is a thresholding algorithm on the

likelihood ratio L (x). Remark: For computational convenience, often one deals with log likelihood ratio (LLR)

log (L(x)) = log (P1(x)) − log (P0(x)).

32 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 33

Hypothesis Testing Basics

Trade-Off Between α (PFA) and β (PMD)

Theorem 2 (Neyman-Pearson Lemma) For a likelihood ratio test ϕτ and another deterministic test ϕ, αφ ≤ αφτ =

⇒ βφ ≥ βφτ .

pf: Observe ∀ x ∈ X, 0 ≤ (ϕτ (x) − ϕ (x)) (P1 (x) − τP0 (x)) , because if P1 (x) − τP0 (x) > 0 =

⇒ ϕτ (x) = 1 = ⇒ (ϕτ (x) − ϕ (x)) ≥ 0.

if P1 (x) − τP0 (x) ≤ 0 =

⇒ ϕτ (x) = 0 = ⇒ (ϕτ (x) − ϕ (x)) ≤ 0.

Summing over all x ∈ X , we get

0 ≤ (1 − βφτ ) − (1 − βφ) − τ (αφτ − αφ) = (βφ − βφτ ) + τ (αφ − αφτ ) .

Since τ > 0, from above we conclude that αφ ≤ αφτ =

⇒ βφ ≥ βφτ .

33 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 34

Hypothesis Testing Basics

1 1 α (PFA) β (PMD) α (PFA) β (PMD) 1 1 Question: What is the optimal trade-off curve? What is the optimal test achieving the curve?

34 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 35

Hypothesis Testing Basics

Randomized Testing Algorithm

Randomized tests include deterministic tests as special cases. Definition 4 (Randomized Test) A randomized test decides ˆ

θ = 1 with probability ϕ (x) and ˆ θ = 0 with probability 1 − ϕ (x), where ϕ is a mapping ϕ : X → [0, 1].

Note: A randomized test is characterized by ϕ, as in deterministic tests. Definition 5 (Randomized LRT) A randomized likelihood ratio test (LRT) is a test ϕτ,γ, parametrized by cosntants τ > 0 and

γ ∈ (0, 1), defined as follows: ϕτ,γ (x) = { 1/0

if P1 (x) ≷ τP0 (x)

γ

if P1 (x) = τP0 (x) .

35 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 36

Hypothesis Testing Basics

Randomized LRT Achieves the Optimal Trade-Off

Consider the following optimization problem:

Neyman-Pearson Problem

minimize

φ:X→[0,1]

βφ

subject to

αφ ≤ α∗

Theorem 3 (Neyman-Pearson) A randomized LRT ϕτ ∗,γ∗ with the parameters (τ ∗, γ∗) satisfying α∗ = αφτ∗,γ∗ attains optimality for the Neyman-Pearson Problem.

36 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 37

Hypothesis Testing Basics

pf: First argue that for any α∗ ∈ (0, 1), one can find (τ ∗, γ∗) such that

α∗ = αφτ∗,γ∗ = ∑

x∈X

ϕτ ∗,γ∗ (x) P0 (x) = ∑

x: L(x)>τ ∗ P0 (x) +

∑

x: L(x)=τ ∗ γ∗P0 (x)

For any test ϕ, due to a similar argument as in Theorem 2, we have

∀ x ∈ X, (ϕτ ∗,γ∗ (x) − ϕ (x)) (P1 (x) − τ ∗P0 (x)) ≥ 0.

Summing over all x ∈ X , similarly we get

( βφ − βφτ∗,γ∗ ) + τ ∗ ( αφ − αφτ∗,γ∗ ) ≥ 0

Hence, for any feasible test ϕ with αφ ≤ α∗ = αφτ∗,γ∗ , its probability of type II error

βφ ≥ βφτ∗,γ∗

37 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 38

Hypothesis Testing Basics

Bayes Risk

Sometimes prior probabilities of the two hypotheses are known:

πθ ≜ P {Hθ is true} , θ = 0, 1, π0 + π1 = 1.

In this sense, one can view the index Θ as a (binary) random variable with (prior) distribution

P {Θ = θ} = πθ, for θ = 0, 1.

With prior probabilities, it then makes sense to talk about the average probability of error for a test ϕ,

r more generally, the average risk:

Pe (ϕ) ≜ π0αφ + π1βφ = EΘ,X [1 {Θ ̸= ϕ(X)}] ; Rπ (ϕ) ≜ EΘ,X [l(Θ, ϕ(X))] .

The Bayes hypothesis testing problem is to test the two hypotheses with knowledge of prior probabilities so that the average probability of error (or in general, a risk function) is minimized.

38 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 39

Hypothesis Testing Basics

Minimizing Bayes Risk

Consider the following problem of minimizing Bayes risk.

Bayes Problem

minimize

φ: X→[0,1]

Rπ (ϕ) ≜ EΘ,X [l(Θ, ϕ(X))]

with known

(π0, π1) and l(θ, ˆ θ)

Theorem 4 (LRT is an Optimal Bayes Test) Assume l(0, 0) < l(0, 1) and l(1, 1) < l(1, 0). A deterministic LRT ϕτ ∗ with threshold

τ ∗ = (l(0,1)−l(0,0))π0

(l(1,0)−l(1,1))π1

attains optimality for the Bayes Problem.

39 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 40

Hypothesis Testing Basics

pf:

Rπ (ϕ) = ∑

x∈X l(0, 0)π0P0 (x) (1 − ϕ (x)) + ∑ x∈X l(0, 1)π0P0 (x) ϕ (x)

+ ∑

x∈X l(1, 0)π1P1 (x) (1 − ϕ (x)) + ∑ x∈X l(1, 1)π1P1 (x) ϕ (x)

= l(0, 0)π0 + ∑

x∈X (l(0, 1) − l(0, 0)) π0P0 (x) ϕ (x)

+ l(1, 0)π1 + ∑

x∈X (l(1, 1) − l(1, 0)) π1P1 (x) ϕ (x)

= ∑

x∈X

[ (l(0, 1) − l(0, 0)) π0P0 (x) − (l(1, 1) − l(1, 0)) π1P1 (x) ] ϕ (x)

(∗)

+ l(0, 0)π0 + l(1, 0)π1.

For each x ∈ X , we shall choose ϕ (x) ∈ [0, 1] such that (∗) is minimized. It is then obvious that we should choose

ϕ (x) = { 1

if (l(0, 1) − l(0, 0)) π0P0 (x) − (l(1, 1) − l(1, 0)) π1P1 (x) < 0 if (l(0, 1) − l(0, 0)) π0P0 (x) − (l(1, 1) − l(1, 0)) π1P1 (x) ≥ 0 .

40 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 41

Hypothesis Testing Basics

Discussions

For binary hypothesis testing problems, the likelihood ratio L (x) ≜ P1(x)

P0(x) is a sufficient statistics.

Moreover, a likelihood ratio test (LRT) is optimal both in the Neyman-Pearson and the Bayes settings. Extensions include

M-ary hypothesis testing

Minimax risk optimization (with unknown prior) Composite hypothesis testing, etc. Here we do not pursue these directions further at the moment. We leave some of them in the later lectures including asymptotic behavior of hypothesis testing, and discuss about its close connection with information theory, in particular, information divergence. Next, we quickly overview the asymptotic behavior of performance in hypothesis testing.

41 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 42

Hypothesis Testing Asymptotics: Overview

1 Statistical Model and Decision Making Basic Framework Examples Paradigms

2 Hypothesis Testing Basics Asymptotics: Overview

3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency

42 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 43

Hypothesis Testing Asymptotics: Overview

i.i.d. Observations

So far we focus on the general setting where the observation space X can be arbitrary alphabets. In the following, we consider product space X n, length-n observation sequence Xn drawn i.i.d. from one of the two distributions, and the two hypotheses are

H0 : Xi

i.i.d.

∼ P0, i = 1, 2, . . . , n H1 : Xi

i.i.d.

∼ P1, i = 1, 2, . . . , n

The corresponding probability of errors are denoted by

α(n) ≡ P(n)

FA ≜ P {H1 is chosen | H0} ,

β(n) ≡ P(n)

MD ≜ P {H0 is chosen | H1}

Throughout the lecture we assume X = {a1, a2, . . . , ad} is a finite set.

43 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 44

Hypothesis Testing Asymptotics: Overview

LRT under i.i.d. Observation (1)

With i.i.d. observation, the likelihood ratio of a sequence xn ∈ X n is

L (xn) =

n

∏

i=1 P1(xi) P0(xi) = ∏ a∈X

(

P1(a) P0(a)

)N(a|xn) = ∏

a∈X

(

P1(a) P0(a)

)nπ(a|xn) ,

where N (a|xn) ≜ # of a's in xn and π (a|xn) ≜ 1

nN (a|xn) is the relative frequency of occurrence

f symbol a in sequence xn.

Note: From the above manipulation, we see that the collection of relative frequency of occurrence (as a |X|-dim probabilty vector),

πxn ≜ [ π (a1|xn) π (a2|xn) · · · π (ad|xn) ]⊺ ,

called the type of sequence xn, is a sufficient statistics for all the previously mentioned hypothesis testing problems.

44 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 45

Hypothesis Testing Asymptotics: Overview

LRT under i.i.d. Observation (2)

Let us further manipulate the LRT, by taking log likelihood ratio:

L (xn) ⋛ τn ⇐ ⇒ log (L (xn)) ⋛ log τn ⇐ ⇒ ∑

a∈X

nπ (a|xn) log (

P1(a) P0(a)

) ⋛ log τn ⇐ ⇒ ∑

a∈X

π (a|xn) log (

π(a|xn) P0(a)

) − ∑

a∈X

π (a|xn) log (

π(a|xn) P1(a)

) ⋛ 1

n log τn

⇐ ⇒ D (πxn ∥P0) − D (πxn ∥P1) ⋛ 1 n log τn

45 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 46

Hypothesis Testing Asymptotics: Overview

Observation Space

X A1 (φ) A0 (φ)

Acceptance Region of H1. Acceptance Region of H0.

← →

P0 P1 P (X) F(n)

1

Acceptance Region of H1.

F(n)

Acceptance Region of H0.

Probability Simplex

{xn ∈ Ai = ⇒ decide Hi} ← → { πxn ∈ F(n)

i

= ⇒ decide Hi }

.

46 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 47

Hypothesis Testing Asymptotics: Overview

P0 P1 P (X) P ∗

Probability Simplex

F(n)

1

F(n) By Large Deviation Theory (to be introduced later), we will see that

α(n) = P n ( F(n)

1

) ≈ 2−nD(P ∗ ∥P0), β(n) = P n

1

( F(n) ) ≈ 2−nD(P ∗ ∥P1).

47 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 48

Hypothesis Testing Asymptotics: Overview

Asymptotic Behaviors

1 Neyman-Pearson: β∗ (n, ε) ≜

min

φn:X n→[0,1]β(n) φn ,

subject to α(n)

φn ≤ ε. It turns out that

∀ ε ∈ (0, 1) , lim

n→∞

{ − 1

n log β∗ (n, ε)

} = D (P0 ∥P1)

2 Bayes: P∗

e(n) ≜

min

φn:X n→[0,1]

{ π0α(n)

φn + π1β(n) φn

}

. It turns out that

lim

n→∞

{ − 1

n log P∗ e (n)

} = D (Pλ∗ ∥P0) = D (Pλ∗ ∥P1)

where Pλ (a) ≜

(P0(a))1−λ(P1(a))λ ∑

x∈X (P0(x))1−λ(P1(x))λ , ∀ a ∈ X,

and λ∗ ∈ (0, 1) such that D (Pλ∗ ∥P0) = D (Pλ∗ ∥P1)

48 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 49

Estimation

1 Statistical Model and Decision Making Basic Framework Examples Paradigms

2 Hypothesis Testing Basics Asymptotics: Overview

3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency

49 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 50

Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound

1 Statistical Model and Decision Making Basic Framework Examples Paradigms

2 Hypothesis Testing Basics Asymptotics: Overview

3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency

50 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 51

Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound

Estimator, Bias, Mean Squared Error

Definition 6 (Estimator) Consider X ∼ Pθ randomly generates the data x, where θ ∈ Θ is an unknown parameter. An estimator of θ based on observed x is a mapping ϕ : X → Θ, x → ˆ

θ.

An estimator of function z (θ) is a mapping ζ : X → z (Θ) , x → ˆ

z.

When Θ ⊆ R or Rn, it is reasonable to consider the following two measures of performance. Definition 7 (Bias, Mean Squared Error) For an estimator ϕ (x) of θ,

Biasθ (ϕ) ≜ EX∼Pθ [ϕ (X)] − θ, MSEθ (ϕ) ≜ EX∼Pθ [ |ϕ (X) − θ|2]

.

51 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 52

Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound

Fact 1 (MSE = Variance +(Bias)2) For an estimator ϕ (x) of θ, MSEθ (ϕ) = VarPθ [ϕ (X)] + (Biasθ (ϕ))2. pf: MSEθ (ϕ) ≜ EPθ [ |ϕ (X) − θ|2] = EPθ [ (ϕ (X) − EPθ [ϕ (X)] + EPθ [ϕ (X)] − θ)2] = VarPθ [ϕ (X)] + (Biasθ (ϕ))2 + 2Biasθ (ϕ) EPθ [ϕ (X) − EPθ [ϕ (X)]]

.

Note: MSE is the risk of an estimator, when the loss function is the squared-error loss. In the following we provide a parameter-dependent lower bound on the MSE of unbiased estimators, namely, Cramér-Rao Inequality.

52 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 53

Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound

Lower Bound on MSE of Unbiased Estimators

Below we deal with densities and hence change notation from Pθ to fθ. Definition 8 (Fisher Information) The Fisher Information of θ is defined as J (θ) ≜ Efθ

[( ∂

∂ θ ln fθ(X)

)2]

. Definition 9 (Unbiased Estimator) An estimator ϕ is unbiased if Biasθ (ϕ) = 0 for all θ ∈ Θ. Now we are ready to state the theorem. Theorem 5 (Cramér-Rao) For any unbiased estimator ϕ, we have MSEθ (ϕ) ≥

1 J(θ), ∀ θ ∈ Θ .

53 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 54

Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound

pf: The proof is essentially an application of Cauchy-Schwarz inequality. Let us begin with the observation that J (θ) = Varfθ [sθ (X)], where

sθ (X) ≜

∂ ∂ θ ln fθ(X) = 1 fθ(X) ∂ ∂ θfθ(X), because

Efθ [sθ (X)] = ∫ ∞

−∞ fθ (x) 1 fθ(x) ∂ ∂ θfθ(x) dx =

∫ ∞

−∞ ∂ ∂ θfθ(x) dx

=

d d θ

∫ ∞

−∞ fθ(x) dx = 0.

Hence, by Cauchy-Schwarz inequality, we have

(Covfθ (sθ (X) , ϕ (X)))2 ≤ Varfθ [sθ (X)] Varfθ [ϕ (X)] .

Since Biasθ (ϕ) = 0, we have MSEθ (ϕ) = Varfθ [ϕ (X)], and hence

MSEθ (ϕ) J (θ) ≥ (Covfθ (sθ (X) , ϕ (X)))2 .

54 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 55

Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound

It remains to prove that Covfθ (sθ (X) , ϕ (X)) = 1:

Covfθ (sθ (X) , ϕ (X)) = Efθ [sθ (X) ϕ (X)] −

Efθ [sθ (X)] Efθ [ϕ (X)]

= Efθ [sθ (X) ϕ (X)] = Efθ [

1 fθ(X) ∂ ∂ θfθ(X)ϕ (X)

] =

d d θ

∫ ∞

−∞ fθ(x)ϕ (x) dx = d d θEfθ [ϕ (X)] (a)

=

d d θθ = 1,

where the (a) holds because ϕ is unbiased. The proof is complete. Remark: Cramér-Rao inequality can be extended to vector estimators, biased estimators, estimator

f a function of θ, etc.

55 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 56

Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound

Extensions of Cramér-Rao Inequality

Below we list some extensions and leave the proofs as exercises.

Exercise 1 (Cramér-Rao Inequality for Unbiased Functional Estimators) Prove that for any unbiased estimator ζ of z (θ), MSEθ (ζ) ≥

1 J(θ)

( d

d θ z (θ)

)2.

Exercise 2 (Cramér-Rao Inequality for Biased Estimators) Prove that for any estimator φ of the parameter θ, MSEθ (φ) ≥

1 J(θ)

( 1 +

d d θ Biasθ (φ)

)2 + (Biasθ (φ))2.

Exercise 3 (Attainment of Cramér-Rao) Show that the necessary and sufficient condition for an unbiased estimator φ to attain the Cramér-Rao lower bound is that, there exists some function g such that for all x,

g (θ) (φ (x) − θ) =

∂ ∂ θ ln fθ (x) .

56 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 57

Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound

More on Fisher Information

Fisher Information plays a key role in Cramér-Rao lower bound. We make further remarks about it.

1 J (θ) ≜ Efθ

[ (sθ (X))2] = Varfθ [sθ (X)], where the score of θ, sθ (X) ≜

∂ ∂ θ ln fθ(X) = 1 fθ(X) ∂ ∂ θfθ(X) is zero-mean.

2 Suppose Xi i.i.d.

∼ fθ, then for the estimation problem with observation Xn, its Fisher information Jn (θ) = nJ (θ), where J (θ) is the Fisher information when the observation is just X ∼ fθ.

3 For an exponential family {fθ | θ ∈ Θ}, it can be shown that

J (θ) = −Efθ [

∂2 ∂ θ2 ln fθ (X)

] ,

which makes computation of J (θ) simpler.

57 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 58

Estimation Maximum Likelihood Estimator, Consistency, and Efficiency

1 Statistical Model and Decision Making Basic Framework Examples Paradigms

2 Hypothesis Testing Basics Asymptotics: Overview

3 Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound Maximum Likelihood Estimator, Consistency, and Efficiency

58 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 59

Estimation Maximum Likelihood Estimator, Consistency, and Efficiency

Maximum Likelihood Estimator

Maximum Likelihood Estimator (MLE) is a widely used estimator. Definition 10 (Maximum Likelihood Estimator) The Maximum Likelihood Estimator (MLE) for estimating θ from a randomly drawn X ∼ Pθ is

ϕMLE (x) ≜ arg max

θ∈Θ

{Pθ (x)} .

Here Pθ (x) is called the likelihood function.

Exercise 4 (MLE of Gaussian with Unknown Mean and Variance) Consider Xi

i.i.d.

∼ N ( µ, σ2)

for i = 1, 2, . . . , n, where θ ≜

( µ, σ2)

denote the unknown parameter. Let x ≜ 1

n

∑n

i=1 xi.

Show that

φMLE (xn) = ( x ,

1 n

∑n

i=1 (xi − x)2)

.

59 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 60

Estimation Maximum Likelihood Estimator, Consistency, and Efficiency

In the following we consider observation of n i.i.d. drawn samples Xi

i.i.d.

∼ Pθ, i = 1, . . . , n, and give

two ways of evaluating the performance of a sequence of estimators {ϕn (xn) | n ∈ N} as n → ∞.

1 Consistency:

The estimator output will coincide the true parameter as sample size n → ∞.

2 Efficiency:

The estimator output will achieve the MSE Cramér-Rao lower bound, as sample size n → ∞. We will see that MLE is not only consistent but also efficient.

60 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 61

Estimation Maximum Likelihood Estimator, Consistency, and Efficiency

Asymptotic Evaluations: Consistency

Definition 11 (Consistency) A sequence of estimators {ζn (xn) | n ∈ N} is consistent if ∀ ε > 0,

lim

n→∞ PXi

i.i.d.

∼ Pθ {|ζn (Xn) − z (θ)| < ε} = 1, ∀ θ ∈ Θ.

In other words, ζn (Xn)

p

→ z (θ) for all θ ∈ Θ.

Theorem 6 (MLE is Consistent) For a family of densities {fθ | θ ∈ Θ}, under some regularity conditions on fθ (x), the plug-in estimator z (ϕMLE (xn)) is a consistent estimator of z (θ), where z is a continuous function of θ.

61 / 62 I-Hsiang Wang IT Lecture 7

SLIDE 62

Estimation Maximum Likelihood Estimator, Consistency, and Efficiency

Asymptotic Evaluations: Efficiency

Definition 12 (Efficiency) A sequence of estimators {ζn (xn) | n ∈ N} is asymptotically efficient if

√n (ζn (Xn) − z (θ)) d → N ( 0,

1 J(θ)

( d

d θz (θ)

)2)

as n → ∞. Theorem 7 (MLE is Asymptotically Efficient) For a family of densities {fθ | θ ∈ Θ}, under some regularity conditions on fθ (x), the plug-in estimator z (ϕn (xn)) is an asymptotically efficient estimator of z (θ), where z is a continuous function of θ.

62 / 62 I-Hsiang Wang IT Lecture 7